Partners for Preservation : Advancing digital preservation through cross-community collaboration 9781783303496, 9781783303489

Who could be partners to archivists working in digital preservation? This book features chapters from international cont

174 56 955KB

English Pages 240 Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Partners for Preservation : Advancing digital preservation through cross-community collaboration
 9781783303496, 9781783303489

Citation preview

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page i

Partners for Preservation

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page ii

Every purchase of a Facet book helps to fund CILIP’s advocacy, awareness and accreditation programmes for information professionals.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page iii

Partners for Preservation Advancing Digital Preservation through Cross-Community Collaboration

edited by

Jeanne Kramer-Smyth

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page iv

© Jeanne Kramer-Smyth 2019 Published by Facet Publishing 7 Ridgmount Street, London WC1E 7AE www.facetpublishing.co.uk Facet Publishing is wholly owned by CILIP: the Library and Information Association. Jeanne Kramer-Smyth has asserted her right under the Copyright, Designs and Patents Act 1988 to be identified as author of this work. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988 this publication may only be reproduced, stored or transmitted in any form or by any means, with the prior permission of the publisher, or, in the case of reprographic reproduction, in accordance with the terms of a licence issued by The Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to Facet Publishing, 7 Ridgmount Street, London WC1E 7AE. Every effort has been made to contact the holders of copyright material reproduced in this text, and thanks are due to them for permission to reproduce the material indicated. If there are any queries please contact the publisher. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library. ISBN 978-178330-347-2 (paperback) ISBN 978-178330-348-9 (hardback) ISBN 978-178330-349-6 (e-book) First published 2019

Text printed on FSC accredited material. Typeset from author’s files in 11/14pt Palatino Linotype and Frutiger by Flagholme Publishing Services Printed and made in Great Britain by CPI Group (UK) Ltd, Croydon, CR0 4YY.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page v

For my parents: I know you would both be proud of this volume. For my husband and son: Thank you for all your love, support, and patience.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page vi

Disclaimer The findings, interpretations, and conclusions expressed in this publication are those of the author(s) and should not be attributed in any manner to The World Bank, its Board of Executive Directors, or the governments they represent.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page vii

Contents

List of figures and tables About the authors Foreword Introduction PART 1 MEMORY, PRIVACY AND TRANSPARENCY 1 The inheritance of digital media Edina Harbinja 2 Curbing the online assimilation of personal information Paulan Korenhof 3 The rise of computer-assisted reporting: challenges and successes Brant Houston 4 Link rot, reference rot and the thorny problems of legal citation Ellie Margolis

vii ix xv xix 1 3 25 45

61

PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE 79 5 The Internet of Things: the risks and impacts of ubiquitous 81 computing Éireann Leverett 6 Accurate digital colour reproduction on displays: from 101 hardware design to software features Abhijit Sarkar 7 Historical building information model (BIM)+: sharing, 123 preserving and reusing architectural design data Ju Hyun Lee and Ning Gu

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page viii

VIII PARTNERS FOR PRESERVATION

PART 3 DATA AND PROGRAMMING 8 Preparing and releasing official statistical data Natalie Shlomo 9 Sharing research data, data standards and improving opportunities for creating visualisations Vetria Byrd 10 Open source, version control and software sustainability Ildikó Vancsa

145 147

Final thoughts Index

199 203

167

185

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page ix

List of figures and tables

Figures 7.1 An HBIM+ knowledge framework 9.1 Minard’s depiction of Napoleon’s march on Russia in 1812

135 175

Tables 7.1 The main characteristics of HBIM+ in relation to BIM 8.1 Example census table 8.2 Example census table of all elderly by whether they have a long-term illness 8.3 Example census table of all elderly living in households by whether they have long-term illness 8.4 Example of a probability mechanism to perturb small cell values 8.5 Example census table of all elderly by gender and whether they have a long-term illness 9.1 Stages of visualising data, the data visualisation life cycle and what to document

137 150 151 151 153 154 172

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page x

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xi

About the authors

Editor Jeanne Kramer-Smyth has held a position as an electronic-records archivist with the World Bank Group Archives since 2011. She earned her Masters of Library Science from the Archives, Records and Information Management Program at the University of Maryland iSchool after a 20 year career as a software developer designing relational databases, creating custom database software and participating in web-based software development. She is the author of Spellbound Blog where she has published dozens of essays exploring the intersection of archives and technology, with a special focus on electronic records, digitisation and access. She is also active on Twitter at @spellboundblog. Beyond her work in archives and digital preservation, she is a writer, photographer, graphic designer, creative spirit and fan of board games. Jeanne is the creator of the Pyramid Game Freeze Tag. A fan of many types of fiction, she has a special place in her heart (and large home library) for fantasy, science-fiction, young adult and historical fiction. She has published a number of short stories in anthologies. She lives in Maryland with her husband, son and two cats.

Contributors Dr Vetria Byrd is an assistant professor of computer graphics technology and Director of the Byrd Data Visualization Lab in the Polytechnic Institute at Purdue University’s main campus in West Lafayette, Indiana. Dr Byrd is introducing and integrating visualisation

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xii

XII PARTNERS FOR PRESERVATION

capacity building into the undergraduate data visualisation curriculum. She is the founder of the Broadening Participation in Visualization Workshop. She served as a steering committee member on the Midwest Big Data Hub (2016–2018). She has taught data visualisation courses on national and international platforms as an invited lecturer of the International High Performance Computing Summer School. Her visualisation webinars on Blue Waters, a petascale supercomputer at the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign (https://bluewaters.ncsa.illinois.edu/ webinars/visualization/introduction), introduce data visualisation to audiences around the world. Dr Byrd uses data visualisation as a catalyst for communication, a conduit collaboration and as a platform to broaden participation of underrepresented groups in data visualisation (Barker, 2014). Her research interests include data visualisation, data analytics, data integration, visualising heterogeneous data and the science of learning and incorporating data visualisation at the curriculum level and every day practice. Dr Ning Gu is Professor of Architecture in the School of Art, Architecture and Design at the University of South Australia. Professor Ning Gu has an academic background from Australia and China. His most significant contributions have been made in research in design computing and cognition, including topics such as computational design analysis, design cognition, design communication and collaboration, generative design systems and building information modelling. The outcomes of his research have been documented in over 170 peerreviewed publications. Professor Gu’s research has been supported by prestigious Australian research-funding schemes from the Australian Research Council, the Office for Learning and Teaching and the Cooperative Research Centre for Construction Innovation. He has guest edited or chaired major international journals and conferences in the field. He was a visiting scholar at Massachusetts Institute of Technology, Columbia University and Technische Universiteit Eindhoven. Dr Edina Harbinja is a senior lecturer in law at the University of Hertfordshire. Her principal areas of research and teaching are related to the legal issues surrounding the internet and emerging technologies. In her research, Edina explores the application of property, contract law,

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xiii

ABOUT THE AUTHORS XIII

intellectual property and privacy online. She is a pioneer and a recognised expert in post-mortem privacy – privacy of deceased individuals. Her research has a policy and multidisciplinary focus; she aims to explore different options of regulation of online behaviours and phenomena. She has been a visiting scholar and invited speaker to universities and conferences in the USA, Latin America and Europe, and has undertaken consultancy for the Fundamental Rights Agency. Her research has been cited by legislators, courts and policy makers in the USA and Europe. Find her on Twitter at @EdinaRl. Brant Houston is the Knight Chair in Investigative Reporting at the University of Illinois at Urbana-Champaign where he works on projects and research involving the use of data analysis in journalism. He is cofounder of the Global Investigative Journalism Network and the Institute for Nonprofit News. He is author of Computer-Assisted Reporting: a practical guide and co-author of The Investigative Reporter’s Handbook. He is a contributor to books on freedom of information acts and open government. Before joining the University of Illinois, he was executive director of Investigative Reporters and Editors at the University of Missouri after being an award-winning investigative journalist for 17 years. Paulan Korenhof is in the final stages of her PhD-research at the Tilburg Institute for Law, Technology, and Society (TILT). Her research is focused on the manner in which the web affects the relation between users and personal information, and to what degree the right to be forgotten is a fit solution to address these issues. She has a background in philosophy, law and art, and investigates this relationship from an applied phenomenological and critical theory perspective. Occasionally she co-operates in projects with Hacklabs (Amsterdam: https://laglab. org/) and gives privacy awareness workshops to diverse audiences. Recently she started working at the Amsterdam University of Applied Sciences as a researcher in legal technology. Dr Ju Hyun Lee is a senior research fellow in the School of Art, Architecture and Design at the University of South Australia and a conjoint senior lecturer at the University of Newcastle. Dr Lee has made a significant contribution towards architectural and design research in

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xiv

XIV PARTNERS FOR PRESERVATION

three main areas: design cognition (design and language), planning and design analysis, and design computing. As an expert in the field of architectural and design computing, Dr Lee was invited to become a visiting academic at the University of Newcastle in 2011. Dr Lee has developed innovative computational applications for pervasive computing and context awareness in the building environments. The research has been published in Computers in Industry, Advanced Engineering Informatics and Journal of Intelligent and Robotic Systems. His international contribution has been recognised as he has been an associate editor for a special edition of Architectural Science Review, a reviewer for many international journals and conferences, and an international reviewer for national grants. Éireann Leverett once found 10,000 vulnerable industrial systems on the internet. He then worked with computer emergency response teams around the world to reduce cyber risk. He likes teaching the basics and learning the obscure. He continually studies computer science, cryptography, networks, information theory, economics and magic history. Leverett is a regular speaker at computer security conferences such as FIRST (Forum of Incident Response and Security Teams), BlackHat, Defcon, Brucon, Hack.lu, RSA Conference and CCC (Chaos Communication Congress); and at insurance and risk conferences of the Society of Information Risk Analysts, the Onshore Energy Conference, the International Association of Engineering Insurers, the International Risk Governance Council and the Reinsurance Association of America. He has been featured by the BBC, the Washington Post, the Chicago Tribune, the Register, the Christian Science Monitor, Popular Mechanics and Wired magazine. He is a former penetration tester from IOActive, and was part of a multidisciplinary team that built the first cyber risk models for insurance with Cambridge University Centre for Risk Studies and RMS (http://www.rms.com/). Ellie Margolis is Professor of Law at Temple University, Beasley School of Law, where she teaches legal research and writing, appellate advocacy and other litigation skills courses. Her work focuses on the effect of technology on legal research and legal writing. She has written numerous law review articles, essays and textbook contributions. Her

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xv

ABOUT THE AUTHORS XV

scholarship is widely cited in legal writing textbooks, law review articles and appellate briefs. Dr Natalie Shlomo (BSc, Mathematics and Statistics, Hebrew University; MA, Statistics, Hebrew University; PhD, Statistics, Hebrew University) is Professor of Social Statistics at the School of Social Sciences, University of Manchester. Her areas of interest are in survey methods, survey design and estimation, record linkage, statistical disclosure control, statistical data editing and imputation, non-response analysis and adjustments, adaptive survey designs and small area estimation. She is the UK principle investigator for several collaborative grants from the 7th Framework Programme and H2020 of the European Union all involving research in improving survey methods and dissemination. She is also principle investigator for the Leverhulme Trust International Network Grant on Bayesian Adaptive Survey Designs. She is an elected member of the International Statistical Institute and a fellow of the Royal Statistical Society. She is an elected council member and Vice-President of the International Statistical Institute. She is associate editor of several journals, including International Statistical Review and Journal of the Royal Statistical Society, Series A. She serves as a member of several national and international advisory boards. Homepage: https://www.research.manchester.ac.uk/portal/natalie. shlomo.html. Ildikó Vancsa started her journey with virtualisation during her university years and has been in connection with this technology in different ways since then. She started her career at a small research and development company in Budapest, where she focused on areas such as system management, business process modelling and optimisation. Ildikó got involved with OpenStack when she started to work on the cloud project at Ericsson in 2013. She was a member of the Ceilometer and Aodh project core teams. She is now working for the OpenStack Foundation and she drives network functions virtualization (NFV)related feature development activities in projects like Nova and Cinder. Beyond code and documentation contributions, she is also very passionate about on-boarding and training activities.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xvi

XVI PARTNERS FOR PRESERVATION

Reference Barker, T. (2014) Byrd Emphasizes Value of Visualization at XSEDE14, HPC Wire, 31 July, https://www.hpcwire.com/2014/07/31/byrd-emphasizesvalue-visualization-xsede14/j.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xvii

Foreword

Partnering for Preservation Partners for Preservation is an important and timely volume, wonderfully curated and framed by the vision of the volume’s editor, Jeanne Kramer-Smyth. The sections and chapters build the case that partnering for preservation in the digital world is both necessary and mutually beneficial. We have learned that to demonstrate good digital practice and to build sustainable digital preservation programs, we must work within and across an array of domains, bringing together specialists in evolving combinations. As explained in the volume’s introduction, ‘Archivists cannot navigate the flood of technology and change alone. This book aims to help build bridges between archivists and those in other professions who are facing and navigating our common struggles. If archivists can build relationships with those in other professions, we can increase the tools and best practices available to everyone.’ Kramer-Smyth explains ‘each of the ten chapters presented here was written by a subject matter expert from outside the GLAM (galleries, libraries, archives, and museums) community. The topics were selected to highlight ten digital challenges being faced by other professions.’ Challenges that have clear relevance and value throughout the digital community. She adds, ‘My hope is that the chapters of this book can inspire archivists across the profession and around the world to see the potential that exists in partnering with others struggling to solve digital challenges.’ Her introduction highlights the flexibility that is built into the volume – the text lends itself to allowing readers to consume the chapters in one gulp, in sections, or individually. This balance of cohesion and modularity echoes core aspects of digital practice that is

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xviii

XVIII PARTNERS FOR PRESERVATION

the focus of the volume. As members of the digital community, we need to be generalists to have an overall sense of the evolving landscape and specialists as our practice advances. No individual can go it alone, understanding this is key to developing sustainable digital programs and practice, a principle that the volume addresses intentionally and eloquently. Each of the chapters is well constructed and documented, explaining complex and technical topics in understandable and approachable ways. The authors are storytellers with interesting perspectives based on a wealth of accumulated experience. The chapters include definitions, both formal and working, that have value in navigating these topics in local settings. Depending on a reader’s background, the topics in the chapters range from more familiar within GLAM literature, e.g., link rot avoidance and file formats, to less familiar, e.g., the complexities of new and emerging legal issues and specific computer-based functionality. In each case, the chapters demonstrate the importance of context in applying the principles of the topics and issues discussed. Similarly, capturing and managing context to enable long-term record-keeping is at the heart of archival principles as well as practice. The authors include an impressive mix of legal scholars and specialists in various aspects of information and computer science. The text cumulatively mirrors dominant themes in discussions of digital practice – privacy and ethics in a digital age, computational applications with a strong focus on algorithms, managing and protecting personal information even after death, data visualization, digital forensics, open data, software sustainability, and virtual and augmented reality. The way in which the authors delve into challenges and risks associated with their topics could be daunting or overwhelming, but each author also offers possible solutions, highlights options and choices, and shares techniques and tactics to help mitigate carefully explained risks. The examples the contributors share implicitly and explicitly demonstrate opportunities for partnering to preserve that the volume editor sought. There is a strong emphasis throughout the volume on the need for both collaboration and documentation to achieve objectives, both of which are central to archival understanding and practice and both of which reflect strong themes in the broader community. As an archivist

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xix

FOREWORD XIX

who is responsible for digital preservation in an academic library, I am aware of how important this volume is right now with its focus on partnering across domains to achieve shared outcomes. In the archival community, we just passed the fiftieth anniversary of the electronic records program at the U.S. National Archives and longer than that for digital content management by ICPSR, an organisation that one author in Part 3 calls out, for preserving social science research data. As other communities with GLAM become interested and engaged in archival and special collections content and as more and more digital content is created by increasingly engaged creators and curators outside the GLAM communities, the message and examples of collaboration is vital. Collaborating in the ways envisioned in this volume allows all of us to cumulatively share our expertise and play to our strengths to the benefit of our collections and to current and future users. I am very pleased to contribute here because this is a significant contribution to the literature and because like Jeanne, I devote much of my time to building community and to encouraging and enabling cross-community collaboration. The themes in the chapters align with some current projects and initiatives that I am devoted to and that will benefit from the contributions of these authors, including: • a forthcoming special issue of Research Library Issues (RLI) by the Association of Research Libraries (ARL) on Radical Collaboration, about building and being part of the various communities that form around what we do in digital practice; and • Building for Tomorrow, an IMLS-sponsored project led by Ann Whiteside at the Harvard University Graduate School of Design to construct a coalition of Digital Architecture, Design, and Engineering (DADE) stakeholders across the domains of design practitioners, libraries and archives, architectural history, digital preservation, intellectual property, and software designers to achieve long-term needs. Both of these projects reflect ideas and principles like those of the authors in Partnering for Preservation. The carefully selected authors identify important issues in current and future digital practice, provide interesting and illustrative

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xx

XX PARTNERS FOR PRESERVATION

examples, and share suggestions for collaborating to achieve desired outcomes. In the words of the volume curator, “It is my fervent wish that you leave this book inspired.” I did and I imagine most engaged readers in any domain that intersects with digital practice, including archives, will as well. I am already an advocate for the importance and benefits of digital preservation and this volume provides wonderful fodder for telling the stories of what we do as archivists and why that is so important. It also provides lessons and insights for anyone in the broad digital community to consider and appreciate. Dr Nancy Y McGovern Director, Digital Preservation MIT Libraries Director, Digital Preservation Management Workshops Past President, Society of American Archivists

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xxi

Introduction Jeanne Kramer-Smyth

We are increasingly immersed in and reliant on the digital world. This reliance creates major digital preservation challenges in the archives professional community – and those challenges have parallels in other professional communities. This book gathers essays from subject matter experts in non-archival professions, addressing a wide array of issues arising from real world efforts. The only certainty about technology is that it will change. The speed of that change, and the ever increasing diversity of digital formats, tools and platforms, will present stark challenges to the long-term preservation of digital records. Archivists frequently lack the technical expertise, subject matter knowledge, time, person-power and funding to solve the broad set of challenges sure to be faced by the archival profession. Archivists need to recruit partners from across as diverse a set of professions that work within the digital landscape as possible. The chapters of this book are grouped into three sections. The first explores topics related to memory, privacy and transparency. Part 2 includes chapters on the physical world – internet-capable devices, colour reproduction and architectural information. The final section digs into aspects of data and programming. You can read this book in a number of ways. You can start at the beginning and read straight through, but each chapter can stand alone and be read in any order. The chapters grouped together have been selected for their related subject matter.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xxii

XXII PARTNERS FOR PRESERVATION

This collection is not a ‘how to’ guide, but rather an exploration of how computers and technology affect our ability to preserve information for the future. Collaboration within professions and dissemination of digital content to audiences rely on the ability of individuals to consume digital products in a predictable manner. Efforts to ensure the smooth workings of these processes are often the same struggles archivists face, but archivists have the added challenge of needing to ensure these processes will work many years in the future. What do we mean when we say ‘digital preservation’? Is it only conquering the dual challenges of extraction from original systems and then storage of 1s and 0s? No: it goes beyond that to include the even greater issues of ensuring renderability and preserving context. Who created the records? How can someone displaced by time or space be sure they are experiencing the content with all of its significant properties intact? In 2008 JISC observed: Significant properties, also referred to as ‘significant characteristics’ or ‘essence’, are essential attributes of a digital object which affect its appearance, behaviour, quality and usability. They can be grouped into categories such as content, context (metadata), appearance (e.g. layout, colour), behaviour (e.g. interaction, functionality) and structure (e.g. pagination, sections). Significant properties must be preserved over time for the digital object to remain accessible and meaningful.

Archivists cannot navigate the flood of technology and change alone. This book aims to help build bridges between archivists and those in other professions who are facing and navigating our common struggles. If archivists can build relationships with those in other professions, we can increase the tools and best practices available to everyone. For example we must consider the following challenges: • how to ensure that those on different computer systems receive and experience information in the way intended • how to export content between programs and platforms • how to ensure proper protection of privacy

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xxiii

KRAMER-SMYTH INTRODUCTION XXIII

• how to interpret properly the meaning of data, including the context and translation of codes. Digital preservation is complicated. Those who are working to ensure preservation of and access to digital objects for future discovery and reference are working in organisations of all different sizes and missions. The work in the trenches is often shoehorned between other traditional archival activities of appraisal, arrangement, description and responding to research requests. Only rarely can digital preservation professionals working in an archive dedicate much time beyond planning and problem solving for their own organisation. Meanwhile, technology moves fast. Faster than we solve the problems we have, new ones appear. Luckily, there is good news on several fronts: there is an increasing number of individuals in the archival community who focus on research and establishing best practices in digital preservation; the standards, tools, training and resources available to apply digital preservation best practices are maturing and multiplying with each passing year; and the challenges we face in pursuing digital preservation are not unique to archives. This book focuses on these challenges. Each of the ten chapters presented here was written by a subject matter expert from outside the gallery, library, archive and museum (GLAM) communities. The topics were selected to highlight ten digital challenges being faced by other professions. As the authors I recruited from around the world began to send me their chapters, I was pleased to see many common threads surfacing: • acknowledgement of existing areas of collaboration with the GLAM community • the challenges of maintaining a high degree of ethics and respecting individuals’ right to privacy • communities facing the fast evolution of technology, and working hard to keep up with and take advantage of the latest digital innovations • collaboration between practising professionals and those in academia • collaboration between practising professionals and computer programmers

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xxiv

XXIV PARTNERS FOR PRESERVATION

• struggles in collaboration due to the very different perspectives of the collaborators • tools created by non-archival professions that could be leveraged by archivists managing digital archival records (DocumentCloud, for example) • interlacing of the Internet of Things, data privacy and innovations in software. I have pondered this overlap of digital challenges since February 2007. I was taking an archival access class in what was then the College of Library and Information Studies at the University of Maryland, College Park (now the UMD iSchool). Our guest speaker that day was Professor Ira Chinoy from the University of Maryland Journalism Department. He was teaching a course on computer-assisted reporting. Inspired by Professor Chinoy’s presentation, I wrote a blog post examining the parallel challenge facing journalists and archivists (Kramer-Smyth, 2007). Professor Chinoy dedicated a large portion of the class to issues related to the Freedom of Information Act and struggling to gain access to borndigital public records (those that were initially created in a digital form, not scanned or digitised from an analogue format). Journalists are usually early in the food chain of those vying for access to and understanding of federal, state and local databases. They must learn what databases are being kept and figure out which ones are worth pursuing. Professor Chinoy relayed a number of stories about the energy and perseverance required to convince government officials to give access to the data they have collected. The rules vary from state to state – see the Maryland Public Information Act (www.marylandattorneygeneral.gov/Pages/ OpenGov/pia.aspx) as an example – and journalists must often cite specific regulations to prove that officials are breaking the law if they do not hand over the information. There are officials who deny that the software they use will even permit extractions of the data – or claim that there is no way to edit the records to remove confidential information. Some journalists find themselves hunting down the vendors of proprietary software to find out how to perform the extract they need. They then go back to the officials with that information in the hope of proving that it can be done. I love this article linked to in Professor Chinoy’s syllabus: The Top 38 Excuses Government Agencies

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xxv

KRAMER-SMYTH INTRODUCTION XXV

Give for Not Being Able to Fulfill Your Data Request (Mitchell, 2002). After all that work, just getting your hands on the data is not enough. The data is of no use without the decoder ring of documentation and context. I spent most of the 1990s designing and building custom databases, many for US federal government agencies. An almost inconceivable number of person hours go into the creation of most of these systems. Stakeholders from all over the organisation destined to use the system participate in meetings and design reviews. Huge design documents are created and frequently updated, and adjustments to the logic are often made even after the system goes live (to fix bugs or add enhancements). The systems I am describing are built using complex relational databases with hundreds of tables. It is uncommon for any one person to really understand everything in such a system – even if they are on the team for the full development life cycle. Sometimes, with luck, the team includes people with amazing technical writing skills, but usually those talented people write documentation for users of the system. Those documents may or may not explain the business processes and context related to the data. They rarely expose the relationship between a user’s actions on a screen and the data as it is stored in the underlying tables. Some decisions are only documented in the application code itself, which is not likely to be preserved along with the data. Teams charged with the support of these systems and their users often create their own documents and databases to explain certain confusing aspects of the system and to track bugs and their fixes. A good analogy here would be to the internal files that archivists often maintain about a collection – the notes that are not shared with the researchers but instead help the archivists who work with the collection to remember such things as where frequently requested documents are or what restrictions must be applied to certain documents. So where does that leave those who are playing detective to understand the records in these systems? Trying to determine what the data in the tables means from the viewpoint of end-users can be a fool’s errand – and that is if you even have access to actual users of the system. I don’t think there is any easy answer given the realities of how many unique data management systems are in use throughout the public sector.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xxvi

XXVI PARTNERS FOR PRESERVATION

Archivists often find themselves struggling with the same problems as journalists. They have to fight to acquire and then understand the records being stored in databases. I suspect they have even less chance of interacting with actual users of the original system that created the records. There are many benefits to working with the producers of records long before they are earmarked to head to the archives. While this has long been the exception rather than the rule, there have been major efforts to increase use of this approach whenever possible. When I wrote my original blog post in 2007, I imagined a future world in which all public records are online and can be downloaded on demand. My example of a forward thinking example back then was the US National Archives’ search interface for World War 2 army enlistment records (NARA, 2002). It included links to sample data, record group information, and a frequently asked questions (FAQ) file. Once a record is selected for viewing – every field includes a link to explain the value. But even this extensive detail would not be enough for someone to just pick up these records and understand them – you still need to understand about World War 2, army enlistment and other contextual matters. This is where the six-page frequently asked questions (FAQs) comes in. Look at the information it provides – and then take a moment to imagine what it would take for a journalist to recreate a similar level of detailed information for new database records being created in a public agency today (especially when those records are guarded by officials who are leery about permitting access to the records in the first place). Archivists and journalists are concerned with many of the same issues related to born-digital records. How do we acquire the records people will care about? How do we understand what the records mean in the context of why and how they were created? How do we enable access to the information? Where do we get the resources, time and information to support important work like this? While this book can’t make you an expert on any of the topics discussed, be it colour science or the Internet of Things, it does seek to give you enough grounding in each topic to help you start to see the potential for archival collaboration with each of the fields represented. As I write this introduction in 2018, most of the challenges I observed when I wrote that original blog post in 2007 have not gone away. My hope is that the chapters of this book can inspire archivists

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xxvii

KRAMER-SMYTH INTRODUCTION XXVII

across the profession and around the world to see the potential that exists in partnering with others struggling to solve digital challenges. References JISC (2008) The Significant Properties of Digital Objects, Joint Information Systems Committee, https://www.webarchive.org.uk/wayback/archive/20140615130716/www. jisc.ac.uk/whatwedo/programmes/preservation/2008sigprops.aspx. Kramer-Smyth, J. (2007) Understanding Born Digital Records: journalists and archivists with parallel challenges, Spellbound Blog, 17 February, www.spellboundblog.com/2007/02/17/understanding-born-digitalrecords-journalists-and-archivists-with-parallel-challenges/. Mitchell, B. (2002) The Top 38 Excuses Government Agencies Give for Not Being Able to Fulfill Your Data Request (and Suggestions on What You Should Say or Do), Poynter, 22 August, https://web.archive.org/web/20120806203023/https://www.poynter.org/ news/the-top-38-excuses-government-agencies-give-for-not-being-ableto-fulfill-your-data-request. NARA (2002) World War II Army Enlistment Records, National Archives and Records Administration, https://aad.archives.gov/aad/fielded-search. jsp?dt=893&tf=F&cat=all&bc=sl.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page xxviii

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 1

PART 1

Memory, privacy and transparency While memory, privacy and transparency have long been cornerstones of archival practice in the analogue world, the crossover into the digital realm presents new challenges. The chapters in Part 1 explore the digital challenges being tackled by lawyers, journalists and privacy experts. In each example, consider the ways that these other communities are struggling with and solving their digital problems. The archives theme I see throughout this section is access. How do we ensure it? How do we balance the lure of technology and desire for access with the interests of the individual? Technology is forcing change in the way we think about inheritance, the way that lawyers cite sources, the way that journalists work, and the way that personal information is often immediately accessible online.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 2

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 3

1 The inheritance of digital media Edina Harbinja

Introduction We live in a world where our entire identities are created, developed and stored online, in different accounts owned by various service providers, such as Google, Facebook, Instagram, Twitter, Apple and Microsoft. Once users who engage with all or some of this die, many interesting and concerning questions for lawyers, but also the wider public, arise (Carroll, 2018). Stakeholders who would have a stake after this unfortunate event may include the deceased’s family and heirs, friends, service providers, researchers, historians, archivists and, sometimes, the public. There have been many cases reported in the media depicting some of these interests and their conflicts, albeit case law is still very scarce in most countries (BBC News, 2005; Gambino, 2013; Sancya, 2005; Williams, 2012). These cases related to some key questions that have largely remained unanswered: should bereaved family members be allowed to access the dead user’s digital accounts? Is the service provider obliged to enable the family this access? Should friends have access to the shared content on Facebook? Do users have a right to decide what happens to these accounts when they die? What about the right of access by the wider public, journalists, archivists and historians in particular? All these questions reveal the complexity of digital assets, remains and posthumous identities. Yet still, in the UK a credible research study has found that 85% of participants are not

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 4

4 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

considering the implications of digital death (Digital Legacy Association, 2017). In this chapter, I aim to shed some light on these questions and offer some mainly legal solutions, but look at their wider implications as well. There is now abundant legal scholarship in the area, and many authors have embarked on identifying key issues in laws related to digital death, as well as offering some solutions and ideas (e.g. Băbeanu, Gavrilă and Mareş, 2009; Cahn, 2011, 36–7; Haworth, 2014, 535, 538; Hopkins, 2014, 211; Lopez, 2016, 183; McCallig, 2013; Perrone, 2012–2013, 21). This author is one of them and has been writing about the topic for more than six years now (Edwards and Harbinja, 2013a, 2013b; Harbinja, 2014a, 2014b; Harbinja, 2016a, 2016b, 227; Harbinja, 2017a, 2017c, 2017d). In spite of this, many western jurisdictions still struggle to find the right (or any) response to the conundrum of questions around regulation of digital assets on death. No western country so far has found an optimal solution, which would consolidate and resolve issues arising in many different areas of law and regulation. There have been some welcome, albeit sporadic, attempts to legislate and we will look at some of these in the chapter. I also consider where regulation and technology could go next in order to improve this area and bring about more clarity for users, platforms, practitioners, archivists and the public. The chapter starts by looking at some conceptual issues around what digital assets are and whether it is useful to offer a comprehensive definition of them. Further, I will examine key legal issues for any jurisdiction, focusing on the examples of the UK and the USA. These are property, ownership and copyright that digital assets might include. In the next section, we will look at service providers’ contract and terms of service, which govern the assets on a more global level. Some technological solutions will be examined here too. Then I will look at the notion of post-mortem privacy and examine whether there is a case for establishing this concept more strongly in law, regulation and technology. The chapter will conclude by evaluating estate planning options and offering some solutions as a way forward.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 5

HARBINJA THE INHERITANCE OF DIGITAL MEDIA 5

The concept of digital assets: what is included in digital inheritance? The notion of digital assets is a relatively new phenomenon in the UK and globally, lacking a proper legal definition, with diverse meanings attributed to it. See for example Conway and Grattan (2017). For instance, from a layman’s perspective it could be anything valuable online, any asset (account, file, document, digital footprint; music library, social media account, picture, video, different online collection, bitcoin wallet) that has a personal, economic or social attachment to an individual. The legal definition, however, needs a little more precision. Constructing its legal definition and nature enables adequate legal treatment and regulation, though an overly narrow definition risks leaving out assets that do not fit in, or technologies that emerge. So far, there have been a fair few attempts to define and classify them (Băbeanu, Gavrilă and Mareş, 2009, 319; Cahn, 2011; Edwards and Harbinja, 2013b, 115; Haworth, 2014, 538; Hopkins, 2014, 211; Perrone, 2012–2013, 185). Most of the definitions are inductive, however, and try to theorise starting from the existing assets online, trying to make appropriate generalisations and classifications (Harbinja, 2017b, 18–25). Edwards and Harbinja also attempted to define it in their early work in the area (Edwards and Harbinja, 2013b). In my more recent work, I propose that digital assets are defined as any intangible asset of personal or economic value created, purchased or stored online (Harbinja, 2017b, 24–5). These assets could fall within existing institutions of property, rights under the contract, intellectual property, personality right or personal data. Another element should be to exclude from the definition the infrastructure of hosts, social media sites and websites which they create and maintain, e.g. cloud storage, as opposed to the accounts and assets created and occupied by users. This is because the infrastructure is owned or protected by their intellectual property rights and only serves as an enabler for the creation and storage of assets that are physically and logically placed above that layer of the internet (Harbinja, 2017b, 121, 220–5, 245–68). Digital assets have become significantly valuable to online users in the UK and worldwide. As early as October 2011, the Centre for Creative and Social Technology at Goldsmiths, University of London, released a study of internet use in the UK entitled ‘Generation Cloud’,

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 6

6 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

which determined that British internet users have at least £2.3 billion worth of digital assets stored in the cloud. The study shows that 24% of UK adults estimate that they have digital assets worth more than £200 per person in the cloud, which amounts to at least £2.3 billion in total (Rackspace, 2011). At the same time, McAfee conducted a global study and found that respondents had 2777 digital files stored on at least one device, at a total value of $37,438, with US internet users valuing their assets at nearly $55,000 (McAfee, 2013). PwC conducted a similar survey in 2013 and found that internet users value their digital assets at £25 billion (PwC, 2013). Given the exponential growth in digital usage between 2013 and 2018, this figure would be even higher now, but there is little recent empirical data to evidence this now. Despite the growing value and importance, the area and applicable laws are far from clear in the UK and elsewhere in the world. Users, practitioners and service providers struggle to navigate through the complex laws around property law, wills and succession, trusts, intellectual property, data protection, contracts and jurisdiction. All these areas are relevant when discussing digital assets and their transmission on death (Harbinja, 2017b, 13–18). In the next section, I outline some of the key issues that users, practitioners, intermediaries, archivists and others might encounter when dealing with digital assets. Key legal issues: property, copyright and access One of the crucial legal issues in this area relates to the question of whether the content of a user’s account can be considered the user’s property or not. This is the first legal concept anyone refers to when they think about what they think they ‘own’. Thus internet users might refer to owning their Facebook account, e-mails, iTunes or Spotify library, YouTube channel, gaming and massively multiplayer online games (MMOG) account, etc. However, in all of these cases ownership is far from what users might expect it to be. It is not property and ownership in the sense of their physical possession, such as their house or car, or even their poem or novel. This is much more complex during a user’s life, and the complexity of law increases once a user dies. I have examined these in examples of e-mails (Harbinja, 2016a), social network accounts (Harbinja, 2017e) and virtual worlds (Harbinja, 2014a, 273). Some authors have argued for propertisation claiming that

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 7

HARBINJA THE INHERITANCE OF DIGITAL MEDIA 7

e-mails or social network accounts and other assets are clearly users’ property (Atwater, 2006, 397, 399; Banta, 2017, 1099; Darrow and Ferrera, 2006, 281, 308; Erlank, 2012, 22–3; Lastowka and Hunter, 2006, 17–18; Mazzone, 2012, 1643). However, this is not as simple as it sounds and legal and normative arguments go against this opinion (Harbinja, 2017b). First, it is important to note that user accounts are created through contracts between service providers and users, and the account itself and the underlying software is the property or intellectual property of the service provider (Harbinja, 2017b, 121, 220–5, 245–68). However, the legal nature of the content itself is not clear in law in most countries around the globe. If the content is an object of property, then the question of ownership is simple for most European jurisdictions and the USA: it transmits on death, through one’s will or intestate succession. In the UK, for instance, ‘a person’s estate is the aggregate of all the property to which he is beneficially entitled’ (Inheritance Tax Act, s. 5(1), applicable to England, Scotland, Wales and Northern Ireland). Other relevant UK statutes are the Wills Act 1837, s. 3 (which does not extend its effect to Scotland), and the Succession (Scotland) Act 1964, s. 32. In the USA, ‘probate assets are those assets of the decedent, includible in the gross estate under Internal Revenue Code §2033, that were held in his or her name at the time of death’ (Darrow and Ferrera, 2006, 281); see also McKinnon (2011). Conversely, if the content is not property stricto sensu, it can be protected by copyright and arguably transmits on death and lasts for as long as copyright lasts (70 years post-mortem mostly) (Copyright Act 1976; Council of the European Union, 1993; European Parliament and Council, 2006, Article 1; 17 US Code §302); for a more detailed analysis of copyright see Harbinja (2017e). This is not easy to establish for most assets we have assessed, as the law often requires tangibility, rivalrousness (the quality of an object that only one person can possess it without undermining its value) and other incidents or features that the law normally assigns to objects of property. Digital objects do not fit squarely within these traditional legal concepts of property and ownership (Harbinja, 2017b). Personal data, in particular, cannot be owned and many (mainly European) authors have argued against its propertisation (Cohen, 2000; Cuijpers, 2007; Harbinja, 2013; Litman, 2000, 1304; Purtova, 2010), whereas US scholarship has historically

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 8

8 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

been more inclined towards this concept (Laudon, 1996, 96; Lessig, 2006; Mell, 1996; Schwartz, 2003; Westin, 1967, 40; Zarsky, 2004). Users’ copyright is slightly more straightforward, as a lot of assets include copyrightable materials, especially user-generated content (Harbinja, 2017e). For instance, under section 1 of the UK Copyright, Designs and Patents Act 1988, copyright subsists in original literary, dramatic, musical or artistic works, sound recordings, films and the typographical arrangement of published editions. Thus, for instance, user posts, notes, poems, pictures and some videos fall into these categories and to a great extent meet copyright requirements of fixation and originality in US and UK law (Harbinja, 2017e, 188–92). As suggested above, this content passes on after one’s death to a person’s heirs for 70 years post-mortem in these jurisdictions and entitlement is not debatable. There is a problem in the UK copyright law regarding unpublished works, however. The act requires that unpublished work is embodied in a tangible medium. For instance, if posts on Facebook are considered unpublished (sent to private friends only), heirs are not entitled to copyright in this content as it lacks tangibility (Harbinja, 2017e, 188–92). Generally, a problem that persists in all countries is the extent of access to all this digital content by heirs and next of kin, and this will be discussed in the following section. Contracts and in-service solutions Every ‘intermediary’ (service provider, a platform that stores and/or enables digital assets) such as Facebook or Google purports to regulate access to and ownership of user-created digital assets on its platforms according to its own terms of service or contract with its users. Thus, most issues of ownership and access to digital assets are determined at least at first by contract. This is not ideal, perhaps, but it has an indisputable impact on transmission of digital assets on death. In particular, a phenomenon which has the potential to generate much uncertainty and litigation in the field of succession has just emerged, which scholars have termed ‘in-service solutions’ or sometimes ‘social media wills’ (Cahn, Kunz and Brown Walsh, 2016; Edwards and Harbinja, 2013b; Harbinja, 2017c, 26–42). Most platforms promise users ‘ownership’ of their content (Harbinja, 2017b), but the actual access

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 9

HARBINJA THE INHERITANCE OF DIGITAL MEDIA 9

and licence end on death and this undermines the initial promise – a user cannot pass this ownership on to their heirs, for example. Google, Facebook and others have introduced technical ‘legacy’ tools, giving users choices to delete or transfer digital assets after death. These tools have the advantage of providing post-mortem control over personal data and digital assets to users who may never make a will (Denton, 2016), as well as easy access to designated beneficiaries, but may also confound traditional estate administration process. In 2013, Google introduced Inactive Account Manager, as the first in-service solution to address the issue of the transmission of digital assets on death. Inactive Account Manager enables users to share ‘parts of their account data or to notify someone if they’ve been inactive for a certain period of time’ (Google, 2018). According to the procedure, the user can nominate trusted contacts to receive data if the user has been inactive for a chosen time (3–18 months). After their identity has been verified, trusted contacts are entitled to download the data the user left to them. The user can also decide only to notify these contacts of the inactivity and to have all the data deleted. There is a link directly from the user’s account settings (personal info and privacy section) to the Inactive Account Manager. In addition, Google offers the following options if a user does not set up the Inactive Account Manager: close the account of a deceased user, request funds from a deceased user’s account, and obtain data from a deceased user’s account. The process is discretionary, however, and Google does not promise that any of the requests will be carried out (Harbinja, 2017c, 35–7). Similarly, Facebook’s solution in its terms of use and privacy policy (known as the statement of rights and responsibilities and the data use policy) provides three main options for dealing with assets on its site (accounts containing posts, pictures, videos etc.): memorialisation, deletion or deactivation and legacy contact. For a detailed analysis see Harbinja (2017d). The effects of memorialisation are that it prevents anyone from logging into the account, even those with valid login information and a password. Any user can send a private message to a memorialised account. Content that the decedent shared, while alive, remains visible to those it was shared with (privacy settings remain ‘as is’). Depending on the privacy settings, confirmed friends may still post to the decedent’s timeline. Accounts which are memorialised no longer appear in the ‘people you may know’ suggestions or other

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 10

10 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

suggestions and notifications. Memorialisation prevents the tagging of the deceased in future Facebook posts, photographs or any other content. Unfriending a deceased person’s memorialised account is permanent, and a friend cannot be added to a memorialised account or profile. Facebook provides the option of removal of a deceased’s account, but with very general statements and vague criteria. The option is available only to ‘verified immediate family members’ or an executor and the relationship to the deceased needs to be verified. Facebook only promises that it will ‘process’ these requests, without giving a firm promise of fulfilling special requests (Facebook, 2018a). Since February 2015 Facebook allows its users to designate a friend or family member to be their legacy contact who is akin to a ‘Facebook estate executor’, who can manage their account after they have died. The legacy contact has a limited number of options: to write a post to display at the top of the memorialised timeline, to respond to new friend requests, and to update the profile picture and cover photo of a deceased user. In addition, a user ‘may give their legacy contact permission to download an archive of the photos, posts and profile information they shared on Facebook’ (Facebook, 2018b). The legacy contact will not be able to log into the account or see the private messages of the deceased. All the other settings remain the same as before memorialisation of the account. Finally, there is an option permitting users to delete their account permanently after their death. For a detailed analysis see Harbinja (2017d). These in-service solutions are partial but positive and a step in the right direction. They empower users and foster their autonomy and choice (Harbinja, 2017c). They are a start towards what may become a much more comprehensive system of ‘social media wills’, with the number of platforms offering these choices and the number of options ever increasing. Perhaps these in-service solutions may encourage young people to think about their future and make decisions about their digital assets. As digital assets created on platforms increase in number, value and emotional and financial significance, this is socially useful, but administrators of platforms need to do much more to raise awareness of these options and inform their users during registration, or later on, as a layered notice, a push notification, or a pop-up window, for instance. The main problem with these tools is that their provisions might

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 11

HARBINJA THE INHERITANCE OF DIGITAL MEDIA 11

clash with a will (possibly made later in life) or the rules of intestate succession and heirs’ interests. For example, a friend can be a beneficiary of Google or Facebook services, but would not be heirs and next of kin, who would inherit copyright in one’s asset. Elsewhere I suggest that the law should recognise these services as ‘social media wills’, and provide legal solutions embraced by the US Uniform Law Commission in the Revised Uniform Fiduciary Access to Digital Assets Act (RUFADAA, see below) (Harbinja, 2017c, 34–5). The terms of service are also intrinsically unclear and contradictory and service providers need to make more effort to clarify them and make them more solid and coherent. Finally, there is no indication of how UK and other consumers use these services, nor have service providers been co-operative with researchers and transparent to the public about this. Users should be made aware of these services and service providers need to make more effort in this regard. Post-mortem privacy A separate issue surrounding digital assets and death is post-mortem privacy – the protection of deceased’s personal data (Harbinja, 2017c; Lopez, 2016). Many digital assets include a large amount of personal data (e.g. e-mails, social media content) and their legal treatment cannot be looked at holistically if one does not consider privacy laws and their lack of application post-mortem. Like many legal systems, UK law does not protect post-mortem privacy. Protections for personality and privacy awarded by breach of confidence, data protection and defamation all do not apply to the deceased in UK law (Edwards and Harbinja, 2013a). In English law, the principle has traditionally been actio personalis moritur cum persona – personal causes of action die with the person (see Baker v. Bolton, 1808). This principle has been revised by legislation in many contexts mainly for reasons of social policy, but it persists in relation to privacy and data protection. The same applies to the USA and many European countries (Edwards and Harbinja, 2013a). From the data protection perspective, the UK Data Protection Act 1998 in s. 1. defines personal data as ‘data which relate to a living individual’, denying any post-mortem rights. The same is envisaged in the Data Protection Bill 2017. The rationale behind not giving

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 12

12 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

protection to the deceased’s personal data in the UK is the lack of ability to consent to the processing of data (HL Select Committee on the European Communities, 1992). In new and infamous EU data protection, Recital 27 of the General Data Protection Regulation permits member states to introduce some sort of protection for a deceased person’s data, and some states have already provided for this protection. France and Hungary have already introduced some sort of protection of post-mortem privacy in their legislation (Castex, Harbinja and Rossi, 2018). The UK government’s approach, therefore, is not ideal and does not contribute to the legislative harmonisation within the EU, Brexit notwithstanding. I have argued on many occasions that post-mortem privacy deserves legal consideration in the UK, drawing an analogy with testamentary freedom, where individuals are permitted to control their wealth pre-mortem and their autonomy is extended on death. They are not entitled to do the same for their online ‘wealth’, identities and personal data, however (Harbinja, 2017c). The problem of post-mortem privacy has already been observed in the USA (In Re Ellsworth, 2005); Ajemian v. Yahoo!, Inc. (2017) and German (Kammergericht, 2017) jurisprudence and resolved in US and French legislation. I will sketch the most significant features of these legislative efforts. US law: the RUFADAA Similarly to the UK principles of non-survivorship of privacy and data protection, the US Restatement (Second) of Torts states that there can be no cause of action for invasion of privacy of a decedent, with the exception of ‘appropriation of one’s name or likeness’ (Restatement (Second) of Torts §652I, 1977). Some states provide for the protection of so-called ‘publicity rights’ (rights that protect celebrities, usually, but sometimes all individuals’ rights to name, image, likeness etc.) post-mortem, up to the limit of 100 years after death (Edwards and Harbinja, 2013a). On the other hand, interestingly, US states have been the most active jurisdictions in legislating the transmission of digital assets on death issues. The initial phase of the digital assets legislation started in 2005, and more than 20 US states having attempted to regulate the area of transmission of digital assets on death over the past 10 years. These

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 13

HARBINJA THE INHERITANCE OF DIGITAL MEDIA 13

laws seem to have been inspired by the publicity around the Ellsworth case and similar controversies. In In Re Ellsworth, the e-mail provider Yahoo! initially refused to give the family of a US marine, Justin Ellsworth, killed in Iraq, access to his e-mail account. They referred to their terms of service, which were designed to protect the privacy of the user by forbidding access to third parties on death. Yahoo! also argued that the US Electronic Communications Privacy Act of 1986 prohibits them from disclosing a user’s personal communications without a court order. The family argued that as his heirs they should be able to access Ellsworth’s e-mails and the entire account, his sent and received e-mails, as his last words. Yahoo!, on the other hand, had a nonsurvivorship policy and there was a danger that Ellsworth’s account could have been deleted. The judge in this case allowed Yahoo! to enforce their privacy policy and did not order transfer of the account login and password. Rather, he made an order requiring Yahoo! to enable access to the deceased’s account by providing the family with a compact disc (CD) containing copies of the e-mails in the account. As reported by the media, Yahoo! originally provided only the e-mails received by Justin Ellsworth on a CD, and after the family had complained again, allegedly subsequently sent paper copies of the sent e-mails (Kulesza, 2012; Soldier’s Kin, 2005). This case illustrates clearly most of the issues in post-mortem transmission of e-mails and other digital assets (post-mortem privacy, access and conflicts of interests of the deceased and family). Legislative responses that followed were partial and piecemeal, rather than comprehensive and evidence-based solutions (Lopez, 2016). The answer to this scattered legislation and possible conflict of law has been harmonisation within the USA. In July 2012 the US Uniform Law Commission formed the Committee on Fiduciary Access to Digital Assets. The goal of the Committee was to draft act and/or amendments to Uniform Law Commission acts (the Uniform Probate Code, the Uniform Trust Code, the Uniform Guardianship and Protective Proceedings Act, and the Uniform Power of Attorney Act) that would authorise fiduciaries to manage and distribute, copy or delete, and access digital assets. Starting from 2012, for the purposes of committee meetings, the Uniform Fiduciary Access to Digital Assets Act (UFADAA) had been drafted and published online on multiple occasions (Lopez, 2016). The process included fierce lobbying efforts by the big tech

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 14

14 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

companies (e.g. Google and Facebook), connected through a think tank called NetChoices. The companies even moved on to lobby for a completely different act, which would replace the UFADAA, resulting in the Privacy Expectation Afterlife and Choices Act 2015 (PEAC). The Uniform Law Commission then decided to revise the UFADAA, and incorporate some of the industry concerns and pro-privacy stances, adopting the revised UFADAA (RUFADAA) in 2015 (Lopez, 2016). Although this initiative was an attempt to improve and develop the existing statutes aiming to consider the full range of digital assets, there were many open issues that the committee needed to address when drafting the RUFADAA. For instance, in the prefatory note for the Drafting Committee in the February 2013 draft, the drafters identify the most critical issues to be clarified, including the definition of digital property (section 2) and the type and nature of control that can be exercised by a fiduciary (section 4). It seems that some of the most controversial issues were disputed within the committee, such as clarifying possible conflicts between contract and executory law, and between heirs, family and friends (Harbinja, 2017b). The RUFADAA includes important powers for fiduciaries regarding digital assets and estate administration. These powers are limited by a user’s will and intent expressed in their choice to use online tools to dispose of their digital assets (e.g. Google Inactive Account Manager). A user’s choice overrides any provisions of their will. If the user does not give direction using an in-service solution, but makes provisions for the disposition of digital assets, the RUFADAA gives legal effect to the user’s directions. If the user fails to give any direction, then the provider’s terms of service apply. The act also gives the service provider a choice of methods for disclosing digital assets to an authorised fiduciary, in accordance with their terms of service (full access, partial access or a copy in a record). Finally, the act gives personal representatives default access to the ‘catalogue’ of electronic communications and other digital assets not protected by federal privacy law (the content of communication which is protected and can only be disclosed if the user consented to disclosure or if a court orders disclosure). Additionally, section 9 of the RUFADAA aims to resolve the issues of the potential violations of criminal and privacy legislation, and tackles jurisdiction, mandating that the choice of law provisions in terms of service do not apply to fiduciaries. Apart from that, the

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 15

HARBINJA THE INHERITANCE OF DIGITAL MEDIA 15

new draft abandoned the digital property notion altogether and left only the digital assets, comprising both the content and the log information (data about an electronic communication, the date and time a message has been sent, recipient e-mail address etc.). So far, the majority of states have introduced and enacted the RUFADAA (Uniform Law Commission, 2018). Most have already had their own digital assets statutes, but this legislation had not been harmonised before the RUFADAA, so the act contributed to the harmonisation of divergent laws – at least in the states that enacted the RUFADAA. Notably, California, where the biggest service providers are based and whose laws are applicable to the terms of service, has not introduced the legislation. It is to be hoped that the act will achieve a wider adoption and application in individual states, or even initiate similar efforts in other countries. An acceptable legal solution for the transmission of e-mails will ideally follow the rationale behind the RUFADAA. It should aim to recognise technology as a way of disposing of digital assets (including e-mails), as a more efficient and immediate solution online. The solution would also consider technological limitations, users’ autonomy, and the changing landscape of relationships online (Kasket, 2013, 10; Pennington, 2013, 617). The Uniform Law Conference of Canada followed this approach and enacted a similar act: the Canadian Uniform Access to Digital Assets by Fiduciaries Act 2016 (UADAFA) (ULCC and CHLC, 2016). This act provides a stronger right of access for fiduciaries than the RUFADAA. There is a default access to the digital assets of the account holder. In the UADAFA, the instrument appointing the fiduciary determines a fiduciary’s right of access, rather than the service provider. The Canadian act has a ‘last-in-time’ priority system, whereby the most recent instruction takes priority over an earlier instrument. Interestingly, however, a user who already has a will, but nominates a family member or a friend to access their social media account after their death, restricts their executor’s rights under the will. This is similar to the US RUFADAA in that the deceased’s will takes priority in any case, the difference is in the mechanism. The RUFADAA is more restrictive in honouring terms of service in the absence of a user’s instruction. Service providers are obliged only to disclose the catalogue of digital assets, and not the content. I believe that this solution is more suitable for the online environment, in particular where assets are intrinsically

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 16

16 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

tied to one’s identity (communications, social networks, multiple accounts with one provider such as Google, where these create a unique profile and identity of a user, and so on). In the Digital Republics Act 2016 (LOI no. 2016-1321 pour une République numérique), France has adopted a solution quite similar to the RUFADAA. Article 63(2) of the act states that anyone can set general or specific directives for preservation, deletion and disclosure of their personal data after death. For more information see Castex, Harbinja and Rossi (2018). These directives would be registered with a certified third party (for general ones) or with the service provider who holds the data (e.g. Facebook and their policy described above). This is quite a surprising development, which brings the US and French approaches to post-mortem privacy closer. This is even odder if we consider conventional and extremely divergent approaches of these jurisdictions to protecting personal data of the living individuals (Castex, Harbinja and Rossi, 2018). We will not be seeing similar developments in the UK unfortunately. The Law Commission has recently initiated a consultation into the reform of the law of wills in England and Wales. In their brief, they assert that digital assets ‘fall outside the sort of property that is normally dealt with by a will’ (Law Commission, 2017, 231) and that digital assets are primarily a matter of contract law and could be addressed in a separate law reform. This suggestion fails to future proof the law of wills as these kinds of assets become more common and valuable. In the future, we will see conflicts between wills and the disposition of digital assets online, and this reform is a chance for UK law to show foresight in anticipating these issues, and to follow good examples in other countries, as explained above. With Lilian Edwards, I have, therefore argued that the Commission should consider digital assets in the ongoing reform, in order to forestall rather than create unclarity and confusion. We will see quite soon whether the Commission will take our suggestion on board and take this opportunity to bring about some clarity in the law. Digital estate planning and potential solutions Digital estate planning is a developing area, with many tech solutions being developed over the years. Given the lack of regulation and law,

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 17

HARBINJA THE INHERITANCE OF DIGITAL MEDIA 17

this was perceived as a quick solution to deal with digital assets on death (Beyer and Cahn, 2012; Hopkins, 2014, 229). These tech digital estate planning services aim to shift the control of digital assets to users by enabling designation of beneficiaries who will receive passwords and content of digital asset accounts. Lamm et al. (2014) categorise these solutions somewhat differently, focusing on the character of actions they promise to undertake on death. They find four categories: • services offering to store passwords • services facilitating administration of digital assets • services performing specific actions (e.g. removing all the data on behalf of a deceased person) • services that currently do not exist, but hypothetically provide services through partnerships with service providers of the deceased’s accounts. This categorisation is very similar to the one I used earlier, with the slight difference that it focuses on actions rather than business models (Harbinja, 2017b). In their earlier work, Edwards and Harbinja evaluated some of the ‘code’ solutions and concluded that ‘these are not themselves a foolproof solution’ (Edwards and Harbinja, 2013b) for five main reasons: • They could cause a breach of terms of service (due to the nontransferable nature of most assets, as suggested above). • There is a danger of committing a criminal offence (according to the provisions of the anti-interception and privacy laws). • The services are inconsistent with the law of succession and executory (they do not fulfill requirements of will formalities; conflicts with the interests of heirs under wills or laws of intestacy may arise; jurisdiction issues, and so on). • There are concerns over the business viability and longevity of the market and services. • There are concerns of security and identity theft (the services store passwords and keys to valuable assets and personal data) (Edwards and Harbinja, 2013b).

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 18

18 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

Cahn and Lamm et al. also identify most of these problems (Cahn, 2014; Lamm et al., 2014, 400–1). Öhman and Floridi criticise them from a philosophical perspective, submitting that these services commercialise death and dying and violate dignity of the deceased (Öhman and Floridi, 2017). It is thus not recommended that the services are used in their current form and with the law as it stands now. However, with improvements in the services and their recognition by the law, they have a potential to be used more widely in the future. In principle, the services are more suitable for the digital environment, as they recognise the technological features of digital assets and enable an automatic transmission on death. However, because of the issues surrounding them described above, the author does not envisage that they will be received legitimately in the near future, at least not outside the USA. In the UK, there is currently a widespread practice whereby solicitors advise testators to list their accounts and passwords for their heirs to use after their death (Law Commission, 2017). This solution is in breach of most user agreements, which could conceivably lead to premature termination of the account. Passwords should change over time and testators may not remember to update the list that they prepared at the time they made their will. Leaving a list like this is also very insecure and leaves users vulnerable to security breaches and hacking. Therefore, although this advice may be practical, following it does not overcome the issues discussed above. Conclusions This chapter examined key issues related to digital assets and inheritance. The area is very complex and digital assets need to be examined individually in order to be able to determine what regulatory and legal regime is best suited to deal with their postmortem transmission. Commonly most assets include a myriad of legal relationships and these surface differently for different types of asset. For instance, copyright might be more important in an asset that includes a number of photographs, poems or stories, property might be in question for music libraries, and post-mortem privacy is a more obvious concern for assets that are more intrinsically related to an individual. Notwithstanding this distinction, all of them have in

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 19

HARBINJA THE INHERITANCE OF DIGITAL MEDIA 19

common that there are governed primarily by intermediary contracts and the lack of laws and regulation gives prevalence to these in many jurisdictions. There have been some innovative solutions in countries such as the USA or France, but there is still work to do to implement these in practice and test them as technology develops. Other countries, such as the UK, still have a long way to go and this author has been involved in the ongoing reforms in the area. It is important that legislators and regulators in this country follow good examples and clarify this muddled and complex area. Technology is one way to go, but innovative service provider solutions may not be followed by legislative reforms and often conflict with some longstanding legal principles in succession, intellectual property, privacy or property law. It is therefore necessary to introduce some specific laws on digital assets in countries where this is not the case, such as the UK. These would ideally recognise user autonomy and post-mortem privacy, which can be expressed in one’s will as well as in-service solutions. In the UK, this could be done as part of the ongoing reforms of the law of will, and by amending data protection laws. Generally, legislators in all countries that aim to legislate in the area need to make sure that their property, contract, IP, data protection and succession laws are consistent, otherwise efforts in one area may be undermined by its conflicting provisions in other areas of law. In addition, there should be exceptions to these provisions in order to enable access by researchers and archivists, in particular, to balance the right to privacy, autonomy with the freedom of expression and the interests of the public. This is particularly important where users may choose to be forgotten post-mortem – to have most of their accounts and data deleted. Here, as generally required by data protection and many other laws, privacy and individual rights need to be balanced against the rights of the public, such as the right to freedom of expression, and the right to follow individual research and archival interests. This does not prevent legislators and users from regulating and promoting user autonomy over digital assets and online data and death. References Ajemian v. Yahoo!, Inc. (2017) SJC-12237. Atwater, J. (2006) Who Owns Email? Do you have the right to decide the

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 20

20 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

disposition of your private digital life?, Utah Law Review, 397–418. Băbeanu, D., Gavrilă, A. A. and Mareş, V. (2009) Strategic Outlines: between value and digital assets management, Annales Universitatis Apulensis Series Oeconomica, 11, 318–19. Baker v. Bolton (1808) 170 English Reports 1033. Banta, N. (2017) Property Interests in Digital Assets: the rise of digital feudalism, Cardozo Law Review, 38, https://ssrn.com/abstract=3000026. BBC News (2005) Who Owns Your E-mails?, 11 January, http:// news.bbc.co.uk/1/hi/magazine/4164669.stm. Beyer, G. W. and Cahn, N. (2012) When You Pass On, Don’t Leave the Passwords Behind: planning for digital assets, Probate & Property, 26 (1), 40. Cahn, N. (2011) Postmortem Life On-Line, Probate and Property, 25, 4, 36–41. Cahn, N. (2014) Probate Law Meets the Digital Age, Vanderbilt Law Review, 67, 1697–1727. Cahn, N., Kunz, C. and Brown Walsh, S. (2016) Digital Assets and Fiduciaries. In Rothchild, J. A. (ed.), Research Handbook on Electronic Commerce Law, Edward Elgar, https://ssrn.com/abstract=2603398. Carroll, E. (2018) 1.7 Million US Facebook Users Will Pass Away in 2018, The Digital Beyond, 23 January, www.thedigitalbeyond.com/2018/01/1-7million-u-s-facebook-users-will-pass-away-in-2018/. Castex, L., Harbinja, E. and Rossi, J. (2018, forthcoming) Défendre les Vivants ou les Morts? Controverses sous-jacentes au droit des données post-mortem à travers une perspective comparée franco-américaine, Réseaux. Cohen, J. (2000) Examined Lives: informational privacy and the subject as object, Stanford Law Review, 52, 1373–426. Conway, H. and Grattan, S. (2017) The ‘New’ New Property: dealing with digital assets on death. In Conway, H. and Hickey, R. (eds), Modern Studies in Property Law, Hart Publishing. Council of the European Union (1993) Council Directive 93/98/EEC harmonizing the term of protection of copyright and certain related rights, 29 October, OJ L290/9. Cuijpers, C. (2007) A Private Law Approach to Privacy; mandatory law obliged?, SCRIPTed, 24 (4), 304–18. Darrow, J. and Ferrera, G. (2006) Who Owns a Decedent’s E-Mails: inheritable probate assets or property of the network?, NYU Journal of Legislation and Public Policy, 10, 281–321.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 21

HARBINJA THE INHERITANCE OF DIGITAL MEDIA 21

Denton, J. (2016) More than 60% of the UK Population Has Not Made a Will, This Money, 26 September, www.thisismoney.co.uk/money/news/ article-3807497/Nearly-60-Britons-not-written-will.html. Digital Legacy Association (2017) Digital Death Survey 2017, https://digitallegacyassociation.org/wp-content/uploads/2018/02/ Digital-Legacy-Association-Digital-Death-Survey-Data.htm. Edwards, L. and Harbinja, E. (2013a) Protecting Post-Mortem Privacy: reconsidering the privacy interests of the deceased in a digital world, Cardozo Arts and Entertainment, 32 (1), 101–47. Edwards, L. and Harbinja, E. (2013b) What Happens to My Facebook Profile When I Die? Legal issues around transmission of digital assets on death. In Maciel, C. and Pereira, V. (eds), Digital Legacy and Interaction: post-mortem issues, Springer. Erlank, W. (2012) Property in Virtual Worlds, PhD dissertation, Stellenbosch University, http://ssrn.com/abstract=2216481. European Parliament and Council (2006) Directive 2006/116/EC on the term of protection of copyright and certain related rights (codified version), 12 December, OJ L372/12 (Copyright Term Directive). Facebook (2018a) Special Request for a Medically Incapacitated or Deceased Person’s Account, https://en-gb.facebook.com/help/contact/ 228813257197480. Facebook (2018b) What Data Can a Legacy Contact Download from Facebook?, https://www.facebook.com/help/408044339354739?helpref=faq_content. Gambino, L. (2013) In Death, Facebook Photos Could Fade Away Forever, Associated Press, 1 March, www.yahoo.com/news/death-facebook-photoscould-fade-away-forever-085129756—finance.html. Google (2018) About Inactive Account Manager, https://support.google.com/accounts/answer/3036546?hl=en. Harbinja, E. (2013) Does the EU Data Protection Regime Protect PostMortem Privacy and What Could Be the Potential Alternatives?, SCRIPTed, 10 (1), 19–38. Harbinja, E. (2014a) Virtual Worlds: a legal post-mortem account, SCRIPTed, 11 (3), 273–307, https://script-ed.org/article/virtual-worlds-alegal-post-mortem-account/. Harbinja, E. (2014b) Virtual Worlds Players: consumers or citizens?, Internet Policy Review, 3 (4), https://policyreview.info/articles/analysis/ virtual-worlds-players-consumers-or-citizens.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 22

22 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

Harbinja, E. (2016a) Legal Nature of Emails: a comparative perspective, Duke Law and Technology Review, 14 (1), 227–55, http://scholarship.law.duke.edu/dltr/vol14/iss1/10. Harbinja, E. (2016b) What Happens to our Digital Assets When We Die?, Lexis PSL, November. Harbinja, E. (2017a) Digital Inheritance in the United Kingdom, The Journal of European Consumer and Market Law, December. Harbinja, E. (2017b) Legal Aspects of Transmission of Digital Assets on Death, PhD dissertation, University of Strathclyde, 18–25. Harbinja, E. (2017c) Post-mortem Privacy 2.0: theory, law and technology, International Review of Law, Computers and Technology, 31 (1), 26–42, www.tandfonline.com/doi/citedby/10.1080/13600869.2017.1275116? scroll=top&needAccess=true. Harbinja, E. (2017d) Post-mortem Social Media: law and Facebook after death. In Gillies, L. and Mangan, D. (eds), The Legal Challenges of Social Media, Edward Elgar. Harbinja, E. (2017e) Social Media and Death. In Gillies, L. and Mangan, D. (eds), The Legal Challenges of Social Media, Edward Elgar Publishing. Haworth, S. D. (2014) Laying Your Online Self to Rest: evaluating the Uniform Fiduciary Access to Digital Assets Act, University of Miami Law Review, 68 (2), 535–59. Hopkins, J. P. (2014) Afterlife in the Cloud: managing a digital estate, Hastings Science and Technology Law Journal, 5 (2), 209–44. House of Lords Select Committee on the European Communities (1992) Report of the Protection of Personal Data. In Re Ellsworth (2005) No. 2005-296, 651-DE, Mich. Prob. Ct., 4 March. Kammergericht (2017) Urteil zu Lasten der klagenden Mutter – kein Zugriff der Eltern auf Facebook-Account ihrer verstorbenen Tochter, 31 May, Aktenzeichen 21 U 9/16, https://www.berlin.de/gerichte/presse/pressemitteilungen-derordentlichen-gerichtsbarkeit/2017/pressemitteilung.596076.php. Kasket, E. (2013) Access to the Digital Self in Life and Death: privacy in the context of posthumously persistent Facebook profiles, SCRIPTed, 10 (1), 7–18. Kulesza, A. (2012) What Happens to Your Facebook Account When You Die?, blog, 3 February, http://blogs.lawyers.com/2012/02/what-happensto-facebook-account-when-you-die/. Lamm, J. et al. (2014) The Digital Death Conundrum: how federal and state

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 23

HARBINJA THE INHERITANCE OF DIGITAL MEDIA 23

laws prevent fiduciaries from managing digital property, University of Miami Law Review, 68 (2), 384–420. Lastowka, G. and Hunter, D. (2006) Virtual Worlds: a primer. In Balkin, J. M. and Noveck, B. S. (eds), The State of Play: laws, games, and virtual worlds, NYU Press, 13–28. Laudon, K. (1996) Markets and Privacy, Communications of the ACM, 39 (9). Law Commission (2017) Making a Will, consultation paper, https://s3-eu-west-2.amazonaws.com/lawcom-prod-storage11jsxou24uy7q/uploads/2017/07/Making-a-will-consultation.pdf. Lessig, L. (2006) Code, Version 2.0, Basic Books. Litman, J. (2000) Information Privacy/Information Property, Stanford Law Review, 52 (5), 1283–1313. Lopez, A. B. (2016) Posthumous Privacy, Decedent Intent, and Post-mortem Access to Digital Assets, George Mason Law Review, 24 (1), 183–242. Mazzone, J. (2012) Facebook’s Afterlife, North Caroline Law Review, 90 (5), 1643–85. McAfee (2013) How Do Your Digital Assets Compare?, 14 May, https://securingtomorrow.mcafee.com/consumer/family-safety/ digital-assets/ McCallig, D. (2013) Facebook After Death: an evolving policy in a social network, International Journal of Law and Information Technology, 22 (2), 107–40. McKinnon, L. (2011) Planning for the Succession of Digital Assets, Computer Law and Security Review, 27, 362–7. Mell, P. (1996) Seeking Shade in a Land of Perpetual Sunlight: privacy as property in the electronic wilderness, Berkeley Technology Law Journal, 11, 1–79. Öhman, C. and Floridi, L. (2017) The Political Economy of Death in the Age of Information: a critical approach to the digital afterlife industry, Minds & Machines, 27 (4), 639–62, https://doi.org/10.1007/s11023-017-9445-2. Pennington, N. (2013) You Don’t De-Friend the Dead: an analysis of grief communication by college students through Facebook profiles, Death Studies, 37 (7), 617–35. Perrone, M. (2012–2013) What Happens When We Die: estate planning of digital assets, CommLaw Conspectus, 21, 185–210. Purtova, N. (2010) Private Law Solutions in European Data Protection: relationship to privacy, and waiver of data protection rights, Netherlands Quarterly of Human Rights, 28 (2), 179–98.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 24

24 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

PwC (2013) Digital Lives: we value our digital assets at £25 billion, PricewaterhouseCoopers, https://www.pwc.co.uk/issues/cyber-securitydata-privacy/insights/digital-lives-we-value-our-digital-assets-at-25billion.html Rackspace (2011) Hosting Generation Cloud: a social study into the impact of cloud-based services on everyday UK life, 16 November, https://web.archive.org/web/20111027035813/http://www.rackspace. co.uk:80/uploads/involve/user_all/generation_cloud.pdf. Sancya, P. (2005) Yahoo Will Give Family Slain Marine’s E-Mail Account, USA Today, 21 April, http://usatoday30.usatoday.com/tech/news/ 2005-04-21-marine-email_x.htm?POE=TECISVA. Schwartz, P. (2003) Property, Privacy, and Personal Data, Harvard Law Review, 117, 2056–2128. Soldier’s Kin to Get Access to his Emails (2005) press release, 21 April, www.justinellsworth.net/email/ap-apr05.htm. ULCC and CHLC (2016) Uniform Access to Digital Assets by Fiduciaries Act (2016), Uniform Law Conference of Canada and Conférence pour l’harmonisation des lois au Canada, https://www.ulcc.ca/images/stories/2016_pdf_en/2016ulcc0006.pdf. Uniform Law Commission (2018) Fiduciary Access to Digital Assets Act, revised 2015, 2018, introductions & enactments, www.uniformlaws.org/Act.aspx?title=Fiduciary%20Access%20to% 20Digital%20Assets%20Act,%20Revised%20(2015). Westin, A. (1967) Privacy and Freedom, Atheneum. Williams, K. (2012) Facebook Saga Raises Question of Whether Users’ Profiles Are Part of Digital Estates, Huffington Post, 15 March, https://www.reddit.com/r/privacy/comments/r6rz9/karen_williams_ facebook_saga_raises_question_of/. Zarsky, T. (2004) Desperately Seeking Solutions: using implementationbased solutions for the troubles of information privacy in the age of data mining and the internet society, Maine Law Review, 56, 13–59.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 25

2 Curbing the online assimilation of personal information Paulan Korenhof

Introduction With the current popularity and increasingly important role of the world wide web in western society, many media controllers of previously offline information collections have chosen to follow this public shift towards the online information realm and uploaded their collection(s) on the web. However, this uploading of information collections on the web is not without consequences for individuals if the collection contains personal information. This became clear in what is taken to be the first big ‘right to be forgotten’ (RTBF) case, which evolved around two articles from 1998 in the online archive of the Spanish newspaper La Vanguardia (CJEU, 2014). The two articles consisted of just a few sentences and referred to a forced sale of property of a citizen – let’s call him ‘G’ – due to a social security debt. G was disturbed by the online availability of these articles because the issue was long resolved. When his attempts to have the articles removed from La Vanguardia’s website failed, he focused his wish ‘to be forgotten’ on Google Search, which displayed the articles as top results for a search on his full name. In the end, the Court of Justice of the European Union (CJEU) decided the case in G’s favour and ordered Google Search to remove the URLs linking the 1998 articles to G’s name from the search results. The CJEU held the opinion that an individual can have the right to have certain information removed (in

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 26

26 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

this case the search results) – even if it is truthful information – if the information ‘at this point in time . . . appears, having regard to all the circumstances of the case, to be inadequate, irrelevant or no longer relevant, or excessive in relation to the purposes of the processing’ (CJEU, 2014). The Google Spain case is a foretaste of the ‘official’ RTBF, which is formulated in Article 17 of the General Data Protection Regulation (GDPR), which came into force in May 2018. It gives individuals under certain conditions the right to ‘obtain from the controller the erasure of personal data concerning him or her’ (GDPR, Article 17). (In this chapter I take the GDPR as given, now implemented, and will not spend time arguing about all its problematic aspects and weaknesses. I will treat the GDPR as a tool, albeit a forced one.) The RTBF is meant to aid individuals in moving beyond their past in the current information age by erasing information that ‘with the passing of time becomes decontextualized, distorted, outdated, no longer truthful (but not necessarily false)’ (Andrade, 2014, 70). Throughout its development the RTBF has been – and still is – the topic of a heated debate with opposing opinions on the legal and ethical issues of deleting and/or constricting access to ‘public information’ and the role of technology therein. In these discussions the values of freedom of speech and privacy are often parried with each other. Freedom of speech advocates (e.g. Baker, 2014) especially look at the judgment in the Google Spain case with horror and regard the RTBF as censorship and even ‘the biggest threat to free speech on the Internet in the coming decade’ (Rosen, 2012). Despite the heated debate and great displeasure in at least a part of the legal, ICT and information professional communities, the RTBF is at our doorstep. It has been presented as a solution to the problem of balancing the impact of digital technologies with access to personal information. But what is the problem with personal information in online public information collections that has led informational professions to now find themselves confronted with this ‘solution’? Instead of focusing on how the RTBF impacts media, I will concentrate in this chapter on ‘why’ there is a wish for such a ‘solution’ in the first place. Addressing the ‘why’ is a more beneficial topic to help us move forward, because the RTBF is not in all cases a perfect solution to the problems raised by the online availability of information collections,

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 27

KORENHOF CURBING THE ASSIMILATION OF PERSONAL INFORMATION 27

nor even the best balanced one (Korenhof, 2018, forthcoming). We should therefore take the development of the RTBF as a signal that our information flows are changing to such an extent that a public concern exists over the presence of personal information in the public sphere. I will explain this ‘why’ by exposing the problems that a public information collection can raise for individuals when it is absorbed by the web. I will not go into the manner in which the digitalisation of archives affects the archive and its operations, but will leave that to the archivists. By focusing on the ‘why’ in this context I hope to give readers a broader frame to think about a ‘how’ that is more elegant and tailored to some of the issues at hand than the upcoming RTBF. The issues need rethinking and may be better addressed ‘from the inside’. This chapter is therefore intended to kick-start an interdisciplinary discussion on how to shape our public information flows in the light of the web. Finding the answers to the issues raised will require a communal effort and highly depends on the purposes of the information collection in question. This chapter consists of three main parts: the ‘what’, the ‘why’ and the ‘how’. The ‘what’ part addresses what it means when information sources are assimilated by the web. In the ‘why’ part, I discuss why this can raise problems for individuals. In the ‘how’ part I briefly explain how the RTBF aims to solve this, and what other options may be feasible. I conclude by opening up the floor for an interdisciplinary approach to (re)think the solution for the problems at hand. Before going into the ‘what’, ‘why’, and ‘how’, let us start with some background. Background Although the Google Spain case referred to in the introduction is a legal case, the questions underlying it have an ethical, economic, political and philosophical nature. The case is in its core tied to the fact that in our lives most information is brought to us by technology. Addressing such questions is therefore not just the work of legal scholars, but also the work of those working with, studying and developing information services and structures, whether it be from a practical, ethical, philosophical, historical, economic or political perspective. Addressing these issues therefore requires an interdisciplinary effort. In order to

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 28

28 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

contribute to this effort I will cover a certain angle and discuss what it means when information collections are made available online, which issues we may need to address, and what kind of solutions we can think about. I will do this from an applied philosophy of information technology perspective. At the core of this perspective lies the fact that technology is inherently non-neutral; it allows us to perceive, experience and perform things we could otherwise not, and allows us to set goals which were unavailable, or even non-existent, without these technologies (cf. Verbeek, 2005). Technologies ‘suggest, enable, solicit, prompt, encourage, and prohibit certain actions, thoughts, and affects or promote others’ (Lazzarato, 2014, 30), so technology affords us certain uses and/or provides us with certain options (Gibson, 2014, 121). By affording us options and changing what we can do in our environment, technology has a normative impact on our lives (Hildebrandt, 2015, 163). One of the most important affordances of technology is the retention and conveying of information. Information retention within a technology is what Stiegler labels ‘tertiary memory’ (Stiegler, 1998). Our experiences, ideas and knowledge are exteriorised and inscribed into technologies (Stiegler, 2010, 9). As tertiary memory, technologies afford to ‘make present what is absent in time and/or space’ (Hildebrandt, 2015, 38). They can convey information about individuals who are physically not around or about events that lie in the past – thereby imbuing the absent a certain presence. However, not every piece of retained information is equally accessible, highlighted, revealed, usable and/or understandable for agents. The manner in which an information source is technologically constituted influences how the information is ‘present’ for people. The presence of information is the quantitative and qualitative proximity of a specific informational referrer in time and space for human agents compared with other informational referrers in the tertiary memory (Korenhof, 2018). Technology can affect the presence of information on a quantitative level in different ways. For example, technology can extend the existence of information in space and time and thereby increase the likelihood that an agent is exposed to it. This can be as simple as keeping a photo of a deceased loved one. Technology can also affect the presence of information on a qualitative level by

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 29

KORENHOF CURBING THE ASSIMILATION OF PERSONAL INFORMATION 29

signalling meaning or authority of information by ranking or highlighting information. An example is the prominent placement of certain articles on the front page of a newspaper. The manner in which information is presented to a specific individual influences the chance that she encounters the information, spends attention on it, and interprets it (Korenhof, 2018). With this framework in the background, I will discuss how the web and its corresponding applications affect the presence of information to such an extent that information that was traditionally already public for a long time became a new cause for concern. What: online assimilation of information collections In order for public information collections to be offered online, the information needs to be digitised and embedded in the web. Information is digitised by turning it into a discrete set of binary values (‘bits’, which are represented by either a value 0 or 1) so that it becomes machine readable. This binary format allows certain typical informationprocessing techniques, like transmission through cables and the ether and flawless infinite replication (Ross, 2013). However, in this digital form, ‘nothing is stored but code: the mere potential for generating an image of a certain material composite again and again by means of numerical constellations’ (Blom, 2016). The result is that in order to perceive the digitised information, we need an electronic device that interprets and parses the binary set into a human-readable form (Chabot, 2013, 63). Digital information thus has an ‘intermittent existence’ (Ross, 2013); it is not there for us until it is generated by an interface, in the case of the web, a browser. Thus the device and browser interface always mediate our experience of online information and is at the same time the ‘world’ in which the information exists for us (Hui, 2014, 52). An online information collection is thus pressed within the rationale of the browser interface that allows us to navigate hyperlinks, search content and copy, paste and save information objects (Manovich, 2001, 76). Once digitised, the information collection can be uploaded online from where it can be accessed at high speed simultaneously from multiple locations, irrespective of where the server that stores the information is located (Ross, 2013), though there can be restrictions,

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 30

30 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

like states blocking content from specific locations. Moreover, not everyone can benefit equally from the web as an ubiquitous information source; the web’s dependency on high level technological resources puts access at a certain price. Those with insufficient funds have a more difficult time to access the internet, and may depend on public access resources. There is almost no time restriction to access; the web is always ‘on’, although web pages can close at certain days or times, for example for maintenance purposes or on religious grounds. Therefore online information collections have a very high accessibility, and the embedding of an information collection on the web often has far-reaching consequences for its content; when ‘the Net absorbs a medium, it re-creates that medium in its own image. It not only dissolves the medium’s physical form; it injects the medium’s content with hyperlinks, breaks up the content into searchable chunks, and surrounds the content with the content of all the other media it has absorbed’ (Carr, 2011, 90). The information is assimilated by the web’s massive information space alongside other online content (as long as the storage server is running) and there becomes open to many of the social, technological and economical forces that shape the information flows of the web. The online information can be associated with other information objects, connected by hyperlinks, processed in overviews and easily copied and uploaded elsewhere on the web. Moreover, it can be ‘mined’ and commodified by many of the agents that have found ways to make profit by processing online information (Anthes, 2015, 28–30). Housing so many different information types, sources and collections the web has led to a ‘convergence of previously separate realms of knowledge and practice by reducing the distance between things and individuals and by synchronizing memory through media technologies’ (Hui, 2016). However, by containing this much information, the web also poses a challenge for its users; it houses far more information than a single individual can process. The result is that users increasingly have to rely on index and retrieval systems (Leroi-Gourhan, 1993, 262–3). A pivotal role here is played by search engines. Because search engines play such a significant role, I will discuss their impact in more detail. Search engines are essential to retrieve information from the web and much use is made of them; Google Search itself states that it

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 31

KORENHOF CURBING THE ASSIMILATION OF PERSONAL INFORMATION 31

processes roughly 40,000 searches per second (Google, 2018a). The use of search engines is even visually and practically incorporated in most web browsers and thereby established as the norm (Stanfill, 2015, 1060). Search engines index as much of the web as they can. On request they produce for the user a ranked search result overview from their index based on a specific search string, while taking specifics of the user into account like the user’s location and search history (Vaidhyanathan, 2012, 21). In order to produce the ranked search result overview, the search engine uses algorithms – ‘sets of defined steps structured to process instructions/data to produce an output’ (Kitchin, 2017). Owing to the prominent role of search engines, the algorithms underlying search engines are ‘now a key logic governing the flows of information on which we depend’ (Gillespie, 2014, 167), but these algorithms are man-made and inscribed with values set by their designers (König and Rasch, 2014, 13). The designers set the base for what is considered relevant and thereby inscribe the search engine with their own norms, preferences and knowledge background. Thus search engines give rise to a certain bias, expressed in the manner in which they shape the information flows towards the user by including, excluding and ranking information, ‘leading unavoidably to favoring certain types of information while discriminating against others’ (König and Rasch, 2014, 13). The result is for instance that many search engines are inclined to focus on English websites (Gillespie, 2014, 177) or websites from western countries (Kuo, 2017). The views of the designers on what information is ‘relevant’, thus plays an important role in search result ranking. Generally the ‘relevance’ evaluation inscribed into search engines is closely tied to the popularity of websites (Gillespie, 2014, 177). An example of this is Google’s PageRank algorithm; PageRank ranks websites by the number of links to the website as well as the importance attributed to the source linking to the website (Page et al., 1999). Often linked-to websites gain a certain ‘authoritative’ status and are more likely to end up (high) in the search results (Pasquale, 2015, 64), for example news media and national archive websites. Search engines thus redefine the information they have indexed into their own frame of relevance. With this, the expressive power of the original information collection, like an archive’s own database and index system, is bypassed by the expressive power of the search

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 32

32 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

engines’ retrieval mechanisms (Gillespie, 2014, 171). Search engines thus have a relatively powerful position in establishing the presence of personal information; they affect its presence at a quantitative (access speed) as well as a qualitative (ranking) level. Search engines also have a certain interest; they are industries of information retrieval, which ‘have taken advantage of a fragmented media market to establish their power as distributors of traffic’ (Couvering, 2008). While it is beneficial for many public media sites to receive traffic through search engines, the main profiteer of such indexing is the search engine itself, which receives profit by mediating the content of others. Given their powerful role, we need to consider the implications of the commercial interests of search engines on our information flows (Hargittai, 2000). Why: the data subject’s wish to be forgotten A wide variety of individuals consider the manner in which the web mediates personal information problematic. Requests for the erasure of search results cover a broad range of topics, varying from crimerelated to mundane personal information like opinions on computer games or a lost pet (McIntosh, 2015). What is striking is that the targeted content also includes content to which individuals have contributed voluntarily, like opinion polls and interviews about life experiences and health (McIntosh, 2015). This suggests that the online assimilation of the information leads to a different, and likely more extensive, presence of information than the individuals expected when they participated in the publication. I believe it would be worthwhile to research the exact motivation of individuals who voluntarily cooperated with the publication of an article and then file a RTBF-request about this content. Unfortunately no extensive research has been done on this front yet. The main issues that individuals are likely to experience as a result of the online assimilation of their personal information are caused by the web’s global accessibility, the convergence of previously separate realms of knowledge, and the fact that, once online, personal information can easily become the object of various kinds of information processing. With global accessibility and the convergence of different contextual knowledge realms, the separating power of space is nullified and the

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 33

KORENHOF CURBING THE ASSIMILATION OF PERSONAL INFORMATION 33

contextual demarcations that we are used to expecting in our informational interactions are missing. Separations like public–private, work–leisure, national–international are diffuse online – if they exist at all. The result is that online personal information is often open to a more widely mixed social and cultural audience than we may have foreseen and the information may be viewed and used in unexpected and unintended contexts (Brown, 2008). The audience in the unintended context may lack vital background knowledge to interpret the information correctly, or hold views that stand in stark contrast to the intended context. Think for example of an individual who was interviewed about her homosexuality in a youth magazine for teenagers in the Netherlands, and who now has to travel to a country where homosexuality is not socially accepted, and even punishable by law. Information that is unproblematic in one context can be problematic, or even detrimental, in another. The vulnerability of the contextual integrity and authenticity of online information is even more increased by the digital character of the information, which allows easy alteration, replication and transmission; the information can easily be copied, edited, linked or pasted elsewhere online. The online assimilation of personal information can thus heavily affect its presence and places the contextual integrity of the information at stake (cf. Nissenbaum, 2009). Moreover, the easy global accessibility reduces the effort – and with that the required motivation – needed to access information (Oulasvirta et al., 2012); with a few clicks users can access content from any location. The audience of online information objects is thus bigger than that of their offline counterparts: ‘Now, however, millions of people who cannot or do not want to go to the archives are accessing them in digital form’ (Stallybrass, 2007). One of the most extensive effects of the online assimilation of personal information is caused by search engines. By allowing name searches, a search engine can provide the user with an overview of the appearance of a particular name, pinpointing its location in a diverse array of online information sources and documents. While not every name search is equally successful, and many return results on multiple individuals with the same name, the profiling mechanics of search engines are likely to narrow the returned results down to those results that cross-match on certain points with the profile of the user, like a shared language and/or country. Therefore the chance is increased that

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 34

34 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

the overview is focused on a few people, or maybe even one individual. With their zoomed-in and combined overview, search engines work as ‘attention lenses; they bring the online world into focus. They can redirect, reveal, magnify, and distort. They have immense power to help and hide’ (Grimmelmann, 2010, 435). They do so according to their own views, evaluations and policies, however. By presenting personal information in a ranking-overview according to their own rationale, search engines bypass the manner in which the original information sources categorise their information, and reshape and redefine the context and conditions of information retrieval. Their focus can highlight secondary information in articles and documents, thereby potentially neglecting and even flipping around the original context of the information by turning the name into the headline and the article in which it occurs into information about the individual. Personal information in nationally relevant sources is particularly susceptible to being discovered via internet search due to the authority that is often attributed to news media and official government records. The result is that old information in these collections can outrank contemporary information in less authoritative sources. This is what happened in the Google Spain case; because of the authority attributed to the archive of the national newspaper La Vanguardia, the time factor lost relative importance and the articles from 1998 ended up as top search results for G’s name. The date that files are uploaded can also play a significant role in the time stamp of the information object; an information object that originally originates from 1998 may be indexed as originating from the time that it is uploaded, thereby receiving a far more recent time stamp. Moreover, while the two articles originally consisted of a few lines deep in the physical papers (page 13 in one edition and page 23 in the other), the zoom-in mechanics of the search engine pushed them to the front view. The information thus received a far more prominent position in the search results than in its original source. By emphasising and highlighting the presence of a certain name in a biased ranking, search engines can easily present the marginal, outdated or secondary as noteworthy and turn anyone into a spectacle on display (Pasquinelli, 2009). The impact of search engines is particularly relevant in the case of non-famous individuals with a limited online presence and an uncommon name (like the name of the author of this chapter).

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 35

KORENHOF CURBING THE ASSIMILATION OF PERSONAL INFORMATION 35

The result of the online assimilation of information connections, especially combined with search engines, is that personal information therein can easily become part of our daily informational lives (Manoff, 2010). Especially if these collections contain old, outdated or contextbound personal information, the consequences may stretch far; with a persistent presence of the past, it becomes difficult for individuals to be seen in a contemporary light in the eyes of the internet user. This can hamper change, progress, personal development, an effective revision of opinions, career choices, ideologies, religious views, and so on. However, it is not just the past that can raise problems as a result of the increased presence and easy access to personal information. A variety of corporations and agencies use information on the web to get a better understanding of the individuals whom they target in order to optimise the manner in which they can make a profit from them, deal with them, or deny them certain services. This happens especially at the level of big data, see for example Big Data Scoring (http:// bigdatascoring.com). Global access can even entail severe risks; crossnational and/or cultural access and interpretation of personal information can in certain contexts raise risks for the well-being of the individual as a result of different legal and moral views. On the web, spatial and temporal distance thus often lose their separating power (Virilio, 1991, 13). This lack of distance between contexts combined with precise and effortless retrieval mechanics is likely the driving force behind the already substantial amount of RTBF requests targeting Google Search results (Google, 2018b). How: the RTBF and alternatives In order to help individuals escape some of the consequences of overly and/or persistently available personal information on the web, the RTBF has been developed. The subject of Article 17 of the GDPR, the RTBF gives natural persons the right to have – under certain conditions – information relating to them erased (GDPR, Article 17(1)). An individual invokes her RTBF by requesting the controller of the targeted information to erase it. The controller has to comply, unless she can appeal to one of the exceptions to the RTBF. These exceptions entail inter alia the freedom of expression and the protection of the public interest with regard to historical and scientific information

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 36

36 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

(GDPR, Article 17(3)). Historical information that serves the public interest receives strong protection against the erasure of information by the RTBF. This showed for example in the Google Spain case, where the RTBF was denied with regard to the content of La Vanguardia’s archive. Moreover, the older the record, the bigger the chance it falls outside the scope of the RTBF, because the GDPR does not protect personal data of the deceased (GDPR, preamble 27). A good deal of historical public media is thus likely to be able to make a successful appeal to exceptions to the RTBF. Even in the case of ‘sensitive data’ – information about an individual’s race, sexual orientation, sex life, political and/or religious beliefs, health, biometric data and/or trade-union membership (GDPR, Article 9(1)) – exceptions are made for the retention of information for archiving purposes and the like (GDPR, Article 9(2j)). What the consequences of the RTBF exactly will be, how it will work in practice and how it will evolve in jurisprudence remains to be seen. A few months before the GDPR went ‘live’, many companies and organisations struggled to get a grip on the demands of the GDPR, the question of how to be compliant, and how to implement the RTBF (Ram, 2017). Google Search had already been confronted with many RTBF requests and struggled with deciding on the conditions for approval and denial of these requests (Google, 2014). With its focus on the erasure of information, the RTBF poses serious challenges for online media. Combined with uncertainty over questions on how to implement the RTBF, I therefore propose not to wait to see how the GDPR-requirements will evolve, nor rely on it actually to address the issues that we may find important. Some issues may be disproportionally addressed by the RTBF, while others may remain neglected. For example, much of the processing of data – especially sensitive data – is based on consent, but can we expect a regular individual to oversee all the potential informational consequences sufficiently if she consents to participation in a publication? Is this a problem of the amateur who co-operates in a publication, or does the professional community that shapes the information flows have a responsibility here to control the impact? I would therefore invite information controllers to think proactively and outside the box about what impact the online assimilation of documents and files has on the presence of the personal information therein – and how possible issues that arise can be circumvented in a

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 37

KORENHOF CURBING THE ASSIMILATION OF PERSONAL INFORMATION 37

manner proportional to the purpose of the information collection. One important issue to address is the convergence of different kinds of knowledge realms on the web. Does the purpose of the information collection require such a convergence? If not, controllers can look for manners to reduce or even prevent a convergence on different levels. One method they can consider is for instance the use of ‘robots.txt’ to prevent the retrieval of (certain) content by search engines. By denying the retrieval of certain information objects by external search engines, the media controller can keep control over the presentation and retrieval context of her own information collection. Then the main media can still be retrieved by search engines, while the risk of the disproportional emphasising of details or the pushing forward of the past as contemporary presence can be averted to a great degree. By restricting the actions of online search engines in this manner, a part of the possible issues may be substantially reduced. However, refusing to be indexed by external search engines is certainly not the only tactic that an information controller can apply to restrict the impact of the online assimilation of the personal information that she processes. Other methods to reduce the impact for individuals include anonymisation of personal information or the use of pseudonyms, tactics mentioned in the GDPR (Article 89(1) R). In the medical research profession, pseudonymisation and anonymisation are already important tools to safeguard the interests of patients while maintaining databases and publishing research (Neubauer and Heurix, 2011). This is a developing practice (albeit not always voluntary) for public information collections. In an online Belgian newspaper archive, an individual’s name was replaced with an ‘X’ in order to prevent the possible negative consequences of the easy and long-term availability of the information tied to the individual’s name (Cour de cassation de Belgique, 2016). Reducing name-focused retrieval by search engines or the identifying ability of the content are by no means the only tactics that an information controller can use to shape the manner in which her information collection is presented online. In their article ‘A Critical Analysis of Privacy Design Strategies’, Colesky, Hoepman and Hillen (2016) have identified several privacy tactics which can be of help to shape the access and context of information access. One of them is minimisation. Minimising the collection of personal information is also

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 38

38 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

one of the demands listed in the GDPR (Article 89(1)). It entails a selection of the content, whereby the information that is not needed is excluded, stripped or destroyed (Colesky, Hoepman and Hillen, 2016). Where it is necessary to retain content, there are several hide tactics that can help to nuance the presence of the information. We can restrict access (Colesky, Hoepman and Hillen, 2016), obfuscate the information and/or mix it with other information (Brunton and Nissenbaum, 2011). The latter tactic is often used by ‘reputation managers’ who aim to reshape an individual’s online reputation by adding an abundance of prominently present positive information to the web, thereby drowning the visibility of the targeted information (Ronson, 2016). Another tactic is separation (Brunton and Nissenbaum, 2011). By isolating information collections, or by distributing their content over different locations, we can reduce the risk that the information is combined into a more extensive view of a particular individual. However, whether this tactic can actually be used effectively on the web is still a matter of research – so far, the character of the web seems to promote the exact opposite – the convergence of diverse knowledge realms and contexts. The last tactic worth mentioning is abstraction (Colesky, Hoepman and Hillen, 2016). If content is summarised or grouped at a more general information level, the focus of the content shifts from particular individuals to a general line of information and the informational visibility of the specific individuals that play a role in the content is reduced. The privacy enhancing tactics briefly touched on here are likely not exhaustive; creative and fresh minds may find new tactics to reduce the presence of online personal information in line with the context and purpose of the information collection. Thinking about and applying such tactics can be worthwhile for information controllers, especially with an eye on potential RTBF requests; if effectively applied, the use of context- and purpose-binding tactics is likely to reduce and prevent RTBF requests, and the success and impact of these requests. However, more than focusing on complying with RTBF requests or preventing them, it is important for media controllers to think about the impact that their online information sharing has for the people referred to in the information – in the short as well as the long run. Is this impact proportional to and in line with the purpose of the information? Should the content be part of our daily lives, or do

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 39

KORENHOF CURBING THE ASSIMILATION OF PERSONAL INFORMATION 39

we flourish if it exists at a certain distance? And what about future technologies? Take for example facial recognition; it seems to be just a matter of time before we can retrieve personal information based on a photo that we took of someone on the street. We therefore need to think about what kind of input should give us access to what kind of personal information about someone. Conclusion The convergence of separate realms of knowledge, culture and territorial scope on the web raise the question of whether public information should be available across diverse national, cultural and temporal spheres. Some information may carry a risk for the individuals involved, even if it is considered harmless in its original context. In this context the categories marked as ‘sensitive’ in the GDPR make sense: the risk might be far broader than one can initially project. Personal information that is commonplace in one context can be detrimental, even deadly, in another. The RTBF may help to negate the worst damage, but it may be best to see it as an ultimum remedium; the erasure of information objects is not cut out for every case, nor is it very nuanced. Luckily, next to raising problems, digital technologies also allow all sorts of new ways of interacting and processing. We can thus take a step back and creatively re-evaluate our informational playing field and try to find ways to shape it as we think it is best – we may find new ways to address the issues. It is therefore worthwhile to explore ways to tinker with the information flow of specific knowledge realms and information collections to provide certain protections for the lifedevelopment of individuals, while not requiring concessions in the goal of the information collection. This also requires us to think closely about the mediation and reuse of information collections by other parties, like search engines. Instead of silently accepting third-party information processing and mediation, it is important to think critically about how and what kind of information we do and do not want to be part of this flow. If we manage to implement privacy by design strategies like the ones pointed out in this chapter, we may be able to reach an acceptable level of privacy protection in public information collections that may reduce

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 40

40 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

the need for RTBF requests, with all their potential crude consequences. However, I fully realise (especially since I am not a technician) that this is easier said than done. This is why interdisciplinary co-operation is so important to be able to decide on how we should deal with personal information in the 21st century. Especially since laws, society and (identification) technologies change over time. The future is uncertain. Acknowledgements I would like to thank TILT (Tilburg Institute for Law, Technology, and Society,) PIlab (Privacy & Identity Lab) and SIDN (Stichting Internet Domeinregistratie Nederland) for having made my research possible. Additionally, I thank Ludo Gorzeman and Jeanne Kramer-Smyth for helpful suggestions and comments on this paper. Bibliography Hill, D. W., Reflections on Leaving Facebook, Fast Capitalism, 5 (2), 2009. Lessig, L., Code: and other laws of cyberspace, Basic Books, 2009.

References Andrade, N. N. G. de (2014) Oblivion: the right to be different . . . from oneself: re-proposing the right to be forgotten. In Ghezzi, A., Pereira, A. G. and Vesnić-Alujević, L., The Ethics of Memory in a Digital Age, Palgrave Macmillan. Anthes, G. (2015) Data Brokers Are Watching You, Communications of the ACM, 58 (1), 28–30. Baker, S. (2014) Inside Europe’s Censorship Machinery, Washington Post, September, https://www.washingtonpost.com/news/volokh-conspiracy/ wp/2014/09/08/inside-europes-censorship-machinery/. Blom, I. (2016) Introduction. In Blom, I., Lundemo, T. and Rossaak, E., Memory in Motion: archives, technology, and the social, Amsterdam University Press, 11–41. Brown, J. J. (2008) Evil Bert Laden: ViRaL texts, community, and collision, Fast Capitalism, 4 (1). Brunton, F. and Nissenbaum, H. (2011) Vernacular Resistance to Data Collection and Analysis: a political theory of obfuscation, First Monday, 16 (5).

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 41

KORENHOF CURBING THE ASSIMILATION OF PERSONAL INFORMATION 41

Carr, N. (2011) The Shallows: what the internet is doing to our brains, WW Norton & Company. Chabot, P. (2013) The Philosophy of Simondon: between technology and individuation, A&C Black. CJEU (2014) Google Spain v. AEPD and Mario Costeja Gozalez, Court of Justice of the European Union, C-131/12. Colesky, M., Hoepman J.-H. and Hillen, C. (2016) A Critical Analysis of Privacy Design Strategies, Security and Privacy Workshop. Cour de cassation de Belgique (2016) C.15.0052.F/1, 29 April. Couvering, E. van (2008) The History of the Internet Search Engine: navigational media and the traffic commodity. In Spink, A. and Zimmer, M. (eds), Web Search: multidisciplinary perspectives, 177–206. Gibson, J. J. (2014) The Ecological Approach to Visual Perception: classic edition, Psychology Press. Gillespie, T. (2014) The Relevance of Algorithms. In Gillespie, T., Boczkowski, P. J. and Foot, K. A. (eds), Media Technologies: essays on communication, materiality, and society, MIT Press. Google (2014) Letter to the EU’s Article 29 Working Party, https://docs.google.com/file/d/0B8syaai6SSfiT0EwRUFyOENqR3M/edit. Google (2018a) Google Search Statistics, www.internetlivestats.com/ google-search-statistics/. Google (2018b) Search Removals under European Privacy Law, https://transparencyreport.google.com/eu-privacy/overview. Grimmelmann, J. (2010) Some Skepticism About Search Neutrality. In Szoka, B. and Marcus, A. (eds), The Next Digital Decade: essays on the future of the internet, TechFreedom, 435. Hargittai, E. (2000) Open Portals Or Closed Gates? Channeling content on the world wide web, Poetics, 27 (4), 233–53. Hildebrandt, M. (2015) Smart Technologies and the End(s) of Law: novel entanglements of law and technology, Edward Elgar. Hui, Y. (2014) What is a Digital Object? In Halpin, H. and Monnin, A. (eds), Philosophical Engineering: towards a philosophy of the web, Wiley Blackwell. Hui, Y. (2016) On the Synthesis of Social Memories. In Blom, I., Lundemo, T. and Rossaak, E. (eds), Memory in Motion: archives, technology, and the social, Amsterdam University Press. Kitchin, R. (2017) Thinking Critically About and Researching Algorithms, Information, Communication & Society, 20 (1), 14–29. König, R. and Rasch, M. (2014) Reflect and Act! Introduction to the society

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 42

42 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

of the query reader. In König, R. and Rasch, M. (eds), Society of the Query Reader: reflections on web search, Institute of Network Cultures. Korenhof, P. (2018, forthcoming) The Collective Memory and the Online Assimilation of Personal Information, forthcoming. Kuo, L. (2017) Digital Hegemony: almost all internet searches in Africa bring up only results from the US and France, Quartz Africa, https://qz.com/995129/google-searches-in-africa-mainly-bring-up-resultsfrom-the-us-and-france/. Lazzarato, M. (2014) Signs and Machines: capitalism and the production of subjectivity, Semiotext (e). Leroi-Gourhan, A. (1993) Gesture and Speech, MIT Press. Manoff, M. (2010) Archive and Database as Metaphor: theorizing the historical record, Portal: Libraries and the Academy, 10 (4), 385–98. Manovich, L. (2001) The Language of New Media, MIT Press. McIntosh, N. (2015) List of BBC Web Pages Which Have Been Removed from Google’s Search Results, BBC Internet Blog, June, www.bbc.co.uk/blogs/internet/entries/1d765aa8-600b-4f32-b110d02fbf7fd379. Neubauer, T. and Heurix, J. (2011) A Methodology for the Pseudonymization of Medical Data, International Journal of Medical Informatics, 80 (3), 190–204. Nissenbaum, H. (2009) Privacy in Context: technology, policy, and the integrity of social life, Stanford University Press. Oulasvirta, A., Rattenbury, T., Ma, L. and Raita, E. (2012) Habits Make Smartphone Use More Pervasive, Personal and Ubiquitous Computing, 16 (1), 105–14. Page, L., Brin, S., Motwani, R. and Winograd, T. (1999) The PageRank Citation Ranking: bringing order to the web, Stanford InfoLab. Pasquale, F. (2015) The Black Box Society: the secret algorithms that control money and information, Harvard University Press. Pasquinelli, M. (2009) Google’s PageRank Algorithm: a diagram of cognitive capitalism and the rentier of the common intellect, Deep Search, 3, 152–62. Ram, A. (2017) Tech Sector Struggles to Prepare for New EU Data Protection Laws, Financial Times, August, https://www.ft.com/content/5365c1fa-8369-11e7-94e2-c5b903247afd. Ronson, J. (2016) So You’ve Been Publicly Shamed, Riverhead Books. Rosen, J. (2012) The Right To Be Forgotten. Stanford Law Review, February,

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 43

KORENHOF CURBING THE ASSIMILATION OF PERSONAL INFORMATION 43

https://www.stanfordlawreview.org/online/privacy-paradox-the-rightto-be-forgotten/. Ross, A. J. C. (2013) Distance and Presence in Analogue and Digital Epistolary Networks, Techné: research in philosophy and technology, 17 (2). Stallybrass, P. (2007) Against Thinking, Publications of the Modern Language Association of America, 122 (5), 1580–7. Stanfill, M. (2015) The Interface as Discourse: the production of norms through web design, New Media & Society, 17 (7), 1059–74. Stiegler, B. (1998) Technics and Time: the fault of Epimetheus, Vol. 1, Stanford University Press. Stiegler, B. (2010) For a New Critique of Political Economy, Polity. Vaidhyanathan, S. (2012) The Googlization of Everything: (and why we should worry), University of California Press. Verbeek, P.-P. (2005) What Things Do: philosophical reflections on technology, agency, and design, Penn State Press. Virilio, P. (1991) The Lost Dimension, Semiotext (e).

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 44

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 45

3 The rise of computer-assisted reporting: challenges and successes Brant Houston

Introduction The rise in the number of journalists analysing data with the use of computers and software began in the mid-1980s. Widely known as computer-assisted reporting, the practice started in the USA with a handful of journalists in the late 1970s, grew significantly in the 1980s, spread to western Europe in the 1990s, and then to the rest of the world in the early 21st century. During its rise, the name for the practice has varied, with some researchers seeing an evolution of the practice with a different name for each era. But across the decades, the basic process of using data for news stories has remained the same. The process has been to acquire data, identify and correct inaccuracies in the data, analyse and visualise data for meaning and possible stories, and verify the completed news story for accuracy before publishing or airing. The purpose of the analysis itself has also remained constant. Whatever software is being used, the journalist uses methods to detect patterns or outliers within data sets, to provide context, or to examine trends (Berret and Phillips, 2016). Over time journalists have begun including unstructured data – text, audio, photos or video – in their analyses and have developed more sophistication with news infographics and interactive presentations. One of the key elements in the use of the data in journalism has been collaboration with library researchers and archivists who also have

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 46

46 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

dealt with the issues of searching, analysing, storing and retrieving data efficiently. There also has been a developing consensus on a broader definition of the use of data in journalism as suggested by data scientist Alexander Howard. Howard has defined ‘data journalism’ as the application of data science to journalism to extract knowledge from data and, more specifically, as gathering, cleaning, organising, analysing, visualising and publishing data to support the creation of acts of journalism (Howard, 2014). As a result of its broad definition, the term ‘data journalism’ has begun to encompass other terms such as precision journalism, computer-assisted reporting and computational journalism (Berret and Phillips, 2016). An even broader conceptual definition has stated that data journalism is thinking about how to frame an experiment, gather data and use rigorous methods to build evidence of some finding of journalistic importance. The basics of computer-assisted reporting have included searching and capturing data online, especially since the advent of the web browser in 1993. While the collaboration between investigative journalists using data and newspaper researchers had been strong since the late 1980s, the web browser brought the two groups into deeper contact. Journalists’ contact with searcher concepts such as Boolean logic had begun, but once data was available on the web the journalists and researchers worked more closely in newsrooms and started sharing knowledge at conferences. For example, a top news researcher, Nora Paul, held a gathering in 1994 of about 20 journalists at the Poynter Institute for Media Studies where they shared techniques and experiences with data. Paul published a white paper from the meeting and later wrote the book Computer-Assisted Research. Basic components of computer-assisted reporting are counting and summing by category of information, such as the total amount of contributions to each candidate in an election year. More advanced techniques include performing statistical tests to determine likelihood of certain occurrences, creating digital maps, using social network analysis to examine relationships between individuals and institutions, and creating algorithms for tasks such as topic modelling in which possible topics within text are identified by algorithms. The methodologies have also included techniques of gathering and

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 47

HOUSTON THE RISE OF COMPUTER-ASSISTED REPORTING 47

analysing datasets created from surveys, polling, sensors, crowd sourcing, web scraping, social media aggregation and field observation. Practitioners and researchers also have suggested that data journalism include multimedia or interactive presentations of data on the web or the building of software applications for newsrooms for newsgathering or infographics (Berret and Phillips, 2016). To understand the evolution of the use of data, some scholars and practitioners have said that there has been a sequence of digital techniques in journalism. They identify the sequence as starting with precision journalism in the 1970s, which concentrated on using statistical methods for news stories. Precision journalism has been defined as the use of social science and statistical analysis for news stories with an emphasis on investigative stories (Meyer, 2002). Following precision journalism came computer-assisted reporting in the 1980s, which emphasised counting, searching, sorting and cross indexing, but also has included statistical analysis and visualising of data. After computer-assisted reporting came data journalism – a term initially used in western Europe – beginning in the first decade of the 21st century. Some practitioners have argued that data journalism is less focused on traditional journalism such as investigative and accountability reporting, and more on the content of the data itself and sharing that data with the public and other academic disciplines. In that definition, different kinds of data such as audio, photos and videos have been added to the categories of data for analysis. In the last decade, the term ‘computational journalism’ has emerged. It has been described as creating ways to employ algorithms for more efficient newsgathering, news writing and data analysis and presentation, with an emphasis on unstructured big data, machine learning and technologies to analyse structured and unstructured information to find and tell stories (Hamilton, 2016). No matter what the descriptive term for use of data by journalists, there is general agreement the practice has transformed and elevated journalism because it has given journalists a deeper and more detailed understanding of the outcomes of laws and policies. It also has allowed reporters to bring context and detail to stories that in the past had relied on anecdotal cases and unconnected events. Furthermore, the practice has enabled newsrooms to become more transparent with the

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 48

48 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

public on their methodologies by sharing the data they have used for a story and to elicit from the public both comment and content for stories. As computer-assisted reporting has become more widespread and routine, it has given rise to discussion and debate over the issues regarding the ethical responsibilities of journalists. There have been criticisms over the publishing of data that was seen as intrusive and violating the privacy of individuals. In the cases of datasets being leaked by computer hackers or whistle blowers, there have been efforts to uphold the traditional standards of fairness, accuracy and consideration of the threats to the safety of individuals and institutions posed by publishing purloined data. In addition, the very access to, and understanding of, data has been a continuing struggle, whether it is acquiring it through open government datasets, requesting data through freedom of information laws, or scraping individual records from the web and organising it in software such as database managers. In many cases datasets are incomplete, out of date and inconsistent in such items as names. Furthermore, they may vary in file types, ranging from easy to use ‘comma separated values’ to frustrating scanned ‘portable documents formats’ that require software to extract usable data for analysis. The history of data in journalism Some practitioners and scholars have suggested data journalism began with the listing and analysis of tabular information in newspapers. They have cited such examples as the publishing of tables of school enrolments and student costs by the British newspaper the Manchester Guardian in 1821 (Gray, Bounegru and Chambers, 2012) or an investigation by Horace Greeley’s newspaper in the 19th century, which compared US congressional travel reimbursement rates for mileage with the actual mileage in postal routes to the same locations (Klein, n.d.). The investigation revealed that congressmen charged for more mileage than was possible. While those are valid comparisons with the concept of computerassisted reporting and principles of analysis of structured data, those journalists behind these news stories clearly did not use computers. Most scholars and practitioners restrict the definition of computer-

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 49

HOUSTON THE RISE OF COMPUTER-ASSISTED REPORTING 49

assisted reporting to practices that involve use of computers and software to analyse and present digital information. The first example usually cited is the CBS news project in 1952, which used a Universal Automatic Computer (UNIVAC) main frame computer to predict the outcome of a US presidential election. By analysing polling data CBS successfully predicted the result of the election, but did not publish the results until after polling day. However, that example is regarded as a false start and the actual beginning of computer-assisted reporting is viewed as the application of social science methods and statistical analysis by journalist Philip Meyer, who examined the causes of the Detroit racial riots in 1967 by surveying city residents who answered questions about participation in riots, attitudes on crime and punishment for looters, and their socioeconomic status. While city officials attributed the riots to uneducated migrants from the south of the USA, Meyer found that those who attended college were just as likely to participate in the riot and that a higher percentage of the rioters were from the north (Meyer, 2002). As he continued to apply those methods, Meyer wrote the text book Precision Journalism: a reporter’s introduction to social science methods, which was published in 1973 and had several subsequent editions. His book provided instruction for doing data analysis and called for all journalists to learn data analysis and social science methods, saying the ante for being a journalist had been raised (Meyer, 2002). Meyer’s work included collaboration with investigative reporters, particularly Philadelphia Inquirer journalists Donald Barlett and James Steele, to analyse sentencing patterns of defendants. Often, datasets had to be built, although another practitioner, Rich Morin at the Miami Herald, made an exceptional analysis of existing digital property assessment records in Florida (Houston, 2012). From the late 1970s to the early 1980s, a small group of journalists across the USA either were taught by Meyer, who offered workshops for professional journalists at the University of North Carolina, or taught themselves how to do data analysis. In addition to the challenge of learning data journalism while still practising traditional news gathering, the small group had difficulty in getting access to mainframe computers since newspaper main frame computers were used primarily for business purposes such as finance, advertising and circulation. Indeed, it was not until the 1990s that many newspapers

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 50

50 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

began to digitise their articles. And when reporters did get time on the mainframe they had the hurdle of learning whatever analytical language had to be used on it. In many cases, the newsroom did not have mainframe computers available and reporters had to negotiate for access to a local university’s mainframe. There was one impressive success with use of a mainframe and with a new approach for journalists to analyse data. In 1976, a team of investigative journalists contracted with a University of Arizona professor to do social network analysis that diagrammed the connections and influence of a powerful group of 40 persons in Phoenix, Arizona. The analysis, known as the Arizona Project, was part of a lengthy investigation into organised crime and government corruption following the killing of a newspaper reporter in Phoenix. However, that again was a false start in social network analysis by a journalist and there was not significant use of social network analysis in journalism again until the early 21st century (Houston, 2012). The advent of Structured Query Language and computerassisted reporting In the 1980s, Steve Ross, a professor at Columbia University, offered analytic journalism courses for journalists and journalism students in which he taught computer science and the use of Structured Query Language (SQL). The language did not require sophisticated statistical skills to summarise data or to join datasets together for comparisons and Eliott Jaspin, an investigative journalist from the Providence JournalBulletin in Providence, Rhode Island, adopted what he learned from Ross during a fellowship at Columbia. Jaspin began using datasets in his reporting and gained the use of the newspaper’s mainframe computer. In 1986, Jaspin received recognition for a story that compared a dataset of school bus drivers with a dataset of criminal records and found an alarming number of matches (LaFleur, 2015, 14–16). Jaspin also began to create a data library for the newsroom. He wanted not only to acquire one or two databases for an individual story, but also to have datasets that could be used as a reference library. For example, he acquired the dataset of all drivers’ licences in Rhode Island, which was public at the time. It came on a nine-track computer tape, which Jaspin had transferred to micro-fiche. It served

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 51

HOUSTON THE RISE OF COMPUTER-ASSISTED REPORTING 51

as a look-up dataset for reporters seeking to confirm the identity of subjects in stories or to find addresses. He also obtained and made available to the newsroom campaign finance data. Gaining access to the newspaper mainframe continued to be difficult for Jaspin, but it remained necessary since most government data was maintained on nine-track tapes. As a result, Jaspin worked with programmer Daniel Wood to develop software that would allow data from a nine-track tape to be copied onto a personal computer. The software was called Nine Track Express and enabled journalists with portable nine-track tape drives to transfer the data fairly easily. While the software required a journalist to learn the nuances of bits and bytes in data, it freed reporters from waiting for access to the ever elusive mainframe (Berret and Phillips, 2016). In 1989, the term computer-assisted reporting attained more prominence when Jaspin left the newspaper and created the Missouri Institute for Computer-Assisted Reporting (MICAR) at the University of Missouri. Jaspin taught detailed sessions for professional journalists and students on the technical aspects of data, but the success of his program stemmed from an approach that relied more on database management and spreadsheet software. Computer-assisted reporting received another boost when in 1989 the series ‘The Color of Money’, won a Pulitzer Prize for the Atlanta Journal-Constitution (LaFleur, 2015) and demonstrated the impact that could be derived from journalists and professors working together. The series of stories relied on analysis of home mortgage data by an in-house data analyst and two university professors. The series of stories exposed racial disparity in mortgage lending and inspired other US newsrooms to begin to adopt computer-assisted reporting and hire journalists who could practise it. The first conference on using data in journalism was held in 1990 by Professor James Brown at Indiana University and Purdue University at Indianapolis at the National Institute for Advanced Reporting. The conference was held on the Indiana–Purdue campus and was cohosted by Investigative Reporters and Editors (IRE), a non-profit association based at the University Missouri. The conference brought together journalists who were practising or wanting to practise precision journalism or computer-assisted reporting (LaFleur, 2015). Even though the first commercial web browser had not been

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 52

52 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

developed, Brown drew attention to the need to co-operate with library researchers and archivists. Among the speakers at his conferences were Barbara Quint, senior editor at Online Searcher, and Jennifer Belton, head of the Washington Post’s library researchers. After Jaspin left the University of Missouri in 1992, IRE assumed the administration of workshops and the data library Jaspin had created at Missouri for reporters throughout the USA. The change in administration was supported by the university since much of the computer-assisted reporting in the USA was being done by investigative journalists. In 1994, the institute at Missouri was renamed the National Institute for Computer-Assisted Reporting (NICAR) and IRE raised funds for NICAR staff and activities. NICAR expanded its training beyond national conferences and workshops at Missouri and began conducting workshops throughout the USA and other countries. In 1996, I published the first widely used textbook: Computer-Assisted Reporting: a practical guide. I served as managing director of NICAR and became executive director of IRE in 1997. In 1997 a formal agreement also was reached between IRE and the University of Missouri, with the agreement giving IRE formal administrative responsibilities of NICAR. Meanwhile, in 1993, a breakthrough in visualisation of data as a reporting tool was dramatically demonstrated by data journalist Steven Doig, who was at the Miami Herald. He showed the efficacy of visualising data when he collected and overlaid the wind speeds and building damage reports after Hurricane Andrew struck Florida in 1992. He was part of a team awarded a Pulitzer Prize for this reporting. Doig’s analysis and map revealed that buildings received heavy damage from low wind speeds because of poor enforcement of building codes (Pulitzer, 2018). As the practice of data journalism increased, journalists began encouraging changes in US open records laws that would recognise that public access to digital documents should be the same as access to paper documents. Government officials had often denied access to data, citing privacy, public safety concerns and the difficulty and costs of duplicating the data for journalists. At the 1994 meeting at Poynter Institute for Media Studies, the journalists and researchers came up with 38 typical

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 53

HOUSTON THE RISE OF COMPUTER-ASSISTED REPORTING 53

excuses for denials they had heard from government officials along with answers to those excuses (Mitchell, 2002). Journalists overcame many denials by showing that government officials were routinely trying to overcharge for data and did not understand that duplication and redaction of records was more accurate and less costly when done electronically. While some states clarified their freedom of information laws to include electronic data, it was not until 1996 that the US Freedom of Information Act was amended to include digital information. Eventually, the remaining states and local governments in the USA also changed their laws to accommodate requests for public data. Laws in other countries later changed as data journalism spread internationally. With creation of the world wide web in 1989 and the subsequent creation of the Mosaic web browser in 1993, data became more accessible to journalists as governments and businesses posted datasets on their websites. Without having to use freedom of information laws, US journalists began downloading datasets more frequently. The US Bureau of Census was particularly helpful in making data available online in files in comma separated value and Excel formats, as well as via geographical information system software for mapping. The proliferation of data also resulted in stories that were not exclusively investigative and in a prelude to what became known as data journalism, reporters began producing lifestyle and sports stories based on data related to those topics. In 1996, two Danish journalists, Nils Mulvad and Flemming Svith, attended a NICAR workshop in Missouri and then commenced offering workshops for European journalists at the Danish School of Journalism. They created the Danish International Center for Analytical Reporting, wrote a Danish handbook on computer-assisted reporting in 1998, and began organising conferences on data journalism in Europe (LaFleur, 2015). In addition, Mulvad formed a collaborative group of journalists to request and report on farm subsidies data in Europe, which included an audit of open records laws for digital information in European countries; thus the journalists kept a running history of which countries had and had not complied (Kaas & Mulvad, 2015). In London in 1997, City University journalism professor Milverton

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 54

54 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

Wallace held the first of five annual conferences called NetMedia, which offered sessions on the internet and classes in computer-assisted reporting led by NICAR and Danish journalists. The classes covered the basic uses of the internet, spreadsheets and database managers, and were well attended by journalists from the UK, other European countries and Africa. Also in the UK, the Centre for Investigative Reporting, founded in 1997, began to collaborate with NICAR to offer classes in data journalism during its summer school and has continued to host a conference with the leadership of both UK and US data journalists. As a result of the various international training efforts, journalists produced stories involving data analysis in multiple countries including Argentina, Bosnia, Brazil, Canada, Finland, Mexico, the Netherlands, New Zealand, Norway, Russia, Sweden and Venezuela. By the early 21st century, data journalism had begun to be practised by journalists in Africa, Asia and Australia. The creation of the Global Investigative Journalism Conference in 2001 in Copenhagen included a schedule of lectures and hands-on training sessions in computerassisted reporting and research to journalists from nearly 40 countries. Attendance at the biannual conference has increased to include journalists from more than 130 countries. In 2005, Wits University in Johannesburg, South Africa, began an annual investigative conference, which provided lectures and hands-on training in computer-assisted reporting to mostly African journalists. Also in 2005, the term data journalism began to be used more often by programmers in the USA and Europe who had become interested in journalism. In the USA, computer programmer and journalist Adrian Holovaty spurred more interest among programmers when he created from police data an ongoing and interactive map of crime in the city of Chicago that allowed citizens to choose ways to and from work that were supposedly safer. With funding from the John S. and James L. Knight Foundation, Holovaty then created the now-defunct Every Block website in 2007 (Jeffries, 2012), which used local data across the USA for online maps. But the project later was criticised for not checking the accuracy of government data more thoroughly and thus producing misleading maps (LA Times, 2009). International progress in data journalism continued to move forward as the first decade of the 21st century was ending. Fred

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 55

HOUSTON THE RISE OF COMPUTER-ASSISTED REPORTING 55

Vallance-Jones and David McKie published Computer-Assisted Reporting: a comprehensive primer in 2008 with a special emphasis on data in Canada. In 2010, the European Journalism Centre in the Netherlands created another data-driven journalism initiative and established a data journalism website covering many of the topics and approaches developed by NICAR. A 2011 initiative produced the Data Journalism Handbook featuring a compendium of the points of views and knowledge of practitioners and scholars. The Centre also collaborated with the Global Editors Network and Open Knowledge Foundation on creating data journalism awards. In 2011, the European Investigative Journalism Center and a group of data journalists organised a conference called Data Harvest to bring together journalists and programmers. Also in 2011, the Guardian in the UK received much attention for its analysis of the participants in racial riots in cities in the UK. While it credited part of its analytical approach to Meyer’s work, the Guardian also analysed social media extensively to determine the origins and purposes of the rioters. The teaching of data journalism became a routine part of journalism conferences in Denmark, Finland, the Netherlands, Norway and Sweden. Through the global investigative conferences, the use of data quickly spread across Eastern Europe. In Eastern Europe, Drew Sullivan (who formed the Organized Crime and Corruption Reporting Project) and Romanian journalist Paul Radu were strong proponents and practitioners. Journalists began using Google online data tools increasingly throughout the world because they were free and relatively easy to learn. In 2015, Google created the News Lab, now known as the Google News Initiative (https://newsinitiative.withgoogle.com/training/), to focus on support for journalists using data for news stories. The News Lab began to offer tutorials for journalists on analysis and interactive charts and maps. Continued challenges Open government and open data projects in the 21st century brought a promise of easier access to government data, but the data placed on government sites was sometimes quietly redacted or omitted important public information. Furthermore, costs still thwarted journalists

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 56

56 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

in their quest for data. Initial demands from government agencies have ranged from thousands of dollars to more than $2 billion and require journalists to spend months or years negotiating for lower costs even though most freedom of information laws grant waivers to a journalist or person working in the public interest. Government agencies also wrongly apply privacy or security exemptions when rejecting requests for data, which requires journalists to appeal those decisions. The institution of prohibitive search fees by government agencies to find the data requested have continued to increase. In addition, some agencies have refused to release record layouts and code sheets for databases, which make it impossible or risky to do analysis. Another perennial problem is that government agencies continue to contract with private firms to collect, store and retrieve government data. The contractor may not know that it is obligated under law to release the data to journalists or other interested parties (Lathrop and Ruma, 2010). Last, there are both legitimate and illegitimate privacy concerns (Schreir, 2010, 305–14). Journalists often still must do extensive data extraction and cleaning on structured data. The databases can suffer from poor design. For example, the Federal Election Commission database of contributors has an occupation field that allows entry of generic occupations such as lawyer or attorney or the name of the law firm. Other databases have codes of 1 through 5 of race yet the same field may have codes going 1 through 9 because of added ethnicities or data entry error without the code sheet being updated. Other databases have simple data entry errors because of a lack of standardisation. The city of St Louis may be entered as St. Louis, St Louis and Saint Louis in the same database. To obtain accurate counts or summing of grouped data it is necessary to correct the data. Cleaning data can be tackled with find and replace functions in Excel, SQL update statements in a database management tool, or via data cleaning with a dedicated tool – such as the free open source tool called OpenRefine (http://openrefine.org/). Oftentimes, databases are incomplete because information is selfreported and information may be inadvertently not entered or intentionally omitted. A journalist must decide whether to try to fill in some of the information with research or note the omissions in the story. Geolocation information is particularly difficult since address records can be entered as intersections or blocks with a generic number

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 57

HOUSTON THE RISE OF COMPUTER-ASSISTED REPORTING 57

for the address. At other times, a geolocation may default to a central location when the record does not have an address. Another time-consuming task is converting files provided in portable document formats (PDFs) to spreadsheets for analysis. Adobe Acrobat, cometdocs.com and Tabula are off the shelf software packages used for such conversions, but the conversion is not always exact when columns of information are transformed. Sometimes a high number of records can cause the programs to crash. Even more difficult are scanned PDFs and those that require optical character recognition software contained in Adobe Acrobat or other software. Sometimes journalists must perform their own manual transcription and data entry when the documents are unreadable by computer programs. Unstructured data Through interaction with information science schools and computer programmers, journalists began adapting software and techniques in dealing with unstructured data in the early 21st century. A free data journalism service for journalists, DocumentCloud (https://www. documentcloud.org/about), was created with funding from the Knight Foundation in 2009 by Aron Pilhofer at the New York Times, Scott Klein and Eric Umansky at the non-profit newsroom ProPublica, and programmer Jeremy Ashkenas. DocumentCloud increased the use of non-structured data by allowing journalists to upload government and business documents in text to the DocumentCloud site. The data was processed through Thomson-Reuters Open Calais, which converted the text to structured data and returned elements of the text such as entities, topic codes, events, relations and social tags, which made the documents easier to analyse. DocumentCloud also allowed journalists to then annotate portions of the documents online. Despite these collaborative successes, the cultures of journalism and computer programming initially clashed. Programmers believed that journalists were not as transparent with the public as they should be about their analysis of data and that journalists were also reluctant to try new software and approaches. The journalists perceived that the programmers did not respect the need for determining the accuracy and limitations of data sets and did not engage enough in journalism

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 58

58 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

practices such as interviewing and observations in the field. By 2009, a desire to bring together computer programmers and journalists resulted in creation of the group Hacks/Hackers (https://hackshackers.com/about/) by Pilhofer, Rich Gordon, a professor at Northwestern University, and journalist Burt Herman to bring more sharing and understanding between the two professions and ease some of the culture clash between the groups. This effort accelerated the use of machine learning and analysis of unstructured data, through simple word clouds to the complexities of topic modelling. In 2010, journalists were confronted with the need to begin to analyse large unstructured data. Wikileaks released the Afghan War Diaries and then the Iraq War Diaries, which contained US secret military documents. It followed those releases with others, including state department documents. Other big data leaks followed, including of European banking records and unstructured data of offshore accounts known as the Panama Papers and the Paradise Papers. The International Consortium of Investigative Journalists received those later leaks of big data and has developed a system for organising, analysing and distributing that data to reporters developing stories from the data that involves extensive extraction of e-mails and other documents into structured data (Gallego, 2017). As the access to unstructured and big data increased, the need for improved analytic tools and more sophisticated algorithms also grew. Journalists took an interest in how social media platforms were selecting advertising and information to show consumers and how propaganda could infiltrate those platforms. Conclusion: the future of data journalism In 2008, Georgia Tech hosted the Symposium on Computation and Journalism. Several hundred computer scientists and journalists attended. In 2009, The Center for Advanced Study in the Behavioral Sciences at Stanford University organised a similar workshop and in 2015 Stanford University created the Computational Journalism Lab. Columbia Journalism School began offering a course in computational journalism and the university’s Tow Center received a substantial grant from the Knight Foundation for its computational journalism program. In 2015, Syracuse University began a master’s programme

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 59

HOUSTON THE RISE OF COMPUTER-ASSISTED REPORTING 59

in computational journalism. The Google News Lab has sponsored awards in computational journalism research and several universities have hosted conferences called the Computation + Journalism Symposium since 2014. Among the new approaches to analysis was one called ‘algorithmic accountability’ in which journalists reverse-engineer algorithms used by social media and online search tools and discover both unintended and intended bias and misdirection. There has continued to be a debate on whether all journalists need to have programming skills, but there has been general agreement that all journalists need to know how programmers do their work. Despite the recognised need for the skills of data journalism, recent surveys have found that many journalists and journalism students still lack the skills to practise it and that universities do not offer enough classes in the field. A survey by educators from Columbia University and Stanford University in 2016 found that half of the journalism schools in the USA did not offer a data journalism class in their curricula (Berret and Phillips, 2016). A Google News Lab survey in 2017 found that about a half of newsrooms it surveyed had a dedicated data journalist (Rogers, Schwabish and Bowers, 2017). However, the Columbia and Stanford study also suggested creating an outline of the tools and techniques for a data and computational journalism curriculum, which tracks the development of computerassisted reporting and the future for it. The study also predicts the growth in data gathered from drones, sensors and crowd sourcing and the presentation of data through augmented and virtual reality (Berret and Phillips, 2016). References Berret, C. and Phillips, C. (2016) Teaching Data and Computational Journalism, Columbia University. Gallego, C. S. (2017) Read the Paradise Papers Documents, International Consortium of Investigative Journalists, 20 November, https://www.icij.org/blog/2017/11/read-paradise-papers-documents/. Gray, J., Bounegru, L. and Chambers, L. (2012) The Data Journalism Handbook: how journalists can use data to improve the news, O’Reilly Media. Hamilton, J. T. (2016) Democracy’s Detectives: the economics of investigative

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 60

60 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

journalism, Harvard University Press. Houston, B. (2012) Computer-Assisted Reporting: a practical guide, 4th edn, Routledge. Howard, A. (2014) The Art and Science of Data-Driven Journalism, Tow Center for Digital Journalism. Jeffries, A. (2012) EveryBlock’s Hype ‘Definitely Faded,’ says Founder Adrian Holovaty, but it’s More Popular than Ever Before, 16 August, https://www.theverge.com/2012/8/16/3245325/5-minutes-on-the-vergewith-adrian-holovaty-founder-of-everyblock. Kaas & Mulvad (2015) Farm Subsidies for Rich People, 29 January, https://www.kaasogmulvad.dk/en/portfolio/farm-subsidies-rich-people/. Klein, S. (n.d.) Antebellum Data Journalism: or, how big data busted Abe Lincoln, https://www.propublica.org/nerds/antebellum-data-journalism-bustedabe-lincoln. LA Times (2009) Highest Crime Rate in LA? No, just an LAPD map, Los Angeles Times, 5 April, www.latimes.com/local/la-me-geocoding-errors5-2009apr05-story.html. LaFleur, J. (2015) A History of CAR, IRE Journal, 2nd quarter. Lathrop, D. and Ruma, L. (eds) (2010) Open Government: collaboration, transparency, and participation in practice, O’Reilly Media. Meyer, P. (2002) Precision Journalism: a reporter’s introduction to social science methods, 4th edn, Rowan & Littlefield. Mitchell, B. (2002) The Top 38 Excuses Government Agencies Give for Not Being Able to Fulfill Your Data Request, Poynter, 25 August, https://www.poynter.org/news/top-38-excuses-government-agenciesgive-not-being-able-fulfill-your-data-request. Pulitzer (2018) The Miami Herald, www.pulitzer.org/winners/miami-herald. Rogers, S., Schwabish, J. and Bowers, D. (2017) Data Journalism in 2017: the current state and challengers facing the field today, The Google News Lab. Schreir, B. (2010) Toads on the Road to Open Government Data. In Lathrop and Ruma (eds), Open Government.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 61

4 Link rot, reference rot and the thorny problems of legal citation Ellie Margolis

Introduction Preservation of digital information is important in all fields, but the nature of law and legal writing makes the preservation of digital information and the problem of link rot particularly significant. Legal writing is heavily dependent on citation in a way that is different from many other fields, and it is crucial that cited materials are accessible to future readers. The last 25 years has seen a migration of legal information from print to electronic, dramatically changing the way that lawyers research and access information (Berring, 2000a, 305–6). That same time frame has marked a rise in citation to more nontraditional sources made more accessible through digital search technologies (Margolis, 2011, 911–12). As the digital revolution has led to increasing citation to electronic sources, the ability to retrieve accurate versions of those sources has grown increasingly problematic. The dual problems of link rot (the URL is no longer available – there is no web content to display) and reference rot (URL is active but information referred to in the citation has changed, or the content of the page has changed) pose significant obstacles to lawyers trying to track down citations to internet sources (Zittrain, Albert and Lessig, 2014, 177). Practising lawyers, legal academics and law librarians have identified a variety of potential solutions to these problems, but none have been widely adopted. The legal field still has a long way to go to ensure the widespread retrievability of electronic citations.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 62

62 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

Law and the nature of legal citation While in many disciplines citation serves largely for the purpose of attribution, in practice-based and academic legal writing citation can play additional important roles. Legal citation practices often seem excessive to those outside the profession. Lawyers are active readers, trained to look for support for even the smallest points and to view citation with a critical eye, checking carefully for the degree of support a cited source provides for the cited proposition. In addition to providing support, a citation can carry a substantive message to the reader about the weight and significance of a legal authority in dictating the outcome of a new situation. Legal readers expect sources, called ‘authority’ in legal writing, for even the simplest proposition. Thus, the norms of academic and practice-based writing call for detailed citation to sources. Academic legal writing is notorious for the high number of footnotes included in a typical law review article, sometimes as much as two lines of footnotes per line of text (Magat, 2010, 65). Legal academics are expected to provide citation footnotes for even the most commonly understood facts. The Bluebook, the citation manual for legal academic writing, provides, ‘In general, you should provide attribution for all sources – whether legal or factual – outside your own reasoning process’ (Columbia Law Review and Harvard Law Review, 2016). Citations demonstrate the comprehensiveness of the writer’s research and the quality of support for the writer’s assertions. They also serve as research tools, leading readers to additional sources – an important function for academic legal writing. Legal readers expect to be able to confirm a citation by looking at the original and ascertain that the reference is accurate and supports the writer’s proposition. The Bluebook contains an elaborate set of citation rules designed to give the reader the ability to locate the source being referenced easily. Citations in practice-based legal writing are even more significant. In the USA at both state and federal level, law is a text-based system. Legislatures enact statutes. Courts issue written opinions. Administrative agencies publish regulations. Lawyers work within this text-based framework reading, analysing and making arguments based on the words that have been created by the law-making body. Words matter and lawyers care deeply about making sure those words

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 63

MARGOLIS LINK ROT, REFERENCE ROT AND PROBLEMS OF LEGAL CITATION 63

are accurate. When a legal document cites a source, legal readers are inclined to check that source to see its exact wording. It is critically important that citations leading to these sources of law are accurate, reliable and accessible. In applying statutes and regulations, courts are limited to the precise words chosen by the enacting body and cannot go beyond the bounds of their interpretation of those words. In addition, statutes and regulations generally apply only in the version that was in effect at the time a situation arose. If a statute or regulation has been amended, the new version may not apply to a situation that took place before the amendment went into effect. Thus, lawyers need to be able to access not only current versions of statutes and regulations, but prior versions as well. In addition to statutes and regulations, courts can make law through judicial opinions, called common law. The common law system is built on the concept of precedent – lower courts are bound to follow the decisions of higher courts. When issues arise that have not yet been addressed, the court’s job is to take existing legal precedent and apply it to new situations (Kozel, 2017, 791). Legal argument is built on the close reading of precedent, and citation to authority is at the centre of all legal analysis. When making an argument on behalf of a client, lawyers carefully construct analysis by identifying the authority that governs the situation and applying it to the client’s particular facts. When issuing decisions, courts set out their understanding of current rules before applying them to the case before them. Those rules are represented by authority and carefully supported by citation. Thus, it is crucial for lawyers and judges to be able to use those citations to look back at the original sources, to understand the reasoning they represent, and to develop new, future-looking arguments. Whether a judicial decision is based on statutory or common law, the purpose of the written legal opinion is to explain the court’s understanding of the law or of the application of the law to the facts of a case. When the court bases that analysis on a source that a subsequent reader cannot access and read, the reader cannot assess how much persuasive weight should be given to the unavailable source. The inability to assess the sources can lead a legal reader to question the validity of the court’s reasoning and diminish the authority of the judicial opinion, potentially affecting the stability and

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 64

64 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

growth of the law. Because the validity of law depends on the authority of those opinions, the ability to find and read cited sources goes to the very heart of the US legal system. As a result of the text-based nature of the legal field, lawyers have created elaborate systems for publishing, maintaining and citing legal authority. The need for reliability and authenticity shaped the legal publishing industry for two centuries, but the rise of digital technology and online information has put a strain on that well-established system. Legal information and the rise of web citation For almost two centuries in the USA, legal information was published in a stable and self-contained system, allowing for reliable citation without fear that the source would become inaccessible. As legal information has migrated online, the legal profession has had no choice but to move to web-based citations. At the same time, norms of legal writing have changed, leading to a more permissive understanding of the types of materials that can be cited, including much information available on the web. As a result of all this, the profession has seen a dramatic rise in web citations and the growing problems of link and reference rot. During the 19th and 20th centuries, publication of legal information was controlled largely by the West Publishing Company (Berring, 2000b, 1675). While some units of government had their own publications systems, for the most part, West published statutes in code books and cases in case reporters. Secondary legal sources were limited in number, consisting primarily of legal encyclopedias, treatises and scholarly journals. All of these print-based materials were regularly published and widely distributed in academic and private law libraries throughout the country. Lawyers rarely referenced materials beyond this relatively limited universe, and because publication was controlled and stable, legal writers knew they could rely on and trust legal citation. This began to change towards the end of the 20th century, with the advent of the digital revolution, but that change was slow. While legal materials such as statutes and cases were digitised and placed online, digital legal databases replicated the organisation of materials in the

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 65

MARGOLIS LINK ROT, REFERENCE ROT AND PROBLEMS OF LEGAL CITATION 65

print system. Lawyers, slow to adapt to change, were cautious about looking beyond traditional print sources. In addition, citation rules continue to require citation to the print format for most primary and secondary legal sources. Indeed, The Bluebook citation form is still based primarily on a print system of legal information, although few lawyers are retrieving print sources. As digital technology became more widespread at the start of the 21st century, the tight hold the print world held on the legal profession began to loosen. Increasingly, legal information began to migrate online. While at first, legal research services such as Lexis and Westlaw replicated the print system of legal information in digital form, as more advanced search technology developed, the rigid print organisation gave way to algorithms, leading legal researchers to a wider variety of sources including those available only on the internet (Margolis, 2011). Over time, West’s control eroded, as courts and legislatures began to make legal materials available directly on their websites, and many free databases, such as Google Scholar, also began to post legal materials. As a result, lawyers tend to work primarily in an online environment even when using materials that continue to be available in print. In addition to traditional print resources moving online, since the turn of the 20th century there has been a growth in ‘born-digital’ legal materials (Liebler and Liebert, 2013, 286–7). Financial constraints have led many state and local government entities, as well as federal agencies, to distribute legal information only online, with no print reference. In 2013, the Government Printing Office (GPO) estimated that 97% of federal government information was born digital (Jacobs, 2014, 1), up from 50% in 2004 (GPO, 2015). Individual states are increasingly publishing primary legal materials online while simultaneously discontinuing print publication (Matthews and Bush, 2007). As print sources have been discontinued, states have had to designate the online version of their legal information as the official version. Lawyers using these materials have no option other than to provide an internet citation. This has further untethered citation of legal materials from the stable, reliable print world of the 20th century and forced lawyers to confront the need to cite internet sources in practice-oriented and academic legal writing. The simultaneous migration of legal information to digital platforms, growth of sophisticated electronic search algorithms, and

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 66

66 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

increase of born-digital legal information have all contributed to lawyers’ increased comfort with online sources of all kinds. As a result, lawyers have begun to rely on information that goes beyond traditional sources of primary law, though this change has been very gradual. The legal field is notoriously slow to adapt to change as a natural result of a system bound heavily in precedent and citation rules requiring print sources. The historically narrow conception of legal authority meant that lawyers were cautious about citing sources that had not typically been used in legal analysis. However, over time lawyers, as with all other segments of society, became more comfortable in an online environment, accustomed to jumping online to find all manner of information quickly and easily. The easy availability of online information, combined with search algorithms that returned a wider range of results, broke down some of the barriers between traditional authority and other sources of information. The temptation to rely on online information thus became impossible to resist, and lawyers began to expand the universe of material considered acceptable and relevant for use in legal analysis (Margolis, 2011). As a result, in addition to traditional primary and secondary legal materials, legal writers are now citing non-legal sources in much greater number, increasing the variety of sources lawyers and legal academics are comfortable citing (Margolis, 2011). Thus, the number of citations to online sources has grown exponentially in practice-oriented and academic legal writing. Numerous studies conducted between 1995 and 2017 confirm the rise in webbased citation to legal and non-legal sources in legal materials. These studies confirm that while initially lawyers were cautious and cited to the internet rarely, the number and rate of citation to online sources grew rapidly over this time period. The increase in citation was accompanied by a change in citation rules to account for the new kinds of sources lawyers and legal academics relied on. The studies show that in practice-oriented and practical legal writing, citation to a variety of online materials has become commonplace. There has been a rapid rise in the frequency of internet citations in academic legal writing. The growing interdisciplinary nature of law review articles, in conjunction with the easy availability of information on the web, has increased the breadth of information legal academics

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 67

MARGOLIS LINK ROT, REFERENCE ROT AND PROBLEMS OF LEGAL CITATION 67

look to in their work. One study found that in 1994 there were four internet citations in three law review articles. By 2003, there were close to 100,000 (Lyons, 2005, 681). Another study looked at the percentage of law review articles citing online sources between 1995 and 2000 and found that number increased from .57% to 23% (Rumsey, 2002, 32–3). That number has doubtless increased in the ensuing years, with more recent studies showing high numbers of web citations in legal journals (Zittrain, Albert and Lessig, 2014, 177). The citation to online sources in judicial opinions, and in the briefs and other documents submitted to courts, has also grown steadily since the mid-1990s. The first citations in a federal court opinion appeared in 1995, and in a state court opinion in 1998 (Torres, 2012, 269). The first documented internet citation by the US Supreme Court was in 1996, when Justice Souter referenced two internet sources describing cable modem technology (Denver Area v. Federal Communications Commission, 1996). While state and federal courts were initially cautious about internet citations, the rate of citing to online sources has grown quickly. Studies of multiple jurisdictions throughout the USA have shown a steady increase in citation to internet sources over time (Torres, 2012, 270–1). Internet citations are now commonplace in appellate court briefs and judicial opinions across state and federal jurisdictions throughout the country. Because of citation rules requiring reference to print sources where available, citations to primary legal authority such as statutes and cases are rarely to websites. Those sources are the bread and butter of legal analysis and still the most prevalent citations in legal documents. The growing numbers of web citations to primary legal sources are to those ‘born-digital’ legal sources on government websites. These materials do not account for the rapid rise in internet citation, however. A key factor in the rise in internet citation in academic and practice-based legal writing is the growth in reference to non-legal sources. The use of non-legal sources has increased along with the growth of the internet as a source of ready information (Margolis, 2011). Internet citation to non-legal sources varies by type of source and type of website. A large number of citations are to data and other factual information on government websites – social science data, scientific reports, policies and the like. A wide array of other sources can be found in legal documents. It is not uncommon to see citations

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 68

68 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

to news outlets, Wikipedia, social media sites, university websites and a variety of other individual sites (Margolis, 2011). Unlike the stable reliability of print legal materials, all of these websites are volatile and subject to change over time. As the citation to online sources has grown, so have the interrelated problems with link rot, reference rot and authenticity of sources. The growing problem of link rot, reference rot and authentication Because publication of legal materials was so stable for so long, lawyers and legal academics became accustomed to trusting that they would have access to cited sources and that the content of those sources would be unchanged. The stability of information contributed to the sense that sources were authoritative, and while there has always been a hierarchy of authority, the fundamental idea that a cited source could be relied on was a central part of legal analysis and the development of law. The ephemeral nature of the internet has changed all that. Fundamentally, link and reference rot call into question the very foundation on which legal analysis is built. The problem is particularly acute in judicial opinions because the common law concept of stare decisis means that subsequent readers must be able to trace how the law develops from one case to the next. When a source becomes unavailable due to link rot, it is as though a part of the opinion disappears. Without the ability to locate and assess the sources the court relied on, the very validity of the court’s decision could be called into question. If precedent is not built on a foundation of permanently accessible sources, it loses its authority. So too in appellate briefs, which are often made public as a part of the case record. When an argument in a brief is based on a source which is unavailable because of link or reference rot, it is impossible to assess the strength of the argument or see what role it may have played in shaping the judicial decision. Link and reference rot in opinions and briefs create a kind of instability in the law that did not exist when law was entirely print based. The impermanence of internet citations threatens scholarly writing as well. Because of the heightened role of footnotes in legal academic writing, the disappearance of sources could make it difficult to trace

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 69

MARGOLIS LINK ROT, REFERENCE ROT AND PROBLEMS OF LEGAL CITATION 69

developments in legal theories and the evolution of different schools of thought. Thus, the potential inability to locate cited sources and expect that they are unchanged is a significant problem in legal writing and poses profound challenges for information preservationists. As citation to web sources has grown over the last 30 years, so too has the recognition of the problems with link and reference rot, as well as the difficulty in authenticating primary legal materials. Awareness of these issues became prevalent around the turn of the 21st century. Law librarians and legal scholars have made numerous attempts to catalogue the extent of the problems and, while no comprehensive study has been done, quite a lot of information is available. A variety of studies of judicial opinions in the last three decades show that link and reference rot are prevalent and worsen over time. The earliest citations are the most susceptible to rot. A 2002 study of federal appellate opinions revealed that in cases from 1997, 86.4% of internet citations no longer led to the cited material, and that 34% of internet citations in cases from 2001 were already inaccessible by 2002 (Barger, 2002, 438). Studies of state appellate court opinions show similar results. A study of internet citations in Washington state supreme and intermediate appellate courts between 1999 and 2005 showed that 35% of the citations were completely invalid and, of the valid citations, 64% did not lead to the material originally cited (Ching, 2007, 396). A similar study of Texas appellate opinions from 1998 to 2011 showed an overall rate of link rot of 36.97%. Studies of other courts show link failures in appellate opinions ranging from 30% to 47% (Peoples, 2016, 22). Some of these studies looked only at link rot. Because reference rot (link still works but leads to different content) can be extremely difficult to detect, it is highly likely the numbers are even higher. Link rot and reference rot are also prevalent in academic legal writing, which includes internet citations in higher numbers than judicial opinions. One early study of law review articles from 1997 to 2001 showed that in 2001 only 61.8% of the URLs were still active links, and 38% of the website links did not work even within the year they were published (Rumsey, 2002, 35). More recent studies confirm the continuing problem of link rot in law review articles. For example, a study of the various journals based at Harvard Law School revealed link or reference rot ranging from 26.8% to 34.2% (Zittrain, Albert and Lessig, 2014, 180).

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 70

70 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

The type of sources subject to link and reference rot vary as widely as the number of citations. Link rot can have a wide variety of causes. A document may simply be taken down if the author no longer wants to make it available. Websites may be updated to modernise their appearance; content may migrate from one URL to another as the owner of the website changes platforms, resulting in a broken link even if the original material still exists on the site. Because content on the web is largely commercial, businesses owning web content may close, rendering material on those sites inaccessible. Link rot can also occur because a website owner forgets to renew a domain registration or simply stops maintaining the website. While skilled researchers might be able to track down content at broken links, the majority of researchers are unlikely to be able to do so. Reference rot poses a more complicated problem. Unlike print materials, which once printed remain the same, website content can be easily changed, often without leaving an obvious sign that the content is different from before. Thus, a reader tracking down a citation may have no way to know whether the URL leads to the identical material originally referenced. For example, a number of appellate briefs and judicial opinions cite Wikipedia and similar crowd-sourced sites, which are obviously subject to reference rot as users change and edit previous entries. Many commercial websites are updated frequently to stay current, and this process could also lead to reference rot. Even born-digital government sources, which many assume to be less subject to the problem of link rot, show high levels of rot. The Legal Information Preservation Association founded the Chesapeake Project in 2007 to collect and archive born-digital documents. The project has conducted several link rot studies since 2008, finding a growing percentage of link rot in each subsequent year (from 8.3% in 2008 to 38.77% in 2015) (Chesapeake Digital Preservation Group, 2015). These findings show link rot increasing even in domains for the state and federal governments, although they are considered less susceptible to link rot. Related to the problem of reference rot, particularly for born-digital primary legal authority, is the problem of authentication. Underlying the concern for the inability to read referenced sources is the critical importance of specific information as the basis of legal analysis. If a

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 71

MARGOLIS LINK ROT, REFERENCE ROT AND PROBLEMS OF LEGAL CITATION 71

court is interpreting a statute, for example, the precise words of that statute are critically important, and if a web version of the statute has been altered, then the court’s decision is not based on legitimate authority. An ‘authentic’ text is one that has been verified to be complete and unaltered by the governmental entity charged with publishing the content as originally created. Historically, governmental entities have designated a particular print version of statutes, regulatory material, judicial opinions and so on as the official version. The official designation is typically made by statute or administrative rule. The stability of legal publishing throughout the 19th and 20th centuries created an environment where lawyers could rely on the authenticity of the official print version of legal resources. As legal information has migrated online, the question of authenticity has arisen. Some government entities have completely discontinued their print publications and make information available only online (Whiteman, 2010, 39). How can a researcher determine if legal content at a particular URL has suffered reference rot and is not the authentic, original version? The federal government and some states have recognised that this is a problem and begun to designate on official online version of various publications. Not all levels of state and federal government have taken this step, however, and of those that have, many have not ensured that the version is both official and authenticated. An authenticated version of an online statute or regulation would give the researcher assurance that the version was not subject to reference rot. One way to ensure authenticity is to place a digital signature in the document to show that it has not been altered. The US GPO began to do that in 2008. At the state level, the National Conference of Commissioners of Uniform State Laws drafted a model statute mandating the authentication of electronic legal documents. The Uniform Electronic Legal Material Act (UELMA) attempts to ensure that online official legal material will be permanently available to the public in unaltered form. A number of states have now adopted the UELMA, but there are many that still had not as of May 2017 (NCSL, 2018). Thus, while the situation has improved, there is still a risk of reference rot in citations to online-only primary legal authority.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 72

72 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

Proposed solutions So what is being done about the problems of link and reference rot in academic and practice-based legal writing? Because there is no single unified system of law, there is no single solution to the problem of link and reference rot in legal materials. Link and reference rot can be found in judicial opinions in state and federal courts, trial and appellate briefs, and academic legal writing. No single source of citation controls all of these varied sources. As recognition of the problem of link and reference rot has grown, a wide variety of solutions have been proposed and attempted. Many of these solutions have been effective, but link and reference rot continue to be a challenge, for lawyers and academics as well as librarians and information preservationists. Tracking rotten links While the recognition of link rot and resultant solutions have started to result in lower incidents of rot in legal documents, many sources written in the 1990s and early decades of the 21st century contain link rot. Thus, one group of solutions to overcoming rot involves tracking down broken links and previous content lost through reference rot. This is a research problem. Practising lawyers and legal academics do not always have sophisticated internet research skills (Peoples, 2016, 30–1). Thus law librarians play an important role in helping lawyers track down information hidden behind broken links. There are in fact some resources available to allow access to web content that has been subject to link and reference rot. For example, the Internet Archive, a non-profit library, has been archiving internet sources since 1996, and researchers can locate archived websites through the Wayback Machine (http://archive.org/web/). Lawyers find the Internet Archive particularly useful to look back at business websites at the time a particular legal dispute arose. It also might be a way to track down legal material such as regulations that have been amended and their original citation left with reference rot. Not all legal materials are included in the Internet Archive, however. The automated crawlers that search for sites may overlook a site containing legal information because it was password protected, or blocked in

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 73

MARGOLIS LINK ROT, REFERENCE ROT AND PROBLEMS OF LEGAL CITATION 73

some way, or merely because the crawl wasn’t deep enough and the site was not located. While the Internet Archive is useful for general sources that may be cited in legal academic and practice-based writing, law-specific archives are more useful. The Legal Information Archive (https:// lipalliance.org/related-projects/legal-information-archive/) is a digital archive established to preserve and ensure access to legal information published in digital formats. It is a collaborative project of the law library community, and is available to members of the Legal Information Preservation Alliance. The Cyber Cemetery (https://digital. library.unt.edu/explore/collections/GDCC/) is another internet archive, specifically for government websites that are no longer in operation. Using these and other repositories of archived internet information is a good way to recover information lost to link and reference rot, though they are not universally known. Lawyers and legal academics unfamiliar with these resources may not think to ask a law librarian and instead would assume the information is simply lost. Thus there is a role for librarians and preservationists in continuing to archive online materials and educate researchers about how to find them. Even when online content is found through a digital repository, there is no guarantee that the site is identical to the original reference. The risk of reference rot is high, and it is important for the researcher to check the date of the citation against the date of the archived content to make sure they match. Here again, librarians and archivists have a valuable role to play in helping lawyers find and verify internet sources. Avoiding link and reference rot The most obvious way to avoid link rot is to avoid citing internet sources. Rule 18.2 of The Bluebook requires the use and citation of printed sources when available. The only exception is a digital copy that is authenticated, official or an exact copy of the print source (Rule 18.2.2). Despite this requirement, the number of citations to the internet has continued to grow because of the increase in born-digital documents and general rise of available information on the web. Because of this, requiring citation to print sources is not a realistic solution. One relatively simple approach is to fix a website in time by saving

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 74

74 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

the contents as a .pdf file. Individual lawyers often do this to preserve web content they are citing in briefs and other court documents. Depending on how important the source is to the legal argument, lawyers may take the step of creating an appendix for the .pdf files and attaching the appendix to the document. This approach does not ensure that later researchers will be able to access the source, depending on how accessible the court’s files are to the public. The fact that lawyers take this step demonstrates the lack of a uniform or widely recognised method for avoiding link and reference rot. The Judicial Conference of the USA, the national policy-making body for the federal courts, recognised the problem of link rot and in 2009 issued guidelines for citing, capturing and maintaining internet resources in judicial opinions and using hyperlinks in judicial opinions. Ironically, this report is not accessible because of link rot, but it is described in a variety of places (Liebler and Liebert, 2013, 291). This policy suggests that, while the decisions are within the individual discretion of the judges, any internet material that is central to the reasoning of the opinion should be captured and preserved. The report provides criteria for assessing internet sources, including the likelihood that a source will be changed or removed. Individual judges and courts developed a variety of practices in response, but they have not been uniform in either the federal or state court system. The most low-tech method of preserving the content of internet citations is to print them out. Some courts and lawyers have made this a practice, though there are obvious limitations. Paper copies of websites cannot include any dynamic data or media such as audio or video that the site may contain. If the court is keeping paper copies of the internet sources cited in an opinion, researchers will most likely access that opinion electronically, but would have to travel to the courthouse to see the original source cited if the URL had suffered link rot. This is also true of paper copies of internet sources cited in briefs and other documents lawyers have filed in the court. The practice of creating an electronic appendix of preserved web pages and attaching it to a court document has become more prevalent since the problem of link rot has become more widely known. This approach also has limitations if the goal is to make sources available to future readers in order to assess the legitimacy of the legal reasoning. For documents filed with the court, even those that are filed

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 75

MARGOLIS LINK ROT, REFERENCE ROT AND PROBLEMS OF LEGAL CITATION 75

electronically are not easily available to the general public. For example, the federal court electronic filing system, PACER, creates an electronic docket where case documents can be downloaded, but that docket is not accessible to the public without cost. If courts are creating an electronic file of .pdf copies of internet materials, they are not making them part of published judicial opinions, so they suffer the same access problem. In response to the growing awareness of citation to internet sources and the subsequent problem of link and reference rot, the US Supreme Court began archiving websites cited in its own opinions. Beginning with the 2005 term, the Supreme Court created an archive on its website: Internet Sources Cited in Opinions, Supreme Court of the US (https://www.supremecourt.gov/opinions/cited_urls/). While this archive is not directly connected to the opinions containing the citations, this approach allows researchers to track down the information if the link in the opinion itself no longer leads to the source. Some other courts have followed suit, but this practice is neither uniform nor widespread. Not surprisingly, librarians have been at the forefront of coming up with systemic solutions to link rot in judicial opinions. Beginning in 2007, some federal circuit court libraries began tracking citations to online sources and preserving the information at the URLs as .pdf files. Since 2016, all of the federal circuit libraries have captured and archived web pages cited in the judicial opinions of the circuit courts. Some of these libraries have also been archiving URLs in the federal district court opinions. Some of these libraries not only archive the sources, but also add them into the individual docket of the case in which they were cited. Again, however, this practice is not uniform. The degree to which these archives are available to the public similarly varies. Some libraries maintain only an internal database for judges and clerks, while others also maintain a public database on the court’s website. Thus, while there has been some progress, addressing the problem of link and reference rot continues to be necessary. In addition, none of the solutions addressed thus far work well for academic legal writing, where rot is also a significant problem. While individual authors and journals may maintain archives of cited internet material, these are not made available through the publication process. The most obvious solution to link and reference rot, one that would

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 76

76 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

work for both academic and practice-based legal writing, is to create links that don’t rot. This can be done by using a persistent identifier that allows the reference to be located over time. There are a number of persistent identifier systems that could be used (Keele and Pearse, 2012, 383, 391–92). The GPO uses the Persistent Uniform Resource Locator system (PURLs) in its archive of government publications, for example. While PURLs and other systems are widely used in other academic fields, however, they have not been widely adopted in legal writing. The Bluebook does not require using persistent identifiers in either practice-based writing or law journals, and for the most part they are not widely used. One system of permanent links has begun to take hold in academic and practice settings, an online preservation service called Perma.cc. The Harvard Library Innovation Lab developed Perma.cc as a response to the growing problem of link rot in legal academic journals (Zittrain, Albert and Lessig, 2014, 178). Perma converts web content to permanent links, which are preserved and cached in a network of university libraries and organisations throughout the country. While Perma.cc was originally developed to address link rot in academic legal scholarship, it has extended its reach to a broader audience. It is widely used in American law schools, and a growing number of courts are using it. Members of the public can create Perma links that are preserved for a set period of time. Courts, libraries and law journals, if they become participating organisations, are able to archive web content permanently with Perma. Because Perma’s architecture is based in libraries and other stable institutions, as long as any of these institutions or successors survive, the links will remain (Zittrain, Albert and Lessig, 2014). As of the fall of 2017, there are over 8000 Perma.cc links in law review articles and over 500 citations in judicial opinions from many different federal and state courts. Perma.cc is likely to expand into the commercial market by allowing accounts for law firms in the near future. It seems likely that Perma.cc will take hold as the predominant method of avoiding link rot. Conclusion Although law is a field that depends heavily on citation, lawyers are not generally trained to be experts in information preservation. There

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 77

MARGOLIS LINK ROT, REFERENCE ROT AND PROBLEMS OF LEGAL CITATION 77

may never be one single solution to link rot, given the vast multiplicity of academic and practice-based legal publications. As the profession continues to find its own solutions, librarians and digital preservationists will play a significant role in developing others that allow future readers to access links, ensuring the legitimacy and integrity of legal reasoning in all the various forms in which it appears. References Barger, C. M. (2002) On the Internet, Nobody Knows You’re a Judge: appellate courts’ use of internet materials, Journal of Appellate Practice and Process, 4, fall, 417–49. Berring, R. C. (2000a) Legal Research and the World of Thinkable Thoughts, Journal of Appellate Practice and Process, 2, summer, 305–18. Berring, R. C. (2000b) Legal Information and the Search for Cognitive Authority, California Law Review, 88, December, 1673–1708. Chesapeake Digital Preservation Group (2015) ‘Link Rot’ and Legal Resources On the Web: 2015 data, http://cdm16064.contentdm.oclc.org/cdm/linkrot2015. Ching, T. S. (2007) The Next Generation of Legal Citations: a survey of internet citations in the opinions of the Washington Supreme Court and Washington appellate courts, 1999–2005, Journal of Appellate Practice and Process, 9, fall, 387–408. Columbia Law Review and Harvard Law Review (2016) The Bluebook: a uniform system of citation, 20th edn, Harvard Law Review Association. Denver Area Educational Telecommunications Consortium, Inc. v. Federal Communications Commission (1996) 518 US 727, 777. GPO (2015) GPO Proposes 21st Century Digital Information Factory, GPO-0433, 1, Government Printing Office, https://www.gpo.gov/docs/default-source/news-content-pdffiles/2004/04news33.pdf. Jacobs, J. A. (2014) Born-Digital US Federal Government Information: preservation and access, Center for Research Libraries, March, https://www.crl.edu/sites/default/files/d6/attachments/pages/ Leviathan%20Jacobs%20Report%20CRL%20%C6%92%20%283%29.pdf. Keele, B. J. and Pearse, M. (2012) How Librarians Can Help Improve Law Journal Publishing, Law Libary Journal, 104 (3), 383, 391–92. Kozel, R. J. (2017) Precedent and Constitutional Structure, Northwestern

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 78

78 PART 1 MEMORY, PRIVACY AND TRANSPARENCY

University Law Review, 112 (4), 789–837. Liebler, R. and Liebert, J. (2013) Something Rotten in the State of Legal Citation: the life span of a United States Supreme Court citation containing an internet link (1996–2010), Yale Journal of Law and Technology, 15, winter, 273–311. Lyons, S. (2005) Persistent Identification of Electronic Documents and the Future of Footnotes, Law Library Journal, 97, fall, 681–94. Magat, J. A. (2010) Bottom Heavy: legal footnotes, Journal of Legal Education, 60, August, 65–105. Margolis, E. (2011) Authority Without Borders: the world wide web and the delegalization of law, Seton Hall Law Review, 41, 909–45. Matthews, R. J. and Bush, M. A. (2007) State-By-State Report on Authentication of Online Legal Resources, American Association of Law Libraries, 14, March, https://www.aallnet.org/wp-content/uploads/2018/ 01/authenfinalreport.pdf. NCSL (2018) Uniform Electronic Legal Material Act: state legislation, National Conference of State Legislatures, 6 February, www.ncsl.org/research/telecommunications-and-informationtechnology/uniform-electronic-legal-material-legislation.aspx. Peoples, L. F. (2016) Is the Internet Rotting Oklahoma Law?, Tulsa Law Review, 52, fall, 1–39. Rumsey, M. (2002) Runaway Train: problems of permanence, accessibility, and stability in the use of web sources in law review citations, Law Library Journal, 94, winter, 27–39. Torres, A. (2012) Is Link Rot Destroying Stare Decisis As We Know It? The internet-citation practice of the Texas appellate courts, Journal of Appellate Practice and Process, 13, fall, 269–99. Whiteman, M. (2010) The Death of Twentieth-Century Authority, UCLA Law Review Discourse, 58, September, 27–63. Zittrain, J., Albert, K. and Lessig, L. (2014) Perma: scoping and addressing the problem of link and reference rot in legal citations, Harvard Law Review, 127, February, 176–96.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 79

PART 2

The physical world: objects, art and architecture The digital is reaching beyond the confines of computers into our physical world more and more with each passing day. The chapters in Part 2 explore the work being done along this ever blurring edge between two worlds. How is the Internet of Things capturing information about us and the physical world all around us? And how is this flood of data being exposed and leveraged? How is digital colour preserved and accurately reproduced for the human eye? How can building information modelling be extended to support collaborative work on historical buildings which need management of a broader set of contextual information? Professionals who are pushing the boundaries between the physical and digital worlds while also concerning themselves with preserving context, accuracy and privacy will pave the way for archivists pursuing similar goals in their work to preserve authentic digital records permanently.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 80

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 81

5 The Internet of Things: the risks and impacts of ubiquitous computing Éireann Leverett

Introduction It used to be common to believe that it is impossible to hack certain things in life; cars, pacemakers or the electric grid. After 20 years of studying and working in security, it has become apparent to me that all technologies can be subverted in one way or another, including those named above. Slowly the wider population is waking up to the problem, and wondering if we are building a bright and bold technological future on faulty foundations and shaky ground. Ideally, experts would solve this problem, much as digital archivists work to solve a problem for society so that everyday folks can get on with their business. The strategic problems in security and privacy are similar though: we don’t have enough trained people to care about these issues, we don’t have clear educational paths into our profession, we’re chronically underfunded, and even when we’re not, money is continually re-prioritised to address other risks that align better to a quarterly profit cycle. All of this was true before and is still true as the Internet of Things enters common parlance. Let’s turn our thoughts to the Internet of Things: what has changed in security therein, and then examine some of the impacts of this slowly accruing risk.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 82

82 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

Security At its core, the Internet of Things is ‘ubiquitous computing’, tiny computers everywhere – outdoors, at work in the countryside, at use in the city, floating on the sea, or in the sky – for all kinds of real world purposes. It might be a home automation system that starts your coffee when you wake up in the morning, or agricultural infrastructure such as an irrigation system that applies water at the best time of day to avoid evaporation and conserve water. It might help manage traffic flows in a city to optimise driving time in certain neighbourhoods. Microphones may be used in high crime areas to triangulate gunshots to improve police response times. All of these purposes initially seem logical, and even business critical to the users, yet each of them involves decisions about security and privacy with incredibly longlasting and far-reaching implications. The coffee maker may be used to penetrate your home network. The agricultural system might be hacked in a regional water dispute and used to destroy crops and manipulate prices. The traffic system could be used to redirect traffic closer to one business than another. Those microphones could easily be used later to survey entire communities deemed dangerous to the status quo. All of these are potential realities (many already having occurred), and we would be wise to try to split the utopian from dystopian effects. First let’s get a bird’s eye view of how these systems operate, and then we can use this to discuss the wider implications on society and consumers. Conceptually, the Internet of Things is best described as ‘sensors, actuators, communication and distributed computing’. Sensors can gather a wide variety of types of data including windspeed, temperature, geographical location, velocity, sunlight, pressure, weight or decibels. Each of these sensors provides raw data and can be gathered, processed, analysed, synthesised and finally used to generate decisions. Based on those decisions, actuators perform real world actions, often using sensors to confirm their success, or that the actions have the desired consequences. Specific examples will help us make this clearer, so let’s start with a common household object. A thermostat has a temperature sensor, which it uses to send signals to the boiler, either to turn on the heater

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 83

LEVERETT THE INTERNET OF THINGS 83

if the house gets too cold, or the air conditioning if it gets too warm. The signal has to make it to the boiler, and the boiler has to decide whether to turn on the heater or the air conditioning based on the signal value. Then it sends another signal to an actuator to fire up the boiler. This abstract model of sensors, decision makers and actuators now allows us to explore Internet of Things devices and how they can be hacked or compromised in a broad and jargon-free way. Let’s start with manipulating a sensor. Say you have a sunlight sensor that turns on a streetlight at certain times of night, or off during the day. This type of sensor is vulnerable to physical world sabotage, such as putting tape over the sunlight sensor, or breaking the switch that turns the light on. The streetlight’s sunlight sensor can also be subject to digital sabotage. We can hack the sensor itself, and cause it to report incorrect values. We can hack a central processor of messages, and alter the signal messages that it handles. Examples of these central processors include the router you have in your home, a large industrial ethernet switch, or even one of the core switches of the internet itself. The original message from a sensor goes into a central processor on its way to its final destination, but an altered message comes out and is delivered along its way. We can also hack the actuators and cause them to perform actions aligned with the goals of our sabotage. Lastly, and crucially, we can hack the communications network and send entirely false control messages, which is different from hacking the devices that run the network, although only subtly so. Why is this last approach important as an example? If you control the communications mediums, you can control the messages sent between the sensors and actuators unless they are cryptographically protected. If a hacker can control both the information that leads to decisions and their consequences, then they don’t need to control the mechanism making decisions. Of course, if they control the decision maker, they don’t need to control the sensors or the actuators. So our point here is that there are almost always at least three ways to subvert the system, by only compromising a small

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 84

84 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

part of the system. Often, it is possible to hack using physical and digital techniques simultaneously, and in collusion. Ideally, a system should be secure enough that neither type of sabotage would be possible. So why is it possible? All the different manufacturers that make Internet of Things components would need to share encryption keys, schemes and protocols to be more secure. This would reduce interoperability (at least in the places where the cryptography used by one vendor is different from another) and means that Internet of Things installers would need to be trained in new skills such as provisioning cryptographic keys and firewalls, when previously they just installed sensors (Wikipedia, 2018a). When computing is ubiquitous, security needs to be considered every step of the way. A single light sensor on a lamp post may seem insignificant and not worthy of security testing, but when 500,000 streetlights start spelling out abusive messages on satellite images or flashing at intersections to increase the likelihood of car accidents, people start to care about why security wasn’t thought about during design or manufacturing. This failure of risk analysis is endemic throughout purchasing of Internet of Things systems. The person making the first purchase performs a pilot study of a small number of devices. Security or privacy implications aren’t considered, because they might derail the pilot. Yet after the pilot, the project gains steam, and more and more of the devices are deployed to wider and wider effect. So something that was only 1% of the original system rapidly becomes 90% and people wonder why security was never considered. The frog has already boiled when we start the debate (Gyimah-Brempong and Radke, 2018). This is compounded by people considering each component as insignificant, but the entire system as critical. There’s a lovely quote that helps us think about this in an elegantly simple way: ‘A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable’ (Lamport, 1987). Some of those Internet of Things failures won’t be accidents, and

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 85

LEVERETT THE INTERNET OF THINGS 85

sufficiently advanced malice is indistinguishable from Murphy. Thus we come to understand that security in the Internet of Things is important, to protect us from not just virtual harm, but real physical harms as the internet becomes more entangled with the physical world. Where once computers lived solely on desktops, now they are in clothes, transport systems, credit cards, phones, cars, pacemakers and nuclear power plants. The consequences of poor computer security are becoming entangled with human safety more rapidly than we can train technologists in the methodologies of both. Privacy The Internet of Things is very much an attempt by chip manufacturers to get beyond the desktop. To move computers out of the office and into the wider world they must offer more efficient computation for electricity used, more processing cycles per battery percentage. In 2011, researchers demonstrated that the energy efficiency of computers doubles every 18 months. Named Koomey’s law, after the lead researcher of the Stamford study in which it was documented, it mirrors Moore’s law, which dictates that processing power doubles in the same time period (Greene, 2018). Once a certain threshold of computations per battery percent is reached, real world Internet of Things applications become much more viable and drive further research and progress along Koomey’s law. The other driver here could be batteries that store more energy, and the two are intimately intertwined. Per Moore’s law, integrated circuit manufacturers also make the chips as cheap and small as possible, and as a result cheaper and smaller battery-powered products can afford to have a chip in them as the cost falls. This culminates in a world where every toy has a voice chip, every kettle a digital clock, and every egg tray an array of sensors. The Internet of Things often rejects standard business models entirely. In a traditional economy, the value of the good on the market shouldn’t fall below the cost of manufacture, unless you are trying to be a loss leader. But with the Internet of Things, a variation of the freemium model applies. We expect services for free (or at least as cheap as Internet of Things devices). The companies make money from the data we generate. First by collecting it all, and second by selling it

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 86

86 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

onward in many forms to many people. Once it is gathered, it can also be processed, analysed and synthesised. This can be done on the device or elsewhere, and may involve transporting data to other countries where privacy legislation may or may not exist to the same degree one expects of a device that processes data inside a jurisdiction we live in. There is then a challenge of determining which countries those may be, which is not often displayed on the box, or explained at the point of purchase. Creators and marketers of this technology traditionally hide behind the notion of informed consent. Yet when was the last time you read and understood every word of the end-user licence agreement for new software? Consider something as simple as installing an app on your phone. Do you know if it needs access to the microphone before you install it? How is the data processed or analysed? What are your rights to use the app without sharing all your details? Do you even know for certain what data it collects, or which countries it intends to store the data within? Average users are not programmers or internet cartographers (Mason, 2015). They are unlikely to suspect that a popular free flashlight app collects your contact list to sell to marketers. This suggests that these are not informed users, or informed choices, or an informed market. The Internet of Things is no different from the app in the example, and often cheap devices are offered as a loss leader to get access to your data or Wi-Fi network. The device is a trojan horse to convince you to give up your data, which you imagine isn’t even valuable in the first place. The phenomenal growth of tech start-ups in wearables suggests that data about people, their habits and things they use is enormously valuable, far more so than the average consumer expects. In a nutshell, you can’t control dataflows you don’t know exist (Jouvenal et al., 2018). You might expect your smartwatch to keep track of your sleep patterns, but finding out your television keeps recordings of your discussions might be quite a surprise, yet has already happened (BBC News, 2015). As if this isn’t enough, there is a further problem. Privacy International (a charity dedicated to understanding privacy as a human right, in particular with respect to the internet) refers to this problem as ‘data in the wings’. Many Internet of Things devices are

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 87

LEVERETT THE INTERNET OF THINGS 87

‘over-provisioned’ – either the software or hardware has more functions available than necessary, which leads to more data being collected or generated than even the manufacturer is aware of. Even the companies themselves can be victims of this lack of informed consent, and they pass on their bad decisions further down the chain. Let’s get deeper into the different ways this ‘data in the wings’ problem manifests itself. When you buy a temperature sensor, it comes with a humidity sensor in the chip, even if the display doesn’t use it. This can occur because humidity sensors and temperature sensors are so often sold together that the economics of producing a chip without a humidity sensor is costly. It is cheaper to bundle them in hardware and sell it for use as both or either function at similar prices than to produce different items, which might have markets as individual products that are too small to justify the cost of production. Similarly, cars became slowly hackable as they were prepared for autonomous driving. The manufacturers added hardware and networks to the vehicles strategically, to prepare the vehicles for assisted and eventually autonomous driving. They were ‘silently’ overprovisioned, deliberately, to prepare for the future. Yet the average consumer is shocked to discover they are hackable, because they still think of them as mechanical and non-digital devices, precisely because they were not informed of this strategic provisioning. Now this brings us back to ‘data in the wings’, by which we mean data derivable by the vendor. Either through over-provisioned software or hardware and/or synthesis with other data available to them they can derive far more data from the data itself. For example, imagine if all those over-provisioned sensors we considered earlier also sent humidity data from people’s homes, even when only a temperature sensor was purchased. In these scenarios the user is truly uninformed about the humidity data being gathered. As we have seen, the vendor might even be ill-informed about the presence of extra hardware. The informed part of the ‘informed consent’ is breaking down significantly in this model, and we do not yet have regulatory structures even to discuss the issue, or protect the consumer from buying one device, and getting their data extracted by the vendor for profit via the process. In the absence of a regulatory framework is each user supposed to

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 88

88 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

verify all this for themselves? Where is the magazine or website where I can review the security and privacy of Internet of Things devices before I buy them? What about the ways the vendor is deriving data and aggregating it? For example, something that seems innocuous in individual statistics (say for example location data on your phone), when combined with a voter centre database, and your stated political views on social media, becomes a tool for political micro-manipulation and gerrymandering. So it is not just what is gathered, but what it can be aggregated with, and what can be derived from both that must concern us as a society. In short, we make decisions based on what can be done with a single piece of data held by a single company (at best), yet we should be more concerned with the continually growing pile of forgotten data from everyone, and what can be aggregated and derived from it: After collection, data is almost certainly bound to be computed upon. It may be rounded up or down, truncated, filtered, scaled or edited. Very often it’ll be fed into some kind of algorithmic machinery, meant to classify it into meaningful categories, to detect a pattern, or to predict what future data points from the same system might look like. We’ve seen over the last few years that these algorithms can carry tremendous bias and wield alarming amounts of power. (Thorp, 2017)

It is a collective risk to society, more than the risk of a single datapoint to an individual. More importantly the single individual is not empowered to resist such things, as simply not participating in sharing phone location data only protects one vote from being manipulated, the process itself of micro-targeted propaganda still works on all your phone owning friends. Feeling your vote is secure because you are not on Facebook or sharing location data is a pyrrhic victory for a breakdown in democracy. It is a collective risk to all of us which unilateral defections cannot eliminate from societal effect. So far we have focused on companies using the Internet of Things to get at our data, but the trojan business model they use has another nasty effect on our digital security and privacy; it is an externality on our lives. The loss leader model means companies spend very little

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 89

LEVERETT THE INTERNET OF THINGS 89

money on security or privacy of their devices and our data, because it is consumers who will suffer the losses. Those losses will continue to seem wild and unpredictable, like the fish tank hack that led to a casino losing money (Schiffer, 2017). They are unpredictable when viewed individually, but as a collective effect on society they are very predictable, common and should concern consumer rights groups for decades to come. File formats To understand computer technologies more transparently, it is a necessity to understand file formats. Digital archivists know this well, but it is also an issue for forensics, computer security and safety investigations, too. Every Internet of Things vendor is free to invent new file formats for storing data, and new protocols for their devices to speak to each other, and often do so. Why do we care so much about file formats? To explain computer security to those new to the issue, I often ask, What do hackers know that we don’t? They know that ‘everything is a file on a computer, or passing through one’. They know that ‘to change the file is to change the world’. To change the file, and have the changes be accepted, we must understand the encoding of data into the file. Without descending into too much technical detail, a company can use either a proprietary or open format of data. If Internet of Things vendor’s software engineers choose to use open file formats, or document their choices publicly, then we can use such documentation while investigating the security and privacy of such devices. Unfortunately, most companies that make software produce proprietary file structures, where the formatting of the data is not publicly described and must be deduced through painstaking examination of many documents over time. As a result the data structure is usually not interoperable with open source software, but not always. Proprietary data formats usually obscure data from anyone with an interest in transparency. We must turn to ‘reverse engineers’ who specialise in deconstructing and understanding the executable code of computer programs, and the file formats they produce and consume. Additionally, it is not something we think to question when we

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 90

90 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

purchase or use software, though there are a growing number of users who ask, Can I export my music library again in the future to an open source file format if I don’t like your software, or am I ‘locked in’ by file format? This lack of transparency leads to many complications, such as how to answer simple questions about a potential intrusion. It should be a simple question: Has my computer or Internet of Things device been compromised? Unfortunately, it is not simple to answer this question. In practice, more time, effort and money must be spent by the victim to understand if they have ‘been hacked’ than they expect, and much of that effort is focused on understanding proprietary file formats along the way. You started with a simple question, but were forced to answer the question, How does this company store and transmit data, technically speaking? just to be able to answer, Has my Internet of Things device been hacked, or is it still under my control? How can this expand beyond the individual to become a problem for society? Imagine a simple case, such as a crop irrigation system that malfunctions. What if it malfunctioned while the farmers were on holiday, and has destroyed $100,000 worth of crops? The insurance company will want to investigate if this is Murphy or malice before paying out. The devices are taken out of service and delivered to a forensic analyst who determines the cause of the outage. To do so, the investigator needs to connect to and collect the data from the device. Now for security reasons we would like this to be as hard as possible, because if anyone can find out the cause of the outage that in itself is a security vulnerability. However, for forensics purposes we would like this to be as easy as possible, so that evidence exists to show the device was hacked: ‘The human victims of WMDs [weapons of mass destruction], we’ll see time and again, are held to a far higher standard of evidence than the algorithms themselves’ (O’Neil, 2016). So there is a natural tension between the openness of standards and data, and the accountability of Internet of Things vendors. Without openness, it is difficult to verify security and privacy claims, probably too expensive or time consuming for the average person to do themselves. We currently lack the structures and regulations for the Internet of Things that we have in place for other industries. Yet the automotive industry was eventually driven to produce a comparative list of safety features and crash test results. This enables car consumers

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 91

LEVERETT THE INTERNET OF THINGS 91

to make informed choices. How might we accomplish this with the Internet of Things? This issue reaches beyond the end-consumer to the vendors who consume hardware or software from others. Open formats would also assist when disputes occur between the accountability of vendors and the security and privacy these devices provide. A manufacturer of a watch may not know much about the security and privacy of phones, and vice versa. Unfortunately, new smart watches are often phones these days, too, and everyone has to rely on their supply chain to be secured. Yet, as previously stated, where is the regulator or consumer protection response to address these concerns? Large corporations are assumed to be able to secure themselves with big budgets, but the man on the Clapham Omnibus neither realises he must spend money to protect himself, nor what he would spend it on if he knew (Wikipedia, 2018b). He cannot hire a team to penetration test his new smart watch and manage any vulnerabilities found, as it is not economically viable to do so. This example demonstrates the conflation of service and product that is the core philosophical problem society is now struggling with. If the watch were to catch fire and hurt the wearer, we would clearly see this as a product liability and the manufacturer would be held responsible. However, because the smart watch is also considered a service (after all it checks the weather and sends text messages), we don’t attach the concept of liability to the manufacturer when we are discussing breaches, or even pathologically, if a ‘hack’ produces a fire. Services (such as software) have traditionally been exempt from liability. The Internet of Things will force us to decide some way to navigate this paradox, and one option may be to make devices open source at some point in their lifecycle (for example if a company goes bankrupt). Another might be to attach liability in some form to services. Then again, we cannot make devices so open that they are easy to hack, either. Sometimes (though not always) open formats can lead to vulnerabilities that are qualitatively different from those vulnerabilities found in proprietary systems. So there needs to be a certain amount of work done to secure these systems too. Open source should not be considered because it makes systems inherently more secure but because it makes systems inherently more manageable in the face of security and privacy problems.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 92

92 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

Unfortunately, the low cost of the Internet of Things makes the approach of ‘securing all the components’ very, very, unlikely to succeed. When you sell each widget for $3.99, you are hardly going to spend $50,000 eliminating vulnerabilities in the product. Yet if you don’t, your customers may very well lose that $50,000 themselves, many times over across each individually vulnerable customer (a classic example of externality). This sets up a tough decision: if components are perfectly secured (which we’re not even sure is theoretically possible), then we will lose some of our consumer rights and ability to verify how they work and where they send our data – the freedom to tinker. We won’t be able to pry them open easily, because they will be so locked down we cannot find much transparency or interoperability. Yet, if they are completely open they will be easily manipulated for malicious aims. Or at the very least you will have to put a large amount of time into configuring them to bring them up to your security standards from an open standard. Thus it is crucially important that Internet of Things vendors fix vulnerabilities found by others. It is easy to assume that they always will, but we might have said the same of companies installing seatbelts in the 1920s too. In reality, it took decades of consumer activism and liability to make seatbelts mandatory, and assign some level of liability to car manufacturers. We should not expect the Internet of Things to become more secure without some consumer pressure either. Decisively, we should be asking who secures, maintains and defends these systems when the company no longer exists, or because of their jurisdiction the company simply does not care. Non-static code and reproducibility problems Before the web most computer programs were written monolithically. Each program was written, compiled for a specific type of computer architecture, and then deployed on that architecture. Computer code was updated infrequently, and behaved the same most of the times it was run. After the web, the world slowly drifted towards online web services. A website offers some service, and all other websites simply leverage that service on their page as the consumer browses a web page. This works very well most of the time, as the maintenance to a

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 93

LEVERETT THE INTERNET OF THINGS 93

webservice seamlessly is performed underneath your user experience. Customer A’s page uses webservice version 1.2, and seconds later Customer B is using webservice 1.3 without a noticeable transition in the website experience. The Internet of Things continues in this tradition, relying heavily on webservices to lower the cost of deployment, but also so that each individual service can be managed centrally and updated frequently. All Internet of Things devices then use the updated service immediately. Firmware, the software closest to the hardware, is often used by an operating system to run individual components such as keyboards or hard drives. With consumer devices the firmware and operating system are often deeply intertwined. Full firmware upgrades are still sometimes necessary (not least for security updates), but these upgrades themselves are also often pushed by third-party webservices, which specialise in the difficulties of updating millions or billions of devices in an orderly manner. It is a good thing for security and privacy when security updates can reach the masses quickly and efficiently. Perhaps it can even be conceptualised as an optimised product recall process, cognitively modelled as the ability to teleport an identical car into your garage fixing a crucial problem, and teleporting the old one away. Unfortunately, this benefit can also create a new problem: reproducibility. When the services are distributed on different domains around the world, it is easy and inexpensive to compose them into Internet of Things services and devices, though we then lose all visibility and control over the consistency of the device’s behaviour. One day it can behave one way, and overnight the services can be updated, and it behaves differently the next day. This has many ramifications, from thwarting forensic investigations to altering user experiences over time or geography. There is another reason why the consistency of service varies. It depends on internet connection and availability. This isn’t as simple as it sounds; a simple on/off binary condition of available internet. For a start, the internet doesn’t behave symmetrically. A connection from A to B does not imply a connection from B to A, despite the common belief that it does. So A can send information to B, but B cannot send information back to A! A simple thought experiment confirms this is possible, when you

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 94

94 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

think of firewall rules. These contain ingress and egress filtering rules, which are commonly different. So for example I might have one rule that says, ‘ALLOW outgoing traffic to GIANT SEARCH ENGINE CORPORATION’, because my employees need it to do their work. Simultaneously, I might have another rule that says, ‘DENY incoming traffic (not initiated internally) from GIANT SEARCH ENGINE COMPANY’ because I don’t want them indexing a number of external resources I have published for special customers. So now, we know that firewall rules exhibit ‘directionality’; thus egress and ingress rules can be different and often are. Even if we believe that only 50% of internet connections have firewall rules, we can still see this is significant enough to demonstrate that connection is not a symmetrical property. There are also many other reasons this condition can occur in connection state and quality of service: triangular routing, Border Gateway Protocol churn and hijacks, Transmission Control Protocol multicast or anycast, User Datagram Protocol, internet censorship such as the Great Firewall, and geofencing, just to name a few for the interested reader. Why labour this point? Why does it matter? Putting it succinctly the internet looks and behaves differently from any particular vantage point. Internet of Things devices may behave differently in Europe from the way they do in China, and they may behave differently today from how they will tomorrow. Not only because of connection asymmetries, but simply because the company doesn’t want to offer the same service in Country X as Y, or because Country Y doesn’t want the service available. This complicates the daily work of a consumer rights security tester or digital archivist. The tests they run in China won’t match those in Geneva, and the tests they ran in August won’t match those in January. Different markets want different products and features, but in the future regulatory environments will differ too. For example, in Europe you can be fined 2% of global turnover for losing data associated with European customers under General Data Protection Regulation (GDPR) legislation. This isn’t just true for European products, but for anyone who sells products in Europe. These regulatory variations will begin to differentiate products over time, but it will be a very slow process.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 95

LEVERETT THE INTERNET OF THINGS 95

One key role for regulators is to prevent digital lock-in. Digital lock-in It’s not just violations of security and privacy that we must protect against, we also must guard against technical lock-in. This is the name we give to measures a company takes to make sure you are forced to use their services, supplies and unique content formats. When you can only use one brand of printer cartridge or can’t transfer all your digital music to another manufacturer’s device, this is a form of technical lockin. In the Internet of Things similar issues can even lead to lock-out! In one case, a firmware update bricked automatic locks popular with Airbnb providers, requiring manual resets to locks scattered around the world (Lomas, 2017). It’s not even an isolated incident, as shown by NEST being shutdown (Price, 2016). Since only the vendor can update proprietary software we must ask questions about what goes wrong when they make mistakes or cancel their projects. Consumers expect to buy devices, and for them to keep working. There are currently no regulations protecting consumers or awarding them damages when companies get it wrong. This rapid obsolescence may be by design or it may be accidentally incentivised by the lack of regulations, but it will continue to provide challenges to those working with the data from the Internet of Things. Another issue of working with data is the sheer volume of it available. Even a simple app on your phone that polls your Global Positioning System (GPS) location every minute leads to half a million data points per person per year. Some might argue that some of that data isn’t useful because you are sleeping some of that time. However, where you sleep can be as relevant to xxxx as who else’s phone is nearby when you do. The value of the data is in the eye of the beholder. As we previously discussed, data about your sleeping habits can suddenly become valuable when combined with data about something else, say for example house prices. It can allow someone to infer your wealth, which for an individual is not so very interesting, but when you have half a million data points on every person combined with their housing and wealth you have a very accurate economic model of a neighbourhood.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 96

96 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

The volume of data generated and captured by the Internet of Things is often beyond human scale. Data measured in terabytes or petabytes present serious challenges to searching and indexing. This has implications for privacy, security and forensics, because either the tools or the talent are lacking. It’s not just the volume of data per device that is staggering, though, it is the number of devices proportional to people. It is estimated by Forbes that Internet of Things devices have a compound annual growth rate of 20%, and we have already passed a crucial inflection point in 2015. In that year the ratio of people to Internet of Things devices was 1:2. Now in 2018 the ratio is about 1:3, but by 2025 we can project it will be 1:9. Why does it matter if Internet of Things devices so dramatically outnumber people? Internet of Things devices need to be secured, investigated and have their data considered for long-term preservation. There are more work hours per digital defender. Even in today’s world the average consumer doesn’t manage all their devices well from a security or privacy perspective. We need specialists to give us advice, but will the number of specialists grow in the same proportion as the number of devices? Will their advice become systematic and simple enough to help us? Or will the average consumer put their blind trust in the corporations providing the services they want to use? Automation will help us to extract data for investigations or archiving, and to secure or harden devices against unwanted attackers. The question still remains: how much automation, and how many specialists will we need? Right now, it is clear we will at least need ‘more’, even if we cannot be specific yet about how many more. The problem is not just the volume of new devices on the market, but also the speed at which they are released and replaced. Many companies are in the business of rapid obsolescence, creating devices which age in one- to two-year life spans, causing consumers to buy a new device more frequently. Consumers drive this rapid obsolescence by requiring ever cheaper electronic devices. This often leads to shorter life times as we still demand all the same features, but more cheaply and smaller. The point here is simple: when you drive down the cost, you pay for it with a legacy of insecure devices.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 97

LEVERETT THE INTERNET OF THINGS 97

That legacy is what we call ‘security debt’, and society is slowly coming to a rational, quantified, understanding of that debt, and its impact on gross domestic product and consumers. Yes, we have cheaper devices, but when we have to buy cyber insurance for our homes and businesses, we will face the cost of those decisions over the last few decades. The externalities (the risks silently transferred to users and society) have a specific price. Whether we pay it directly, via higher cost Internet of Things devices, or indirectly, via regulatory oversight of device failures, there is a significant place here for regulation and regulatory approaches. Interestingly it seems easier to enact privacy data protection laws or regulatory oversight than to create a system of punitive measures for software neglect and vulnerability. In the short term we are more likely to see approaches to manage the data than the software. In limiting our goals to protecting consumers, we fail to recognise the role of companies’ software neglect. This will be evident in both open source and proprietary code, but the solutions discussed and the impacts will be different. Each software philosophy will fall back onto its foundational roots to defend itself. Open source advocates will claim that the software allows for the freedom to tinker, and anyone may find or fix flaws. There may be many eyes on this code, but there are precious few hands typing code to fix the flaws found by the eyes. In contrast, organisations invested in proprietary code bases will find and fix vulnerabilities faster, but the severity of the bugs found will be greater. There will also continue to be proprietary code bases that reach end of life, but are still critical to the functions of society. An obvious solution would be that proprietary code bases would become open source when their parent company dissolves. The code could be placed in code escrow, and then repaired by any interested parties. This may solve the problem of the code base being maintained, but it leaves two other problems unsolved: devices may rely on data that a company holds in order to keep functioning (the infrastructure a company provides to the Internet of Things device may be needed for continual operation) and software companies do not last as long as governments. When governments have to deal with the impact on their operations and constituencies of software companies that make quick profits on cheap Internet of Things devices, but then vanish

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 98

98 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

when the maintenance bills come due, they are forced to turn towards more long-term thinking about regulation, escrow, and liability. We’ve already seen the nationalisation of a certificate authority that failed spectacularly in the face of being hacked: DigiNotar was taken over by the Dutch government after it was found it had been breached. After discovering how a company was hacked, an investigation naturally turns towards finding out what the hackers did while hacking. In the DigiNotar case, we were surprised to discover that they produced a number of custom Secure Sockets Layer (SSL) and Transport Layer Security (TLS) certificates to compromise others around the world. It was akin to taking over a locksmith to produce new sets of master keys that say, ‘Don’t copy this master key’. It was then discovered that DigiNotar could not be permitted to go bankrupt because the Dutch government relied on its certificate services for various services to their citizens. The government paid for the certificate clean-ups for others (externalities) and to prop up the company to maintain its own national identity systems (dependency). While this example does not concern an Internet of Things company, it does illustrate a combination of externalities and dependencies that such companies may impose on society if we do not keep a careful watch on our relationships with technology. Conclusion The open source and proprietary software debate has evolved some nuanced discussions of such issues, but we still lack much of a philosophical intervention framework for data and society. How long should data be stored? What criteria should be used to determine what data is kept? What counts as private data and what does not? Does data belong to the company that gathered it or the person from whom it was gathered? What about the data derived from that data? Does data require maintenance? When it is wrong, how does it affect us? Anyone who has struggled with incorrect data kept by a third party, especially when attempting to start a business or secure a mortgage, has had a taste of how severely this type of situation can impact one person’s life. An individual’s struggle pales in comparison with algorithmic amplifications of other biases in society such as gender gaps and sexism, race-based profiling and housing inequality. Yet all

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 99

LEVERETT THE INTERNET OF THINGS 99

of these issues will be exacerbated by the Internet of Things, in volume and complexity. It is also true that the Internet of Things will also bring many simple benefits. It may well provide us crucial evidence to reduce or improve those imbalances as well, if they are used creatively to document these inequalities. There is a balance to be had between the concerns detailed in this chapter and the recognition that these devices can also advance innovations that are a benefit to both individuals and society. There will be untold benefits in economic benefits and new jobs created. Our society will become more interesting and more fluid for their existence, but we need policy makers and the public to recognise and manage the risks these devices can induce as well. A bit of healthy policy towards security, privacy, consumer safety and data protection could enable us to reap the benefits of a functional and safe Internet of Things without triggering a whirlwind of rogue device dysfunction. References BBC News (2015) Not in Front of the Telly: warning over ‘listening’ TV, 9 February, www.bbc.co.uk/news/technology-31296188. Greene, K. (2018) A New and Improved Moore’s Law, MIT Technology Review, https://www.technologyreview.com/s/425398/a-new-andimproved-moores-law/. Gyimah-Brempong, A. and Radke, B. (2018) Think Data Privacy is Dead? Try replacing the word ‘privacy’ with ‘consent’, Kuow, 16 April, http://kuow.org/post/think-data-privacy-dead-try-replacing-wordprivacy-consent. Jouvenal, J., Berman, M., Harwell, D. and Jackman, T. (2018) Data on a Genealogy Site Led Police to the ‘Golden State Killer’ Suspect. Now others worry about a ‘treasure trove of data’, Washington Post, 27 April, https://www.washingtonpost.com/news/post-nation/wp/2018/04/27/ data-on-a-genealogy-site-led-police-to-the-golden-state-killer-suspectnow-others-worry-about-a-treasure-trove-of-data/?noredirect= on&utm_term=.cc9754248a79. Lamport, L. (1987) Distribution, https://www.microsoft.com/en-us/ research/publication/distribution/. Lomas, N. (2017) Update Bricks Smart Locks Preferred by Airbnb, TechCrunch, 14 August, https://techcrunch.com/2017/08/14/wifi-disabled/.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 100

100 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

Mason, B. (2015) Beautiful, Intriguing, and Illegal Ways to Map the Internet, Wired, 10 June, https://www.wired.com/2015/06/mapping-the-internet/. O’Neil, C. (2016) Weapons of Math Destruction: how big data increases inequality and threatens democracy, Penguin Random House. Price, R. (2016) Google’s Parent Company is Deliberately Disabling some of its Customers’ Old Smart-home Devices, Business Insider, 4 April, http://uk.businessinsider.com/googles-nest-closing-smart-homecompany-revolv-bricking-devices-2016-4. Schiffer, A. (2017) How a Fishtank Helped Hack a Casino, Washington Post, 21 July, https://www.washingtonpost.com/news/innovations/ wp/2017/07/21/how-a-fish-tank-helped-hack-a-casino/ ?utm_term=.78294c9ccf8d. Thorp, J. (2017) You Say Data, I Say System, Hacker Noon, 13 July, https://hackernoon.com/you-say-data-i-say-system-54e84aa7a421. Wikipedia (2018a) December 2015 Ukraine Power Grid Cyberattack, https://en.wikipedia.org/wiki/December_2015_Ukraine_power_grid_ cyberattack. Wikipedia (2018b) The Man on the Clapham Omnibus, https://en.wikipedia.org/wiki/The_man_on_the_Clapham_omnibus.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 101

6 Accurate digital colour reproduction on displays: from hardware design to software features Abhijit Sarkar

Introduction In this chapter I will talk about colour. Not colour in a general sense, but colour as it is rendered on a display. I am a colour scientist by profession. Most of my professional experience revolves around computer displays and improving visual experience on displays. In the imaging industry, many professionals with expertise in colour science typically work for manufacturers of electronic devices like smartphones, tablets, laptops, monitors, or even graphics processors and PC chipsets. Some work for camera or printer manufacturers. Besides the imaging industry, media and entertainment companies, and companies that manufacture paints, textiles, lighting products and so on and so forth, also employ colour scientists. The manufacturers need colour scientists and engineers to develop methods and algorithms (probably in collaboration with hardware and software engineers) to achieve superior colour experience. A lot has to do with the fundamentals of human vision and perception – how our visual system works and perceives colours. Not surprisingly, colour science is a multidisciplinary field involving physics, chemistry, physiology, statistics, computer science and psychology. When creating and consuming a digital artwork, accurate colour reproduction on a display is one of the fundamental requirements. While cameras and printers play a vital role in archival and conservation work, specific focus of this chapter will be on displays and related topics.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 102

102 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

I will start this discussion by describing why display colour accuracy is important, when it is important, and for who. Then, I will get into the details of how display technology has and is still evolving, and how that has impacted the notion of colour accuracy. In the next section, I will share some of my first-hand experiences of various challenges a device manufacturer faces in achieving high levels of display colour accuracy. Then comes the software aspects. I will try to familiarise you with some of the software features that can positively or negatively affect colour accuracy, so you are aware whether, when and how to use them. The section after that has some tips on how to ensure that colour accuracy is preserved during the digitisation of artwork, and afterwards during its consumption. Finally I consider some of the unsolved challenges we face today, and the outlook for the future. Display colour accuracy: why it’s important and when I am sure that as an archivist, you find it critical that the computers and devices used in archival work preserve and reproduce colours accurately. But such a capability is highly desirable to many other users of modern computing devices like desktops, laptops, tablets or the all-in-ones, which come with multiple functionalities and usage. When it comes to an average user, the activity that often demands display colour accuracy is online shopping. When buying products such as clothing, shoes, or items for home decoration, not being able to view the actual colour may lead to a disappointing purchase experience, and possibly a return of the purchased merchandise. Preserving colour accuracy is a prerequisite for many professional users in their work. Take for example photographers, creative digital artists or graphics designers – it is not difficult to imagine they would need colours to show up accurately on their monitors. Similarly, professionals in the film and post-production industry, like colourists and directors of photography, would not be able to do their jobs if accurate colours are not ensured throughout the whole process of movie creation. A critical part of those professionals’ job is to preserve the original intent of the director when the movie was shot. Let me give two examples. During film production, raw, unedited footage shot during the day,

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 103

SARKAR ACCURATE DIGITAL COLOUR REPRODUCTION ON DISPLAYS 103

commonly called ‘dailies’ or sometimes ‘rushes’, are prepared for a review by the film director, the director of photography (or cinematographer) and select staff and artists. The purpose of the review, undertaken later that night or the following morning, is to determine if the quality of the shots is satisfactory, or if any retake is needed. Among other things, colour tones in the entire footage is scrutinised for inconsistencies and departure from the director’s intent. Colours captured with the film camera need to be reproduced faithfully on the display or on the projector. Colour accuracy must be preserved both during capture and display. The second example is the process of colour grading during postproduction, where the colourist tweaks the colours frame by frame to give the content a certain look. This process, sometimes also referred to as colour correction, is very subjective as it does not follow any mathematical rules. Every colourist presumably arrives at a different result. Not surprisingly, many directors of photography have their favourite colourists they like to work with. The director of photography needs to review and approve the final outcome of the colour grading process. Both individuals might view the content on the same display sitting side by side, or they might be sitting in two studio rooms across continents, viewing the content on two entirely different displays. In either case, each display that they use must be painstakingly colour calibrated. There must be a high level of confidence that the specific shade of purple or turquoise would appear almost identical on the monitors, and when measured with a highquality instrument would yield the same numbers. Well, at least within certain predetermined tolerance. Thus, it is not surprising that a massive amount of time, money and effort go into preserving colour accuracy and deploying proper colour management in the movie making process. Display technology evolution and colour accuracy For many decades, the professionals engaged in creative pursuits and colour-critical tasks relied on professional-grade monitors for their work. These monitors offered excellent accuracy, stability and repeatability in colour reproduction, but were bulky and expensive, often employing conventional cathode ray tube (CRT) technology,

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 104

104 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

which is in the process of being phased out. Most professional monitors in the market today are based on liquid crystal display (LCD) technology, which offers advanced digital controls and features. The other display technology that has very recently found its way into mass-market consumer electronics devices like smartphones and high-end television sets is organic light-emitting diode (OLED) technology. This has been drawing significant attention not only because it offers superior visual colour and contrast performance over LCD technology, but also because it is thinner and functionally more power efficient. As an example of why it is more power efficient, imagine a picture of a long dark tunnel, at the end of which you can see a bright light. Even if 90% of the scene is pitch dark, a typical LCD has the backlight at full brightness, just to be able to show the bright end of the tunnel – you cannot selectively turn off the backlight. An OLED display on the other hand would have 90% of the pixels turned off, barring the area showing the end of the tunnel. While it has these advantages, OLED technology poses multiple technical challenges as well. The organic material responsible for the blue colour in an OLED display tends to degrade faster than red or green, giving a permanent, uneven colour tint on the panel over time. The technical term for it is the burn-in effect, which is very dependent on how much the display was used and what content was displayed. Further, the colour performance in an OLED changes significantly as a function of the overall display brightness level and the area of the screen displaying content. So, preserving colour accuracy in an OLED display for a wide variety of content is more challenging than for an LCD display. We should be able to find solutions to technical challenges like these in the coming years, which will enable OLED technology to become a dominant mainstream display technology by ousting LCD, much in the same way as LCD technology replaced CRT technology some decades ago. Two of the current technologies with the most impact on display colour quality are the wide colour gamut and high dynamic range. Both have implications for colour accuracy, so it is important to have a basic understanding of them. Display colour gamut refers to the range of distinct colours a display can reproduce, determined by the saturation of purest red, green and blue (RGB) colours of the display, called ‘the primaries’. Conventional

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 105

SARKAR ACCURATE DIGITAL COLOUR REPRODUCTION ON DISPLAYS 105

displays have a standard colour gamut referred to as sRGB, while more modern displays use deeper reds and greens, resulting in a wider colour gamut. Some common examples are DCI-P3 (defined by the Digital Cinema Initiatives and used in the film industry) and Adobe RGB (mainly used by photographers and graphic artists), which contain 35–37% more colours than sRGB. In this regard, it is probably worthwhile to explain the subtle difference between colour gamut and colour space, a closely related term you will likely come across. Colour space refers to the colour representation in a given three-dimensional space, which can either be dependent on a specific capture or display device or can be independent of any device. So, sRGB, DCI-P3 and Adobe RGB are examples of device-dependent RGB colour spaces. CIEXYZ and CIELAB, created by the International Commission on Illumination (CIE), are examples of device independent colour spaces. When a display or a projector conforms to any of those device-dependent colour space specifications, the (device) colour space and colour gamut are used almost interchangeably. However, it is not uncommon for the primaries of any specific device to slightly deviate from the standard device colour spaces. So, when talking about a display device, I recommend using the term colour gamut, and reserve the term colour space for describing the industry standards. By the way, for those of you working with printers, they use a different set of primaries like cyan, magenta, yellow and black (CMYK), but in this chapter I have mostly focused on displays. While we are on the topic of colour space, I should briefly discuss ‘white point’, since it fundamentally affects the concept of colour accuracy. Any given colour space has defined RGB primaries. When these primaries are combined, it results in a shade of white, which is the white point for that specific colour space. Both sRGB and Adobe RGB use a standard white point termed D65, referring to a standard daylight source with a correlated colour temperature (CCT) of 6500K. I will skip the technical definition of CCT, but the higher the number the more bluish or cooler the white point is, and the lower the number the more yellowish or warmer the white point is. Unlike sRGB and Adobe RGB, DCI-P3 has a white point a bit warmer than the D65, more like 6300K. This results in a rather different overall look and colour tone of digital content under DCI-P3 from that used under sRGB or

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 106

106 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

Adobe RGB. The key reason for this is that DCI-P3 is a digital cinema standard, originally defined for viewing conditions in theatres. For general use, a more appropriate colour space or standard is P3-D65, which has the same colour gamut as DCI-P3, but with a D65 white point. Most consumer grade wide gamut displays recently released to the market use P3-D65 or Adobe RGB as the default colour profile. Using the appropriate white point is critical for preserving colour accuracy in digital media. However, when it comes to colour accuracy, these wide colour gamut displays make things a bit more complicated as before their advent all that the mass-market displays were typically able to offer was sRGB colour gamut. So, when digital content was created, and when they were consumed, the implicit assumption was that they conformed to the sRGB standard. That allowed people to be indolent and bypass colour management altogether in many consumer applications! In other words, it was acceptable if the digital content were missing information about the exact physical colours the stored digital values were meant to represent. The software operating system would simply assume the content to be sRGB and reproduce colours accordingly. Many digital contents still follow this implicit assumption, but it falls apart when we introduce a wide gamut display into the workflow. When non-colour-managed digital content is reproduced on such a display, colours look oversaturated, skin tone looks unnatural, not to mention that the colour accuracy is gone out the window. This happens because the reproduced colours were meant for an sRGB display, not a P3 or Adobe RGB display. A digital value of 255 red maps to a lot deeper red in P3 than in sRGB. Similarly, a digital value of 255 green maps to a lot deeper green in Adobe RGB than in sRGB. Avoiding this situation would require us to use colourmanaged content exclusively along with colour management software applications. This does not match the ground reality when we think about either content creation or consumption. I will discuss this in more detail later when I talk about software features. The other latest technology to affect display colour quality profoundly is high dynamic range (HDR) display technology. A display’s dynamic range is defined as the ratio of maximum brightness (of full white) to minimum brightness (of full black) the display is capable of reproducing. It is also called the contrast ratio. Conventional

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 107

SARKAR ACCURATE DIGITAL COLOUR REPRODUCTION ON DISPLAYS 107

LCDs have a contrast ratio of up to 800:1. The latest mobile displays released recently report a contrast ration of up to 1800:1. HDR displays on the other hand have a contrast ratio from 60,000:1 up to 100,000:1. Showing HDR content on an HDR display in full glory requires us to overcome many a hurdle. First of all, each and every piece of hardware and software involved in processing HDR content must represent colours in 30-bits or more, while most of the traditional hardware and software components support only 24-bit colour. In case you are not familiar with the concept of bit-depth, let me clarify what I mean by 30bit and 24-bit colour. While a painter’s brush brings glory to colours in innumerable combinations of shades and mixtures on the easel, in the digital world colours can be represented only by a finite set of numbers, characterised by bits – 0s and 1s. For a 24-bit system, each of the three primaries – red, green and blue, – can be represented by an 8-bit number, giving a total of 256 shades (28 = 256) per primary colour. When you take all combinations of those shades of primaries, you get more than 16 million colours (2563 = 16,777,216). Now you can do the maths for a 30-bit system. Each primary will have 10-bit, leading to 1024 shades per primary, which gives more than a billion colours! You must have seen that ‘billion colours’ phrase in some of the TV advertisements. Beyond requiring adequate bit-depth, rendering HDR content faithfully necessitates having information about the original scene luminance, which is stored in the content’s metadata. If metadata is a new term to you, think of it like a memo describing file content, pasted on top of the file before being dispatched from one office to another. A real world scene containing 108:1 dynamic range (white is 100,000,000 times brighter than black) needs to be specially processed to render on an HDR display. Technically, this is known as tone mapping. HDR rendering and HDR displays thus employ a whole different set of hardware and software designs, just as HDR capture requires specialised camera and software tools. The context of colour accuracy is also very different from what we have discussed so far. Preserving brightness in a perceptual sense is considered more critical for HDR applications than preserving colour accuracy. Thus we need to be careful when displaying standard dynamic range (SDR) content on an HDR display. We need to ensure HDR tone mapping is turned off. Similarly, HDR content displayed on an SDR display might be subjected to additional tone mapping. It is safe to say that using

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 108

108 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

colourimetric techniques to accurately quantify and describe the human perception of colour for accurate HDR imaging is a non-trivial challenge. As the display technology transitions from LCD to OLED (or one of the other latest technologies), from sRGB to wide gamut, and from SDR to HDR, what is the impact on archival and conservation work? What role does ambient light play in our quest for colour accuracy? How does the business of preserving and representing colours of yesteryear artwork evolve? I will try to address some of these questions in the next sections. As for the rest, I hope to leave you with some interesting food for thought. Challenges for a device manufacturer in preserving colour accuracy There are numerous technical challenges involved in ensuring display colour accuracy during the product manufacturing process. I was fortunate to have a first-hand experience of initiating and subsequently helping implement factory colour calibration for the displays in Microsoft Surface products after I joined the company in 2013. I would like to share with you my experience of dealing with display colour calibration, and how my team ran into issues and solved them along the way. This I hope will help you understand and appreciate the amount of effort a manufacturer needs to put in to ensure high-quality display colour accuracy. I would expect my experience and learning not to be much different from those of many others in the industry who are involved in similar activities. In the past decade, many leading manufacturers of personal computing devices like laptops and tablets have incorporated display colour calibration in their manufacturing process, thus prioritising colour accuracy. What once used to be a characteristic in professional-grade monitors costing thousands of dollars is almost a mandatory feature in today’s consumer devices, many of which are priced at only several hundred dollars. In the early days of Surface, when Surface RT was released, independent external reviewers were critical about display colours not being accurate in the product. There were a few reasons why this was the case, but before I consider them I need to explain two technical aspects of preserving colour accuracy.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 109

SARKAR ACCURATE DIGITAL COLOUR REPRODUCTION ON DISPLAYS 109

First, to evaluate the accuracy of colour, reviewers either view test images that have an expected appearance, or, if they are more sophisticated, measure specific test colours displayed on the screen with an instrument (called a colourimeter or a spectroradiometer) and check if the values match expected results. The most commonly used metric for perceived colour difference (error) is called Delta E (ΔE). In the year 2000, an advanced formula was established in the industry, Delta E 2000 (ΔE00). Speaking loosely, a colour difference below one unit of ΔE00 is not perceptually noticeable. There are many software tools that can run automated tests and summarise results, including Delta E numbers. These test colours and expected results need to adhere to some universal standard, accepted by the content creators, hardware manufacturers and customers alike. As I explained earlier, the most common such standard is sRGB. Most of the content on the internet uses the sRGB colours by default. So, if a display can show sRGB colours accurately, any sRGB image (or video) will automatically show up accurately on the display, using virtually any image viewer (or video player). Second, I need to explain how a display can show accurate colours. Let us consider the LCD technology since it has been more prevalent in the past several decades. An LCD typically has a light source in the form of a sheet or layer at the back of the display, called a backlight. This can be an array of uniformly distributed white light-emitting diodes (LEDs), or a strip of high brightness white LEDs placed along the edge of the display. In the latter case of edge illumination, the light from the LEDs is distributed with the help of a clear plastic plate called a lightguide. Fluorescent lamps could be used as the light source, though they are less common these days. There are other layers in an LCD serving various purposes, but they are not important for our discussion. What ultimately creates colours is the light emanating from the backlight and passing through RGB colour filters. Every dot on the screen, called the pixel (comes from picture element), consists of those colour filters so that a pixel can show various shades of red, green, blue or any combination of them. The characteristics of those colour filters, how well tuned they are, ultimately determines how much extra work one must put in to achieve accurate colours on the display. That extra work is colour calibration, which I will explain shortly. Now that you have some idea about the relevant technical aspects,

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 110

110 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

let me return to my story around Surface RT. The display in that product was capable of showing a subset of all the sRGB colours – its gamut was smaller than the full sRGB gamut. So when the reviewers viewed an sRGB test image, colours did not show up correctly. In the absence of colour management, all colours got squished to fit to the smaller display gamut, thus giving an overall washed out look, which included even the mid-tones that the display would have been able to show correctly with a little bit of work. As a newly minted colour scientist in the display team, the first thing I tried in the next products, Surface Pro 2 and Surface 2, was the first step towards colour management, to define a display colour profile. That allowed colour-managed applications like Photo Viewer to show the sRGB colours correctly within the constraints of the sub-sRGB display gamut. It seemed like the display suddenly had richer colours, without any additional investment. However, the limitation quickly became clear. Many software applications in the windows ecosystem did not adhere to colour management. Further, there was quite a bit of variation in colour characteristics from one device to the next, so a generic colour profile could only go so far. When we started the development work for Surface Pro 3 in 2014, I figured out a way to address the colour accuracy problem – through factory colour calibration. Using a high-end colour measurement instrument (a spectroradiometer), I could take fast, accurate measurements on the display by showing a specific set of colours, and then fed those measurement data into my new colour calibration algorithm, which computed the necessary settings, basically a set of numbers, to be programmed into the graphics hardware responsible for driving the display. This was the ‘extra work’ I referred to earlier. The hardware settings ‘fixed’ any inaccuracies inherent to the display panel design and construction. Once we are able to specify the hardware settings for an sRGB display, the colour accuracy no longer depends on the colour management capability of the software applications. It gets applied automatically ‘system wide’. The hardware settings are unique to every device, and drastically reduce device-to-device variations in colour performance. There are mainly two aspects involved in factory colour calibration: white point correction and gamma correction. White point correction addresses inaccuracies in the display gamut – it ensures the white on

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 111

SARKAR ACCURATE DIGITAL COLOUR REPRODUCTION ON DISPLAYS 111

the display is not too bluish and not too yellowish, and the fully saturated colours like RGB are showing up as accurately as feasible. Gamma correction ensures smooth gradation (also called ramps) of greys and reds, greens and blues. Imagine the picture of a cloudless azure sky that gets deeper and deeper as you gaze towards the horizon – preserving the azure tone of the sky would require white point correction, and preserving the smooth transition throughout the sky would necessitate excellent gamma correction. Ensuring smooth gradients in a greyscale image is even more challenging since the display must be able to preserve the neutral tone by balancing the red, green and blue, as well as the transitions between adjacent shades. The display panel design, the colour resolution available at various stages of the hardware components, the colour resolution provided by the operating system and software applications – everything plays a critical role in achieving exceptional colour accuracy. Any deficiency in one piece would ruin all the hard labour and investments involved in others. If you think about white point correction and colour accuracy in general, it essentially comes down to how you can achieve a balance between RGB primary colours as the 16 million colours are produced on the display (or 1 billion for a 10-bit display). This balancing act makes it necessary to reduce the brightness of the peak primaries to various degrees until you get the right shade of grey or white – you cannot increase the brightness of the primaries beyond what is physically realisable. Thus, white point correction invariably leads to some loss in original brightness capability in the display. The further the colours are from the target in the pre-calibration state, the greater the luminance loss in the calibration process. Since the settings are applied in the hardware at the factory, as a customer you never get back that original brightness. That is the trade-off of colour calibration, between luminance loss and colour accuracy. While this is not much of a concern for professional-grade desktop monitors, it is a big deal for mobile devices. Luminance loss leads to lower contrast in the display. Lower contrast affects outdoor readability, and can in turn impact battery life as users are forced to crank up the brightness more often. In Surface, for example, we have had to give up some accuracy in the past in favour of attaining the brightness and contrast we wanted to achieve in our products.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 112

112 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

I did not have much industrial experience when I joined the Surface team. In my first three to four years the progressive improvement in the display colour accuracy of our products matched well the learning curve I and the rest of the team had to traverse. That included gaining collective expertise as a team. Surface, being a system integrator, does not build its own hardware, but rather procures and assembles various pieces based on its own design of the products. So, improving display colour accuracy involved understanding the hardware capability offered by various external partners like Intel and NVIDIA and coming up with any workarounds needed when the capability did not meet our requirements. It involved understanding technology offerings by panel suppliers and other electronic components suppliers, putting in place a brand new manufacturing process, and continuously improving the process every year. Let me share with you two anecdotes. When we first started calibrating Surface Pro 3 devices in the factory, we discovered some discrepancy in the measurement data from the factory stations and the data subsequently collected in our lab. We chased down the root cause to the fact that we were not allowing the displays to be turned on for a certain amount of time in the factory before starting the calibration process. The LED backlight in the display has a certain settling time (or a warm-up time). That required us to add an additional step in the factory colour calibration process so that all displays are in stable state. Any such requirement ultimately has an impact on the manufacturing cost and schedule. Needless to say, our manufacturing team was not too thrilled with this sudden revelation! One issue that was always a bit thorny was how long we were allowed to spend to run the display calibration test on the factory floor. Coming relatively fresh from the academic world, I spent hours and hours in the laboratory perfecting my tests and experiments. It was initially hard for me to realise how every second counts when production is in full swing. There is a maximum time, in the order of tens of seconds, that any specific test can take during production, called a ‘takt time’. In the beginning, it took me and the team a huge amount of effort to ensure that every step of the factory colour calibration process was optimised so that we could achieve desired colour accuracy without a ‘takt time hit’. If anything goes wrong with the test, and a device fails to complete the calibration process success-

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 113

SARKAR ACCURATE DIGITAL COLOUR REPRODUCTION ON DISPLAYS 113

.

fully, it needs to either go through the test again or go through a separate process, typically called Failure Analysis, to figure out what went wrong. If that happened more than a couple of times a day during peak production, it could very well lead to a war-like situation, and we would find ourselves in the centre of a storm until we figured out the root cause, and a solution. Over the past three to four years, significant investment has been made in the overall factory colour calibration process for Surface displays, in resources, infrastructure, development time and process implementation steps. The high-end instruments alone cost tens of thousands of dollars. The colour measurements must be made in a dark chamber, at the centre of the display, at 100% brightness, and after turning off all the software features that might affect the colour and brightness of the display. The devices should ideally be connected to power so that power saving features do not attempt to reduce the brightness automatically. These are only a few of many different considerations in ensuring a reliable, efficient implementation and execution of the factory process. Hopefully, all the above details helped paint a picture in your mind – that the amount of pain a device manufacturer has to go through in ensuring the display colour accuracy meets the benchmark standards. When I joined Surface in 2013, few mobile devices were meeting the average accuracy of less than 2.0 Delta E (ΔE00) in independent reviewers’ assessment. As I write this chapter in 2018, many tablet and smartphone devices are approaching average colour accuracy of 1.0 Delta E. What was a product differentiator for mobile devices back then has become a baseline feature today. Professional-grade displays have always had a very high colour accuracy, and sometimes with a periodic self-calibration capability, all of which typically come at a premium cost and relatively fewer constraints than the mobile devices. Software features and colour accuracy: a complicated story In the previous sections, we mostly talked about hardware aspects, from panel characteristics to hardware design and manufacturing choices. The other critical piece that determines the colour performance of a display is the software, including the operating

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 114

114 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

system and various software applications. I should point out upfront that the forthcoming discussion is not a comprehensive review of various software features in various operating systems that might be relevant here. This should merely serve as a general guideline on what type of software features you should be mindful of. If you are aware of them, you can decide whether to use them, how and when. Probably the most crucial element to mention here is the colour management in the operating system. Think of colour management as the mechanism that allows colour professionals to manage digital settings to ensure that the vision of the content creator is matched with what the viewer sees when they experience the final creation. For colour management to work across various operating systems, devices and software tools, we need to adhere to a certain universally recognised standard. Such an industry standard is established by the International Colour Consortium (ICC). The ICC-based colour management systems allow many different devices with varied colour characteristics and capabilities to speak the same universal language of colour (technical jargon is ‘profile connection space’). Further, device-specific colour characteristics are described in device colour profiles using a standardised format. Colour profiles provide the mechanism for documenting colour-related information during the creation of digital content, so that the consumer of the content can reproduce the visual effect of the original as intended, independent of any device and underlying technology. A colour management system has two main responsibilities: • to assign a specific colour meaning to the digital values in an image • to change the digital values as the image goes from one device to the other so that the visual appearance of the image always stays the same (Fraser, Murphy and Bunting, 2014). The former is achieved by assigning or embedding into the image a profile (called input profile) that describes the colour characteristics of the originating device, for example a digital camera. The latter is achieved by converting from the input profile to a profile (called output profile) that describes the colour characteristics of the device where the image will be viewed, for example a display or a printer.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 115

SARKAR ACCURATE DIGITAL COLOUR REPRODUCTION ON DISPLAYS 115

Let us consider an example. Imagine you are digitising a piece of artwork. Ideally, you will use the raw capture setting in your professional camera and save the image as a raw DNG (Digital Negative) file, so that you can preserve as much information as possible. Camera characteristics automatically get saved in the image metadata. You can then transfer the image to your computer and open it using a software tool like Adobe Photoshop. While opening the file, you can specify, among other parameters, the colour space, say Adobe RGB or sRGB, which assigns the right profile to the image. You can then perform operations like colour correction and resizing if you want, and when done save the image in any format as needed, say JPEG (Joint Photographic Experts Group) or TIFF (Tagged Image File Format). Colour management principles are enforced differently by different operating systems, for example Image Colour Management (ICM) in Microsoft Windows, ColourSync in Apple’s iOS/Mac operating system and, more recently, colour management in Google Android. Android has traditionally been used in handheld mobile devices, but some laptops running Android have recently started appearing on the market. A key difference in colour management between Windows and iOS/Mac operating system is that in Windows colour management is an opt-in process, so a Windows application can completely ignore colour management if it wants to. Whereas in iOS/Mac operating system, every application is bound to abide by some minimum colour management principles. Some software applications perform the colour management tasks on their own so they can function similarly on different platforms, while the others rely entirely on the operating system for colour management. Thus it is important to be aware of the colour management capabilities of the operating system as well as the specific application you are using. Software features that have an impact on colour fidelity Apart from colour management, there is a whole array of software features that have an impact on colour fidelity. When preserving colour fidelity and accuracy, you should be cautious about any power saving feature that is ‘content adaptive’. This is particularly important if you are using a mobile device as in this case increasing the battery life is a critical requirement, so these

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 116

116 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

features may perform more aggressively. The display is typically the most power-hogging element in a computing device. Apart from setting the display to sleep after a certain amount of time, the graphics hardware in your device may have a feature that kicks in when the device is on battery, and would try to reduce the backlight brightness to save power, and then increase the digital colour values in red, green and blue in a way that the visual appearance does not change significantly. However, these features often do not consider exactly how our visual system responds to colour variations and hue changes that differ in different areas of the colour space, and in a very nonlinear fashion for RGB digital values in the content. The more sophisticated the feature is, the more difficult it is to detect visual artefacts or evidence that the feature is altering the content. Another factor here is that as a user you rarely do what we typically do during development: put two identical devices side by side, one with the feature turned on and another with it turned off, and then compare visual performance on a variety of content to determine if the results are acceptable. If you are measuring colours on the display, it is almost always a bad idea to have the feature turned on. Whether there is such a feature on your device depends on the manufacturer’s product design choices and the operating system’s power policy. If you are unsure and want to be on the safe side, I recommend keeping your device connected to power when you are working, particularly when you need to measure colours or review them with a critical eye. These days it is becoming more and more common for computers to flaunt a ‘blue light reduction’ feature. Apple calls it ‘Night Shift’, Microsoft calls it ‘Night Light’, some Android device manufacturers call it ‘Blue Light Filter’. The main goal is to reduce the amount of blue light produced by the display at night. The reduction could increase over the course of the evening and night. The purpose is to reduce eye strain and sleeplessness. Several research studies have reported that blue light from LEDs affect our circadian rhythm by suppressing melatonin levels responsible for sleep, and in an extreme case can even lead to photoreceptor damage. However, when you need your display to be colour accurate, particularly if you are working at night, you will definitely want to turn the feature off. Other software features that can be problematic for colour-critical work are ‘ambient-adaptive’ ones, for example the ‘auto-brightness’

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 117

SARKAR ACCURATE DIGITAL COLOUR REPRODUCTION ON DISPLAYS 117

feature that automatically adjusts the brightness of the display backlight based on what the ambient light sensor installed on your device senses to be the brightness of your surround. You may wonder: what is wrong with that, if it is not changing the colours on the screen? The truth is we perceive colour as a three-dimensional entity. An increase in brightness can increase the perceived colourfulness or contrast and vice versa. In colour appearance literature, these colour appearance phenomena are referred to as the Hunt effect (perception of colourfulness increases with brightness) and the Stevens effect (perception of contrast increases with brightness) respectively. I will talk a bit more about colour appearance and ambient adaptation at the end of this chapter when I discuss unsolved challenges. My point here is that an ‘auto-brightness’ feature may adversely affect colour fidelity. Colour measurements would invariably be wrong if the feature is on. Another ambient-adaptive feature has recently been introduced by Apple in their iPad products called ‘True Tone’ (which I expect to be incorporated in their other product lines). When this feature is on, the display’s white point responds to changes in the colour of ambient light. A warm ambient light (as in home lighting) makes the display go a bit warmer, and under natural daylight it goes back to its default values. The change is subtle and follows a slow transition. Like autobrightness, you should turn off this feature as well when colour accuracy is important for your work. Digitisation and colour accuracy: planning and thinking ahead As I alluded to before, bypassing colour management during content creation or digitisation was not of a huge concern some decades ago since there was an implicit assumption that the content would be in sRGB colour space. Today’s displays have vastly improved colour capabilities. In the coming years and decades, it will be more and more common for displays to have wider colour gamuts, so I strongly advise you to embrace sound colour management practices. If you are involved in digital content creation or digitisation of existing artwork, the single most important advice I can give you is to start by capturing and preserving as much information as possible, and allow redundant information to be discarded later as and when

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 118

118 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

needed. It is a lot more difficult to synthesise missing colour fidelity information than to discard information that is not needed. This is true for colour gamut, colour bit-depth, resolution as well as dynamic range. For example, while photographing an artwork using a professional camera, you can capture the image as raw (unprocessed data from the camera’s image sensor), and then process with an image-editing software application like Adobe Photoshop to save it in a wide gamut format like Adobe RGB. Any colour-managed image viewer will properly map the colours when showing on an sRGB or a P3 display. In contrast, if you started with an sRGB image, even though the colour management system in your application or operating system would expand the colours to display on a wide gamut display, the accuracy you can expect in the former case will not be available. While digitising, you should use as much colour bit-depth as possible. Earlier, I mentioned 8-bit and 10-bit per primary RGB colours in relation to display hardware. When dealing with digital images, you can save them as 16-bit or 32-bit per primary colour. That ensures the best possible colour fidelity during content creation and/or digitisation. Resolution of the content follows the same traits. Today, televisions boast ultra HD (high definition) or 4K resolution. More strikingly, some of the latest mobile devices come with up to 6 million pixels or more. However, the human eye cannot resolve the pixels at a certain distance beyond a given resolution, and device manufacturers have started recognising that it is not practical or useful to continue boosting the display resolution. I suspect state-of-the-art professional cameras already provide a resolution that will be sufficient in the foreseeable future. Archiving raw camera still photos has additional advantages in that the camera raw format can be imported into image-editing software like Photoshop and then converted into an HDR format. It allows preserving a wider dynamic range present in the original content. Photographers and filmmakers will likely shift more towards HDR content in the coming years and decades. To meet the demand of future content, HDR displays will become more prevalent. SDR content cannot be reliably converted into HDR content. As an analogy, think of the black-andwhite content of yesteryears – we have no reliable way to turn them into colour content. If we have the data in a format that would support the next wave of technology, it is highly desirable.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 119

SARKAR ACCURATE DIGITAL COLOUR REPRODUCTION ON DISPLAYS 119

If you are involved in archival work that does not involve digitisation, you will still need to ensure that the colour fidelity of the original content is preserved. You must choose proper devices and applications for displaying the content. Many areas discussed earlier in this chapter should give you a good general guideline. It is worthwhile to recognise here the technological revolution on the storage side that has been taking place over the past several decades. Floppy disks have turned into museum articles, the compact discs (CDs) are no longer preferred as mobile devices do not have CD players, and their longevity is not satisfactory anymore. While high capacity external storage disc is the preferred choice today, I suspect that will eventually give way to a technology that has already started reshaping multiple industries. High speed internet at home and work is commonplace today, cellular networks are moving towards 4G and 5G, today’s wider colour gamut, high bit-depth, high resolution content keep pushing the limit of how we transmit and consume data. All of this make cloud storage and computing a choice for the future. Relying on computer hard disk or external hard disk as the primary storage will become less and less common. It is important for archivists and conservationists to embrace this technological shift. What are the unsolved challenges and where to go from here Generally, when we talk about colour accuracy and basic colour management, we implicitly rely on some of the fundamental assumptions and guiding principles behind applied colourimetry and colour modelling. These are areas within colour science that deal with colour measurements and using mathematical models to describe and control device colour characteristics. The effect of ambient light and the surround is not considered in colourimetry, even though they can have a significant impact on our overall perception of colour. That goes into the domain of colour appearance. I will argue that colour accuracy as we know it is more relevant for theoretical comparisons, benchmarking activities and colour-critical tasks under controlled conditions than for our day-to-day activities in real world circumstances. Colour measurements are typically carried out in a dark confined space, and colour accuracy is determined against a reference condition. However,

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 120

120 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

in the real world, what ambient conditions we are in greatly impacts our perception of colours. Ambient colour and brightness, colour and contrast of objects in our field of view, the degree to which we are adapted to the immediate surround – factors like these dictate our perception, introducing a certain level of uncertainty in the final result. For this reason, photographers and graphics artists choose to work under a defined, controlled lighting condition. Thus, the question is: How do we define and achieve real world colour accuracy? While there are colour appearance models to predict how our sensory and cognitive system might behave when presented with a specific set of colours and objects under a specific viewing condition, they are far from perfect in unravelling the immensely complex and sophisticated human visual system. It is equally difficult to implement a thorough and complex colour appearance model in the device hardware. If we force the hardware to do all the complex operations 60 times per second or more, which is how fast they need to process content, we will need a very powerful computer. It is not a realistic proposition for a laptop or a tablet. Thus, finding a display feature that can reasonably achieve real world colour accuracy that we can relate to the world around us, and be implemented across a broad spectrum of devices, is one of the key unsolved challenges in colour reproduction. There is another area where the colour accuracy and colour matching story get a bit complicated. Let me explain with a real example. The other day my wife bought a white woollen hat and a matching scarf from the store. We both thought their off-white colours matched pretty well to be worn together. However, when we arrived home and checked, we were shocked to find that the colours of the garments appeared significantly different under cool fluorescent lighting. I persuaded my wife not to return those items, so that we could have a fun way to demonstrate to our friends the phenomenon of metamerism. When two colours match under one lighting condition but not under a different one, you can safely blame it on illuminant metamerism. A related effect of change in lighting is called colour inconstancy, where an object’s colour appearance changes with lighting. How are those relevant for visual arts and an archivist’s work? First, imagine an artist using two different colour mixtures to paint various

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 121

SARKAR ACCURATE DIGITAL COLOUR REPRODUCTION ON DISPLAYS 121

parts of the sky blending into each other. They matched well in the artist’s studio full of natural daylight, but showed up differently under the warm lighting of a gallery. On the other end of the spectrum, imagine a conservationist performing inpainting to treat damaged areas of an old painting. The inpainting matched the adjacent original paint in the natural daylight of the conservation studio, but appeared more magenta than blue when the painting was returned to the gallery. The phenomenon that can turn out to be more perplexing than the ones described above is called observer metamerism, where two colours under a specific viewing condition match for one person, but not for another, even though both have normal colour vision. This happens when there is a significant difference in the sensitivities of photoreceptors (cones) in the eyes of two individuals. Colour calibration and colour accuracy use one of the standard light sources (for example natural daylight) as the reference condition. There is no way to predict the effect the illuminant or observer metamerism or colour inconstancy can have on a digitised artwork, if it has been photographed only under a single lighting condition. Something like that requires us to capture spectral content of the artwork at different wavelengths. This technique is known as spectral imaging, which is way outside the capabilities of mainstream cameras and displays. Roy Bern’s recently published book is an informative reference on these topics as they pertain to visual arts, and has many real examples involving paintings and paint samples (Berns, 2016). Conclusion In the foregoing discussion, I described some of the shortcomings of current technologies in meeting challenges related to colour perception and colour accuracy. New technologies bring new challenges. Think about virtual and augmented reality. They employ novel, complex types of optical systems posing colour reproduction challenges that are unlike the existing mainstream imaging technologies. If we were to use virtual or augmented reality to experience visual arts, how would we define, measure and achieve colour accuracy, and colour fidelity? The concept of viewing condition is quite different in this case.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 122

122 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

How would we then adapt our colour appearance models to predict what a viewer is experiencing? What kind of challenges would these technologies pose to colour management systems? Would the current method of defining and generating colour profiles suffice, or would we need to adapt them to these new technologies? I do not think we have answers to all these pertinent questions just yet. However, as an archivist or a conservationist, you want to be mindful of these technological evolutions, and evolve your tools and methods to adapt to technological transitions. As the technology driving the electronic industry continues to evolve at a rapid pace, adapting to such transition is a significant challenge for an archivist, whose work must stay relevant long after today’s technology becomes obsolete in the not-so-distant future. References Berns, R. S. (2016) Colour Science and the Visual Arts: a guide for conservators, curators, and the curious, The Getty Conservation Institute. Fraser, B., Murphy, C. and Bunting, F. (2014) Real World Colour Management, 2nd edn, Peachpit Press.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 123

7 Historical BIM+: sharing, preserving and reusing architectural design data Ju Hyun Lee and Ning Gu

Introduction Architecture, engineering and construction professional practices (e.g. architectural, engineering and contracting firms and practitioners) have pushed software developers to support standardised data formats and flows for importing, exporting and sharing building data. Recently, the widespread use and proliferation of object-oriented computer-aided design (CAD) packages together with the increased complexity and automation in the construction processes have fostered the uptake and exchange of 3D data during the collaboration process (Singh, Gu and Wang, 2011). Building information modelling (BIM) has facilitated this innovation in building design, construction and management. Through a single digital data repository, BIM shares and maintains an integrated digital representation of all building information throughout the entire project lifecycle (Gu and London, 2010). Creating a BIM ecosystem requires careful consideration for BIM-related products, processes and people to co-evolve (Gu, Singh and London, 2015). BIM supports powerful building data documentation and digitisation, and facilitates more effective communication across different disciplines involved in the building lifecycle. It aims to break down the communication barriers among different stakeholders by maintaining only one consistent digital model for all disciplines through new processes of generating, managing and sharing building information.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 124

124 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

BIM research has been developed over decades and has largely addressed new building designs, while its applications on existing buildings are still under scrutiny (Volk, Stengel and Schultmann, 2014). Historical BIM (HBIM) is a recent development of BIM focusing on historical buildings, and addresses the historical, cultural and social parameters that exist in this realm (Biagini et al. 2016; Quattrini, Pierdicca and Morbidoni, 2017). Most implementations of Historical BIM (HBIM) involve a parametric library, which is a collection of parametric geometries allowing mathematical modelling (through generations and variations) of shapes and objects, and a mapping system for capturing and translating survey data for heritage conservation (Maurice, McGovern and Sara, 2009), therefore it is also called ‘historic BIM’ or ‘heritage BIM’. HBIM extends the generic BIM approach to document and manage buildings that are historically important with significant heritage values. Because of this, most HBIM literature has focused on addressing the accurate, automated creation of a 3D digital model from survey data through advanced techniques such as terrestrial laser scanning and photogrammetry. Recent research developments on the translation of point clouds into building models is especially common in this domain (Baik, 2017; Chiabrando, Lo Turco and Rinaudo, 2017; López et al., 2017). Point clouds are sets of dense point measurements, collected by laser scanning and represented in a common co-ordinate system (Tang et al., 2010). Recent studies further suggest that HBIM has the potential to facilitate the sharing and reuse of historical building data among a variety of stakeholders including the public (Murphy et al., 2017; Napolitano, Scherer and Glisic, 2017). Despite these preliminary studies, sharing, preserving and reusing historical building data is still largely unexplored. In this context, the chapter first briefly introduces the challenges the design and building industry have faced in sharing, preserving and reusing architectural design data before the emergence and adoption of BIM, and discusses BIM as a solution for these challenges. It then reviews the current state of BIM technologies and subsequently presents the concept of historical BIM+ (HBIM+), which aims to share, preserve and reuse historical building information. HBIM+ is based on a new framework that combines the theoretical foundation of HBIM

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 125

LEE AND GU HISTORICAL BIM+ 125

with emerging ontologies and technologies in the field including geographic information system (GIS), mobile computing and cloud computing to create, manage and exchange historical building data and their associated values more effectively. BIM as a solution for the sector challenges Collaboration in the architecture, engineering and construction industries has long revolved around the development and exchange of 2D drawings and documents, including descriptive specifications and details. Designers have often used full-scale mock-ups, models and 3D virtual models for further visualisation needs, however 2Dbased documentation is still a main means of collaboration in conventional practices. In addition, the lack of trust in the completeness and accuracy of 3D digital models has remained a concern for the collaborators involved (Gu and London, 2010). As a result, data exchange across the disciplines in a building and construction project is still limited to 2D drawings in many practices – especially in small to medium sized firms. An integrated digital model capable of representing design and building data in multiple forms and formats has become essential for supporting greater collaboration and communication in the sector. Strategies and enablers such as agreed protocols, standardised evaluation and validation procedures have been gradually implemented for assigning responsibilities and facilitating effective collaboration and communication. For example, the development of intelligent model checkers has been able to ensure the accuracy of 3D model integration for processes such as design review and evaluation. In the meantime, international standards for paper and digital based documentation in the cultural heritage domain have been established through the Venice Charter and the London Charter (Denard, 2012; Murphy et al., 2017). Those creating 3D digital models for cultural heritage have recently adopted more advanced technologies and techniques – the use of terrestrial laser scanning, photogrammetry and these methods combined with remote sensors. However, these 3D models are largely limited to a digital record of a building or artefact relating only to geometry, texture and other visual properties. A new ontology for cultural heritage buildings has been developed to

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 126

126 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

integrate geometrical and non-geometrical information, which will form an essential part of knowledge-enriched visualisation of architectural heritage (Acierno et al., 2017; Quattrini, Pierdicca and Morbidoni, 2017). The 3D models of historical buildings can also be associated with geographical locations, often using GIS, to aid further analysis and communication (Albourae, Armenakis and Kyan, 2017; Dore and Murphy, 2012). Thus, these new digital cultural heritage models become integrated with added information and intelligence, parallel to the advancements enabled by BIM development (Murphy et al., 2017). An integrated digital system for architectural modelling, data preservation and collaboration requires the exchange of information between the various applications involved in a building project lifecycle including design tools, analysis tools, facilities management tools and document management systems (DMSs). The collaboration features are similar to those in a DMS, which are normally limited to supporting 2D drawings and documents. The new systems on the other hand provide a platform for the integration and exchange of 3D models with embedded intelligence. A different approach to modelling is also required for the collaborative settings, because multiple parties contribute to a centralised model. Further new roles and different relationships within the project teams should be considered following the emergence and adoption of the new approach. Various cultural and perception issues need to be appropriately addressed for existing stakeholders. For example, users with CAD backgrounds, such as designers, seek to support integrated methods of visualisation and navigation that are comparable with the previous applications with which they are familiar. By contrast, users with DMS backgrounds, such as contractors and project managers, expect visualisation and navigation to be the important features that are missing in existing DMS solutions (Singh, Gu and Wang, 2011). As BIM matures, the existing CAD packages and DMS are integrated into a single product. An effective BIM approach should not only have the technological capability to support the collaboration requirements of diverse user groups, but also provide adequate support features to assist users in assessing, designing and implementing BIM in their projects (Singh, Gu and Wang, 2011). In common practice, collaboration is often established in local

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 127

LEE AND GU HISTORICAL BIM+ 127

repositories and is typically accessible through proprietary standalone desktop software. However, information management needs close cooperation of the entire design team, including the architects, engineers, manufacturers, contractors and clients at all project stages. Thus, the scope of collaboration is limited to a single project in isolation and to ad-hoc decentralised and traditional forms of communication such as e-mail, paper printouts or other traditional channels of information exchange (El-Diraby, Krijnen and Papagelis, 2017). Valuable, simultaneous knowledge exchange between various members and sometimes various teams is a core issue in effective BIM implementation, which requires collaborative and collective participation and contribution from all stakeholders in a building project lifecycle. BIM and its key aspects ISO standard 29481-1 defines BIM as an approach that uses ‘a shared digital representation of a built object (including buildings, bridges, roads, process plants, etc.) to facilitate design, construction and operation processes to form a reliable basis for decisions’. It also stands for the shared digital representation of the physical and functional characteristics of any construction works (ISO, 2016). BIM manages the diverse sets of information and provides a digital platform for describing and displaying information required in the planning, design, construction and operation of constructed facilities (ISO, 2016). BIM integrates the lifecycle of the built environment into a common digital information environment and represents real buildings virtually as semantically enriched, consistent, digital building models (Biagini et al., 2016; Eastman et al., 2011; Tang et al., 2010). BIM is regarded as an objectoriented CAD system enabling parametric representation of building components (Cerovsek, 2011; Volk, Stengel and Schultmann, 2014). Objects can have geometric or non-geometric attributes with functional, semantic or topological information (Eastman et al., 2011). Functional attributes may include installation durations or costs and semantic information dealing with connectivity, aggregation, containment or intersection information. Topological attributes store information regarding objects’ locations, adjacency, coplanarity (the status of being in the same plane) or perpendicularity (Biagini et al., 2016). Thus,

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 128

128 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

commercial BIM platforms often provide integrated data management, component libraries and general functionalities (e.g. visualisation and predictive analyses) (Eastman et al., 2011). Key BIM players and associates in the architecture, engineering and construction (AEC) industry include architects, engineers, design consultants, contractors, project managers, facility managers, delegates from government agencies and information technology service providers (Gu, Singh and London, 2015). The perception and expectation of BIM vary across disciplines (Singh, Gu and Wang, 2011; Gu and London, 2010). For example, BIM is often perceived as an extension to CAD for the design disciplines, while non-design disciplines such as contractors and project managers often regard BIM as an intelligent DMS that can quickly extract data from CAD packages directly. BIM application vendors and service providers often aim to integrate the two requirements (Gu, Singh and London, 2015). Since integrated model development needs further collaboration and communication across disciplines and these disciplines often have different perceptions and expectations, a concurrent engineering approach to model development is often needed where multiple parties contribute simultaneously to the shared BIM model (Gu, Singh and London, 2015). Different business models are also needed to suit varying implementation needs, for example, a BIM model can be maintained in-house or outsourced to service providers, which will have different capacities and different financial implications. Looking into the future, BIM calls for new roles and team dynamics within contemporary building and construction projects. An examination of the existing workflow and resourcing capabilities can help the project determine whether these would be internally or externally resourced roles. Organisations need to develop strategies and business models that suit the requirements and practices of different players in the industry, contingent on the capabilities of the firms they work with (Singh, Gu and Wang, 2011). Key BIM participants are often concerned with a lack of training on and awareness of BIM applications (Gu and London, 2010). Therefore, appropriate training and information for various team members are important to enable them to contribute to and participate in the changing work environment. Overall, considering these complexities and varying perception and readiness, a collective and integrated

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 129

LEE AND GU HISTORICAL BIM+ 129

approach to managing the complex interdependencies across key BIM players is required to facilitate BIM adoption effectively and to maximise the impact of BIM. Emerging BIM technologies for enhanced preservation and collaboration Ontology and the semantic web BIM defines and applies intelligent relations between elements in a digital building model, which can include both geometric and nongeometric data (Singh, Gu and Wang, 2011). The main function of BIM is to facilitate building information management and exchange through this consistent relational model between different stakeholders across the entire project lifecycle. Information technologies have enabled specialisation that allows for incorporating increased volumes and diversity of knowledge into the processes (Turk 2016). Gruber (1995) defines ontology as a formal and explicit specification of a particular conceptualisation. In philosophy, an ontology is a systematic account of existence. The ontology in construction informatics defines the field, maps its structure and provides a system to organise the related knowledge. Core themes of ontologies are to create and represent knowledge related to information-processing activities, communication and co-ordination activities or aspects about common infrastructures (Turk, 2006). Aksamija and Grobler (2007) develop an ontology describing the principles for building design and relations among many different factors and variables in a complex building design process. The ontology facilitates communication between people and interoperability between systems (Succar, 2009). As the current dominant technological approach to building project management, BIM has adopted and developed a range of ontologies for formalising the relevant processes and practices. Li et al. (2017) point out that the applications of ontology to BIM have addressed many construction problems such as cost estimation, defect management, construction safety and quantity take-off (a detailed measurement of materials and labour in a building project). For example, Zhang, Boukamp and Teizer (2015) introduce an ontology-based semantic model to organise, store and reuse construction safety knowledge.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 130

130 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

Ontologies provide abstract representations of the domain knowledge that is machine-readable and suitable for computation (Aksamija and Grobler, 2007). In addition, semantic interoperability implies a shared understanding of the information (Gehre et al., 2017). The data exchange and information sharing in BIM can be perceived as a centralised approach for exchanging files through common services (Mignard and Nicolle, 2014). Mignard and Nicolle (2014) further suggest the concept of semantic BIM that uses ontologies to manage digital models. The ontology of construction informatics is applied to map its structure (Turk, 2006). Semantic BIM organises the related knowledge developed for the building lifecycle at each step of this structure (Mignard and Nicolle, 2014). More recently, Quattrini, Pierdicca and Morbidoni (2017) introduce a web application for demonstrating BIM data exploration using the semantic web, which allows us to exploit both 3D visualisation and the embedded structured metadata. These data are organised by different typologies, e.g. 3D models and their components, digital work sheets, detail drawings, multimedia contents such as PDF documents, videos or images, and web links. Extending these concepts to HBIM, the knowledge-based data enrichment for HBIM enables different users to query a digital repository composed of structured and semantically rich heritage data about the building. These works also show the evolution of BIM and HBIM management from geometric representations to web 3D objects, empowered by the structures and logics of semantic content modelling. In this way, the application of ontology and the semantic web to HBIM can be developed to allow various data to be shared and reused across applications and industries. An open environment for preservation and collaboration needs explicit ontologies with a standard mechanism to access applications (agents), however. Explicit ontologies can be considered as ‘a referring knowledge’ (Koumoutsos et al., 2006). Geomatics BIM creates, manages and shares the building lifecycle data ‘locally’, largely focusing on the relations between different objects of the digital model and different stakeholders during the project lifecycle. GIS on the other hand deals with storing, managing and analysing data

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 131

LEE AND GU HISTORICAL BIM+ 131

‘globally’, describing them spatially in the urban context. The integration of the BIM models with GIS systems can further enhance BIM and HBIM generating even richer data sets of non-architectural information. Ma and Ren (2017) present an application of integrating BIM and GIS for smart city studies, incorporating data sets about specific building facilities and the surrounding urban context. Applications of 3D GIS are now seen in areas of 3D cadaster (spatial database and volumetric representation of 3D property registries), urban planning, disaster management, noise mapping as well as cultural heritage (Dore and Murphy, 2012). For example, the information stored in HBIM can now be connected to specific geographical locations to reference and analyse the heritage data spatially. The integration of BIM and GIS not only supports new building projects regarding purposes such as supply chain management or construction scheduling, but also assists retrofitting preparation and decision making especially relevant to existing buildings. Göçer, Hua and Göçer (2016) introduce a pre-retrofit framework that obtains and integrates multiple forms of building data efficiently to identify existing problems and corresponding solutions. Albourae, Armenakis and Kyan (2017) combine GIS and HBIM for the visualisation, management and analysis of increasingly heterogeneous collections of heritage data. Such integration provides annotated vector and raster layers of information in the georeferenced models, as well as effective ongoing documentation of cultural heritage artefacts through semantic data (Eide et al., 2008; Soler, Melero and Luzón, 2017). For example, the Archive Mapper for Archaeology (AMA) project allows users to import XML data models of their existing archeological datasets and map the cultural heritage data to the ontology schema of the Conceptual Reference Model of the International Committee for Documentation (CIDOC CRM) (Eide et al., 2008). The CIDOC CRM guidelines provide ‘definitions and a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation’ (CIDOC CRM, 2018) (see also ISO 21127:2006). These definitions of relationships among already mapped elements supports the enrichment of their semantic meaning. Apollonio et al. (2011) introduce a web-based system using open source software to construct, manage and visualise archeological excavations and the heritage of

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 132

132 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

Pompeii as well as a Palladio 3D geodatabase. Manferdini and Remondino (2010) also present an open source 3D web-based tool, which can semantically segment complex reality-based 3D models, annotate information and share the results for cultural heritages such as Claude-Nicolas Ledoux’s architecture and the Angkor Wat temple. Thus, the integration has significant potentials for digital heritage modelling, management and analysis. In parallel, with the advancement of mobile computing, a wide range of location-based applications has been developed and adopted for different spatial and historic contexts and purposes (Lee and Kim, 2011). These locationaware technologies are based on the Global Positioning System (GPS) or the Wi-fi Positioning System (WPS). For example, the Wikitude world browser, introduced in 2008, can project the location-specific information on such landmarks. In addition, many recent Social Network Service (SNS) applications also consider multi-user scenarios. They open up tremendous opportunities to explore opportunities such as the personalisation of contents, interactivity, navigation and collaboration for BIM and HBIM. Virtual reality and augmented reality The application of virtual reality and augmented reality to BIM is an emerging research topic, which aims to improve the way building information is accessed in order to enhance project collaboration and productivity (Wang et al., 2014). Milgram and Kishino (1994) use the reality-virtuality continuum to define the series of interrelated virtual and augmented reality concepts along the continuum. Virtual reality has had a long association with the domains of architecture and construction for visualisation and simulation purposes, while augmented reality is an interesting feature of mixed reality by inserting virtual information into the physical scenes of reality. A virtual-realitybased BIM platform can allow us to experience different scenarios being proposed and simulated on real-life projects effectively (Goulding, Rahimian and Wang, 2014). Napolitano, Scherer and Glisic (2017) extend similar concepts to provide virtual tours enhanced with informational modelling. In HBIM, Murphy et al. (2017) develop a web-based platform to create and disseminate conservation documentation, combining virtual reality and a game engine. In their HBIM

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 133

LEE AND GU HISTORICAL BIM+ 133

approach, the game engine is used as an interactive repository of heritage data as well as their associated knowledge and information. The application of virtual and augmented reality to BIM and HBIM allows for real-time interaction and registration in the digital world across the boundaries between the real and the virtual. These mixedreality platforms can support further interactive relations between persons, objects and locations, when combining with or empowered by emerging technologies introduced above, such as mobile or ubiquitous computing, or location-aware technologies. Although these fast innovations improve users’ experience and communication, these technologies are still experimental and may not be cost-effective at this moment. Mobile or cloud computing As reviewed earlier, many mobile applications are operating through location-based services and SNS. When presented through mobile devices such as a smart phone, mixed-reality platforms such as augmented reality are often used as the interface for assessing the location-specific digital information directly from different physical locations. For example, Keil et al. (2011) develop a mobile augmented reality application that visualises the compelling history about a cultural heritage site directly from its physical location. The combination of mobile computing and augmented reality can also create ubiquitous workspaces that enable borderless interactions between the digital and physical worlds. Kim, Lee and Gu (2011) further suggest extending BIM with mobile computing and augmented reality for intelligent construction management in the AEC industry. Participants in BIM-based construction projects often need continuous on-demand access to information where awareness of the participants’ and their collaborators’ context becomes an important factor to be considered during project collaboration. As a result, mobile augmented reality applications providing access to location-specific and context-aware information can be especially useful for supporting communication and collaboration in BIM. To further enhance the user experiences, these mobile applications can be associated with ubiquitous computing, which provides interactive interfaces, a coupling of bits and atoms, as well as ambient

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 134

134 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

media in which computational power is seamlessly integrated into the everyday objects and environments (Ullmer and Ishii, 2000). For example, large amounts of smart devices, intelligent transportation systems and home network appliances in smart cities are connected and exchange their data – the Internet of Things. Together they extend human computer interaction (HCI) beyond the desktop, creating instead physical spaces with embodied computer interactions available on the move. These embodied interactions can have meanings and values with social relevance to people in a specific location. Although these applications have great potential, they require intense computation to support effective real-time data processing such as for image processing, natural language processing etc. Mobile cloud computing therefore becomes necessary, focusing on a wide range of potential cloud applications and more specifically power mobile computing (Fernando, Loke and Rahayu, 2013). Armbrust et al. (2009) define cloud computing as ‘both the applications delivered as services over the internet and the hardware and system software in the datacenters that provide those services’. The key strengths of cloud computing are the cloud service providers including Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) (Carolan and Gaede, 2009). Extending the generic cloud computing concepts, mobile cloud computing runs an application on a remote resource-rich server. A mobile device acts like a thin client connecting to the remote server. Its rich and seamless functionality empowers users, regardless of the resource limitations of the mobile devices (Fernando, Loke and Rahayu, 2013). Historical BIM+ Traditional HBIM does not always consider the full BIM lifecycle (Logothetis, Delinasiou and Stylianidis, 2015). It is often limited to the maintenance stage of existing buildings, and sometimes also includes limited considerations for retrofit and deconstruction (Volk, Stengel and Schultmann, 2014). With the Construction Operations Building information exchange (COBie) standard, stakeholders can store building maintenance data in a structured way for facility documentation and management using BIM (Eastman et al., 2011). The standard defines a ‘level of detail’ or ‘level of development’ (LoD) for the

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 135

LEE AND GU HISTORICAL BIM+ 135

technical equipment (Volk, Stengel and Schultmann, 2014). By contrast, HBIM+ extends HBIM to address the full BIM lifecycle covering sharing, preservation and reuse of historical building data beyond the maintenance stage. HBIM+ places more emphasis than HBIM on information management throughout the entire project, considering more diverse needs of the historical buildings and a wider range of design and maintenance scenarios. For example, for a historical building that requires a long-term building operation plan with new intervention, HBIM+ can adopt ‘as-planned’ BIM creation for the new building addition or extension. This requires much stronger support for the interoperability and communication of the BIM model because of the potential range and number of stakeholders involved. These issues are not often being thoroughly considered in HBIM because of its narrow focus on maintenance. Turk (2016) argues that BIM has largely developed and been adopted for the process of creating information about new or future buildings. In most cases, information flow in a typical BIM lifecycle is linear, ending with the handover of the completed building to the client and the commencement of facilities management. On the other hand, historical buildings are existing buildings, some of which have significant heritage, cultural and social values. The information flow of HBIM+ is therefore more cyclical, with the project lifecycle spanning over a very long period or even being continuous. Figure 7.1 illustrates the HBIM+ knowledge framework. Information flow of building data

Conceptual design

Detailed design

Analysis

Deconstruction

Refurbishment

Documentation

Fabrication

Maintenance

Construction 4D/5D

Information modelling

Data collection

Construction logistics

‘As-planned’ flow ‘As built’ flow

Figure 7.1 An HBIM+ knowledge framework Source: partly adopted from Syncronia (2011)

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 136

136 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

in a HBIM+ lifecycle can be both ‘as-planned’ and ‘as-built’. Similar to BIM, HBIM+ is also a multidisciplinary collaboration platform, owing to the wide range of stakeholders involved. Because of the potential heritage, cultural and social values, the general public and the other interest groups such as archivists, historians and tourism organisations can also develop strong interests in the model, which is not necessarily the case in other standard building and construction projects. Through the involvement of these additional stakeholders and the subsequent knowledge translation in their respected disciplines, information gathered through the HBIM+ project lifecycle can create new values beyond the model and the project. As HBIM+ may involve very diverse users and disciplines, many of whom are outside the traditional building construction domains, a server approach is more feasible for optimising data interoperability. A BIM server as a collaboration platform maintains a more robust repository of the building data, for storing, checking, viewing, sharing and modifying the data. The BIM server supports the integration, translation and exchange of information between the various applications involved in a building project lifecycle including design tools, analysis tools, facilities management tools and a DMS (Singh, Gu and Wang, 2011). In a HBIM+ server, while a DMS can support the storage and exchange of 2D drawings and documents of the historical building data, the inherited generic BIM features provide the integration and exchange of 3D data such as models and simulations. In this way, HBIM+ facilitates the continuous creation, integration, and exchange of the data related to the historical building, consisting of geometrical and non-geometrical data within and beyond the building and construction disciplines. In BIM, domain knowledge is encoded in various forms such as product libraries, object properties, rules and constraints. In order to retrieve open and documented domain knowledge semantically, BIM uses ontologies that can provide semantic information to facilitate intelligent uses such as information integration and ontology reasoning (Chen and Luo, 2016). Karan and Irizarry (2015) use a query language to access and acquire the data in semantic web format. As a result, the BIM approach can be supported by a wide array of applications such as design and collaboration platforms, analysis tools and project management systems. However, semantic web technologies are still being

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 137

LEE AND GU HISTORICAL BIM+ 137

developed and there are very limited ontologies globally agreed on for use in the AEC industries. It is often time consuming for organisations to develop their own ontologies and limits an open environment for preservation and collaboration. For HBIM and HBIM+, the types of applications are specifically tailored for historical buildings. This section focuses on the main characteristics of HBIM+ by drawing a comparison to BIM. The criteria for the comparison are based on the work of Gu, Singh and London (2015), which includes representation, documentation and information management, inbuilt intelligence, analysis and simulation, and collaboration and integration. They are described in Table 7.1. As discussed earlier, the main difference between HBIM+ and HBIM is that HBIM+ addresses the entire project lifecycle and HBIM only narrowly emphasises maintenance. Table 7.1 The main characteristics of HBIM+ in relation to BIM BIM

HBIM+

• clearly represents the design intent for building and construction professionals • aids design thinking and development • provides common ground and a visual language for communication between members of the multidisciplinary team • evolves from symbolic representation to virtualisation

• clearly represents the design intent for building and construction professionals and beyond • clearly represents the context – existing buildings • aids design thinking and development • provides common ground and a visual language for communication between members of the multidisciplinary team including non-building and construction professionals • evolves from symbolic representation to virtualisation • new media-rich representation engages with the non-building and construction professionals

Documentation • has user assistance tools and for managing complexity information • improves design decisions management through analyses and simulation • uses active knowledgebased design and management systems

• has user assistance tools for managing increased complexity • improves design decisions through analyses and simulation • optimises heritage, cultural and social values through analyses and simulation • uses active knowledge-based systems beyond design and management to consider heritage

Representation

(Continued)

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 138

138 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

Table 7.1 (Continued) BIM

HBIM+

• has user assistance tools for managing complexity • improves design decisions through analyses and simulation • uses active knowledgebased design and management systems

• has user assistance tools for managing increased complexity • improves design decisions through analyses and simulation • optimises heritage, cultural and social values through analyses and simulation • uses active knowledge-based systems beyond design and management to consider heritage

Collaboration • multi-party collaboration and integration and integration of project information • data compatibility and consistency • interconnectivity and interoperability of information flows • multidisciplinary collaboration through exchanging data across different parties

• multi-party collaboration and integration of project information in professional, social and community contexts • increased data compatibility and consistency • increased interconnectivity and interoperability of information flows • multidisciplinary collaboration through exchanging data across different parties and interfaces in professional, social and community contexts

Inbuilt intelligence, analysis and simulation

As discussed above, emerging ontologies and technologies have played an important role in defining and supporting the advanced HBIM+ knowledge framework. With the evolution in these fields, the capabilities of HBIM+ and stakeholders’ collaboration and user experience would be further enhanced. For example, the general BIM platforms have evolved to adopt augmented and immersive environments, haptics interfaces, 3D printing and holographic imaging (Gu, Singh and London, 2015). They will continue to evolve with newly emerging opportunities such as big data, open source developments, crowdsourcing, social computing and cloud computing to advance the design, analysis, communication capabilities and information management supporting ‘as-planned’ and ‘as-built’ flows. HBIM+ has significant potential to engage a wide range of users for creating optimal cultural, social and economic values. Conclusion Evolving from BIM and HBIM, HBIM+ focuses specifically on sharing, preserving and reusing historical building data for the entire building

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 139

LEE AND GU HISTORICAL BIM+ 139

project lifecycle. Generic BIM has largely addressed new buildings, and traditional HBIM tends only to highlight the maintenance stage of historical buildings, while HBIM+ opens up to the entire lifecycle management of historical buildings and addresses multiple scenarios for different heritage purposes. HBIM+ advances BIM and HBIM to define and facilitate different needs and scenarios of historical buildings. It expands to the entire project lifecycle and establishes a formal framework to create, manage and exchange historical building data and their associated contexts and values more effectively. Future studies will apply this framework in a series of trials and case studies to test technical aspects of it as well as to survey stakeholder perceptions, which will provide empirical evidence for refining the framework and facilitate its wider adoption. References Acierno, M., Cursi, S., Simeone, D. and Fiorani, D. (2017) Architectural Heritage Knowledge Modelling: an ontology-based framework for conservation process, Journal of Cultural Heritage, 24, supplement C, 124–33, https://doi.org/10.1016/j.culher.2016.09.010. Aksamija, A. and Grobler, F. (2007) Architectural Ontology: development of machine-readable representations for building design drivers. In International Workshop on Computing in Civil Engineering 2007, 168–75. Albourae, A. T., Armenakis, C. and Kyan, M. (2017) Architectural Heritage Visualization Using Interactive Technologies, International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLII-2/W5:7-13, https://www.int-arch-photogramm-remote-sens-spatialinf-sci.net/XLII-2-W5/7/2017/isprs-archives-XLII-2-W5-7-2017.pdf. Apollonio, F. I., Benedetti, B., Gaiani, M. and Baldissini, S. (2011) Construction, Management and Visualization of 3D Models of Large Archeological and Architectural Sites for E-Heritage GIS Systems. In XXIIIrd International CIPA [Comité International de la Photogrammétrie Architecturale] Symposium, Prague. Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R. H., Konwinski, A., Lee, G., Patterson, D. A., Rabkin, A., Stoica, I. and Zaharia, M. (2009) Above the Clouds: a Berkeley view of cloud computing. In Technical Report UCB/EECS-2009-28, University of California at Berkeley. Baik, A. (2017) From Point Cloud to Jeddah Heritage BIM Nasif Historical

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 140

140 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

House – case study, Digital Applications in Archaeology and Cultural Heritage, 4, supplement C, 1–18, https://doi.org/10.1016/j.daach.2017.02.001. Biagini, C., Capone, P., Donato, V. and Facchini, N. (2016) Towards the BIM Implementation for Historical Building Restoration Sites, Automation in Construction, 71 (1), 74–86, https://doi.org/10.1016/j.autcon.2016.03.003. Carolan, J. and Gaede, S. (2009) Introduction to Cloud Computing Architecture, white paper, Sun Microsystems. Cerovsek, T. (2011) A review and outlook for a ‘Building Information Model’ (BIM): a multi-standpoint framework for technological development, Advanced Engineering Informatics, 25 (2), 224–44, https://doi.org/10.1016/j.aei.2010.06.003. Chen, G. and Luo, Y. (2016) A BIM and Ontology-Based Intelligent Application Framework. In Advanced Information Management, Communicates, Electronic and Automation Control Conference, Institute of Electrical and Electronics Engineers, 494–7. Chiabrando, F., Lo Turco, M. and Rinaudo, F. (2017) Modeling the Decay in an HBIM Starting from 3D Point Clouds: a followed approach for cultural heritage knowledge, International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLII-2/W5:605-612. CIDOC CRM (2018) What is the CIDOC CRM?, www.cidoc-crm.org/. Denard, H. (2012) A New Introduction to the London Charter. In Bentkowska-Kafel, A., Baker, D. and Denard, H. (eds), Paradata and Transparency in Virtual Heritage, Ashgate, 57–71. Dore, C. and Murphy, M. (2012) Integration of Historic Building Information Modeling (HBIM) and 3D GIS for Recording and Managing Cultural Heritage Sites, paper read at 18th International Conference on Virtual Systems and Multimedia, Milan. Eastman, C., Teicholz, P., Sacks, R. and Liston, K. (2011) BIM Handbook: a guide to building information modeling for owners, managers, designers, engineers and contractors, 2nd edn, Wiley. Eide, Ø., Felicetti, A., Ore, C. E., D’Andrea, A. and Holmen, J. (2008) Encoding Cultural Heritage Information for the Semantic Web: procedures for data integration through CIDOC-CRM mapping. In Arnold, D., Van Gool, L., Niccolucci, F. and Pleti, D. (eds), Open Digital Cultural Heritage Systems Conference, EPOCH/3D-COFORM Publication, 47–53. El-Diraby, T., Krijnen, T. and Papagelis, M. (2017) BIM-Based Collaborative

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 141

LEE AND GU HISTORICAL BIM+ 141

Design and Socio-Technical Analytics of Green Buildings, Automation in Construction, 82, supplement C, 59–74, https://doi.org/10.1016/j.autcon.2017.06.004. Fernando, N., Loke, S. W. and Rahayu, W. (2013) Mobile Cloud Computing: a survey, Future Generation Computer Systems, 29 (1), 84–106, https://doi.org/10.1016/j.future.2012.05.023. Gehre, A., Katranuschkov, P., Stankovski, V. and Scherer, R. J. (2017) Towards Semantic Interoperability in Virtual Organisations, n.d., www.irbnet.de/daten/iconda/06079005573.pdf. Göçer, Ö., Hua, Y. and Göçer, K. (2016) A BIM-GIS Integrated Pre-Retrofit Model for Building Data Mapping, Building Simulation, 9 (5), 513–27, https://link.springer.com/article/10.1007/s12273-016-0293-4. Goulding, J. S., Rahimian, F. P. and Wang, X. (2014) Virtual Reality-Based Cloud BIM Platform for Integrated AEC Projects, Journal of Information Technology in Construction, 19, 308–25. Gruber, T. R. (1995) Toward Principles for the Design of Ontologies Used for Knowledge Sharing?, International Journal of Human-Computer Studies, 43 (5), 907–28, https://doi.org/10.1006/ijhc.1995.1081. Gu, N. and London, K. (2010) Understanding and Facilitating BIM Adoption in the AEC Industry, Automation in Construction, 19 (8), 988–99, https://doi.org/10.1016/j.autcon.2010.09.002. Gu, N., Singh, V. and London, K. (2015) BIM Ecosystem: the coevolution of products, processes, and people. In Kensek, K. M. and Noble, D. (eds), Building Information Modeling: BIM in current and future practice, John Wiley & Sons, 197–210. ISO 21127:2006 Information and Documentation: a reference ontology for the interchange of cultural heritage information, International Standards Organization. ISO 29481-1:2016 Building Information Models: information delivery manual, Part 1, Methodology and Format, International Standards Organization. Karan, E. P. and Irizarry, J. (2015) Extending BIM Interoperability to Preconstruction Operations Using Geospatial Analyses and Semantic Web Services, Automation in Construction, 53, 1–12, https://doi.org/10.1016/j.autcon.2015.02.012. Keil, J., Zollner, M., Becker, M., Wientapper, F., Engelke, T. and Wuest, H. (2011) The House of Olbrich: an augmented reality tour through architectural history, paper read at 2011 IEEE [Institute of Electrical and Electronics Engineers] International Symposium on Mixed and

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 142

142 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

Augmented Reality – arts, media, and humanities, 26–29 October. Kim, M. J., Lee, J. H. and Gu, N. (2011) Design Collaboration for Intelligent Construction Management in Mobile Augmented Reality. In Proceedings of the 28th International Symposium on Automation and Robotics in Construction, https://www.scimagojr.com/journalsearch.php?q= 21100205741&tip=sid&exact=no. Koumoutsos, G., Lampropoulos, K., Efthymiopoulos, N., Christakidis, A., Denazis, S. and Thramboulidis, K. (2006) An Intermediate Framework for Unifying and Automating Mobile Communication Systems. In Proceedings of the First IFIP TC6 international conference on Autonomic Networking, Paris, Springer-Verlag. Lee, J. H. and Kim, M. J. (2011) A Context Immersion of Mixed Reality at a New Stage. In Herr, C. M., Gu, N., Roudavski, S. and Schnabel, M. A. (eds), Circuit Bending, Breaking and Mending: proceedings of the 16th International Conference on Computer-Aided Architectural Design Research in Asia, Association for Computer-Aided Architectural Design Research in Asia. Li, X., Wu, P., Shen, G. Q., Wang, X. and Teng, Y. (2017) Mapping the Knowledge Domains of Building Information Modeling (BIM): a bibliometric approach, Automation in Construction, 84, supplement C, 195–206, https://doi.org/10.1016/j.autcon.2017.09.011. Logothetis, S., Delinasiou, A. and Stylianidis, E. (2015) Building Information Modelling for Cultural Heritage: a review, International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, no. II-5/W3, 177–83, http://adsabs.harvard.edu/abs/2015ISPAn.II5..177L. López, F. J., Lerones, P. M., Llamas, J., Gómez-García-Bermejo, J. and Zalama, E. (2017) A Framework for Using Point Cloud Data of Heritage Buildings Toward Geometry Modeling in A BIM Context: a case study on Santa Maria La Real De Mave Church, International Journal of Architectural Heritage, 11 (7), 965–86, http://doi.org/10.1080/15583058.2017.1325541. Ma, Z. and Ren, Y. (2017) Integrated Application of BIM and GIS: an overview, Procedia Engineering, 196, supplement C, 1072–9, https://doi.org/10.1016/j.proeng.2017.08.064. Manferdini, A. M. and Remondino, F. (2010) Reality-Based 3D Modeling, Segmentation and Web-Based Visualization. In Ioannides, M., Fellner, D., Georgopoulos, A. and Hadjimitsis, D. G. (eds), Digital Heritage, proceedings of the Third International Euro-Mediterranean Conference,

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 143

LEE AND GU HISTORICAL BIM+ 143

Lemessos, Cyprus, 8–13 November, Springer, 110–24. Maurice, M., McGovern E. and Sara, P. (2009) Historic Building Information Modelling (HBIM), Structural Survey, 27 (4), 311–27, http://doi.org/10.1108/02630800910985108. Mignard, C. and Nicolle, C. (2014) Merging BIM and GIS Using Ontologies Application to Urban Facility Management in ACTIVe3D, Computers in Industry, 65 (9), 1276–90, https://doi.org/10.1016/j.compind.2014.07.008. Milgram, P. and Kishino, F. (1994) A Taxonomy of Mixed Reality Visual Displays, IEICE Transactions on Information and Systems, 77 (12), 1321–9. Murphy, M., Corns, A., Cahill, J., Eliashvili, K., Chenau, A., Pybus, C., Shaw, R., Devlin, G., Deevy, A. and Truong-Hong, L. (2017) Developing historic building information modelling guidelines and procedures for architectural heritage in Ireland, International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, no. XLII-2/W5:539-546, http://doi.org/10.5194/isprs-archives-XLII-2-W5539-2017. Napolitano, R. K., Scherer, G. and Glisic, B. (2017) Virtual Tours and Informational Modeling for Conservation of Cultural Heritage Sites, Journal of Cultural Heritage, 29, 123–9, https://doi.org/10.1016/j.culher.2017.08.007. Quattrini, R., Pierdicca, R. and Morbidoni, C. (2017) Knowledge-based Data Enrichment for HBIM: exploring high-quality models using the semantic-web, Journal of Cultural Heritage, 28, supplement C, 129–39, https://doi.org/10.1016/j.culher.2017.05.004. Singh, V., Gu, N. and Wang, X. (2011) A Theoretical Framework of a BIMBased Multi-Disciplinary Collaboration Platform, Automation in Construction, 20 (2), 134–44, https://doi.org/10.1016/j.autcon.2010.09.011. Soler, F., Melero, F. J. and Luzón, M.-V. (2017) A Complete 3D Information System for Cultural Heritage Documentation, Journal of Cultural Heritage, 23, supplement C, 49–57, https://doi.org/10.1016/j.culher.2016.09.008. Succar, B. (2009) Building Information Modelling Framework: a research and delivery foundation for industry stakeholders, Automation in Construction, 18 (3), 357–75, https://doi.org/10.1016/j.autcon.2008.10.003. Syncronia (2011) Autodesk BIM Conference 2011, http://blog.syncronia.com/autodesk-bim-conference-2011/. Tang, P., Huber, D., Akinci, B., Lipman, R. and Lytle, A. (2010) Automatic reconstruction of As-Built Building Information Models from LaserScanned Point Clouds: a review of related techniques, Automation in

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 144

144 PART 2 THE PHYSICAL WORLD: OBJECTS, ART AND ARCHITECTURE

Construction, 19 (7), 829–43, https://doi.org/10.1016/j.autcon.2010.06.007. Turk, Ž. (2006) Construction Informatics: definition and ontology, Advanced Engineering Informatics, 20 (2), 187–99, https://doi.org/10.1016/j.aei.2005.10.002. Turk, Ž. (2016) Ten Questions Concerning Building Information Modelling, Building and Environment, 107, supplement C, 274–84, https://doi.org/10.1016/j.buildenv.2016.08.001. Ullmer, B. and Ishii, H. (2000) Emerging Frameworks for Tangible User Interfaces, IBM Systems Journal, 39 (3.4), 915–31, http://doi.org/10.1147/sj.393.0915. Volk, R., Stengel, J. and Schultmann, F. (2014) Building Information Modeling (BIM) for Existing Buildings: literature review and future needs, Automation in Construction, 38, supplement C, 109–27, https://doi.org/10.1016/j.autcon.2013.10.023. Wang, X., Truijens, M., Hou, L., Wang, Y. and Zhou, Y. (2014) Integrating Augmented Reality with Building Information Modeling: onsite construction process controlling for liquefied natural gas industry, Automation in Construction, 40, supplement C, 96–105, https://doi.org/10.1016/j.autcon.2013.12.003. Zhang, S., Boukamp, F. and Teizer, J. (2015) Ontology-Based Semantic Modeling of Construction Safety Knowledge: towards automated safety planning for job hazard analysis, Automation in Construction, 52, supplement C, 29–41, https://doi.org/10.1016/j.autcon.2015.02.005.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 145

PART 3

Data and programming An argument could be made that everything in this book ties back to data and programming, but the chapters in Part 3 take a more focused look at the mechanics of data, visualisations and software development. As you work your way through them, keep an eye out for discussion of the benefits of standards, communication and careful planning. How can the strategies described by these authors be applied to your next digital preservation project? What groundwork might best be laid with digital record creators before or during record creation? What lies at the intersections of digital communication and human collaboration?

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 146

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 147

8 Preparing and releasing official statistical data Natalie Shlomo

Introduction In this chapter, we provide an overview of the preparation needed to release statistical data to researchers and the public. This involves protecting the confidentiality of data subjects as well as maintaining high-quality data. Our focus here will be on statistical disclosure limitation (SDL) methods used by statistical agencies and data custodians of official data sources. We also refer to a large body of work in the computer science literature for protecting the privacy of data subjects defined as differential privacy. We distinguish between confidentiality – as described in the statistical literature which refers to guarantees to respondents of surveys and censuses not to divulge their personal information that is shared with the statistical agency – and privacy, which refers to every data subject’s right not to share their personal information. The aim of SDL is to prevent sensitive information about individual respondents from being disclosed. SDL is becoming increasingly important owing to growing demands for accessible data provided by statistical agencies. The statistical agency has a legal obligation to maintain the confidentiality of statistical entities and in many countries there are codes of practice that must be strictly adhered to. In addition, statistical agencies have a moral and ethical obligation towards respondents who participate in surveys and censuses through

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 148

148 PART 3 DATA AND PROGRAMMING

confidentiality guarantees presented prior to their participation. The key objective is to encourage public trust in official statistics production and hence ensure high response rates. The information released by statistical agencies can be divided into two major forms of statistical data: tabular data and microdata. Whereas tables have been commonly released by statistical agencies for decades, microdata released to researchers is a relatively new phenomenon. Many statistical agencies have provisions for releasing microdata from social surveys for research purposes, usually under special licence agreements and through secure data archives. Microdata from business surveys which are partially collected by census and have very sensitive data are typically not released. In order to preserve the confidentiality of respondents, statistical agencies must assess the disclosure risk in statistical data and, if required, choose appropriate SDL methods to apply to the data. Measuring disclosure risk involves assessing and evaluating numerically the risk of re-identifying statistical units. SDL methods perturb, modify or summarise the data in order to prevent re-identification by a potential attacker. Higher levels of protection through SDL methods however often negatively impact the quality of the data. The SDL decision problem involves finding the optimal balance between managing and minimising disclosure risk to tolerable risk thresholds depending on the mode for accessing the data and ensuring high utility and fit-for-purpose data where the data will remain useful for the purpose for which it was designed. With technological advancements and the increasing push by governments for open data, new forms of data dissemination are currently being explored by statistical agencies. This has changed the landscape of how disclosure risks are defined and typically involves more use of perturbative methods of SDL. In addition, the statistical community has begun to assess whether aspects of differential privacy which focus on the perturbation of outputs may provide solutions for SDL. This has led to collaborations with computer scientists. One such collaboration was the Data Linkage and Anonymisation Programme at the Isaac Newton Institute, University of Cambridge, from June to December 2016 (INI, 2016). This chapter starts with an overview of SDL methods that have been applied to traditional forms of statistical data, and then looks at how

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 149

SHLOMO PREPARING AND RELEASING OFFICIAL STATISTICAL DATA 149

disclosure risk and data utility are measured and quantified. New forms of data dissemination under consideration, their challenges and technological requirements are then examined and the chapter ends with a discussion of other related areas of research and future directions. Preparing statistical data for release In this section we focus on traditional forms of statistical data: tabular data and microdata. We define disclosure risk scenarios for each type of output, how disclosure risk is measured and quantified and some standard information loss measures to assess data utility. Tabular data Tabular data in the form of hard-copy tables has been the norm for releasing statistical data for decades and this remains true today. There are recently developed web-based applications to automate the extraction of certain portions of fixed tabular data on request, such as neighbourhood or crime statistics for a specific region. One example of this type of application is Nomis (https://www.nomisweb.co.uk/), a service provided by the Office for National Statistics in the UK to provide free access to detailed and up-to-date labour market statistics from official sources. When the data underlying the tables are based on survey microdata, the tables only include survey-weighted counts or averages. Survey weights are inflation factors assigned to each respondent in the microdata and refer to the number of individuals in the population represented by the respondent. The fact that only survey-weighted counts or averages are presented in the tables means that the underlying sample size is not known exactly and this provides an extra layer of protection in the tables. In addition, statistical agencies do not assume that ‘response knowledge’ (whether an individual responds to a survey or not) is in the public domain. Therefore there is little confidentiality protection needed in tabular data arising from survey microdata. Typical SDL approaches include coarsening the variables that define the tables, for example banded age groups and broad categories of ethnicity, and ensuring safe table design to avoid low or zero cell values in the table.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 150

150 PART 3 DATA AND PROGRAMMING

However, business statistics arising from business surveys are more problematic since large businesses are sampled with certainty and hence ‘response knowledge’ is assumed. Therefore, SDL methods for protecting tables generated from business surveys are very different. We will not continue the discussion of disseminating business statistics as this is out of scope of the focus of this chapter. For more details about SDL for tables arising from business statistics see Willenborg and De Waal (2001), Duncan, Elliot and Salazar-Gonzalez (2011) and Hundepool et al. (2012). The biggest challenge when focusing on tabular data arising from microdata is when the microdata contain whole population counts, such as from a census or a register. In this case, there is no extra layer of protection afforded by the sampling. There has been much focus in the statistical literature on SDL methods for census data. See Shlomo (2007) and references therein for an overview, which we briefly describe here. The disclosure risk in a census context arises from small values in a table – 1s and 2s – since these can lead to identity disclosure and allow re-identification. However, the main disclosure risk is attribute disclosure where an individual can be identified on the basis of some of the variables defining the table and a new attribute may be revealed about the individual. For tabular data, this means that there are only one or two populated cells in a given column or row but the rest of the values are zeros. Indeed, the amount and placement of the zeros in the table determines whether new information can be learned about an individual or a group of individuals. Table 8.1 presents an example census table where we discuss the disclosure risks. Table 8.1 Example census table Benefits

Age 16–24

Age 25–44

Age 45–64

Age 65–74

Age 75+

Benefits claimed

17

8

2

4

1

Benefits not 23 claimed

35

20

0

0

Total

43

22

4

1

40

In Table 8.1 we see evidence of small cell values which can lead to re-identification, for example the cell value of size 1 in the cell for ‘Age 75+’ and ‘Benefits claimed’. In addition, the cell value of size 2 in ‘Age 45–64’ and ‘Benefits claimed’ can lead to re-identification since it is

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 151

SHLOMO PREPARING AND RELEASING OFFICIAL STATISTICAL DATA 151

possible for an individual in the cell to subtract him or herself out and therefore re-identify the remaining individual. Moreover, we see evidence of attribute disclosure through the zeros in the cell values for the columns labelled ‘Age 65–74’ and ‘Age 75+’ under ‘Benefits not claimed’. This means that we have learned that all individuals aged 65–74 are claiming benefits although we have not made a reidentification. However, for the single individual in the column labelled ‘Age 75+’ we have made an identification and also learned that the individual is claiming a benefit. Other disclosure risks are disclosure by linking tables since parts of the microdata can be reconstructed if many tables are disseminated from the same data source, and disclosure by differencing where two tables that are nested may be subtracted one from the other resulting in a new table containing small cell values. Tables 8.2 and 8.3 show an example of this type of disclosure risk. Table 8.2 Example census table of all elderly by whether they have a long-term illness Health

Age 65–74

Age 75–84

Age 85+

Total

No long-term illness

17

9

6

32

Long-term illness

23

35

20

78

Total

40

44

26

110

Table 8.3 Example census table of all elderly living in households by whether they have long-term illness Health

Age 65–74

Age 75–84

Age 85+

Total

No long-term illness

15

9

5

29

Long-term illness

22

33

19

74

Total

37

42

24

103

We see that tables 8.2 and 8.3 are seemingly without disclosure risks on their own. However, Table 8.3 can be subtracted from Table 8.2 to obtain a very disclosive differenced table of elderly living in nursing homes and other institutionalised care with very small and zero cell values.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 152

152 PART 3 DATA AND PROGRAMMING

Since census tables have traditionally been released in hard copy, these latter disclosure risks from linking and differencing tables were controlled by the statistics agency through strict definitions of the variables defining the tables and no release of nested tables differing in a single category. In the section ‘New forms of data dissemination’ we show how these types of disclosure risks have now become more relevant with new forms of data dissemination based on online flexible table generation servers. SDL methods for census tabular data should not only protect small cell values in the tables but also introduce ambiguity and uncertainty into the zero values to avoid attribute disclosure. Tables should be carefully designed with respect to the variables that define the tables and the definition of their categories. There are also general principles regarding population thresholds and the number of dimensions allowed in the table. The SDL methods typically implemented at statistical agencies include pre-tabular methods, post-tabular methods and combinations of both. Pre-tabular methods are implemented on the microdata before tables are compiled; these methods will be covered in the next section, ‘Microdata’. Post-tabular methods are implemented on the cell values of the tables after they are computed and typically take the form of rounding the cell values, either on the cells with small values only or on all cell values of the tables. Random rounding rounds the value of each cell according to a probability mechanism and internal cells and marginal cell values of the tables are rounded separately, resulting in rows or columns of the tables that are not additive – internal cell values do not sum to margins. Controlled rounding ensures that the sum of rounded internal cell values equals the rounded margin value which is a desired property by users of the data. However, controlled rounding is too limiting for the large scale production of census tables. Cell suppression where small cell values in tables are suppressed is also not used in a census context because of the need to ensure consistency that the same cells are suppressed across the many tables produced. A more general case of random rounding is random cell perturbation based on a probability mechanism, which was first carried out at the Australian Bureau of Statistics (ABS) and described in Fraser and Wooton (2005). Table 8.4 opposite demonstrates one such probability mechanism for small cell values of a table. The probabilities define the

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 153

SHLOMO PREPARING AND RELEASING OFFICIAL STATISTICAL DATA 153

Table 8.4 Example of a probability mechanism to perturb small cell values Original value

Perturbed value

0

0 0.80

1 0.05

2 0.05

3 0.05

4 0.05

1

0.05

0.80

0.05

0.05

0.05

2

0.05

0.05

0.80

0.05

0.05

3

0.05

0.05

0.05

0.80

0.05

4

0.05

0.05

0.05

0.05

0.80

chance that a cell value having an original value of, say, 1 will be changed to a perturbed cell value of 2 (in this case the probability is 0.05). The probability of remaining a value of 1 is 0.80. Note that the sum of the rows is equal to 1. Depending on a random draw, the value of the cell will change or not change according to the probabilities in the mechanism. In the example above, if the random draw is below 0.05 then the value of 1 will be perturbed to 0; if the random draw is between 0.05 and 0.85 then the value of 1 will remain a 1; if the random draw is between 0.85 and 0.90 then the value of 1 will be perturbed to 2; if the random draw is between 0.90 and 0.95 then the value of 1 will be perturbed to 3; and finally if the random draw is above 0.95 then the value of 1 will be perturbed to 4. Shlomo and Young (2008) modified the probability mechanism so that the internally perturbed cell values in a row or column will (approximately) sum to the perturbed cell value in the respective margin. This is done by a transformation of the probability mechanism so that the frequencies of the cell values from the original table are preserved on the perturbed table. Another feature of the ABS method for cell perturbation of census tables described in Fraser and Wooton (2005) is a ‘same cell – same perturbation’ approach. This means that any time a cell value is produced in a table as a result of counting the number of individuals having a particular row and column characteristic, this same cell will always receive the same perturbation. This reduces the chance of identifying the true cell value through multiple requests of the same table and averaging out the perturbations. We continue the example from Table 8.2 on the census table of all elderly by age group and longterm illness to demonstrate this approach. Suppose a new table is

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 154

154 PART 3 DATA AND PROGRAMMING

generated similar to Table 8.2 but with the added variable of gender, as shown in Table 8.5. Table 8.5 Example census table of all elderly by gender and whether they have a long-term illness Health

Age 65–74

Age 75–84

Age 85+

Total

No longMales term illness Females

8

4

3

15

9

5

3

17

Long-term illness

Males

10

15

6

31

Females

13

20

14

47

40

44

26

110

Total

Gender

We see that Table 8.2 and Table 8.5 have the same values on the column margins since the same individuals appear in these cells in both tables. Therefore, we need to ensure that the perturbation of these cell values will remain exactly the same even though they appear in different tables. The approach is as follows: each individual in the microdata is assigned a random number. Any time individuals are aggregated to form a cell in a table, their random numbers are also aggregated and this becomes the random draw for the perturbation. Therefore, individuals who are aggregated into same cells will always have the same random draw and therefore a consistent perturbation. Since the tabular data is based on whole population counts, disclosure risk measurement is straightforward as the disclosure risk can be observed. One disclosure risk measure is simply the percentage of the cell values containing a value of 1 or 2. In addition, the placement of zero cell values in the table and whether they appear in a single row or column of the table poses a considerable risk of attribute disclosure. These degenerate distributions in tables where rows or columns have mainly zero cell values with only a few nonzero cell values can be identified through disclosure risk measures defined in Shlomo, Antal and Elliot (2015). These measures assign a value between 0 and 1 for the level of risk caused by degenerate distributions where a value of 1 denotes the extreme case where the entire row or column have zero cell values and only one non-zero cell value. To assess the impact on data utility, information loss measures are generally based on distance metrics between original and perturbed cell values in a table. In addition, a key statistical analysis

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 155

SHLOMO PREPARING AND RELEASING OFFICIAL STATISTICAL DATA 155

for a table of counts is to be able to test for statistical correlations between the variables defining the table through a Chi-square test for independence. We therefore produce an information loss measure comparing test statistics on the original and perturbed tables to ensure that the power of such tests are not impacted by the perturbation and that there is no change in the outcome of statistical inference. Microdata Microdata from social surveys are released only after removing direct identifying variables, such as names, addresses and identity numbers. The main disclosure risk is attribute disclosure where small counts on cross-classified indirect identifying key variables can be used to identify an individual and confidential information may be learned from the remaining sensitive variables in the dataset. Since the prerequisite to attribute disclosure is identity disclosure, SDL methods typically aim to reduce the risk of re-identification. The indirect identifying key variables are those variables that are visible and traceable, and are accessible to the public and to potential intruders with malicious intent to disclose information. In addition, key variables are typically categorical variables, and may include sex, age, occupation, place of residence, country of birth and family structure. Sensitive variables are often continuous variables, such as income and expenditure, but can also be categorical. We define the ‘key’ as the set of combined cross-classified identifying key variables, typically presented as a contingency table containing the counts from the survey microdata. For example, if the identifying key variables are sex (two categories), age group (ten categories) and years of education (eight categories), the key will have 160 (=2x10x8) cells following the cross-classification of the key variables. The disclosure risk scenario of concern at statistical agencies is the ability of a potential intruder to match the released microdata to external sources containing the target population based on a common key. External sources can be either in the form of prior knowledge that the intruder might have about a specific population group, or by having access to public files containing information about the population, such as phone company listings, voter’s registration or even a national population register. Note that the statistical agency does not assume that an intruder would have ‘response knowledge’ on whether

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 156

156 PART 3 DATA AND PROGRAMMING

an individual is included in the dataset or not and relies on this extra layer of protection in their SDL strategies. SDL techniques for microdata include perturbative methods, which alter the data, and non-perturbative methods, which limit the amount of information released in the microdata without actually altering the data. Examples of non-perturbative SDL techniques are: • coarsening and recoding, where values of variables are grouped into broader categories such as single years of age grouped into age groups • variable suppression, where variables such as low-level geographies are deleted from the dataset • sub-sampling, where a random sample is drawn from the original microdata. This latter approach is commonly used to produce research files from census microdata where a 1% sample is drawn from the census for use by the research community. Statistical agencies typically use these methods to produce public-use files, which are placed in secure data archives where registered users can gain access to the census sample data. As sampling provides a priori protection in the data there is less use of perturbative methods applied to microdata although there are some cases where this is carried out. The most common perturbative method used in microdata is record swapping. In this approach, two records having similar control variables are paired and the values of some variables – typically their geography variables – are swapped. For example, two individuals having the same sex, age and years of education will be paired and their geography variables interchanged. Record swapping is used in the USA and the UK on census microdata as a pre-tabular method before tabulation. A more general method is the post-randomisation probability mechanism (PRAM) where categories of variables are changed or not changed according to a prescribed probability mechanism similar to that described for table cell values in the section ‘Tabular data’, above (see also Gouweleeuw et al., 1998). A common perturbative method for survey microdata is top-coding, where all values above a certain threshold receive the value of the threshold itself. For example, any individual in survey microdata

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 157

SHLOMO PREPARING AND RELEASING OFFICIAL STATISTICAL DATA 157

earning above $10,000 a month will have their income amount replaced by $10,000. Another perturbative method is adding random noise to continuous variables, such as income or expenditure. For example, random noise is generated from a normal distribution with a mean of zero and a small variance for each individual in the dataset and this random noise is added to the individual’s value of the variable. Disclosure risk for survey microdata is typically measured and quantified by estimating a probability of re-identification. This probability is based on the notion of uniqueness on the key where a cell value in the cross-classified identifying variables may have a value of 1. Since survey microdata is based on a sample, a cell value of 1 is only problematic if there is also a cell value of 1 in the whole population. In other words, we need to know if the sample unique in the key is also a population unique or if it is an artefact of the sampling. Therefore, the disclosure risk measure of re-identification can take two forms: the number of sample uniques that are also population uniques, and the overall expected number of correct matches for each sample unique if we were to link the sample uniques to a population. When the population from which the sample is drawn is unknown, we can use probabilistic modelling to estimate the population parameters and obtain estimates for these disclosure risk measures. For more information on the estimation of disclosure risk measures using advanced statistical modelling, see Skinner and Holmes, 1998; Elamir and Skinner, 2006; Skinner and Shlomo, 2008 and Shlomo and Skinner, 2010. The application of SDL measures to prevent the disclosure of sensitive data leads to a loss of information. It is therefore important to develop quantitative information loss measures in order to assess whether the resulting confidentialised survey microdata is fit for purpose according to some predefined user specifications. Obviously, measures that cause information loss should be minimised in order to ensure high utility of the data. Information loss measures assess the impact on statistical inference: the effects on bias and variance of point estimates, distortions to distributions and correlations, effects on goodness of fit criteria such as the R2 for regression modelling, and so on. For example, some SDL methods, such as adding random noise to a variable such as income where the random noise is generated from a normal distribution with a mean of zero and a small variance, will

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 158

158 PART 3 DATA AND PROGRAMMING

not impact on the point estimate of total or average income, but will increase the variance and cause a wider confidence interval. Whereas information loss generally refers to measuring the extent to which a user can replicate an analysis on the perturbed data and obtain the same inference as on the original data, we can also use information loss measures as a means of comparing different SDL methods. The disclosure risk and information loss measures can be used to produce a ‘disclosure risk-data utility’ confidentiality map (Duncan and Lambert, 1989). The SDL method that has a disclosure risk below the tolerable risk threshold as determined by the statistical agency and having the highest data utility (lowest information loss) is selected for confidentialising the data before its release. As mentioned, survey microdata is typically released into data archives and repositories where registered users can gain access to the data following confidentiality agreements. Almost every country has a data archive, which researchers can use to gain access to microdata. One example is the UK Data Service (https://www.ukdataservice. ac.uk/), which is responsible for the dissemination of microdata from many of the UK surveys. See the website for more information. There are other international data archives and repositories and these do a particularly good job of archiving and making data available for international comparisons. One example is the archive of the Integrated Public Use Microdata Series (IPUMS; https://www.ipums. org/) located at the University of Minnesota, which provides census and survey data from collaborations with 105 statistical agencies, national archives and genealogical organisations around the world. The staff ensure that the datasets are integrated across time and space with common formatting, harmonising variable codes, archiving and documentation. See the website for more information. Researchers may obtain more secure data that has undergone fewer SDL methods through special licensing and on-site data enclaves, although this generally involves a time-consuming application process and the need to travel to on-site facilities. Nevertheless, with more demand for open and easily accessible data, statistical agencies have been considering more up-to-date dissemination practices. This is the subject of the next section.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 159

SHLOMO PREPARING AND RELEASING OFFICIAL STATISTICAL DATA 159

New forms of data dissemination This section focuses on new forms of data release that take advantage of web technologies and computing software to make statistical data more accessible. However, with the need to release more open data, there are implications on disclosure risk scenarios and how statistical data should be protected. In particular, SDL methods are relying more heavily on perturbative methods to account for increasing disclosure risks. Data enclaves and remote access As mentioned above, many statistical agencies have set up on-site data enclaves on their premises. These are largely motivated by the statistical agencies’ obligations to continue to meet demands for large amounts of data and still ensure the confidentiality of individuals. The on-site data enclave is a secure environment where researchers can gain access to confidential data. The secure servers have no connection to printers or the internet and only authorised users are allowed to access them. To minimise disclosure risk, no data is allowed to be removed from the enclave and researchers undergo training to understand the security rules. Researchers are generally provided with software embedded in the system. All information flow is controlled and monitored. Any outputs to be taken out of the data enclave are dropped in a folder and manually checked by experienced confidentiality officers for disclosure risks. The disadvantage of the on-site data enclave is the need to travel, sometimes long distances, to access confidential data. In very recent years, some agencies have been experimenting with evolving on-site data enclaves to permit remote access through ‘virtual’ data enclaves. These ‘virtual’ data enclaves can be set up at organisations, universities and data archives and allow users to access data off-site through virtual private network (VPN) connections and secure password systems. Users log on to secure servers to access the confidential data and all activity is logged and audited at the keystroke level. Similar to the on-site data enclaves, secure servers have embedded software, and data analysis can be performed within the server. Just as in on-site data enclaves, any required outputs are dropped into folders and reviewed

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 160

160 PART 3 DATA AND PROGRAMMING

by confidentiality officers before being sent back to the researchers via a secure file transfer protocol site. The technology also allows users within the same research group to interact with one another while working on the same dataset. An example of this technology is the Inter-university Consortium for Political and Social Research (ICPSR) housed at the University of Michigan. The ICPSR maintains access to data archives of social science data for research and operates both a physical on-site data enclave and a ‘virtual’ data enclave. See ICPSR (n.d.) for more information. Web-based applications There are two types of web-based applications that are currently being considered for disseminating statistical outputs: flexible table generators and remote analysis servers. Flexible table-generating servers Driven by demand from policy makers and researchers for specialised and tailored tables from statistical data (particularly census data), some statistical agencies are considering the development of flexible online table-generating servers that allow users to define and generate their own tables. Key examples are the ABS TableBuilder (www.abs. gov.au/ websitedbs/censushome.nsf/home/tablebuilder) and the US American Factfinder (https://factfinder.census.gov/faces/nav/jsf/pages/ index.xhtml). Users access the servers via the internet and define their own table of interest from a set of predefined variables and categories, typically from drop-down lists. Generally, a post-tabular SDL method is applied, which perturbs the cell values in the tables before release. While the online table generators have the same types of disclosure risks as mentioned in the section ‘Tabular data’, above, the disclosure risks based on disclosure by differencing and disclosure by linking tables are now very much relevant since there are no interventions or manual checks on what tables are produced or how many times tables are generated. Therefore, for these types of online systems for tables, the statistical community has recognised the need for perturbative methods to protect against disclosure and have been engaging with

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 161

SHLOMO PREPARING AND RELEASING OFFICIAL STATISTICAL DATA 161

the computer scientist literature on differential privacy (Dinur and Nissim, 2003; Dwork et al., 2006; Dwork and Roth, 2014). Differential privacy is an output perturbation method that assumes a ‘worst-case’ scenario in which an intruder has knowledge of the full dataset except for one target individual. Differential privacy ensures that an intruder cannot make any inferences (up to a small degree) about the target individual. This type of ‘worst-case’ disclosure risk scenario includes all types of disclosure risks defined up till now: identity and attribute disclosure, disclosure by differencing and linking tables, which are all relevant scenarios for an online flexible table-generating server. The solution for guaranteeing privacy in differential privacy is to add random noise to the cell values in the generated table according to a prescribed probability distribution similar to the random noise addition approach described in the section ‘Microdata’, above. The privacy guarantee is set a priori and is used to define the prescribed probability distribution. Research is still ongoing for use of the differential privacy approach in an online flexible tablegenerating server and has yet to be implemented. Nevertheless it is useful to see the potential of the privacy guarantees based on differential privacy in statistical applications. See Rinott et al. (2018) for more information on current research. One advantage of differential privacy is that it is grounded in cryptography so parameters of the perturbation are not secret and researchers can account for the additive noise perturbation in their statistical analysis. Besides the additive random noise on each of the cell values in the generated tables, there are other ways of reducing disclosure risk in an online flexible table-generating server, such as coarsening the categories of variables in the underlying microdata used to generate the tables, thus not allowing single individuals to be released into a requested table. In addition, typical principles for table design mentioned in the section ‘Tabular data’, above, also apply here, such as thresholds on the percentage of small cells that can be released in a table, percentage of zeros that can be released in a table and thresholds on the population sizes. If the requested table does not meet the standards, it is not released through the server and the user is advised to redesign the table.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 162

162 PART 3 DATA AND PROGRAMMING

Remote analysis servers An extension of an online flexible table-generating server is a remote analysis server where users can submit code for conducting a statistical analysis on a dataset and in return receive a confidentialised output from the analysis without the need for human intervention in the form of a manual check of the outputs for disclosure risks. Similar to flexible table-generating servers, the queries are submitted through a remote interface and researchers do not have direct access to the underlying data. The queries may include exploratory analysis, measures of association, regression models and statistical testing. The queries can be run on the original data or on confidentialised data and may be restricted and audited depending on the level of protection required. O’Keefe and Good (2009) describe regression modelling via a remote analysis server. Synthetic data Basic confidential data is a fundamental product of virtually all statistical agency programs. These lead to the publication of publicuse products such as summary data and microdata from social surveys. Confidential data may also be used for internal use within data enclaves. In recent years, there has been a move to produce synthetic microdata as public-use files, which preserve some of the statistical properties of the original microdata. The data elements are replaced with synthetic values sampled from an appropriate probability model where the parameters of this model are estimated from the original microdata. Several sets of the synthetic microdata may be released to account for the uncertainty of the model and to obtain correct inference (variance estimates and confidence intervals). See Raghunathan, Reiter and Rubin (2003) and Reiter (2005) and references therein for more details of generating synthetic data. The synthetic data can be implemented on parts of the data so that a mixture of real and synthetic data is released (Little and Liu, 2003). One application which uses partially synthetic data is the US Census Bureau application OnTheMap (http://onthemap.ces.census.gov/). It is a web-based mapping and reporting application that shows where workers are employed and live according to the Origin-Destination

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 163

SHLOMO PREPARING AND RELEASING OFFICIAL STATISTICAL DATA 163

Employment Statistics from the 2010 US Census. More information is given in Abowd and Vilhuber (2008). In practice it is very difficult to capture all conditional relationships between variables and within sub-populations. If models used in a statistical analysis are sub-models of the model used to generate the data, then the analysis of multiple synthetic datasets should give valid inferences. In addition, partially synthetic datasets may still have disclosure risks and need to be checked before dissemination. For tabular data there are also techniques to develop synthetic tables arising from business statistics. Controlled tabular adjustment (CTA) carries out an advanced cell suppression algorithm to identify and protect highly disclosive statistics in the tables by replacing them with synthetic values that guarantee some of the statistical properties of the original table (Dandekar and Cox, 2002). Conclusion This chapter focused on an understanding of the work and techniques involved in preparing statistical data before their release. The main goal of SDL is to protect the privacy of those whose information is in a dataset, while still maintaining the usefulness of the data itself. In recent years, statistical agencies have been restricting access to statistical data because of their inability to cope with the large demand for data while ensuring the confidentiality of statistical units. However, with government initiatives for open data, new ways to disseminate statistical data are being explored. This has led to initial stages of research between the statistical community and the computer scientists who have developed formal definitions of disclosure risk and perturbative methods that can guarantee privacy. These methods come at a cost in that researchers will have to cope with perturbed data when carrying out statistical analysis, which may require more training. In addition, statistical agencies need to release the parameters of the SDL methods so that researchers can account for the measurement error in their statistical analysis. References Abowd, J. M. and Vilhuber, L. (2008) How Protective Are Synthetic Data? In

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 164

164 PART 3 DATA AND PROGRAMMING

Domingo-Ferrer, J. and Saygin, Y. (eds), Privacy in Statistical Databases, Springer, 239–46. Dandekar, R. A. and Cox L. H. (2002) Synthetic Tabular Data: an alternative to complementary cell suppression manuscript, Energy Information Administration, US Department of Energy. Dinur, I. and Nissim, K. (2003) Revealing Information While Preserving Privacy, paper given at PODS [Principles of Database Systems] Conference, www.cse.psu.edu/~ads22/privacy598/papers/dn03.pdf. Duncan, G. T., Elliot, M. and Salazar-Gonzalez, J. J. (2011) Statistical Confidentiality, Springer. Duncan, G. and Lambert, D. (1989) The Risk of Disclosure for Microdata, Journal of Business and Economic Statistics, 7, 207–17. Dwork, C. and Roth, A. (2014) The Algorithmic Foundations of Differential Privacy, Foundations and Trends in Theoretical Computer Science, 9, 211–407. Dwork, C., McSherry, F., Nissim, K. and Smith, A. (2006) Calibrating Noise to Sensitivity in Private Data Analysis. In Proceedings of the Third IACR Theory of Cryptography Conference, International Association for Cryptologic Research, 265–84. Elamir, E. and Skinner, C. J. (2006) Record-Level Measures of Disclosure Risk for Survey Micro-Data, Journal of Official Statistics, 22, 525–39. Fraser, B. and Wooton, J. (2005) A Proposed Method for Confidentialising Tabular Output to Protect Against Differencing. In European Commission Statistical Office of the European Communities (Eurostat), Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Geneva. Gouweleeuw, J., Kooiman, P., Willenborg, L. and de Wolf, P. P. (1998) Postrandomization for Statistical Disclosure Control: theory and implementation, Journal of Official Statistics, 14 (4), 463–78. Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S., Spicer, K. and de Wolf, P. P. (2012) Statistical Disclosure Control, John Wiley & Sons. ICPSR (n.d.) Data Enclaves, Inter-university Consortium for Political and Social Research, https://www.icpsr.umich.edu/icpsrweb/content/ICPSR/access/restricted/ enclave.html. INI (2016) Data Linkage and Anonymisation, Isaac Newton Institute for Mathematical Sciences, https://www.newton.ac.uk/event/dla. Little, R. J. A. and Liu, F. (2003) Selective Multiple Imputation of Keys for

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 165

SHLOMO PREPARING AND RELEASING OFFICIAL STATISTICAL DATA 165

Statistical Disclosure Control in Microdata, University of Michigan Department of Biostatistics Working Paper 6. O’Keefe, C. M. and Good, N. M. (2009) Regression Output from a Remote Analysis System, Data and Knowledge and Engineering, 68 (11), 1175–86. Raghunathan, T. E., Reiter, J. and Rubin, D. (2003) Multiple Imputation for Statistical Disclosure Limitation, Journal of Official Statistics, 19 (1), 1–16. Reiter, J. P. (2005) Releasing Multiply Imputed, Synthetic Public-Use Microdata: an illustration and empirical study, Journal of the Royal Statistical Society Series A, 168 (1), 185–205. Rinott, Y., O’Keefe, C., Shlomo, N., and Skinner, C. (2018, forthcoming) Confidentiality and Differential Privacy in the Dissemination of Frequency Tables, Statistical Sciences. Shlomo, N. (2007) Statistical Disclosure Control Methods for Census Frequency Tables, International Statistical Review, 75, 199–217. Shlomo, N. and Skinner, C. J. (2010) Assessing the Protection Provided by Misclassification-Based Disclosure Limitation Methods for Survey Microdata, Annals of Applied Statistics, 4 (3), 1291–310. Shlomo, N. and Young, C. (2008) Invariant Post-tabular Protection of Census Frequency Counts. In Domingo-Ferrer, J. and Saygin, Y. (eds), Privacy in Statistical Databases, Springer, 77–89. Shlomo, N., Antal, L. and Elliot, M. (2015) Measuring Disclosure Risk and Data Utility for Flexible Table Generators, Journal of Official Statistics, 31, 305–24. Skinner, C. J. and Holmes, D. (1998) Estimating the Re-identification Risk Per Record in Microdata, Journal of Official Statistics, 14, 361–72. Skinner, C. J. and Shlomo, N. (2008) Assessing Identification Risk in Survey Microdata Using Log-Linear Models, Journal of American Statistical Association, 103 (483), 989–1001. Willenborg, L. and De Waal, T. (2001) Elements of Statistical Disclosure Limitation in Practice, Lecture Notes in Statistics, Springer-Verlag.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 166

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 167

9 Sharing research data, data standards and improving opportunities for creating visualisations Vetria Byrd

Introduction Data visualisation is becoming an appreciated method for making sense of data, explaining complex information, and representing large amounts of information in a visual, concise manner. The data deluge has arrived, bringing with it grand expectations of new discoveries, better understandings, and improved ability to examine history and culture (Borgman, 2012). The problems that analysts face are becoming increasingly large and complex, not to mention uncertain, ill-defined and broadly scoped (Isenberg et al., 2011). Gone are the days of a single researcher painstakingly working through enormous amounts of data and reporting only a fraction of the insight it contains. Realistic problems often require broad expertise, diverse perspectives and a number of dedicated people to solve them (Isenberg et al., 2011). The pressure to share data comes from many quarters: funding agencies, polity bodies such as national academies and research councils, journal publishers, educators, the public at large and individual researchers themselves (Borgman, 2012). The idea of sharing research data has been explored from the perspective of computer-supported cooperative work (Isenberg et al., 2011). This approach focused primarily on collaborative systems that enable visualisation tools to be developed that allow multiple users to interact with a visualisation application. Other approaches focused on visualisation techniques and

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 168

168 PART 3 DATA AND PROGRAMMING

shared cognition (Yusoff and Salim, 2015) and remote rendering of visualisations as a service (Paravati et al., 2011). This chapter looks at the collaborative nature of sharing the underlying data that propels the system, rather than focusing on systems and services. It provides an overview of the visualisation process, and discusses the challenge of sharing research data and ways data standards can increase opportunities for creating and sharing visualisations, while also increasing visualisation capacity building among researchers and scientists. Data visualisation: an overview Data visualisation is a non-trivial, iterative process that transforms data from a raw complex state into a visual representation for the purpose of gaining insight into what the data represents (Card et al., 1999). The process of visualising data entails seven stages: • obtaining the data (acquire) • providing some structure for the data’s meaning, and ordering it into categories (parse) • removing all but the data of interest (filter) • applying methods from statistics or data mining to discern patterns (mine) • choosing a basic visual model to represent the data (represent) • improving the basic representation to make it clearer and more visually engaging (refine) • adding methods to manipulating the data or controlling what features are visible (interact) (Fry, 2007). This non-trivial process enables insight which allows for the exploration and explanation of data (Fry, 2007), the analysis of data (Rogers et al., 2016), for discoveries to be made (Segel and Heer, 2010), decision making (Craft et al., 2015, Daradkeh et al., 2013; Schneiderman et al., 2013) and storytelling (Kosara and Mackinlay, 2013, Segel and Heer, 2010). By making data more accessible and appealing, visual representations may also help engage more diverse audiences in exploration and analysis (Heer et al., 2010).

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 169

BYRD SHARING RESEARCH DATA 169

The challenge of sharing research data Data created from research is a valuable resource that can be used and reused for future scientific and educational purposes (Van den Eynden et al., 2011). If the rewards of the data are to be reaped and science moved forward, researchers who produce data must share it, and do so in such a way that the data can be interpreted and reused by others (Borgman, 2012). Data visualisation, as a tool, is a mechanism to facilitate sharing. Although the need for sharing is evident, researchers and scientists who desire to publish their findings are faced with the challenge of finding that delicate balance between sharing data while making efforts to advance their research in such a way that they make discoveries, and gain recognition for their work and effort. Borgman (2012) discusses these challenges and examines four rationales for sharing data: • • • •

to reproduce or verify research to make results of publicly funded research available to the public to enable others to ask new questions of extant data to advance the state of the arguments for sharing.

This paper adds an additional rationale: to allow for and build on the visual representation of data to advance research and scientific discovery. The sharing of visualisations created by the data should be subject to the same level of scientific scrutiny, governance and quality control as data shared in the rationales provided above. Without data, there would be nothing to represent visually. The quality of the data impacts the quality of the visualisations. Representation The challenges faced by those seeking to share research data has been declared to be an urgent problem for over 30 years (Borgman, 2012). As we move deeper into the era of digital transformation, there are many more sources of data of different types (Nambiar, 2017), and an increasing demand for ways to make sense of research data in a timely, efficient and accurate manner (Kum et al., 2011). Visualisation as a tool enables researchers to represent their data not only in basic views (in

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 170

170 PART 3 DATA AND PROGRAMMING

a table, or as bar, line or point graphs) but also visually using various techniques to provide insight and demonstrate the implications of what the data represents. Sharing of data and the steps taken to visualise it could lead to analysis from different perspectives using the same data, and possibly to unforeseen discoveries. Each discipline and stakeholder potentially represents data in a way that represents their contribution, and addresses their research question or hypothesis, or way of thinking (Avgerinou and Ericson, 1997). Representation involves taking decisions on how to represent data visually and could lead to more scholarly questions and answers, for example, can the data be represented differently to examine different aspects of the data? Visualisation capacity building Each discipline defines visualisation and an expected level of competency in visualisation. As the role of data visualisation continues to grow at all levels of scholarship, the literature shows that research and use of data visualisation has evolved from understanding and usage to creation. The use of visualisation skills is often thought of as visual literacy. Broadly defined, visual literacy refers to the skills that enable an individual to understand and use visuals for intentional communication with others (Ausburn and Ausburn, 1978). With time and experience, the ability to identify data appropriate tools and use them leads to competency and confidence. Having the capacity and ability to understand the data life cycle and apply the data visualisation process to create visualisations that provide insight leads to visualisation capacity building (Byrd and Cottam, 2016). Visualisation capacity building encapsulates all the visual skills, including visual literacy, understanding and applying the visualisation process, the use of appropriate visualisation tools for different types of data, and the ability to draw and visually represent conclusions from insight gained in the process. The demand for and creation of data is outpacing the ability to make sense of data. Managing and sharing data to enable reproducibility of research results involves data management, documentation, formatting, storage, and ethics and consent issues. There are mechanisms in place to facilitate data sharing. Funding agencies like the National

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 171

BYRD SHARING RESEARCH DATA 171

Science Foundation and the National Institutes of Health require data management plans for proposals submitted for funding consideration. Scholarly journals require authors of manuscripts to provide data documentation in supplemental documents as well as documentation on file formats for images and data files. These mechanisms are in place to address reproducibility of research but have an added benefit of inadvertently addressing representation, specifically visual representation of the data. The next section describes how documenting the data life cycle by documenting the visualisation process from ideation to creation provides opportunities for creating visualisations. Documenting the visualisation process: from ideation to creation and beyond Noah Illiinsky (Illiinsky and Steele, 2011) identifies three general guidelines for creating strong visualisations: understand your data; understand what you want to show, and understand the format of your visualisation and its strengths and limitations. Creating a visualisation requires a number of nuanced judgements (Heer et al., 2010). One must determine which questions to ask, identify the appropriate data, and select effective visual encodings to map data values to graphical features such as position, size, shape and colour (Heer et al., 2010). Data characteristics change as data moves through the data visualisation life cycle. To better understand what decisions are made and why, what worked and what did not work (and why), documentation and access to data at every stage in the visualisation process (from ideation to creation) is needed. Table 9.1 on the next page lists the data visualisation life cycle and what to document for each stage of visualisation. This table is not intended to be an exhaustive list of content to document, but is presented as a first step towards documenting the process of data visualisation. A very important part of documenting the visualisation process, not included in Table 9.1, is documenting the types of tools used in the process of visualisation. Some 80–90% of the work in the visualisation process occur in the parse, filter and mine stages. The entire visualisation process is non-trivial; however, these three steps (parse, filter and mine) shape and ultimately influence how the data is represented and what insight is gained. As Table 9.1 shows, parsing converts a raw

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 172

172 PART 3 DATA AND PROGRAMMING

Table 9.1 Stages of visualising data, the data visualisation life cycle and what to document Source: adapted from Fry (2007) Visualisation Description stage

Data visualisation life cycle

Acquire

Obtain the data

Raw, unstructured data Data sources, data stake holders, accessibility of the data

Parse

Change the data into a Formatted data format that tags each part of the data with its intended use

Filter

Remove portions not relevant to the current task or interest

A subset of the original What data has been data set has been removed, how the created remaining data will serve the resulting visualisation

Mine

Apply methods from statistics or data mining as a way to discern patterns

Model building and evaluation, clean up the data (handle missing values, noise, etc.), resulting in a subset of the output from the previous set

Methods applied to discern patterns (Fayyad et al., 2002)

Represent

Choose models to provide a basic visual representation of the data

Data takes on a visual form

Visual encodings used to map data values to graphical features (Heer et al., 2010)

Refine

Improve the basic Different visual representation to make techniques might result it clearer in a different visual representation or layout of the data

What improvements are made to enhance the visual representation

Interact

Add methods for manipulating the data

What methods are used for manipulating the data

The visual representation of the data as the visualisation becomes more dynamic and changes according to user or system inputs

What to document

How the data is broken into individual parts and converted into usable format

stream of data into a structure that can be manipulated in software (Fry, 2007). In Visualizing Data: exploring and explaining data with the processing environment Ben Fry (2007) states, ‘The chapter aims to give you a sense of how files are typically structured because more likely than not, the data you acquire will be poorly documented (if it’s documented at all).’ Documentation of data is crucial at this level.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 173

BYRD SHARING RESEARCH DATA 173

Standardisation of data is equally crucial, but perhaps more challenging considering the heterogeneous nature and format of data. There are numerous books and papers on parsing, filtering and mining for different programming languages, data types and platforms. It is beyond the scope of this chapter to list them all. In fact, different intermediate tools might be used in each of these stages to prepare the data for use in a visualisation tool. Often Microsoft Excel is used for initial exploration of data, after which the data may be saved in a different format for another tool. Perhaps this is an over simplification and generalisation; however, it is important to realise that more often than not data is messy and complex and rarely comes ready for use with a visualisation tool; some housekeeping is required. This data cleanup activity is sometimes referred to as data mining or data wrangling. There is no universal tool for either so the first visualisation created from the data will rarely be the last visual representation of the data. The data drives the selection of tool or software used for visual representation. As data continues to grow in volume (scale of data), variety (diversity of data formats), velocity (rate of streaming data) and veracity (uncertainty of data) one thing is certain: it is crucial to represent data visually in a manner that informs without overwhelming. Documentation of the visualisation process should include a full exploration of visualisation tools used to create the final visual of the data, and any other tools that were considered and perhaps explored, but not used to create the final visualisation. One must determine which questions to ask, identify the appropriate data, and select effective visual encodings to map data values to graphical features such as position, size, shape and colour (Heer et al., 2010). The questions and answers will differ depending on the desires of the researcher and stakeholders. For any given data set the number of visual encodings – and thus the space of possible visualisation designs – is extremely large (Heer et al., 2010). There should be a record of how decisions were made. Why was the final tool selected? Were other tools considered? If so, why were those tools not chosen to create visualisations with the data at hand?

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 174

174 PART 3 DATA AND PROGRAMMING

Visualisations built on shared data Years ago, before the big data revolution (Kitchin, 2014) and the age of ‘open’ movements – open source (Lerner and Tirole, 2001), open hardware (Pearce, 2014), open content, open education and open educational resources (Tom et al., 2008), open government (Janssen, 2012), open knowledge (García-Peñalvo, García de Figuerola and Merlo, 2010), open access (US Department of Energy, 2007), open science (David, 2001) and the open web (Petrie and Bussler, 2008) – the biggest hurdle to creating visualisations was finding and gaining access to data. Today, data is ubiquitous, voluminous, diverse and disparate. It is available in greater abundance than ever before. In some cases, data is even currency. Technology has enabled data sharing on a massive scale. This new wealth of data, combined with innovations in visualisation techniques, can enable ‘chains’ of visualisations where work to create one visualisation leads to enabling the creation of another visualisation. One example that we can all relate to is data used to create maps. As we evolved from cave drawings to putting writing instruments to paper (Friendly, 2008), we developed the 2D map – the visual foundation for the geo-inspired visualisations we enjoy and rely on today. Because maps are so often used in day-to-day modern life, they provide an easy to understand framework on which to build visual representations. Map-based visualisations present a low barrier to understanding. Relying on a basic knowledge of maps, often now in digital form, additional information (data) may be added to enhance the visual understanding of geographic locations. This can be seen in visualisations that show temperatures, elevation models, water levels, agriculture, power grids, network connectivity, social groups, you name it. With the advent of the Internet of Things, data increasingly comes with geocoordinates. Countless visualisations have been created where a map is the foundation of the visualisation. In his book The Visual Display of Quantitative Information, Edward Tufte describes the image, well known in visualisation and statistical communities, of Charles Joseph Minard’s flow map of Napoleon’s march on Russia in 1812 as maybe the best statistical graphic ever drawn (Tufte, 2006). The map in Figure 9.1 opposite, also available online (Wikimedia, 2018), shows not only geography but also time, temperature, the course and direction of the army’s movement, and the number of troops along the paths.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 175

BYRD SHARING RESEARCH DATA 175

Figure 9.1 Minard’s depiction of Napoleon’s march on Russia in 1812 Source: EdwardTufte.com (n.d.); Wikimedia (2018)

Maps showing weather data have become commonplace today and weather data is the perfect example of shared data that creates a chain of visualisations. Weather data is constantly changing, which makes it historically fascinating, but in real time the weather can be volatile, unpredictable and insightful for creating weather models or visual simulations for decision-making processes. We often take for granted or do not recognise the chain of visualisations we see or engage with daily. In a 2017 article called ‘How Data Visualization is the Future of Information Sharing’, Ryan McCauley and Bob Graves describe multiple scenarios where ‘layered data’ is used in emergency management, city planning and sustainable infrastructure (McCauley and Graves, 2017). One of the best examples of shared data enabling a ‘chain’ of visualisations in the research and scientific community is the Human Genome Project (Robert, 2001), an international, collaborative research programme whose goal was the complete mapping and understanding of all the genes of human beings (Bentley, 2000). The project produced enormous amounts of data, so much that biologists struggled to make sense of it all. It was during the time of the Human Genome Project that the area of bioinformatics emerged, along with the need for tools that would enable deep understanding and insight. Techniques were developed for visualising the genome and its annotations (Loraine and Helt, 2002). Genome browsers (Nicol et al.,

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 176

176 PART 3 DATA AND PROGRAMMING

2009; Searls, 2000) and tools (Helt et al., 1998; Jackson et al., 2016) were developed to explore and visualise the massive amounts of data the project generated. The chain of visualisations that continues as a result of shared data and visualisations can be seen in scholarly science and nature publications, and in art and technology. These are just a few examples. Today data is so readily available on the internet that these types of visualisation chains occur daily. Udacity.com and FlowingData.com are two examples of websites where visualisations created from shared data are on display. There are many more. Reusable data practices Not all data are created with reuse in mind, but managers of journals and projects are increasingly targeting their efforts in this area. There is information on data management and ways to make data reusable on the websites of many scholarly journals like Elsevier®. In 2015 Springer® piloted a project on linked data where metadata on conference proceedings would be made available free of charge. Elsevier® maintains a data repository (Elsevier Data Repositories) for potential authors. In the sciences major steps have been taken to standardise and document data. These are some online databases and community data repositories that use and enforce data standards and documentation: • Allen Brain Atlas (www.brain-map.org/) • Alzheimer’s Disease Neuroimaging Initiative (http://adni.loni.usc.edu/) • Data Repositories (http://oad.simmons.edu/oadwiki/Data_repositories) • European Human Brain Project (https://www.humanbrainproject.eu/en/) • Human Connectome Project (www.humanconnectomeproject.org/) • Human Genome Project (https://www.genome.gov/10001772/allabout-the-human-genome-project-hgp/) • NASA Data Portal and APIs (https://www.nasa.gov/data) • National Centers for Environmental Information (NOAA; https://www.ncdc.noaa.gov/)

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 177

BYRD SHARING RESEARCH DATA 177

• Scientific Data (https://www.nature.com/sdata/policies/repositories). Missed opportunities and long-tail data For every published paper, there are likely to be missed opportunities for sharing data for any number of reasons, for example: • page constraints, or more likely lack of knowledge of how to represent the data visually in an effective manner. Often this data is cast aside, not deleted nor discarded, but looked on as the output of one of the intermediate stages of the final visualisation. What value exists in that data even at the intermediate stage? • data that isn’t documented well enough to be used. There are two sides of this scenario: lack of documentation may impede further use of the data and reproducibility of results, or documentation of the process may be insufficient; for whatever reason, the person or system who captured or created the data lacks the capacity to transform it from a raw complex state to an insightful representation that could lead to knowledge building and sharing of data visualisation. • resistance to change in representation of data. A community has a historical preference to represent data in a certain way and people are sometimes unyielding and unforgiving of those who try different ways of exploring data to find more insight. Chan et al. (2014) estimated that more than 50% of scientific findings do not appear in published literature but reside in file drawers and personal hard drives. This so-called ‘file-drawer phenomenon’ dominates ‘the long tail of dark data’ (Heidorn, 2008), which Ferguson and colleagues define as small, granular data sets, collected in the course of day-to-day research. They may include: • • • • •

small publishable units (for example targeted endpoints) alternative endpoints parametric data results from pilot studies metadata about published data

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 178

178 PART 3 DATA AND PROGRAMMING

• unpublished data that includes results from failed experiments • records that are viewed as ancillary to published studies (Ferguson et al., 2014, 1443). Although this data may not be considered useful in the traditional sense, data-sharing efforts may illuminate important information and findings hidden in this long tail (Ferguson, 2014). In the context of data sharing, this chapter defines small publishable units as results from incremental steps in the research and data visualisation processes that have significant impact on an overall project. The results of small projects can contribute accumulated knowledge to a larger project (Heidorn, 2008). These smaller outcomes prove to be of substantial significance and may be disseminated as publishable results that support a larger project. Alternative endpoints are outcomes that were not anticipated, but support a different outcome from the one originally expected, leading to an alternative point for project conclusion. Alternative endpoints could either extend or shorten the project. Parametric data as defined in statistics literature is data based on assumptions about the distribution of the data (for example the assumption that the data follows a normal distribution). Bruce and Bruce (2017) and Conover (1980) are resources for persons interested in learning more about parametric data. Data-sharing best practices Developing a tracking and reward system for data requires that researchers pay more attention to managing and sharing their data (Goodman et al., 2014). Ferguson and colleagues (2014, 1446) describe data-sharing best practices for long-tail data: the data needs to be discoverable, accessible, intelligible, assessable and usable. Alter and Gonzalez (2018) advocate greater recognition for data creators and the authors of program code used in the management and analysis of data. This section describes two initiatives that address the need for better documenting of the data life cycle that will enable continued used of the data and visualisations downstream: the Research Data Alliance and the Open Science Framework.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 179

BYRD SHARING RESEARCH DATA 179

The Research Data Alliance One organisation that focuses primarily on sharing research data is the Research Data Alliance (https://www.rd-alliance.org/), which was launched as a community-driven organisation in 2013 by the European Commission, the US National Science Foundation, the US National Institute of Standards and Technology, and the Australian Government’s Department of Innovation, with the goal of building the social and technical infrastructure to enable open sharing of data. The website states: The current global research data landscape is highly fragmented, by disciplines or by domains, from oceanography, life sciences and health, to agriculture, space and climate. When it comes to crossdisciplinary activities, the notions of ‘building blocks’ of common data infrastructures and building specific ‘data bridges’ are becoming accepted metaphors for approaching the data complexity and enable data sharing. The Research Data Alliance enables data to be shared across barriers through focused Working Groups and Interest Groups, formed of experts from around the world – from academia, industry and government.

According to the website, participation in the Research Data Alliance is open to anyone who agrees to its guiding principles of openness, consensus, balance, harmonisation and a community-driven and nonprofit approach. The growing dependency on data is driving the need for standardisation, documentation and learned experiences of other domains. One important area in which data sharing leads to discovery, new knowledge gained and societal benefits is clinical trial data. Lo (2015) states, ‘Given the large volume of data anticipated from the sharing of clinical trial data, the data must be in a computable form amenable to automated methods of search, analysis, and visualization.’ This statement can be applied to all data-driven projects. Value will accrue from data sharing only if investigators can easily access results data and can query, align, compute on and visualise what will no doubt be a large amount of complex, heterogeneous data (Lo, 2015). It is beyond the scope of this chapter to discuss challenges associated with data sharing, but the reader is encouraged to read Lo (2015), who addresses many of these challenges.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 180

180 PART 3 DATA AND PROGRAMMING

The Open Science Framework To facilitate data sharing, a sharable dataset that consists of the data, metadata and intermediate data, documentation of decisions made during the data visualisation process and information about the tool used to create the visualisation is needed. As an example, the Open Science Framework (https://osf.io/) developed by the Center for Open Science, makes transparent virtually all the metadata and data that it is required to share in real time as research is being conducted (Foster and Deardorff, 2017; Lo, 2015). Another key feature of the Open Science Framework is the use of data-sharing plans. Similar to data management plans, data-sharing plans describe what specific types of data will be shared at various points and how to seek access to it (Lo, 2015). Data-sharing plans are required by many federal funders as well as some foundations, private sponsors and industries. There are plans for managing and sharing data across disciplines and around the world. Although this chapter highlights the need to share data, and to document the process of visualising data and the associated tools used to create the visualisations, it does not address the underlying challenges of sharing data. The reader is encouraged to review the works of Corti (2014), Ferguson et al. (2014) and Lo (2015) as a starting point to exploring them. For some online resources on managing data see Cousijn and Allagnat (2016), Elsevier (2018), McCauley and Graves (2017) and Springer (2015). Conclusion Sharing research data not only advances research but also allows for reproducibility of visualisations created with data. It allows data to be analysed from different perspectives and increases the possibility of building on previous research. One important benefit of sharing data and the process of visualising that data is that doing so improves visualisation capacity, building skills of researchers and scientists. Today’s interest in data science is fuelling the need for standards and documentation of data that continues to grow in size (volume) and type (variety). The variety of data makes standardisation a challenge and a necessity. If we are to use resources available for visualisation

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 181

BYRD SHARING RESEARCH DATA 181

capacity building and sharing fully, a combination of data management and data-sharing plans that document the data management life cycle will help to improve opportunities for creating visualisations. References Alter, G. and Gonzalez, R. (2018) Responsible Practices for Sharing Data, American Psychologist, 73 (2), 146–56. Ausburn, L. J. and Ausburn, F. B. (1978) Visual Literacy: Background, Theory and Practice, Innovations in Education & Training International, 15 (4), 291–7. Avgerinou, M. and Ericson, J. (1997) A Review of the Concept of Visual Literacy, British Journal of Educational Technology, 28 (4), 280-91. Bentley, D. (2000) The Human Genome Project—an overview, Medicinal Research Reviews, 20 (3), 189–96. Borgman, C. L. (2012) The Conundrum of Sharing Research Data, Journal of the American Society for Information Science and Technology, 63 (6), 1059–78. Bruce, P. and Bruce, A. (2017) Data and Sampling Distributions. In Bruce, P. and Bruce, A. (eds), Practical Statistics for Data Scientists: 50 essential concepts, O’Reilly, 64–67. Byrd, V. L. and Cottam, J. A. (2016) Facilitating Visualization Capacity Building. In Proceedings of the 9th International Symposium on Visual Information Communication and Interaction, 156-156, Dallas, TX, USA: ACM. Card, S. K., Mackinlay, J. D. and Shneiderman, B. (1999) Readings in Information Visualization: using vision to think, Morgan Kaufmann. Chan, A.-W., Song, F., Vickers, A., Jefferson, T., Dickersin, K., Gotzsche, P. C., Krumholz, H. M., Ghersi, D. and van Der Worp, H. B. (2014) Increasing Value and Reducing Waste: addressing inaccessible research, The Lancet, 383 (9913), 257–66. Conover, W. J. (1980) Practical Nonparametric Statistics, Wiley & Sons. Corti, L. (2014) Managing and Sharing Research Data: a guide to good practice, Los Angeles, SAGE. Cousijn, H. and Allagnat, L. (2016) Managing Your Research Data to Make it Reusable: experts weigh in on good data management, why it’s important and how to achieve it, Elsevier, 20 July, https://www.elsevier.com/connect/managing-your-research-data-to make-it-reusable. Craft, M., Dobrenz, B., Dornbush, E., Hunter, M., Morris, J., Stone, M. and

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 182

182 PART 3 DATA AND PROGRAMMING

Barnes, L. E. (2015) An Assessment of Visualization Tools for Patient Monitoring and Medical Decision Making. In Systems and Information Engineering Design Symposium (SIEDS), 2015, 212–17, IEEE. Daradkeh, M., Churcher, C. and McKinnon, A. (2013) Supporting Informed Decision-making under Uncertainty and Risk through Interactive Visualisation. In Proceedings of the Fourteenth Australasian User Interface Conference-Volume 139, 23–32, Australian Computer Society, Inc. David, P. (2001) From Keeping ‘Nature’s Secrets’ to the Institutionalization of ‘Open Science’, Economics Group, Nuffield College, University of Oxford. EdwardTufte.com (n.d.) Poster: Napoleon’s march, https://www.edwardtufte.com/tufte/posters. Elsevier (2018) Database Linking: linking research data and research articles on ScienceDirect, https://www.elsevier.com/authors/author-services/ research-data/data-base-linking/supported-data-repositories. Fayyad, U. M., Wierse, A. and Grinstein, G. G. (2002) Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann. Ferguson, A. R., Nielson, J. L., Cragin, M. H., Bandrowski, A. E. and Martone, M. E. (2014) Big Data from Small Data: data-sharing in the ‘long tail’ of neuroscience, Nature Neuroscience, 17 (11), 1442–7. Foster, E. D. and Deardorff, A. (2017) Open Science Framework (OSF), Journal of the Medical Library Association, 105 (2), 203–6. Friendly, M. (2008) A Brief History of Data Visualization. In Handbook of Data Visualization, Springer, 15–56. Fry, B. (2007) Visualizing Data: exploring and explaining data with the processing Environment, O’Reilly Media, Inc. Garcia-Penalvo, F. J., Garcia de Figuerola, C. and Merlo, J. A. (2010) Open Knowledge: challenges and facts, Online Information Review, 34 (4), 520–39. Goodman, A., Pepe, A., Blocker, A. W., Borgman, C. L., Cranmer, K., Crosas, M., Di Stefano, R., Gil, Y., Groth, P., Hedstrom, M., Hogg, D. W., Kashyap, V., Mahabal, A., Siemiginowska, A. and Slavkovic, A. (2014) Ten Simple Rules for the Care and Feeding of Scientific Data, PLOS Computational Biology, 10 (4), e1003542. Heer, J., Bostock, M. and Ogievetsky, V. (2010) A Tour through the Visualization Zoo, Communications of the ACM, 53 (6), 59-67. Heidorn, P. B. (2008) Shedding Light on the Dark Data in the Long Tail of Science, Library Trends, 57 (2), 280–99. Helt, G. A., Lewis, S., Loraine, A. E. and Rubin, G. M. (1998) BioViews: Javabased tools for genomic data visualization, Genome Research, 8, 291–305.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 183

BYRD SHARING RESEARCH DATA 183

Iliinsky, N. and Steele, J. (2011) Designing Data Visualizations, Sebastopol, California, O’Reilly. Isenberg, P., Elmqvist, N., Scholtz, J., Cernea, D., Kwan-Liu Ma, H. and Hagen, H. (2011) Collaborative Visualization: Definition, challenges, and research agenda, Information Visualization, 10 (4), 310–26. Jackson, N., Avery, W., Oluwatosin, O., Lingfei, X., Renzhi, C., Tuan, T., Chenfeng, H. and Jianlin, C. (2016) GMOL: an interactive tool for 3D genome structure visualization, Scientific Reports, 6 (20802). Janssen, K. (2012) Open Government Data and the Right to Information: opportunities and obstacles, The Journal of Community Informatics, 8 (2), 1–11. Kitchin, R. (2014) The Data Revolution Big Data, Open Data, Data Infrastructures and their Consequences, London, SAGE Publications. Kosara, R. and Mackinlay, J. (2013) Storytelling: the next step for visualization, IEEE Computer, 46 (5), 44–50. Kum, H.-C., Ahalt, S. and Carsey, T. M. (2011) Dealing with Data: governments records, Science, 332 (6035), 1263. Lerner, J. and Tirole, J. (2001) The Open Source Movement: key research questions, European Economic Review, 45 (4–6), 819–26. Lo, B. (2015) Sharing Clinical Trial Data: maximizing benefits, minimizing risk, JAMA, 313 (8), 793–4. Loraine, A. E. and Helt, G. (2002) Visualizing the Genome: techniques for presenting human genome data and annotations, BMC Bioinformatics, 3. McCauley, R. and Graves, B. (2017) How Data Visualization is the Future of Information Sharing, 30 January, GovTech, www.govtech.com/fs/How-Data-Visualization-is-the-Future-ofInformation-Sharing.html. Nambiar, R. (2017) Transforming Industry through Data Analytics, O’Reilly Media, https://ssearch.oreilly.com/?q=transforming+industry+through+ data+analytics+digital+disruption. Nicol, J. W., Helt, G. A., Blanchard, S. G., Raja, A. and Loraine, A. E. (2009) The Integrated Genome Browser: free software for distribution and exploration of genome-scale datasets, Bioinformatics, 25, 2730–1. Paravati, G., Sanna, A., Lamberti, F. and Ciminiera, L. (2011) An Open and Scalable Architecture for Delivering 3D Shared Visualization Services to Heterogeneous Devices, Concurrency and Computation: Practice & Experience, 23 (11), 1179–95. Pearce, J. (2014) Open-source Lab: how to build your own hardware and reduce

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 184

184 PART 3 DATA AND PROGRAMMING

research costs, Amsterdam, Boston, Elsevier. Petrie, C. and Bussler, C. (2008) The Myth of Open Web Services: the rise of the service parks, Internet Computing, IEEE, 12 (3), 96, 94–95. Robert, L. (2001) A History of the Human Genome Project 2001, Science (Washington), 291 (5507), 1195. Rogers, K., Wiles, J., Heath, S., Hensby, K. and Taufatofua, J. (2016) Discovering Patterns of Touch: a case study for visualization-driven analysis in human-robot interaction. In The Eleventh ACM/IEEE International Conference on Human Robot Interaction, 499–500, IEEE Press. Schneiderman, B., Plaisant, C. and Hesse, B. (2013) Improving Health and Healthcare with Interactive Visualization Methods, HCIL Technical Report, 1, 1–13. Searls, D. B. (2000) Bioinformatics Tools for Whole Genomes, Annual Review of Genomics & Human Genetics, 1, 251–79. Segel, E. and Heer, J. (2010) Narrative Visualization: telling stories with data, IEEE transactions on visualization and computer graphics, 16 (6), 1139–48. Springer (2015) Springer Starts Pilot Project on Linked Open Data, 4 March, https://www.springer.com/gp/about-springer/media/press-releases/ corporate/springer-starts-pilot-project-on-linked-open-data/51686. Tom, C., Shelley, H., Marion, J. and David, W. (2008) Open Content and Open Educational Resources: enabling universal education, International Review of Research in Open and Distance Learning, 9. Tufte, E. (2006) Beautiful Evidence, Cheshire, Conn., Graphics Press. US Department of Energy (2007) Laboratory, S. N. A., S. United States. Dept. of Energy. Office of & I. United States. Dept. of Energy. Office of Scientific and Technical. 2007. Demystifying Open Access. Washington, D.C.: Oak Ridge, Tenn.: Washington, D.C.: United States. Dept. of Energy. Office of Science; Oak Ridge, Tenn.: distributed by the Office of Scientific and Technical Information, U.S. Dept. of Energy. Van den Eynden, V., Corti, L., Woollard, M., Bishop, L. and Horton, L. (2011) Managing and Sharing Data; a best practice guide for researchers, http://repository.essex.ac.uk/2156/1/managingsharing.pdf. Wikimedia (2018) File:Minard.png, https://commons.wikimedia.org/wiki/File:Minard.png. Yusoff, N. M. and Salim, S. S. (2015) A Systematic Review of Shared Visualisation to Achieve Common Ground, Journal of Visual Languages & Computing, 28, June, 83–99.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 185

10 Open source, version control and software sustainability Ildikó Vancsa

Introduction This chapter will guide you to the land of open software development to show some of the methods and many of the challenges people and communities are facing in today’s digital era. Open source as we know it started in the 1980s with the aim of having the source code of an operating system accessible to the public to speed up the process of fixing issues. Sharing not just ideas but also artefacts and the work environment became more and more popular. This shared form of developing software increased in popularity when the internet and world wide web came to life in the 1990s, providing the opportunity for global movements. The term ‘open source software’ was born in 1998 to distinguish it from ‘free software’ (Peterson, 2018). ‘Free software’ indicated that the software was ‘free’ in the sense of not costing money rather than being freely accessible. You will learn what open source is and how its innovative and forward looking nature is pushing users and contributors to think about it continuously and take the next steps. But how does this fit into the dynamics of software development, sustainability and productisation (making commercially viable products)? We will seek the answers by going into details of the processes and the questions they open up in an open, collaborative environment.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 186

186 PART 3 DATA AND PROGRAMMING

Information technology, software and open source Information technology (IT) by definition is the application of computers to store, study, retrieve, transmit and manipulate data. During the past decade computers shifted from being mystical luxury items to becoming part of everyday life in ways we could never have predicted. Wired phones became smartphones. Cars became tens of mini-computers and hundreds of sensors on wheels. Even your television got ‘smart’. For the purposes of this exploration, we will set aside the challenges inherent in the enormous amounts of information these devices generate, process and transmit. Instead we will focus on the software and the software development process, highlighting efforts to connect large computer systems from the point of view of the data being shared. The tools and applications you use in your everyday life are designed and written in various programming languages by software engineers and developers either working for small companies, startups or multinational companies, or just doing it as a hobby. The web pages you see and the applications on your smartphone are all part of a larger system. Behind the scenes is extensive cooperation among companies, from multinational to start-up, and thousands of individuals from all around the world. They build multilayered systems from numerous pieces of software that are often reusable code libraries and services. In order to achieve a stable service and good user experience, these pieces need to work together smoothly. Your phone installs updates to keep it fast and efficient. The car mechanic not only changes the oil but also uploads a new version of the software that’s running in the background to make your car safer and more powerful. From your regular reliance on these devices, you can get a sense of the importance of keeping the versions of all the pieces working smoothly together. Some of the difficulties of keeping things working together smoothly come from the people and organisations that need to collaborate. As technology evolves, involvement in the IT industry is growing worldwide. You don’t have to work for a company anymore to be able to develop software and share it with the public. We use the term open source to refer to the source code that is

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 187

VANCSA OPEN SOURCE, VERSION CONTROL AND SUSTAINABILITY 187

published and accessible to anyone to study, modify, reuse or redistribute. The term originates from software development and incorporates rules and guidelines of accessing and enhancing the source code that programs and applications are built of. Open source grew beyond the program code and became a mindset that hundreds of thousands of people embrace all around the globe. To show all angles of open source would take a book. To understand the legal implications of this style of working and collaboration takes a profession. To get a better view of the challenges of exchanging data and designing, developing and building software in an open, collaborative fashion will take this chapter. Community and communication: who said what exactly? We cannot really discuss open source without mentioning the effort that people need to put into communicating with each other. Members of a community must be able to follow and track back the information that has been exchanged, no matter what avenue of communication is used. The efforts we stamp with the name ‘open source’ are most commonly formalising around communities. Each of these smaller or larger groups of people have their own governance and project structure, ways of communicating, programming languages they embrace, and tools to use for software development and testing. Open source is one of the driving forces of innovation. It is particularly effective at providing collaborative environments for nurturing new ideas and technologies and keeping the communities’ artefacts publicly available. Human interaction is one of the key elements to facilitate implementation of new functionality, which then becomes part of the software and applications we all use. The common language of worldwide open source communities is English, used for both oral and written forms of communication. As involvement in an open source project grows around the globe the importance of offline communication and the ability to document, share and store meeting minutes and decisions taken rapidly increases. So how do we do all this? And if the text itself is available, is it a given that any individual will be able to read and understand it? There is no way of listing every tool and possible way to manage the

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 188

188 PART 3 DATA AND PROGRAMMING

necessary communication, but I will explore a few of them here to give a taste of the options available. Internet Relay Chat (IRC) is a text-based chat program from the early days of the internet, which has become accessible to the wider public. You can connect to multiple separate networks of servers through various client applications, which might run on your laptop, mobile phone or web browser. The network, which has been serving various open source communities for over 20 years now, called Freenode, permits you to join and explore discussions in various channels after picking a nickname that identifies you to others. The magic of IRC lies in its simplicity. It is purely text based. While some modern clients can embed photos, videos and emojis in open source communities they are rarely used, not least because not all client applications support these advanced features. Debates in a public IRC channel can be followed by anyone who joins the channel as long as they maintain an active connection. While this sounds as if it easily provides the option to everyone who may want to join, when we take a closer look we realise that dialogue takes place at a given time in a given time zone, which excludes everyone who is not present in that particular time zone. The same problem occurs if someone loses their internet connection. Simple web-based or desktop clients are not capable of retrieving messages that were exchanged during the time period the client was disconnected from the network. While applications exist to address this issue we will consider the efforts that give access to information without the need to use advanced features, which often require an extra fee. Open source is more than just public access to software programs and their source code. It incorporates an open and collaborative effort and environment. It is crucial to find a way to provide access to key information, like design discussions and decisions, to anyone who is willing to get involved in development activities. The work of open source communities is often organised into projects and project teams, each of which are formalised around a modular component of the architecture of the system and software package they are producing. This drives the avenues of communication as well, such as having an IRC channel per project and several additional channels to address interactions between the project teams.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 189

VANCSA OPEN SOURCE, VERSION CONTROL AND SUSTAINABILITY 189

The text-based nature of IRC makes it relatively easy to save and store the conversations that occur on public channels. For example, discussions can be stored in raw text format daily. This can effectively save the content of a project related channel. The granularity of what is being saved can be adjusted by using additional tools, making it possible to store information from a given time window separately, for example a meeting that takes place in either a common or a project channel. By saving meeting discussions individually we can track back all the information shared during the meeting, including questions and concerns, and design, usability or testing-related messages. The team responsible for the community’s infrastructure stores and makes the logs available to the public. Anyone unable to connect to the internet or simply unable to attend a meeting for any reason, who would like to know what their teammates were discussing, can look up the log files and read them in a browser window. So you have the ability to access data, but will you find what you are looking for? And will you understand what you found? While finding information about a discussion from a known date and time may be easy, what if you don’t know all the parameters of the chat snippet you are looking for? Search engines can help you to find related information in the published log files. You also have the choice to download and perform search and other analysis on your laptop or any available resource you have that is applicable for this problem. The language in which a discussion is written can increase the difficulty in finding information in the stored files. Let’s face it, not everyone is fluent in English. So the question is still open, how useful are these logs? Another interesting aspect of communication, especially in specialised areas like scientific research or telecommunications, is that the words and acronyms used are often specific to the specific field in question. The number of acronyms is rapidly growing. There are often overlapping industry segments served by the same open source project that use the same acronyms to mean different things. This can make it even more complicated to understand for someone working on the open source project but who is not deeply involved in either served industry.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 190

190 PART 3 DATA AND PROGRAMMING

Another form of communication frequently used in the open source community is e-mail. Most will be familiar with the concept of a mailing list, a mechanism to distribute e-mail messages among individuals who are subscribed to a particular list. It is like giving a list of names to the local post office with a note requesting that if anyone on the list sends a letter to the list, everyone on the list has to receive a copy of the message it contains. E-mail can be created in a rich-text or HTML format, which provides the possibility to specify colours, fonts and character formatting. While e-mail messages can be converted to a text-only format and then saved and stored similarly to the aforementioned IRC discussions – data may be lost in the process. In the case of e-mail threads, which are messages sent as replies to an original e-mail, it is a common practice to provide comments in the reply as in-line text differentiated from the original message by use of colour. When the information is stored in raw text format and the font colours are lost, it becomes more difficult to read an e-mail thread that leveraged colour coding to distinguish between the speakers and replies. The information is saved, stored and published, but how useful is it if the individual is left to navigate and attempt to use it? And there are further open questions, like how long are we storing all this information for? What is the relationship between the viability of the data and the time interval it is stored and made publicly available? You can read the text, but for how long is it needed and meaningful to you or anyone else around the globe? Version control: which milestone or release am I working on right now exactly? Now that we’ve warmed up with the short tour of the social and human aspects of handling information exchanged during open source software development efforts, let’s take a look at the software side itself. For this section you will see source code treated as information that is continuously modified in a controlled way. This is most commonly called version control, but sometimes referred to as revision control or source control. Version control manages changes to a collection of information, either computer programs or documents (Wikipedia, 2018).

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 191

VANCSA OPEN SOURCE, VERSION CONTROL AND SUSTAINABILITY 191

To give you an example of the challenge, let’s imagine that you are writing a book which consists of several chapters, including an introduction and a summary. What you do first is design the architecture of your book by defining the table of contents starting from the main chapters to a more fine-grained setup. You save the table of contents and start to work on the chapters one at a time, but you might not finish one before starting another. Perhaps you continue to polish the introduction as the book starts to formalise. You are passionate about your work, you have phrases you like – maybe multiple versions of expressing the same thought as you progress towards the end goal of a cohesive final product with a nice flow from the beginning to the end. So you start saving the different versions of the sections or whole chapters as they evolve as you might want to go back to an earlier version because you liked that better. The longer the book is, the more chapters you have, the more hectic your work style and the harder it is to keep track of the different versions of the individual pieces. You might accidentally delete a version that you wanted to use or get in trouble with naming the versions. This approach can easily lead to a loss of time and effort as you attempt to keep track of the versions manually; it will be much easier if you use a software-based system to support you. To take it one step further, let’s imagine that you work with a group of writers and choose to put the document in a place where it is available to all. You are collaborating with them from the first step of creating the table of contents to the last of putting the whole document together. And if this is not yet terrifying enough, then let’s imagine that you would like to make this a community effort and open up the idea to the public, so everyone can contribute to your book. You can decide whether or not you like their changes and add what you liked to the end result. It can be a group of five people or a group of five hundred. Are you scared yet? The challenges at the individual level get amplified and made exponentially harder by the open environment and larger number of people who are trying to add their ideas to the end result. When it comes to software development, the end result has to be an application that does what it was designed for while it also integrates well with its environment and is easy to use. To keep order during the development process and with the

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 192

192 PART 3 DATA AND PROGRAMMING

constructed software we are using revision control systems like Git, a third generation version control tool. The first signs of version control systems date back to the 1960s, but the first version control system made available to the public in 1977 was the Source Code Control System created in 1972. Regardless of the exact date these systems came alive, the purpose then is the same as it is today. Writing software is a complex process and hardly a one-person task. To understand it better think of cars and the number of pieces they have. If one person puts it together it takes a very long time, hence when you hear about the invention of the automobile you will hear even more about the assembly line, which changed the history of making them. A software project consists of files, an item or unit holding data like the pages of a book. If you have the access rights you can open a file and read its content organised into lines, and depending on the file format that is either human-readable or not. If files would be as well defined as the pieces of the car we could have an assembly line for writing software, but the analogy doesn’t work out of the box here. When you add new functionality to your application you need to modify multiple files. If multiple features are added or modified simultaneously, there may even be overlapping lines of code needing modification by different teams. The complexity of the problem is elevated, but the idea of having an assembly line is still not off the table – it just needs a little adaptation. Similarly to the parts of a car you want the pieces of your software to be developed in parallel, and also similarly to the car you want the pieces to fit together without conflict. You also want to save every single step on the way to have a complete history of changes and the reason for them in case you would need to revisit a decision or remove a change. Version control systems are your magical assembly line to be able to speed up the software development process with the ability to avoid conflicts most of the time and providing help with resolving them when they happen. The tools have come a long way since from the first generation created in the 1970s to the third generation we have today. So what has changed? Remember the story of writing a book together with a group of

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 193

VANCSA OPEN SOURCE, VERSION CONTROL AND SUSTAINABILITY 193

people? Let’s imagine that instead of assigning chapters to writers, everyone can edit the whole book. The first generation of version control systems had some limitations in providing this ability. You could access the book and edit its pages. You could work on one page at a time and when you were working on that page no one else could access it to make changes before you finished yours. Similarly you could not edit a desired page if someone else was already working on it. But first generation tools are for the history books and developers and companies have moved over to the second and third generation versions. As speed and efficiency became more and more important, you don’t want to have the pages of the book blocked from other writers. With second generation tools, there is a master copy of your book that can be seen and modified by your fellow authors. You can edit multiple pages at the same time and so can others. Once one is happy with their changes they get merged back to the master copy so everyone can see them – no matter who is working on the book. The only thing you have to do before having your changes added to the book is to see the latest visible version and update your pages with any changes that were made since you started working. This step is needed to make sure there are no more conflicts to resolve before your changes make their way back into the master copy. Second generation tools are widely used, especially in corporate environments. One of the widely used ones is Apache Subversion (https://subversion.apache.org), which happens to be open source software itself. These tools provide dramatically more flexibility than the first generation versions did, but they are still not necessarily the most suitable for open source collaboration. Without going into too much detail let’s take a quick look at third generation tools. These tools are decentralised, which means that you have a master copy of your book and everyone has a local copy of it, be they sitting in the same office as you do or anywhere else in the world. These tools take the changes that you made as opposed to the pages you edited, which is a big difference from the first and second generation editions. You make changes to the local version of the book and your changes get uploaded to the remote master copy in a package that is most commonly called ‘commit’. You still need to address conflicts locally, but the system is made much more flexible by addressing the challenge from the opposite angle.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 194

194 PART 3 DATA AND PROGRAMMING

Third generation version control systems are very popular in open source software development. We use these tools for two main purposes. These tools give us the option to put a time stamp on a version of the software we are working on and release that version. A release of the software can be considered akin to a published edition of a book. Recall that the changes to the software are just as important as the software itself. When a change to the master copy breaks pre-existing functionality it must be repaired. Like having a few new paragraphs in your book that make the story line completely inconsistent. How do you fix it? Fortunately there are multiple ways to address this issue. You can work on your book further and make additional changes to correct the flow of your story to put it back on track. It sounds easy, but if you are writing a complex science fiction novel you might feel it hard to do so. In this case you might want to revisit the changes that you or someone else made to the broken parts of your book. Seeing previous edits and having information from the contributing authors regarding the purpose of their changes can help you to make further edits. You can also decide that one or more of the changes have done more damage than good to the book, in which case the version control system lets you remove the changes and start those pieces over. And this is no different with the software either. Releasing software is a complex process. As with a book, it is not enough to write grammatically correct content for the book – you need a cover, illustrations inside, and methods of printing and distribution. We will not consider the release process itself, but look a bit into what happens to the master copy or master copies. The version control system does more than just keep track of the master version of your software, it also preserves every change made along the way. It even allows you to mark a specific subset of the code with a version stamp and publish the snapshot it identifies. Once you release an edition of your book, the version control system will let you keep working on and correcting what you have already published. Now you must manage making edits on two versions, your constantly evolving master copy and the recently published edition. This just doubles the trouble for you if you are on your own. If you are using version control systems you get help with all this as these

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 195

VANCSA OPEN SOURCE, VERSION CONTROL AND SUSTAINABILITY 195

tools are capable of tracking different versions of the same software or document. The goal is to simplify adding content to your master copy and fixes to the published editions simultaneously. To make it easier to see why it is important let’s switch from books to cars as an example here. You have a company manufacturing cars; let’s call them Company X. They have one model of their hybrid pick-up truck out on the roads. They still produce this model and have assembly lines dedicated to this. Company X is working on the next model already, but also continues to improve the existing model to make it even more safe and stable. They don’t want to add new features to the cars already running on the road – that would be confusing to buyers. It would also require changes to the assembly line, adding complications and extra cost. You never use a software tool in isolation, it runs on a physical machine with an operating system and interacting with other software components and humans. There are many integration points and usability expectations. In open source, we say that if you want a more stable edition of the software use the latest version of it. If you want more functionality, use the latest version of it. In general, just use the latest version of it. This is the guideline of those who develop and test the software. They see the complexity of keeping a released version working alongside all the interdependent software components. If you are the one making the car, you want to operate only the latest assembly line and remove the ones producing older models of your cars. You also want your customers driving the latest models which are safer and faster, and include the best combination of features. If you are the car owner, you might want to keep your car for as long as possible. You are used to how it works. You have your daily routine with your car and the colour you like is not available in the new model. You don’t want to pay for a new car. And the list goes on and on and on. This takes us to two related questions: how long should we continue to support a software release? and why can’t we just keep on adding new content to the released edition?, or in other words the conundrum of long-term support.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 196

196 PART 3 DATA AND PROGRAMMING

The long-term support conundrum: do I really want to keep this for that long? We have explored how people collaborate on open source software development projects. One of the secrets behind a project’s growth is having a lot of people or companies all working together to solve the same problem. No matter who is contributing their ideas to an open source project, one of the main driving forces is the interest in the idea itself as opposed to the quality of the code or the architecture of the prototype of a software component. The biggest challenge after the initial excitement around your project doesn’t come with the growth in the number of developers, but when a growing number of people start to use it. If you think back to the book example, it is like waiting for the book reviews coming out after you published an edition. You hope for a good review and for constructive feedback at the same time to have a better idea what content to add to the next edition. You want your readers and the critics engaged enough to keep reading your books and give you feedback. The same holds true for our car example. The car manufacturer must seek to keep their customers who want parts to maintain older models satisfied, while also pursuing the customer who would like your design process to move faster so they can access the new models more quickly. You cannot please everyone – you need to thrive and survive the balancing act. The challenge itself is not any different from those faced in commercial software development processes and corporate environments. Open source is simply a different environment in which the challenges are handled and addressed differently. This section is not about the master copy of your book in the version control system, but rather the published editions needing corrections. You need to balance your correction efforts to make sure that you don’t take too much energy away from the creation of the next new version. You must ensure that corrections do not create new flaws. A typo should get fixed in both the master copy and in all published versions in which it is found. Minor changes, like a typo, are unlikely to cause issues.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 197

VANCSA OPEN SOURCE, VERSION CONTROL AND SUSTAINABILITY 197

On the other hand, actual content changes are more questionable. Is the change fixing an issue or is it new content? Can you be sure that it will not cause more damage than it fixes? A group of people caring for the released edition of the book can easily answer these questions. Providing long-term support becomes a problem when you have multiple editions of your book, multiple models of your car, or multiple versions of your software. The more releases you have, the more work it is to maintain them all. The older ones get out of date, the content of the book is less relevant, the features of the car are less suitable for the environment, and your software is getting increasingly harder to integrate with other pieces of the environment. Conclusion In open source, the question is not how long to make a software version available, but rather for how long it should be maintained. In software, the older something gets the harder it is to use because of the rapid changes and innovations in technology. There are new programming languages, new feature expectations, and more components to integrate your software with, which create new challenges without you having changed a single character in your code. Most open source communities, if they care about older software versions, put effort into maintaining two or three past versions. This usually gives a year and a half or two years to use a version and get fixes to it. It is important to note that with open source software you always have the option to look into the code and fix it yourself, just as you can with an older model car. There is a continuous discussion on how much time and how many versions communities decide to keep up to date. The decisions are based on many factors, from the amount of time the design and implementation phase take before release to the number of people available to do the work. There is no silver bullet answer to this problem, but it highlights one of the main advantages of open source. Beyond the shared efforts going into solutions of common problems is the sharing of ideas with anyone who is interested in picking it up or participating in developing it further. Open source projects appear with smaller groups or larger

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 198

198 PART 3 DATA AND PROGRAMMING

communities around them facing very similar challenges on how to communicate, share information, and collaborate to produce software together that works. Communities and projects also disappear over time as people move on to new ideas and other problems. I could talk about the challenges of using or even just understanding the software that is left behind, but it is not a common topic to discuss. The reason for it is that people talk about what interests them in the present and put their energy into it as opposed to focusing on something that solved an issue they had yesterday, last year, or a decade ago. Source code left behind is like any form of written or drawn material we find from hundreds or thousands of years ago. As you have access to the source code written in a particular programming language, you can understand it if you speak that language. The piece of software might not work anymore on the current operating systems without the additional software packages or libraries on which it depends, but the idea is still there. You can choose between figuring out how to make an environment to make that piece of software work or reimplementing the idea in a programming language of your choice that can run in a current environment. Some of the key factors to the success of an open source community are adaptability and innovation. Technology evolves quickly and there are always new questions that need new answers. With innovation sometimes comes the need to resolve an old problem in a new environment. We all work together and learn from each other and concentrate on keeping the ideas at the heart of the software alive, rather than focusing on keeping a specific version running. References Peterson, C. (2018) How I Coined the Term ‘Open Source’, 1 February, Opensource.com, https://opensource.com/article/18/2/coining-termopen-source-software. Wikipedia (2018) Version Control, https://en.wikipedia.org/wiki/Version_control.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 199

Final Thoughts Jeanne Kramer-Smyth

I am an optimist, but I think I am not being naive to propose that the list of potential ‘partner professions’ for archivists concerned with digital preservation is very long. This is not a one way street. We are not the only ones who will benefit from such partnerships. The promise of collaboration across professions is something many are pursuing. The popular TED (Technology, Entertainment, Design) conferences and related online videos grew out of cross-collaboration at the intersection of technology, entertainment and design. Now their ‘TED Fellows program provides transformational support to an international community of 436 visionaries who collaborate across disciplines to create positive change around the world’ (TED, n.d.). Not everyone is going to be able to seek partners from other sectors with whom to join forces in their digital preservation work. What is more feasible is for those in the R&D arm of the profession to be creative in their hunt for similar challenges already being tackled in the private sector. By the time you read this section, I am assuming you have read the preceding chapters. As each of these chapters took shape, I took it as a good sign that in every case I could think of multiple archives colleagues with whom I already wanted to share them. I also started spotting common threads that wove from one chapter to the next. In the chapters on memory, privacy, and transparency, the authors touch on a number of related issues about which archivists are often concerned. Dr Harbinja’s discussion of the inheritance of digital media addressed privacy, the wishes of the content creators, and the tug of war at play between social media companies and the evolving frontier

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 200

200 PARTNERS FOR PRESERVATION

of digital inheritance law. In her chapter looking at GDPR, Paulan Korenhof gave us great insight into the impact of the ever-accessible internet and the corresponding dramatic increased access to personal information. Brant Houston documented the challenges and successes navigated during the rise of computer-assisted reporting. Professor Margolis showed us the major impacts of link and reference rot on the trustworthiness of legal citations. All of these issues – privacy, the record creator’s wishes, the right to access public information, and the importance of building an infrastructure to battle link and reference rot – are crucial to the work of archivists engaged in doing everything from digital record transfer and ingest to balancing the importance of providing access to records with the privacy and interests of the record creators. In the chapters that discussed the intersection of the physical and digital worlds, we see that ‘going digital’ doesn’t mean that we completely leave the analogue world behind. We live in the physical world and the push to continue leveraging technology to improve that physical world presents its own challenges. Éireann Leverett showed us the promise and the risks at play with the Internet of Things. This current snapshot gives us a sense of the challenges on the horizon related to privacy, security, and appraisal and preservation of this coming tsunami of data. Dr Sarkar gave us a tour of the edge between how colour information is preserved and presented to us via physical devices. Any archivist responsible for digital content that contains colour needs to figure out the most trustworthy way to preserve and provide accurate access to their records. Dr Lee and Dr Gu showed us a path to building on the success of business information modelling (BIM) to improve the management of architectural design data for historic buildings. Leveraging standards and centralised repositories can smooth many of the challenges of tying context to design information. All three of these chapters point to the need for archivists to learn about how the interface between technology and the physical world is evolving. At the same time, technological innovation can move so quickly that concerns about privacy, security, records management, and data preservation are often left by the wayside. Archivists with an understanding of these issues can contribute significantly by engaging those at the forefront of technological evolution. In the section on data and programming, we had the opportunity to

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 201

KRAMER-SMYTH FINAL THOUGHTS 201

learn about efforts to protect the privacy of individuals, present information in more accessible ways, and maintain clear channels of communication in decentralised teams. Dr Schlomo detailed what must be considered to balance the conflicting priorities of access to data and individual privacy. Archivists whose digital records contain private information that must be shielded from the public would benefit from these selfsame techniques. In addition, archivists with years of experience walking this line between access and privacy can contribute to this evolving conversation in the data world. Dr Byrd’s chapter of leveraging shared data for creating visualisations showed us another example of the ways that clear communication of context and use of standards can improve outcomes. In her chapter on open source software, Ildikó Vancsa took us through the history of the open source movement and the massive benefits afforded through the use of version control software. The lessons learned by this community can help ensure we do our best as archivists charged with preserving software and digital records. It is my fervent wish that you leave this book inspired. Inspired to look at other professional communities’ efforts to solve digital problems with an eye to how they can also support the work of archivists pursuing digital preservation. Don’t assume that we need to go it alone. While our digital preservation community is strong and enthusiastic – there are so many others who are going through strikingly similar struggles. Be an ambassador. Tell people what archivists do. Tell people why digital preservation is as hard as it is important. Watch for that lightbulb that goes on over their head as they recall a situation they have struggled with professionally or personally that mimics archivists’ digital preservation challenges. I have long been an advocate for recruiting mid-career technologists to career shift over to work in digital preservation. In the potential partnerships envisioned in this book, we can find other ways to bring the bright light of technologists from across many disciplines to bear on the work we have before us. Reference TED Fellows Program, TED: ideas worth spreading, n.d., https://www.ted.com/participate/ted-fellows-program.

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 202

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 203

Index

3D models, historical BIM+ (HBIM+) 124, 125-6 ABS TableBuilder, flexible tablegenerating servers 162 academic legal writing, link rot and reference rot 69-70 access, legal issues 6-8 Adobe Acrobat, computer-assisted reporting 57 AMA (Archive Mapper for Archaeology), historical BIM+ (HBIM+) 131 ‘ambient-adaptive’ features, digital colour reproduction 116-17 American Factfinder, flexible tablegenerating servers 162 anonymisation and pseudonymisation, online assimilation 37 Arizona Project, computer-assisted reporting 50 assets, digital see digital assets augmented reality, historical BIM+ (HBIM+) 132-3 authenticity legal information/citation 64, 68-71, 72-6

proposed solutions 72-6 BIM (building information modelling) see historical BIM+ ‘blue light reduction’, digital colour reproduction 116 The Bluebook, legal information/citation 62, 65, 73, 76 ‘born-digital’ legal materials, legal information/citation 65-7 building information modelling (BIM) see historical BIM+ CAD (computer-aided design) systems, BIM (building information modelling) 126, 127-8 census tables SDL (statistical disclosure limitation) 150-5 tabular data 150-5 Chesapeake Project, Legal Information Preservation Association 70 Chinoy, Ira xxiv-xxv

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 204

204 PARTNERS FOR PRESERVATION

CIDOC CRM (Conceptual Reference Model of the International Committee for Documentation), historical BIM+ (HBIM+) 131 cloud computing, historical BIM+ (HBIM+) 133-4 collaboration historical BIM+ (HBIM+) 126-7, 129-34 SDL (statistical disclosure limitation) 148 colour gamut, digital colour reproduction 105-6 colour reproduction, digital see digital colour reproduction colour space, digital colour reproduction 105-6 common threads, digital challenges xxiii-xxiv communication, open source software 187-90 communities, open source software 187-90 computational journalism see computer-assisted reporting computer-assisted reporting 45-59 Adobe Acrobat 57 Arizona Project 50 data journalism 46-59 data leaks 58 database challenges 56-7 DocumentCloud 57 Freedom of Information Act 53 future 58-9 Google News Initiative 55 government agency policies 51, 52-6 Hacks/Hackers group 58 history of data in journalism 48-50 Howard, A. 46 Jaspin, E. 50-2 Meyer, P. 49

MICAR (Missouri Institute for Computer-Assisted Reporting) 51 Mulvad, N. 53 NICAR (National Institute for Computer-Assisted Reporting) 52, 54 PDF files 57 SQL (Structured Query Language) 50-5 Svith, Flemming 53 techniques 46-7 unstructured data 57-8 contextual integrity and authenticity of online information 33 contextual integrity of information 33 contracts, digital assets 8-11 convergence of knowledge realms 37 copyright Copyright, Designs and Patents Act 1988: 8 legal issues 7-8, 11 CTA (controlled tabular adjustment), tabular data 163 Cyber Cemetery legal information/citation 73 link rot and reference rot 73 data dissemination, SDL (statistical disclosure limitation) 159-63 data enclaves ICPSR (Consortium for Political and Social Research) 160 remote access 159-60 SDL (statistical disclosure limitation) 159-60 ‘data in the wings’, Internet of Things 86-9 data journalism see computerassisted reporting data leaks, computer-assisted reporting 58

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 205

INDEX 205

Data Linkage and Anonymisation Programme, SDL (statistical disclosure limitation) 148 Data Protection Act 1998:, digital death/inheritance 11 data visualisation 167-81 see also sharing research data documenting the process 171-3 guidelines 171-173 Human Genome Project 175-6 ‘layered data’ 175 map-based visualisations 174-6 National Institutes of Health 170-1 National Science Foundation 170-1 overview 168 process 168, 171-3 representation 169-70 reusable data 176-7 shared data, visualisations built on 174-6 visual literacy 170 visualisation capacity building 170-1 database challenges, computerassisted reporting 56-7 data-sharing plans, sharing research data 180 death see digital death/inheritance Delta E metric, digital colour reproduction 109, 113 differential privacy, SDL (statistical disclosure limitation) 161 DigiNotar, hacking 98 digital assets access 6-8 contracts 8-11 copyright 7-8, 11 legal issues 6-8 property 6-8 in-service solutions 8-11, 14 value 5-6 digital challenges xxii-xxiv common threads xxiii-xxvi digital colour reproduction 101-22

‘ambient-adaptive’ features 116-17 ‘blue light reduction’ 116 colour gamut 105-6 colour space 105-6 Delta E metric 109, 113 digitisation and colour accuracy 117-19 display colour accuracy 102-3 display technology 103-8 file formats 115 film production 102-3 future 119-21 HDR (high dynamic range) display technology 106-8, 118 ICC (International Colour Consortium) 114 LCD (liquid crystal display) technology 104, 107-8, 109 metamerism 120-1 Microsoft Surface 108, 110-13 OLED (organic light-emitting diode) technology 104, 108 preserving colour accuracy 108-13 software features and colour accuracy 113-17 ‘white point’ 105-6 digital death/inheritance 3-4, 6-19 see also legal issues Data Protection Act 1998: 11 digital estate planning 16-18 GDPR (General Data Protection Regulation) 12 legacy tools 9-11 post-mortem privacy 11-16, 18-19 RUFADAA (Revised Uniform Fiduciary Access to Digital Assets Act) 11, 12-16 digital estate planning, digital death/inheritance 16-18 digital lock-in, Internet of Things 95-8 digitisation and colour accuracy, digital colour reproduction 117-19

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 206

206 PARTNERS FOR PRESERVATION

disclosure risk microdata 155-8 SDL (statistical disclosure limitation) 148 display colour accuracy, digital colour reproduction 102-3 display technology, digital colour reproduction 103-8 DocumentCloud, computer-assisted reporting 57 Elsevier Data Repositories, reusable data 176 e-mail, open source software 190 Facebook, legacy tools 9-11 file formats digital colour reproduction 115 Internet of Things 89-92, 97 ‘file-drawer phenomenon’, sharing research data 177-8 film production, digital colour reproduction 102-3 firewall rules, Internet of Things 93-4 flexible table-generating servers ABS TableBuilder 162 American Factfinder 162 SDL (statistical disclosure limitation) 160-1 footnotes, legal information/citation 62, 68-9 forgotten, wish to be 32-5 see also RTBF (right to be forgotten) Freedom of Information Act xxiv-xxv computer-assisted reporting 53 Freenode, open source software 188 future computer-assisted reporting 58-9 digital colour reproduction 119-21 historical BIM+ (HBIM+) 128-34, 139 GDPR (General Data Protection Regulation)

digital death/inheritance 12 Internet of Things 94 origin 26 RTBF (right to be forgotten) 35-9 geomatics, historical BIM+ (HBIM+) 130-2 GIS (geographic information system), historical BIM+ (HBIM+) 130-1 Git tool, version control 191-2 Google, legacy tools 9-11 Google News Initiative, computerassisted reporting 55 Google Scholar, legal information/citation 65 Google Spain, RTBF (right to be forgotten) 25-6, 36 government agency policies, computer-assisted reporting 51, 52-6 GPS (Global Positioning System), historical BIM+ (HBIM+) 132 hacking DigiNotar 98 Internet of Things 81-5, 89-92 Hacks/Hackers group, computerassisted reporting 58 Harvard Library Innovation Lab legal information/citation 76 link rot and reference rot 76 Perma.cc 76 HBIM+ see historical BIM+ HDR (high dynamic range) display technology, digital colour reproduction 106-8, 118 historical BIM+ (HBIM+) 123-39 3D models 124, 125-6 AMA (Archive Mapper for Archaeology) 131 augmented reality 132-3 BIM (building information modelling) 123-34 BIM definition 127

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 207

INDEX 207

BIM future 128-34 BIM key aspects 127-9 CAD (computer-aided design) systems 126, 127-8 characteristics 137-8 CIDOC CRM (Conceptual Reference Model of the International Committee for Documentation) 131 cloud computing 133-4 collaboration 126-7, 129-34 development 124-5 diversity 136 emerging BIM technologies 129-34 enhanced preservation technologies 129-34 future 139 geomatics 130-2 GIS (geographic information system) 130-1 GPS (Global Positioning System) 132 HBIM+ knowledge framework 135-6, 138 mobile computing 133-4 ontologies 129-30, 136-8 semantic web 129-30 SNS (Social Network Service) 132 virtual reality 132-3 WPS (Wi-fi Positioning System) 132 history data in journalism 48-50 legal information 64-8 Howard, A., data journalism 46 Human Genome Project data visualisation 175-6 sharing research data 175-6 ICC (International Colour Consortium), digital colour reproduction 114 ICPSR (Consortium for Political and

Social Research), data enclaves 160 information collections online assimilation 29-32 search engines 30-5, 37, 39 information retention, tertiary memory 28 inheritance see digital death/inheritance in-service solutions, digital assets 8-11, 14 Internet Archive legal information/citation 72-3 link rot and reference rot 72-3 Internet of Things 81-99 automation 96 ‘data in the wings’ 86-9 description 82-3 digital lock-in 95-8 file formats 89-92, 97 firewall rules 93-4 GDPR (General Data Protection Regulation) 94 growth rate of devices 96 hacking 81-5, 89-92 Koomey’s law 85 Moore’s law 85 non-static code 92-5 ‘over-provisioning’ 86-9 privacy 84-9 Privacy International 86-7 reproducibility 93-4 risk analysis 84-5 security 81-5, 88-92 ‘security debt’ 96-7 ‘trojan horse’ devices 86-9 variation in behaviour 94 Internet Sources Cited in Opinions legal information/citation 75 link rot and reference rot 75 Supreme Court 75 IPUMS (Integrated Public Use Microdata Series), microdata 158

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 208

208 PARTNERS FOR PRESERVATION

IRC (Internet Relay Chat), open source software 188-9 IT (information technology), open source software 186-7 Jaspin, E., computer-assisted reporting 50-2 journalism see computer-assisted reporting Judicial Conference of the USA legal information/citation 74 link rot and reference rot 74 Koomey’s law 85 LCD (liquid crystal display) technology, digital colour reproduction 104, 107-8, 109 legacy tools, digital death/inheritance 9-11 Legal Information Archive legal information/citation 73 link rot and reference rot 73 Legal Information Preservation Association, Chesapeake Project 70 legal information/citation 61-77 academic legal writing 69-70 authenticity 64, 68-71, 72-6 The Bluebook 62, 65, 73, 76 ‘born-digital’ legal materials 65-7 Cyber Cemetery 73 footnotes 62, 68-9 Google Scholar 65 Harvard Library Innovation Lab 76 history of legal information 64-8 Internet Archive 72-3 Internet Sources Cited in Opinions 75 Judicial Conference of the USA 74 Legal Information Archive 73 Legal Information Preservation Association 70

legal precedent 63 link rot 61, 68-77 link rot and reference rot 61, 68-77 online material, growth of 64-8 PACER federal court electronic filing system 75 PDF files 73-5 Perma.cc 76 persistent identifier systems 75-6 Persistent Uniform Resource Locator system (PURLs) 76 stare decisis 68 Supreme Court 75 Wayback Machine 72 web citation 64-8 legal issues see also digital death/inheritance access 6-8 copyright 6-8 digital assets 6-8 ownership 6-11 property 6-8 legal precedent, legal information/citation 63 link rot and reference rot academic legal writing 69-70 avoiding 73-6 Cyber Cemetery 73 Harvard Library Innovation Lab 76 Internet Archive 72-3 Internet Sources Cited in Opinions 75 Judicial Conference of the USA 74 Legal Information Archive 73 legal information/citation 61, 6877 PACER federal court electronic filing system 74-5 Perma.cc 76 persistent identifier systems 75-6 Persistent Uniform Resource Locator system (PURLs) 76

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 209

INDEX 209

proposed solutions 72-6 Supreme Court 75 tracking rotten lnks 72-3 Wayback Machine 72 long-tail data, sharing research data 177-8 long-term support open source software 195-7 version control 195-7 metamerism, digital colour reproduction 120-1 Meyer, P., computer-assisted reporting 49 MICAR (Missouri Institute for Computer-Assisted Reporting) 51 microdata see also tabular data disclosure risk 155-8 IPUMS (Integrated Public Use Microdata Series) 158 SDL (statistical disclosure limitation) 148, 155-8 SDL techniques 156-7 UK Data Service 158 Microsoft Surface, digital colour reproduction 108, 110-13 Missouri Institute for ComputerAssisted Reporting (MICAR) 51 mobile computing, historical BIM+ (HBIM+) 133-4 Moore’s law 85 Mulvad, N., computer-assisted reporting 53 NARA (National Archives and Records Administration) xxvi National Institute for ComputerAssisted Reporting (NICAR) 52, 54 National Institutes of Health, data visualisation 170-1

National Science Foundation, data visualisation 170-1 NICAR (National Institute for Computer-Assisted Reporting) 52, 54 Nomis, tabular data 149 non-static code, Internet of Things 92-5 OLED (organic light-emitting diode) technology, digital colour reproduction 104, 108 online assimilation anonymisation and pseudonymisation 37 information collections 29-32 privacy tactics 35-9 OnTheMap web-based mapping and reporting application, SDL (statistical disclosure limitation) 162-3 ontologies historical BIM+ (HBIM+) 129-30, 136-8 semantic web 129-30 Open Science Framework, sharing research data 180 open source software 185-98 communication 187-90 communities 187-90 e-mail 190 Freenode 188 IRC (Internet Relay Chat) 188-9 IT (information technology) 186-7 long-term support 195-7 terminology 186-7 version control 190-8 Origin-Destination Employment Statistics, SDL (statistical disclosure limitation) 162-3 ‘over-provisioning’, Internet of Things 86-9 ownership, legal issues 6-11

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 210

210 PARTNERS FOR PRESERVATION

PACER federal court electronic filing system legal information/citation 74-5 link rot and reference rot 74-5 PDF files computer-assisted reporting 57 legal information/citation 73-5 Perma.cc Harvard Library Innovation Lab 76 legal information/citation 76 link rot and reference rot 76 persistent identifier systems legal information/citation 75-6 link rot and reference rot 75-6 Persistent Uniform Resource Locator system (PURLs), legal information/citation 76 post-mortem privacy, digital death/inheritance 11-16, 18-19 precision journalism see computerassisted reporting privacy, Internet of Things 84-9 Privacy International 86-7 privacy tactics, online assimilation 35-9 property, legal issues 6-8 pseudonymisation and anonymisation, online assimilation 37 PURLs (Persistent Uniform Resource Locator system), legal information/citation 76 reference rot see link rot and reference rot remote access data enclaves 159-60 SDL (statistical disclosure limitation) 159-60 remote analysis servers, SDL (statistical disclosure limitation) 162 reporting, computer-assisted see computer-assisted reporting

Research Data Alliance, sharing research data 179-80 reusable data data visualisation 176-7 Elsevier Data Repositories 176 RTBF (right to be forgotten) 25-7, 35-40 alternatives 35-9 GDPR (General Data Protection Regulation) 35-9 Google Spain 25-6, 36 origin 25-6 RUFADAA (Revised Uniform Fiduciary Access to Digital Assets Act) 11, 12-16 SDL (statistical disclosure limitation) 147-63 aims 147-8, 155 census tables 150-5 collaboration 148 data dissemination 159-63 data enclaves 159-60 Data Linkage and Anonymisation Programme 148 differential privacy 161 disclosure risk 148 flexible table-generating servers 160-1 microdata 148, 155-8 OnTheMap web-based mapping and reporting application 162-3 Origin-Destination Employment Statistics 162-3 preparing statistical data for release 149-58 remote access 159-60 remote analysis servers 162 synthetic data 162-3 tabular data 148, 149-55, 163 web-based applications 160-2 search engines information collections 30-5, 37, 39 time stamps 34

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 211

INDEX 211

‘wish to be forgotten’ 33-5 security, Internet of Things 81-5, 88-92 ‘security debt’, Internet of Things 96-7 semantic web historical BIM+ (HBIM+) 129-30 ontologies 129-30 sharing research data 167-81 see also data visualisation best practices 178-80 challenges 169 data-sharing plans 180 ‘file-drawer phenomenon’ 177-8 Human Genome Project 175-6 long-tail data 177-8 missed opportunities 177-8 Open Science Framework 180 rationales 169 representation 169-70 Research Data Alliance 179-80 visualisations built on shared data 174-6 significant properties, digital objects xxii SNS (Social Network Service), historical BIM+ (HBIM+) 132 Source Code Control System, version control 192 SQL (Structured Query Language), computer-assisted reporting 50-5 stare decisis, legal information/citation 68 statistical data see SDL (statistical disclosure limitation) Supreme Court Internet Sources Cited in Opinions 75 legal information/citation 75 link rot and reference rot 75 Svith, Flemming, computer-assisted reporting 53

synthetic data, SDL (statistical disclosure limitation) 162-3 tabular data see also microdata census tables 150-5 CTA (controlled tabular adjustment) 163 Nomis 149 SDL (statistical disclosure limitation) 148, 149-55, 163 SDL approaches 149-50 TED (Technology, Entertainment, Design) conferences/videos 199 tertiary memory, information retention 28 time stamps, search engines 34 ‘trojan horse’ devices, Internet of Things 86-9 UK Data Service, microdata 158 unstructured data, computerassisted reporting 57-8 version control Git tool 191-2 long-term support 195-7 open source software 190-8 Source Code Control System 192 tools 191-5 virtual reality, historical BIM+ (HBIM+) 132-3 visualisation, data see data visualisation Wayback Machine legal information/citation 72 link rot and reference rot 72 web citation, legal information/citation 64-8 web-based applications, SDL (statistical disclosure limitation) 160-2

Kramer-Smyth final proof 19 December 19/12/2018 13:24 Page 212

212 PARTNERS FOR PRESERVATION

‘white point’, digital colour reproduction 105-6

WPS (Wi-fi Positioning System), historical BIM+ (HBIM+) 132