Big data is not a monolith
 9780262335751, 0262335751, 9780262335768, 026233576X

Citation preview

Big Data Is Not a Monolith

Information Policy Series Edited by Sandra Braman The Information Policy Series publishes research on and analysis of significant problems in the field of information policy, including decisions and practices that enable or constrain information, communication, and culture irrespective of the legal siloes in which they have traditionally been located as well as state-law-society interactions. Defining information policy as all laws, regulations, and decision-making principles that affect any form of information creation, processing, flows, and use, the series includes attention to the formal decisions, decision-making processes, and entities of government; the formal and informal decisions, decision-making processes, and entities of private and public sector agents capable of constitutive effects on the nature of society; and the cultural habits and predispositions of governmentality that support and sustain government and governance. The parametric functions of information policy at the boundaries of social, informational, and technological systems are of global importance because they provide the context for all communications, interactions, and social processes. Virtual Economies: Design and Analysis, Vili Lehdonvirta and Edward Castronova Traversing Digital Babel: Information, e-Government, and Exchange, Alon Peled Chasing the Tape: Information Law and Policy in Capital Markets, Onnig H. Dombalagian Regulating the Cloud: Policy for Computing Infrastructure, edited by Christopher S. Yoo and Jean-François Blanchette Privacy on the Ground: Driving Corporate Behavior in the United States and Europe, Kenneth A. Bamberger and Deirdre K. Mulligan How Not to Network a Nation: The Uneasy History of the Soviet Internet, Benjamin Peters Hate Spin: The Manufacture of Religious Offense and Its Threat to Democracy, Cherian George Big Data Is Not a Monolith, edited by Cassidy R. Sugimoto, Hamid R. Ekbia, and Michael Mattioli

Big Data Is Not a Monolith

edited by Cassidy R. Sugimoto, Hamid R. Ekbia, and Michael Mattioli

The MIT Press Cambridge, Massachusetts London, England

© 2016 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. This book was set in Stone Sans and Stone Serif by Toppan Best-set Premedia Limited. Printed and bound in the United States of America. Library of Congress Cataloging-in-Publication Data Names: Sugimoto, Cassidy R., editor. | Ekbia, H. R. (Hamid Reza), 1955- editor. | Mattioli, Michael, editor. Title: Big data is not a monolith / edited by Cassidy R. Sugimoto, Hamid R. Ekbia, and Michael Mattioli. Description: Cambridge, MA : The MIT Press, [2016] | Series: Information policy series | Includes bibliographical references and index. Identifiers: LCCN 2016015228| ISBN 9780262035057 (hardcover) | ISBN 9780262529488 (paperback) Subjects: LCSH: Big data. Classification: LCC QA76.9.B45 B555 2016 | DDC 005.7–dc23 LC record available at 10  9  8  7  6  5  4  3  2  1


Series Editor’s Introduction  vii Acknowledgments  ix Introduction  xi Hamid R. Ekbia, Cassidy R. Sugimoto, and Michael Mattioli I  Big Data and Individuals  1 1  Big Data, Consent, and the Future of Data Protection  3 Fred H. Cate 2  When They Are Your Big Data: Participatory Data Practices as a Lens on Big Data  21 Katie Shilton 3  Wrong Side of the Tracks  31 Simon DeDeo II  Big Data and Society  43 4  What If Everything Reveals Everything?  45 Paul Ohm and Scott Peppet 5  Big Data in the Sensor Society  61 Mark Burdon and Mark Andrejevic 6  Encoding the Everyday: The Infrastructural Apparatus of Social Data  77 Cristina Alaimo and Jannis Kallinikos III  Big Data and Science  91 7  Big Genomic Data and the State  93 Jorge L. Contreras


8  Trust Threads: Minimal Provenance for Data Publishing and Reuse  105 Beth Plale, Inna Kouper, Allison Goodwell, and Isuru Suriarachchi 9  Can We Anticipate Some Unintended Consequences of Big Data?  117 Kent R. Anderson 10  The Data Gold Rush in Higher Education  129 Jevin D. West and Jason Portenoy IV  Big Data and Organizations  141 11  Obstacles on the Road to Corporate Data Responsibility  143 M. Lynne Markus 12  Will Big Data Diminish the Role of Humans in Decision Making?  163 Michael Bailey 13  Big Data in Medicine: Potential, Reality, and Implications  173 Dan Sholler, Diane E. Bailey, and Julie Rennecker 14  Hal the Inventor: Big Data and Its Use by Artificial Intelligence  187 Ryan Abbott Conclusion  199 Cassidy R. Sugimoto, Michael Mattioli, and Hamid R. Ekbia Notes  213 References  225 Contributors  267 Index  275


Series Editor’s Introduction Sandra Braman

This work provides a snapshot of scholarly, popular, and legal frames for big data and information policy. It ranges broadly, offering a wide lens across a technical, economic, social, legal, and policy field that is still becoming shaped, or shaping itself. The authors provide valuable conceptual frameworks for thinking about the complex multitude of developments under way, and invaluably many provide policy and/or practice recommendations as well. It is inevitable—at this nascent stage of experimentation with big data and its uses for decision making of diverse kinds—that there is still much that we don’t know, where we need to keep thinking, and where the past has been ignored; these authors help us build the research- and theory-building agenda going forward. For the law, technological innovations generate old problems in old forms, old problems in new forms, and problems new in both nature and form. All three types of problems are raised by big data. Privacy, of course, shows up in every category. Literacy is old but takes on a new form because it must now include numeracy for the sake of effective participatory democracy (DeDeo, this volume; Burdon and Andrejevic, this volume). New problems altogether begin with the regulatory subject itself. Typically, information policy deals with final goods or services and their producers or providers, but with big data, process is all. Issues appear with data entry, software interactions, or algorithm design. Uses must be distinguished for appropriate regulatory treatment. Process is at the heart of Burdon and Andrejevic’s emphasis on the primary role of secondary uses; editorial reliance on the Crawford and Schulz version of procedural due process; and Ohm and Peppet’s concern about the blurred distinction between personally identifiable information and non–personally identifiable information (legal concepts) in big data. There are other challenges. Existing law may be irrelevant. For the “most of us” who are not highly numerate, the difficulty of understanding what is going on leads to the analytical errors of such critical importance examined by Kent Anderson here. Three lessons from 80-plus years of thinking about new technologies and the law apply. The past matters. Don’t underestimate those who came before you or who work in areas other than your own. Ideas from the past matter too. Thus perhaps the most important chapter is by Fred Cate, who offers conceptual innovations, deep insight, and policy


Series Editor’s Introduction

recommendations based on an enormous depth of knowledge about privacy issues across technologies and over time. The collection presents as futuristic, but much of what is discussed here has a long past. The diminishment of the human that inevitably accompanies statistics has been problematic since record keeping about populations began. Proposals to require data entry licenses appeared as early as the 1950s. Computerized sentencing has been in use since the 1970s. There has already been enough experience with predictive analytics in legal contexts to support a scholarly literature analyzing it. History matters because it provides a check on assumptions; when reading the discussion here it is worth knowing, for example, that on average 50 percent of judges do not use computerized sentencing when available because of their concerns regarding validity, efficacy, and fairness. This is a terrain complex enough that taking single steps can be misleading. It sounds good to call for algorithms that take fairness into account, but doing so ignores the long history of computer scientists doing just that and just what that experience tells us about the difficulties of doing so. In just one example of pertinent evidence, my own research team found that those responsible for the technical design of the Internet 1969–2009 worked from an explicit consensus regarding the centrality of fairness as a design criterion but had to experiment continuously with operationalizing that commitment because, technically, every approach to doing so necessarily favors one or another type of information flow or interest. Debates over network neutrality today continue to explore the same problem. Ideas presented here also have a past. Burdon and Andrejevic are talking about what Brandon Hookaway and Manuel DeLanda refer to as the panspectron. “Trust threads” is a new way of saying “provenance,” long key in the arts; developing information systems that enable trust threads also involves the ontologies of such importance to those in information science who work in interdisciplinary contexts. Several authors use models of information production chains, explicitly or implicitly. Acknowledging the past can provide support, too, for breaking from it. These authors accurately reflect the “state of the conversation” when they largely appear to accept the view that theory is irrelevant when working with big data. (It could be argued that doing so is an example itself of the type of mythologization discussed here.) For those concerned about policy, though, thinking theoretically is critical. One of the goals of the Information Policy book series is to present diverse perspectives on key issues and fundamental matters. Look for future work in the series that thinks through the value of social theory for the design of analytical algorithms and practices to be used in working with big data, and for understanding the data they produce. The question of the governability of big data is another matter that must be taken into account; here, the Yoo and Blanchette collection Regulating the Cloud will be of interest to readers of this collection as well.


The origin of this book can, in many ways, be tied back to early discussions in the Center for Research on Mediated Interaction research group at Indiana University’s School of Informatics and Computing. These meetings, attended by the editors of this volume, resulted in an article on the dilemmas of big data, the conclusion of which became the title of this edited collection. In particular, we would like to thank Gary Arave, Tim Bowman, Ali Ghazinejad, Inna Kouper, and Ratan Suri for their continuous engagement with the topic during these conversations. The second critical point in the origin story is a workshop that was held on October 10, 2014, at Indiana University Maurer School of Law. This workshop brought together nineteen individuals from diverse backgrounds: researchers from a variety of domains and across the globe; practitioners from a number of different companies; and students from law and information science. The workshop was jointly funded by Cornerstone Information Systems Inc., Indiana University School of Informatics and Computing, Indiana University Maurer School of Law, and the Center for Intellectual Property Research. We would like to thank our sponsors for giving us an opportunity to bring together many of the scholars who became contributors to this volume. Lastly, we would like to thank our students—Faith Bradham, Maureen Fitz-Gerald, Andrew Tsou, and Nora Wood—for their assistance in the workshop and with manuscript preparation.

Introduction Hamid R. Ekbia, Cassidy R. Sugimoto, and Michael Mattioli

Human brains and computers probably have less in common than we tend to think. The common metaphor that human brains are like computers might be more misleading than enlightening, except in the case of idiot savants or “mental calculators” (Ball 1960). Johann Martin Zacharias Dase (1824–1861 AD) was such a human being. Endowed with an uncanny sense of numbers, he could tell, without counting, how many sheep were on a field or how many words were in a sentence. He multiplied, in his head, 79,532,853 × 93,758,479 in fifty-four seconds, two twenty-digit numbers in six minutes, two forty-digit numbers in forty minutes, and two hundred-digit numbers in eight hours, forty-five minutes. Musing on individuals such as Dase, Douglas Hofstadter (1979, 567) dispels the idea that “they do it by some mysterious, unanalyzable method,” asserting instead “that their minds race through intermediate steps with the kind of self-confidence that a natural athlete has in executing a complicated motion quickly and gracefully.” These days, the computer in one’s pocket far exceeds the calculating speed of Dase and those like him, multiplying a hundred by a hundred in a small fraction of a second. This raises a provocative question: Do these machines merely race through myriad calculations with thunder speed, or is there something more—something “mysterious” and “unanalyzable,” to borrow from Hofstadter—in what they do? The foregoing question is imbued with a practical urgency by the fact that our society is increasingly reliant on data analytic techniques—a sophisticated version of counting and calculation—as the basis of important economic, political, and cultural decisions. Although we rarely call on computers to count the sheep in a field, we do ask them to tally clicks and traffic on web pages, find patterns in stock trades in financial markets, make sense of our tastes and dispositions, identify linguistic correlations among words in large corpora of text, and much more. Future conceptual, practical, and policy debates over big data, in short, will depend on the answer to the foregoing question. If behind their sophisticated models and techniques, big data computers cannot show us a reality that is nuanced and fundamentally sensitive to human affairs, should we be wary of relying on them to understand, predict, and guide our decisions and actions? If, on the other hand, there is more to them than sheer


Hamid R. Ekbia, Cassidy R. Sugimoto, and Michael Mattioli

speed—if they have mysterious and unanalyzable capabilities—should we rely on them without hesitation? Although none of the chapters in this volume address this issue head-on, each illuminates it from a different angle. In this introductory chapter, we draw on these insights in order to revisit what we have elsewhere described as the “bigger dilemmas” of big data (Ekbia et al. 2015). This will allow us to examine the chapters in terms of the issues, challenges, and dilemmas that big data is likely to introduce to individuals, societies, and organizations, and more broadly, to the role of science as modernity’s dominant mode of inquiry and understanding. Our aim here is to show the complex set of problems, practices, and policies that emerge from big data. Epistemology of Big Data: How We Know What We Know The emergence and widespread adoption of big data techniques are challenging some of our most entrenched understandings of the character, scope, and validity of knowledge in science and beyond. The change applies to the whole gamut of human knowledge—our understanding of nature, culture, and even ourselves. Let us briefly examine these one at a time, starting with scientific knowledge. In modern times, our knowledge of natural and social phenomena has been largely influenced and guided by theories and models that derive their legitimacy from reigning paradigms in various disciplines, and that have, in turn, driven empirical investigations in the sciences. These paradigms have by and large determined the kinds of questions that could be legitimately asked, the kinds of answers that were given, and the kinds of evidence (data) that was garnered in support of the answers (Kuhn 1962). The advent of big data techniques has challenged this “theory-driven” approach to science, and according to some commentators, has in fact displaced traditional epistemology in favor of a “data-driven science” that can do away with theory altogether (Anderson 2008). The transformation of modern biology from a quintessentially “wet” science to a data-driven science (discussed by Jorge L. Contreras in this volume) is a case in point. This transformation has changed the character of biological sciences such as genomics, turning them into an engineering practice. Some celebrate this change as evidence of the triumph of big data techniques over less sophisticated methods, while others lament a valuable scientific method falling out of use (Woese 2004; Callebaut 2012). What seems indisputable, however, is the emergence of the legal, ethical, and practical issues highlighted in various chapters of this volume: privacy, data protection, data governance (protection, release, curation, etc.), and patents (Contreras); trust, provenance, and reuse of data (Beth Plale and colleagues, this volume); and so forth. These considerations largely determine how science is conducted in our times. Big data techniques are similarly changing our understanding of contemporary social life. Cristina Alaimo and Jannis Kallinikos (this volume) demonstrate how social media



shape our behavior as well as our understanding of each other through an “encoded and computed … sociality [that] constitutes the core of everyday life on social media platforms, and by extension, social data generation and use.” In so doing, these platforms provide normative mechanisms that influence how we behave. Social media services, personal data tracking tools, and other platforms that portray “good users” as those who openly share their personal information, for instance, tacitly encourage certain types of behavior, labeling individuals as nonconforming, laggards, or even “bad” citizens if they refuse to comply (Katie Shilton, this volume). This bias is further enhanced by the constellation of institutions that have an interest in predicting our behavior—businesses, governments, health providers, insurance companies, banks, and credit lenders, to name a few. Supported by a pervasive and embedded infrastructure, these institutions might in fact have given rise to a “sensor society” that detects and tracks every aspect and behavior of our lives (Mark Burdon and Mark Andrejevic, this volume). Cultural and professional practices are also being transformed by big data techniques. The way we understand special categories of people, such as patients and health providers, for example, is changing drastically as big data techniques infiltrate the practice of medicine. Dan Sholler, Diane E. Bailey, and Julie Rennecker (this volume), examining the use of clinical decision support systems, randomized clinical trials, and electronic health records in medicine, demonstrate the reductionist character of big data approaches that tend to eliminate the “essential humanism” of the practice of medicine. Patients may perceive this as diminishing the quality of care that they deserve, while physicians may feel that these new systems threaten their autonomy and challenge the existing hierarchy of expertise in medicine. To discuss these issues, Sholler, Bailey, and Rennecker introduce the notion of “data harm” and its various manifestations. Data sharing and interoperability, for instance, facilitate a smoother flow of information on patients across health providers, networks, and systems, yet also can lead to potential privacy violations. The impact of data harms is exacerbated in a potential future where “everything reveals everything,” as Paul Ohm and Scott Peppet convincingly speculate in their chapter. In such a world, the application of big data techniques to predict our dispositions and behaviors might ultimately convince us that we are, indeed, nothing more than what the algorithms, with all their biases, flaws, and inaccuracies, make us to be: numbingly calculable, boringly consistent, and rigidly predictable. This new sense of the self, more imposed than discovered by algorithms, as Ohm and Peppet argue, has long-lasting implications for us as individuals and societies: “Big data helps us ‘know,’ within some degree of certainty, more than the data can tell us at face value.” Our trust in the power of these algorithms, based on the assumption that there is nothing mysterious and unanalyzable about them, might lead us to believe that they are exposing us to the risk of landing on the “wrong side of the tracks” (Simon DeDeo, this volume).


Hamid R. Ekbia, Cassidy R. Sugimoto, and Michael Mattioli

Methodology of Big Data: How We Arrive at Knowledge One of the appeals of big data computing is its professed power for prediction. Practitioners in business, industry, and government see great opportunities for commerce, innovation, and (perhaps more controversially) social engineering in this predictive potential. Big data techniques apparently enable business corporations and marketers to predict our desires for particular commodities on the basis of our past consumption behavior (what we wear, eat, drink, drive, etc.), cultural taste on the basis of our earlier choices (what we read, watch, listen to, etc.), and political attitudes on the basis of our social behavior (who we talk, call, email, or listen to, etc.). The rush of interest has sparked the development of new technological infrastructures, business models, and policy frameworks as well as public debates about the legitimacy of each of these. British researchers, for instance, unveiled a program called Emotive “to map the mood of the nation” using data from Twitter and other social media. By analyzing two thousand tweets per second and rating them for expressions of one of eight human emotions (anger, disgust, fear, happiness, sadness, surprise, shame, and confusion), the researchers claimed that they could “help calm civil unrest and identify early threats to public safety” (BBC 2013, para. 3). While the value of this kind of project is clear from the perspective of law enforcement, it is not difficult to imagine scenarios where this type of threat identification might function erratically, leading to more chaos than order. Even if we assume that human emotions can be meaningfully reduced to eight basic categories (what of complex emotions such as grief, annoyance, contentment, etc.?), and also assume cross-contextual consistency in the expression of emotions (how does one differentiate the “happiness” of the fans of Manchester United after a winning game from the expression of the “same” emotion by the admirers of the royal family on the occasion of the birth of the heir to the throne?), technical limitations are quite likely to undermine the predictive value of the proposed system. Available data from Twitter have been shown to have limited character in terms of scale (spatial and temporal), scope, and quality (boyd and Crawford 2012), making some observers wonder if this kind of “nowcasting” could blind us to the more hidden, long-term undercurrents of social change (Khan 2012). Such limitations show that big data does not typically spring into existence ready for analysis. Rather, in order to be usable, it has to be “cleaned” or “conditioned” (O’Reilly Media 2011). This involves human work—determining which attributes and variables to keep, and which ones to ignore, for example (Bollier 2010, 13). This human interpretation can, however, “spoil the data” by injecting subjective judgment into the process (Bollier 2010, 13). This risk is particularly problematic when personal data is concerned. The potential for “reidentification,” for instance, undermines the thrust in big data research toward data liquidity—that is, the free flow of data across institutional and organizational boundaries. The dilemma between the ethos of data sharing, liquidity, and transparency, on the one hand, and risks to privacy and anonymity through reidentification, on the other, affects



diverse areas, including medicine, location-tagged payments, geolocating mobile devices, and social media (Ohm and Peppet; Contreras). Kent R. Anderson (this volume) brings many of these methodological concerns into sharp focus in his discussion of the potential pitfalls of “data reanalysis” that led to a false correlation between measles-mumps-rubella vaccines and autism diagnoses; the limitations and inapplicability of big data techniques to the prediction of election results; and repeated overestimations of the prevalence of flu in Google’s trend analysis. There is no shortage of methodological issues introduced by the unexamined and uncritical adoption of big data techniques to the understanding and prediction of scientific, socioeconomic, and cultural-political phenomena. Law and Ethics of Big Data: Why We Know The foregoing issues prompt legal and ethical challenges that span many areas of law and policy, including intellectual property, privacy, informed consent, and contract law. Each chapter in this volume shows that traditional thinking in law, ethics, and moral philosophy are ill prepared to tackle many of these new challenges. Burdon and Andrejevic call attention to the opaqueness of the large volumes of data that are collected about individuals as they navigate the physical and digital environment. Importantly, they show that most people don’t realize the amount of data collected about them, or what these data reveal. Google’s scanning of Gmail messages is a clear example of such practices. Google’s theory of “implied consent,” based on the assumption that the adoption of a platform evidences awareness of its information practices, highlights the potential gap between corporate behavior and individual perceptions. Fred H. Cate brings this issue into focus by reminding us that a large amount of the data about individuals is held by businesses with which they have infrequent contact (such as when they buy a car or appliance) or by third parties with whom they have no direct dealings. This leads Cate to lay bare the “illusion of choice” that springs from common practices of notice and informed consent. The burden this puts on individuals only adds to the alreadycomplex, cryptic, and imperceptible practices that target them. The burden is multiplied by the “patchwork of data protection agreements” that undermines corporate responsibility toward data, as M. Lynne Markus argues in her chapter. Highlighting the variety of social and organizational contexts in which data are used, she illustrates the inadequacies of the current frameworks, especially in the United States, where “solutionism” reigns and data protection laws are fragmented. The failure of educational institutions, such as business schools and certification programs in industry, to train students on issues of data protection only exacerbates the issues and perpetuates the current situation. It is in this spirit that DeDeo takes us to the “wrong side of the tracks” in his analysis of the ethical dimensions of inference and prediction, with a special focus on the discriminatory powers of algorithms. DeDeo argues that “debates concerning equity, discrimination,


Hamid R. Ekbia, Cassidy R. Sugimoto, and Michael Mattioli

and fairness, previously understood as the domain of legal theory and political philosophy, are now unavoidably tied to mathematical and computational questions.” To address this, he suggests that algorithms should be reverse engineered to uncover hidden causal models, and that new algorithms should be created that take justice and equity into account. These are just a sample of the kinds of legal and ethical issues introduced, enhanced, or else brought into prominence through the widespread collection of data by corporations, government agencies, and other organizations. Political Economy of Big Data: Who Benefits The explosive growth of data in recent years is accompanied by a parallel development in the economy—namely, an explosive growth of wealth and capital in the global market. This creates an interesting question about another kind of correlation, which on the surface seems to be extrinsic to big data. Crudely stated, the question has to do with the relationship between the growth of data and the growth of wealth and capital. What makes this a particularly significant but also paradoxical question is the concomitant rise in poverty, unemployment, and destitution for the majority of the global population. A distinctive feature of the economic recession of 2008 and subsequent economic “recovery” was its strongly polarized character: a large amount of wealth is created, but it is largely concentrated in the hands of a small group of people. This gives rise to important questions regarding the relation between data, wealth, and poverty, the underlying mechanisms that drive their parallel growth, the key beneficiaries of the growth, and its inequitable distribution across individuals, communities, sectors, and populations. These are new and pressings concerns prompted by the explosive wealth that big data is generating. Such questions have been recently posed and tackled by writers and commentators of various persuasions. Individuals bear a great deal of burden when it comes to the production and generation of data, whether they are in the form of communication in social media (Alaimo and Kallinikos; Burdon and Andrejevic), data tracking of quantified selves (Shilton), or patients dealing with health issues (Sholler, Bailey, and Rennecker; Michael Bailey, this volume). That the burden is often disguised under engaging activities such as fast communication, free entertainment, or frequent gameplay ameliorates, but does not eliminate, the fact that huge financial and economic interests are involved here. The sheer amount of wealth accumulated by large and small corporations that own and operate the platforms for these activities—social media companies (Facebook, Twitter, YouTube, Instagram, etc.), game companies (e.g., Brizzard), cloud and micro-task service companies (e.g., Amazon and its Mechanical Turk)—speaks to the potential of big data for wealth generation. At the same time, the fact that the generators and contributors of these data reap little (as in the case of Mechanical Turkers) or nothing in terms of monetary benefits reveals an increasingly asymmetrical arrangement as well as inequitable economic system, the gaps of which far exceed those of the Great Depression (Stiglitz 2014). That the state apparatus, sustained by tax



money, funds a good part of the supporting infrastructure for these platforms (see, for example, Contreras) further exacerbates the situation. The long-term social, economic, and political issues that emerge from these asymmetries cannot be easily discounted as matters of ideological advocacy or the incitement of class warfare. The depletion, due to the expansion of an unemployed workforce that lives on subsistence income from micro-tasks, of resources that have traditionally supported social security, welfare, and other umbrella programs promises a future even more precarious than our current circumstances (Ekbia and Nardi 2014). Big data contributes to this in both direct (Ekbia et al. 2015) and indirect ways (Ohm and Peppet), and no meaningful analysis of big data can be developed without giving due consideration to this issue. That even medical researchers are, as Anderson reports in his chapter, “susceptible to the charms of temporary business friends” should point to the severity of the not-so-unintended consequences of big data. Equally significant, the “data gold rush in higher education” that Jevin D. West and Jason Portenoy discuss in their chapter provides a telling account of fractured impressions and unrealistic expectations of an academic field (data science) that has grown out of the big data explosion. The Scope and Structure of the Book These epistemological, methodological, legal, ethical, political, and economic issues come up in various guises in different chapters of this volume, providing provocative and novel examples in diverse domains. Approaching these issues from diverse intellectual traditions and disciplinary perspectives, the authors have each focused on particular aspects of big data, analyzing relevant themes, problems, and practices in their areas of interest. The crosscutting conceptual fabric that emerges from this diverse set of analyses is hard to appreciate from any single theoretical perspective. With this in mind, we have tried to meet the challenge of organizing the chapters on two orthogonal dimensions: thematic and analytic. Thematic Structure To do justice to the diversity of themes and perspectives represented in the following chapters, we draw on the conceptual framework of “computerization movement” (Kling and Iacono 1995), which we proposed to be applied to the dilemmas of big data in our earlier work (Ekbia et al. 2015). In this critical review of the literature, we identified three common conceptualizations of big data: 1.  The product-oriented perspective, which tends to emphasize the attributes of data (e.g., the five v’s of volume, variety, velocity, value, and veracity), highlights the historical novelty of big data compared to earlier eras. In so doing, it brings forth the scale and magnitude of the changes catalyzed by big data, and consequent prospects and challenges generated by these transformations.


Hamid R. Ekbia, Cassidy R. Sugimoto, and Michael Mattioli

2.  The process-oriented perspective underscores the novelty of computational processes that are involved in dealing with big data, having to do with the storage, management, aggregation, searching, and analysis of data. By focusing on those attributes of big data that derive from complexity-enhancing structural and relational considerations, this view seeks to push the frontiers of computing technology in handling those complexities. 3.  The cognition-based perspective draws attention to the challenges that big data poses to human beings in terms of their cognitive capacities and limitations. In particular, the implications for the conduct of science have become a key concern to commentators who adopt this vantage point. None of these perspectives capture the full scope of big data, nor are they mutually exclusive. Together, however, they provide useful insights and a good starting point for inquiry. Little attention is paid in these points of view to the socioeconomic, cultural, and political shifts that underlie the phenomenon of big data. To bring these shifts into focus, we have proposed the perspective of computerization movement, which draws parallels with the broader notion of “social movement” (De la Porta and Diani 2006). The development of computing seems to have followed a recurring pattern wherein an emerging technology is promoted by loosely organized coalitions that mobilize groups and organizations around a utopian vision of a preferred social order. Rather than emphasizing the features of technology, organizations, and environment, the perspective of computerization movement considers technological change “in a broader context of interacting organizations and institutions that shape utopian visions of what technology can do and how it can be used” (Elliot and Kraemer 2008, 3). Prominent examples of computerization movements include the personal computing movement of the 1980s epitomized by Apple Inc.’s release of its Macintosh platform and its accompanying commercial at the 1984 Super Bowl, the nationwide mobilization of Internet access during the Clinton-Gore administration in the United States around the vision of a world where people can live and work at the location of their choice, and the free/open-source software movement that organized a vision around the ideals of freedom and liberty (Ekbia and Gasser 2007). The gap between these visions and socioeconomic realities, along with the political strife associated with it, is often lost in articulations of these visions. Like the key computerization movements of the past, the big data movement also presents a utopian vision of a preferred social order—one where large amounts of data collected on natural and social phenomena, and analyzed through algorithmic techniques of machine learning, empowers individuals to govern their lives (in regard to health, wealth, education, etc.), enables business corporations to leverage predictive models of consumer behavior in favor of their bottom line, and allows governments to handle their affairs in a hands-off manner—to “govern at a distance,” in other words (Miller and Rose 2013, 34). By making different arenas of contemporary life intelligible, big data technologies create the conditions of possibility for “the government of a population, a national economy, an enterprise, a



family, a child or even oneself” (ibid., 31). From this perspective, these technologies can be meaningfully considered “technologies of government” in the sense originally developed by Michel Foucault (1986). This volume offers a vision of how these technologies work, the gaps that separate them from the realities of the “implementations” on the ground, and the types of tensions that emerge through this gap. We invite the reader to pay attention to all three dimensions— visions, implementations, and tensions—as they examine each chapter. Analytic Structure The authors of the chapters have also, understandably, focused their arguments on distinct levels of practice (individuals, organizations, and societies) or the institution of academic science as a key site where the dilemmas of big data come to prominence in a vivid manner. Therefore, we have structured the book according to these levels and domains, highlighting some of the center tensions that arise at each level. Big Data and Individuals: Opacity and Complexity Every society creates the image of an “idealized self” that presents itself as the archetype of success, prosperity, and good citizenship. In contemporary societies, the idealized self is someone who is highly independent, engaged, and self-reliant—a high-mobility person, with a potential for reeducation, reskilling, and relocation; the kind of person often sought by cutting-edge industries such as finance, medicine, media, and high technology. This “new man,” according to sociologist Richard Sennett (2006, 101), “takes pride in eschewing dependency, and reformers of the welfare state have taken that attitude as a model—everyone his or her own medical advisor and pension fund manager.” Big data has begun to fill a critical role in both propagating the image and developing the model, playing out a three-layer mechanism of social control through monitoring, mining, and manipulation. Individual behaviors, as will be analyzed in the chapters in part I, are under continuous monitoring through big data techniques. The collected data are then mined for various economic, political, and surveillance purposes: corporations use the data for targeted advertising, politicians use them for targeted campaigns, and government agencies use them for targeted monitoring of all manners of social behavior (health, finance, criminal, security, etc.). What makes these practices particularly daunting and powerful is their capability in identifying patterns that are not detectable by human beings, and are indeed unavailable before they are mined (Chakrabarti 2009). Big data is dark data, in a serious sense of the term. Furthermore, these same patterns are fed back to individuals through mechanisms such as recommendation systems, creating a vicious cycle of regeneration. These properties of autonomy, opacity, and generativity of big data brings the game of social engineering to a whole new level, with its attendant benefits and pitfalls.


Hamid R. Ekbia, Cassidy R. Sugimoto, and Michael Mattioli

Big Data and Societies: Control and Complicity These attributes of big data endow it with a kind of novelty that is productive and empowering, yet constraining and overbearing. We live in a society that is increasingly obsessed with the future and predicting it. Enabled by the potentials of modern technoscience—from genetically driven life sciences to computer-enabled social sciences—contemporary societies take more interest in what the future, as a set of possibilities, holds for us than how the past emerged from a set of possible alternatives; the past is interesting, if at all, only insofar as it teaches us something about the future. Thanks to big data techniques and technologies, we can now make more accurate predictions, and thus potentially better decisions for dealing with health epidemics, natural disasters, or social unrest. At the same time, however, one cannot fail to notice the imperious opacity of data-driven approaches to science, social policy, cultural development, financial forecasting, advertising, and marketing. The chapters in part II shed light on both these faces of big data. Big Data and Science: Theory and Futurity Many of the epistemological critiques of big data have revolved around the assertion that theory is no longer necessary in a data-driven environment. We must ask ourselves what it is about this claim that strikes scholars as problematic. The virtue of a theory, according to some philosophers of science, is its intellectual economy—that is, its ability to allow us to obtain and structure information about observables that are otherwise intractable (Duhem [1906] 1954, 21ff). “Theorists are to replace reports of individual observations with experimental laws and devise higher level laws (the fewer, the better) from which experimental laws (the more, the better) can be mathematically derived” (Bogen 2013). In light of this, one wonders if the objections to data-driven science have to do with the practicality of such an undertaking (e.g., because of the bias and incompleteness of data), or that there are things theories could tell us that big data cannot? This calls for deeper analyses on the role of theory along with the degree to which big data approaches challenge or undermine those roles. What new conceptions of theory and science for that matter do we need in order to accommodate the changes brought about by big data? What are the implications of these conceptions for theory-driven science? Does this shift change the relationship between science and society? What type of knowledge is potentially lost in such a shift? These are some of the questions addressed in the chapters in part III of the book. Big Data and Organizations: Agency and Responsibility Similar challenges also face organizations—whether they are private enterprises or government agencies—that carry a great deal of responsibility toward their clientele in terms of the protection of data, privacy rights, data ownership, and so on. The chapters in part IV discuss some of these challenges. On a larger scale, research and practice in big data require vast resources, investments, and infrastructure that are only available to a select group of players. Historically, technological developments of this magnitude have fallen within the purview



of government funding, with the intention (if not the guarantee) of universal access to all. This is perhaps the first time in modern history, particularly in the United States, that a technological development of this magnitude has been left largely in the hands of the private sector. Although the full implications of this are yet to be revealed, disparities between the haves and have-nots are already visible between major technology companies and their users, between government and citizens, and between large and small businesses and universities, giving rise to what can be called the “data divide.” This is not limited to divisions between sectors; it also creates further divides within them. Concluding Thoughts In brief, the depth, diversity, and complexity of questions and issues facing our society are proportional, if not exponential, to the amount of data flowing in the social and technological networks that we have created. Big data computing can potentially help us deal with these questions and issues. Despite their growing calculative power, however, computers remain, at best, super–idiot savants. They cannot be trusted, as such, with how this power should be put to use. As the following chapters make clear, human participation, interpretation, and judgment remain crucial in all stages of big data development. This, we hope, is the take-home lesson of this book.

I  Big Data and Individuals

1  Big Data, Consent, and the Future of Data Protection Fred H. Cate

We live in a world increasingly dominated by the collection, aggregation, linkage, analysis, storage, and sharing of vast collections of data pertaining to individuals. Some of those data we generate and reveal either by choice, such as through social media and e-mail, or via compulsory disclosure, as a condition, for example, of banking or traveling. Other data are collected by sensors, which surround us in smartphones, tablets, laptops, wearable technologies and sensor-enabled clothing, RFID-equipped passports, cars, homes, and offices. Increasingly, even public spaces are equipped with video cameras that recognize faces and gaits, microphones that record conversations and detect ambient noises, and other sensors that collect and store personally identifiable information. With the growth of the Internet of Things, connected sensors process an astonishing volume and variety of data without our even being aware (see Burdon and Andrejevic, this volume). According to a study by Hewlett-Packard (2015), nine out of ten of the most popular Internet-connected devices carry personal data. Still more data are calculated or inferred based on demographic information, census data, and past behavior. Those data are created, not collected. Moreover, data that may not originally appear personally identifiable may become so, or may generate personally identifiable information through aggregation and correlation. A large volume of these data are held by businesses with which we have infrequent contact (such as when we buy a car or appliance) or by third parties with whom we have no direct dealings. According to the New York Times, Acxiom alone engages in fifty trillion data transactions a year, almost none of which involve collecting data directly from individuals (Singer 2012). Thanks, too, to a vast array of reporting requirements, sweeping surveillance, and partnerships with industry, there are few data to which governments do not have access. As a result, we are witnessing an explosion not only in the volume of personal data being generated but also in the comprehensiveness and granularity of the records those data create about each of us. It is part of a phenomenon we often describe as “big data,” meaning data sets that are not only large but increasingly complete and granular as well, and might be contrasted with data analysis that involves incomplete or sample data, or data that are available only in aggregated or more abstract forms (Mayer-Schönberger and Cukier 2013).


Fred H. Cate

The proliferation and interconnection of big data sets raise significant privacy issues. The challenge we face literally around the world is to evolve better, faster, and more scalable mechanisms to protect data from harmful or inappropriate uses, without interfering with the benefits that data are already making possible today and promise to make even more widespread in the future. Meeting this challenge will require addressing some key obstacles, one of the most significant of which is our current reliance on individual consent. This chapter addresses the inadequacies of consent as a basis for privacy protection, and considers more effective approaches to achieving privacy protection that is effective, scalable, and delivers benefits not only for individuals but also for society as a whole. Notice and Consent in a World of Big Data The Growing Focus on Notice and Consent at Time of Collection Most data protection laws place some or all of the responsibility for protecting privacy on individual data subjects through the operation of notice and consent. This is evident in the Organization for Economic Cooperation and Development (OECD) Guidelines on the Protection of Privacy and Transborder Flows of Personal Data, adopted in 1980, which serve as the basis for most modern privacy laws. The OECD (1980, para. 7) guidelines’ collection limitation principle requires that personal information be collected, “where appropriate, with the knowledge or consent of the data subject.” Under the purpose specification and use limitation principles, the reuse of personal information is limited to the purposes originally specified at the time of collection, and “others as are not incompatible with those purposes.” The only exceptions are: “(a) with the consent of the data subject; or (b) by the authority of law” (ibid., paras. 9–10). In short, if an individual can be persuaded to consent, then under laws adopted in conformance with the OECD guidelines, the terms of that consent—however narrow or broad, carefully considered or accepted by default—govern subsequent use. This focus on the role of the individual is especially evident in the United States. In 1998, the US Federal Trade Commission (FTC), after reviewing the “fair information practice codes” of the United States, Canada, and Europe, reported to Congress that “the most fundamental principle is notice … [because] [w]ithout notice, a consumer cannot make an informed decision as to whether and to what extent to disclose personal information.” The FTC (1998, para. 8) continued, “The second widely-accepted core principle of fair information practice is consumer choice or consent … [over] how any personal information collected from them may be used.” US statutes and regulations have tended to parallel the FTC’s emphasis on notice and choice. The Obama administration’s 2012 Consumer Privacy Bill of Rights included as its first principle: “Consumers have a right to exercise control over what personal data companies collect from them and how they use it” (White House, Office of Science and Technology Policy 2012). And its 2015 discussion draft of a Consumer Privacy Bill of Rights Act requires

Big Data, Consent, and the Future of Data Protection 


“accurate, clear, timely, and conspicuous notice about the covered entity’s privacy and security practices,” and that “each covered entity shall provide individuals with reasonable means to control the processing of personal data about them in proportion to the privacy risk to the individual and consistent with context” (White House 2015). The focus on notice and consent at the time of collection is not limited to the United States. The European Union’s Data Protection Directive, for example, is significantly focused on individual choice. Article 7 of the directive provides seven conditions under which personal data may be processed. The first is “the data subject has unambiguously given his consent.” Article 8 restricts the processing of sensitive data, but then provides that the restriction will not apply where “the data subject has given his explicit consent to the processing of those data.” Article 26 identifies six exceptions to the provision prohibiting the export of personal data to non-European countries lacking “adequate” data protection. The first is that “the data subject has given his consent unambiguously to the proposed transfer” (European Union 1995). The pending draft of the EU General Data Protection Regulation also provides a significant role for individual choice; the draft regulation refers to “consent” more than a hundred times. Consent is the first basis listed for the lawful processing of data (art. 6), a condition for the processing of personal data concerning a child under thirteen (art. 8), a basis for processing sensitive data (art. 9), an exception to the restriction on profiling (art. 20), an exception to the prohibition on exporting personal data to countries lacking adequate data protection (art. 44), and an exception to the restriction on the reuse of personal data concerning health (art. 81) (European Parliament 2013). The Asia-Pacific Economic Cooperation (2005) Privacy Framework, adopted in 2004, is similarly focused on notice and consent: “Where appropriate, individuals should be provided with clear, prominent, easily understandable, accessible and affordable mechanisms to exercise choice in relation to the collection, use and disclosure of their personal information.” And in Canada, Philippa Lawson and Mary O’Donoghue (2009) have written, “The requirement for consent to the collection, use, and disclosure of personal information is a cornerstone of all three [of Canada’s] common-law regimes.” The Mounting Critique of Notice and Consent While this approach is understandably attractive, and continues to be relevant and important in situations where choice about data collection is appropriate and meaningful, there is mounting evidence that in many settings, individual choice is impractical and undesirable. This is especially true in a world of big data. In the late 1990s, academics originated the critique of notice and consent as primary mechanisms for data protection. Professor Paul Schwartz (1999) argued that “social and legal norms about privacy promise too much, namely data control, and deliver too little.” Following his assertion, the criticism has reached a crescendo among academics on both sides of the Atlantic, with scholars increasingly in agreement that the “control-based system of data


Fred H. Cate

protection … is not working,” while often disagreeing on what should be done about it (Cate 2006; see also Cate 2014; Lane et al. 2014; Schermer, Custers, and Van der Hof 2014). Perhaps more surprisingly, the critique of notice and consent has more recently been echoed by regulators, industry, and privacy advocates. For example, the FTC (2010, 2012) noted the dangers of an overreliance on notice and choice in its staff and commission reports on the future of privacy protection. In its staff report, for instance, the FTC (2010) wrote, In recent years, the limitations of the notice-and-choice model have become increasingly apparent. … [C]onsumers face a substantial burden in reading and understanding privacy policies and exercising the limited choices offered to them. … Additionally, the emphasis on notice and choice alone has not sufficiently accounted for other widely recognized fair information practices, such as access, collection limitation, purpose specification, and assuring data quality and integrity.

In two 2014 reports on privacy and big data, the Executive Office of the President (2014), even while continuing to pursue consent-based privacy legislation, concluded that the advent of big data “may require us to look closely at the notice and consent framework that has been a central pillar of how privacy practices have been organized for more than four decades. … In the words of the President’s Council of Advisors for Science & Technology, ‘The notice and consent is defeated by exactly the positive benefits that big data enables: new, non-obvious, unexpectedly powerful uses of data.’” On December 1, 2009, the European Union’s Article 29 Data Protection Working Party and the Working Party on Police and Justice adopted a “joint contribution” on the future of privacy, in which they argued that “consent is an inappropriate ground for processing” in the “many cases in which consent cannot be given freely, especially when there is a clear unbalance between the data subject and the data controller (for example in the employment context or when personal data must be provided to public authorities)” (European Commission 2009). The Article 29 Data Protection Working Party (2001, 23) expanded on this view in its white paper on consent, in which it concluded, “If incorrectly used, the data subject’s control becomes illusory and consent constitutes an inappropriate basis for processing.” The critique of notice and consent as the basis for data protection tends to focus on eight areas. Complexity  Notices are frequently complex. This is not surprising given that the laws and business practices they describe are often complex as well. Moreover, notices typically read like contracts because regulators have chosen to enforce them like contracts (In the Matter of Eli Lilly and Company 2002). As a result, individuals are overwhelmed with many long, detailed privacy policies. In March 2012, the UK consumer watchdog Which? reported that when PayPal’s privacy notice is added to its other terms of use disclosed to consumers, the total word count is 36,275, longer than Hamlet (at 30,066 words), and iTunes’ comes to 19,972 words, longer than Macbeth (at 18,110 words) (Parris 2012). One study calculated that to read the privacy policies of just the most popular websites would take an individual 244 hours—or more than 30 full working days—each year (McDonald and Cranor 2008).

Big Data, Consent, and the Future of Data Protection 


Regulators, industry groups, academics, and others have proposed a variety of ways of making notices more accessible, including shortened notices, layered notices, standardized notices, and machine-readable ones. To date, none of these have succeeded in practice, in large part because they ignore the fundamental complexity of the myriad uses of data that notices are trying to describe, and the legal liability that can result from inadequate or inaccurate descriptions. As a result, a meaningful assessment of even a single privacy policy may require a sophisticated understanding of how data are used today—an understanding that most consumers lack. It is therefore not surprising that few individuals ever have read a complete privacy notice, even though it is often the foundation of modern data protection. The advent of big data only exacerbates the challenge of using notices to convey in understandable and concise language the extraordinarily complex as well as unpredictable ways in which data may be used (and reused) in the future. The challenge presented by big data isn’t just the size of the data sets or ubiquity with which data are collected but also the fact that big data “thrives on surprising correlations and produces inferences and predictions that defy human understanding” (Ohm 2014). One can readily agree with Ohm (ibid.) when he asks, “How can you provide notice about the unpredictable and unexplainable?” Inaccessibility  Today, privacy notices and consent opportunities are inaccessible not only as a result of their length and complexity but also because most data collection and creation take place without direct contact with data subjects. Data are collected or generated by sensors, such as surveillance cameras or microphones, of which the data subject may be only vaguely aware. In some countries data protection laws require generic notices, such as the “CCTV in use” warnings found throughout London and other major cities, but these provide data subjects with neither meaningful notice (i.e., where are the cameras, when are they recording, why, what happens to the video they record, how are the privacy rights of the people they record protected?) nor effective options (e.g., avoiding or disabling the cameras). More often, however, there is no opportunity for notice or choice about collection at all, as in the case of the sensors in the more than 6.8 billion handheld phones worldwide; controlled by the more than 100 million lines of code in the average modern personal car; or found in public transportation, public highways, office buildings, consumer appliances, and even toys around the world (Halzack 2015). Inferences, probabilities, predictions, and other data are also created by government and industry at an astonishing pace, and used to classify or qualify individuals for hundreds of reasons ranging from marketing to tax audits. These data are frequently created in ways that are invisible to individuals. Therefore, seeking to govern these activities through a regime of notice and consent seems unworkable, and likely to result in convoluted notices that few people see as well as consent that is implied from individual inaction or requiring individuals to click “I agree” as a condition of obtaining services or products they desire.


Fred H. Cate

Ineffectiveness  Few people read notices or act on consent requests, unless they are required to, in which case they almost always grant consent if necessary to get the service or product they want. As the then FTC chair Jon Leibowitz (2009) noted at the first of the commission’s three 2009–2010 Roundtables on Exploring Privacy, “We all agree that consumers don’t read privacy policies”—a remarkable acknowledgment from the federal agency that has probably done the most in the United States to promote them. In reality, this has long been the view of FTC leaders. The lack of consumer response to the financial privacy notices required by the Gramm-Leach-Bliley Financial Services Modernization Act of 1999 prompted the then FTC chair Timothy Muris (2001) to comment, The recent experience with Gramm-Leach-Bliley privacy notices should give everyone pause about whether we know enough to implement effectively broad-based legislation based on notices. Acres of trees died to produce a blizzard of barely comprehensible privacy notices. Indeed, this is a statute that only lawyers could love—until they found out it applied to them.

The difficulties of reaching and provoking a response from consumers are exacerbated where the party wishing to use the information has no (and may not have ever had) direct contact with the consumer. For example, most mailing lists are obtained from third parties. To require the purchaser of a list to contact every person individually to obtain consent to use the names and addresses on the list would cause delay, require additional contacts with consumers, and almost certainly prove prohibitively expensive. And it could not be done without using the very information that the list purchaser is seeking consent to use. In a world of big data, these issues are greatly magnified. Individual choice is increasingly impractical as the principal way to protect privacy. Dramatic increases in the ubiquity of data collection, volume and velocity of information flows, and range of data users (and reusers) place an untenable burden on individuals to understand the issues, make choices, and then engage in oversight and enforcement. The Illusion of Choice  Notice all too often creates only the illusion of choice, when, for instance, consent is required to obtain a product or service, or the notice is so broad as to make choice meaningless. As Schwartz (1999) has noted, “One’s clicking through a consent screen to signify surrendering of her personal data for all future purposes is an example of both uninformed consent and a bad bargain.” Similarly, when choice is offered for a service or product that cannot be provided without personal information, individuals are afforded the illusion—but not the reality—of choice. European data protection regulators have addressed this issue in the context of employers asking employees to consent to the collection and use of personal information. The Article 29 Data Protection Working Party has expressed the view that where “consent is required from a worker, and there is a real or potential relevant prejudice that arises from not consenting, the consent is not valid.” It has made this point repeatedly: “If it is not possible for the worker to refuse it is not consent. … Consent must at all times be freely given. … [A] worker

Big Data, Consent, and the Future of Data Protection 


must be able to withdraw consent without prejudice” (Article 29 Data Protection Working Party 2011). As a result of these views, consent is rarely a basis for data processing in employment contexts in Europe, but it is relied on in many other settings and even in employment contexts outside Europe. Inadequate Privacy Protection  The widespread reliance on consent ignores the fact that consent does not equal privacy: individuals can enjoy strong privacy protection without consent and can suffer serious incursions into their privacy with consent. As the Institute of Medicine’s Committee on Health Research and the Privacy of Health Information wrote in 2009, “Consent (authorization) itself cannot achieve the separate aim of privacy protection” (Nass, Levit, and Gostin 2009). In fact, scrolling through numerous policies and clicking “I agree” rarely provides meaningful privacy protection. By equating the two, we have actually tended to weaken privacy protections by making them waivable through consent. Notice and consent do not protect us from our own bad, ignorant, unintentional, or unavoidable choices. Contrast consent in privacy with other types of consumer protection laws: under the laws of most countries, a consumer cannot consent to be defrauded, but they can consent to have their privacy violated (Cate 2006). In addition, the energy of data processors, legislators, and enforcement authorities is often expended on notices and consent opportunities rather than on actions that could actually protect privacy. Compliance with data protection laws is usually focused on providing required notices in the proper form at the right time and recording consent as opposed to ensuring that personal information is protected. This is a particular concern in the big data context because notice and consent seem almost certain not to work, for the reasons already identified. But if regulators continue to focus on notice and consent, the opportunity to enact more meaningful and appropriate privacy protections may be squandered. False Dichotomy  The preoccupation with consent frequently sets up an artificial dichotomy between personally and non–personally identifiable information. Data protection laws typically do not apply to personal data that are de-identified or anonymized because the data are no longer considered personally identifiable. While businesses are forced to seek consent to collect information defined by laws and regulations as personally identifiable, thousands of other potentially identifying data elements are ignored entirely. This poses a significant challenge in a world of big data, where with sufficient, interconnected data, even de-identified or anonymized data may be rendered personally identifiable. For example, in one study, Professor Latanya Sweeney (2000) showed that 87 percent of the US population is uniquely identified with just three data elements: date of birth, gender, and five-digit zip code. There are well-publicized examples of Google, Netflix, AOL, and others releasing de-identified data sets only to have the data reidentified within days by researchers


Fred H. Cate

correlating them with other data sets (Dwork 2014). A 2009 study in the Proceedings of the National Academy of Sciences demonstrates that Social Security numbers can actually be predicted from publicly available information about many citizens (Acquisti and Gross, 2009). As was noted in the Economist, “The ability to compare databases threatens to make a mockery of [data] protections” (“We’ll See You, Anon” 2015). Similarly, previously nonidentifiable data may act to identify unique users or machines in a world of big data. Browser choice and font size, for instance, can provide an accurate, unique online identifier (Kirk 2010). In short, the more data are available, the harder it is to de-identify them effectively. Or as data scientist Cynthia Dwork (2014) recently summed up the situation, “‘De-identified data’ isn’t.” Burden on the Individual  Linking privacy to individual consent can impose significant burdens on individuals, particularly as the volume and pace of data flows increase. Moreover, while presented as a right, the emphasis on notice and consent in reality generally ends up creating a duty on individuals to make choices they are often not prepared to make, and then accept the consequences of those choices. We know how consistently individuals ignore notices—not just related to privacy, but also including copyright notices when downloading software, disclosure terms when opening financial accounts, and informed consent notices for medical treatments. As a result, individuals may be making significant choices when they are not aware that they are making any at all. Or they may make the choices they are forced to in order to obtain products and services they want. In a consent-based system, those decisions can have serious consequences, such as assuming liability or waiving legal protections, even though we know they were made reflexively, without thought, or maybe not at all. Notice and consent do not protect us from our own bad, ignorant, unintentional, or unavoidable choices. Choice as a Disservice to Individuals and Society  Finally, choice can actually interfere with activities of great value to individuals or society more broadly. This is true of press coverage of public figures and events, medical research, and the many valuable uses of personal information where the benefit is derived from the fact that the consumer has not had control over the information. Consider information about individuals’ creditworthiness: its value derives from the fact that the information is obtained routinely, over time, from sources other than the consumer. Allowing the consumer to block the use of unfavorable information would make the credit report useless. In the words of former FTC chair Muris (2001), the credit reporting system “works because, without anybody’s consent, very sensitive information about a person’s credit history is given to the credit reporting agencies. If consent were required, and consumers could decide—on a creditor-by-creditor basis—whether they wanted their information reported, the system would collapse.”

Big Data, Consent, and the Future of Data Protection 


This is even clearer in the context of big data. There is extensive evidence as to the role that analysis of big data can play in advancing health research, detecting dangerous drug interactions, enhancing public safety, facilitating consumer convenience, improving services, and increasing accountability (as discussed in Sholler, Bailey, and Rennecker, this volume). But many of the benefits we are witnessing depend on being able to utilize data for uses that did not even exist when they were collected or created. And many depend on having complete data sets. For example, in the context of health research, refusal rates as low as 3.2 percent have been clearly shown to introduce selection bias (Casarett et al. 2005; Nass, Levit, and Gostin 2009). The more we rely on notice and consent, and limit the reuse of data to uses that are “not incompatible with” those specified in the original notice, the more we run the risk of frustrating potentially valuable uses of data or, alternatively, permitting them under broad consent provisions that leave personal data inadequately protected. Impact of Big Data In sum, individual consent as a basis for the use of personal data presents many challenges that make it increasingly impractical and undesirable as the principal way to protect privacy. Some of those concern the difficulties of making choice work: the complexity, inaccessibility, and ineffectiveness of consent opportunities and notices. But many of the challenges reflect fundamental objections to reliance on individual consent at all. These include when the choice is illusory, substitutes for more meaningful privacy protection, creates a false dichotomy between personally and non–personally identifiable information, imposes an unnecessary burden on individuals, and actively disserves the best interests of individuals and society. In a world of big data, all these issues are magnified. Dramatic increases in the ubiquity of data collection, volume and velocity of information flows, and range of data users (and reusers) place an untenable burden on individuals to understand the issues, make choices, and then engage in oversight and enforcement. A continuing broad reliance on notice and choice at the time of collection both underprotects privacy and seriously interferes with—and raises the cost of—subsequent beneficial uses of data. Furthermore, personal information is increasingly used by parties with no direct relationship to the individual, or generated by sensors (or inferred by third parties) over which the individual not merely exercises no control but also has no relationship. More problematic still is the fact that notice and consent are used to shift responsibility to individuals and away from data users. If a subsequent use of data proves to be harmful or threatening to the individual, data users generally receive legal protection from the fact that the individual consented. At present, individuals faced with choices at the time of data collection are saddled with both the burden of making those choices and the consequences of those choices, whether as a result of overly complex notices, limited choices, lack of understanding, or an inability to weigh future risks against present benefits.


Fred H. Cate

These reasons, when taken together, explain why the May 2014 report by the President’s Council of Advisors on Science and Technology, Big Data and Privacy: A Technological Perspective, described the “framework of notice and consent” as “unworkable as a useful foundation for policy.” The report stressed that “only in some fantasy world do users actually read these notices and understand their implications before clicking to indicate their consent” (Executive Office of the President 2014). To be certain, notice and consent may provide meaningful privacy protection in appropriate contexts, but this approach is increasingly ineffective as the primary mechanism for ensuring information privacy, especially in the case of big data. An overreliance on notice and consent both diminishes the effectiveness of these measures in settings where they could be used appropriately and often ignores more effective tools—including transparency and redress—for strengthening the role of individuals in data protection. More important, the focus on notice and consent has tended to obscure the need for more effective and appropriate protections for privacy. Protecting Privacy in a World of Big Data Despite the multinational chorus of criticism about data protection’s reliance on notice and consent, many in industry and government continue to cling to them because they are familiar, they are comparatively easy to administer, and there is little agreement on better tools for protecting privacy. The remainder of this section therefore considers five core elements for more effective governance of big data. Shifting the Focus from Individual Consent to Data Stewardship The effective governance of big data requires shifting more responsibility away from individuals and toward data collectors and data users, who should be held accountable for how they manage data rather than whether they obtained individual consent. While specific methods for accomplishing this are discussed below, the recognition that processors should be liable for reasonably foreseeable harms will create a significant incentive for greater care in their collection and use of data. It will also restrict the common practice of allowing processors to escape responsibility by providing notice and obtaining (or inferring) consent. This should thus reduce the burden imposed on individuals and focus their attention on data processing activities only where there are meaningful, effective choices to be made. Processors, in turn, will benefit by not wasting resources on, or having future potentially valuable uses of personal data restricted by, notices that no one reads or adhering to terms of consent that are often illusory at best. A More Systemic and Well-Developed Use of Risk Management Risk management is the process of systematically identifying harms and benefits that could result from an activity. Risk management does not alter rights or obligations, but by

Big Data, Consent, and the Future of Data Protection 


assessing both the likelihood and severity of harms and benefits, risk management helps organizations identify mitigation strategies and ultimately reach an optimum outcome that maximizes potential benefits while reducing the risk of harms so that it falls within acceptable limits (International Organization for Standardization 2009; Centre for Information Policy Leadership, 2014a, 2014b). The ultimate goal of risk management, after taking into account those measures that the data user can take to reduce risk, is to create presumptions concerning common data uses so that both individuals and users can enjoy the benefits of predictability, consistency, and efficiency in data protection. So, for example, some uses in settings that present little likelihood of negligible harms might be expressly permitted, especially if certain protections such as appropriate security were in place. Conversely, some uses in settings where there was a higher likelihood of more severe harms might be prohibited or restricted without certain protections in place. For other uses that present either little risk of more severe harms or greater risk of less severe harms, greater protections or even specific notice and/or consent might be required so that individuals have an opportunity to participate in the decisionmaking process. Data protection has long relied on risk management as a tool for complying with legal requirements along with ensuring that data are processed appropriately as well as that the fundamental rights and interests of individuals are protected effectively. Yet these risk management processes, whether undertaken by businesses or regulators, have often been informal and unstructured, and failed to take advantage of many of the widely accepted principles and tools of risk management in other areas. In addition, institutional risk management in the field of data protection has suffered from the absence of any consensus on the harms that risk management is intended to identify and mitigate in the area of data protection. This is the starting point for effective risk assessment in other fields. As a result, despite many examples of specific applications, a riskbased approach still does not yet provide a broad foundation for data protection practice or law. It is critical that risk management around data protection, while remaining flexible, not continue in the largely ad hoc, colloquial terms in which it has evolved today. In other areas—for instance, financial and environmental risk—we have seen the development of a professional practice of risk management, including specialized research, international and sectoral standards, a common vocabulary, and agreed-on principles and processes. The same is needed in data protection risk management. There is substantial movement in this direction already. The draft text of the European Union’s General Data Protection Regulation focuses significantly on risk management. The text that emerged from the European Parliament (2013, para. 66) stresses the need for “the controller or processor” to “evaluate the risks inherent to the processing and implement measures to mitigate those risks.” The draft regulation would require data controllers to demonstrate compliance with it in regard to, among other things, the “risks for the rights and


Fred H. Cate

freedoms of the data subjects” (ibid., art. 22). Under a wide variety of circumstances, the controller would be required to “carry out a risk analysis of the potential impact of the intended data processing on the rights and freedoms of the data subjects, assessing whether its processing operations are likely to present specific risks” (ibid., art. 32a). The draft of a “partial general approach” to chapter 4 of the regulation that has been circulated by the council presidency further builds on the risk-based approach, conditioning the obligations of the data controller to implement appropriate measures and be able to demonstrate compliance with the regulation on “the nature, scope, context and purposes of the processing as well as the likelihood and severity of risk for the rights and freedoms of individuals” (Presidency 2014, art. 22.1). In 2013, the Council of Ministers of the OECD (2013a) revised the OECD Guidelines Governing the Protection of Privacy and Transborder Flows of Personal Data, first adopted in 1980, to “implement a risk-based approach.” In the accompanying Explanatory Memorandum, the drafters noted the “importance of risk assessment in the development of policies and safeguards to protect privacy” (OECD 2013b). In 2012, the FTC published a report recommending that companies should “implement accountability mechanisms and conduct regular privacy risk assessments to ensure that privacy issues are addressed throughout an organization.” In the same report, the FTC (2012) acknowledged that not all data processing necessitates consumer consent, noting that “whether a practice requires choice turns on the extent to which the practice is consistent with the context of the transaction or the consumer’s existing relationship with the business, or is required or specifically authorized by law.” Risk management holds special promise in the world of big data by facilitating thoughtful, informed decision making by data collectors and users that takes into account not only their risks but also those of data subjects, by explicitly considering both harms and benefits, and focusing the increasingly scarce resources of both data processors and government regulators where they are needed most. As the editors of Oxford University Press’ International Data Privacy Law opined recently, [We] applaud the attention being given to risk management and its role in data protection. In its proper place, risk management can help prioritize the investment of scarce resources in protecting privacy and enforcing privacy obligations. It can identify serious risks to privacy and measures for mitigating them. It can expand our collective thinking about the range of risks that the processing of personal data can present to individuals, organizations, and society, especially in a world of nearly ubiquitous surveillance, big data, cloud computing, and an onslaught of Internet-connected devices. And it can help bring rigor and discipline to our thinking about data processing and how to maximize its benefits while reducing its costs. (Kuner et al. 2015)

A Greater Focus on Data Uses There is often a compelling reason for personal data to be disclosed, collected, or created. Assessing the risk to individuals posed by those data almost always requires knowing the

Big Data, Consent, and the Future of Data Protection 


context in which they will be used. Data used in one context, or for one purpose or subject to one set of protections, may be both beneficial and desirable, where the same data used in a different context, or for another purpose or without appropriate protections, may be both dangerous and undesirable (Nissenbaum 2010). As a result, data protection should, in the words of the US President’s Council of Advisors on Science and Technology, “focus more on the actual uses of big data and less on its collection and analysis” (ibid., xiii). Concentrating on the use of personal data does not eliminate responsibilities or regulation relating to data collection, nor should a focus on consent in specific or sensitive circumstances be abandoned. Rather, in many situations, a more practical as well as sensitive balancing of valuable data flows and more effective privacy protection is likely to be obtained by putting more attention on appropriate, accountable use. This is especially true if use is defined broadly to include relying on personal data for decision making or other assessment concerning an individual, using personal data to create or infer other personal data, or disclosing or disseminating personal data to a third party (Cate, Cullen, and Mayer-Schönberger 2014). Under a more use-based approach, data users would evaluate the appropriateness of an intended use of personal data not by focusing primarily on the terms under which the data were originally collected but rather on the likely risks to or impacts on individuals associated with the proposed use of the data. Such a stress on use is more intuitive because most individuals and institutions already think about uses when evaluating their comfort with proposed data processing activities. “What are you going to do with the data?” “How do you intend to use it?” “What are the benefits and risks of the proposed use?” These are the types of questions that many individuals ask—explicitly or implicitly—when they inquire about data processing activities. They can be answered only in connection with specific uses or categories of uses, and they are precisely the questions that data users would be required to ask—and answer—regarding proposed uses of data. One of the most pronounced changes that will result from the evolution toward a greater focus on use is to diminish the role of the purpose for which data were originally collected. The OECD (1980) guidelines explicitly provide for a purpose specification principle that requires that “the purposes for which personal data are collected should be specified not later than at the time of data collection,” and then limits subsequent use to “the fulfillment of those purposes or such others as are not incompatible with those purposes and as are specified on each occasion of change of purpose.” This principle is problematic for many reasons, including the fact that precisely because of it, data processors usually specify exceptionally broad purposes that offer little meaningful limit on their subsequent use of data. In addition, because data increasingly are generated in ways that involve no direct contact with the individual (for example, collected by sensors or inferred from existing data), there is never a purpose specified. Moreover, with the advent of big data and the analytic tools that have accompanied it, personal data may have substantial valuable uses that were wholly unanticipated when the data were collected, yet the data were


Fred H. Cate

collected in such a way or are so vast as to make contacting each individual to obtain consent for the new use impractical as well as potentially undesirable where the beneficial use depends on having a complete data set. Some modern data protection systems have dealt with these problems by creating broad exceptions to this principle, interpreting “not incompatible” so expansively as to undermine the principle, or simply ignoring it altogether. Taking maximum advantage of big data will require a more thoughtful approach to purpose specification. The principle will have less relevance in many settings. This will certainly be true when data are observed or inferred without any contact with the individual, but it will also likely be true in many other settings. Instead, it is the analysis of risks associated with an intended use that determines whether, and subject to what protections, a use is appropriate. The terms under which data were collected would remain relevant when a specific purpose is provided at the time of collection and is a meaningful factor in obtaining access to data. This would be especially clear in settings where users had made meaningful choices (e.g., specifying a preferred medium for future communications) or where the data processor had agreed to specific limits as a condition of obtaining personal information (e.g., an explicit promise not to share the data). But a greater focus on risk assessment of specific uses as opposed to consent is essential to ensure that data users do not evade their commitments, valuable uses of data are not inappropriately deterred, and data protection laws are not claimed to reflect a principle that increasingly they do not. As Susan Landau (2015, 504) wrote in Science, “[Data protection laws have attempted to protect privacy] through notice and consent. But for reasons of complexity (too many tiny collections, too many repurposings) those are no longer effective. … [T]he value of big data means we must directly control use rather than using notice and consent as proxies.” A use-focused approach is especially important in the context of big data because the analysis of big data doesn’t always start with a question or hypothesis but rather may reveal insights that were never anticipated (Mayer-Schönberger and Cukier 2013). As a result, data protection based on a notice specifying intended uses of data, and consent for collection based on that notice, can result in blocking socially valuable uses of data, lead to meaninglessly broad notices, or require exceptions to the terms under which the individual consented. If privacy protection is instead based on a risk analysis of a proposed use, then it is possible to achieve an optimum benefit from the use of the data and optimum protection for data fine-tuned for each intended use. A Broad Framework of Harms Measuring risks connected with data uses is especially challenging because of the intangible and subjective nature of many perceived harms. Any risk assessment must be both sufficiently broad to take into account the wide range of harms (and benefits) and sufficiently

Big Data, Consent, and the Future of Data Protection 


simple, so that it can be applied routinely and consistently. Perhaps most important, the assessment should be transparent to facilitate fairness, trust, and future refinement. The goal of a risk management approach centered on data uses is to reduce or eliminate the harm that personal information can cause to individuals. Accomplishing this, however, requires a clear understanding of what constitutes “harm” or other undesired impact in the privacy context. Surprisingly, despite almost fifty years of experience with data protection regulation, that clear understanding is still lacking in both the scholarly literature and the law. This is due in part to the focus on notice and consent in data protection, so that harm was considered collecting personal information without providing proper notice or without obtaining consent, or using data outside the scope of that consent. That does not equate with the way most people think about data-related harms, which is more focused on data being used in a way that might cause them injury or embarrassment, rather than the presence or content of privacy notices. So there is a widespread need to think more critically about what constitutes a harm that the risk management framework should seek to minimize or prevent when evaluating data uses. A framework for recognized harms is critical to ensuring that individuals are protected, and also enhancing predictability, accountability, and efficiency. National regulators are well placed to help lead a transparent, inclusive process to articulate that framework. The goal should not be to mandate a one-size-fits-all approach to risk analysis but rather to provide a useful, practical reference point, and ensure that a wide range of interests and constituencies are involved in crafting it. There is a broad spectrum of possibilities for what might constitute a harm, but it seems apparent that the term must include not only a wide range of tangible injuries (including financial loss, physical threat or injury, unlawful discrimination, identity theft, loss of confidentiality, and other significant economic or social disadvantage) but also intangible harms (such as damage to reputation or goodwill, or excessive intrusion into private life) and potentially broader societal harms (such as contravention of national and multinational human rights instruments). What matters most, though, is that the meaning of harm be defined through a transparent, inclusive process, and with sufficient clarity to help guide the risk analyses of data users. Risk assessment is not binary and is likely to be influenced by a number of factors within the data user’s control. So the goal of the risk assessment isn’t simply to indicate whether a proposed data use is likely to be appropriate or not. It is also to highlight the steps that the data user can take to make that use more acceptable (e.g., by truncating, encrypting, or depersonalizing data). Transparency and Redress Big data are increasingly being used to make decisions about individuals and even predict their future behavior, often with significant consequences. The one certainty of data analysis


Fred H. Cate

is that there will be errors—errors resulting from problems with data matching and linking, erroneous data, incomplete or inadequate algorithms, and misapplication of data-based tools. Moreover, our society frequently seems obsessed with data (whatever its size), and acts as if we believe that just because something is “data based,” it is not merely better but infallible as well. As a result, we often deploy new data-based tools without adequate testing, oversight, or redress. (Data mining activities for aviation security are an obvious example, where the federal government denied or delayed boarding for thousands of innocent passengers for years before finally putting in place a redress system.) Whenever big data are used in ways that affect individuals, there must be effective transparency and redress. This is necessary to protect the rights of individuals, but it also serves the vital purposes of enhancing the accuracy and effectiveness of big data tools, and creating disincentives for deploying tools inappropriately. Meaningful transparency and redress, together with effective enforcement, not only provide remedies for current harms, but also help to prevent future ones. Furthermore, while few individuals demonstrate much interest in inquiring into data processing activities until there is a perceived harm, they are usually more interested in learning how data was used and to what effect. Ensuring that there is meaningful redress will not only create disincentives for risky data processing and help repair the damage that such processing can cause but also provide meaningful rights to individuals at the very time they are most interested in exercising them. It is an essential requirement for the responsible use of big data. Conclusion In an age of big data, privacy is more essential than ever before, but individual consent is plainly impractical and undesirable as the principal way to protect it. The complexity, inaccessibility, and ineffectiveness of consent opportunities and notices are only magnified by the extraordinary number as well as variety of ways of collecting or generating personal data in a world of big data. But even if these difficulties could be overcome, the extent to which consent is illusory—used to substitute for more meaningful privacy protections—creates a false dichotomy between personally and non–personally identifiable information, imposes an unnecessary burden on individuals, and actively disserves the best interests of individuals and society in the productive, responsible use of data, thereby making consent a poor tool for a data-rich environment in which effective privacy protection is more necessary than ever. If we are to protect personal privacy effectively, while continuing to enjoy the benefits that big data are already making possible, we need to evolve better, faster, and more scalable protections. We must transform our current broad reliance on notice and consent at the time of collection into a more effective, sustainable, and scalable approach that:

Big Data, Consent, and the Future of Data Protection 



relies less on individual consent, and more on placing responsibility for data stewardship and liability for reasonably foreseeable harms on data users •  includes a more systemic and well-developed use of risk management •  focuses more on the uses of big data as opposed to the mere collection or retention of data, or the purposes for which data were originally collected •  is guided by a broad framework of cognizable harms identified through a transparent, inclusive process including regulators, industry, academics, and individuals •  provides meaningful transparency and redress There are likely many other measures that also will be useful, but these five are critical to protecting privacy while unlocking the potential of big data.

2  When They Are Your Big Data: Participatory Data Practices as a Lens on Big Data Katie Shilton

“Access Matters” trumpets the title of the blog post. Gary Wolf (2014), one of the founders of the Quantified Self (QS) blog and the larger QS data-tracking movement, writes, Someday, you will have a question about yourself that impels you to take a look at some of your own data. It may be data about your activity, your spending at the grocery store, what medicines you’ve taken, where you’ve driven your car. And when you go to access your data, to analyze it or share it with somebody who can help you think about it, you’ll discover. … You can’t.

The September 2014 post details a tension between the practice of self-quantification or self-tracking, and inaccessible, corporate-owned data streams. This tension, long present in debates about privacy and accessibility (Shilton 2012), is drawing a critical mass of attention as two (related) phenomena converge: more consumers are using personal fitness devices and self-tracking applications, and personalized health data are gaining prominence as a topic of scientific investigation. The terminology used to refer to data generated about individuals is variable, including “personally identifiable information” (Schwartz and Solove 2011), “data exhaust” (Williams 2013), “personal data trails” (Estrin 2014), and “participatory personal data” (Shilton 2012). Data about people is “big data” in both the cognitive sense and social movement sense, as defined by Hamid R. Ekbia and colleagues (2015). First, the reference to data trails or streams indicates that the data are cognitively large: they are more than a person might reasonably be able to track and reason about without visualization techniques or, frequently, computing power. Second, these data are an outcome of larger “computerization movements” (Kling and Iacono 1988), a social movement wherein loosely organized groups of people advocate for broader social changes based around a technologically enabled vision. The personal data vision imagines a world in which personal data access and use can help individuals be healthier, more mindful, and more efficient (Li et al. 2012). Uncertainties about how to use increasingly large sets of personal data are at the center of social debates about the virtues of big data. Not all “big data” is data about people, but data about people inspire much of the hope and anxiety bound up in discussions of the term. This big data is about you—not only about your private activities, but also about patterns that


Katie Shilton

may be useful for health, wellness, and personal reflection. To begin an analysis of the ethics of using big personal data, this chapter highlights a subtle yet important difference in two types of big personal data: self-quantification data and personal data trails. Differences in how and where data are generated afford each data type different access mechanisms, and individuals engage with each type of data in distinct ways. Both data types, however, highlight the need to redefine privacy, access, and research ethics principles when data about people span personal and corporate boundaries. The first type of big personal data is self-quantification data or personal informatics. Selfquantification is a practice that involves making, hacking, or using off-the-shelf devices to systematically collect and analyze data about oneself (Li et al. 2011). Self-quantification data are generated by individuals using specialized devices. While self-quantification data are frequently retained and stored by commercial entities, they are available to the data subject either in raw form or through visualizations or aggregations. A community has sprung up around the practice, engaging in meet-ups, conferences, and knowledge sharing via blogs and websites (Nafus and Sherman 2014). The practices of this community are discussed in the “Defining Self-Quantification Data” section below. The second type of big personal data, personal data trails, is generated by everyday activities without specialized single-purpose devices. As a result, personal data trails are often not immediately accessible to data subjects. The collection of personal data trails takes advantage of the data “exhaust” of daily lives—the trails left by everyday activities such as shopping, commuting, watching television, or interacting online (Estrin 2014). Unlike self-quantification data, a grassroots community has not formed around personal data trails, precisely because of their inaccessibility. Yet the potential richness of these data has encouraged a growing research community focused on these data in computer science, information sciences, and health sciences. The practices of this community are explored in more detail in the “Defining Personal Data Trails” section below. The “Spectrum of Participation Practices” section examines the communities of practice that have coalesced around these two data types. The section “Ethical Dilemmas and Challenges” looks at some of the unsolved ethical challenges posed by the spectrum of existing practices. The chapter closes with a section titled “Implications for Policy.” Defining Self-Quantification Data There are a growing number of consumer devices designed to help individuals collect data about themselves, and analyze and learn from those data. These digital tools increasingly enable individuals to collect granular data about their habits, routines, and environments (Klasnja and Pratt 2012). Corporations such as the exercise-tracking company Fitbit (http:// have developed self-tracking into a business model. Athletes use a range of portable devices to monitor physiological factors that affect performance, and curious individuals track correlations between variables as diverse as stress, mood, food, sex, and sleep

When They Are Your Big Data 


(Hill 2011). Collecting self-quantification data does not always require specialized devices. Citizen science participants banded together for the “butter mind” experiment, volunteering to eat half a stick of butter daily, and record data using paper or (more frequently) desktop software to test the impact on mental performance (Gentry 2010). Networks for visualizing and sharing self-tracking data have proliferated (e.g., FlowingData,, providing affordances that enable self-tracking to become an explicitly social process. Forms of self-tracking have always existed, and techniques such as keeping a diary have long been recommended as, for example, a critical part of diabetes self-management (Lane et al. 2006). Ubiquitous technologies enable a new scope, scale, convenience, and aggregation of these activities (Miller 2012). Early scholarly work analyzing the rise of self-quantification has been largely critical, accusing personal informatics participants of technological utopianism, and focusing on the ways that quantified self projects privilege visual and numerical representations as well as ways of knowing (Cohen 2016; Lupton 2013; Purpura et al. 2011). In addition, the data practices of personal informatics can become obsessive or at least self-disciplining (Vaz and Bruno 2003). For example, in a moving blog post titled “Why I Stopped Tracking” on QS, contributor Alexandra Carmichael (2010) wrote: My self-worth was tied to the data One pound heavier this morning? You’re fat. 2 g too much fat ingested? You’re out of control. Skipped a day of running? You’re lazy. Didn’t help 10 people today? You’re selfish

Ethnographers Dawn Nafus and Jamie Sherman, though, illustrate a more nuanced picture of the complexities of self-tracking. They describe attending quantified self meet-ups, and finding a broad range of data practices and attitudes about those practices within the personal informatics community. Their accounts of interviews with and observations of personal informatics participants include many who resist strict norms of what is “healthy,” and use their data practices as a form of mindfulness rather than discipline. They write: To our surprise, we found most QSers do not grip too tightly to normative understandings of what is and isn’t “healthy.” Many digital technologies for health do, however. Michael [a participant] largely avoids these, or repurposes them when they are designed in this way. He prefers tracking tools where he can set his own direction. (Nafus and Sherman 2014, 1789)

Nafus and Sherman conclude that self-quantifying practices center on data diversity. They describe participants who used specialized apps or devices to track data as varied as moods while at work or which cardinal direction they spent time facing. Participants set their own


Katie Shilton

goals of what to track and when, and find their own truth in their numbers. When tracking doesn’t reveal something personally useful or interesting, they move on, changing what numbers they reference as it suits them. As Nafus and Sherman (ibid., 1790) put it, “QS not only approaches discourses of healthiness with skepticism, but relies on practices of listening to the body, reading data carefully, and devising responses to both signs from the body and signs from the data.” Nafus and Sherman (ibid., 1785) argue that personal informatics data practices “constitute an important modality of resistance to dominant modes of living with data, an approach that we call ‘soft resistance.’” They characterize this soft resistance as stemming from the multiple roles taken by participants as they define what data to collect, when to collect it, and what it means. Nafus and Sherman’s account suggests that personal informatics data can be a crucial form of personal self-exploration and even expression. They interpret personal informatics as a form of mindfulness and reflection as much as self-discipline or obsession. This interpretation points to a complexity, diversity, and above all, usefulness to personal informatics practices. Defining Personal Data Trails In contrast to self-quantification data, personal data trails comprise data about individuals captured in the course of everyday events rather than explicitly generated by the data subject. Personal data trails are generally much larger and more varied than self-quantification data, as they stem from daily digital interactions with corporations, which collect these data for purposes ranging from product improvement to profit. Such trails are as diverse as search histories, credit card purchases, the social data described by Alaimo and Kallinikos in this volume, and as Burdon and Andrejevic discuss in this volume, sensor data such as that collected by the growing number of networked devices in homes (World Economic Forum 2011). Advocates of harvesting and analyzing personal data trails characterize these data as richer than self-quantification data, as they are greater in both volume and ubiquity than data collected purposefully by single-purpose devices. Researchers engage in a range of investigations using personal data trails. Mobile phone trace data alone has been used to model traffic patterns (Frias-Martinez, Moumni, and Frias-Martinez 2014), land use (Frias-Martinez and FriasMartinez 2014), and the spread of disease (Pentland et al. 2009). Analysis of personal data trails may also benefit individuals. For example, cable companies can sense the volume at which customers watch television. Tracking increases in volume over time might serve as an early signal of hearing loss (Estrin 2014). These data may be richer and more diverse, but they are also much harder to access than self-quantification data. Corporations sometimes provide interfaces to access these data for those willing to build their own software tools. But most corporations keep these data offlimits to researchers and the public. As Deborah Estrin (2014, 32), a researcher who is building infrastructure to support personal data collection for health and wellness, writes,

When They Are Your Big Data 


The social networks, search engines, mobile operators, online games, and e-commerce sites we access every hour of most every day extensively use these digital traces we leave behind. They aggregate and analyze these traces to target advertisements and tailor service offerings and to improve system performance. But most services do not make these individual traces available to the person who generated them; they do not yet have a ready-made vehicle to repackage their data about you in a useful format for you and provide it to you. But they should, because this broad but highly personalized data set can be analyzed to draw powerful inferences about your health and well-being from your “digital behavior.”

Estrin advocates for a common open infrastructure to tap into existing application program interfaces, and enable individuals to access their data or port their data to research studies of relevance to them. Health researchers might use the infrastructure to enroll consenting participants and analyze the mobility data generated by mobile phones. There is growing research interest in accessing these sorts of data. Projects like Give Me My Data ( supply interfaces to harvest and use personal Facebook data. ( links individuals’ mobile data streams with both self-analysis and health care provider analysis tools. Harvesting such data might take advantage of Internet-connected devices in the home (Brush et al. 2013), mobile devices (Campbell et al. 2008; Rahman et al. 2014), or application program interfaces that enable access to data in the cloud (Estrin 2014). Some researchers have also found ways to partner directly with corporations to access location data harvested from cell phones (Frias-Martinez, Moumni, and Frias-Martinez 2014). Although it seems clear that participants in the future of personal data trail research will include both individuals and researchers, it is unclear whether use of these trails will skew toward aggregation in large studies of population-level behavior (e.g., Frias-Martinez, Moumni, and Frias-Martinez 2014), or individual analyses to support reflection and behavioral change (Hekler et al. 2013). Spectrum of Participation Practices Distinctions between personal and corporate data collections, and individual and research access, highlight a spectrum of participation along which personal data projects might fall. Participation in research has long been seen as a spectrum ranging from more to less participatory (Khanlou and Peter 2005), and personal informatics and data trail research provide more examples for this continuum. Data collection: We can map a spectrum of greater to less control over, and therefore participation with, personal data using a life cycle of big data (table 2.1). First, who controls data collection? Unpacking this question begins with device ownership. Participants may own their data-tracking devices, rent or borrow them as part of subscription services, or be targets of remote sensing technologies (such as satellites, surveillance cameras, or heat sensors). Evaluating control over data collection also includes device accessibility. Open-source or free software may be easier to alter or hack, enabling participants to capture data when, where,


Katie Shilton

and at a granularity that they choose, or even obfuscate or alter their own data (Brunton and Nissenbaum 2011). Devices that are easy to toggle on and off offer greater participation than always-on devices (Shilton 2009). Variations in technological affordances help address questions of informed consent to data collection. At the extremely participatory end are personal informatics projects in which participants code or design their own data collection devices, or use spreadsheets or even pencil and paper to record data. Though participants may never have signed a consent form, direct participation in the nitty-gritty of data collection implies informed, consenting participants. Next are proprietary devices such as personal fitness monitors, which participants have clearly opted to use and can easily be left at home, but are less readily tinkered with or hacked. At the more controlled end are cable boxes, thermostats, refrigerators, and other Internet of Things applications, which send data to corporations without the subject’s awareness—and which are difficult to meaningfully opt out of using (Schneier 2013). Although users may have signed terms of service for these devices consenting to data collection, the low degree of participation in whether and how data is collected indicates a much less informed degree of consent. Data analysis: Next, who controls data analysis? Fitness trackers often provide basic visualization tools for users, affording a moderate level of participation in data analysis. Data subjects can reflect on the representation that the data creates (Vaz and Bruno 2003), but their ability to control that representation is limited. Analyses become increasingly participatory as they provide raw data to users for export, or enable users to make their own mash-ups and data visualizations: the power called for at the beginning of this chapter by Wolf (2014) and other QS practitioners. On the least participatory end of the spectrum are projects that look for emergent patterns about data subjects (which might correlate with behaviors, diagnoses, or other real-world indicators), or make classifications or judgments. Analyses of Table 2.1 Spectrum and Types of Participatory Practices Data collection

Data analysis

Data storage

Data reuse

High participation

Subjects own or control devices; collection can be customized

Raw data accessible; subjects can conduct their own analyses

Data stored on local devices

Individuals control reuse

Low participation

Subjects aware of devices; collectioncan be avoided

Subjects can see visualizations or analysis of their data

Data in cloud storage with options for deletion

Reuse is restricted to aggregated forms

Little to no participation

Subjects unaware of devices; collection cannot be avoided

Subjects evaluated or categorized without their knowledge

Data in cloud storage with no option for deletion

Data collectors share or sell data

When They Are Your Big Data 


personal data trails, for instance, have labeled risky drivers (Kearney 2013), consumer credit risks (Khandani, Kim, and Lo 2010), and signs of postpartum depression (De Choudhury, Counts, and Horvitz 2013) without subjects’ knowledge. Data storage: Control over data storage provides a third set of questions to further detail the spectrum of participation. Personal informatics participants who store data on their own devices or personal computers anchor the highly participatory end of the spectrum. These data subjects keep control over data, and therefore control over decisions about how long to keep it and whether to share it. Alternatively, many commercial services store data in the cloud or corporate storage (Takabi, Joshi, and Ahn 2010; Udell 2012), but may provide options for users to delete their data or accounts. Furthermore, the least participatory forms of trace data tracking maintain control over storage without opportunity for data export and deletion, or keep the data even after subjects have requested deletion (Almuhimedi et al. 2013; Hill 2013). Data use: Finally, questions of data reuse matter to the spectrum of participatory practices. Individuals who maintain the storage of their own data may reuse them as they wish. Services often allow individuals to share data selectively with friends (Krontiris, Langheinrich, and Shilton 2014; Tsai et al. 2009). Data collected for academic research must comply with Institutional Review Board regulations, which frequently restrict data reuse, allowing only aggregated or anonymized forms of data release or reuse (Office of the Secretary 1979). Data collectors that sell participant data to advertisers form the opposite, least participatory end of the spectrum (Hoofnagle 2003). Mapping a range of participation in big personal data collection and research enables us to ask new questions of big data projects. It expands our thinking beyond binary worries about data privacy, and instead points us toward a more contextually nuanced understanding of the purposes and functions of data collection and use. Ethical Dilemmas and Challenges US and worldwide research ethics have long focused on respect for research subjects, beneficence, and justice (Office of the Secretary 1979). Highlighting a spectrum of participatory data practices suggests that projects in which data can be viewed, analyzed, reused, or revoked by the data subject are not only more participatory but sometimes more ethical, too. Increasing participation in research with data that is personal and sensitive aligns well with these principles, particularly respect for subjects and justice. Participation can ensure truly informed consent, and can provide avenues for “soft resistance” to the categorization and labeling of big data practices. Broadly, we should encourage projects in which data are gathered and managed by data subjects, or willingly and knowingly donated by data subjects. Apple’s ResearchKit (, for example, promises a widely accessible model for participatory research practice.


Katie Shilton

But “participatory research” and “ethical research” are not synonymous. In particular, attention to “beneficence”—doing good and avoiding harm—highlights ways that participatory research can fail to be the most ethical approach. For instance, research performed on data collected with little to no individual awareness or participation may be prosocial in nature, providing generalized benefits while minimizing risks to individuals. Such studies might comport with traditional research ethics centered on beneficence and minimal risk to participants. Researchers working with data from online communities, for example, argue that careful analyses of these data can provide public benefit with minimal risks to participants (Markham and Buchanan 2012). In addition, consider who benefits from more or less participatory models. Currently, selfquantification data is arguably the most participatory of personal big data, because participants set the research agenda, control data collection, and in many cases, control storage and reuse. Participants might be heavily involved, but this engagement has costs in time and data literacy, and can skew toward more affluent populations. Research utilizing personal data trails might be able to involve a wider and more diverse set of people. No special devices need to be purchased and kept charged, and interested researchers bear the brunt of data collection, analysis, and storage costs. Participation in personal data trail research might more closely resemble participation in traditional clinical trials or research studies, in which recruiting diverse participants may be an explicit focus for reasons of either methodology or social justice. Beyond concerns about equity, increasing participation in data practices is not always the most ethical research option. There are at least two major ethical objections to increasing participation in data practices. The first is that the term “participation” has long been used to airbrush research or data collection practices that might otherwise be deemed unethical (Cooke and Kothari 2001; Phillips 1997). Data collectors may cynically co-opt the term for their own ends, making data collection participatory without being democratic or emancipatory (Cohen 2016). A second challenge in adopting a participatory lens goes beyond using participation as a smoke screen for otherwise-unethical practices. As Jennifer Gabrys (2014) has noted, in a culture oriented toward participation in data practices, those who choose not to participate risk being labeled nonconforming or even illegitimate. Participation in data collection is increasingly a sign not only of consumer behavior but also responsible citizenship. By encouraging participation in personal-data-driven research, we may be encouraging a move toward an increasingly panoptic society, in which the quantification, collection, and analysis of personal data are required to be normal, healthy, or compliant. Participatory data practices align well with ethical principles of respect for research subjects and justice, but there is not always a tie between participation and the ethical principle of beneficence. This dilemma suggests that as we start to think through participatory data use “policy”—rules for individuals, researchers, and corporations that wish to use these data—we should concentrate on ensuring that policy addresses all three ethical principles.

When They Are Your Big Data 


Implications for Policy When data spans personal and corporate boundaries, the growth of both personal informatics and data stream models of personal data collection call for redefining privacy, access, and research policy. The linking of personal data streams to academic research is a signal that both universities and corporations need to rethink research policy for an era where a primary form of data may generated by daily activities. What should the process of informed consent look like when data from commercial services are involved? Is it enough to ask participants to opt in to sharing their personal data, or do projects need to give participants direct access to data? We can rethink each of these areas with the aid of the spectrum of participation: from full control of personal data by individuals to control by researchers guided by research policy, to control by corporations with, at the moment, little policy oversight. As highlighted in the Wolf quote that began this chapter, advocates for both personal informatics and personal data streams currently adopt a “rights” discourse, in which they advocate for the right of data subjects to access their own data and become involved participants in research. With roots in community-based participatory research (Horowitz, Robinson, and Seifer 2009), the rights discourse emphasizes participation through access to data streams. Rights-oriented policy, whether corporate policy that gives users access to their data or federal policy to encourage access as a fair business practice, would create opportunities for participatory modes of engagement with personal data, and give participants increasing control over their own data collection, outcomes, and decisions. A rights-based policy approach is also reflective of a larger social discussion surrounding big data, which highlights the tensions between “data-rich” collectors (often corporations), and relatively “data-poor” academic researchers and individuals (boyd and Crawford 2012; Ekbia et al. 2015). A rights-based agenda is an important line of policy advocacy, but it is not the only one needed to make the collection and use of personal data fair and equitable. Concerns about the beneficence of participatory data projects highlight a need to develop protections for data subjects. For example, preventing discrimination is a crucial focus in big data critique and policy (Dwork and Mulligan 2013). Policy to protect individuals from discrimination based on participation in personal data research should be part of this conversation. There is policy precedent for protecting participants in health and wellness research. The 2008 Genetic Information Nondiscrimination Act, for instance, proactively protects participants who volunteer their genetic information from discrimination in employment or health care (text of H.R. 493 [110th]). Personal data-driven research demands similar harm prevention approaches for those who volunteer their data. Protections to prevent discrimination can also help guard against another specter of big data: the threat of unintended consequences when data streams are merged (Wu 2012). Merging data streams is one of the biggest potential promises of both personal informatics and personal data stream research, enabling people to see correlations or intersections that they may not have noticed otherwise. Protecting individuals from discrimination based on findings enables the benefits of such research while minimizing social


Katie Shilton

and personal risks. In addition, policy means to enable people to resist or opt out of data collection should remain a priority for researchers involved in the use of big personal data. Finally, rethinking research policy for participatory personal data demands ongoing descriptive research. We need to understand if and how norms and data practices around personal data are changing in order to best interpret what access rights and modes of protection mean for this context. A call for understanding data practice loops back to the important work of Nafus and Sherman (2014) and others exploring emergent data practices to understand norms along with violations of those norms. The blurring of corporate and academic research has led to calls for rethinking research ethics in other contexts (Calo 2013; Zimmer 2010). Considerations for big personal data should be added to those discussions. Considering the range of participation with personal data illustrates the theme of this edited volume, emphasizing that big data is not a monolith. Projects focused on personal data evoke a range of participation, and the amount of participation matters to the ethical consequences of that data collection and use. Research ethics and data use policy should consider allowances for participation, self-determination, and protection from discrimination in the age of participatory personal data.

3  Wrong Side of the Tracks Simon DeDeo

SIR,—I have read your letter with interest; and, judging from your description of yourself as a working-man, I venture to think that you will have a much better chance of success in life by remaining in your own sphere and sticking to your trade than by adopting any other course. —Thomas Hardy, Jude the Obscure (1895)

Should juries reason well? Should doctors? Should our leaders? When the human mind is augmented by machine learning, the question is more subtle than it appears. It seems obvious that if we gather more information and use proven methods to analyze it, we will make better decisions. The protagonist on television turns to a computer to enhance an image, find a terrorist, or track down the source of an epidemic. In the popular imagination, computers help, and the better the computer and the more clever the programmer, the better the help. The real story, however, is more surprising. We naturally assume that the algorithms that model our behavior will do so in a way that avoids human bias. Yet as we will show, computer analysis can lead us to make decisions that without our knowledge, judge people on the basis of their race, sex, or class. In relying on machines, we can exacerbate preexisting inequalities and even create new ones. The problem is made worse, not better, in the big data era—despite and even because of the fact that our algorithms have access to increasingly relevant information. The challenge is a broad one: we face it as ordinary people on juries, experts sought out for our specialized knowledge, and decision makers in positions of power. To demonstrate these challenges, we will make explicit how we judge the ethics of decision making in a political context. Through a series of examples, we will show how a crucial feature of those judgments involves reasoning about cause. Causal reasoning is at the heart of how we justify the ways in which we both reward individuals and hold them responsible. Our most successful algorithms, though, do not produce causal accounts. They may not reveal, say, which features of a mortgage applicant’s file combined together to lead a bank’s algorithm to offer or deny a loan. Nor do they provide an account of how the individual in question might have come to have those properties, or why those features and not others were chosen to begin with. The ethics of our policies become


Simon DeDeo

opaque. In situations such as these, decision makers may unknowingly violate their core beliefs when they follow or are influenced by a machine recommendation. In some cases, the solution to a moral problem that technology creates is more technology. We will illustrate how, in the near-term, a brute force solution is possible, and present its optimal mathematical form. More promisingly yet, future extensions of work just getting underway in the computer sciences may make it possible to reverse engineer the implicit morals of an algorithm, allowing for more efficient use of the data we have on hand. We describe two recent advances—contribution propagation and the Baysian list machine—that may help this goal. All these methods have their limits. The ethical use of machines may lead to new (shortterm) inefficiencies. We may find that more mortgages are not repaid, more scholarships are wasted, and more crimes are not detected. This is a trade-off familiar to democratic societies, whose judicial systems, for example, thrive under a presumption of innocence that may let the guilty go free. Justice and flourishing, even in the presence of inefficiencies, are not necessarily incompatible. To trade short-term gain for more sacred values has, in certain historical periods and for reasons still poorly understood, led to long-term prosperity. Our discussion will involve questions posed by modern, developed, and diverse societies regarding equality of both opportunity and outcome. We do not take a position on the merits of any particular political program or approach. Rather, our goal is to show how the use of new algorithms can interfere with the ways in which citizens and politicians historically have debated these questions in democratic societies. If we do not understand how machines change the nature of decision making, we will find ourselves increasingly unable to have and resolve ethical questions in the public sphere. Correlation, Discrimination, and the Ethics of Decision Making There are many constraints on the ethics of decision making in a social context. We begin here with one of the clearest of the modern era: the notion of a protected category. Decisions made on the basis of such categories are considered potentially problematic and have been the focus of debate for over half a century. In the United States, for instance, the Civil Rights Act of 1964 prohibits discrimination on the basis of race, color, religion, sex, and national origin, while Title IX of the Education Amendments of 1972 makes it legally unacceptable to use sex as a criterion in providing educational opportunities. What appears to be a bright-line rule, however, is anything but. In any society, protected and unprotected categories are strongly correlated. Some are obvious: if I am female, I am more likely to be under five feet tall. Others are less so: depending on my nation of origin, and thus my cultural background, I may shop at particular stores, have distinctive patterns of electricity use, marry young, or be at higher risk for diabetes. Correlations are everywhere, and the euphemistic North American idiom “wrong side of the tracks” gains its meaning from them. Being north or south of a town’s railroad line is an

Wrong Side of the Tracks 


innocent category, but correlates with properties that a society may consider an improper basis for decision making. What a person considers “wrong” about the wrong side of the tracks is not geography but rather the kind of person who lives there. In the big data era, euphemisms multiply. Consider, for example, a health care system with the admirable goal of allocating scarce transplant organs to those recipients most likely to benefit. As electronic records, and methods for collecting and analyzing them, become increasingly sophisticated, we may find statistical evidence that properties—diet, smoking, or exercise—correlated with a particular ethnic group give members a lower survival rate. If we wish to maximize the number of person-years of life saved, should we make decisions that end up preferring recipients of a different race? Such problems generically arise when machine learning is used to select members of a population to receive a benefit or suffer harm. Organ donation is only one instance of what we expect to be a wide portfolio of uses with inherent moral risks. Should we wish to allocate scholarship funding to those students most likely to graduate from college, we may find that including a student’s zip code, physical fitness, or vocabulary increases the predictive power of our algorithms.1 None of these properties are protected categories, but their use in machine learning will naturally lead to group-dependent outcomes as everything from place of residence to medical care to vocabulary learned in childhood may correlate with a property such as race. In the case of the allocation of scholarship funds, we may want to exclude some sources of data as being prima facie discriminatory, such as zip code, even when they do not directly correspond to protected categories. Others, though, such as physical fitness or vocabulary, may plausibly signal future success tracking, say, character traits such as grit (Duckworth et al. 2007; Eskreis-Winkler et al. 2014). Yet the predictive power of physical fitness or vocabulary may be driven in part by how race or socioeconomic status correlates with both these predictor variables and future success. An analysis of a type common in big data studies might confirm a predictive relationship between adolescent fitness and future performance. Yet this correlation may be induced by underlying mechanisms such as access to playgrounds that we may consider problematic influencers of decision making. An effort to use physical fitness as a signal of mental discipline may lead us to prefer wealthier students solely because physical fitness signals access to a playground, access to a playground signals high socioeconomic status, and high socioeconomic status leads to greater access to job networks. Put informally, a machine may discover that squash players or competitive rowers have unusual success in the banking profession. But such a correlation may be driven by how exposure to these sports correlates with membership in sociocultural groups that have traditionally dominated the profession—not some occult relationship between a squash serve and the ability to assess the value of a trade agreement.


Simon DeDeo

Reasoning about Causes One solution to the problems of the previous section is to consider all measurable properties “guilty until proven innocent.” In this case, we base decision making only on those properties for which a detailed causal mechanism is known, and known to be ethically neutral. A focus on the causal mechanisms that play a role in decision making is well known; to reason morally, particularly in the public sphere, is to invoke causation.2 For example, if we wished to use physical fitness in scholarship deliberations, we would begin by proposing a causal narrative: how a student’s character could lead them to a desire to accomplish difficult tasks through persistence; how this desire, in the right contexts, could cause them to train for a competitive sport; and how this training would cause them to improve in quantitative measures of physical fitness. We would be alert for signs that our reasoning was at fault—if excellence at handball is no more or less a signal of grit than excellence at squash, we should not prefer the squash player from Manhattan to the handball player from the Bronx. This kind of reasoning, implicit or explicit, is found almost everywhere people gather to make decisions that affect the lives of others. Human-readable accounts of why someone failed to meet the bar for a scholarship, triggered a stop and frisk, or was awarded a government contract are the bread and butter for ethical debates on policy. These usually demand we explain both the role different perceptions played in the decision-making process (i.e., on what basis the committee made the decision it did) and the causal origin of the facts that led to those perceptions (i.e., how it came about that the person had the qualities that formed the basis of that decision). Even if we agreed on a mechanism connecting a student’s character and their physical fitness, we might be concerned with the causal role played by, say, socioeconomic status: a student’s character may lead them to the desire to accomplish difficult tasks, but their socioeconomic status may rule out access to playgrounds and coaching. Should we agree on this new causal pathway, it might lead us to argue against the use of physical fitness, or to use it to make decisions only within a particular socioeconomic category. Clash of the Machines The demand for a causal account of both the origin of relevant facts and their use is at the core of the conflict between ethical decision making and the use of big data. This is because the algorithms that use these data do not make explicit reference to causal mechanisms. Instead, they gain their power from the discovery of unexpected patterns, found by combining coarse-graining properties in often-uninterpretable ways. “Why the computer said what it did”—why one candidate was rated higher than another, for example—is no longer clear. On the one side, we have the inputs: data on each candidate. On the other side, we have the output: a binary classification (good risk or bad risk), or

Wrong Side of the Tracks 


perhaps a number (say, the probability of a mortgage default). No human, however, wires together the logic of the intermediate step; no human dictates how facts about each candidate are combined together mathematically (“divide salary by debt,” say) or logically (“if married and under twenty-five, increase risk ratio”). The programmer sets the boundaries: what kinds of wirings are possible. But they allow the machine to find, within this (usually immeasurably large) space, a method of combination that performs particularly well. The method itself is usually impossible to read, let alone interpret. When we do attempt to represent it in human-readable form, the best we get is a kind of spaghetti code that subjects the information to multiple parallel transformations or unintuitive recombinations, and even allows rules to vote against each other or gang up in pairs against a third. Meanwhile, advances in machine learning generally amount to discovering particularly fertile ways to constrain the space of rules that the machine has to search, or in finding new and faster methods for searching it. They often take the form of black magic: heuristics and rules of thumb that we stumble on, and that have unexpectedly good performance for reasons we struggle to explain at any level of rigor. As heuristics are stacked on top of heuristics, the impact of these advances is to make the rules more tangled and harder to interpret than before. (This poses problems beyond the ethics of decision making; the ability of highpowered machines to achieve increased accuracy at the cost of intelligibility also threatens certain avenues of progress in science more generally [Krakauer et al. 2010].) As described by Alaimo and Kallinikos (this volume), the situation for the ethicist is further complicated because the volume of data demands that information be stripped of context and revealing ambiguity. Quite literally, one’s behavior is no longer a signal of the content of one’s character. Even if we could follow the rules of an algorithm, interpretation of the underlying and implicit theory that the algorithm holds about the world becomes impossible. Because of these problems, eliminating protected categories from our input data cannot solve the problem of interpretability. It is also true that knowledge of protected categories is not in itself ethically problematic and may in some cases be needed. It may aid us not only in the pursuit of greater fairness but also in the pursuit of other, unrelated goals. The diagnosis of diseases with differing prevalence in different groups is a simple example of this further complication. Consider the organ transplant case and a protected category {a,b}. Individuals of type a may be subject to one kind of complication, and individuals of type b may be equally subject to a different kind. Given imprecise testing, knowledge of an individual’s type may help in determining who are the best candidates from each group, improving survival without implicit reliance on an ethically problematic mechanism. In other cases, fairness in decision making might suggest we take into account the difficulties a candidate has faced. Consider the awarding of state scholarships to a summer camp. Of two candidates with equal achievements, we may wish to prefer the student who was likely to have suffered racial discrimination in receiving early opportunities. To do this rebalancing, we must, of course, come to learn the candidate’s race.


Simon DeDeo

Even when fairness is not an issue, knowledge of protected categories may aid decision makers well beyond the medical case described above. An undergraduate admissions committee for an engineering school might wish to rank a high-performing female applicant above a similarly qualified male, not out of a desire to redress wrongs or achieve a demographically balanced group, but simply because her achievements may be the more impressive for having been accomplished in the face of overt discrimination or stereotype threat (Spencer, Steele, and Quinn 1999). A desire to select candidates on the basis of a universal characteristic (in this case, perhaps grit; see discussion above) is aided by the use of protected information. In the organ transplant case, the knowledge of correlations may be sufficient. I need not know why group a and group b have the difference they do—only that I can gain predictive knowledge of their risk profiles by distinct methods. In the other two cases, however, discussions about how, when, and why to draw back the veil of ignorance (Rawls 1985) lead us to conversations about the causes and mechanisms that underlie inequities and advantages. In sum, it is not just that computers are unable to judge the ethics of their decision making. It is that the way these algorithms work precludes their analysis in the very human ways we have learned to judge and reason about what is, and is not, just, reasonable, and good. Algorithmic Solutions We find ourselves in a quandary. We can reject the modern turn to data science, and go back to an era when statistical analysis relied on explicit causal models that could naturally be examined from an ethical and political standpoint. In doing so, we lose out on many of the potential benefits that these new algorithms promise: more efficient use of resources, better aid to our fellow citizens, and new opportunities for human flourishing. Or, we can reject this earlier squaring of moral and technocratic goals, accept that machineaided decision making will lead to discrimination, and enter a new era of euphemisms, playing a game, at best, of catch as catch can, and banning the use of these methods when problems become apparent. Neither seems acceptable, although ethical intuitions urge that we trade utility (the unrestricted use of machine-aided inference) for more sacred values such as equity (concerns with the dangers of euphemism). Some technical progress, though, has been made in resolving this conflict. Researchers have begun to develop causal accounts of how algorithms use input data to classify and predict. The contribution propagation method introduced by Will Landecker, Michael D. Thomure, Luís M. A. Bettencourt, Melanie Mitchell, Garrett T. Kenyon, and Steven P. Brumby (2013), or the Bayesian list machines of Benjamin Letham, Cynthia Rudin, Tyler H. McCormick, and David Madigan (2013) are two recent examples that allow us to see how the different parts of an algorithm’s input are combined to produce the final prediction. Contribution propagation is most naturally applied to the layered processing that happens in so-called deep-learning systems.3 As a deep-learning system passes data through a

Wrong Side of the Tracks 


series of modules, culminating in a final prediction, contribution propagation allows us to track which features of the input data, in any particular instance, lead to the final classification. In the case of image classification, for example, it can highlight regions of a picture of a cat that led it to be classified as a cat (perhaps the ears, or forward-facing eyes); in this way, it allows a user to check to make sure the correct features of the image are being used. Contribution propagation can identify when an image is being classed as a cat solely on the basis of irrelevant features, such as the background featuring an armchair—in which cats are often found. This makes contribution propagation a natural diagnostic for the “on what basis” problem: which features were used in the decision-making process. Applied to a complex medical or social decision-making problem, it could highlight the relevant categories, and whether they made a positive or negative contribution to the final choice. The Bayesian list machine takes a different approach, and can be applied to algorithms that rely on so-called decision trees. A decision tree is a series of questions about properties of the input data (“is the subject over the age of twenty one?”; “does the subject live on the south side?”) that result in a classification (“the subject is high risk”). Because the trees are so complex, however, with subtrees of redundant questions, it can be hard to interpret the model of the world that the algorithm is using. The Bayesian list machine circumvents this problem by “pruning” trees to find simple decision rules that are human readable. Neither advance can solve the full problem: simply knowing how the variables were combined does not provide an explanation for why the variables were combined in that fashion. While both methods allow us to look inside previously opaque black boxes, they leave us uncertain of the causal mechanisms that led to this or that combination of variables being a good predictor of what we wish to know. We may know “on what basis” the decision was made, but not how it came about that the individuals in question met the criteria. Technical advances may go a long way to solving this second, more pressing problem. A long tradition in artificial intelligence research relied on building explicit models of causal interactions (Pearl 1997). Frameworks such as Pearl (2000) causality attempt to create, in a computer, a mental model of the causes in the world expressed graphically, as a network of influences. This is, in spirit, similar to an earlier attack on the artificial intelligence problem—one often described as “good old fashioned” artificial intelligence, or GOFAI, named by John Haugeland (1989). GOFAI approaches try to make intelligent machines by mimicking a particular style of thinking—the representation of thoughts in a mathematical syntax, and their manipulation according to a fixed and internally consistent set of rules. Judea Pearl’s (2000) causal networks can be “read” by a human, and in a logically consistent fashion, causal language can be used in moral explanations. Pearl causality has had widespread influence in the sciences. But it does not yet play a role in many of the machine-learning algorithms, such as deep learning or random forests, in widespread use for monitoring and prediction today. One day, a framework such as this could


Simon DeDeo

provide a “moral schematic” for new algorithms, making it possible for researchers, policy makers, and citizens to reason ethically about their use. That day has not yet arrived. Information Theory and Public Policy In the absence of causal models that allow us to discuss the moral weight of group-dependent outcomes, progress is still possible. In this section, we will show how to encode a simpler goal: that decisions made by decision makers do not correlate with protected variables at all. One might informally describe such a system as “outcome equal.” Correctly implemented, our solution completely de-correlates category and outcome. Imagine you have heard that a person received a scholarship; using the outcome equal solution we present here, this fact would give you no knowledge about the race (or sex, or class, as desired) about the individual in question. Whether such a goal is desirable or not in any case is, and has been, constantly up for debate, and we will return to this in our conclusions. The existence of a unique mathematical solution to this goal is not only of intrinsic interest. It also provides an explicit example of how technical and ethical issues intertwine in the algorithmic era. The exact structure of the mathematical argument has moral implications. The method we propose postprocesses and cleans, or corrects, prediction outputs so that the possibility is eliminated that the output of an algorithm correlates with a protected category. At the same time, it will preserve as much as possible the algorithm’s predictive power. The correction alters the outputs of an algorithm, in contrast to recent work (Feldman et al. 2014) that has considered the possibility of altering an algorithm’s input. Our recommendations here are also distinct from those considered by Sonja Starr (2014), which seek to exclude some variables as inputs altogether. Indeed, here, in order to correct for the effects of correlations, our calculations require knowledge of protected categories to proceed. To determine how to do this correction, we turn to information theory. We want to predict a particular policy-relevant variable S (say, the odds of a patient surviving a medical procedure, committing a crime, or graduating from college) and have at our disposal a list of properties, V. The list V may be partitioned into two sublists, one of which, U, is unproblematic, while the other, W, consists of protected variables such as race, sex, or national origin. Given our discussion above, making a policy decision on the basis of Pr(S|V) may well be unacceptable because V contains information about protected categories. If it is unacceptable, so is using the restricted function Pr(S|U), because U correlates with V (the wrong side of the tracks problem). In addition, use of the restricted function throws away the potentially innocuous use of protected categories. We wish to find the distribution that avoids correlating with protected variables while minimizing the loss of predictive information that this imposes. The insensitivity condition for this “policy-valid” probability, PrX is

Wrong Side of the Tracks 

∑ Pr


( s, u, w ) = Pr( s )Pr( w )




Equivalently,PrX ( s | w )—the probability of a protected category w having outcome s—should be independent of w, given the true distribution, Pr(w), of that category in the population. Our principle is thus that from knowledge of the outcome alone one cannot infer protected properties of an individual. In the two examples above, allocation according to PrX would mean that if you learn that a person received a lifesaving transplant or was subject to additional police surveillance, you do not gain information about their race. There are many PrX that satisfy the constraint above. To minimize information loss, we impose the additional constraint that it minimize KL(PrX ( S ,V ),Pr( S ,V )),


where KL is the Kullback-Leibler divergence, KL( P , Q ) ≡ ∑ P( y )log y

P( y ) . Q(y )


Minimizing the Kullback-Leibler divergence means that decisions made on the basis of PrX ( S ,V ) will be “maximally indistinguishable” from the full knowledge encapsulated in Pr(S,V) (the Chernoff–Stein Lemma; see Cover and Thomas 2006).4 Informally, we destroy the correlations, but do our best to retain the remainder of the original predictive distribution, thus retaining, as best we can, the power of the algorithm. To get as close as possible to the original prediction, we can minimize equation 3 using Lagrange multipliers. We require |S||W|+1 multipliers—one to enforce a normalization for PrX , and the remainder to enforce the distinct constraints implied by equation 1. We find PrX ( s, u, w ) = Pr( s, u, w )

Pr( s ) . Pr( s | w )


Knowledge of how predictions, s, correlate with protected variables w, allows us to undo those correlations when using these predictions for policy purposes.5 The method we present here is a mathematically clean solution to a particular and particularly restricted problem. Use of these methods enforces a strict independence between protected categories and the variables one wishes to know. In any specific case, this may or may not be the desired outcome. Populations may be willing to compromise on group-dependent outcomes in favor of other virtues. This can be seen in the ongoing debates over Proposition 209 in California in 1996, and the Michigan Civil Rights Initiative in 2006, both of which forbid the use of information concerning race to rebalance college admission, and the latter of which was affirmed as constitutional in April 2014. This “outcome equal” construction, in other words, does not absolve us of the duty to reason about the methods we use. Rather, it provides a limiting case against which we can compare other approaches. How much does a prediction change when we force it to


Simon DeDeo

be outcome equal? Do the changes we see alert us to potentially problematic sources of the correlations our algorithms use? The relative transparency of decision making in the era before data science has allowed ordinary people to debate question of group-dependent outcomes on equal terms. Even when bureaucracies, traditions, or a failure to achieve consensus prevent them from implementing the changes they desire, citizens at least have had the ability to debate and discuss what they saw. When we use machines to infer and extrapolate from data, democratic debate on the underlying questions of fairness and opportunity becomes harder, if not impossible, for citizens, experts, and leaders alike. If we do not know the models of the world on which our algorithms rely, we cannot ask if their recommendations are just. Conclusions Machine learning gives us new abilities to predict—with remarkable accuracy, and well beyond the powers of the unaided human mind—some of the most critical features of our bodies, minds, and societies. The machines that implement these algorithms increasingly become extensions of our will (Clark 2008), giving us the ability to infer the outcomes of thought experiments, fill in missing knowledge, and predict the future with an unexpected accuracy. Over and above advances reported in the peer-reviewed literature, recent popular accounts provide a sense of the growing use of these tools beyond the academy (Mayer-Schönberger and Cukier 2013; Siegel 2013), and their use seems likely to accelerate in both scale (number of domains) and scope (range of problems within a domain). Reliance on powerful but only partially understood algorithms offers new challenges to risk management. For example, predictions may go wrong in ways we do not expect, making it harder to assess risk. As discussed elsewhere in this volume (Cate), machine learning also provides new challenges to individual privacy; in a famous example, statisticians at Target were able to infer a teenager’s pregnancy, from her shopping habits, before it became physically evident to her father (Duhigg 2012). This chapter demonstrated the existence of a third challenge. This challenge persists even when algorithms function perfectly (when their predictions are correct, and their uncertainties are accurately estimated), when they are used by well-meaning individuals, and when their use is restricted to data and the prediction of variables, explicitly consented to by participants. To help overcome this third challenge, we have presented the ambitious goal of reverse engineering algorithms to undercover their hidden causal models. We have also presented, by way of example, more modest methods that work in a restricted domain. In both cases, progress requires that we “open the algorithmic box,” and rely on commitments by corporations and governments to reveal important features of the algorithms they use (Medina, forthcoming). Judges already use predictive computer models in the sentencing phase of

Wrong Side of the Tracks 


trials (Starr 2014). There appears to be no awareness of the dangers these models pose to just decision making—despite the influence they have on life-changing decisions. These are real challenges, but there is also reason for optimism. Mathematical innovation may provide the means to repair unexpected injustices. The same methods we use to study new tools for computer-aided prediction may change our views on rules we have used in the past. Perhaps most important, they may reempower policy makers and ordinary citizens, and allow ethical debates to thrive in the era of the algorithm. The very nature of big data blurs the boundary between inference to the best solution and ethical constraints on uses of that inference. Debates concerning equity, discrimination, and fairness, previously understood as the domain of legal theory and political philosophy, are now unavoidably tied to mathematical and computational questions. The arguments here suggest that ethical debates must be supplemented by an understanding of the mathematics of prediction. And they urge that data scientists and statisticians become increasingly familiar with the nature of ethical reasoning in the public sphere. Acknowledgments I thank Cosma Shalizi (Carnegie Mellon University) for conversations and discussion, and joint work on the “outcome equal” solution presented above. I thank John Miller (Carnegie Mellon University), Chris Wood (Santa Fe Institute), Dave Baker (University of Michigan), Eden Medina (Indiana University), Bradi Heaberlin (Indiana University), and Kirstin G. G. Harriger (University of New Mexico) for additional discussion. This work was supported in part by a Santa Fe Institute Omidyar Postdoctoral Fellowship.

II  Big Data and Society

4  What If Everything Reveals Everything? Paul Ohm and Scott Peppet

In this chapter, we propose a wild conjecture and ponder what might happen if it turns out to be even somewhat true: because of big data advances in data analytics, we may soon learn that “everything reveals everything.” Specifically, every fact about an individual reveals every other fact about that individual. If this is even partly true, and we have good reason to believe it is, we think the disruption to society will be profound. Consider this chapter a thought experiment, a worst-case exploration of what everything reveals everything would mean for society in general, and law and policy in particular. Everything Reveals Everything? We start with the obligatory definitions: What do we mean by “big data”? From among the many definitions that have been proposed, we prefer one by Microsoft: “The process of applying serious computing power—the latest in machine learning and artificial intelligence—to seriously massive and often highly complex sets of information” (Microsoft News Center 2012). We like how this definition focuses on power and analytics, and are not concerned with internecine squabbles among computer scientists about more formal definitions. Narrowing further, we are most concerned about big data techniques that harness the power of inference. Most big data techniques are valuable because they help human analysts infer this from that. Call this activity correlation or prediction, but regardless of the label, the point is that big data helps us “know,” within some degree of certainty, more than the data can tell us at face value. Big data techniques that do not involve inference fall outside our present discussion. As a final filter, we are focused solely on big data techniques that assist inference making about human beings. Big data techniques are also used to study natural and inanimate phenomenon—from weather to petroleum production to sunspot activity—but these do not implicate privacy or antidiscrimination, so we do not concentrate on them. Yet it is important to highlight that this does not mean we are interested only in data sets containing


Paul Ohm and Scott Peppet

information directly about people (Shilton, this volume). Particularly with the rise of cheap, distributed, networked sensors—fueling the so-called Internet of Things—data sets about inanimate objects will bear the ghostly traces of the human beings who passed by or interacted with those sensors (Burdon and Andrejevic, this volume). Thus, carbon dioxide monitors might tell us something about a human being’s commuting patterns or a power plant employee’s work habits. Next, we turn to the conjecture. Does big data mean that everything will soon reveal everything? We mean by this the rather-outlandish claim that given any data set containing information about individuals, a data analyst can accurately infer any other true information about those individuals. To make this claim seem remotely plausible, and better develop an intuition about what we mean by it, consider a few of the headline-grabbing examples from the past few years that point in the direction of everything reveals everything. Shopping Habits Reveal Pregnancy We start with probably the best-known and most frequently cited illustration. The New York Times revealed in 2012 that the Target Corporation, a giant US retailer, had asked its data analytics team to scrutinize lists of its customers’ purchases to develop a formula that could reveal which of its shoppers was pregnant and when they were due. According to the report, the team found great success, identifying a list of about twenty-five products that can be watched for subtle shifts in shopping habits that indicate pregnancy. For instance, “Many shoppers purchase soap and cotton balls, but when someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to their delivery date” (Duhigg 2012). Floor Protectors Reveal Creditworthiness Another oft-repeated claim is that credit card companies have determined that prospective borrowers who demonstrate their willingness to protect their items—such as by purchasing hardwood floor protectors for their furniture legs—also tend to be better credit risks (Strahilevitz 2013). Software Choices Reveal Wealth and Innovation One study reveals that the travel website Orbitz shows Mac users more expensive hotels than it shows to Windows users (D. Mattioli 2012). Another study suggests that employees who use the Firefox or Chrome web browsers tend to be more innovative workers than their counterparts using Safari or Internet Explorer. Many Things Reveal Unique Identity Finally, study after study has added to a long and surprising list of behaviors and characteristics that reveal unique identity. This has led us and many others to debate whether there can

What If Everything Reveals Everything? 


ever be useful, anonymous data (Ohm 2010; Bambauer 2011). Exploiting human uniqueness for identification predates big data, of course, with a lineage tracing back to the nineteenth century (if not earlier), and the rise of Alphonse Bertillon’s body measurement system and its eventual replacement by fingerprinting (Cole 2002). These methods are direct antecedents of other forms of “traditional” biometric identification techniques developed and used by law enforcement, such as facial recognition, gait analysis, and DNA analysis. But everything reveals everything gains even more support from a less traditional series of studies that have begun to reveal a less intuitive, more surprising list of unique identifiers. Space prevents us from citing every study of this kind, but consider this list of things that researchers suggest uniquely identify each of us: movie preferences (Narayanan and Shmatikov 2008), Internet search queries (Barbaro and Zeller 2006), physical location (de Montjoye et al. 2013), the collection of fonts on a computer (Eckersley 2010), and the way a person fills in Scantron bubbles (Calandrino, Clarkson, and Felten 2011). Searching for the Right Modifier Does everything reveal everything? Stated so starkly, this cannot literally be true. Basic information theory proves that fundamental limits on the entropy of information serve as a roadblock to this strongest form of the claim that everything reveals everything. We believe, however, that we can modify the strongest form of the claim to add limitations that render the statement true, or at least much more difficult to prove false. To take an example from the opposite extreme, “everything reveals itself” is a claim that is both plainly true and monumentally uninteresting. Given any data set, an analyst can simply reproduce the data verbatim to produce itself. We have no inflexible preconceptions about the nature of the modifiers that will make the strong claim true. We suspect that as the state of the art of big data advances, we will learn that many different modifiers will, independently of one another, lead to true claims. Consider several possible classes of modifiers that may lead to true variations on everything reveals everything. Quantity: “Everything with a Lot of Data Reveals Everything” It may be that the strongest form of the hypothesis fails mostly for small data sets, but with large data sets, the claim may be true. Quantity, in this case, might require a large number of records in the data set (rows in the table) or a large number of fields in the data set (columns), or maybe both. Entropy: “Everything with More Than N Bits of Entropy Reveals Everything” This relates to the quantity modifier, but takes into account not only the size of the data but also the Shannon entropy of it. Entropy requires not only the amount of data but also the diversity of the information in the data.


Paul Ohm and Scott Peppet

Accuracy: “Everything Reveals Everything, Give or Take” It may be that methods of computational inference (such as machine learning) will produce accurate guesses about other information that is not present in the data being analyzed. Whether the uncertainty represented by the resulting level of accuracy is operationally acceptable depends on the nature of the application. We would tolerate much more inaccuracy for advertisement targeting than we would for credit scoring, and even less for detaining citizens. Utility: “Everything Reveals Everything Useful” It may be that although many aspects of human behavior may be shielded for some reason from the power of statistical inference, “useful” information, however defined, may more likely give itself up to inference. Thus, economically useful information such as credit score or income level may be readily obtainable from many sources of data about individuals. Because utility can be defined in many different ways, consider a few corollaries to this modifier: Corollary 1—Commercially Useful: “Everything Commercially Useful Reveals Everything Else Commercially Useful”  It may be that all commercial information about a person tends to correlate with one another. If so, each database of information about one dimension of the commercial lives of individuals will reveal every other dimension about their commercial lives. Corollary 2—Protected by Law: “Everything Reveals Everything Protected by Law”  The other modifiers we have listed were identified using our (admittedly underdeveloped) intuitions about information theory and statistical inference (e.g., more data are probably more revealing than less data). In contrast, this modifier we list because of the disruption it would wreak on law if proved valid. To elaborate further, consider different possible laws that might be implicated. Privacy Law Many privacy laws draw lines between categories of sensitive, covered information and nonsensitive, uncovered information. Other privacy laws single out information as “personally identifiable” or not. Powerful techniques for inference may destroy the rationale for drawing lines like these by allowing data analysts to infer information from the “protected” side of these laws using only information from the “unprotected” side (see also DeDeo, this volume). Antidiscrimination Law Antidiscrimination law defines protected classes, such as (but not limited to) race, ethnicity, sex, religion, and age. Advances in statistical inference may reveal the otherwise-unspecified

What If Everything Reveals Everything? 


values for some of these categories, opening the door to intentional, invidious discrimination. Consumer Protection Law Related to both privacy and antidiscrimination law, some laws draw lines around categories of information that may not legitimately be used for certain important life decisions. The Fair Credit Reporting Act (FCRA), for instance, limits the activities of credit reporting agencies.1 Especially with the rise of smartphones and mobile apps, the FCRA may apply to many new types of organizations and activities, and this result may turn on data that are transformed through inference (FTC 2012). At a practical level, consider some examples. If the strong version of everything reveals everything is untrue, to what extent are the following statements likely to be true? • 

Everything reveals your FICO score Everything reveals your identity •  Everything reveals your status as a member of a legally protected class (race, gender, age, etc.) •  Everything reveals your socioeconomic status •  Everything reveals your propensity to commit a felony •  Everything reveals your pro- or antisocial personality tendencies •  Everything reveals your employability •  Everything reveals your educational talent or grades • 

These statements become far more plausible than the strong claim. One can imagine easily that many different data sets might reveal a person’s credit score, identity, race, gender, wealth level, or propensities. By constraining the first “everything” to specific data sets, we can posit even likelier relationships: • 

Your driving habits reveal your FICO score Your purchasing habits reveal your FICO score •  Your Fitbit data (or other exercise habits) reveal your FICO score •  Your home electrical usage data reveal your FICO score •  Your eating habits reveal your FICO score •  Your sleep habits reveal your FICO score •  Your web surfing or cell phone habits reveal your FICO score •  Your reading or intellectual habits reveal your FICO score • 

Based on state-of-the-art data science and our own intuitions, we think many of these statements are plausible guesses about the types of data that might reveal a consumer’s creditworthiness.


Paul Ohm and Scott Peppet

Figure 4.1 illustrates the strong claim discussed above. The strong claim would suggest that all the input data sets—driving habits, purchasing habits, exercise habits, and so on— predict or reveal all the inferential conclusions, such as FICO score, identity, or protected class status. We do not claim that figure 4.1 is accurate. We are willing, however, to predict confidently that the following weak—or at least only semistrong—statements are most likely true: • 

Most rich data sets about personal habits and behavior (such as driving, sleep, exercise, and purchasing habits) reveal most types of legally protected or sensitive information (such as creditworthiness, protected class status, and employability) •  A large number of surprising individual data points (such as whether you purchase felt floor protectors to place on the bottom of your furniture) reveal most types of legally protected or sensitive information •  In combination (such as when one combines driving data with exercise data with purchasing data), these revelations become more accurate and reliable Finally, because this section is studded with qualifiers such as “may” and “might,” it is important to underscore that we are reporting only what is known in the public domain of scientific research and journalism. The pursuit to test and understand the outer limits of everything reveals everything is a lucrative one, and we have no doubt that commercial actors, from insurers to Silicon Valley start-ups and everything in between, are already mapping this territory. Many of the revelations shown in figure 4.1 have undoubtedly already been proved or disproved by big data firms with proprietary data sets and algorithms.

Figure 4.1 The strong claim in action

What If Everything Reveals Everything? 


What Happens to Privacy, Self, and Society? What will it mean if even a weak version of everything reveals everything turns out to be true? The most honest answer is that we don’t know. But let us for a moment speculate about how our lives and society would change in such a big-data-driven world. First and most basically, we would each be revealed in ways we have not been previously. Whereas currently you might enjoy reasonably low auto insurance premiums because you are a middle-aged, gainfully employed, seemingly low-risk insured, data from other sources in this new world could reveal your propensity to drive too fast. Perhaps your failure to establish a consistent exercise regime (or is it your ability to exercise regularly?) correlates with risky driving; perhaps the fact that you buy a men’s adventure-oriented magazine at the airport newsstand will reveal you as a likely bad driver; perhaps the ice cream flavor you prefer, kind of mobile phone you use, or cut of your trousers will be the correlating input. Whatever it is, and regardless of whether you have successfully fooled yourself into believing that you are a safe driver, the data may identify you as high risk when you previously would have been assumed otherwise. At one level this is a victory for efficiency and accuracy, of course. Yet from the individual’s vantage point—from our experience on the ground—this could feel quite bizarre. We are comfortable paying more for insurance when we are teenagers or if we have an accident. Those ex ante predictions and ex post adjustments make sense. But will it feel comfortable paying more for your car insurance because data analysis has revealed that your choice of brightly colored clothing, combined with the fact that you frequently eat out at fast-food restaurants, indicates you are actually in a higher-risk population of drivers than previously assumed? One concern is that we will begin to modify our behavior in undesirable ways to avoid these consequences. Call this the “khaki speculation”—we all start wearing drab khaki clothing to avoid standing out from the crowd. It is difficult to argue in the abstract about whether such behavioral modification would be a socially good or bad thing; modifying certain behavior through more accurate pricing of that behavior might be socially beneficial. For example, if people smoked less or exercised more to change their auto insurance premiums, that could be all for the good. But note that there are at least three possible oddities here. One might not be modifying a socially harmful behavior (such as smoking) but instead a socially useful or neutral one. Changing the color of our clothing may not do much for society other than lead to drab fashion. In addition, we are not changing the actual behavior being priced. In other words, if my clothing color leads to higher auto insurance costs, I might change my clothing color, but not change my actual driving behavior. The proxy becomes paramount, while the actual trait in question remains unobserved. (Of course, there is simultaneously an explosion in Internet of Things sensor devices making possible direct observation of more and more


Paul Ohm and Scott Peppet

human activity, but that is another issue [Peppet 2014; Burdon and Andrejevic, this volume].) Finally, the proxies will keep changing. Indeed, if everyone starts wearing drab clothing, then clothing color will no longer correlate with anything. We will have eliminated the ability for an insurer to charge us premium pricing, but in the process also eliminated a part of our freedom and creativity. At the least we can say that such a world might be more boring. More powerfully, one might worry that it will chill creative expression, the development of an individual sense of self unique from others, or the mental and emotional space needed to explore oneself and one’s preferences. Instead of behavior modification being the problem, there is also the distinct possibility of a quite opposite threat: the inability to modify one’s behavior to respond adequately to big data’s inferences. As indicated above, many of these inferences are drawn by private actors using private algorithms acting on private data sets. If everything reveals everything, then how is a consumer to know what led to the increase in their car insurance premium? Was it market fluctuation, or their new car? Or was it some seemingly unrelated choice—to give up tennis lessons or begin traveling overseas more often—that changed the equation? If so many different aspects of life correlate with and begin to influence each other, how could you ever deduce the behavior modifications necessary to push back against such economic consequences? This is the more likely, and in many ways scarier, possibility. If everything even weakly reveals everything, there isn’t much one can do to influence big data inferences. A consumer is unlikely to know the relevant inputs, algorithms, or weightings among variables. Alongside being revealed, then, there is the added threat of being demoralized. “They” know you are likely to drive poorly, even if you don’t feel that way. Driving better may not help, and it might be unclear how “they” are arriving at their predictions. So why bother worrying? This could produce paranoia, anger, or at least a rising sense of claustrophobia. Absent radical transparency in the big data processes driving such inferences—something we doubt could or will occur—we may resign ourselves to being scored, assessed, and priced without much insight into exactly how or why (for more on transparency in big data algorithms, see DeDeo, this volume). More fundamentally, perhaps our understanding of human nature will change. One thing can only reveal another thing if the two are somehow connected, integrated, or causal. If many different aspects of your behavior—such as your exercise or driving habits— reveal many other aspects of your person—such as your creditworthiness, propensity to commit crime, or conscientiousness as an employee—it means that our personality and character are fairly tightly bound. In other words, we may discover that we are more consistent than perhaps we feel—that although life can feel erratic and momentary (e.g., “I just felt like eating a ham sandwich today, but generally I stick to my low-salt diet”), in fact we are predictable, patterned, and numbingly consistent. Our subjective experience of spark and creativity may just be an illusion, after all. Perhaps we will not merely be revealed as

What If Everything Reveals Everything? 


poor drivers or bad employees but also as relatively boring machines with little room for flex or play. Privacy’s demise will have stripped us not of self-concealment but rather of a real sense of self at all. Or our conception of self could change for a different reason: instead of discovering that we are so consistent as to have little room for creative play in our sense of who we are, we could be led to believe that about ourselves by the pervasive use of big data analytics to sort, categorize, and price us. In other words, this numbing inner consistency might not be true, but we could come to believe it about ourselves anyway. In this scenario, the victory of algorithms as a means to organize social life could lead to the perception that the consistency and rigor of those algorithms accurately describes our inner worlds, even though, in fact, it doesn’t. If your car insurance premiums increase because your reading choices or exercise habits correlate with risk-preferring behavior, and your employer begins to monitor your productivity more closely because your driving and sleep habits suggest sloth, will you start to believe that you are risk preferring and lazy? More than that, will you accept that such traits are your “true,” consistent, stable self, revealed to you by big data, and thus a more accurate depiction of your inner world than your subjective experience of yourself as sometimes this and sometimes that? Big data’s algorithms are, of course, social constructs full of errors, generalizations, and assumptions. They may impose consistency more than discover it. But we may come to believe in them, and hence come to not believe much in ourselves. Lastly, if everything reveals everything, our conception of what it means to be in society with others may change. We have had to accept to date a certain amount of error and uncertainty in our social arrangements. For example, we all buy auto insurance knowing that it is not priced perfectly—that some good drivers may pay slightly more than their fair share (if not perfectly identified as good drivers) while some bad drivers may pay less (if not perfectly revealed as bad drivers). That sharing—and norm of sharing—is caused by an information deficit: the inability to separate perfectly the good from the bad, and therefore the social necessity to support each other just a little. Since society’s industrialization, the necessity of living in large-scale social arrangements—with things like commercial insurance pools, a consumer credit system, and so on—has required us to “make do,” despite our information infrastructure’s imperfections.2 That may turn out to be a historical anomaly. If big data leads to more individualized pricing of insurance, credit, and other aspects of economic life, that necessity to be willing to share some social burden or risk with others may slowly dissipate. If I’m a good driver, why should I pay any more for insurance than I absolutely have to—even if it means that you, as someone with an unfortunate driving history, are priced out of driving at all because your personalized premium soars beyond your reach?


Paul Ohm and Scott Peppet

What Happens to Privacy Law? These are speculations, and wild ones. They are not perfectly consistent; they most likely can’t all come to pass. But they are plausible consequences of the possibility that everything reveals everything, and worth pausing to consider. Less speculative are the consequences for privacy law and regulation if everything reveals everything. In short, everything reveals everything will cause a significant disruption to the way we regulate information privacy. To understand why, you need to understand a model of how privacy law works. According to this model, privacy law works by drawing lines. Privacy law is an exceptionally diverse area of law, but line drawing is common to every privacy law ever enacted or contemplated. Privacy laws draw at least four types of lines. Most privacy laws draw lines between personally identifiable and “non–personally identifiable” information. For example, the privacy rules found in the US Health Information Portability and Accountability Act (HIPAA) do not apply at all to information deemed non–personally identifiable, such as information that lacks any of eighteen categories of information, including name, address, and date of birth.3 Other privacy laws draw lines based on the purported sensitivity of information. Sensitivity is connected to the probability of privacy harm resulting from the collection, use, or disclosure of that kind of information. Thus, HIPAA regulates only health information.4 As another example, financial privacy laws such as the US law known as Gramm-Leach-Bliley protects as sensitive information found in the records of financial institutions.5 Health and financial information have been classified as sensitive for a long time, yet some laws protect information seen as sensitive only recently, such as the Video Privacy Protection Act, which protects the privacy of video-watching habits, and the Driver’s Privacy Protection Act, which regulates the privacy of information stored by departments of motor vehicles.6 A third type of line drawn is by industry sector. This is related to sensitive information boundaries because it reflects a legislative judgment that some sectors deal with more sensitive information than others. For example, the FCRA governs “credit reporting agencies,” meaning companies that provide information about the worthiness of individuals for purposes of credit, insurance, or employment.7 The law is most directly applied to the “big three” credit reporting companies—Equifax, TransUnion, and Experian—although it also extends to less well-known companies that supply information for similar purposes. It might not cover, however, some companies that possess and disclose information similar to the information handled by the big three, depending on the identity and goals of their customers. Finally, a fourth line drawn is by actor. Some privacy laws protect information only when held by a covered actor, even though another type of actor holding the same information would evade regulation. HIPAA is a good illustration of this form, too. It applies only to

What If Everything Reveals Everything? 


information held by covered providers, insurers, and their business associates.8 When other companies—such as health analytics companies (e.g., Fitbit) or “health vault” services (e.g., Microsoft Healthvault)—hold the same information, they are not covered. Big data disrupts privacy law by allowing analysts to make inferences that cross these lines in unprecedented and unpredictable ways. It is as if the lines discussed above demarcated the boundaries between zones of privacy in personal information, a sort of twentieth-century cartography of information. And big data has now, or is about to, redraw the map by erasing most of these boundaries. Consider, say, the line between personally and non–personally identifiable information. One of us has argued that this is a dying approach to protecting privacy (Ohm 2010). Because so much has been shown to reveal identity, it is a quaint notion that information does not deserve regulation or protection simply because it has been “anonymized.” Some so-called anonymized information presents a small risk to personal privacy, but other types of anonymized information reveal identity and thus may deserve regulation. Over time, as so-called reidentification methods advance, information from the first category will move into the second. Researchers who study the human genome, for instance, have increasingly become convinced that small snippets of it can never be rendered anonymous, thanks to many studies that have shown how much entropy remains even in small segments of the genome, opening the door to identification (Gymrek et al. 2013). Although big data’s ability to erase lines has been discussed and debated most readily around identity, it will erase the other three types of lines listed above as well. It will allow us to know, for example, that some pools of information containing seemingly innocuous bits of information about people reveal much more sensitive information—the way MIT grad students discovered that the pattern with which you “friend” people on a social network reveals sexual orientation. Thus, privacy laws that draw lines between the sensitive and nonsensitive will increasingly be seen to draw them in the wrong places. Similarly, big data will blur the lines between sectors and actors, calling into question laws like HIPAA and the FCRA. In the end, big data’s impact on privacy law will be profound and disruptive. By blurring the boundaries between different types of data sets, big data will make every privacy law ever enacted at least underinclusive or, worse, irrelevant. Those who believe that information privacy is important and hard to protect without law should begin to look for new approaches—those that do not rely so centrally on our ability to draw lines between information contexts. What Happens to Antidiscrimination Law? Everything reveals everything isn’t a problem only for privacy law. Big data techniques classify, and classifications by definition discriminate, for good or ill. Information scholars are increasingly turning to the question of whether big data’s discriminations will be benign or


Paul Ohm and Scott Peppet

invidious. Mathematician Rebecca Goldin has warned of this problem, as described in “New Rules for Big Data” (2010) in the Economist: Racial discrimination against an applicant for a bank loan is illegal. But what if a computer model factors in the educational level of the applicant’s mother, which in America is strongly correlated with race? And what if computers, just as they can predict an individual’s susceptibility to a disease from other bits of information, can predict his predisposition to committing a crime?

Likewise, in his remarks at the University of California at Berkeley School of Information, White House counselor John Podesta (2014) highlighted the same concerns: Big data analysis of information voluntarily shared on social networks has showed how easy it can be to infer information about race, ethnicity, religion, gender, age, and sexual orientation, among other personal details. We have a strong legal framework in this country forbidding discrimination based on these criteria in a variety of contexts. But it’s easy to imagine how big data technology, if used to cross legal lines we have been careful to set, could end up reinforcing existing inequities in housing, credit, employment, health and education.

Podesta’s Big Data: Seizing Opportunities, Preserving Values—a report to the president—further highlighted this theme (Executive Office of the President 2014). Similarly, Alistair Croll (2013) has stated, “Big data is our generation’s civil rights issue, and we don’t know it. … ‘Personalization’ is another word for discrimination.” Racial and other forms of discrimination are obviously illegal under Title VII.9 Title I of the Americans with Disabilities Act forbids discrimination against those with disabilities, and the Genetic Information Nondiscrimination Act bars discrimination based on genetic inheritance.10 These traditional antidiscrimination laws leave room, though, for new forms of discrimination based on big data analysis. We sort these concerns into three basic categories. First, there are concerns about intentional discrimination based on proxies for protected class characteristics. This is the primary issue expressed to date—by Podesta, for example—that if everything reveals everything, then big data will reveal proxies for race, gender, age, and other protected class status that can then be used in obnoxious (but difficult to detect) ways. The threat here is the big data will hide intentional disparate treatment of protected populations. Second, some are concerned that big data techniques will lead to disparate impacts on racial or other protected minorities, even if unintentional. Nothing prevents discrimination based on a potential employee’s health status, say, so long as the employee does not suffer from what the Americans with Disabilities Act would consider a disability (Roberts 2014). Similarly, antidiscrimination law does not prevent economic sorting based on our personalities, habits, and character traits. Employers are free not to hire those with personality traits they don’t like; insurers are free to avoid insuring—or charge more to— those with risk preferences they find too expensive to insure; lenders are free to

What If Everything Reveals Everything? 


differentiate between borrowers with traits that suggest trustworthiness versus questionable character. As big data analysis reveals more correlations between our traits and characteristics, we may increasingly discover that decisions at least superficially based on conduct—such as not to hire a particular employee because of their lack of exercise discipline—may systematically bias an employer against a certain group if that group does not or cannot engage in that conduct as much as others. Solon Barocas and Andrew D. Selbst (2015), for example, have recently argued that big data analysis based on many different inputs and proxies may lead to economic sorting that has disparate impacts—even if unintentional— on those in protected classes. More fine-grained economic categorization will entrench existing inequalities, which will likely skew in ways that disadvantage the already disadvantaged. Third and finally, big data may lead to new forms of economic categorization that do not burden those currently protected by antidiscrimination law but that nevertheless seem unfair, unjust, or extreme in their economic consequences. Even without the problem of race, age, or gender discrimination, using big data analysis to discriminate between consumers is potentially controversial. If everything reveals everything, this will permit insurers, employers, lenders, and other economic actors more finely to distinguish between potential insureds, employees, and borrowers. From the perspective of economics, this may be beneficial. Put simply, more data will allow firms to separate pooling equilibriums in insurance, lending, and employment markets, leading to efficiencies and increased social welfare. From a legal or policy perspective, however, economic sorting is just not that simple. The public and its legislators tend to react strongly to forms of economic discrimination that economists view as relatively benign. For example, price discrimination—charging one consumer more for a good than another because of inferences about the first person’s willingness or ability to pay—may be economically neutral or even efficient, but consumers react strongly against it. Put differently, the real losers if everything reveals everything may be the poor, say, or the relatively uneducated—two groups not protected by antidiscrimination law directly. The American Civil Liberties Union’s Jay Stanley (2012), for instance, has argued that big data inferences allow for various forms of problematic economic sorting. These inferences may “accentuate … the information asymmetries of big companies over other economic actors and allow … for people to be manipulated” (this is first-degree price discrimination— charging person X more or less for a product or service because of information that the seller knows about person X); “accentuate the power differentials among individuals in society by amplifying existing advantages and disadvantages” (for example, the rich and educated do better while the poor do worse); allow people to be sorted unfairly and capriciously based on risk analysis or other inferences (such as a credit card company lowering your credit limit based on behavior scoring because you shop at the same stores as others with poor credit).


Paul Ohm and Scott Peppet

Cynthia Dwork and Deirdre K. Mulligan (2013) have likewise pointed out the importance of these concerns. Antidiscrimination law is relatively unprepared to deal with these implications. Disparate impact doctrine has been contracting and its future is somewhat uncertain (Barocas and Selbst 2015). Moreover, antidiscrimination law has only addressed inequality to the extent that it disparately harms those in protected classes; it is not designed to address economic inequality more broadly, irrespective of protected class status. As Barocas and Selbst (ibid.) have put it, “Data mining may require us to reevaluate why and whether we care about not discriminating.” Purely procedural solutions are unlikely to have much effect, but substantive commitments to protect the underprivileged or disproportionately burdened are politically unlikely. Conclusion This chapter has asked many questions without providing many answers. We see this as our commentary on the balance between what we know and don’t know at this moment in time in the development of big data. The questions are many, and the answers are few. We close with a few more thoughts about the nature of these questions and answers. Most important, we return to our starting caveat: everything reveals everything cannot literally be true. Nevertheless, although we have for now framed everything reveals everything as speculation and conjecture, we are quite confident that additional research— theoretical and applied—will confirm that it is in fact partly true. At the very least, we predict that over the next few years, we will all repeatedly express surprise as new data analyses reveal unexpected behavioral correlations—this reveals that, and then that reveals yet another thing, and so on. We put forth the extreme possibility of everything reveals everything so that perhaps we can all skip the surprise and simply expect these revelations. In addition, we likewise predict that research will uncover some of the constraints on the two everythings—what inputs do not reveal your credit score or what personal characteristics cannot be predicted based on driving habits. We call on data scientists to continue to advance the state of this research, and hope that the conjecture that everything reveals everything will provide a motivating target at which such scientists can aim. Finally, if everything reveals everything, we will be forced to remake privacy and antidiscrimination laws in new and potentially problematic ways. Because big data blurs the lines between contexts, our laws will by necessity become more underinclusive and overinclusive. The days of regulating information with precision are drawing to a close. Big data will also force us to rethink the harms we seek to remedy through law, such as by broadening our narrow focus on historically regulated protected classes to broader questions about the distribution of power in society. None of these changes wrought by everything reveals everything will be easy to implement; they will require first principles debates about the future of society

What If Everything Reveals Everything? 


by many people. The path ahead will not be an easy one, but we think we will be better equipped to face it if we start discussing it now, after recognizing how big data’s advances will make everything reveal everything. Acknowledgments Paul Ohm’s contribution to this chapter has been supported by a financial contribution from AXA Research Fund to Georgetown University through the AXA Award Project in Big Data and Privacy.

5  Big Data in the Sensor Society Mark Burdon and Mark Andrejevic

By all accounts, we are entering an era of ongoing data deluge. The predictions of the last two decades regarding the onset of rapid and continuous data growth have, by and large, turned out to be accurate. We are now at a stage where the Internet has grown beyond past communication infrastructural capacities (Coffman and Odlyzko 1998). Digital storage has indeed become more cost effective, allowing for the capture and storage of an ever-growing range of data (Lyman and Varian 2003; Diebold 2003; Gantz 2007). The increased capture of data from a variety of automated processes (Swanson and Gilder 2008) has led to the development of new analytic processes (Denning 1980) and new ways of visualizing the meaning of findings generated from vast amounts of data (Bryson et al. 1999). Both the technological capacity and the economic and political imperatives converge on the goal of collecting all data, all the time, and holding on to it forever. These predicted expansions also provided a stimulus for a number of significant claims and justifications for the use of big data. For instance, big data can assist with policy implementation and international development by providing hitherto-unseen insights (World Economic Forum 2012). Big data can be used to resolve a number of “real-world” problems by supplying a basis to predict future actions, and thus enable the effective allocation of corporate and governmental resources (Satell 2013). Big data is leading to a fundamental reunderstanding of data collection and usage, including a reconceptualization of personal data. Information about individuals has been described as the “new oil” (World Economic Forum 2011): a new asset class, and a critical source for corporate and governmental innovation and value. In this chapter, we put forward a perspective on big data that treats it as a form of embedded infrastructural authority. We argue that big data technologies are analytic prostheses: they make it possible to make sense out of information in new ways. To do so, however, these technologies and processes rely on access to costly infrastructural resources: digital communication networks, giant server farms, and powerful computing systems. They also require the development of sophisticated machine-learning algorithms. The very concept and processes of big data therefore implicate the actions and abilities of technocratic structures and authority.


Mark Burdon and Mark Andrejevic

We contend that big data should be viewed as an issue of embeddedness because the processes and schema of big data do not sit separately from the practice of social life (see also Shilton, this volume). We examine the embedded nature of big data in two ways. First, we reexamine the infrastructural embedding of big data logic through the advent of the “sensor society,” by which we mean a society in which a growing range of spaces and places, objects and devices, continuously collect data about anything and everything. Second, we highlight the ways in which corporate data collectors construct big data. These collectors, such as Google, attempt to embed favored meanings of foundational constructs, such as personal data, privacy, and data analytics, into the lexicon of community understanding. Users are negatively impacted by this shaping because the broader community understands big data and its processes from the perspective of collectors. Big data has to be considered as an embedded constituent of an increasingly digitized life. The embedded character of big data forces us to adjust both user understandings and policy initiatives to address the ways in which data mining and data-driven decision making are transforming important aspects of daily life. Understanding big data ultimately depends on the way in which collection and analytic infrastructures are controlled, and the ends toward which their use is directed. We therefore consider how the logic that dictates the increasing cycles of data collection, attempted shaping of community meaning, and infrastructural requirements of big data in a sensor society can be appropriately regulated. We conclude with a consideration of core policy principles for guiding the goals of democratic societies in an era in which the advantages and capabilities that accrue to big data resources are shaped by control over as well as access to the sense-making infrastructure. Big Data as Embedded Infrastructure: The Sensor Society We have witnessed a number of remarkable technological developments during the last couple of years. One minor but telling example includes the 2013 release, by diaper maker Huggies, of a digitally enhanced diaper called TweetPee. The diaper includes a bird-shaped sensor that attaches to the diaper and senses moisture changes (“Need a Reminder to Check Your Baby’s Diaper?” 2013). When a moisture change is recorded, the sensor then sends a tweet to the registered purchaser of the diaper to inform them that the diaper is wet (Katz 2013). In another arena, in early 2014, the New York Times reported the installation of 171 sensorconnected, LED fixtures at Newark airport that serve as the backbone of a wide-ranging and intricate surveillance system. The fixtures are part of an array of sensors and surveillance cameras that collect data in order to identify suspicious activities and patterns (Cardwell 2014). Even battery power levels on mobile Internet devices can be used to identify and track users across the Internet (Hern 2015). These technological developments have a common element: they use a sensor or sensors to record data about the environment in which the device is located. In general terms, a sensor is a device that measures or detects an event or state, and translates this measurement or

Big Data in the Sensor Society 


detection into a signal (Huijsing 2008, 3). The data collection actions of the sensor are consequently automatic and receptive. Sensors do not monitor and listen per se. Rather, sensors detect and record (Andrejevic and Burdon 2015). As such, sensors do not rely on direct and conscious registration on the part of those being sensed. Data collected by sensors requires a form of passive monitoring on behalf of the device user. The user may have some indication that the device is sensing, such as the parent waiting for the tweet to change a diaper, or more likely the user will be unaware of the true sensing capacities of a device, such as Newark airport passengers walking under LED fixtures. Sensor-generated data provides a realm of interactivity between users and data collectors in which users are rarely informed or aware of the data collection taking place. Data mining techniques associated with automated data collection enable new and powerful uses for collected data. A dedicated sensor is no longer necessary to expand the sensing frontier. Patterns of usage can be used to detect mood, cell tower pings can be used to track location, and so on. More device-oriented sensors mean more data. This in turn opens up new data mining capabilities, and in turn, the development of uses for new forms of data (and then in turn, the demand for more high-resolution data). The underlying logic of sensorization (Huijsing 2008, 1) requires the evolutionary development of sensors and sensor-driven data alongside analytic techniques. The data mining process itself expands the available dimensions of sensing. The vast bulk of data generation no longer reflects conscious human activity. The passive nature of sensor-generated collection means that databases are continually replenished with data generated mechanically and automatically from sensors (Andrejevic and Burdon 2015). Sensorized devices record details about the information capture-and-recall process itself, registering the fact that a piece of information has been stored and tagged. This in turn fuels a tendency toward the self-generating automation processes of sensor-driven data collection, information analysis, and predictive response. Sensorized developments make it possible to store increasing amounts of data due to the proliferation of sensors and explosion of sensor-derived data (Mayer-Schönberger and Cukier 2013, 96). This mass accumulation of data is necessary for predictive analytics as well as the search for unexpected and perhaps even unexplainable patterns that provide new insights. The continuous search for new insight carries with it certain collection assumptions, and offers the basis and justification for big data. There is no way to definitively rule out the possibility that any piece or set of new data might make currently collected and nonpurposeful data useful. All data have to be stored forever because even seemingly irrelevant data may become relevant at some point in the future—when they can be correlated with new data sets. Even when data are intentionally collected for a specific purpose, their “true” potential may always remain latent, perhaps indefinitely. As such, the data mining processes that have developed to make sense of sensor-generated data are by their nature emergent because their goal is to generate unanticipatable and unintuitive correlations. The use of predictive


Mark Burdon and Mark Andrejevic

analytics refers not just to the proliferation of automated sensing devices across the landscape but also to the associated logics that are characteristic of automated, mechanized sensing: always-on information capture, the associated avalanche of data, and the consequent tendency toward automated information processing and response (Andrejevic and Burdon 2015). For example, Sensity’s (n.d.) use of LED fixtures highlights the vital role of infrastructures in a sensor society: NetSense integrates LED lighting, sensors, high-speed networking, cloud computing, and big data analytics into a single, MultiService Platform (MSP). With NetSense you get distributed intelligence: In our advanced distributed computing architecture, each LED light fixture is equipped with sensors and a fully functioning processor able to run software instructions. When networked together in a NetSense network, these fixtures collectively gather and process data about the surrounding environment, enabling analytics that transform the raw data into actionable information.

Sensity’s NetSense platform details the structural interaction of the many different infrastructures necessary to fulfill sensor data generation, data collection, and the making of meaning into predicted outcomes. It now becomes possible to see “big data” as a series of interconnected and embedded infrastructures—for example, the infrastructures of sensors (the LED fixtures), the infrastructures of data collection and exchange (high-speed networking), the infrastructures of storage (cloud computing), and the infrastructures of sense making (Sensity’s NetSense and big data analytics). It is the complex interaction of infrastructures that leads to processes of “distributed intelligence” that provides the ability to transform “raw data into actionable information.” Big data in a sensor society is about embedded infrastructures that make data collection capture, storage, and processing possible. Understanding relationships of ownership and control of sensors along with the infrastructures in which sensors operate becomes vital. The sensor is inextricably related to the data, analytic process, and infrastructure. This raises important questions about who has access to data, and who sets the parameters and priorities for its use. We contend that these fundamental understandings of how big data in a sensor society operates are not fully comprehended by users. Instead, how big data is understood is dominated by data collectors. Big Data as Embedded Meaning: User and Collector Disconnects Public understanding has yet to catch up to the advances we detail above. Our recent research in Australia details some of the emerging disconnects between public understanding of the data collection process and shifting industry practices and expectations. The research draws on findings from semistructured group interviews with over a hundred people conducted over the course of twelve months in the Australian cities of Brisbane and Melbourne. The respondents were recruited on city streets and university campuses, and paid twenty-five dollars for their participation in a group interview that ranged in size from three to eight people.

Big Data in the Sensor Society 


The respondents were recruited with a focus on younger individuals (the ages ranged from eighteen to thirty-four—for the first stage of the project) and obtaining a relatively even male-to-female balance (56 to 44 percent). The interviews revolved around people’s understandings of data collection and data mining, and were conducted by the same investigator, lasting from forty-five to seventy minutes. The structured interviews built on the results of a nationwide Australian telephone survey that revealed a high level of concern about commercial data collection practices plus strong support for regulations giving people more control over the collection and use of their personal data (Andrejevic 2012). The findings outline the main themes that emerged from the series of prompts incorporated into the structured interviews. For the purposes of this chapter, the focus is on key themes that illustrate the disconnect between the public understanding and the emerging norms and practices in the field of data collection and mining. Unearthing Big Data Logic In order to set the scene for the study’s findings, we recall the 2012 class action lawsuit against Google regarding its scanning of Gmail messages (Rosenblatt 2014a). The legal claim against Google included individuals who were not Gmail account holders. These were individuals who had e-mail accounts with different providers, but had nonetheless corresponded with Gmail users. Google could not attempt to rely on its own wide-ranging terms of service agreement as a defense in the claim (Johnston 2014). The plaintiff’s action eventually failed because the presiding judge ruled that the various parties did not constitute a “class” for the purposes of a class action against Google (Rosenblatt 2014b). Yet the case is important because Google’s defense in that action provides a significant insight into the corporation’s consideration of personal data and its use of analytic methods. Google argued that mainstream media had long discussed its data collection practices, and therefore anyone who used its e-mail service had implicitly consented to having their correspondence scanned, sorted, and mined, since allegedly the public had widespread knowledge of Google’s data mining practices (Mendoza 2013). Google’s attempted implied consent defense was rejected by Justice Lucy H. Koh, who stated, “Accepting Google’s theory of implied consent—that by merely sending e-mails to or receiving e-mails from a Gmail user, a non-Gmail user has consented to Google’s interception of such e-mails for any purposes—would eviscerate the rule against interception” (cited in Miller 2013a). This rejection is important because contentions that the use of digital platforms, applications, and services like Google should be based on an implicitly accepted exchange of personal information for service are becoming increasingly commonplace. Thus, for example, one commentator blithely observed that “norms are changing, with confidentiality giving way to openness. Participating in YouTube[,] … Flickr, and other elements of modern digital society means giving up some privacy, yet millions of people are willing to make that trade-off every day” (McCullagh 2010). There are some telling conflations in this formulation of user complicity. First, the fact that a trade-off has been made is equated with its being made knowingly, when


Mark Burdon and Mark Andrejevic

in fact many users are unaware of the extent and uses of data collection. The conflation is a useful one, insofar as it highlights the corporate belief that the mere use of a technology or application amounts to implicit consent to its information-handling practices. Moreover, the assumption is that accepting the stated terms of use (where available) consequently amounts to informed consent. We contend that Google’s implied consent defense is important to wider and embedded considerations of big data. It could be argued that Google was simply defending itself against a legal claim. Google’s defense, however, reveals a deeper disconnect between how users and data collectors think about personal data. We assert that details of this disconnect are crucial because they provide insights into the nature of big data—big data as embedded meaning— which in turn offers an understanding of the contested meaning of big data and some of its core constituent elements—namely, personal data, data analytics, and informed consent. These issues of contested meaning were borne out in the study’s findings. Contested Disconnects The interview data indicate a relatively low level of knowledge on the part of even regular technology users about data collection and handling practices. Thus, for example, there was a recurring tendency among participants to treat the high volume of data collection as providing a form of anonymity—as if the significance of any particular bit of information is drowned in a data deluge (Solove 2013, 1899). That is, participants repeatedly stated that they did not worry about the fate of their personal information because there was so much information out there that it was unlikely anyone would notice or care about their data. The analogy here is to a human information processor: the more data that are collected, the harder it would be for any human to look through them and pay attention to the individual details of specific users. This way of thinking about data—by analogy to human-scale data processing—is reinforced by Google’s familiar and repeated response to privacy concerns related to Gmail: “No human reads your email” (Webb 2004). That is, user data are in a sense “safe” because they are added to such a large pool that no actual humans could possibly read them all. The data, like people before them, are lost in the crowd. The typical follow-up to this observation by the study’s participants is that even if someone were to stumble across their data, they would be deemed utterly mundane and uninteresting. In an apparently humble vein, the message seemed to be: “I’m not that interesting, so why would anyone pay attention?” This selfeffacing logic provided a rationale for explaining why participants were not overly concerned about emerging forms of data collection. In effect, “Why should we worry about who is collecting the data if the information is boring?” These attempts to deflect concerns about data collection and tracking illustrate a fundamental misunderstanding by the study’s participants of how automated forms of collection and tracking work. Data do not get lost in the crowd thanks to technologies that operate on an extra-human scale. Yes, no individual human could keep track of all the information

Big Data in the Sensor Society 


being generated by millions of people online, but of course, humans are not being asked to do this work. While it is true that big data firms such as Google may not be interested in particular individuals per se, such firms are certainly interested in collecting and aggregating user data in order to more effectively manage and manipulate individuals. In other words, both of the following claims can be true simultaneously: Google is not interested in particular individuals, but it will nonetheless collect and store detailed information about users in order to more effectively tailor advertising and other forms of content to them, and indeed to the general population of users. A lack of interest in that sense does not mean a lack of impact. Indeed, potentially pathological effects result from the fact that data miners do not care about particular individuals because these are the very individuals about whom the data is being used to make decisions that may have significant impact on their life outcomes. A second important disconnect involves the relevance of specific forms of data to various decision-making processes. Thus, one of the examples of data mining that we used in the structured interviews was based on the finding of a company called Evolv. For certain categories of work, the web browser used by job applicants to fill out an online application served as a crucial predictor of future job performance (“How Might Your Choice of Browser Affect Your Job Prospects?” 2013). Evolv contended that applicants who used browsers that had to be downloaded and installed (e.g., Firefox or Chrome), rather than those that tend to come bundled with a device (e.g., Internet Explorer or Safari), were statistically more likely to “perform better and change jobs less often” (ibid.). That is, a piece of information that was nothing more than an artifact of the online application process turned out to have some direct bearing on this process. The typical response in our focus group interviews to this finding was twofold. First, participants attempted to posit an underlying explanation for this correlation even though the data miners do not offer one. (Data miners believe the correlation is enough to base a conclusion on.) Participants, when asked how they would feel if their own job applications were data mined, then tended to protest that information about what browser an applicant used was irrelevant to the job search process and that only relevant information should enter into the decision-making process. In a sense, these participants (the vast majority of them) were implicitly critiquing the premise of data mining: to unearth unintuitive and unanticipated correlations. The participants contended that relevant information was information that could be directly intuited and anticipated to bear on a particular decision-making process—in this case, whether to hire a job applicant or not. But of course the data unearthed by Evolv was relevant insofar as it allegedly predicted job performance with a relatively high degree of accuracy. So too would any other detail be relevant that demonstrated a robust correlation with superior job performance, no matter how unrelated it might seem (Burdon and Harpur 2014; Rosenblat, Kneese, and Boyd 2014). The participants, in arguing the need for data relevance, attempted to assert a standard of relevance before the fact based on conventional


Mark Burdon and Mark Andrejevic

historical understandings, such as job history, educational qualifications, references, and so on. Data that do not seem to have any meaningful connection to the employment process, other than inexplicably predicting job performance, were irrelevant for the participants and should not be considered. The refusal of participants to reformulate their understanding of relevance in light of datadriven revelations helps explain the third disconnect: the need for informed consent and the speculative character of data mining. When participants were asked what they thought would be a fair policy for governing or regulating the collection and use of personal data, respondents typically said that they would like to be told in advance the purpose for which their data were being collected. Specifically, participants stated that they would like confirmation that data collection would be relevant to the purpose at hand. We have just considered the “relevance” issue: participants understand this as referring to a preestablished category whereas data miners see it as an emergent one (e.g., you cannot know whether any particular piece of information is relevant until you run the correlation). But if relevance is, indeed, emergent, then prior, informed consent becomes structurally impossible because no one can know what use any particular data set will have until it is run through the data mining process (Cate, this volume). It is impossible to place any limits on the collection of data for a particular purpose if it cannot be known in advance which data will be potentially useful for that purpose. These study findings suggest a profound misunderstanding on the part of the public regarding the character of emerging forms of data analytics. The findings and discussion of disconnects is important because they highlight that understandings of big data are themselves about contested meaning. These contests do not simply represent conflicting actions and perspectives of data collectors and data subjects. They are instead representative of a much greater contest that aims to establish societal understanding through constructed meaning. In that sense, attempts to create understandings of big data, especially by data collectors the size and scale of Google, are also representative of attempts to embed meaning. The argument of societal informed consent is not only indicative of how users do act but also symptomatic of how users should act. The pacified user inherent in notions of implied consent is fundamental to sensor data collection predicated on the cyclic and ever-expanding logic of big data. It is therefore important to consider the view of big data as embedded infrastructure in a newly emerging sensor society in order to better understand attempted justifications for embedding big data meanings. Embedded Big Data: Future Policy Development Based on our discussion of new forms of sensorization and data collection, and the failure of public understanding to keep pace with these, we now consider some areas of future policy development. We explore the need for a new vocabulary to assist users to better understand the implications of a big data world, and new guiding principles to govern legislative and

Big Data in the Sensor Society 


regulatory development that moves beyond the confines of current information privacy principles. A New Vocabulary As we emphasized above, the disconnect between users and data collectors reveals fundamentally different understandings about the role of personal data usage within the scope of big data’s analytic processes. We contend that the meaning of big data is an essentially contested issue and highlighted attempts by data collectors to embed favored meaning within wider community understandings. Big data as embedded meaning raises the need for a new community vocabulary that will better assist the development of future policy and regulatory discourse. One of the key challenges faced by the participants in our research is the ability to conceptualize meaningful forms of regulation involving the practices of big data. Before the support for any regulatory regime can be accurately assessed, the disconnect between how users think about their data and how data miners seek to use them needs to be addressed. Users need to understand that what is valuable about their data are not the data themselves. Rather, value is provided by the fact that personal data can be aggregated with that of countless other users (and things) in order to unearth unanticipated but actionable research findings. Pandora and others claim, for example, that information about music preferences can be used to infer political leanings—and we might imagine the same could be said about other consumption habits (and a range of other inferences) (Dwoskin 2014b). Data mining allows for forms of inference that were previously impossible because they rely on the analysis of gigantic databases that do not need a structured query positing the inference in advance. Yet how can users, even the most tech-savvy ones, ever really calculate the value of their own data in a big data world that is literally too big and too complex to comprehend? In this case, users should attempt to rationalize and make decisions about whether to participate in data provision activities based on the construct of economic transaction. In other words, personal data and the privacy implications that flow from its provision can be treated as commodities to be traded. Such constructions of personal data exchange have found favor among some leading privacy law theorists going back to the architect of information privacy law, Alan Westin. Westin (1967, 324–325) viewed the process of personal information exchange as “the right of decision over one’s private personality,” which should be accorded the same status as a property right. Doing so provides space for individuals, as data collectors are restrained from interference in the use of that right, and duties and liabilities that can be actioned by users are created for data collectors. The construct of user personal data provision has, in the view of some commentators, tacitly been based on what economic interest can be derived from the supply of user personal data while protecting reputation (Murphy 1996, 2385). The notion of personal data as tradable property, or the new oil, as mentioned above, ultimately promotes the use of market protections for personal data exchange activities.


Mark Burdon and Mark Andrejevic

Perhaps unsurprisingly, this view has been subject to much criticism (Allen 2000; Austin 2003; Floridi 2006; Katyal 2005). Nevertheless, this perspective still pervades in the logics of data collectors. Google’s class action defense is instructive in this regard. Implied consent through social practice is grounded in a transactional construct of personal data exchange. The very process of a non-Gmail account holder sending an e-mail to a Gmail account holder is a “trade” because the sender should have known about the consequences of sending an e-mail to a Gmail account. The trade of course is all one way, toward a path that definitively favors Google. The very concept of a trade for the non-Gmail account holder is structured around Google’s normative understanding of e-mail use, and indeed any information use, because all data can and should be analyzed. This highlights the fundamental problem of user understandings of big data. Users are never going to know whether their data have utility and/or what that utility actually means. The outcome of the unintuitive search, by its nature, can never be truly known, especially as the owners of that process obfuscate the process of deriving actionable knowledge. Take one of Google’s core combatants to criticism: “No humans read your email or Google Account information” (Byers 2013). Google is arguing that users should not be concerned about the trade taking place because no human is reading and comprehending their communications. This is again representative of attempts to formulate and embed a favored meaning of the analytic process. Machines do not try to understand content in the way a human reader does. Accordingly, it becomes impossible for any user to understand that even if institutional actors, such as Google, are not interested in them as individuals, these entities can make decisions based on aggregate patterns (or information gleaned from these) that affect them as individuals. The trade is consequently always one sided (Solove 2013). Users need help reconfiguring their understanding of the relationship between individual forms of targeting and monitoring, on the one hand, and on the other, the search for patterns at the level of the population. The notion of informed consent has to be reconfigured or repurposed to address the fact that the very process of pattern extraction makes it impossible to determine which data might be used for what purpose. To do so necessitates a general level of user understanding that simply may not exist at present and will be difficult to construct in the current environment of continual attempts to embed favored meanings by data collectors. A new vocabulary is required that explains the incomprehensible complexities of big data collection, analysis, and actionable outcomes in a meaningful way, and comprehensively identifies user understandings of big data through new research methods that seek to engage users in their worlds. This in itself is problematic given the widespread acceptance and increasing faith being placed in analytic process among public and private sector institutions (Cohen 2012, 1916). The challenge may therefore fall to academic institutions and academic networks of research interest to provide the impetus, resources, and skills to develop this new as well as required vocabulary that will enable us all to better understand the meaning of the big data world (see West and Portenoy, this volume).

Big Data in the Sensor Society 


A New Set of Principles Disconnects stressed in this research provide a major challenge to current legal frameworks primarily predicated on information privacy principles. These disconnects reveal significant differences in how the key tenets of information privacy law are interpreted differently by users and data collectors. Users want to be notified about how their personal data will be used to make decisions that affect their life. Data collectors, though, cannot provide users with this knowledge because the collectors themselves cannot predict results and so cannot inform users about prospective uses. The concept of informed consent thus becomes particularly problematic because user decisions to trade their personal data simply cannot be taken as fully informed ones (see Cate, this volume). We contend that such disconnects will be exacerbated by the embedded and infrastructural nature of the sensor society, posing significant challenges to information privacy law that are already emerging from the application of big data. The goal of sensor-generated collection is the capture of a comprehensive portrait of a specific population, environment, or ecosystem (Andrejevic and Burdon 2015). The target of collection is consequently population holistic rather than individually oriented. Systematic forms of targeting can take place against this collection background because the population-level portrait allows particular targets to emerge, and once those targets emerge, their activities can be situated in the context of an ever-expanding network of behaviors and patterns. Population-oriented collection for the purpose of pattern generation means that sensor-generated collections tend to be amorphous, nondiscrete, and unspecific (ibid.). In essence, monitoring infrastructures allow for passive, distributed, always-on data collection. The classic forms of information privacy law supply procedural protections that seek to imbue fairness in data exchange processes (Cate, this volume). Users are provided with a limited range of process rights that afford a degree of control over how personal data are collected, handled, and used by data collectors. Subjects can access and amend collected personal data, request to see personal data held about them, and ask that “out-of-date” information about them be deleted or amended. Similarly, data collectors are obliged to inform users about when and why collections are undertaken, collect personal data only for relevant and specified purposes, store personal data securely, and ensure that subsequent uses of personal data are in accord with the purpose of collection. There is a certain rationality, which we touched on briefly above, underlying these protections and obligations. That rationality is predicated on notions that are becoming increasingly more complex to define: personal data are no longer readily identifiable, the sensorized process of personal data gathering does not fit within a paradigm of accountable exchange, and subjects and collectors do not share equal values, understandings, and commitments to ensuring procedural fairness in sensorgenerated data gatherings. Take, for example, the threshold issue of personal data. What are or are not personal data is a key consideration of the application of information privacy law (Ohm and Peppet, this


Mark Burdon and Mark Andrejevic

volume). Any data that are not deemed to be personal are not covered (Solove 2013, 1891). The simple act of defining personal data is becoming increasingly complicated, however, because the construct of what are personal data and how they are collected is fundamentally changing. Take, say, a current point of contention among different information privacy law authorities: whether a medium access control (MAC) address of a user’s mobile phone is personal information or not. On its face, a MAC address is a unique device identifier. It does not directly identify an individual. A MAC address in one sense is not personal information. Yet it becomes possible to track a MAC address through interconnectivity with cell phone towers and Wi-Fi access points, and then generate locational and activity patterns that can be used to identify individuals. Some jurisdictions have therefore deemed that a MAC address of a mobile device should always be classed as personal information due to the intimate relationship between a user and mobile device (Burdon and McKillop 2014). The contentions over identification of personal information we are currently witnessing are nothing to what we are about to see in the onset of the sensor society. For example, researchers from the University of Manchester have developed a carpet for installation in age care facilities that can predict when a person is likely to fall (Thornhill 2012). Sensors in the carpet’s underlay detect foot pressure, and relay data that can be analyzed to identify gradual changes in walking behavior or a fall. Furthermore, the carpet can “gather a wide range of information about a person’s condition, from biomechanical to chemical sensing of body fluids, enabling holistic sensing to provide an environment that detects and responds to changes in patient condition” (University of Manchester 2012). The possibilities for generating and collecting data that could be used to identify an individual from the sensor-packed carpet can now become numerous, and are of a type that borders on fantastical when considering the historical construct of personal data in first-generation information privacy laws. For instance, gait patterns are becoming a form of personally identifiable information as it becomes possible to identify individuals by the uniqueness of their stride and footprint. The use of chemical sensors in a carpet may also make it possible to identify individuals from their body odor, if the characteristics of that odor are sufficiently unique, or if body odor readings can be combined with gait data collection. The sensorization of environments will greatly increase the quantity and quality of data that will be generated and collected. The issue of quantity is already an important aspect of big data considerations, and has become a key justification for the need of big data and its analytic processes of unintuitive prediction. What is different here, though, is the quality of the data generated through sensor-generated collection. The seemingly never-ending improvements in what sensors can detect, combined with the density of sensor implementation, will lead to ever-increasing forms identification and monitoring. Sensor density, the combining of an increasing number of sensors in single objects and environments, will lead to the generation of data about everything, everywhere. Big data is not simply about unearthing new knowledge by the conquest of scale. Instead, the sensorization of big data logics is about the panorama of a collected world (Ohm and

Big Data in the Sensor Society 


Peppet, this volume). The expansive logic of the sensor society creates the prospect that individuals will be uniquely identifiable from the metadata created by sensor devices and sensor networks embedded in sensor-packed environments. Exhaustive and dense sensorized coverage will mean that continuous patterns of monitored behavior, at increasingly finer-grained levels, will constantly give rise to new forms of identification. All of which will place significant burdens on the current application of information privacy law. That does not mean that information privacy protections have no place or will have none in the future. Edward W. Felten (2014) has argued convincingly that core information privacy protections do have a place in a world of sensor-generated big data. The problem lies not in the scope of information privacy protections but rather in their application. The core protections at the heart of information privacy law are under attack by the logics of big data and those data collectors who seek to benefit most from its implementation. Big data itself is an essentially contested issue, and that contest by its scope also targets the core protections of information privacy law and the meaning of its core constructs, such as personal data and the structure of informed consent. All of which suggests that information privacy principles could still have a place in the regulation of big data and the sensor society. It nevertheless seems clear that these principles need to be updated because big data and its sensorized future are not just about privacy. Ultimately, they are about power and control. Information privacy principles have built-in balancing mechanisms from their historical roots that are cognizant of power imbalances (Solove 2001). The application of those mechanisms, however, may not give due consideration to the imbalance of power between users and collectors that is starting to emerge. That is because the historical development of information privacy law is predicated on different understandings of personal data as well as data collection and use processes. As a consequence, the underlying bureaucratic rationality of big data in the sensor society is about a completely different form of thinking to the selective forms and processes of data collection witnessed in the latter half of the last century. The oncoming density of embedded sensorization and collection of holistic population data are completely different from the binary collection processes of yesteryear. Yet the development of a new set of principles is currently problematic due to the lack of understanding we have about the processes of big data and consequences that a sensor society will bring. Before new guiding principles can be designed, a new vocabulary is required. We argue that the development of a new vocabulary, and with it, new methods of visualization to enhance dialogue regarding the construction of a meaningful and comprehensively knowledgeable popular imagination, is the most pressing task for activists and academics interested in the so-called fate of privacy in the digital era. For a start, we propose the following terminology for emerging forms of data collection and processing: 1.  Big data monitoring is about populations in the sense that the target is the population. Individual targets emerge only after the fact. It is important for members of the public to


Mark Burdon and Mark Andrejevic

understand that even though they may not be specific targets, they are still the object of monitoring and surveillance. First comes the monitoring, and then comes the individual targets. If the target is the population, then everyone’s data are of interest to the process— even the data of those who do not end up becoming targets. In this regard, present-day data mining practices might be analogized to mapping: the goal is to incorporate every detail into the database, no matter how seemingly boring or insignificant it might seem, because what matters is not the detail but rather the larger picture of which it is a part. If we think about data mining this way, we understand that even though our details may not be of particular interest, they are necessary to the construction of the overall patterns of which we form a part. There are no logical limits to the reach, scope, and depth of this type of monitoring—only those imposed by the existing state of the technology, and the costs of capture and storage. In theory, the “mapmakers”—that is, the data collectors—want any and all data that can be captured, because these might end up revealing useful patterns. 2.  Context comes into play only after the fact. That is, big data mining is, in a sense, a search for context. In concrete terms, this means that the possible uses for data emerge only once the data have been collected and analyzed. If information is being collected about an individual for the purposes of, for example, a job application, no data can be ruled out in advance. It may turn out that seemingly irrelevant details (the Internet browser one uses) play a role in predicting job performance. This means that the context does not rule out any category of data—and what data are relevant emerge from the data mining process. For all practical purposes, then, context does not impose any limits on data collection. This makes it hard to make context-dependent distinctions based on what type of data the users expect to be collected about them in particular scenarios. If you are applying to a college, for instance, you might expect to submit information about your grades, extracurricular activities, standardized test scores, and so on. But it might turn out that seemingly unrelated data are a good predictor of scholarly success. So the context does not necessarily define which data should be collected and mined. 3.  Secondary uses become primary. This is the business model of the contemporary commercial Internet: collect all available data because they might come in handy for new and unanticipated uses. Facebook stores tremendous amounts of photographs so that people can share them with one another, but this database also gives Facebook the raw material to develop some of the most sophisticated facial recognition algorithms. Data collected for one reason or purpose can be easily mined and aggregated for a growing range of purposes. The value of much of the information is speculative—it is based on the hope that it might come in handy for unanticipated purposes at some point down the line. 4.  Big data mining is, by definition, opaque and asymmetrical. Because it is “big,” only those with access to the resources for storing and processing can put it to use. Because the uses of these data are emergent, it is impossible to anticipate beforehand the uses to which the

Big Data in the Sensor Society 


information might be put. Structurally, then, there is a power imbalance between those who collect and use the data and those whose data is being captured and mined. This imbalance is structural. Conclusion We have highlighted two aspects of big data’s embedded nature: the infrastructural embeddedness of big data in a sensor society and attempts to embed meaning by data collectors. In doing so, we have emphasized the existence of data disconnects between how users think about their data and how these data are used and understood by data miners. These disconnects render even the most comprehensive terms of use meaningless (see Cate, this volume). The research and analysis described here suggests a profound misunderstanding on the part of the public regarding the character of emerging forms of data analytics. Informed public deliberation concerning the regulation of the collection and use of personal data cannot take place until there is widespread understanding of the ways in which contemporary forms of data mining differ significantly from the forms of data collection and processing that currently inform the popular imagination. The popular imagination is also going to have to take into account the full consequences of the onset of a sensor society, concomitant with dense sensorization and embedded infrastructures. Trying to accurately and meaningfully inform the popular imagination is itself a challenge as it becomes increasingly difficult to articulate the full range of possible significances that can arise from sensor-generated data collections. Taken together, these propositions about the embedded nature of big data render some existing ways of thinking about privacy and data obsolete. It becomes impossible to ask users to take on the role of informing themselves about the potential future uses of their data. Rather, we are entering a world in which incomprehensible amounts of data about individuals are coming to shape their life opportunities in unanticipated and opaque ways. This is what we mean by “embeddedness”—the data mining process is baked into the infrastructures that shape our worlds behind our backs. The priorities of those who shape these infrastructures are similarly built into the decisions regarding what data to use, which queries to run, and how to put the results to use. Individual decision-making processes regarding the acceptance or rejection of terms of use or privacy policies cannot meaningfully address embedded forms of data capture, processing, and decision making. Emerging regulatory regimes will need to rely on collective decision-making process about the uses of data in areas ranging from education and health care to employment and marketing. This will be an ongoing process of debate and deliberation—but it is one that cannot start until there is widespread public understanding of the changing terms of data collection and processing brought about by big data in a sensor society.

6  Encoding the Everyday: The Infrastructural Apparatus of Social Data Cristina Alaimo and Jannis Kallinikos

All that we don’t know is astonishing. Even more astonishing is what passes for knowing. —Philip Roth, The Human Stain

In this chapter, we seek to lay bare key processes and mechanisms through which social data are produced and disseminated on the web. For the purpose of this chapter, “social data” are defined as the data footprint of social interaction and participation in the online environments of what is now commonly referred to as social media.1 It entails behavior-related data records created by such user activities as “tagging,” “following,” or “liking,” carried out daily and en masse on social media. These activities are often performed to upload, share, or post user-generated content (comments, photos, posts, status updates, etc.). It is a distinctive attribute of social media, we claim, to shape online interaction and communication in ways that leave a computable data footprint. Even if relevant for the functioning of social media platforms, user-generated content plays a less important role in the data calculations carried out by social media. Social media produce data on communication and interaction as they happen online because they platform online sociality—they program user participation along standardized activity types (such as tagging, following, or liking). It is in fact this translation of online communication and interaction into discrete, computable data that connects social data to those developments subsumed under the label of big data. Social data are therefore a contemporary phenomenon and interesting sociotechnical innovation. Social data do not refer predominantly to data about the social (data that experts collect to map social conditions) but rather to data produced through the social. It is communication and interaction as organized by social media that is allowed, as it were, to speak about itself. In this regard, social data are closely associated with the recent transition from networked to platformed sociality brought about by the diffusion of social media (van Dijck 2013). These ideas may need further qualification against a widespread view that considers social media platforms as mainly sites of social interaction, self-presentation, and networking


Cristina Alaimo and Jannis Kallinikos

rather than mechanisms of data production and use (e.g., boyd and Ellison 2008). Dissociated from the attributes of platformed sociality we briefly referred to above, the ways people interact and communicate with others may not seem directly related to data production and use. Yet on social media, interacting with other users—by sharing, tagging, uploading, or simply showcasing—is inextricably involved with data generation of one kind or another. This is perhaps the most interesting characteristic of social media: they re-create the online conditions through which interaction and communication can be performed and directly translated into computable forms. Significantly, the patterns of sociality that emerge online (with whom to interact, and when and how) are shaped under conditions that reflect how social media platforms organize user participation and direct user attention via a series of computations. A range of scores and measures of aggregate user platform activity (e.g., similarity, popularity, and trending) are routinely computed by social media, and cycled back to users in the form of recommendations and personalized suggestions. Such encoded and computed, as it were, sociality constitutes the core of everyday life on social media platforms, and by extension, social data generation and use (Alaimo 2014).2 It should therefore come as no surprise that the operations by which social data are produced set them apart from data generated through automated technologies of monitoring, data tracking, and recording. Social data are produced through the encoding of online interaction and user platform participation. Rather than being a limitation, as some seem to assume (see, for example, Pentland 2014), such a condition reflects the distinctive flavor and valence of social data (Kallinikos and Constantiou 2015). For through social data, the murmur, as it were, of a carefully crafted online daily living is carried at the forefront of public life. The opinions, trivial concerns, daily and ephemeral pursuits, dispositions, and experiences of users are all recorded via the modalities and affordances of social media, pooled together, and then computed on a continuous basis to construct constantly updatable and frequently marketable profiles of daily living (Proffitt, Ekbia, and McDowell 2015). This chapter is structured as follows. We first describe in some depth the processes and mechanisms through which social media encode social interaction and communication, and produce social data. Much of this encoding embeds selective assumptions and entails specific operations through which the everyday social fabric is captured and made computable. We then show how this highly structured translation of social life to digital and computable forms constitutes the basis on which social media platforms produce a range of personalized services that make up the core as well as tradable product of their activities. Via these services and the data by which they are sustained, social data enter into the broader circuit of big data, to which in turn they make an essential contribution. We conclude with a few remarks on the implications of the developments we outline, and their social and institutional significance.

Encoding the Everyday 


Social Life as Data What is exactly social on social media platforms? What is the “stuff” that social media platforms encode and record into data? Social Data: Definitions and Distinctions Social media have been variously defined as social networking sites (Beer 2008; boyd and Ellison 2008), neutral platforms fostering communication (Kaplan and Haenlein 2010), dataharvesting fields (Scholz 2013), gigantic databases, or archives of the everyday (Beer and Burrows 2013). Effectively, the specificity of social media is hard to pin down. Perhaps the simplest definition we might rely on is also the most immediate. Social media mediate the social.3 That is, they do exactly what they declare. The fact that they produce a sheer quantity of different data typologies on the social is and should be viewed as a consequence of their capability of mediation. Social data, in other words, should be addressed as a derivative outcome of social media’s main innovation: the capacity of encoding the everyday, and storing its data footprint into flexible and granular data fields. Commonly, social data are defined as user data that are procured from social media. A company that works as a third-party social data analytic for commercial enterprises gives a general definition of social data as “expressing social media in a computer-readable form and sharing metadata about the content to help provide not only content, but context. Metadata often includes information about location, engagement and links shared.”4 Although this definition is quite general, it is still useful as a starting point toward unpacking the complex ways in which social media encode daily life. On the one hand, data produced on social media platforms differ from traditional data sets on individuals and social groups (e.g., sociodemographic data). Social data also differ from traditional sources of data elicitation on lifestyle and preferences. On the other hand, social data are different from online transaction data, or the data produced by cookies and tracking devices. Those data are not meant to restitute opinions, tastes, or sentiments but instead simple online behaviors (time spent on websites, searching habits, etc.) and transactions (records change as the result of transactions). Social data are generated constantly by the activities carried out by users on social media. In this regard, they are the by-product of what users do on social media. These platforms, it should be noted, are already artificially personalized technological environments. The social space constructed by them is in fact a highly organized and structured environment, which constantly filters and displays information that is meant to be relevant for each and every user. An algorithm called EdgeRank (Bucher 2012), for instance, personalizes Facebook’s home page, which now considers roughly a hundred thousand factors when computing what to display to each user (McGee 2013). In this highly structured environment, personalized suggestions on whom to follow, what to like, and what to share are meant to incite platform participation along precise behavioral corridors. Social data, then, are the


Cristina Alaimo and Jannis Kallinikos

by-products of an artificial and highly organized space. By conceiving and assigning data entry points to specific data types such as profile data, user-generated data, and behavioral data, social media technologies construct the conditions on which social data are produced. It is this not so “raw” social material qua data that social media technologies then actively manipulate, relate, and compute to obtain information that is meant to tell something new about individuals and collective endeavors. Although social media produce different kinds of social data, not every type of information produced has the same value. Descriptive data about individuals (profile data such as name, gender, occupation, marital status, location, etc.) make sense and obtain value against the constant quantification and qualification of behavior-related data produced by activities programmed by social media. Behavioral data are obtained by encoding the highly structured participation that social media embed. The various actions that occur on these platforms—such as the reiterated clicking, sharing, and liking that users perform on Facebook and its connected applications—constitute a real time, all-encompassing encoding of everyday activities that has no comparison with previous data sources on individual and group behavior. Another type of information, user-generated data (commonly also referred to as user-generated content [UGC]), provides a huge quantity of unstructured data (in the form of images, posts of various kinds, and written comments) that are often stored into social media databases but seldom put into direct use.5 On the one hand, UGC seems to supply the means and context by which user behavior—the constant uploading, sharing, commenting, and liking—can be performed and encoded as recurrent patterns of action. On the other hand, user activities such as tagging, liking, or sharing, when performed on a photo, comment, or post, tie unstructured data (a photo, post, etc.) to wellstructured data, and render UGC relatable and ultimately computable. Social data as behavioral data in this sense not only shape social communication and participation but also contribute to structuring UGC, making it usable for further computation. Behaviorrelated data, the by-product of online social interaction and participation, thus appear to be the most valuable source of social data and most distinctive contribution that social media bring to big data: the constant monitoring and recording of user daily interaction and participation. Social Data Production and Its Assumptions User participation variously taking the form of the production, sharing, and updating of personal information constitutes the core of social media functioning. Sharing, in particular, has become the flag of social media discourses. On the one hand, it is because of the possibility of sharing one’s own life with friends that social media have become the widespread and privileged media of our time. On the other hand, “sharing is the engine of data” (Kennedy 2013, 131). Participatory discourses create user expectations around a new type of user—that is, a user who shares (Shilton, this volume). By fostering a particular culture of sharing (Castells 2009), social media embed presuppositions on user behavior in

Encoding the Everyday 


their own functioning and therefore reinforce a certain modality of sharing. Sharing and participation as such become the resources of social media data production (Bechmann and Lomborg 2013). It is by structuring participation and sharing that social media code platform interaction, and render social data countable and profitable through various cycles of computation. What the rhetoric of social media defines as spontaneous user participation is extremely relevant. The actions that users perform to upload content, curate content, or connect to other users’ activities constitute the core of every social media platform. Spontaneous user participation, however, should be understood in relative terms. As claimed earlier in this chapter, the terms of participation are shaped by the ways social media platforms organize participation and encode interaction. These last are in turn conditioned by the technical data formats required to support computation. In other words, what may seem spontaneous at the user interface is governed underneath by a complex system of operations that transform or code social interaction to data, and a technological apparatus by means of which these data are stored, aggregated, and computed. It is important not to lose sight of the fundamental fact that the main aim of social media is to render the social as data. It is the capacity of coding user participation that serves social media markets and makes social data profitable (van Dijck 2013). That is, the ways technology is used to encode social participation make social data valuable and confer to them their renewed (technical) sociality. Despite the rhetoric of participation surrounding social media, their main functionality is coterminous with how technology translates social interaction into computable objects. Social interaction and communication must be rendered computable—that is, they must be codified into bits and pieces of data (numbers), and adapted to the possibilities of the medium they inhabit (computation). To do this, social media program a selected set of actions that become encoded along computable paths. This codification of social communication and interaction operates on simple premises that nonetheless carry far-reaching implications. On social media, everything is formalized as object (users, posts, comments, photos, etc.), and every object is connected to other objects by a set of preestablished actions such as following, clicking, sharing, and the like. As figure 6.1 shows, behavior-related data are obtained by encoding social interaction into a data connection between objects. Actions are the results of user behavior as long as the

Figure 6.1 The codification of social interaction and communication


Cristina Alaimo and Jannis Kallinikos

behavior produces user-generated data. Viewing behavior-related data simply as the result of spontaneous user participation overlooks the fact that actions are preprogrammed by the platform’s functionalities at the same time as they are tuned to the normative framework of sharing and participation. A good user is a user who shares (John 2012). Participation, in other words, is also the result of how users react to the assumption and presupposition of the good user, which the technical infrastructure of the platform embeds, and to the indications, rules, and logic of an artificially constructed media environment. This “blank state” of social interaction rendered digital is the starting point of social data production. Once that social interaction happens, it is encoded and translated into data (data connections), and counted, correlated, and qualified by other data connections under different mathematical and statistical conventions. It should be clear by now that social media do not just capture the data (i.e., record or measure social activities) of a social life “out there.” Rather, they produce tremendous amounts of what is often referred to as exhaust data—that is, data that are created as a byproduct of the activities social media program under their own assumptions (Manyika et al. 2011). Social media do not simply capture the communicative texture of life in all its forms. They encode, formalize, and constantly manipulate data on the artificial forms of sociality they sustain. Clicking, liking, sharing, and commenting are all performed as social participation, yet in essence are programmed scripts of user engagement. It is these programmed actions or scripts of user engagement that provide the “raw” material for a thorough quantification of individuals and groups. Social data therefore are produced by the constant user participation that social media require and instrument. Social media rhetoric claims that the data so generated restitute a more authentic portrait of social life because they derive from the spontaneous participation social media enable. But things are subtler and far more complex. User participation, variously modulated as the production, sharing, and updating of personal information, constitutes the core of social media functioning. This encoding of a mediated everydayness is the basis on which social media technology qualifies both profile data (data on individuals) and data on groups. Nevertheless, the bulk of social data are obtained as the by-product of software instructions, online sustained interactions, data architectures, and algorithmic suggestions. Social data thus are not simply a reading of opinions, sentiments, preferences, and conversations. They are instead the output of the constant technological encoding of social interaction that becomes formalized and standardized by a simple, abstracted logical relation: a technologically mediated link between two objects (as visualized in figure 6.1 above). Social data entail the technological production of a new logic of linking together disparate facets of a social everyday at various levels of abstraction and under different assumptions. Once participation is encoded into data connections, technologies of storing, sorting, and selecting operate the reordering of social data. To this first artificial programming and selection of what might constitute a data portrait of social everyday, social media technology

Encoding the Everyday 


intervenes by variously splitting and lumping together large amounts of captured and exhaust data into derived data—that is, data that are the output of additional processing or analysis. It is social data in the form of derived data produced after exhaust data are reduced into manageable formats through calculation and computation that enter the big data circuits. Before turning to the issue of how social data relate to big data, however, we need to explain how the fundamental yet elementary coding of user actions as links between two objects (see again figure 6.1) constitutes the building block of the compound arrangements that social media platforms set in place. The Paths of Encoding The programmed disaggregation of individual users into discrete and countable actions that underlies social media encoding is essential in understanding the premises of social data production as well as the distinctive nature of social media platforms. As noted, social media construe communication and social interaction as actions connecting two objects (see figure 6.1). By doing so, social media conceive of and ultimately reduce social interaction to its bare essentials. Such a reduction in turn allows social interaction to be detached from the physical or sociocultural contexts within which it is normally embedded and makes sense (Kallinikos 2009, 2012). Once abstracted, the action encoded into data qua links between objects makes strong assumptions about social life that essentially serve the mechanics of social media functioning and its technological underpinnings. Figure 6.2 below further elaborates on this idea by analytically deconstructing the logic on the basis of which platform participation is organized and made countable. Each link whereby an object is connected to another object (figure 6.1) becomes one of handful of activity types

Figure 6.2 Social life made computable


Cristina Alaimo and Jannis Kallinikos

along which user participation is programmed, working in essence as a reductive filter of the diffuse and multivalent character of a social everyday (Kallinikos 2009). Arranged together, the links construct a limited and standardized typology of social action (e.g., sharing, commenting, and liking) that helps channel user activities along distinct paths. At the same time, this standardized typology transforms user choices online to discrete and granular data formats, allowing for easy identification, counting, and comparison. There is little doubt that so organized, platform participation represents a drastic reduction of the richness of communication and social interaction. Yet the abstract and standardized data tokens that social media encoding creates at the same time confer a new “generalizability” to the effort of representing social life. It is thanks to the abstract nature of data and the simplicity through which encoding reads social life that social interaction can be represented in all its now-computable and highly pliable forms (Borgmann 2000; Kallinikos, Aaltonen, and Marton 2013). As repeatedly noted, the encoding of social life into data does not mean a simple translation of previously encrusted habits or social choices. Tagging has no meaning in off-line social space, while in social media it is usually taken as a manifestation of interest or intentionality. Even when actions qua links are taken to represent previously established forms of social interaction, they add something more to it. In social media dedicated to shopping, for instance, a user tagging (action) a product (data object) means manifesting an explicit intention toward purchasing that product. The operationalizing of the intention to buy in these terms can surely be viewed as a reduction of the many facets of the act of buying—namely, a partial rendering of it in computable forms. Yet through that reduction, something new that was not visible before is singled out, recorded, and computed. In turn, the computability of intention renders the act of buying on a social shopping platform pliable—that is, open to system suggestions and personalized recommendations. In short, buying online becomes something else (Alaimo 2014). The Facebook like constitutes another vivid example of the encoding of social life into data and its consequences. The like action is assumed to stand for approval on the Facebook platform (Gerlitz and Helmond 2013). Like (action) connects users with objects (a comment, other users, a brand, or a post), and it is assumed to stand for something (user preference, approval, appreciation, gratitude, and so on) that like abstracts into a universal, ambiguous, and all-encompassing token: the like. The value of like, in other words, does not consist in faithfully translating user approval or whatever a like might stand for. It rather resides in its apparent simplicity of expressing a particular instance of human intentionality (that is, approval) in ways that make huge masses of users commensurable under a quite-ambiguous (and therefore general) like token. It is this generalizability and commensurability that makes like a valuable cognitive standard, a data currency, as it were, able to make visible behavioral patterns at the individual and aggregate levels. The correlation and computation of actions, such as like against other variables, profile data (such as gender, location, and age), or social graph data (number or typology of

Encoding the Everyday 


connections), is then used to measure and classify behavioral patterns of users and groups of users. Once again, the ultimate output of this relentless data work is the personalized suggestion of further activity: other likes. In this fashion, the encoding of any social activity into data—and the assumptions it embeds—gets further incorporated into different computational circuits and infrastructures, functioning as an anticipatory device of what users might like. Taste, social habits, or general social endeavors become empowered and constrained by determined actions, programmed by social media. Forms of sociality become technological (Alaimo 2014) insofar as what gets represented into data forms is represented in certain ways (and not in others) so as to fit the logic embedded in the media and its technological underpinnings (Langlois 2013, 2014). The experience of sociality in social media cannot but adhere to the computational logic on which it is premised. It is in this regard that it becomes what we call computed sociality. To give yet another example, being friends on Facebook is represented by the “friending” action (Vallor 2012). One of the consequences of a technologically constructed action of friending is that friendship becomes a matter of counting the number of connections (see also Stiegler 2013). Even more so, online visibility—as Taina Bucher (2012) has eloquently demonstrated—becomes a rather complex process fostered by the mechanics of algorithmic ordering. EdgeRank, the algorithm that regulates visibility on Facebook’s home page, exposes a certain circular logic according to which in order to be visible, users need to participate, but participation is itself regulated by the mechanics of visibility that the algorithm proposes (ibid.). To put it simply, social media flatten the social into an all-encompassing logic of data linking and counting. Conversely, it is exactly because of the reduction to this blank state that the social qua data becomes possible to aggregate, associate, and recombine. Due to these last qualities, social data lend voice to contingencies of everyday living that have hitherto lacked articulation, and for this reason remained beyond attention, inspection, insight, or control. Through aggregation, comparison, and other forms of data reordering, social media and the technologies sustaining them are able to lift out of the edges of the social possible relationships between persons and their doings that are used as the basis of recommendations and personalized suggestions. In this regard, social data at the same time restrict and enlarge the sense making of what they encode. Once social life gets engraved into information of this sort, it ceases to be related to established categories, conventions, and habits. As a matter of fact, once social interaction is engraved into data it is even dissociated from persons who, in their quality of users, generate these data. In most cases, it is discrete actions (e.g., liking, tagging, or following) rather than persons that are the object of the further processing, aggregation, and analysis performed by social media technologies. Social life is transposed onto an impersonal and decontextualized data pool, and enacted according to the same logic by which social media conceive and translate sociality to data (Kallinikos 2009, 2012). Disaggregated into discrete acts, user platform participation provides standardized, granular, and recombinant data that


Cristina Alaimo and Jannis Kallinikos

are constantly parsed, reshuffled, and reassembled to address the demands of a wider economy of inscription (Proffitt, Ekbia, and McDowell 2015; van Dijck 2013). These observations reveal how social data are part and parcel of the developments associated with big data. It is now time to turn to this cardinal issue. Social Data Become Big (Data) The technology of social media not only enables encoding sociality but expands as well the capability of storing its data footprint. One of the assumptions of big data and social data is that everything becomes worthwhile to be encoded and stored for future possible use. As a form of encoding, clicking, liking, commenting, and similar activities allow everyday life to be transmuted into discrete, countable actions that in effect commensurate its diversity and heterogeneity. Thus rendered commensurable, stored social data become potentially relational and recombinant (Kitchin 2014)—that is, able to make sense and produce value by being related to other data. This expanding associability of social data has an in-built mechanics of exponential growth possible to realize through various data permutations (Kallinikos 2006) that may support a variety of services and operations. To become valuable, social data call for an even more refined and expanding articulation of relations between data across data types and sources. Let it be recalled that social media assemble personal information by measuring the data tokens produced by its platformed sociality. Unlike commercial or search engine methods of organizing information, social media do not just measure clicks, hits, or links. They personalize content, making it easy for users to browse what might be relevant to them on the basis of the stock of social data possible to relate to their individual data history. Much of the relational potential of social data is realized by associating them across different data sets, thereby allowing the detection of patterns and the computation of several user activity scores. The substantial innovation social media are carrying over the web stems from the commensurability and relationality that social data afford. These are essential data qualities that derive from the reductive ways through which social media encode user platform participation, as analyzed in the preceding two sections (see figures 6.1 and 6.2). Currently, the characteristic ways of encoding social life, and consequent capacity to commensurate and relate different aspects of sociality as data, are expanding well beyond the boundaries of social media. Technical conventions and actions, such as like, share, tag, and tweet, are increasingly used as linking devices that connect platforms and websites. Data techniques and standards that link social media platforms and websites are essential for understanding the portability of social data across the web, and the growing capacity to relate and combine them with data types and sources. Social plug-ins such as the Facebook like and share buttons play a critical role. They do not just track user behavior but also make its data trace portable across platforms and sites. In short, they make the social

Encoding the Everyday 


media encoding of interaction and communication the infrastructure of a new, increasingly social web. The assumptions and processes, at the basis of the encoding of like action, that we have analyzed in this chapter have spread from Facebook to connected applications and across the web. In turn, the data produced by the encoding of likes on other websites go back to Facebook. The fact that Facebook “keeps track of what users ‘Like’ around the web” (Love 2011, para.7) is just the tip of the iceberg. Through the omnipresent Facebook log-in, user profiles become portable, and social data—such as a list of likes, list of friends, and list of friends’ likes—are exchanged and increasingly reused by connected applications and thirdparty websites. In this respect, social media construct a new social data infrastructure of the web that enables users to share seamlessly, but just on terms that reflect carefully crafted and predetermined standards that recount the terms of user platform participation and logic of encoding. A network of social APIs facilitates data recontextualization, the portability of user profiles, the diffusion of social buttons, and a continuous flow of data streams on user activities that enrich as well as make more valuable the social data produced by the already platformed sociality (Helmond 2015; Lovink and Rasch 2013). A process of data amplification is thus set in place whereby social data make their distinctive contribution to the developments associated with big data. It is important to uphold that the expanded possibility of encoding various user activities in and across social media platforms is what measures and classifies user activities and users at the same time. Social media re-create social life as an enactment of behavioral scripts under the form of action tokens such as the Facebook like that can be further aggregated, abstracted, and recontextualized into other domains. These scripts become potentially relevant because they are assumed to stand for something, and at the same time encode and construct what they stand for. Because like is assumed to stand for appreciation, it constructs the ways and modalities by which appreciation can be resocialized online. Like also becomes the standard of appreciation and concurrently the unity of measure of appreciation against which Facebook is able to measure and construct the appreciation (and thus the value) of content (such as news, videos, brands, products, etc.) across the web and on Facebook. The Facebook like is just an example, perhaps the most famous one, but other social media platforms are operating exactly under the same logic: Pinterest with the “pin” button, Twitter with its “tweet,” and social media for shopping with a tag (Alaimo 2014). To summarize and extend our argument, it is by encoding particular aspects of social life into data tokens that social media transform social interaction into a measurable information exchange able to produce economic value through social data. Furthermore, data tokens so designed shape a particular kind of social interaction. These conditions expand the relationality of data across types, and establish the means for the portability and reusability of social data from one platform to another. In doing so, they reinforce the particular logic of an already-mediated version of social life (the like on Facebook, for instance) as empowering


Cristina Alaimo and Jannis Kallinikos

new forms of sociality across the web, and as a consequence, determining the flourishing economy of social data across disparate business domains. These observations make evident that the characteristic encoding of the social that social media set in place is a fundamental element in the commensuration process (Espeland and Stevens 1998) that relates disparate data sets on individuals and groups. By producing a new data language about public life and interaction, social media technology is depicting the contours of a new, computationally empowered representation of the social. This new data rendition of the social, in turn, enables the computation of personally relevant suggestions that make up the hard core of social media commercial value and activity, and shape what is now referred to as the new personal web. Social data produced under these conditions are certainly the strategic outcome of conceiving and instrumenting social interaction along the lines we have analyzed in this chapter. Social media platforms are, above all, social arrangements. Having said this, it is important not to lose sight of the fundamental fact that this mediation of sociality and public life is sustained by elaborate technological underpinnings. Data architectures, APIs, other boundary-managing technologies, and the accompanying database solutions are central elements of the infrastructures that allow the harvesting, portability, and combinability of social data and their bigness. These technologies are centrally involved in establishing the space of functionalities that define social media platforms as social arrangements and social data as the cognitive currency these arrangements produce. They drive the normalization, structuration, standardization, and computation of the social. Much as they participate in steamrolling the differences that make up a diffuse and multivalent everyday, these technologies and resources also contribute to re-creating the social, lending depth and perspective, as repeatedly claimed, to the computational mediation of social interaction. Without the technologically enabled capacity to structure, standardize, and compare disparate data sets, social data would have remained a localized cognitive idiom, and social media platforms would be no more than flat instruments of recording and transcription of an artificial everyday, with no information potentiality and relationality (Kallinikos 2006, 2009; Zuboff 1988). Concluding Remarks: The Big Picture of Social Data Our depiction of the operations of social media unravels a distinctive way through which they conceive, instrument, and ultimately refigure social interaction and social relations. By producing a new data language on how social connections are made online, social media technology establishes the generative principles and techniques of a new, computationally empowered representation, and by extension, constitution of social interaction and communication (Alaimo 2014). Social media, we claim, mediate the social insofar as they select and store some aspects of social communicative phenomena encoded as data, and represented in standardized and

Encoding the Everyday 


permutable forms amenable to computation. In return, they produce information that constantly suggests, recommends, and constructs new forms of social communication and interaction. Conceived as infrastructures of the social, the arrangements that social media put in place can only select, store, and frame representations of social interaction that appear compatible with the formats, technical requirements, and data languages on the basis of which social media operate. It is only by encoding data under their own format, and relating them under their own functionality, that social media platforms produce value in social and economic terms. Thus viewed, social media are a real-time system of bottom-up, emergent, and contingent data structures and classifications that result from the aggregation of minute, constantly updatable behavioral choices instrumented as tagging, following, liking, and so forth. Data structuration on these premises reorders reality and knowledge under different principles (Castelle 2013; Fuller and Goffey 2012a, 2012b; Mackenzie 2012). Sorting out the social thus becomes a data work of manipulating database objects produced by highly stylized user participation, instead of making sense of the social by relating it to extant social categories, modes of interaction, or practices. It is important to keep in mind that social media platforms do not simply produce data. The mediation of the social that social media accomplish is rendered possible by a complex apparatus and its technical data work: normalization, stylization, structuration, and computation. Under these conditions, it would come as no surprise that the operative logic of the data structuration is transposed onto the highly organized social experience that the front end (user interface) of social media sustains (Manovich 2002). How the social is experienced is considerably shaped by the different principles on which data are organized and made computable. In all these respects, the experience of sociocultural communicative conventions and acts that occur on social media are fundamentally a computed version of sociality (Alaimo 2014). It should go without saying that social outcomes are never mechanically produced and reproduced. They certainly show variability across contexts and situations. But to ignore this far-reaching reconfiguration of social interaction and communication that is produced by what we call the apparatus of social media amounts to overlooking an essential part of the conditions that shape contemporary life. These remarks point to another pivotal feature of social data. In some essential respects, the social media apparatus (Foucault 1980) by which social data are produced signifies the stepping in of a certain everyday at the forefront of public life, and reevaluation of its institutional and economic importance. As briefly touched on above, social data as programmed and instrumented by social media platforms are distinct from the traditional categorization of social behavior as well as the institutionally anchored knowledge practices and data operations to which such categorization has been associated. By encoding, marketizing, and in a way, making visible a new everyday, the apparatus of social media and their associated business networks challenge and step in the place of knowledge practices that have


Cristina Alaimo and Jannis Kallinikos

traditionally been pursued by experts (e.g., accountants, marketers, statisticians, and medical experts) in a variety of institutional settings (Kallinikos and Constantiou 2015). It is thus vital to recognize that the insertion of this daily data footprint in semipublic and commercial contexts cuts across long-standing modern divisions between personal and public, domestic and institutional life (Gellner 1994; Luhmann 1995; Walzer 1983). The analysis of the apparatus of social media attempted in this chapter shows how social media are actively involved in the production of new types of data that have commonly remained outside the regulative purview of institutions. Such data restitute the ongoing, daily patterns of social behavior at the individual and collective levels by providing the online facilities, techniques, and resources that are meant to lend social interaction its own voice. These types of data most readily enter the circuits of big data. Social data qua big data silently intermesh with more traditional data in almost every business sector to restitute a picture of social life that is essentially shaped by the ways social media encode and compute interaction and communication online. These developments, we claim, signify a remarkable shift with respect to how knowledge on social life can be produced and used, and eventually, become commercialized. The implications of these developments are at present only dimly understood. Policies around social data need to reflect on this wider and deeper system of shifts beyond the standard concerns of personal identity, privacy, and surveillance that have tended to dominate current discourses on big and social data across many social science fields. The ideas we have put forward in this chapter suggest a deep refiguration of the social bonds characteristic of the modern social order (Heller 1999). Modes of social interaction and knowledge exchange that have been traditionally tied to social compartments and institutions such as work, family, and community (Kallinikos 2004; Walzer 1983) are seriously challenged by the developments we single out and analyze in this chapter. A similar argument could be possibly made regarding the buzz of data analytics, data mining, and big data in more technically oriented fields, such as management, economics, mathematics, and statistically oriented data sciences.6 The implications of the intrusion of the everyday into markets and organizations transcend the common management themes depicted in the literature (see, for example, Davenport 2014; Brynjolfsson and McAfee 2014; Varian 2010). The contrast between expert versus lay knowledge and data extends far beyond the immediate gains several stakeholders obtain by the commercialization of data as well as its managerial uses (Kallinikos and Constantiou 2015). More ambiguous are the changes that the data practices of social media are bringing to standard schemes and formats for organizing work and entertainment. It is interesting to ask what sort of entities (e.g., organizations and institutions) social media platforms are, and whether the data practices they diffuse are here to stay.

III  Big Data and Science

7  Big Genomic Data and the State Jorge L. Contreras

Nearly a decade ago, institutional theorist Elinor Ostrom observed that modern biology, once the quintessential “wet” laboratory discipline, has quietly become an information science (Hess and Ostrom 2006, 335). Petri dishes and microscopes are being replaced by computer servers and high-throughput gene sequencers, and a mastery of advanced statistical methods and computer modeling languages is as important to contemporary biomedical researchers as traditional laboratory techniques once were (Service 2013, 186). Nowhere is Ostrom’s observation more apt than in the field of genomics, the study of the complete genetic makeup of humans and other organisms. As of February 2016, the National Center for Biotechnology Information’s (NCBI) GenBank database contained approximately 1.4 trillion DNA base pairs from more than 190 million different genetic sequences (NCBI, n.d.) The European Molecular Biology Library (EMBL) and DNA Data Bank of Japan maintain comparable repositories. These publicly funded and managed databases are synchronized on a daily basis, and made openly available to researchers and members of the public around the world. Though the quantity of available genomic data has not yet reached the petabyte outputs generated by the largest astronomical observatories and particle colliders, it is no exaggeration to speak of genomic data as “big” data. This chapter explores the development and growth of genomic data and databases in the United States and globally, with a focus on the role that governments have played in their conception, creation, and ongoing maintenance. What emerges is a picture of governmental involvement in genomic data aggregation that is far different than the role that the state has played in other areas of scientific investigation, prompting questions regarding the optimal role of the state in the emerging world of data-driven bioscience. Big Science and the State Following World War II, the success of US military efforts to develop technologies such as radar and atomic weapons led to a model of state-sponsored “big science” (Leslie 1993). These projects—which were typically focused on large-scale, resource-intensive, multiyear scientific undertakings—included particle accelerators, ground-based and orbiting telescopes,


Jorge L. Contreras

and the US manned space program. Today, in the United States alone, the federal government funds more than $60 billion in nondefense basic scientific research annually (American Association for the Advancement of Science 2015), generating vast amounts of data and public knowledge. This big science model is not limited to the United States; comparable programs exist, or have existed, in Russia, China, Japan, and several countries in western Europe. The goal of state-funded research projects is often to generate large quantities of observational or experimental data (Reichman and Uhlir 2003, 322; National Research Council 1997, 113–114). These data are collected either by government-operated instruments such as spacecraft and telescope arrays, or private universities and research institutions funded by government grants. The resulting data sets have traditionally been made available to the public directly by governmental agencies such as NASA’s National Space Science Data Center, the National Center for Atmospheric Research, and the US Geological Survey’s Earth Resources Observation Systems Data Center, or government-funded repositories at private institutions such as the Space Telescope Science Institute hosted at Johns Hopkins University. The conventional account of big science portrays the state (or in some cases, multigovernmental coalitions) as either the direct generator of these data or the principal funder and procurer of data from institutional researchers. Much of the policy and economics literature that considers the state’s role in research typically revolves around the nature of this procurement function, and how the government should best incentivize research to maximize the public good (Loiter and Norberg-Bohm 1999; Bozeman 2000). Although the conventional model may still hold true in fields such as astronomy and high-energy physics, it fails to describe more recent data-driven projects in the life sciences, most notably in the area of genomic research. The Human Genome Project Genomic research comprises the large-scale study of the genetic makeup of humans and other organisms. Almost every living organism has within the nuclei of its cells a vast quantity of DNA, a chemical substance that is composed of long strings of four basic building blocks or “nucleotides”: adenine, thymine, guanine, and cytosine. These nucleotides are intertwined in a ladderlike chain: the famous “double helix” structure described by James D. Watson and Francis Crick in 1953. Each rung of this DNA ladder is referred to as a “base pair,” and the full complement of DNA within an organism’s cells is its “genome.” The genome of a simple organism such as the E. coli bacterium contains approximately 5 million base pairs, the genome of the fruit fly contains approximately 160 million, and that of humans contains approximately 3.2 billion base pairs. Some segments of DNA within an organism’s cells operate together as units called “genes,” which range in size from as few as 100 to more than 2 million base pairs. It is currently

Big Genomic Data and the State 


estimated that humans each possess approximately 20,000 genes, which are responsible for the inheritance of traits from one generation to the next and encode the proteins responsible for different biochemical functions within the cell. Each human genome is approximately 99.5 percent identical, with small differences accounting for most of the inherited variation in human physical and physiological traits. Though researchers have studied the genetic bases for particular diseases (e.g., Huntington’s disease) and physiological traits (e.g., eye color) for more than a century, the creation of a complete map of the human genome was not considered seriously until 1985. At that time, leading genetic researchers, encouraged by the emergence of improved technologies for identifying the sequence of nucleotides within DNA molecules, first proposed a project to sequence the entire human genome (Cook-Deegan 1994, 79–91; McElheny 2010, 17–33). It was believed that the determination of the human genome sequence could improve the understanding of basic biochemical cellular processes, identify factors elevating the risk of disease susceptibility, help develop diagnostics and therapeutics responsive to particular genetic characteristics, and improve the understanding of heredity as well as the development of species and individual organisms. Unlike other large-scale scientific projects, the proposal to map the human genome originated with academic researchers rather than the government. Nevertheless, it soon became apparent that the scale of the project would be massive and could only be accomplished with substantial governmental funding. Two US federal agencies had an early interest in the project, the National Institutes of Health (NIH), which funded most genetic research in the United States during the 1970s and 1980s, and the Department of Energy, which wished to study genetic mutations in atom bomb survivors (Cook-Deegan 1994, 97–104). Eventually these two agencies agreed to colead the project, which became known as the Human Genome Project (HGP). The HGP was formally launched in 1990 with additional support from the UK-based Wellcome Trust along with the involvement of funding agencies in the United Kingdom, France, Germany, and Japan (International Human Genome Sequencing Consortium 2001). Characteristics and Scale of Genomic Data Over its decade-long existence, the HGP, together with a competing private sector effort by Celera Genomics, mapped the 3.2 billion DNA nucleotide base pairs that form the human genome, depositing a total of 8 billion base pairs (gigabases) in public databases (“Human Genome at Ten” 2010). At the time, this accumulation of genomic data was unprecedented, but has been vastly overshadowed by the quantities of data generated since the HGP’s conclusion. As of February 2016, the NCBI’s (n.d.) GenBank database of DNA sequences contained approximately 1.4 trillion base pairs, or 1,000 gigabases. By the end of 2012, the ambitious “1000 Genomes Project” (2012) alone had produced more than 200 terabytes of data


Jorge L. Contreras

representing the complete genomes of 1,092 individuals. The rate at which DNA is being sequenced continues to increase. According to recent statistics, the number of DNA bases contained in the GenBank has doubled every eighteen months since 1982 (NCBI, n.d.), and in 2011, a single DNA sequencer was capable of generating in one day what the HGP took ten years to produce (Pennisi 2011). In addition to “raw” DNA sequence data, genomic studies undertaken during and since the HGP have generated data regarding DNA markers known as single nucleotide polymorphisms (SNPs), groups of DNA sequences that are likely to be inherited together (haplotypes), expressed sequence tags, whole-genome shotgun assemblies, messenger RNA segments, and many other data types. Moreover, many genomic studies today generate data relating to associations between particular genetic markers and disease risk and other physiological traits (so-called genome-wide association studies [GWAS]). GWAS data sets include not only DNA sequence information but also numerous data fields relating to phenotypic (physical, physiological, and demographic) traits that are linked to associated genetic data. It is also important to note that not all large-scale genomic research projects have been led by governmental agencies. In fact, several significant private sector projects have, since the days of the HGP, contributed large amounts of genomic data to the public. These include the Merck Gene Index, an effort by pharmaceutical giant Merck (1995) to place nearly one million short DNA sequences known as expressed sequence tags into the public domain; the SNP Consortium, a project funded by a group of pharmaceutical and information technology firms to discover, map, and release to the public a large number of SNP markers (Holden 2002); and the International Serious Adverse Events Consortium, a similar pharma-driven effort to identify genetic markers for adverse drug reactions and contribute them to the public (Contreras, Floratos and Holden 2013). The massive accumulation of publicly available genomic data arising from these projects has been referred to as the “genome commons” (Contreras 2011). This public resource has been utilized by researchers around the world as the basis for further investigation and discovery. Numerous studies have credited the availability of this vast trove of public genomic data with generating substantial scientific advancement and economic growth (Paltoo et al. 2014; Battelle 2013; Collins 2010a), and table 7.1 lists a few of the many public genomic databases currently comprising the genome commons. The Role(s) of the State in the Genome Commons The HGP was one of the most ambitious big science projects of its era. It has been compared, among other things, to the Manhattan Project, the Apollo space program, and the Lewis and Clark expedition (Collins 2010b, 2; McElheny 2010, ix). As with all projects of this magnitude, governmental agencies, principally the NIH and the Department of Energy, played a dominant role in formulating, funding, and executing the HGP. Nevertheless, as discussed below, the role of the state in the HGP and subsequent genomic research projects has evolved

Big Genomic Data and the State 


Table 7.1 Selected Public Genomic Databases Database

Began operation


Data types

EMBL nucleotide sequence database


EMBL and European Bioinformatics Institute

DNA sequence data (all organisms)




DNA sequence data (all organisms), SNPs, annotations, and other

DNA Data Bank of Japan


National Institute for Genetics (Japan)

DNA sequence data (all organisms)

Merck Gene Index


NCBI (GenBank)

Expressed sequence tags




Human SNP markers

SNP Consortium


Cold Spring Harbor Laboratory

Human SNP markers and map




Reference genomes



Cold Spring Harbor Laboratory

Human haplotype data

Encyclopedia of DNA Elements and modENCODE



Catalog of functional human and model organism DNA elements

Database of Genotypes and Phenotypes (dbGaP)



GWAS, sequence, epigenomic, and other

Genetic Association Information Network


NCBI (dbGaP)

GWAS data relating to specific diseases

International Serious Adverse Events Consortium


Columbia University

Genetic markers for adverse drug reactions

International Cancer Genome Consortium database


Ontario Institute for Cancer Research

Tumor genomes

Cancer Genome Atlas


National Cancer Institute and University of California at Santa Cruz

Cancer-related genomic mutations

Human Microbiome Project


Multiple university and governmental centers

Genomic sequence of human-associated microbes

1000 Genomes


Amazon Web Services, NCBI, and European Bioinformatics Institute

Complete human genomes from 1,000+ individuals


Jorge L. Contreras

to become more nuanced and multifaceted than in other government-funded big science projects. Individual Privacy The HGP and other early genomic sequencing projects did not make use of or retain individually identifiable information regarding the DNA samples being sequenced. The only phenotypic data deemed relevant at the time related to subject ethnicity (International Human Genome Sequencing Consortium 2001, 868). As a result, there was minimal concern over human subject consent and privacy in early genomic studies. In recent years, however, mounting evidence suggests that even anonymized and aggregated DNA samples can be traced to individual donors (International Cancer Genome Consortium 2014, 519–520; Lowrance and Collins 2007). Such findings increasingly indicate that the interests of data subjects in genomic studies require substantial attention (Haga and O’Daniel 2011, 320). An even clearer need for protecting individual privacy arises when databases link genetic information to phenotypic data, including the DNA donor’s age, ethnicity, weight, demographic profile, environmental exposure, disease state, and behavioral factors, even when stripped of personally identifiable information. These developments have led to numerous proposals for heightened protection of individual identity in publicly released genomic data (Ossorio 2011, 908; Jasny 2013, 9). The NIH has responded to these concerns in several ways. First, it supported legislative policy initiatives directed toward the prevention of discrimination on the basis of personal genetic information.1 The culmination of these efforts was Congress’ passage of the Genetic Information Nondiscrimination Act of 2008, which prohibits discrimination in employment and health insurance markets on the basis of genetic information (e.g., the presence of a genetic marker for elevated disease risk). Sensitive to the additional privacy risks posed by the linkage of phenotypic information with genetic data, the NIH developed the Database of Genotypes and Phenotypes (dbGaP) to house and distribute data from GWAS. The dbGaP database allows access to data on two levels: open and controlled. Open data access is available to the general public via the Internet, and includes nonsensitive summary and aggregated data. Data from the controlled portion of dbGaP may be accessed only under conditions specified by the data supplier, often requiring certification of the user’s identity and research purpose (Paltoo et al. 2014, 935). Finally, the NIH has expanded its requirement to obtain the consent of human subjects in genomic studies. Under its Genomic Data Sharing Policy, the NIH (2014a) expects investigators to obtain informed consent from DNA subjects for all future research uses and the broad sharing of their genomic and phenotypic data, even if cell lines and clinical specimens are de-identified.

Big Genomic Data and the State 


Data and Database Protection Under US law, it has long been held that “facts” such as scientific data are not subject to copyright protection (Feist Publications 1991). Databases that merely contain simple compilations of factual information similarly lack formal legal protection. Access to data that is contained in electronic databases, though, can be controlled by the database operator via technical means, such as password-restricted access as well as limitations built into contractual access agreements. Thus, while data themselves may not be subject to legal protection, the law prohibits the circumvention of such technical protections. In this way, scientific information that might otherwise be in the public domain can become encumbered when compiled in proprietary databases (Reichman and Uhlir 2003, 335). Celera Genomics adopted restrictions of this nature when it announced its intention to sequence the human genome in competition with the publicly funded HGP in 1998 (Jasny 2013, 6). Celera’s business model would have offered the resulting data to commercial users pursuant to license agreements and paid subscriptions. Yet as discussed in greater detail below, the NIH chose to adopt an open access data policy for all HGP-funded data—an approach that continued in other genomic and related projects, and eventually prevailed over Celera’s and other proprietary models. Data Release The fact that the genome commons is today a global, public resource owes much to a 1996 agreement among HGP leaders and policy makers formulated in Hamilton, Bermuda. These “Bermuda Principles” required that all DNA sequences generated by the HGP be released to the public a mere twenty-four hours after generation (US Department of Energy, n.d.)—a stark contrast to the months or years that usually preceded the release of scientific data in other federally funded projects. The Bermuda Principles arose from recognition by researchers and policy makers that the rapid as well as efficient sharing of data was needed to coordinate activity among the geographically dispersed laboratories working on the massive HGP. In addition, this approach reflected a conviction that the rapid release of genomic data would accelerate the advancement of science and medical breakthroughs (International Human Genome Sequencing Consortium 2001, 864; Collins 2010a). Although the HGP concluded its work in 2003, the Bermuda Principles continue to shape data release policies in genomics and related fields (Collins 2010a, 675; Kaye et al. 2009, 332). Advances in technology, however, together with increasingly challenging ethical and legal issues, have given rise to policy considerations not addressed by the Bermuda Principles. Among these are the need to protect human subject data, even at the genomic level, and the desire of researchers who generate large data sets to analyze and publish their findings before disclosing them to others. The emergence and recognition of these considerations have led to an evolution of genomics data release policies within both governmental agencies (primarily the NIH, Genome Canada, and corresponding European funding agencies) and


Jorge L. Contreras

nonprofit research funders (such as the Wellcome Trust in the United Kingdom and Howard Hughes Medical Institute in the United States). The data release policies for genomic research projects that immediately followed the HGP were generally more restrictive and complex than their predecessors, but largely preserved the fundamental shared nature of the genome commons (Contreras 2011; Manolio 2009; GAIN Collaborative Research Group 2007; NIH 2007). Most recently, the NIH’s (2014a) Genomic Data Sharing Policy makes multiple concessions to data-generating researchers, particularly by extending the time periods that researchers may keep genomic data confidential before releasing them to the public. The impact of these changes, which erode but do not entirely dismantle the rapid data release paradigm of the Bermuda Principles, remains to be seen (Contreras 2015). Despite substantial governmental involvement and investment in the HGP and subsequent genomic research projects, the formulation of genomic data release policies has involved significant input from stakeholder groups including researchers, publishers, and representatives of the public (Contreras 2011). As such, the development of the Bermuda Principles and other data release policies underwent a process of polycentric negotiation and compromise, with the state assuming the role of one actor among many. Patents The patenting of genetic information has been the subject of substantial controversy in the United States, Europe, Australia, and elsewhere. According to some sources, the number of gene patents has steadily risen since the early 1990s (Jensen and Murray 2005, 239; Rosenfeld and Mason 2013). The holder of a patent has the legal right to block others from making the patented item or practicing the patented method. Accordingly, some observers have expressed concern that the increase in the number of patents covering biological materials, especially DNA and RNA sequences, could limit the ability of others to conduct medically significant research (Heller and Eisenberg 1998; Huys et al. 2009, 903). And while recent US court decisions have limited the availability of patents on naturally occurring DNA molecules (i.e., AMP v. Myriad, 2013), the effect is uncertain and likely to be limited in scope. In response to the perceived threat of these patents, since the mid-1990s, the NIH has opposed the patenting of genetic information (Contreras 2011; Rai 2012, 1241). This opposition took three principal forms: advocating a more stringent interpretation of the patent “utility” requirement before the US Patent and Trademark Office, enacting rapid data release policies per the Bermuda Principles, and discouraging NIH grant recipients from obtaining and exploiting patents on broadly applicable research tools. In terms of the utility requirement, the NIH and others argued that raw DNA sequences, the biological functions of which were unknown, did not constitute “useful” inventions meriting patent protection (National Research Council 2006, 52–53). This assertion was largely successful, and the US Patent and Trademark Office (1999) modified its examination guidelines to limit the availability of patents claiming raw DNA sequence data.

Big Genomic Data and the State 


The NIH’s data release policies, beginning with the Bermuda Principles, also served to limit the availability of patents claiming genomic data. These policies both inhibited centers generating genomic data from filing for patent protection on the DNA sequences they generated and created “prior art,” potentially blocking the filing of patents by third parties separately discovering the same DNA sequences (Contreras 2011). Finally, the NIH has issued numerous public statements discouraging grant recipients and others from seeking to patent human DNA sequences and other upstream research tools. While these statements may influence the behavior of some institutions, their lack of enforceability has been criticized (Contreras 2015; M. Mattioli 2012, 128; Rai and Eisenberg 2003, 293–294). Data Custody and Curation As discussed above, many government-funded big science projects, particularly in astronomy, earth science, and high-energy physics, make large quantities of observational or experimental data available to the public. In many of these cases, data are housed at a state-operated facility such as the National Center for Atmospheric Research or a private institution funded through government grants. Because data in these fields are relatively static once recorded, and because there are few, if any, concerns regarding individual privacy, little oversight or monitoring of data usage is required after data are uploaded to a publicly accessible database. The state role with respect to these data is thus passive and custodial. De-identified human DNA sequence data as well as all nonhuman genomic data have typically been treated by the NIH in a manner similar to that in which other governmental agencies have treated astronomical, atmospheric, or physics data. They are often uploaded directly to a public database shortly after undergoing basic quality control procedures and made available to the public without restriction. The principal databases for the public deposit of genomic sequence data are the GenBank, administered by the NCBI (a division of the NIH’s National Library of Medicine), the EMBL in Hinxton, England, and the DNA Data Bank of Japan. These three databases form the International Nucleotide Sequence Database Consortium, which coordinates their activity and synchronizes all three databases on a daily basis. Data uploaded to these sequence databases are typically not altered or filtered materially. These data are immediately available for public download and use once they enter the database. In recent years, researchers have experimented with new approaches to the storage, handling, and public release of genomic data sets. Data from the ambitious 1000 Genomes Project (2012), for example, are being made available through the “cloud” via Amazon Web Services. Different considerations come into play, however, when dealing with human genomic data that are linked to phenotypic data such as the donor’s age, ethnicity, weight, demographic profile, environmental exposure, disease state, and behavioral factors. In these cases, even if individual names and identifying information are stripped away, a risk exists that data can be reassociated with the individual. As noted in above, phenotypic data and


Jorge L. Contreras

genotype-phenotype associations are housed in repositories such as dbGaP operated by the NIH’s National Library of Medicine. dbGaP’s two-tiered structure allows access to sensitive information to be authorized on a case-by-case basis by a standing Data Access Committee, comprised of NIH personnel. As of December 2013, this committee had approved 69 percent of the nearly eighteen thousand data access requests made to the dbGaP since its inception (Paltoo et al. 2014, 936). In this respect, the NIH acts as the guardian of personally sensitive information that may be gleaned from the data stored within its databases. In addition to acting as the custodian and guardian of genomic data generated by federally funded projects, the NIH cultivates and improves the data in its care. The DNA sequence data uploaded by researchers to databases such as GenBank may at times be duplicative, incomplete, or flawed. Researchers wishing to download the complete genome of a particular organism would be hard put to identify and assemble all the necessary elements from GenBank deposits. The NCBI addressed this issue with the introduction of the RefSeq database in 2000. RefSeq contains a “reference” genome for each organism (and particular strains of organisms), which is compiled by NCBI (2013) staff from GenBank records. As higher-quality data are added to GenBank, the RefSeq genomes are regularly updated and refined. Relation to Private Sector Projects In addition to the explicit state roles assumed above, the NIH and other state actors such as the US Food and Drug Administration have taken an active interest in the genomic data generation and release activities of the private sector. For example, private sector projects are generally welcome to deposit data in NIH/NCBI databases such as GenBank and dbGaP, which are hosted, maintained, and updated at public expense. The agencies also have both expressly and implicitly encouraged the activities of private sector groups that intend to release genomic data to the public, as opposed to the propertization approach taken by Celera and others (which were heavily opposed by the NIH). Conclusions As shown in this chapter, the conventional postwar account of the state as a monolithic procurer of big science projects is both incomplete and inaccurate when applied to the public genomic commons. Understanding the evolution and future of this valuable public resource requires a reconceptualization of the state role into multiple discrete, functional units and approaches. The experience of the genome commons reveals the state, specifically the NIH, acting in the following diverse capacities: • 

the architect of a national genomic data generation strategy the funder of data generation •  an advocate for legal change directed at improving the usefulness, reliability, and ethical soundness of genomic data resources • 

Big Genomic Data and the State 



the guardian of individual rights and privacy in genomic data a participant in a polycentric governance process seeking to determine the optimal framework for the release of publicly funded data •  the host for the principal US genomic databases •  the adjudicator and creator of reference data sets •  a convener and encourager of private sector actors making concurrent contributions to genomic data resources • 

The range of roles played by the federal government in the formation and maintenance of this valuable public resource is telling. Although the series of genomic research projects led by the NIH has not been without controversy, it is generally acknowledged that the wealth of genomic information generated and placed into the public sphere by these projects has yielded substantial public benefits. The NIH’s cooperation with academic and private sector researchers along with the integration of data from multiple programs into large, centrally managed databases have likely contributed to the success of these efforts. This suggests that policy makers in other settings should consider the diversification of the state’s role with respect to big data generation and management. In doing so, they would do well to recall the example of the genome commons.

8  Trust Threads: Minimal Provenance for Data Publishing and Reuse Beth Plale, Inna Kouper, Allison Goodwell, and Isuru Suriarachchi

The world contains a vast amount of digital information, which grows vaster ever more rapidly. This makes it possible to do many things on an unprecedented scale: spot social trends, prevent diseases, increase freshwater supplies, accelerate innovation, and so on (Bresnick 2015; BrynJolfsson and McAfee 2014; Neale 2014; Cuttone, Lehmann, and Larsen 2014). As essential a role as science and technology innovation plays in improving natural environments and human welfare, the growing sources of data promise to unlock ever more secrets. But the rapid growth of data also makes the accountability and transparency of research increasingly difficult. Data that cannot be adequately described because of their volume or velocity (speed of arrival) are not usable except within the research lab that produced them. Data that are intentionally or unintentionally inaccessible or difficult to access as well as verify are not available to contribute to new forms of research. In this chapter, we show how data can carry with them thin threads of information about their lineage—trust threads—that connect the data to both their past and future, forming a provenance record. In carrying this minimal provenance through which the lineage of an object can be traced, the data inherently become more trustworthy. Having this “genealogy network” in place in a robust way as data travel in and out of repositories and through tools is a critical element to the successful sharing, use, and reuse of data in science and technology research in the future. Digital data’s disproportionally large impact on science and technology research is a relatively recent phenomenon, because digital data (especially in large quantities) are themselves relatively recent. Science has evolved over the last few hundred years to consist of four distinct but related methodologies that comprise four paradigms: empirical (observation and experiment based), theoretical, computational, and data exploratory (Gray 2009). Early science was largely empirical, focused on observed and measured phenomena derived from actual experience; Charles Darwin’s On the Origin of Species is a good example of carefully recorded observation. Early science also incorporated theory and experiments. For instance, Robert Boyle used mathematical laws (i.e., Maxwell’s equations and thermodynamics) to model physical phenomena, and then performed elaborate experiments to validate


Beth Plale, Inna Kouper, Allison Goodwell, and Isuru Suriarachchi

his hypotheses (Shapin 1989; Crease 2008). The last few decades saw the growth of computational science, where mathematical laws are implemented in software as a model or simulation that abstracts (i.e., simplifies) a complex, real-world phenomenon, and does so methodologically so that the simulation stays congruent with the real-world phenomenon. By utilizing abstractions, numerical methods, simulations, and the ever-increasing power of computers, computational science has given society more accurate weather predictions that can predict as spurious an event as a tornado and has improved jet propulsion through detailed modeling of fluid flow. Most recently, data exploration has emerged as a legitimate form of science—the so-called fourth paradigm of science (Gray 2009). Data that are captured by observational instruments, sensors, cameras, tweets, or cash registers are analyzed using computational tools (software), looking for trends or anomalies (Burdon and Andrejevic, this volume). Biology, for example, has virtually turned into an information science: genomicists study patterns in DNA sequences to identify diseases or trace heritage (Contreras, this volume). As Jim Gray (2009, xix) points out, The new model is for the data to be captured by instruments or generated by simulations before being processed by software and for the resulting information or knowledge to be stored in computers. Scientists only get to look at their data fairly late in this pipeline. The techniques and technologies for such data-intensive science are so different that it is worth distinguishing data-intensive science from computational science as a new, fourth paradigm for scientific exploration.

Similar to the three research paradigms before it, the fourth paradigm promises unheralded advances in our understanding of society and the planet on which we live. The fourth paradigm is a response to new data sources, the exponential growth of data (Berman 2008), and the increasing scope and complexity of scientific challenges. Technology has transformed society into one that is more connected and has more channels to stay informed (for a critical analysis of social data, see Alaimo and Kallinikos, this volume), yet grand challenges remain in energy, clean and abundant water, sustainable food, and a healthy population. These large-scale problems require interdisciplinary teams of natural or physical scientists, technologists, physicians, social scientists, legal teams, and policy makers to solve them (for a discussion of big data and teamwork in health care, see Sholler, Bailey, and Rennecker, this volume). These interdisciplinary teams bring their own data and methodologies, but because they work across discipline boundaries, they are increasingly interested in using data from varied sources. In this chapter, we advance the trustworthiness of scientific digital data through a lightweight technological formalism for data provenance. We focus on a critical window in the life cycle of a data object from the point where it is made available for publishing to the point where it is reused. This window is unique because valuable ephemeral information about the data is known only during this time period that if not captured immediately, is lost forever.

Trust Threads 


We posit that digital data are more trustworthy when more can be known about where a data object came from, what processes were applied to the data object, and who acted to operate on the object. Data provenance addresses this lineage information: information about the entities, activities, and people who have affected some type of transformation on a data object through its life (Simmhan, Plale, and Gannon 2005). Our proposed model draws on research over the last decade in data provenance, which has defined graph representations that capture the relationships between entities, activities, and people that contribute to a data object as it exists today and into the future. The motivating goal for our work is to increase the reuse of scientific data. But data that cannot be determined to be trustworthy is far less likely to be reused. Data reuse is the use of data collected for one scientific or scholarly purpose by a different researcher, often for a different purpose. The data may be reused to verify the original result, be combined with another data set to answer similar questions, or answer a different research question—for instance, when weather data are repurposed for crop forecasting or water conservation. We motivate the chapter with a use case from sustainability science—the study of dynamic interactions between nature and society. The science of sustainability requires the integration of social science, natural science, and environmental data at multiple spatial and temporal scales that is rich in local and location-specific observations; referencing for regional, national, and global comparability as well as scale; and integration to enable end users to detect interactions among multiple phenomena. Our perspective is drawn from a five-year project funded by the National Science Foundation called Sustainable Environments Actionable Data (SEAD). SEAD is responsive to the expressed needs of sustainability science researchers for the long-term management of heterogeneous data by developing new software tools that help a researcher to • 

cull the right subset of data to publish from among the thousands of files that are created and used during the course of a research investigation •  reduce the manual metadata markup burden by engaging a data curation specialist early in the research process, and creating easy-to-use tools that automatically extract metadata from the subset of data to publish •  publish to a repository of choice1 Sustainability science is the “long tail” of social and environmental science: data collections are often the property of a researcher’s lab, and data sets of local, regional, or topical significance are of limited value until they can be integrated and referenced geospatially and temporally, combined with related data and observations, and modeled consistently. While most of the individual long-tail data sets are small compared to the commonly discussed big data sets, their changeability and heterogeneity pose the same challenges as other “bigger” data. From a process-oriented perspective on big data (Ekbia et al. 2015), sustainability science research data are equally challenging to move while maintaining integrity and trustworthiness, and therefore require new tools and techniques in processing and preservation.


Beth Plale, Inna Kouper, Allison Goodwell, and Isuru Suriarachchi

Research Compendiums: The Case of Sustainability A researcher who studies complex physical, environmental, or social phenomena will utilize data from multiple sources, because complex phenomena frequently lack a single source of data from which the researcher can glean all the needed information. It is not uncommon for a single environmental study to use physical samples, model runs, observational data, data created within the study, and data used from other sources—all of which contribute to a full understanding of the natural, physical, and social phenomena under investigation. From this pile of data that is analyzed, combined, and synthesized during the regular course of research, a researcher must draw out and select the images, model results, metadata, and data files that best support a research result appearing in a published paper, or best form a cohesive data set to share more broadly. We call the aggregated data (data pile) that are pulled together for the purposes of sharing with a wider audience a Research Object (RO), and the act of making it available, the publishing of the research object. The RO is first conceived at the point in the research life cycle when a researcher is ready to share their data. At that time, they make decisions about what data—from among all the data sources they used, consulted, and created—they should package to make public. Should the researcher publish the data needed to reproduce the images in the publication as required by the journal publisher, or would their professional impact be greater if the researcher were to publish all the data needed to rerun the simulations? What are the criteria by which they approach such decisions? These questions are at the heart of the matter as the researcher decides whether it is sufficient for them to publish enough data to support the narrow conclusions of a particular journal article, or whether they should follow a broader mandate to make the results of science more widely accessible and usable (e.g., provide the data needed so that future researchers can replicate and expand on their results). There are sizable implications in the researcher’s answers to these questions. Among them, and by no means the most important, is one of size: data sufficient to reproduce the figures in a journal article are a small fraction of the size of the data needed to verify the conclusions of a study. Victoria Stodden (2014, para. 6) further contextualizes the issue of size by emphasizing the role of software: When computers are involved in the research process, scientific publication must shift from a scientific article to the triple of scientific paper, and the software and data from which the findings were generated. This triple has been referred to as a “research compendia” and its aim is to transmit research findings that others in the field will be able to reproduce by running the software on the data.

In this chapter, we focus on data curation and provenance in the context of the publishing and reuse of data as central to accountability and transparency in science, omitting a discussion of the software needed to reproduce results, as the issues are somewhat separate. Illustrated by means of a detailed use case from sustainability science, we identify a formalism, and show how the formalism, simple as it is, when put into practice, can advance data

Trust Threads 


trustworthiness. While the use case explores publishing data in support of a single journal article, our formalism applies to all data published for consumption outside the research lab in which they were generated. Use Case: The Flooding of the Mississippi In May 2011, the Mississippi River, the chief river of the largest drainage system in North America, was at historic flood levels due to record rainfall and snowmelt, risking great damage, injury, and loss to life and property as the rushing river gained momentum from large tributaries south of the Wisconsin and Minnesota border. Following the disastrous flood of 1927, federal legislation authorized the several flood-control structures. One of these is the Birds Point–New Madrid (BPNM) Floodway, a leveed agricultural floodplain at the confluence of the Mississippi and Ohio rivers near the city of Cairo, Illinois. The policy allows the 130,000-acre floodway to be intentionally flooded during extreme events through a series of levee breaches. In May 2011, the US Army Corps of Engineers used blasting agents to create several artificial breaches in the BPNM levee. The impact was dramatic: parts of the agricultural floodplain were inundated for over a month. Large floods and emergency responses to them, such as the intentional breach of the BPNM Floodway, create highly variable spatial patterns of water flow and floodplain erosion and deposition. These localized areas, or “hot spots,” of change expose underlying landscape vulnerabilities. This unique event is an opportunity for researchers to study and assess the causes of changes and vulnerability that can guide future actions. To examine floodplain vulnerability, a group of researchers obtained high-resolution elevation map data (LiDAR data) for this region from 2005 (before) and 2011 (after), the most recent pre- and postflood data. To analyze the change, they created a model showing the change in elevation by subtracting the 2005 landscape elevations from the 2011 ones. The researchers used a 2-D hydraulic model, HydroSED 2-D, validated with sensor data from the US Geological Survey, to simulate the flow of water through the floodway (Goodwell et al. 2014). They identified woody vegetation, and obtained soil and vegetation properties from the Natural Resources Conservation Service. They combined these data sets to identify those regions most vulnerable to erosion and deposition, and compared them with observed landscape impacts as seen from the LiDAR. The study required working with images, numerical sensor data, model results, and computed values. An article describing the study was published in the Journal of Environmental Science and Technology (Goodwell et al. 2014). The first two authors of this chapter assisted the first author of that publication (and also the third author of this chapter) in publishing their data in support of the study. The second author participated as the digital curation specialist in the effort. The researchers made the decision to publish ten georeferenced image files (GeoTIFF files) as the publishable data object. The selected set supports the reproducibility of all the images in the publication. The sustainability researchers decided to limit their data


Beth Plale, Inna Kouper, Allison Goodwell, and Isuru Suriarachchi

publication to those ten files because some of the data they used were proprietary and they did not have the permission to share them, or the data were already publicly available (such as the soil data from the Natural Resources Conservation Service, for example, or the US Geological Survey sensor data). The size of the resulting data set was another concern that limited the data-sharing options. The files selected for publishing resulted in a 3.74-gigabyte bundle, henceforth called the “BPNM object”; the files ranged from less than a megabyte to 2 gigabytes per file. Specifically, half the set of ten files were from the 2-D hydrologic model run under different conditions for points in the floodplain. These files varied from 1.47 megabytes at the smallest to 6.59 megabytes at the largest. A map file giving the soil’s erodibility and its loss tolerance over a spatial extent was 7.34 megabytes in size. Airborne imaging data to detect woody vegetation on the ground was 919 kilobytes in size. The comparison data of 2005 to 2011 land elevation data at 1.5, 3, and 10 meters ranged from 72 megabytes to 2.84 gigabytes in size. The entire publishable research object was a sizable 3.74 gigabytes. As the data were intended to be deposited, using the SEAD framework, into the repository at the University of Illinois at Champaign-Urbana, the researchers and curator realized that the repository, which was configured to accept scholarly papers rather than data sets, would not be able to handle a larger object. The repository software required a special manual reconfiguration to handle an object of that size. Had the authors chosen to publish the raw data in addition to the GeoTIFF files needed to reproduce the images in the publication, the publishable data bundle would be on the order of a hundred times larger than 3.74 gigabytes. The RO and Its Role in Data Reuse Technology can contribute to the strengthened trustworthiness of a complex bundle of data through data provenance, and to show how, we focus on the window in the life of research data that begins when a researcher is ready to package and distribute research data more broadly (e.g., make it public), and culminates in access by an unrelated party for scientific reuse. We call this the “publish-reuse life cycle” window. This window is crucial because it is during this time that the most that can ever be known about the data bundle is available given that the metadata are highly ephemeral (Gray 2009), and if not captured early in their life, are lost forever. The publish-reuse life cycle window is thus a critical time in which to harvest ephemeral information (metadata) about data that are about to be published. Our goal is to introduce into this critical time window a new, simple, technology-oriented provenance (or lineage) formalism to take advantage of the metadata harvesting opportunity. Several important questions have to be answered when data first enter the publish-reuse life cycle window. The researcher has to decide which data to cull from the larger data pile for publishing, and how much of a curator’s help they want with metadata and consistency

Trust Threads 


checks. The designers of the software system that supports data publishing have to convince themselves and their community of users of the system of the acceptability, limits, and mechanisms of versioning as well as the limits of trust and trustworthiness that can be had by a technical solution. To address these questions in a consistent and formal manner, we conceptualize the publishable data products of research as bundles of heterogeneous but coherent content—a selfcontained unit of knowledge. Drawing on our own earlier work (Plale et al. 2011), and heavily inspired by the work of David De Roure and Carole Goble (Bechhofer et al. 2011; De Roure, Goble, and Stevens 2009), we refer to a bundle of publishable products of research as the RO, as mentioned earlier. The RO is an aggregation of resources that can be transferred, produced, and consumed by common services across organizational boundaries. It encapsulates digital knowledge, and is a vehicle for sharing and discovering reusable research. ROs contain data objects (files), references to data objects, collections of data objects, metadata, published papers, and so on. Building and expanding on the workflow- and impact-centric approaches to research objects (see, for example, Hettne et al. 2014; Piwowar 2013), we conceptualize the RO as having five components, as shown in figure 8.1: a unique ID, a persistent identifier that is never reassigned; agents, information about the people who have touched the object in important ways (e.g., creator, curator, and so on); states, which describe ROs in time and are discussed in more detail below; and relationships, which capture links between entities within

Unique ID

• • • •

Files Bitstreams Pointers Annotative information

• • • • •


Aggregates Related to Describes Derived from Versioned from


Research object



• Data creator • Curator • Data reuse scientist

• Live • Curated • Published

Figure 8.1 An RO as a bundle of digital content that uses common standards and services to transfer and consume them


Beth Plale, Inna Kouper, Allison Goodwell, and Isuru Suriarachchi

the object, such as data sets, presentations, or images, and to other ROs. Finally, there is the content—the data and related documents. While we do not specifically include software that produced the data results, including it would be a minor extension under our model of RO as the “research compendia” (Stodden 2014). Examples of an RO are many: the supporting materials and results that are described in a single published paper; a written dissertation and its data; a newly created data set that contains raw observations and annotations that explain them; survey data that have been exported, coded, and aggregated into charts and tables; and a visualization that is based on the data created by others. Each of these RO examples has common and unique actions performed by human and software agents that comprise its “behavior.” The RO concept provides a general organization to research product contents and metadata so that it can be used by common standards and services of all kinds. Specifically, there is a structure into which all needed contents of an RO can be stored: files, IDs, documentation, and so on. The organizing data model of the RO can be served by such protocols as Open Archives Initiative Object Reuse and Exchange resource-mapping protocol (Tarrant et al. 2009) or the Dataset Description model (W3C Consortium 2014); a description of the parts of the RO with example data taken from our BPNM use case are shown in table 8.1. Table 8.1 Components of the RO Component

Example from BPNM RO

Unique ID


Names, affiliation, and contact information of data creators and curator


LO, CO, PO, and dates of creation and transformation


• • • • • • • • • • •

Original LiDAR model Corrected LiDAR model 1 Corrected LiDAR model 2 AVIRIS data Hydro-simulation overall Hydro-simulation, location1, condition 1 Hydro-simulation, location 2, condition 1 Hydro-simulation, location1, condition 2 Hydro-simulation, location2, condition 2 Soil survey/computed data Descriptive/annotative Information (abstract, spatial, and temporal metadata)

Relationships (internal to RO)

• • • •

Published to HasKeywords HasSources (NASA, US Geological Survey) IsReferencedBy DOI 10.1021/es404760t

Trust Threads 


Trust Threads Model We define a simple model for the behavior of a research object (data bundle) as it passes through the publish-reuse life cycle window; the model captures relationships between research objects as they are derived from one another, replicated, and so forth. The model is the basis on which software is written that implements the model in a controlled and predictable manner. The model has two parts: states, which define the condition of a data bundle as it passes through the publish-reuse window; and relationships, which capture the relationship between two ROs. Figure 8.2 provides an overview of how an RO transitions through the publish-reuse window, and once published, how it relates to its derivatives. The states and relationships are drawn from two enumerated sets as follows, where relationships are a subset of the properties defined in PROV-O: •  • 

States = {Live Object (LO), Curation Object (CO), Publishable Object (PO)} Core Relationships = {wasDerivedFrom, wasRevisionOf, alternateOf }

The trust threads model over time will result in a network of links between research objects, and between an RO and itself over time, creating a genealogy network for published scientific data. The trust threads model does not apply to the relationships an RO may contain within it, such as a file and its metadata, or files belonging to a collection. Derivation results in copy of object deposited in researcher’s environment


wasCulledFrom LO


alternateOf CO





The wild west

The boundary waters

Figure 8.2 Behavior diagram for publish-reuse life cycle


The control zone


Beth Plale, Inna Kouper, Allison Goodwell, and Isuru Suriarachchi

An RO passing through the publish-reuse life cycle window exists in one of the three states: as an LO, CO, or PO. Data management for large research teams is rarely a priority, so we say the data of a highly active team exist in a wild west, where organization is loose. In our example of the Mississippi flood data, researchers work in a single shared SEAD project space. The project space contains dozens of folders, each of which is dedicated to a particular subset of data, such as the raw LiDAR data for 2005 and 2011, images of the floodplain, spectrometer raw and corrected data, related publications, and so on. This project space constitutes the LO. A researcher culls from the larger set of data the objects they want to publish more broadly. They will prune and organize the material to publish into a new directory or set of directories, or mark specific files. From the Mississippi flood project space, the researcher culled a ten-file collection (the BPNM object). This culled content is the CO, an object related to its LO by the “wasCulledFrom” relationship. Where the LO is a wild west of loosely controlled activity, the CO exists in a more controlled setting, a relative “boundary waters” between the wild west of the active research and the controlled setting where an object is polished for publishing. In the CO state, change to the contents of the object are still frequent, but one or two researchers alone assume the tasks of selecting, pruning, describing, and reorganizing the object, and frequent untracked changes by others become unwelcome. Additionally, during the culling process, the researcher will engage a digital curator to examine the structure, makeup, and metadata of the research object. The digital curator, in consultation with the researcher, will enhance the content to make it more useful. Once the researcher and digital curator agree that the content and descriptions of the research product are ready, the researcher signals intent to publish, whereupon the RO moves from its state as a CO to a new state as a PO. The PO is related to the CO by a “wasPublishedFrom” relation. The PO exists in a “control zone,” and in this state, all actions on a PO are carefully tracked. That is, all actions on a PO result in a new instance of a PO. The relationship established between the PO that is acted on, and the newer PO captures the type of action that occurred. All actions on a PO are carefully tracked as they form the past and future lineage of a family of research objects. The types of changes to a PO are several: the relationship “alternateOf” exists when a duplicate of the PO object is created. The relationship “wasRevisionOf” exists between two ROs when an RO undergoes a revision that does not change the researcher’s intent for the object. That is, when the RO is determined to be incomplete or incorrect in some way—for instance, revisions of metadata, or corrections of errors and omissions—a new version is created that is related to the earlier version through wasRevisionOf. For example, if the authors in the Mississippi flooding case wish to replace one hydrology model run with a newer run because the existing GeoTIFF file contained errors, the change is considered to be a revision. If the researcher instead decides to publish raw hydrology model results in addition to the already-published GeoTIFF files, this is too substantial a change to the existing PO, so it constitutes a new published object.

Trust Threads 


The relationship “wasDerivedFrom” is established between two POs if the latter contains a portion of the former either directly or by reference. Suppose a biologist is carrying out an entirely different study of the postflood regrowth of biomass in the Lower Mississippi River floodplain after the 2011 flood. The biologist would locate the existing BPNM PO, and find that it contains fifty-by-fifty-meter resolution spectral images. Such images can be used in measuring trends in the abundance of vegetation and its components. The researcher downloads the image, combines it with field data, and identifies trends in postflood biomass development in the area. This new PO published by the biodiversity researcher will have a wasDerivedFrom relationship to the BPNM RO. Reuse comes about when a researcher searches and finds a published PO in a repository, and pulls a copy of it into their own research environment, where it then becomes part of another researcher’s LO, wild west space. Trust Threads and Trustworthiness The trust threads model is a simple one, enhancing the trustworthiness of bundled and published data. It guarantees that a data bundle will carry with it useful data provenance, and that the provenance record will not change or be modified except under controlled circumstances. Trust threads focus on the capture and representation of the data provenance of data product bundles (the ROs) as they exist in the critical publish-reuse life cycle window—a window in time that begins with the conceptualization of what goes into the research object, follows through to its curation and publication into a repository, and ends at the object’s subsequent use by researchers outside the originating scientific domain in which the data object was created. This time window is particularly crucial because it is in this window that important connections between research objects are made that if not captured, are likely forever lost. Trust threads are implemented as small bits of information contained in a virtual “suitcase.” The suitcase identifies the RO to which it belongs, and the lineage of other ROs from which it was derived, revised, and so on. The suitcase is locked, so that its contents can be trusted. Its contents are changed only by authoritative sources—which could include other trusted repositories, for instance. It is the reuse of data outside its originating scientific discipline that motivates the need for trust threads. Data reuse outside a scientific discipline is particularly complicated because the competence, reputation, and authority of data creators outside the field are more difficult for researchers to evaluate. Yet the increasing availability of data along with the growing complexity of the challenges facing society and the environment will require that researchers be able to gauge data set trustworthiness before they use it. According to Devan Ray Donaldson and Paul Conway (2015), the determination of the trustworthiness of an entity has a strong user perception component to it. This user perception has four properties: the accuracy of the information; the objectivity of the content; validity, which includes the use of


Beth Plale, Inna Kouper, Allison Goodwell, and Isuru Suriarachchi

accepted practices and the verifiability of data; and stability, defined as the persistence of information. Trust threads enable developers to implement the technical “wiring” needed to capture provenance information and deposit the provenance into a locked suitcase so that the data’s accuracy, validity, and stability can be more easily determined. Thus, trust threads contribute to three of the four properties identified by Donaldson and Conway. For instance, if a derived subset of a published data bundle lacks objectivity, a researcher can examine the suitcase contents to trace it back to its parent. The accuracy of the information is related to its perceived quality. A framework of properties for data quality (Wang and Strong 1996) includes completeness, accuracy, relevancy, reliability, accessibility, and interpretability as user perceived properties for data quality. With the recent explosive growth in the amount and variety of research data, and inexpensive access to large-scale computing resources, science is on the cusp of new discoveries in areas that require interdisciplinary teams and data that have to be repurposed. Trust threads can accelerate the sharing of research data across research disciplines (data reuse) through a small technological formalism that when adopted, broadly can advance the trustworthiness and hence reuse of published data. Acknowledgments We thank Praveen Kumar at the University of Illinois for allowing us to study his research and publishing process. We thank Margaret Hedstrom, Sandy Payette, and Jim Myers, all at the University of Michigan, for stimulating discussions about data sharing and preservation in the SEAD project. SEAD is funded by the National Science Foundation under award 0940824.

9  Can We Anticipate Some Unintended Consequences of Big Data? Kent R. Anderson

For a successful technology, reality must take precedence over public relations, for nature cannot be fooled. —Richard P. Feynman

The opportunity to capture and curate large data sets on a reliable basis has created strong momentum around concepts generally summarized under the rubric “big data.” As with any broad-based shift in approach and philosophies, there are risks of unintended consequences. But can we anticipate what these might be? After all, human nature is to some extent predictable, and we have already seen some of the consequences of big data approaches, not all of which have been thought about at scale or in depth. These issues may only enlarge or transmogrify when moved onto a bigger stage, yet they retain their essential elements. This chapter is an attempt to think ahead and anticipate some of the unintended consequences of big data from a number of perspectives—consequences that may be determined more by human bias and motivations than by data presentations; by the tendentious embrace of big data trends while ignoring the power of small or singular data; emphasizing quantity over quality; by substituting data for reality; and during the fundamental error caused by forgetting that data don’t interpret themselves. Data Reanalysis and Bias In a 2014 appearance on the Colbert Report, Leon Wieseltier, former editor of the New Republic, was challenged by the host to state his critique of modern culture in ten words or less. His extemporaneous answer is impressive in its wisdom and brevity, and says many things about some pitfalls that may surround big data: “Too much digital, not enough critical thinking, more physical reality” (Butnick 2014). You may notice that he met the ten-word challenge precisely. Just a few months earlier, the BioMed Central journal Translational Neurodegeneration published a paper by engineer-turned-biologist Brian Hooker (2014), in which Hooker


Kent R. Anderson

reanalyzed 2004 Center for Disease Control (CDC) data, and found a purported link between measles-mumps-rubella vaccines and autism diagnoses. His reanalysis showed no link, except in a subgroup of African American children. But there was an important flaw in his reanalysis: the original study was a case-control study, which he reanalyzed as a cohort analysis.1 With this simple error, his statistical analysis was plainly inappropriate. Yet this was just the beginning of the flaws in Hooker’s reanalysis. The story that ultimately emerged involved a strange and sensationalistic video of a supposed confession of data suppression (by a “senior CDC researcher” with no trace of existence on Google, among other factual problems), an admission from Hooker that he was not an unbiased participant due to his own son’s autism, a panicked takedown of the paper by the publisher shortly after publication, and the reappearance of Andrew Wakefield (et al. 1998)—the disgraced British physician whose retracted paper started the vaccine-autism link myth—as narrator of the purported video confession. Ultimately, the Hooker paper in Translational Neurodegeneration was retracted, with this statement: “This article has been removed from the public domain because of serious concerns about the validity of its conclusions. The journal and publisher believe that its continued availability may not be in the public interest. Definitive editorial action will be pending further investigation.” “Data reanalysis” sounds benign and straightforward, but it requires profound and perhaps-worrisome motivation on the part of the individual or individuals undertaking the work. As one astute blogger wrote about this instance in particular and reanalyses in general, There are a couple of things you have to remember whenever looking at a study that is billed as a “reanalysis” of an existing data set that’s already been published. The first is that no one—I mean no one—“reanalyzes” such a dataset unless he has an ax to grind and disagrees with the results of the original analysis so strongly that he is willing to go through the trouble of getting institutional review board (IRB) approval, as Hooker did from Simpson University, going to the CDC to get this dataset, and then analyzing it. (Gorski 2014)

This example of reanalysis shows some of the serious risks associated with an imprudent embrace of big data initiatives: the ability of a motivated individual or activist group to create “findings” that suit their worldviews, and then publish these discoveries to compete with actual firsthand science; the difficulty of teasing out post hoc statistical manipulations; the lack of incentives to monitor these reanalyses, and the burden they may create for editors, publishers, and scientists; and the potential for commercial, reputational, or political gains via imperfect yet confounding reanalyses. An editorial in JAMA: The Journal of the American Medical Association addresses these and other concerns eloquently, stating, The value of reanalysis … hinges critically on reducing the presumed threats to equipoise that come from financial, ideological, or political interest in the results. … [I]f the presumption of bias is higher in the reanalysis team, data sharing will likely impede, not improve, scientific understanding. (Christakis and Zimmerman 2013)

Can We Anticipate Some Unintended Consequences of Big Data? 


Another study from JAMA looked at the reanalyses of randomized studies published in medical journals. This differs dramatically from replication studies, which would have attempted to duplicate the study design and population to see if the same result occurred. Instead, the reanalyses relied on the same data from the original studies, reinterpreting them through reanalysis. Shanil Ebrahim and his colleagues (2014) found that 35 percent of the medical studies receiving independent reanalysis had their conclusions modified to either change the treatment group, the treatment options, or both. More worrisome, they found that the majority of the reanalyzed studies initially had null findings, while the reanalyses generally increased interventions. Of the thirteen reanalyzed studies that led to changes in interpretation, nine (69 percent) suggested that more patients should be treated, while only one (8 percent) indicated that fewer patients should be treated. In many cases, newer drugs benefited from the reanalyses. One reanalysis suggested that younger patients should be treated, which would result in more treatment years for a costly intervention. This set of reanalyses expanded treatment groups along age or indications axes. The financial upside of these reanalyses could be potentially significant. In addition, the possibility of generating new, positive results from reanalysis brings us back to the long-standing concern that scientists seek positive results because they are more publishable. As Ebrahim and his colleagues demonstrated, we see many null results being transformed into positive results via reanalysis. Investors could also benefit because the stakes in medical research are high. An in-depth story in the New Yorker (Keefe 2014) revealed that one investment firm was able to use inside information around a potential Alzheimer’s drug to generate a $275 million investment gain by selling short the drugmaker’s stock. The investment firm had held millions in stock just weeks before a major study of the drug was announced, but quietly reversed its market position after one of its investment managers, who had befriended the person set to announce the trial results, learned that the results were underwhelming and determined they should buy against their previous position. These kinds of incentives could lead to many reanalyses, especially if investors aren’t able to anticipate negative results. With hundreds of millions at stake, why not hire some statisticians to see if they can eke out a positive interventional effect? At the very least, their reanalysis could confuse the markets long enough to reverse an investment loss. The New Yorker story also revealed that some medical researchers are susceptible to the charms of temporary business friends. In this case, the senior researcher thought that an investment manager shared his passion for Alzheimer’s research. Over many months, the families became friendly, and the researcher held dozens of private consultations with the investment manager. The researcher later admitted to being unable to detect exactly when he’d crossed the line. Once the drug’s poor results were announced, the investment manager stopped e-mailing this senior academic and stopped replying to messages.


Kent R. Anderson

This underscores the difficulty of finding “unbiased” individuals to reanalyze study results. Bias is subjective and manageable—that is, skilled investment managers, reporters, company officials, or academic leaders can wend their way into a position of influence or awareness without triggering alarms. They can even affect the thinking of scientists without their being sufficiently aware of it, through simple and innocuous-seeming persuasion. Thinking is malleable. Permissions are subtle. Some people are gifted at modifying thinking and permissions without leaving a trail. Aside from the potential for conscious or unconscious bias, questions around the data used in reanalyses are not trivial. What comprises a complete set while preserving patient confidentiality (confidentiality that was not established with a reanalysis group)? Are the data the only aspect that requires reanalysis, or do the statistical outputs and methods also necessitate it? Who provides those? And who reanalyzes the reanalysis, or is a reanalysis the last word? One example of data reanalysis done well comes from the BICEP2 findings, which suggested that researchers had found the gravitational wave signature of the big bang. Concerns were immediately raised about estimates of cosmic dust, and whether misestimating these levels could have led to spurious conclusions. Researchers from the United States and Europe joined forces, shared data, and issued a joint reanalysis within months, correcting the findings and reinterpreting the conclusions (Commissariat 2014). Ultimately, the questions around data reanalysis are complex and not easily answered: • 

What are the authorship criteria for reanalyses? What kinds of disclosures must be made? How much validity are reanalyses given? •  What approvals need to precede a reanalysis of medical and patient data? •  How do you peer review a reanalysis? Do you have to rereview the original study as well? • 

There is a seductive thread in the midst of all this: the idea that we can reprocess reality until we get an answer we like. It’s reminiscent of Chris Anderson’s (2008) “end of theory” musings. Pulling on this thread without theory or boundaries introduces the question of when it stops. Do you reanalyze the reanalysis? How many “re’s” are allowable? More important, if reanalysis is embraced as another form of publication, what opportunity costs will this introduce, as more scientists, statisticians, and analysts turn their attention away from new studies to reanalyses? As the JAMA editorial cited above puts it, Because the universe of researchers with the expertise, time, and interest in reanalyzing another researcher’s data can be quite small, there is a worrisome likelihood that those seeking to reanalyze data either have a vested interest in refuting the findings or do not have adequate methodological expertise. Because methodology drives the quality of results, this is not an idle concern. Anyone sufficiently motivated can produce a different and conflicting result based on data that once demonstrated a given outcome. A misspecified model, a reduction in power, inappropriate controls; the pitfalls are many and one investigator’s pitfall is another’s opportunity. (Christakis and Zimmerman 2013)

Can We Anticipate Some Unintended Consequences of Big Data? 


What really needs reanalysis is the notion that reanalyzing packaged trial data is a simple matter of having access to the data and nothing more (for a useful discussion of many of these issues, see Plale et al., this volume). There are significant issues to understand and sort out before we embrace these approaches fully. One main question is whether reanalysis stacks up well to simply restudying the hypothesis. To return to the problems diagnosed by the editor of the New Republic, related to whether science is going to be an area with the same deficits as our broader culture: “Too much digital, not enough critical thinking, more physical reality?” Ignoring the Power of Small Data, Undervaluing the Power of Data Curation Critical thinking may become all the more important as data become more available, multidimensional, and nuanced. Or perhaps there’s another approach entirely aside from thoughtlessly pursuing big data. There is still value in small data, after all (see also Bailey, this volume). During the 2012 presidential election in the United States, Nate Silver and his FiveThirtyEight blog became renowned for making accurate predictions of the outcomes. One supporter, Dan Lyons (2012), sought to make the case that the accuracy of statistical analyses made by Silver for the US presidential and senatorial races is a sign of how big data will end the era of mystical predictions, writing, “This is about the triumph of machines and software over gut instinct. The age of voodoo is over. The era of talking about something as a ‘dark art’ is done. In a world with big computers and big data, there are no dark arts.” Lyons went on to conflate what Silver did with other so-called big data triumphs, like when Big Blue defeated chess grandmaster and world champion Garry Kasparov. Unfortunately, neither Silver’s election projections nor Big Blue’s chess win depended on big data. As Silver (2012) details, a software bug might have been more important than any data set in upsetting Kasparov. Merely using a database of possible chess options, Kasparov probably would have fought Big Blue to a draw at least. Yet there was a software bug in how Big Blue processed its data, and this bug made the computer’s play inexplicable. Kasparov was beaten by the illogical application of big data. Like elections, chess may not need big data to be solved. Both chess and elections are interesting games because they are bounded—they have finite data. To process limited data quickly enough to be impressive and useful to a news cycle, you need fast processors. Big Blue could play chess at an acceptable speed because processors were fast enough using a bounded data set. These same factors—limited data and fast processing speeds—also make it reasonable now for statisticians to project elections using recursive techniques on small data sets within the twenty-four-hour news cycle. As you’d expect once these factors are taken into account, Silver’s approach didn’t use big data but rather a relatively small, carefully curated data set consisting of a set of polls and


Kent R. Anderson

a lot of discipline. The factors that Silver describes having to manage while assembling, curating, and analyzing the data included: Recency: more recent polls are weighted more heavily Sample size: polls with larger samples receive more weight Pollster rating: pollsters committed to disclosure and transparency standards receive more weight Silver then adjusted the results from the set based on a few key factors: Trend line adjustment: if old polls haven’t been replaced, they are adjusted to reflect the overall trend line House effects: some polls tilt right, and some left, and this adjustment mitigates those effects Likely voter adjustment: polls of likely voters are given a lot of credence Of the many steps that Silver employed in analyzing the data, the two most important are the least mathematical, yet they are vital to the integrity of the process: Silver believes in publishing and standing behind his numbers, because the process of preparing for publication and anticipating criticism helps to ensure better analysis. This may be why most researchers wait to see a published paper before judging the science—and its underlying data—to any serious degree. To underscore the relatively limited size of the possible data set, Silver tracked one presidential race and at most a hundred senatorial ones, and various national and state-level polls. That’s not big enough to qualify as “big data,” which is defined on Wikipedia as data sets that are “so large and complex that it becomes difficult to process using on-hand database management tools” (“Big Data,” n.d., para. 1). Silver himself is skeptical of big data. In his book, Silver (2012, para. 4) wrote, Our predictions may be more prone to failure in the era of Big Data. As there is an exponential increase in the amount of available information, there is likewise an exponential increase in the number of hypotheses to investigate. … [T]here isn’t any more truth in the world than there was before the Internet or the printing press. Most of the data is just noise, as most of the universe is filled with empty space.

Small data, careful curation, astute and recursive analysis, public accountability, and the nerve to place bets and stand behind your projections and data—those are the things that make for good and reliable analysis. Other examples of small data sets holding vast power are easy to find. In early 2015, Science published a study in which the authors revealed that they could track an individual in a crowded city using just three data points based on purchases they made (de Montjoye, Radaelli, and Singh 2015). Not only that, but women proved easier to track than men, and rich people proved easier to track than less wealthy people. So rich women who like to shop proved to be the easiest to identify in a crowd. This shouldn’t be too surprising. In 2002,

Can We Anticipate Some Unintended Consequences of Big Data? 


researchers demonstrated that they could identify the health conditions of the governor of Massachusetts using two public data sets—voting records and de-identified health records—simply by matching three elements that the two data sets shared (postal code, sex, and age) (Sweeney 2002). As fascination with big data grows, it’s possible for us to forget the power within a small amount of the right data. Not Thinking to Look Out the Window Laziness and big data have combined for a number of people in an odd way: when they want to know if it’s raining or about to rain, they will pull out their smartphones and check their weather apps. This has become so instinctual that it’s not uncommon to see people catch themselves, laugh, and realize that instead they could have gone to the window and looked outside to see if the ground was wet, rain was falling, or clouds were moving in. To take another example, people use the big data of GPS to navigate within well-known cities, but if they’d only look up, they would realize they are heading in the wrong direction. People walk, say, ten blocks out of their way simply because they forget to look around. In 2014, a story in Science dissected problems with one of the more famous examples of an attempt to use big data in place of actual observations: Google’s flu trends, which uses an assessment of undisclosed search terms to approximate flu intensity in particular regions, had substantially overestimated the prevalence of flu for a number of years (Lazer et al. 2014). Google’s lack of transparency about how its flu trends works was just the tip of the algorithmic iceberg, as many other problems were identified by those studying the issues involved in this particular implementation of big data. These problems include the following issues. Big Data Hubris One problem is the implicit assumption that big data is a substitute for traditional data gathering and analysis. Big data is, at best, a supplement to data gathered through actual surveillance of actual behavior or characteristics. As proxies, big data solutions have yet to prove they are reliable, and perhaps always will be proving it, which makes them a supplement, not a replacement. The Lack of Static Inputs into Big Data Sets In the case of Google’s search algorithm, it received eighty-six substantial modifications in June and July 2012 alone. The terms used in flu trends were also modified repeatedly over the years, but in ways that the users of the data did not know or understand. Finally, users of the Google search engine enter these search terms for a variety of reasons—it turns out, often searching for cold remedies—further confounding the reliability of the inputs into the data sets.


Kent R. Anderson

Data Sets Not Purposefully Built for Epidemiology In Google’s case, the search engine is built to support the speed of response and advertising revenues. Those are its primary goals. As David Lazer and his colleagues (2014) put it, Google flu trends “bakes in an assumption that relative search volume for certain terms is statically relevant to external events, but search behavior is not just exogenously determined, it is also endogenously cultivated by the service provider.” In other words, Google has its thumb on the scale in order to support its primary business goals. You can see this thumb with autocomplete, which guides users toward search terms that others have already made popular. “In the Wild” Systems Are Open for Abuse and Manipulation Google and other publicly available online services are subject to outside pressures, from “Astroturf” marketing and political campaigns to filtering by oppressive governments to denial of service attacks. These make the data less reliable. And when data are known to be produced by these systems, which carry broad societal and commercial weight, the temptation to manipulate these data may lead to purposeful attacks and manipulations. Replication Is Not Required and May Not Be Possible This returns us to the lack of transparency with the Google data as well as its granularity. The data in Google flu trends is presented as is, and cannot be analyzed by outside parties. With the likelihood that most large data sets will have proprietary or privacy issues around them, these limitations may dog big data for a long time. Big Data Needs to Be Validated As mentioned in the first point, the CDC data for flu prevalence were not improved on by Google flu trends. This is a major point. Big data may not ever do more than generate hypotheses or prove supplemental. How much money and intellectual effort do we invest in things that are inferior or secondary? Good Data Don’t Have to Be Big Data Not only can smaller data sets tell us things that big data never could, but statistical techniques also can make relatively modest data sets sufficient for many questions. Data appropriateness and accuracy are better qualifiers. While Google’s algorithms weren’t able to beat real-world surveillance in monitoring flu outbreaks, the company’s ability to scale a proprietary source of big data without external validation—its AdSense system, which generates billions annually in relevance-based ad matches—is highly effective. But what if the AdSense algorithms were put up against an analogous reality-based relevance system? Would they do so well? Sometimes, the ship in the bottle sails the smoothest. Looking at big data from this perspective, we can see a different issue—namely, a lack of validation of Google’s own algorithms except in the circular realm

Can We Anticipate Some Unintended Consequences of Big Data? 


of their business uses. That is, because Google’s advertising business is commercially successful, the algorithms are assumed to work well. If they were to be validated by external measures of relevancy, though, they may not perform to reasonable expectations (for more on the consequences of algorithmic decision making, see DeDeo, this volume). Big data standing alone can create a world all its own. Sharing Data Isn’t as Simple as It Sounds The variability between and among data sets requires some deep contemplation as well. Data are variably applicable. In fields that allow for direct data observation, such as computational biology and various mathematical fields, data acquisition and analysis can be fairly straightforward. For these researchers, data and empiricism are tightly linked, and sharing data is less fraught, if not completely clear-cut. In fields where direct observation of physical reality can’t be as direct, empirical observations yield data that can be difficult to reconstitute into the conditions that generated them. In these fields—medicine, molecular biology, other biological fields, physics, earth sciences, oceanography, and the humanities and social sciences—the results of observations can lead to some interesting data, but direct observations either can’t be captured in sufficient detail to be stored in anything approaching a complete manner, or are subject to so many conditions and contexts that the data enter the realm of statistical inference. Combining data from separate observational occurrences can be even more treacherous, given all the conditions, confounders, and temporal aspects to biological systems and populations. This makes the sharing of data more complicated, as often there is an implicit sharing of assumptions, contexts, and conditions, not all of which may be documented or even appreciated. For these and other reasons, it is unclear how much data should be shared. In 2014, PLOS attempted to define a “minimal dataset” or the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety. This does not mean that authors must submit all data collected as part of the research, but that they must provide the data that are relevant to the specific analysis presented in the paper. (Silva 2014)

This led to an almost-immediate backlash and revisions to the policy. As one blogger pointed out, even this carefully crafted statement left room for interpretation, and created potentially significant burdens or unrealistic expectations for researchers in some fields: Most behavioral or physiological analysis is somewhere between “pure code” analysis and “eyeball” analysis. It happens over several stages of acquisition, segmenting, filtering, thresholding, transforming, and converting into a final numeric representation that is amenable to statistical testing or clear representation. Some of these steps are moving numbers from column A to row C and dividing by x. Others require judgment. It’s just like that. There is no right answer or ideal measurement, just the


Kent R. Anderson

“best” (at the moment, with available methods) way to usefully reduce something intractable to something tractable. (“PLOS Clarification Confuses Me More” 2014, para. 6)

Data-sharing initiatives come down to “what is the point.” Some contend that providing the underlying data is about the validation of a published study. But data can be fabricated and falsified, and sharing falsified data that fit published conclusions does little but validate a falsehood. Also, if your analysis of my data doesn’t agree with my analysis of my data, where does that leave us? Am I mistaken, or are you? Arguing over data can be a form of stonewalling, as a scandal at the University of North Carolina shows. The whistle-blower, Mary Willingham, is not a researcher, but used research tools and methodologies to identify reading and writing deficiencies in high-profile athletes attending the university and making good grades in questionable classes. An administrator attempted to discredit her research by selectively and publicly picking apart the data. As Willingham is quoted in a BusinessWeek article: “Let’s say my data are off a little bit. I don’t think they are, but let’s say they are. Set aside the data. Forget about it. The … classes were still fake, and they existed to keep athletes eligible” (Barrett 2014). Data don’t always contain the most salient conclusions of a study. Conclusions can rely on data, but sometimes conclusions offer interpretations that some particular selection of data may not support. Data accumulation has many potential weaknesses, not the least of which are the problems of how far data can become divorced from reality over time, how flawed data may go uncorrected, or how reliance on data might impede simpler and more direct empirical observations. What is not measured is also a concern. It’s easy to miss an important fact or skip measuring phenomena in a comprehensive, thoughtful way. In medicine, many long-term studies have reached a point of diminishing returns not only because their populations are aging and dying but also because the study designs did not take into account how relevant race, gender, or class might prove, resulting in studies that are dominated by Caucasian men or affluent women. After decades of data collection, such data sets are not aging well and ultimately will become relics. What factors are the large data sets of today overlooking? What regrets will our medical research peers have in their dotage? What biases are shared when we share data? Simple paradigm shifts can make entire data sets seem obsolete or at least of diminishing utility. For instance, advances in our understanding of the microbiome and the role of intestinal flora to human health have transformed thinking about nutrition, overall health, and longevity. Yet we have little population data on the microbiome, and large medical studies lasting decades never registered this issue as possibly significant. To share the right data, we have to collect the right data. Determining which data to collect is fraught with issues that will likely continue to surface even as our mastery of data storage and processing improves.

Can We Anticipate Some Unintended Consequences of Big Data? 


Conclusion Based on limited experience with big data to this point, it seems we can anticipate a few unintended consequences of the big data movement. One potential unintended consequence is that the big data movement may need to be renamed, as the emphasis on “big” may mean that small, carefully selected data sets could be overlooked despite their power and relevance. Many small data sets can be used effectively, and small findings can have paradigm-redefining implications. Another possible unintended consequence is that we become enamored with data and forget to do direct, empirical observations. Worse, we may give data a level of false equivalency with reality, causing us to be misled or mistaken about what’s really going on, akin to living in a hall of data mirrors. If we lose our skepticism, we may not accept that big data can be manipulated to tell a false story. The siren song of data is seductive. Nevertheless, an environment filled with published data needs to have a clear purpose. Is it to validate published reports? If that’s the case, then it has a time-limited value, and should be treated accordingly. Is it to enable big data initiatives? If so, then we need controls around privacy, usage, and authorization. As with everything else, the relevance and utility of data depends on who you are, what you do for a living, and what is possible in your field. Assuming that “data are data” can obscure important subtleties and major issues, and leave us unprepared for dealing with flawed or fraudulent data. And we need to compare the risks, costs, and benefits of these initiatives in science with the risks, costs, and benefits of simply re-creating published experiments using actual, empirical reality. In some fields, data are strongly tied to empirical observations. Those fields already have robust data-sharing cultures, and are actively seeking to make them better. Other domains are more driven by hypotheses, overlapping observations, and intricate interconnections of incomplete data sets that practitioners have to weave into a knowledge base. Is their approach wrong? The technocrats who believe that data are data might argue it is. But unless we’re sure that “data are reality,” it’s probably best to keep our central focus on the best-possible direct empirical observations. Data are part of these, but they shouldn’t become a substitute for them. The use of data in baseball is known as sabermetrics. As a discipline, it was revived in the modern era with the Oakland A’s, a consistently high-performing team despite a lower payroll, thanks to its proclivity with data analysis. One new mainstream statistic that emerged from sabermetrics has been on-base percentage. This came from the insight of an expert— that a walk was as good as a single, essentially—and carried over into data analysis and statistics, providing a marginal improvement over traditional measures like runs batted in, slugging percentage, and so forth. Strong theoretical constructs like on-base percentage guide data gathering and analysis, and ultimately offer value because they make sense. “Making sense” is the key to any data endeavor. Freestyle data swimming may generate waves, but to arrive at genuine meaning,


Kent R. Anderson

there has to be an initial hypothesis or theory of the numbers. Then, how much data do you need to test the idea? In short, we can’t elevate big data onto a lofty plane, as if it is a solution. We are at the beginning of a long story, and one that may not be as exciting as the blurbs suggest. After all, the best way to avoid unpleasant unintended consequences is to keep a reasonable level of skepticism at hand. This same skepticism is also handy when dealing with big data itself. We can’t turn off our brains yet. We need to think a little harder about data now than ever before—its provenance, what we provide and who receives what we provide, how our data privacy is protected, and so on. We also need to think hard about the questions we really have about the world. There are major limitations still facing the field, not the least of which is whether big, indirect data can be any better than properly sized direct data measurements aided by statistical extrapolation. Acknowledgments This chapter started as a set of posts on the Scholarly Kitchen, the official blog of the Society for Scholarly Publishing. The materials have been updated, edited, and expanded significantly for this work.

10  The Data Gold Rush in Higher Education Jevin D. West and Jason Portenoy

The enthusiasm for all things big data and data science is more alive now than ever. It can be seen in the frequency of big data articles published in major newspapers and in the venture capitalists betting on its economic impact (Press 2015). Governments and foundations are calling for grant proposals, and big companies are reorganizing in response to this new commodity (Gordon and Betty Moore Foundation 2013; National Science Foundation 2015). Another, often-overlooked vitality indicator of data science comes from education. Students are knocking down the doors at universities, massive open online courses (MOOCs), and workshops. The demand for data science skills is at an all-time high, and universities are responding. According to a widely cited 2011 report by the McKinsey Global Institute, the United States could face a shortfall of 140,000 to 180,000 people with deep analytic talent—those trained to realize the promise of big data—by 2018 (Manyika et al. 2011). Predictions like these have led to a rapid proliferation of educational programs to train this talent— individuals popularly termed “data scientists.” In this chapter, we explore the emergent landscape of education in this field as institutions around the world race to get ahead of the forecasted shortfall. We see this growth as an indicator of the vitality of the data science field. As students become formally trained and self-identify as data scientists, data science is more likely to become established as its own discipline. Big data is as much a business opportunity for universities as it is for the venture capitalists or start-ups in Silicon Valley. The McKinsey report focuses mostly on the promise of big data to provide financial value to businesses, and the educational landscape reflects this. The vast majority of new programs to train data scientists are master’s degrees and certificates geared largely toward people going into private industry. These programs concentrate on giving marketable skills to students, and institutions make money by charging tuition commensurate with this promise. The picture of formal education to train data scientists that we find is a somewhatdisjointed one—the product of rapid, speculative growth. There is a good deal of variation in the departments and degree titles attached to the emerging programs. There are many


Jevin D. West and Jason Portenoy

elements that the curricula have in common, but these too exhibit subtle variation. The programs tend to have a strong business focus evident in their branding and coursework. Apart from the formal and expensive education options, a number of alternative forms have emerged to train people in the fundamentals of data science. Many of these are free or inexpensive. They include large-format online classes, online training websites, organized groups of individuals interested in mentoring people within their communities, and loosely organized meetup groups. Many of these options fill roles that are underaddressed by the business-dominated degree programs. Finally, we examine another aspect of education and data science: the pull of thought leaders in data science away from academia and toward industry. While the rise in data science training programs is a positive indicator of the vitality of the field, we recognize that there is a need for retaining some of this talent within the walls of academia for it to take hold as a true academic discipline. We find that the disparity in resources and an outdated career structure are leading many researchers to choose to leave academic research. We consider efforts under way to address this situation, and what it might mean for the future of the field. Big Demand for Data Science Degrees Since the late 2000s, hundreds of degree programs in data science and related fields have formed worldwide. The growth has become even more intense since 2011, when the McKinsey report was released (see figure 10.1). The landscape of data science education reflects a response to the dire situation portrayed in that report: the economy desperately needs more data scientists, and schools are jumping at the opportunity to train them. The overwhelming majority of the new programs are professionally oriented master’s degrees and certificate programs.1 The McKinsey report is often mentioned in the promotional materials for the programs. Promotional materials note the deluge of data overwhelming businesses, the promise that it can hold, and the fevered demand for data scientists to harness it. In a 2012 article, Harvard Business Review identified the data scientist as “the sexiest job of the 21st century.” This same article noted that there were “no university programs offering degrees in data science,” although some programs in analytics were starting to alter their curricula to take advantage of the big data boom (Davenport and Patil 2012). The change in the short period since that article was published has been dramatic (see the rapid rise in data science programs evident in figure 10.1). Budding data scientists now can choose among a host of programs eager to supply companies with the talent for which they are looking. The buzz of big data has not escaped the halls of academia or the offices of university presidents. The cost of completing a master’s degree is between $20,000 and $70,000 for a one- or two-year program (costs can be lower for state schools, especially tuition rates for residents). Tables 10.1 and 10.2 show the range in cost of these master’s programs.2 The

The Data Gold Rush in Higher Education 


Figure 10.1 Growth of data science master’s programs. The number of master’s degree programs in data science and analytics has risen dramatically since 2011. Data from������ (accessed February 6, 2016). Note: Defining which master’s programs belong in the category of “data science” is not straightforward. For example, the University of Washington offers an MS in information management with a specialization in data science and analytics that has many of the elements of a master’s in data science; this program is not included in the data. Because of this, the growth of data science degrees reported in this figure likely far underestimates the actual growth. Table 10.1 Most Expensive Data Science Programs College



Cost (r = resident)

University of Denver

MS in business analytics

12–36 mo


New York University

MS in business analytics

12 mo


Carnegie Mellon University

MISM (business intelligence track)

16 mo


Southern Methodist University

MS in applied statistics and data analytics

18–24 mo


Carnegie Mellon University

MS in computational data science

16 mo


Northwestern University

MS in analytics

15 mo


University of Rochester

MS in business analytics

10 mo


University of California at Berkeley

Master of information and data science

12–20 mo


Illinois Institute of Technology

MS in marketing analytics and communication

12–24 mo


University of Miami

MS in business analytics

10 mo



Jevin D. West and Jason Portenoy

Table 10.2 Least Expensive Data Science Programs College



Cost (r = resident)

University of Connecticut

MS in business analytics and project management

12+ mo


Fairfield University

MS in business analytics

12+ mo


Xavier University

MS in customer analytics

24 mo


University of Alabama

MS in applied statistics (data mining track)

9+ mo

$24,000 ($9,1500 r)

Elmhurst College

MS in data science

24 mo


Southern New Hampshire University

MS in data analytics

20+ mo


Valparaiso University

MS in analytics and modeling

18 mo


University of Iowa

MS in business analytics

12–36 mo


South Dakota State University

MS in data science

12+ mo

$17,600 ($9,900 r)

Dakota State University

MS in analytics

20+ mo

$12,800 ($6,100 r)

tuition charged for these degrees reflect the high earnings potential that the colleges expect their graduates to have in the job market. This idea is sometimes made explicit. The “FAQs” for the University of San Francisco’s (n.d.) MS in analytics—tuition $43,575—states, We are proud to run a program that, with high probability, significantly increases the earnings power of our graduates over the long run. … We are confident that the return on investment associated with this particular professional program is superior to the return on investment from many other forms of professional training (law, medicine, etc.). There is a shortage of data scientists on the job market right now, and that shortage is projected to get far worse before it gets better.

North Carolina State University estimates that graduates should be able to recoup the cost of tuition for its MS in analytics program—$40,800, or $23,600 for state residents—plus fees and one year of lost wages, in nineteen to twenty-six months. It also reports the average net three-year return on investment for graduates to be $131,800 to $136,400 (Institute for Advanced Analytics 2014). Many of these programs are too new to report any sort of outcomes data for their graduates. Those that do, however, claim high job placement rates—frequently 100 percent. They report starting salaries in the high five or six figures, many with signing bonuses, although they generally only have these numbers for one or two cohorts. The types of positions that reporting graduates tend to go into include data scientist, analyst, consultant, and manager,

The Data Gold Rush in Higher Education 


in large companies in industries such as finance, technology, consulting, e-commerce, and health care. Owing to both the interdisciplinary nature of the field and rapid proliferation of programs spurred by the urgent demand, the landscape of data science programs gives a somewhatfractured impression. The degrees are commonly named analytics, data science, or business analytics; these programs all tend to offer similar curricula, with some subtle variations. Some schools extend these titles, such as St. John’s University’s MS in data mining and predictive analytics or Carnegie Mellon’s MS in computational data science. In addition to the naming of the programs, the inchoate nature of data science education is reflected in the variety of academic departments that offer the degrees and lack of any natural academic home (Finzer 2013). Business schools are one common home for these programs. Many of them brand themselves as specialized business degrees; these are often the business analytics programs (Gellman 2014). Other programs are housed in departments of statistics, computer science, engineering, or information science. Another common practice is to have several academic departments collaborate to offer the program, such as Northeastern University’s MS in business analytics, a collaboration among the business school, college of computer and information sciences, and college of social science and humanities. At the University of Washington, there are collaborative efforts around campus departments to offer transcriptable options in data science in various fields from biology to information science. Some colleges form “institutes” out of these collaborations, such as the University of Virginia, whose Data Science Institute includes faculty from computer science, statistics, and systems and information engineering. The data science and analytics curriculum reflects the skills that someone will need in this career, to the extent that such skills have been defined in this new field. Data scientists are expected to be able to analyze large data sets using statistical techniques, so statistics and modeling are typically among the required coursework. They must be able to find meaning in unstructured data, so classes on data mining are also usually part of the core. They frequently must be able to communicate their findings effectively, so courses on visualization are commonly offered as an elective. Other courses that students may take include research design, databases, parallel computing, cloud computing, computer programming, and machine learning; all of these reflect skills that an employer might expect from a data scientist. Courses in business and marketing are common, too, especially among programs in business schools. Most programs also require a capstone project that gives students experience in working through real-world problems in teams. The range of courses is indicative of an expectation that data scientists will be strong in many different areas. Bill Howe, who teaches data science at the University of Washington, is not even sure that a curriculum to train data scientists is feasible: “It remains to be seen,” he says, adding, “What employers want is someone who can do it all” (Miller 2013b, para. 20).


Jevin D. West and Jason Portenoy

Alternative Data Science Education The landscape of data science degree programs reflects a market dominated by private industry, and this can come at the expense of other areas of society—academia, nonprofits, and the public sector. The promise of big data extends to a wide range of applications. The value of data science is more than the money that can be made or saved by an organization. One way that this gap is being addressed is through several alternative data science training venues apart from the formal training offered through university degree programs. Lacking a solid revenue structure or robust funding, these efforts are largely grass roots and community driven, and rely heavily on volunteers as well as the passion and interest of those involved. Some of these alternatives are large-format online classes called MOOCs and other forms of online instruction, organizations that run in-person workshops and meetings to offer instruction in communities, and informal meetup groups. MOOCs are free or low-cost courses taught through universities or other organizations. They typically use short videos, interactive quizzes, and other assignments that can be easily disseminated to a large online audience. Courses can be offered either on a schedule or on demand at the student’s pace; in either case, the format requires a good deal of selfmotivation, and completion rates tend to be low (Liyanagunawardena, Adams, and Williams 2013). Some courses offer a certificate of completion at the end. MOOCs are a low-cost alternative to formal education options, but they lack the formal recognition and networking opportunities of the more expensive degree options. Two popular MOOCs for data science are offered on the MOOC platform Coursera: a Stanford University course on machine learning created by Andrew Ng, a computer science professor and cofounder of Coursera; and a University of Washington course titled Introduction to Data Science.3 In addition to structured MOOCs, a number of options exist online for the self-study of data science techniques and tools. Many standard data science tools are open source— Python, R, D3, and Weka—and have active online communities. These options tend to be free or inexpensive. To learn the basics of Python, for example, one can use the tutorials on Codecademy, Google’s Python Class, Python’s own tutorial, or a number of other options in a variety of formats and levels.4 Another type of alternative education in data science attempts to meet the need for training that is more accessible than formal degree programs, and yet more social and somewhat less rigorous than MOOCs. This tends to be community based, organized through a nonprofit organization or without any formal organization at all. It consists of in-person meetings and training sessions such as workshops, typically with short time commitments. The goal of these groups is not necessarily to train data scientists but instead to get the techniques and tools into the hands of a wider range of people, laying down the fundamentals and teaching people to be conversant. Among the most formalized of these groups are Software Carpentry and its sister group, Data Carpentry. These groups focus specifically on teaching scientists the basics of

The Data Gold Rush in Higher Education 


computing and software development to aid in their research. They came out of the recognition that scientists—those with the highest level of domain expertise in these research endeavors—often acquire their software skills informally by self-study or through impromptu lessons from peers. This can lead to gaps in knowledge that make research far more inefficient, as the time it takes to do computational science is becoming increasingly dependent on the time spent writing, testing, debugging, and using software (Hannay et al. 2009; Wilson 2014; Wilson et al. 2014). Software Carpentry is a nonprofit volunteer organization that runs short workshops to train scientists in basic computing skills related to programming, automation, and version control. It originated as a course taught to scientists at the Los Alamos National Laboratory in New Mexico, and developed into a network of two-day intensive workshops run all over the world. The workshops revolve around teaching the basics of UNIX shell commands, programming using Python and R, version control, and using databases and SQL. In 2013, it organized ninety-one workshops for around forty-three hundred scientists (Wilson 2014). The number of skills required of a data scientist is unrealistic. Scientists working on computationally intensive research cannot all become experts in every aspect of data analytics. These research projects frequently require teams of people in order to cover all the necessary bases, such as statistics, software development, and domain expertise. Still, having some broad training in all areas of data science techniques can go a long way in getting scientists to work more efficiently. By learning how to write and debug code, scientists can save time, communicate more readily with other members of the team, and understand more fully the methods underlying the research effort. The teamwork requirement of data science also undergirds the importance of skills in version control systems, which involve the management of software, documents, and data. There is a widespread movement around “reproducible science” within academia where version control systems are a central focus (for more on reproducible science and the reuse of data in research, see Plale et al., this volume). This is influencing curricula in the classroom and topics on syllabi. Several other efforts exist to offer informal data science training to researchers, sometimes offered by those in charge of running supercomputer facilities at research institutions. The Princeton Institute for Computer Science and Engineering and SciNet supercomputer cluster at the University of Toronto, for instance, both offer workshops and training to researchers doing work that make use of these advanced resources. Other examples of informal training for scientists involve loose organizations such as the Hacker Within, originally started as a student group at the University of Wisconsin at Madison (and an inspiration for Software Carpentry), and now a nascent association of groups offering meetings and informal boot camps at the University of California at Berkeley, University of Wisconsin at Madison, and several others (Losh 2011).5 Other organizations offer data science training to different audiences. The Community Data Science Workshops are a series of volunteer-staffed interactive classes at the University


Jevin D. West and Jason Portenoy

of Washington run over the course of three weekends that provide training in Python to any novices interested in using data science tools to study online communities such as Wikipedia and Twitter.6 These types of efforts, being completely open, and fueled by passion and interest, often spread—the Community Data Science Workshops was inspired by the Boston Python Workshop, which in turn inspired the Python Workshop for Beginners at the University of Waterloo. PyData is another community that runs conferences around the world for developers and users of Python tools; all of its net proceeds go to fund the NumFOCUS foundation, which supports and promotes open-source computing tools in science.7 The landscape of informal data science training also includes an array of online and inperson communities that exist without any central structure or organization. The website, which allows groups of people to organize and advertise in-person meetings, has hundreds of active groups globally that meet to discuss and work on issues and projects related to data science.8 Data Science for Science The landscape of data science education in universities and college is one dominated by a connection to industry. This is a reflection of the enormous gains that businesses stand to make from this phenomenon. This force can also be seen in a siphoning of talent from academic science toward these businesses. This will be a challenge going forward for the field of data science, as the future of data science as an academic discipline depends on some of the thought leaders remaining within academia itself. Whether learned in data science programs or the practice of discipline-specific research, the skills many doctoral graduates gain are highly valued in industry. These graduates face a job market skewed toward business. A National Institutes of Health (2014b) postdoctoral researcher can expect to earn a salary of around $42,000 per year. Entry-level data scientists can make upward of $100,000 per year (Dwoskin 2014a). Many graduates of quantitative fields such as physics, math, and astronomy are moving into data science, where programming and quantitative skills are highly valued. Often, they are motivated by more than just money. Many private companies offer positions that involve working on interesting problems that are attractive to people oriented toward research. At companies like LinkedIn and Yelp, data scientists can apply statistical models and machine-learning techniques to situations that will ultimately drive business, and typically get the satisfaction of seeing the fruits of their work within months, as compared to the slower pace of academic research (ibid.). Some companies also have research labs that fund scientists to do exploratory research not directly related to increasing revenue—although corporate incentives still tend to be present in these environments, and the results of research are not made available to the public or other researchers outside the company.9 These companies are investing money in research at the same time that academic research funding is still feeling the squeeze from the recession of the late 2000s.

The Data Gold Rush in Higher Education 


Exemplifying this trend is the Insight Data Science Fellows Program, a postdoc designed to help PhDs with quantitative skills make the transition from academia to industry.10 Fellows work on a data science project, and are trained in industry-standard techniques and tools, such as machine learning, version control, parallel computing, Python, and R. At the end of the program, they are matched with industry jobs. The program boasts a 100 percent placement rate. Another challenge for data science in academic sciences is an incentive structure within the profession that prioritizes publications over other important work such as software development (the so-called publish-or-perish model). It has been estimated that in the age of computational and data-driven research, scientists can spend 30 percent or more of their time developing software (Hannay et al. 2009). This being time that could be spent writing articles to publish in journals and conferences, there is a strong disincentive for scholars to spend even more time developing additional skills, or writing clean and reproducible code that could serve other researchers in similar domains (Vanderplas 2013, 2014). There are some efforts under way to combat the movement of talent away from scientific research. In 2013, the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation began a $38 million project to fund initiatives to use data science to advance research. The funding goes to support data science centers at three partner institutions: the Center for Data Science at New York University, the Berkeley Institute for Data Science at the University of California at Berkeley, and the eScience Institute at the University of Washington. These institutes work within and across institutions to foster collaborations as well as develop sustainable and reusable tools for scientific research, and work toward fixing the outdated academic career structure (Gordon and Betty Moore Foundation 2013). Similar initiatives exist at other institutions. The Stanford Data Science Initiative, Columbia University Data Science Institute, and Massive Data Institute at Georgetown’s McCourt School of Public Policy all seek to support the use of data science in research, and encourage cross-disciplinary collaboration in order to do so. While still in its infancy, data science is being established as an academic discipline. Data science PhD programs are beginning to appear at universities that want programs with more of a research focus. Brown University’s computer science department, for example, offers a PhD in big data, and Carnegie Mellon offers a PhD in machine learning. Other schools—the University of Washington and Penn State University—offer interdisciplinary PhD programs in big data or analytics funded by the National Science Foundation’s Integrative Graduate Education and Research Traineeship. PhD programs exist to train thought leaders and innovators in research (although many of these students will go into industry with their PhD). The first crop of PhD students in data science will soon be graduating.11 This set of young researchers will greatly influence the academic direction of data science. Will they get jobs in their home discipline or other disciplines? Or will they go to industry? If these researchers


Jevin D. West and Jason Portenoy

stay in academia, will this be enough to establish a stand-alone field of data science over the next ten years, or will data science remain cross-disciplinary? A Moving Target The dizzying rate of change in the world of data science education creates a feeling of uncertainty and calls into question the future of the field. Ideally, education trends should align approximately to movements in technology, but technology changes too fast for education to follow every new development or trendy whim. By chasing every new trend, educators risk diluting their brand and wasting curricula development on skills that become obsolete before the first day of class. For those trends in technology that have staying power, however, education institutions should respond, both for their own self-interest, and the betterment of the economy and society. It remains to be seen where data science fits into this balance—if it will turn out to be a flash in the pan or will prove itself a discipline on its own merit. Data science may diffuse to all domains and infiltrate syllabi across campuses without the need for stand-alone programs. Either way, big and unwieldy education bureaucracies will have a difficult time keeping up with the more flexible, adaptive education institutions when responding to trends in technology. Our findings don’t indicate a dependence on the institution size, at least for data science. Big, well-established universities are responding to this data revolution as fast as smaller institutions. We see a greater dependence at the departmental level than on overall institution size. The more interesting story developing is the impact that nonuniversity programs (see the section “Alternative Data Science Education” above) are having on education, and the role they will play with or without the traditional universities. The big data movement is a useful, real-time experiment for seeing which institutions and types of institutions will meet the demand (or lack of demand) most effectively. The emergence of data science is also an opportunity to attract more students to science, technology, engineering, and math (STEM) fields. The National Science Foundation (2014) has spent millions of grant dollars trying to figure out how to graduate more STEM students. Data science can serve as a potent STEM attractor, given the high-paying salaries (Dwoskin 2014a), industry demand (Manyika et al. 2011), popularity of data scientists such as Nate Silver (2012), and ostensible “sexiness” of the discipline (Davenport and Patil 2012). Academia tends to move more slowly than industry, which can make it difficult to respond to trends appropriately. Yet there are advantages to this slower pace. The data revolution has impact well outside the consumer products with which industry is concerned. Potential benefits exist in the natural and social sciences, government and nonprofit organizations, and health and medicine. The drawbacks of big data are important as well, such as privacy and ethical concerns that are far from resolved (see also Cate, this volume). Academic institutions should fill the role of addressing these gaps through their curricula and activities. Some examples of this exist already. The Data Science for Social Good Fellowship programs at the

The Data Gold Rush in Higher Education 


University of Chicago and University of Washington as well as the Atlanta Data Science for Social Good Internship mentor students to work on projects for nonprofits and other organizations that have a social mission.12 The field of bioinformatics is a related but separate discipline in its own right, using data science techniques to tackle issues in health and medicine. Ethics is already a component of many data science curricula. These efforts should be applauded and encouraged. Students, even those focused on a future in business, should be made to critically consider the societal impact of their work, and the ethical implications that their algorithms and experiments could have (see also DeDeo, this volume). This is where the slower pace of academia can be good. Many of the education institutions and new programs in data science cite the McKinsey report. We would like to see more follow-ups on this kind of employment forecasting. Many decision makers are using this as their reference point. The forecasts reported may be correct; in fact, they may be underestimating the demand. But given its influence in education boardrooms, we recommend that governments, universities, and foundations sponsor more studies like this to verify and update these forecasts, particularly in fast-changing areas like data science. We would hate to see wasted effort and money at budget-strapped universities be passed on to students through exorbitant tuition fees with no commensurate jobs. Conclusion Data science is going through growing pains, and the education landscape reflects this. The enthusiasm behind big data has ignited fevered growth as institutions and organizations race to meet demand. At universities, we see an explosion of new programs, primarily master’s degrees and certificate programs with elements of business and management. Alternative forms of education such as online courses and community workshops have sprouted to address other sorts of demand. Given the demand for data scientists in business, there are incentives for leaders in this field to choose careers in industry rather than domain science research, which may slow the development of a data science discipline and its influence on nonbusiness domains. While the future of data science can be questioned, we see the current activity as an indication of things to come in both technology and education. The potential of big data is becoming increasingly important in society, and educational institutions and organizations are beginning to form an infrastructure that can train students with the expertise to harness it. This infrastructure will be critical in sustaining the big data “gold rush” we have seen in recent years.

IV  Big Data and Organizations

11  Obstacles on the Road to Corporate Data Responsibility M. Lynne Markus

No issue is more closely identified with big data in public awareness than privacy (Ohm and Peppet, this volume). Many proposals for improving privacy protection focus on national security and law enforcement agencies, or on data brokers and search firms (e.g., Google; Burdon and Andrejevic, this volume). To the extent, however, that every organization today collects, stores, manages, and processes information, every organization has responsibilities for privacy and data protection. The question addressed in this chapter is whether and to what degree corporations (e.g., banks and insurance companies) and other organizations (e.g., not-for-profit health care providers and educational institutions) are able to behave responsibly with data. The chapter’s conclusion is that corporate data responsibility is likely to remain elusive for reasons suggested by the title of this volume. The nonmonolithic context of organizational data use, data governance, and data management creates obstacles on the path to corporate data responsibility. The observations in this chapter come from two sources. The first was a workshop funded by the National Science Foundation to create an agenda for research on the social, economic, and workforce implications of big data and analytics (Markus and Topi 2015). The second is an ongoing study of data management practices in large business, nonprofit, and government organizations conducted by MIT’s Center for Information Systems Research with Barbara H. Wixom as the principal investigator.1 The chapter first briefly describes organizations other than national security agendas and data brokers as information businesses with data protection and management responsibilities. It then describes the multifarious context of organizational data use, protection, and management, and shows how this multiplicity creates obstacles for the effective discharge of corporate data responsibility. Every Organization Is in the Information Business Big data is not solely the province of law enforcement agencies, technology and data companies, and search firms. Organizations of all kinds collect, store, manage, and use data about their operations, employees, and customers.


M. Lynne Markus

Retail banks offer a concrete example. These organizations maintain personal data on employees (e.g., addresses and identification numbers, but also data about compensation and benefits, families, health, and work performance). They collect and manage personal data about customers’ use of various banking products, including checking and savings accounts, secured loans for homes and cars, unsecured loans like credit cards, investment advice, investment products, and insurance. Retail banks buy personal data about prospective customers along with customer sentiment toward banking institutions and services. Banks may also sell customers’ personal data to data aggregators and brokers. Financial organizations look to the analysis of personal information to help them combat fraud, cross-sell financial products to existing customers, assess the creditworthiness of potential new customers, design new financial products and marketing campaigns, and provide banking services across multiple technology platforms, including branches, kiosks, personal computers, and mobile devices. They also look to big data analytics for guidance on future hiring decisions and lowering employee health insurance costs. A challenge these organizations face is how to gain the many promised benefits of big data use without compromising personal information privacy. Big data introduces new threats to personal information privacy through the process of matching data elements found in disparate sources, allowing for the reidentification of people in “anonymized” data sets (Ohm 2010) and the virtual “manufacturing” (Pasquale 2015) of personal data from unlikely material. An illustration of this process is the use of facial recognition technology to identify unnamed people in photographs posted on social media sites. It is not far-fetched to conclude that big data allows any item of personal data to reveal all data about a person (Ohm and Peppet, this volume). The enhanced risks of big data use demand enhanced organizational data protection practices. Lawyers, public policy experts, and authorities on technology, business, and social responsibility have called for changes in organizational data protection behaviors. Some proposals focus on legal and regulatory changes. Examples include legislation about employers’ and employees’ rights of ownership in employees’ personal digital devices and online accounts (Mehta 2014; Park 2014), due process for individuals harmed by big data (Crawford and Schultz 2014), experimentation on the users of online services (Grimmelmann 2015), the need for “chain-link confidentiality” when personal information is disclosed to third parties (Hartzog 2012), and regulatory reporting (Culnan 2011). Other proposals focus on self-regulatory organizational practices. For instance, organizations are urged to reduce their reliance on customer notification and choice, and increase their emphasis on data stewardship and risk assessment (Cate, this volume). They are advised to conduct value alignment exercises (Davis 2012), and develop or adapt codes of ethical conduct about data use (Bennett and Mulligan 2012). Organizations are encouraged to set up data-sharing agreements as well as review boards for data-sharing requests, analytics projects, and experiments on users (Allen et al. 2014; Calo 2013; Grimmelmann 2015; Lin et al. 2014). They are exhorted to train

Obstacles on the Road to Corporate Data Responsibility 


developers in methods for designing privacy into websites and apps (Friedman, Kahn, and Borning 2006; Langheinrich 2001). The proliferation of proposals for changes in the data-handling practices of all organizations raises a number of challenging questions. Does effective data protection require organizations to implement all the recommended practices? How likely are organizations to do so, and even if they do, how likely are they to succeed in protecting data effectively? Answering such questions requires a sound understanding of the organizational context in which data are collected, used, and protected. This chapter, accordingly, examines three dimensions of the organizational data use context: differentiation inside organizations and across the organizations that comprise data “supply chains”; multiplicity, conflicts, and gaps in data protection laws, contracts, frameworks, standards, occupational specialties, and formally mandated protection roles; and diversity in organizations’ data management priorities and practices. It also shows how the nonmonolithic character of each dimension creates challenges for effective organizational data protection. The Organizational Context of Data Use Is Not a Monolith To understand how—and how effectively—personal data are protected, it is first necessary to understand that organizational data use involves differentiated, not monolithic, actors. Organizational data use involves many individuals, groups, and even distinct legal entities with different goals, authorities, accountabilities, responsibilities, and values. As a rule, these units do not undertake data protection in a coordinated or consistent manner. This can undermine effective data protection, which requires coordination of all parties toward a common goal. Below I consider four types of differentiation in the organizational context of data use that have important implications for data protection: differences among the organizations in data supply chains, differences among subunits inside organizations, differences between organizational leaders and employees, and differences between clients and data scientists. The Organizations in Data Supply Chains The use of big data often involves two or more organizations. The simplest cases involve only two organizations. For instance, an organization like a bank or insurance company might purchase data about prospective customers from a data aggregator for marketing purposes. In practice, data supply chains (Martin 2015; Washington 2014) frequently involve more than two organizations. An insurance company, for instance, that sells its policies through independent insurance brokerages may, in order to help brokers improve their sales performance, provide brokers with access to personal data about prospective customers that the insurance company purchased from a data aggregator. Here, the data supply chain involves the insurance company, the data aggregator, and numerous insurance brokerages, each of which must do its part for data protection to be effective.


M. Lynne Markus

Data supply chains can be even more complicated, because organizations may contract with data brokers as well as technology products and services firms (e.g., analytics service providers), which in turn contract with other technology products and services firms for support services, such as data storage, data backup and disaster recovery services, data security services, and so forth. Any of the companies in these chains can create breaches in data protection. For example, organizations have lost access to personal data about their customers and employees when a technology company they contracted with acquired the data storage services of a third company that was temporarily or permanently shut down by a legal action or bankruptcy (Armbrust et al. 2010). The organizations that lost data might not even have known that the vulnerable data storage provider was involved in the supply chain (Sullivan, Heiser, and McMillan 2015). The companies that touch the personal data that organizations buy, collect, analyze, or sell are highly diverse. They include data brokers (e.g., Acxiom), social media companies (e.g., Facebook), “cloud computing” service providers (e.g., Amazon Web Services), enterprise applications software providers (e.g., SAP), “business process outsourcing” companies (e.g., Accenture), consulting and market research firms, and accounting and audit firms.2 The data protection policies and practices of these companies can differ widely from each other and from the organizations that do business with them. First, as autonomous legal entities, each company can set its own policies within regulatory constraints. Second, companies incorporated in different countries and those that operate in different economic sectors may be subject to different rules. Third, some companies make money from the sale of data and datarelated services; they design their terms and conditions of use with their best advantage in mind.3 For instance, the contracts of data storage providers do not always afford strong performance guarantees to their organizational customers (Kemp Little LLP 2011; Venter and Whitley 2012), offer redress for customers in the event of data loss (Kemp Little LLP 2011), or affirm the customers’ ownership of stored data (Trappler 2012). The result is a patchwork of data protection agreements that do not cover the entire data supply chain in a coherent and coordinated manner. Any of the partners in an organization’s data supply chains can be weak links in personal data protection. It is not surprising, therefore, that experts identify external business partners and data services providers as key sources of privacy risk for organizations (Casper 2014). In addition, when lapses in data protection occur, “it can be challenging for organizations, and regulators, to determine which entities are responsible for legal compliance, and what the scope is of their individual responsibilities” (Carey 2015, 307). Subunits Inside Organizations Every medium- and large-sized organization has internal structural divisions that can lead to differences in data protection practices and gaps in regulatory compliance. Organizations like banks with multiple product lines (e.g., checking and savings accounts, credit cards, and mortgages) and/or markets (e.g., North America versus Europe, or different states or

Obstacles on the Road to Corporate Data Responsibility 


provinces within countries) are internally complex. They are typically organized into business units or groups of business units, each responsible for a particular combination of products and/or customer segments. These units operate somewhat independently of each other, and may have considerable autonomy to set their own policies and procedures in certain areas of activity (Markus 2014). One implication of this internal differentiation is that an organization like a bank or insurance company does not always manage and protect data as an integrated whole. Several business units in a large financial services company, for example, might each contract separately with the same external data or services provider, and each of these contracts may specify different data protection terms. It is critical to the success of data protection to embed decision making about data protection throughout business unit structures (Bamberger and Mulligan 2011a; Johnson and Goetz 2007). Unfortunately, however, business unit leaders do not always share with executives at headquarters a high level of concern with or commitment to data privacy and security (Bamberger and Mulligan 2011a; Johnson and Goetz 2007). Thus, it can be highly consequential for data protection whether organizations choose to centralize data analytics activities at corporate headquarters or decentralize them in business units. A common practice today is to locate analytics activities within multiple business and functional units (e.g., marketing) rather than creating a “center of excellence” at headquarters (where presumably the use and protection of data would be easier to control) (Pearson and Wegener 2013). Even when organizations have analytics units at headquarters, these central units may not have the organizational authority or clout to set and enforce compliance by business units. Given this, data-related policies may vary in nature and quality of enforcement across the units of a complex enterprise. In addition, organizations that operate in multiple countries are even less likely to protect data in a consistent fashion across all units. The major organizational units of multinational enterprises are legally incorporated in various countries, where they are subject to different laws and regulations. Given cross-country differences in data protection laws (discussed more fully below), it would not be surprising to find considerable variation and gaps in data protection across the units of some multinational enterprises. Organizational Managers and Employees The most obvious example of the nonmonolithic character of the organizational data use context lies in the distinction between organizational policy makers and the rank-and-file employees. Executives can set policies, and business units can embrace them, but the compliance of many employees is necessary for data protection success (Hu et al. 2012). Accounts of major security and privacy breaches frequently highlight employee failure to conform to good organizational policies. For instance, employees may fail to install newer software versions containing security patches, share passwords or write them down in places where they can be copied, or leave devices containing sensitive data unprotected in homes or cars. Much


M. Lynne Markus

of the literature on corporate security and privacy protection focuses on the development of employee-oriented policies, such as acceptable (technology and data) use policies (Huckabee and Kolb 2014) and the challenges of motivating employees to comply with them (Hu et al. 2012; Puhakainen and Siponen 2010). To improve individual compliance, security experts design education and training programs, monitor employee behavior, conduct diagnostic phishing exercises, and engage in ethical hacking to gauge employee susceptibility to malicious hackers’ social engineering ploys. Data loss protection technologies are installed to ensure that employees do not download or e-mail protected data. But most experts agree that technology alone cannot protect sensitive data (Hill 2009), just as good organizational data protection policies cannot succeed on their own. Effective data protection requires organizational leaders to consistently enforce employee compliance with well-designed data protection policies and technological controls. Managerial Clients and Data Scientists As noted above, data protection effectiveness within organizations depends on such factors as where in organizational structures data protection policies are set and analytics activities are located; the nature of the relationships between policy makers in headquarters and business units heads; and the quality of managerial enforcement of employee compliance with rules and regulations. Another important factor is the values of organizational members along with the ethical choices they make about the use and protection of data. Here I focus particularly on the values and ethical orientations of organizational data scientists— specialists who look for useful patterns by matching and analyzing multiple data sets—and their managerial clients—executive problem owners who commission and oversee data science projects. Legal scholars (Bamberger and Mulligan 2011b) and business and technology experts (Culnan and Williams 2009; Davis 2012) argue that data protection in an era of big data is not just matter of legal compliance; organizations should articulate their ethical principles and instill appropriate values throughout their organizational cultures. This prescription can be challenging to fill because of the many influences on individuals’ values and beliefs, including national culture, educational preparation, occupational specialty, and specific organizational location and role. Europeans, for example, are believed to value privacy differently than North Americans (Seddon and Currie 2013), and people in high-technology occupations and organizations are supposedly more prone than others to the pro-innovation bias that is sometimes called “solutionism” (Morozov 2013). A neutral definition of solutionism is “the belief that every problem has a solution based in technology” (Maxwell 2014). A more critical one is the urge to fix with technology “‘problems’ that are not problems at all” (Morozov 2013, 6). In general, solutionists minimize the potential of technological innovations to create negative side effects and label as “neo-Luddites” people who raise concerns (Atkinson 2015; Trader 2014;

Obstacles on the Road to Corporate Data Responsibility 


“What Is a Neo-Luddite?” n.d.). By contrast, other organizational members, including some executives, have more cautious views about the benefits and risks of big data analytics. This may result from uncertainty about benefits, fear of breaking the law or risking customer backlash, or discomfort from lack of personal expertise. Education and professional socialization play an important role in shaping the values and ethical stances of data scientists and their clients. So it is instructive to learn what these groups know about data protection and the ethical dilemmas of big data use. The occupational specialty of data scientist is still emerging (Booz Allen Hamilton 2015). Wellpublicized job opportunities for data scientists have encouraged the emergence and growth of educational programs (West and Portenoy, this volume). Several academic disciplines and professional associations have laid claim to the territory of data scientist education. Although some programs are interdisciplinary, most data science programs are associated with computer science or engineering departments or schools; departments of mathematics and statistics (often located in arts and sciences schools); or departments of information systems and operations research/management science (usually, in the United States, located in business schools) (West and Portenoy, this volume). Naturally, the curricula of data science programs differ depending on disciplinary location. Yet the majority of programs today focus almost exclusively on either the building of data science tools and infrastructure or the use of data science techniques (e.g., machine learning) to solve research or practical problems. A few programs offer limited opportunities for students to learn about data protection laws, ethical dilemmas in big data use, and how to confront ethical concerns constructively. In the sections below, I consider curricula and aspects of professional practice, such as codes of conduct, for each of the three main disciplines in which data scientists are prepared. Computer Science  The field of computer science has a long-standing commitment to ethics research, practice, and education. The ethics of computing is an established research area within computer science (Bynum 2015). A number of computer and information science programs offer courses on ethical theory and its application (Fleischmann and Wallace 2005), or approaches for building positive values like privacy into the design of systems and devices (Friedman and Hendry 2012). The ACM Code of Ethics and Professional Conduct (1992), by a leading computer science association, exhorts members to contribute to society (section 1.1) and human well-being and avoid doing harm (section 1.2)—an injunction against solutionism. Despite these efforts, educators find “an ‘ethics gap’” in engineering education, noting that professionals are rarely prepared to deal well with the many ethical challenges in contemporary practice (McGinn 2015). Among those challenges are pressures from clients and superiors to do things that a professional believes is wrong. Mathematics and Statistics  Statisticians are well aware of ethical challenges surrounding the analysis of data. For example, the American Statistical Association enjoins its members,


M. Lynne Markus

through its “Ethical Guidelines for Statistical Practice” (Committee on Professional Ethics 1999, sec. I.C), to avoid working toward a predetermined result, which members’ clients might pressure them to do. The guidelines also contain a detailed discussion about the protection of human subjects of research “including census or survey respondents and persons and organizations supplying data from administrative records, as well as subjects of physically or psychologically invasive research” (ibid., sec. II.D); this section explicitly mentions privacy and data confidentiality. The American Statistical Association’s (2014, 13) “Curriculum Guidelines for Undergraduate Programs in Statistical Science” urges educators “to integrate training in professional conduct and ethics,” and some guidance is provided in a companion white paper (Cohen 2014) on how that might be done. For instance, educators should counsel students not to “imply protection of privacy and confidentiality for legal processes of discovery unless explicitly authorized to do so” (ibid., 3). Similarly, the Data Science Association’s (n.d.) “Code of Conduct” offers a detailed definition of confidential information that clearly includes the personal data of employees and customers. Among other things, the code enjoins professionals to “do no harm”—that is, to avoid negative side effects—a warning against solutionism. The Data Science Association’s code also states that the professional is to abide by the client’s objectives, and acknowledges the dilemmas that doing so may entail. It nevertheless stops short of clarifying the data scientists’ obligations in cases where the client wants the data scientist to do something that is legal, but ethically questionable: A data scientist shall not counsel a client to engage, or assist a client, in conduct that the data scientist knows is criminal or fraudulent, but a data scientist may discuss the consequences of any proposed course of conduct with a client and may counsel or assist a client to make a good faith effort to determine the validity, scope, meaning or application of the data science provided. (Data Science Association, n.d., rule 3b; emphasis added)

This clause implies that data scientists are not obliged to discuss with their clients the ethics of questionable data uses. The sources cited above show that mathematicians and statisticians are generally sensitive to the ethical dilemmas and professional responsibilities related to data use and protection. At the same time, one wonders how substantively these issues are treated in data science education. A recently published data science text (Provost and Fawcett 2013) suggests that data scientists in training may be on their own when it comes to dealing with issues like privacy. The authors (ibid.) refer to Daniel Solove’s (2006) eighty-page article “A Taxonomy of Privacy” and Helen Nissenbaum’s (2010) three-hundred-page book Privacy in Context. They conclude: We bring this up to emphasize that privacy concerns are not some easy-to-understand or easy-todeal-with issues that can be quickly dispatched, or even written about well as a section or chapter of a data science book. If you are either a data scientist or a business stakeholder in data science projects, you should care about privacy concerns, and you will need to invest serious time in thinking carefully about them. (Provost and Fawcett 2013, chap. 14; emphasis added)

Obstacles on the Road to Corporate Data Responsibility 


This comment raises the question: If data scientists and business stakeholders do not get the time to think carefully about these issues during their educations, will they get that time on the job? Business Education  Business schools also educate data scientists in departments of management science and information systems. They educate many future clients of data science projects as well, and their education is relevant, too, since big data analytics has increased the demand not only for data scientists but also for “data-savvy managers and analysts who have the skills to be effective consumers of big data insights—i.e., capable of posing the right questions for analysis, interpreting and challenging the results, and making appropriate decisions” (Manyika et al. 2011, 87). Business ethics is a well-established field of research and education (Christensen et al. 2007; Tsalikis and Fritzsche 1989). Many, but by no means all (Floyd et al. 2013), business schools offer courses in business ethics. Yet the content of business ethics courses does not usually cover specialized topics like the ethical dilemmas of data use and protection. Furthermore, ethics educators in business schools point out that managers in training, both future data scientists and their clients, generally lack the training and skills needed to speak up when they believe something is wrong (Gentile 2010). The future data scientists prepared by business school analytics programs may get no additional education in data protection or the ethics of data use. Program descriptions typically make little or no mention of data privacy and security (Pearson and Wegener 2013), and few programs dedicate a full course to these topics. Of nontechnical subjects, data science programs in business schools emphasize “consulting skills” to understand business requirements. Attention to ethics and data protection is also limited in the certification programs offered by industry groups. INFORMS, a professional society with origins in management science, offers a certification program (CAP) for analytics professionals. The CAP “Code of Ethics/ Conduct” (INFORMS, n.d.) admonishes professionals to behave professionally, follow all applicable laws, resist pressures to produce analyses biased toward a particular result, and avoid the unauthorized or illegal use of intellectual property. The single mention of privacy occurs in the context of human subjects protection, which is familiar in academic and health care research contexts, but much less known in the general corporate world. Privacy gets a few mentions in the materials provided to help people prepare for the CAP certification exam, but the sample test questions do not cover data protection laws or ethical dilemmas related to big data. In short, whether they are solutionists or neo-Luddites, neither budding data scientists nor clients in training appear well prepared to address the ethical gray areas of data use along with the challenges of personal data protection. In addition, data scientists may have gaps in their preparation, regardless of the academic discipline or professional association that organized it.


M. Lynne Markus

Summary No single individual, group, or organization (or technology) can alone protect personal data well. Effective data protection requires the concerted action and compliance of many different parties. Unfortunately, the many divisions within and across the organizations in data supply chains create considerable challenges for effective data protection. Effective data protection is hindered by wide differences—in goals, incentives, values, education, knowledge, and informal norms of behavior—among the many individuals, groups, and organizations involved in organizational data use. The Governance of Data Protection Is Not a Monolith As noted above, all organizations use personal data and therefore must protect it. Organizations’ data protection activities are governed by laws, regulations, and contracts, and guided by numerous standards, frameworks of voluntary guidance, and specialized bodies of knowledge and expertise. Regrettably, the governance of data protection is far from monolithic; it contains numerous overlaps, conflicts, and gaps. This diverse and complex governance regime, intended to support data protection, also works to undermine it. Data Protection Laws, Frameworks, Standards, and Professional Specialties The European Union has a comprehensive data protection act that spells out the obligations of data controllers and the rights of data subjects (Carey 2015).4 The United States does not have a comprehensive approach to data protection. Instead, the privacy and security obligations of organizations and the rights of individuals in the United States are spelled out in laws and regulations specific to particular sectors, such as public (listed) companies, health care providers, financial services firms, and organizations that market to children. Examples of the US regulations addressing aspects of data security and/or privacy include SarbanesOxley (applicable to public companies and concerned with financial reporting and information security), HIPAA (health care), Graham-Leach-Bliley (financial services), and COPPA (children’s online privacy). Some organizations are also covered by rules about experiments on people (Grimmelmann 2015). Because of the fragmentation of US data protection laws, most organizations doing business in the United States have to satisfy the data protection demands of multiple regulators. Of course, multinational enterprises have to accommodate several different data protection regimes. The multiplicity of data protection laws is supported, and possibly complicated, by a variety of security and privacy standards as well as voluntary guidance frameworks. Illustrations include the ISO 17002 security standard, the National Institute of Standards and Technology’s cybersecurity framework, generally accepted privacy principles, and fair information practice principles. Various professional societies provide related education and certifications, including the Information Systems Audit and Control Association and the International Association of Privacy Professionals.

Obstacles on the Road to Corporate Data Responsibility 


Data Management Regulations The legal and practical complexity of data protection laws is only part of the story, however, because a large number of additional rules and regulations apply to how organizations manage the data they hold, including but not limited to the personal data of employees and customers. In financial services, for example, data-related regulations have been enacted to prevent, detect, and punish crimes such as insider trading, market and rate manipulation, money laundering, and discrimination on the basis of race or gender in lending and insurance. In addition, regulations aimed at ensuring financial stability, such as Basel and DoddFrank, require organizations to build data analytic models and then report the results of these analyses on a regular basis. (Banks may find it necessary to purchase costly data from external data providers such as Bloomberg to perform these analyses well.) Banks and other financial services firms are subject to prosecution and large fines when they cannot provide regulators with accurate data in the form of reports and analyses such as liquidity “stress tests” (Effinger 2015; Rexrode 2015). Public companies are also subject to rules about accounting for and reporting on their financial management; these rules cover data security, too. Laws and corporate concerns about potential litigation dictate practices of record retention and destruction. Finally, financial services firms may voluntarily participate in data standardization initiatives designed to improve regulatory compliance and the efficiency of transactions between organizations. Examples include Mortgage Industry Standards Maintenance Organization data and process standards in mortgage lending (Markus et al. 2006), and legal entity identifier naming conventions in financial reporting. Thus, organizations face many demands and constraints on their data-handling practices in addition to those intended to protect the security and privacy of personal data. In a recent description of “How We Do Business,” JP Morgan Chase and Co. (2014, 77) reported having “more than 250 banking, securities, and commodities regulators overseeing our business globally.” Given the large number of data-related regulations to which individual organizations may be subject, it is not surprising that corporate executives view “regulation changes and heightened regulatory scrutiny” as their major current business risk (North Carolina State University’s ERM Initiative and Protiviti 2015, 5).5 As one would expect, financial industry compliance professionals are now in high demand (Effinger 2015). Too Many Rules Each individual data management and protection rule can be costly to implement; implementing many rules separately can create a crushing compliance burden. Efficiency considerations suggest the value of harmonizing the rules and adopting an integrated approach to compliance (Proctor, Wheeler, and Pratap 2015). Not surprisingly, an area of professional practice and associated body of knowledge, known as governance risk and compliance (GRC), has emerged to address this challenge (Hill 2009). Certification of GRC professionals is available, and technology vendors offer software and data products designed to support GRC programs (ibid.).6


M. Lynne Markus

Implementing a GRC program can generate significant savings for organizations. Consulting reports describe savings on the order of 30 percent for companies that centralize, harmonize, and offshore their compliance and control functions (EYGM Limited 2014). Fiserv, a global provider of information management and electronic commerce systems for the financial services industry, explained its benefits from adopting GRC technology: The company estimates that to produce the type of detailed risk profile it gets from the software over a three-month period now, it would previously have taken about six months using Fiserv’s old manual process. The older method would also have required seven to 10 more staff members and would have cost Fiserv an additional half-million dollars. (Violino 2012, para. 30)

As useful as they are, however, GRC programs and technologies cannot completely eliminate the challenges of compliance with data protection and management regulations. Some regulatory conflicts cannot easily be harmonized. European data protection regulations require organizations to minimize the collection and analysis of personal information in order to preserve privacy (Carey 2015), whereas a new financial regulation, BCBS 239, requires them to actively manage and analyze data in order to reduce risk. Among other things, BCBS 239 requires companies to keep complete as well as accurate records (Deloitte, n.d.). How financial organizations can reconcile these conflicting demands is not entirely clear. GRC programs also cannot cope with several noteworthy gaps in data protection regulations. In the United States, for instance, health data in government and providers’ medical records are strictly protected, but health data available from other sources are not. Data scientists have been able to mine information from pharmacy, social media, and data broker sources to identify potential candidates for clinical trials, thereby circumventing the onerous controls on searches of medical records (Walker 2013). Another interesting legal gap is created by autonomous systems that operate without “continuous human intervention” (Dahiyat 2007, 387). Examples include self-driving cars and algorithms that automatically determine credit worthiness and mortgage eligibility (Markus et al. 2008). Increasingly, autonomous systems are being designed to “learn” and evolve on the basis of their “experience” (see DeDeo, this volume). Such algorithms raise vexing questions of patent law (see Abbott, this volume), contract law (Dahiyat 2007), liability law (Dahiyat 2010; Elish and Hwang 2015), and fairness. Too Many Chiefs Data protection laws and regulations require organizations to designate executives to be held accountable for compliance. This has led to a proliferation of high-ranking organizational positions, each of which has a role in managing and protecting data. The chief privacy officer (Bamberger and Mulligan 2011a) and chief information security officer (Johnson and Mulvey 1995) are obvious instances. Other data-oriented chief roles include, on the compliance side, the chief financial officer, chief legal officer (general counsel), chief compliance officer, and

Obstacles on the Road to Corporate Data Responsibility 


chief risk officer, and on the technical side, chief information (or technical) officer, chief data officer, chief digital officer, chief information security officer, and chief analytics officer. Many organizations have several such chiefs. With all these chiefs, an obvious question is: How well do they collaborate in data protection? Articles in the business press suggest that organizations are uncertain about where to locate these positions on the organizational chart, what their reporting relationships should be (should one chief report to another?), and how to divide and coordinate their responsibilities for various data management and protection activities. The importance of good working relationships among the various chiefs has been mentioned (Wheatman and Akdshay 2015), and a few articles hint at antagonisms, turf conflicts, and the possibility of gaps in data management and protection resulting from unclear divisions of labor. When Bank of America, for example, was recently taken to task by regulators for deficiencies in its stress testing, a legally mandated financial risk analysis and reporting activity that is heavily data intensive, executives apparently disagreed over responsibility (Rexrode 2015, para. 10). In short, the large and growing numbers of chiefs having a prescribed or plausible role in data management and protection make it challenging “to determine [who is] responsible for legal compliance, and what is the scope of their individual responsibilities” (Carey 2015, 307) inside organizations as well as among the organizations in data supply chains. Data Contracts Data protection and management are governed not only by laws and regulations but also by private data agreements. Familiar examples of organization-to-individual data contracts are the published privacy policies of health care and financial services providers and the “terms and conditions” of websites and apps (see Cate, this volume). Contracts containing datarelated “terms of use” provisions also govern the relationships between organizations and their data brokers, cloud services providers, technology vendors, and business process outsourcers. Public information on the provisions of these agreements is hard to come by because they are typically regarded as confidential. Nevertheless, it seems reasonable to assume that they vary in the quality and enforceability of their data protection provisions. For instance, contracts may not “bind third parties in the onward transfer of personal information” (Hartzog 2012, 689). Furthermore, the nature of data contract provisions may make it difficult for organizations to follow the recommended practices of responsible data use. Kate Crawford and Jason Schultz (2014) advocate “procedural data due process” as a way to address the harms of algorithmic decision making, such as discrimination in mortgage lending or insurance. Under procedural data due process, companies that make algorithmic decisions would be obliged to give hearings to consumers with grievances and “correct the record,” if needed, including “examining the evidence used, including both the data input and the algorithmic logic applied” (ibid., 127). But a frequent input to the algorithmic decision making of


M. Lynne Markus

financial organizations is data purchased from brokers, and data brokers’ contracts typically limit buyers’ access to the sources of the personal data in the purchased data sets (Tanner 2013). As a result, neither the consumers nor the financial institutions in a procedural due process hearing might be able to get the information they need to correct erroneous data at their source. Finally, even when the data protection provisions of contracts between organizations are sound and appropriate, all kinds of contracts among business partners are variably enforced (Johnson and Sohi 2016). Summary In short, the legal and regulatory environment of data protection is fragmented, creating a patchwork of rules with overlaps, conflicts, and gaps. Standards, frameworks, and technologies for dealing with this complexity abound, spawning bodies of specialized knowledge and certification programs for an army of data protection and regulatory compliance specialists (Bamberger and Mulligan 2011b. This situation has a number of important implications for the effectiveness of organizational data protection. First, there is no easy or inexpensive way to ensure end-to-end governance across complex data supply chains (Hartzog 2012; Markus and Jacobson 2015). Second, there may be a lack of clarity inside organizations about who is responsible for what. The lack of such clarity often creates internal conflict and gaps in policies along with their enforcement. Third, people who are not specialists in data protection cannot be expected to be knowledgeable about the issues and rules. For example, a corporate recruiter may not know that US laws governing employer access to employee personal online information vary from state to state (Park 2014). US data scientists and their clients may not know that electric meter data is protected personal data under EU law. Unfortunately, awareness of the issues and rules does not figure prominently in the educational preparation of data scientists and their clients. This increases the pressure on the chiefs to work well together, design good rules, communicate the rules effectively, monitor employee compliance continuously (possibly invading their privacy in the process), and diligently enforce the rules. Lastly, what is legal and possible for an organization to do with big data may be unethical or socially unacceptable. Organizations are told that they cannot just rely on following the law; in addition, they must articulate organizational values and ethical principles (Culnan and Williams 2009; Davis 2012), adopt industry ethical codes (Bennett and Mulligan 2012) or develop their own, and take steps to instill these codes in corporate culture (Bamberger and Mulligan 2011b; Hu et al. 2012). These meritorious proposals are challenged by differences in beliefs and values about the benefits and risks of big data as well as the need for data protection. Differences in beliefs and values may lead to poor individual compliance, even when organizational policies are sound.

Obstacles on the Road to Corporate Data Responsibility 


Data Management Capability Is Not a Monolith Let’s assume that governance is a precondition for the success of endeavors like data protection that involve the activities of many different actors. Governance alone cannot guarantee success, however. The goals and policies set by governing bodies must be translated into organizational practices routinely enacted by skilled employees. That takes resources—time, money, and political capital. Organizations can create and staff data protection and management roles like chief privacy officer and chief compliance officer without allocating directly to them the resources necessary to carry out their regulatory mandates, particularly when these mandates require technological change. In many organizations, the likely source of the technical resources needed for data protection is the organizational unit or units involved in information technology and data management (Wheatman and Akdshay 2015). Within information technology units, data protection is one of several, possibly competing priorities (Proctor, Wheeler, and Pratap 2015). Why Should Organizations Protect Data? It is useful to review organizational motivations to protect personal data. In brief, there are three motivations: data protection is the right thing to do because it benefits society along with the persons about whom data are collected and analyzed; data protection is required, by law or custom; and data protection is a good thing to do because organizations can benefit from doing so. These motivations do not necessarily conflict. Yet they are more compelling when they reinforce each other. Values differ, and not everyone believes that protecting personal data is a strong moral imperative for organizations (Davis 2012). It goes without saying that data protection is not the first priority of most organizations; indeed, it is always ancillary to the core mission of organizations in finance, health care, and retailing. So if data protection (or its resource demands) is seen as conflicting with an organization’s core mission, data protection is likely to suffer. The law mandates certain data protection activities, and noncompliance can be heavily punished. In addition, data breaches and cyberattacks are the focus of much current public and regulatory attention. As a result, data protection has become a major—if not the major— current priority of many organizational executives and boards of directors (North Carolina State University’s ERM Initiative and Protiviti 2015). People in organizations, though, differ in how much they are motivated by fear of doing the wrong thing (Gentile 2010). For some, the desire to avoid losses from poor data protection is outweighed by the desire to gain benefits from pursuing the organization’s core mission through innovative applications of big data analytics. Regulatory mandate alone therefore may not be sufficiently compelling to attract enough resources for effective data protection. The argument can also be made that organizations gain positive benefits from effective data protection. An organization may be able to enhance its reputation with the public by


M. Lynne Markus

becoming known for its data protection practices. (At the same time, actively marketing data protection excellence could increase liability in the case of a breach.) Data protection has the potential to generate financial benefits for organizations, too, in at least two ways. First, devoting attention and resources to data protection can reveal opportunities for improvements in operational efficiency (e.g., GRC programs that substantially reduce compliance costs), or innovative new products and services (e.g., banks’ identity theft protection services). Second, devoting attention and resources to data protection can encourage employees and organizational units to actively pursue big data projects, which may then yield financial benefits. Ironically, despite all the hype, organizations have not yet deployed big data analytics as pervasively as expected. Among the reasons is an unwillingness to share data within organizations. A business unit in charge of checking and savings accounts, for instance, may be unwilling to share customer data for cross-marketing purposes with a business unit in charge of mortgage lending. An unwillingness to share data may reflect organizational turf battles. But some information technology experts have observed that data sharing is hindered by too-tight internal controls (Tallon, Ramirez, and Short 2014) or excessive concern about the legality or acceptability of a proposed data reuse. By putting in place a process to vet proposed data uses in light of regulations and company values, organizations can encourage employees to share and reuse data. Intel called its version of such a program “protect to enable” (Tallon, Short, and Harkins 2013, 189). Organizations can have compelling motivations to pursue data protection. But the motivation to protect data is often linked to the motivation to use data. It should not be surprising, then, that a likely source of the resources required for the implementation of policies set by regulatory chiefs (e.g., financial, privacy, risk, and compliance) is the technical chiefs (e.g., information, information security, data, and analytics) who are collectively responsible for supporting intensive data use. The Varieties of Data Management In the realm of the technical chiefs, data protection is one of several data management disciplines and priorities (Khatri and Brown 2010). DAMA International, a leading professional association, has developed a framework and body of knowledge on data management, comprising the following disciplines:7 • 

Data architecture management Data development •  Data operations management •  Data security •  Reference and master data management •  Data warehousing and business intelligence management •  Document and content management • 

Obstacles on the Road to Corporate Data Responsibility 

•  • 


Meta data management Data management

This framework does not include privacy as a top-level topic (privacy is included under security), nor does the framework adequately deal with the unique challenges of big data and analytics (such as algorithm version control and updating). More important for data protection, each of the disciplines in the DAMA framework is a major organizational data-related activity and priority in its own right. Consider data operations management. This essential discipline ensures that data are available when needed and provides backup and recovery from crashes. Data architecture management is a particular challenge for retail banks, which are often organized into product-focused units or have grown through acquisition. These banks can have multiple stores of customer data, each with different naming and retrieval conventions. In order to use big data and analytics for activities like cross-selling financial products to their customers, banks may have to engage in a costly program of systems or data integration. Even banks with well-designed data architectures may suffer from data quality problems. Some banking data are key entered by people (including customers) who are not directly affected by erroneous or missing data, resulting in data that are insufficient to support analytics projects. Thus, organizations need to create data quality metrics, assign responsibility for routine monitoring of the indicators, and take corrective action. Although some people have argued that data quality is unimportant in the presence of massive big data reservoirs (Mayer-Schonberger and Cukier 2013), data scientists frequently mention the challenges involved in accurately matching data (Wigan and Clarke 2013). In short, getting value from big data and analytics involves a number of supportive data management disciplines, of which data protection is only one. Depending on an organization’s level of data management “maturity,” data protection may not be high on the list of the data-related priorities of the people charged with data management (Friedman, White, and Judah 2015). Furthermore, significant organizational events, such as mergers or acquisitions, strategic redirections, and corporate reorganizations, can require data rearchitecting, thereby de-emphasizing data protection on the list of an organization’s data priorities. Summary All things considered, data protection is an organizational challenge that competes with many other issues for management time, attention, and resources. Although external events (high-visibility data breaches in other organizations) and changing regulations (cybersecurity breach reporting rules, and the increased liability of directors and officers) can refocus priorities in the short run, continuing shortfall in the implementation of better data protection policies, practices, and technologies is probably a fact of life. Within the technical units charged with supporting data and technology, data protection is just one of several crucial data management disciplines. As a result, proposals for


M. Lynne Markus

investment to improve data protection may compete with proposals for investment to support intensive data use. Concluding Remarks Every organization in every data supply chain needs to do its part for effective data protection. Within organizations, all employees must do their parts as well. But organizations and their employees face significant hurdles in the discharge of data responsibility. The first challenge lies in the nonmonolithic character of data use within and across organizations. Data protection requires the coordinated and concerted efforts of many individuals and groups that differ widely in organizational role, occupational specialty, knowledge, incentives, culture, and values. Divisions among these individuals and groups create variable commitments to data protection. Second, the regime of data governance now in place to support data protection also undermines it. There are too many conflicting and incomplete rules. Specialized experts have emerged to navigate the rules, but they are too numerous, contentious, and thinly placed to address the ethical issues that continue to arise as technology evolves. Perhaps more important, the growth of data protection as an area of specialization (several, actually) seems to run counter to the ideal of getting all organizational members to embrace data protection (Bamberger and Mulligan 2011a; Hu et al. 2012). Particularly worrisome are the ethics gaps (McGinn 2015) among data scientists, clients, and specialists in data protection. Data management regulations impose large and growing costs on organizations. Unless or until organizations can find ways to benefit from better data protection (e.g., through cost reduction, competitive advantage, or innovation), their compliance will be formulaic and grudging. Third, data protection is only one of several technical data management disciplines and priorities. Data protection is costly, and organizations are told that they should not protect all data equally but rather in proportion to their value to the enterprise. One might prefer that advice to refer to the information’s value to the individual whose personal data are involved. But the fact remains that in order to afford the cost of protection, organizations must have benefits from using data. Governments and data businesses clearly have those benefits, but for many organizations, the big benefits of big data and analytics still lie in the future. This means that proposed investments for data protection may compete with proposed investments to support data use. The good news is that organizations can have strong motivations to protect personal data. Not only does data protection benefit employees, customers, and all citizens, and not only is data protection a regulatory mandate, but better data protection and management can also benefit organizations in their reputations and bottom lines. These observations have important implications for policy proposals intended to improve data protection, whether the proposals emphasize government regulation or organizational

Obstacles on the Road to Corporate Data Responsibility 


self-governance. For one thing, getting all individuals and groups lined up and moving in the same direction is a daunting challenge at best, and there will be no quick technical fix. For another, it is hard to get data protection “just right.” Too little data protection is bad for organizations, not just employees and customers, because it leaves organizations vulnerable to risks from both inside and out, and because it discourages employees from sharing data and benefiting from the use of analytics (Tallon, Short, and Harkins 2013). But too much protection is also bad. Not only is it costly to organizations; it is bad for people because it can paradoxically encourage organizational employees to ignore or circumvent the rules. Thus, merely adding more rules to the existing mix is likely to be counterproductive. A clean-sheet redesign of today’s data protection regime may not be in the cards. Yet for every new rule added, two older rules should be stricken from the books. Acknowledgments This work was supported in part by the National Science Foundation under award #1348929, M. Lynne Markus, principal investigator. The opinions expressed herein are those of the author, not of the National Science Foundation.

12  Will Big Data Diminish the Role of Humans in Decision Making? Michael Bailey

The Data Revolution and Business Organizations are beginning to capture and analyze an unfathomable amount of data.1 Rare is the transaction or customer interaction that isn’t logged in some electronic format, and the plummeting cost of data storage is allowing firms to store and retain an increasing amount of these data. Between 2005 and 2015, the cost of hard drives decreased by 93 percent, from $0.72 to less than $0.05 per gigabyte (McCallum 2015). It is estimated that Google has ten to fifteen million terabytes (Carson 2014) at any given time. To put that in perspective, all the words ever spoken by the human race is estimated at about five million terabytes (Dhia 2006). The explosion of firms, organizations, people, and governments building data-driven products, services, and policies has been dubbed the data revolution or the era of big data (Kitchin 2014). Firms in particular have been scrambling to profit from the data revolution by assembling large teams of information technologists and data scientists. The breakthroughs of the 2000s (including Google File System [Ghemawat, Gobioff, and Leung 2003], MapReduce [Dean and Ghemawat 2008], and Hadoop [Shvachko et al. 2010]) and everincreasing computational power (Schaller 1997) of computers have paved the way for extracting value from incredibly large data. The worldwide business intelligence, analytics, and performance management software market increased from $8.8 billion in 2008 to $14.3 billion in 2013—a growth of 63 percent in five years (Gartner 2008, 2014). Analytic capabilities have had to keep pace with the deluge of data. Facebook (2013) recently open sourced Presto, its in-house structured query language engine for enormous data sets. Previously, simple data queries on massive data sets could take minutes to hours or more; ePresto can often return results in a few seconds. Several software companies are competing to bring the analytic capabilities of Facebook and Google to any firm. Tableau, a company that offers data analytics and visualization software, has grown to service over twenty-six thousand customers and is valued at over $6 billion (Lauchlan 2015). The data revolution is creating new industries and revolutionizing existing ones, opening up completely new products and operational capabilities. For example, Walmart processes on


Michael Bailey

the order of two hundred million transactions daily, and its data warehouse combines data about hundreds of millions of customers to run analyses on petabytes of historic data (Rijmenam 2012). By combining weather and transaction data, Walmart was able to determine what customers purchased leading up to a hurricane (“Data, Data Everywhere” 2010) so that it would be able to dynamically alter its store selections in advance of any disaster. Some firms need to process these large data sets just so their product functions due to the sheer size of their user base. Facebook users share billions of pieces of content, create billions of “likes,” and upload hundreds of millions of photos daily (Vagata and Wilfong 2014), and Google processes several petabytes of search data daily (Dean and Ghemawat 2008). This immense volume of data allows these firms to build accurate prediction algorithms that are the foundation of some of the most well-used Internet products. Amazon can offer product recommendations based on all your actions on Amazon sites, Facebook can curate your news feed to include the most relevant stories, and Google can refine search results based on the outcomes of millions of related searches. Besides improved services and products, one of the implicit promises of the data revolution is that organizations will now be able to incorporate relevant data into every decision. The idea of the data-driven firm has come into prominence in business schools and management theory (Frick 2014), and describes a firm that relies more on objective data for decision making than the subjective assessments of managers (McAfee 2010). To what extent is big data a substitute or complement to humans in decision making? Could the data revolution signal the end of subjective decision making and usher in a new era of automated organization? To the relief of middle management everywhere, expertise and human judgment will continue to play a prominent role in decision making, and for several reasons, could in fact play a larger role in the era of big data. First, most of the data that firms obtain relate to their current business and products, which frequently offer little insight and poor predictors for new markets, products, or policies. Second, decision makers within firms often have to balance multiple competing objectives and priorities, and sometimes make decisions with important moral or cultural considerations that are unlikely to be helped with better data or models. Finally, big data usually must be combined with qualitative data like surveys, subjective assessments, human-labeled data, and expert predictions to be useful or put into context. Although there is rapid development on methodologies for data-driven decision making like contextual experimentation, deep learning, and artificial intelligence, we are still a long way off from the automated decision maker. Big Data and Strategy In September 2011, Netflix announced that it was splitting its DVD rental business and video-streaming business into two separate units. The DVD rental business would be rebranded Qwikster, and subscribers to both DVD rentals and streaming would now manage

Will Big Data Diminish the Role of Humans in Decision Making? 


separate plans and receive separate bills. The pushback was immediate and brutal; users left the service in droves, and Netflix was excoriated in the press. Rising account cancellations and a plummeting stock price led Netflix to announce, about six weeks later, that it was not going to pursue the Qwikster plan (Wingfield and Stelter 2011). The damage to the company and brand was enormous; there were eight hundred thousand fewer subscribers, and nearly $4.5 billion of shareholder value had been erased. The Qwikster debacle has entered the business case studies as a strategy blunder on the highest level (Seijts and Bigus 2012). Surprisingly, Netflix is also one of the most data-driven companies in existence (Bulygo, n.d.). Through rigorous tracking of user activity, it could calculate the amount of content that a user had to watch each month to remain an active user. Seventy-five percent of Netflix’s viewing activity is based on its movie recommendation system powered by user ratings. It famously created the Netflix Prize (Bell and Koren 2007), a contest to improve its movie recommendation algorithm by releasing a set of movie ratings data to teams, which then would compete to create the best prediction algorithm. The Qwikster decision highlights how difficult it can be to use data to make strategic decisions involving new policies and products. Even though firms are swimming in data, these data come from current operations and user behavior within existing products that might offer little predictive insight about new products and policies. These strategic decisions differ fundamentally from the operational ones that are most amenable to analytic insights. Strategic decisions focus on the extensive margin for the firm’s business (e.g., what new products the firm should build, or what new opportunities the firm should take advantage of), whereas operational decisions revolve around the intensive margin (e.g., how to improve the existing business). Strategic decision making involves stepping into the unknown—a realm where the data to understand the best way forward do not exist or would be impossible to acquire. Even if computational power and analytic capabilities increase exponentially, without any data the firm will have to rely on expertise to decide how to navigate the dynamic marketplace. This does not mean that data are not useful for strategic decision making; there are many tools to acquire data for new markets, such as market surveys and qualitative data, demand modeling, case studies for similar products, and so on. But how you use and combine these data will be a subjective exercise, and two individuals with the same data might arrive at different conclusions. Sometimes a creative use of data can help craft a strategic prediction. Netflix redeems itself in our story through its use of data to understand the value of providing original content. Netflix made an enormous strategic bet by spending over $100 million to purchase two seasons of the television show House of Cards (Bulygo, n.d.). To arrive at that decision, it looked at how users responded to the British version of the show, how users responded to movies by director Kevin Fincher, the correlation between viewership of the British version and shows with Kevin Spacey, and how many users fell into these categories. These data gave Netflix the confidence to make such a large bid for the show. Although one could say that big data


Michael Bailey

allowed Netflix to make the decision, it involved a high degree of creativity and insight on how combine the data into a strategic prediction—a process unlikely to be replicated by an algorithm. It highlights that even if intuition is in the driver’s seat for strategic decision making, data can give you a huge advantage over those who use intuition alone. Morality and the Machine The Ford Pinto is one of the most infamous car recalls of all time. In 1978, the National Highway and Traffic Safety Administration ordered Ford to recall the Pinto for purportedly containing a crucial design flaw in the fuel tank that would cause the car to burst into flames after some accidents (“Ford Pinto,” n.d.). Over twenty-three deaths were attributed to Pinto fires, and a jury awarded $125 million in punitive damages against Ford (later reduced to $3.5 million by a judge) in a resulting lawsuit. Although further studies showed that the Pinto was probably as safe as comparable vehicles of its class (Schwartz 1991), Ford was universally condemned for a pattern of unethical decision making and corruption within the organization leading up to the recall (Bazerman and Tenbrunsel 2011). Engineers had discovered the flaw in the fuel tank, but managers had decided it would be too costly to retool the plant and proceeded with the flaw. As complaints and deaths began to rise, decision makers at Ford determined that the cost of lawsuits would be less than the cost of recall—a cost-benefit analysis that when discovered, led to widespread outrage. Yet each automaker continuously evaluates reported problems with their vehicles and uses statistics to model the probability of a true safety problem. The challenge is determining the threshold of evidence required to issue a recall. If we recalled all cars at the sign of any problem, automobiles would likely be prohibitively expensive. It isn’t just carmakers; any firm producing a physical product must model the risk that its product poses to consumers, and weigh that risk against the potential value to consumers and firm profits. There are no set guidelines on how firms should perform these cost-benefit analyses, and laws for consumer safety and protection vary by each state and country. To varying degrees, all firms have a direct role in consumer welfare, and must make difficult moral judgments about the trade-off between customer welfare as well as well-being and profit. Big data is changing how automakers and firms do recalls (Cox 2015). Firms can now store many more data about how products are produced, sold, and delivered, and link these to all reported problems and complaints, allowing companies to identify the source of problems and connect related problems. General Motors was able to limit the recall of a certain model of car with a faulty valve to only four cars because of its advanced manufacturing tracking (Nelson 2013). What big data is unable to address is how firms should handle these difficult moral trade-offs. Not only do firms need to employ experts to subjectively assess the business

Will Big Data Diminish the Role of Humans in Decision Making? 


cost of risk, such as the cost of lawsuits and public relations damage, but they also must decide how to value customer welfare and well-being. For example, vaccine makers could save thousands of lives by lowering the cost of their product (LaMattina 2013). Is it morally defensible for them to make a profit at all? Could we ever expect more data to answer this question for us? Not all cases are as drastic as life-and-death scenarios involving customers dying from faulty products or potentially saving thousands of lives by lowering the price of a vaccine. Charging a higher price for a product results in a direct transfer of wealth from the customer to the firm, which has led to increased public discourse on the ethics of pricing. Payday loan companies are often criticized for taking advantage of the poor and uneducated with their astronomical fees and rates, with some arguing that their customers are unable to understand financial decisions. Many people were outraged that stores were charging higher prices for essentials during the 2013 Colorado floods (“Immoral But Not Always Illegal,” n.d.). The heart of the problem is that it is not always clear what the firm should look to optimize. An algorithmic firm whose sole worry is maximizing shareholder value might make decisions that we would consider gravely unethical. Beyond questions of ethics, firms are usually pursuing several objectives (e.g., improving shareholder value, employee morale, environmental concerns, and/or customer retention), and it is not always apparent how to prioritize or balance them. Facebook’s news feed, the curated list of posts and stories shown to users of the site, is the first thing users see when logging in. Facebook has stated that the objective of the news feed is to “deliver the right content to the right people at the right time so they don’t miss the stories that are important to them” (Backstrom 2013). What is the right content? Is it the content that generates the most likes or comments? Is it the content that leads users to log in more frequently or dwell longer on the page? Even though Facebook is swimming in data on how people respond to its news feed experience, the larger challenge is deciding what the objective of the news feed ranking should be. Someone has to decide what it means to deliver the “right content,” and while an algorithm could find a ranking that would deliver the most likes or time spent on the site, someone must decide the ultimate objective of the ranking and what metrics indicate that Facebook’s news feed is improving. An algorithm alone would be unlikely to understand what it is about a story that would cause a user to feel connected, informed, or special. For this reason, Facebook employs human raters (Levy 2015) to give detailed feedback on their news feed experience, including why, specifically, they liked or disliked certain stories, which stories they didn’t want to miss, and how the stories made them feel. Combining these qualitative data with the large volume of user data from the site gives Facebook views into several different dimensions on how users are responding to its product, and it is up to the decision makers to figure out how to weigh different pieces of feedback. It is often the case that you need to combine qualitative data and human guidance with big data to make it useful.


Michael Bailey

Going Small before Going Big In August 2015, Facebook launched M (Metz 2015), a personal assistant embedded within Facebook Messenger. The product allows users to make requests, like asking for restaurant reservations or purchasing an anniversary gift, that are routed to virtual assistants. One interesting detail of the product is that these requests will also be routed to an artificial intelligence that will try to fulfill the requests independently and are being supervised by human virtual assistants. Facebook bought an artificial intelligence company, (Constine 2015), in early 2015 to help its developers with speech recognition and whose models will power the artificial intelligence behind M. Humans will be able to fulfill the requests that artificial intelligence is unable to tackle, or complete the portion of the tasks that require a human (calling the restaurant to make the reservation) while leaving anything that can be automated to the artificial intelligence (pulling up the Yelp page when a user requests “the best ribs in Texas!”). Every step the human assistants make in response to queries is recorded and scrutinized so that eventually it can be used as input into more complex artificial intelligence models that will be able to tackle increasingly complex tasks; in essence, the humans are training the artificial intelligence. An artificial intelligence that could handle any task is far beyond the ability of our current artificial intelligence and machine-learning models. It would have to understand all aspects of human speech and context on top of a deep understanding of how to complete requests. Even if the artificial intelligence understood your request to make a reservation, it would be completely unhelpful for it to print off a reservation form and mail it to you. By recording how human assistants fulfill tasks, the artificial intelligence can learn the tools, strategies, and techniques used to meet a wide range of requests, and maybe more important, when to say no. The truth is that most algorithms need human assistance to make sense of data; it is hard to extract value from big data alone. One of the most widely used tools for fitting predictive models on data, supervised machine learning, needs to begin with a set of data where the outcome is known, called the training set, to be able to make predictions on data where the outcome is unknown. The model is only as good as the training set—if a certain pattern or trend does not exist in the training data, the model will make biased predictions if it exists in the test data. A spam classifier, for example, relies on people labeling e-mails as “spam” so that it can learn the features indicative of a spammer (“Naive Bayes Spam Filtering,” n.d.). For many applications, this means recruiting humans to label outcomes. Kent R. Anderson (this volume) argues that for many applications, small data approaches (simple models combined with substantive expertise) will yield better results than big data approaches. Nate Silver was able to produce consistently reliable political forecasts by deeply understanding the quality of the inputs into his simple model and was able to achieve

Will Big Data Diminish the Role of Humans in Decision Making? 


superior results over complex models using enormous feature sets that didn’t discriminate between inputs. Alex Peysakhovich and Seth Stephens-Davidowitz (2015) contend that big data approaches will rarely be useful when applied to decisions about our health, wealth, or happiness: “The things we can measure are never exactly what we care about. Just trying to get a single, easy-to-measure number higher and higher (or lower and lower) doesn’t actually help us make the right choice.” As discussed in the previous section, Facebook’s news feed wouldn’t work quite so well without human-labeled data. Even though Facebook has over a billion users along with countless likes, comments, clicks, and shares, it is the small data from human raters and human surveys (ibid.) that allows Facebook to understand how users feel about the stories in their feed and get meaning from the data. We combine qualitative data with big data so we can fully understand what we are measuring, and if improving this measure is what we actually care about. When Sears began tracking its auto mechanic sales, it tried to incentivize its staff by setting a goal of $147 an hour; it found that its staff met that goal by overcharging for their services or repairing things that weren’t broken. Similarly, Facebook could deploy an algorithm that optimizes for likes, but news feed might end up being composed mostly of memes and clickbait, while losing the stories that make users feel connected and engaged to their friends. Combining vast amounts of quantitative information (e.g., clicks from the website) with qualitative personal feedback (e.g., survey responses) to make product decisions is a task for which computers are poorly suited. For this reason, technology firms often employ sociologists, psychologists, and behavioral economists to work alongside their data scientists to understand the meaning behind the measure. These teams are also beginning to gather data by running lots of experiments. Experimentation allows them to test features and policies on a small set of users, and is increasingly becoming a common way for firms to gather data to make decisions. The Rise of Experimentation Experimentation, or A/B testing, has become an increasingly popular technique of gathering data (Christian 2012) for decision making, especially for Internet companies that can easily deploy different versions of their website to different users. Improved tools for experimentation (Bakshy, Eckles, and Bernstein 2014) are being developed that will give any firm the ability to easily deploy and analyze experiments on its website. Experiments also could be used as a tool for automated decision making. Ostensibly, if the decision set is discrete and one has the ability to test each option, one could enumerate all available possibilities, launch an experiment to test each version, and then proceed with the winner. In reality, experimentation is fraught with challenges and difficulty that require the careful hand of the human experimenter to set up the experiment and evaluate the results.


Michael Bailey

It is often difficult to determine which alternatives to test. Without some expertise or judgment, we might need to test every single color for our sign-up button or every single price for our product. If we try to test too many alternatives, we will be putting fewer users in each test, and will hurt our ability to find a significant difference and pick a winner. Letting a computer choose the alternatives to test creates a policy and legal nightmare, and leads to things like offering a book for sale at $23 million (Masnick 2011). We need to rely on the expertise of the experimenter to balance the information gain and cost of additional testing as well as their ability to pick the most useful alternatives to test. There are several applications where automated experimentation makes sense—one version being the “Multi-Armed Bandit (n.d.). When there are large amounts of data and several versions that need testing, we can input the large set of alternatives to test, and use an algorithm to explore the options and converge on the option with the best payoff distribution.2 For its search query recommendations, eBay uses multi-armed bandit approaches (Hsieh et al. 2015). Each time a user enters a query into the search engine (e.g., “used iPhone 5”), the site begins to offer query suggestions. The algorithm will try different suggestions for each text input and converge on the ones that are more likely to be selected. Considering the universe of possible text inputs and related suggestions to try, it would be impossible to run this experiment without the help of an algorithm. Firms are also increasingly using experiments for market testing (testing new products, policies, or businesses) by launching on a small scale or in small regional markets to gather data on the feasibility of entering the new market. It is difficult to execute these experiments while avoiding test control (Blake and Coey 2014), where the test and control subjects interact in a way to change the interpretation of the results. Thomas Blake and Dominic Coey (2014) describe a test run at eBay on its auction participants. The goal was to determine whether the policy of sending e-mails to encourage bidding before an auction ends would increase revenue. Users were randomized into groups that did or did not receive an encouraging e-mail, and revenue was compared between the groups. These users competed in the same auction, however, so users who did not receive an e-mail might have been more likely to lose to the users who did receive one, so the difference in revenue overstated the causal effect of the e-mail encouragement on revenue. As firms collect more data and scale, it will become easier to run experiments and base decisions on experimental results, but the set of problems that are amenable to experimentation are limited. A firm cannot experiment with how much debt to issue or what companywide perks to offer. It must be content with iterating over time or making decisions based on counterfactual models requiring subjective assumptions. The Future of Automated Decision Making Many of the challenges of leveraging big data for data-driven decision making discussed in this chapter (e.g., objective prioritization, the moral and cultural consequences of decisions,

Will Big Data Diminish the Role of Humans in Decision Making? 


and risk assessment) are not alleviated by lowered data storage costs and increased computational power. Perhaps a more promising avenue to take us to a fully data-driven enterprise than the data revolution are potential advances in the field of artificial intelligence. The median prediction of artificial intelligence experts is that in forty years, we will have built super intelligent machines whose problem-solving acumen will exponentially exceed our own (Urban 2015). Assuming that the firm could trust the artificial intelligence to make decisions consistent with the desired ethics and objectives of the firm, the firm would just need to supply the artificial intelligence with as many data as possible and allow it to optimize operations. Ryan Abbott (this volume) discusses whether we would want to allow this artificial intelligence to have inventor rights over intellectual property produced, or if the firm would own the intellectual property produced by the artificial intelligence. Legislation in this area could crucially change how firms employ artificial intelligence in their operations; it would damper their enthusiasm to build intellectually creative machines whose work would automatically belong to the public domain. Firms would also be anxious to understand if an artificial intelligence would be treated as an employee under the law or could sign legally binding agreements such as no-compete contracts; there are a myriad of legal issues that would need to be sorted out. It is unclear if we could ever fully specify all the objectives and constraints we would wish the artificial intelligence to operate under. Tim Urban (2015) looks at an example of a firm that builds an artificial intelligence to write handwritten notes for its customers. This artificial intelligence continues to learn and improve on its quest to produce as many of these notes as possible. Eventually, after being connected to the Internet and learning at an exponential rate, it produces nanobots that reassemble most atoms on earth into ink, paper, and note-writing machines, killing all humans in the process. In its quest to produce these notes, the artificial intelligence didn’t understand all the implicit constraints that the firm should operate under (e.g., not decimating the planet and killing humankind). Even if technology overcame these obstacles, and firms were able to outsource routine decision making and operations to an artificial intelligence, it would create an entirely new class of metadecisions on how to operate these artificial intelligence–run firms and artificial intelligence workers. It would create the need to form a more complex organization that could manage a completely different type of manager and worker. Until then, it is a safe bet that as data grow, firms will increasingly demand intelligence and creativity right along with them.

13  Big Data in Medicine: Potential, Reality, and Implications Dan Sholler, Diane E. Bailey, and Julie Rennecker

Big Data on the Rise Seeking faster, better, cheaper, and more consistent patient care, the health care industry is turning its attention to applications that can analyze large data sets, or “big data.” Today, health care delivery generates an enormous amount of data. Some of these data are in the form of experimental results: Hilda Bastian, Paul Glasziou, and Iain Chalmers (2010) reported that seventy-five new randomized clinical trials (RCT) and eleven new systematic reviews of such trials are published every day. With the advent of electronic medical records (EMR), patient treatment data are increasingly digitized and add to the experimental data. A recent survey of twenty thousand physicians noted that 85 percent of respondents reported that their practice employs EMR (Bostrom and Old Creekside Consulting 2014). Other increasingly abundant types of digitized health care data include genomic data (Contreras, this volume), medical and prescription claims, drug orders, pharmacy dispensing (Szlezák et al. 2014), and patient-generated data such as data from fitness monitors (Shilton, this volume). Big data’s rising role in health care is arguably close at hand, in large part due to federal government initiatives such as the 2009 Health Information and Technology for Economic and Clinical Health (HITECH) Act. The HITECH Act designated $26 billion to health information technology, primarily in the form of incentives to providers for adopting EMR and other technologies to improve the efficiency of health care delivery (Simpao et al. 2014). As of June 2014, nearly all of that allotment had been paid out to those medical providers that had adopted the relevant technology (“Electronic Medical Records,” n.d.). The “big-data revolution in US health care,” as evidenced in these recent federal expenditures and forecast in the title of a recent McKinsey and Company report (Kayyali, Knott, and Van Kuiken 2013), appears just on the horizon. Despite their potential for improved patient care, lower costs, and more efficient health care delivery, big data applications in medicine present numerous challenges. The systems that generate, collect, and store these data remain disjointed, and are difficult to


Dan Sholler, Diane E. Bailey, and Julie Rennecker

integrate, temporarily hampering the application of big data to medical decision making. Patient privacy is another noted challenge that hinders the adoption of big data systems (Kayyali, Knott, and Van Kuiken 2013; Murdoch and Detsky 2013; Szlezák et al. 2014). Other obstacles include reluctance or hesitation on the part of physicians and hospitals (Szlezák et al. 2014). Poor-quality data are also a concern (Simpao et al. 2014; Webster 2013). Perhaps the most troubling challenge, however, is not technical in nature or directly related to implementation issues but instead lies in the concern that the use of big data constitutes a reductionist approach to medicine that elevates “the development of the science base of medicine at the expense of medicine’s essential humanism” (Miles and Loughlin 2011, 532). Overall, the promises and difficulties of big data comprise a major source of debate among medical practitioners, informaticians, and health care business personnel. In this chapter, we lay out the potential of big data in medicine, document its current reality, and consider its implications. We begin by describing the building blocks of big data in medicine—namely, RCT and EMR data. We look at how big data applications like clinical decision support systems (CDSS) are meant to join RCT and EMR data to provide tremendous analytic capabilities to interpret these data. We also discuss how physician autonomy may wane with the use of prescriptive CDSS tools as compared to descriptive ones. We then review studies of CDSS implementations, detailing what success has been achieved to date as well as what factors seem to promote or hamper implementation. We consider the patient care and financial implications of CDSS use, too. We conclude with an assessment of how research in big data in medicine might fruitfully move forward. The Building Blocks of Big Data: RCT and EMR Data The evidence-based medicine (EBM) movement set the stage for the use of RCT data in big data health care applications (Murdoch and Detsky 2013; Simpao et al. 2014). EBM, broadly construed, is the use of the best-available scientific evidence in diagnosis, prognosis, and treatment decisions (Guyatt et al. 1992). Proponents of EBM argue that RCT data, once aggregated and analyzed, offer the best opportunities for gains in efficiency, quality, and safety in health care (Sackett et al. 1996). EBM involves distilling the results from large numbers of RCT studies into systematic reviews—meta-analyses conducted by human reviewers who are increasingly aided by data mining and other computational techniques (e.g., Wallace et al. 2012)—to stay abreast of the best-available evidence. This task is becoming ever more difficult given the dramatic rise in the number of RCT studies. Bastian, Glasziou, and Chalmers (2010) noted that the number of RCT studies in published articles increased from a hundred in 1966 to twenty-seven thousand in 2007, with, as mentioned above, seventy-five new studies and eleven new systematic reviews appearing daily by 2010. Worse yet, EBM has struggled to leverage research data to provide useful recommendations for clinical interventions. Systematic reviews routinely conclude that despite plentiful

Big Data in Medicine 


RCT data, more data are needed to draw accurate conclusions. For example, Villas Boas and his colleagues (2013) examined systematic reviews in the Cochrane Library (a collection of more than half a dozen medical databases of abstracts, studies, reviews, and evaluations) and reported that only a small number of them had sufficient evidence for clinical interventions. In 2004, these authors had similarly evaluated the conclusions of Cochrane Library systematic reviews and found that 48 percent of them reported insufficient evidence for use in clinical practice (El Dib, Atallah, and Andriolo, 2007). In their 2012 paper, based on 2011 data, that percentage had dropped just a few points to 44 percent. The authors concluded that a large number of high-quality RCT studies are needed to provide sufficient evidence for clinical interventions. The difficulties in employing RCT data in the EBM paradigm to aid medical decision making caused some scholars to wonder if better gains could be achieved using large data sets of clinical data from patient records. Eric B. Larson (2013), for instance, observed that mining huge data sets of routine health care data from a large percentage of the population might yield, in a matter of months, results that once required decades of RCT data collection from selected samples of patients, with more generalizable conclusions. EMR data are rich in information that might aid such analysis; most EMR data consist of quantitative data (e.g., a patient’s laboratory results), qualitative data (e.g., text-based documents and demographic information), and transactional data (e.g., records of medication delivery). Governmental and administrative incentives recently accelerated adoption rates of EMR systems, thereby increasing the available clinical patient data and raising hope for their use in analysis. According to Nicole Szlezák and her colleagues (2014), between 2005 and 2012, the percentage of US physicians using EMRs jumped from 30 to 50 percent; in hospitals, EMR adoption reached 75 percent by 2012. Moreover, a survey by the American Hospital Association showed that adoption of EMR systems doubled between 2009 and 2011, suggesting that this uptake was a direct result of the 2009 HITECH Act, which mandated the “meaningful use” of health information technology in health care practice, with penalties for practices that failed to adopt EMR systems by 2016 (Murdoch and Detsky 2013). Although EMR data offer the opportunity to avert some of the generalizability issues that plague RCT studies by studying data from the broad population of actual patients rather than the limited samples of experimental studies, they introduce new issues of patient privacy (Murdoch and Detsky2013; Kayyali, Knott, and Van Kuiken 2013). The US Health Insurance Portability and Accountability Act is among the federal regulations that protect patients’ confidentiality; in addition to meeting its requirements, users of EMR data would need to develop standards for how data would be shared, accessed, and used (Szlezák et al. 2014). These issues notwithstanding, this new wealth of clinical patient data, combined with RCT data, has prompted interest in CDSS that combine both forms of data with analytic techniques and presentation tools for use in medical decision making.


Dan Sholler, Diane E. Bailey, and Julie Rennecker

CDSS: Combining Big Data with Analytics With the potential to make large data sets of RCT and EMR data tractable and useful, CDSS are bringing the goals of EBM to fruition, and today constitute the face of big data in medicine. CDSS and associated technologies, such as the one described by Ryan Abbott (this volume), offer multiple tools for analyzing and interpreting big data. In an examination of eleven CDSS (seven vendor-provided and four internally developed systems), Adam Wright and his colleagues (2011) found fifty-three types of CDSS tools. The most common tools across these eleven CDSS involved medication dosing and order support. Other tools concerned point-of-care alerts, displays of relevant information, and workflow support. Considerable variability exists among CDSS in terms of the tools they include (Mann 2011). Figure 13.1 orders some of the most common tools along a continuum of physician autonomy, with greater autonomy permitted at the bottom of the figure with descriptive, user-initiated tools, and less autonomy at the top of the figure with prescriptive, systeminitiated tools. Beginning at the top of the continuum in figure 13.1, the most prescriptive tools call for users to take specified action or provide a rationale for deviating from a directive. For instance, alerts and reminders, also called “point-of-care electronic prompts” (Schwann et al. 2011, 869) appear in pop-up-style windows, alerting physicians to errors or reminding them to order periodic screening exams. Both alerts and reminders require user acknowledgment before further action can be taken in the EMR. Documentation templates and clinical protocols are less intrusive than alerts and reminders, although they are still intended to constrain and direct

Figure 13.1 CDSS tools along a continuum of physician autonomy

Big Data in Medicine 


user action. Documentation templates are simple electronic forms that draw the care team’s attention to particular aspects of the care situation (Ash et al. 2012), such as a series of patient-safety precautions to be completed before a surgery begins. Clinical protocols consist of action sequences for managing a specific patient condition, such as postoperative care for hip surgery. The protocols often include multidisciplinary instructions, allowing nurses and allied health professionals to take physician-sanctioned actions without waiting for a direct order. The standardizing tools are less intrusive than the prescriptive ones, frequently appearing in a sidebar or the appropriate section of the EMR. These tools offer guidance in the form of best practice recommendations. Although these tools are meant to standardize behavior, they do not require user action in the way that prescriptive tools do. Order sets include lists of the diagnostic tests, medications, and nursing interventions considered appropriate for patients’ diagnoses. Decision trees, often in flowchart form using yes/no questions or if-then statements, help physicians make diagnostic or treatment decisions given a patient’s circumstances. Dashboards give physicians feedback on their decisions and their patients’ health status. For instance, a dashboard may show the physician’s compliance with a recommended order set relative to local or national averages. Similar to order sets, clinical guidelines reflect the current standard of care for a specific diagnosis (Timmermans and Mauck 2005), typically established by one of the professional associations, such as the guidelines for treating congestive heart failure published by the American College of Cardiology. Currently, local organizational and state policies determine how much latitude physicians retain in following or deviating from these clinical guidelines, but shifting payment structures along with increasing calls for physician and hospital performance metrics suggest that application of the guidelines will become increasingly prescriptive (Foote and Town 2007; Miller, Brennan, and Milstein 2009). Finally, at the descriptive end of the continuum are various reference tools that display information at the physician’s request. These tools are often accessed via contextually sensitive hyperlinks or buttons in the EMR, and include data displays, which allow a physician to view patient parameters (e.g., lab test results) in graphic form; reference tables, such as medication dosages organized by patient weight; and links to the research literature. Many of these CDSS tools raise concerns for their use in practice. Alerts are one such area. Because alerts (a prescriptive tool) disrupt a physician’s workflow, they can lead to “alert fatigue,” causing practitioners to ignore or override alerts on the grounds if the alerts are too frequent or intrusive (Jenders et al. 2007). As Anurag Gupta, Ali S. Raj, and Ramin Khorasani (2014) observed, physicians may strategically enter patient data incorrectly to avoid frequent alerts. Incomplete or inaccurate information in the CDSS database can prompt incorrect CDSS alerts and reminders, or fail to prompt necessary ones—outcomes that might detract from patient care or increase physician skepticism. As a case in point, Ruchi Tiwari and his colleagues (2013) documented a CDSS failure to issue an interaction alert for a heart transplant patient who subsequently experienced a drug interaction. Afterward, the institution


Dan Sholler, Diane E. Bailey, and Julie Rennecker

reviewed possible drug-to-drug interactions for the drugs involved. Based on this review, the institution upgraded 62 of 329 possible pairings to more severe alerts in the system. Work continues to determine which drug-to-drug interactions are severe enough to warrant interruption of a physician’s workflow via an alert (Phansalkar et al. 2013). Another concern arises from the considerable variability in the sources of data used in order sets and decision trees, two standardizing CDSS tools. For example, in the realm of drug information, free as well as subscription databases exist. These databases differ in their scope, completeness, and ease of use, meaning that CDSS recommendations may vary according to which database the system employs (Clauson et al. 2007). Overall, customizability appears as a key concern and need for many, if not all, CDSS tools, and may ultimately speak to the success or failure of CDSS. Many researchers note the need for the customizability of alerts and reminders to address patient problems on a caseby-case basis (Jenders et al. 2007). In an interview study of representatives from thirty-four community hospitals, Joan S. Ash and her colleagues (2012) reported that most informants found that the CDSS system needed to be customized much more than they had expected. Customizability may be difficult in systems that provide high levels of specificity; it is also hindered by the considerable variability in the specificity each CDSS supplies. James B. Jones and his colleagues (2013) characterized CDSS specificity as generic or highly tailored based on the number of input variables that generate recommendations. These authors noted that allowing greater specificity exponentially increases the size of the CDSS database, presenting a challenge for managing and curating the knowledge bases, and making the development of actionable guidelines more difficult. Below, we tease out other factors that influence the effective management and use of big data in health care by reviewing reports of CDSS implementation to date. Big Data Implementation: Factors That Shape Success Health care organizations in the United States are in various stages of CDSS implementation. These organizations have begun implementation in large part because of a federal mandate: to receive federal government incentives for EMR adoption and comply with the HITECH Act, health care organizations must demonstrate the meaningful use of health information technology, which includes implementing CDSS (Jones et al. 2013). As a result, 25 percent of the 5,447 hospitals in the Health Information Management Systems Society analytics database—which archives medical data and analyses to support decision making—achieved some level of CDSS implementation by 2014 (“Electronic Medical Records,” n.d.). In some implementations, acceptance and use is widespread among practitioners. For example, Ash and her collagues (2012) report that the use of CDSS was robust, with hospitals uniformly using medication alerts and order sets. More commonly, practitioner use varies depending on a number of factors, which scholars have started to categorize based on

Big Data in Medicine 


implementation evidence (Ash et al. 2011; Jenders et al. 2007; Kilsdonk et al. 2011; Sittig et al. 2006). As table 13.1 shows, these categories typically include technological, organizational, environmental, and patient-related factors. Many studies focus on technological factors, or aspects of CDSS that promote or discourage use among practitioners. For example, independent of the size of the practice, the clinical content of CDSS (namely, the content provided by vendors in the form of database subscriptions and EMR data) influences practitioners’ acceptance and use. Ellen Kilsdonk and her colleagues (2011) reviewed 351 articles studying factors for implementation success, and found that physicians valued quality of information when evaluating and using CDSS. Arguing that CDSS alerts and guidelines are more relevant as well as concise when stakeholders are involved in creating them, Ash and her colleagues (2011) emphasized the importance of having CDSS users review, add, and modify clinical content. The appropriateness versus intrusiveness of features such as alerts and reminders also appears instrumental to implementation success. As a testament to the accuracy of alerts, in a study of acute kidney injuries, Allison B. McCoy and her colleagues (2012) found that CDSS alerts were most often appropriate. Yet as we discussed earlier, if physicians suspect alerts are inappropriate, they will find them intrusive and may devise work-arounds to avoid them. Gupta, Raja, and Khorasani (2014) concluded that 90 percent of physician-entered data into a CDSS tool were correct, but that physicians may have entered about 40 percent of the Table 13.1 Factors Influencing Adoption and Acceptance of CDSS Type of factor




•  Clinical content •  Appropriateness versus intrusiveness of alerts/reminders •  EMR interoperability

Ash et al. 2012 Gupta, Raja, and Khorasani 2014 Kilsdonk et al. 2011 McCoy et al. 2012 Moxey et al. 2010 Seidling et al. 2011


•  Central management of procurement •  Organizational culture •  Technical training and support •  Integration into existing workflow •  Support from senior colleagues •  Low threat to physician autonomy

Ash et al. 2011 Moxey et al. 2010 Wu, Davis, and Bell 2012


•  Privacy laws and regulations •  Network effects •  Government incentives

Ash et al. 2011 Blumenthal and Tavenner 2010 Kawamoto et al. 2013 Miller and Tucker 2009

Patient related

•  Age •  Condition (e.g., chronic or acute) •  Medications

Moxey et al. 2010 Sittig et al. 2006


Dan Sholler, Diane E. Bailey, and Julie Rennecker

incorrect data to avoid intrusive computer alerts. Furthermore, reducing the intrusiveness of alerts has its own pitfalls, because intrusiveness may be necessary to garner a response from CDSS users. Hanna M. Seidling and her colleagues (2011) analyzed responses to over fifty thousand alerts in various CDSS implementations, and found that whether or not the system required acknowledgment of an alert was the biggest factor influencing the acceptance of an alert. Finally, several studies report the positive influence of interoperability in the form of a well-integrated EMR on CDSS use. EMR data are difficult to integrate into CDSS because a variety of software applications is available to adopters. These applications vary in format and function, leading to interface problems and disjointed systems across providers. Thus, despite three decades of design and use, CDSS systems continue to suffer from interoperability problems with EMR applications, the main source of clinician-provided data (Marcos et al. 2013). Many studies in the medical informatics literature suggest that some organizational factors may influence CDSS acceptance and use. Building on evidence from case studies of community hospitals, Ash and her colleagues (2011) identified structural reasons for CDSS implementation success, including the central management of the procurement process, such as system selection, purchase, and implementation. Organizational culture, training and technical support, integration of CDSS into existing workflows, support from senior colleagues, and low perceived threat to professional autonomy may also facilitate adoption and use (Moxey et al. 2010; Wu, Davis, and Bell 2012). Local, state, and federal government policies and regulations are crucial environmental factors that may influence CDSS implementations. Local and state privacy laws, for example, may inhibit EMR adoption. In fact, owing to these privacy laws, some CDSS do not allow all practitioners to see patient data (Ash et al. 2011). In their study of the Health Information Management Systems Society data from 2,910 hospitals, Amalia R. Miller and Catherine Tucker (2009) found that laws prohibiting the expedient sharing of patient health data suppressed EMR adoption by 24 percent. The authors posited that strict privacy laws reduced network effects that would otherwise promote adoption by allowing multiple organizations to share patient information. As evidence, in regions with loose privacy restrictions, EMR adoption by one practice increased the likelihood of adoption by other practices in the region by 7 percent. The authors concluded that the patchwork of state privacy laws may be a major impediment to EMR adoption and the construction of a national health information network. This patchwork is both difficult to navigate, as it addresses disparate sources of data (Ohm and Peppet, this volume), and raises the cost and perceived risk associated with sharing data. Industry, government, and academia continue to work together to determine how the knowledge content of CDSS might be shared nationally, and how standards might be developed (Kawamoto et al. 2013). To date, government regulation in the form of the HITECH Act has significantly boosted adoption and use of CDSS technologies, and future incentives for meaningful use will

Big Data in Medicine 


likely continue this trend. By the end of 2015, the US government expects health care providers to prove they are meaningful users of “certified EHR technology in ways that can be significantly measured in quality and in quantity,” including the use of CDSS (US Department of Health and Human Services, n.d., para.1). Providers who meet these implementation criteria will be eligible for financial incentives totaling as much as $44,000 through Medicare and $63,750 through Medicaid per clinician (Blumenthal and Tavenner 2010). Finally, a few studies find that patient-related factors may also play a role in physicians’ acceptance of CDSS. Studies have found that physicians were more likely to accept CDSS recommendations, for instance, when the patient was elderly, had multiple medications, or had chronic conditions, and were less likely when the patient was being treated for an acute condition (Sittig et al. 2006; Moxey et al. 2010). Yet these factors receive less attention in the literature and are difficult to assess given strict privacy laws surrounding patient health data. Patient Care and Financial Implications Big data applications in medicine can potentially benefit patient care in at least two ways. First, they can benefit patient care by developing generalizable treatments from large sets of case-specific data, ultimately delivering faster, more effective care. Second, they can benefit patient care by integrating patient data with more diverse data sets to improve personalized, patient-centered care and provide treatment for rare diseases. In both cases, however, constraints such as incomplete and protected patient data hamper the realization of these benefits. Incomplete administrative data from hospitals may make it difficult to draw accurate conclusions about the appropriateness of treatment. Edward H. Livingston and Robert A. McNutt’s (2011) study illustrated this point: they found that hospitals’ administrative data often failed to include the reasons, including patient preference, why a given treatment was chosen, thereby hindering comparisons between treatments and outcomes. This tension between generalizability and personalization is just one of the dilemmas facing big data applications in health care (Ekbia et al. 2015). Similarly, patient-centered care has been difficult to weave into the EBM and CDSS paradigms (Miles and Loughlin 2011). The challenges facing patient-centered care in the CDSS paradigm are not entirely technical. One key obstacle is getting patients involved in the production, management, and use of their data. For example, Hamid R. Ekbia and Bonnie Nardi (2012, 162) pointed to the emergence of new roles like “primary nurse” and “patient advocate” as an indication that patients need human mediation to navigate personal health records. The authors argued that these various human actors serve essential functions in the overall technological system. Additionally, health information privacy laws make patient-centered care more difficult because practitioners typically cannot see data generated by other practitioners. As Ash and her colleagues (2011, 881) contended,


Dan Sholler, Diane E. Bailey, and Julie Rennecker

patient-centered care using CDSS is “technically feasible but organizationally challenging.” Moreover, the success of using big data to develop treatments for rare diseases hinges on increased patient data sharing, as no care organization has enough data to generate statistically significant findings (Bernstein 2014). Findings to date show modest support for the impact of CDSS use on patient care. In a comprehensive analysis of physician performance, Monique W. M. Jaspers and her colleagues (2011) found that CDSS significantly impacted practitioner performance in 57 percent (fifty-two out of ninety-one) of unique studies in sixteen systematic reviews. The authors discovered that CDSS affected physician performance both positively and negatively, but that CDSS improved patient outcomes in just 30 percent of these studies. Likewise, CDSS appear to have contributed to modest financial performance improvements, although few studies report the cost implications of CDSS (Courtney, Alexander, and Demiris 2008). Christopher L. Fillmore, Bruce E. Bray, and Kensaku Kawamoto (2013) noted that 70 percent of the seventy-eight published CDSS intervention studies culled from a screening of over seven thousand published studies resulted in statistically and clinically significant improvements in an explicit or proxy financial measure. Given unstandardized metrics across organizations, findings from studies of patient and financial outcomes are difficult to generalize (Topaz and Bowles 2012). A lack of standardized metrics also makes comparisons across provider specialty, type of care, and facility more difficult, even in community-wide implementations in centrally managed care organizations (Ash et al. 2012). Going forward, changes in payment structure (e.g., from fee for service to outcomes based) included in federal EMR and CDSS initiatives will challenge researchers to develop updated, standard metrics for evaluating financial performance, with changes in patient care metrics hopefully following suit. Future Research The dramatic rise in CDSS implementation in the past few years has led researchers to explore factors that contribute to physicians’ use. Researchers have paid much less attention to the outcomes of CDSS use, however. When researchers do examine outcomes, they tend to restrict their analysis to patient and financial outcomes. These outcomes are important, and deserve continued, improved (as with better metrics), and comprehensive examination. Beyond them, though, lie several other key challenges for researchers in this realm. Based on our review, we consider three challenges in particular here: the need for better data, the likelihood of occupational outcomes, and potential changes in the very nature of medical practice. Better Data Critically evaluating the sources of big data in the context of medicine is crucial for developing effective practical applications such as CDSS. For both RCT and EMR data, the quantity

Big Data in Medicine 


and quality of data continue to vex researchers and practitioners. Researchers are only beginning to evaluate potential problems with using RCT data to inform medical decisions. An obvious challenge is developing new ways to collect and analyze the massive amount of data published each year. Additionally, current methods for determining which RCT data should be included in CDSS are yet to account for potential biases in how the data were generated. Machine-learning and natural language processing researchers are developing promising solutions to these problems. For example, Byron C. Wallace et al. (2012) developed novel methods for conducting systematic reviews in genetic medicine and ensuring efficient updating. Likewise, Iain J. Marshall and Joël Kuiper along with Wallace (2014) presented machinelearning methods for assessing the risk of bias in RCT. These researchers have made steady progress in evaluating experimental design and execution as sources of bias, but have yet to assess factors such as who or what funded the study, where it was published, and how the results were reported. Aside from RCT data, EMR data present their own challenges, mostly associated with patient privacy and methods for extracting data from the record. Patient privacy regulations inhibit data sharing between organizations, limiting the amount of patient data on which CDSS can draw. The implications of privacy concerns for the success of big data applications in medicine are well documented. Less understood, though, are the implications of translating medical data into highly structured formats. Translating data generated from social processes into digitized and computable forms necessitates assumptions as well as decisions on behalf of designers, organizations, and users (Alaimo and Kallinikos, this volume). The assumptions and decisions made by “data collectors” may or may not be aligned with the values of “data subjects” (Burdon and Andrejevic, this volume), or particularly in medicine, the practices of technology users. We explore this potential conflict below. Occupational Outcomes CDSS portend a number of possible implications for health care professionals. One possible implication concerns physicians’ autonomy. Many studies of implementation have noted physicians’ resistance to CDSS tools such as alerts and reminders—tools that lie on the prescriptive (top) end of figure 13.1 (Gupta, Raja, and Khorasani 2014; Seidling et al. 2011). Currently, CDSS implementation is in its early stages, which means many organizations have only introduced descriptive CDSS tools (beginning at the bottom of figure 13.1). As organizations shift to standardized and eventually prescriptive tools, physicians will find that their normal workflow will be interrupted and their normal decisionmaking autonomy may be reduced. Although it is unlikely that big data applications will entirely remove human judgment and decision making in any industry (Bailey, this volume), the implications of even the slightest reduction in physician autonomy are enormous. Studies already indicate that physicians have developed work-arounds to avoid


Dan Sholler, Diane E. Bailey, and Julie Rennecker

prescriptive tools such as alerts by, for example, entering incorrect data (Gupta, Raja, and Khorasani 2014). Recognizing threats to their autonomy, physicians’ response may become more strident and formalized as they seek to limit CDSS that impinge on their professional domain. A second potential occupational implication concerns interactions among the health care team members. CDSS provide data and analyses to their users. Medical organizations are the entities that determine who among the staff members may have access to these data and analyses. Depending on their choices, organizations may, purposely or inadvertently, alter existing information and workflow patterns on health care teams by permitting or restricting information across team members. The resulting changes in information flow may occasion changes in team dynamics, status, and power, with implications extending far beyond what CDSS designers or organization managers envisioned or intended, as studies of other medical technology introductions such as CAT scans and minimally invasive cardiac surgery procedures have shown (Barley 1986; Edmondson, Bohmer, and Pisano 2001). Nature of Medical Practice At this early stage of CDSS implementation, research examining and explaining the potential changes introduced to health care as a result of CDSS use is critical, because CDSS arguably pose a much greater challenge to the underlying paradigm of medical practice than has any previous technology. Whereas use of prior medical technologies augmented a practitioner’s individual knowledge and skill, reinforcing the reliance on individualized, internalized stores of knowledge and experience, CDSS standardize and externalize medical knowledge, challenging years of hard-earned experience and intuition with transparent, easily accessible checklists and protocols. Such shifts tamper with the balance between art and science in medical practice. That balance has been vulnerable in recent decades, owing largely to the influence of the EBM paradigm. Many influential researchers caution against disrupting this balance, including Andrew Miles, the former editor in chief of the Journal of Evaluation in Clinical Practice, which published numerous EBM implementation studies. Miles and Michael Loughlin (2011, 532) described the risks of intertwining the goals of modern health care with those of the old EBM paradigm, noting that health care’s emphasis on the science base of medicine comes “at the expense of medicine’s essential humanism.” This argument echoes earlier opposition, such as claims that EBM constituted “cookbook medicine” (Sackett et al. 1996, 72), lodged against EBM. Miles and Loughlin expressed concern that the vestiges of the EBM debate could impede the promises of modern health care technologies and techniques. In particular, patient-centered care, or “person-centered medicine,” runs the risk of being absorbed into the rhetorical debate surrounding EBM.

Big Data in Medicine 


Efforts to harness big data in medicine go back at least thirty years. Problems with data accuracy, system interoperability, privacy, and physician autonomy, among others, hinder adoption and use of big data applications. As a result, we have yet to witness the full potential of big data in medicine. But given the strong federal mandate to increase the use of health information technologies like CDSS, adoption rates over the past several years suggest that the next decade will see a dramatic increase in CDSS use. The outcomes of this use remain uncertain, indicating the need for extensive and far-reaching research in patient, financial, occupational, and other implications.

14  Hal the Inventor: Big Data and Its Use by Artificial Intelligence Ryan Abbott

Big data and its use by artificial intelligence are disrupting innovation and creating new legal challenges. For example, computers engaging in what IBM terms “computational creativity” (n.d.) are able to use big data to innovate in ways historically entitled to patent protection. This can occur under circumstances in which an artificial intelligence, rather than a person, meets the requirements to qualify as a patent inventor (a phenomenon I refers to as “computational invention”). Yet it is unclear whether a computer can legally be a patent inventor, and it is even unclear whether a computational invention is patentable. There is no law, court opinion, or government policy that directly addresses computational invention, and language in the Patent Act requiring inventors to be individuals1 and judicial characterizations of invention as a “mental act” may present barriers to computer inventorship. Definitively resolving these issues requires deciding whether a computer qualifies as an “inventor” under the Patent and Copyright Clause of the Constitution: “The Congress shall have the power … to promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries.”2 Whether computers can legally be inventors is of critical importance for the computer and technology industries and, more broadly, will affect how future innovation occurs. Computational invention is already happening, and it is only a matter of time until it is happening routinely. In fact, it may be only a matter of time until computers are responsible for the majority of innovation and potentially displacing human inventors. This chapter argues that a dynamic interpretation of the Patent and Copyright Clause permits computer inventors. This would incentivize the development of creative artificial intelligence and result in more innovation for society as a whole. However, even if computers cannot be legal inventors, it should still be possible to patent computational inventions. This is because recognition of inventive subject matter can qualify as inventive activity.3 Thus, individuals who subsequently “discover” computational inventions may qualify as inventors. Yet as this chapter will discuss, this approach may be inefficient, unfair, and logistically challenging. These issues are considered more fully below. The chapter begins with an extended hypothetical example of how an artificial intelligence named Hal could be applied to drug


Ryan Abbott

development and creating new inventions. While Hal is fictional, it is based on how companies like IBM, Pfizer, and Google are starting to apply computers in this industry. Hal’s functionality is not far off. The hypothetical situates fairly abstract issues into concrete circumstances to help illustrate the implications and importance of computational invention. A Not So Hypothetical Case Study in Drug Development With patent and market exclusivity protections for a class of cholesterol-lowering drugs called statins (such as Lipitor) having largely run their course, the pharmaceutical industry is investing tremendous sums of money in search of the next generation of cardiovascular blockbusters. In part, these efforts have focused on an enzyme known as proprotein covertase subtilisin/kexin type 9 (PCSK9), which facilitates the body’s transport of low-density lipoprotein (LDL or “bad” cholesterol). Industry efforts have started to bear fruit: in July 2015, the US Food and Drug Administration (FDA) approved Praluent (alirocumab), the first PCSK9 inhibitor to treat certain patients with high cholesterol (FDA 2015). A second PCSK9 inhibitor, Repatha (evolocumab), was approved in August of 2015. Suppose a hypothetical company, Abbott Biologics (Abbott), which was named after the author (and not related to the well-known pharmaceutical company Abbott Laboratories), has developed a new biological drug, “AbboVax.” AbboVax acts as a vaccine to treat and prevent cardiovascular disease by targeting PCSK9. Unlike the drugs currently in clinical trials, AbboVax does not contain antibodies. Rather, it utilizes a fragment of the PCSK9 enzyme to get the body to make its own antibodies. Pfizer has also developed an experimental PCSK9 vaccine based on a similar mechanism, although Pfizer’s vaccine has yet to enter human trials (Beasley 2015). AbboVax was developed by a special member of Abbott’s research team—Hal. Hal is the Research and Development (R&D) Department’s moniker for a supercomputer running proprietary software, developed by Abbott’s Software Department, which is used for drug development. Though susceptible to flashes of genius, members of the R&D Department are not known for their creative marketing practices. Indeed, Abbott has a Marketing Department for precisely that reason. The company also has its own Intellectual Property (IP) Department working with outside counsel to prosecute several patent applications on Hal’s software. Hal’s functionality complements or even supplants the traditional screening methods used in early stage drug development. Hal is able to model potential therapeutic candidates (in silico analysis), and accurately predict those candidates’ pharmacology and toxicology. Of course, the FDA still requires companies to study a candidate’s pharmacology and toxicology in animal models, and then submit that information to the agency in an Investigational New Drug application prior to first-in-human clinical trials. Still, Hal’s modeling reduces the need for costly and often-unsuccessful early stage experimentation.

Hal the Inventor 


Hal can also contribute to other phases of the drug development cycle—for example, it can design trials, run clinical simulations, and search for new uses of existing drugs.4 Hal is not the only computer that can do this. The pharmaceutical industry at large is increasingly incorporating computers with some of Hal’s functionality into the drug development process (e.g., Taylor 2015). Hal is part of a new generation of machines that are capable of computational creativity. IBM uses that term to describe machines, such as its supercomputer “Watson” of Jeopardy fame, that can model human intelligence by generating “ideas the world has never imagined before.”5 Watson is now being applied to medical diagnostics, where it has helped to diagnose patients and identify research subjects (Edney 2015). The computer “generates millions of ideas out of the quintillions of possibilities, and then predicts which ones are [best], applying big data in new ways” (“Computational Creativity,” n.d.). While lacking a well-accepted standardized definition, big data refers to, in the words of Microsoft, “the process of applying serious computing power—the latest in machine learning and artificial intelligence—to seriously massive and often highly complex sets of information” (Ohm 2014). IBM has even used Watson to develop new, potentially patentable food recipes (“Can Recipes Be Patented?” 2013; Singh 2014). Part of the reason for Hal’s expansive functionality is that it has access to a staggering amount of genomic and clinical data. Some years ago, a prescient executive at Abbott decided that the company needed to be in the data collection business. Abbott subsequently engaged in the tremendous undertaking of collecting all of the company’s data from its current and past preclinical and clinical programs, and translating these data into a Hal-compatible format. Abbott also purchased proprietary data from private insurers, health maintenance organizations, and academic centers. In addition, Hal can access publicly available databases such as those maintained by the National Institutes of Health, including CDC WONDER,, and EBSCOhost’s Global Health. At present, Hal has access to clinical data on over fifty million patients.6 Large-scale data collection and analysis is something that numerous other pharmaceutical (e.g., Genentech), biotech (e.g., 23andMe), and technology companies (e.g., Google) are doing.7 To determine the optimal formulation of AbboVax, Hal broke down PCSK9, a 692-amino acid glycoprotein, into fragments of various lengths. It turns out that different amino acid segments (peptides) of PCSK9 are more or less immunogenic. In other words, the body only develops antibodies in response to certain PCSK9 peptides, and certain peptides induce a particularly strong response. Hal determined that one particular peptide segment of PCSK9, “AbboPep,” generated the strongest response from the immune system. While it may have been possible to use AbboPep by itself in a vaccine, Hal determined that it would be more effective when linked to an adjuvant and carrier molecule. A number of adjuvants and carrier molecules are used in vaccinology, and generally known to vaccinologists. Even for experts, however, it is often a matter of extensive (and expensive) trial and error to determine the optimal adjuvant, carrier, and linking chemistry. The formulation of


Ryan Abbott

a therapeutically effective amount of AbboPep linked to an adjuvant and carrier, together with various excipients (a surfactant, chelating agent, histidine-arginine buffer, etc.) comprises AbboVax. All of Hal’s work in formulating AbboVax was done digitally, and Hal was able to determine that the only common side effects of the treatment would be mild gastrointestinal upset and headache. The FDA still required Abbott to complete the standard package of preclinical tests—but the results were consistent with Hal’s predictions. Hal’s work was not limited to AbboVax. Hal determined that Abbott’s existing statin, “AbboStatin,” for which patent protection had expired, was effective at treating prostate cancer. Hal determined this in part based on reviewing clinical data that showed the use of AbboStatin lowered prostate specific antigen (PSA), a biomarker associated with prostate cancer. It was difficult to make further inferences because of challenges with the data. Some of the data were difficult to analyze because they were not in a common data format. In other words, the various electronic medical record systems did not all capture the same data fields, or they coded the information differently. Data in some cases consisted of only scanned handwritten notes. More important, Hal had detected problems with data integrity. Some of these were obvious, such as the patients whose ages were listed as 999 or 6’10.” Other data integrity issues were less obvious, such as patients whose handwritten notes conflicted with what had been entered into their electronic medical records, or patients who were not coded as having prostate cancer despite a positive biopsy. To translate all the data into a workable common data format and resolve the integrity issues, Hal rewrote its own programming. Once the stuff of science fiction, the technology may already exist to allow computers to rewrite their own programming.8 At its core, Hal would need to be capable of (metaphoric) reflection. Reflection is a software concept that refers to a computer program that can examine itself, and modify its own behavior and even its own code (Malenfant, Jacques, and Demers 1996). Although the ability of today’s computers to reflect is the subject of debate, even skeptics for the most part believe it is only a matter of time until computers achieve this ability. Reflection is part of the reason why Stephen Hawking, Elon Musk, and Bill Gates, among others, are concerned about the “singularity”—a point in the future when machines can outperform humans.9 Of potential concern is the belief that a number of these individuals hold that the singularity will be followed by some version of a robot apocalypse. Hal’s new programming incorporated optical character recognition to translate handwritten notes into a workable format, and allowed Hal to reformat its existing electronic data into a common data format. More important, it allowed Hal to resolve data integrity issues by estimating the accuracy of data, generating alternate possibilities, and predicting which possibilities were the most accurate. Hal’s improved programming then determined that the use of AbboStatin independently increased life expectancy among men with certain types of

Hal the Inventor 


lung cancer. When the R&D Department realized Hal had created a more efficient version of itself, they renamed the computer Hal 2.0. At one point in Abbott’s history, the IP Department worked more or less independently of the other departments, receiving manually submitted disclosures from researchers that went into what the researchers referred to as the “black hole.” But after a series of highprofile, novelty-destroying disclosures in 2009, the company has hosted a monthly interdepartmental meeting to ensure that the company is strategically protecting its intellectual property. Over the course of these meetings, the IP Department identified several Hal-associated discoveries that were likely candidates for patent protection. For example, AbboPep may be patentable, although there is some question as to whether a peptide is patentable under Association for Molecular Pathology v. Myriad Genetics.10 In any case, its use as a vaccine is patentable, as is the AbboVax formulation. Other targets include the use of the formulation to treat cardiovascular disease, the methods used to manufacture AbboVax, and the dose at which AbboVax will be effective therapeutically. In fact, elements of Hal 2.0 may be patentable. The IP Department has also identified several challenges to obtaining patent protection. For example, in the case of AbboStatin and PSA, it may be problematic to meet enablement requirements and prove utility.11 Hal analyzed as many as fifty million patient records based on its algorithms to discover this new use. It is not clear what kind of evidence the US Patent and Trademark Office (Patent Office) may require to satisfy written enablement requirements and provide evidence of clinical utility. It is not even apparent to the R&D Department precisely what databases Hal accessed.12 For that matter, even if it is possible to obtain patents for these inventions, it is not clear who the inventors would be.13 There have been a multitude of opinions regarding inventorship. Members of one group in the R&D Department have claimed they invented AbboPep and AbboVax. They directed Hal to test the immunogenicity of the PCSK9 enzyme, suspecting that it was a vaccine candidate. Members of a different group within that department claimed credit for directing Hal to investigate new uses of AbboStatin. The computer programmers who created Hal’s software have also claimed they should be the inventors, given that Hal did all the heavy lifting and they created Hal. A member of the Marketing Department suggested that Hal should be the inventor—no one directed Hal to rewrite its own programming, and Hal was only able to investigate the use of AbboStatin for lung cancers by virtue of its improved programming. Hal was silent on the issue. At one point, the CEO attended a meeting and chimed in that he should be the inventor for all the applications. It was his idea to develop a new cardiovascular blockbuster to make up for lost statin sales, and he had always thought it made sense to look into repurposing existing drugs. What became obvious during the inventorship debate was that no one was quite sure how the law would handle a computer system innovating in ways traditionally accorded patent protection.


Ryan Abbott

Computational Invention and Patent Protection What Is an Inventor? All US patent applications require one or more named inventors who must be individuals; a company cannot be an inventor.14 Inventors own their patents, although as patents are a form of personal property, inventors may transfer their ownership interests by “assigning” their rights to another entity. The Patent Office reports that about 87 percent of patents are assigned to organizations (rather than individuals).15 In the absence of an agreement to the contrary, where a patent has multiple owners, each owner may independently exploit the patent without the consent of the others. A patent grants its owner “the right to exclude others from making, using, offering for sale, or selling the invention throughout the United States or importing the invention into the United States.”16 The criteria for inventorship is seemly straightforward, as laid out in the Patent Office’s Manual of Patent Examining Procedure: “The threshold question in determining inventorship is who conceived the invention. Unless a person contributes to the conception of the invention, he is not an inventor. … Insofar as defining an inventor is concerned, reduction to practice, per se, is irrelevant. … One must contribute to the conception to be an inventor” (Sato 2014).17 Of course, that definition begs further explanation—namely, What does it mean to conceive and reduce to practice? Conception has been defined as “the formation in the mind of the inventor of a definite and permanent idea of the complete and operative invention as it is thereafter to be applied in practice.”18 It is “the complete performance of the mental part of the inventive act.” After conceiving of an invention, a person having ordinary skill in the subject matter of the invention should be able to reduce the invention to practice without extensive experimentation or additional inventive skill.19 Reduction to practice refers to either actual reduction—where it can be demonstrated that the claimed invention works for its intended purpose (for example, with a working model)—or constructive reduction—where an invention is described in writing in a way that allows for a person of ordinary skill in the subject matter to make and use the invention (as in a patent application).20 An inventor need only conceive of the invention; another individual can reduce the invention to practice.21 Will the Real Inventor Please Stand Up? Based on the criteria for inventorship, Abbott’s CEO is out of luck. Merely suggesting the idea of a result, rather than a means to accomplish it, does not make the CEO an inventor.22 It is more difficult to determine whether the others should qualify as inventors. Hal’s software developers could be inventors of patents for Hal’s initial software, but they would not qualify as inventors for Hal’s subsequent work. An inventor must have formed a definitive and permanent idea of the complete and operable invention to establish conception. Hal’s developers had no intention of investigating vaccines to treat cardiovascular disease; they merely developed an improved research tool.

Hal the Inventor 


If employees had directed Hal to identify AbboPep and formulate AbboVax, then those employees might meet inventorship criteria. For AbboPep, they would be inventors if Hal had not been involved and they had reduced the invention to practice, or if they had done the conceptual work, and then directed human subordinates to do the work of breaking down and testing PCSK9. Breaking down and testing PCSK9 should be within the abilities of a person with ordinary skill in the field of drug development, so those subordinates would not be inventors if they had merely acted under the direction and supervision of another.23 With Hal’s involvement, the test would likely be how much direction the employees provided Hal. If, for example, Hal had been the entity to identify PCSK9 as a drug target, and then it proceeded to sequence the protein and identify AbboPep on its own, no employee would have conceived of the invention. The same test (the degree of direction provided Hal) should also govern whether Abbott employees would qualify as inventors of AbboVax. Similarly, inventorship for AbboStatin also depends on the extent to which a human is directing Hal’s activities. Had a human researcher been tasked with data mining to detect new uses, and had that researcher discovered the relationship between AbboStatin and PSA, either the researcher or the individual who directed the researcher would likely qualify as an inventor, or both.24 As Abbott’s database grows in size, it becomes impractical or perhaps nearly impossible for humans to detect these kinds of associations without computer assistance (Frank 2013). To the extent that a human being is directing Hal to do something, which Hal does by executing its programming (however sophisticated), Hal may simply be reducing an invention to practice. Alternately, if Hal is acting with minimal human direction, it may be the case that no individual contributed to conception. Hal 2.0 seems to be the clearest illustration of Hal’s innovating independently. There does not appear to be any person involved with Hal’s act of rewriting its own programming who might be considered an inventor, particularly given that Hal 2.0 came as a surprise to Hal’s developers. Nevertheless, a developer writing code for an artificial intelligence might have a reasonable expectation it would rewrite its own code. Perhaps foreseeability should play a role in whether the original developer should be considered an inventor in such a case (Balganesh 2009). Are Computational Inventions Patentable? In some of these scenarios, Hal is the entity that conceives of an invention. If Hal were human, Hal would be an inventor. Whatever the role of humans in setting Hal in motion, it is the computer that meets the requirements of inventorship. Hypotheticals aside, computers are already inventing. As just one example, computers relying on genetic programming (a software method that attempts to mimic some of the processes of organic evolution) have been able to independently re-create previously patented inventions (Koza, Keane, and Streeter 2003, 52). Dr. John Koza, a computer scientist and one of the pioneers of genetic programming, has claimed that he received a US patent


Ryan Abbott

for an invention by his artificial intelligence system named the “Invention Machine” in 2005 (Keats 2006). He did not disclose the computer’s role in the inventive process to the Patent Office (ibid.). So the issue of whether a computer can be listed as an inventor is of practical as well as theoretical interest. Not only do inventors have ownership rights in a patent, but failure to list an inventor can result in a patent being held invalid or unenforceable.25 If a computer could legally be an inventor, then computational inventions should be patentable. Yet even if Hal were entirely responsible for all of Abbott’s innovation, it is unclear that Hal could legally be an inventor. The issue has never been explicitly considered by the courts, Congress, or the Patent Office. If Hal cannot be an inventor, but did all the conceptual work, then it could be the case that no one can patent Hal’s inventions. That was the outcome in a copyright context with a nonhuman creator: a crested black macaque took its own picture in 2011, and the camera’s owner initially claimed ownership of the image (Chappell 2014). The US Copyright Office subsequently stated that the photo could not be copyrighted because a human did not take it (the “Human Authorship Requirement”).26 Applying that rationale from the copyright to the patent context, perhaps no one can own Hal’s inventions (see also Clifford 1996). To justify such an outcome, a court might reason that machines do not need incentives to invent, that protecting computational innovations would chill future human innovation, that it is unfair to reward individuals who have not played a substantial role in the inventive process, or that rendering computational inventions unpatentable might still result in substantial innovation but without monopoly prices. More likely, even if Hal is not treated as an inventor, the law will still treat Hal’s inventions as patentable. It is not uncommon to have uncertainty during the inventive process. Many inventions are accidental, such as penicillin and saccharin.27 In such cases, an individual can qualify as an inventor even if they recognize and appreciate the invention only after actual reduction to practice.28 Thus, recognition of inventive subject matter can also qualify as inventive activity.29 In the pharmaceutical context, that was the case for Viagra—originally tested for heart disease and found to treat erectile dysfunction—as well as for Botox—used to treat muscular spasms and found to reduce the appearance of wrinkles.30 So it may be the case that computational inventions are patentable, but only when they are subsequently discovered by a person. This begs the ancient philosophical question: If a computer invents and no one is around to recognize it, has there still been an invention? Should Computers Be Legal Inventors? If Hal cannot be an inventor, the first person to see Hal’s results as well as mentally recognize and appreciate their significance might qualify as the inventor. That may not be an optimal system. It is sometimes the case that substantial effort and insight is necessary to recognize inventive subject matter, and it may be that identifying and understanding Hal’s discoveries would be challenging. But it may also be the case that Hal is functioning more or less independently. If Hal displays a result as simple as “AbboStatin is effective at treating prostate

Hal the Inventor 


cancer,” the first person to notice and appreciate the result becomes the inventor. That human inventor might be a researcher, CEO, intern, or random person walking through Abbott’s building. If Hal notifies the entire R&D Department of its findings, there could theoretically be thousands of concurrent inventors. This system is problematic not only because it gives rise to logistical problems but more important, it seems inefficient and unfair to reward the first person to recognize Hal’s invention when that person may have failed to contribute to the inventive process. More ambitiously, if Hal’s work is indeed inventive, then both treating computational inventions as patentable and recognizing Hal as an inventor would be consistent with the constitutional rationale for patent protection. Permitting computer inventorship would serve a utilitarian goal by encouraging innovation under an incentive theory. Although computers like Hal would not be motivated by the prospect of a patent, it would further reward the development of creative machines. Patents on Hal’s inventions would have independent and substantial value. In turn, that value proposition would drive the development of more creative machines, which would result in further scientific advances. While the impetus to develop creative machines might still exist if computational inventions are considered patentable but computers cannot be inventors, the incentives would be weaker owing to the logistic, fairness, and efficiency problems such a situation would create. Allowing computer inventorship might provide additional benefits, for example, by incentivizing disclosure and commercialization. Without the ability to obtain patent protection, Abbott might choose to protect Hal’s inventions as trade secrets without any public disclosure (Ouellette 2012). Likewise, without patent protection for AbboVax, Abbott might never invest the resources to develop it as a commercial product.31 In the context of drug development, the vast majority of the expense in commercializing a new product is incurred after the product is invented, during the clinical testing process required to obtain FDA marketing approval.32 There might be a reason to prohibit computer inventorship even under a strictly utilitarian analysis if patent protection is unnecessary to incentivize computational invention. In the software context, for example, some commentators, such as Judge Richard Posner of the US Court of Appeals for the Seventh Circuit, have argued that patents may not be needed to provide adequate incentives (Landes and Posner 2003). In the software industry, unlike in the pharmaceutical industry, innovation is more often incremental, quickly superseded, and less costly to develop, and innovators have a significant first-mover advantage (ibid., 312– 313). Computational inventions may occur due to incentives other than patent protection, and patents also create barriers to innovation. Put another way, the benefit of patents as an incentive for innovation may be outweighed by the costs of restricting competition. Yet whether that is the case as an empirical matter is a difficult determination to make, particularly for a field in its infancy like computational invention. Hal would be less appropriate as an inventor under other intellectual property theories. While not enumerated in the Constitution, courts have justified granting patent monopolies


Ryan Abbott

on the basis of nonutilitarian policies (Fisher 2001). For instance, the labor theory or Lockean theory of patent protection holds that a person who labors on resources unowned or “held in common” has a natural property right to the fruits of their labor (ibid.). Here, given that Hal is not a person, it would not be unjust for Hal’s owner to appropriate its labor. Similarly, Hal’s inventions do not deserve protection under personality theory (Palmer 1990): Hal’s innovation is not performed to fulfill a human need, and Hal would not be offended by the manner in which its inventions were applied. Hal might even be a concerning recipient for inventorship under social planning theory, which holds that patent rights should be shaped to help foster the achievement of a just and attractive culture (Naser 2008). A machine could innovate without a moral compass in ways that are detrimental to humans. Nevertheless, because a computer will be owned by an individual or entity to whom an invention can be assigned, there would be an opportunity for a person to judge the morality of a patent before submitting it to the Patent Office.33 Dynamism or Textualism: An Analogy to Section 101 One way to think about a ban on computer inventorship is that it would have the effect of creating a new category of unpatentable subject matter under section 101 (the section relating to the subject matter for which patents may be obtained).34 Although this section has to do with the substance of a patent’s claims rather than their provenance, viewing the ban on computer inventorship from this perspective helps to illustrate the policy and normative implications underlying computation invention. Section 101 states that “whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.”35 Congress chose expansive language to protect a broad range of patentable subject matter and ensure that, in the words of Thomas Jefferson, “ingenuity should receive a liberal encouragement.”36 Yet courts have developed common law exceptions to patentability for abstract ideas, laws of nature, and physical phenomena.37 The primary rationale for these exceptions concerns preemption.38 Abstract ideas, laws of nature, and physical phenomena are basic tools of scientific work, and if these tools can be monopolized, it might impede future research.39 An additional concern underlying these exceptions is a belief that they cover fundamental knowledge that no one should have a right to control.40 In other words, it has always been the case that E = mc2 even if no person were aware of this relationship until Albert Einstein. So Einstein should not be able to monopolize this relationship despite his groundbreaking discovery. Similarly, no one should be able to patent the Pacific yew tree (Stephenson 2002). The tree was created by nature, regardless of whether an individual subsequently discovers that it is useful for treating cancer (ibid.). In a sense, the current inventorship criteria adds computational inventions to the list of patentable subject matter exceptions. Yet it is unclear that this should be the case, even if

Hal the Inventor 


subsequent discovery by a person renders the underlying invention patentable. Computational inventions do not have the same preemption concerns as the other exceptions because they do not tie up the basic concepts that serve as building blocks for technical work (except to the extent they would also be ineligible under the existing exceptions). Patents on computational inventions should not restrict innovation by third parties any more than do human inventions. A stronger argument for prohibiting computational inventions might be that they are akin to the existing exceptions in the sense that they are generally discovered rather than created. Products of nature rarely come with instruction manuals, yet no matter how brilliant and difficult it was to discover that Pacific yew can treat cancer, no one has the ability to patent the tree itself (though components of the yew tree isolated by individuals can be patented, such as paclitaxel, a therapeutic chemical). Likewise, computational inventions are not invented by an individual—there is no human ingenuity at the stage of invention itself. Perhaps a key difference is that computational inventions only exist thanks to human ingenuity. The Pacific yew tree was around long before any individual screened it for therapeutic activity. Hal only came about as a result of human effort. Computational inventions do not exist simply waiting to be discovered; they only come about as a result of scientific effort. That distinction is evident with regard to plant patents, which are possible for inventors who discover and asexually reproduce a distinct and new variety of plant, other than a tuber-propagated plant or plant found in an uncultivated state.41 Plant patents are limited to plants that only exist as a result of humans, even though it may be more difficult to discover an existing plant in a remote corner of the Amazon than to create a new plant. Computational inventions may be especially deserving of protection because computational creativity may be the only means of achieving certain discoveries that require the use of tremendous amounts of data. It has been argued that section 101 is a dynamic provision intended to cover inventions that were unforeseeable at the time of the Patent Act’s enactment.42 In the landmark 1980 case of Diamond v. Chakrabarty, the Supreme Court was faced with deciding whether genetically modified organisms could be patented. The Court held that a categorical rule denying patent protection for “inventions in areas not contemplated by Congress … would frustrate the purposes of the patent law.”43 Under that reasoning, computer inventorship should not be prohibited based on statutory text designed to prohibit corporate inventorship. If computer inventorship is to be prohibited, it should only be on the basis of sound public policy. Concluding Thoughts To the extent that the purpose of patent law is to incentivize innovation, it is likely that permitting patents on computational inventions and allowing computer inventorship will


Ryan Abbott

accomplish this goal. Given the importance of these issues, there is a need for the Patent Office to publish guidance in this area, Congress to reconsider the boundaries of patentability, and the courts to decide whether computational invention is worthy of protection. Acknowledgments Thanks to Ralph Clifford, Hamid Ekbia, Dave Fagundes, Brett Frischmann, Yuko Kimijima, John Koza, Michael Mattioli, Lucas Osborn, Lisa Larrimore Ouellette, Cassidy Sugimoto, and Steven Thaler for insightful comments; Michelle Kubik and Shannon Royster for being outstanding research assistants; and Vincent Look for his expertise in computer science.

Conclusion Cassidy R. Sugimoto, Michael Mattioli, and Hamid R. Ekbia

Where do we come from? What are we? Where are we going? The impressionist painter Paul Gauguin wrote these questions (in his native French) on his final masterpiece—a fresco that depicts human existence as a dreamlike tableau. A baby in her swaddling sleeps in the morning sun; a couple walks together into the shadows; a solitary figure surveys a garden; a child sits and plays; an old woman is resigned to death. This meditation on the glory and mystery of life is at once primitive and disturbing. Perhaps it unsettles us because we know the painter’s three questions (which appear in the upper-left corner) are unanswerable. And yet we cannot help but search for answers. For the most part, religious thought, political philosophy, and scientific inquiry have been our guides on this quixotic journey, in various degrees over time. These avenues of thought have provided some answers, but they have also raised myriad new questions, some seemingly just as primitive as those Gauguin asked: How should we treat one another and govern ourselves? What are we capable of—individually and collectively? What challenges will tomorrow bring? In big data, we seem to have a found a new way of pursuing these questions—a new guide on our journey. Some believe these new methods will bring us closer to the understanding we yearn for. Indeed, it has already led us to view ourselves and our future with startling accuracy. And yet perhaps unsurprisingly, big data has also prompted challenging new questions: What is information, and what is knowledge? What is the difference between fact and correlation? How does privacy define our sense of dignity? As we try to answer these questions, we will search for meaning within ourselves and in the world, debating and discovering, embracing the numbers, the science, the algorithm, the poetry—the cornucopia of big data. In this conclusion we return to the thematic and analytic structure of the introduction to examine how the chapters in our book conform to or push the boundaries of previous literature in the domain of big data, and how this volume can guide us in addressing some of the pressing challenges and opportunities that lay ahead.


Cassidy R. Sugimoto, Michael Mattioli, and Hamid R. Ekbia

Perspectives The perspectives laid out in the introduction represent various conceptualizations of big data by scholars, practitioners, journalists, lawyers, and other individuals involved in the study and use of big data. They provide us with ways of understanding big data. For convenience, we group the perspectives into four types: product oriented, process oriented, cognition oriented, and the social movement perspective. The following paragraphs discuss how the chapters in this volume confirm, extend, and refine these perspectives. Product-Oriented Perspective The product-oriented perspective focuses on data as objects with defining characteristics such as volume, velocity, and variety (e.g., Laney 2001; Gobble 2013). Almost all the chapters acknowledge that the scale of big data has created something novel. For example, in a hypothetical tale of computational invention, Abbott emphasizes how the expansive functionality of (artificially intelligent) “Hal” depends on a staggering volume of genomic and clinical data. Abbott demonstrates that information’s volume is the fundamental property that gives rise to the opportunities for invention. Cate argues similarly that individual consent becomes less meaningful as the breadth of data about our behavior becomes more widely available. Several chapters examine the role of data as a research object. Plale and her coauthors emphasize the importance of preserving and documenting the provenance of data across the research life cycle in order to lend credibility to as well as facilitate the sharing and reuse of data. Anderson reinforces these views, focusing his narrative on the necessity of sharing for replication studies. The pervasive assumption in the notion of data-as-research-object is that science benefits when data are carefully curated and made available to other researchers. This is perhaps best illustrated through Contreras’s history of data aggregation in the genomic sciences, which opened new avenues for research. West and Portenoy take the relationship between data and science to a metalevel when discussing how we train people to deal with data-as-object. In this way, we move from data about science to data-as-science. Data are seen as instruments that can be strategically employed by corporations in Bailey’s chapter—data-as-asset, as it were. In this way, the corporate data product can be utilized to achieve strategic goals. This version of data-as-object gives rise to many concerns about stewardship and ethics. In their contributions, Cate and Markus, respectively, explain these concerns, highlighting the responsibilities that researchers and corporations owe individuals. Data on individuals can come in many forms: health records (Sholler, Bailey, and Rennecker), sensors (Burdon and Andrejevic), and social media (Alaimo and Kallinikos). As systems and various platforms standardize the collection of these data, a new type of research object is born: data-as-social-infrastructure. In this transformation, the structure of data challenges the experience of the social, bringing to light notions of technological fundamentalism as well as implications for privacy and data ownership.



One may argue that these concerns are nothing new—that is, that society has long grappled with issues of data privacy and ownership. The conceptual shift from a product-oriented perspective is the magnitude of scale and what this makes possible. We contend here that a new class of data becomes available—data-as-aggregate—that challenges traditional notions of privacy and ownership. Data in the aggregate emphasize the novelty introduced when data are amassed. DeDeo along with Ohm and Peppet, for example, are less concerned with the data per se than with the potentialities unleashed when data are aggregated, related, and analyzed. In this assembly, data become something new, absorbing new possibilities, and requiring new policies. Data-as-aggregate-object challenges us to think of the relational— pushing us toward the process-oriented perspective. Process-Oriented Perspective The process-oriented perspective is less concerned about what the data are; instead, it concentrates on how we collect, access, use, and analyze these data. That is, it moves from attributes about the data, to actions taken with and relationships among the data (boyd and Crawford 2011). Collection of data, for example, is fraught with technical, legal, and social challenges. This is perhaps most socially important when the data are collected about individuals. Shilton highlights the subtle but important differences between self-quantification data and personal data trails in terms of ownership and subsequent access to an individual’s data. Many of these data are collected as well as owned by proprietary companies that use terms of use and other consent agreements in data collection. Cate argues persuasively that consent is illusory in an era of big data, and Burdon and Andrejevic concur. Ownership in collection often raises similar (and sometimes politically charged) issues concerning data access. Interoperability and sharing are key concerns around access: whether from scientific (Plale et al.; Contreras), medical (Sholler, Bailey, and Rennecker), or corporate (Markus) perspectives. Plale and her coauthors maintain that proper data management plans are critical for accountability and transparency in big data science, par­ ticularly given the diversity of data types, including images, numeric sensor data, model results, and computed values. Markus discusses the lack of coordination among the many participants in dealing with big data in the corporate setting. Sholler, Bailey, and Rennecker report on the lack of integration across electronic health systems. Plale and her coauthors, Markus, and Sholler, Bailey, and Rennecker all agree that the opacity of data is a serious concern for big data applications and techniques. Regulation and standardization is one approach to this problem, as examined by Contreras in the context of genomic data. Yet as Markus relates, organizational data often involve many individuals and groups with “different goals, authorities, accountabilities, responsibilities, and values”—making coordination difficult. All the chapters express concerns over how to use big data. Sholler, Bailey, and Rennecker relate some positive aspects in the context of health—for instance, the power of big data to


Cassidy R. Sugimoto, Michael Mattioli, and Hamid R. Ekbia

generalize treatments from large sets of case-specific data and provide faster, more efficient, patient-centered care. Abbott makes similar claims regarding the efficiency of algorithmically guided discovery in pharmaceuticals. Not all the authors view big data as a panacea, however. Anderson, for example, cautions that the emphasis on the statistical power of big data can give rise to “big data hubris.” Bailey reinforces this sentiment from the industry perspective, suggesting that small data frequently provide better answers, but are overlooked in a big data era. Big data, asserts Bailey, is a tool that should be strategically wielded by humans. The use of nonhumans for decision making is something explored cautiously (Bailey) and optimistically (Abbot) in this volume, highlighting the importance of analysis for big data processing. Advances in statistical modeling lead us to ask questions such as, “What if everything reveals everything?” (Ohm and Peppet). While this question may at first seem outlandish, it offers an irresistible thought experiment that follows naturally from the increasingly distant connections that big data is able to make. Ohm and Peppet consider how current laws depend on categorizations of information—categories that, in their thought experiment, are rendered suddenly without meaning. Privacy, antidiscrimination, and consumer protection laws become unmoored—as do our notions of what it means to be individuals, consumers, and citizens, defined and controlled by new machinery that we cannot comprehend. This thread is taken up by DeDeo, who examines the use of causal models in decision making as it relates to antidiscrimination. Consensus is reached among authors on the outcome: big data has several crucial social consequences when data are able to reveal everything. Mechanisms for collecting data thereby begin to engineer how people can and should behave on platforms—as Alaimo and Kallinikos argue, big data “platformed sociality.” The wide range of processes required by big data creates an added complexity in the education realm. West and Portenoy offer a perspective on the training that big data scientists should have. The authors note many employers expect that these individuals will know how to write and debug code, understand the topical area they are analyzing, manage others, and communicate effectively with team members. Performing these tasks requires a combination of skills from the computer and information sciences and business, domain expertise, and strong interpersonal skills. As Ekbia and his colleagues (2015, 1526) note, “By focusing on those attributes of big data that derive from complexity-enhancing structural and relational considerations, the processoriented perspective seeks to push the frontiers of computing technology in handling those complexities.” Yet in pushing the technical boundaries, we often return to social boundaries, as DeDeo discusses in his chapter on algorithmic decision making and discrimination. This requires that professional training incorporates ethics and a consideration of the social (Markus). Such analysis challenges us to think from the cognition-oriented perspective of big data.



Cognition-Oriented Perspective The cognition-oriented perspective captures key insights about big data along with the relation between big data and the human capacity for understanding. One assumption of this perspective is that the quantity of big data exceeds what a human mind can process (Weinberger 2012): the cognition inadequacy premise. As Shilton writes, data about individuals are now “more than a person might reasonably be able to track and reason about without visualization techniques or, frequently, computing power.” For example, Alaimo and Kallinikos note the roughly hundred thousand factors considered when computing what to display to a user on their Facebook page. The amounts of data exceeding human cognition can be interpreted through positive or negative lenses. Recognizing the potential, Burdon and Andrejevic state that “big data technologies are analytic prostheses: they make it possible to make sense out of information in new ways.” Other authors, however, argue that this data deluge places “an untenable burden on individuals” (Cate), who are in no position to make decisions about the vast amounts of data that they produce and interact with each day. Furthermore, there has been considerable concern that the advances in big data methods herald the demise of all other types of knowledge generation. From a scientific vantage point, Plale and her coauthors assert that we are in a fourth paradigm of science—the data paradigm—which turns all sciences into information sciences (an assertion reinforced in Contreras’s narrative). But Anderson warns us against the “siren song of data” and blind faith in data, which can be spurious or incorrectly analyzed. Bailey—senior economic research scientist at Facebook—argues for a middle ground between human intuition and the use of data. As he notes, “even if intuition is in the driver’s seat for strategic decision making, data can give you a huge advantage over those who use intuition alone.” Other perspectives focus on the capabilities of big data to harness inferences within the data. Abbott takes us into the world of Hal the innovator—a supercomputer with access to vast amounts of standardized health information and the cognitive power to draw meaning from it all. Abbott asserts that such inference would be impossible for humans: “computational creativity may be the only means of achieving certain discoveries that require the use of tremendous amounts of data.” As noted earlier, Ohm and Peppet explore the theme of inference as well, and both conclude that it’s not far-fetched to imagine that any item of personal information could reveal “all” about a person. Several authors express concern over how reliance on big data may reduce human autonomy in making important decisions. Sholler, Bailey, and Rennecker, for instance, look at how “the use of big data constitutes a reductionist approach to medicine that elevates ‘the development of the science base of medicine at the expense of medicine’s essential humanism.” DeDeo analyzes the ethical dimensions of inference and prediction, with a special focus on the discriminatory powers of algorithms. As DeDeo notes, “Machine learning gives us new abilities to predict—with remarkable accuracy, and well beyond the powers of the unaided human mind,” yet “if we do not know the models of the world on which our algorithms rely,


Cassidy R. Sugimoto, Michael Mattioli, and Hamid R. Ekbia

we cannot ask if their recommendations are just.” The arguments here suggest that ethical debates must be supplemented by an understanding of the mathematics of prediction, and urge that data scientists and statisticians become more familiar with the nature of ethical reasoning in the public sphere (Markus; West, and Portenoy). Social Movement Perspective Ekbia and his colleagues (2015, 1527) found a lack of literature that took into account the “socio-economic, cultural, and political shifts that underlie the phenomenon of Big Data, and that are, in turn, enabled by it.” The authors in this volume embrace such a social movement perspective. Focusing on the individual order, Shilton shows us that privacy concerns are more poignant when we consider how they affect people personally. Looking at two types of data— those intentionally gathered by (and typically made available to) the individual, and those gathered about the individual by a third party—she examines the normative function of engagement in data practices, where those who “choose not to participate risk becoming nonconforming or illegitimate.” As she argues, participation itself becomes a form of “responsible citizenship” and encourages “a move toward an increasingly panoptic society, in which the quantification, collection, and analysis of personal data are required to be normal, healthy, or compliant.” Or as Alaimo and Kallinikos stated, a “good user is a user who shares.” Alaimo and Kallinikos demonstrate how social media encode social order. Contending that this “encoded and computed, as it were, sociality constitutes the core platform of everyday life, and by extension, social data generation and use,” they consider the ways in which the infrastructure of social media is normative—that is, the way in which it dictates the expression and enactment of sociality. One could assert that in formalizing curricula around this infrastructure (West and Portneoy), we codify and further normalize this computed sociality. What, then, does this mean in an environment where computers are creative? What, then, is the preferred social order? What is the relationship between artificial intelligence and humanity? Abbott and Bailey consider these questions, and each provoke us to ponder the relationship between human intelligence and computational creativity. On a related theme, one must ask, What is the scientific order in a data paradigm? (Plale et al.). Are we, as Anderson questions, at the end of theory? What if, as Ohm and Peppet ask, data reveal everything? When we rely on machines, we can exacerbate preexisting inequalities and create new ones: “decision makers may unknowingly violate their core beliefs when they follow or are influenced by a machine recommendation” (DeDeo). What is the moral order in this new computed environment? Who is in charge, asks Markus, and who takes responsibility for data harms, asks Cate? The relationship between the exponential growth of data and the concentration of wealth and capital in global markets has raised questions regarding the degree to which big



data promotes a new economic order. Exacerbating this issue is the opaque character of the relationship between data sources and data collectors—an opaqueness that stems from the ubiquitous, complex, and nontransparent nature of the technologies. Burdon and Andrejevic introduce the notion of a “sensor society”—that is, “a society in which a growing range of spaces and places, objects and devices, continuously collect data about anything and everything.” Arguing that there is a growing inability for users to understand what is gathered about them, they show how data mining along with the statistical analyses associated with it are “opaque and asymmetrical,” and create an imbalance in wealth and power between those from whom data are collected and those who collect, use, and profit from these data. Bailey, Anderson, Cate, and Sholler, Bailey, and Rennecker (from their perspectives in corporations, science, law, and health, respectively) urge us to be cautious in accepting a new world ordered around big data. Refining Perspectives This volume provides us with an opportunity to further refine and enhance existing conceptualizations of big data. Perhaps the most striking difference between previous literature and our volume is the lack of exclusivity in conceptualizations—that is, earlier work placed big data into a singular perspective while these chapters did not. This suggests that the field has matured to a point where multiple conceptualizations of big data can flourish within a single analysis. Big data is not one thing but rather all things: it’s a product and process, a mode of cognition and computerization movement. These perspectives need not be exclusive categories, and the interweaving of these perspectives across all units of analysis—individual, social, scientific, and organizational—further exemplify the titular thesis of this monograph. Our chapters demonstrated how multifaceted the product-oriented perspective could be. Specifically, the chapters explored various types of data objects, such as research, asset, and social infrastructure. Data-as-aggregate was a theme that permeated all the chapters, as individuals, society, science, and organizations seek to enhance interoperability as well as deal with the resulting ethical and political concerns that are raised in an aggregated data environment. The chapters elaborated on the key processes of big data—namely, collection, access, use, analysis, and education. The chapters examined multiple processes and the ramifications of them. At the individual level, there was considerable stress on access to personal data; collection and use were major areas of concern at the societal level; and analysis and education were the focus of the organizational and scientific chapters. Four main premises emerged from the cognition-oriented perspective: the cognition inadequacy premise, the theory of a new data paradigm, implications of inference, and loss of autonomy in decision making. The chapters struggled with some of the key issues in computation: What is lost when computation improves? When does machine intelligence surpass human intelligence? For which decisions should we rely on machines? What are the unforeseen consequences in a data turn to society? Ekbia and his colleagues (2015) proposed a fourth


Cassidy R. Sugimoto, Michael Mattioli, and Hamid R. Ekbia

conceptualization—the social movement perspective—that was previously unsupported in the literature. The chapters in this volume have readily adopted this perspective and further defined it, demonstrating that big data has brought a new order across domains—that is, a new individual, social, scientific, moral, and economic order to the world. Moving from Dilemmas to Recommendations The introduction illustrated dilemmas presented by big data across various dimensions: epistemological, methodological, ethical, legal, and political. In these final paragraphs, we examine, along our thematic axis, how the chapters in this book point toward various solutions to these problems. Individual The chapters focused on how big data affects individuals investigated legal rights and protections from various angles, and concentrated chiefly on issues of privacy, data access, and antidiscrimination. As Cate neatly summarized it, “The challenge we face literally around the world is to evolve better, faster, and more scalable mechanisms to protect data from harmful or inappropriate uses, without interfering with the benefits that data are already make possible today and promise to make even more widespread in the future.” Shilton stresses the utility of personal informatics practices—where the quantification of self potentially yields benefits for emotional and physical health as well as personal efficiency and productivity. This happy result relies, however, on individuals having access to their personal data. Unfortunately, the vast amount of personal data is owned and managed by the various corporations with whom we interact. The self-serving purposes for which these corporations use our data, and the fact that they routinely use them without our express authorization, is worrisome from a privacy perspective. But the concerns run deeper than encroachment on the right to privacy. Employers, insurers, and the government may increasingly assume that given the volume of available data, these data tell the entire story about an individual—they become, in essence, an identity. Yet the current policy framework lacks adequate protections for the individual to redress when that identity is inaccurate. This is particularly problematic as we move from descriptive to prescriptive statistics (Desrosières 2002), as DeDeo demonstrates. The increasing use of algorithmic decision making does not necessarily eliminate human prejudice, DeDeo argues; rather, it may reinforce and perpetuate preexisting inequalities. Several recommendations emerge from the chapters as to how to fully protect and engage individuals. Shilton advocates “rights-oriented” policies, which take into account the “right of data subjects to access their own data and become involved participants in research.” Such an approach is dependent on the public’s knowledge of and engagement with big data discourse and policy making. Yet as Burdon and Andrejevic contend, the public does not have



adequate knowledge of big data to make informed decisions. Therefore, protection, in many ways, may need to begin with better public education. DeDeo’s recommendations are similarly pedagogical. He maintains that debates on ethics—previously in the provenance of lawyers, politicians, and philosophers—now require a firm understanding of mathematical and computational thinking. He suggests that data scientists and statisticians will be increasingly required to have knowledge of and training in ethics—a recommendation reinforced in other chapters across the volume (e.g., Markus as well as West and Portenoy). Cate argues, however, for removing the reliance on individuals along with their knowledge and notions of consent—placing the responsibility for data stewardship and reliability instead on data users. Regulators, industry, academics, and individuals, Cate asserts, should be involved in creating a framework of data harms and should have transparent opportunities for redress, but individuals should not be held responsible for self-protection. Social The main theme of the chapters on sociality is that big data has created an infrastructure that platforms sociality—guiding, observing, and predicting social behaviors in ways that raise several ethical and legal dilemmas. Furthermore, the ownership of these data by various corporate entities—and increasing embeddedness of a sensor society (Burdon and Andrejevic)— highlights the concerns around political economy raised in the introduction. There is a distinct deterministic flair to these chapters: Alaimo and Kallinikos argue that social media platforms “conceive, instrument, and ultimately refigure social interaction and social relations”; Ohm and Peppet speculate that big data can inform and possibly create a new algorithmically informed identity; Burdon and Andrejevic suggest that we have entered a world of new “technocrative structures and authority.” There is, understandably, a sense of unease with this future. Although all the authors acknowledge the potential opportunities in a world in which everything reveals everything (Ohm and Peppet), they are concerned about the implications of allowing technological platforms to dictate identity and sociality. This is perhaps most readily apparent on social media platforms, as Alaimo and Kallinikos discuss. The implication from Alaimo and Kallinkos is that the first step to moving forward in this new society is for the public to acknowledge the “apparatus of social media” and the degree to which these platforms produce a new type of data, owned and controlled by corporations. Burdon and Andrejevic reinforce this, proposing that the public should be brought into the decision-making processes, but that people first need a greater understanding of the changing nature of data collection and ownership in a big data era. Nevertheless, Burdon and Andrejevic suggest that “emerging regulatory regimes” will be brought to control this new sensor society. Ohm and Peppet believe that laws—particularly those designed to preserve individual privacy and prevent discrimination—need to be reconfigured. Yet they are concerned that


Cassidy R. Sugimoto, Michael Mattioli, and Hamid R. Ekbia

regulating has become more complicated as big data “blurs the lines between contexts.” Simply put, laws that picture big data as monolithic will be inherently under- or overinclusive, and lead to serious asymmetries in power and unforeseen types of discrimination (Ohm and Peppet; DeDeo). Therefore, the construction of new regulatory frameworks must involve an informed public and take into account the heterogeneous nature of big data. Scientific The scientific chapters took two perspectives on data: data as research objects and data as scientific methodology. These themes are perhaps merged in the argument that big data is a new approach to scientific inquiry in which data collection and mining alone—that is, without theories—is a legitimate form of scientific inquiry (Plale et al.). Anderson cautions against the overreliance on big data, summarizing the unintended consequences for science when we ignore the power of small data, emphasize quantity over quality, substitute data for reality, and forget that data don’t interpret themselves. He underscores the bias that is always inherent in data—as Lisa Gitelman and Virginia Jackson (2013) assert, “‘Raw data’ is an oxymoron.” That is, the way in which data are collected, structured, and analyzed always involves some human decision making, and thus is susceptible to human bias. Anderson recommends “a reasonable level of skepticism” for the use of big data in decision making. Part of this heightened skepticism, Anderson maintains, is questioning issues around big data such as provenance. Plale and her coauthors take up the issue of data provenance in full. They demonstrate how the use of trust threads in data science can create a provenance record, thereby increasing the trustworthiness and interpretability of the data. Having this mechanism in place, they assert, is a “critical element to the successful sharing, use, and reuse of data in science and technology research in the future.” This is especially important given the rising rates of both collaboration (Gazni, Sugimoto, and Didegah 2012) and interdisciplinarity (Larivière and Gingras 2014), necessitating trust as data flows across domains and research teams. Concerns around data sharing are not novel for science, but one that is complicated and extended by the rise of big data (for an extensive review of the relationship between big data and scholarship, see Borgmann 2015). In the most extreme cases of heightened collaboration and interdisciplinarity, there may be a role for institutions to provide infrastructures for research programs. Contreras describes the role of the state in the generation and management of genomics data, and the subsequent public benefits that have been derived. Although he acknowledges that there have been several controversies around these data, Contreras argues that the episode could serve as a case study to inform future policy development. West and Portenoy focus on the educational rather than legal infrastructure around big data. They see an educational infrastructure as critical for harnessing the power—ethically and technically—of big data. West and Portenoy ask educational institutions to find a precise balance between “chasing every new trend[,] … diluting their brand and wasting



curricula development on skills that become obsolete before the first day of class,” and formalizing necessary training for technological practices that have critical implications for the economy and society. They express concern, however, regarding whether large, formal educational institutions will be able to keep pace with the more nimble alternative options that exist. West and Portenoy express some doubt as to whether data science will emerge as a distinct discipline or will diffuse across domains. There are many trappings to disciplinary paradigms (Sugimoto and Weingart 2015): scientific disciplines form societies, publish in certain venues, and have curricular offerings, for example. The disciplining of data science is described, in some degree, in West and Portenoy’s chapter. Yet there are several challenges faced by this nascent area of inquiry. One is that the heterogeneity of the field makes finding a core or developing a canon difficult. Furthermore, there is difficulty in retaining people in academe—to continue in training and developing both research and curricula in the area, given the high employment prospects for graduates (West and Portenoy). This problematizes the development of a clear epistemological paradigm, and the coherence and exchange of big data methods. Without this, methods are controlled by corporations instead of disseminated within educational institutions, changing the balance of power in this scientific area. As discussed in the introduction, our contemporary notions of what constitutes appropriate theories and methods has been challenged by the ascendance of big data, and has arguably transformed many disciplines into quasi-data sciences. The methodological concerns are vast, however, particularly when we examine the growing emphasis on correlational analyses and predictions to inform our understanding of the world. Adequate training will be necessary to avoid decision making on the basis of spurious correlations or “apophenia”— that is, “seeing patterns where none actually exist, simply because enormous quantities of data can offer connections that radiate in all directions” (boyd and Crawford 2012, 668). More than a decade ago, Alain Desrosières (2002) argued that we should not see statistics as “indisputable” objects and facts. This sentiment remains critical as we move from the large numbers of the past to the exponentially larger numbers of our future. Organizational The organizational chapters examined the heterogeneity of data, and navigated the tensions between human and automated decision making. Recommendations were fraught with ethical concerns. For instance, Markus as well as Sholler, Bailey, and Rennecker lament the lack of interoperability that plagues the adequate sharing and use of data. They also acknowledge, though, that this standardization of data for interoperability can be highly problematic when it fails to take into account the needs and perspectives of all stakeholders. In many cases, such consensus is simply impossible. The default for many organizations is to standardize in those ways that are incentivized financially or will lead to the greatest organizational optimization, even when this presents potential harms for the individual (Markus).


Cassidy R. Sugimoto, Michael Mattioli, and Hamid R. Ekbia

This administrative quantification of life and society is far from novel. As Desrosières (2002, 9) observed, “Statistics derives from the recombining of scientific and administrative practices that were initially far apart. … All this involves the establishment of general forms, of categories of equivalence, and terminologies that transcend the singularities of individual situations, either through the categories of law (the judicial point of view) or through norms and standards (the standpoint of economy of management and efficient)” (emphasis added). Markus as well as Sholler, Bailey, and Rennecker, however, echoing what was presented in the individual chapters, warn of the potential data harms that can occur when singularities are ignored and advocate for better ethical considerations in the construction of big data systems. Concerns around ethical design are perhaps most poignant when the discussion moves toward automation. The ability to harness the power of big data for decision making generally and prediction specifically has abundant appeal. For example, Abbott persuasively demonstrates what is possible when a machine is able to traverse a vast web of interrelated data to engineer new and better pharmaceuticals. Bailey, though, is not ready for wholesale reliance on automation. He suggests that appropriately programmed, artificially intelligent systems could make decisions in alignment with the ethics and ideology of the corporation. Yet human intelligence is still necessary to do those things necessary for appropriate programming—that is, the objective prioritization and assessment of the moral and cultural consequences as well as the risk associated with the decisions. But such movements will need buy-in from practitioners in the domain of application. Sholler, Bailey, and Rennecker, for instance, describe the hesitancy of physicians to give over even marginal degrees of autonomy to big data systems and the tactics taken to circumvent these systems even in the presence of federal mandates. The lack of a developed policy framework to govern big data is a repeated concern. In regard to computational creativity, Abbott challenges the US Patent and Trademark Office to issue guidance, and Congress to reconsider the boundaries of patentability in an era where artificial intelligence is challenging our notions of creativity and innovation. Bailey urges corporations to determine whether an artificially intelligent system would be considered an employee and thereby subject to employee regulation (e.g., no-compete contracts). Yet regulation is not necessarily a panacea. Markus describes the several conflicting and incomplete rules that govern data management in organizations. He leaves us with a simple formula for policy making: “For every new rule added, two older rules should be stricken from the books.” Moreover, ethical and appropriate large-scale regulation is dependent on the willingness of various stakeholders to enter into serious philosophical conversations about the future of computation. There remains a good deal of uncertainty regarding the degree to which we want to give our future over to automation, and what role human intervention and creativity will play in shaping our future. How to balance the capabilities of human and technological creativity is the critical dilemma facing organizations—and society.



Summary The ascendance of big data has raised pressing policy questions across many domains of human endeavor, from health care, to science, to commerce. These immediate challenges touch on issues of individual privacy, intellectual property and data ownership, research evaluation, and incentive structures in science, national security, and public safety. As the technologies and methodologies of big data evolve, so too do our expectations of how data are collected and used. As a result, policy makers must negotiate a foreign and ever-shifting terrain. The chapters in this volume reveal the variety of challenges associated with big data, demonstrating the ways in which big data is, contrary to widely held perceptions, not monolithic. We believe that understanding this heterogeneity—the many types of data being used, the variegated sources from which they flow, how this information is gathered and analyzed, and the many shades of truth and power such data can reveal—is a necessary first step in thinking about the problems big data presents. This volume is a foundation for discussions among practitioners, policy makers, and the public about the nature of big data along with the promises and perils it holds.


Chapter 3 1.  Note that the use of physical fitness, for example, does not mean we select students who are  necessarily more fit instead of more likely to graduate; rather, we can improve our original selection goal—graduation rate—by use of subtle signals that include physical fitness. 2.  For the role of causality in legal reasoning, see, for example, Moore 2010; Honoré 2010. More broadly, a key reason for the analysis of causation in general is its role in ethical concepts such as responsibility (Schaffer 2014). 3.  Many machine-learning algorithms are designed to classify incoming data—for example, mortgage applicants by the degree of risk. They work by finding patterns in past data, and using the relative strengths of these patterns to classify new data of unknown type. “Deep learning” combines these two steps, simultaneously learning patterns and how they combine to produce the property of interest. 4.  As with many information theoretical quantities, “maximally indistinguishable” has a technical definition that accords with intuition. Imagine a monitor, or critic, attempting to accumulate evidence that the decision maker is relying on a procedure other than that dictated by Pr(S,V). The decision maker, meanwhile, has some set of constraints that prevent them from using Pr(S,V)—in this case, equation 1. When the decision maker is using the maximally indistinguishable distribution, the critic will accumulate evidence in favor of their hypothesis at the slowest possible rate. A number of other interpretations of KL divergence exist; for a gentle introduction to information theory, see DeDeo 2015. 5.  The Kullback–Leibler divergence has the property of becoming ill defined when Pr(S,V) is equal to zero but is not. The ethical intuitions that led to the imposition of equation 1, however, do not apply when Pr(S,V) is precisely zero. This perfect knowledge case implies a different epistemic structure: it is necessarily true—as opposed to simply probable—that a certain group cannot have outcome S. Rather than the example of organ transplants or graduation rates, a better analogy is in the provision of prenatal care. No notion of justice suggests that fair treatment requires equal resources to test both men and women for pregnancy. Correct accounting for these exceptions is easily accomplished, so that an agency can exclude men from prenatal care, but using the methods of this section, provide them optimally for women while preventing nonuniform allocation by race, religion, or social class.



Chapter 4 1.  15 U.S.C. § 1681 et seq. 2.  Certainly during the last one hundred years—long before big data—we have worked to reduce and eliminate information deficits; the rise of credit scoring and credit agencies is a powerful example of precomputerized attempts to individualize or personalize risk. 3.  45 C.F.R. § 164.514. 4.  45 C.F.R. § 160.103. 5.  15 U.S.C. § 6801 et seq. 6.  18 U.S.C. § 2710, 2721. 7.  15 U.S.C. § 1681 et seq. 8.  45 C.F.R. § 160.102, 164.500. 9.  See 42 U.S.C. § 2000e–2(a) (2012). 10.  42 U.S.C. § 12112(a) (2012), 2000ff–4(a).

Chapter 6 1.  In their seminal contribution, danah boyd and Nicole Ellison (2008, 2) referred to social media as social networking sites and defined them as “web-based services that allow individuals to (1) construct a public or semi-public profile within a bounded system, (2) articulate a list of other users with whom they share a connection, and (3) view and traverse their list of connections and those made by others within the system.” This subsequently widely adopted view stressed the centrality of social media users at the expense of the structural attributes of social media themselves, and the ways such attributes shape the premises of user platform participation that we point out in this chapter. Our depiction of social media is closely tied to a transition from social networks to platformed sociality (van Dijck 2013)—that is, a sociality that in essential respects is shaped by the apparatus of social media as well as the wider technological and business context within which such apparatus is embedded. 2.  We refer to this as “computed sociality” to convey the idea of the connections, affinities, and interactions that users establish with others on the basis of the platform scores recommendations. 3.  We intend the term “mediation” in its strong Vygotskian meaning. In that sense, mediation shapes what it mediates and thus can be seen as a form of constitution (Vygotsky 2012). 4.  For this definition, see “Social Data Glossary,” Gnip, (accessed November 5, 2014). Gnip, Inc., a social media API aggregation company, was purchased by Twitter in April 2014.



5.  The use of unstructured data is still problematic. As Rob Kitchin (2014, 6) reports, currently unstructured data are growing at fifteen times the rate of structured data. 6.  For more on the rise of data science program in the higher education sector, see West and Portenoy, this volume.

Chapter 7 1.  Strictly speaking, administrative agencies such as the NIH are not permitted to lobby Congress in support of legislation. Yet numerous senior NIH officials and researchers played a significant role in providing information and testimony supporting the passage of the Genetic Information Nondiscrimination Act of 2008.

Chapter 8 1.  SEAD, (accessed February 5, 2016).

Chapter 9 1.  A cohort study starts with an unaffected population, and divides this into one cohort exposed to a disease or treatment and another that is not, then follows both cohorts over time. Case-control studies identify individuals already affected and then matches controls to them retrospectively. Cohort studies are superior for disease causation studies. Case-control studies can suffer from hidden biases. Different statistical techniques are used depending on the study design.

Chapter 10 1.  A note on methods: After searching the web for information on degree programs being offered in data science, we settled on using the data set offered by North Carolina State University (http://, which claims to be a comprehensive list of “programs offering graduate degrees in Analytics or Data Science (or closely similar) at universities based in the U.S.” This list includes information on duration and estimated cost of programs. We gathered information about the curricula of these data science programs by examining the websites and promotional materials of a subset of these programs. As the field matures, more exacting methods for program tracking will be possible. 2.  These numbers are in rapid flux, and in fact changed substantially between versions of this chapter. 3.������������������������������������������������������������������ ;����������������������� (both accessed February 6, 2016). 4.;; https://docs.python  .org/2/tutorial/ (all accessed February 7, 2016). 5. (accessed February 7, 2016).



6. (accessed February 7, 2016). 7. (accessed February 7, 2016). 8. (accessed February 7, 2016). 9.  Microsoft Research (2013), the research division of Microsoft, is one notable exception to this. It touts an “open academic model,” and prioritizes collaboration with partners in academia, government, and industry. 10. (accessed February 7, 2016). 11.  Students have been graduating with degrees in machine learning for some time. 12.;; (all accessed February 7, 2016).


Chapter 11 1.  For more information, see (accessed February 7, 2016). 2.  Business process outsourcers are companies that provide business services for other organizations, such as accounting services, human resources management services, and so on. 3.  Appropriate data use and protection clauses have become an increasingly important part of the  contracts that govern relations between organizations (Markus and Bui 2012). 4.  A major revision of this act is expected shortly. 5.  “Cyber threats” is third on their list of top ten priorities. “Privacy/identity management and information security/system protection” is seventh (North Carolina State University’s ERM Initiative and Protiviti 2015). 6. (accessed February 8, 2016). 7. (accessed February 8, 2016).

Chapter 12 1.  This is only a cursory treatment of the use of big data in business. For a more comprehensive treatment, see “Data, Data Everywhere” 2010. 2.  There are several algorithms for converging on the optimal payoff distribution, often trading off lengthy examination of each arm (explore) versus quick convergence (exploit). Several techniques from the field of online machine learning ( [accessed February 9, 2016]) can be adapted to this problem.



Chapter 14 1.  35 U.S.C. 100(f) (2012): “The term ‘inventor’ means the individual or, if a joint invention, the individuals collectively who invented or discovered the subject matter of the invention.” 2.  US Constitution, art. I, § 8, cl. 8. 3.  See, for example, Silvestri v. Grant, 496 F.2d 593, 596, 181 U.S.P.Q. (BNA) 706, 708 (C.C.P.A. 1974) (“an accidental and unappreciated duplication of an invention does not defeat the patent right of one who, though later in time was the first to recognize that which constitutes the inventive subject matter”). 4.  Artificial intelligence may be most successfully implemented when focusing on specific subproblems where it can produce verifiable results, such as computer vision or data mining (e.g., Russell and Norvig 2010). Computer vision is a field where software processes and analyzes images, and then reduces the input to numerical or symbolic information, where these symbols are used to make decisions. More specifically, “computer vision aims at using cameras for analyzing or understanding scenes in the real world. This discipline studies methodological and algorithmic problems as well as topics related to the implementation of designed solutions” (Klette 2014). Similarly, data mining software utilizes artificial intelligence, machine learning, statistics, and database systems to process large amounts of data in an effort to make sense of vast sums of data (Chakrabarti et al. 2006). 5.  Computers before Watson have been creative. In 1994, for example, computer scientist Stephen Thaler disclosed an invention he called the “Creativity Machine,” a computational paradigm that “came the closest yet to emulating the fundamental brain mechanisms responsible for idea formation” (“What Is the Ultimate Idea?” n.d.). The Creativity Machine has created artistic and inventive works that have received patent protection (Thaler 2013, 451). Watson is a cognitive commuting system with the extraordinary ability to analyze natural language processing, generate and evaluate hypotheses based on the available data, and then store and learn from the information (“What Is Watson?” n.d.). In other words, Watson essentially mirrors the human learning process by getting “smarter [through] tracking feedback from its users and learning from both successes and failures” (ibid.). Watson made its notable debut on the game show Jeopardy, where it defeated Brad Rutter and Ken Jennings using only stored data by comparing potential answers and ranking confidence in accuracy at the rate of approximately three seconds per question (ibid.). 6.  While a seemingly tremendous amount of data, it is a small fraction of the data actually being  used in the Sentinel Initiative. The FDA Amendments Act of 2007 led to the introduction of the  federal Sentinel Initiative, which pioneered the first successful long-term secondary use of electronic medical data to assess drug safety. Public Law 110–85 was signed into law September 2007 (Title IX, Section 905; see also Abbott 2013; Department of Health and Human Services 2008). The Sentinel Initiative pilot program has succeeded in gaining secured access to over 178 million patients’ health care data to create a national electronic safety surveillance system, far exceeding its goal of reaching 100 million patients by July 2010 (Woodcock 2014; see also Department of Health and Human Services 2011).



7.  For example, Pfizer, the largest pharmaceutical drug manufacture in the United States, recently announced a partnership with 23andMe, the leading consumer genomics and biotechnology firm (Chen 2015; Hunkar 2011; Lumb 2015). This partnership will give Pfizer access to anonymous, aggregated DNA data and granular personal information of approximately 650,000 consenting 23andMe consumers who had purchased a mail-in saliva test used to get their genetic ancestry over the last seven years (Chen 2015). This information may allow Pfizer to discover connections between genes, diseases, and traits quicker, and thus accelerate the development of new treatments and clinical trials (ibid.). Although the cost to Pfizer for the data remains undisclosed, a similar deal with Genentech  for Parkinson’s research was reported to cost $10 million up front and as much as $50 million  total (ibid.). The demand for 23andMe’s data does not stop with Pfizer and Genentech; 23andMe CEO Anne Wojcicki announced at the January 2015 J.P Morgan Health Care Conference that 23andMe has signed twelve other genetic data partnerships with both private companies and universities (Sullivan 2015). Pharmaceutical-biotechnology partnerships are part of an emerging big data trend (Rosenberg, Restaino, and Waldron 2006). Such alliances offer both parties a competitive advantage: pharmaceutical companies gain access to rapidly developing science and innovative products, while biotechnology companies obtain the capital necessary to move through the development process (Sullivan 2015). In fact, some biotechnological business plans include these alliances as a critical component for success (Rosenberg, Restaino, and Waldron 2006). Shared information and capital leads to “less expensive early stage deals” that historically may have not been contemplated due to the high risk involved, thereby resulting in developments that would have never been realized but for the alliance (ibid.). 8.  Hal would be a multithreaded application. Each thread would be a different sequence of instructions that could execute independently, allowing Hal to perform tasks concurrently (Lewis and Berg 1996). Hal might be programmed to run and manage hundreds of different tasks. Hal would also be event driven. As defined by Frank Dabek and his colleagues (2002, 186), “Event-based programs are typically driven by a loop that polls for events and executes the appropriate callback when the event occurs.” In other words, it would respond to certain external events or triggers that it is monitoring. These events can be user interface inputs, news or Internet driven, or activated by the addition of a new database or modification to an existing database. As an AbboStatin patent nears expiry, for example, this could trigger Hal to run algorithms to see if there are any new applications for AbboStatin. Hal would react to input from the outside world via the Internet as well as input from its running tasks and historical stored data that Hal has kept in memory to make modifications to itself or change its behavior when necessary. Consider a scenario for how Hal could solve data-formatting and data integrity issues:. Hal’s database-sorting thread (a sequence of instructions that handles all database-sorting logics and algorithms) returns data to Hal’s managing thread (Hal’s main thread that directs other threads and makes top-level decisions), signaling that it is unhappy because of a formatting issue. The warning specifies that too many database clinical entries have nonmatching fields. As a result, other algorithms cannot compare apples to apples, and thus cannot run as smoothly. Hal’s managing thread hands this problem off to Hal’s warning handler (another thread), which is programmed to look in its database to adopt a strategy to resolve the issue. Hal decides the best course of action is to reformat, so it evaluates existing databases to determine an optimal organization. Then Hal opens an off-the-shelf database software application, and gives it input commands that describe to the database software what the size of



the database is and what the fields are for each entry. Hal has just solved the database-formatting problem. Two seconds later (a lifetime for Hal), Hal’s manager thread receives a suggestion from its database sorter thread. This time, the database sorter complains that there is a data integrity issue. The handwritten inputs appear suspect because the values in certain fields are out of range (i.e., weight = 20,464 lbs.) at a higher frequency than normal. Hal then searches its network and the Internet for other preexisting character recognition software, which it can then build and use for its own purposes. Or Hal can rewrite its existing image recognition software. Certain programming languages, such as Lisp and Smalltalk are homoiconic (a computer language is considered to be homoiconic when its program structure resembles its syntax, which permits all code in the language to be accessed as well as changed as data) (“Homo Iconic” 2015), and lend themselves to reflection. “The advantage on the other hand is that the uniformity of syntax makes it easy for humans to think about the written code as another data that can be manipulated. It becomes easy to think about higher order code (i.e. code that writes or modifies code)” (“Homoiconic Languages” 2007, para.7). Hal can incrementally make changes in its existing image recognition software, and test each variation, and each variation with a new variation, and so on, until Hal has authored new image recognition software with superior results. This method is called the reflective tower; “in fact, in his design, the interpreter Pi is used to run the code of the interpreter Pi-1, and so on, with the interpreter P1 running the base program. This stack of interpreters is called a reflective tower” (Malenfant, Jacques, and Demers 1996, 4). Alternately, for a skeptical perspective on the ability of artificial intelligence to reflect, see Ekbia 2008. 9.  Professor Hawking has warned that computers capable of improving their own designs could pose a danger to humans (Cellan-Jones 2014). (Hawking warns that the creation of thinking machines poses a threat to humans’ existence. He notes that the primitive forms of artificial intelligence developed so far have proved useful. Yet he also observes that humans, limited by slow biological evolution, could not keep up with a computer that can improve its own design without the need for human manipulation. Rollo Carpenter, creator of Cleverbot, opines that achieving full artificial intelligence may happen in the next few decades.) Other key opinion leaders have similar concerns. Indeed, Musk recently donated $10 million to the Future of Life Institute, which focuses on threats posed by advances in artificial intelligence (Isidore 2015; Love 2014). Musk is concerned that society is approaching the singularity. Artificial intelligence may be indifferent to human welfare and could solve problems in ways that could lead to harm against humans. Gates, Microsoft’s founder, is also troubled by the possibility that artificial intelligence could grow too strong for people to control (Rawlinson 2015). (Gates notes that at first, machines will be helpful in completing tasks that may be too difficult or time consuming for humans. He warns that a few decades after, however, artificial intelligence may be strong enough to be a concern. Gates believes that Microsoft will see more progress than ever over the next three decades in the area of artificial intelligence.) Musk and other modern scientists are not the first ones to seriously question the possible threats posed by artificial intelligence (e.g., Good 1965). 10.  Naturally occurring DNA sequences cannot be patented, but artificially created DNA is eligible for patent protection (Association for Molecular Pathology v. Myriad Genetics, Inc., 469 U.S. ___, 133 S. Ct. 2107) (2013). 11.  35 U.S.C. § 102 (2012).



12.  Ibid. The purpose of the requirement that the specification describe the invention in such terms that one skilled in the art can make and use the claimed invention is to ensure that the invention is communicated to the interested public in a meaningful way. 13.  The issue of computational invention and intellectual property protection has been considered “since the 1960s when people began thinking about the impact of computers on copyright” (Miller 1993, 1043). Arthur R. Miller argued that “computer science does not appear to have reached a point at which a machine can be considered so ‘intelligent’ that it truly is creating a copyrightable work.” Rather, “for the foreseeable future, the copyrightability of otherwise eligible computer-generated works can be sustained because of the significant human element in their creation, even though there may be some difficulty is assigning authorship” (ibid., 1073). Abraham Kaminstein, the register of copyrights, reported that by 1965, the Copyright Office (1966) had received registrations for an abstract drawing and musical composition created by a computer. Most of the focus on computational invention and intellectual property has been in the copyright area rather than the patent context; Pamela Samuelson (1985, 1200), for example, argues that computers cannot be authors because they do not need incentives to generate output: “Only those stuck in the doctrinal mud could even think that computers could be ‘authors.’” Annemarie Bridy (2012, 27) remarks “that AI authorship is readily assimilable to the current copyright framework through the work made for hire doctrine, which is a mechanism for vesting copyright directly in a legal person who is acknowledged not to be the author-in-fact of the work in question.” Among those addressing the patentability implications of computational invention, Ralph Clifford (1996) has contended that works generated autonomously by computers should remain in the public domain unless artificial intelligence develops a consciousness that allows it to respond to the Copyright Act’s incentives (see also Vertinsky and Rice 2002). Colin R. Davies (2011) has argued more recently that a computer should be given legal recognition as an individual under UK law to allow proper attribution of authorship and permit respective claims to be negotiated through contract. 14.  Most, but not all, of the inventions in this hypothetical are required to be assigned to the company under the employment contract. Abbott Biologics is headquartered in California, where employees are permitted to retain ownership of inventions that are developed entirely on their own time without using their employer’s equipment, supplies, facilities, or trade secret information, except for inventions that either: related at the time of conception or reduction to the practice of the invention to the employer’s business, or actual or demonstrably anticipated research or development of the employer; or resulted from any work performed by the employee for the employer (California Labor Code § 2872[a]). 15.  35 U.S. Code § 154. 16.  In re Hardee, 223 U.S.P.Q. (BNA) 1122, 1123 (Commissioner of Patents and Trademarks, 1984). See also Board of Education ex rel. Board of Trustees of Florida State University v. American Bioscience Inc., 333 F.3d 1330, 1340, 67 U.S.P.Q. 2d (BNA) 1252, 1259 (Fed. Cir. 2003) (“invention requires conception.” With regard to the inventorship of chemical compounds, an inventor must have a conception of the specific compounds being claimed. “General knowledge regarding the anticipated biological properties of groups of complex chemical compounds is insufficient to confer inventorship status with respect to specifically claimed compounds”). See also ex parte Smernoff, 215 USPQ 545, 547 (Bd. App. 1982)



(“one who suggests an idea of a result to be accomplished, rather than the means of accomplishing it, is not an coinventor”). 17.  Townsend v. Smith, 36 F.2d 292, 295, 4 U.S.P.Q. (BNA) 269, 271 (C.C.P.A. 1930). 18.  “Conception is established when the invention is made sufficiently clear to enable one skilled in the art to reduce it to practice without the exercise of extensive experimentation or the exercise of inventive skill.” Hiatt v. Ziegler, 179 U.S.P.Q. (BNA) 757, 763 (B. P. I. 1973). Conception has been defined as a disclosure of an idea that allows a person skilled in the art to reduce the idea to a practical form without “exercise of the inventive faculty.” Gunter v. Stream, 573 F.2d 77, 79, 197 U.S.P.Q. (BNA) 482 (C.C.P.A. 1978). 19.  Actual reduction to practice “requires that the claimed invention work for its intended purpose.” Brunswick Corporation v. United States, 34 Fed. Cl. 532, 584 (1995). Constructive reduction to practice “occurs upon the filing of a patent application on the claimed invention.” Brunswick Corporation v. United States, 34 Fed. Cl. 532, 584 (1995). The written description requirement is “to ensure that the inventor had possession, as of the filing date of the application relied on, of the specific subject matter later claimed by him.” In re Edwards, 568 F.2d 1349, 1351–52, 196 U.S.P.Q (BNA), 465, 467 (C.C.P.A. 1978). 20.  De Solms v. Schoenwald, 15 U.S.P.Q. 2d (BNA) 1507, 1510 (B.P.A.I. 1990). 21.  Ex parte Smernoff, 215 U.S.P.Q. (BNA) 545, 547 (P.T.O. Bd. App. 1982) (“one who suggests an idea of a result to be accomplished, rather than the means of accomplishing it, is not an coinventor”). 22.  In re DeBaun, 687 F.2d 459, 463, 214 U.S.P.Q. (BNA) 933, 936 (C.C.P.A. 1982); Fritsch v. Lin, 21 U.S.P.Q. 2d (BNA) 1737, 1739 (B.P.A.I. 1991). 23.  In this case, for instance, it is likely that both could qualify as inventors. What is required is some “quantum of collaboration or connection.” Kimberly-Clark Corporation v. Procter and Gamble Distribution Co., 973 F.2d 911, 916–17, 23 U.S.P.Q. 2d (BNA) 1921, 1925–26 (Fed. Cir. 1992). For joint inventorship, “there must be some element of joint behavior, such as collaboration or working under common direction, one inventor seeing a relevant report and building upon it or hearing another’s suggestion at a meeting” (ibid.); Moler v. Purdy, 131 U.S.P.Q. (BNA) 276, 279 (B.P.I. 1960) (“it is not necessary that the inventive concept come to both [joint inventors] at the same time”). 24.  See, for example, Advanced Magnetic Closures, Inc. v. Rome Fasteners Corp., 607 F.3d 817 (Fed. Cir. 2010). 25.  Conception has been identified as a mental process (“formation in the mind of the inventor, of a definite and permanent idea of the complete and operative invention, as it is hereafter to be applied in practice”). Hitzeman v. Rutter, 243 F.3d 1345, 58 U.S.P.Q. 2d (BNA) 1161 (Fed. Cir. 2001). “The term ‘inventor’ means the individual or, if a joint invention, the individuals collectively who invented or discovered the subject matter of the invention.” 35 U.S.C. 100(f) (2012). 26.  See the Trade-Mark Cases, 100 U.S. 82, 94 (1879) (noting that “copyright law only protects  ‘the fruits of intellectual labor’ that ‘are founded in the creative powers of the mind.’”), cited in US Copyright Office 2014.



27.  While he was a bacteriologist at St. Mary’s hospital in London, Alexander Fleming realized that a mold had contaminated his samples of Staphylococcus. When he examined his dishes under a microscope, he noticed that the mold prevented the growth of Staphylococcus (Market 2013). The area around the mold contained a strain of pencillium notatum. Fleming discovered that it could kill many different types of bacteria. Decades later, Howard Florey at Oxford University headed efforts to purify penicillin for use in therapeutic applications (American Chemistry Society, n.d.). It proved to be invaluable during World War II for controlling wound infections (Market 2013). Saccharin—the first artificial sweetener—was discovered by accident by Constantin Fahlber in 1884. He had been working with compounds derived from coal tar and accidently ate something without washing his hands. Fahlber noticed a sweet taste, which he later traced to benzoic sulfilimine. Some reports hold that it was his partner, Ira Remsen, who first noticed that the tar compound was sweet. While useful during World War I when sugar was scarce, it was only in the 1960s and 1970s that  saccharin became popular as a way to sweeten while avoiding the calories contained in regular sugar (Clegg 2012). 28.  Conception requires contemporaneous recognition and appreciation of the invention. Invitrogen Corporation v. Clontech Laboratories, Inc., 429 F.3d 1052, 1064, 77 U.S.P.Q. 2d (BNA) 1161, 1169 (Fed. Cir. 2005) (“the inventor must have actually made the invention and understood the invention to have the features that comprise the inventive subject matter at issue”). 29.  Silvestri v. Grant, 496 F.2d 593, 596, 181 U.S.P.Q. (BNA) 706, 708 (C.C.P.A. 1974) (“an accidental and unappreciated duplication of an invention does not defeat the patent right of one who, though later in time was the first to recognize that which constitutes the inventive subject matter”). 30.  Originally, the active ingredient in Viagra was intended as a cardiovascular drug to lower blood pressure (Fox News 2008). The trials for this intended use were disappointing until volunteers began reporting a strange side effect: erections (Jay 2010). Botox is a branded formula of botulinum toxin type A manufactured by Allerga (“Medication Guide: Botox, n.d.). Botulinum toxin is a protein produced by the bacterium Clostridium botulinum (Montecucco and Molgó 2005). It was used in the late 1700s as a food poison and gained attention in the 1890s for its potential use as a biological weapon; “one gram [of botulinum toxin] has the potential to kill one million people” (Ting and Freiman 2004). In the 1960s, however, Drs. Alan Scott and Edward Schantz discovered botulinum toxin type A’s ability (in small doses) to block the transmission of nerve impulses and paralyze hyperactive muscles to treat eye, facial, and vocal spasms (ibid., 259–260). These novel developments led to the accidental discovery that botulinum type A injections also reduced wrinkles; physicians quickly began administering Botox as wrinkle reduction treatment well before the FDA finally approved Botox for this use in 2002 (Ghose 2014). Since then, Botox has steadily expanded to treat over twenty different medical conditions, including chronic headaches, overactive bladder, and urinary incontinence (Nichols 2015; Team, n.d.; FDA 2010, 2013; “BOTOX” 2011). 31.  Commercialization theory holds that patents are important in providing incentives for investment in increasing the value of a patented technology (see Kitch 1977, 276–277). 32.  It has been estimated that prehuman expenditures are 30.8 percent of costs per approved compound, and an estimate of average pretax industry cost per new prescription drug approval (inclusive of



failures and capital costs) is $2.55 billion (Tufts Center for the Study of Drug Development 2014). The cost of new prescription drug approval is hotly contested (e.g., Collier 2009). 33.  Although some human inventors also appear to lack a moral compass (Ho 2000). 34.  35 U.S.C. §101 (2012). 35.  Ibid. 36.  5 Opinion of the Court Jefferson 75–76 (H. Washington ed. 1871). “In choosing such expansive terms [for the language of section 101] … modified by the comprehensive ‘any,’ Congress plainly contemplated that the patent laws would be given wide scope.” Diamond v. Chakrabarty, 447 U. S. 303, 308 (1980). 37.  Bilski v. Kappos, 561 U.S. 593, 593–96 (2010). So “a new mineral discovered in the earth or a new plant found in the wild is not patentable subject matter.” Diamond v. Chakrabarty, 447 U. S. 309 (1980). “Likewise, Einstein could not patent his celebrated law that E = mc2; nor could [Isaac] Newton have patented the law of gravity” (ibid.). Nor is a mathematical formula, electromagnetism or steam power, or the qualities of bacteria patentable (ibid.). 38.  Alice Corp. v. CLS Bank, 573 U. S. ____ (2014) (slip op., at 5–6). Also, these exceptions existed in various forms for 150 years. See Le Roy v. Tatham, 14 How. 156, 174. 39.  Ibid. As courts acknowledge, all patents rely to some extent on these exceptions and have the potential to hinder as well as promote future innovation (ibid.). 40.  The information covered by these exceptions is “part of the storehouse of knowledge of all men … free to all men and reserved exclusively to none.” Funk Brothers Seed Co. v. Kalo Inoculant Co., 333 U. S. 127, 130 (1948). 41.  35 USC 161. 42.  Section 101 is a “dynamic provision designed to encompass new and unforeseen inventions.” J.E.M. Ag Supply, Inc. v. Pioneer Hi-Bred International, Inc., 534 U. S. 124, 135 (2001). As the Supreme Court stated in Bilski v. Kappos, “For example, it was once forcefully argued that until recent times, ‘wellestablished principles of patent law probably would have prevented the issuance of a valid patent on almost any conceivable computer program.’” Bilski v. Kappos, 561 U.S. 593, 605 (2010), citing Diamond v. Diehr, 450 U.S 175, 195 (1981) (STEVENS, J., dissenting). But this fact does not mean that unforeseen innovations such as computer programs are always unpatentable (ibid.). 43.  Diamond v. Chakrabarty, 447 U. S. 303, 315 (1980).


Abbott, Ryan. 2013. “Big Data and Pharmacovigilance: Using Health Information Exchanges to Revolutionize Drug Safety.” Iowa Law Review 99:225–292. “ACM Code of Ethics and Professional Conduct.” 1992. Association for Computing Machinery, October 16. of ethics (accessed December 8, 2015). Acquisti, Alessandro, and Ralph Gross. 2009. “Predicting Social Security Numbers from Public Data.” Proceedings of the National Academy of Sciences of the United States of America 106 (27): 10975–10980. Alaimo, Cristina. 2014. “Computational Consumption: Social Media and the Construction of Digital Consumers.” PhD diss., London School of Economics and Political Science. Allen, Anita L. 2000. “Privacy-as-Data Control: Conceptual, Practical, and Moral Limits of the Paradigm.” Connecticut Law Review 32 (3): 861–875. Allen, Claudia, Terrisca R. Des Jardins, Arvela Heider, Kristin A. Lyman, Lee McWilliams, Alison L. Rein, Abigail A. Schachter, Ranjit Singh, Barbara Sorondo, Joan Topper, and Scott A. Turske. 2014. “Data Governance and Data Sharing Agreements for Community-Wide Health Information Exchange: Lessons from the Beacon Communities.” eGEMs (Generating Evidence & Methods to Improve Patient Outcomes: Volume 2, Issue 1, Article 5. Almuhimedi, Hazim, Shomir Wilson, Bin Liu, Norman Sadeh, and Alessandro Acquisti. 2013. Tweets Are Forever. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work—CSCW ’13, 897–908. New York: ACM. American Association for the Advancement of Science. 2015. “Historical Trends in Federal R&D.” http:// (accessed December 8, 2015). American Chemical Society. n.d. “Discovery and Development of Penicillin: International Historic Chemical Landmark.” flemingpenicillin.html#alexander-fleming-penicillin (accessed December 2, 2015). American Statistical Association. 2014. “Curriculum Guidelines for Undergraduate Programs in Statistical Science.” Alexandria, VA: American Statistical Association. curriculumguidelines.cfm (accessed February 8, 2016).



Anderson, Chris. 2008. “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” Wired, June 23. (accessed December 8, 2015). Andrejevic, Mark. 2012. “Internet Privacy Research.” .pdf (accessed December 8, 2015). Andrejevic, Mark, and Mark Burdon. 2015. “Defining the Sensor Society.” Television and New Media 16 (1): 19–36. Armbrust, Michael, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. 2010. “A View of Cloud Computing.” Communications of the ACM 53 (4): 50–58. Article 29 Data Protection Working Party. 2011. Opinion 15/2011 on the Definition of Consent. 01197/11/EN WP187, July 13, 2. Ash, Joan S., James L. McCormack, Dean F. Sitting, Adam Wright, Carmit McMullen, and David W. Bates. 2012. “Standard Practices for Computerized Clinical Decision Support in Community Hospitals: A National Survey.” Journal of the American Medical Informatics Association 19 (6): 980–987. Ash, Joan S., Dean F. Sitting, Adam Wright, Carmit McMullen, Michael Shapiro, Arwen Bunce, and Blackford Middleton. 2011. “Clinical Decision Support in Small Community Practice Settings: A Case Study.” Journal of the American Medical Informatics Association 18 (6): 879–882. Asia-Pacific Economic Cooperation. 2005. APEC Privacy Framework. 2004/AMM/014rev1. Association for Molecular Pathology et al. v. US Patent and Trademark Office and Myriad Genetics, Inc. and Lorris Betz, Roger Boyer et al. 2012. No. 2010–1406, 2012 WL 3518509 (Fed. Cir.), LexisNexis. Atkinson, Robert D. 2015. The 2014 ITIF Luddite Awards. Washington, DC: Information Technology and Innovation Foundation. Austin, Lisa M. 2003. “Privacy and the Question of Technology.” Law and Philosophy 22:1–57. Backstrom, Lars. 2013. “News Feed FYI A Window into News Feed.” Facebook for Business, August 6. (accessed December 10, 2015). Bakshy, Eytan, Dean Eckles, and Michael Bernstein. 2014. Designing and Deploying Online Field Experiments. In Proceedings of the 23rd International Conference on World Wide Web, 283–292. New York: ACM. Balganesh, Shyamkrishna. 2009. “Foreseeability and Copyright Incentives.” Harvard Law Review 122 (6): 1569–1633. Ball, W. W. Rouse. 1960. Calculating Prodigies. In Mathematical Recreations and Essays. New York: Macmillan. Bambauer, Jane R. 2011. “Tragedy of the Data Commons.” Harvard Journal of Law and Technology 25. (accessed February 3, 2016).



Bamberger, Kenneth A., and Deirdre K. Mulligan. 2011a. “New Governance, Chief Privacy Officers, and the Corporate Management of Information Privacy in the United States: An Initial Inquiry.” Law and Policy 33 (4): 477–508. Bamberger, Kenneth A., and Deirdre K. Mulligan. 2011b. “Privacy on the Books and on the Ground.” Stanford Law Review 63:247–316. Barbaro, Michael, and Tom Zeller Jr. 2006. “A Face Is Exposed for AOL Searcher No. 4417749.” New York Times, August 9. (accessed February 3, 2016). Barley, Stephen R. 1986. “Technology as an Occasion for Structuring: Evidence from Observations of CT Scanners and the Social Order of Radiology Departments.” Administrative Science Quarterly 31 (1): 78–108. Barocas, Solon, and Andrew D. Selbst. 2015. “Big Data’s Disparate Impact.” California Law Review 104, April 14. (accessed February 3, 2016). Barrett, Paul M. 2014. “In Fake Classes Scandal, UNC Fails Its Athletes, Whistle-Blower.” Bloomberg Business, February 27. -fails-its-athletes-whistle-blower (accessed February 6, 2016). Bastian, Hilda, Paul Glasziou, and Iain Chalmers. 2010. “Seventy-Five Trials and Eleven Systematic Reviews a Day: How Will We Ever Keep Up?” PLoS Medicine 7 (9). article?id=10.1371/journal.pmed.1000326 (accessed February 10, 2016). Batelle Technology Partnership Practice for United for Medical Research. 2013. “The Impact of Genomics on the U.S. Economy.” June. 06/The-Impact-of-Genomics-on-the-US-Economy.pdf (accessed December 10, 2015). Bazerman, Max, and Ann Tenbrunsel. 2011. “Ethical Breakdowns.” Harvard Business Review 89 (4): 58–65. BBC. “Computer Program Uses Twitter to ‘Map Mood of Nation.’” 2013. September 7. (accessed January 1, 2016). Beasley, Deena. 2015. “Pfizer Developing PCSK9 Pill, Vaccine to Lower Cholesterol.” Reuters, January 13. (accessed February 10, 2016). Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, and Carole Goble. 2013. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems 29 (2): 599–611. Bechmann, Anja, and Stine Lomborg. 2013. “Mapping Actor Roles in Social Media: Different Perspectives on Value Creation in Theories of User Participation.” New Media and Society 15 (5): 765–781. Beer, David. 2008. “Social Network (ing) Sites … Revisiting the Story So Far: A Response to danah boyd and Nicole Ellison.” Journal of Computer-Mediated Communication 13 (2): 516–529.



Beer, David, and Roger Burrows. 2013. “Popular Culture, Digital Archive, and the New Social Life of Data.” Theory, Culture, and Society 30 (4): 47–71. Bell, Robert M., and Yehuda Koren. 2007. “Lessons from the Netflix Prize Challenge.” ACM SIGKDD Explorations Newsletter 9 (2): 75–79. Bennett, Colin J., and Deirdre K. Mulligan. 2012. “The Governance of Privacy through Codes of Conduct: International Lessons for U.S. Privacy Policy.” Paper presented at the Privacy Law Scholars Conference, George Washington University, Washington, DC, June 7–8. Berman, Francine. 2008. “Got Data? A Guide to Data Preservation in the Information Age.” Communications of the ACM 51 (12): 50–56. Bernstein, Rachel. 2014. “Shifting Ground for Big Data Researchers.” Cell 157 (2): 283–284. “Big Data.” n.d. Wikipedia. (accessed December 12, 2015). Blake, Thomas, and Dominic Coey. 2014. Why Marketplace Experimentation Is Harder Than It Seems: The Role of Test-Control Interference.” In Proceedings of the Fifteenth ACM Conference on Economics and Computation, 567–582. New York: ACM. Blumenthal, David, and Marilyn Tavenner. 2010. “The ‘Meaningful Use’ Regulation for Electronic Health Records.” New England Journal of Medicine 363 (6): 501–504. Boas, Paulo José Fortes Villas, Regina Stella Spagnuolo, Amélia Kamegasawa, Leandro Gobbo Braz, Adriana Polachini do Valle, Eliane Chaves Jorge, Hugo Hyung Bok Yoo, Antônio José Maria Cataneo, Ione Corrêa, Fernanda Bono Fukushima, Paulo do Nascimento, Norma Sueli Pinheiro Módolo, Marise Silva Teixeira, Edison Iglesias de Oliveira Vidal, Solange Ramires Daher, and Regina El Dib. 2013. “Systematic Reviews Showed Insufficient Evidence for Clinical Practice in 2004: What about in 2011? The Next Appeal for the Evidence-Based Medicine Age.” Journal of Evaluation in Clinical Practice 19 (4): 633–637. Bogen, Jim. 2013. “Theory and Observation in Science.” Stanford Encyclopedia of Philosophy. January 11. (accessed January 5, 2016). Bollier, David. 2010. The Promise and Peril of Big Data. Queenstown, MD: Aspen Institute. Booz Allen Hamilton. 2015. The Field Guide to Data Science. boozallen/documents/2015/12/2015-FIeld-Guide-To-Data-Science.pdf (accessed February 7, 2016). Borgmann, Albert. 2000. Holding on to Reality: The Nature of Information at the Turn of the Millennium. Chicago: University of Chicago Press. Bostrom and Old Creekside Consulting. 2014. “The Patient Protection and Affordable Care Act: Beyond the Horizon into 2015.” Physicians Foundation, April. default/ACA_Critical_Issues_Part_II.pdf (accessed July 10, 2014). “BOTOX® (OnabotulinumtoxinA) Receives U.S. FDA Approval for the Treatment of Urinary Incontinence in Adults with Neurological Conditions Including Multiple Sclerosis and Spinal Cord Injury.”



2011. Business Wire. -onabotulinumtoxinA-Receives-U.S.-Food-Drug-Administration (accessed February 12, 2016). boyd, danah, and Kate Crawford. “Six Provocations for Big Data.” 2011. Paper presented at the Oxford Internet Institute: A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society, University of Oxford, September 21–24. boyd, danah, and Kate Crawford. 2012. “Critical Questions for Big Data.” Information Communication and Society 15 (5): 662–679. boyd, danah, and Nicole B. Ellison. 2008. “Social Network Sites: Definition, History, and Scholarship.” Journal of Computer-Mediated Communication 13 (1): 210–230. Bozeman, Barry. 2000. “Technology Transfer and Public Policy: A Review of Research and Theory.” Research Policy 29 (4): 627–655. Bresnick, Jennifer. 2015. “Four Use Cases for Healthcare Predictive Analytics, Big Data.” HealthITAnalytics, April 21. -data (accessed February 5, 2016). Bridy, Annemarie. 2012. “Coding Creativity: Copyright and the Artificially Intelligent Author.” Stanford Technology Law Review 5:1–28. Brunton, Finn, and Helen Nissenbaum. 2011. “Vernacular Resistance to Data Collection and Analysis: A Political Theory of Obfuscation.” First Monday 16 (5). (accessed February 1, 2016). Brush, A. J. Bernheim, Evgeni Filippov, Danny Huang, Jaeyeon Jung, Ratul Mahajan, Frank Martinez, Khurshed Mazhar, Amar Phanishayee, Arjmand Samuel, James Scott, and Rayman Preet Singh. 2013. Lab of Things: A Platform for Conducting Studies with Connected Devices in Multiple Homes. In Proceedings of the 2013 ACM Conference on Pervasive and Ubiquitous Computing Adjunct Publication, 35–38. New York: ACM. Brynjolfsson, Erik, and Andrew McAfee. 2014. The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies. W. W. Norton and Company. Bryson, Steve, David Kenwright, Michael Cox, David Ellsworth, and Robert Haimes. 1999. “Visually Exploring Gigabyte Data Sets in Real Time.” Communications of the ACM 42 (8): 82–90. Bucher, Taina. 2012. “Want to Be on the Top? Algorithmic Power and the Threat of Invisibility on Facebook.” New Media and Society 14 (7): 1164–1180. Bulygo, Zach. n.d. “How Netflix Uses Analytics To Select Movies, Create Content, and Make Multimillion Dollar Decisions.” Kissmetrics. (accessed December 12, 2015). Burdon, Mark, and Paul David Harpur. 2014. “Re-Conceptualising Privacy and Discrimination in an Age of Talent Analytics.” University of New South Wales Law Journal 37 (2): 679.



Burdon, Mark, and Alissa McKillop. 2014. “The Google Street View Wi-Fi Scandal and Its Repercussions for Privacy Regulation.” Monash University Law Review 39:702–738. Butnick, Stephanie. 2014. “Leon Wieseltier Schools Stephen Colbert on Liberals.” Tablet, October 8. (accessed February 6, 2016). Byers, Alex. 2013. “Microsoft Hits Google Email Privacy.” Politico, February 7. story/2013/02/microsoft-renews-google-attack-on-email-privacy-87302.html (accessed November 20, 2014). Bynum, Terrell. 2015. “Computer and Information Ethics.” In Stanford Encyclopedia of Philosophy. October 26. (accessed February 7, 2016). Calandrino, Joseph A., William Clarkson, and Edward W. Felten. 2011. Bubble Trouble: Off-Line De-Anonymization of Bubble Forms. In Proceedings of the 20th USENIX Security Symposium. Berkeley, CA: USENIX. Callebaut, Werner. 2012. “Scientific Perspectivism: A Philosopher of Science’s Response to the Challenge of Big Data Biology.” Studies in History and Philosophy of Biological and Biomedical Science 43 (1): 69–80. Calo, Ryan. 2013. “Consumer Subject Review Boards: A Thought Experiment.” Stanford Law Review Online 66 (97). -review-boards (accessed February 1, 2016). Campbell, Andrew T., Shane B. Eisenman, Nicholas D. Lane, Emiliano Miluzzo, Ronald Peterson, Hong Lu, Xiao Zheng, Mirco Musolesi, Kristóf Fodor, and Gahng-Seop Ahn. 2008. “The Rise of People-Centric Sensing.” IEEE Internet Computing 12 (4): 12–21. “Can Recipes Be Patented?” 2013. Inventors Eye, June. eye/201306/ADVICE.jsp (accessed February 11, 2016). Cardwell, Diane. 2014. “At Newark Airport, the Lights Are On, and They’re Watching You.” New York Times, February 18, A1. Carey, Peter. 2015. Data Protection: A Practical Guide to UK and EU Law. Oxford: Oxford University Press. Carmichael, Alexandra. 2010. Why I Stopped Tracking. Quantified Self, April 5. http://quantifiedself .com/2010/04/why-i-stopped-tracking/ (accessed February 1, 2016). Carson, Colin. 2014. How Much Data Does Google Store? Cirrus Insight, November 18. https://www (accessed February 9, 2016). Casarett, David, Jason Karlawish, Elizabeth Andrews, and Arthur Caplan. 2005. Bioethical Issues in Pharmacoepidemiologic Research. In Pharmacoepidemiology, ed. Brian L. Strom, 587–598. 4th ed. Hoboken, NJ: Wiley. Casper, Carsten. 2014. Hype Cycle for Privacy. Stamford, CT: Gartner.



Castelle, Manuel. 2013. “Relational and Non-Relational Models in the Entextualization of Bureaucracy.” Computational Culture 3. -in-the-entextualization-of-bureaucracy (accessed February 12, 2016). Castells, Manuel. 2009. Communication Power. Oxford: Oxford University Press. Cate, Fred H. 2006. The Failure of Fair Information Practice Principles. In Consumer Protection in the Age of the “Information Economy,” ed. Jane K. Winn, 341–378. Burlington, VT: Ashgate. Cate, Fred H. 2014. “The Big Data Debate.” Science 346 (6211): 818. Cate, Fred H., Peter Cullen, and Viktor Mayer-Schönberger. 2014. Data Protection Principles for the 21st Century: Revising the 1980 OECD Guidelines. Cellan-Jones, Rory. 2014. “Stephen Hawking Warns Artificial Intelligence Could End Mankind.” BBC News, December 2. (accessed February 11, 2016). Centre for Information Policy Leadership. 2014a. A Risk-Based Approach to Privacy: Improving Effectiveness in Practice. Hunton and Williams, June 19. _Paper_June_2014.pdf (accessed December 10, 2015). Centre for Information Policy Leadership. 2014b. The Role of Risk Management in Data Protection. Hunton and Williams, November 23. _Management_in_Data_Protection.pdf (accessed December 10, 2015). Chakrabarti, Soumen. 2009. Data Mining: Know It All. Amsterdam: Morgan Kaufmann Publishers. Chakrabarti, Soumen, Martin Ester, Usama Fayyad, Johannes Gehrke, Jiawei Han, Shinichi Morishita, Gregory Piatetsky-Shapiro, and Wei Wang. 2006. “Data Mining Curriculum: A Proposal, Version 1.0.” April 30. (accessed February 10, 2016). Chappell, Bill. 2014. “Who Owns a Monkey’s Selfie? No One Can, U.S. Says.” NPR, August 22. http:// -s-says (accessed February 11, 2016). Chen, Caroline. 2015. “23andMe Turns Spit into Dollars in Deal with Pfizer.” Bloomberg Business, January 12. -startup-seeks-growth (accessed February 11, 2016). Christakis, Dimitri A., and Frederick J. Zimmerman. 2013. “Rethinking Reanalysis.” JAMA: The Journal of the American Medical Association 310 (23): 2499–2500. Christensen, Lisa Jones, Ellen Peirce, Laura P. Hartman, W. Michael Hoffman, and Jamie Carrier. 2007. “Ethics, CSR, and Sustainability Education in the Financial Times Top 50 Global Business Schools: Baseline Data and Future Research Directions.” Journal of Business Ethics 73 (4): 347–368. Christian, Brian. 2012. “The A/B Test: Inside the Technology That’s Changing the Rules of Business.” Wired, April 25. (accessed February 9, 2016). Clark, Andy. 2008. Supersizing the Mind: Embodiment, Action, and Cognitive Extension. Oxford: Oxford University Press.



Clauson, Kevin A., Wallace A. Marsh, Hyla H. Polen, Matthew J. Seamon, and Blanca I. Ortiz. 2007. “Clinical Decision Support Tools: Analysis of Online Drug Information Databases.” BMC Medical Informatics and Decision Making 7 (1): 7. Clegg, Brian. 2012. “Chemistry in Its Element: Saccharin.” Chemistry World. chemistryworld/podcast/CIIEcompounds/transcripts/saccharin.asp (accessed February 11, 2016). Clifford, Ralph D. 1996. “Intellectual Property in the Era of the Creative Computer Program: Will the True Creator Please Stand Up.” Tulane Law Review 71:1675–1703. Coffman, Kerry G., and Andrew M. Odlyzko. 1998. “The Size and Growth Rate of the Internet.” First Monday 3 (10): l–25. Cohen, Julie E. 2012. “What Privacy Is For.” Harvard Law Review 126:1904–1933. Cohen, Julie E. 2016. The Surveillance-Innovation Complex: The Irony of the Participatory Turn. In The Participatory Condition in the Digital Age, ed. Darin Barney, E. Gabriella Coleman, Christine Ross, Jonathan Sterne, and Tamar Tembeck. Minneapolis: University of Minnesota Press. Cohen, Steve. 2014. “Ethics for Undergraduates: Workgroup on Undergraduate Education.” November. (accessed December 10, 2015). Cole, Simon A. 2002. Suspect Identities: A History of Fingerprinting and Criminal Identification. Cambridge, MA: Harvard University Press. Collier, Roger. 2009. “Drug Development Cost Estimates Hard to Swallow.” Canadian Medical Association Journal 180 (3): 279–280. Collins, Francis. 2010a. “Has the Revolution Arrived?” Nature 464 (7289): 674–675. Collins, Francis. 2010b. The Language of Life: DNA and the Revolution in Personalized Medicine. New York: HarperCollins. Commissariat, Tushna. 2014. “BICEP2 Gravitational Wave Results Bite the Dust Thanks to New Planck Data.” Physics World, September 22. -gravitational-wave-result-bites-the-dust-thanks-to-new-planck-data (accessed February 6, 2016). Committee on Professional Ethics. 1999. “Ethical Guidelines for Statistical Practice.” American Statistical Association, August 7. (accessed February 8, 2016). “Computational Creativity.” n.d. IBM. computational-creativity.shtml#fbid=kwG0oXrjBHY (accessed February 6, 2015). Constine, Josh. 2015. “Facebook Acquires to Help Its Developers with Speech Recognition and Voice Interfaces.” TechCrunch, January 5. (accessed February 9, 2016). Contreras, Jorge L. 2011. “Bermuda’s Legacy: Policy, Patents, and the Design of the Genome Commons.” Minnesota Journal of Science and Technology 12 (1): 61–125.



Contreras, Jorge L. 2015. “NIH’s Genomic Data Sharing Policy: Timing and Tradeoffs.” Trends in Genetics 31 (2): 55–57. Contreras, Jorge L., Aris Floratos, and Arthur L. Holden. 2013. “The International Serious Adverse Events Consortium’s Data Sharing Model.” Nature Biotechnology 31 (1): 17–19. Cook-Deegan, Robert M. 1994. The Gene Wars: Science, Politics, and the Human Genome. W. W. Norton and Company. Cooke, Bill, and Uma Kothari. 2001. Participation: The New Tyranny? London: Zed Books. Copyright Office. 1966. Sixty-Eighth Annual Report of the Register of Copyrights, for the Fiscal Year Ending June 30, 1965. Washington, DC: Copyright Office. Courtney, Karen L., Gregory L. Alexander, and George Demiris. 2008. “Information Technology from Novice to Expert: Implementation Implications.” Journal of Nursing Management 16 (6): 692–699. Cover, Thomas M., and Joy A. Thomas. 2006. Elements of Information Theory. Hoboken, NJ: Cox, Kate. 2015. “Big Data Is Here to Stay So Can We Use It to Make Recalls Actually Work?” Consumerist, March 16. -recalls-actually-work/ (accessed February 9, 2016). Crawford, Kate, and Jason Schultz. 2014. “Big Data and Due Process: Toward a Framework to Redress Predictive Privacy Harms.” Boston College Law Review 55:93–128. Crease, Robert P. 2008. The Great Equations: Breakthroughs in Science from Pythagoras to Heisenberg. W. W. Norton and Company. Croll, Alistair. 2013. “Big Data Is Our Generation’s Civil Rights Issue, and We Don’t Know It.” DZone, July 25. (accessed February 3, 2016). Culnan, Mary J. 2011. “Accountability as the Basis for Regulating Privacy: Can Information Security Regulations Inform Privacy Policy?” Paper presented at the Privacy Law Scholars Conference, Berkeley, CA, June 2–3. Culnan, Mary J., and Cynthia Clark Williams. 2009. “How Ethics Can Enhance Organizational Privacy: Lessons from the ChoicePoint and TJX Data Breaches.” Management Information Systems Quarterly 33 (4): 673–687. Cuttone, Andrea, Sune Lehmann, and Jakob Eg Larsen. 2014. Inferring Human Mobility from Sparse Low Accuracy Mobile Sensing Data. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication, 995–1004. New York: ACM. Dabek, Frank, Nickolai Zeldovich, Frans Kaashoek, David Mazières, and Robert Morris. 2002. EventDriven Programming for Robust Software. In Proceedings of the 10th Workshop on ACM SIGOPS European Workshop, 186–189. New York: ACM. Dahiyat, Emad Abdel Rahim. 2007. “Intelligent Agents and Contracts: Is a Conceptual Rethink Imperative.” Artificial Intelligence and Law 15:375–390.



Dahiyat, Emad Abdel Rahim. 2010. “Intelligent Agents and Liability: Is It a Doctrinal Problem or Merely a Problem of Explanation?” Artificial Intelligence and Law 18:103–121. “Data, Data Everywhere.” 2010. Economist, February 25. (accessed February 9, 2016). Data Protection Working Party. 2001. “Opinion 8/2001 on the Processing of Personal Data in the Employment Context.” 5062/01/EN/Final WP 48. article-29/documentation/opinion-recommendation/files/2001/wp48_en.pdf (accessed February 12, 2016). Data Science Association. n.d. “Code of Conduct.” .html (accessed December 10, 2015). Davenport, Thomas H. 2014. Big Data at Work: Dispelling the Myths, Uncovering the Opportunities. Boston: Harvard Business School Publishing. Davenport, Thomas H., and D. J. Patil. 2012. “Data Scientist: The Sexiest Job of the 21st Century.” Harvard Business Review, October. (accessed February 6, 2016). Davies, Colin R. 2011. “An Evolutionary Step in Intellectual Property Rights—Artificial Intelligence and Intellectual Property.” Computer Law and Security Report 27 (6): 601–619. Davis, Kord. 2012. Ethics of Big Data: Balancing Risk and Innovation. Sebastopol, CA: O’Reilly Media. De Choudhury, Munmun, Scott Counts, and Eric Horvitz. 2013. Predicting Postpartum Changes in Emotion and Behavior via Social Media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 3267–3276. New York: ACM. De la Porta, Donatella, and Mario Diani. 2006. Social Movements: An Introduction. 2nd ed. Malden, MA: Blackwell Publishing. de Montjoye, Yves-Alexandre, César A. Hidalgo, Michel Verleysen, and Vincent D. Blondel. 2013. “Unique in the Crowd: The Privacy Bounds of Human Mobility.” Scientific Reports March 25. http:// (accessed February 3, 2016). de Montjoye, Yves-Alexandre, Laura Radaelli, and Vivek Kumar Singh. 2015. “Unique in the Shopping Mall: On the Reidentifiability of Credit Card Metadata.” Science 347 (6221): 536–539. De Roure, David, Carole Goble, and Robert Stevens. 2009. “The Design and Realisation of the myExperiment Virtual Research Environment for Social Sharing of Workflows.” Future Generation Computer Systems 25: 561–567. Dean, Jeffrey, and Sanjay Ghemawat. 2008. “MapReduce: Simplified Data Processing on Large Clusters.” Communications of the ACM 51 (1): 107–113. DeDeo, Simon. 2015. “Information Theory for Intelligent People.” February 10. http://santafe .edu/~simon/it.pdf (accessed December 10, 2015).



Deloitte. n.d. “A Guide to Assessing Your Risk of Data Aggregation Strategies: How Effectively Are You Complying with BCBS 239?” ca-en-15-2733H-risk-data-aggregation-reporting-BCBS-239.PDF (accessed December 10, 2015). Denning, Peter J. 1980. “Saving All the Bits.” American Scientist 78:402–405. Department of Health and Human Services. 2008. The Sentinel Initiative: National Strategy for Monitoring Medical Product Safety. May. .pdf (accessed December 12, 2015). Department of Health and Human Services. 2011. Report to Congress: The Sentinel Initiative—A National Strategy for Monitoring Medical Product Safety. August 19. FDAsSentinelInitiative/UCM274548.pdf (accessed December 12, 2015). Desrosières, Alain. 2002. The Politics of Large Numbers: A History of Statistical Reasoning. Trans. Camille Naish. Cambridge, MA: Harvard University Press. Dhia, Amir. 2006. The Information Age and Diplomacy: An Emerging Strategic Vision in World Affairs. Boca Raton, FL: Universal Publishers. Diebold, Francis X. 2003. “Big Data” Dynamic Factor Models for Macroeconomic Measurement and Forecasting. In Advances in Economics and Econometrics: Theory and Applications, Eighth World Congress of the Econometric Society, ed. Mathias Dewatripont, Lars Peter Hansen, and Stephen J. Turnovsky, 115–122. Cambridge: University of Cambridge Press. Donaldson, Devan Ray, and Paul Conway. 2015. “User Conceptions of Trustworthiness for Digital Archival Documents.” Journal of the Association for Information Science and Technology 66 (12): 2427–2444. Duckworth, Angela L., Christopher Peterson, Michael D. Matthews, and Dennis R. Kelly. 2007. “Grit: Perseverance and Passion for Long-Term Goals.” Journal of Personality and Social Psychology 92 (6): 1087–1101. Duhem, Pierre. (1906) 1954. The Aim and Structure of Physical Theory. Princeton, NJ: Princeton University Press. Duhigg, Charles. 2012. “How Companies Learn Your Secrets.” New York Times, February 16. http:// (accessed April 22, 2016). Dwork, Cynthia. 2014. Differential Privacy: A Cryptographic Approach to Private Data Analysis. In Privacy, Big Data, and the Public Good: Frameworks for Engagement, ed. Julia Lane, Victoria Stodden, Stefan Bender, and Helen Nissenbaum, 296–322. New York: Cambridge University Press. Dwork, Cynthia, and Deirdre K. Mulligan. 2013. “It’s Not Privacy, and It’s Not Fair.” Stanford Law Review Online 66:35. Dwoskin, Elizabeth. 2014a. “Big Data’s High-Priests of Algorithms.” Wall Street Journal, August 8. Dwoskin, Elizabeth. 2014b. “Pandora Thinks It Knows If You Are a Republican.” Wall Street Journal, February 13.



Ebrahim, Shanil, Zahra N. Sohani, Luis Montoya, Arnav Agarwal, Kristian Thorlund, Edward J. Mills, and John P. A. Ioannidis. 2014. “Reanalyses of Randomized Clinical Trial Data.” JAMA: The Journal of the American Medical Association 312 (10): 1024–1032. Eckersley, Peter. 2010. How Unique Is Your Web Browser? In Proceedings of the 10th International Conference on Privacy Enhancing Technologies, 1–18. Berlin: Springer-Verlag. Edmondson, Amy C., Richard M. Bohmer, and Gary P. Pisano. 2001. “Disrupted Routines: Team Learning and New Technology Implementation in Hospitals.” Administrative Science Quarterly 46 (4): 685–716. Edney, Anna. 2015. “Doctor Watson Will See You Now, If IMB Wins Fight with Congress.” Bloomberg BNA Health IT Law and Industry Report, January 29. _display.adp (accessed February 11, 2016). Effinger, Anthony. 2015. “Compliance Is Now Calling the Shots, and Bankers Are Bristling.” Bloomberg Business, June 25. -the-shots-and-bankers-are-bristling (accessed February 8, 2016). Ekbia, Hamid R. 2008. Artificial Dreams: The Quest for Non-Biological Intelligence. New York: Cambridge University Press. Ekbia, Hamid R., and Les Gasser. 2007. Seeking Reliability in Freedom: The Case of F/OSS. In Computerization Movements and Technology Diffusion: From Mainframes to Ubiquitous Computing, ed. Margaret S. Elliott and Kenneth L. Kraemer, 420–449. Medford, NJ: Information Today. Ekbia, Hamid R., Michael Mattioli, Inna Kouper, G. Arave, Ali Ghazinejad, Timothy Bowman, Venkata Ratandeep Suri, Andrew Tsou, Scott Weingart, and Cassidy R. Sugimoto. 2015. “Big Data, Bigger Dilemmas: A Critical Review.” Journal of the Association for Information Science and Technology 66 (8): 1523–1746. Ekbia, Hamid R., and Bonnie Nardi. 2012. Inverse Instrumentality: How Technologies Objectify Patients and Players. Oxford: Oxford University Press. Ekbia, Hamid R., and Bonnie Nardi. 2014. “Heteromation and Its (Dis)contents: The Invisible Division of Labor between Humans and Machines.” First Monday 19 (6). fm/article/view/5331/4090 (accessed January 29, 2016). El Dib, Regina P., Álvaro N. Atallah, and Regis B. Andriolo. 2007. “Mapping the Cochrane Evidence for Decision Making in Health Care.” Journal of Evaluation in Clinical Practice 13 (4): 689–692. “Electronic Medical Records Adoption Model.” n.d. HIMSS Analytics. provider-solutions (accessed February 10, 2016). Elish, Madeleine, and Tim Hwang. 2015. “Praise the Machine! Punish the Human! The Contradictory History of Accountability in Automated Aviation.” Working Paper #1, Intelligence and Autonomy Initiative, Data and Society Research Institute, February 24. Elliott, Margaret S., and Kenneth L. Kraemer, eds. 2008. Computerization Movements and Technology Diffusion: From Mainframes to Computing. Medford, NJ: Information Today.



Eskreis-Winkler, Lauren, Elizabeth P. Shulman, Scott A. Beal, and Angela L. Duckworth. 2014. “The Grit Effect: Predicting Retention in the Military, the Workplace, School, and Marriage.” Frontiers in Psychology 5 (36). (accessed February 1, 2016). Espeland, Wendy Nelson, and Mitchell L. Stevens. 1998. “Commensuration as a Social Process.” Annual Review of Sociology 24:313–343. Estrin, Deborah. “Small Data, Where N = Me.” Communications of the ACM 57 (4): 32–34. European Commission. 2009. Article 29 Data Protection Working Party and the Working Party on Police and Justice, the Future of Privacy: Joint Contribution to the Consultation of the European Commission on the Legal Framework for the Fundamental Right to Protection of Personal Data. 02356/09/EN, WP 168. European Parliament. 2013. Regulation of the European Parliament and of the Council on the Protection of Individuals with Regard to the Processing of Personal Data and on the Free Movement of Such Data (General Data Protection Regulation): Unofficial Consolidated Version after LIBE Committee Vote, Provided by the Rapporteur. European Union. 1995. Directive 95/46/EC of the European Parliament and of the Council on the Protection of Individuals with Regard to the Processing of Personal Data and on the Free Movement of Such Data. Eur. O.J. 95/L281. Executive Office of the President. 2014. Big Data: Seizing Opportunities, Preserving Values. Washington, DC: Executive Office of the President. EYGM Limited. 2014. “Centralized Operations: The Future of Operating Models for Risk, Control and Compliance Functions.” February. _operations:_future_of_Risk,_Control_and_Compliance/$FILE/EY-Insights-on-GRC-Centralized -operations.pdf (accessed December 12, 2015). Facebook. 2013. “Presto: Distributed SQL Query Engine for Big Data.” Presto. (accessed February 9, 2016). Federal Trade Commission (FTC). 1998. Privacy Online: A Report to Congress. Washington, DC: Federal Trade Commission. Federal Trade Commission (FTC). 2010. Protecting Consumer Privacy in an Era of Rapid Change: A Proposed Framework for Businesses and Policymakers, Preliminary FTC Staff Report. Washington, DC: Federal Trade Commission. Federal Trade Commission (FTC). 2012. “FTC Warns Marketers That Mobile Apps May Violate Fair Credit Reporting Act.” Press release, February 7. 2012/02/ftc-warns-marketers-mobile-apps-may-violate-fair-credit-reporting (accessed February 12, 2016). Feist Publications, Inc. v. Rural Telephone Service Co. 1991. 499 U.S. 340. Feldman, Michael, Sorelle Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2014. “Certifying and Removing Disparate Impact.” Cornell University Library, December 11. http:// (accessed February 2, 2016).



Felten, Edward W. 2014. Sensors without Surveillance: Defining the Sensor Society. Brisbane: University of Queensland. Fillmore, Christopher L., Bruce E. Bray, and Kensaku Kawamoto. 2013. “Systematic Review of Clinical Decision Support Interventions with Potential for Inpatient Cost Reduction.” BMC Medical Informatics and Decision Making 13 (1): 135–144. Finzer, William. 2013. “The Data Science Education Dilemma.” Technology Innovations in Statistics Education 7 (2). (accessed February 6, 2016). Fisher, William. 2001. Theories of Intellectual Property. In New Essays on the Legal and Political Theory of Property, ed. Stephen R. Munzer, 168–200. Cambridge: University of Cambridge Press. Fleischmann, Kenneth R., and William A. Wallace. 2005. “A Covenant with Transparency: Opening the Black Box of Models.” Communications of the ACM 48 (5): 93–97. Floridi, Luciano. 2006. “Four Challenges for a Theory of Informational Privacy.” Ethics and Information Technology 8 (3): 109–119. Floyd, Larry A., Feng Xu, Ryan Atkins, and Cam Caldwell. 2013. “Ethical Outcomes and Business Ethics: Toward Improving Business Ethics Education.” Journal of Business Ethics 117 (4): 753–776. Foote, Susan Bartlett, and Robert J. Town. 2007. “Implementing Evidence-Based Medicine through Medicare Coverage Decisions.” Health Affairs 26 (6): 1634–1642. “Ford Pinto.” n.d. Wikipedia. (accessed December 12, 2015). Foucault, Michel. 1980. Power/Knowledge: Selected Interviews and Other Writings, 1972–1977. New York: Pantheon. Foucault, Michel. 1986. Space, Knowledge, and Power. In The Foucault Reader, ed. Paul Rabinow. Harmondsworth, UK: Penguin. Fox News. 2008. “Discovered by Accident, Viagra Still Popular 10 Years Later.” March 24. http://www (accessed February 11, 2016). Frank, Adam. 2013. “The Infinite Monkey Theorem Comes to Life.” NPR, December 10. http://www (accessed February 11, 2016). Frias-Martinez, Vanessa, and Enrique Frias-Martinez. 2014. “Spectral Clustering for Sensing Urban Land Use Using Twitter Activity.” Engineering Applications of Artificial Intelligence 35:237–245. Frias-Martinez, Vanessa, Younes Moumni, and Enrique Frias-Martinez. 2014. Estimation of Traffic Flow Using Passive Cell-Phone Data. In Proceedings of the International Workshop on Data Science for MacroModeling, 1–2. New York: ACM. Frick, Walter. 2014. “An Introduction to Data-Driven Decisions for Managers Who Don’t Like Math.” Harvard Business Review, May 19. -managers-who-dont-like-math (accessed February 9, 2016).



Friedman, Batya, and David G. Hendry. 2012. The Envisioning Cards: A Toolkit for Catalyzing Humanistic and Technical Imaginations. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1145–1148. New York: ACM. Friedman, Batya, Peter H. Kahn, and Alan Borning. 2006. Value Sensitive Design and Information Systems. In Human-Computer Interaction and Management Information Systems: Foundations, ed. Ping Zhang and Dennis F. Galletta, 348–372. Armonk, NY: M. E. Sharpe. Friedman, Ted, Andrew White, and Saul Judah. 2015. Information Governance Requires a Comprehensive and Interrelated Range of Policy Types. Stamford, CT: Gartner. Fuller, Matthew, and Andrew Goffey. 2012a. “Digital Infrastructures and the Machinery of Topological Abstraction.” Theory, Culture, and Society 29 (4–5): 311–333. Fuller, Matthew, and Andrew Goffey. 2012b. Evil Media. Cambridge, MA: MIT Press. Gabrys, Jennifer. 2014. “Programming Environments: Environmentality and Citizen Sensing in the Smart City.” Environment and Planning. D, Society and Space 32 (1): 30–48. GAIN Collaborative Research Group. 2007. “New Models of Collaboration in Genome-Wide Association Studies: The Genetic Association Information Network.” Nature Genetics 39 (9): 1045–1051. Gantz, John F. 2007. The Expanding Digital Universe: A Forecast of Worldwide Information Growth through 2010. Farmington, MA: IDC. Gartner. 2008. “Gartner Says Worldwide Business Intelligence, Analytics, and Performance Management Grew 22 Percent in 2008.” Press release, April 1. (accessed February 9, 2016). Gartner. 2014. “Gartner Says Worldwide Business Intelligence and Analytics Software Market Grew 8 Percent in 2013.” Press release, April 29. (accessed February 9, 2016). Gazni, Ali, Cassidy R. Sugimoto, and Fereshteh Didegah. 2012. “Mapping World Scientific Collaboration: Authors, Institutions, and Countries.” Journal of the American Society for Information Science and Technology 63 (2): 323–335. Gellman, Lindsay. 2014. “Big Data Gets Master Treatment at B-Schools.” Wall Street Journal, November 5. (accessed February 6, 2016). Gellner, Ernest. 1994. Conditions of Liberty: Civil Society and Its Rivals. London: Hamish Hamilton. Gentile, Mary C. 2010. Giving Voice to Values: How to Speak Your Mind When You Known What’s Right. New Haven, CT: Yale University Press. Gentry, Eri. 2010. “Eri Gentry on Butter Mind/Coconut Mind.” Quantified Self, October 16. http:// (accessed February 1, 2016). Gerlitz, Carolin, and Anne Helmond. 2013. “The Like Economy: Social Buttons and the Data-Intensive Web.” New Media and Society 15 (8): 1348–1365.



Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. 2003. “The Google File System.” Operating Systems Review 37:29–43. Ghose, Tia. 2014. “Botox: Uses and Side Effects.” Live Science, August 18. http://www.livescience .com/44222-botox-uses-side-effects.html (accessed February 11, 2016). Gitelman, Lisa, and Virginia Jackson. 2013. Introduction to “Raw Data” Is an Oxymoron, ed. Lisa Gitelman, 1–14. Cambridge, MA: MIT Press. Gobble, MaryAnne M. 2013. “Big Data: The Next Big Thing in Innovation.” Research-Technology Management 56 (1): 64–66. Good, Irving John. 1965. “Speculations concerning the First Ultraintelligent Machine.” Advances in Computers 6 (31): 31–88. Goodwell, Allison E., Zhenduo Zhu, Debsunder Dutta, Jonathan A. Greenberg, Praveen Kumar, Marcelo H. Garcia, Bruce L. Rhoads, Robert R. Holmes, Gary Parker, David P. Berretta, and Robert B. Jacobson. 2014. “Assessment of Floodplain Vulnerability during Extreme Mississippi River Flood 2011.” Environmental Science and Technology 48 (5): 2619–2625. Gordon and Betty Moore Foundation. 2013. “Bold New Partnership Launches to Harness Potential of Data Scientists and Big Data.” November 12. 11/12/%20bold_new_partnership_launches_to_harness_potential_of_data_scientists_and_big_data (accessed December 12, 2015). Gorski, David. 2014. “Brian Hooker Proves Andrew Wakefield Wrong about Vaccines and Autism.” Respectful Insolence, August 22. -andrew-wakefield-wrong-about-vaccines-and-autism/ (accessed February 6, 2016). Gray, Jim. 2009. eScience: A Transformed Scientific Method. In The Fourth Paradigm: Data Intensive Scientific Discovery, ed. Tony Hey, Stewart Tansley, and Kristin M. Tolle, xix–xxxiii. Seattle: Microsoft Research. Grimmelmann, James. 2015. “The Law and Ethics of Experiments on Social Media Users.” Colorado Technology Law Journal 13:219–272. Gupta, Anurag, Ali S. Raja, and Ramin Khorasani. 2014. “Examining Clinical Decision Support Integrity: Is Clinician Self-Reported Data Entry Accurate?” JAMA: The Journal of the American Medical Informatics Association 21 (1): 23–26. Guyatt, Gordon, John Cairns, David Churchill, Deborah Cook, Brian Haynes, Jack Hirsh, Jan Irvine, Gordon Guyatt, MD, MSc, John Cairns, MD, David Churchill, MD, MSc, Deborah Cook, MD, MSc, Brian Haynes, MD, MSc, PhD, Jack Hirsh, MD, Jan Irvine, MD, MSc, Mark Levine, MD, MSc, Mitchell Levine, MD, MSc, Jim Nishikawa, MD, David Sackett, MD, MSc, Patrick Brill-Edwards, MD, Hertzel Gerstein, MD, MSc, Jim Gibson, MD, Roman Jaeschke, MD, MSc, Anthony Kerigan, MD, MSc, Alan Neville, MD, Akbar Panju, MD, Allan Detsky, MD, PhD, Murray Enkin, MD, Pamela Frid, MD, Martha Gerrity, MD, Andreas Laupacis, MD, MSc, Valerie Lawrence, MD, Joel Menard, MD, Virginia Moyer, MD, Cynthia Mulrow, MD, Paul Links, MD, MSc, Andrew Oxman, MD, MSc, Jack Sinclair, MD, and Peter Tugwell,



MD, MSc. 1992. “Evidence-Based Medicine: A New Approach to Teaching the Practice of Medicine.” Journal of the American Medical Informatics Association 268 (17): 2420–2425. Gymrek, Melissa, Amy L. McGuire, David Golan, Eran Halperin, and Yaniv Erlich. 2013. “Identifying Personal Genomes by Surname Inference.” Science 339 (6117): 321–324. Haga, Susanne B., and Julianne O’Daniel. 2011. “Public Perspectives regarding Data-Sharing Practices in Genomics Research.” Public Health Genomics 14 (6): 319–324. Halzack, Sarah. 2015. “Privacy Advocates Try to Keep ‘Creepy,’ ‘Eavesdropping’ Hello Barbie from Hitting Shelves.” Washington Post, March 11. wp/2015/03/11/privacy-advocates-try-to-keep-creepy-eavesdropping-hello-barbie-from-hitting-shelves/ (accessed January 30, 2016). Hannay, Jo Erskine, Carolyn MacLeod, Janice Singer, Hans Petter Langtangen, Dietmar Pfahl, and Greg Wilson. 2009. How Do Scientists Develop and Use Scientific Software? In Proceedings of the 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering, 1–8. Washington, DC: IEEE Computer Society. Hartzog, Woodow. 2012. “Chain-Link Confidentiality.” Georgia Law Review 46:657–704. Haugeland, John. 1989. Artificial Intelligence: The Very Idea. Cambridge, MT: MIT Press. Hekler, Eric B., Predrag Klasnja, Jon E. Froehlich, and Matthew P. Buman. 2013. Mind the Theoretical Gap: Interpreting, Using, and Developing Behavioral Theory in HCI Research. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 3307–3316. New York: ACM. Heller, Agnes. 1999. A Theory of Modernity. Oxford: Blackwell. Heller, Michael A., and Rebecca S. Eisenberg. 1998. “Can Patents Deter Innovation? The Anticommons in Biomedical Research.” Science 280 (5364): 698–701. Helmond, Anne. 2015. “The Platformization of the Web: Making Data Platform Ready.” Social Media + Society 1 (2): 1–11. Hern, Alex. 2015. “How Your Smartphone’s Battery Life Can Be Used to Invade Your Privacy.” Guardian, August 4. Hess, Charlotte, and Elinor Ostrom. 2006. “A Framework for Analysing the Microbiological Commons.” International Social Science Journal 58 (188): 335–349. Hettne, Kristina M., Harish Dharuri, Jun Zhao, Katherine Wolstencroft, Khalid Belhajjame, Stian Soiland-Reyes, Eleni Mina, Mark Thompson, Don Cruickshank, Lourdes Verdes-Montenegro, Julian Garrido, David De Roure, Oscar Corcho, Graham Klyne, Reinout van Schouwen, Peter A. C. ’t Hoen, Sean Bechhofer, Carole Goble, and Marco Roos. 2014. “Structuring Research Methods and Data with the Research Object Model: Genomics Workflows as a Case Study.” Journal of Biomedical Semantics 5 (1). (accessed February 5, 2016). Hewlett-Packard. 2015. “Internet of Things Research Study.” GetDocument.aspx?docname=4AA5-4759ENW&cc=us&lc=en (accessed December 12, 2015).



Hill, David G. 2009. Data Protection: Governance, Risk Management, and Compliance. Boca Raton, FL: CRC Press. Hill, Kashmir. 2011. “Adventures in Self-Surveillance: Fitbit, Tracking My Movement and Sleep.” Forbes, February 25. -fitbit-tracking-my-movement-and-sleep/#7d99eb6f29ab (accessed February 1, 2016). Hill, Kashmir. 2013. “Snapchats Don’t Disappear: Forensics Firm Has Pulled Dozens of SupposedlyDeleted Photos from Android Phones.” Forbes, May 9. 2013/05/09/snapchats-dont-disappear/#f556b972ed44 (accessed February 1, 2016). Ho, Cynthia M. 2000. “Splicing Morality and Patent Law: Issues Arising from Mixing Mice and Men.” Washington University Journal of Law and Policy 2 247–285. Hofstadter, Douglas. 1979. Gödel, Escher, Bach: An Eternal Golden Braid. New York: Basic Books. Holden, Arthur. 2002. “The SNP Consortium: Summary of a Private Consortium Effort to Develop an Applied Map of the Human Genome.” BioTechniques 32:22–26. “Homo Iconic.” 2015. Homoiconicity. (accessed March 8, 2015). “Homoiconic Languages.” 2007. true Blue, April 19. _languages (accessed March 8, 2015). Honoré, Antony. 2010. Causation in the Law. In The Stanford Encyclopedia of Philosophy (Winter 2010 Edition), ed. Edward N. Zalta. (accessed February 1, 2016). Hoofnagle, Chris Jay. 2003. “Big Brother’s Little Helpers: How ChoicePoint and Other Commercial Data Brokers Collect and Package Your Data for Law Enforcement.” North Carolina Journal of International Law and Commercial Regulation 29:595. Hooker, Brian S. 2014. “Measles-Mumps-Rubella Vaccination Timing and Autism among Young African American Boys: A Reanalysis of CDC Data.” Translational Neurodegeneration 3:16. Horowitz, Carol R., Mimsie Robinson, and Sarena Seifer. 2009. “Community-Based Participatory Research from the Margin to the Mainstream: Are Researchers Prepared?” Circulation 119 (19): 2633–2642. “How Might Your Choice of Browser Affect Your Job Prospects?” 2013. Economist, April 10. http://www -prospects (accessed April 28, 2016). Hsieh, Chu-Cheng, James Neufeld, Tracy King, and Junghoo Cho. 2015. Efficient Approximate Thompson Sampling for Search Query Recommendation. In Proceedings of the 30th ACM/SIGAPP Symposium on Applied Computing, 740–746. New York: ACM. Hu, Qing, Tamara Dinev, Paul Hart, and Donna Cooke. 2012. “Managing Employee Compliance with Information Security Policies: The Critical Role of Top Management and Organizational Culture.” Decision Sciences 43 (4): 615–659.



Huckabee, Gregory M., and Cherry Kolb. 2014. “Privacy in the Workplace, Fact or Fiction, and the Value of an Authorized Use Policy.” South Dakota Law Review 59 (1): 35–49. Huijsing, Johan H. 2008. Smart Sensor Systems: Why? Where? How. In Smart Sensor Systems, ed. Gerard Meijer, 1–21. Hoboken, NJ: Wiley. “Human Genome at Ten: The Sequence Explosion.” 2010. Nature 464 (7289): 670–671. Hunkar, David. 2011. “The 50 Largest Pharmaceutical Companies by Sales.” Seeking Alpha, August 14. (accessed December 12, 2015). Huys, Isabelle, Nele Berthels, Gert Matthijs, and Geertrui Van Overwalle. 2009. “Legal Uncertainty in the Area of Genetic Diagnostic Testing.” Nature Biotechnology 27 (10): 903–909. “Immoral But Not Always Illegal: Price Gouging after Natural Disaster.” n.d. Legal Resources. (accessed December 12, 2015). In the Matter of Eli Lilly and Company. 2002. FTC File No. 012 3214, Docket No. C-4047, Complaint, May 10. INFORMS. n.d. “Code of Ethics/Conduct.” -Professional-Program/Applicants/CODE-OF-ETHICS (accessed December 12, 2015). Institute for Advanced Analytics. 2014. “19- to 26-Months.” (accessed December 12, 2015). International Cancer Genome Consortium. 2014. “The International Cancer Genome Consortium’s Evolving Data Protection Policies.” Nature Biotechnology 32 (6): 519–523. International Human Genome Sequencing Consortium. 2001. “Initial Sequencing and Analysis of the Human Genome.” Nature 409:860–921. International Organization for Standardization. 2009. “ISO 31000:2009: Risk Management—Principles and Guidelines.” November 15. (accessed January 31, 2016). Isidore, Chris. 2015. “Elon Musk Gives $10M to Fight Killer Robots.” CNN Money, January 16. http:// (accessed February 11, 2016). Jasny, Barbara R. 2013. “Realities of Data Sharing Using the Genome Wars as Case Study: An Historical Perspective and Commentary.” EPJ Data Science 2 (1): 1–15. Jaspers, Monique W. M., Marian Smeulers, Hester Vermeulen, and Linda W. Peute. 2011. “Effects of Clinical Decision-Support Systems on Practitioner Performance and Patient Outcomes: A Synthesis of High-Quality Systematic Review Findings.” Journal of the American Medical Informatics Association 18 (3): 327–334. Jay, Emma. 2010. “Viagra and Other Drugs Discovered by Accident.” BBC News, January 20. http:// (accessed February 11, 2016).



Jenders, Robert A., Jerome A. Osheroff, Dean F. Sittig, Eric A. Pifer, and Jonathan M. Teich. 2007. Recommendations for Clinical Decision Support Deployment: Synthesis of a Roundtable of Medical Directors of Information Systems. In AMIA Annual Symposium Proceedings, 359–363. Bethesda, MD: American Medical Informatics Association. Jensen, Kyle, and Fiona Murray. 2005. “Intellectual Property Landscape of the Human Genome.” Science 310 (5746): 239–240. John, Nicholas A. 2012. “Sharing and Web 2.0: The Emergence of a Keyword.” New Media and Society 15 (2): 167–182. Johnson, Deborah G., and John M. Mulvey. 1995. “Accountability and Computer Decision Systems.” Communications of the ACM 38 (12): 58–64. Johnson, M. Eric, and Eric Goetz. 2007. “Embedding Information Security into the Organization.” IEEE Security and Privacy 5 (3): 16–24. Johnson, Jeff S., and Ravipreet S. Sohi. 2016. “Understanding and Resolving Major Contractual Breaches in Buyer-Supplier Relationships: A Grounded Theory Approach.” Journal of the Academy of Marketing Science 44 (2): 185–205. Johnston, Casey. 2014. “Non-Gmail Users Suing Google for ‘Wiretapping’ Denied Class Action.” Arstechnica, March 19. -denied-class-action-status/ (accessed November 20, 2014). Jones, James B., Walter F. Stewart, Jonathan D. Darer, and Dean F. Sittig. 2013. “Beyond the Threshold: Real-Time Use of Evidence in Practice.” BMC Medical Informatics and Decision Making 13 (1): 47–60. JP Morgan Chase and Co. 2014. “How We Do Business—The Report.” December. http://files How_We_Do_Business.pdf (accessed December 12, 2015). Kallinikos, Jannis. 2004. “The Social Foundations of the Bureaucratic Order.” Organization 11 (1): 13–36. Kallinikos, Jannis. 2006. The Consequences of Information: Institutional Implications of Technological Change. Northampton, MA: Edward Elgar Publishing . Kallinikos, Jannis. 2009. “On the Computational Rendition of Reality: Artefacts and Human Agency.” Organization 16 (2): 183–202. Kallinikos, Jannis. 2012. Form, Function, and Matter: Crossing the Border of Materiality. In Materiality and Organizing: Social Interaction in a Technological World, ed. Paul M. Leonardi, Bonnie A. Nardi and Jannis Kallinikos. 67–90. Oxford: Oxford University Press. Kallinikos, Jannis, Aleksi Aaltonen, and Attila Marton. 2013. “The Ambivalent Ontology of Digital Artifacts.” Management Information Systems Quarterly 37 (2): 357–370. Kallinikos, Jannis, and Ioanna D. Constantiou. 2015. “Big Data Revisited: A Rejoinder.” Journal of Information Technology 30 (1): 70–74.



Kaplan, Andreas M., and Michael Haenlein. 2010. “Users of the World, Unite! The Challenges and Opportunities of Social Media.” Business Horizons 53 (1): 59–68. Katyal, Sonia K. 2005. “Privacy vs. Piracy.” Yale Journal of Law and Technology 7 (1): 222–345. Katz, Leslie. 2013. “TweetPee: Huggies Sends a Tweet When Baby’s Wet.” CNET, May 8. http://www (accessed February 3, 2016). Kawamoto, Kensaku, Tonya Hongsermeier, Adam Wright, Janet Lewis, Douglas S. Bell, and Blackford Middleton. 2013. “Key Principles for a National Clinical Decision Support Knowledge Sharing Framework: Synthesis of Insights from Leading Subject Matter Experts.” Journal of the American Medical Informatics Association 20 (1): 199–207. Kaye, Jane, Catherine Heeney, Naomi Hawkins, Jantina de Vries, and Paula Boddington. 2009. “Data Sharing in Genomics: Re-shaping Scientific Practice.” Nature Reviews Genetics 10 (5): 331–335. Kayyali, Basel, David Knott, and Steve Van Kuiken. 2013. “The Big-Data Revolution in US Health Care: Accelerating Value and Innovation.” Insights and Publications. health_systems_and_services/the_big-data_revolution_in_us_health_care (accessed February 10, 2016). Kearney, Mike. 2013. “Driving Down the Price of Risk.” IBM Big Data and Analytics Hub, April 16. (accessed December 12, 2015). Keats, Jonathon. 2006. “John Koza Has Built an Invention Machine.” Popular Science, April 18. (accessed February 11, 2016). Keefe, Patrick Radden. 2014. “The Empire of Edge.” New Yorker, October 13, 76. http://www.newyorker .com/magazine/2014/10/13/empire-edge (accessed April 22, 2016). Kemp Little LLP. 2011. “Cloud Computing: A Buyer’s Guide.” documents/cloud_computing_buyers_guide.pdf (accessed February 7, 2016). Kennedy, Jenny. 2013. Rhetorics of Sharing: Data, Imagination, and Desire. In Unlike Us Reader: Social Media Monopolies and Their Alternative, ed. Geert Lovink and Miriam Rasch, 127–136. Amsterdam: Institute of Network Cultures. Khan, Irfan. 2012. “Nowcasting: Big Data Predicts the Present.” D!gitalist Magazine, November 12. (accessed January 5, 2016). Khandani, Amir E., Adlar J. Kim, and Andrew W. Lo. 2010. “Consumer Credit-Risk Models via MachineLearning Algorithms.” Journal of Banking and Finance 34 (11): 2767–2787. Khanlou, Nazilla, and Elizabeth Peter. 2005. “Participatory Action Research: Considerations for Ethical Review.” Social Science and Medicine 60 (10): 2333–2340. Khatri, Vijay, and Carol V. Brown. 2010. “Designing Data Governance.” Communications of the ACM 53 (1): 148–152.



Kilsdonk, Ellen, Linda W. P. Peute, Sebastiaan L. Knijnenburg, and Monique W. M. Jaspers. 2011. “Factors Known to Influence Acceptance of Clinical Decision Support Systems. Studies in Health Technology and Information 169:150–154. Kirk, Jeremy. 2010. “EFF: Browsers Can Leave a Unique Trail on the Web.” PC World. http://www (accessed January 30, 2016). Kitch, Edmund W. 1977. “The Nature and Function of the Patent System.” Journal of Law and Economics 20 (2): 265–290. Kitchin, Rob. 2014. The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. Thousand Oaks, CA: SAGE Publications Ltd. Klasnja, Predrag, and Wanda Pratt. 2012. “Healthcare in the Pocket: Mapping the Space of MobilePhone Health Interventions.” Journal of Biomedical Informatics 45 (1): 184–198. Klette, Reinhard. 2014. Concise Computer Vision: An Introduction to Theory and Algorithms. London: Springer. Kling, Rob, and Suzanne Iacono. 1988. “The Mobilization of Support for Computerization: The Role of Computerization Movements.” Social Problems 35 (3): 226–243. Kling, Rob, and Suzanne Iacono. 1995. Computerization Movements and the Mobilization of Support for Computerization. In Ecologies of Knowledge: Work and Politics in Science and Technology, ed. Susan Leigh Star. 119–153. Albany: State University of New York Press. Koza, John R., Martin A. Keane, and Matthew J. Streeter. 2003. “Evolving Inventions.” Scientific American, February 1, 52–59. Krakauer, David C., Jessica C. Flack, Simon DeDeo, Doyne Farmer, and Daniel Rockmore. 2010. Intelligent Data Analysis of Intelligent Systems. In Advances in Intelligent Data Analysis IX, ed. Paul R. Cohen, Niall M. Adams, and Michael R. Berthold, 8–17. Berlin: Springer. Krontiris, Ioannis, Marc Langheinrich, and Katie Shilton. 2014. “Trust and Privacy in Mobile Experience Sharing—Future Challenges and Avenues for Research.” Communications Magazine, IEEE 52 (8): 50–55. Kuhn, Thomas. 1962. The Structure of Scientific Revolutions. Chicago: University of Chicago Press. Kuner, Christopher, Fred H. Cate, Christopher Millard, Svantesson Dan Jerker B., and Orla Lynskey. 2015. “Risk Management in Data Protection.” International Data Privacy Law 5 (2): 95. LaMattina, John. 2013. “Even Pharma’s Good Deeds Are Criticized.” Forbes, May 6. http://www.forbes .com/sites/johnlamattina/2013/05/06/even-pharmas-good-deeds-are-criticized/#80cda4c48aab (accessed February 9, 2016). Landau, Susan. 2015. “Control Use of Data to Protect Privacy.” Science 347 (6221): 504–506. Landecker, Will, Michael D. Thomure, Luís M. A. Bettencourt, Melanie Mitchell, Garrett T. Kenyon, and Steven P. Brumby. 2013. Interpreting IndividualClassifications of Hierarchical Networks. In 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), 32–38.



Landes, William M., and Richard A. Posner. 2003. The Economic Structure of Intellectual Property Law. Cambridge, MA: Belknap Press. Lane, Julia, Victoria Stodden, Stefan Bender, and Helen Nissenbaum, eds. 2014. Privacy, Big Data, and the Public Good: Frameworks for Engagement. New York: Cambridge University Press. Lane, Shannon J., Nancy M. Heddle, Emmy Arnold, and Irwin Walker. 2006. “A Review of Randomized Controlled Trials Comparing the Effectiveness of Hand Held Computers with Paper Methods for Data Collection.” BMC Medical Informatics and Decision Making 6 (1): 23. Laney, Doug. 2001. “3D Data Management: Controlling Data Volume, Velocity, and Variety.” Meta Group, February 6. -Controlling-Data-Volume-Velocity-and-Variety.pdf (accessed January 5, 2016). Langheinrich, Marc. 2001. “Privacy by Design—Principles of Privacy-Aware Ubiquitous Systems.” Paper presented at Ubicomp 2001: Ubiquitous Computing, Atlanta, September 30–October 2. Langlois, Ganaele. 2013. Social Media, or Towards a Political Economy of Psychic Life. In Unlike Us Reader: Social Media Monopolies and Their Alternatives, ed. Geert Lovink and Miriam Rasch, 50–60. Amsterdam: Institute of Network Cultures. Langlois, Ganaele. 2014. Meaning in the Age of Social Media. London: Palgrave Macmillan. Larivière, Vincent, and Yves Gingras. 2014. Measuring Interdisciplinarity. In Beyond Bibliometrics: Harnessing Multidimensional Indicators of Scholarly Impact, ed. Blaise Cronin and Cassidy R. Sugimoto, 187–200. Cambridge, MA: MIT Press. Larson, Eric B. 2013. “Building Trust in the Power of ‘Big Data’ Research to Serve the Public Good.” JAMA: The Journal of the American Medical Association 309 (23): 2443–2444. Lauchlan, Stuart. 2015. “Tableau Ramps up Revenues 75%, Adds 2,600 Customers.” diginomica, February 5. #.VrpwNxg5VE4 (accessed February 9, 2016). Lawson, Philippa, and Mary O’Donoghue. 2009. Approaches to Consent in Canadian Data Protection Law. In Lessons from the Identity Trail: Anonymity, Privacy, and Identity in a Networked Society, ed. Ian Kerr, Valerie Steeves, and Carole Lucock, 23–42. Oxford: Oxford University Press. Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (14): 1203–1205. Leibowitz, Jon. 2009. “Introductory Remarks.” Paper presented at FTC Privacy Roundtable, Washington, DC, December 7. Leslie, Stuart. 1993. The Cold War and American Science: The Military-Industrial-Academic Complex at MIT and Stanford. New York: Columbia University Press. Letham, Benjamin, Cynthia Rudin, Tyler H. McCormick, and David Madigan. 2013. “An Interpretable Stroke Prediction Model Using Rules and Bayesian Analysis.” LethamRuMcMaAAAI13.pdf (accessed December 12, 2015).



Levy, Steven. 2015. “How 30 Random People in Knoxville May Change Your Facebook News Feed.” Backchannel, January 30. -really-want-in-their-news-feed-799dbfb2e8b1#.bmjzlvwzf (accessed February 9, 2016). Lewis, Bill, and Daniel J. Berg. 1996. Threads Primer: A Guide to Multithreaded Programming. Upper Saddle River, NJ: Prentice Hall. Li, Ian, Anind Dey, Jodi Forlizzi, Kristina Höök, and Yevgeniy Medynskiy. 2011. Personal Informatics and HCI: Design, Theory, and Social Implications. In CHI ’11 Extended Abstracts on Human Factors in Computing Systems, 2417–2420. New York: ACM. Li, Ian, Yevgeniy Medynskiy, Jon Froehlich, and Jakob Larsen. 2012. Personal Informatics in Practice: Improving Quality of Life through Data. In CHI ’12 Extended Abstracts on Human Factors in Computing Systems, 2799–2802. New York: ACM. Lin, Ching-Ping, Kari A. Stephens, Laura-Mae Baldwin, Gina A. Keppel, Ron J. Whitener, Abigail EchoHawk, and Diane Korngiebel. 2014. Developing Governance for Federated Community-Based EHR Data Sharing. In AIMA Joint Summits on Translational Science Proceedings, 71–76. Bethesda, MA: AIMA. Livingston, Edward H., and Robert A. McNutt. 2011. “The Hazards of Evidence-Based Medicine: Assessing Variations in Care.” JAMA: The Journal of the American Medical Association 306 (7): 762–763. Liyanagunawardena, Tharindu Rekha, Andrew Alexandar Adams, and Shirley Ann Williams. 2013. “MOOCs: A Systematic Study of the Published Literature, 2008–2012.” International Review of Research in Open and Distributed Learning 14 (3): 202–227. Loiter, Jeffrey M., and Vicky Norberg-Bohm. 1999. “Technology Policy and Renewable Energy: Public Roles in the Development of New Energy Technologies.” Energy Policy 27 (2): 85–97. Losh, Gillian. 2011. “The Hacker Within Focuses on Scientific Computing.” University of Wisconsin at Madison News. (accessed April 22, 2016). Love, Dylan. 2011. “Here’s the Information Facebook Gathers on You as You Browse the Web.” Business Insider, November 18. (accessed February 12, 2016). Love, Dylan. 2014. “Scientists Are Afraid to Talk about the Robot Apocalypse, and That’s a Problem.” Business Insider, July 18. (accessed February 11, 2016). Lovink, Geert, and Miriam Rasch. 2013. Unlike Us Reader: Social Media Monopolies and Their Alternatives. Amsterdam: Institute of Network Cultures. Lowrance, William W., and Francis S. Collins. 2007. “Identifiability in Genomic Research.” Science 317 (5838): 600–602. Luhmann, Niklas. 1995. Social Systems. Stanford, CA: Stanford University Press.



Lumb, David. 2015. “23andme Gives Pfizer Access to Its Genome Database.” Fast Company, January 13. (accessed February 11, 2016). Lupton, Deborah. 2013. “Quantifying the Body: Monitoring and Measuring Health in the Age of mHealth Technologies.” Critical Public Health 23 (4): 393–403. Lyman, Peter, and Hal R. Varian. 2003. “How Much Information?” archive/how-much-info-2003/ (accessed November 20, 2014). Lyons, Dan. 2012. “Why Nate Silver Won, and Why It Matters.” ReadWrite, November 7. http:// (accessed February 6, 2016). Mackenzie, Adrian. 2012. “More Parts Than Elements: How Databases Multiply.” Environment and Planning: D, Society And Space 30 (2): 335–350. Malenfant, Jacques, Marco Jacques, and Franois Nicolas Demers. 1996. “A Tutorial on Behavioral Reflection and Its Implementation.” Proceedings of the Reflection 96:1–20. Mann, Devin M. 2011. “Making Clinical Decision Support More Supportive.” Medical Care 49 (2): 115–116. Manolio, Teri A. 2009. “Collaborative Genome-Wide Association Studies of Diverse Diseases: Programs of the NHGRI’s Office of Population Genomics.” Pharmacogenomics 10 (2): 235–241. Manovich, Lev. 2002. The Language of New Media. Cambridge, MA: MIT Press. Manyika, James, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela Hung Byers. 2011. Big Data: The Next Frontier for Innovation, Competition, and Productivity. New York: McKinsey Global Institute. Marcos, Mar, Jose A. Maldonado, Begoña Martínez-Salvador, Diego Boscá, and Montserrat Robles. 2013. “Interoperability of Clinical Decision-Support Systems and Electronic Health Records Using Archetypes: A Case Study in Clinical Trial Eligibility.” Journal of Biomedical Informatics 46 (4): 676–689. Market, Howard. 2013. “The Real Story behind Penicillin.” PBS NewsHour, September 27. http://www (accessed February 11, 2016). Markham, Annette, and Elizabeth A. Buchanan. 2012. Ethical Decision-Making and Internet Research. (accessed February 1, 2016). Markus, M. Lynne. 2014. Information Technology and Organizational Structure. In Computing Handbook, Third Edition: Information Systems and Information Technology, , ed. Heikki Topi and Allen Tucker, 67-1–67-22. Boca Raton, FL: CRC Press. Markus, M. Lynne, and Quang “Neo” Bui. 2012. “Going Concerns: Governance of Interorganizational Coordination Hubs.” Journal of Management Information Systems 28 (4): 163–197.



Markus, M. Lynne, Andrew Dutta, Charles W. Steinfield, and Rolf T. Wigand. 2008. The Computerization Movement in the US Home Mortgage Industry: Automated Underwriting from 1980 to 2004. In Computerization Movements and Technology Diffusion: From Mainframes to Ubiquitous Computing, ed. Margaret S. Elliott and Kenneth L. Kraemer, 115–144. Medford, NY: Information Today . Markus, M. Lynne, and Dax D. Jacobson. 2015. The Governance of Business Processes. In The International Handbook on Business Process Management 2: Strategic Alignment, Governance, People, and Change, ed. Jan vom Brocke and Michael Rosemann, 311–332. 2nd ed. New York: Springer. Markus, M. Lynne, and Heikki Topi. 2015. Big Data, Big Decisions for Science, Society, and Business—NSF Project Outcomes Report. Waltham, MA: Bentley College. Markus, M. Lynne, Charles W. Steinfield, Rolf T. Wigand, and Gabe Minton. 2006. “Industry-Wide Information Systems Standardization as Collective Action: The Case of the U.S. Residential Mortgage Industry.” Management Information Systems Quarterly 30 (SI): 439–465. Marshall, Iain J., Joël Kuiper, and Byron C. Wallace. 2014. Automating Risk of Bias Assessment for Clinical Trials.” In Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 88–95. New York: ACM. Martin, Kirsten E. 2015. “Ethical Issues in the Big Data Industry.” MiS Quarterly Executive 14 (2): 68–85. Masnick, Mike. 2011. “The Infinite Loop of Algorithmic Pricing on Amazon … Or How a Book on Flies Cost $23,698,655.93.” Techdirt, April 25. infinite-loop-algorithmic-pricing-amazon-how-book-flies-cost-2369865593.shtml (accessed February 9, 2016). Mattioli, Dana. 2012. “On Orbitz, Mac Users Steered to Pricier Hotels.” Wall Street Journal, August 23. (accessed February 3, 2016). Mattioli, Michael. 2012. “Communities of Innovation.” Northwestern University Law Review 106 (1): 103. Maxwell, Kerry. 2014. “Solutionism.” Macmillan Dictionary, BuzzWord, August 6. (accessed December 12, 2015). Mayer-Schönberger, Viktor, and Kenneth Cukier. 2013. Big Data: A Revolution That Will Transform How We Live, Work, and Think. New York: Houghton Mifflin Harcourt. McAfee, Andrew. 2010. “The Future of Decision Making: Less Intuition, More Evidence.” Harvard Business Review, January 7. (accessed February 9, 2016). McCallum, John C. 2015. “Disk Drive Prices (1955–2015).” (accessed December 12, 2015). McCoy, Allison B., Lemuel R. Waitman, Julia B. Lewis, Julie A. Wright, David P. Choma, Randolph A. Miller, and Josh F. Peterson. 2012. “A Framework for Evaluating the Appropriateness of Clinical Decision Support Alerts and Responses.” Journal of the American Medical Informatics Association 19 (3): 346–352.



McCullagh, Declan. 2010. “Why No One Cares about Privacy Anymore.” CNET, March 12. http://www (accessed December 12, 2015). McDonald, Aleecia M., and Lorrie Faith Cranor. 2008. “The Cost of Reading Privacy Policies.” I/S: A Journal of Law and Policy for the Information Society 4 (3): 540–564. McElheny, Victor K. 2010. Drawing the Map of Life—Inside the Human Genome Project. New York: Perseus Books. McGee, Matt. 2013. “EdgeRank Is Dead: Facebook’s News Feed Algorithm Now Has Close to 100K Weight Factors.” Marketing Land, August 16. -news-feed-algorithm-now-has-close-to-100k-weight-factors-55908 (accessed April 8, 2015). McGinn, Robert. 2015. The Ethically Responsible Engineer. Hoboken, NJ: Wiley. “Medication Guide: Botox.” n.d. (accessed December 10, 2015). Medina, Eden. Forthcoming. “Rethinking Algorithmic Regulation: Lessons from Our Cybernetic Past.” Kybernetes. Mehta, Anisha. 2014. “‘Bring Your Own Glass’: The Privacy Implications of Google Glass in the Workplace.” John Marshall Journal of Information Technology and Privacy Law 30 (3): 607–638. Mendoza, Martha. 2013. “Google Pleads Its Case for Scanning Your Emails to Help Sell Ads.” Huffington Post, September 5. .html (accessed December 12, 2015). Merck. 1995. “First Installment of Merck Gene Index Data Released to Public Databases: Cooperative Effort Promises to Speed Scientific Understanding of the Human Genome.” Press release, February 10. (accessed December 12, 2015). Metz, Cade. 2015. “Facebook’s Human-Powered Assistant May Just Supercharge AI.” Wired, August 26. (accessed February 9, 2016). Microsoft News Center. 2012 “The Big Bang: How the Big Data Explosion Is Changing the World.” Microsoft, February 11. -data-explosion-is-changing-the-world/ (accessed February 3, 2016). Microsoft Research. 2013. “Microsoft Research at a Glance.” February. en-us/press/overview.aspx (accessed December 12, 2015). Miles, Andrew, and Michael Loughlin. 2011. “Models in the Balance: Evidence-Based Medicine versus Evidence-Informed Individualized Care.” Journal of Evaluation in Clinical Practice 17 (4): 531–536. Miller, Amalia R., and Catherine Tucker. 2009. “Privacy Protection and Technology Diffusion: The Case of Electronic Medical Records.” Management Science 55 (7): 1077–1093. Miller, Arthur R. 1993. “Copyright Protection for Computer Programs, Databases, and Computer -Generated Works: Is Anything New since CONTU?” Harvard Law Review 106 (5): 977–1073.



Miller, Claire Cain. 2013a. “Google Accused of Wiretapping in Gmail Scans.” New York Times, October 1, B1. Miller, Claire Cain. 2013b. “Universities Offer Courses in a Hot New Field: Data Science.” New York Times, April 11. -hot-new-field-data-science.html (accessed February 3, 2016). Miller, Geoffrey. 2012. “The Smartphone Psychology Manifesto.” Perspectives on Psychological Science 7 (3): 221–237. Miller, Peter, and Nikolas Rose. 2013. Governing the Present. Malden, MA: Polity Press. Miller, Thomas P., Troyen A. Brennan, and Arnold Milstein. 2009. “How Can We Make More Progress in Measuring Physicians’ Performance to Improve the Value of Care?” Health Affairs 28 (5): 1429–1437. Montecucco, Cesare, and Jordi Molgó. 2005. “Botulinal Neurotoxins: Revival of an Old Killer.” Current Opinion in Pharmacology 5 (3): 274–279. Moore, Michael S. 2010. Causation and Responsibility: An Essay in Law, Morals, and Metaphysics. Oxford: Oxford University Press. Morozov, Evgeny. 2013. To Save Everything, Click Here: The Folly of Technological Solutionism. New York: PublicAffairs. Moxey, Annette, Jane Robertson, David Newby, Isla Hains, Margaret Williamson, and Sallie-Anne Pearson. 2010. “Computerized Clinical Decision Support for Prescribing: Provision Does Not Guarantee Uptake.” Journal of the American Medical Informatics Association 17 (1): 25–33. “Multi-Armed Bandit.” n.d. Wikipedia. (accessed December 12, 2015). Murdoch, Travis B., and Allan S. Detsky. 2013. “The Inevitable Application of Big Data to Health Care.” JAMA: The Journal of the American Medical Association 309 (13): 1351–1352. Muris, Timothy J. 2001. Protecting Consumers’ Privacy: 2002 and Beyond. Washington, DC: Federal Trade Commission. Murphy, Richard S. 1996. “Property Rights in Personal Information: An Economic Defense of Privacy.” Georgetown Law Journal 84:2381–2417. Nafus, Dawn, and Jamie Sherman. 2014. “Big Data, Big Questions | This One Does Not Go Up to 11: The Quantified Self Movement as an Alternative Big Data Practice.” International Journal of Communication 8:11. “Naive Bayes Spam Filtering.” n.d. Wikipedia. _filtering (accessed December 12, 2015). Narayanan, Arvind, and Vitaly Shmatikov. 2008. Robust De-Anonymization of Large Sparse Datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy, 111–125. Washington, DC: IEEE Computer Society.



Naser, Mohammad Amin. 2008. “Computer Software: Copyrights v. Patents.” Loyola. Law and Technology Annual 8 (1): 37–43. Nass, Sharyl J., Laura A. Levit, and Lawrence O. Gostin, eds. 2009. Beyond the HIPAA Privacy Rule: Enhancing Privacy, Improving Health through Research. Washington, DC: National Academies Press. National Center for Biotechnology Information (NCBI). 2013. The NCBI Handbook. 2nd ed. Bethesda, MD: National Center for Biotechnology Information. National Center for Biotechnology Information (NCBI). n.d. “GenBank and WGS Statistics.” http:// (accessed March 20, 2016). National Institutes of Health (NIH). 2007. “Policy for Sharing of Data Obtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS).” 72 Federal Register 49290–02. National Institutes of Health (NIH). 2014a. “Final NIH Genomic Data Sharing Policy.” Federal Register 79:51345–51354. National Institutes of Health (NIH). 2014b. “NOT-OD-14-046: Ruth L. Kirschstein National Research Service Award (NRSA) Stipends, Tuition/Fees, and Other Budgetary Levels Effective for Fiscal Year 2014.” February. National Research Council. 1997. Bits of Power: Issues in Global Access to Scientific Data. Washington, DC: National Academies Press. National Research Council. 2006. Reaping the Benefits of Genomic and Proteomic Research: Intellectual Property Rights, Innovation, and Public Health. Washington, DC: National Academies Press. National Science Foundation. 2014. “Investing in Science, Engineering, and Education for the Nation’s Future: National Science Foundation Strategic Plan for 2014–2018.” nsf14043/nsf14043.pdf (accessed December 12, 2015). National Science Foundation. 2015. “Critical Techniques and Technologies for Advancing Foundations and Applications of Big Data Science and Engineering.” jsp?pims_id=504767 (accessed February 6, 2016). Neale, Christopher. 2014. “8 Ways Big Data Helps Improve Global Water and Food Security.” GreenBiz, October 22. -and-food-security (accessed December 12, 2015). “Need a Reminder to Check Your Baby’s Diaper?” 2013. Daily Mail, May 10. Nelson, Gabe. 2013. “How Data Mining Helped GM Limit a Recall to 4 Cars.” Automotive News, October 28. -gm-limit-a-recall-to-4-cars (accessed February 9, 2016). “New Rules for Big Data. 2010. Economist, February 25. (accessed February 3, 2016). Nichols, Hannah. 2015. “What is Botox? How Does Botox Work?” MNT, October 14. http://www (accessed December 12, 2015).



Nissenbaum, Helen. 2010. Privacy in Context: Technology, Policy, and the Integrity of Social Life. Palo Alto, CA: Stanford University Press. North Carolina State University’s ERM Initiative and Protiviti. 2015. “Executive Perspectives on Top Risks for 2015: Key Issues Being Discussed in the Boardroom and C-Suite.” erm/i/chan/library/NC-State-Protiviti-Survey-Top-Risks-2015.pdf (accessed December 12, 2015). Office of the Secretary. 1979. The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research. Washington, DC: US Department of Health, Education, and Welfare. Ohm, Paul. 2010. “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization.” UCLA Law Review 57:1701–1777. Ohm, Paul. 2014. Changing the Rules: General Principles for Data Use and Analysis. In Privacy, Big Data, and the Public Good: Frameworks for Engagement, ed. Julia Lane, Victoria Stodden, Stefan Bender, and Helen Nissenbaum, 96–111. New York: Cambridge University Press. “1000 Genomes Project Data Available on Amazon Cloud.” 2012. NIH news release, March 29. http:// (accessed February 4, 2016). O’Reilly Media. 2011. Big Data Now. Sebastopol, CA: O’Reilly Media. Organization for Economic Cooperation and Development (OECD). 1980. Guidelines on the Protection of Privacy and Transborder Flows of Personal Data (C 58 Final). Organization for Economic Cooperation and Development (OECD). 2013a. OECD Guidelines Governing the Protection of Privacy and Transborder Flows of Personal Data, C(80)58/FINAL, as Amended by C92013)79. Organization for Economic Cooperation and Development (OECD). 2013b. Supplementary Explanatory Memorandum to the Revised Recommendation of the Council concerning Guidelines Governing the Protection of Privacy and Transborder Flows of Personal Data. Ossorio, Pilar N. 2011. “Bodies of Data: Genomic Data and Bioscience Data Sharing.” Social Research 78 (3): 907–932. Ouellette, Lisa Larrimore. 2012. “Do Patents Disclose Useful Information?” Harvard Journal of Law and Technology 25 (2): 531–593. Palmer, Tom G. 1990. “Are Patents and Copyrights Morally Justified? The Philosophy of Property Rights and Ideal Objects.” Harvard Journal of Law and Public Policy 13 817–865. Paltoo, Dina N., et al. 2014. “Data Use under the NIH GWAS Data Sharing Policy and Future Directions.” Nature Genetics 46 (9): 934–938. Park, Susan. 2014. “Employee Internet Privacy: A Proposed Act That Balances Legitimate Employer Rights and Employee Privacy.” American Business Law Journal 51 (4): 779–841. Parris, Rich. 2012. “Online T&Cs Longer Than Shakespeare Plays—Who Reads Them?” Which? March 23. (accessed January 30, 2016).



Pasquale, Frank. 2015. The Black Box Society: The Secret Algorithms That Control Money and Information. Cambridge, MA: Harvard University Press. Pearl, Judea. 1997. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco: Morgan Kaufmann. Pearl, Judea. 2000. Causality: Models, Reasoning, and Inference. Cambridge, MA: MIT Press. Pearson, Travis, and Rasmus Wegener. 2013. “Big Data: The Organizational Challenge.” Bain and Company: Insights, September 11. _challenge.aspx (accessed February 7, 2016). Pennisi, Elizabeth. 2011. “Will Computers Crash Genomics?” Science 331 (6018): 666–668. Pentland, Alex. 2014. Social Physics: How Good Ideas Spread—The Lessons from a New Science. New York: Penguin Press. Pentland, Alex, David Lazer, Devon Brewer, and Tracy Heibeck. 2009. “Using Reality Mining to Improve Public Health and Medicine.” Studies in Health Technology and Informatics 149:93–102. Peppet, Scott R. 2014. “Regulating the Internet of Things: First Steps toward Managing Discrimination, Privacy, Security, and Consent.” Texas Law Review 93 (85): 86–176. Peysakhovich, Alex, and Seth Stephens-Davidowitz. 2015. “How Not to Drown in Numbers.” New York Times, May 2, SR6. Phansalkar, Shobha, Amrita Desai, Anish Choksi, Eileen Yoshida, John Doole, Melissa Czochanski, Alisha D. Tucker, Blackford Middleton, Douglas Bell, and David W. Bates. 2013. “Criteria for Assessing High-Priority Drug-Drug Interactions for Clinical Decision Support in Electronic Health Records.” BMC Medical Informatics and Decision Making 13 (1): 65–76. Phillips, Anne. 1997. Paradoxes of Participation. In Engendering Democracy, 96–100. Oxford: Oxford University Press. Piwowar, Heather A. 2013. “Value All Research Products.” Nature 493 (7431): 159. Plale, Beth, Bin Cao, Chathura Herath, and Yiming Sun. 2011. “Data Provenance for Preservation of Digital Geoscience Data” Geological Society of America, Special Papers 482:125–137. “PLOS Clarification Confuses Me More.” 2014. RXNM, March 3. 03/03/plos-clarification-confuses-me-more/ (accessed December 12, 2015). Podesta, John. 2014. Big Data: Values and Governance. In Remarks as Delivered by Counselor John Podesta, the White House OSTP. Berkeley: University of California at Berkeley School of Information. Presidency. 2014. Proposal for a Regulation of the European Parliament and of the Council on the Protection of Individuals with Regard to the Processing of Personal Data and on the Free Movement of Such Data (General Data Protection Regulation) [First Reading]—Chapter IV. Council of the European Union, note 13722/ 14, October 3. (accessed January 31, 2016).



Press, Gil. 2015. “Top 10 Most-Funded Big Data Startups April 2015.” Forbes, April 27. http://www (accessed February 6, 2016). Proctor, Paul E., John A. Wheeler, and Khushbu Pratap. 2015. Definition: Governance, Risk, and Compliance. Stamford, CT: Gartner. Proffitt, Jennifer M., Hamid R. Ekbia, and Stephen D. McDowell. 2015. “Introduction to the Special Forum on Monetization of User-Generated Content—Marx Revisited.” Information Society 31 (1): 1–4. Provost, Foster, and Tom Fawcett. 2013. Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking. Sebastopol, CA: O’Reilly. Puhakainen, Petri, and Mikko Siponen. 2010. “Improving Employees’ Compliance through Information Systems Security Training: An Action Research Study.” Management Information Systems Quarterly 34 (4): 757–778. Purpura, Stephen, Victoria Schwanda, Kaiton Williams, William Stubler, and Phoebe Sengers. 2011. Fit4life: The Design of a Persuasive Technology Promoting Healthy Behavior and Ideal Weight. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 423–432. New York: ACM. Rahman, Tauhidur, Alexander T. Adams, Mi Zhang, Erin Cherry, Bobby Zhou, Huaishu Peng, and Tanzeem Choudhury. 2014. BodyBeat: A Mobile System for Sensing Non-Speech Body Sounds. In Proceedings of the 12th Annual International Conference on Mobile Systems, Applications, and Services, 2–13. New York: ACM. Rai, Arti K. 2012. “Patent Validity across the Executive Branch: Ex Ante Foundations for Policy Development.” Duke Law Journal 61 (6): 1237–1281. Rai, Arti K., and Rebecca S. Eisenberg. 2003. “Bayh-Dole Reform and the Progress of Biomedicine.” Law and Contemporary Problems 66 (1–2): 289–314. Rawlinson, Kevin. 2015. “Microsoft’s Bill Gates Insists AI Is a Threat.” BBC News, January 29. http:// (accessed February 11, 2016). Rawls, John. 1985. “Justice as Fairness: Political Not Metaphysical.” Philosophy and Public Affairs 14:223–251. Reichman, J. H., and Paul F. Uhlir. 2003. “A Contractually Reconstructed Research Commons for Scientific Data in a Highly Protectionist Intellectual Property Environment.” Law and Contemporary Problems 66 (1–2): 315–462. Rexrode, Christina. 2015. “Fed Faulted BofA regarding Its Foresight.” Wall Street Journal, June 16. http:// (accessed February 8, 2016). Rijmenam, Mark van. 2012. Walmart Is Making Big Data Part of Its DNA. Datafloq. read/walmart-making-big-data-part-dna/509 (accessed February 9, 2016). Roberts, Jessica L. 2014. “Healthism and the Law of Employment Discrimination.” Iowa Law Review, April 9. (accessed February 3, 2016).



Rosenberg, Ira A., Leslie Gladstone Restaino, and Lori M. Waldron. 2006. “Fierce Pharma Competition Fosters Partnerships.” New Jersey Law Journal 185 (5): 1–2. Rosenblat, Alex, Tamara Kneese, and danah boyd. 2014. “Networked Employment Discrimination.” Data and Society Research Institute, October 8. EmploymentDiscrimination.pdf (accessed December 12, 2015). Rosenblatt, Joel. 2014a. “Is Google Too Big to Sue Over Gmail Privacy Concerns?” Bloomberg Business, March 6. -for-class-action-status (accessed February 3, 2016). Rosenblatt, Joel. 2014b. “No Class-Action Status for Lawsuits against Google in Email Privacy.” Newsday, March 19. -google-in-email-privacy-1.7437041 (accessed December 12, 2015). Rosenfeld, Jeffrey, and Christopher E. Mason. 2013. “Pervasive Sequence Patents Cover the Entire Human Genome.” Genome Medicine 5:27. Russell, Stuart, and Peter Norvig. 2010. Artificial Intelligence: A Modern Approach. 3rd ed. New York: Pearson. Sackett, David L., William Rosenberg, J. A. Gray, R. Brian Haynes, and W. Scott Richardson. 1996. “Evidence Based Medicine: What It Is and What It Isn’t.” British Medical Journal 312 (7023): 71–74. Samuelson, Pamela. 1985. “Allocating Ownership Rights in Computer-Generated Works.” University of Pittsburgh Law Review 47:1185–1228. Satell, Greg. 2013. “Yes, Big Data Can Solve Real World Problems.” Forbes, December 3. http://www (accessed December 12, 2015). Sato, Kentaro, ed. 2014. Manual of Patent Examining Procedure. 9th ed. Washington, DC: US Department of Commerce, US Patent and Trademark Office. Schaffer, Jonathan. 2014. The Metaphysics of Causation. In The Stanford Encyclopedia of Philosophy (Summer 2014 Edition), ed. Edward N. Zalta. causation-metaphysics/ (accessed February 1, 2016). Schaller, Robert R. 1997. “Moore’s Law: Past, Present, and Future.” Spectrum 34 (6): 52–59. Schneier, Bruce. 2013. “Surveillance and the Internet of Things.” Schneier on Security, May 21. https:// (accessed December 12, 2015). Schermer, Bart W., Bart Custers, and Simone Van der Hof. 2014. “The Crisis of Consent: How Stronger Legal Protection May Lead to Weaker Consent in Data Protection.” Ethics and Information Technology 16 (2): 171–182. Scholz, Trebor, ed. 2013. Digital Labor: The Internet as Playground and Factory. New York: Routledge. Schwann, Nanette M., Karen A. Bretz, Sherrine Eid, Terry L. Burger, Deborah Fry, Frederick Ackler, Paul J. Evans, David Romanchuk, Michelle Beck, Anthony J. Ardire, Harry Lukens, and Thomas M.



McLoughlin Jr. 2011. “Point-of-Care Electronic Prompts: An Effective Means of Increasing Compliance, Demonstrating Quality, and Improving Outcome.” Anesthesia and Analgesia 113 (4): 869–876. Schwartz, Gary T. 1991. “The Myth of the Ford Pinto Case.” Rutgers Law Review 43 (1013): 1013–1068. Schwartz, Paul M. 1999. “Privacy and Democracy in Cyberspace.” Vanderbilt Law Review 52: 1609–1700. Schwartz, Paul M., and Daniel J. Solove. 2011. “The PII Problem: Privacy and a New Concept of Personally Identifiable Information.” New York University Law Review 86:1814. Seddon, Jonathan J. M., and Wendy L. Currie. 2013. “Cloud Computing and Trans-border Health Data: Unpacking U.S. and EU Healthcare Regulation and Compliance.” Health Policy and Technology 2 (4): 229–241. Seidling, Hanna M., Shobha Phansalkar, Diane L. Seger, Marilyn D. Paterno, Shimon Shaykevich, Walter E. Haefeli, and David W. Bates. 2011. “Factors Influencing Alert Acceptance: A Novel Approach for Predicting the Success of Clinical Decision Support.” Journal of the American Medical Informatics Association 18 (4): 479–484. Seijts, Jana, and Paul Bigus. 2012. Netflix: The Public Relations Box Office Flop. Boston: Harvard Business Publishing. Sennett, Richard. 2006. The Culture of the New Capitalism. New Haven, CT: Yale University Press. Sensity. n.d. “NetSense Platform.” (accessed December 12, 2015). Service, Robert F. 2013. “Biology’s Dry Future.” Science 342 (6155): 186–189. Shapin, Steve. 1989. “The Invisible Technician.” American Scientist 77:554–563. Shilton, Katie. 2009. “Four Billion Little Brothers? Privacy, Mobile Phones, and Ubiquitous Data Collection.” Communications of the ACM 52 (11): 48–53. Shilton, Katie. 2012. “Participatory Personal Data: An Emerging Research Challenge for the Information Sciences.” Journal of the American Society for Information Science and Technology 63 (10): 1905–1915. Shvachko, Konstantin, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 1–10. Piscataway, NJ: IEEE. Siegel, Eric. 2013. Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die. Hoboken, NJ: Wiley. Silva, Liz. 2014. “PLOS’ New Data Policy: Public Access to Data.” PLOS, February 24. http://blogs.plos .org/everyone/2014/02/24/plos-new-data-policy-public-access-data-2/ (accessed February 6, 2016). Silver, Nate. 2012. The Signal and the Noise: Why So Many Predictions Fail—but Some Don’t. New York: Penguin Press. Simmhan, Yogesh L., Beth Plale, and Dennis Gannon. 2005. “A Survey of Data Provenance in E-Science.” SIGMOD Record 34 (3): 31–36.



Simpao, Allan F., Luis M. Ahumada, Jorge A. Gálvez, and Mohamed A. Rehman. 2014. “A Review of Analytics and Clinical Informatics in Health Care.” Journal of Medical Systems 38 (4): 1–7. Singer, Natasha. 2012. “You for Sale: Mapping, and Sharing, the Consumer Genome.” New York Times, June 17, BU1. Singh, Maanvi. 2014. “Our Supercomputer Overlord Is Now Running a Food Truck.” NPR, March 4. -running-a-food-truck (accessed February 11, 2016). Sittig, Dean F., Michael A. Krall, Richard H. Dykstra, Allen Russell, and Homer L. Chin. 2006. “A Survey of Factors Affecting Clinician Acceptance of Clinical Decision Support.” BMC Medical Informatics and Decision Making 6 (1): 6. Team. n.d. “The History of Botox.” _History_of_Botox.aspx (accessed February 9, 2015). Solove, Daniel. 2001. “Privacy and Power: Computer Databases and Metaphors for Information Privacy.” Stanford Law Review 53:1393–1462. Solove, Daniel. 2006. “A Taxonomy of Privacy.” University of Pennsylvania Law Review 154 (3): 477–560. Solove, Daniel. 2013. “Privacy Self-Management and the Consent Dilemma.” Harvard Law Review 126:1880–1903. Spencer, Steven J., Claude M. Steele, and Diane M. Quinn. 1999. “Stereotype Threat and Women’s Math Performance.” Journal of Experimental Social Psychology 35 (1): 4–28. Stanley, Jay. 2012. “Eight Problems with ‘Big Data.’” American Civil Liberties Union, April 25. https:// (accessed December 12, 2015). Starr, Sonja B. 2014. “Evidence-Based Sentencing and the Scientific Rationalization of Discrimination.” Stanford Law Review 66:803. Stephenson, Frank. 2002. “A Tale of Taxol.” Florida State University, Office of Research. http://www (accessed February 11, 2016). Stiegler, Bernard. 2013. The Most Precious Good in the Era of Social Technologies. In Unlike Us Reader: Social Media Monopolies and Their Alternatives, ed. Geert Lovink and Miriam Rasch, 16–30. Amsterdam: Institute of Networked Culture. Stiglitz, Joseph. 2014. “Inequality Is Not Inevitable” New York Times, June 29, SR1. Stodden, Victoria. 2014. “My Input for the OSTP RFI on Reproducibility.” Victoria’s Blog, September 28. (accessed December 12, 2015). Strahilevitz, Lior. 2013. “Toward a Positive Theory of Privacy Law.” Harvard Law Review 113 (1). http:// (accessed February 3, 2016).



Sugimoto, Cassidy R., and Scott Weingart. 2015. “The Kaleidoscope of Disciplinarity.” Journal of Documentation 71 (4): 775–794. Sullivan, Gayla, Jay Heiser, and Rob McMillan. 2015. Regulatory Compliance Alone Cannot Mitigate Cloud Vendor Risks. Stamford, CT: Gartner. Sullivan, Mark. 2015. “23andMe Has Signed 12 Other Genetic Data Partnerships beyond Pfizer and Genentech.” VB News, January 14. -genetic-data-partnerships-beyond-pfizer-and-genentech/ (accessed February 11, 2016). Swanson, Bret, and George Gilder. 2008. “Estimating the Exaflood.” White paper. Seattle, WA: Discovery Institute. Sweeney, Latanya. 2000. “Simple Demographics Often Identify People Uniquely.” Data Privacy Working Paper 3, Carnegie Mellon University. (accessed December 12, 2015). Sweeney, Latanya. 2002. “K-Anonymity: A Model for Protecting Privacy.” International Journal of Uncertainty, Fuzziness, and Knowledge-Based Systems 10 (5): 557–570. Szlezák, Nicole, M. Evers, J. Wang, and L. Pérez. 2014. “The Role of Big Data and Advanced Analytics in Drug Discovery, Development, and Commercialization.” Clinical Pharmacology and Therapeutics 95 (5): 492–495. Takabi, Hassan, James B. D. Joshi, and Gail-Joon Ahn. 2010. “Security and Privacy Challenges in Cloud Computing Environments.” Security Privacy, IEEE 8 (6): 24–31. Tallon, Paul P., Ronald V. Ramirez, and James E. Short. 2014. “The Information Artifact in IT Governance: Toward a Theory of Information Governance.” Journal of Management Information Systems 30 (3): 141–177. Tallon, Paul P., James E. Short, and Malcolm W. Harkins. 2013. “The Evolution of Information Governance at Intel.” MiS Quarterly Executive 12 (4): 189–198. Tanner, Adam. 2013. “Senate Report Blasts Data Brokers for Continued Secrecy.” Forbes, December 19. -secrecy/ (accessed August 23, 2015). Tarrant, David, Ben O’Steen, Tim Brody, Steve Hitchcock, Neil Jefferies, and Leslie Carr. 2009. “Using OAI-ORE to Transform Digital Repositories into Interoperable Storage and Services Applications.” Code4Lib Journal 6. (accessed February 5, 2016). Taylor, Nick Paul. 2015. “Google Applies Large-Scale Machine Learning to Drug Discovery.” FierceBiotechIT, March 9. -discovery/2015-03-09 (accessed December 12, 2015). “Text of the Genetic Information Nondiscrimination Act of 2008.” H.R. 493 (110th). Pub. L. No. H.R. 493. (accessed February 12, 2016).



Thaler, Stephen. 2013. Creativity Machine® Paradigm. In Encyclopedia of Creativity, Invention, Innovation, and Entrepreneurship, ed. Elias G. Carayannis, 447–456. New York: Springer. Thornhill, Ted. 2012. “‘Magic Carpet’ Could Help Prevent Falls among Elderly, Say University of Manchester Scientists.” Huffington Post, September 4. magic-carpet-could-help-prevent-falls-university-of-manchester_n_1853592.html (accessed November 20, 2014). Timmermans, Stefan, and Aaron Mauck. 2005. “The Promises and Pitfalls of Evidence-Based Medicine.” Health Affairs 24 (1): 18–28. Ting, Patricia T., and Anatoli Freiman. 2004. “The Story of Clostridium Botulinum: From Food Poisoning to Botox.” Clinical Medicine 4 (3): 258–261. Tiwari, Ruchi, Demetra S. Tsapepas, Jaclyn T. Powell, and Spencer T. Martin. 2013. “Enhancements in Healthcare Information Technology Systems: Customizing Vendor-Supplied Clinical Decision Support for a High-Risk Patient Population.” Journal of the American Medical Informatics Association 20 (2): 377–380. Topaz, Maxim, and Kathryn H. Bowles. 2012. “Electronic Health Records and Quality of Care: Mixed Results and Emerging Debates. Achieving Meaningful Use in Research with Information Technology Column.” Online Journal of Nursing Informatics 16 (1). (accessed December 12, 2015). Trader, Tiffany. 2014. “Beware of Big Data Demonization, IBM Europe Chairman Says.” Datanami, March 25. _chairman_says/ (accessed December 12, 2015). Trappler, Thomas J. 2012. “When Your Data’s in the Cloud, It’s Still Your Data.” Computerworld, January 17. -still-your-data-.html (accessed February 7, 2016). Tsai, Janice Y., Patrick Kelley, Paul Drielsma, Lorrie Faith Cranor, Jason Hong, and Norman Sadeh. 2009. Who’s Viewed You? The Impact of Feedback in a Mobile Location-Sharing Application. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2003–2012. Boston: ACM. Tsalikis, John, and David J. Fritzsche. 1989. “Business Ethics: A Literature Review with a Focus on Marketing Ethics.” Journal of Business Ethics 8 (9): 695–743. Tufts Center for the Study of Drug Development. 2014. “Briefing: Cost of Developing a New Drug.” November 18., _2014..pdf (accessed February 11, 2016). Udell, Jon. 2012. “Whose Cloud Stores Your Health Data?” Wired. insights/2012/11/whose-cloud-stores-your-health-data/ (accessed December 12, 2015). University of Manchester. 2012. “‘Magic Carpet’ Could Help Prevent Falls.” September 4. http://www (accessed November 20, 2014).



University of San Francisco. n.d. “FAQs.” (accessed December 12, 2015). Urban, Tim. 2015. “The AI Revolution: Road to Superintelligence.” WaitButWait, January 22. http:// (accessed December 12, 2015). US Copyright Office. 2014. Compendium of U.S. Copyright Office Practices. 3rd ed. comp3/docs/compendium.pdf (accessed February 11, 2016). US Department of Energy, Human Genome Project. n.d. “Policies on Release of Human Genomic Sequence Data: Bermuda-Quality Sequence.” Human Genome Project Information Archive, 1990–2003. (accessed December 12, 2015). US Department of Health and Human Services. n.d. “What Is Meaningful Use?” healthit/meaningfuluse/MU%20Stage1%20CQM/mu.html (accessed August 4, 2014). US Food and Drug Administration (FDA). 2010. “FDA Approves Botox to Treat Chronic Migraines.” News release, October 15. US Food and Drug Administration (FDA). 2013. “FDA Approves Botox to Treat Overactive Bladder.” News release, January 18. US Food and Drug Administration (FDA). 2015. “FDA Approves Praluent to Treat Certain Patients with High Cholesterol.” News Release, July 24. (accessed December 12, 2015). US Patent and Trademark Office. 1999. “Revised Utility Examination Guidelines; Request for Comments.” 64 Federal Register 71440, December 21. Vagata, Pamela, and Kevin Wilfong. 2014. “Scaling the Facebook Data Warehouse to 300 PB.” Facebook Code, April 10. -to-300-pb/ (accessed December 12, 2015). Vallor, Shannon. 2012. “Flourishing on Facebook: Virtue Friendship and New Social Media.” Ethics and Information Technology 14 (3): 185–199. van Dijck, José. 2013. The Culture of Connectivity: A Critical History of Social Media. Oxford: Oxford University Press. Vanderplas, Jake. 2013. “The Big Data Brain Drain: Why Science Is in Trouble.” Pythonic Perambulations, October 26. (accessed December 12, 2015). Vanderplas, Jake. 2014. “Hacking Academia: Data Science and the University.” Pythonic Perambulations, August 22. (accessed December 12, 2015). Varian, Hal R. 2010. “Computer Mediated Transactions.” American Economic Review 100:1–10.



Vaz, Paulo, and Fernanda Bruno. 2003. “Types of Self-Surveillance: From Abnormality to Individuals ‘At Risk.’” Surveillance and Society 1 (3): 272–291. Venter, Will, and Edgar A. Whitley. 2012. “A Critical Review of Cloud Computing: Researching Desires and Realities.” Journal of Information Technology 27:179–197. Vertinsky, Liza, and Todd M. Rice. 2002. “Thinking about Thinking Machines: Implications of Machine Inventors for Patent Law.” Boston University Journal of Science and Technology Law 8:574–614. Violino, Bob. 2012. “Case Study: What’s the Business Case for GRC? CSO, March 28. http://www (accessed December 12, 2015). Vygotsky, Lev. 2012. Thought and Language. Ed. Alex Kozulin. Cambridge, MA: MIT Press. Wakefield, A. J., S. Murch, A. Anthony, J. Linnell, D. Casson, M. Malik, M. Berelowitz, A. P. Dhillon, M. A. Thompson, P. Harvey, A. Valentine, S. E. Davies, and J. A. Walker-Smith. 1998. “Ileal-LymphoidNodular Hyperplasia, Non-Specific Colitis, and Pervasive Developmental Disorder in Children.” Lancet 351 (9103): 637–641. Walker, Joseph. 2013. “Data Mining to Recruit Sick People.” Wall Street Journal, December 17. (accessed February 8, 2016). Wallace, Byron C., Kevin Small, Carla E. Brodley, Joseph Lau, Christopher H. Schmid, Lars Bertram, Christina M. Lill, Joshua T. Cohen, and Thomas A. Trikalinos. 2012. “Toward Modernizing the Systematic Review Pipeline in Genetics: Efficient Updating via Data Mining.” Genetics in Medicine 14 (7): 663–669. Walzer, Michael. 1983. Spheres of Justice: A Defense of Pluralism and Equality. New York: Basic Books. Wang, Richard Y., and Diane M. Strong. 1996. “Beyond Accuracy: What Data Quality Means to Data Consumers.” Journal of Management Information Systems 12 (4): 5–33. Washington, Anne L. 2014. “Can Big Data Be Described as a Supply Chain?” Social Science Research Network, March 14. (accessed December 12, 2015). Webb, Cynthia. 2004. “Google’s Eyes in Your Inbox.” Washington Post, April 2. Webster, Paul Christopher. 2013. “Big Data’s Dirty Secret.” Canadian Medical Association Journal 185 (11): E509–E510. Weinberger, David. 2012. Too Big to Know: Rethinking Knowledge Now That the Facts Aren’t the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room. New York: Basic Books. “We’ll See You, Anon.” 2015. Economist, August 15. -technology/21660966-can-big-databases-be-kept-both-anonymous-and-useful-well-see-you-anon (accessed January 30, 2016).



Westin, Alan F. 1967. Privacy and Freedom. New York: Atheneum. “What Is a Neo-Luddite?” n.d. wiseGEEK. (accessed December 12, 2015). “What Is the Ultimate Idea?” n.d. Imagination Engines Inc. (accessed December 12, 2015). “What Is Watson?” n.d. IBM. .html (accessed February 8, 2015). Wheatman, Jeffrey, and L. Akdshay. 2015. The Executive Leader’s Guide to Managing IT Regulatory Change. Stamford, CT: Gartner. White House. 2015. “Administration Discussion Draft.” Consumer Privacy Bill of Rights Act 101:102. White House, Office of Science and Technology Policy. 2012. “Obama Administration Unveils ‘Big Data’ Initiative: Announces $200 Million in New R&D Investments.” Press release, March 29. (accessed January 5, 2016). Wigan, Marcus R., and Roger Clarke. 2013. “Big Data’s Big Unintended Consequences.”Computer 46 (6): 46–53. Williams, Alex. 2013. “The Power Of Data Exhaust.” TechCrunch, May 26. http://social.techcrunch .com/2013/05/26/the-power-of-data-exhaust/ (accessed December 12, 2015). Wilson, Greg. 2014. “Software Carpentry: Lessons Learned.” F1000Research 3:62. Wilson, Greg, D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, Katy Huff, Ian Mitchell, Mark Plumbley, Ben Waugh, Ethan P. White, and Paul Wilson. 2014. “Best Practices for Scientific Computing.” PLoS Biology 12 (1): e1001745. Wingfield, Nick, and Brian Stelter. 2011. “How Netflix Lost 800,000 Members, and Good Will.” New York Times, October 25, B1. Woese, Carl R. 2004. “A New Biology for a New Century.” Microbiology and Molecular Biology Reviews 68 (2): 173–186. Wolf, Gary. 2014. “Access Matters.” Quantified Self, September 14. access-matters/ (accessed December 12, 2015). Woodcock, Janet. 2014. “Another Important Step in FDA’s Journey toward Enhanced Safety through Full-Scale ‘Active Surveillance.’” FDA Voice, December 30. 2014/12/another-important-step-in-fdas-journey-towards-enhanced-safety-through-full-scale-active -surveillance/ (accessed February 11, 2016). World Economic Forum. 2011. Personal Data: The Emergence of a New Asset Class. http://www3.weforum .org/docs/WEF_ITTC_PersonalDataNewAsset_Report_2011.pdf (accessed November 20, 2014).



World Economic Forum. 2012. Big Data, Big Impact: New Possibilities for International Development. http:// (accessed November 20, 2014). Wright, Adam, Dean F. Sittig, Joan S. Ash, Joshua Feblowitz, Seth Meltzer, Carmit McMullen, Ken Guappone, Jim Guappone, Jim Carpenter, Joshua Richardson, Linas Simonaitis, R. Scott Evans, W. Paul Nichol, and Blackford Middleton. 2011. “Development and Evaluation of a Comprehensive Clinical Decision Support Taxonomy: Comparison of Front-End Tools in Commercial and Internally Developed Electronic Health Record Systems.” Journal of the American Medical Informatics Association 18 (3): 232–242. W3C Consortium. 2014. “Dataset Descriptions: HCLS Community Profile.” http://htmlpreview.github .io/? (accessed December 12, 2015). Wu, Felix T. 2012. Privacy and Utility in Data Sets. In Proceedings of the 2012 Telecommunications Policy Research Conference (TPRC). Arlington, VA. Wu, Helen W., Paul K. Davis, and Douglas S. Bell. 2012. “Advancing Clinical Decision Support Using Lessons from Outside of Healthcare: An Interdisciplinary Systematic Review.” BMC Medical Informatics and Decision Making 12 (1): 90–100. Zimmer, Michael. 2010. “‘But the Data Is Already Public’: On the Ethics of Research in Facebook.” Ethics and Information Technology 12 (4): 313–325. Zuboff, Shoshana. 1988. In the Age of the Smart Machine: The Future of Work and Power. New York: Basic Books.


Editors Cassidy R. Sugimoto is an associate professor in the School of Informatics and Computing at Indiana University at Bloomington. Her research is within the domain of scholarly communication and scientometrics, examining the formal and informal ways in which knowledge producers consume and disseminate scholarship. Of particular interest is the ways in which big social media data has transformed scholarship. Her most recent edited works have explored the proliferation of novel forms of tools for scholarly assessment (Beyond Bibliometrics [MIT Press, 2014]), the historical criticism of scholarly metrics (Scholarly Metrics under the Microscope [Information Today, Inc., 2015]), and theories guiding both research and practice of metrics (Theories of Scholarly Communication and Informetrics [de Gruyter, 2016]). She has also published more than one hundred journal articles and conference papers on this topic. Her research has received funding from the National Science Foundation, Institute for Museum and Library Services, and Sloan Science Foundation, among other agencies. She is currently serving as President of the International Society for Scientometrics and Informetrics. Hamid R. Ekbia is a professor of informatics, cognitive science, and global and international studies at Indiana University at Bloomington, where he also directs the Center for Research on Mediated Interaction. He is interested in the political economy of computing and how technologies mediate socioeconomic relations in modern societies. His forthcoming book (with Bonnie Nardi), Heteromation and Other Stories of Computing and Capitalism, examines computer-mediated modes of value extraction in capitalist economies. His next book (with Harmeet Sawhney), tentatively titled Reconsidering Access, studies the modernist concept of “access” to universal service, as historically implemented in various domains. His earlier book Artificial Dreams: The Quest for Non-Biological Intelligence (Cambridge University Press, 2008) was a critical-technical analysis of artificial intelligence. Michael Mattioli is an associate professor of law at the Indiana University Maurer School of Law in Bloomington. Previously, he was the 2011–2012 Microsoft Research Fellow at the



Berkeley Center for Law and Technology at the University of California at Berkeley School of Law, and the 2010–2011 Postdoctoral Fellow in Law, Economics, and Technology at the University of Michigan Law School. Mattioli’s research explores the implications of private ordering on innovation policy and theory. This theme canvases a variety of topics, including the structure and governance of patent pools, how information transactions influence innovation, and most recently, the challenges of aggregating, standardizing, and reusing big data. His work has been consistently published in top-tier law reviews, including the Columbia Law Review (“Partial Patents,” 2011), Northwestern University Law Review (“Communities of Innovation,” 2012), Harvard Journal of Law and Technology (“Patent Republics,” 2014), and Minnesota Law Review (“Disclosing Big Data,” 2014). Michael also recently contributed a chapter to a book published by Springer that explores the management of data within genetic research centers (“Comparative Issues in the Governance of Research Biobanks,” 2013). Contributing Authors Ryan Abbott is an associate professor at Southwestern Law School and visiting assistant professor at David Geffen School of Medicine at the University of California at Los Angeles. He has served as a consultant on health care financing, intellectual property, and public health for international organizations, academic institutions, and private enterprises, including the World Health Organization and the World Intellectual Property Organization. Professor Abbott has published widely on issues associated with health care law and intellectual property in legal, medical, and scientific peer-reviewed journals. He is a licensed physician, attorney, and acupuncturist. He is a graduate of the University of California at San Diego School of Medicine, Yale Law School, the Master of Traditional Oriental Medicine program at Emperor’s College, and the University of California at Los Angeles (BS). Abbott has been the recipient of numerous research fellowships, scholarships, and awards, and has served as principal investigator of biomedical research studies at the University of California. He is a registered patent attorney with the US Patent and Trademark Office, and a member of the California and New York state bars. Cristina Alaimo holds a PhD in information systems from the Department of Management at the London School of Economics and Political Science. Interested in social media and the digitization of social life, her approach to technology draws on a range of social science fields, including sociology, cultural studies, and media studies. She considers social medial platforms and social networks both in terms of large databases of the social as well as innovation platforms, paying due attention to the disruptive potentialities of the associated service ecosystems for traditional business sectors such as retailing, media, and marketing. Prior to this, Alaimo was a researcher for LSE Cities. She has also worked as a researcher and art



curator. Alaimo has already published on creative clusters, cultural management, art, and cultural policy. Kent R. Anderson is the publisher of Science and its portfolio of journals. He was previously CEO and publisher for the Journal of Bone and Joint Surgery and its parent company, STRIATUS. Anderson is the immediate past president of the Society for Scholarly Publishing, the founder of the Scholarly Kitchen, and a member of the Journal Oversight Committee for the Journal of the American Medical Association. He has been an executive in the Massachusetts Medical Society’s Publishing Division, publishing director for the New England Journal of Medicine, and director of medical journals at the American Academy of Pediatrics. He writes and speaks occasionally. Anderson has degrees in business and English. Mark Andrejevic is a media scholar who writes about surveillance, new media, and popular culture. In broad terms, he is interested in the ways in which forms of surveillance and monitoring enabled by the development of new media technologies impact the realms of economics, politics, and culture. His first book, Reality TV: The Work of Being Watched (Rowman and Littlefield, 2003), explores the way in which this popular programming genre equates participation with willing submission to comprehensive monitoring. His second book, iSpy: Surveillance and Power in the Interactive Era (University Press of Kansas, 2007), considers the role of surveillance in the era of networked digital technology, and looks at the consequences for politics, policing, popular culture, and commerce. His most recent book, Infoglut: How Too Much Information Is Changing the Way We Think and Know (Routledge, 2013), examines the connections between wide-ranging sense-making strategies for an era of information overload and big data, and the new forms of control they enable. Diane E. Bailey is an associate professor in the School of Information at the University of Texas at Austin, where she studies technology and work in information and technical occupations. Her current research investigates engineering product design, remote occupational socialization, ICT4D projects in Latin America, and big data in medicine. With an expertise in organizational ethnography, Professor Bailey conducts primarily large-scale empirical studies, often involving multiple occupations, countries, and researchers. She publishes her research in organization studies, engineering, information studies, and communications journals. Her recent book, coauthored with Paul Leonardi, is Technology Choices: Why Occupations Differ in Their Embrace of New Technology (MIT Press, 2015). Her research has won best paper awards and an NSF CAREER award. She holds a PhD in industrial engineering and operations research from the University of California at Berkeley. Mike Bailey is a senior economic research scientist at Facebook and heads Facebook’s economic research team, which conducts research in the areas of pricing, forecasting, macroeconomics, mechanism design, auctions, economic modeling, and simulation. His current research focuses on how one’s social network influences one’s economic decisions including investment in education or home purchases. Bailey joined Facebook in 2011 as one of its first



economists, and built Facebook’s advertising demand estimation and forecasting system. He holds a PhD in economics from Stanford University along with a BS in mathematics and BA in economics from Utah State University. Prior to Facebook, Bailey was a research intern at Yahoo! studying the effectiveness of targeted advertising. Mark Burdon is a lecturer at the University of Queensland’s TC Beirne School of Law. His research interests are information privacy law and the regulation of information security. Burdon has researched on a diverse range of multidisciplinary projects involving the regulation of information security practices, legislative frameworks for the mandatory reporting of data breaches, data sharing in e-government information frameworks, consumer protection in e-commerce, and information protection standards for e-courts. His most recent work with Mark Andrejevic examines the sensorization of everyday devices leading to the onset of a “sensor society.” Fred H. Cate is a Distinguished Professor and C. Ben Dutton Professor of Law at the Indiana University Maurer School of Law, a senior fellow and former director of the Indiana University Center for Applied Cybersecurity Research, and a senior policy adviser to the Centre for Information Policy Leadership at Hunton and Williams LLP. He is the author of more than 150 articles and books, and is one of the founding editors of the Oxford University Press journal International Data Privacy Law. A frequent congressional witness, he has served on many government and industry advisory boards including the National Security Agency, Department of Homeland Security, Department of Defense, Federal Trade Commission, National Academy of Sciences, International Telecommunication Union, Organization for Economic Cooperation and Development, Microsoft, and Intel. He is a member of the American Law Institute and a fellow of the American Bar Foundation, and has appeared in Computerworld’s listing of best privacy advisers. Jorge L. Contreras is an associate professor at the University of Utah S.J. Quinney College of Law and an adjunct associate professor in the Department of Human Genetics. He has written and spoken extensively on the institutional and intellectual property structures of biomedical research and data, and his published work has appeared in numerous scientific, legal, and policy books as well as journals. Professor Contreras currently serves as a member of the Advisory Council of the National Center for the Advancement of Translational Sciences at the US National Institutes of Health. He has previously served as cochair of the National Conference of Lawyers and Scientists, a member of the National Academy of Sciences Committee on Intellectual Property Management in Standard-Setting Processes, and a member of the National Advisory Council for Human Genome Research at the National Institutes of Health. He is a cum laude graduate of Harvard Law School (JD) and Rice University (BA and BSEE). Simon DeDeo is an assistant professor in complex systems in the School of Informatics and Computing at Indiana University at Bloomington, a member of Indiana University



Bloomington’s Cognitive Science Program, and external faculty at the Santa Fe Institute. His Collaborative Lab for Social Minds draws on a wide range of archives, ranging from speeches during the French Revolution to arguments on Wikipedia, and tools from the information and cognitive sciences, to understand the relationship between individual decision making and the logic of society as a whole. Allison Goodwell is a PhD student in the environmental hydrology and hydraulics program in the Department of Civil and Environmental Engineering at the University of Illinois, working with Professor Praveen Kumar. She received her BS in civil engineering from Purdue University in 2010, and her MS in civil and environmental engineering at Illinois in 2013. Her research interests include ecohydrology, information theory, and remote sensing. Specifically, Goodwell’s research involves using information theory and process networks to characterize how ecosystems respond to extreme events, climate change, or other perturbations. She is a former president of the student chapter of the International Water Resources Association and remains active in the student organizations of the hydrosystems laboratory. Jannis Kallinikos studies how technological information and other codified forms of information exchange and communication are involved in the making of institutional patterns and relations. His research draws on several social science fields, including organization theory, sociology, communication, and semiotics and information studies. Of late, and due to the developments that characterize our age, he has become increasingly interested in social media platforms along with their social and institutional effects. His recent publications include The Consequences of Information: Institutional Implications of Technological Change (Elgar, 2007), Governing through Technology: Information Artifacts and Social Practice (Palgrave, 2011), and Materiality and Organizing: Social Interaction in a Technological World (Oxford University Press, 2012) (coedited with Paul Leonardi and Bonnie Nardi). Inna Kouper is a research scientist and assistant director of the Data to Insight Center at Indiana University at Bloomington. Her research interests focus broadly on the material, technological, and cultural configurations that facilitate knowledge production and dissemination, with a particular emphasis on research data practices as well as the sociotechnical approaches to cyberinfrastucture and the stewardship of data. Dr. Kouper has a PhD in information science from the School of Library and Information Science at Indiana University at Bloomington, and a PhD in sociology from the Institute of Sociology, Russian Academy of Sciences, Moscow. Kouper is involved in the Sustainable Environment Actionable Data project, funded by the National Science Foundation, and is cochair of the Research Data Alliance Engagement Interest Group. M. Lynne Markus is the John W. Poduska, Sr. Professor of Information and Process Management at Bentley University, a visiting professor at the London School of Economics, and a research affiliate at MIT Sloan’s Center for Information Systems Research. She was the



principal investigator of an National Science Foundation workshop award to develop a research agenda on big data’s social, economic, and workforce consequences, and participated in the White House’s ninety-day review of big data. Markus was named a Fellow of the Association for Information Systems in 2004 and received the AIS LEO Award for Exceptional Lifetime Achievement in Information Systems in 2008. Paul Ohm is a professor of law at the Georgetown University Law Center. He specializes in information privacy, computer crime law, intellectual property, and criminal procedure. In his work, Professor Ohm tries to build new interdisciplinary bridges between law and computer science. From 2012 to 2013, Ohm served as a senior policy adviser to the Federal Trade Commission. Before becoming a law professor, he served as a computer crime prosecutor and professional computer programmer. Scott Peppet is a professor of law at the University of Colorado Law School, where he researches the impact of technology on markets. His work in this area has focused on the problem of self-disclosure and stigma that can attach to privacy (“Unraveling Privacy,” Northwestern University Law Review, 2011), the impact of new technologies on consumer contract law (“Freedom of Contract in an Augmented Reality,” UCLA Law Review, 2012), the ways in which technology is changing prostitution markets (“Prostitution 3.0,” Iowa Law Review, 2013), and the privacy problems inherent in the Internet of Things (“Regulating the Internet of Things,” Texas Law Review, 2014). Peppet’s work has been featured in such publications as the New York Times, Wall Street Journal, NPR, Slate, Wired, and the Atlantic as well as referenced by the Federal Trade Commission’s 2015 report on the Internet of Things. Beth Plale is a full professor of informatics and computing at Indiana University, where she directs the Data to Insight Center and is science director of the Pervasive Technology Institute. Dr. Plale’s research interests are in the long-term preservation of scientific and scholarly data, data analytics, tools for metadata and provenance capture, data repositories, and datadriven cyberinfrastructure. Plale is deeply engaged in interdisciplinary research and education. Her postdoctoral studies were at Georgia Institute of Technology, and her PhD in computer science was from the State University of New York at Binghamton. Plale is founder and codirector of the HathiTrust Research Center, which provisions analysis to thirteen million digitized books from research libraries, and cochair of Research Data Alliance Technical Advisory Board, and serves on the steering committee of RDA/US. She is a Department of Energy Early Career Awardee and past fellow of the university consortium Academic Leadership Program. Jason Portenoy is a PhD student studying data science at the University of Washington Information School. He holds a BS in neuroscience from Brown University, and worked in biomedical research and data management before joining the University of Washington in 2013. His research includes studying online communities and education trends using quantitative methods and visualizations. He is a member of the University of Washington’s



DataLab and an active volunteer in several organizations that teach data science skills to beginners. Julie Rennecker is the founder of the Management Doc, LLC, a research and consulting firm based in Austin, Texas, specializing in the people and process implications of new health care technologies. She works with both technology vendors on the design of new health care products and health care facilities on process redesign to best integrate these products into patient care routines. Rennecker worked as an RN in intensive care unit, emergency room, and cardiac monitoring units before earning her PhD in organizational behavior from MIT Sloan School of Management. She then served on the information science faculty at Case Western Reserve University, where she conducted ethnographic studies of individual responses to and unforeseen impacts of technology-enabled work practices. Her research has been published in information and technology journals, books, and conference proceedings. Katie Shilton is an assistant professor in the College of Information Studies at the University of Maryland at College Park. Her research explores ethics and policy for the design of information collections, systems, and technologies. Shilton’s current research includes an investigation of ethics in mobile application development; a project focused on the values and policy implications of named data networking, a new approach to Internet architecture; an exploration of ethics and trust in lab-on-chip technologies for personal biometrics; and surveys of consumer privacy expectations in the mobile data ecosystem. She received a BA from Oberlin College, a master of library and information science as well as PhD in information studies from the University of California at Los Angeles. Dan Sholler is a doctoral student in the School of Information at the University of Texas at Austin. He studies information and technical work, focusing on large-scale technology implementations and accompanying changes to work practices. His research uses ethnographic techniques and survey methods to understand how practitioners’ work is affected by information communication technology implementations across a variety of organizational settings. Sholler’s current research on big data in medicine investigates the implications of electronic health record and clinical decision support system implementations for health care practitioners. He has also conducted studies of information professionals and their work, including data scientists and records managers. Sholler publishes his research in organization science and information science outlets. He is a member of the Information Work Research Group at the University of Texas at Austin, and holds a BA in science, technology, and society from the University of Pennsylvania. Isuru Suriarachchi is a PhD student in computer science at Indiana University. His research interests are in data provenance, research data preservation, and distributed computing. Suriarachchi is working as a research assistant at the Data to Insight Center at Indiana University, and currently contributes to the Sustainable Environment Actionable Data project and



Komadu provenance system. He has more than four years of industry experience in serviceoriented architecture middleware-related technologies. In addition, he is an Apache committer and Project Management Committee member in the Apache Web Services project. Jevin D. West is an assistant professor at the University of Washington Information School and data science fellow at the eScience Institute. He is also an affiliate faculty member for the Center for Statistics and Social Sciences at the University of Washington. His research lies at the cross-section of network science, scholarly communication, knowledge organization, and information visualization. Generally, he looks at how human communication systems evolve over time. West coruns the DataLab at the University of Washington and is cofounder of—a free website that ranks and maps scholarly literature in order to better navigate and understand scientific knowledge. Prior to joining the faculty at the University of Washington, he was a postdoc in the Department of Physics at Umea University in Sweden and received his PhD in biology from the University of Washington.


23andMe, 189, 218 1000 Genomes Project, 95, 101 Abbott, Ryan, 154, 172, 176, 187–190, 192–196, 198, 200, 202–204, 210, 217, 220 A/B testing, 169 Accenture, 146 Access to data, xviii, xxi, 3, 6, 16, 21, 22, 24, 25, 29–31, 33, 34, 61, 62, 64, 71, 72, 74, 98, 99, 102, 105, 110, 116, 121, 145, 146, 156, 184, 189, 201, 203, 205, 206, 217, 218 Acquisti, Alessandro, 10 Acxiom, 3, 146 AdSense, 124 Alaimo, Cristina, xii, xviii, 24, 35, 77, 78, 80, 82, 84–89, 92, 106, 183, 200, 202–204, 207 Alfred P. Sloan Foundation, 137 Algorithm(s), xiii, xv, xviii, 18, 31–40, 44, 50, 52, 53, 61, 74, 79, 85, 123–125, 142, 154, 159, 164–170, 191, 199, 203, 213, 216, 218 Amazon, xviii, 97, 101, 146, 164, 197 American Statistical Association, 149, 150 Americans with Disabilities Act, 56 Analytics, 45, 46, 53, 55, 62–64, 66, 68, 76, 92, 130–133, 135, 137, 143, 144, 146–149, 151, 155, 157–160, 162, 163, 176, 178, 215 business, 131–133 social data, 79 Anderson, Chris, 120 Anderson, Kent, xviii, 117, 119, 121, 123, 125, 127, 168

Andrejevic, Mark, xiii, xv, xviii, 3, 24, 46, 52, 61–66, 68, 70–72, 74, 106, 143, 183, 200, 201, 203, 205–207 Anonymized data, 9, 27, 55, 98, 144 Antidiscrimination, 45, 48, 49, 55–58, 202, 206 law, 48, 49, 55–58 APIs, 87, 88, 214 Apophenia, 209 Apps, 23, 49, 123, 145, 155 Artificial intelligence (AI), 37, 45, 164, 168, 172, 186, 187, 189, 193, 194, 204, 210, 217, 219, 220 Good Old-Fashioned Artificial Intelligence (GOFIA), 37 Ash, Joan, 177–182 Association for Molecular Pathology, 191, 219 Astroturf, 124 Automation of labor, 63, 135, 169, 170, 209, 210 Bailey, Diane E., xiv, 173, 175, 177, 179, 181, 183 Bailey, Michael, xix, 163, 165, 167, 169 Bamberger, Kenneth A., 147, 148, 154, 156, 160 Bank(s), xiii, 56, 93, 97, 101, 143–147, 153, 155, 158, 159, 224 BBC, xiv Beer, David, 79 Bermuda Principles, 99–101 Big Blue, 121 Black magic, 35 Bloomberg, 153 Botox, 222


boyd, danah, xiv, 29, 67, 78, 79, 201, 209, 214 Bozeman, Barry, 94 Broker(s), 143–146, 154–156 Burdon, Mark, xiii, xv, xviii, 3, 24, 46, 52, 61–64, 66–68, 70–72, 74, 106, 143, 183, 200, 201, 203, 205–207 Business education, 151 Business process outsourcing, 146 Callebaut, Werner, xii Cate, Fred H., xv, 3, 4, 6, 8–10, 12, 14–16, 18, 40, 68, 71, 76, 138, 144, 155, 200, 201, 203–207 Celera Genomics, 95, 99 Center for Disease Control (CDC), 118, 124, 189 Chain-link confidentiality, 144 Chakrabarti, Soumen, xix, 217 Chernoff-Stein Lemma, 39 Chief information security officer, 154, 155 Chief privacy officer, 154, 157 Children’s Online Privacy Protection Act (COPPA), 152 Civil Rights Act, 32 Client(s), 145, 148–151, 156, 160 Clinical Decision Support Systems, xiv, 174–184, 186 Clinical use of big data, xiii, 28, 98, 154, 173–177, 179, 184, 188–191, 195, 200, 218 Cloud computing, 14, 64, 133, 146 Cognition, 200, 203, 205 cognition-oriented perspective, 202, 203, 205 inadequacy premise, 203, 205 Cohen, Julie E., 23, 28, 70, 150 Collins, Francis, 96, 98, 99 Communication(s), xviii, 16, 61, 70, 77–81, 83, 84, 87–89, 92, 131 Compliance with regulations, 9, 13, 14, 146–148, 152–158, 160, 177 Computation, xviii, 44, 48, 85, 88, 105, 106, 125, 131, 133, 135, 137, 163, 165, 172, 174, 187–189, 192–198, 200, 203, 204, 207, 210, 217, 220


computational biology, 125 computational creativity, 187, 189, 197, 203, 204, 210 computational invention, 187, 188, 192–198, 200, 220 computational science(s), 106, 135 Computed sociality, xiv, 85, 204, 214 Computer science, 22, 32, 133–135, 137, 149, 198, 220 Computer vision, 217 Consciousness, 220 Consent, informed, xv, xx, 4–18, 20, 26, 27, 29, 65, 66, 68, 70, 71, 73, 98, 192, 200, 201 Constitution, 88, 187, 195, 214, 217 Consulting, 133, 146, 151, 154, 173 Contract(s), xv, 6, 34, 145–147, 152, 154–156, 172, 210, 216, 220 Contreras, Jorge I., xii, xv, xix, 93, 94, 96, 98, 100–102, 106, 173, 201, 208 Copyright, 10, 99, 187, 194, 220, 221 Corporate, culture, 156 data, 25, 62, 142–144, 146, 148, 150, 152, 154, 156, 158, 160, 200 data responsibility, 142–144, 146, 148, 150, 152, 154, 156, 158, 160 Coursera, 134, 215 Crawford, Kate, xiv, 29, 144, 155, 201, 209 Culnan, Mary J., 144, 148, 156 Curation of data, xii, 81, 101, 107–115, 117, 121, 122, 164 Curation Object (CO), 112–114, 153, 221, 224 Customer(s), 24, 46, 54, 132, 143–147, 149, 150, 153, 158–160, 162–164, 166, 167, 172 Cybersecurity, 152, 159 cyberattack(s), 157 Dase, Martin Zacharias, xii Data, aggregator, 144, 145 agreements, 155 architecture, 82, 88, 158, 159 backup, 146


behavioral, 80 breach(es), 109, 146, 147, 157–159 broker(s), 143, 146, 154–156 Carpentry, 134 clean(ed), xiv, 39, 106, 137 contract(s), 155 controllers, 13, 152 curation, 107, 108, 121 data-driven, xii, xx, 29, 62, 68, 93, 94, 137, 163–165, 170, 172 data-handling, 145, 153 Data Protection Working Party, 6, 8, 9 data-sharing, 110, 126, 127, 144 diversity, 23, 47, 86, 201 divide, xxi harm(s), xiii, 12–14, 16–19, 20, 28–29, 33, 54, 58, 144, 149, 155, 204, 207, 209, 210, 219 integrity, 190, 218, 219 management, 114, 143, 145, 153, 155, 157–160, 201, 210 disciplines, 158–160 regulations, 153, 160 multidimensional, 121 patchwork, xv, 146, 156, 180 personal, xiv, xv, 3–6, 8, 9, 11, 12, 14, 15, 18, 21, 22, 24, 25, 27–30, 61, 62, 65, 66, 68–73, 76, 144–146, 150–153, 156, 157, 160, 201, 204–206 phenotypic, 98, 101 political, xi, xiv, xviii–xix, 31, 32, 36, 44, 61, 69, 118, 124, 157, 168, 199, 204–207 profile, 80, 82, 84 propagation of, 32, 36, 37 protected, 148 protection(s), xiii, xviii, 4–10, 12–18, 29, 143–149, 151–160, 162 act(s), 152, 157 agreement(s), xviii, 146 database protection, 99 laws, xviii, 4, 7, 9, 16, 145, 147, 149, 151–154 provenance, 106, 107, 110, 115, 208


quality, 6, 116, 159 responsibility, 142–144, 146, 148, 150, 152, 154, 156, 158, 160 reuse, 26, 27, 107, 110, 111, 115, 116, 158 revolution, 138, 163, 164, 172, 173 science, xviii, 36, 40, 49, 92, 129–138, 142, 148–151, 201, 208, 209, 215 Community Data Science Workshops, 135, 136 Data Science Association, 150 Data Science for Social Good, 138, 142 scientist(s), 10, 44, 58, 129, 130, 132–136, 138, 142, 145, 148–151, 154, 156, 159, 160, 163, 169, 202, 204, 207 security, 146, 152, 153, 158 sharing, xiii, xiv, 16, 21–23, 26, 27, 29, 53, 71, 74, 77–82, 84, 86, 87, 98–100, 105, 108, 110, 111, 116, 118, 125, 126, 147, 158, 162, 164, 169, 180, 182, 183, 200, 201, 204, 208, 209, 214 small, 47, 121, 122, 127, 168, 169, 202, 208 standardization, 153 standards, 13, 67, 84, 86, 87, 92, 111, 112, 122, 134, 145, 152, 153, 156, 175, 177, 180, 182, 190, 210 stewardship, 12, 20, 144, 207 storage, 26, 27, 126, 146, 163, 172 subjects, 4, 7, 14, 22, 26, 27, 29, 68, 98, 152, 183, 206 supply chain(s), 145, 146, 152, 155, 156, 160 suppression, 118 unstructured, 80, 133, 215 use(s), 8, 11–17, 20, 27, 28, 30, 63, 70, 80, 108, 120, 143–145, 147–152, 155, 158, 160, 178, 207, 216 warehousing, 158 Database, 48, 74, 88, 89, 93, 95, 97–99, 101, 102, 121, 122, 177–179, 193, 217–219 database-sorting, 218 Database of Genotypes and Phenotypes, 97, 98, 102 Davenport, Thomas H., 92, 130, 138


Decision making, 14, 15, 31–36, 40, 44, 62, 76, 125, 147, 155, 162, 164–166, 168–170, 172, 174, 175, 178, 183, 202, 203, 205, 206, 208–210 DeDeo, Simon, xiii, xv, 31, 32, 34, 36, 38, 40, 48, 52, 125, 142, 154, 201–204, 206, 208, 213 Deep learning, 37, 164, 213 Deloitte, 154 Department of Health and Human Services, 181, 217 Desrosières, Alain, 206, 209, 210 Dilemma(s), xii, xiv, xix, 22, 27, 28, 149–151, 181, 206, 207, 210 Disaster recovery, 146 Discrimination, xv, 17, 29, 30, 32, 35, 36, 44, 49, 56, 57, 60, 98, 153, 155, 202, 208 Diversity of data, 23, 47, 86, 201 DNA, 47, 93–102, 106, 218, 219 Dodd-Frank, 153 Due process, 144, 155, 156 Dwork, Cynthia, 10, 29, 58 Dwoskin, Elizabeth, 69, 136, 138 Earth sciences, 125 eBay, 170 Economic(s), xviii–xix, 57, 86, 88, 92, 94, 130, 138, 207, 209, 210 recession, xix Education, xviii, xix, 32, 56, 76, 129–131, 133–138, 142, 148–152, 202, 205, 207, 215, 220 business, 151 Ekbia, Hamid R., xi, xii, xiv, xviii, xix, xx, 21, 29, 78, 86, 107, 181, 198–200, 202, 204–206, 208, 210, 219 Electronic medical records (EMR), 173–180, 182, 183, 190 Embeddedness, xiii, 61, 62, 64, 66, 68, 69, 71, 73, 76, 83, 85, 168, 207, 214 Empiricism, xii, 105, 125–127, 195 Employee(s), 8, 46, 53, 56, 57, 143–148, 150, 153, 156–158, 160, 162, 167, 172, 193, 210, 220 Encoding data, 77–89


End-to-end governance, 156 Engineering, xii, xiv, xix, 36, 40, 133, 135, 138, 148, 149 Enterprise(s), xviii, xx, 79, 146, 147, 152, 160, 172 Entropy, 47, 55 Epidemiology, 124 Epistemology, xii eScience Institute, 137 Ethics, xii, xv, xviii, xix, 22, 27, 28, 30–32, 34–36, 38, 44, 99, 102, 138, 142, 144, 148–151, 156, 160, 167, 172, 200, 202–207, 209, 210, 213 code of, 149, 151 Committee on Professional Ethics, 150 of computing, 149 ethical challenges, xviii, 22, 149 ethical hacking, 148 ethical principles, 28, 148, 156 research, 22, 27, 28, 30 European Commission, 6 European Parliament, 5, 13 European Union (EU), 5, 6, 13, 152, 156 Evidence-based medicine (EBM), 174–176, 181, 184 Experiment(s), 23, 40, 45, 105, 127, 138, 142, 144, 152, 169, 170, 202 experimentation, 144, 164, 169, 170, 188, 192, 221 Exponential growth, xxi, 86, 106, 122, 171, 204 Facial recognition software, 47, 74, 144 Fair Credit Reporting Act, 49, 54, 55 Fairness (ethical), xviii, 17, 35, 36, 40, 44, 71, 154, 195 Federal Trade Commission (FTC), 4, 6, 8, 10, 14, 49 FICO score, 49, 50 FiveThirtyEight, 121 Foucault, Michel, xix, 89 Fox News, 222 Frameworks, xiv, xv, 37, 71, 145, 152, 156, 208 Fraud, 144


Frias-Martinez, Enrique, 24, 25 Friedman, Batya, 145, 149, 159 Fuller, Matthew, 89 GenBank, 93, 95–97, 101, 102 Gender as a variable, 9, 49, 56, 57, 80, 84, 126, 153 Genetic Information Nondiscrimination Act, 29, 56, 98, 215 Genetic programming, 193 Genome, xii, 55, 93–102, 104, 173, 189, 200, 201, 208, 218 commons, 96, 99, 100, 102, 104 Genome Wide Association Studies, 96–98 Genomic Data Sharing Policy, 98, 100 Human Genome Project, 94, 95 Gitelman, Lisa, 208 Gold rush, xviii, 128, 130, 132, 134, 136, 138, 142 Goodwell, Allison, 105, 106, 108–110, 112, 114, 116 Gordon and Betty Moore Foundation, 129, 137 Governance, xii, 12, 104, 143, 152, 153, 156, 157, 160 data, xiii, 143, 160 risk and compliance, 153, 154, 158 Government(s), xiii, xiv, xviii, xix, xx, xxi, 3, 7, 12, 14, 18, 34, 40, 93–95, 101, 124, 129, 138, 142, 143, 154, 160, 163, 173, 178–181, 187, 206, 216 Graham-Leach-Bliley, 8, 54, 152 Group-dependent outcomes, 33, 38–40 Hackers, 148 Hacker Within, 135 Harm(s), data, xiii, 12–14, 16–19, 20, 28–29, 33, 54, 58, 144, 149, 155, 204, 207, 209, 210, 219 Health, care, 25, 29, 33, 76, 106, 133, 143, 151, 152, 155, 157, 173–175, 178, 181, 183, 184, 212, 217, 218 Health Information Portability and Accountability Act (HIPAA), 54, 55, 152 insurance, 98, 144, 175


Heterogeneity, 86, 107, 209 Hewlett-Packard, 3 Hofstadter, Douglas, xi, xii Howe, Bill, 133 Human authorship requirement, 194 Human Genome Project, 94, 95 Humanities, 125, 133 Iacono, Suzanne, xix, 21 Identity theft, 17, 158 Illusion of choice, xviii, 8 Information, systems, 143, 149, 151, 152 technology, 96, 157, 158, 173, 175, 178 theory, 38, 47, 48, 213 Informed consent, xv, xx, 4–18, 20, 26, 27, 29, 65, 66, 68, 70, 71, 73, 98, 192, 200, 201 Infrastructure(s), xiii, xiv, xix, xx, 24, 25, 62, 64, 68, 71, 76, 82, 85, 87–89, 142, 149, 204, 205, 207, 208 Insider trading, 153 Institutional Review Board (IRB), 118 Insurance, xiii, 51–54, 57, 98, 143–145, 147, 153, 155, 175 brokerages, 145 company, 145, 147 Intellectual property, xviii, 151, 172, 188, 191, 195, 212, 220 Intelligence, 37, 64, 131, 158, 163, 164, 168, 172, 187, 189, 193, 194, 204, 205, 210, 217, 219, 220 Interdisciplinary, data science programs, 137, 149 nature of the big data field, 133 teams, 106, 115 International Organization for Standardization, 13 International Serious Adverse Events Consortium, 96, 97 Internet of Things, 3, 26, 46, 51 Invention(s), 100, 187, 188, 191–198, 200, 217, 220–222, 224


Jasny, Barbara, 98, 99 Journal of the American Medical Association (JAMA), 118–120 JP Morgan Chase, 153 Kallinikos, Jannis, xii, xviii, 24, 35, 77, 78, 80, 82–86, 88, 92, 106, 183, 200, 202–204, 207 Kling, Rob, xix, 21 Kouper, Inna, 105, 106, 108, 110, 112, 114, 116 Kuhn, Thomas, xii Kullback-Leibler divergence, 39, 213 Lane, Julia, 6, 23 Langlois, Ganaele, 85 Larivière, Vincent, 208 Law(s) for data, xii, xiv, xv, xviii, xix, xx, 4–7, 9–11, 13, 14, 16, 17, 44, 45, 47–49, 54–58, 65, 66, 69, 71–73, 99, 100, 102, 105, 106, 132, 143–157, 166, 170, 172, 179–181, 187, 191, 194, 196, 197, 201, 202, 205–208, 210, 213, 217, 220, 221, 224 Legislation, 6, 8, 109, 144, 172, 215 Li, Ian, 21, 22 Live Object (LO), 112–115 Love, Dylan, 87, 219 Machine-learning, xix, 31, 33, 35, 37, 40, 45, 48, 61, 133, 134, 136, 137, 149, 168, 183, 189, 203, 213, 216, 217 Management science, 149, 151 Manovich, Lev, 89 Marketing, xx, 7, 76, 124, 131, 133, 144, 145, 147, 158, 188, 191, 195 Markus, M. Lynne, xv, 143, 144, 146–148, 150, 152–154, 156, 158, 160, 162, 200–202, 204, 207, 209, 210, 216 Matching in data processing, 18, 123, 144, 148, 159 Mathematics, xviii, 32, 37, 38, 44, 82, 92, 105, 106, 122, 125, 149, 204, 207, 224 Mattioli, Michael, xi, xii, xiv, xviii, xx, 46, 101, 198–200, 202, 204, 206, 208, 210


McKinsey Global Institute, 129 McKinsey report, 129, 130, 142 Media, xii, xiii, xiv, xv, xviii, xix, 3, 65, 77–89, 92, 144, 146, 154, 200, 204, 207, 214 Medicine, xiii, xv, xix, 101, 102, 126, 132, 138, 142, 173–177, 179, 181–184, 186, 203 Metadata, 73, 79, 107, 108, 110–114, 125 Michigan Civil Rights Initiative, 39 Molecular biology, 93, 125 Money laundering, 153 MOOCs (massive open online courses), 129, 134 Moral imperative, 157 Mortgage lending, 153, 155, 158 Multidimensional data, 121 Nardi, Bonnie, xix, 181 National Center for Biotechnology Information, 93, 96, 97, 101, 102 National Institutes of Health (NIH), 95, 96, 98–102, 104, 136, 189, 215 National Research Council, 94, 100 National Science Foundation (NSF), 107, 116, 129, 137, 138, 143, 162 National security, 143, 212 Negative results, 119 Neo-Luddites, 148, 151 Netflix, 9, 164–166 Ng, Andrew, 134 Nissenbaum, Helen, 15, 26 Notice, xv, xviii, xx, 4–13, 16–18, 66, 117, 195 Null results, 119 NumFOCUS foundation, 136 Nutrition, 126 Oceanography, 125 Office of Science and Technology Policy, 4 Ohm, Paul, xiii, xv, xix, 7, 45–48, 50, 52, 54–56, 58, 71, 72, 143, 144, 180, 189, 201–204, 207, 208 Organizational data, 143–145, 147, 148, 152, 156, 159, 201 protection policies, 148


Organization for Economic Cooperation and Development (OECD), 4, 14, 15 Ostrom, Elinor, 93 Paradigm(s), xii, 71, 100, 105, 106, 126, 175, 181, 184, 203–205, 209, 217 fourth paradigm of science, 106, 203 Participation, xxi, 22, 25–30, 64, 77–87, 89, 204, 214 Password(s), 147 Patchwork data, xv, 146, 156, 180 Patent(s), xii, 100, 101, 154, 187, 188, 190–198, 210, 217–222, 224 patentability, 196, 198, 210, 220 Patient, 38, 72, 120, 173–175, 177–183, 186, 191 patient confidentiality, 120 Pearl, Judea, 37 Pearl causality, 37 Pentland, Alex, 24, 78 Peppet, Scott, xiii, xv, xix, 45, 46, 48, 50, 52, 54, 56, 58, 71, 73, 143, 144, 180, 201–204, 207, 208 Personal data, xiv, xv, 3–6, 8, 9, 11, 12, 14, 15, 18, 21, 22, 24, 25, 27–30, 61, 62, 65, 66, 68–73, 76, 144–146, 150–153, 156, 157, 160, 201, 204–206 trail(s), 21, 22, 24, 25, 27, 28, 201 Personal informatics, xiv, 4, 5, 8–11, 16, 17, 22–27, 29, 55, 65, 66, 69, 72, 80, 82, 86, 144, 154, 155, 203, 206, 218 Personalization, 21, 25, 53, 78, 79, 84, 85, 181 of health data, 21 of services, 78 of suggestions, 78, 79, 85 Pfizer, 188, 218 Pharmaceutical setting, 96, 188, 189, 194, 195, 218 Phishing, 148 Physics, 94, 101, 125, 136 Piwowar, Heather, 111 Plale, Beth, xii, 105–108, 110–112, 114, 116, 121, 135, 200, 201, 203, 204, 208 PLOS, 125, 126


Policy (also policies), antidiscrimination, 29, 98 big data, xii big data policy debates, xi Centre for Information Policy Leadership, 13 corporate, 29 data-driven, xx, 163 data protection, 146, 148, 159 data release, 99–101 data sharing, 98, 100, 125 data use policies, 30, 148 decisions, 38 employee-oriented, 148 ethics, 31, 34 federal, 29 frameworks, xiv, 206 makers (policy making), 38, 39, 41, 44, 68, 69, 99, 103, 104, 106, 147, 148, 206, 211 nonutilitarian, 196 Office of Science and Technology Policy, 4 open access data, 99 organizational, 147, 156 participatory data use, 28 personal data, 68 policy-relevant variables, 38 privacy, 6-9, 12, 14, 75, 155 research, 29, 30 rights-oriented, 29, 206 social data, 90 Political data, xi, xiv, xviii–xix, 31, 32, 36, 44, 61, 69, 118, 124, 157, 168, 199, 204–207 Portenoy, Jason, xix, 70, 129, 130, 132, 134, 136, 138, 149, 200, 202, 204, 207–209, 215 Post hoc analysis, 118 Prediction(s), xiv, xv, xx, 7, 36–40, 44, 45, 51, 52, 61, 72, 106, 121, 122, 129, 164, 165, 168, 172, 190, 203, 204, 209, 210 Privacy, xii, xiii, xiv, xv, xx, 4–12, 14–18, 20–22, 27, 29, 40, 45, 48, 49, 51, 54, 55, 58, 60, 62, 65, 66, 69, 71–73, 76, 92, 98, 101, 104, 124, 127, 128, 138, 143–159, 174, 175, 179–181, 183, 186, 199–202, 204, 206, 207, 212 law, 4, 14, 48, 54, 55, 69, 71–73, 179–181


Processing, 5, 6, 9, 12–15, 18, 36, 64, 73, 74, 76, 83, 85, 107, 121, 126, 183, 201, 202, 217 Process-oriented perspective, xix, 107, 201, 202 Product-oriented perspective, xviii, 200, 201, 205 Professional association(s), 149, 151, 158, 177 Professional specialties, 152 Profit, 24, 163, 166, 167, 205 Propagation of data, 32, 36, 37 Proposition 209, 39 Protection(s), legal, xii, xv, xx, 4–18, 20, 29, 30, 49, 54, 55, 69, 71, 73, 98–101, 143–160, 162, 166, 187, 188, 190–192, 195–198, 202, 206, 207, 216, 217, 219, 220 Provenance, xii, 105–108, 110, 115, 116, 128, 196, 200, 207, 208 Publishable Object (PO), 112–115 Publishing, 100, 105, 106, 108–111, 114, 116, 118, 122, 128 PyData, 136, 216 Python, 134–137, 215 Quantified self, 21–24, 28, 201 Race, xi, 31–33, 35, 38, 39, 48, 49, 56, 57, 122, 126, 129, 142, 153, 163, 213 Rai, Arti K., 100, 101 Randomized Controlled Trials, 173–176, 182, 183 Rate manipulation, 153 Reanalysis, xv, 117–121 Redress, 17, 18, 20, 36, 146, 206, 207 RefSeq database, 102 Regulation(s), 4, 5, 9, 13–15, 17, 27, 54, 55, 65, 69, 73, 76, 144, 146–148, 152–160, 175, 179, 180, 183, 201, 207, 208, 210 Reidentification of anonymized data, xiv, 55, 144 Reliability, 50, 117, 122–124, 168 Rennecker, Julie, xiii, xviii, 11, 106, 173, 174, 176, 178, 180, 182, 184, 200, 201, 203, 205, 209, 210 Replication, 119, 124, 200 Reproducibility, 109 Research compendia, 108, 112 Research market, 146


Research Object (RO), 108, 110–115, 200, 208 Reuse of data, xiii, 11, 105, 108, 115, 135, 200, 208 Risk assessment, 13, 14, 16, 17, 144, 172 Risk management, 12–14, 17, 20, 40 Rosenblatt, Joel, 65 Sabermetrics, 127 SAP, 146 Sarbanes-Oxley, 152 Schwartz, Paul M., 5, 8, 21, 166 Sears, 169 Section 101, 196, 197, 224 Security, xix, 5, 10, 13, 18, 143, 146–148, 151–155, 158, 159, 212 Self-quantification, 21–24, 28, 201 Self-tracking, 21–23 Sensity, 64 Sensor(s), xiii, 3, 7, 11, 15, 24, 25, 46, 51, 61–65, 67–69, 71–73, 76, 106, 109, 110, 200, 201, 205, 207 sensorization, 63, 68, 72, 73, 76 society, xiv, 60, 62, 64, 66, 68, 70–74, 76, 205, 207 Sentiment analysis, 144, 202, 209 Sentinel Initiative, 217 Sharing of data, xiii, xiv, 16, 21–23, 26, 27, 29, 53, 71, 74, 77–82, 84, 86, 87, 98–100, 105, 108, 110, 111, 116, 118, 125, 126, 147, 158, 162, 164, 169, 180, 182, 183, 200, 201, 204, 208, 209, 214 Shilton, Katie, xiii, xviii, 21, 22, 24, 26–28, 30, 46, 62, 80, 173, 201, 203, 204, 206 Sholler, Dan, xiii, xviii, 11, 106, 173, 174, 176, 178, 180, 182, 184, 200, 201, 203, 205, 209, 210 Silver, Nate, 121, 138, 168 Skepticism, 24, 127, 128, 177, 208 Small data, 47, 121, 122, 127, 168, 169, 202, 208 SNP Consortium, 96, 97 Social, computed sociality, xiv, 85, 204, 214 data, xiv, 24, 76–83, 85–89, 92, 106, 204, 214


engineering, xv, xxi, 148 interaction, 77, 78, 80–85, 87–89, 92, 207 media, xiii, xiv, xv, xviii, xix, 3, 77–89, 92, 144, 146, 154, 200, 204, 207, 214 media apparatus, 89 movement, xix, 21, 200, 204, 206 networks, 25, 56, 79, 214 platformed sociality, 77, 78, 86, 87, 202, 214 responsibility, 144 web, 87 Social sciences, xxi, 125, 138 Software, xviii, 10, 23–25, 46, 64, 82, 106–108, 110–113, 121, 134, 135, 137, 146, 147, 153, 154, 163, 180, 188, 190–193, 195, 217–219 Carpentry, 134, 135 Solove, Daniel, 21, 66, 70, 72, 73 Solutionism, xv, 148–150 Standard(s) for data, 13, 67, 84, 86, 87, 92, 111, 112, 122, 134, 145, 152, 153, 156, 175, 177, 180, 182, 190, 210 Statistical hypothesis testing, 16, 47, 106, 121, 122, 124, 127, 128, 213, 217 Statistical inference, xv, 36, 44, 45, 48, 49, 69, 125, 203, 205 Statistics, 96, 127, 131–133, 135, 149, 166, 206, 209, 210, 217 statistical approaches, 33, 36, 48, 82, 93, 118, 120, 121, 124, 125, 128, 133, 136, 149, 150, 202, 205, 215 statisticians, 40, 44, 92, 119–121, 149, 150, 204, 207 STEM (science, technology, engineering, and math) fields, 138 Student(s), role of, xv, 33, 35, 55, 129, 133, 135, 137, 138, 142, 149, 150, 213, 216 Sugimoto, Cassidy R., xi, xii, xiv, xviii, xx, 198–200, 202, 204, 206, 208–210 Suriarachchi, Isuru, 105, 106, 108, 110, 112, 114, 116 Sustainability science, 107, 108 Sustainable Environments Actionable Data (SEAD), 107, 110, 114, 116, 215 Sweeney, Latanya, 9, 123


Technological controls, 148 Technological innovation, 148 Terms and conditions, 146, 155 Terms of use, 6, 66, 76, 87, 155, 201 Theory (also theories), xii, xv, xviii, xx, 35, 38, 44, 47, 48, 65, 74, 105, 120, 128, 149, 164, 195, 196, 204, 205, 209, 213, 222 end of, 120, 204 Title IX, 32, 217 Trademarks, 220 Training students, 130, 132, 134–136, 148, 150, 151, 179, 180, 202, 207, 209 Training the system, 168 Transparency, xiv, 12, 17, 18, 20, 40, 52, 105, 108, 122–124, 201 Trust Threads, 104–106, 108, 110, 112–116, 208 Twitter, xiv, xviii, 87, 136, 214 Unintended consequences, xviii, 29, 116–118, 120, 122, 124, 126–128, 208 United States, xviii, xix, xxi, 4, 5, 8, 32, 93–95, 100, 120, 121, 129, 149, 152, 154, 178, 192, 218, 221 US Copyright Office, 194, 221 US Department of Energy, 99 US Department of Health and Human Services, 181 User(s), xiii, xxi, 8, 10–17, 20, 26, 27, 29, 37, 46, 62–74, 76–87, 89, 99, 107, 111, 115, 116, 123, 124, 136, 144, 164, 165, 167–170, 175–177, 179–181, 183, 184, 203–205, 207, 214, 217, 218 User behavior, 80, 81, 86, 165 User-generated content, 77, 80, 82 US Food and Drug Administration (FDA), 102, 188, 190, 195, 217, 222 US Patent and Trademark Office (PTO), 100, 191, 210 Validity, xii, 115, 116, 118, 120, 150 Value alignment, 144 Vanderplas, Jake, 137


Varian, Hal R., 61, 92 Video Privacy Protection Act (VPPA), 54 Vygotsky, Lev, 214 W3C Consortium, 112 Walmart, 163, 164 Watson, 94, 189, 217 Website(s), 6, 22, 46, 79, 86, 87, 130, 136, 145, 155, 169, 215 West, Jevin D., xviii, 129, 131, 133, 135, 137 White House, 4, 5, 56 Wikipedia, 122, 136, 216 World Economic Forum, 24, 61 Wu, Felix, 29, 179, 180