The Power of Data: Data Journalism Production and Ethics Studies [1 ed.] 9781032546834, 9781032546841, 9781003426141

This book is a theoretical work on data journalism production that drills down the models, narratives, and ethics. From

269 31 2MB

English Pages 190 [206] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

The Power of Data: Data Journalism Production and Ethics Studies [1 ed.]
 9781032546834, 9781032546841, 9781003426141

Table of contents :
Cover
Half Title
Title
Copyright
Dedication
Contents
List of Figures
List of Tables
List of QR Codes
Foreword
Acknowledgments
1 Data Journalism: An Area in Need of Deeper Exploration
2 Understanding Data Journalism
3 Multiple Models of Data Journalism Production
4 Insight into Reality: Data Collection and Analysis
5 Representing Reality: The Narrative of Data Journalism
6 Social Media-oriented Production of Data News
7 Ethics of Data Journalism Production
8 Big Data Journalism
Conclusion
Index

Citation preview

THE POWER OF DATA DATA JOURNALISM PRODUCTION AND ETHICS STUDIES ZHANG Chao

The Power of Data

This book is a theoretical work on data journalism production that drills down the models, narratives, and ethics. From idea to concept and then to a widespread innovative trend, data journalism has become a new global paradigm, facilitating the transformation to focus on data, convergence, and intelligence. Drawing on various theoretical resources of communication, narratology, ethics, management, literature and art, game studies, and data science, this book explores the cutting-edge issues in current data journalism production. It critically analyzes crucial topics, including the boundary generalization of data journalism, data science methodology, the illusion of choice in interactive narratives, the word-image relationship in data visualization, and pragmatic objectivity and transparency in production ethics. Provided with a toolbox of classic examples of global data journalism, this book will be of great value to scholars and students of data journalism or new media, data journalists, and journalism professionals interested in the areas. ZHANG Chao is Professor at the School of Culture and Communication at Shandong University, China. He specializes in data journalism, critical algorithm studies, and international communication.

The Power of Data

Data Journalism Production and Ethics Studies ZHANG Chao

First published in English 2024 by Routledge 4 Park Square, Milton Park, Abingdon, Oxon OX14 4RN and by Routledge 605 Third Avenue, New York, NY 10158 Routledge is an imprint of the Taylor & Francis Group, an informa business © 2024 ZHANG Chao Translated by ZHANG Chao and JIANG Xiancheng The right of ZHANG Chao to be identified as author of this work has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. English Version by permission of China Renmin University Press. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-1-032-54683-4 (hbk) ISBN: 978-1-032-54684-1 (pbk) ISBN: 978-1-003-42614-1 (ebk) DOI: 10.4324/9781003426141 Typeset in Times New Roman by Apex CoVantage, LLC

Dedicated to my beloved mother, wife, and daughter

Contents

List of Figures List of Tables List of QR Codes Foreword Acknowledgments

viii ix x xi xiv

  1 Data Journalism: An Area in Need of Deeper Exploration

1

  2 Understanding Data Journalism

8

  3 Multiple Models of Data Journalism Production

41

  4 Insight into Reality: Data Collection and Analysis

59

  5 Representing Reality: The Narrative of Data Journalism

87

  6 Social Media-oriented Production of Data News

134

  7 Ethics of Data Journalism Production

142

  8 Big Data Journalism

173

Conclusion

184

Index186

Figures

5.1 The pure linear narrative mode and the hybrid linear narrative mode 7.1 Algorithmic transparency in news production

103 157

Tables

2.1 Comparison of computer-assisted journalism, data journalism, and computational journalism 4.1 Data types of Data Journalism Awards winners from 2013 to 2018 4.2 Data analysis methods for Data Journalism Awards winners from 2013 to 2018 4.3 Data processing complexity of Data Journalism Awards winners from 2013 to 2018 5.1 The structure of “From rainforest to your cupboard: the real story of palm oil” 5.2 Mechanisms of interactive narratives used in data journalism 5.3 Different structures of verbal narratives and pictorial narratives 7.1 The pragmatic objectivity of data journalism

16 76 77 78 105 111 119 147

QR Codes

4.1 We Crawled Cai Xukun’s Weibo and Found How Fans Control Celebrities’ Data on Social Media. 5.1 One report, diverging perspectives 5.2 Track national unemployment, job gains and job losses 5.3 North Korea is the only country that has performed a nuclear test in the 21st century 5.4 The most detailed map of gay marriage in America 5.5 People’s republic of Bolzano 5.6 Where carbon emissions are greatest 5.7 Libor: The spider network 5.8 Help us map Trump world 5.9 Deals for developers 5.10 A full-text visualization of the Iraq War Logs 5.11 Zhou Yongkang’s associated persons and properties 5.12 America’s unique gun violence problem, explained in 16 maps and charts 5.13 Median income across the US 5.14 The migrant files 5.15 HeartSaver: An experimental news game 5.16 People’s republic of Bolzano 5.17 How China’s economic slowdown could weigh on the rest of the world 5.18 Why the Middle East is now a giant warzone, in one terrifying chart 5.19 Iraq’s bloody toll 5.20 The one question most Americans get wrong about college graduates 5.21 EU referendum: Full results and analysis 5.22 The myth of the criminal immigrant 5.23 The urban neighborhood Wal-mart: A blessing or a curse? 5.24 Here’s how America uses its land 5.25 Indonesia plane crash 6.1 Is the Nasdaq in another bubble? 6.2 Who runs China 7.1 This chart shows an alarming rise in Florida gun deaths after ‘Stand Your Ground’ was enacted 8.1 Hanergy: The 10-minute trade

60 89 92 93 94 94 95 98 98 99 99 99 100 107 111 112 116 117 118 118 122 123 126 126 127 127 139 140 152 174

Foreword

ZHANG Chao is an important scholar of data journalism research in China. When I saw his book The Power of Data: Data Journalism Production and Ethics Studies, I couldn’t help but think of the thought-provoking “prophecy” of Tim BernersLee, the founder of the World Wide Web, which said, “Data-driven journalism is the future.” On November 19, 2010, the British government released a wealth of data on government spending. A panel consisting of Tim Berners-Lee and Francis Maude, then-Minister for the Cabinet Office, was asked: “Who would analyse such data?” Berners-Lee’s response: The responsibility needs to be with the press. Journalists need to be datasavvy. These are the people whose jobs are to interpret what government is doing to the people. So it used to be that you would get stories by chatting to people in bars, and it still might be that you’ll do it that way sometimes. But now it’s also going to be about poring over data and equipping yourself with the tools to analyse it and picking out what’s interesting. Data-driven journalism is the future. (Arthur, 2010) In 2010, there was perhaps more agreement that “digital-driven journalism is the future” because digital technology was changing the shape and culture of journalism. However, as the traditional medium of journalism fades away, it is an indisputable fact that the journalism in the future will become digital journalism. The distinction between data-driven and digitally-driven reflects Berners-Lee’s profound insight into the future of journalism: processing open data will become part of journalism’s commitment to the public. When you get this book, more than a decade has passed since Berners-Lee’s “prophecy.” Now that journalism is in the era of big data, data journalists are not only dealing with structured small data, but also with semi-structured and unstructured big data. Data journalists can no longer stay away from data or just be “porters” relaying experts’ conclusions as in the traditional journalism era, but should be data knowledge producers with independent data collection, analysis, and visualization capabilities.

xii  Foreword What are the global developments in data journalism? What are the cutting-edge issues that deserve attention? This book brings us new thoughts and inspiration with unique perspectives, novel ideas, and in-depth analyses. The author of this book examines data journalism production and ethics through a global perspective, showing us a macro, diverse, and cutting-edge panorama of data journalism production. The book does not limit its understanding of data journalism to China, but starts from the context in which data journalism was born and developed, and examines related issues on the basis of the true qualities of data journalism – “openness” and “science.” As a result, the book has more depth of thought than many existing data journalism studies. ZHANG Chao has the courage to expand the theoretical boundaries of data journalism research. This book integrates theoretical resources from the fields of communication, data science, computational science, narratology, cultural studies, rhetoric, and ethics. It uses an interdisciplinary approach to raise the practical issues of data journalism production to the academic level, making an important contribution to the development of the theoretical vision of data journalism. What is remarkable is that ZHANG Chao’s research on data science methodology has largely compensated for the lack of research on this issue in current Chinese data journalism studies. Globally, ZHANG Chao is also one of the few scholars who have conducted in-depth research on data journalism ethics. He expands our understanding of objectivity and proposes the principle of pragmatic objectivity in data journalism; he keenly captures the significance of transparency in data journalism and explores its dimensions; he explores the principles of personal data utilization in response to the increasing use of personal data. These studies are both academically valuable and instructive to the industry as well. ZHANG Chao is an early scholar in China to study data journalism with a critical perspective. From the first data journalism paper, Data journalism: Context, genres and ideas, published by ZHANG Chao and me in 2016, he has been advocating a critical awareness of data use, which was uncommon at the time. ZHANG Chao brings a critical perspective throughout the book, both in his exploration of data science methodology and in his use of narratology theories to examine the narrative of data journalism. As this book was being finalized, I came across some chapters of The data journalism handbook 2: Towards a critical data practice, which differs most from the 2012 edition in that it takes a critical perspective on data journalism production. This reflects the current “critical turn” in data journalism research, which I believe is a sign that data journalism research continues to deepen. I would like to see this book meet readers around the world through Routledge Press and let the world know the observations and thoughts of Chinese scholars on global data journalism. I also hope that this book becomes an important part

Foreword xiii of global data journalism academia and provides more inspiration for future data journalism studies. ZHONG Xin Professor and Doctoral Supervisor, School of Journalism and Communication, Renmin University of China Reference Arthur, C. (Ed.). (2010). Journalists of the future need data skills, says Berners-Lee. The Guardian. www.theguardian.com/technology/organgrinder/2010/nov/19/berners-leejournalism-data

Acknowledgments

I started to study data journalism in 2014. At that time, data journalism was a buzzword that had just entered Chinese journalism and academia, but it was not as developed in China as it was in the U.K. and U.S., and Chinese journalists did not have a comprehensive understanding of it. At the time, for example, big data was a hotter term than data journalism, so some journalists thought data journalism had to use big data. What is data journalism? What is the relationship between it and big data? My research on data journalism began with these questions, and this book has witnessed a milestone in my research. Thank you to Ms. ZHAI Jianghong, Vice President of the Humanities Branch of China Renmin University Press (CRUP), and Ms. GAO Ya of the International Publishing Center of CRUP for recommending this book to Routledge. Thanks to Routledge for introducing this book to the international academic community. I am grateful to JIANG Xiancheng, a Ph.D. student at the School of Journalism and Communication, Renmin University of China, currently an exchange Ph.D. student at Wee Kim Wee School of Communication and Information, Nanyang Technological University, for his participation in the translation of this book. Some of the chapters in this book are highly theoretical and demand a high level of translation, and Mr. JIANG has well understood and translated these parts, reflecting his extensive knowledge and high academic level. I would like to thank many media workers who readily accepted in-depth interviews without ever meeting me, namely ZHANG Zehong, YU Mingjun, GUO Junyi, Nicolas Kayser-Bril and LIU Wen. I am grateful to the editors of these journals for their support of my research on data journalism studies: LI Na of Academic Journal of Zhongzhou, ZHU Xuhong of TV Research, LIN Zhuming of China Publishing, and LÜ Xiaodong, GUO Pingping, LI Jing, ZHANG Jun, and HOU Miaomiao of Editorial Friend. I would like to thank Prof. ZHONG Xin, Prof. PENG Lan, Prof. LIU Xiaoyan, Prof. LIU Hailong, Prof. CAI Wen, and Prof. KUANG Wenbo of the School of Journalism and Communication, Renmin University of China for their academic guidance. I would like to express my gratitude to Prof. ZHONG Xin, DING Yuanyuan, and SHAN Xuemeng for collaborating with me on some of my academic research.

Acknowledgments xv Thanks to Prof. ZHOU Yi and Prof. ZHANG Hongjun of Shandong University for their selfless help and support. Thanks to my students QIAO Diyu, QIN Yan, CHEN Sha, and WANG Yonghua who helped me with some trivial work in the publication. Thanks to the School of Culture and Communication of Shandong University for providing financial support for the publication of this book’s English version. Thank you to my family. I love you all.

1

Data Journalism An Area in Need of Deeper Exploration

This is a data-driven era. Since The Guardian established the world’s first data newsroom in March 2009, data journalism has rapidly spread around the world. Data journalism is the inevitable result of the full penetration of data technology into journalism and is an important turning point in the evolution of journalism (Chen, 2016), which is seen as a new paradigm shift (Zhang & Zhong, 2016). According to Tim Berners-Lee, founder of the World Wide Web, “Data-driven journalism is the future.” It is also a time when journalism is being challenged by the ceding of media gatekeeping, the questioning of journalistic legitimacy, and the “popularization” of professional skills. According to Alan Rusbridger, editor-in-chief of The Guardian, “Data is not only a new product of the information age, it is also at the heart of changes in the digital industry, finance, and business. It’s more like a collection of truths and facts.” (Rogers, 2015, p. 1) The emergence of data journalism has shown journalism the opportunity to be prominent: to become a bridge between data and the public through the collection, processing, and presentation of data; and to regain legitimacy and professionalism through the reprocessing of data to gain insight into social reality. Data journalism was born at the right time. Although there is no exact data on how many news organizations around the world are involved in data journalism nowadays, we know that The Sigma Awards received 603 entries from 379 organizations in 76 countries and areas in 2022, which reflects the high interest of journalists in data journalism. Since 2006, when Adrian Holovaty, founder of EveryBlock, introduced the idea of data journalism in his article A Fundamental Way Newspapers Websites Need to Change, data journalism has passed its first decade (Kayser-Bril, 2016). In ten years, data journalism has gone from idea to concept and then to widespread recognition in the news industry. Data journalism is gradually developing its own paradigm and its own “ethos.” Correspondingly, it has gradually become an important topic in international journalism and communication research, and two orientations have gradually emerged: theoretical and practical. The theoretical orientation aims to theorize data journalism research and explore various theoretical issues in data journalism production; the practical orientation aims to summarize the practical experience DOI: 10.4324/9781003426141-1

2  Data Journalism of data journalism production and provide references for the industry. This book chooses a theoretical perspective and elevates the practical issues of data journalism to the academic level, with a view to combining theory and practice. Origin and Significance of the Research Robert M. Entman (1993) has described “framing” as a “fractured paradigm” due to different research paths. According to our observation, the international discussion of data journalism in recent years has also been in a “fractured” state. The multiple voices in the field of data journalism production and research need to be systematized, and many of the current issues facing the field require active responses. Origin of the Research The Unclear Boundaries of Data Journalism

“Data journalism remains a new and vaguely defined practice.” (Kayser-Bril et al., 2016) The bias that news including data equals data news is widely popular, and many media even disguise (or misidentify) traditional graphic news as data news. CCTV (China Central Television) once launched a series of news including “Data about the Two Sessions” and “Data about Spring Festival,” and some Chinese local TV stations also imitated CCTV to launch the so-called “Data about” series, which actually only reports the news with numbers. There are also some media that make charts of important data from authoritative reports and call them “a chart to understand.” The so-called data news pieces only have simple words and charts, which cannot be acknowledged as data news in fact. Actually, “Data journalism is 80% perspiration, 10% great idea, 10% output.” (Rogers, 2011) Some people’s oversimplification of the definition of data journalism has led to unclear boundaries between it and traditional journalism (Fang & Gao, 2015). Some others believe that data journalism is necessarily marked by big data processing (Coddington, 2015). However, even leading media outlets such as The Guardian and The New York Times rarely touch on big data news production, so this overemphasis on big data also deviates from the original meaning of data journalism. In Western journalism, there are also concepts related to data journalism such as precision journalism, computer-assisted reporting, and computational journalism, and the boundaries and relationships of these concepts are worth studying. The “Fractured” Perception of Data Journalism

Data journalism has its own style. Some might say that the style of data journalism is scientific because it emphasizes the use of quantitative methods. Indeed, data journalism has this characteristic, but this is also owned by precision journalism and computer-assisted reporting. When we discuss data journalism today, we are not trying to invent a myth that data journalism has always existed but just has never been called by that name

Data Journalism 3 before. In fact, the rise of contemporary data journalism has a deep political motivation: the open data movement, which is essentially a movement for information democracy. Advocates of the open data movement have been working with the media to promote the development of data journalism. However, data journalism has undergone an interesting “localization” as it has entered other parts of the world due to the lack of similar motivations in other countries. In Europe and the United States, data journalism is commonly used to cover political issues. In countries such as Malaysia, data journalism has been developed only patchily (Winkelmann, 2013). In China, data journalism is mostly regarded as a tool for journalistic innovation (Fang et al., 2016). There is also a view that data journalism is objective, neutral, and authoritative because it uses “data” to express. But in fact, there are also fake news and bad news in data journalism. The objectivity, neutrality, and authority of data journalism can only be achieved by following the production standards. Data Journalism as an Emerging Field

From concept to practice, data journalism has evolved over the decade as many issues surrounding data journalism have been discussed and debated in industry and academia. As an emerging genre of journalism, some questions of data journalism cannot be answered by previous theories and methods. For example, traditional narrative approaches are limited in their interpretation of data journalism (Fang, 2015, p. 43). Data journalism provides researchers with many pressing questions and further possibilities to be explored. Significance of the Research Necessity of Clarifying the Boundaries of Data Journalism

The current understanding of data journalism is not uniform, and some have even distorted its connotation. What exactly is data journalism? Why is it a new genre of journalism? How can it be developed? Answering these questions presupposes going back to the original context of data journalism and clarifying its boundaries with other types of journalism. Unclear boundaries will create problems for the understanding, operation, and sustainability of data journalism. In fact, some of the works have only “data” in name, but do not comprehend its substance. Therefore, it is essential to clarify the boundaries of data journalism. Importance of Critically Examining Data Journalism

Data journalism has quickly become a trend in the industry in just a few years, which implies the logic of techno-optimism. Some believe that data journalism can achieve objectivity in news reporting through “technology-neutral” and “objective” data and that the analysis of data (especially big data) can reveal the truth behind the data. However, the essence of data journalism is the construction of discourse. In a sense, data journalism “constructs” reality with data visualization, rather than

4  Data Journalism “reflecting” it in full sincerity. Therefore, this book aims to explore the deeper mechanisms of data reuse and data visualization by introducing some concepts from data science, and also aims to examine data journalism production in a multidisciplinary and multi-perspective manner. Significance of Introducing New Trends in Global Data Journalism

In all honesty, there is a large gap between the vision, development, and innovation of data journalism in some Chinese media and the leading media in the U.K. and the U.S. Some Chinese media outlets have gone against the original basic ideas of data journalism, and while they have “shown some professionalism,” they have not shown enough “innovation” (Fang & Gao, 2015). Therefore, this book focuses on new ideas and practices in global data journalism, aiming to bring new insights for Chinese scholars and journalists. Exploring the Sustainability of Data Journalism

The authority of data journalism lies not in the name of “data” but in a series of professional norms and ethical requirements in its production. The neglect of professional norms is gradually eroding the credibility and professional value of the data journalism field (Fang & Gao, 2015). It is imperative for data journalism to “establish a realistic and systematic set of professional norms” so that data journalists and interested readers can “understand its basic concepts and rules” (Fang & Gao, 2015). This book will attempt to propose a feasible set of ethical norms for data journalism production. Innovations of the Book New Perspectives

Current research on data journalism suffers from an emphasis on practice and experience and a disregard for theory and reflection. This book uses theoretical resources from data science, narratology, rhetoric, ethics, and other disciplines to examine data journalism from an interdisciplinary and multidimensional perspective and to analyze and study its related issues reflexively. New Issues

There are some major issues in data journalism research that have rarely been explored. For example, data reuse is a core aspect of data journalism, but related research is weak. The importance of data journalism ethics has become increasingly prominent in recent years, but the current related research lacks breadth and depth. While everyone emphasizes the narrative of data journalism, few scholars have actually analyzed it in depth based on narratological theories. There are many similar issues, and this book will explore these important issues in depth and expand the understanding of data journalism.

Data Journalism 5 New Findings

Many people believe that data journalism is a product of the big data era, but we believe that there is no direct relationship between the birth of the two. Big data has contributed to the global spread of data journalism, but the “data” of big data and the “data” of data journalism are not the same. The global spread of data journalism is not spontaneous in the media, but in a symbiotic relationship with the open data movement. Different data journalists embrace data journalism for two different motivations, “journalistic responsibility” and “journalistic innovation,” resulting in different understandings of data among journalists in different countries. New Perspectives

This book presents many novel perspectives. For example, it argues that the interactive narrative of data journalism is divided into two levels: rule-driven narrative and procedural rhetoric. In the rule-driven narrative, the author is not “dead” but rather controls the user’s thinking in a more covert way, while procedural rhetoric conveys information and persuades the user in a natural way. Interactive narratives thus appear to give users freedom of choice, but in essence, this freedom is only an illusion. The book also offers new insights into the principles of data journalism. For example, data journalism should consider objectivity as a method rather than a goal. In addition, the significance of transparency for data journalism lies in the “repeatability” of the data analysis process, which allows others to monitor the data journalism production process. The book also includes a section on data journalism transparency, which is uncommon in existing research worldwide. Structure of the Book The chapters in this book correspond to the different processes of data journalism. Chapter  2 is entitled “Understanding Data Journalism.” The first section is devoted to the definition of data journalism in order to clarify its boundaries. The second section explores the contemporary contexts in which data journalism was born and developed, helping to provide insight into why data journalism first emerged in Western countries such as the U.K. and the U.S., and why data journalism is considered a new journalistic paradigm. In the third section, we explore four aspects of data journalism innovation, including its epistemology, methodology, narrative, and textual representation. The fourth section delves into the symbiotic relationship between the open data movement and data journalism, and explores possible reasons why the connotation of “data” in data journalism has been blurred in recent years. Data journalism has developed different production models in practice. Chapter 3, entitled “Multiple Models of Data Journalism Production,” analyzes the endogenous, outsourcing, crowdsourcing, hackathon, and self-organization models of data journalism production, and uses the SWOT analysis framework to examine the endogenous, outsourcing, and crowdsourcing models in detail.

6  Data Journalism Chapter 4 is entitled “Insight into Reality: Data Collection and Analysis.” The first section summarizes the four methods of data collection: open access, access by application, self-collection, and shared access. The second section explores the problems of data collection with case studies. The third section summarizes the four functions of data analysis: describing, explaining, predicting, and deciding. The fourth section analyzes the common problems in data analysis with cases. In the fifth section, the path to improve data analysis is proposed. The sixth section discusses the application of the concepts and methods of data science in the winning entries of the Data Journalism Awards. The seventh section provides an outlook on the trends of data science applications in data journalism. Chapter 5 is entitled “Representing Reality: The Narrative of Data Journalism.” In this chapter, data journalism is considered as discourse and the narrative of data journalism is considered as part of the discursive construction. The first section presents the core concepts of the chapter – discourse and narrative – and points out the importance of a post-classical narratological perspective on the narrative study of data journalism. The second section explores how data journalism is a complex narrative in four dimensions: temporal, spatial, social networks, and correlation. The third and the fourth sections explores three narrative models of data journalism: linear narrative, paralleling narrative, and interactive narrative, arguing that the essence of interactive narrative is not the “Death of the Author,” but an illusion of freedom of choice. The fifth section analyzes the word-image relationship in data visualization from the perspective of discourse, and examines how the two collaborate in the production of discourse. The sixth section explores story awareness, relevance awareness, and product awareness in data journalism narratives. As social platforms become the mainstream channel for users to consume news, data journalism needs to be “sharing-centered” on social platforms. Chapter 6 is entitled “Social Media-oriented Production of Data News,” which explores the reasons for social media-focused production of data journalism and the “sharingcentered” production strategy. Traditional journalistic ethical norms, while still informative, are not fully applicable to data journalism. Chapter 7, entitled “Ethics of Data Journalism Production,” explores the ethics of data journalism in a constructive manner, systematically discussing objectivity, transparency, and the use of personal data in data journalism, and proposing possible solutions. In the first section, the author first sorts out the multiple meanings of objectivity and argues that data journalism should adhere to pragmatic objectivity, putting forward specific requirements from three aspects: data collection, data analysis, and data visualization. The second section explores the principle of transparency in data journalism, sorting out the connotation and system of transparency, and discussing in detail from data collection transparency, data analysis transparency, algorithmic transparency, data visualization transparency, producer transparency, and other transparency matters. The third section addresses the secondary use of personal data by data journalists and proposes the principles of informed consent, legality and proportionality, and public interest priority and minimizing harm.

Data Journalism 7 When big data becomes a buzzword, what is the practice of big data journalism and its value? Chapter 8 is entitled “Big Data Journalism,” which introduces the current situation of big data news production, discusses the social value of big data news from three aspects: social environment monitoring, social governance, and media profit model, and explores the requirements of big data news for media. References Chen, M. (2016). Data journalism: An important turning point in the history of news evolution. Media (14), 1. (Published in China.) Coddington, M. (2015). Clarifying journalism’s quantitative turn: A typology for evaluating data journalism, computational journalism, and computer-assisted reporting. Digital Journalism, 3(3), 331–348. https://doi.org/10.1080/21670811.2014.976400 Entman, R. M. (1993). Framing toward clarification of a fractured paradigm. Journal of Communication, 43(4), 51–58. https://doi.org/10.1111/j.1460-2466.1993.tb01304.x Fang, J. (2015). Introduction to data journalism. China Renmin University Press. Fang, J., & Gao, L. (2015). Data journalism – An area needs urgently to be regulated. Chinese Journal of Journalism & Communication, 37(12), 105–124. (Published in China.) Fang, J., Hu, Y., & Fan, D. (2016). Data journalism practice in the eyes of journalists: Value, path and prospect – A study based on in-depth interviews with seven journalists. Journalism Research (2), 13–19. (Published in China.) Kayser-Bril, N. (Ed.). (2016, October 13). Celebrating 10 years of data journalism. Nicolas Kayser-Bril. http://blog.nkb.fr/ten-years-datajournalism Kayser-Bril, N., Valeeva, A., & Radchenko, I. (Ed.). (2016). Transformation of communication processes: Data journalism. https://arxiv.org/ftp/arxiv/papers/1605/1605.01956.pdf Rogers, S. (Ed.). (2011). Data journalism at the Guardian: What is it and how do we do it? The Guardian. www.theguardian.com/news/datablog/2011/jul/28/data-journalism Rogers, S. (2015). Facts are sacred: The power of data (Y. Yue, Trans.). China Renmin University Press. (Published in China.) Winkelmann, S. (Ed.). (2013). Data journalism in Asia. www.kas.de/documents/252038/253252/7_ dokument_dok_pdf_35547_2.pdf/9ecd0cfc-9d30-0967-1d7e-04dd9c7c66de?version=1. 0&t=1539655194206 Zhang, C., & Zhong, X. (2016). Data journalism: Context, genres and ideas. Editorial Friend (1), 76–83. (Published in China.)

2

Understanding Data Journalism

We are in an era of emerging data journalism, but reporting news with “data” is not the invention of journalists in the 21st century. In 1810, the Raleigh, N.C., Star made a questionnaire survey on local residents about agricultural production and social well-being, which was regarded as the earliest data journalism practice (Bowers, 1976). The Guardian’s earliest data news practice can be traced back to 1821, when it included in its news a table listing the number of students and the cost of tuition and fees at various schools in Manchester (Rogers, 2015, p. 4). The question is, why did data journalism become a buzzword in journalism in the first decade of the 21st century? What is the difference between data journalism and “using data to report the news”? Is the “data” in data journalism necessarily big data? Why is data journalism regarded as a new journalistic paradigm or genre? The Boundaries of Data Journalism In the decade since the concept of data journalism was put forward, the boundaries of data journalism have been a fundamental issue discussed and debated in academia and industry. To solve this problem, we first go back to the “origin” of data journalism, sort out different viewpoints, and try to outline the boundaries of data journalism. Definition of Data Journalism

What is data journalism? One view is that data journalism is synonymous with computer-assisted reporting. According to the National Institute for ComputerAssisted Reporting (NICAR), “computer-assisted reporting and data journalism are only different in name, but not in substance.” Another view is that data journalism is different from both computer-assisted reporting and from any other journalistic genre. To study the concept and scope of data journalism, we should go back to the original context in which the concept was proposed and consider what it really refers to. Adrian Holovaty, founder of the EveryBlock website, was the first to put forward the idea of data journalism. In 2006, he proposed that newspaper websites should stop the story-centric worldview and find stories from structured data (Holovaty, 2006). Holovaty argued that the “structured information” (such as dates, times, locations, victims of fires, and the number of fire stations, etc.) gathered DOI: 10.4324/9781003426141-2

Understanding Data Journalism 9 by newspaper reporters, which could be stored on computers for later review and comparison, is “distilled” by reporters into words, which can only be used once, without any chance of being repurposed. Newspaper websites should fully tap the value of structured information to better serve readers, instead of simply publishing articles. Adrian Holovaty does not advocate that all reports on all communication platforms should stop being story-centric. He addressed the problem of newspaper websites being a “rehash” of newspaper content and proposed that newspaper websites should make good use of the technological advantages of new media platforms to make structured information play a greater role, so that readers can find connections and new insights from structured information. In fact, Adrian Holovaty has practiced the idea of stopping story-centric worldview before he published A Fundamental Way Newspaper Sites Need to Change. In May 2005, he launched the website “Chicago Crime” (chicagocrime.org), which combined Google Maps with the data of the Chicago Police Department to create an interactive online map of crime records for each neighborhood, so that Chicago residents could conveniently obtain the criminal records near their residences (Batsell, 2015). This is actually Database Journalism, a sub-form of data journalism. One of the first to put Adrian Holovaty’s idea of stopping story-centric reporting into practice was PolitiFact, a data-driven website created by the Tampa Bay Times in August 2007, which made its goal clear in its statement that the site was aiming to serve audiences better through databases rather than newspaper stories (Waite, 2007). In its coverage of the 2008 U.S. presidential election, the site won the 2009 Pulitzer Prize for checking more than 750 political claims and helping voters distinguish fact from rhetoric (The Pulitzer Prize, 2009). The first person to formally put forward the concept of “data journalism” is Simon Rogers, former digital editor of The Guardian and now editor-in-chief of Google Trends. On December 18, 2008, he officially put forward “data journalism” in a blog post entitled Turning Official Figures into Understandable Graphics, at the Press of a Button on The Guardian website: As of yesterday, our development team has come up with an application which takes the raw data and turns it into an editable map. Which meant that we could produce a fantastic interactive graphic based on these figures. It’s data journalism – editorial and developers producing something technically interesting and that changes how we work and how we see data. (Rogers, 2008) By examining the views of academia and industry at home and abroad, we believe that the view that data visualization is the way to present data journalism dominates, although there are now some different views. Simon Rogers did not define data journalism in his book Facts Are Sacred: The Power of Data. Through his description of data journalism, we can see that data, data processing, and data visualization are the three elements of data journalism. The Art and Science of Data-driven Journalism, published by the Tow Center for Digital Journalism at

10  Understanding Data Journalism Columbia’s Graduate School of Journalism, holds that the basic form of data journalism must include three elements: 1) the treatment of data as a source to be gathered and validated, 2) the application of statistics to interrogate it, 3) and visualizations to present it, as in a comparison of batting averages or stock prices. (Howard, 2014) Paul Bradshaw, a professor at Birmingham City University in the U.K., argues that data journalism is not about making news with data, but is rather a combination of traditional “nose for news” and the ability to tell fascinating stories with large-scale digital information (Gray et al., 2012). Some scholars have also defined data journalism by suggesting what it is not. Data journalism is not social science, although we use polls, statistics, and other related methods in our reporting; data journalism is not mathematics, although we need to know how to calculate trends or do basic calculations; data journalism is not a beautiful chart or a cool interactive map, although we often use data visualization for analysis or explanation; data journalism is not hardcore programming, although we use codes to analyze, grab, and make charts; data journalism is not hacking, because we don’t do such a thing (Marzouk & Boros, 2018). Regardless of whether data journalism is defined through its form or process, two elements stand out: (1) quantitative information should play a central role in the development or telling of the story, and (2) there should be some visual representation of the data in the story (Zamith, 2019). Based on the understanding of data journalism from different angles, the academic circles and industries at home and abroad classify the definitions of data journalism into three categories: “Data Journalism is a Process,” “Data Journalism is the Composition of Some Elements,” and the combination of the two. “Data Journalism is a Process” regards data journalism as a journalistic production process. Jonathan Stray, a veteran data journalist, thinks that data journalism means obtaining, reporting, curating, and publishing data in the public interest (Stray, 2011). Deutsche Welle reporter Mirko Lorenz thinks that data journalism can be regarded as a process of refinement, in which the original data is transformed into meaningful content. When complex facts are refined into clear stories that are understood and remembered by the public, the value of data to the public increases. The production process of data journalism can be divided into four stages: data, filter, visualize, and story. With the stages going forward, the value of data to the public is increasing (European Journalism Centre, 2010). “Data Journalism is the Composition of Some Elements” means defining data journalism by listing its constituent elements, and such definitions are the most common. Generally, it focuses on three elements: object, method, and representation. Representative definitions include: Data journalism means journalism with structured data.

(Kayser-Bril, 2016)

Understanding Data Journalism 11 This definition holds that the core element of data journalism is structured data. Structured data refers to the data which can be processed by computers (KayserBril, 2015). However, this definition does not draw a clear line between data journalism and computer-assisted reporting. Because structured data has long been the object of computer-assisted reporting in daily news production, the objects of data journalism now also include semi-structured data and unstructured data. Aimed at serving the public interest, based on open data, relying on special software programs to process the data, data journalism uncovers the news stories hidden behind macro and abstract data, and present the news in a visual and interactive way. (Liu & Lu, 2014) This definition covers the three elements of data journalism: object, methodology, and representation, but the description of methodology is not specific: what is a “special” software program? Data journalism is a new journalistic form in the information society, providing data support for recent events or extracting factual information for reporting from a large amount of data. In the production process, we must rely on Internet technology to collect, process, and analyze data, and make and release news through visual expression. (Shen & Luo, 2016) The definition of methodology (so-called “Internet technology”) is also vague, and we hold that it describes a way of producing data journalism. In 2016, Simon Rogers finally defined data journalism: Using data to tell stories in the best possible way, combining the best techniques of journalism: including visualizations, concise explanation, and the latest technology. It should be open, accessible, and enlightening. (Rogers, 2016) Simon Rogers’s definition emphasizes the method and path of telling stories with data, and tries to distinguish it from The Guardian’s news stories with data since 1821 and the data journalism in the 21st century by using expressions such as “best possible way” and “best techniques,” but he still hasn’t clarified the boundaries of data journalism. The third is a combination of “Data Journalism is a Process” and “Data Journalism is the Composition of Some Elements.” A representative definition is: Data journalism is a style of visual journalism based on data analysis and computer technology that uses data in news narratives to present content that would otherwise be difficult to present by words alone or to identify issues and then uncover news stories through data analysis. (Wu, 2016)

12  Understanding Data Journalism To facilitate understanding, we delineate the boundaries of data journalism by defining its components. We hold that: Data journalism is a news journalistic paradigm which uses data science methods to discover facts from various kinds of data and presents data through data visualization methods, serving news value and public interest. Compared with traditional news, data, data science, and data visualization are three unique components of data journalism. “News value” and “public interest” are two important criteria for data selection in data journalism, while “data science methods” point out the main differences between data journalism and traditional news in methodology. Although some viewpoints mentioned here hold that data journalism is based on statistical science (Howard, 2014) and deals with structured data (Kayser-Bril, 2016), in fact, with the development of data journalism practice, data processing methods have expanded their focuses from statistics to data science and from structured data to semi-structured data and unstructured data. “Discover facts from various kinds of data” points out that the object of data journalism is “data,” which is used to reflect and reveal facts, and “facts” point to the goals and attributes of data journalism. “Data visualization” refers to the transformation process of data from “abstract” to “intuitive.” The Differences Between Data Journalism and Traditional Journalism

Whether data journalism is a new journalistic genre or a gimmick depends largely on its particularity. What is the difference between data journalism and traditional journalism? Why is it called “data” journalism or “data-driven” journalism? Some people think that data journalism originates from traditional investigative journalism, which is a different way to search for stories by systematically investigating problems, providing a set of skills that complement traditional news reporting methods. Green-Barber (n. d.) believe that data journalism can provide credible evidence to support claims and present information to audiences in the form of data rather than narratives based on text. On this point, she agrees with Adrian Holovaty. Data journalism is the processing and presentation of complex relations which are difficult to perceive in the form of text or table (Charbonneaux & GkouskouGiannakou, 2015). The reason that data journalism is named after “data” is that in the past, news reports were mainly text with data as a supplement, or both numbers and text, while data journalism is mainly data with text taking a secondary role (Yan & Li, 2016). Data journalism has better performance than traditional journalism in terms of news value elements: full-time and predictive reporting increases the timeliness of news; the application of large sample data over a long period presents the impact of events in a macroscopic and comprehensive manner; structured data reveals the inner relationship of news events and highlights the main points of the news; unique selection of topics and visualization highlight the interesting nature of news

Understanding Data Journalism 13 (Dai & Han, 2015). Data journalism can contain long periods of information and has the characteristics of in-depth reporting (Wang, 2016). The use of data has shifted the core of journalists’ work from the pursuit of the quickest report to the search for the real meaning of an event (Gray et al., 2012). Compared with traditional news reports, data journalism is a new process of information collection and processing. Bradshaw (2016) uses the inverted pyramid structure of data journalism to illustrate the production process: data compilation, data cleaning, understanding of data context, and data combination. It seems that distinguishing data journalism from traditional news with “data” as the center is not enough to explain the core differences between these two. Some people think that data journalism emphasizes the concept of “products,” stressing the business model and unified development and operation environment, rather than conceiving articles and videos (Bai & Ren, 2016). We think that these discussions distinguish data journalism from traditional news reports in different ways and gradually clarify the boundaries between them. However, we should also see that data journalism is different from other journalistic genres that also use “data.” The Relationship Between Data Journalism and Precision Journalism, Computer-Assisted Reporting, and Computational Journalism

In essence, data journalism belongs to quantitatively-oriented journalism, and there are several interrelated concepts in journalism in Europe and America: precision journalism, computer-assisted reporting, and computational journalism. Precision journalism can be traced back to the 1950s, when American journalists used computers to analyze the information in the database. In the 1960s, American scholar Philip Meyer put forward the concept of Precision Journalism and published Precision Journalism: A Reporter’s Introduction to Social Science Methods in 1973, which advocated applying social investigation methods to journalistic practice to improve the scientificity, authenticity, and objectivity of information communication. Precision journalism rose in America and then spread all over the world (Shi & Zeng, 2014). With the popularization of computers, the improvement of data storage technology, and the rise of computer-assisted reporting, databases have gradually become an important source for journalists to find news clues. Although computers can help journalists collect and analyze data, computer-assisted reporting is a technique at first, which does not fundamentally affect the news production process (European Journalism Centre, 2010). From precision journalism to computer-assisted reporting, the relationship between them can be summarized as the relationship between ideas and means. Precision journalism is a journalistic idea, and computer-assisted reporting is a technical means to realize the idea of precision journalism. Is data journalism also a means of precision journalism? There are two views on this. One view is that there is no substantial difference between data journalism and computer-assisted reporting (NICAR, n.d.). Data journalism and computer-assisted

14  Understanding Data Journalism reporting are both quantitative news, using quantitative analysis to obtain data and analyze the results for news reporting. Another point of view is that there are differences between computer-assisted reporting and data journalism. (1) In terms of continuity, data journalism is the product of the development of computer-assisted reporting to a certain stage. The rise of data journalism is due to the background of the big data era, but also due to the commercial value of news production (Su & Chen, 2014). (2) In the form of news, computer-assisted news reporting is not an independent news style but a reporting method, but data journalism focuses on the way data is used through the workflow. Data journalism demands three different news skills: computer-assisted reporting, news application development, and data visualization (Zotto et al., 2015). Therefore, the connotation and extension of data journalism are wider than those of computer-assisted reporting (Fang & Yan, 2013). (3) On the path of content production, the difference between computer-assisted reporting and data journalism lies in the different paths of producing justified beliefs: computer-assisted reporting follows a hypothesis-driven path, while data journalism follows a data-driven path (Parasie, 2019). Jonathan Zhu believes that the evolution from the emergence of precision journalism to the rise of computer-assisted reporting, and then to database news and data-driven news is not an alternative relationship but an incremental relationship (Wang, 2017). Precision journalism advocates the objectivism principle and social statistics method; computer-assisted reporting realizes the digitalization of survey data on this basis; and data journalism inherits these advantages and makes the report more visible through data visualization (Huang, 2015). With the development of journalism, it is obviously not comprehensive to compare and distinguish precision news, computer-assisted reporting, and data journalism. In recent years, computational journalism has arisen in the Western press. Mark Coddington thinks that computational journalism is a strand of technologically oriented journalism centered on the application of computing and computational thinking to the practices of information gathering, sense-making, and information presentation, rather than the journalistic use of data or social science methods more generally (Coddington, 2015). Computational thinking refers to using the basic concepts of computer science to solve problems, design systems, and understand human behaviors. Its essence is abstraction and automation, demanding abstraction abilities such as handling abstract algorithms, models, languages, and protocols, and automation abilities such as systemization, programming, and compilation (Wing, 2006). As data journalism makes more use of spreadsheets than algorithms, not all data journalism can be regarded as computational (Stray, 2011). Coddington (2015) discussed the differences among computer-assisted reporting, data journalism, and computational journalism from several dimensions:

Understanding Data Journalism 15 (1) On the professional level, computer-assisted reporting tends to make journalists become professional experts, limit others’ participation in news production, and realize journalists’ professional control of the content. Data journalism and computational journalism are rooted in open-source culture and tend to be open and participatory. For example, although data journalism emphasizes editing selection and professional news evaluation in data analysis and presentation, data journalism is open to non-professionals in production (such as crowdsourcing). Computational journalism emphasizes network collaborative production to produce a tangible product or platform, while data journalism emphasizes narratives. (2) In terms of openness, computer-assisted reporting is deeply influenced by traditional journalistic ideas, and its production is not transparent. The application of open data and open-source software makes the production of data journalism more transparent. Although computational journalism is deeply influenced by the open-source movement, as algorithms are often regarded as commercial secrets, the production of computational journalism is not transparent. (3) From the perspective of epistemology, the object of computer-assisted reporting is a sample, while the object of data journalism and computational journalism is big data. The amount of data processed by computational journalism is larger than that of data journalism. (4) From the perspective of public initiative, under the influence of traditional news production ideas, computer-assisted reporting regards the public as “passive receivers,” but does not regard it as a creative and interactive part of the news production process (Lewis, 2012). Data journalism allows the public to analyze and understand data through data visualization or Web applications, and so does computational journalism, which goes further than data journalism and provides tools for audiences to analyze data with their own computational thinking. In other words, both data journalism and computational journalism regard the public as “active actors” in essence. (5) From the perspective of quantification, computer-assisted reporting is rooted in social science methods, and has a prudent style of investigative reporting and the aim of serving public interests. Data journalism is characterized by participatory openness and cross-field hybrid. Mark Coddington’s comparison of computer-assisted reporting, data journalism, and computational journalism makes people intuitively see the big or small differences among them in different dimensions. The objects of data journalism are not all big data. According to our observation of the practices of mainstream media in the U.K. and the U.S., the proportion of big data journalism is extremely low, and structured small data is the main object of data journalism at present. Based on this discussion and analysis, we summarize the differences between computer-assisted reporting, data journalism, and computational journalism in Table 2.1.

Computerassisted journalism Data journalism

Openness Epistemology of Role of the Relationship Types of data used of news data extraction public between news production producers and the public

Professional norms pursued

Low

Data sampling

Objectivity in Statistical news production Science

High

All types of data Active

Computational Relatively Big data journalism low

Passive

More Active

Methodology Driving forces

Hierarchical

Structured data

Equal

Structured, semiObjectivity and Data science Data-driven structured, and transparency in unstructured data news production Structured, semiObjectivity in Data science Data-driven structured, and news production unstructured data

Equal

Hypothesisdriven

16  Understanding Data Journalism

Table 2.1 Comparison of computer-assisted journalism, data journalism, and computational journalism

Understanding Data Journalism 17 Contemporary Contexts for the Birth and Development of Data Journalism Why was data journalism introduced and rapidly developed? Why is mainstream media in countries with more developed journalism, such as the U.K, and the U.S., more active in developing data journalism? It is believed that the emergence of data journalism is closely related to the era of big data (Yuan & Qiang, 2016). Data journalism is driven by the wave of big data and is an interdisciplinary and cross-disciplinary way of producing news (Wen & Li, 2013), and is a key innovation and “dividend” for the global media in response to the changes of the big data era (Liu & Lu, 2014; Li & Zhou, 2015). Is the birth of data journalism necessarily linked to big data? The big data boom began in June 2011 after McKinsey & Company released its report Big Data: The Next Frontier for Innovation, Competition, and Productivity. From the perspective of when they emerged, there is no causal relationship between data journalism and big data. The Guardian tried to produce data journalism in 1821, the idea of data journalism was proposed by Adrian Holovaty in 2006, and the concept of data journalism was formally introduced by Simon Rogers in December 2008. In terms of “data,” Holovaty and Rogers both referred to structured data rather than big data. If the argument that big data gave rise to data journalism is valid, then the object of data journalism should be big data, but at present, data journalism at home and abroad still deals mainly with small data. Therefore, “data journalism and the development of big data are basically independent of each other” (Jonathan, 2014). Why was data journalism born in the first decade of the 21st century? We use the Phylogenetic approach to examine the inevitable factors that led to its birth. According to the Phylogenetics discipline, there are intrinsic mechanisms for the occurrence and development of things (Zhang, 2007). This method reflects and reveals the historical stages, forms, and laws of the development and evolution of nature, human society, and human thinking, emphasizing a dynamic examination of the object and focusing on the main, essential, and inevitable factors in the historical process (Feng, 2001, p. 218). If the researcher only makes a description of the sequence of events without specifying the profound reasons for their creation and the laws of development, it cannot be called the Phylogenetic approach (Zhang, 2007). Data journalism is born out of a specific context. In linguistics, context refers to the interweaving of words and phrases whose meaning is determined by the segment or dialogue in which they exist. Generally speaking, context represents “the interconnected conditions in which something occurs” (Zhu & Chen, 2011). Context is not an independent entity, nor does it refer only to the external environment, but rather to the state of coupling between the actor and its environment, the correlation that occurs between different things. Context is a “present” relationship with “immediacy,” which also means that the context of what happened in the past is “historical” (Zhu & Chen, 2011). Data journalism is not something independent and transcendent of social reality. If we take it out of the social context and examine it in isolation, it will be difficult

18  Understanding Data Journalism to understand why it was born in this specific time and space (Qian & Zhou, 2015). Examining the context in which data journalism was born can not only help us understand the internal logic of its emergence but also help us deeply appreciate the substance of data journalism. The Political Context: The Open Data Movement

Data journalism is a set of productive activities that revolve around data. Without data, there can be no data journalism; and without sufficient data, there can be no broad and deep data journalism works. Where does data come from? The open data movement that emerged in 2005 has provided a potential data resource for data journalism and has been a major political force in its emergence. Open data is data that can be freely used, reused, and redistributed by anyone ─ subject only, at most, to the requirement to attribute and share alike (Open Data Handbook, 2011). Open data is not big data nor unstructured data, but structured data in the hands of the government that is of public interest (Xu & Wang, 2014). The open data movement is closely related to the open government movement, and both are also influenced by the open-Source movement. In the open government movement, which emerged worldwide in the 1980s, the public demanded that the functions of government be shifted from traditional public administration to public governance, that the efficiency of government operations be improved, and that the level of social governance be raised. The meaning of open government is “to improve the government’s ability to govern through open information, open data, interaction, and dialogue between the government and the public, and cooperation between the government and enterprises and nonprofit social organizations” (Wang & Ma, 2015). The Open Government Movement is motivated by the public’s distrust of government and the demand for government “openness.” Open data is not equal to open information. The purpose of open government information is to protect the public’s right to know, improve government transparency, and promote lawful administration, focusing on political and administrative values; however, open government data emphasizes the use of government data by the public, focusing on the economic and social values of government data. Open government information focuses on the disclosure of information, while open government data goes deeper into the data level (Zheng, 2014). In addition, the reuse of government information resources by society generally requires prior authorization from government departments, while open government data is exempt from authorization. Open data is not only about free access to data, but also about the immeasurable value that can be generated when data from multiple sources are connected, which can be innovatively applied in seemingly unrelated fields, making the energy of data amplified layer by layer (Xu & Wang, 2014). The Open Data Movement of the early 21st century was first launched by the U.S. technology elite, who believed that only open access can truly drive social progress, whether it be code or data (Tu, 2013, p. 191). In December 2007, open data promoters gathered in California and set out eight standards and principles for open

Understanding Data Journalism 19 data: (1) data must be complete; (2) data must be primary; (3) data must be timely; (4) data must be accessible; (5) data must be machine processable; (6) access to data must be non-discriminatory; (7) data formats must be non-proprietary; and (8) the data must be license-free (Tauberer, 2014). In January 2009, U.S. President Barack Obama signed the Memorandum on Transparency and Open Government, and in May the U.S. public data open website Data.gov went live, becoming the world’s first open government data website. The site provides the public with free access to machine-readable data resources for data research and data product development (Zhao et al., 2016). In December of that year, the U.S. government issued the Open Government Directive, which proposed that the three principles of open government are Transparency, Participation, and Collaboration: government should be transparent, which makes the government responsible for explaining and informing citizens of what the government is doing; government should be participatory, with public participation helping to improve the efficiency of government and the quality of decision-making; and government should be collaborative, involving more citizens in the government’s decision-making process. The U.S. government has now opened up nearly 200,000 datasets covering agriculture, business, climate, consumer, ecology, education, energy, finance, health, manufacturing, oceans, public safety, research, local government, and more. In December 2009, the U.K. Government published the report Putting the frontline first: smarter government, which made open government data and transparency a top national strategy. Then-Prime Minister David Cameron pioneered the concept of a “right to data,” pledging to make the right to data universal for the British people. According to Cameron, “Greater transparency is at the heart of our shared commitment to enable the public to hold politicians and public bodies to account.” (BBC NEWS, 2010) Open data is not only a “raw material” for the media, but more importantly, its goals are aligned with the media’s goals. Open government data is one of the most effective vehicles for public scrutiny of the government. In the Western press, where the idea of a Fourth Estate is deeply rooted, open data is an effective way for the press to monitor the government. The ultimate goal of both is the proper functioning of democracy. Open data does not mean that the public can correctly and deeply understand the data related to public interests, nor does it mean that the public will actively inquire and collect these data. There is still an insurmountable “data gap” between open data and the public. For the media, data related to public interests is itself a part of news production, and when a large amount of data is opened up, how to use these data resources to gain insight into social reality becomes an important issue. In the open data movement, promoters of open data (such as the Open Knowledge Foundation) have formed “partnerships” with media outlets, giving data journalism a different “political air” than computer-assisted reporting. “This new, improved data journalism could start to perform a valuable democratic function: becoming a bridge between those who have the data (and are

20  Understanding Data Journalism terrible at explaining it) and the world, which is crying out for raw information and ways of understanding it.” (Rogers, 2014) So the public has high expectations for data journalism, not just in its innovative form, but also in the elements of justice and fairness inherent in open data. Data journalism will help news organizations engage in investigative reporting in a more cost-efficient way that monitors government and promotes democracy (Hu et al., 2016). Technical Context: The Open-Source Movement

If the open data movement provided the political basis for the birth of data journalism and the resources to produce it, another movement in the technology field, the open-source movement, has brought data journalism from “ideal” to “reality.” In 1984, Richard Stallman, a researcher at MIT’s Artificial Intelligence Laboratory, started the open-source movement (formerly known as the free software movement). Stallman believed that computer software should be free, and that if it was not, there would be a situation where a few people would rule the industry (Li, 2005). This movement has grown in popularity over the past 30 years and has directly contributed to the development of the open data movement. A direct result of the open-source movement is the proliferation of open-source software, which means the owner of the software and the copyright that constitutes its source code allows anyone to study, modify, and distribute the open-source software and use it for any purpose (Qian, 2016). The difference between opensource software and proprietary software is that the code of proprietary software is protected by law (Qian, 2016), while the code of open-source software is not. Computer-assisted reporting, which began in the 1950s, uses proprietary software, which is considered one of the differences between data journalism and computerassisted reporting. The open-source movement promotes the idea of open-source as technically transparent and participatory coding, where all source code can be used and modified, and where these modifications can be accessed by others. This idea of sharing is at the heart of the hacker ethic (Li Y., & Li S., 2015), which promotes access to information and computer resources by writing open-source programs (Ji, 2005). In the field of journalism, open-source software provides journalists with lowcost, efficient, and highly innovative tools for news production. According to data journalism expert David McCandless, it is not the increasing volume of data that makes data increasingly important, but the tools and ability of journalists to analyze it (Rogers & Gallagher, 2013). According to Rogers (2015, p. 296), “the barriers for entry have never been lower as free tools change the rules on who can analyse, visualize, and present data.” Free tools in this context mostly refer to open-source software. The main reasons the media use open-source software rather than purchase proprietary software are: (1) When the prospect of reusing and reprocessing data is unclear, purchasing proprietary software requires capital and staff investment,

Understanding Data Journalism 21 while open-source software is free, easy to obtain, and simple to operate, effectively reducing the cost of investment. (2) The development of certain functions of proprietary software must be completed by the developer, which is costly, while open-source software can be designed by the media according to their own needs for specific functions. (3) Open-source software can already meet most of the needs related to data news production, so there is no need to purchase proprietary software. Industrial Context: The Crises in British and American Journalism

The industrial context for data journalism is a crisis in journalism, including a crisis of trust and a crisis of professionalism. For British and American journalism, one of the motivations for creating and embracing data journalism was the industry’s own need to survive and develop in new ways. Crisis of Trust in British and American Journalism

One of the fundamental assumptions that characterize the American public is that democracy thrives in part because of the information disseminated by the news media (Altschull, 1988, p. 20). In the competitive audience market, traditional news production, with its emphasis on timeliness and exclusivity, has focused on competing for attention, a scarce resource, while making truth a truly scarce resource. The quality of journalism has declined as a result of the involvement of interest groups and the stimulation of the attention economy. Journalists have focused too much on news events, intentionally or unintentionally abandoning the pursuit of truth, limiting the role of journalism in the democratic society (Jiang, 2011). The American public’s trust in the media has been declining since the turn of the 21st century. Gallup’s 2016 U.S. Media Trust Survey showed that Americans’ trust in the mass media to report the news fully, accurately, and fairly has dropped to its lowest level in Gallup polling history, with only 32% saying they have a great deal or fair amount of trust in the media (Swift, 2016). YouGov’s survey of the British public’s trust in the media shows that BBC journalists are the most trusted by the public, with a trust rate of 61%. As for newspapers, “upmarket” newspapers have the highest trust rate of 45%, while red-top tabloid newspapers have a trust rate of only 13% (Jordan, 2014). The crisis of trust in journalism is not only in the U.K. and the U.S., but also in a number of developed countries. In Digital News Report 2016, published by the Reuters Institute for the Study of Journalism, 42% of the public in the U.K. and 30% in the U.S. trust their own media, and 29% and 27% trust journalists, respectively (Newman, 2016). The immediate impact of the crisis of trust on journalism is the dissipation of media power. Media power derives from the empowerment of the audience and reflects the expectations of society that the media will serve the public interests (Wang, 2013). The power of media is dependent on how the public identifies with the media, which means that media power can only be realized if the public trusts

22  Understanding Data Journalism the media (Chen, 2011). A media outlet without power cannot gain the support of society and has no need to exist. The public needs the media because they are hungry for the truth, and the media exists because it can find the truth. In reality, a large proportion of news reports have “news” but no “truth” (Jiang, 2011). In the 21st century, the development of new media has challenged the previous traditional media-centered news production model. The development and empowerment of technology have made possible the rapid dissemination of vast amounts of information, but it has also made it costly to distinguish between true and false information. The self-cleaning function of the “marketplace of ideas” envisaged by Milton has not been realized. In many cases, rumors tend to “dilute” the truth. As credit becomes a scarce resource, traditional media face a new opportunity to move from an “attention market” to a “credibility market” (Huang, 2013). Rebuilding trust in journalism requires a new style of journalism that is seen as credible and authoritative. A type of journalism that resembles the production of scientific knowledge will undoubtedly have an advantage in the credibility race. According to Brian Keegan, a professor in the School of Humanities and Social Sciences at Northwestern University, “Data-driven journalism can provide a more solid empirical basis for discussions about policy, economic trends, and social change” (Howard, 2014). According to independent journalist Sandra Fish, “Data can trigger hypotheses, it can disprove your hypothesis, but it always enhances a story that would otherwise be based on anecdotes.” (Edge, 2015) So data journalism can strengthen journalism’s “Fourth Estate,” using software to enable more sophisticated investigations and discover the connections. In the past, some stories were not reported comprehensively or were never reported at all (Gray et al., 2012, p. 137). Data journalism gives journalism back its legitimacy by integrating its strengths (Peng, 2015). The Crisis of Professionalism in Journalism

Boundary work is an important concept developed by Gieryn, a sociologist of scientific knowledge, in his study of the boundary of science. It means that scientists selectively assign characteristics to scientific institutions (i.e., their practitioners, methods, stock of knowledge, values, and organization) in order to construct a social boundary that distinguishes some knowledge activities as “non-scientific” (Ma, 2013). Gieryn (1983) identified three strategies by which scientists construct boundaries between science and non-science: expansion, expulsion, and protection of autonomy. Through these three strategies, the profession of journalism maintains its own boundaries and keeps its professionalism from being violated or replaced. With the rise of citizen journalism around the world, the privilege of professional news producers has been broken (Wang, 2016), and communication technologies have made it possible for anyone to become a journalist at minimal cost (Clayton, 2008). Anyone with basic journalistic sensitivity, professional knowledge, and editorial skills can become a journalist and be influential through self-publishing or social networks, even if they are not recognized by the mainstream media. The

Understanding Data Journalism 23 political and sociological significance of self-media lies first and foremost in the fact that it undermines the power of the traditional media (Pan, 2011). The media, which no longer has a monopoly, has to fight for its position of authority by providing knowledge (Kovach & Rosenstiel, 2014, p. 191). At the same time, journalism is in the process of de-boundarization. The boundaries between journalism and other forms of public communication (e.g. public relations, blogs, and podcasts) are disappearing (Loosen, 2015). From the perspective of journalism, de-boundarization is essentially the expansion of journalistic boundaries, a phenomenon due to the endogenous development of journalism and the external pressure of Internet technology (Zhao & Ni, 2016). In this context, open data provides an opportunity for journalism to raise the professional bar. The open data movement has been met with a question: is the public empowered by open data? (Janssen & Darbishire, 2012) While data is open, the public is unable to understand abstract data. While open-source software has lowered the bar for collecting, processing, analyzing, and visualizing data, it has to be acknowledged that complex data production is not within the reach of the general public. An intermediary is needed between the government and the public to transform open data into information or knowledge for the public (Baack, 2015). The role of the news media is to serve the public interest through content production and to be a trusted intermediary for open data, which also enhances the professionalism of media. As a public good, data journalism is contributing to a “democratic conversation” around data, especially in the social space (Boyles & Meyer, 2016). “Data are everywhere all the time,” notes Mark Hansen, director of Columbia University’s Brown Institute for Media Innovation: They have something to say about us and how we live. But they aren’t neutral, and neither are the algorithms we rely on to interpret them. The stories they tell are often incomplete, uncertain, and open-ended. Without journalists thinking in data, who will help us distinguish between good stories and bad? (Bell, 2012) “Data journalism gives journalists a whole new role as a bridge between those who have the data (and are terrible at explaining it) and the world, which is crying out for raw information and ways of understanding it” (Rogers, 2015, pp. 34–35). Data is essential to making the journalism of today stronger than what came before (Sunne, 2016). Innovations in Data Journalism Everett M. Rogers (1983, p. 23) argues that: An innovation is an idea, practice, or object that is perceived as new by an individual or other unit of adoption. It matters little, so far as human behavior is concerned, whether or not an idea is ‘objectively’ new as measured by the

24  Understanding Data Journalism lapse of time since its first use or discovery. The perceived newness of the idea for the individual determines his or her reaction to it. If the idea seems new to the individual, it is an innovation. Data journalism is a kind of “disruptive innovation,” but what exactly is “new” about it? We compare data journalism with traditional journalism and computerassisted reporting, and explore the “newness” of data journalism in understanding, exploring, constructing, and expressing reality. Discovering the “Reality” in the Data

According to Shen Hao, a professor at Tsinghua University, data journalism has changed the definition of news from “the reporting of newly occurring facts” to “the reporting of newly discovered facts.” The object of journalism has expanded from “what happens” to “what is found.” Manovich, a professor at the University of California, argues that databases provide a new way to structure our experience of ourselves and of the world. In the past, news producers tended to place less importance on data and did not realize the value of open data. The reuse of open data is extremely limited and large-scale and massive data are not among the journalists’ work objects (Manovich, 2001, pp. 194–195). According to Ye Zhenzhen, head of the “Central Kitchen,” the media hub of People’s Daily, at a time when the virtual and the real are deeply integrated, more and more news is happening in the world of data, and there is a need to tap into these “invisible news scenes” (Guo, 2016). When data journalism was born, there was a debate about “Can data be called journalism?” “Is it journalism to publish a raw database?” Adrian Holovaty responded on his blog on 21 May 2009: 1. Who cares? 2. I hope my competitors waste their time arguing about this as long as possible. (Holovaty, 2009) Publishing data is news, argue Matt Stiles and Niran Babalola of the Texas Tribune in a blog post describing the paper’s operating philosophy (Stiles & Babalola, 2010). In May 2015, La Nación, in partnership with three NGOs, developed Declaraciones Juradas Abiertas, a web-based application for visualizing the assets of officials, giving the public intuitive and easy access to information about their personal assets, making that information available to readers in both raw formats and, more importantly, interactive applications (Mao, 2014). Other media outlets collect their own data and create databases to provide the public with more targeted information services (Bi, 2016). “In the same way that a photograph captures a moment in time, data is a snapshot of the real world” (Yau, 2014, p. 6). As humanity enters the era of big data, the ontological status of data is accentuated. Data goes from being a symbol that describes things to being one of the essential properties of everything in the world

Understanding Data Journalism 25 (Huang, 2016). Data is a way of representing reality, from which profound insights can be discovered if the data is analyzed over time or analyzed in relation to other things. Data-based analysis can also reveal certain hidden truths. Caixin Media’s “From Regulation to Stimulation: A Ten-Year Cycle of the Property Market” reviews the effects of national policy regulation on the property market since 2005, using a dynamic interactive chart to show the month-by-month changes in 70 cities across the country over 10 years. By comparing the year-over-year increase in house prices around the country, it is clear that the policy has been effective on house prices. Compared to the two previous adjustments, the stimulus policies were slow to take effect in most cities after the overall house price drop in 2014. “Medicare Unmasked,” one of a series of data journalism stories launched by The Wall Street Journal in 2014, used 9.2 million U.S. healthcare claims data to expose the workings of programs totaling $600 billion for seniors and people with disabilities. Other media outlets have further uncovered new healthcare fraud and abuse based on this data (Bi, 2016). The Scientific Path to Reality: Data Science as a Method for Seeking Truth

Julian Assange, the founder of WikiLeaks, believes that “journalism should be more like science,” and that “As far as possible, facts should be verifiable. If journalists want long-term credibility for their profession, they have to go in that direction.” Assange holds that journalists should “have more respect for readers” (Greenslade, 2010). Journalists often use qualitative methods to seek truth, such as interviewing, gathering information, and using inductive, deductive, and inferential thinking to gain knowledge, judgment, and interpretation of a phenomenon, issue, or event. According to Margaret Freaney, executive editor of the Atlanta Business Chronicle, the media industry used to refer to journalism as “anecdotal.” If a journalist finds three examples, he or she can simply sum up a “trend.” In reality, the story was not convincing, because the journalist was only talking about exceptional cases. Now, if journalists do systematic research and use figures to support their speculation, it will be more credible (Xu & Wang, 2015). According to Richard Sambrook, professor of journalism at Cardiff University, “In a world awash with opinion there is an emerging premium on evidence-led journalism and the expertise required to properly gather, analyze and present data that informs rather than simply offers a personal view” (Sambrook, 2010). Data journalism uses data science as a method of truth-seeking and is therefore distinct from other types of journalism. One might question: Doesn’t computer-assisted reporting use data science as a method of truth-seeking? In fact, computer-assisted reporting in the traditional sense uses statistical methods. Although the object of statistics is also data, there is a difference between statistics and data science. While modern statistics developed from dealing with small data obtained from imperfect experiments, data science emerged from dealing with big data. In terms of data objects, data in statistics is structured data, while data in data science includes both traditional structured data as well as unstructured and semi-structured data,

26  Understanding Data Journalism such as text, images, video, audio, weblogs, etc (Wei & Jiang, 2014). The theoretical basis of data science is therefore statistics, and data science can be seen as a result of the continuous expansion of statistics in terms of research scope and analysis methods, with particular emphasis on data-oriented, algorithm-based data analysis methods (Wei & Jiang, 2014). For example, Nate Silver, head of the FiveThirtyEight website, builds statistical models to predict news events; 49 out of 50 states were predicted correctly in the 2008 U.S. elections and all 50 states were predicted correctly in the 2012 U.S. elections. The U.S. website BuzzFeed, in partnership with the BBC, used algorithms to analyze 26,000 ATP and Grand Slam top men’s tennis matches from 2009 to 2015 to come up with a list of matches and players suspected of match-fixing. Thus we can find that the methods used in data journalism are no longer limited to statistics, but extended to data science. Constructing the Multidimensional Face of Reality: The Capacity of Complex Narratives

The narrative of data journalism is mostly complex, processing and presenting complex relationships that are not easily perceived in text or table (Charbonneaux & Gkouskou-Giannakou, 2015). It starts from a certain news event and makes the news wider, deeper, and more reflective of its historical nature by using data science methods (Bai & Ren, 2016). “An event is newsworthy not only because of the event itself but also because we can narrate it using existing narrative codes” (Bignell, 2012, p. 70). Whereas language has a limited storytelling capacity, images are polysemous and require the use of language (such as text or spoken language) to “anchor” the meaning of images. More importantly, language-based narratives tend to follow a linear logic of time or cause and effect, and cannot accommodate additional narrative threads and dimensions. Data journalism has narrative codes that allow complex narratives. The visual representation system of data journalism determines that it excels in complex narratives: the language of data journalism is multimodal, its coding is visual, and its content is high-dimensional. Data journalism breaks through the limitations of the traditional tree-like two-dimensional narrative structure and forms a threedimensional narrative model (Zhang, 2016). Traditionally, journalists believe that readers are less interested in long stories. Journalists strive to be brief in their storytelling, and large descriptive or explanatory paragraphs in the news are sometimes removed, and complex topics are often left untouched (Gans, 2009, p. 204). For news reporting, a balance between macronarrative and micro-narrative will make the story appear more realistic, objective, and vivid, but traditional news reporting is difficult to balance the two in practice (Bai & Ren, 2016). The Wall Street Journal Formula, seen as a good example of combining macronarrative with micro-narrative, starts with a specific example (person, scene, or detail), passes through transitional paragraphs, moves to the main body of the story, and then returns to the first example. This method of writing starts with a micro-level

Understanding Data Journalism 27 narrative, then moves to the macro-level narrative, and finally returns to the microlevel, which is both humane and storytelling, and is mostly used in feature writing. But the purpose of The Wall Street Journal’s individual storytelling is to make the story more interesting, to make the audience understand the story deeply by “seeing the big in the small” and empathizing with it. Data journalism goes further because its micro-narrative is not about a particular case, but about providing targeted and personalized narrative content to the audience in an interactive and personal way. The Washington Post’s Pulitzer Prize-winning piece, “People shot dead by police this year,” for example, provides both a macro-level overview of deaths by police shooting in the U.S. in 2015 and a meso-level count of the incidence by state. More importantly, audiences can filter the information for what interests them, clicking on the options for state, gender, race, age, type of weapon carried, whether mentally ill, and danger level to view charts, case profiles, reported text, videos, and more. The information provided by traditional news reports to audiences is homogeneous. If traditional news reports try to accommodate different audiences in terms of content, it often leads to two situations: “big and empty” and “small and narrow.” “Big and empty” means that the news has a large coverage but little targeted information, while “small and narrow” means that the information is targeted to a specific group of people, resulting in a narrow distribution. According to Schudson, modern news often gives people a sense of “standardization”: “the coverage should be inclusive of women as well as men, young as well as old, racial minorities as well as whites, and non-heterosexuals as well as heterosexuals” (Schudson, 2013). Data journalism allows a flexible shift between macro-narratives and micronarratives, thus enabling a move from the standardization of news to the personalization of news. As traditional news pieces cannot present large data sets, the complex nature of big data requires the complement of clear and concise narratives with interactive, visual elements (Boyles & Meyer, 2016). Through new media platforms and database technology, data journalism can present multiple levels of information, balancing the reach and relevance of the news. Intuitive Forms of Expressing Reality: The Fun of Data Visualization

The end product of data journalism is a visual symbol system that changes the traditional language-based narrative of news. From linguistic logic to visual logic, data journalism has undergone a fundamental “pictorial turn” in grammar, form, and manner (Liu, 2016). In expressing abstract information such as time and space, charts convey information through figurative elements and logic (Wang, 2014). The intricate relationships and patterns between data and the world “interpreted” by data are “represented” by data visualizations, satisfying people’s visual pleasure. Visual pleasure comes from two aspects: one is the pleasure of “gazing,” and the other is “fascination.” “Gazing” is an active behavior of human beings, and what is gazed at is the object that can satisfy people’s visual desire and bring pleasure (Wang, 2015). Data visualization is a way to simplify complex issues that are difficult to visualize or comprehend, to satisfy people’s need to understand the information and to bring the pleasure of knowledge acquisition through “gazing.” Data

28  Understanding Data Journalism visualization is inherently aesthetic: whether it is a simple static chart or a complex interactive chart, its design follows certain aesthetic principles and pleases the eyes of the audience. In addition, data visualization is an intuitive, clear, and vivid way to expand the dimensions of news reporting that were previously out of reach. Before the advent of data journalism, most news could only be presented in the form of text and images. With a set length and time frame, the amount of information that can be included in the text is limited, and usually, a news article does not present the full information about an event. Data journalism overcomes this shortcoming by using data visualization to bring more information into a smaller news format, presenting the audience with more complete information about the event (Chen, 2016). For example, in Caixin Media’s “Zhou Yongkang’s Associated Persons and Properties,” if we use text to narrate the relationship between the people in the story, it will eventually lead to information overload for the audience, whereas by using data visualization, it is possible to understand the intricate relationship between the people in the story with just one chart. “Fascination” allows the audience to be immersed in data journalism. While the data visualization highlights some of the creator’s intentions and messages, it also allows the audience to have their own exploration. According to Reception Theory, the audience itself is also the creator, especially with the use of interactive tools allowing the audience to explore the content of interest to them and to have a personalized experience based on human-computer interaction. Compared to traditional news, which is passively received by the audience, there is a clear difference in creating an “anthropomorphic communication”: the audience is active, not passive. The satisfaction of this “fascination” allows news reporting to leave behind the “speed reading” of superficiality and increase the degree of immersion and pleasure of news reading. Data visualization has expanded the narrative space of news reporting and enhanced the freedom of news narrative. Now we may no longer fear the multifaceted and complex nature of news, but rather embrace it (Lupi, 2015). Generalization of the Connotation of “Data” in Data Journalism The fundamental characteristic of data journalism is not its scientific nature, but its openness. The connotation of “data” in data journalism has quietly changed in the decade since its emergence. We first explore the symbiotic relationship between the open data movement and journalism by looking at the history of data journalism and the initial connotation of the word “data” in data journalism, and then analyze the deeper reasons for the generalization of the connotation of “data.” The “Symbiosis” Between the Open Data Movement and Journalism

Although data journalism is an innovative style of journalism, it is not simply done by the media itself but is the result of the symbiosis between journalism and the open data movement. New data is analyzed and communicated to the public by journalists as soon as it becomes available (Stoneman, 2015). The term “symbiosis”

Understanding Data Journalism 29 originally referred to the mutually beneficial dependence of different organisms, and we use it to describe the interdependent and mutually beneficial relationships formed by different social actors. The Open Data Movement and The Guardian’s Proactive Engagement

The Guardian was a key driver in the open data movement. The Guardian has long called on the government to release the data it holds, and has launched the Free Our Data project (Rogers, 2015, p. 25). In 2006, The Guardian’s technology department published an article entitled Give us back our crown jewels, asking governmentfunded or authorized agencies such as the Ordnance Survey, the Hydrographic Office, and the Highways Agency to make data available to the public for free and stating, “Why can’t government agencies make it as easy for us to access data as Google Maps or the Xtides program?” (Arthur & Cross, 2006) The project “Free Our Data” was thus started. The Guardian’s technology department set up a dedicated website (www.freeourdata.org.uk) to actively promote the project. This project reflects the news media’s desire for Open Data, and it is easy to see why The Guardian is a pioneer and active explorer of contemporary data journalism. The Guardian’s continued support of the open data movement is reflected in the fact that it now follows its data news with raw data that is freely available for public download (Xu, 2016). Simon Rogers sees data journalism as combining a wave of Open Data with a new kind of reporting that goes beyond analyzing data to making it available and showcasing the work of journalists (Rogers, 2014). One of the main goals of The Guardian’s data blog is to make journalists’ collated data available to the public: Every day we work with datasets from around the world. We have had to check this data and make sure it’s the best we can get, from the most credible sources. But then it lives for the moment of the paper’s publication and afterward disappears into a hard drive, rarely to emerge again before updating a year later. So, together with its companion site, the Data Store – a directory of all the stats we post – we are opening up that data for everyone. Whenever we come across something interesting or relevant or useful, we’ll post it up here and let you know what we’re planning to do with it. (Rogers, 2009) Support for Journalism by Open Data Promoters

Organizations such as the Sunlight Foundation, the Knight Foundation, the Open Data Institute, and the Data Transparency Promotion Consortium have played a huge role in driving the open data movement. These organizations not only provide guidance to governments wanting to run Open Data projects, but also provide tools, advice, courses, and certification on the use of Open Data (Fang, 2015, p. 57). As “data” is seen as a prerequisite for the production of knowledge, Open Data promoters have extended the idea of open-source to raw data sharing in order

30  Understanding Data Journalism to achieve the democratization of information, make more people aware of it, and involve more people in politics. At the same time, Open Data promoters realize that they have to become data brokers, distilling raw data into knowledge and making it available to the public (Baack, 2015). While “information democracy” is about giving everyone the right to interpret raw data, Open Data advocates are aware that the general public does not have the time and expertise to do so. It is unrealistic to rely solely on one’s own ability to achieve information democracy, so intermediaries from outside of oneself are needed. In the open data movement, there are three criteria for the selection of intermediaries by Open Data promoters: (1) Data-driven: they should be data-driven, which means that they should be able to handle large and complex datasets to make them accessible to others. (2) Open: empowering intermediaries should be open, which means that they should make the data from which they generate stories or build applications available to their audiences – the principle of sharing raw data applies here as well. (3) Engagement: empowering intermediaries should be engaging, which means that they should actively involve citizens in public issues. This implies that such intermediaries should not only be information providers and that they should have a cooperative relationship with their audiences (Baack, 2015). Those who met these three criteria were NGOs and professional journalists, and so the Open Data promoters tried to work with them. For the media, training in professional skills and improving data processing capabilities requires financial and manpower investment. Open Data, while good, cannot always be incorporated into the daily production practices of the media, and the involvement of Open Data promoters has given a boost to data journalism practices. For example, the Knight-Mozilla Fellows project’s partner media outlets include The New York Times, The Guardian, The Washington Post, ProPublica, Zeit Online, BBC, Al-Jazeera, and others. The series of leaks created by WikiLeaks forms another part of the open data movement. In 2010, Julian Assange released “Iraq War Logs” and “Afghanistan War Logs” in collaboration with The Guardian, The New York Times, and Der Spiegel. Assange believes: If the media put their resources behind both sets of logs, then the basic cause of openness and freedom of information will have been served. My only real job now was to work with them to get the best out of the material. Well, not my only job: I also had to keep them honest, which quickly proved to be a full-time occupation. (Assange, 2013) WikiLeaks now has partnerships with more than 80 news organizations in over 50 countries around the world (Brevini et al., 2013). In his article How WikiLeaks Outsourced the Burden of Verification, American journalist Craig Silverman writes: If WikiLeaks had released the documents on its own, the initial debate and coverage would have focused on whether the material was real, thus delaying

Understanding Data Journalism 31 any discussion about the contents. By enabling a process of distributed verification, Assange was able to ensure the conversation about the documents moved immediately to the information they contained, not whether they were authentic. (Silverman, 2013) The Knight Foundation, which has played a major role in the open data movement, awarded the Associated Press $400,000 in September 2015 to support public access to critical information and expand the AP’s work with data journalism teams across the U.S. (Qiu, 2013). The spread of data journalism has been driven by Open Data promoters, and data journalism practitioners have not embraced data journalism entirely of their own accord. Data journalism practices are not only publicizing the open data movement, but they are also becoming an integral part of it. Possible Explanations for the Generalization of the Connotation of “Data” in Data Journalism

In the global spread of data journalism, a noteworthy phenomenon is that the connotation of “data” in data journalism is losing its original “Open Data” color. The fact that the Data Journalism Awards have singled out the “Open Data Award” also indicates that Open Data is not the current “standard” for all data journalism. The public tends to think of “data” more in terms of data science. There is a tendency to think of “data” as including both Open Data and other kinds of data from a data science perspective. The reasons for this are twofold. Different Motivations for the Acceptance of Data Journalism by Journalists

Data journalism is a polysemic term that defies facile academic explication (Lewis & Waters, 2018). If Simon Rogers had called data journalism “Open Data journalism” or defined it more clearly, it might not have led to the current variety of academic and industry claims. Of course, this is just an afterthought. When Simon Rogers introduced data journalism in December 2008, it was an innovative practice at The Guardian and had no visibility. It has been argued that the use of graphs, data access, and analysis – two veins of news production development – have been developing in the first decade of the 21st century, but have been “overshadowed” by other new technologies. In 2010, The Guardian’s “Iraq War Logs,” an innovative approach to data journalism in the wave of WikiLeaks, attracted worldwide attention. This led to the first surge of data journalism in the U.K. and U.S. mainstream media (Azzelzouli, 2016), and the concept of data journalism was frequently mentioned after WikiLeaks (Qiu, 2013). When The Guardian’s data journalism received attention from the media industry, it was easier to recognize and understand data journalism from its external form. The external form of data journalism is “data + data visualization,” so the “intuitive” understanding of data journalism neglects the context of data journalism and the connotation of “data.”. Perhaps the connotation of “data” is not a key

32  Understanding Data Journalism issue for learners, either. As a journalistic practice, learners are more interested in how to learn and practice it than in what it is. This brings us to the question of what motivates data journalists to embrace data journalism in their work. In general, data journalism practitioners are motivated by two main factors: journalistic responsibility and journalistic innovation. The rise of data journalism is tied to the democratic traditions of Western societies, linked to the ideas and actions of open government and following a tradition of investigative reporting in the service of the public (Coddington, 2015). In the West, data journalists have created data as glanceable information, allowing users to quickly and instinctively navigate data journalism as a catalyst for dialogue (Boyles & Meyer, 2016). According to Pew’s 2015 survey, majorities are hopeful that Open Data can help journalists cover government more thoroughly (56% are), and 53% say Open Data can make government officials more accountable. Combining those who respond affirmatively to these propositions means that 66% of Americans harbor hopes that Open Data will improve government accountability (Horrigan & Rainie, 2015). In Germany, data journalism is built as a promoter of the democratic political ideal. In Greece, the first experiments of data journalism had been carried out in a context of deep economic crisis and skepticism from the citizens regarding the political leaders and the public institutions, with data journalists playing the role of tutors teaching the new professional routines and information research methods (Charbonneaux & Gkouskou-Giannakou, 2015). In Sweden, data journalism is practiced by most of the mainstream media, including the public broadcasters (Stoneman, 2015). The relationship between data journalism and audiences based on journalistic responsibility is positive. It reaffirms people’s belief that we’re still doing investigative and enterprise reporting in their communities (Boyles & Meyer, 2016). Data journalism is not just rhetoric, but a more capable approach to sacred truths and a greater commitment to the social community (Li, 2015). The public has high hopes for data journalism not only because of its innovative forms but also because of the elements of justice and fairness inherent in data disclosure, which helps news organizations engage in investigative reporting in a more economical way, thus they can better monitor government and promote democracy. In countries and regions where Open Data is not as widespread, data journalism practices are mostly based on journalistic innovation. Fama, a Malaysian freelance journalist, argues that: Data journalism, like all good journalism, is not well represented in the vast majority of the Malaysian media. Journalists, for the most part, do not look for stories, they report what their bosses, and ultimately the government, tells them to. (Winkelmann, 2013) The factors driving the development of data journalism in China are mainly commercial and professional considerations, and the very different context from

Understanding Data Journalism 33 the U.K. and the U.S. has limited the role of data journalism in democratic politics in China to some extent (Fang et al., 2016). One researcher’s in-depth interviews with early practitioners of data journalism in China found that: they ventured into the idea of doing data journalism because they were influenced by some cutting-edge international media, such as The Guardian in the U.K. and The New York Times in the U.S. The adoption of data journalism is mainly an imitation of the characteristics of Western journalism. (Li, 2017) So it’s easy to understand why Chinese data journalism practitioners focus on “visualization” rather than the highly specialized quantitative and in-depth reporting that is “data-driven” (Fang et al., 2016). Open Data is Not Sufficient to Support Daily Data Journalism Production

The data used in data journalism generally have two characteristics: they are of public interest, and they are newsworthy. The assumption of the open data movement is that government opening up data can form a virtuous circle: the more extensive the data the government opens up, the less costly and risky social innovation can be, for example, by avoiding repeated data collection; the more capable the data users are, the more effective the use of data can be and the more value can be created; this, in turn, motivates the government to open up even more extensive data (Du & Cha, 2016). The lack of open government data can hinder the advancement of data journalism practices (Fang et al., 2016). A large number of government data sets are still held within the administrative system and are not available to society in a timely manner (Tang & Liu., 2014). There are problems such as serious data fragmentation, lack of uniform standards for data platforms, and the reluctance of some datarich departments to share (Sun et al., 2015). The 2016 Open Data Barometer report shows that 92 countries and regions included in the evaluation have released some government data; only 10% of the 1,380 government datasets sampled are fully open, and many of these Open Datasets still have quality problems (World Wide Web Foundation, 2015). Much of the data for the U.K., which has the highest rating in the Open Data Barometer report, is also static. The most accessed road safety data, for example, could only be retrieved for 2013 and earlier in September 2015 in its entirety, with no data published for the first half of 2015 or the whole of 2014 (Stoneman, 2015). So in the U.K. and U.S., where there is a high degree of Open Data, data journalists often need to request the data they want through the Freedom of Information Act (FOIA) (Stoneman, 2015). When Open Data is not enough for the daily production of data journalism, expanding the boundaries of data is a viable strategy for data journalists. With the coming of the Big Data Era, the processing of massive amounts of unstructured data has also become an essential part of news production, and the data used by data journalism has inevitably changed. It has also been argued that the underutilization

34  Understanding Data Journalism of Open Data in data journalism is also due to the lack of collaboration between Open Data promoters and the media. The open data movement so far has not seen journalism as crucial. A more formal alliance between journalists and open-data advocates would be a very powerful thing. Open Data remains outside the mainstream of journalism (Stoneman, 2015). The key to understanding and defining the boundaries of data journalism is to understand the meaning of data and the context in which it was created. Data journalism is not just about data visualization (Bell, 2015), but also about the political ethos and objectives that are embedded in it. If data journalism is understood simply through the lens of data science, it will ultimately lead to blurred and generalized boundaries. Data journalism has been described as an innovative practice not only because it uses data visualization to solve the problem of abstract data that journalism had previously been unable to handle, but also because it is a bridge between Open Data and the public, finding facts in data, using data to create value, and using data visualization to achieve the “functional significance” (Crawford, 2012, p. 41) of journalism in monitoring governments and promoting dialogue. References Altschull, J. H. (1988). Agents of power (Y. Huang & Z. Qiu, Trans.). Huaxia Publishing House. (Published in China.) Arthur, C., & Cross, M. (Ed.). (2006). Give us back our crown jewels. The Guardian. www. theguardian.com/technology/2006/mar/09/education.epublic Assange, J. (Ed.). (2013). Assange’s autobiography: The secrets that cannot be kept secret (14). http://szsb.sznews.com/html/2013-08/02/content_2574446.htm Azzelzouli, O. (Ed.). (2016). Data journalism meets UK hyperlocal media: What’s hindering the potential? Data Journalism. http://datadrivenjournalism.net/news_and_analysis/ data_journalism_meets_uk_hyperlocal_media_whats_ hindering_the_potential Baack, S. (2015). Datafication and empowerment: How the open data movement re-articulates notions of democracy, participation, and journalism. Big Data & Society, 2(2), 1–11. https://doi.org/10.1177/2053951715594634 Bai, G., & Ren, J. (2016). Comparison and reexamination of traditional news and data journalism. Social Sciences in Yunnan (1), 186–188. (Published in China.) Batsell, J. (Ed.). (2015). For online publications, data is news. Nieman Reports. http://nie manreports.org/articles/for-online-publications-data-is-news/ BBC NEWS (Ed.). (2010). David Cameron to make more government data available. BBC NEWS. www.bbc.com/news/10195808 Bell, E. (Ed.). (2012). Journalism by numbers. Columbia Journalism Review. www.cjr.org/ cover_story/journalism_by_numbers.php Bell, M. (Ed.). (2015). What is data journalism? Vox. www.vox.com/2015/2/4/7975535/ what-is-data-journalism Bi, Q. (2016). Open data applications in data journalism. Hubei Social Sciences (7), 190– 194. (Published in China.) Bignell, J. (2012). Media semiotics: An introduction (B. Bai & L. Huang, Trans.). Sichuan Education Publishing House. (Published in China.) Bowers, T. A. (1976). “Precision journalism” in North Carolina in the 1800s. Journalism & Mass Communication Quarterly, 53(4), 738–740. https://doi.org/10.1177/107769907605300422

Understanding Data Journalism 35 Boyles, J. L., & Meyer, E. (2016). Letting the data speak: Role perceptions of data journalists in fostering democratic conversation. Digital Journalism, 4(7), 944–954. https://doi. org/10.1080/21670811.2016.1166063 Bradshaw, P. (Ed.). (2016). The inverted pyramid of data journalism. Online Journalism Blog. https://onlinejournalismblog.com/2011/07/07/the-inverted-pyramid-of-data-journalism/ Brevini, B., Hintz, A., & McCurdy, P. (2013). Amy Goodman in conversation with Julian Assange and Slavoj Žižek. In B. Brevini, A. Hintz, & P. McCurdy (2013) (Ed.). Beyond wikileaks: Implications for the future of communications, journalism and socty. Palgrave Macmillan. Charbonneaux, J., & Gkouskou-Giannakou, P. (2015). “Data journalism”, an investigation Practice? A glance at the German and Greek cases. Brazilian Journalistic Research, 11(2), 244–267. Chen, S. (Ed.). (2016). Associated press: Data journalism breaks through traditional news reporting. Tou Tiao. www.toutiao.com/i6260807587923493378/(Published in China.) Chen, X. (2011). The change of media power from the development of the media. Modern Audio-Video Arts (6), 11–13. (Published in China.) Clayton, S. (2008). Here comes everybody: The power of organizing without organizations. Penguin. In Zhou, H., & Wu, X. (2015). Rethinking the crisis of journalism: The power of culture – Professor Jeffrey Alexander’s reflections on the sociology of culture. Shanghai Journalism Review (3), 4–12. (Published in China.) Coddington, M. (2015). Clarifying journalism’s quantitative turn: A typology for evaluating data journalism, computational journalism, and computer-assisted reporting. Digital Journalism, 3(3), 331–348. https://doi.org/10.1080/21670811.2014.976400 Crawford, C. (2012). Chris Crawford on interactive storytelling. New Riders. Dai, S., & Han, X. (2015). Value-added and “alienation”-thinking about the value of news in the paradigm of data journalism. Journalism Research (3), 42–43. (Published in China.) Du, Z., & Cha, H. (2016). On opening government data to the public. Journal of Capital Normal University (Social Sciences Edition) (5), 74–80. (Published in China.) Edge, A. (Ed.). (2015). How to enhance your stories with data. Columbia Journalism School. www.journalism.co.uk/news/how-to-enhance-your-stories-with-data-/s2/a564923/ European Journalism Centre (Ed.). (2010). Data driven journalism: What is there to learn? http://mediapusher.eu/datadrivenjournalism/pdf/ddj_paper_final.pdf Fang, J. (2015). Introduction to data journalism. China Renmin University Press. (Published in China.) Fang, J., Hu, Y., & Fan, D. (2016). Data journalism practice in the eyes of journalists: Value, path and prospect-A study based on in-depth interviews with seven Journalists. Journalism Research (2), 74–80. (Published in China.) Fang, J., & Yan, D. (2013). Data journalism: Theory and practice. Chinese Journal of Journalism & Communication, 35(6), 73–83. (Published in China.) Feng, Q. (2001). Great dictionary of philosophy. China: Shanghai Dictionary Publishing House. (Published in China.) Gans, H. (2009). Deciding what’s news: A study of CBS evening news, NBC nightly news, Newsweek, and Time (L. Shi & H. Li, Trans.). Peking University Press. (Published in China.) Gieryn, T. F. (1983). Boundary-work and the demarcation of science from non-science: Strains and interests in professional ideologies of scientists. American Sociological Review, 48(6), 781–795. www.jstor.org/stable/2095325?origin=crossref Gray, J., Chambers, L., & Bounegru, L. (2012). The data journalism handbook. O’Reilly Media. Green-Barber, L. (Ed.) (n.d.). Beyond clicks and shares: How and why to measure the impact of data journalism projects. Data Journalism. https://datajournalismhandbook.org/

36  Understanding Data Journalism handbook/two/situating-data-journalism/beyond-clicks-and-shares-how-and-why-tomeasure-the-impact-of-data-journalism-projects Greenslade, R. (Ed.). (2010). “Data journalis” scores a massive hit with Wikileaks revelations. The Guardian. www.theguardian.com/media/greenslade/2010/jul/26/press-free dom-wikileaks Guo, P. (2016). Big data and news expression. New Media (7), 45. (Published in China.) Holovaty, A. (Ed.). (2006). A fundamental way newspaper sites need to change. Holovaty. www.holovaty.com/writing/fundamental-change/ Holovaty, A. (Ed.). (2009). The definitive, two-part answer to is data journalism? Holovaty. www.holovaty.com/writing/data-is-journalism/ Horrigan, J. B., & Rainie, L. (Ed.). (2015). Americans’ views on open government data. Pew Research Center. www.pewinternet.org/2015/04/21/open-government-data/ Howard, A. B. (Ed.). (2014). The art and science of data-driven journalism. The Tow Center. http://towcenter.org/wp-content/uploads/2014/05/Tow-Center-Data-Driven-Journalism.pdf Hu, Y., Pratt, & Chen, L. (2016). Data myths in the U.S. election news. The Press (23), 133–135. (Published in China.) Huang, C. (2013). Big data strategies in converging coverage of complex issues: The example of the Guardian’s “Unravelling the riots” feature. Journalism and Mass Communication Monthly (21), 9–15. (Published in China.) Huang, J. (2015). Observations on news producing mechanism: From precision to data. Chongqing Social Sciences (9), 100–105. (Published in China.) Huang, X. (2016). The ontological assumption and objective nature of big data. Studies in Philosophy of Science and Technology, 33(2), 90–94. (Published in China.) Janssen, K., & Darbishire, H. (Ed.). (2012). Using open data: Is it really empowering? www. w3.org/2012/06/pmod/pmod2012_submission_39.pdf Ji, H. (2005). The technical history of hacker culture. Journal of Social Sciences (5), 124– 128. (Published in China.) Jiang, H. (2011). What journalism can do for democracy: A brief comment on why democracies need an unlovable press by Michael Schuderson. Contemporary Communication (2), 57–60. (Published in China.) Jonathan, Z. (2014). From big data to data journalism. New Media and Society (4), 11–13. (Published in China.) Jordan, W. (Ed.). (2014). British people trust Wikipedia more than the news. YouGov. https://yougov.co.uk/topics/politics/articles-reports/2014/08/09/more-british-peopletrust-wikipedia-trust-news Kayser-Bril, N. (Ed.). (2015). Datajournalism. Nicolas Kayser-Bril. http://blog.nkb.fr/ datajournalism Kayser-Bril, N. (Ed.). (2016). Celebrating 10 years of data journalism. Nicolas Kayser-Bril. http://blog.nkb.fr/ten-years-datajournalism Kovach, B., & Rosenstiel, T. (2014). Blur: How to know what’s true in the age of information overload (J. Lu & Z. Sun, Trans.). China Renmin University Press. (Published in China.) Lewis, N. P., & Waters, S. (2018). Data journalism and the challenge of shoe-leather epistemologies. Digital Journalism, 6(6), 719–736. https://doi.org/10.1080/21670811.2017. 1377093 Lewis, S. C. (2012). The tension between professional control and open participation: Journalism and its boundaries. Information, Communication and Society, 15(6), 836–866. https://doi.org/10.1080/1369118X.2012.674150 Li, L. (2005). The free software movement and the spirit of scientific ethics. Journal of Shanghai Normal University (Philosophy and Social Science Edition), 34(6), 39–44.

Understanding Data Journalism 37 Li, Y. (2015). Data journalism: The logic of reality and the nature of the “field”. Modern Communication (Journal of Communication University of China), 37(11), 47–52. (Published in China.) Li, Y. (2017). Wandering between the open and conservative strategies: How news organizations adopt data journalism under logic of uncertainty. Journalism & Communication, 24(9), 40–60. (Published in China.) Li, Y., & Li, S. (2015). Data journalism: “Telling a good story”? – How data journalism’s inherits and transforms traditional journalism. Journal of Zhejiang University (Humanities and Social Sciences), 45(6), 106–122. (Published in China.) Li, Y., & Zhou, J. (2015). Data journalism: A Chinese and foreign way of expression. News Research (11), 49–50. (Published in China.) Liu, T. (2016). China in western data journalism: Searching for an analytic framework of visual frame based on visual rhetoric. Journalism & Communication, 23(2), 5–28. (Published in China.) Liu, Y., & Lu, Z. (2014). Differences between Chinese practice and foreign practice of data journalism. China Publishing Journal (2), 29–33. (Published in China.) Loosen, W. (2015). The notion of the “Blurring Boundaries”. Digital Journalism, 3(1), 68–84. https://doi.org/10.1080/21670811.2014.928000 Lupi, G. (2015). The architecture of a data visualization. https://medium.com/accurat-stu dio/the-architecture-of-a-data-visualization-470b807799b4#.jp1jkufua Ma, L. (2013). Boundary studies in STS – From scientific demarcation to boundary organization. Philosophical Trends (11), 83–92. (Published in China.) Manovich, L. (2001). The language of new media. The MIT Press. Mao, C. (Ed.). (2014). The fourth power in the era of big data: The application of officials’ property visualization. http://djchina.org/2014/04/18/argentina-declaraciones-juradas/ (Published in China.) Marzouk, L., & Boros, C. (2018). Getting started in data journalism. Balkan Investigative Reporting Network in Albania. https://birn.eu.com/wp-content/uploads/2018/08/Datajournalism-single-page.pdf Newman, N., Fletcher, R., Levy, D. A. L., & Nielsen, R. K. (Ed.). (2016). Digital news report 2016. Reuters Institute for the Study of journalism. http://media.digitalnewsreport. org/wp-content/uploads/2018/11/Digital-News-Report-2016.pdf?x89475 NICAR (Ed.). (n.d.). About IRE. Investigative Reporters and Editors. www.ire.org/nicar/about/ Open Data Handbook (Ed.). (2011). What is open data? Open Knowledge Foundation. http://opendatahandbook.org/guide/en/what-is-open-data/ Pan, X. (2011). A media sociological interpretation of the self-media revolution. Contemporary Communication (6), 25–27. (Published in China.) Parasie, S. (2019). Data-driven revelation? Digital Journalism, 3(3), 364–380. https://doi. org/10.1080/21670811.2014.976408 Peng, L. (2015). What does the encounter of data and journalism bring? Journal of Shanxi University (Philosophy and Social Science Edition) (2), 64–70. (Published in China.) The Pulitzer Prize (Ed.). (2009). The 2009 Pulitzer Prize winner in national reporting. The Pulitzer Prizes. www.pulitzer.org/finalists/st-petersburg-times Qian, J. (2016). As open source data journalism. Journalism Research (2), 6–12. (Published in China.) Qian, J., & Zhou, J. (2015). From emergence to diffusion: Data journalism from a social practice perspective. Shanghai Journalism Review (2), 60–66. (Published in China.) Qiu, L. (Ed.). (2013). New role in the age of big data. Lifeweek. www.lifeweek.com. cn/2013/1008/42726_2.shtml. (Published in China.)

38  Understanding Data Journalism Rogers, E. M. (1983). Diffusion of innovations (3rd ed.). The Free Press. Rogers, S. (Ed.). (2009). Welcome to the datablog. The Guardian. www.theguardian.com/ news/datablog/2009/mar/10/blogpost1 Rogers, S. (Ed.). (2014). Hey wonk reporters, liberate your data!. Mother Jones. www. motherjones.com/media/2014/04/vox-538-upshot-open-data-missing Rogers, S. (2015). Facts are sacred: The power of data (Y. Yue, Trans.). China Renmin University Press. (Published in China.) Rogers, S. (Ed.). (2016). Data journalism matters more now than ever before. Simon Rogers. https://simonrogers.net/2016/03/07/data-journalism-matters-more-now-than-ever-before/ Rogers, S. (Ed.). (2018). Turning official figures into understandable graphics. The Guardian. www.theguardian.com/help/insideguardian/2008/dec/18/unemploymentdata Rogers, S., & Gallagher, A. (Eds.). (2013). What is data journalism at the Guardian? The Guardian. www.theguardian.com/news/datablog/video/2013/apr/04/what-is-datajournalism-video Sambrook, R. (Ed.). (2010). Journalists can learn lessons from coders in developing the creative future. The Guardian. www.theguardian.com/media/2014/apr/27/journalistscoders-creative-future Schudson, M. (2013). Reluctant stewards: Journalism in a democratic society. Daedalus, 142(2), 159–176. https://doi.org/10.1162/DAED_a_00210 Shen, H., & Luo, C. (2016). Data journalism: Historical prospect from the perspective of modernity. Journalism Research (2), 1–5. (Published in China.) Shi, L., & Zeng, Y. (2014). Data journalism in the perspective of convergent communication. Journal of Sichuan Normal University (Social Sciences Edition), 41(6), 143–147. (Published in China.) Silverman, C. (Ed.). (2013). How WikiLeaks outsourced the burden of verification. Columbia Journalism Review. https://archives.cjr.org/campaign_desk/how_wikileaks_ outsourced_the_b.php Sun, H., Nan, T., Wei, H., Li, J., & Ma, Y. (Eds.). (2015). Government monopoly leads to serious waste of public data tied up in a high shelf. China News. www.chinanews.com/ cj/2015/02-25/7076397.shtml. (Published in China.) Stiles, M., & Babalola, N. (Eds.). (2010). Memorial data. The Texas Tribune. www.texastrib une.org/2010/05/31/texas-tribune-database-library-update/ Stoneman, J. (Ed.). (2015). Does open data need journalism? Oxford University Research Archive. https://ora.ox.ac.uk/objects/uuid:c22432ea-3ddc-40ad-a72b-ee9566d22b97 Stray, J. (Ed.). (2011). A computational journalism reading list. Jonathanstray. http://jona thanstray.com/a-computational-journalism-reading-list Su, H., & Chen, J. (2014). From computing to data journalism: Origins, development, and current status of computer-assisted reporting. Journalism & Communication, 21(10), 78–92. (Published in China.) Sunne, S. (Ed.). (2016). Diving into data journalism: Strategies for getting started or going deeper. American Press Institute. www.americanpressinstitute.org/publications/reports/ strategy-studies/data-journalism/ Swift, A. (Ed.). (2016). Americans’ trust in mass media sinks to new low. Gallup. https:// news.gallup.com/poll/195542/americans-trust-mass-media-sinks-new-low.aspx Tang, S., & Liu, Y. (2014). ‘Impression’ of global government data openness – An interpretation of the open data barometer. Foreign Investment in China (5), 28–31. (Published in China.) Tauberer, J. (Ed.). (2014). Open government data definition: The 8 principles of open government data. Open Government Data (The Book). https://opengovdata.io/2014/8-principles/

Understanding Data Journalism 39 Tu, Z. (2013). Big data. Guangxi Normal University Press. (Published in China.) Waite, M. (Ed.). (2007). Announcing PolitiFact. Mattwaite. www.mattwaite.com/posts/2007/ aug/22/announcing-politifact/ Wang, B., & Ma, H. (2015). Analytical framework of open government theory: Concepts, policies, and governance. Information and Documentation Services (6), 35–39. Wang, G. (2016). New technology application and the shape of news production – A discussion of “what determines news”. Journal of Southwest Minzu University (Humanities and Social Science), 37(10), 158–162. (Published in China.) Wang, H. (2013). Analysis of the alienation of media power. Shandong Social Science (5), 128–130. (Published in China.) Wang, X. (2015). Gazing at pleasure – the visualization tendency of consumer culture. Popular Literature and Art (2), 262–263. (Published in China.) Wang, X. (Eds.). (2017). The past and present of data journalism. People’s Daily Online. People’s Daily. http://media.people.com.cn/n/2014/0710/c386639-25265252.html. (Published in China.) Wang, Y. (2014). Time visualization design in infographic design [Master dissertation, Nanjing Normal University]. CNKI Theses and Dissertations Database. https://kns.cnki.net/ kcms2/article/abstract?v=3uoqIhG8C475KOm_zrgu4lQARvep2SAkbl4wwVeJ9RmnJ RGnwiiNVu2_82Y63SZUO3JagsdJJRhCjo7fAcyJ2w77ggNwKbaa&uniplatform=NZ KPT. (Published in China.) Wei, J., & Jiang, P. (2014). The statistical connotation of data science. Statistical Research, 31(5), 4–8. (Published in China.) Wen, W., & Li, B. (2013). Data journalism reports in the era of big data from the US presidential election. Chinese Journalist (6), 80–81. (Published in China.) Wing, J. M. (2006). Computational thinking. Communications of the ACM, 49(3), 335–337. https://doi.org/10.1145/1118178.1118215 Winkelmann, S. (Ed.). (2013). Data journalism in Asia. Konrad Adenauer Stiftung. www. kas.de/documents/252038/253252/7_dokument_dok_pdf_35547_2.pdf/9ecd0cfc-9d300967-1d7e-04dd9c7c66de?version=1.0&t=1539655194206 World Wide Web Foundation (Ed.). (2015). Executive summary and key findings. Open Data Barometer. http://opendatabarometer.org/3rdEdition/report/#executive_summary Wu, X. (2016). Data journalism: A historical picture from the perspective of modernity. Journalism Research (2), 120–126. (Published in China.) Xu, D. (2016). Data journalism: Development status and trends. China Publishing Journal (10), 12–15. (Published in China.) Xu, J., & Wang, W. (2014). Structured, relational open data and its application. Information Studies: Theory & Application, 37(2), 53–56. Xu, Y., & Wang, C. (2015). From “quote” to “data”: The revolutionary shift from traditional to data journalism. Chinese Journalist (7), 122–123. (Published in China.) Yan, T., & Li, M. (2016). Data journalism: The innovation and expansion of news reporting. Journal of Xi’an Jiaotong University (Social Sciences), 36(2), 119–126. (Published in China.) Yau, N. (2014). The beauty of data: Learning visual design in one book (X. Zhang, Trans.). People’s University of China Press. (Published in China.) Yuan, M., & Qiang, Y. (2016). Review and prospect of data journalism research in China. Journal of Zhengzhou University (Philosophy and Social Sciences Edition), 49(2), 130– 135. (Published in China.) Zamith, R. (2019). Transparency, interactivity, diversity, and information provenance in everyday data journalism. Digital Journalism, 7(4), 470–489. https://doi.org/10.1080/21670 811.2018.1554409

40  Understanding Data Journalism Zhang, J. (2016). From “digitization” to “datafication”: Deconstruction and reconstruction of data journalism narrative mode. China Publishing Journal (8), 39–43. (Published in China.) Zhang, N. (2007). Phylogenetic method and historical research. Collected Papers of History Studies (5), 43–50. (Published in China.) Zhao, J., Di, T., Zhang, P., & Xing, X. (2016). Open data website: A new way to open data. E-Government (3), 109–117. https://doi.org/10.1080/21670811.2014.976408 (Published in China.) Zhao, L., & Ni, N. (2016). Variation and reshaping of journalism border from perspective of social system theory. Academic Journal of Zhongzhou (5), 168–172. (Published in China.) Zheng, L. (2016). The real-life dilemma of open data. New Media (4), 48–49. https://doi.org /10.1080/21670811.2014.976408 (Published in China.) Zhu, C., & Chen, F. (2011). Contemporary features of the development of context theory and philosophy of technology. Studies in Philosophy of Science and Technology, 28(2), 21–25 (Published in China.) Zotto, C. D., Schenker, Y., & Lugmayr, A. (Eds.). (2015). Data journalism in news media firms: The role of information technology to master challenges and embrace opportunities of data-driven journalism projects. AIS eLibrary. http://aisel.aisnet.org/ecis2015_rip/49/

3

Multiple Models of Data Journalism Production

The production of data journalism is guided by the concept of openness, which is reflected in the openness of the resources used (open data), the technologies used (open-source software), forms of cooperation (crowdsourcing, outsourcing, etc.), the creative representation of the news, and many other aspects. Data journalism production is highly inclusive and innovative and is an integral part of Open Journalism. At the same time, data journalism production can be divided into different levels. Alberto Nardelli, The Guardian’s current data editor-in-chief, divides the paper’s daily data journalism into three categories: quick and relevant numbers, data investigations to support stories, and making complex issues (Reid, 2015). The openness of data journalism and its hierarchical differences inevitably brings about a diversity of data journalism production models. We classify data journalism production models into five types: endogenous model, outsourcing model, crowdsourcing model, hackathon model, and self-organizing model, based on the way producers integrate and collaborate with internal and external resources. The first two models are often used by the media in the daily production of data journalism. Endogenous Model What is the definition of an endogenous model? Under what circumstances can it be applied? How should we evaluate it? In this section, we will explore these questions. Definition of Endogenous Model

The concept of “endogenous development” in economics refers to the process of local social mobilization by an organization that brings together various interest groups to pursue a strategic planning process and resource allocation mechanism that meets local aspirations to develop local capabilities in terms of skills and qualifications (Zhang et al., 2007). There is a method of data journalism production that is similar to endogenous development in economics, which we call the endogenous model. The endogenous mode of data news production refers to a mode in which the media absorb, integrate, optimize, and enhance internal resources for data news production. Most DOI: 10.4324/9781003426141-3

42  Multiple Models of Data Journalism Production media outlets make up for the lack of their own production capacity by bringing in people with strong data processing skills, training employees’ skills, and encouraging them to learn on their own. The endogenous model is suitable for self-sufficient news producers, i.e., producers who have all the necessary staffing or resources for data news production. The New York Times, The Guardian, The Washington Post, The Wall Street Journal, The Economist, ProPublica, FiveThirtyEight, BBC, Caixin, The Paper, and other media outlets mostly adopt this model for their daily data news production. Team Building of Endogenous Model

While the endogenous model relies on the media’s own resources, it also adheres to the concept of openness, i.e., openness to those within the media. In the endogenous model, the traditionally newsroom-centered working model is broken. News production requires cross-border integration in which people in various positions have to understand each other from the beginning stage. This has made programmers, who used to be downstream of the news production line and often did some support work, key or even central. The line between editorial as an insider and technology as an outsider has blurred. At The Washington Post and Chicago Tribune, teams of “Embedded Developers” have emerged (Qian & Zhou, 2015). The team building of the endogenous model can be divided into two types: team-based and individual-based. For the team-based type, data journalism production teams are formed by members within the media. Such teams can be divided into professional and ad hoc teams, depending on the strategic positioning of data journalism in different media. For professional teams, there are independent departments in the media responsible for data journalism production: for example, the data news team of The Paper, the Interactive News Technologies Department of The New York Times, and the data news team of the BBC. For ad hoc teams, data journalism production staff is dispersed across media departments and combined in an ad hoc manner. Such teams can be long-term or short-term. Long-term ad hoc teams are teams that can be formed quickly by people in various departments when they need to team up and return to their departments when the task is completed. Short-term temporary teams are teams that are formed entirely because of a particular task. Caixin uses ad hoc teams to produce interactive data news, drawing people from different departments to work together on a task. Since members from different departments represent their original departments, coordination between different departments becomes intra-team coordination, so such teams consume a lot of time in the early stages (Wang, 2014, p. 228). Guo Junyi, head of CCTV’s big data journalism project, pointed out the limitations of this model: When we worked on big data journalism projects, all of the team’s copy editors, video editors, studio production staff, and even anchors were called in at short notice. The team was disbanded after the project ended, and we called

Multiple Models of Data Journalism Production 43 these people back to work together on the next project. This model, while flexible, is not professional enough. More than a year of producing big data news has made me realize that big data journalism is very specialized, and requires the cooperation of professionals from various fields. (Guo, 2018) For the individual-based type, data journalism is produced entirely by one journalist, which requires a full set of data journalism skills, including topic selection, data collection, data analysis, news writing, and data visualization. In Facts are Sacred, Simon Rogers refers to the “short-form of data journalism” in The Guardian’s daily data journalism production, where journalists very quickly identify key data, analyze it as the story is happening, and guide the audience to participate in it (Rogers, 2015). Caixin, The Paper, The Guardian, The New York Times, The Washington Post, and other media outlets usually adopt an individual type of production approach for relatively simple data news. Journalists often produce simple analyses based on secondary data or interpret small-scale data from primary sources, and there are usually fixed templates for this type of data journalism. The Guardian has templates with a choice of chart styles, color palettes, and interactive visualizations that journalists can use to produce data news quickly, and the BBC and Financial Times have similar ones (Dick, 2014). The short version of data journalism has adapted to the fast-paced work of the news media, but it also poses problems. First, the ability of data journalism to provide complex narratives is forgotten, and data journalism is in danger of being reduced to “number” journalism. The audience’s perception of data journalism is formed through the works they come across. If the short form of data journalism (or even some “pseudo-data journalism”) dominates the mainstream, it will undoubtedly affect the audience’s evaluation of data journalism. Second, it is not conducive to the deep digital transformation of the media. Data journalism and media digital transformation are not contradictory. Investigative data journalism or large-scale data visualization interactive works require program development personnel, who are also the key force of convergent journalism. Improving the depth and scale of data journalism works can improve the overall level of media content, while low-level data journalism does not contribute to the deep digital transformation of media, but is only a kind of “unhelpful” embellishment. In the process of developing data journalism, some media departed from the purpose of data journalism when it emerged (Fang & Gao, 2015). A SWOT Analysis of the Endogenous Model

The applicability of the endogenous model needs specific analysis. Generally speaking, large media outlets tend to adopt the endogenous model due to their complete staffing or the high capacity of their personnel. Some small or local media, although they may also adopt the endogenous model, the low level and output may make data journalism become something not very meaningful that cannot enhance the media brand reputation as the media expects. We systematically analyze the

44  Multiple Models of Data Journalism Production Strengths, Weaknesses, Opportunities, and Threats of the endogenous model using the SWOT analysis framework, first proposed by Kenneth R. Andrews of Harvard Business School in 1971 in The Concept of Corporate Strategy to aid strategic decision-making (Ni & Wu, 2001). Which models to adopt requires a joint decision between media and data journalism practitioners. We use a SWOT analysis to systematically compare the differences between the endogenous model, outsourcing model, and crowdsourcing model. It is important to note that the specifics vary across media, and we address the general case here. Strengths of Endogenous Model

First, the endogenous model can realize the media’s gatekeeping of data journalism production. The media can control the entire production chain from topic selection to publication, and can effectively manage the frame, content, and angle of the stories. In our interviews with some data journalists in China, we found that gatekeeping is the keyword in data journalism production, as it is in all aspects of news production. Data journalism practitioners can control the entire process to ensure that data journalism is produced in accordance with editorial standards and that the content of reports is adjusted in response to problems encountered. In-depth interview (Editor A, head of data journalism of media B, online interview): For example, when we check the domestic rocket launch records, the data provided by these authorities, such as China Academy of Launch Vehicle Technology, China Great Wall Industry Corporation, and China Aerospace Science and Technology Corporation, are not the same, then we will decide not to write the news related to this. Second, the endogenous model can effectively integrate resources within the media and promote communication and integration between different positions. Traditionally, the relationship between media positions is mostly upstream and downstream; that is, journalists are in the upstream and technicians and editors are in the downstream, which means that there is communication between positions but each position has an unequal status, as journalists and editors are the “protagonists” of news production, while technicians and editors tend to work on the basis of the journalist’s needs. Although the relationship between the various jobs in data journalism can also be upstream and downstream, the more successful endogenous model is one in which journalists, technicians, and editors work together in the topic selection phase. The experience of the first Data Journalism Awards winners in 2012 showed that data journalism competes on teamwork, not team size (Xu & Wan, 2013). “News editors should bring data reporters in at the beginning of the story so that they can start identifying what needs to be developed for the project.” (Au, 2016) Programmers need to judge the difficulty and cost based on the data journalist’s choice of topic and research questions (Huang & Zhang, 2016).

Multiple Models of Data Journalism Production 45 The endogenous model breaks down the barriers between departments and positions, enabling the exchange and integration of personnel with different disciplinary backgrounds, promoting mutual understanding and skills learning among different people, and to a certain extent changing the previous situation where different positions had little understanding of other positions. Weaknesses of Endogenous Model

First, the level of data journalism in the endogenous model is constrained by the media’s own capabilities. There is still a big gap between media and specialized companies in data processing and data visualization. If data journalists only adopt the endogenous model for data journalism production, their level is greatly constrained by their own capabilities, especially the quality of their personnel. In-depth interview (Reporter C, working for a newspaper in Hunan Province, doing data journalism, online interview): We don’t have programmers and can only do some simple data tables. Some data news in China is relatively simple, often presented in “one chart,” reflecting the current low production capacity of some media. With limited funds and manpower, some media outlets often create teams that strive to learn on their own to improve data journalism production. Journalists can learn simple data journalism skills through self-learning, but higher-level data journalism production cannot be learned through self-learning, so the fastest way to improve in the endogenous model is to train staff or recruit professionals directly. Aron Pilhofer, editorin-chief of digital content at The Guardian, argues from an examination of data journalism in the United States that “training has been incredibly impactful in the U.S. to spread data journalism, there’s hardly a newsroom in the U.S. that doesn’t have a specialist in this area, and training is a huge reason for that” (Reid, 2014). In-depth interview (Editor A, head of data journalism of media B, online interview): Some of us are from the University of Michigan and Nagoya University, both majoring in data journalism. Data journalism training abroad is already very systematic, and if China still can’t produce competent data journalism students, our team will face pressure for subsequent development. Second, the endogenous model puts new requirements on media production management. The effective operation of interdisciplinary and cross-departmental teams required for data journalism production has placed high demands on media management, and what used to be intra-departmental management has become cross-departmental team management. According to Pilhofer, one of the problems with data journalism production is that “it’s hard to assemble teams that are crossdisciplinary” and “have the right combination of designers, developers, graphic artists, photographers” (Reid, 2014).

46  Multiple Models of Data Journalism Production The assessment of journalists’ work is a real problem in the management of media production. How to treat each participant fairly in data journalism production tests the scientificity of the assessment mechanism. In traditional news production, the workload of journalists is mostly calculated by the number of published works, and some media will add other assessment criteria, such as the number of retweets, viewership, whether they travel, and the quality of the articles. In some media, the workload of each person in a multi-person collaborative work is calculated as an average. A data journalism director said it would be detrimental to future cooperation if the workload of different people was distinguished particularly clearly, so she adopted a more flexible approach, i.e., between strictly distinguishing the workload of different people and calculating everyone to have the same workload. In addition to the problem of workload distribution among team members, there are also differences in assessment standards between data news and other news. The workload of the editorial staff has increased significantly. In the past, a 1,000-word news release could be completed in an hour, but today a piece of data news may take ten or a hundred times longer than before, requiring preferential treatment in terms of salary to guarantee the production of data news. (Zou, 2015) Some data journalists abroad also regret “the lack of standardization in data journalism,” as “there is no equivalent to the payment by wordcount which print journalists have so long worked by.” (Bradshaw, 2016) Some data journalism works are simple, some are complex, and some may die halfway, but how should different works be assessed? In addition, some ad hoc projects will temporarily pull out staff; how to ensure the newsgathering work is not greatly affected is a question media managers need to consider. Opportunities of the Endogenous Model

First, there is a huge demand for simple data journalism in everyday news production. The World News Media Network (WNMN)’s 2015 and 2016 data journalism survey of 144 media outlets in 40 countries around the world shows that data journalism has become an important tool for media to respond to digital communication trends (Gu, 2016). As an emerging style of news reporting, data journalism has been favored by the media in recent years. More media outlets are opening data news columns, providing a wide range of opportunities for the promotion of the endogenous model. Data journalism columns need to be updated on a regular basis to attract the continuous attention of the audience. News media, as for-profit organizations, need to consider production costs, so relying on outsourcing for all of them is not very realistic. For simple data news production, the endogenous model is undoubtedly the best choice. News in the new media era is about timeliness. In the era of information explosion, audiences have a short attention span for events and topics of general hotness.

Multiple Models of Data Journalism Production 47 The daily production of data news has to consider the hotness and timeliness of events and topics, audience attention, news production cycle, etc. There is no doubt that the endogenous model is more adaptable to the “7/24” (24 hours a day, 7 days a week) working rhythm of the new media platform. Second, the wide variety of data journalism learning resources creates opportunities for journalists to improve their skills. With the continuous development of data journalism, related learning resources are increasing, and journalists can learn online and offline through online courses, professional websites, forums, lectures, hackathons, textbooks, and other forms. These learning resources help fill the gap the data processing skills of journalists. Threats of Endogenous Model

We believe that the main threat to the endogenous model comes from models such as outsourcing. In today’s world of highly specialized professions, the limitations of the endogenous model undoubtedly provide room for other models to flourish. If the number of people involved in data journalism grows and the cost becomes lower in the future, the media will likely abandon the endogenous model. Outsourcing Model The endogenous model has its applicability. However, the more complex and specialized data journalism is difficult to have done by the media itself, so data journalists can choose the outsourcing model and ask for external help to accomplish it. Definition of Outsourcing Model

Outsourcing model refers to a management model in which an enterprise integrates and utilizes the best external specialized resources to reduce costs, improve efficiency, give full play to its core competitiveness, and enhance its ability to cope with the external environment (Zhang et al., 2012). In the outsourcing model of data journalism, the media integrates and utilizes external specialized resources to entrust all or part of the data journalism production tasks to others according to certain standards. For example, Tian Yan Cha, a company information search website in China, cooperates with some media to publish data news. Tuzheng, a data news self-publisher, not only produces data news itself but also cooperates with other media or undertakes their outsourcing tasks. Periscopic, a data visualization company, and others undertake data journalism for media and NGOs. Germany’s Open Data City, which provides data journalism services in cooperation with media, has also won a Grimme Online Award and a Lead Award in Germany (Yu, 2015). Application of the Outsourcing Model

The outsourcing model is widely used in data journalism production. Data reproduction, visualization, and even the entire production process can use the outsourcing

48  Multiple Models of Data Journalism Production model. Thomson Reuters uses an outsourcing model for some of its large data journalism projects: the vendor provides programmers, designers, copywriters, and other professionals to design and develop the front end of the relevant applications, and the vendor’s team structure is very similar to that of the media, making it easier to achieve the goals that Reuters editors want (Au, 2016). ProPublica had 111 partners in the six years since its inception, and in addition to exclusive copy, it also contracts or partners with other media outlets to produce data news stories (Xu C. & Xu Z., 2015). The big data news series “Using big data to tell the stories of a community of shared future” launched by CCTV in October 2015 is a joint effort of four teams, namely the main news production team composed of members of CCTV News Center, the shooting team composed of members of CCTV’s science documentary team, the data visualization team built by professional computer companies, and the visual effects team that participated in the production of Transformers 3 (Wu et al., 2016). The data visualization team and the post-production visual effects team are outsourcing teams for CCTV. After the outsourcing task is delegated, the media will also be involved in the production of the news, in which the journalist’s role is similar to that of a project manager: coordinating the smooth running of each step (Yu, 2015). According to Dr. Eddy Borges-Rey of the University of Stirling, outsourcing data analysis is one of the future trends in data journalism production for mainstream media (Bai, 2018). A SWOT Analysis of the Outsourcing Model Strengths of the Outsourcing Model

The outsourcing model emphasizes a high degree of specialization, which is a product of the specialized division of labor and economies of scale. The outsourcing model advocates letting professionals do professional work and letting enterprises do what they are good at (Zhang et al., 2012). The production of CCTV Nightly News’ “Data about” series is mostly outsourced, using Baidu map data and retrieval data provided by Baidu, microblog data provided by Sina, network security data provided by 360, social network data provided by Tencent, transaction data provided by Alibaba, and data analysis technologies provided by TRS, an information technology company (Chang et al., 2014). In-depth interview (Editor A, head of data journalism of media B, online interview): There are many cases of outsourcing, in fact, mainly the technical aspects need to be outsourced, because not every media can afford a high-end engineer. For example, if you are making 3D works, not all media can afford to have a 3D team. For some difficult data journalism production aspects, outsourcing can certainly help journalists solve problems quickly.

Multiple Models of Data Journalism Production 49 Weaknesses of Outsourcing Model

As mentioned earlier, media outlets need to be responsible for the content of their stories and ensure the quality of each segment. Some tasks are outsourced because the media are unable to achieve the desired goals due to a lack of human, technical, or professional capacity. The media lack effective monitoring of news production and are unable to do so because of the control of content production in the hands of outsourcing companies. Some data journalists said that they would not use outsourcing in data production links that are invisible or incomprehensible. In-depth interview (Editor A, head of data journalism of media B, online interview): We have not outsourced data production, but have bought data from other organizations. Opportunities of the Outsourcing Model

In the era of new media, user experience is given a higher priority. The m ­ ulti-layered nature of data journalism requires both “fast-food” type of data news and largescale products with more interactivity and professionality. CCTV’s “Data about” series have achieved a positive social response. “Data about Spring Festival Travel Rush” aired on Nightly News had higher ratings than the average ratings (Chang et al., 2014). For data journalism production, the breadth, depth, and relevance of data and the interactive experience and visual pleasure of data visualization determine the quality of the user experience. The expertise of outsourcing companies can meet the needs of the media. Some industry insiders predict that media will establish relatively independent production teams similar to “data news agencies” to provide articles to different media, and this will be one of the development directions of data journalism in the future (Dai, 2017). Threats of the Outsourcing Model

Advances in technology and the entry of high-quality talent into the media have made high-level data journalism production easier. Open-source professional software has increased in volume, availability, and convenience, which has lowered the professional threshold for data journalism to some extent and may crowd out the original business areas of outsourcing companies, which requires them to meet diverse and multi-level needs in terms of professional standards and services. Crowdsourcing Model The outsourcing model usually requires the data journalism industry to give financial support to the subcontractor. The crowdsourcing model may be a good option for those without funding and in need of outside assistance.

50  Multiple Models of Data Journalism Production Definition of Crowdsourcing Model

Crowdsourcing was first introduced in June 2006 by Jeff Howe, senior editor of Wired, a monthly American magazine, in The Rise of Crowdsourcing. The crowdsourcing model of data journalism production invites a specific group of people to participate in data journalism tasks (e.g., news gathering, data collection, or data analysis). The characteristics of the crowdsourcing model are as follows. First, the openness of production. The crowdsourcing model develops products by integrating external resources. Crowdsourcers make the project process open to outsiders and allow them to participate in it by means of open-source. Second, the dynamic nature of the organization. Online communities are the main organizational form of crowdsourcing. Third, the self-determination of participation. The participants are not in an employment relationship with the issuer but participate according to their own needs. Crowdsourcing replaces contracting with incentives and emphasizes selfhelp collaboration, linking participants to complete tasks in an orderly and efficient manner (Tan et al., 2011). There are two shared attributes among almost all crowdsourcing projects: participants are not primarily motivated by money, and they’re donating their leisure hours to the cause. That is, they’re contributing their excess capacity, or “spare cycles,” to indulge in something they love to do (Howe, 2006, p. 29). Crowdsourcing can help companies leverage external intellectual resources to solve difficulties in their R&D or production, improve efficiency, or reduce costs. Application of the Crowdsourcing Model

The application of crowdsourcing models in data journalism production mainly falls into two categories: those that mobilize audiences to help identify documents or data, and those that mobilize audiences to provide content. When an election spending scandal involving politicians broke in the U.K. in 2009, The Guardian asked readers to help sift through 1 million documents released to identify undisclosed malfeasance. More than 20,000 readers participated in the project, and 170,000 records of receipts were examined within 80 hours (Chang & Yang, 2014). One of the readers examined 29,000 receipt records (Rogers, 2015, p. 228). In the run-up to the 2012 U.S. presidential election, ProPublica launched “Free the files,” a project that asked audiences to compile a wealth of complex television campaign data released by the Federal Communications Commission (FCC). The project asked audiences to sort through a large amount of complex television campaign data released by the FCC to extract key information such as who bought the ads. The goal of the crowdsourcing project is to bring these shady deals to light, but the massive amount of data can’t be done by a single news organization (Fang, K., 2015). Every 17 years, a kind of cicada called Magi breaks out of the ground on the northeast coast of the United States. In 2013, New York Public Radio (WNYC) launched “Cicada Tracker,” a data news crowdsourcing project. The station invited listeners to measure soil temperatures in their homes with homemade sensors to

Multiple Models of Data Journalism Production 51 observe the emergence of cicadas (Yin & Yu, 2016), receiving 1,750 temperature reports from 800 locations (Xu, 2015). For the 2016 U.S. election, the Vox website launched “Emotion Tracker” on Election Day. Audiences could choose which candidate they supported and their mood at the time, and the data was uploaded in real time to a quadrant chart designed by the website, where audiences could log in multiple times to track their change of mood. As a crowdsourcing project, it fully illustrated people’s moods from the night before the election to the announcement of the results, attracting a total of 12,006 participants (Zhou, 2016). A SWOT Analysis of the Crowdsourcing Model Strengths of the Crowdsourcing Model

The crowdsourcing model has created a new kind of knowledge marketplace that brings together idle, distributed intellectual resources (Feng, 2012). Clay Shirky, in his Cognitive Surplus, points out that there are more than a trillion free hours in the world, and that the Internet and digital technologies allow people to leverage that free time to create valuable content, rather than just consume it. Cognitive surplus and web technologies form the software and hardware foundation for crowdsourcing (Zhao, 2015). A successful crowdsourcing model can bring together audiences who are willing to participate, split up tasks, achieve rapid task decomposition and completion, improve the efficiency of data journalism production, and compensate for the media’s lack of humanpower, technology, and resources in handling these tasks. Weaknesses of the Crowdsourcing Model

Crowdsourcing is voluntary and there is no contractual or coercive relationship between the sender and the receiver. Crowdsourcing assumes that the recipient is conscientious, responsible, has some expertise, or has an irreplaceable advantage in some way (e.g., the recipient is an eyewitness to the site, a stakeholder, etc.). The potential risk, however, is that the quality of the data and content collected by crowdsourcing is inconsistent, and implementers will have to weigh the pros and cons if they are to crowdsource, as they have no control over the quality of this data. In this case, the team has to make sure that the data is properly verified (Au, 2016). This also means that the key to accuracy in the news production chain depends in large part on the crowdsourcer, leading to blended responsibility (Aitamurto, 2016). Moreover, crowdsourcing is not always successful. For example, to analyze the “Afghan War Logs,” OWNI (Objet Web Non Identifié), a French publication covering web culture, launched a crowdsourcing effort in which readers could view the diary and point out stories of interest to them. However, only 250 people participated, and 70% of them stopped after analyzing only one document, so the diary was mainly collated by experts (Léchenet, 2014).

52  Multiple Models of Data Journalism Production Opportunities of the Crowdsourcing Model

The proliferation of ideas and the formation of communities in the crowdsourcing model offer increasing possibilities for its application. Jeff Howe, in Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business, argues that the realization of crowdsourcing depends on four conditions: the emergence of a class of amateurs, a shift in production from closed exclusivity to open sharing, the democratization of production tools, and the community as an effective production organization model (Li, 2013). With the popularity of the Internet, crowdsourcing, which is characterized by freedom, openness, equality, and collaboration, has made the audience’s enthusiasm and ability to innovate more powerful and commercially valuable, and “user-created content” has become a new trend that can lead to breakthrough innovations (Zhang et al., 2012). Threats of the Crowdsourcing Model

We believe that the threat of crowdsourcing comes from itself, that is, how the media can build a successful crowdsourcing model when faced with different crowdsourcing tasks. The crowdsourcing of data journalism is suitable for articles that seek breadth and diversity of coverage rather than depth and historicity of reporting. It is suitable for collecting data that is relevant to the public interest and more intuitive but struggles to show advantages when reporting on specialized issues and collecting authoritative data (Fang, J., 2015, p. 139). It also requires the media to have an understanding of the motivations, needs, and behaviors of the audience involved in crowdsourcing. Simon Willison, a journalist at The Guardian, argues that the crowdsourcing takers are not paid, so they have to make crowdsourcing tasks interesting. Simon Rogers argues that it is important to design crowdsourcing tasks in a playful way. In The Guardian’s crowdsourcing of politicians’ election spending scandals, the newspaper included game mechanics in the crowdsourcing, where users could see which challenges were accepted and completed, which was used to give the users a sense of achievement and being rewarded (Vehkoo, 2013). Also during the crowdsourcing process, journalists are required to verify what the audience is taking on to ensure quality. The media also needs to consider the professional recognition and affordability of the takers for the crowdsourced task; usually, the simpler and more interesting the task, the more people will participate. Media also need to pay attention to Community Building, creating ongoing relationships with reader groups (Vehkoo, 2013). The media should not view each crowdsourcing participant as a one-time resource, but rather build a long-term relationship with them. Hackathon Model to Inspire Creativity In data journalism production, programmers are the technical core of the team. The acceptance of programmers in data journalism not only enables cross-border production but also brings cultural integration. The hackathon model is a challenging and exciting ad hoc production model that originated from software development.

Multiple Models of Data Journalism Production 53 Definition of Hackathon Model

A hackathon, also known as a programming marathon, is a form of software development where software developers develop software in a collaborative manner for a set period of time, with the organizer providing learning opportunities and a venue for participants and posting the event to the web (Linders, 2014). Hackathons originated from the 24-hour “marathon bursts” initiated by MIT students in the 1980s, where they competed in a competition of creativity and strength at a set time and place. The duration of the hackathon ranged from a few days to a week. Developers, programmers, designers, and other different functions get together and work in small teams to develop ideas around a specific theme for a fixed amount of time. As programmers enter the data journalism field, there are data journalism hackathons around the world dedicated to the need for data journalism innovation. For programmers, they are unable to identify interesting stories in the data, and for journalists, while they realize the importance of data journalism, they lack the skills to frame what they want. Therefore, the hackathon is also seen as one of the effective forms to promote collaboration and communication between journalists and programmers. Application and Evaluation of the Hackathon Model

In March 2010, SETUP, a digital culture organization in Utrecht, The Netherlands, launched the Hacking Journalism project to encourage broader collaboration between developers and journalists. At the time, data journalism was just emerging; no media in Utrecht had the intention or budget to hire programmers, and the media was on the fence about data journalism. The hackathon provided an opportunity for programmers and journalists to collaborate and exchange ideas. The hackathon was 30 hours long and the 30 participants were divided into six groups, each focusing on a different topic: criminality, health, transportation, security, aging, and energy. On February 13–14, 2015, Florida International University and the Miami Hacker Consortium jointly organized a “Media hackathon” aimed at optimizing the readability of local data journalism and making the stories available to the public in a timely manner (Truong, 2015). In December 2015, Hacks/Hackers Dublin hosted the Data Journalism Hackathon in Dublin, Ireland, with the theme of analyzing data from a survey. Survey software company Qualtrics surveyed 5,000 respondents in Europe about their attitudes toward work, and participants were divided into six teams that spent two days analyzing the data, with the winning entry published in the Irish Times (Qualtrics Dublin Office, 2015). Hackathons related to data journalism in recent years can be broadly divided into two categories: those that are completely open and hosted by industry associations, NGOs, universities, and media; and those that are open only to insiders, mostly hosted by the media themselves.

54  Multiple Models of Data Journalism Production Some data newsrooms have applied hackathons to their daily news production. Pili Hu, head of the Data Lab at Initium Media in Hong Kong, said: I find that the Hackathon culture can work well for news production. Oftentimes, people may dawdle to finish an article, but they can actually do it quickly by being more efficient. So I  thought we could spend one day or two days to push forward a project, and we could use “Jackathon” (journalism+hackathon) to do something that we normally can’t do in a very short time. This event basically achieved our intended purpose, that is, to urge people to learn something new and expand their potential, and the Jackathon is also a mechanism to expand within the organization. We can learn from each other in the way we analyze data and the way we read books. (Qualtrics Dublin Office, 2015) Although the hackathon model is not common, its usefulness for data journalism production cannot be underestimated. First, it fosters collaboration between technical people and journalists, allowing people from two different disciplines to deepen their understanding. Second, hackathons can produce creative data journalism products in a short period of time, or solve a particular puzzle and improve the efficiency of data journalism production. Third, technicians and journalists can learn new skills from each other. Self-organization Model There is a unique pattern in data journalism production in China: participants are not paid and do not meet each other, but organize themselves in a spontaneous way to produce data journalism. This unique model is the self-organization model. Definition of Self-organization Model

The concept of “self-organization” is derived from natural science and engineering technology. Organization is a process of ordering and a way of constituting things in nature and human society. There are two ways for things to evolve from disorder to order or from lower order to higher order: one way is to self-organize to achieve order, which is called “self-organization”; the other way is to be organized by others, which is called “hetero-organization,” that is, to passively move from disorder to order under external instructions (Yang, 2007). Hermann Haken, the founder of Synergetics, believes that self-organization is the formation of spatiotemporal patterns (structures) and/or performance of functions without an “ordering hand” (Hutt & Haken, 2020). In hetero-organizations, a group of people is appointed by a power holder to accomplish a given task, but in self-organizations, people join a group on their own initiatives. The creation of self-organizations consists of two stages. First, a number of individuals form small groups. Second, the small group must have a specific purpose for which they divide their work and take action. Only when the small group

Multiple Models of Data Journalism Production 55 enters the stage of self-management and can act spontaneously for the same goal can it be called self-organization, otherwise it can only be a small group (Luo, 2010). Case: Tuzheng’s Self-organization Model

We found that China’s Tuzheng is one of the few media that uses a self-organization model for data journalism. Tuzheng is a self-published media, founded by journalist Dai Yu in July 2014, with selected interns from all over the country as members. Interns who are interested in data journalism and come from key universities at home and abroad across departments and majors then become the main body of data journalism production, gradually embarking on the specialization of data journalism. There are three reasons for choosing this group: there are too few full-time journalists interested in data journalism, most of those interested in data are not working as journalists, and it takes time for journalists to transform (Zhang, 2016). At present, Tuzheng cooperates with media such as Southern Window, Southern Weekend, Southern Metropolis Daily, Guangzhou Daily, etc., and has columns in Sina, ZAKER, Zhihu, and other famous Chinese websites. Tuzheng is self-managed, self-organized with a clear division of labor, consisting of reporters, editors, art editors, programmers, promoters, and coordinators, divided into a news group, a research group, an art editor group, and a promotion group. Tuzheng publishes a few data news items every week, and each staff member is interned for six months. Communication among members of Tuzheng is also different from that of professional media, as members communicate through Tower, an online team collaboration system. There is no face-to-face communication between members, and everything happens online. There are no ready-made training materials for current affairs data journalism, so they make their own case studies and design their own internal training courses (Dai, 2015). The members of Tuzheng are college students who need to attend classes, so the proper functioning of the whole team will depend on the members’ self-awareness and the rules they set among themselves. For example, in order to ensure communication between members, each team leader will set up a time with the team members to communicate at least once a week. In-depth interview (Liu Wen, journalist of Inner Mongolia Radio and Television, former head of the newsgroup of Tuzheng, face-to-face interview): I arrange a meeting time with the group members, and usually choose 8:00 pm. Since we are very busy, I will list out the problems in advance, so that we can go straight to the problems during the meeting, and the meeting will finish after the problems are discussed. There is no time for small talk, and this way of meeting is efficient. The data journalism production process in Tuzheng is roughly the same as in the media. Unlike the media, there is less communication among the members, and tasks

56  Multiple Models of Data Journalism Production are frequently assigned. Because Tuzheng is a nonprofit self-publishing organization, members are not paid a fixed amount of money, so they participate with interest and enthusiasm. Usually, the team leader has more authority in setting the topics and articles, but for some major or sensitive events, the management of Tuzheng (journalists working in regular media) will intervene and keep an eye on the articles. The data reproduction of Tuzheng is relatively simple. Most of the selected topics are based on reports that are already rich in data, and members only need to identify, organize, or simply analyze data that are newsworthy in the reports. In-depth interview (Liu Wen, journalist of Inner Mongolia Radio and Television, former head of the newsgroup of Tuzheng, face-to-face interview): I usually search reports online to determine the topic of articles, so that the data is better to find when we write. This method is kind of a shortcut. Complex and in-depth data news production is done by members of the research team, which also collaborates with other media outlets and takes on outsourced work from time to time. Tuzheng has done a successful job as a self-published media run entirely by university students, but as a media outlet that transmits content through the WeChat platform, it has shortcomings, such as its relatively low level and efficiency, little communication among its members, lack of consensus among different positions, and strong subjectivity in each production session. References Aitamurto, T. (2016). Crowdsourcing as a knowledge-search method in digital journalism. Digital Journalism (2), 280–297. https://doi.org/10.1080/21670811.2015.1034807 Au, E. (Ed.). (2016). Building data teams: Tips for managers, muckrakers and coders. Uncovering Asia Conference. https://2016.uncoveringasia.org/2016/09/25/buildingdata-teams-tips-for-managers-muckrakers-and-coders/ Bai, Z. (Ed.). (2018). Approaching the frontiers of data journalism – Dr Eddy Georges-Rey gives a series of lectures at the School of Journalism and Publishing. Beijing Institute of Graphic Communication. http://xwcb.bigc.edu.cn/xydt/78553.htm. (Published in China.) Bradshaw, P. (2016, May 3). Data journalism’s commissioning problem. Online Journalism Blog. https://onlinejournalismblog.com/2016/05/05/data-journalisms-commissioning-problem/ Chang, J., Wen, J., & Liu, S. (2014). Exploration and attempt of TV data news report – Taking CCTV evening news series report of “Data Stories” as an example. Shanghai Journalism Review (5), 74–79. (Published in China.) Chang, J., & Yang, Q. (2014). Data journalism: Ideas, methods and influences. Journalism and Mass Communication Monthly (12), 10–18. (Published in China.) Dai, Y. (Ed.). (2015). “South Window” Tuzheng Data Studio Dai Yu: Politics+data news how to do? Tencent News. http://view.inews.qq.com/a/20150717A00NTT00 (Published in China.) Dai, Y. (Ed.). (2017). Data news cold? Dai Yu: After “disenchantment”, the rational return of data journalism in China. http://mp.weixin.qq.com/s/v7SdtIOoye1nqDci9couOA. (Published in China.) Dick, M. (2014). Interactive infographics and news values. Digital Journalism (4), 490–506.

Multiple Models of Data Journalism Production 57 Fang, J. (2015). Introduction to data journalism. China Renmin University Press. (Published in China.) Fang, J., & Gao, L. (2015). Data journalism – An area need urgently to be regulated. Chinese Journal of Journalism & Communication, 37(12), 105–124. (Published in China.) Fang, K. (2015, January 17). How does the New York Times plays “crowdsourcing”? Fangkc. http://fangkc.cn/2015/01/new-york-times-hive/(Published in China.) Feng, X. (2012). Crowdsourcing mode research based on two-sided market [Doctoral dissertation, Wuhan University]. CNKI Theses and Dissertations Database. https://kns.cnki.net/kcms2/article/abstract?v=3uoqIhG8C447WN1SO36whLpCg h0R0Z-i4Lc0kcI_HPe7ZYqSOTP4QmLJECduZNhfP-9zehIVUco0egMDZjSR_ EU9JCQP8m3h&uniplatform=NZKPT (Published in China.) Gu, X. (Ed.). (2016). Data journalism has become the touchstone of whether the media is advanced or not? Go to New York to hear the leading voices of the industry. The Paper. www.thepaper.cn/newsDetail_forward_1468494 (Published in China.) Guo, J. (Ed.). (2018). The beginning of CCTVbig data journalism (part 2). datajournalism. https:// mp.weixin.qq.com/s?__biz=MjM5MDM3NzUxMA==&mid=290585660&idx=1& sn=1926c2f6f2dd7747583dd67a94ea23bc&3rd=MzA3MDU4NTYzMw==&scene=6 #rd. (Published in China.) Howe, J. (2006). Crowdsourcing: How the power of the crowd is driving the future of business. Random House Business. Huang, Z., & Zhang, W. (2016). How data journalism is produced – Take Caixin’s data visualization works as an example. News and Writing (3), 86–88. (Published in China.) Hutt, A., & Haken, H. (2020). Synergetics. Springer. Léchenet, A. (Ed.). (2014). Global database investigations: The role of the computer-assisted reporter. Reuters Institute for Study of Journalism. https://reutersinstitute.politics.ox.ac. uk/our-research/global-database-investigations-role-computer-assisted-reporter Li, Y. (2013). The idea of “news co-production” under the “crowdsourcing” model. Journalism Lover (6), 38–40. (Published in China.) Linders, B. (Ed.). (2014). Experiences and good practices from Hackathons. InfoQ. www. infoq.com/news/2014/12/experiences-practices-hackthons Luo, J. (2010). Self-organization – The third governance model beside market and hierarchy. Comparative Management, 2(2), 1–12. (Published in China.) Ni, Y., & Wu, X. (2001). On the evolution of enterprise strategic management thoughts. Business Management Journal (6), 4–11. https://doi.org/10.19616/j.cnki.bmj.2001.06.001 Qian, J., & Zhou, J. (2015). From emergence to diffusion: Data journalism from the perspective of social practice. Shanghai Journalism Review (2), 60–66. (Published in China.) Qualtrics Dublin Office (Ed.). (2015). Data journalism hackathon. Meetup. www.meetup. com/hacks-hackers-dublin/events/226992595/ Reid, A. (Ed.). (2014). Guardian forms new editorial teams to enhance digital output. Journalism. www.journalism.co.uk/news/guardian-forms-new-editorial-teams-to-enhancedigital-output/s2/a562755/ Reid, A. (Ed.). (2015). ‘Without humanity, data alone is meaningless’: Data journalism insights from the Guardian. Journalism. www.journalism.co.uk/news/-without-humanitydata-alone-is-meaningless-data-journalism-insights-from-the-guardian/s2/a564652/ Rogers, S. (2015). Facts are sacred: The power of data (Y. Yue, Trans.). China Renmin University Press. Tan, T., Cai, S., & Hu, M. (2011). Foreign research status of crowdsourcing. Journal of Wuhan University of Technology (Information& Management Engineering), 33(2), 263– 266. (Published in China.)

58  Multiple Models of Data Journalism Production Truong, E. (Ed.). (2015, October 7). How to create a hackathon in your newsroom. Poynter. www.poynter.org/2015/how-to-create-a-hackathon-in-your-newsroom/377355/ Vehkoo, J. (Ed.). (2013). Crowdsourcing in Investigative Journalism. Reuters Institute for Study of Journalism. https://reutersinstitute.politics.ox.ac.uk/sites/default/files/2017-10/ Crowdsourcing_in_Investigative_Journalism_0.pdf Wang, H. (2014). Organizational behavior: Theory and application. Tsinghua University Press. (Published in China.) Wu, K., Yan, S., & Shu, T. (2016). The innovation and practice of CCTV data journalism. TV Research (5), 20–21. (Published in China.) Xu, C., & Xu, Z. (2015). A field perspective on data journalism – A case study of ProPublica’s journalism practice. Journal of News Research, 6(9), 209 + 220. (Published in China.) Xu, R., & Wan, H. (2013). Data journalism: The core competitiveness of news production in the era of big data. Editorial Friend (12), 71–74. (Published in China.) Xu, X. (2015). New pattern of news production in the era of big data: Ideas, practice and reflection of sensor journalism. Chinese Journal of Journalism & Communication, 37(10), 107–116. (Published in China.) Yang, G. (2007). Self-organization and the self-organization mechanism of community. Southeast Academic Research (5), 117–122. (Published in China.) Yin, L., & Yu, X. (2016). Analysis of application cases of Internet of Things in foreign countries. News and Writing (11), 18–22. (Published in China.) Yu, M. (2015). Data journalism practice: Process reengineering and models innovation. Editorial Friend (9), 69–72. (Published in China.) Zhang, H., Huang, C., & Zhou, Y. (2007). A review of the endogenous development. Journal of Zhejiang University (Humanities and Social Sciences), 37(2), 61–68. (Published in China.) Zhang, J. (2016). Data journalism: A study on the new way of storytelling at the big data era [Master dissertation, Lanzhou University]. CNKI Theses and Dissertations Database. https://kns.cnki.net/kcms2/article/abstract?v=3uoqIhG8C475KOm_zrgu4lQARvep2SAk fRP2_0Pu6EiJ0xua_6bqBsWCF_ycD2NYFo7naQ1vE_4REG67QYWkmasz5ZryFLy6 &uniplatform=NZKPT. (Published in China.) Zhang, L., Zhong, F., & Hui, T. (2012). A review of crowdsourcing research. Science & Technology Progress and Policy, 29(6), 154–160. (Published in China.) Zhao, J. (Ed.). (2015). The sudden death of the Guardian Chinese website. Harvard Business Review. www.hbrchina.org/2015-05-22/3001.html (Published in China.) Zhou, Y. (Ed.). (2016). Annual roundup: Data journalism in the U.S. election. DJCHINA. http://djchina.org/2016/12/12/us-election-media/(Published in China.) Zou, Y. (2015). How does visual data journalism change from “work” to “product”? Chinese Journalist (1), 92–93. (Published in China.)

4

Insight into Reality Data Collection and Analysis

Data journalism is 80% perspiration, 10% great idea, and 10% output (Rogers, 2011). Data reuse is at the heart of data journalism production. Generally speaking, statistical analysis is more commonly used to deal with structured small data. As the data used in data journalism changes from small data to big data that cannot be handled by statistical analysis, data mining methods are needed. Data mining is the process of extracting implied, previously unknown, and valuable knowledge for decision-making from a large amount of data. From their theoretical origins, data mining and statistical analysis are in many cases related. Data mining is an extension and expansion of statistical analysis methods, utilizing statistical sampling, estimation, and hypothesis testing, as well as artificial intelligence, pattern recognition, search algorithms, modeling techniques, and learning theory (Liang & Xu, 2015, p. 2–13). Data mining is further distinguished from statistical analysis. The current data journalism uses mostly structured small data, so the media mainly uses statistical analysis in the data reuse process. The process of statistical analysis generally consists of data collection, data cleaning, and data analysis. Data collection is the acquisition of data based on research questions, while data cleaning is the discovery and correction of identifiable errors in data files, including checking data consistency, dealing with invalid and missing values, and so forth. Data analysis is the process of extracting useful information and drawing conclusions from the collected data using appropriate methods. The process of data mining is relatively complex, including business understanding, data understanding, data preprocessing, modeling, evaluation and interpretation of the model, and implementation. In this chapter, we take into account the characteristics of both statistical analysis and data mining methods to explore the data collection and data analysis aspects of data journalism production. Methods of Data Acquisition Data journalists’ access to data can be broadly classified in four ways: open access, access by application, self-collection, and shared access.

DOI: 10.4324/9781003426141-4

60  Insight into Reality Open Access

Open access is a means of obtaining data through open, permissionless access. Open access data is freely available, such as open government data, publicly available annual reports from companies, surveys and research reports published by research institutions, news reports from the media, and other publicly available data. “The Migrant Files,” launched by Detective.io in 2014, uses data from media reports, government publications, gray literature, UN reports on asylum seekers, and some reports on migration and human trafficking in Europe and surrounding areas (Hong, 2014). With advances in data collection technology, some data journalists are obtaining data from online platforms through web scraping. ProPublica’s data story “Dollars for Docs” focused on various payments to doctors by pharmaceutical companies, which are required by law to disclose these data on their websites. Unable to obtain the relevant data through normal channels, the reporter collected them through web scraping (Qiu, 2015b). In October 2016, the U.K. government approved plans to build a third runway at Heathrow Airport. In order to investigate the noise complaints from residents near the airport, BBC reporters wrote a web crawler program to capture data from the official website of Heathrow Airport and found that the airport received an average of one noise complaint every five minutes (Wainwright, 2017). Netease Data Blog published “We Crawled Cai Xukun’s Weibo and Found How Fans Control Celebrities’ Data on Social Media” on April 5, 2019, which crawled more than 280,000 comments from the Weibo account of Cai Xukun, a popular young Chinese idol, and 64,000 comments from the Weibo account of Pan Changjiang, a well-known Chinese comedian. The story used the Force-Directed Placement algorithm to map the network of co-occurrence relationships between high-frequency words, revealing certain patterns in the data controlling of the fans in these two Weibo accounts. The story also found that more than 286,000 comments on a particular Weibo post of Cai Xukun were contributed by 118,000 Weibo accounts, with 2.4 comments per person, and that only 3,869 accounts with more than 10 comments, accounting for about 3.3% of the total number of users, created 98,000 comments, or 34.5% (Netease data blog, 2019, see QR code 4.1). Open data is favored by data journalists because it is easy to obtain, does not require payment, and does not involve complicated procedures such as licensing when producing, and only requires a clear explanation of the source of the data. In the daily production of data, these data can be adapted to the “7/24 ” operation mode of the media.

QR code 4.1 We Crawled Cai Xukun’s Weibo and Found How Fans Control Celebrities’ Data on Social Media. https://mp.weixin.qq.com/s/td1JhXTruR6ZUdcHJuCPyg

Insight into Reality 61 Access by Application

Access by application refers to the means of obtaining data from the relevant authorities in accordance with governmental disclosure laws. In the United States, the Freedom of Information Act was passed in 1966, followed by similar legislation in various states, and in 2005, the Freedom of Information Act was implemented in the United Kingdom, providing the public with the right to know any government information that is relevant to the public interest. China adopted Provisions on the Disclosure of Government Information in 2007, which stipulates the scope and manner of government information disclosure, as well as the procedures for requesting disclosure and supervision. By the end of 2016, the number of countries in the world that have enacted legal provisions regarding freedom of information reached 115 (Freedominfo, 2016). In countries such as the U.K. and the U.S., it has become common for data journalists to request relevant data from the government under the Freedom of Information Act, which is known as a FOIA request (Qiu, 2015b). A study of The Guardian’s data stories throughout 2012 found that 18 data stories were obtained through FOIA requests (Li, 2014). There are also limitations to obtaining data through applications. First, there is a process for requesting the desired data and information. In the U.S., each state has different rules for FOIA requests. Louisiana, for example, requires that all FOIA requests be answered within three days. Some states do not have such a requirement, so obtaining data can take months or even years (Qiu, 2015b). Investigative reporting has a long production cycle, so reporters can wait, but in the case of time-sensitive stories, reporters may miss the best time to publish the data once it is available, although the data may be used in some future stories. Second, the relevant authorities do not always provide the data. Even in countries where the Freedom of Information Act has been in force for many years, such as the U.K. and the U.S., government departments take many complex factors into account when accepting requests for information disclosure (Fang, K., 2015). Some journalists say that government departments will use various excuses to stall, such as the database is too large to export (Qiu, 2015a). Third, the data obtained is not necessarily useful. In 2015, Sarah Ryley, an investigative reporter for The New York Daily News, requested archival records from the city’s police department to investigate how New York evicted residents over alleged violations of its nuisance abatement law. The police department provided the information she wanted, but it did not help her investigation (Bilton, 2016). Self-collection

Media organizations sometimes collect raw data on their own through questionnaires, crowdsourcing, sensors, etc. The reason for the low proportion of self-collected data is that it has many limitations; for example, media outlets need to follow strict procedures, organize a lot of manpower, and spend a lot of time and money to conduct questionnaires. The larger the survey, the more convincing it is but the higher the cost, while the credibility and persuasiveness of simple data surveys can easily be

62  Insight into Reality questioned. Whether respondents are willing to provide data to the media is also a problem. The Paper’s news “We went to the dating corner 6 times and collected 874 marriage offers” is based on the most “primitive” method of data collection. They asked people to take photos of the marriage offers, and then transformed the collected data into usable data after removing personal privacy information. To compensate for the limitations of self-collection, media outlets sometimes crowdsource data submitted by audiences to meet specific needs. Although the data collected this way is not a full sample or statistically significant, it can reflect and reveal certain issues to some extent. For example, to promote transparency and public interest in the French water market, France Libertés, in partnership with 60 Millions de Consommateurs, designed a crowdsourcing interface where users would scan their water utility bill and enter the price they paid for tap water, and in four months, more than 5,000 people uploaded their home water bills to the crowdsourcing platform (Gray et al., 2012). In recent years, sensors have become an important way to collect data. Journalists collect data either by using existing sensor systems in government departments and public facilities, or by using crowdsourcing to buy or rent commercial sensors (Xu, 2015). Fergus Pitt, a senior fellow at the Tow Center for Digital Journalism at Columbia’s Graduate School of Journalism, believes that journalists need data that is not all available from official sources, which may withhold information, which encourages journalists to open up some other ways of collecting data. The ubiquity of sensors and their ability to quantify something abstract are the main reasons why journalists are employing them to gather data (Liu, 2015). To keep the Indian public informed about air pollution around them, IndiaSpend, an Indian nonprofit news outlet, has installed more than 60 air quality sensors across the country to monitor PM2.5 and PM10 levels in the air, and the public can log in to the interface to view real-time monitoring data. Although the government has also set up monitoring stations, there are only 37 monitoring stations in the country, and not all monitoring data is published (Song, 2016). Shared Access

Shared access is a way to obtain data through negotiation and cooperation with data holders for a fee or no fee. For example, in 2014, CCTV’s Nightly News launched “Data about Spring Festival Travel Rush,” using population migration data from Baidu Maps LBS (Location Based Service) positioning data. In 2010, Julian Assange, the founder of WikiLeaks, did not choose to release the documents immediately, but instead cooperated with The Guardian, The New York Times, and Der Spiegel to get the three media outlets to release them first. In June 2013, Edward Snowden, a former U.S. CIA employee, handed over secret PRISM information to The Guardian and The Washington Post and told the media when it should be released. The Economist launched “How to forecast an American’s vote” in the run-up to the 2018 U.S. midterm elections, with the prediction model at its core based on data from 125,000 American voters provided by pollster

Insight into Reality 63 YouGov, a leading research organization (Selby-Boothroyd, 2019). When sharing data, data providers need to consider whether such sharing violates conventions, agreements, or even the law. Problems with Data Collection In data journalism production, data as raw material seems to be ubiquitous and plentiful, but in fact it is scarce. Data journalists encounter many problems in data collection; the main three are as follows. Insufficient Available Data Resources Due to Data Monopolies

The opening of government data can form a virtuous cycle: the more plentiful and broader the data the government opens up, the lower the cost and risk of social innovation, and as data users become more capable of using it, the greater the value created – which in turn drives the government to open up more data (Du & Cha, 2016). However, the lack of government openness can hinder the growth of data journalism (Fang et al., 2016). A large number of government data sets are still held within the administrative system and are not available to society in a timely manner (Tang & Liu, 2014). There are problems such as serious data fragmentation, lack of uniform standards for data platforms, and reluctance to share by some data-rich departments (Sun et al., 2015). According to the Open Data Barometer report released by the World Wide Web Foundation in April 2016, 92 countries and regions included in the evaluation have released some government data, and 55% of them have set up open data agencies, 76% of which are online. Europe and the United States topped the list overall: the United Kingdom scored first, followed by the United States, France, and Canada; the Asia-Pacific region was next, with China ranking 55th. The survey showed that 93% of the public and technology professionals in all countries and regions have utilized government data. Of the 1,380 government data sets sampled, only 10% are fully open, and many of those that are open are still problematic (World Wide Web Foundation, n.d.). The China Open Data Lens shows that the amount of data openly available to local governments in China is low. As of May 20, 2015, the average number of data sets made public by the surveyed local governments was 278 (Zheng, 2016). China’s open data platforms include the Chinese Government Public Information Online, the Beijing Basic Database of Macro-economy and Social Development, DataShanghai, etc. In 2014, Xinhua News Agency started to make its important news materials available to the public for free at info.xinhua.org. Baidu Index repositioned itself as a big data sharing and exploration platform in 2013. However, the scope and variety of data openness in China is still limited. The data used in data journalism must be both of public interest and newsworthy. Therefore, the degree and quality of data openness largely affects the development of data journalism.

64  Insight into Reality In-depth interview (Editor A, head of data journalism of media B, online interview): was a lot of news that could not be completed, sometimes because of lack of data, sometimes because the timing was not right. Problems with the Quality of the Data

Whether the data is actively released by the government or obtained through requests, data quality is an important factor in the value of data utilization. Assessment of data quality requires assessments on a number of dimensions. First, the data is free of error, or the correctness of data. Second, the completeness of data, which can be viewed from at least three perspectives: schema completeness, column completeness, and population completeness. Third, the consistency of data, including the consistency of redundant data in one table or in multiple tables, the consistency between two related data elements, the consistency of format for the same data element used in different tables. Fourth, the believability of data, or the extent to which data is regarded as true and credible, including the believability of source, the believability compared to internal commonsense standard, the believability based on the age of data. Fifth, the appropriate amount of data, as the amount of data should be neither too little nor too much. Sixth, the timeliness of data, which reflects how up-to-date the data is with respect to the task for which it is used. Eighth, the accessibility of data, or the ease of attainability of the data (Lee et al., 2006, pp. 55–59). There have been cases in the United States where media outlets have damaged the reputations of individuals because of inadequate verification of the data they obtained. A story in The Texas Tribune had a local man branded as a criminal, but it turned out to be a problem with the state’s crime database itself (Bradshaw, 2013). In September 2014, the Centers for Medicare & Medicaid Services (CMS) released Open Payments, a national disclosure program that promotes a more transparent and accountable health care system, but a search by reporters found many deficiencies in the database, such as not including all physician income data and existing errors in one-third of the data (Ornstein, 2014). In late 2012, for example, data journalist Nils Mulvad finally got his hands on veterinary prescription data that he had been fighting for seven years. But he decided not to publish the data when he realized that it was full of errors (Bradshaw, 2013). The Open Data Barometer’s quality evaluation criteria for open data include the following: appropriately licensed, free, properly formatted, up-to-date, easy to find, sustainable, and linked. Only about half of the government data studied is available in a machine-readable format. And of that machine-readable data, only half is available for download in bulk. This makes data reuse complicated and in some cases impossible for information intermediaries like researchers, academics, civil society, and the media (World Wide Web Foundation, 2016). China Open Data Lens indicates that the data should be in an easy-to-use, machine-readable format (e.g., it should be in xls format rather than pdf). The current data readability rate is

Insight into Reality 65 84.1% (Zheng & Gao), and non-machine-readable data causes data journalists to spend a lot of time processing data. Other data may have been faulty at the source. The sample for the 2016 U.S. presidential pre-election poll was allegedly biased, with most of the polling data coming from random telephone interviews from landlines, ignoring other groups. Even for random telephone interviews, the response rate was less than 10%, resulting in the characteristics of a small sample being magnified (Hu et al., 2016). In 2014, the Canadian government collected data on the unemployment rate, trying to understand the impact of temporary foreign workers on the unemployment rate of its citizens. The results show that the presence of foreign workers will not hinder the employment of local citizens. However, it turns out that the data used to calculate the unemployment rate in these areas does not cover the indigenous people. This mistake is obvious. A large number of people were omitted from the data collection, which produced an inaccurate and potentially destructive description of the Canadian employment rate (McBride, n.d.). News should be time-sensitive. The relatively poor timeliness of some open data undoubtedly affects the news value of the data. The China Open Data Lens shows that most of the data published in various places in China are static, accounting for nearly 90%, and updated for one year or irregularly (Zheng & Gao, 2015). In the U.K., where there is a high degree of data openness, much of the data is also static. For the most accessed road safety data, in September 2015, we were only able to retrieve complete data for 2013 and earlier, with no data published for the first half of 2015 or for the whole of 2014 (Stoneman, 2015). And this means that if the media wanted to use the database to do a data story combing the road safety situation at the national level after a major road safety accident, the media would face two choices: either to give up using the database or to cut the data off at 2013. If the media chose the latter option, they would inevitably face questions from their audiences: Why is the most recent data not available? While the media could explain the lack of access to the latest data, this data story would have no advantage in terms of news value or satisfaction of audience needs. Nate Silver, head of the FiveThirtyEight website, which correctly predicted Obama’s election in 2008 and 2012, lost his 2016 election prediction, and a major reason is that his prediction model relies on data derived from polls. Sampling points for general election polls were not evenly distributed enough, and the silence or not speaking the true opinion of those who supported Trump led to lopsided data. Sampling methods, sample selection, etc. are closely related to the results, and without enough valid data, Silver cannot make accurate predictions. These are all problems with the data quality dimension (Cindy, 2016). Violations and Controversies Arising from Non-compliance with Ethical Norms

The traditional ethics of news production can hardly restrain the news that uses raw data as material as well. The imperfection of laws and regulations and the lack of ethical regulation of data collection and publication often lead to data journalism infringement.

66  Insight into Reality In December 2013, The Journal News of New York State published a map listing the names and home addresses of all legal gun owners in Westchester and Rockland counties. The newspaper’s president, Janet Hasson, believes New Yorkers have the right to own guns with a permit, and they also have the right to access public information, and one of the roles of journalists is to report publicly available information in a timely manner, even if it is unpopular. But the people whose information was published believed that it violated their right to privacy (Weiner, 2012), and finally this newspaper withdrew the gun map. A few weeks later, New York State responded by enacting a law that allows the information of registered handgun owners to be anonymized (Beaujon, 2013), and the handgun owner’s information to be removed from public records for privacy reasons (McBride & Rosenstiel, 2013, p. 239). In another story, there is a potential ethical risk. BuzzFeed and the BBC analyzed 26,000 top tennis matches from 2009 to 2015 using an algorithm to come up with a list of matches and players suspected of match-fixing. BuzzFeed shared the investigative methodology, anonymized data, and algorithm for the story, titled “The Tennis Racket,” on Github. By analyzing win-loss records and betting odds at both the beginning and ending of a match, BuzzFeed identified cases where there was an unusually large swing (e.g., greater than 10  percent difference). If there were enough of these matches, it cast suspicion on the player. Although they didn’t publish the names of suspicious players, it didn’t stop others from quickly de-anonymizing the players pinpointed by the statistical analysis. A group of undergraduate students from Stanford University made public the names of players BuzzFeed had kept hidden (Diakopoulos, 2016). In addition, how to deal with the leaked data is also a difficult issue. When WikiLeaks released data to multiple news organizations in 2010 and 2011, every media organization had to decide not only whether to publish them but how, balancing redacting the names of people who might be put at risk with the public’s right to know what was done on their behalf by the government (Howard, 2013). Data journalists need to keep personal data properly in all aspects of data collection, analysis, and publication. In recent years, some data journalism has been done with cross-border and cross-boundary cooperation, and it is especially important to avoid leakage of sensitive personal data when that data flows among different participants. Colin Porlezza, an academic at the University of London, argues that we should think about the following questions regarding personal data: Who is in control of the data? With whom can the data be shared? How should the data be protected and secured? News organizations should not only make sure that sensitive data is securely electronically stored, but also that journalists get appropriate training on how to avoid security risks and potential misuse of data due to increased sharing and partnerships (Data Journalism, n. d.). With the increasing popularity of data journalism and the diversification of data collection methods, the ethical issues of data journalism will become more prominent. How to avoid ethical risks is a topic that needs to be explored.

Insight into Reality 67 Functions of Data Analysis Data analysis is the process of using appropriate statistical methods to study and summarize the collected data in detail in order to maximize the useful information of the data and to form conclusions (Zhao et al., 2015). There are three main types of data analysis methods. The first type, Descriptive Analytics, uses descriptive statistics and data visualization to describe the basic characteristics of data, such as sum, mean value, standard deviation, etc. The second type, Predictive Analytics, uses causal analysis, correlation analysis, and other methods to derive potential patterns, common patterns, or future trends based on past or current data (Chao, 2016, p. 18). The third type, Prescriptive Analytics, combines a number of influencing factors based on known data to provide the best decision solution. We classify the functions of data analytics in data journalism into four categories: describing, explaining, predicting, and deciding. Describing

This type refers to the summary, judgment, and presentation of the current situation, characteristics, and evolution of things through data analysis. Description is the most basic function of data analysis, and it relies on descriptive data analysis, which refers to the description of the characteristics of the measurement sample and the characteristics of the whole represented by it (Zhao et al., 2015, pp. 94–95). After evaluation and cleaning, journalists analyze the data by comparative analysis, averaging analysis, cross-analysis, and other methods to analyze the centralized tendency, dispersion, and so on. Of the reports that won the Data Journalism Awards from 2013 to 2016, 73% had the primary function of informing (Heravi & Ojo, 2017). We analyzed the Data Journalism Awards winners from 2013 to 2018 and found that 93.5% of them used descriptive analytics (Zhang et al., 2019). Of course, descriptive analytics is not necessarily simple; some data investigative journalism and complex data journalism, although their main function is to describe, use complex data analysis methods, and some even use machine learning. Explaining

This type of data analysis is used to explore and reveal the reasons for the occurrence and development of a phenomenon. Often journalists need to use a combination of data analysis methods to explore and verify, and sometimes use qualitative analysis. In August 2011, riots in London spread to six major cities in the United Kingdom in five days, with different explanations for the cause of the riots. The Guardian launched a data journalism feature, “Reading the Riots,” to give the public a better understanding of who was rioting and why. Using descriptive analysis, The Guardian combines the locations of the riots with maps of the distribution of poverty to refute politicians’ claims that the riots had nothing to do with poverty.

68  Insight into Reality Among the winners of the Data Journalism Awards from 2013 to 2016, approximately 39% of them tried to explain some phenomena to the public. For example, one article explained how millions of minority voters were inadvertently prevented from voting by a computer program designed to identify irregularities during elections (Heravi & Ojo, 2017). Predicting

This type is to predict the degree or probability of a certain attribute of an event or phenomenon for a period of time in the future, mainly using predictive analysis. The journalist must establish an analysis model before predicting, and two types of predictive analytics can be distinguished: regression and classification. In regression, the target variable is continuous. Popular examples are predicting stock prices, loss given default (LGD), and customer lifetime value (CLV). In classification, the target is categorical (Baesens, 2016, p. 35). During the 2016 U.S. presidential election, The New York Times’ The Upshot section presented predictive coverage not only at the national level, but also in all 50 states. Of course, making predictions is risky, especially for predictions with shortterm visible results. If predictions fail, the reputation of data journalism is affected. For example, in the 2016 U.S. presidential election, the U.S. mainstream media experienced a Waterloo in its predictions. Predictions that are only visible in the long term are less at risk. For example, “Visualizing the Past, Present and Future of Carbon Emissions,” a Data Journalism Awards winner, describes the problems that the current carbon budget brings to the Earth’s ecosystem. Such predictions will take a long time to prove, and the public tends to see them as a form of popularization of science, without expecting the predictions to be correct. Deciding

This type of data analysis is used to obtain optimal options regarding future actions, mainly using normative analysis. When there are few options, descriptive and predictive analysis can give the optimal solution. However, in more cases, the amount of data and the number of combined options is so large that normative analysis needs to be applied to help people select an optimum from the many choices (Watson & Nelson, 2017, p. 128). Data analysis in data journalism is currently less involved in the decision-making level. For the production of data news in the era of big data, massive correlated datasets can make data complete the progression from information to knowledge and then to wisdom. The highest goal of data analysis is to provide personalized decision-making services. Some current data news has reached the decision-making level. In The New York Times’ “Is It Better to Rent or Buy,” users can enter their length of residence, mortgage rate, and down payment to determine whether it’s better to rent or buy a home. ProPublica’s “HeartSaver: An Experimental News Game” uses real-world data to allow users to experience the impact of different factors (e.g., level of hospital care, distance to hospital) on a patient’s chances of survival to facilitate patient decision-making.

Insight into Reality 69 Problems with Data Analysis Data analysis is one way of interpreting data, but it is not the only way, nor does it lead to the only interpretation. In practice, data analysis is prone to a number of problems. Distortion of Meaning Due to Ignoring the Original Context of the Data

The production of meaning of any text is dependent on a specific context. Context emphasizes the underlying rules or external environment in which meaning occurs, and not only establishes the boundaries of interpretation, but also participates as a productive element in the direct construction of meaning (Liu, 2018). Data are never neutral “givens” (McBride, n.d.). Data cannot have meaning without context, and once it has meaning, it must exist in some context. Data journalism involves two contexts: the original context of the data and the context set by the news text. The original context of the data refers mainly to the purpose and method for which the data were collected, and relates to the boundaries of data interpretation. The context of the news text refers to the intent of the news to use the data, and is related to the applicability of the data. By placing data from the original context in the context set by the news text, “recontextualization” occurs. According to British sociolinguist Fairclough, recontextualization is a form of “reperspectivization,” a means by which power operates (Wang, 2015). Abstract data may generate new meanings or even lead to deviant meanings through contextual substitution and transfer. Therefore, journalists need to consider the context of the data in a comprehensive manner. Generally speaking, the better the fit between the original context of the data and the context set by the news text, the more accurate people’s understanding of the data will be, and the more credible the reality characterized by the data will be. The Washington Post’s “A fascinating map of the world’s most and least racially tolerant countries” uses its own 30 years of data to present differences in public perceptions of racial tolerance across countries and regions. However, this report was criticized by experts for not taking into account differences in questionnaire design and differences in understanding of racial tolerance across years, leading to blind comparisons between data (Mitter, 2013). Some journalists mistakenly assume that the original context of the data and the context of the news text are the same for a survey on the same issue. Therefore, it is important to understand whether data are comparable and to be careful when normalizing data across time and from different organizations on the same topic. Also, co-temporal data may not be comparable due to contextual differences. For example, people generally think that high-speed railway stations are far from the city center, but in fact 75% of high-speed railway stations are within ten kilometers of the downtown, so why does this misunderstanding occur? This suggests that city size has an impact on people’s perception of distance, and that the distance of high-speed rail stations from the city center in cities of different sizes is not comparable. Caixin’s “How far is a high-speed railway station from you” introduces a distance index, which is the ratio of the straight-line distance from the station to the city center and the radius of the built-up area. A smaller distance index means

70  Insight into Reality that the station is easier to reach, while a larger distance index means that it is more remote (Liu Jiaxin, front-end engineer of Caixin Data News Center, speaks at the 2019 Data Creators Conference in Shanghai, China, June 3, 2019. You Shu Editorial Department, 2019). This makes the data comparable and makes it easier to draw reliable conclusions. There is also a case where the unreasonable choice of the time range leads to a large bias or even distortion in the recontextualized data in characterizing reality. A powerful example is the widely reported “national crime wave” in the United States in 2015. It’s true that certain cities have seen spikes in crime compared to years past. But if journalists examine a broader period of time, they will see that crime rates were higher almost everywhere in the U.S. a decade ago, and almost twice as high two decades ago (Groskopf, 2015). Open data is often utilized by data journalists, but it is important to note that open data is for government agencies, not journalists. That is, most datasets are named, structured, and organized according to the agencies, not the journalists looking for stories (D’Ignazio, 2019). Journalists, for their part, should collect and analyze data with a good understanding of the original context of the data to avoid misinterpretation. Unreliable Conclusions from Ignoring the Applicability of Analytical Methods

In September 2016, The New York Times did a poll of 867 likely Florida voters, and the analysis concluded that Hillary would win by 1 percentage point. The newspaper sent the raw data to four prominent polling analysts, and the four returned different results, with one’s analysis finding a Trump victory. The reason for this is that this expert used a different research method (Cohn, 2016). This shows that the choice and combination of data analysis methods are crucial to the conclusion. Which methods, variables, models, algorithms, etc. are chosen need to be considered as a whole; otherwise the conclusions drawn may be wrong. “Disasters Cost More Than Ever, But Not Because of Climate Change,” published by the FiveThirtyEight website, argues that the increasing losses from disasters over the past 30 years are not caused by climate change, but by global GDP growth. To prove their point, the authors calculate the ratio of disaster losses to GDP and conclude that this ratio has declined. The report has been criticized by numerous U.S. weather scientists because of many errors. First, the report did not include all available data. Second, it completely ignores society’s input into resilience to disasters. Third, 30 years is too short a time frame for climate analysis. Fourth, and more importantly, the authors themselves have not adequately demonstrated whether it is really unrelated to climate change (Skeptical Science, 2014). The Guardian’s “Murders up 10.8% in biggest percentage increase since 1971, FBI data shows” presents the change in the number of people who died from murder in the United States from 1960 to 2015. It can be seen that the number of deaths peaked in the 1990s and has been hovering in the medium range since then, although it has decreased. It is difficult to conclude from the totals that the period

Insight into Reality 71 after 2000 was the best period for social security; however, this is not the case. The U.S. population grew by 80% between 1960 and 2015, and in fact if we go by the murder rate per 100,000 people, the 2015 figure is close to the lowest since 1960 (Whitby, 2016). A comparison of only the total number of people to draw conclusions is undoubtedly one-sided. “With Few Gun Laws, New Hampshire Is Safer Than Canada,” published by Mises Wire, compares murder rates in the U.S. and Canadian regions and concludes that Canada’s Northwest Territories has a murder rate of between 8 and 10 per 100,000 people, the third highest of all. In fact, Canada is so sparsely populated that the total population of the region does not reach 100,000 people, and the data was selected only for 2014, a year in which there were three murders in the region, compared to zero in the previous three years (McMaken, 2015). For these cases, journalists can avoid mistakes by applying simple data analysis. There are also some complex data analyses that may use algorithms and models, but this does not guarantee a reliable conclusion, and the applicability of the method has a greater impact on the conclusion. The construction of models is also not arbitrary. Take prediction models as an example: different prediction models can be applied to sample data of different structures. The correct choice of prediction models is a critical step in the data mining process (Liang & Xu, 2015, p. 11). For example, it has been argued that the general failure of U.S. media predictions for the 2016 U.S. election was due to the fact that the prediction models did not adequately consider the possibility that state polls would be wrong at the same time. Polling errors are generally due to pollsters not reaching a certain type of voter or misestimating the turnout of a certain type of voter, in which case states with similar demographic composition are likely to have polling errors at the same time (Cindy, 2016). “Europe One Degree Warmer,” produced by the European Data Journalism Network in conjunction with several media outlets, analyzes more than 100 million meteorological data points and finds that every major European city is warmer in the 21st century than it was in the 20th century. The journalists used two datasets in their analysis that could not be directly compared, so a reconciliation algorithm was used to allow historical comparisons to be made (EDJNet, 2019). Mistaking Correlation for Causation

In the face of uncertainties, people like to find the causes and transform uncertainties into certainties. The non-deterministic interdependence between variables is called correlation (Jian et al., 2015, p. 264). In business, correlations are sufficient for decision-making. But journalism goes beyond the “what” to answer the “why,” which requires data analysis to go further: to explore possible causal relationships on the basis of correlations (Li, 2013). But correlation should not be mistaken for causation. Assuming that countries with higher gun ownership rates have more gun homicides (the two are correlated), there could be several reasons for this. First, gun ownership breeds homicides. Second, the presence of homicides causes more

72  Insight into Reality people to buy guns for self-defense. Third, some other reason causes both homicides and gun ownership (possibly poverty). Fourth, it is just a coincidence, and statistical tests can be used to rule out this possibility (Stray, 2014). Correlation is not the same as causation, and a high rate of gun ownership does not necessarily lead to more shootings. In data journalism, it happens that correlation is mistaken for causation due to poor recognition of logical relationships. In “Do Fewer Unions Make Countries More Competitive?” published on FiveThirtyEight, a journalist analyzes data from the World Economic Forum’s annual Global Competitiveness Index for dozens of developed countries and comes to the conclusion that countries with higher rates of unionization density are slightly more economically competitive. A data expert reanalyzed the original data and found that the reporter had confused correlation and causality in the report, and that the linear correlation is not significant due to the small R2 (Whitby, 2014). It is also important to avoid spurious correlations. Some seeming “correlations” are just “coincidences” and do not mean that there is true causality between the two variables. A change in two variables in the same direction may be caused by a change in a third variable. Forcing the one explanatory variable (unionization) to explain all the variation in the dependent variable (competitiveness) is itself arbitrary (Whitby, 2014). The era of big data has created a large number of spurious correlations that can be mined if one has enough processing power on the big data set. Many of them are only statistically significant, but causal relations, where a change in one variable causes a change in the other, are much harder to find (Fletcher, 2014). Another common logical error is to make linear long-term predictions about a non-linear phenomenon. For example, FiveThirtyEight’s “An Anti-Death-Penalty Majority Might Be on Its Way in 2044” analyzes data on American public attitudes toward the death penalty since 2000, and the reporter concludes that the overall support for the death penalty is dropping in the U.S., but at a slow rate, and if the trend continues, a majority of Americans will support the death penalty for an additional 30 years. But the fact is that attitudes toward the death penalty evolve non-linearly, determined by a variety of factors, and the reporter’s linear predictions without seeing other influences are meaningless. The Path to Improving Data Analysis in Data Journalism The various problems of data analysis in data journalism production will hurt the authenticity and credibility of the news. We believe that the quality of data analysis can be improved in two ways: improving journalists’ data analysis ability and introducing experts into the production process. Improving Journalists’ Data Analysis Ability

In China and abroad, data journalism production, as a new phenomenon, relies mainly on the endogenous model, i.e., the media rely on absorbing, integrating, optimizing, and enhancing internal resources for data journalism production. Many

Insight into Reality 73 media outlets’ data journalism teams are regrouped from the original teams, which have changed their names but not their knowledge structure. Some media outlets produce data journalism not for their own development strategies, but simply because it is popular and can reflect their innovative capabilities. The survey shows that most global data journalists have received a high degree of formal training in journalism and less formal training in data-oriented and technical aspects, such as data analysis, statistics, coding, data science, machine learning, and data visualization (Heravi, 2018). Therefore, the key to improving the level of data analysis lies in improving journalists’ own data analysis ability, for which strengthening journalists’ self-learning ability, organizing professional training, and introducing talents are effective ways. Aron Pilhofer of The Guardian, through his examination of data journalism in the United States, concluded that training has been incredibly impactful in the U.S. to spread data journalism and that there’s hardly a newsroom in the U.S. that doesn’t have a specialist in this area, and training is a huge reason for that (Reid, 2014). At the same time, some Chinese media are bringing in foreign talent in order to quickly improve their data analysis capabilities. The reason is that data journalism talent training abroad is more mature than in China, with more than half of the U.S. journalism schools offering data journalism courses and some schools having a special data journalism program (Fang & Hu, 2018). Inviting Experts to Participate in the Production Process

While simple data analysis can be done by journalists, the analysis of complex data sets for complex events needs to be done by experts. Unlike traditional journalism, which treats experts as sources, data journalism incorporates experts into the news production process, i.e., have them embedded in the editorial team as content producers. The Guardian’s “Reading the Riots,” for example, was a project in which 30 professionally trained data journalists and five data analysis experts from The London School of Economics and Political Science worked together to analyze the data (Rusbridger & Rees, n.d.). Experts can have two functions in data analysis. First, experts are gatekeepers for the entire process of data analysis. The journalists themselves are the gatekeepers, and the experts can be the second gatekeepers to ensure the reliability and credibility of the data analysis. This is in line with the intersubjective objectivity of data journalism, which means that journalists should discuss with experts and peers and listen to different opinions in order to improve data analysis and conclusions, rather than making decisions on their own (Zhang, 2018). Second, experts are directly responsible for data analysis. This is equivalent to “outsourcing” the data analysis process to experts. In fact, many large-scale data journalism works are similar to scientific research, in which experts play a leading role in the data analysis process. After all, the media is limited by production cycles, human and material resources, etc. This type of cooperation not only enhances the professionalism and scientificity of data analysis, but also provides journalists with the opportunity to learn from researchers.

74  Insight into Reality How Data Journalism Awards Winners Apply Data Science For a long time, journalism was not considered a desirable profession because of the lack of professionalization of the skills required of journalists. Journalism was then considered a profession because of its commitment to the public interest and its demand for autonomy (Li, 2012). In the new media era, the privilege of content production for professional news producers has been broken, and communication technologies allow anyone to become a journalist at minimal cost. The massive amateurization of newsgathering techniques (Zhou & Wu, 2015) has led to a crisis of deprofessionalization, directly undermining the legitimacy of journalism. Today, the public’s demand for media professionalism is not decreasing, but increasing, so the appropriate use of new technologies can help enhance the professionalism of journalism (Peng, 2017). The birth of data journalism provides an opportunity to enhance journalistic professionalism. Data journalism is considered a new journalistic paradigm because it uses data as the “raw material” for understanding reality, data science as a methodology for seeking truth, and data visualization as a means of representing reality. Data science methodology distinguishes data journalism from computerassisted reporting, precision journalism, and graphic journalism. As an emerging discipline born in the era of big data, data science is a new area of research that is related to huge data and involves concepts like collecting, preparing, visualizing, managing, and preserving (Chao & Lu, 2017). Data science is intended to analyze and understand the original phenomenon related to the data by revealing the hidden features of complex social, human, and natural phenomena related to data from a point of view other than traditional methods (Sinha et al., 2021). The open data movement has opened up data, but it is difficult for the public to understand data in abstract form, so data journalists have become the intermediaries that connect data to the public. In its first decade, data journalism has been constructing its own professional discourse but has not fully completed it, mainly because the data journalism industry has not yet formed a set of professional skills standards and ethical norms and lacks consensus on what constitutes “good” data journalism. How to evaluate the professionalism of data journalism has become a new issue that needs to be studied urgently. Since its establishment in 2012, the Data Journalism Awards have represented the highest level of data journalism in the world. Current research on Data Journalism Awards in China and abroad has mostly focused on the data sources, narrative models, data visualization, professional norms and ethics of the winners, but has not yet touched upon the key issue of data science methodology. This study examines the performance of the winners of the Data Journalism Awards from 2013 to 2018 in terms of data collection and data analysis, with specific analysis aspects including methods of collecting data, data volume, data type, data analysis methods, and data processing complexity,1 to take a look at the current performance and trends of data journalism in data science.

Insight into Reality 75 Data Collection

The career as a profession should have full autonomy to ensure that it achieves the public interest and develops a specific reputation (Li, 2012). The use of data science methods by journalists in data collection has shifted control of some important data collection to journalists, which has limitedly enhanced the professionalism of news production. Of the 36 identifiable samples, 6, or only 16.7%, utilized data science methods for data collection. For example, Medicamentalia.org used Ruby (a programming language) to crawl data from a database of drug prices in developing countries. “Fact Check: Trump and Clinton Debate for the First Time” uses the technique of real-time transcription of text by voice to obtain the raw data of the debate using programming methods. The reason for the low percentage of data collection using data science methods is that a large amount of data is in the hands of governments and corporations, and much of it is directly available through free downloads, public information requests, partnerships, or purchases. However, collecting data through data science methods such as programming is difficult, specialized, and elongates news production time. Most awarded works rely heavily on data collected by governments and corporations, resulting in data journalism that lacks a critical awareness of these data and constructs reality in the way governments and corporations understand it. Data Volume

With the age of Big Data upon us, how much data can be called “Big Data”? The Reuters Institute for the Study of Journalism’s Big Data for Media report concluded that most Big Data would be measured in terabytes, petabytes, zettabytes, and beyond (Stone, 2017). Among the winning entries, the one that qualifies as Big Data is “The Panama Papers,” which contains 2.6 terabytes of data and 11.5 million files. The volume of data is not specified in most of the samples, so we use the number of rows in the dataset to evaluate it. The number of rows in the tens of thousands can indicate a large volume of data, and those in the millions can be classified as big data. Among the 30 identifiable samples, there are 14 samples reaching the thousand level, 8 samples reaching the tens of thousands level, and 3 samples reaching the million level. This indicates that the samples still mainly use small data. The main reasons for this include the following. First, the limitation of data journalism production cycle. In a “7/24” news production, Big Data-based data news production is bound to take up more resources and longer time, and media organizations will weigh whether it is worth investing more resources for Big Data news. Second, the ability to process big data is insufficient. Most news organizations are short of data science talents, and still rely on reporters and editors with liberal arts backgrounds for low-level data journalism production. Therefore, some media outlets cannot process big data even if they want to do so.

76  Insight into Reality Table 4.1 Data types of Data Journalism Awards winners from 2013 to 2018 Data Type

Number

Percentage

Structured data Unstructured data Combined

16 9 8

48.5% 27.3% 24.2%

Data Type

Data can be divided into structured data and unstructured data. Structured data is data stored in a database with a certain logical and physical structure, and is mainly structured data used in daily news production. Unstructured data is not stored in the database, but in various types of forms, such as text, images, video, audio, web logs, etc (Zhang, 2017). Of the 33 identifiable samples, 16 used entirely structured data and 9 used entirely unstructured data. For example, “Rhymes Behind Hamilton” used unstructured rhyme data. Eight use a combination of structured and unstructured data (see Table 4.1). For example, “Spies in the Skies” used structured flight location data from Flightradar24 Flight Tracker and unstructured data such as Associated Press reports, U.S. Department of Homeland Security (DHS) reports, and base station simulator parameter files. Seventeen of the samples used unstructured data, suggesting that data journalism production has made great strides in the types of data it processes. The expansion from structured data to unstructured data is a key breakthrough in data science skills for data journalists, changing Adrian Holovaty’s understanding of “data as structured data” in data journalism, and giving data journalism an increasingly data science-like quality. Data Analysis Methods

There are many data analysis methods. The methods commonly used by reporters include descriptive data analysis, exploratory data analysis, database or data warehouse, machine learning, and information retrieval. Of the 31 identifiable samples, 20 used descriptive and exploratory data analysis, and 9 had only descriptive data analysis. Some samples also used data analysis methods such as database and data warehouse (3 out of 31), machine learning (2 out of 31), and information retrieval (2 out of 31) (see Table 4.2). Reality is complex, and so is the data that reflect it, requiring journalists to use a variety of data analysis methods when interpreting complex issues. The Globe and Mail’s “Unfounded” describes the means, extremes, and distribution of “unfounded” closure rates, and also explores the correlation between female police officers and “unfounded” closure rates using correlation tests. As an important branch of artificial intelligence, machine learning studies how computer programs can automatically improve their performance with accumulated

Insight into Reality 77 Table 4.2  Data analysis methods for Data Journalism Awards winners from 2013 to 2018 Data Analysis Methods

Main Function

Number

Percentage

Descriptive data analysis

To describe statistical characteristics To discover new features To process large-scale and structured data To learn the intrinsic regularity information in the data To search for information in documents

29

93.5%

20 3

64.5% 9.7%

2

6.5%

2

6.5%

Exploratory data analysis Database or Data warehouse Machine learning Information retrieval

experience (Zhao, 2013, p. 6). “We Trained A Computer To Search For Hidden Spy Planes. This Is What It Found” identifies FBI and DHS spy planes from a variety of flight data using random forest algorithms. In the work “How does ‘Hamilton,’ the non-stop, hip-hop Broadway sensation tap rap’s master rhymes to blur musical lines?” the syllable segmentation algorithm is used to split words into phonemes, and the Markov Cluster Algorithm (MCL) and simulated annealing (SA) are used to cluster the syllables. The raw files of Swiss Leaks were huge, with account information scattered across tens of thousands of seemingly unconnected files. Since traditional manual mining methods were no longer able to analyze this ponderous unstructured data, it used a technology that combines database and information retrieval, Neo4j (a graph database platform), to process highly connected data and complex queries, transforming such connections into graph nodes and exploring the links between nodes. Data Processing Complexity

How to evaluate the data processing complexity of the samples? We classify the data processing complexity into four levels: low, medium, relatively high, and high. Samples that directly present the original data are rated as “low.” Those that describe the numerical and distributional characteristics of the data, such as mean, median, plural, variance, distribution function, etc., are rated as “medium.” Those using multiple statistical analysis methods (e.g., correlation analysis, regression analysis, dimensionality reduction analysis, cluster analysis, principal component analysis, or simple programming tools such as R and Python) are rated as “relatively high.” Those with mathematical modeling, big data mining, or algorithm innovation and improvement are rated as “high.” The last three levels can be classified as “professional.” Among the 34 identifiable samples, 17 samples each were rated as “low” and “professional” in terms of data processing complexity (see Table 4.3). More than half of the works presented the raw data directly or presented the number, percentage, etc. of raw data. For example, the individual work “People’s Republic of

78  Insight into Reality Table 4.3 Data processing complexity of Data Journalism Awards winners from 2013 to 2018 Data Processing Complexity

Number

Percentage

Professional or not

Low Medium Relatively high High

17 7 5 5

50% 20.6% 14.7% 14.7%

Not Yes Yes Yes

Bolzano” presents the change in the number of Chinese people born in Bolzano from 2000 to 2013 and the share of the Chinese population in the total local population. The Wall Street Journal’s “Battling Infectious Diseases in the 20th Century: The Impact of Vaccines” directly presents the number of major infectious diseases by state in the United States over the years. There are also some works that have a high level of data processing. “We Trained A Computer To Search For Hidden Spy Planes. This Is What It Found” uses a large amount of flight tracking data from flight websites to find the flight paths of suspected FBI or DHS aircraft through machine learning algorithms. The algorithm first defines some flight characteristics metrics, such as turn speed, flight altitude, and speed, and then trains a random forest algorithm to distinguish unlabeled flight data from common flight data. With a large amount of data to process, this data news shows the power of machine learning by using a data model that relies on labeled data to build predictions. The Globe and Mail’s “Easy Money” collected data from different Canadian jurisdictions over 30 years, and the variety of sources and formats of the data made data cleaning more difficult. This news article innovatively defined a new statistical indicator, the national securities crime recidivism rate. Repeated calculations and field research by the journalists verified the accuracy of the indicator, revealing the problems of governing the financial markets in the country. We assign each level a score of 0 for “low,” 1 for “medium,” 2 for “relatively high,” and 3 for “high.” The average score of the samples is only 0.94, which indicates that many of the Data Journalism Awards entries are low in the level of data processing. Many of the entries mainly focus on the presentation of data results, but less on the processing and analysis of the data itself. Future Trends in the Application of Data Science to Data Journalism An analysis of the Data Journalism Awards winners from 2013 to 2018 shows that data journalism performs differently across the indexes of data science professionalism. Some entries use big data, others have a high degree of overall professionalism, and a significant number have a lot of room for improvement in data science. By looking at the winning entries, we see the following trends in the future of data journalism in terms of data science applications.

Insight into Reality 79 Building Database to Provide Personalized Services and Innovate Profit Model

As the media is usually in the “downstream” of data flow, journalists have difficulties in accessing data due to data monopolies, lack of data, and low data quality. Driven by the open data movement, journalists have more and more access to free datasets, and some media have the ability to collect all kinds of data on their own. Whether as a product form or as a method of data science analysis, databases are increasingly valued by the media. The Data Journalism Awards also include an Open Data Award to encourage the opening of databases that are of public interest. There are two main ways for media to build their own databases. First, the media organize and clean open data sets into high-quality open databases. Many of the open datasets from governments have data quality and format problems, which are difficult to handle even if the public accesses them. The media’s secondary processing of existing open datasets can not only save costs, but also help improve the utilization of the database, thus establishing the media’s brand image of serving the public. The second type of database is the “niche database.” The media systematically integrates open data, public data, “leaked” data, and self-collected data based on research questions to create a more personalized, user experience-oriented database. This kind of database can highlight the positioning of the media and help the media to provide in-depth services to specific users. “Follow the Money” by Postmedia, a 2018 Open Data Award winner, is a database of political donations that brings together records from all provinces and territories in one easy-to-search tool in Canada, gathering more than six million donation records that add up to a sum of about $2 billion (Postmedia, 2018). ProPublica has in recent years created a series of databases on U.S. health care, retirement, education, and other areas. The Washington Post has spent a year compiling and publishing its own database of school shootings in response to the lack of official data on the frequency of school shootings in the United States. Self-built databases also help media accumulate data resources and improve the efficiency of data journalism production. “Derailed Amtrak Train Sped into Deadly Crash Curve,” the 2016 Al-Jazeera winner for “Best Use of Data in a Breaking News Story,” was completed in a short period of time, largely because the journalists had accumulated the data a year earlier. Self-built databases help the media build “strong ties” with users through interactive design, authoritative data, and topics of public interest to achieve both social and economic benefits. As a kind of product, databases have various profit models. First, the media uses the attention generated by the databases to distribute advertising to readers for profit. The Texas Tribune presents nearly 40 databases on an online platform that differs from traditional news reporting, and the databases are accessed three times more than the average news story. Second, the media offers a fee-based download service for datasets. Since launching the Data Store in 2014, ProPublica has generated more than $200,000 in revenue. Selling valuable proprietary datasets is a relatively low-effort way for investigation-heavy news organizations to bring in new incremental revenue off assets that would have sat dormant otherwise (Bilton, 2016).

80  Insight into Reality Third, the media provides user-oriented targeted services based on the databases. Thomson Reuters’ core business is to gather data in a particular setting (e.g., credit risk, marketing), build models with it, and sell the output of these models (e.g., scores), possibly together with the underlying raw data, to interested customers (Baesens, 2016, p. 14). Using Unstructured Data to Represent the Broader Social Reality

According to projections from IDC (International Data Corporation), 80% of worldwide data will be unstructured by 2025 (King, 2019). The value of big data lies in the insight into social reality from the huge amount of unstructured data. The use of unstructured data will increase in the future as it becomes an inevitable choice for news production in the Big Data era. While many media outlets now have simple unstructured data mining and analysis skills, well-known media outlets such as The New York Times, The Washington Post, The Guardian, The ­Financial Times, and others have gone a step further, i.e., they can mine and ­analyze large-scale unstructured data. Data journalism embraces unstructured data for the following reasons. First, unstructured data can meet the needs of data journalism production. Open data is not timely online and often of questionable quality, while self-collected data requires the media to invest a lot of human and material resources, also unlikely to become the normal way of data collection. The expansion of data used by the media to unstructured data reduces journalists’ reliance on open data. Unstructured data is more ubiquitous and accessible than structured data, allowing the media to cover more topics and better serve society. Second, unstructured data are more “honest” than structured data. The processing of structured data relies on statistical methods, which are inevitably biased in representing reality. However, unstructured data contains complete and continuous information and key details, and is more reliable and credible in representing reality. Third, the media’s ability to apply data science has increased. The processing of structured data can be done according to basic statistical principles and with software such as Excel and SPSS, while the processing of unstructured data has raised higher requirements for the media. Some media outlets are now hiring programmers or collaborating to improve their ability to process unstructured data, which will be an important criterion for measuring media’s data journalism productivity in the future and will lead to disruptive innovation in data journalism production. Whoever has the ability to process unstructured data will be able to take the initiative in the era of big data. Using Machine Learning to Enhance Big Data Processing

In the era of intelligent media, intelligent data journalism production has become an important development trend. Machine learning is expected to become a musthave technology for processing large data sets in the coming years. Media outlets that are currently at the forefront of data journalism, such as the Associated Press,

Insight into Reality 81 Reuters, The New York Times, The Los Angeles Times, The Washington Post, The Chicago Tribune, BuzzFeed, and ProPublica, are also at the forefront of artificial intelligence technology. There are three main types of machine learning: Supervised learning, unsupervised learning, and reinforcement learning. Supervised learning uses a training set to teach models to yield the desired output. This training dataset includes inputs and correct outputs, which allow the model to learn over time (IBM Cloud Education, 2020). It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process (Brownlee, 2016). The unsupervised learning is defined as the process of optimizing the cluster prototypes, depending on the similarities among the training examples (Jo, 2021, p. 12). Reinforcement learning is a special type of learning method that needs to be treated separately from supervised and unsupervised methods. In reinforcement learning methods, the system is continuously interacting with the environment in search of producing desired behavior and is getting feedback from the environment (Joshi, 2020, p. 11). There are three main applications of machine learning in data journalism production. First, machine learning helps media do classification and prediction. Supervised learning can help journalists quickly identify and access the data they need. When journalists are faced with large data sets, relying on manual judgment and selection is time-consuming and can lead to errors. Supervised learning can quickly and efficiently identify the data that journalists need based on a designed model, and is especially useful for handling large, regular data. Journalists can also use regression analysis in supervised learning to make predictions. In “Doctors & Sex Abuse” from The Atlanta Journal-Constitution, journalists used 50 crawlers to obtain more than 100,000 physician disciplinary documents from the U.S. health care system, then used machine learning to clean and analyze the documents and retrieve keywords involving sexual assault (Global Investigative Journalism Network, 2017). The effectiveness of supervised learning depends heavily on the reliability of the algorithm and the training data; if they are not reliable, the data results will be wrong. In an investigation of the Los Angeles Times powered by machine learning algorithms, the algorithm had a 24% error rate and required manual review (Nahser, 2018). Second, machine learning helps the media gain insight into the world. Journalists’ cognition and experience are limited in the face of huge amounts of data, and relying solely on supervised learning algorithms designed by journalists is not sufficient. Unsupervised learning can autonomously find the correlation between massive data and identify the hidden structure in the data. The AP’s data journalism program used unsupervised learning to find typical cases of gun misuse from 140,000 manually entered case records, extrapolating the probability of intentional shooting by suspects in cases involving children or police officers (Yu & Chen, 2018). In August 2017, Google and ProPublica launched a machine learning analytics tool called Documenting Hate News Index, using machine learning to generate its content and helping reporters by digging out locations, names, and other useful data from the 3,000-plus news reports (Rogers, 2017).

82  Insight into Reality Third, machine learning helps media make decisions. Reinforcement learning can help journalists make decisions in specific contexts, a learning approach that is still relatively uncommon in news production, the famous case of which is AlphaGo. The “Rock-Paper-Scissors” interactive page launched by The New York Times can exploit a person’s tendencies and patterns to gain an advantage over its opponent (Bradshaw, 2017). Some media outlets use reinforcement learning to determine the most effective content recommendation options. The application of machine learning will undoubtedly greatly enhance the professionalism of data science in data journalism production, help the journalism industry to discover the truth from big data, and let the news report closer to “mirror reality” from “representational reality.” For media that are deeply devoted to data journalism, mastering and proficiently applying machine learning is an inevitable choice. Data journalism is an opportunity for journalism to “re-professionalize” itself. Our study of the 2013–2018 Data Journalism Awards winners shows that the “reprofessionalization” of data journalism is not holistic but partial, and it has a long way to go. Data science methodology is the core of data journalism production, so the level of journalists’ mastery of it determines the level of understanding the world through data. At a time when big data is everywhere and artificial intelligence is developing rapidly, data journalism needs to continue to enhance its professionalism and irreplaceability so that it can consolidate the legitimate status of journalism and meet the public’s expectations of journalism. Note 1 In this study, after excluding the awarded works with invalid links, the samples were identified based on the pre-designed analysis categories. The number of identifiable samples varies across analysis categories because some works do not provide downloadable raw data in a particular category.

References Baesens, B. (2016). Analytics in a big data world: The essential guide to data science and its applications (X. Ke & J. Zhang, Trans.). Posts & Telecom Press. Beaujon, A. (2013). N.Y.’s tough new gun law also prohibits disclosure of gun owners’ names. Poynter. www.poynter.org/2013/n-y-s-tough-new-gun-law-also-prohibits-disclo sure-of-gun-owners-names/200714/ Bilton, R. (Ed.). (2016). ProPublica’s data store, which has pulled in $200K, is now selling datasets for other news orgs. Nieman Lab. www.niemanlab.org/2016/10/propubli cas-data-store-which-has-pulled-in-200k-is-now-selling-datasets-for-other-news-orgs/ Bradshaw, P. (Ed.). (2013, September 13). Ethics in data journalism: Accuracy. Online Journalism Blog. https://onlinejournalismblog.com/2013/09/13/ethics-in-data-journalism-accuracy/ Bradshaw, P. (Ed.). (2017). Data journalism’s AI opportunity: The 3 different types of machine learning & how they have already been used. Online Journalism Blog. https:// onlinejournalismblog.com/2017/12/14/data-journalisms-ai-opportunity-the-3-differenttypes-of-machine-learning-how-they-have-already-been-used/

Insight into Reality 83 Brownlee, J. (Ed.). (2016). Supervised and unsupervised machine learning algorithms. Machine Learning Mastery. https://machinelearningmastery.com/supervised-and-unsupervisedmachine-learning-algorithms/ Chao, L. (2016). Data science. Tsinghua University Press. (Published in China.) Chao, L., & Lu, X. (2017). Data science and its implications on information science. Journal of the China Society for Scientific and Technical Information, 36(8), 761–771. (Published in China.) Cindy (Ed.). (2016). Why did big data predictions fail in 2016. InfoQ. https://mp.weixin. qq.com/s/6-B17oOEXdx0cwweYCG9fg? (Published in China.) Cohn, N. (Ed.). (2016). We gave four good pollsters the same raw data. They had four different results. The New York Times. www.nytimes.com/interactive/2016/09/20/upshot/ the-error-the-polling-world-rarely-talks-about.html?rref=collection%2Fsectioncollection %2Fupshot&_r=0 Data Journalism (Ed.). (n.d.). Ethical dilemmas in data journalism. Data Journalism. https:// datajournalism.com/read/newsletters/ethical-dilemmas-in-data-journalism Diakopoulos, N. (Ed.). (2016). BuzzFeed’s pro tennis investigation displays ethical dilemmas of data journalism. Columbia Journalism Review. www.cjr.org/tow_center/transpar ency_algorithms_buzzfeed.php D’Ignazio, C. (Ed.). (2019). Putting data back into context. Data Journalism. https://datajour nalism.com/read/longreads/putting-data-back-into-context?utm_source=sendinblue&utm_ campaign=Conversations_with_Data_May_Ethical_Dilemmas&utm_medium=email Du, Z., & Cha, H. (2016). On opening government data to the public. Journal of Capital Normal University (Social Sciences Edition) (5), 74–80. (Published in China.) EDJNet (Ed.). (2019). Europe is warming, rapidly. EDJNet. www.onedegreewarmer.eu/ Fang, J., & Hu, W. (2018). Global practice in data journalism education: Characteristics, Constraints and trends. In W. Qiong & S. Hongyuan (Eds.), China data journalism development report (2016–2017). Social Sciences Academic Press (China). (Published in China.) Fang, J., Hu, Y., & Fan, D. (2016). Data journalism practice in the eyes of journalists: Value, path and prospect – A study based on in-depth interviews with seven Journalists. Journalism Research (2), 74–80. (Published in China.) Fletcher, J. (Ed.). (2014). Spurious correlations: Margarine linked to divorce? BBC. www. bbc.com/news/magazine-27537142 Freedominfo (Ed.). (2016). Eight countries adopt FOI regimes in 2016. Freedominfo. www. freedominfo.org/2016/12/eight-countries-adopt-foi-regimes-2016/ Global Investigative Journalism Network (Ed.). (2017). The 2016 U.S. Data journalism awards are another example of in-depth reporting. Global Investigative Journalism Network. https://cn.gijn.org/2017/01/25/2016 Gray, J., Chanbers, L., & Bounegru, L. (2012). The data journalism handbook. O’Reilly Media. Groskopf, C. (Ed.). (2015). The Quartz guide to bad data. Quartz. https://qz.com/572338/ the-quartz-guide-to-bad-data/#user-content-frame-of-reference-has-been-manipulated Heravi, B. R. (2018). Data journalism in 2017: A summary of results from the global data journalism survey. In G. Chowdhury, J. McLeod, V. Gillet, & P. Willett (Eds.), Transforming digital worlds. iConference 2018. Lecture Notes in Computer Science, vol 10766. Springer. Heravi, B. R., & Ojo, A. (Eds.). (2017). What makes a winning data story? UCD iSchool. https:// medium.com/@Bahareh/what-makes-a-winning-data-story-7090e1b1d0fc#.3fbubynuo Hong, Y. (Ed.). (2014). Global Editorial Network data journalism awards announced. Djchina. http://djchina.org/2014/07/13/gendja_2014/(Published in China.)

84  Insight into Reality Howard, A. (Ed.). (2013). On the ethics of data-driven journalism: Of fact, friction and public records in a more transparent age. The Tow Center. https://medium.com/tow-center/ on-the-ethics-of-data-driven-journalism-of-fact-friction-and-public-records-in-a-moretransparent-a063806e0ee3 Hu, Y., Pratt, & Chen, L. (2016). Data myths in the U.S. election news. The Press (23), 133–135. (Published in China.) IBM Cloud Education (Ed.). (2020). Supervised learning. IBM. www.ibm.com/cloud/learn/ supervised-learning Jian, J., He, X., & Jin, Y. (2015). Statistics. China Renmin University Press. (Published in China.) Jo, T. (2021). Machine learning foundations. Springer. Joshi, A. V. (2020). Machine learning and artificial intelligence. Springer. King, T. (Ed.). (2019). 80 percent of your data will be unstructured in five years. Data Management Solutions Review. https://solutionsreview.com/data-management/80-percent-ofyour-data-will-be-unstructured-in-five-years/ Lee, Y. W., Pipino, L. L., Funk, J. D., & Wang, R. Y. (2006). Journey to data quality. The MIT Press. Li, M. (2013). Professional ability training for news gathering and editing staff in the era of big data. China Publishing Journal (17), 26–30. (Published in China.) Li, Y. (2012). Reinvent your major or move away from it? – Analyzing the career model of online journalism from the cognitive dimension. Shanghai Journalism Review (12), 42–48. (Published in China.) Li, Y. (2014). A Study on the characteristics of western data Journalism – Taking the Practice of The Guardian as an example [Unpublished master dissertation]. Renmin University of China. (Published in China.) Liang, Y., & Xu, X. (2015). Principles, algorithms and applications of data mining. China Machine Press. (Published in China.) Liu, S. (2015). In the age of technology, journalism can use sensors as reporting tools. Science & Technology for China’s Mass Media (6), 30–32. (Published in China.) Liu, T. (2018). On context: Interpretation method and visual rhetoric analysis. Journal of Northwest Normal University (Social Sciences) (1), 5–15. (Published in China.) McBride, K., & Rosenstiel, T. (2013). The new ethics of journalism: Principles for the 21st century. CQ Press. McBride, R. (Ed.). (n.d.). Giving data soul: Best practices for ethical data journalism. Data Journalism. https://datajournalism.com/read/longreads/giving-data-soul-best-practicesfor-ethical-data-journalism?utm_source=sendinblue&utm_campaign=Conversations_ with_Data_May_Ethical_Dilemmas&utm_medium=email McMaken, R. (2015). With few gun laws, New Hampshire is safer than Canada. Mises Wire. https://mises.org/wire/few-gun-laws-new-hampshire-safer-canada Mitter, S. (Ed.). (2013). The cartography of bullshit. Africa is a country. https://africasacountry.com/2013/05/the-cartography-of-bullshit Nahser, F. (Ed.). (2018). Three examples of machine learning in the newsroom. Global Editors Network. https://medium.com/global-editors-network/three-examples-of-machinelearning-in-the-newsroom-1b47d1f7515a Netease data blog (Ed.). (2019). We searched CAI’s micro blog and found the routine of fans’ brushing control comments. www.163.com/data/article/ED4QAHR9000181IU.html Ornstein, C. (Ed.). (2014). What to be wary of in the govt’s new site detailing industry money to docs. ProPublica. www.propublica.org/article/what-to-be-wary-of-in-the-govts-new-sitedetailing-industry-money-to-docs

Insight into Reality 85 Peng, L. (2017). Better journalism, or worse journalism? – New challenges for the media industry in the era of artificial intelligence. China Publishing Journal (24), 3–8. (Published in China.) Postmedia (Ed.). (2018). Political donations database allows Canadians to follow the money. Postmedia. www.postmedia.com/2018/03/20/political-donations-database-allows-canadi ans-to-follow-the-money/ Qiu, Y. (Ed.). (2015a). I’m in the American newsroom: ProPublica, who likes to chew hard data. News Lab. https://posts.careerengine.us/p/5aff37e7f8630d5470f86b17 (Published in China.) Qiu, Y. (Ed.). (2015b). ProPublica, who likes to chew hard data. News Lab. http://djchina. org/2015/01/22/data_newsroom_propublica/(Published in China.) Reid, A. (Ed.) (2014). Guardian forms new editorial teams to enhance digital output. Journalism. www.journalism.co.uk/news/guardian-forms-new-editorial-teams-to-enhancedigital-output/s2/a562755/ Rogers, S. (Ed.). (2011). Data journalism at the Guardian: What is it and how do we do it? The Guardian. www.theguardian.com/news/datablog/2011/jul/28/data-journalism Rogers, S. (2017, August 18). A new machine learning app for reporting on hate in America. Simon Rogers. https://blog.google/outreach-initiatives/google-news-initiative/newmachine-learning-app-reporting-hate-america/ Rusbridger, A., & Rees, J. (Eds.). (n.d.). Reading the riots: Investigating England’s summer of disorder. The Guardian & London School of Economics. http://eprints.lse.ac.uk/ 46297/1/Reading%20the%20riots(published).pdf Selby-Boothroyd, A. (Ed.). (2019). The Economist’s “Build a voter” models. https://datajournalismawards.org/projects/the-economists-build-a-voter-models/ Sinha, G. R., Srinivasarao, U., & Sharaff, A. (2021). Introduction to data science: Review, challenges, and opportunities. In A. Sharaff, & G. R. Sinha (Eds.), Data science and its applications. CRC Press. Skeptical Science (Ed.). (2014). Cherry picked and misrepresented climate science undermines FiveThirtyEight brand. Skeptical Science. https://skepticalscience.com/fivethirty eight-pielke-downplay-climate-damages.html Song, J. (Ed.). (2016). Sensor news in DJA 2016. dyclub. http://mp.weixin.qq.com/s/_ aRzWgCBtA8TDykOTosYTg (Published in China.) Stone, M. L. (Ed.). (2017). Big data for media. Reuters Institute for the Study of Journalism. https://reutersinstitute.politics.ox.ac.uk/sites/default/files/2017-04/Big%20Data%20 For%20Media_0.pdf Stoneman, J. (Ed.). (2015). Does Open Data need journalism? Oxford University Research Achive. https://ora.ox.ac.uk/objects/uuid:c22432ea-3ddc-40ad-a72b-ee9566d22b97 Stray, J. (Ed.). (2014). Draw conclusions from data (Y. Yang & K. Fang, Trans.). Fangkc. http://fangkc.cn/2014/01/drawing-conclusions-from-data/ Sun, H., Nan, T., Wei, H., Li, J., & Ma, Y. (Eds.). (2015). Government monopolies result in the waste of public data. China News. www.chinanews.com/cj/2015/02-25/7076397. shtml. (Published in China.) Tang, S., & Liu, Y. (2014). ‘Impression’ of global government data openness – An interpretation of the open data barometer. Foreign Investment in China (5), 28–31. (Published in China.) Wainwright, D. (Ed.). (2017). How annoying is Heathrow airport? BBC data reporters use a crawler to find out (S. Liang, Trans.). Global Investigative Journalism Network. https:// cn.gijn.org/2017/01/12/伦敦希思罗机场有多扰民?bbc数据记者用爬虫揭晓/

86  Insight into Reality Wang, G. (2015). Recontexualization into Poetry Translation: A Case of Long Yingtai’s translation of the Rhodora. Foreign Languages and Their Teaching (2), 80–85. (Published in China.) Watson, M., & Nelson, D. (2017). Managerial analytics: An applied guide to principles, methods, tools, and best practices (Z. Wang & Q. Wang, Trans.). China Machine Press. (Published in China.) Weiner, R. (Ed.). (2012). N.Y. newspaper’s gun-owner database draws criticism. USA Today. www.usatoday.com/story/news/nation/2012/12/26/gun-database-draws-criticism/1791507/ Whitby, A. (Ed.). (2014). Guide to bad data journalism. Prezi. https://prezi.com/pweevqs1 hunh/guide-to-bad-data-journalism/ Whitby, A. (Ed.). (2016). Data visualization and truth. Andrew Whitby. https://andrew whitby.com/2016/09/26/data-visualization-and-truth/ World Wide Web Foundation (Ed.). (n.d.). Executive summary and key findings. The Open Data Barometer. http://opendatabarometer.org/3rdEdition/report/#executive_summary World Wide Web Foundation (Ed.). (2016). Open data barometer (3rd ed.). https://webfoundation.org/research/open-data-barometer-3rd-edition/ Xu, X. (2015). Application of sensors in data journalism. News and Writing (12), 70–72. (Published in China.) You Shu Editorial Department (Ed.). (2019). A new approach to data narrative: Data journalist conference speech transcript (2). https://mp.weixin.qq.com/s/uLsLO4MNzPn2wz2fDxKiRg (Published in China.) Yu, T., & Chen, S. (2018). The application and impact of artificial intelligence in American journalism. Shanghai Journalism Review (4), 33–42. (Published in China.) Zhang, C. (2018). An analysis of the pragmatic objectivity of data journalism. Academic Journal of Zhongzhou (9), 166–173. (Published in China.) Zhang, C., Shan, X., & Liu, J. (2019). From deprofessionalisation to re-professionalisation: Applications and trends in data science for data journalism. China Publishing Journal (1), 25–28. (Published in China.) Zhang, Z. (2017). The classification methods for structural data and non-structural data. Journal of Ningde Normal University (Natural Science) (4), 417–420. (Published in China.) Zhao, S., Tang, H., & Xiong, H. (2015). Big data analysis and application. Aviation Industry Press. (Published in China.) Zhao, Y. (2013). Philosophical explorations in machine learning. Central Compilation & Translation Press. (Published in China.) Zheng, L. (2016). The real dilemmas of open data. New Media (4), 48–49. (Published in China.) Zheng, L., & Gao, F. (Eds.). (2015). How much data has the government opened up? China open data lens 2015. Open Data China. http://mp.weixin.qq.com/s/mIsmlhk2BKzOHBr JdFkKbg (Published in China.) Zhou, H., & Wu, X. (2015). Rethinking the crisis of journalism: The power of culture – Reflections on cultural sociology by professor Geoffrey Alexander. Shanghai Journalism Review (3), 4–12. (Published in China.)

5

Representing Reality The Narrative of Data Journalism

With the increasing ability of data to interpret and represent reality, human beings are able to achieve the so-called “mirrored existence,” a way of existence based on computers, networks, and other hardware, using digital data and its operations to represent various real relationships in the material world (Jia & Xu, 2013). Humans expect data journalism to be a true reflection of the objective world. Perhaps “Big Data,” which is now in its initial stage, may one day realize these long-cherished dreams. “The key word in data journalism is journalism,” according to Steve Doig, a data journalist for The Miami Herald (Sunne, 2016). Data journalism is different from traditional journalism in various ways, but it is still essentially journalism. Therefore, data journalism is “representing” the world, not “mirroring” it. Discourse and Narrative Discourse refers to a systematically organized set of statements and is an important concept in contemporary cultural and media studies. Discourse defines and describes what can and cannot be said and what can and cannot be done, providing a set of possible expressions for given domains, topics, objects, and processes to be talked about (Fang, 2014, p. 93). According to Michel Foucault, discourse is a way of organizing knowledge that structures the constitution of social (and progressively global) relations through the collective understanding of the discursive logic and the acceptance of the discourse as social fact (Adams, 2017). News discourse can therefore be defined as an act of constructing meaning with the subjective intention based on the rules of news narrative language, text, and story (Fang, 2014, p. 94). The structure of news discourse is directly influenced by the process and ideology of news production and indirectly influenced by the institutional and macro-social environment of the news media (Liu, 2006). In modern democracies, news has become an extremely important discourse with the power to articulate and interpret the social truth (Sun, 2005). Data Journalism as Discourse

Our view of data journalism as “discourse” suggests that we take a critical view of data journalism production. Mathematical language has now become a way to DOI: 10.4324/9781003426141-5

88  Representing Reality grasp the meaning of images and has been transformed into a “postmodern” science of discourse (Liao, 2016). Jürgen Habermas argues that science and technology have become a “new ideology.” The ideology of science and technology differs from the traditional ideology of the classical capitalist era in that it is non-political. Modern science and technology as potential ideologies have penetrated the consciousness of the public. The public has become used to understanding life with the help of the discourse of science and technology (Wu, 2008, p. 49). Data journalism is a genre of journalistic discourse. The “articulation” of data and news gives data journalism a unique discursive connotation. The heart of data journalism is data science. According to Julian Assange, “journalism should be more like science” (Moss, 2010). Data journalism is also referred to as “scientific journalism” (Khatchadourian, 2020). Following Assange’s logic, combining journalism with data science will empower the media to produce data knowledge and earn more trust. If we look back at the history of journalism, we can see that our understanding of all kinds of news discourse has gone through a process from optimism to criticism. Text, photos, and images were once considered neutral and transparent, and all went through a process of being trusted to being distrusted. The same thing is happening today: data and data science methods used in journalism are seen as “scientific”; data visualization, which converts data into visual language, is seen as “transparent.” According to Vilém Flusser, the images can only be the mediations between the world and human beings (Flusser, 2000). No matter whether the news discourse takes the form of language, data, data science methods, or images, it is only the translation of objective reality rather than reality itself. It has been argued that framing in data journalism consists of two phases. The first stage – building logical structures – is drawing conclusions from the data, and the second stage – building forms – is visualizing the data (Mao et al., 2016). Through data visualization, data journalism not only restores reality in a visual sense but also reconstructs it (Liu, 2016). Data journalism production is not a process in which meaning is generated automatically and the truth emerges naturally. For example, the algorithms which secretly contribute to data journalism appear to be “objective,” but they are not. A researcher’s unconscious cognitive bias may be expressed through the selection criteria chosen for the algorithm (Broussard, 2016). Additionally, the cleaning phase of data is inherently subjective, as the data analysts would make decisions about what IVs and DVs will be counted or not (Boyd & Crawford, 2012). Although the same data and algorithms will yield the same data results, the interpretation of data results will vary among different media outlets. Meaning is not simply given, but is socially constructed across a number of institutional sites and practices (Best & Kellner, 1991, p. 26). We have not found two analyses of the same data set in media reports, but the New York Times’ “One Report, Diverging Perspectives” vividly illustrates the differences in the interpretations of the data (see QR code 5.1). The jobs report released by the U.S. government in January 2012 conveyed two main information points: In September, there was an increase of 114,000 jobs in

Representing Reality 89

QR code 5.1 One report, diverging perspectives. https://archive.nytimes.com/www.nytimes. com/interactive/2012/10/05/business/economy/one-report-diverging-perspec tives.html

the U.S., and the unemployment rate dropped to 7.8%. Faced with the same report, different political parties see different stories. Democrats focus on the 31 consecutive months of job growth that have occurred under Obama’s efforts since the 2008 economic crisis, with the unemployment rate dropping more than 2 percentage points from 10% in 2009 to 7.8%. Republicans are concerned that the number of new jobs must reach 150,000 to balance population growth and 43 consecutive months of unemployment above 8%. It is evident that data journalism, both in data processing and narrative aspects, is continually influenced by the producer from the beginning to the end. In-depth interview (Nicolas Kayser-Bril, Head of Journalism++, via email): Everything about storytelling is subjective, from which topic to choose to which story to tell. Post-Classical Narrative Perspectives on Data Journalism Narrative Research

Narratology is the study of narrative as a genre. Its objective is to describe the constants, variables, and combinations typical of narrative and to clarify how these characteristics of narrative texts connect within the framework of theoretical models (typologies). (Fludernik, 2009, p. 8) Narratology was formed in the 1960s and was heavily influenced by Structuralism and Russian Formalism. With its focus upon either story or discourse of the narratives, Classical Narratology vowed to find, in a “scientific” way, the general structural rules for all the narratives (Tang, 2003). In Narrative Discourse, the French narratologist Gérard Genette revises the storydiscourse dichotomy by differentiating story (histoire), narrative (récit), and narrating (narration): to use the word story for the signified or narrative content, to use the word narrative for the signifier, statement, discourse, or narrative text itself, and to use the word narrating for the producing narrative action and, by extension, the whole of the real or fictional situation in which that action takes place (Genette, 1983, p. 27). In the 1990s, influenced by various rising linguistic and literary theories: the paradigms of Narratological study have undergone some major changes: not only taking into its scope the context in both the utterance and the

90  Representing Reality interpretation of the narratives, but also embracing the disciplines in other fields, thus bringing it into an entirely new stage of pluralism and significantly broadening its horizon. (Tang, 2003) The introduction of narratology by Chinese journalism and communication researchers began in the early 1990s and was mainly used to study news writing techniques and news reporting. Fang Yihua (2016) explains the differences between news narratives and literary narratives in five aspects: content preferences, presentation forms, textual properties, contexts, and expression. At present, Chinese scholars’ research on news narratives basically follows the classical narratology approach, studying the narrator and point of view in news reports. The extent to which the paradigm of classical narratology can be used to study data journalism is questionable. In the era of new media, the expansion of news has made news texts beyond the definition of classical narratology. The concept of text in data journalism is quite broad. Some data news still conforms to our traditional perception of news texts, such as those consisting of text and data visualizations in newspapers, magazines, and television. However, some data news works in the forms of data applications or interactive data are totally different from traditional news texts. Such kind of data news texts can be better analyzed with the new narratological view: that narrative is a process. The narratives of data journalism also exhibit some differences from traditional news narratives. In the 2012 U.S. election coverage, the message reported by the traditional journalists was that the presidential election of 2012 was a volatile, competitive election that would only be resolved on Election Night; at the same time, data journalists were saying that President Obama was consistently and overwhelmingly favored to win reelection, which turned out to be true (Solop & Wonders, 2016). While many people talk about the importance of data storytelling, few people use data to tell stories (Citraro, 2014). There are only a limited number of data news works that really tell a story, and many of them are just a collection of information, data, or charts. Actually, the traditional narrative approaches have played a less important role in data journalism (Fang, 2015, p. 43). Based on the analyses of the types, fact-finding methods, and writing models of the winning entries of the Data Journalism Awards, Zeng et al. (2017) demonstrated that data journalism follows the social science research paradigm, while I consider this argument questionable. Unlike classical narratology, our discussion of data news narratives in this book combines narrative research with theories on technology, rhetoric, and discourse, focusing on the narrative dimensions, narrative modes, word-image relations, and narrative principles of data journalism. Dimensions of Complex Narratives in Data Journalism There is a distinction between simple news facts and complex news facts. The former kind is much easier to understand, consisting of limited matters and the association between matters is not very strong; the latter kind can be understood

Representing Reality 91 only through rational understanding, containing many matters with a high degree of interconnectedness (Yang, 2001, pp. 39–40). Although data visualization, with its intuitiveness and interactivity, gives data journalism an advantage in representing complex issues, it does not answer the question of which aspects of complex narratives are reflected in data journalism. It is argued that for data visualizations that include multiple data objects and various relationships between data, the core of the representation is to reveal the structural relationships of the data, such as association, comparison, and evolution (Peng, 2015). Our analysis of the Data Journalism Awards winners and other outstanding works reveals that the narratives and analyses of complex issues in data news stories are mostly organized around four dimensions: the temporal dimension, the spatial dimension, the social network dimension, and the data relationship dimension. Data news can be organized along one or several narrative dimensions. Each narrative dimension is of different degrees of complexity, depending on the specific issue. Temporal Dimension: Depicting Changes and Trends

Events have been defined as processes. A process is a change, a development, and presupposes therefore a succession in time or a chronology. The events themselves happen during a certain period of time, and they occur in a certain order. (Bal, 2017, p. 177) In data journalism, time is a dimension that helps one construct a robust and intuitive framework to establish connections between events (Yang, 2015). Any event or phenomenon takes place in the temporal dimension. For a piece of news that deals with a long period of time or includes many events, the traditional wordy narrative is often overwhelmed by the limited space or length. The timeline in data visualization allows a clear and visual presentation of the chronological or cause-and-effect relationships between events (Zhang G., 2013). In the temporal dimension, moments and time periods appear in four main combinations: (1) a single moment; (2) a time period made up of multiple moments in series; (3) independent moments; and (4) independent time periods. A single moment usually appears as the background. Except for a single moment, the other three can be narrated with a timeline. Besides this, three types of time points are of importance: the most recent time point, compared time points, and particular time points (Hu, 2017). The logic of showing the evolution of things is often used in data visualization (Peng, 2015), and timelines are essentially the strategic deployment and framing of power discourse in the temporal dimension (Liu, 2016). Timelines connect different moments and time periods, embedding specific topics and events into a macro historical narrative picture, and combining them with other information to reflect changes, comparisons, or trends, allowing the audience to understand the news in the historical context. The discursive or ideological property of time is due to the fact that time is an important element of context. Data journalism can either focus on the moments at the present or place the moments in a historical narrative to re-contextualize

92  Representing Reality the facts. However, it is impossible to cover an infinite time span in a news story; therefore, the choice of beginning and ending points is a process of meaning construction. The Washington Post’s “Track National Unemployment, Job Gains and Job Losses” presents data on unemployment rates from January 1948 to August 2016, from which some patterns of depression periods and unemployment rates can be seen, such as the fact that depression months tend not to be the months with the highest unemployment rates and that the lowest unemployment rates tend to occur one to two years after the end of the depression (see QR code 5.2). The increase in unemployment is represented by the change from green to red. At the macro level, the audience can easily see the cyclical recessions that have occurred in the U.S. through the 70-year box plot, and can also identify the two peaks in unemployment, namely from 1983 to 1984 and from 2009 to 2011. The audience can also conclude that the falling of the unemployment rate to 4.9% in 2016 does not indicate the coming of the best time but only the ending of the worst times. The Guardian’s “China’s Financial Crisis” chooses June 2015 as the starting point of the Chinese stock market curve, which is exactly when the Chinese stock market began its huge decline, and presents China’s turbulent economic situation by intentionally “stretching” the data scale on the chart (Liu, 2016). While timelines can provide a great deal of information on the temporal dimension, they also “simplify” the historical context of events. In particular, comparisons on the same timeline force different actors to be involved in the same history, since each actor may face different situations at the same point in time. The timeline allows some key information about the actors to be omitted, serving instead the subjective intention of the creator. The Washington Post’s “Eight countries, 2056 nuclear tests, 71 years” compares the number of nuclear tests and nuclear warhead stockpiles of eight countries in the world over a 71-year period (see QR code 5.3). It is easy for audiences to see that while the world’s nuclear powers stopped testing in the 1990s and are reducing their stockpiles of nuclear warheads, North Korea remains an exception. While North Korea’s five nuclear tests in 71 years are unremarkable in terms of numbers, it is the only country to have conducted a nuclear test in the 21st century. Therefore, this news story “naturally” stresses North Korea’s “disobedience” against the global trend in the form of timeline. Time points marked on the timeline include: the Castle Bravo hydrogen bomb test in 1954, the peak of annual U.S. nuclear tests

QR code 5.2 Track national unemployment, job gains and job losses. http://graphics.wsj. com/job-market-tracker/

Representing Reality 93

QR code 5.3 North Korea is the only country that has performed a nuclear test in the 21st century. www.washingtonpost.com/graphics/world/nuclear-tests/

(96) in 1962, and the beginning of a significant decline in the U.S. nuclear weapons stockpile in 1992, and what they all have in common is that they almost always follow the U.S. historical narrative. Thus, this story includes other countries while discarding various national contexts. Spatial Dimension: Achieving Classification and Comparison

While the news narratives are traditionally centered on the temporal dimension, the spatial dimension often takes a secondary position. The reason may be that the simultaneity and synchronization of stories in different spaces in a single story is an unattainable ideal for narrators (Gaudreault, 2005, p. 106). The inability to simultaneously describe both the action and the environment in which it takes place leads the narrator to sometimes discard spatial information (Gaudreault, 2005, p. 105). In data journalism pieces, different time points are connected by a timeline and can therefore be presented as images. However, stories of different sites in the spatial dimension are presented through the form of maps with different locations. The strength of maps is that they can provide a reference for people to correctly judge the connections between spatial things through spatial positioning, thus allowing audiences to establish a clear spatial perception. “Maps are some of the most information-dense ways of communicating data,” says Len De Groot, director of data visualization at the Los Angeles Times. “You can do a lot in a map because people already understand the fundamentals – unlike, say, a scatterplot.” (Miller, 2015) Maps bring the power of data journalism in spatial narratives. Similar to temporal narratives, spatial narratives can combine the macro level with the micro level. The use of interactive tools in data journalism allows creators to create a grand narrative about the macro space, but also allows audiences to explore specific information about a particular location. The use of color-based segmentation and various markers on the map can clearly convey important information within this geographic space (Deng & Lu, 2013). The New York Times’ “The Most Detailed Map of Gay Marriage in America” assigns different colors to different areas of the map based on the percentage of same-sex marriages in each region, with darker colors indicating a higher percentage of same-sex marriages. If you click on a region with your mouse, it will show the number and percentage of same-sex marriages in that region (see QR code 5.4).

94  Representing Reality

QR code 5.4 The most detailed map of gay marriage in America. www.nytimes.com/2016/ 09/13/upshot/the-most-detailed-map-of-gay-marriage-in-america.html

QR code 5.5 People’s republic of Bolzano. www.peoplesrepublicofbolzano.com/

People have long been used to seeing the map as a mirror of the world, thinking that a map should be the same as the world is (Zhang X., 2013). Concerned about the Chinese population in Italy, some Italian media have created a false impression of a “Chinese invasion of Italy.” Bolzano is a city in northern Italy with a small Chinese population. In order to prove that the Chinese are not “invading” Italy, “People’s Republic of Bolzano,” winner of the 2015 Data Journalism Award, used different colored “dots” to mark the distribution of Chinese people in different occupations in Bolzano. In the map, the journalist provides only basic geographical information about Bolzano, such as the main streets and rivers, and expands the view of the map to the entire administrative area of the city rather than focusing only on the city center, thus showing the small and scattered Chinese population in the city (see QR code 5.5). Had the map been presented in a different way, for example, by reducing the scale of the map to make the dots appear dense, or by limiting the view of the map to the core of the city, or by increasing the diameter of the dots in the map so that they dominate the image, we are afraid the map would have presented a completely different meaning. So maps are not purely scientific; they are also aesthetic, religious, social, and political. The realization of discourse on maps is achieved through both cartographic techniques and map language, which profoundly and distinctly reflect the values and ideology of the cartographer (Zhang X., 2013). The Washington Post’s “Where carbon emissions are greatest” focuses on global carbon emissions (see QR code 5.6). In this map of the world, the darker the color of a place, the more carbon dioxide is emitted there. Through interactive means, readers of this news story can see both the global average of carbon dioxide emissions from 2001 to 2012 and the specific values for a particular location (located by latitude and longitude). This work visualizes the connection between carbon emissions and global development through spatial narratives.

Representing Reality 95

QR code 5.6  Where carbon emissions are greatest. www.washingtonpost.com/graphics/ national/carbon-emissions-2015/

The narratives of spatial dimension usually also tend to confuse the contextual differences of different regions. In a map, information about a country’s geographical area is the easiest to identify, but information about its population distribution and economic level is not easy to represent. If the producer wants to compare information related to spatial distribution, the map allows the reader to draw conclusions in a visual way. However, if the producer wants to compare information that does not relate to spatial distribution, the map’s narrative will be inefficient and may even be misleading. In “Where carbon emissions are greatest,” the journalist adds additional axes to represent both the total carbon emissions and the per capita emissions of a country, “highlighting” China and the United States with these two variables (see QR code 5.6). With the help of the axes, the reader can see that: China’s total carbon emissions are the highest in the world, but its per capita emissions are at the lower middle level of the world, while the total carbon emissions and the total per capita carbon emissions of the United States are both at the high level of the world. During the presidential elections, the U.S. media often used a map of the United States to show the elections in different states, using red and blue to show the areas won by Republicans and Democrats respectively, which seems objective but is actually misleading. Usually “red states” are larger than “blue states,” so readers can easily see a large area of red on the map that represents the Republican Party, but this does not reflect the true election results. For example, Montana is about 15 times the size of Vermont, yet both have almost the same number of electors. So critics argue that using a map of the U.S. to represent the vote is a huge distortion of the real situation and can make readers politically biased (Stabe, 2016). The narratives of the spatial dimension are usually about geopolitical relations. Especially in the political news, the spatial narrative is actually an expression of geopolitical issues. Globalization has brought countries closer together, making today’s world more like a “global village.” “Data journalism lays out an intuitive, visual, and vivid comparative context, where each country or region exists as a co-text of other countries or regions, thrown into an intertextual contextual relationship.” (Liu, 2016) In this intertextual relationship, each country or region is a “mirror” of the others, interconnected and forming a broad “picture” of geopolitical discourse. Conflicting Claims, a work by the South China Morning Post, shows the waters claimed by six countries and regions in the South China Sea and the islands and reefs they actually occupy. The overlapping boundary lines on the map, showing

96  Representing Reality the overlap of the sovereignty claimed by each country or region, and the different colored dots, showing the islands and reefs actually occupied by each one, visually portray the complex conflict of interests of the parties in the South China Sea and illustrate the difficulty of resolving the South China Sea issue. Of course, data journalists often use a combination of temporal and spatial dimensions of narrative in order to make the news narrative informative and comprehensive. Social Network Dimension: Revealing Relationships in Complex Networks

The American scientist Warren Weaver divided the history of modern science into three distinct stages. The first phase, from the 17th to the 19th centuries, encapsulated what he denominated as “problems of simplicity.” During this period, most scientists were fundamentally trying to understand the influence of one variable over another. The second phase, taking place during the first half of the 20th century, involved “problems of disorganized complexity,” when researchers started conceiving systems with a substantial number of variables, but the way many of these variables interacted was thought to be random and sometimes chaotic. The last stage, initiated in the second half of the 20th century and continuing to today, is critically shaped by “problems of organized complexity.” In the third phase, humans realized that the variables in complex systems are highly correlated and interdependent, requiring a new set of thinking, methods, and tools to explore them (Lima, 2011, p. 45). This brings us to another important concept: Complex Network, whose structure is irregular, complex, and dynamically evolving in time, with the main focus moving from the analysis of small networks to that of systems with thousands or millions of nodes, and with a renewed attention to the properties of networks of dynamical units (Boccaletti et al., 2006). Complex networks, as abstractions of real social relationships, can be used to describe relationships between people, between organizations, and between computers. Through social networks, data news works can show the micro and macro connections between various actors in a social system. The word “network” here does not refer to the Internet, but rather is a metaphor for the web-like structure between social actors or elements (Lin, 2009, p. 41). A social network is usually presented as a graph consisting of many “nodes” and “edges,” where the “nodes” represent the actors in the social network and the “edges” represent the “relations” between the actors. According to the idea of social network analysis, all actions of actors are not isolated but are interconnected, and the network among actors also determines the opportunities and outcomes of their actions. This view is called “neo-structuralism.” (Lin, 2009) “Social network” has become an increasingly popular way to represent all kinds of collective phenomena (Bounegru et al., 2017). In social network analysis, the main focus is on the following measures (Zhao, 2011): Density, the ratio of the actual ties that the actor has to the maximum possible ties.

Representing Reality 97 Centrality, the degree to which an actor is central to the network. Ties Strength, the strength of the ties between two actors in the network. Position, a set of nodes that are structurally in the same position. Content, the type or properties of the ties between the actors (e.g., affinity, power distribution). Role, a relatively fixed pattern of behavior exhibited by nodes in the same position. Cliques, a cohesive subgroup that is linked together with some common characteristics. In the past, social networks rarely appeared in news reports because journalists focused more on core events and major issues and less on secondary issues. Social networks focus on explaining social events through the lens of relationships, taking a more macro, comprehensive, and complex view of the relationships between subjects and their interactions. With the development of data journalism, there are more reports exploring the relational networks of actors, but the proportion is still relatively small for two reasons: first, social network analysis is not yet mastered by all journalists and editors, and second, social network analysis requires sufficient data. A complete social network diagram can show three levels of information: macro view, relationship view, and micro view. The macro view provides a bird’s-eye view into the network and highlights certain clusters, as well as isolated groups, within its structure. In most cases, the use of color (within nodes or edges) and relevant positioning (grouping) is enough to provide meaningful insight into the network’s broad organization (Lima, 2011, p. 91). The relationship view is concerned with an effective analysis of the types of relationships among the mapped entities (nodes). The micro view into the network should be comprehensive and explicit, providing detailed information, facts, and characteristics on a single-node entity (Lima, 2011, p. 92). Data journalists do not necessarily need to use all three views in a single story. Which views are used depends largely on the content and purpose of the story. For example, a reporter may use a micro view to reveal the connections between a particular individual and other individuals in an event, a relationship view to reveal the connections among several individuals, or a macro view to reveal the connections among all subjects in an event. Liliana Bounegru et al. summarize five models in their “Narrating Networks: Exploring the affordances of networks as storytelling devices in journalism” (Bounegru et al., 2017). The first model is “Exploring Associations Around Single Actors,” usually describing the relationships around a given social unit, referred to as the ego, resulting in a mini-network or immediate neighborhood surrounding the ego. The second category is “Detecting Key Players,” depicting key actors based on the number of connections with other nodes. The third one is “Mapping Alliances and Oppositions,” which depicts associations of nodes as well as the absence of associations between groupings of nodes. The fourth is named “Exploring the Evolution of Associations Over Time,” formed around a temporal dimension and showing the transformation of associations of actors over time. The fifth

98  Representing Reality one, “Revealing Hidden Ties,” is used to depict hidden and potentially incriminating sequences of connections or paths between nodes. In fact, from a macro perspective, social networks can be divided into two categories: the whole network and the ego network. The whole network is the complete set of ties among all actors in the network, allowing a more precise measurement of the network structure (Kilduff  & Tsai, 2003, p. 136). The position of the key actors in the structure and their relationships with other actors can be seen in the whole network. The Wall Street Journal piece, “Libor: The Spider Network,” analyzes the social relationships between 18 financial institutions and 35 individuals that falsified LIBOR, the London interbank offered rate, which is the world’s most important key interest rate, during the financial crisis. This whole network is actually a combination of ego networks (see QR code 5.7). BuzzFeed’s “Help Us Map Trump World” is a crowdsourced project that uses social network analysis to present Trump’s network of business connections (see QR code 5.8). Reporters amassed a social network of more than 1,500 individuals and organizations associated with Trump’s business empire through public records, news stories, and other means. This work is not yet complete because it asks readers to continue to provide relevant information. But the reporters identified three individuals with close ties to Trump, all three of whom have considerable social networks of their own. A WAMU (American University Radio) work published in 2014, “Deals for Developers,” maps a vast network of political contributions by combing through data on nearly 10,000 political contributions to elections in Washington, D.C., since 2003 (see QR code 5.9). Each node in the network represents a different entity, including officials, corporations, relatives of key figures, etc (Meng, 2016).

QR code 5.7 Libor: The spider network. http://graphics.wsj.com/libor-network/#item=DB

QR code 5.8  Help us map Trump world. www.buzzfeednews.com/article/johntemplon/ help-us-map-trumpworld

Representing Reality 99

QR code 5.9 Deals for developers. https://wamu.org/projects/developerdeals/

QR code 5.10 A full-text visualization of the Iraq War Logs. http://jonathanstray.com/a-fulltext-visualization-of-the-iraq-war-logs

QR code 5.11 Zhou Yongkang’s associated persons and properties. https://datanews.caixin. com/2014/zhoushicailu/

The Associated Press, based on the 11,616 “significant action” (SIGACT) logs leaked by WikiLeaks, mapped the social network relationships of different U.S. military actions during December 2006, the bloodiest month of the Iraq War (see QR code 5.10). In this graph, each node represents a log of significant action, and the different colors represent different types of actions, including criminal actions, enemy actions, bombings, allied actions, etc. The measurement of log-to-log relationships is achieved by the cosine-similarity and TF-IDF (Term Frequency-Inverse Document Frequency) algorithms. By using these methods, this graph reveals the aggregation between different actions. In “Deals for Developers,” users can view the ego network centered on a node (i.e., the person involved) by simply clicking on it in the interaction diagram (Kilduff & Tsai, 2003, p. 134). Other media outlets use genograms to sort out the social relationships of key figures, the idea of which is similar to social network analysis. Generally speaking, genograms are more ideal if the relationships of the subjects involved are relatively simple (e.g., mostly one-way relationships) or if the number of actors is limited (e.g., a few dozen, rather than hundreds). For example, Caixin’s “Zhou Yongkang’s Associated Persons and Properties” (see QR code 5.11) sorts out the

100  Representing Reality relationships between the main individuals and enterprises involved in Zhou Yongkang’s corruption scandal. Caixin’s series on Zhou Yongkang’s scandal is 60,000 words long, yet only one diagram is needed to show the relationships of the people in the series. Correlation Dimension: Exploring Correlations Among Variables

Relationships in social networks are social relationships, which emphasize the association of individuals with individuals. In this section, we will discuss correlation. There are various classifications of correlations: positive or negative correlation, linear or non-linear correlation, complete or partial correlation, single or multiple correlation, etc. Causality is the function of event A (cause) on event B (effect). It is also a basic law for understanding things and anticipating trends that humans have gradually concluded over time (Wei Z., 2001). Correlation is useful because it provides clearer perspectives than causality (Mayer-Schönberger & Cukier, 2013, p. 93). Correlation analysis is significant in itself, and it also lays the foundation for studying the causal relationship. Sometimes, clues of causality are also implied in correlation analyses (Peng, 2015). Faced with various uncertainties in the objective world, people tend to find the causes and transform the uncertainties into certain things. People are also used to transform correlations into causal relationships to explain things around them. Actually, the exploration of causal relationships in the objective world always starts from correlations (Wei B., 2001). In 2008, Wired magazine’s editor-in-chief Chris Anderson argued that “the traditional process of scientific discovery – of a hypothesis that is tested against reality using a model of underlying causalities – is on its way out, replaced by statistical analysis of pure correlations.” (Mayer-Schönberger & Cukier, 2013, p. 93) We would like to give an example. What is the relationship between gun violence and the number of guns and gun regulations? In Vox’s “America’s unique gun violence problem, explained in 16 maps and charts,” the reporter tells his audience that the number of guns in a certain place is directly correlated with the number of people who die from shootings. This correlation applies not only to U.S. states, but also to other developed countries (see QR code 5.12). “Brexit: voter turnout by age,” produced by the Financial Times in June 2016, found that there was a slight general trend for turnout to increase in line with

QR code 5.12 America’s unique gun violence problem, explained in 16 maps and charts. www.vox.com/policy-and-politics/2017/10/2/16399418/fedex-indianapolismass-shooting-gun-violence-statistics-charts

Representing Reality 101 average age, using analysis of the referendum results published by the Press Association and the demographic data of U.K. census. Correlation is a model that describes the quantitative relationship between things, while causation describes the nature of the relationship between things. Both of them are indispensable (Wang, 2016). It has been argued that correlation will replace causality as the primary reference for people’s decision-making in the era of big data. However, we believe that the importance of causality to news reporting will not be weakened in the era of big data, but rather strengthened. News reporting cannot rely solely on correlations to provide audiences with explanations of social reality, because the task of news reporting is not only to tell people what happened but also to tell them why these things happened. The revelation of causes is often the discovery of causality (Peng, 2015). For example, a report by Tuzheng, “A series on disciplinary committee secretaries in 70 large and medium-sized cities,” found a correlation between the region of China and the age of the CPC disciplinary committee secretary (tasked with enforcing internal rules of CPC and regulations and combating corruption and malfeasance in the party) in that region. The producers of this story interviewed experts and came up with the reason for this correlation: officials in China’s developed regions (i.e., the eastern and central regions) are generally younger and more energetic, but the eastern regions are more concerned with political stability, hence the relatively young disciplinary committee secretaries in central China (Zhang, 2016). One researcher commented that there were only few stories based on deep data analysis in the stream of more traditional reporting in The Guardian and The New York Times (Lorenz et al., 2011). Linear Narrative and Parallel Narrative in Data Journalism According to Edward Segel and Jeffrey Heer (2011), narrative visualization approaches can be divided into an author-driven approach and a reader-driven approach. A purely author-driven approach has a strict linear path, relies heavily on messaging, and includes no interactivity. In contrast, a purely reader-driven approach has no prescribed ordering of images, no messaging, and a high degree of interactivity. However, most examples of narrative visualization fall somewhere in between, balancing both elements. Therefore, narrative visualization includes three basic models: Martini Glass Structure, Interactive Slideshow, and Drill-Down Story (Segel & Heer, 2011). This classification addresses only data visualization, while the narrative of data stories relies on both text and visualization. Given the specificity of data stories in narrative and the limitations of previous classifications, we summarize the narrative patterns of data news based on hundreds of cases collected. We classify the narrative models of data news into linear narrative, parallel narrative, and interactive narrative. It should be noted that data news on new media platforms is hypertext, forming a multi-layered narrative system by means of hyperlinks. Theoretically, a hypertext can have infinite levels of hyperlinks. In order to grasp the characteristics of data news from a macro perspective, this study

102  Representing Reality only analyzes the first layer of data news texts. In fact, our classification applies not only to the first layer of the text, but also to all other layers, mainly depending on how the text is related to the subtexts. Linear Narrative

The linear narrative is a classical narrative style that focuses on the integrity of the story, the coherence of time, and the causality of the plot, implying a belief in and an appeal to the order and certainty of the world (Sun, 2011). The advent of the linear narrative is inextricably linked to the production of language and writing. In his Course in General Linguistics, Ferdinand de Saussure argues that the linearity principle is fundamental in linguistics, as successive morphemes constitute a sequence of time that can be represented in words (Sun, 2011). Linear narrative texts have one and only one main plot line that runs throughout, with a clear beginning and a purposeful end. The events in the plot line of a linear narrative are causally related to each other, which means that, taken as a whole, there is a decisive causal law for the whole plot, although there may also be a rich multi-causal relationship between events and events (Weng, 2007). Therefore, in linear narrative, the narrative flow cannot be changed even if there exists a narrative structure that disrupts the regular time sequence such as flashbacks (Sun, 2011). American scholar R. Williams argues that linear narrative involves the priorities of news content (Lin & Wu, 2006, p. 155). A linear narrative is a narrative model centered on the communicator, in which the communicator expects the audience to read in the order set by the communicator and to endorse the viewpoint, which reflects the author’s desire to control the topic and content. The linear narrative of data journalism follows a chronological or causal flow. To use an analogy, the linear narrative of data journalism is like a tree. A simple linear narrative has a single trunk, and the narrative stops at a point as the “tree” grows. A more complex linear narrative contains “branches” and “leaves.” Generally, the “tree” grows in a linear direction, although there may be secondary events associated with the core of the narrative. The audience’s understanding of the news content must rely on the logic set in advance, i.e., each stage of the narrative is based on the previous stage. The main criterion for determining whether a text belongs to the linear narrative model is whether there is a logic that the author intends to convey between the parts of the text that have a relatively complete meaning (e.g., paragraphs). For example, we can examine whether the paragraphs of a text follow an inverted pyramid structure in order of news value. In other words, we can say that a data news text follows a linear narrative model if the relationship between the parts of the text is logical or reflects the intention of the communicator. Data news works employ both a pure linear narrative mode and a hybrid linear narrative mode (see Figure 5.1). The pure linear narrative mode refers to the use of linear narrative for all paragraphs of the text. A hybrid linear narrative mode means that some paragraphs of the text may adopt parallel narrative or interactive

Representing Reality 103

Figure 5.1  The pure linear narrative mode and the hybrid linear narrative mode

narrative, but the whole text still follows a linear relationship. The plotline of the news in the linear narrative mode usually overlaps with the reading sequence of the audience, thus it is a “closed” narrative. For example, Xinhuanet’s work “A Family’s 65 Years” adopts a pure linear narrative mode. This work follows the chronological order from the founding of the People’s Republic of China in 1949 to 2014, combining the experiences of the fictional character Wang Jiefang’s family with important events and numbers in the country’s history to show the great changes of individuals, families, and the country over 65 years. A pure linear narrative mode is the inevitable choice for data news published on traditional media platforms. For example, the linearity of information dissemination and audience reception in the medium of television dictates that the manuscripts of the data news it publishes must also be linear. The linear narrative mode assumes that the audience is interested in the content of the entire news text and that the audience will read the full text in its entirety, using suspense or connectives between paragraphs to attract audience interest. In CCTV’s “Using big data to tell the stories of a community of shared future: The road to the world,” the journalist’s core point is that the infrastructure construction projects provided to countries along the Belt and Road are changing the economic and social life of these countries. Using the “heat map” of excavators of China Communications Construction Company as a starting point, the reporter points out that the global distribution of excavators is shifting from Central Asia and Eastern Europe to the broader Belt and Road countries, reflecting the trend of the global shift in infrastructure construction. The reporter took the Padma Bridge in Bangladesh and the “Double West” Highway in Kazakhstan, which were built with Chinese participation, and the signal coverage of China Mobile in Pakistan as examples to introduce the positive impact of infrastructure construction on these countries. Representing three different aspects of infrastructure development, these three cases are organically linked by the reporter through the illustration of commonalities among them. However, in the new media environment, the use of parallel narratives and interactive narratives is still relatively limited. On the surface, this is because journalists

104  Representing Reality do not see new media platforms as interactive platforms; however, at a deeper level, this reflects the habitus of journalists, which is related to the established way of news gathering and evaluation of data news. The linear narrative mode still has a “market” on new media platforms because people’s long-established linear reading habits are difficult to change in a short time. Of course, from the perspective of information reception, a linear narrative is undoubtedly more convenient for audiences to learn and understand information. Linear narratives reflect the “initiative” of journalists as elites over the production, flow, and influence of information and knowledge. Although we are now in an era of audience-centeredness, the control of meaning production by the communicator and the media power achieved through communication will not change. Although the hybrid linear narrative model includes parallel or interactive narratives, it still allows the audience to read the news following the journalist’s narrative logic, which actually reflects the media’s intention to “control” the audience. Parallel Narrative

Non-linear narratives are narratives that are sequenced in a relatively random, fragmented, and non-linear manner rather than in chronological order. Non-linear narratives include the “non-linearity” at the author’s creation stage and the “nonlinearity” at the user’s reading stage (Swartz, 2006, p. 326). A non-linear narrative is not a product of the new media platform. Literary or film narratives also use non-linear methods of narrative, such as flashbacks in literature or parallel editing in film and television. Eddy Borges-Rey’s interviews with British data journalists suggest that data journalists have taken on some characteristics of computational thinking (Borges-Rey, 2018), highlighted by the fact that linear narratives are being replaced by more interactive and participatory forms that give audiences multilayered, multi-platform, gamified, database-linked, and vivid content. There are two types of non-linear narratives in data journalism: parallel narrative and interactive narrative. In literature, the basic characteristic of the parallel narrative is that the story revolves around one or several main characters, and there is no necessary cause-and-effect relationship between stories and stories, nor does any cumulative dramatic effect emerge (Weng, 2007). It can be said that a parallel narrative consists of several stories “pieced together” like a jigsaw puzzle. This kind of narrative does not have a clear plotline. Each story independently fulfills the narrative requirements of a sub-theme and reflects the overall theme, and the relationship between them is parallel or complementary. The parallel narrative model of data journalism is similar to the “parallel subject” narrative model in literary composition, in which all the stories or plot threads that make up a text revolve around a definite theme, but with no specific causal connection or clear chronological order between them (Long, 2010). There are four distinctive features of a “parallel subject” narrative: first, the theme is the soul of the text; second, the text is composed of multiple parallel stories or plot threads; third, there is no clear causal connection or chronological order among the stories

Representing Reality 105 or plot threads; and fourth, the order of the stories or plot threads is interchangeable, and the interchanged text is not fundamentally different from the original text (Long, 2010). Most data news works applying parallel narrative mode are combined stories, requiring navigation bars to present their structure clearly when published on new media platforms. Parallel narratives are audience-driven, allowing audiences to read outside of the journalist’s set path. In an era of attention scarcity and fragmented reading, parallel narrative represents the idea of not forcing the audience to read the whole text. “From rainforest to your cupboard: the real story of palm oil,” published by The Guardian, tells the story of palm oil from rainforest to table, informing the audience that on the one hand, the increase of palm oil production means the decrease of tropical rainforest but on the other hand, it is difficult to find suitable alternatives for palm oil. The entire story is divided into seven sections: rainforest, plantations, community, where it goes, businesses, consumers, and alternatives (see Table 5.1). Each of these seven sections is an independent module with no logical connection to each other, and the audience is free to choose any module of interest to read. “Homes for the Taking: Liens, Losses and Profiteers,” published by The Washington Post, contains eight long stories about how some investors lost their homeownership because their $500 debt turned into a huge debt. Parallel narrative presents the complexity and multifaceted nature of events through the “piecing together” of stories. Parallel narrative also allows audiences to select specific content to read or read it in a personalized order, giving them the power to “splice” different parts of the story. Therefore, it is a semi-open narrative form, which meets the needs of audiences for fragmented and autonomous information consumption in the new media environment.

Table 5.1  The structure of “From rainforest to your cupboard: the real story of palm oil” Topics

Specific Contents

Rainforest

Value of the rainforest: dead and alive The destruction of Indonesia’s forests Palm oil production by country The impact of palm oil in pictures Can palm oil be sustainable? Life on palm oil farms Palm oil imports How much of this is sustainable? What does the marketplace look like? How much palm oil do businesses use? Business case studies: the good, the bad, and the ugly What do you value most? Palm oil v the alternatives: how do they compare? Why there’s no easy answer to finding a sustainable alternative

Plantations Community Where It Goes Businesses Consumers Alternatives

106  Representing Reality Interactive Narrative in Data Journalism While the earliest data news produced by The Guardian dates back to 1821, data journalism born in the 21st century has a very different social context and technological environment than previous data journalism. One of the hallmarks of contemporary data journalism is interactive technology, which has given rise to a new type of narrative – Interactive Narrative. According to Tom Rawlings, digital design and production director at Auroch Digital: News organizations have to consider that their news is being delivered by a computer now and computers by their nature are interactive devices. The news organization that relies on static, non-interactive methods of engaging their audience is going to be left behind by the organizations that embrace this. (Reid, 2013) Among the winners of the Data Journalism Award from 2013 through 2016, 59% of them were “interactive” while 27% of the stories provided features for searching, filtering, and selection (Heravi & Ojo, 2017); in other words, 86% of the winners used different levels of interactive means. Interactive narratives are a growing trend in digital journalism and have become commonly used in data stories on new media platforms. What are the types of interactive narratives in data journalism? How does data journalism represent discourse through interactive storytelling? This section explores these issues. The Application of Interactive Narrative in Data Journalism Technically, all input actions of users on new media platforms will generate output content, so they can all be called interactions. Interaction is a kind of direct intervention. Interactivity means the ability to intervene in a meaningful way within the representation itself, not to read it differently (Cameron, 2017). Interactive narrative refers to the act of narrating using interactive tools. The storyline of an interactive narrative is not the same for everyone but changes based on user input, thus allowing the narrative text, narrative structure, and narrative experience to be personalized. Unlike interactive narratives, linear narratives focus on the integrity of the story, the coherence of time, and the causality of the plot, implying a belief in and an appeal to the order and certainty of the world (Sun, 2011). The plotline of a linear narrative story overlaps with the user’s reading order, meaning that different users read the same text. However, the plot line of an interactive narrative does not match the user’s reading order. Of course, interactive narratives do not have a definite plot line and are often presented in the form of databases, apps, or games. Technically, interactive narratives are mainly based on databases and algorithms that generate customized content to meet users’ needs for specific information, and therefore have a great advantage in achieving the functional significance of data journalism. There are several reasons for adopting interactive narratives for data journalism. First, due to the incompetence of traditional news narrative forms in presenting

Representing Reality 107 large-scale data sets, journalists need to use visually appealing and concise interactive narratives to present them (Boyles & Meyer, 2016). Second, interactive narratives fit into the new media culture, which is characterized by decentralization and engagement. Third, interactive narratives can create immersive experiences for users and help build “strong ties” between users and texts. The interactive narratives used in data journalism fall into two categories: database-based exploratory narratives and game-based experiential narratives. Database-based Exploratory Narrative

Database-based exploratory narratives allow users to explore personalized content based on their preferences. Influenced by the concept that “Publishing data is news” (Batsell, 2016), there is a growing number of database-based data news pieces. Rather than providing users with a raw data set, data journalists use interactive tools to provide an intuitive and visual interface that retains meaningful information and removes useless variables after judging the value of the variables in the data set. Interactive narratives can maximize the proximity of news, providing users with targeted information by meeting their psychological, geographic, and informational needs. For example, “Median Income Across the US,” published by New York Public Radio, uses different colors to represent income levels in different neighborhoods, and users can view specific values by mousing over specific neighborhoods (see QR code 5.13). Scott Klein, a data journalist working for ProPublica, believes that making complex data understandable to users should be realized by helping them follow a sequence from the general (far view) to the particular (near view). The far view is typically the landing page of the app, focused on broad meaning and context. This page should have the national picture of the data with ranked examples. The near view is the page at the lowest level of abstraction, where the reader is looking at her own school, his own town, etc (Klein, 2015). It is the means through which readers will understand the whole by relating it to the example they understand best. It has been argued that the future of journalism no longer assumes that the public needs some “standard” knowledge, but rather aims to meet the differentiated needs of different groups (Wang & Yu, 2016). Data journalists use interactive narratives to highlight the proximity of news, not only to provide users with a service that is truly relevant to them, but also to build stronger “strong ties” with them.

QR code 5.13 Median income across the US. https://project.wnyc.org/median-income-nation/

108  Representing Reality Game-based Experiential Narrative

Newsgame is another common form of interactive narrative for data journalism. Newsgames use journalistic principles to develop the media functions of games, providing players with virtual experiences based on real events and issues (Huang, 2014). Games can inspire a different kind of deliberation, one that considers the uncertainty of complex systems instead of embracing simple answers (Bogost, 2011). The difference between a game and a story is that while a story is represented through a series of unchanging facts, a game is represented through a bifurcation tree, allowing the player to make judgments and create his or her own story at each bifurcation. In addition, the reader of a story must infer causality from a series of facts, while the game player is encouraged to explore various possibilities (Fang & Dai, 2010, p. 4). For traditional news narratives, users are bystanders. However, interactive narratives can engage users and internalize them as stakeholders in the news. For example, “What is at stake at the Paris climate change conference?” published by the Financial Times, is a game launched in December 2015 ahead of the Paris Agreement that tasked users with keeping global temperature increases to 2°C by the end of the century. Users were asked to find out how different countries should respond to achieve the desired reduction in emissions. By clicking on “Create your own model” in the calculator, the user can adjust the emissions for each country, and the calculator will generate a prediction based on the user’s personal choices. Newsgames can enhance the fun of sports reports. For example, BBC’s “Euro 2016: Who would you pick in your team of the tournament?” allows users to select who they think are the best players, assemble their own team regardless of the nationality of the players, and then choose a team format and create a league to play against other users’ teams (Fang & Fan, 2016). The user’s pleasure in participating in newsgames is achieved through immersion and engagement. Within a text, the aesthetic remains largely immersive as long as the story, setting, and interface adhere to a single schema1 (Douglas & Hargadon, 2001; Carr et al., 2015, p. 75). Engagement is a more deliberate and critical mode of participation, which happens when less familiar or more difficult material makes demands on the user, who is driven to reread or otherwise reconsider information in an attempt to make sense of it. Engagement involves those portions of a text where extra effort or interpretive skills are called for and where external referents are sought. Besides, engagement involves the consideration of multiple schemas, whether from within or beyond the text itself (Carr et al., 2015, p. 75). Whether the pleasure comes from immersion or involvement is related to the individual user. For example, a fan’s pleasure may come from immersion in a footballthemed newsgame, while the source of a non-fan’s pleasure may be involvement. The Storytelling Mechanism of Data Journalism

The essence of data journalism is news discourse, which represents facts through a set of narrative grammar and narrative structure. The use of interactive narratives

Representing Reality 109 in data journalism is not only for information transmission but also for persuasion. For data journalism, the narrative itself is the practice of discursive representation, and its ultimate goal is the production of meaning. In visual culture studies, the production of meaning includes two dimensions: connotation and extension. Connotation refers to the thoughts, feelings, and associations that one perceives in a visual culture work, while extension refers to the “literal” meaning of an image (Barnard, 2013, p. 208). Therefore, we also divide the mechanism of the interactive narrative in data journalism into the connotation level of discourse production, which is the rule mechanism, and the extension level of discourse expression, which is the procedural rhetoric. The Author is Not Dead: The Illusion that Users Are Making Their Own Choices Despite the Limitations of the Rules

In narrative studies, the relationship between the author, the text, and the reader has long been the focus. Narratologists have long considered the reader to be passive and the author to be the master of the narrative. In 1967, the French literary theorist Roland Barthes, in his The Death of the Author, declared that the text is dominated not by the Author but by the reader, thus making the reader the new Author or the God of the text. Barthes destroys the traditional author-centered “authortext-reader” theoretical structure and a new reader-centered structure of “readertext-author” is established (Tang, 2008). If the “death of the author” in the linear narrative is more about the dominant position of the reader in the interpretation of the text, in interactive narrative, the reader can not only dominate the interpretation of the text, but also “reconstruct” the text. With the different choices users make when reading data news texts, there are differences in the narrative texts presented. From this perspective, there is no author in the interactive narrative; in other words, the user is the author. Users actively participate in the reconstruction of the meaning and value of the work in the interactive narrative, and the narrative of the text has a variety of uncertain possibilities, thus allowing users to form a narrative system based on their own personality, which deeply releases their freedom of reading. Is the user really free in the interaction narrative? Interactive narratives are centered on rules rather than neutral technologies. In newsgame-type data news pieces, rules are predetermined for the purpose of keeping the game on track, defining the game’s storyline and scoring criteria (Despain, 2015, p. 28). It is rules that give meaning to players’ actions (Carr et al., 2015, p. 8). In database-type data news works, the permutations and calculations between different variables are set in advance. The rules are the invisible carriers of the author’s will and pre-determine the connotation of the data news text. From this perspective, the rules of computational science are akin to an omniscient “God” that dominates the direction and cognitive logic of the interactive narrative. Interactive narratives are not aimless, but rather are governed by rules that determine the themes, perspectives, and intentions of the narrative. Therefore, the designer of the interactive narrative rules is actually the author of the text, but this author is implicit. The interactive narrative seems to give users

110  Representing Reality the freedom to explore the text and understand it, but in fact, it manipulates users’ behavior and thinking through the rules, only that the author’s view is translated through the rule-driven plot setting, path design, and algorithm in the user’s participation and experience, rather than directly expressed. In “What is at stake at the Paris climate change conference?” users are asked to adjust the share of carbon emissions of different countries to achieve a global temperature increase of less than 2°C by the end of the century. The rules of this game are based on temperature prediction models. Therefore, no matter how much carbon reduction the user adjusts for each country, even if all countries’ carbon emissions are reduced to zero, the global temperature increase by the end of the century will still be greater than 2°C. The rules of this newsgame show that the causes of global temperature increase are various, and reducing carbon emissions is only one of the important measures to reduce the temperature increase. In interactive narratives, the immersion or involvement of the user not only causes the author’s presence to be ignored but also reinforces the user’s obedience to the rules in constant interactions. Data journalism gives people more choices, but these choices follow the grammatical system controlled by the computer language and can therefore be seen as “the illusion that users are making their own choices” (Liu, 2016). Once the user adapts to the rules of a particular game, the user begins to be “controlled.” Thus, the author of interactive narratives of data journalism is not only not dead, but still in control of the user’s understanding of the issues, like “God.” Procedural Rhetoric: Persuasive Externalized Rules

The rules of interactive narratives are externalized in the form of procedures, which visualize and embody the rules, making them no longer abstract and virtual, but visible, perceptible, and experienceable. In an interactive narrative, the user’s input produces a specific output, and the specific output requires further input from the user, and so on, to complete the whole process of the interactive narrative. In this process, procedural rhetoric becomes an important means of conveying information, expressing ideas, and persuading. Procedural rhetoric, first proposed by scholar and video game designer Ian Bogost, is a general name for the practice of authoring arguments through processes. Its arguments are made not through the construction of words or images, but through the authorship of rules of behavior and the construction of dynamic models (Bogost, 2008, p. 125). Unlike verbal rhetoric or visual rhetoric, procedural rhetoric is realized in a naturalized way during the interaction with the user, without the user being aware of it. The first level of procedural rhetoric in interaction narratives (or procedural rhetoric I) is to use game mechanics to draw the user into a particular interaction diagram, thereby directing the user’s attention to a specific area by anchoring. The main techniques used include: (1) Default views provide an initial point of interpretation anchored to the default visual configuration; (2) Fixed comparisons present some information by default so that users can contrast this information

Representing Reality 111

QR code 5.14 The migrant files. www.themigrantsfiles.com/ Table 5.2  Mechanisms of interactive narratives used in data journalism

Rules Procedural Rhetoric I Procedural Rhetoric II

Database-based Exploratory Narrative

Game-based Experiential Narrative

Existent Existent Nonexistent

Existent Existent Existent

with other values in the visualization; (3) Navigation bars and menus allow users to search for specific information to constrain the data depiction based on a user’s preferences for certain information, which can be called filtering. (Hullman & Diakopoulos, 2011) “The Migrant Files,” published by Detective.io, uses several of these techniques. Its default view sets the sad tone of the story with a gray and white background. The visualization allows the user to see that most of the migrants who went to Europe died before they reached Europe, and many others died after they arrived. The size of the red circles on the graph represent the number of migrants from different countries who died on their way to Europe, which can be seen as fixed comparisons. Users can click on a specific circle to see the deaths of migrants from a particular place at a particular time, enabling information filtering (see QR code 5.14). In simple interactive works, procedural rhetoric allows users to piece together the content of text in constant exploration. Procedural rhetoric can be used to allow both less engaged users to understand the basic information in the text, and more deeply engaged users to explore the information in greater depth and personalization, deepening their understanding of certain issues. The procedural rhetoric of complex interactive narratives, especially gamebased experiential narratives, is much more complex. Within a rule-based framework, procedural rhetoric can realistically simulate real-world choices or dilemmas, “immerse” users in specific situations, and thereby reinforce or change their attitudes and behaviors (also referred to as procedural rhetoric II; see Table 5.2). In newsgames, for example, procedural rhetoric is used to persuade through game experiences such as challenge, fear, tension, fantasy, social interaction, exploration, etc (Despain, 2015, p. 28). Let’s take the topic of getting a license plate for a car in Beijing as an example. If journalists directly tell users the probability of getting a license plate, it is difficult

112  Representing Reality for them to empathize. Caixin has created a newsgame called “We’ll take you to get a license plate,” which is based on the real success rate of getting a license plate in Beijing and allows users to experience for themselves how difficult it is. For example, the probability of getting a license plate for an ordinary car in June 2016 is 1 out of 725 times. After the user applies four times without success, the game will calculate that he or she would have to wait until February 2044 to get one. Through this mini-game, users can experience the feelings of getting a license plate in Beijing, and more deeply appreciate the reality of unequal supply and demand. “On Your Marks: Can you react faster than an Olympic athlete?”, launched by the Financial Times during the Rio 2016 Olympic Games, allows users to test their reaction times in Athletics, Swimming, and Track Cycling. By “racing” with other competitors, users can experience the state of the athletes. ProPublica’s “HeartSaver: An Experimental News Game” is based on the data from Centers for Medicare and Medicaid Services, New York State Department of Health, New York City Department of City Planning, and Google Directions API. The game is set in New York City, where millions of people suffer from heart disease, and where factors such as the level of hospital care and proximity to patients’ homes affect their chances of survival. Users can participate in the game to experience specific decision-making situations and understand the impact of their choices on patients’ chances of survival (see QR code 5.15). When a user plays “HeartSaver: An Experimental News Game,” he or she needs to take into account multiple factors and make a decision in a short period of time. The outcome of the game may be a success or a failure, and the emotional experience for the user may be different. Some users may be heart patients themselves, others may be family members of heart patients, and others may just be curious. But in any case, the procedural rhetoric creates an experiential situation close to the real situation. The judgment made by the user, the references used, and the final result of the game are all calculated based on real data and rules, so the experiences gained by the user are real, measurable, and perceptible. We can say that the essence of procedural rhetoric is to persuade users through simulated experiences based on rules. Procedural rhetoric makes full use of persuasion principles, empathy, and relevance techniques to persuade users “silently” in progressive interactive experiences. In the interactive narrative, the process of participation or experience is the process of narrative, and the pleasure of narrative largely depends on the vividness of procedural rhetoric.

QR code 5.15  HeartSaver: An experimental news game. https://projects.propublica.org/ graphics/heartsaver

Representing Reality 113 Interactive narrative in the new media environment has not really changed the position of the author and the user. The author still dominates the narrative, but only dominates the narrative process through more invisible means. Interactive narrative essentially follows a narrative mechanism that relies on rules and procedural rhetoric, in which rules determine the connotation of the text, procedures materialize the extension of the text, and procedural rhetoric “naturalizes” the rhetorical practice through interaction. The essence of interactive narrative is not the “death of the author,” but an “illusion of choice.” The Relationship Between Word and Image Data journalism was born in the era of the “pictorial turn.” The visualization of data was therefore included in the articulation of data and news, both as a function of journalism and as a result of the development of media culture. The visual is a place where meanings are created and contested (Mirzoeff, 1999, p. 6). Data visualization means that the reality represented by data and the intricate relationships between data are expressed through a combination of visual codes, a seemingly neutral process that actually achieves persuasive discourse production in an implicit way (Li, 2017). At the same time, scholars in the humanities began to pay attention to pictorial representation, and images became a central topic of academic interest (Zheng, 2012). The relationship between words and images has become a frontier issue in visual culture and other fields. Research in journalism and communication has focused on the relationship between words and images in photojournalism. While photographs are an imitation of reality, data visualization presented in images can also “recreate” reality by “translating” abstract data into understandable images. Unlike literary or artistic language, journalistic language is non-fictional, which necessarily limits the freedom of expression. Therefore, researchers on the relationship between words and images in data visualization need to take into account the specificity of data journalism, and cannot mechanically apply the literary viewpoint. Words and images are two transmission channels for discourse production, so studying the relationship between words and images helps to gain insight into the way they collaborate in discourse production. Xu Wei (2012) examines the changes in the word-image relationship in novels and films, arguing that this relationship has gone through three stages: image loyal to words, image beyond words, and image dominating words. What is the relationship between words and images in data visualization works? In the era of “pictorial turn,” do images necessarily have “hegemony”? If not, what are the possible relationships between words and images? Under what conditions are these relationships formed? What role do these relations play in the production of discourse? Before formally discussing the relationship between words and images, we would like to explain the definitions of data visualization, “words,” and “images” discussed in this part. There are two types of relationships between data visualization and data news: first, data visualization is a part of data news text; second,

114  Representing Reality data visualization is the whole content of data news text. The data visualization discussed in this part includes the prior two categories. In data visualization, the text includes different forms such as titles, visual codes, legends, annotations, data sources, bylines, and others. The “words” refer to the linguistic parts such as titles, annotations, data sources, bylines, and others, while the “images” refer to visual codes and legends. The Image-dominant Model

In the era of the “pictorial turn,” one representative view is that images are hegemonic in today’s culture. As images become the dominant way of reflecting and understanding the world, language is considered to be subordinated to images. In data visualization, the image-dominant model means that images rely on their own semantic systems to dominate the meaning of the text. Of course, there are some conditions for image domination. The first condition is the belief that “seeing is believing.” “Seeing is believing” may refer to the witnessing of an event or scene, or it may be an indirect “presence” achieved through the acquisition of information or evidence. One of the most striking features of the new visual culture is the growing tendency to visualize things that are not in themselves visual (Mirzoeff, 1999, p. 5). For data visualization, there is a mediating process of “translating” data into data visualization. Data visualization, because it is derived from “data,” is given the aura of “science” and “objectivity.” In the eyes of the general audience, the “translation” of data into data visualization is not a subjective creation, but a trustworthy and scientific knowledge production process. The second condition is the establishment of context. Context is accompanied by the identification and differentiation of meaning. The function of context is to qualify and guide the interpretation process, enabling people to construct meaning along some shared cognitive framework and comprehension pattern (Liu, 2018b). Like words, images are dependent on context to anchor meaning. Data visualization can only avoid the uncertainty of meaning if the symbols are placed in a specific context. The third condition is a visual design that is based on cultural conventions and therefore can be understood accurately. If the author employs meaning-anchoring techniques to the best of his or her ability to achieve independent representation of images, words become dispensable or even absent, and images become the primary means of anchoring meaning and guiding ideas. Data visualization that satisfies these three conditions can accomplish preferred readings of the images by interpreting the institutional/political/ideological order imprinted on them (Rose, 2016, p. 133). Image domination illustrates the importance of visuality in interpreting the meaning of a text. If the meaning carried by an image is conceptual, the image has the rhetorical function of public discourse production, grasping the established concept in a relatively visual way so as to achieve public identification and dialogue in the visual sense (Liu, 2018b).

Representing Reality 115 But image domination, if abused ─ in other words, if journalists take advantage of people’s trust in data visualization and the belief that “seeing is believing” ─ can create discursive violence under the neutral veil of visual text (Dang, 2015, p. 277). In data journalism, image domination can be used to distort the truth, intentionally or unintentionally, to prove one’s own point of view, often by intercepting the Y-axis of a chart or by not following conventional design principles. For example, in the chart for “Stop and Frisk” produced by New York Public Radio, a brighter pink area indicates more stop-and-frisk operations, and the green dots on the chart indicate the locations where guns were found. On the surface, the picture shows that the green dots do not appear in the bright pink areas, indicating that the stopand-frisk policy was not effective enough (Porway, 2016). However, after changing the color of the graph, it was found that the stop-and-frisk policy was actually working. The Word-dominant Model

Although images have always wanted to break free from language and become dominant in meaning, image domination does not always happen. In which cases, then, will words regain dominance? In other words, under which conditions will word domination happen? In the first case, word domination occurs when the visual codes of data visualization cannot accomplish basic representation. Unlike visual texts such as photographs, paintings, movies, and television, which reproduce or represent reality as “images,” data visualization uses visual codes to translate data into a visible image that matches the user’s visual perception habits. Visual codes are mainly composed of graphical elements such as points, lines, and surfaces, and visual channels such as position, length, area, shape, direction, and hue. Visual codes in the form of graphs need to anchor meaning with the help of words. For example, a bar chart without labels and legends is just a graph without any meaningful information. At this level, words must be dominant. In addition, charts, together with textual content such as titles, annotations, and descriptions of data sources, constitute the second level of data visualization text, where words are not necessarily dominant, and the word-graph relationship we discuss is on this level. In the second case, word domination also occurs when there is uncertainty about the meaning of images. The peculiarly free nature of images in their presentation of things leads to a non-absolute coherence between the signifier and the signified in the representation of images (Zhang, 2015). According to Roland Barthes, any image is polysemous, it contains a “floating” chain of signifieds underneath its signifier, and its readers are free to choose some and ignore the others (Barthes, 2005, p. 28). The image is a decontextualized existence (Long, 2008). Due to journalistic demands for truth, accuracy, and objectivity, words are needed to summarize, explain, or confirm the meaning of images when data visualization may have multiple meanings or may not achieve the intended communication purpose. Let’s take the example of the winner of the Data Journalism Award, “The People’s Republic of Bolzano” (see QR code 5.16). If all the words in the image are

116  Representing Reality

QR code 5.16 People’s republic of Bolzano. www.peoplesrepublicofbolzano.com/

removed, leaving only the two characters and other visual symbols, there are multiple possibilities for the user’s interpretation of the image. For example, could this be meant to say that the two men have a similar part of their eyebrows? Or that the right man’s eyebrows are transplanted onto the left man’s? However, these are not the meanings of the image. The Chinese in Bolzano, Italy make up 0.6% of the entire population. In the image, there are two men standing side by side. The man on the left is a native of Bolzano, and his area in the picture represents the overall population of the Bolzano region. The percentage of Chinese (represented by the Chinese man on the right) in the total population of Bolzano is visualized as a portion smaller than the eyebrow of the man on the left in the picture. The 0.6% share of the Chinese population in the total population of Bolzano is “translated” by the size of the eyebrows. This data visualization is creative in design but would be ambiguous without the introductory text on the left side of the figure to anchor the meaning. In data visualization, visual symbols are good at presenting apparent information, while words cannot only illustrate the literal meaning but also have a meaning beyond the words. Under this condition, words rely on the certainty of their signification and the profundity of their meanings to dominate the meaning of the whole data visualization text. In the third case, words as dominant discourse (Cox & Pezzullo, 2017, p. 74) become unquestioned or unchallenged in the communication of meaning. At this time, if the signified of the image in the picture conflicts with the signified of words, people tend to reimagine the meaning of the image according to the signification of words (Liu, 2018a). The Guardian’s piece “How China’s economic slowdown could weigh on the rest of the world” points to the theme of data visualization in its title. In the introduction, the journalist wrote the following paragraph: In the year to July, China’s customs agency reports that imports from Australia are down by $15bn dollars on the same period last year – a loss which is already equal to 1% of Australia’s GDP, and many other countries stand to lose out to similar degrees. China’s imports overall are down by 14.6% over 2015. Find out what happens if this decline continues for the rest of the year – or worsens – and how that loss compares to each country’s GDP. This report seems to tell the facts with data, and some Chinese media even use it to illustrate the influence of the Chinese economy on the world. But in fact, if

Representing Reality 117

QR code 5.17 How China’s economic slowdown could weigh on the rest of the world. www.theguardian.com/world/ng-interactive/2015/aug/26/china-economicslowdown-world-imports

placed in the context of Western countries at the time, the report implies a deeper logic of slow economic recovery and economic instability in Western countries: Blame China. In the mainstream opinion of Western countries at that time, there was a strong voice that China should be responsible for the slowing down of the global economy. The words in The Guardian’s introduction were a dominant discourse, which visual symbols would encounter great difficulty in attempting to challenge. In the data visualization design (see QR code 5.17), The Guardian creates an image of China as a drag on the world economy. In the word-dominant model, the image is a figurative representation of the meaning of the words. The main purpose of images is to illustrate, prove, and reinforce the propositions and ideas of the words. The term “pictorial turn” is in fact rhetorical. After all, the limitations of images themselves do not guarantee their permanent domination. The Complementary Model of Words and Images

The complementary model of words and images means that the respective strengths of words and images are used to jointly produce discourse. In this relationship, words and images are equal. There are two types of word-image complementarity: redundancy and intertextuality. The first is complementarity on a functional level: redundancy, which is the repetition of information or narrative. In data visualization, words and images occupy two symbolic communication channels, rational and perceptual, respectively. Words are good at expressing abstract and discursive content and are not good at providing visual details or complex narratives, while images are good at expressing figurative and informative content and excel at complex narratives. Using words to narrate events with long time spans takes up a lot of space, while data visualization can incorporate information over decades and centuries into a single chart. In addition, words cannot present both the event and the environment in which it takes place at the same time (Gaudreault, 2005, p. 105), while data maps can provide multidimensional information in the spatial dimension to form narratives of parallel subjects of different regions. Data visualization can also present complex relationships through social network or genograms. For example, Caixin’s “Zhou Yongkang’s Associated Persons and Properties” compares the relationships between the key figures and businesses involved in the Zhou’s scandal.

118  Representing Reality Its series is 60,000 words long, yet a single chart is sufficient to show all the relationships in the story. Words and images stating the same proposition are semantically redundant, but the reading experience for the audience differs due to the different comprehension mechanisms of verbal and visual symbols. “The Tangled Web in the Fight for Syria’s Future,” for example, shows the intricate relationships between different parties in the Middle East and other countries outside the region (see QR code 5.18). How do we show the intricacies and how do we make them accessible to audiences? Whereas cumbersome words are likely to overwhelm audiences, diagrams are much clearer. Different colored lines are interwoven to represent the relationships between different subjects, allowing the relationships to be “present” at the same time and presenting the “intricacies” in a very visually appealing way. Thus, visual text provides much more information than words in a limited space. We can say that visual text is good at complex narratives. The second is complementarity on the level of content: intertextuality, which refers to the way that the meanings of any one discursive image or text depend not only on that one text or image, but also on the meanings carried by other images and texts (Rose, 2016, p. 188). Any new text, according to Jacques Derrida, is intertextual with previous texts, words, and codes, while traces of past texts seep into the work through the author’s “Aufhebung.” (Zhu, 2018) The intertextualization of words and images is generally accomplished by means of annotation. For example, “Iraq’s bloody toll,” published by the South China Morning Post, presents the number of Iraqi civilian and American military casualties since the Iraq War (see QR code 5.19). Each bright red bar in the figure indicates the number of civilian casualties in a given year, and each dark red bar indicates the number of U.S. military casualties.

QR code 5.18 Why the Middle East is now a giant warzone, in one terrifying chart. https:// archive.thinkprogress.org/why-the-middle-east-is-now-a-giant-warzone-inone-terrifying-chart-b2b22768d952/

QR code 5.19 Iraq’s bloody toll. www.scmp.com/infographics/article/1284683/iraqs-bloody-toll

Representing Reality 119 By annotating the bars at particular points in time, the work highlights small contexts within the larger historical context, helping the audience to deepen their understanding and awareness of the casualty figures. The annotations also serve to emphasize certain visual information. The intertextual relationship between words and images can also be achieved with hyperlinks or other interactive means, in addition to annotations on static diagrams. The Negotiation Model of Words and Images

News discourse is a kind of performative discourse that is used to convince readers of the authenticity of what it describes. Thus, journalism transforms the interpretation of events into the truth, into a reality on which the public can act (Van den Hoven, 2016). In data news texts, words and images need to be consistent in meaning. There must not be a word-image opposition or word-image competition as in The Treachery of Images (also known as “This is Not a Pipe”). However, the wordimage relationship in news texts must not be harmonious and complementary. The framing of news narratives determines what is spoken and presented. Entman defined the act of framing as the selection of “some aspects of a perceived reality and making them more salient in a communicating text” (Cox & Pezzullo, 2017, p. 123). Narrative framing, then, refers to the ways in which media organize the bits and facts of phenomena through stories to aid audiences’ understanding (Cox & Pezzullo, 2017, p. 129). Narrative framing provides a framework for understanding the world: what the problem is, who is responsible, and what the solutions are. The negotiation model of words and images is a mutually constraining relationship between the word-anchored frame and the image-anchored frame, which may be due to the adherence to journalistic guidelines, the desire to achieve meaning beyond the words, or the differences in the focuses of narratives. It is an “ambiguous” state in which words and images want to “overstep” each other while trying to “restrain” each other. The headline “North Korea is the only country that has performed a nuclear test in the 21st century” produced by The Washington Post defines North Korea as a country that goes against the tide of global peace. In its verbal narrative structure (see Table 5.3), North Korea has created and should be responsible for the global “nuclear threat.” Table 5.3  Different structures of verbal narratives and pictorial narratives

What Is the Problem

Who Should Be Responsible What Are the Solutions

Verbal Narratives

Pictorial Narratives

North Korea is the only country that has performed a nuclear test in the 21st Century. North Korea

Many countries in history have conducted nuclear tests and possessed nuclear weapons.

Unclear

The United States and the Soviet Union during the Cold War Unclear

120  Representing Reality At the pictorial level, North Korea is undoubtedly not the visual focus of data visualization. The pictorial narrative structure also presents a different frame than the verbal narrative. The data visualization places all of the countries in the world that possess nuclear weapons in a broad historical picture, with the United States and the Soviet Union placed above and at the visual center. The length and number of the nuclear race during the Cold War is well represented in the visual design. North Korea is labeled with a font and scale consistent with other countries, with no special treatment in visual coding. If the headline did not emphasize North Korea, no one would have noticed the message “North Korea is the only country that has performed a nuclear test in the 21st Century.” The verbal and pictorial frames in these cases are not the same but are also not in conflict. They are in a tension, engaging in ambiguous and unspoken discourses. Why is there a word-image negotiation? We believe that this is actually a “ritual strategy of objectivity.” Objectivity requires a set of journalistic norms and uniform technical standards. Objectivity as a “ritual strategy” allows journalists to be free from responsibility for the implicit values or consequences of their reporting (Hackett & Zhao, 2010, p. 21). The procedures set by journalists for reporting news may in fact contain bias (Su, 2013). In The Washington Post story on North Korea and the nuclear threat, the reporter uses both verbal narrative to emphasize his position and pictorial narrative to present the information in a comprehensive and objective manner, to some extent weakening the bias of words, which is a “clever” narrative strategy. The word-image negotiation model we discuss here is judged from the perspective of the communicator, hence the audience’s perception may not necessarily be the same as that of the communicator. Depending on the social contexts in which the audience lives, the audience’s personal experiences, and other factors, the audience may directly perceive the negotiation relationship set by the communicator as word domination or image domination. After all, “the meaning of a text is created at the time of reading, not at the time of writing.” (Bell & Garret, 2016, p. 2) In the era of the “pictorial turn,” words and images are not in an “either-or” relationship, but in a “symbiotic” relationship. No matter which model of word-image relationship is adopted in data news works, the aim is the same: to serve the needs of the overall text to represent reality. Principles of Data News Narrative Data journalism is not favored by the audience just because it involves more data or data visualization. Not all audiences like data, charts, or maps. Audiences expose themselves to data journalism first and foremost to get news information. Therefore, it is important for data journalists to think about how data journalism can meet the diverse needs of audiences and thus increase their reliance. We believe that the narrative of data journalism should include three concepts: story awareness, relevance awareness, and product awareness.

Representing Reality 121 Story Awareness: Enhancing the Attractiveness of News

The advent of data journalism has not changed readers’ love of stories. A study shows that when we are being told a story, not only are the language processing parts in our brain activated, but any other area in our brain that we would use when experiencing the events of the story are too (Widrich, 2012). However, many data journalists believe that data journalism does not need storytelling, but simply presentation of information. This can lead to two extreme results. One extreme result is data news with a one-sided emphasis on visualization. Some journalists extract data of interest to audiences from research reports and make charts directly, which only show others’ conclusions and hardly provoke audiences to explore the deeper content. This lack of original data analysis has made data journalism synonymous with “chart journalism” (Zhang, 2016). Some news works often say “a picture to read . . .”; however, the audience often does not need “this picture” to understand the message the journalist wants to convey. The other extreme result is data news that contains only “data.” Some data news only lists data or various charts but lack any explanation of the logic between the data, as if the audience is a statistician with an extreme interest in charts. According to Mohammed Haddad, Al-Jazeera’s data journalist, “the most important aspect of data visualization is making sure ‘visualizations are clarifying’ the stories rather than simplifying them” (Albeanu, 2015). Data journalism is not the same as bringing data analysis or visualization techniques directly into journalism. The narrative is still at the heart of data journalism, so some of the classic narrative techniques are useful in data journalism. Paul Bradshaw argues that the core process of data journalism is data processing, and that its main task is storytelling (Yu, 2015). In-depth interview (a data journalist surnamed Yu, working for a newspaper in Hunan Province, China): Both storytelling and data are the focus (of data journalism). If there is only data, readers will be overwhelmed too. Andrew Flowers, a data editor who works for FiveThirtyEight, believes that the most important thing in data journalism is the story, followed by the data (Flowers, 2016). He says, “FiveThirtyEight won’t have broad appeal without narrative.” (Ali, 2014) This is because many data news works lack a sense of storytelling. According to Dino Citraro, founder of Periscopic, a data visualization company, information is transformed into a story through five specific elements: setting, plot, characters, conflict, and theme, and all stories must contain all five of these elements. While many people talk about the importance of data storytelling, few people use data to tell stories (Citraro, 2014). In “The One Question Most Americans Get Wrong About College Graduates” made by The New York Times, the journalist reports on the unemployment rate for undergraduates between the ages of 25 and 34. In a conventional way, the reporter

122  Representing Reality

QR code 5.20 The one question most Americans get wrong about college graduates. www. nytimes.com/interactive/2016/06/03/upshot/up-college-unemployment-quiz. html?mtrref=www.google.com.hk&gwh=FAFD15EF2F5B006F34D3F4AC AEB5BB61&gwt=pay&assetType=PAYWALL

would have used data to visualize the unemployment rate, comparing it horizontally or vertically to give the audience a sense of how the unemployment rate has changed and what it means. However, the reporter used a “suspenseful” narrative technique to reorganize this data story, piquing the audience’s interest through the headline “A Question About College Graduates That Most Americans Get Wrong,” and explaining in the body of the story what the question is. Then, through interactive means, the audience was asked to guess the unemployment rate of college graduates between the ages of 25 and 34 in the United States. And a comparison was provided: the unemployment rate for high school graduates was 7.4%. Readers can make their own guesses about the unemployment rate for 25- to 34-year-olds who graduated from a four-year college. If an incorrect answer is chosen, the page appears: the correct answer is 2.4%. The page also provides the results of the Google survey (9.2%) and the average of The New York Times audience answers (6.5%). The page then provides an interpretive analysis of these data (see QR code 5.20). Therefore, although there are new explorations in narrative modes and narrative dimensions, data journalism is still narrative-centered, only that this narrative is built on new technologies and concepts Li Y. and Li S. (2015). Relevance Awareness: Highlighting Content Proximity

Data news works usually calls for a long period of time for its production, so the strength of data journalism lies not in timeliness, but in proximity. Establishing a relationship between data and the audience through proximity requires a great sense of relevance to the audience. In order to strengthen the connection between news content and users, data journalists should judge whether the topic is closely related to the audience in the topic selection stage, which kinds of data are important and relevant to the audience in the data collection stage, whether the analysis results meet the needs of the audience in the data analysis stage, and identify the common and personalized needs of the audience in the data visualization stage. Many mainstream media outlets now use proximity strategies in their data journalism production. In some of the best pieces that use data interactions and visualizations, audiences can find data that is relevant to the area in which they live. Claire Miller, a data journalist working for Trinity Mirror, says, “The idea of making

Representing Reality 123 news personal is something we can do so much better now with online news. People really, really like talking about themselves. And giving them an opportunity to talk about themselves will turn them toward news” (Cooley, 2016). Some data journalists also use databases or other interactive means that can really allow readers to “find themselves” in the news story. New York Public Radio (WNYC) considers its “hurricane maps” to be an important public service. Jim Schachter, WNYC’s vice president of news, said: if we try to compete by typing words against The New York Times, the BBC website, and The Guardian, we can’t win that game. But if WNYC combines audio with powerful, data-driven visualizations and tools and projects, that’s a place where we can make a mark as a small and ambitious news organization. (Oputu, 2014) The Texas Tribune has launched more than 30 database news stories on its online platform, providing both raw data and interactive databases. The Tribune’s biggest magnet by far has been its more than three dozen interactive databases, which collectively have drawn three times as many page views as the site’s stories (Batsell, 2010). The New York Times has launched a calculator that can calculate which is more cost-effective for people working in major American cities, buying or renting a home. Through this interactive application, macro data such as housing prices become relevant to every person’s life. All of these are examples of proximity strategies. Reinforcing relevance to audiences does not mean that data journalists have to produce local news. In fact, the complex narrative capabilities of data journalism can enable multiple levels of proximity within the same text. In “EU referendum: full results and analysis,” The Guardian uses a map of the U.K. population to simulate a U.K. map, with the results of the referendum highlighted in yellow and blue. With the help of different color distributions, the audience can clearly see the voters’ opinions at the national level. It can be found that the public attitudes toward Brexit differ among different regions of the U.K., with the public in Scotland and London preferring to stay in the EU and the public in other regions preferring to leave (see QR code 5.21). If one clicks on the different regions on the U.K. map, one can view the specific voting results on a constituency basis. In the constituency analysis at the bottom of the U.K. map, audiences can click on the “dots”

QR code 5.21 EU referendum: Full results and analysis. www.theguardian.com/politics/ ng-interactive/2016/jun/23/eu-referendum-live-results-and-analysis

124  Representing Reality representing different constituencies to see how different types of voters in that constituency, such as those with higher education, those without formal qualifications, middle-income residents, residents of different ages, and non-British-born residents, stand on the issue of Brexit. In addition to achieving psychological or geographic proximity, data journalism can also highlight the proximity of news by making readers “empathize” with it. Empathy is an alternative or indirect emotional response, which refers to a psychological process in which an individual’s understanding of real or imagined emotions of others triggers similar or identical emotional experiences, or in which the individual is able to recognize and accept the perspective of others and experience their emotions themselves (Zhang & Yang, 2007). The outbreak of the Syrian civil war in 2011 has led to a serious humanitarian crisis. As a reminder of the dangers of war, “What if the Syrian civil war happened in your country?,” a piece of news made by Public Radio International, allows audiences to select their own country to see what would happen if a civil war of the scale of the Syrian civil war occurred in their country. The data used for the analysis include the size of each country’s population, as well as the annual number of deaths, conventional road traffic deaths, lack of safe water, health needs, displacements, hunger, refugees, and more. It has been argued that journalism of the future no longer assumes that the public needs or cares about the same “standard” knowledge, but rather seeks to meet the differentiated needs of different groups for knowledge (Wang, 2007). Proximity in data journalism is not only about providing audiences with a service that is genuinely relevant to them, but also about building strong ties with audiences that will help media organizations secure their position and grow in a highly competitive marketplace. Product Awareness: Emphasizing User Experience User Experience (UX) is the user’s comprehensive feeling about the product based on the interaction between the user and the product, which is multi-attribute and multi-angle, involving the perception of the product attributes (such as whether its function is comprehensive, whether its design is new and novel, etc.), the emotional changes when using the product (such as whether the user is satisfied, happy, etc.), the evaluation of the product (such as whether it works well, whether it is worth the price, etc.), the subsequent behaviors about the product (such as whether to avoid it in the future or buy it again, etc.) and other levels (Liu & Sun, 2011). Given that many data news stories contain complex data and multi-layered information, providing information services in an intuitive, simple, and interesting way requires data journalists to have product awareness that takes into account the user experience. “Public Schools Explorer Database,” a project made by The Texas Tribune, compiled academic, enrollment, and financial records from 8500 schools in 1300 districts across Texas. Although The Texas Education Agency has already published such data on the web, the information on the government site is too complex,

Representing Reality 125 forcing parents to navigate a bewildering sea of acronyms, links, and PDFs. Therefore, The Texas Tribune’s Schools Explorer presents the state’s data on easy-tofollow charts, empowers users to create their own side-by-side comparisons, and uses an algorithm to generate summary sentences (Batsell, 2015, p. 107). In addition to satisfying user experience in content, data news published in new media platforms can be empowered with more functions through interactive means. Therefore, journalists need to take into account the actual needs of users when designing. In 2012, before Hurricane Sandy reached the northeast coast of the United States, The New York Times created an interactive map based on data from the U.S. National Weather Service, in which users could enter their zip codes to find out the potential risk of damage, the intensity of damage, and the nearest relief centers. After the hurricane hit, the interactive map added emergency evacuation centers, emergency supplies distribution points, and other information for the convenience of people in areas affected by the hurricane (Du, 2014). “The Best and Worst Places to Grow Up: How Your Area Compares,” published by The New York Times in 2015, uses a more automated geolocation service, with interactive maps that automatically locate users based on their IP location and tell them how their area rates in terms of child growth. Gregor Aisch, the graph editor of The New York Times, said: asking readers to supply information can sometimes cut down on their desire to use a news app. Past projects have shown that asking people to enter their Zip code automatically guarantees that a large chunk of people won’t use it while using geolocation is a way to ensure that everyone gets their location version. (Ellis, 2015) Sometimes data visualization trying to convey too much information may lead to information overload. If this is the case, journalists should consider using interactive means or multiple graphs to make the information clear enough. Richard Mazza, a computer scientist, argues that it is necessary to provide a global overview of the collection of data and, at the same time, let users analyze specific details or parts that they may judge as relevant to their goal (Mazza, 2009, p. 106). Poor user experience in interaction design is a common problem. Murray Dick’s research into the interaction teams at the BBC, The Guardian, and The Financial Times found that professional norms of interaction design are not fixed and that not all interactive teams comply with best practices in terms of user-centered design (Dick, 2014). The 2019 Data Journalism Award winner, “The Myth of the Criminal Immigrant,” published by The New York Times, uses a comparative approach to present the influx of immigrants and violent crime in major U.S. cities from 1980 to 2016, demonstrating on the one hand that the influx of immigrants is not correlated with violent crime, and on the other hand allowing the data visualization to contain richer information while taking into account the personalized choices of users (see QR code 5.22).

126  Representing Reality

QR code 5.22 The myth of the criminal immigrant. www.nytimes.com/interactive/2018/ 03/30/upshot/crime-immigration-myth.html

QR code 5.23 The urban neighborhood Wal-Mart: A blessing or a curse?. www.npr.org/ 2015/04/01/396757476/the-neighborhood-wal-mart-a-blessing-or-a-curse

The user experience of data news can also be satisfied by its presentation and compatibility across different terminals. Elliot Bentley, the image editor of The Wall Street Journal, said: on desktop, you can fill an article with lots of nice photos and little embeds, but I don’t think that works quite as well on mobile because you can end up cluttering up a small screen and I think it loses some of the impacts that a single graphic has on mobile. (Ciobanu, 2016) Therefore, interactive image news should be designed to meet mobile needs first. The initial version of Caixin’s work, “Zhou Yongkang’s Associated Persons and Properties,” offered horizontal images, so users needed to scroll left and right on their phones to see the full picture. The new version of this work converts the two-layer structure into a three-layer structure that fits within a single cell phone screen. As a result, the new version has accumulated about 3.18 million visits in the seven days it has been online. Lorenz Matzat, who works for Open Data City, said: people come to the website and get a first impression of the interface. But then they are on their own. Maybe they stay for a minute or half an hour. Our job as data journalists is to provide the framework or environment for this. As well as the coding and data management bits, we have to think of clever ways to design experiences. (Gray et al., 2012, p. 53) NPR’s “The Urban Neighborhood Wal-Mart: A Blessing or a Curse?,” in the desktop version, provides multiple graphs in response to the reported Wal-Mart expansions in three cities, including Washington, D.C., Chicago, and Atlanta. This display facilitates comparisons on the desktop version. On mobile, it is provided as a GIF image, which matches well with mobile (see QR code 5.23).

Representing Reality 127

QR code 5.24 Here’s how America uses its land. www.bloomberg.com/graphics/2018-usland-use/

QR code 5.25 Indonesia plane crash. https://graphics.reuters.com/INDONESIA-CRASH/ yzdpxjlrgpx/

Bloomberg’s “Here’s How America Uses Its Land” was designed with the user experience in mind in the data visualization. Bloomberg has compiled the distribution of six categories of land in the U.S.: pasture/range, forest, cropland, special use, miscellaneous, and urban. If a normal map of the United States is used to present the specific regional distribution, it is difficult for users to form an overall impression of the percentage of the six types of land, especially for some users with limited geographic knowledge. In fact, many users only want to know the percentage of different types of land. Therefore, the reporter designed another map to present the size of the six types of land in the United States in a more intuitive way (see QR code 5.24). This chart had the most unique visitors among graphics stories and the secondmost unique visitors among all stories on Bloomberg.com in 2018 (Southern, 2019). In addition, data visualizations allow users to quickly grasp specialized information that is not easy to understand. Reuters’ entry “Indonesia plane crash” (see QR code 5.25), which won the 2019 Data Journalism Award for Best Use of Data in a Breaking News Story, includes a chart of global commercial airliner crashes in various stages of flight from 2006 to 2017. The journalist combines a bar graph of the number of crashes at different stages of flight with a graphic representation of the plane’s flight, allowing users to gain not only data about crashes but also knowledge about aircraft navigation. Note 1 Schemas enable us to perceive objects and occurrences around us and to make efficient sense of them by consulting our ready-made store of similar occurrences and understandings, which we gain from reading, personal experience, and even advice we receive from others. See Douglas, J. Y., & Hargadon, A. (2001). The pleasures of immersion and

128  Representing Reality engagement: Schemas, scripts and the fifth business. Digital Creativity, 12(3), 153–166. https://doi.org/10.1076/digc.12.3.153.3231

References Adams, R. (Ed.). (2017). Michel Foucault: Discourse. Critical Legal Thinking. https://criticallegalthinking.com/2017/11/17/michel-foucault-discourse/ Albeanu, C. (Ed.). (2015). Data journalism: From specialism to “the new normal”. Journalism. www.journalism.co.uk/news/data-journalism-from-specialism-to-the-new-normal-/ s2/a565533/ Ali, T. (Ed.). (2014). The pitfalls of data journalism. Columbia Journalism Review. www.cjr. org/data_points/fivethirtyeight_and_journalism.php Bal, M. (2017). Narratology: Introduction to the theory of narrative. University of Toronto Press. Barnard, M. (2013). Approaches to understanding visual culture (Z. Cong, Trans.). The Commercial Press. (Published in China.) Barthes, R. (2005). Explicit meaning and implicit meaning (Huai Yu, Trans.). Baihua Literature and Art Publishing House. (Published in China.) Batsell, J. (Ed.). (2010). Lone star trailblazer. Cornell University Press. https://archives.cjr. org/feature/lone_star_trailblazer.php Batsell, J. (Ed.). (2015). Engaged journalism: Connecting with digitally empowered, news audiences. Columbia University Press. https://doi.org/10.1080/14241277.2015.1057019 Batsell, J. (Ed.). (2016). For online publications, data is news. Nieman Reports. http://nie manreports.org/articles/for-online-publications-data-is-news/ Bell, A., & Garret, P. (2016). Approaches to media discourse (G. Xu, Trans.). China Renmin University Press. (Published in China.) Best, S., & Kellner, D. (1991). Postmodern theory: Critical interrogations. Macmillan Education UK. https://doi.org/10.5840/radphilrevbooks1993820 Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., & Hwang, D. U. (Eds.). (2006). Complex networks: Structure and dynamics. Physics Reports. https://cosnet.bifi.es/publications/ phys_rep_2006.pdf Bogost, I. (2008). The rhetoric of video games. The MIT Press. Bogost, I. (Ed.). (2011). Persuasive games: Exploitation ware. Game Developer. www. gamasutra.com/view/feature/134735/persuasive_games_exploitationware.php?page=3 Borges-Rey, E. (Ed.). (2018). Towards an epistemology of data journalism in the devolved nations of the United Kingdom: Changes and continuities in materiality, performativity and reflexivity. Journalism, 21(7), 1–18. https://journals.sagepub.com/doi/full/10.1177/ 1464884917693864 Bounegru, L., Venturini, T., Gray, J., & Jacomy, M. (2017). Narrating networks: Exploring the affordances of networks as storytelling devices in journalism. Digital Journalism, 5(6), 699–730. https://doi.org/10.1080/21670811.2016.1186497 Boyd, D., & Crawford, K. (2012). Critical questions for big data: Provocations for a cultural, technological, and Scholarly Phenomenon. Information, Communication & Society, 15(5), 662–679. https://doi.org/10.1080/1369118X.2012.678878 Boyles, J. L., & Meyer, E. (2016). Letting the data speak. Digital Journalism, 4(7), 944–954. https://doi.org/10.1080/21670811.2016.1166063 Broussard, M. (2016). Big data in practice. Digital Journalism, 4(2), 266–279. https://doi. org/10.1080/21670811.2015.1074863 Cameron, A. (Ed.). (2017). Dissimulations: The illusion of interactivity. Millennium Film Journal. http://mfj-online.org/journalPages/MFJ28/Dissimulations.html

Representing Reality 129 Carr, D., Buckingham, D., Burn, A., & Schott, G. (2015). Computer games: Text, narrative and play (Z. Cong, Trans.). Peking University Press. (Published in China.) Ciobanu, M. (Ed.). (2016). Advice from FT and WSJ for getting started with interactive graphics. Journalism. www.journalism.co.uk/news/advice-from-the-financial-times-and-the-wallstreet-journal-for-getting-started-with-interactive-graphics/s2/a677894/ Citraro, D. (Ed.). (2014). A framework for talking about data storytelling. Periscopic. www. periscopic.com/news/a-framework-for-talking-about-data-narration Cooley, B. (Ed.). (2016). Making data stories more personal: Highlights from data journalism UK. Journalism. www.journalism.co.uk/news/making-data-stories-more-personalhighlights-from-data-journalism-uk/s2/a694889/ Cox, J. R., & Pezzullo, P. C. (2017). Environmental communication and the public sphere (5th ed.). SAGE Publications. Dang, X. (2015). The power operation of visual culture. People’s Publishing House. (Published in China.) Deng, W., & Lu, Y. (2013). Point, linear and plane: Three information chart modes in disaster event reporting. Journalism and Mass Communication Monthly (4), 35–40. (Published in China.) Despain, W. (2015). 100 principles of game design (X. Xiao, Trans.). The People’s Posts and Telecommunications Press. (Published in China.) Dick, M. (2014). Interactive infographics and news values. Digital Journalism, 2(4), 490– 506. https://doi.org/10.1080/21670811.2013.841368 Douglas, J. Y., & Hargadon, A. (2001). The pleasures of immersion and engagement: Schemas, Scripts and the Fifth Business. Digital Creativity, 12(3), 153–166. https://doi. org/10.1076/digc.12.3.153.3231 Du, Y. (Ed.). (2014). How does the New York Times use big data to do hurricane disaster reporting. Chuansong. http://chuansong.me/n/1549972 (Published in China.) Ellis, J. (Ed.). (2015). The Upshot uses geolocation to push readers deeper into data. Nieman Lab. www.niemanlab.org/2015/05/the-upshot-uses-geolocation-to-push-readersdeeper-into-data/ Fang, J. (2015). Introduction to data journalism. China Renmin University Press. (Published in China.) Fang, J., & Fan, D. (2016). Sports data journalism in the age of media convergence – Taking the report of the 2016 European cup as an example. News and Writing (8), 77–80. (Published in China.) Fang, T., & Dai, X. (2010). Introduction to game design. Publishing House of Electronics Industry. (Published in China.) Fang, Y. (2014). Introduction to news narratology. China Radio and TV Press. (Published in China.) Fang, Y. (2016). A comparison of news narrative and literary narrative. Modern Communication (Journal of Communication University of China) (5), 60–63. (Published in China.) Flowers, A. (Ed.). (2016). Five thirty eight’s data journalism workflow with R. Channel 9. https://channel9.msdn.com/Events/useR-international-R-User-conference/useR2016/ FiveThirtyEights-data-journalism-workflow-with-R Fludernik, M. (2009). An introduction to narratology. Routledge. Flusser, V. (2000). Towards a philosophy of photography. Reaktion Books. Gaudreault, A., & Jost, F. (2005). What is film narratology (Y. Liu, Trans.). The Commercial Press. (Published in China.) Genette, G. (1983). Narrative discourse: An essay in method. Cornell University Press. Gray, J., Chambers, L., & Bounegru, L. (2012). The data journalism handbook. O’Reilly Media.

130  Representing Reality Hackett, R., & Zhao, Y. (2010). Sustaining democracy? Journalism and the politics of objectivity (H. Shen & Y. Zhou, Trans.). Tsinghua University Press. (Published in China.) Heravi, B., & Ojo, A. (Eds.). (2017). What makes a winning data story? UCD iSchool. https:// medium.com/@Bahareh/what-makes-a-winning-data-story-7090e1b1d0fc#.3fbubynuo Hu, C. (Ed.). (2017). 13 thoughts of data driven decision making. Sohu. www.sohu. com/a/204942623_236505 (Published in China.) Huang, M. (2014). Newsgames in the context of digitization. Journal of Chongqing University of Posts and Telecommunications (Social Science Edition), 26(5), 94–100. (Published in China.) Hullman, J., & Diakopoulos, N. (2011). Visualization rhetoric: Framing effects in narrative visualization. IEEE Transactions on Visualization & Computer Graphics, 17(12), 2231–2240. https://doi.org/10.1109/TVCG.2011.255 Jia, L., & Xu, X. (2013). The nature of “big data” and its marketing value. Nanjing Journal of Social Sciences (7), 15–21. (Published in China.) Khatchadourian, R. (Ed.). (2020). No secrets. New Yorker. www.newyorker.com/ magazine/2010/06/07/no-secrets Kilduff, M.,  & Tsai, W. (2003). Social networks and organizations. Sage. https://dx.doi. org/10.4135/9781849209915 Klein, S. (Ed.). (2015). Intro: The design and structure of a news application. GitHub. https://github.com/propublica/guides/blob/master/design-structure.md Li, J. (2017). Rhetorical practice of visual framework in data journalism. Journalism and Mass Communication Monthly (5), 9–15. (Published in China.) Li, Y., & Li, S. (2015). Tell a good story? How data journalism inherits and transforms traditional journalism. Journal of Zhejiang University (Humanities and Social Sciences), 45(6), 106–122. (Published in China.) Liao, H. (2016). “Image” in “image”: The visual representation and construction of information graphic. Journal of Central South University (Social Sciences), 22(1), 208–213. (Published in China.) Lima, M. (2011). Visual complexity: Mapping patterns of information. Princeton Architectural Press. Lin, J. (2009). Social network analysis: Theory, methods and applications. Beijing Normal University Publishing House. (Published in China.) Lin, S., & Wu, X. (2006). Literature guidance of film and television theory (TV Volume). Shanghai University Press. (Published in China.) Liu, J., & Sun, X. (2011). What shapes user experience? Advances in Psychological Science, 19(1), 94–106. (Published in China.) Liu, T. (2016). China in western data journalism: Searching for an analytic framework of visual frame based on visual rhetoric. Journalism & Communication, 23(2), 5–28. (Published in China.) Liu, T. (2018a). On context: Interpretation method and visual rhetoric analysis. Journal of Northwest Normal University (Social Sciences) (1), 5–15. (Published in China.) Liu, T. (2018b). On image: “Image” in meaning and visual rhetoric analysis. Journalism Research (4), 1–9. (Published in China.) Liu, X. (2006). The discourse of news and the news of discourse: An interpretation of News as discourse. Hubei Social Sciences (1), 133–135. (Published in China.) Long, D. (2008). Image narration and text narration: Image and text in story painting. Jiangxi Social Sciences (7), 28–43. (Published in China.) Long, D. (2010). On juxtaposition narration as the theme of spatial narration. Jiangxi Social Sciences (7), 24–40. (Published in China.)

Representing Reality 131 Lorenz, M., Kayser-bril, N., & Mcghee, G. (Eds.). (2011). Voices: News organizations must become hubs of trusted data in a market seeking (and valuing) trust. Nieman Lab. www. niemanlab.org/2011/03/voices-news-organizations-must-become-hubs-of-trusted-datain-an-market-seeking-and-valuing-trust/ Mao, L., Tang, Z., & Zhou, H. (2016). Data journalism reporting: Framework and architecture. News and Writing (7), 35–39. (Published in China.) Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think (Y. Sheng & T. Zhou, Trans.). Zhejiang People’s Publishing House. (Published in China.) Mazza, R. (2009). Introduction to information visualization. Springer-Verlag. Meng, D. (2016). Production characteristics and narrative mode of data journalism: An empirical study based on nominated works of data journalism award. Contemporary Communication (6), 23–26. (Published in China.) Miller, G. (Ed.). (2015). How to tell science stories with maps. The Open Notebook. www. theopennotebook.com/2015/08/25/how-to-tell-science-stories-with-maps/ Mirzoeff, N. (1999). An Introduction to Visual Culture. Routledge. Moss, S. (Ed.). (2010). Julian Assange: The whistleblower. The Guardian. www.theguard ian.com/media/2010/jul/14/julian-assange-whistleblower-wikileaks Oputu, E. (Ed.). (2014). WNYC is beefing up its data journalism. Columbia Journalism Review. www.cjr.org/behind_the_news/wnyc_is_beefing_up_its_data_jo.php Peng, L. (2015). What does the encounter of data and journalism bring? Journal of Shanxi University (Philosophy and Social Science Edition) (2), 64–70. (Published in China.) Porway, J. (Ed.). (2016). The trials and tribulations of data visualization for good. Markets for Good. https://marketsforgood.org/the-trials-and-tribulations-of-data-visualizationfor-good/ Reid, A. (Ed.). (2013). Newsgames: Future media or a trivial pursuit? Journalism. www. journalism.co.uk/news/newsgames-future-media-or-a-trivial-pursuit-/s2/a554350/ Rose, G. (2016). Visual methodologies: An introduction to researching with visual materials (4th ed.). SAGE Publications Ltd. Segel, E., & Heer, J. (2011). Narrative visualization: Telling stories with data. IEEE Transactions on Visualization & Computer Graphics, 16(6), 1139–1148. https://doi.org/10.1109/ TVCG.2010.179 Solop, F., & Wonders, N. A. (2016). Data journalism versus traditional journalism in election reporting: An analysis of competing narratives in the 2012 presidential election. Electronic News, 4(4), 1–21. https://doi.org/10.1177/1931243116656717 Southern, L. (Ed.). (2019). Inside Bloomberg’s 30-person international data journalism team. Digiday. https://digiday.com/media/inside-bloombergs-30-person-internationaldata-journalism-team/ Stabe, M. (Ed.). (2016). How to cover the U.S. Election with a visual map? (S. Shi, Trans.). FT Chinese. www.ftchinese.com/story/001070056?full=y Su, S. (2013). Fact or ritual: Limitations and countermeasures of news objective strategy. Theory Horizon (1), 155–157. (Published in China.) Sun, W. (2005). Criticism, integration or manipulation: The publicity of metropolitan newspapers. 2005 China Communication Forum. (Published in China.) Sun, W. (2011). Research on interactive media narrative [Doctoral dissertation, Nanjing University of the Arts]. CNKI Theses and Dissertations Database. https://kns.cnki.net/ kcms2/article/abstract?v=3uoqIhG8C447WN1SO36whNHQvLEhcOy4v9J5uF5Ohrl8 vTzlaZqhNhoaD5Dbi-1MV6n4dxajsYfpLymZ9Up5F01uLl2OxQ_h&uniplatform=NZ KPT. (Published in China.)

132  Representing Reality Sunne, S. (Ed.). (2016). The rise of data reporting. American Press institute. www.ameri canpressinstitute.org/publications/reports/strategy-studies/data-reporting-rise/ Swartz, D. (2006). Culture and power: The sociology of Pierre Bourdieu (D. Tao, Trans.). Shanghai Translation Publishing House. Tang, F. (2008). On balters concept of “the death of author”. Journal of Hunan University of Technology (Social Science Edition), 13(6), 77–80. (Published in China.) Tang, W. (2003). Paradigms and dimensions: A review of foreign narratology research. Journal of Foreign Languages (5), 60–66. (Published in China.) Van den Hoven, P. (2016). Critical rhetoric as a theory of journalist transparency (Y. Yang, Trans.). Global Journal of Media Studies, 3(4), 83–96. (Published in China.) Wang, C. (2007). On the knowledge form of news in the future. Nanjing Journal of Social Sciences (10), 105–110. (Published in China.) Wang, C., & Yu, X. (2016). A study of innovation mechanisms in newsrooms: A study of “micro-news production” in three daily newspapers. Shanghai Journalism Review (3), 10–20. (Published in China.) Wang, T. (2016). Causality in big data and its philosophical connotations. Social Sciences in China (5), 22–42. (Published in China.) Wei, B. (2001). Analyzing correlation and causation – Two concepts that must be clarified in probability statistics. China Statistics (6), 45–46. (Published in China.) Wei, Z. (2001). On the definition of causality. Qinghai Social Sciences (1), 117–121. (Published in China.) Weng, D. (2007). The application of nonlinear narrative in cartoon work. Journal of Fujian Normal University (Philosophy and Social Sciences Edition) (5), 128–130. (Published in China.) Widrich, L. (Ed.). (2012). The science of storytelling: Why telling a story is the most powerful way to activate our brains. Life Hacker. http://lifehacker.com/5965703/the-science-ofstorytelling-why-telling-a-story-is-the-most-powerful-way-to-activate-our-brains Wu, H. (2008). Science in cultural perspective. Fudan University Press. (Published in China.) Xu, W. (2012). Studies on the evolution of relation between language and image of new period. Academic Monthly, 44(2), 106–114. (Published in China.) Yang, B. (2001). On news facts. Xinhua Publishing House. (Published in China.) Yang, S. (2015). Several methods of data visualization timeline design. Science & Technology for China’s Mass Media (Z1), 68–70. (Published in China.) Yu, M. (2015). Data journalism practice: Process reengineering and models innovation. Editorial Friend (9), 69–72. (Published in China.) Zeng, Q., Lu, J., & Wu, X. (2017). Data journalism: Concept explication and exploration. Journalism & Communication, 24(12), 79–91. (Published in China.) Zhang, G. (2013). Data journalism as an open journalism: The data journalism practice of the Guardian. Shanghai Journalism Review (6), 7–13. (Published in China.) Zhang, J. (2016). Data journalism: A study on the new way of storytelling at the big data era [Master dissertation, Lanzhou University]. CNKI Theses and Dissertations Database. https://kns.cnki.net/kcms2/article/abstract?v=3uoqIhG8C475KOm_zrgu4lQARvep2SAk fRP2_0Pu6EiJ0xua_6bqBsWCF_ycD2NYFo7naQ1vE_4REG67QYWkmasz5ZryFLy6 &uniplatform=NZKPT. (Published in China.) Zhang, K., & Yang, L. (2007). A review of studies on empathy. Science of Social Psychology (Z3), 161–165. (Published in China.) Zhang, W. (2015). Aesthetic disenchantment of graphical supremacy – On the construction of paradigm and intertextuality representation of the modern visual text. Journal of

Representing Reality 133 Xinjiang University (Philosophy, Humanities & Social Sciences), 43(4), 107–112. (Published in China.) Zhang, X. (2013). Discourse, power and politics on the maps [Master dissertation, Lanzhou University]. CNKI Theses and Dissertations Database. https://kns.cnki.net/kcms2/article/ abstract?v=3uoqIhG8C475KOm_zrgu4lQARvep2SAkfRP2_0Pu6EiJ0xua_6bqBjOIvH bTFLJCGx3VyzVUjH0kZjySmurzniXL7o__b3hL&uniplatform=NZKPT (Published in China.) Zhao, L. (2011). Basic theory and method of social network analysis and its application in information science. Research on Library Science (20), 9–12. (Published in China.) Zheng, E. (2012). Analysis of Michelle’s “pictorial turn” theory. Literature & Art Studies (1), 30–38. (Published in China.) Zhu, H. (2018). The influence and construction of western intertextuality theory on reader centered theory. Masterpieces Review (24), 163–164. (Published in China.)

6

Social Media-oriented Production of Data News

Data journalism fits the trend of “Everything is quantifiable” in the era of big data and has become popular worldwide. However, the data journalism industry has not paid enough attention to social media (Zhang, 2016), and even if they are aware of its importance, most of them regard it as a channel and do not design their products according to the attributes and logic of social media. Social platforms are increasingly changing users’ news consumption habits. Studying the social media-oriented production of data journalism will not only help data journalists pay attention to social platforms, but also help the data journalism industry to deepen the blue ocean of social platforms and propose new development solutions. What is social media-oriented production? Social media-oriented production is a production mode relying on interpersonal relationships, oriented to make users share, and aimed at embedding in users’ online social life; in short, to adapt news production to social platforms rather than just using them. Social media-oriented production of data journalism requires “sharing as the core” and emphasizes userorientation and user experience. Letting users “share” is not simply designing a share button on the interface, which may not necessarily be useful, but implementing the concept of “sharing” in the whole production chain, so that the news itself has the potential to be shared. Motivations for Social Media-oriented Production of Data Journalism The data journalism industry’s neglect of social platforms is mainly manifested in three aspects. First, the social accounts of data journalism teams are only regarded as supplementary channels for media organizations, lacking daily and targeted operation. Second, data journalism products are mainly designed for the web rather than mobile. Third, the theme, content, writing, and presentation of data news rarely take into account social platform features, lacking appeal to users. Why should data journalism production pay attention to social platforms?

DOI: 10.4324/9781003426141-6

Social Media-oriented Production of Data News 135 The “Migration” of Users: From PC to Mobile

The biggest change in global news consumption in recent years is that mobile is gradually becoming the main source of news for people. The 42nd Statistical Report on Internet Development in China shows that the number of mobile Internet users reached 788 million in 2018, and the proportion of Internet users accessing the Internet through cell phones was as high as 98.3%. The number of monthly users of computer-based news services reached 480 million in June 2017, according to iResearch (iResearch, 2017). In the U.S., only 10 to 15 percent of news clicks come from newspaper websites, while 80 percent are brought in by search and sharing (Xu, 2015). About half of U.S. adults (53%) say they get news from social media “often” or “sometimes” (Shearer & Mitchell, 2021). Social platforms are the new intermediaries between news content and users. Given the significant differences in text formats and user reading habits between computer and mobile, data journalists will have to adapt quickly to this new trend. Some international mainstream media are increasingly emphasizing content pushing on social platforms. The Economist entered social media in 2009 and has since shifted from posting a lot of low-quality content to posting a small amount of high-quality, audience-engaging content and increasing subscribers (Law, 2017). The BBC has been sharing two infographics per day to Facebook, Twitter, and Pinterest since May 2014 as part of its “social media-focused experiments” (Reid, 2014). Change of Communication Mode: From Vertical Communication to Fissionstyle Communication

The social relationship between users has enabled content dissemination on social platforms to go beyond a one-time, vertical model to a decentralized, fission-style model. In the information age, the spread of information is sometimes like nuclear fission, rapidly passing from one user to multiple users. “Sharing” has been transplanted from an interpersonal interaction to online, both as a means of building and maintaining relationships and as a driving force for the spread of information on social platforms. In sharing, users bring valuable and entertaining content to others, define themselves to others, grow and nourish their relationships, achieve self-actualization, and get the word out about causes or brands (Customer Insight Group, 2017). Each time a user shares news, it promotes the media’s brand reputation and brings it new followers. Data journalism is now produced in a more traditional way, with much news in almost identical computer, paper, and mobile versions, taking into account neither the technical differences of platforms nor the differences in user habits and communication models. The fission-style communication brought by sharing among users challenges the established production model and requires data journalists to innovate production concepts.

136  Social Media-oriented Production of Data News Information Filtering Mechanism: From User Selection to Collaborative Filtering

In the era of mass communication, the media was the gatekeeper of information, and the audience chose the information put out by the media according to their needs. Nowadays, the communicators and communication channels are more diversified, and a new aspect of information filtering on social platforms has been added: Collaborative Filtering. The media content is not only spread through its own social accounts, but also through the network of users. Each individual is in the network of multiple others, and information selection has changed from “user chooses” to “friends choose for me.” 20% of WeChat users select articles in their “Subscriptions” and share them in their “Moments,” where 80% of WeChat users read them (Wang, 2015). In fact, some social media algorithms are designed based on this principle. Changes in information filtering mechanisms require data journalists to rethink their products’ communication strategies. The Long Production Time of Data Journalism Requires a Long “Shelf Life”

Data journalism takes much longer to produce than traditional journalism because of the multiple components of data collection, analysis, and visualization. According to Google News Lab, 49% of data stories are created in a day or less, 30% take up to a week, 11% take several weeks, and 3% take a month or more. The fact is that data journalism can be done in a day or less (Rogers et al., 2017). In fact, the level of analysis and visualization input for data stories that can be completed in a day is relatively low. The recent addition of the Best Use of Data in a Breaking News Story (Within First 36 Hours) to the Data Journalism Award also indicates that data journalism has a longer production cycle than traditional journalism. When selecting topics, data journalists must consider the production time of their work, so many focus on long-lasting topics rather than breaking news to extend the “shelf life.” The content that dominates in social platforms is long-lasting topics rather than breaking news, which fits with the long production time of data journalism. New content on social platforms can be shared by users at any time as long as it is hot. After the attention decreases, some content with a long shelf life will go into “Long Tail Distribution” and still have the power to spread. “Sharing” As the Core Production Strategy “Sharing” is both the production concept of data journalism and the evaluation indicator of the dissemination effect of a piece of news, emphasizing the important criterion of “good” data news on social platforms: multi-level communication power. Of course, “good” news must be based on public interest, follow journalistic ethics, reflect professional standards, and provide users with true and accurate information. Under this premise, how can the social media-oriented production of data journalism be centered on “sharing”?

Social Media-oriented Production of Data News 137 News Content is Worth Sharing

Some of the topics in the news that attract users to talk about them are important factors to promote retweets. Let us think about the “back-fence principle”: news should be what two housewives talk about at the end of the day while leaning over the fence in the backyard. Given that the nature of relationships on social media is an expansion of real human relationships, the act of sharing for friends is not simply to vent emotions and express opinions, but also to assume the function of “social currency,” which Jonah Berger believes is all about people talking about things to make themselves look good, rather than bad (Knowledge@Wharton, 2013). In addition to considering professional standards of journalism, data journalism should also consider users’ psychological motivation to present themselves and share with friends on social platforms. Data journalists should increase the content that attracts users to share by improving the interestingness, quality, style, and format of their journalistic works. News Has the Potential to Become a Hot Topic

In the age of social media, users connected by interests and shared values share similar information, so whether a data journalism piece resonates or not directly determines the likelihood that it will be reposted. There are elements of news that “go bang like a bomb” (Chen, 2017) that can quickly ignite certain emotions in users, leading to retweets and quickly creating public opinion hotspots. Research has found that viral communication is partly emotionally driven. Positive content is more likely to be shared in large quantities, and content that evokes high levels of positive (awe) or negative (anger or anxiety) emotions is more likely to go viral. It might also be useful to feature or design content that evokes activating emotions because such content is likely to be shared (Berger & Milkman, 2012). Data journalists need to think about making news content a hot topic on two levels: information and format. At the information level, journalists need to identify the content that is most likely to spark the interest of users. The journalists should open with the most interesting or relevant fact or surprising data point. Include engaging movement or eye-catching colors as this will be the first thing people see on social media, and maybe the only thing if they scroll past too fast (Segger, 2018). At the formal level, data visualization designers should not pursue information in abundance, but rather make important data clearly identifiable. Alternatively, designers can let users gradually reveal the answers to questions that have been set up as they read the news. For example, The Economist’s team works to strip each chart down to make just one clear point so readers get the sense from scrolling through (Southern, 2018). In the era of social media, emojis are popular on the Internet, and Graphics Interchange Format (GIF) can be used for emojis, as well as for data visualization and be used to emphasize a certain data point. The “7 GIFs to show you the 28 years of

138  Social Media-oriented Production of Data News development and opening up of Pudong New Area,” released in 2018 by ThePaper, highlights the information point that Pudong’s tertiary industry accounted for more than 50% for the first time out in 2007, capturing a key point in time for industrial restructuring. However, if there are too many variables in a GIF, it will not impress the users. It is important to understand that GIFs are not only attractive because they “move” and “repeat,” but that they also highlight the most attractive messages. Gamified Design

Gamification has some similarity to games, but they are not exactly the same. Gamification is not quite creating a game but transferring some of the positive characteristics of a game to something that is not a game, thus, gami“fy”ing. Those positive characteristics of a game are often loosely described as “fun,” and they have the effect of engaging game players in the activity. The fun in gameplay is engineered by the four building blocks, or defining characteristics, of a game: goal, rules, feedback system, and voluntary participation. In gamification, these building blocks more or less still appear but in a less pronounced manner (Kim, n.d.). The design of news gamification has three values: first, it caters to the participatory culture of social platforms; second, it enhances the fun and attractiveness of data products through interaction; and third, it changes the one-time consumption of data news and builds a “strong relationship” with users through multiple participation of individual users and sharing with others. The first form of gamification is a personalized user-oriented product available for sharing. For example, the interactive news in the form of H5 (a term that stands for the mobile version of websites opened in mobile browsers) of People’s Daily in 2017, named “Look! Here is my military uniform photo,” in which users could upload their own photos and use face recognition technology to generate military uniform photos in different era styles. The news game was viewed by more than 200 million people in two days. A second form of gamification is to engage users in predictions that deepen their knowledge of key data. The New York Times’ “You Draw It: How Family Income Predicts Children’s College Chances” allows users to draw the relationship between family income and college attendance, and later displays the correct answers, allowing them to have fun, deepen their understanding, and see what others predict. A third form of gamification is to allow users to experience plots in the news. The Wall Street Journal’s virtual reality data story “Is the Nasdaq in another bubble?” presents the rise and fall of the Nasdaq index over 21 years in the form of a three-dimensional runway, with users experiencing the ups and downs of the index from a first-person perspective on the “runway” like a roller coaster ride. Through the exciting experience generated by procedural rhetoric, users can truly feel the dramatic fluctuations of the stock market (see QR code 6.1).

Social Media-oriented Production of Data News 139

QR code 6.1 Is the Nasdaq in another bubble? http://graphics.wsj.com/3d-nasdaq/ Lightweight Design

Lightweight design is about minimizing the technical or content load of data journalism without compromising the expression of meaning and user experience. Design lightweighting includes two levels: technical lightweighting and content lightweighting. Technical lightweighting means matching the technical form of data journalism products to the performance of mobile devices. One study found that data journalism is increasingly focused on making data lighter for easier loading, using swipes instead of clicks to present visual content to suit mobile reading habits, and linking data with social media such as Facebook (Zhao & Chen, 2018). Users are looking for convenience, so the more complexity the designer requires, the less rewarding the work will be (Huang, 2019). The Financial Times’ “One Belt, One Road” is available in two versions. The desktop version of the feature splits the screen vertically and the map on the left changes as the reader scrolls. On a mobile device the map is pinned to the top of the screen and changes as the reader scrolls (Otter, 2017). CGTN’s interactive page “Who Runs China,” launched in 2019, uses a similar design with 2,975 dots to represent the 2,975 Chinese National People’s Congress deputies, analyzing the gender, age, ethnicity, and education of the deputies, etc. Whenever the user scrolls down the mouse or swipes the mobile screen, the textual content floats above the data visualization chart and the two work together to display the content. Users have a good experience when reading both the desktop and mobile versions (see QR code 6.2). The result of not being lightweight is that it can’t be spread in mobile or the mobile user experience is not good. NetEase’s “Cameron was re-elected as British Prime Minister, let’s see how the last 11 Prime Ministers are” is a long infographic, which is clear and easy to understand on the computer version, but users have to slide the mobile screen back and forth to understand it (Xu & Wang, 2015). Content lightweighting means that the amount of information should be based on the user’s reading habits on social media platforms, which means that text should be short, graphics should be well focused, and video needs to tell the story in a short amount of time. According to BuzzFeed, the majority of social media shares happen after people have been on a page for over three and a half minutes on desktop, or over two minutes on a mobile device (Jeffries, 2014).

140  Social Media-oriented Production of Data News

QR code 6.2 Who runs China. https://news.cgtn.com/event/2019/whorunschina/index.html

National Geographic’s 2017 piece, “What Will Become of Scotland’s Moors?” was designed to lighten the content for the mobile version. Its desktop version has a highly-detailed map of Scotland with multiple layers of data, while on mobile the map is broken into multiple base maps, each with different data (Otter, 2017). The lightweight design of the work fits the news receiving habits of social media users, enhancing the user experience and increasing the likelihood of the work being shared. The medium is the message, and sharing is communicating. Social mediaoriented production is crucial not only for data journalism, but for journalism as a whole. Social platforms are not only a compensatory medium, but also a new force that shapes the concept and way of news production. social media-oriented production requires journalists to adapt to the media properties and communication logic of social platforms. At present, The Economist has learned to turn the “fans” into “subscribers” on social platforms, and social platforms may bring more possibilities for the profitability of data journalism. For data journalism, “sharing is the core” is not just a slogan, but a necessity in the social media era. References Berger, J. (Ed.). (2013). “Contagious”: Jonah Berger on why things catch on. Knowledge at Wharton. https://knowledge.wharton.upenn.edu/article/contagious-jonah-berger-on-whythings-catch-on/ Berger, J., & Milkman, K. (2012). What makes online content viral? Journal of Marketing Research, 49(2), 192–206. https://journals.sagepub.com/doi/pdf/10.1509/jmr.10.0353 Chen, G. (2017). Online media headline mess and remediation. Media (8), 12–13. (Published in China.) Customer Insight Group (Ed.). (2017). The psychology of sharing: What is this study about? The New York Times. www.bostonwebdesigners.net/wp-content/uploads/POS_PUB LIC0819-1.pdf Huang, Z. (2019). What is good data journalism. Shanghai Journalism Review (3), 13–14. (Published in China.) iResearch. (2017). 2017 China mobile terminal news and information industry report. iResearch. http://report.iresearch.cn/report_pdf.aspx?id=3034 (Published in China.) Jeffries, A. (Ed.). (2014). You’re not going to read this. The Verge. www.theverge. com/2014/2/14/5411934/youre-not-going-to-read-this Kim, B. (Ed.). (n.d.). Understanding gamification. American Library Association. https:// journals.ala.org/ltr/issue/download/502/252

Social Media-oriented Production of Data News 141 Law, D. (Ed.). (2017). The evolution of the Economist’s social media team. The Medium. https://medium.com/severe-contest/the-evolution-of-the-economists-social-mediateam-aee8be7ac352 Otter, A. (Ed.). (2017). Seven trends in data visualization. Global Investigative Journalism Conference. https://gijc2017.org/2017/11/19/data-visualization/ Reid, A. (Ed.). (2014). BBC to launch daily infographics shared on social media. Journalism. www.journalism.co.uk/news/bbc-to-launch-daily-infographics-shared-on-social-media/ s2/a556686/ Rogers, S., Schwabish, J., & Bowers, D. (Eds.). (2017). Data journalism in 2017: The current state and challenges facing the field today. News Lab. https://newslab.withgoogle. com/assets/docs/data-journalism-in-2017.pdf Segger, M. (Ed.). (2018). Lessons for showcasing data journalism on social media. The Medium. https://medium.com/severe-contest/lessons-for-showcasing-data-journalismon-social-media-17e6ed03a868 Shearer, E., & Mitchell,A. (Ed.). (2021). News use across social media platforms in 2020. Journalism. www.journalism.org/2021/01/12/news-use-across-social-media-platforms-in-2020/ Southern, L. (Ed.). (2018). How the economist uses its 12-person data journalism team to drive subscriptions. Digiday. https://digiday.com/media/economist-using-12-persondata-journalism-team-drive-subscriptions/ Wang, Y. (2015). How to use “social currency” to do a good job in micro-communication. International Communications (9), 63–64. (Published in China.) Xu, Q., & Wang, D. (2015). Netease new media lab: How data journalism on mobile breaks through page limitations. Science & Technology for China’s Mass Media (8), 62–65. (Published in China.) Xu, Y. (Ed.). (2015). ‘Share button’ becomes standard for news consumption. Chinese Social Science Network. http://ex.cssn.cn/hqxx/tt/201503/t20150320_1554522.shtml (Published in China.) Zhang, Q. (2016). Unpacking Netease data blog. News World (6), 56–58. (Published in China.) Zhao, R., & Chen, Z. (2018). Innovative approaches to global data journalism – Take the award-winning works of data journalism award as an example. News and Writing (11), 84–88. (Published in China.)

7

Ethics of Data Journalism Production

Ethics are associated with professionalism and its power as a form of boundary work, delineating insiders from outsiders of an industry (Lewis & Westlund, 2015). If data journalism, which is in the process of development, wants to be accepted by the industry and recognized by society, the establishment of professionalism is particularly important, both for delineating the boundaries of data journalism and for maintaining its own legitimacy and credibility. However, the identity of professionalism is constantly constructed through the efforts of participants, rather than being a set-in-stone established fact (Xie, 2009, pp. 315–316). Data journalists, like scientists and other knowledge producers, attempt to construct their professional communities by setting professional goals and developing a set of ideologies to justify and achieve those goals. The formation of professional communities is not natural, but is achieved through their practices (including discursive practices), and the process of such construction involves both boundary work (Gieryn, 1983) and collective interpretations (Lu & Zhou, 2016). A survey found that data journalists mostly emphasize the centrality and primacy of data and see themselves as mere translators of abstract technical knowledge (Boyles & Meyer, 2016). The starting point for data journalists is data, as they believe that everything in life can be quantified. There are still many challenges to data news production, including untrained analysts, data reliability, measuring the right thing, and unjustified data collection (Stoneman, 2015). When published without context or consideration for ethics, data journalism can cause harm through perpetuating stereotypes and biases (McBride, 2016). The tools of data journalism can also be used to spread untruths, as data points are not facts; they are only given elements from which facts are derived (Kayser-bril, 2016). Data journalism therefore needs a set of accepted production ethics, otherwise it is likely to turn out to be a source of confusion in the public’s perception of the real world and eventually lose credibility. The 2016 U.S. presidential election is a case in point. The accuracy of data journalism was called into question when polls before the election were “overwhelmingly” in favor of Hillary Clinton and the FiveThirtyEight website, which twice successfully predicted the outcome of the election, was off (Tesfaye, 2016). As Jacob Harris says, “the public has only a limited tolerance for fast-and-loose data journalism and we can’t keep fucking it up.” (Harris, 2014) DOI: 10.4324/9781003426141-7

Ethics of Data Journalism Production 143 Since the birth of modern journalism in the 17th century, journalism ethics has undergone four “revolutions,” which are: (1) the invention of journalistic ethics by the 17th century periodic press; (2) the fourth-estate ethics of newspapers of the 18th century Enlightenment public sphere; (3) the liberal ethics of the 19th century press, and (4) the professional ethics of the mass commercial press during the late 1800s and early 1900s (Ward, 2011, p. 3). Now, new media technologies such as digital technology are bringing a lot of “uncertainty” to today’s journalism (Li & Chen, 2016). Since the 1990s, journalism has entered the era of digital journalism and new forms of communication are changing journalism and its ethics. Media ethics requires urgent reflective engagement because basic values are under question and new issues challenge traditional approaches to responsible journalism (Ward, 2011, p. 1). Today, the media revolution calls for an ethics of mixed, global media (Ward, 2011, p. 3). It is particularly necessary to establish ethical norms with regard to the real characteristics of data journalism. In-depth interview (Reporter D, reporter of Caixin’s “Digital Speak” section, online interview): Data journalism is produced with reference to the general journalism codes. Traditional ethical journalism norms are still meaningful to data journalism, but they cannot be used directly. In our in-depth interviews with Chinese data journalists, some mentioned that certain media outlets “falsify” data in order to prove their point of view. Caixin’s reporter D mentioned that “there is no professional rating for data journalists at the moment.” In order to establish the professionalism of data journalism, it is imperative to “establish a set of systematic professional standards that can guide the reality” (Fang & Gao, 2015). What ethical norms should be established for data journalism? Answering these questions comprehensively is not easy, and in this chapter, we try to explore three aspects of objectivity, transparency, and the use of personal data. The Principle of Objectivity in Data Journalism The development of data journalism in the media and public trust in data journalism is largely due to the perception that data and data journalism are objective. Data journalism can either convey the truth through the aura of data or mislead public perceptions by using data tricks. The objectivity of data news production has become a real issue. Data journalism, born at the beginning of the 21st century, faces an awkward situation as journalism in the West is currently facing a crisis of public trust. In academia, doubts about objectivity arose in the middle of the century. Post-moderns and others questioned the ideals of truth and objectivity in Western culture and science (Ward, 2011, p. 132). Many people even believe that “realistic news is not objective,” or “news cannot be objective,” or “news does not need to be objective” (Xie, 2009, p. 87). Objectivity is one of the important cornerstones for journalism

144  Ethics of Data Journalism Production to establish its own boundaries and achieve its own legitimacy. The questioning and challenging of journalistic objectivity essentially reflect the reexamination of “journalism as a profession” in Western society. In fact, data journalism is the product of the Western journalism industry’s efforts to rebuild a trusting relationship with the public and to enhance the objectivity and professionalism of news reporting. However, Western media have to face a basic logical problem: how can a non-objective journalism make objective data news? At the same time, although the development of data journalism is in full swing, the lack of professional norms is gradually eating away at the credibility and professional value of data journalism (Fang & Gao, 2015). What exactly is objectivity? Is objectivity needed in journalism? If so, what kind of objectivity is needed? How does data journalism observe objectivity? This section will explore these questions. Polysemy of Objectivity

In journalism, objectivity is a complex, polysemous, and dynamic concept that contains three levels of meaning. The broad level of objectivity originates from the idea of subjectivity and objectivity dualism in the age of Enlightenment rationalism. Objectivity in the middle level refers to the basic characteristic that news facts do not change according to the subjective will of people, so journalists should reflect the facts as they are when reporting news. Objectivity in the narrower sense refers to the idea of objectivity in news reporting (Xu, 2007), which gives rise to “objective reporting,” a form of reporting that embodies objectivity by separating facts from opinions, presenting news from a viewpoint free of emotion, and striving to be fair and balanced so as to provide comprehensive information to the audience (Liu, 2015, p. 89). Objectivity cannot be equated with objective reporting, but is far more complex. News is a performative discourse used to convince readers that its content is true (Van den Hoven, 2016). Objectivity is a value, and objective reporting is the form in which that value is put into practice. The main purpose of objective reporting is to demonstrate objectivity in order to gain the trust of the audience. The current understanding of objectivity can be broadly divided into two types: one is ontological, and one is epistemological. The ontological objectivity is related to realism, which considers the object to exist independently of the subjective and regards objectivity as a goal. Epistemological objectivity focuses not on the object of belief, but on the process of belief formation. In discussions of epistemological objectivity, the objectivity of attitudes, beliefs, judgments, and methods of acquisition are often discussed. Traditional journalistic objectivity is a combination of substantive and epistemological objectivity, but it places more emphasis on rules and methods at the epistemological level. A journalist will usually say that he used an objective method (epistemological objectivity) to obtain objective facts (ontological objectivity) (Ward, 2011, pp. 128–129). According to Robert Hackett and Yuezhi Zhang (2010, pp. 57–59), objectivity can be understood in five dimensions. First, objectivity is a normative and ideal set of goals for news writing, even if these goals can never be fully achieved. These

Ethics of Data Journalism Production 145 goals are divided into those at the cognitive level of journalism, such as the ability to report the world truthfully, and those at the evaluative level of journalism, such as the ability to communicate valuable meanings and interpret the world. Second, objectivity is a set of epistemological assumptions about knowledge and the real world, implying that it is possible for journalists to convey the meaning of facts or events themselves to their audiences through neutral and appropriate reporting skills. Third, objectivity encompasses a set of practices and uniform technical standards for gathering news, but they may vary across media outlets as they evolve over time. Fourth, objectivity is a social structure embodied in various institutionalized forms. Fifth, objectivity is an active component of public discourse about the news. Objectivity is an important criterion by which the public judges news, and when the public criticizes news reporting for not being objective, it actually reinforces the character of objectivity as a criterion for evaluating news (Hackett & Zhang, 2010). As can be seen, the greatest source of the debate on journalistic objectivity lies in its incommensurability. Kovach and Rosenstiel argue that the original meaning of objectivity is now thoroughly misunderstood and close to being lost (Kovach & Rosenstiel, 2014, p. 97). What exactly is journalistic objectivity? To answer this question, we need to go back to its origins. The idea of objectivity began with realism, which arose in the 19th century in the United States when journalism was separated from political parties and sought accuracy. Realists believed that if journalists uncovered facts and put them together in a certain order, the truth would emerge naturally. Journalists at the time thought this would help them understand events (Kovach & Rosenstiel, 2014, p. 98). In Liberty and the News, published in 1920, Walter Lippmann argued that the crisis of democracy lay in the scarcity of journalistic facts and the failure of the press to perform its duties well, resulting in a plethora of distortions that made the truth unavailable to the public (Hu & Wu, 2010). Lippmann argued that journalism needs to find ways to keep journalists awake and free from irrational, untested, and unacknowledged assumptions when observing, understanding, and reproducing news. The best solution to journalistic bias is for journalists to be more scientifically minded. “There is but one kind of unity possible in a world as diverse as ours. It is unity of method, rather than aim; the unity of disciplined experiment” (Kovach & Rosenstiel, 2014, pp. 97–98). Initially, objectivity was understood as a method that would allow different types of media in the United States to coexist in harmony under general journalistic principles. This also meant that objectivity in media reporting was evaluated not in terms of content, but in terms of procedures and editorial methods. But the discussion of objectivity now extends beyond journalistic methods to include goals. In fact, if we understand objectivity as a goal, it also indicates the unattainability of objectivity. The field of journalism has never existed independently; it is interdependent with other fields. One of the current complaints about objectivity is that (ontological) objectivity is too demanding an ideal for journalism and hence objectivity is a myth (Ward, 2011, p. 132), as now evidenced by Western journalism.

146  Ethics of Data Journalism Production At the same time, the advent of the post-truth era has led people to believe that “knowing the truth is impossible” or “the truth is unimportant” (Wang, 2017), so the media may be more effective in shaping public opinion through emotional empathy with the audience than by reporting objective facts (Zhou & Liu, 2017), making the traditional objectivity of journalism a great challenge. Objectivity Redefined: A Pragmatic Perspective

One of the cornerstones of journalism’s legitimacy lies in its insistence on objectivity; otherwise, journalism is not unlike advertising or public relations. Faced with the controversy over objectivity, journalism has two options: one is to abandon its adherence to traditional objectivity and find an alternative norm; the other is to redefine objectivity. However, the journalism industry does not yet have an alternative to objectivity, and objectivity exists in a wide range of fields such as law and science, and is not exclusive to journalism. Objectivity, as a possible approach to the truth, needs to be redefined in a more practical way (Ward, 2011, p. 152). Stephen Ward proposes to replace traditional objectivity with Pragmatic Objectivity. Pragmatism began with discussions among members of the Harvard Metaphysical Club in the 1870s, and was formalized in 1877–78 with Charles Sanders Peirce’s The Fixation of Belief and How to Make Our Ideas Clear. William James developed Peirce’s methodological principles of pragmatism into a more systematic theory for the analysis of various problems. John Dewey emphasized the scientific approach to problem solving and called pragmatism “empirical naturalism” (Peirce, 2007, pp. 4–5). The dominant spirit of pragmatism is to value the fluidity of experience over rigidity, action over talk, the scientific method over superstition, and creative enterprise over conformity (Peirce, 2007, p. 7). In fact, objectivity as a method only realizes the possibility of obtaining the truth, and the objective method under certain conditions may be wrong in the future, so there is no absolute objectivity. According to William James (2012, p. 145), the essential contrast is that for rationalism reality is ready-made and complete from all eternity, while for pragmatism it is still in the making, and awaits part of its complexion from the future. Traditional objectivity is one-dimensional, i.e., statements are tested by facts while statements beyond facts are subjective, which leads to a passive role of journalists in news reporting. However, journalists are supposed to be active seekers of the truth and need to question and explain things, so pragmatic objectivity can increase journalists’ motivation in reporting (Ward, 2011, p. 152). Pragmatic objectivity consists of the objective stance and five criteria (Ward, 2011, pp. 154–155). The objective stance “consists in moral virtues such as honesty and intellectual virtues such as caring about truth in general and caring for specific truths.” It requires journalists to “place a critical distance between oneself and the story, to be open to evidence and counter-arguments, to fairly represent other perspectives, and to be committed to the disinterested pursuit of truth for the public.”

Ethics of Data Journalism Production 147 The first criterion is empirical validity, which tests whether evidence is carefully obtained and gathered and whether data are accurately presented in the news. Empirical validity is broader than reporting facts and includes placing facts in context. The second criterion is the completeness and implications of the news, which means that “journalists should check to see if the story contains all important facts, avoids hype, and reports on both the risks and benefits (or the positive and negative consequences) of a new policy or scientific discovery for society” (Maeyer, 2015). The third criterion is the consistency criterion. It tests the story for “coherence with existing knowledge and the views of credible experts.” The fourth criterion is the self-consciousness criterion. An objective story is selfconscious about the frame it uses to present a study or event, and the sources chosen. The last criterion is intersubjective objectivity. Richard Rorty (1992, p. 80) argues that non-coercive agreement provides us with everything we might need to move toward objective truth, i.e., intersubjective agreement. Thus, whenever there is intersubjective agreement about the beliefs that guide our actions, we have objectivity. This means that objectivity comes from agreement, just as intersubjectivity is from agreement (Rorty, 1992). The Principle of Objectivity in Data Journalism

Pragmatic objectivity has great implications for data journalism. Data may seem neutral and objective, but it is not. Data is by its nature a “representation” rather than a “mirror” of the real world. Data journalism is a form of journalistic argument, and a form of social science research argument (Zeng et al., 2017). Objectivity is an important criterion for evaluating the reliability of conclusions in data journalism. Data journalism is similar to scientific research in that all aspects are closely related to pragmatic objectivity (see Table 7.1). How can data journalism Table 7.1  The pragmatic objectivity of data journalism Data Collection

Data Analysis

Empirical Validity Accuracy, credibility, and Choice of data appropriateness of the analysis methods, data, and scientificity algorithms, and of the methods contexts Completeness and Data quality, possible Possible Implications flaws interpretation of data results, data bias Consistency Similar data Shared analytical and methods, methods intercorroboration of data Self-consciousness Critical awareness Critical awareness Intersubjective Multiple opinions and Multiple opinions Objectivity expert advice and expert advice

Data visualization Data mapping

Whole presentation of data, bias in design, framing, and visual rhetoric Cultural and aesthetic conventions under certain social conditions Critical awareness Multiple opinions and expert advice

148  Ethics of Data Journalism Production practice pragmatic objectivity? With reference to the five criteria of pragmatic objectivity, we propose the objectivity principles of data collection, data analysis, and data visualization for different aspects of data journalism production. Principle of Objectivity in Data Collection

The reliability of data directly determines the authenticity of data journalism. At the level of empirical validity, journalists need to systematically assess the quality of the data they collect. Data quality is assessed by whether the data meet the needs of the study. The main dimensions of evaluation include the following. First, the accuracy of the data. Journalists need to verify that the data is correct. Sometimes the data collected will have errors, such as formatting, spelling, and categorization errors, and these data cannot be analyzed immediately after they are obtained, but need to be cleaned. Second, the credibility of the data, including the credibility of the data source, the credibility compared to the usual internal standards, and the credibility based on the age of the data. According to Elliot Bentley, graphics editor of The Wall Street Journal, journalists must remain skeptical of data, “unless it’s coming from a really verified source, it’s always important, especially when you’re doing an original analysis of data, to be critical” (Ciobanu, 2016). Usually data from authoritative sources (e.g., government, leading survey companies, reputable NGOs, and universities) can be trusted. For example, Xinhua’s “Graduating from a Top School with a High Salary? Data Stories” shows the relationship between overall university ranking and graduation salary ranking in the form of a chart. The data sources of this report are Chinese Universities Alumni Association (cuaa.net) and iPIN (a job search website), which do not collect complete information on graduates, and thus the objectivity of this report is questionable (Zhu, 2016). Third, the appropriateness of the data. Journalists need to judge the amount of data necessary to solve a problem based on research needs before data collection. Fourth, the scientificity of the data collection methods. Whether for primary or secondary data, it is particularly important to assess the scientificity of the data collection method. During the 2016 U.S. presidential election, some media outlets hoped to influence the election through manipulating the poll results, and therefore chose the questions, sampling size, and error interpretation of the polls according to their own preferences (Wang, 2016), which lacked scientificity in data collection. In terms of the completeness and implications of data collection, data journalists need to make judgments about the integrity of the collected data, as well as the strengths and weaknesses of the data. For example, for some secondary data, the original purpose of their collection is not exactly the same as the question studied by the journalist, and the journalist needs to understand the problem targeted by these data, the purpose of the study, the applicability of the conclusions, the margin of error, and the data deficiencies. Sometimes when direct data is not available or incomplete for researching certain issues, journalists may use alternative data to conduct research, and then they need to consider what data is needed. With the abduction of 276 schoolgirls by the

Ethics of Data Journalism Production 149 extremist terrorist group Boko Haram in Nigeria in May 2014, the FiveThityEight website produced the data story “Kidnapping of Girls in Nigeria is Part of a Worsening Problem,” which examines trends in kidnapping cases in Nigeria over the years. The article concludes that the number of kidnappings over the past three decades has risen from just two in 1983 to 3,608 in 2013. However, this conclusion is not true and the main reason for the error is that the journalist mistakenly equated the number of cases reported in the media with the number of cases that actually occurred (Fang, 2015, p. 194). The consistency criterion requires journalists to reflect on whether the method of data collection follows currently generally accepted operating procedures. When collecting secondary data, it is important not to use only a single data set, but to collect from as many sources as possible, so that different data can corroborate each other. Simon Rogers believes that diverse data sources are essential for data journalism (Rogers, 2015, p. 5). At least two data sources should be included in data journalism production to demonstrate objectivity, but in practice not all data stories can meet this standard. The self-consciousness criterion requires data journalists to maintain a prudent and critical approach to data, to systematically understand data quality, and to comprehend data in its original context. This is an important guarantee of objectivity in the data collection phase. The criterion of intersubjective objectivity requires journalists to communicate with data collectors and experts, to discuss all aspects of data collection, to listen to more opinions and suggestions from different parties, and to ensure that the collected data meet professional standards and research needs. For example, a Miami Herald reporter’s story about judges not punishing some DUI offenders turned out to be inaccurate because the reporter did not have knowledge of some of the special circumstances listed when he obtained the data. In the data set obtained by the reporter, the judge sentenced defendants who were arrested for the first time and could not afford to pay the fine to do community service, and these special circumstances were not marked in the database, leading the reporter to believe that the judge did not punish some DUI offenders (Chang & Yang, 2014). The Principle of Objectivity in Data Analysis

In Raw Data Is an Oxymoron, Lisa Gittleman, a professor at New York University, argues that once the method of data collection is certain, it determines how the data will be presented. While the results of data analysis may seem fair and objective, the values of journalists actually influence the process of data construction to interpretation (Xu, 2014, p. 59). “Data does not intrinsically imply truth. Yes, we can find truth in data, through a process of honest inference. But we can also find and argue multiple truths or even outright falsehoods from data” (Diakopoulos, 2013). At the level of empirical validity, journalists need to judge whether the data analysis methods they choose can effectively answer their questions, and they need to judge the algorithms and models they design according to professional standards. Take algorithms as an example: there are superior and inferior algorithms, and

150  Ethics of Data Journalism Production a good algorithm needs to meet the criteria of correctness, readability, robustness, high efficiency, and low storage volume (Lv, 2009, p. 7). The same algorithm may behave very differently for different data or different analysis objectives (Hong et al., 2014, p. 74). Moreover, algorithms have biases and they exist in every aspect of algorithm design and operation (Zhang, 2018). When designing and selecting algorithms, it is necessary to follow algorithm ethics to avoid misleading. In terms of completeness and implications, data reporters are expected to evaluate whether the analytical methods used reveal the full and deep picture and avoid intentional or unintentional bias. WNYC’s “Median Income Across the US” uses medians to present income distributions, a simplistic treatment that leads to data maps that differ from the actual situation. Specifically, there are areas on this map that are bright green, indicating that they have very high median incomes and therefore may lead readers to believe that they are wealthier areas. But in fact, some of the green areas are actually the poorest areas of the city, due to the fact that people in some of these poor areas share an apartment with more than 10 people, thus driving up household income, but obviously this does not mean that the area is affluent (Porway, 2016). The consistency criterion requires that the data analysis methods used by journalists meet professional standards. Journalists usually need to apply currently accepted data analysis methods and not challenge the established consensus on data analysis methods, unless they themselves have a deeper understanding of data analysis methods. The self-consciousness criterion requires journalists to have a thorough knowledge of data analysis methods and to maintain a critical attitude. Specifically, journalists need to understand the original context of the data they collect, the strengths and limitations of the data analysis methods they choose, and the problems that may arise in answering particular questions. For example, in the 2016 U.S. presidential election, data analytics used by news organizations included traditional polls, polling aggregations/averages, forecasting models, Google search-term analytics, automated social media analytics, and election prediction markets (Kuru, 2016). When using these data analysis methods, journalists should have a more comprehensive understanding of each method and have their own independent judgment on the application context of each method, rather than using them blindly. The criterion of intersubjective objectivity requires journalists to communicate and discuss data analysis methods and their conclusions with experts and peers, rather than making their own decisions alone, and refining data analysis and conclusions by listening to different opinions. Data itself is a simplification of things, and with simplification comes the danger of distortion (Sun, 2014). Therefore, the role of experts is very important in the data analysis section. The Principle of Objectivity in Data Visualization

Stuart Hall, in Encoding, Decoding, argues that “a ‘raw’ historical event cannot, in that form, be transmitted by, say, a television newscast” (Luo & Liu, 2000, p. 346). In the process of news dissemination, one of the prerequisites from encoding to

Ethics of Data Journalism Production 151 decoding is that news “must be encoded and can be encoded.” “Must be encoded” means that the subject’s knowledge of news facts must be converted into news text through the encoding process (Xiao, 2016, p. 56). “Can be encoded” means that encoding must follow the social, media, and technical rules of discourse, which are the prerequisites for accurate “decoding” of texts by audiences. In fact, these rules themselves are not entirely “neutral” or “born that way”; they are the product of negotiations among various forces. It is only after the rules are established that they often take on a “naturalized” and “objective” appearance. Data visualization is the final step in data journalism production. To achieve objectivity, journalists need to represent data and its meaning accurately and follow design conventions. The empirical validity criterion requires that the data visualization be accurate in its presentation of the data at the time of design. Common inaccuracies in data visualizations include data sizes that are out of proportion to the area in the data visualization, and pie charts whose parts add up to more than 100%. Simon Rogers (2015, p. 275) argues that aesthetic deficiencies and design flaws in data visualization are excusable, but factual distortions and misinformation are inexcusable. The completeness and implications criteria require designers to understand whether the data visualization presents the results of the data analysis completely, whether there are design flaws that could lead to misinterpretation by the audience, and how significant such negative effects are. WNYC produced a news story on “Stop and Frisk” data, in which the brighter pink city blocks on the map indicate that more stop-and-frisk occurs there, and the green dots on the map indicate where guns are found. The map shows that the green dots don’t appear near the hot pink squares as much as one might think, implying that “stop and frisk” might not actually be that effective at getting guns off the streets. However, a citizen journalist plotted this map using the same data and found that in fact stop-and-frisk is actually effective (Porway, 2016). After the 2012 U.S. election results came out, some data visualizations showed the winning states of Democratic candidate Barack Obama and Republican candidate Mitt Romney on a map of the United States. The map seemed to make it appear that the two sides were neck and neck, but in fact, Obama took a total of 332 electoral votes and Romney took only 206. States with large areas do not necessarily have more electoral votes, so presenting the results on a normal map of the United States may seem objective, but it can actually be misleading. The New York Times made a distorted map, turning all the states into squares of varying sizes according to the number of electoral votes so that the comparison of the two candidates’ votes would be more visual and accurate (Fang, 2016). The consistency criterion requires that data visualization follow design conventions and conform to people’s perceptions of visual symbols in a certain period of time and in a certain cultural context. For example, in the use of colors, semantically resonant colors are important for data visualization because they can help improve graph-reading performance (Lin et al., 2013; Lin & Heer, 2014). Michael Dunn, a Florida man, got into an argument with a black youth and shot and killed him in 2012. Dunn, who was charged with second-degree murder, used Florida’s “Stand Your Ground” law to defend himself. Under the law, anyone can

152  Ethics of Data Journalism Production

QR code 7.1 This chart shows an alarming rise in Florida gun deaths after ‘Stand Your Ground’ was enacted. Insider. www.businessinsider.com/gun-deaths-in-floridaincreased-with-stand-your-ground-2014-2

use a weapon wherever they believe they are under attack or believe their life or safety is in danger. Reuters has published a graphic showing a dramatic spike in Florida murders by firearm after the Stand Your Ground law was enacted in 2005. However, this graph gives the impression that shootings are down after the Act was implemented because the Y-axis values grow from top to bottom instead of the generic bottom-to-top (Engel, 2014; see QR code 7.1). The process of transforming data into visualizations is not an automatic one. The self-consciousness criterion requires journalists to design with both the needs of the general audience in mind and to fit their own communication intentions. The intersubjective objectivity criterion requires that data visualizations be designed with input from multiple parties in order to modify apparent or potential biases and errors for more accurate representation. Our interviews with some data journalists in China revealed that the editorial team or the person in charge often communicates with the designer before and after the production of the data visualization, and the designer realizes that the data visualization is not an “individual” work, but a collective work that requires the consideration of multiple parties. The Principle of Transparency in Data Journalism Transparency has long been neglected by journalistic ethics (Plaisance, 2007). Since the start of the 21st century, transparency has gained prominence in American media discourse (Allen, 2008). In 2009, Patrick Plaisance’s Media Ethic: Key Principles for Responsible Practice listed transparency as the primary ethical principle of journalism, while objectivity was not included (Xia & Wang, 2014). In 2013, the Poynter Institute published a new ethic for journalism that replaced “independence” with “transparency.” In September 2014, the American Society of Professional Journalists expanded the journalism ethic of “Be Accountable” to “Be Accountable and Transparent”: journalists should explain to the public the ethical choices they face and make, encourage public participation in discussions about the standards of journalism, respond promptly to questions about the accuracy and objectivity of reporting, and acknowledge, correct, and communicate errors in journalism (SPJ, 2014). Media outlets such as NPR and BBC have written transparency into their own media codes of ethics. Some even believe that objectivity, which was central in the 20th century, will be replaced and that transparency will become the new professional norm and narrative strategy (Karlsson, 2010).

Ethics of Data Journalism Production 153 The Connotation of Transparency in Journalism

Transparency refers to “the extent to which an organization reveals relevant information about its internal workings” (Grimmelikhuijsen, 2010). “Transparency, as in the increasing ways in which people both inside and external to journalism are given a chance to monitor, check, criticize and even intervene in the journalistic process” (Deuze, 2005). The essence of transparency is to put the “backstage” of news production in the “front,” to break the invisible wall between the editorial department and the audience, and to make news production visible from the previously invisible. What does the principle of transparency require journalists to present to their audiences? Some argue that transparency has three components. First, Accountability, which means that data and methods should be easily accessible or visible to the audience. Second, Interactivity, which gives audiences the opportunity to actively participate in the production of news, such as opening up open spaces for audience comments and inviting audiences to contribute news material. Third, Background Openness; that is, journalists provide their own personal background information to increase the credibility of the news reports (Gynnild, 2014). According to Michael Karlsson, transparency consists of Disclosure Transparency and Participatory Transparency. Disclosure Transparency means that news producers explain and disclose the criteria for news selection and production methods. Participatory Transparency means that the audience is invited to participate in different stages of the news production process (Karlsson, 2010). Other scholars have divided transparency into three parts. The first, Production Transparency, refers to the transparency of the production process itself. Second, Actor Transparency, means the transparency of the journalist’s own information. Third, Dialogue Transparency, describes the transparency of journalists’ interaction with the public (Groenhart & Bardoel, 2012). NPR views transparency as a tool that allows the public to evaluate the work of journalists, who need to reveal as much as about how they discover and verify the facts they present, to make their decision-making process clear to the public, and to disclose any relationships, whether with partners or funders, that might appear to influence their coverage (NPR, 2014). Stephen Ward summarizes the specific elements of Online Journalism that require transparency, and they include: ‘about’ pages on websites, highlighting corrections and apologies online, and maintaining a record of changes; explanatory ‘boxes’ for controversial decisions; links to codes of ethics, press councils, and ways for the public to question stories; editorial notes at top or bottom of stories explaining a reporter’s relation to a source in the story (or some other matter); links to background knowledge, experts, and other journalistic treatments of the same story; links to original documents, raw interview notes, unedited interview tape, and video of an entire news conference; editorial policies on external partners, citizen images or text, online forums with editors, reporters, or ombudsmen; placing reader comments and questions alongside the online

154  Ethics of Data Journalism Production story; regular publication of in-house evaluations of compliance with standards; facts about ownership and readership. (Ward, 2015) The Trust Project, run by Santa Clara University’s Markkula Center for Applied Ethics, is dedicated to promoting transparency in journalism. In the project, 75 media professionals have developed the Trust Indicators, a list of eight indicators that journalists should disclose. It has been adopted by Google, The Economist, The Washington Post, and other media outlets (The Trust Project, 2014). By sorting out the current views on transparency in academia and industry, we have summarized three dimensions of transparency: the communicator dimension, the content dimension, and the feedback dimension. In the communicator dimension, transparency involves the media and journalists. The media should disclose basic information about themselves (e.g., ownership, mission) and journalists should publish their own basic information (e.g., expertise, contacts) and explain the relationship with stakeholders in the news story. In the dimension of content transparency, journalists need to disclose (sometimes with further explanation) information about the interview process, methodology, sources, original materials, etc., and to correct errors promptly. In the feedback dimension, media and journalists need to provide spaces for dialogue (e.g., columns, web pages) to interact with audiences, invite them to participate in news production, and present this process publicly. Transparency is meant to emphasize the mediating character of the journalists, reminding the audience of the presence of journalists between “reality” and “the representation of reality” (Rupar, 2006). Journalists recognize that news production systems are not perfect, and that knowledge is constantly evolving. If audiences don’t trust news stories, they can judge for themselves (Rosen, 2017). This is fundamentally different from the past, when journalists claimed to be objective and neutral “intermediaries”: they admitted that they were “reflecting” reality, not “mirroring” it. Transparency in Data Journalism

There is no doubt that data journalism needs to be transparent. “Every algorithm, however it is written, contains human, and therefore editorial, judgments. The decisions made about what data to include and exclude adds a layer of perspective to the information provided.” (Bell, 2012) Data journalists should allow others to replicate their work, and therefore need to publish their research methods just like scientists do (Shiab, 2015). For data journalists, making data, data analysis, and algorithms transparent can prompt them to review their operational processes, aware that an external audience is monitoring them. Transparency benefits the computing field and the community of data journalism, and the constructive feedback that a high-level audience will provide can improve the accuracy of their work (Stark & Diakopoulos, 2016). Of course, we also need to see that transparency remains a goal rather than a reality in journalism (Koliska, 2015). It is imperative to build a set of transparency

Ethics of Data Journalism Production 155 norms for data journalism. We classify transparency into data collection transparency, data analysis transparency, algorithmic transparency, data visualization transparency, producer transparency, and other transparency matters based on this discussion. Data Collection Transparency

Transparency in data collection means that data reporters should publish information related to the data (e.g., specific sources of data, evaluation of data quality, tools and processes used to collect the data, etc.) and provide links to the raw data or data. The specific source of the data refers to the report, database, or institution from which the data came, the context in which the data were collected, and what issues they were originally used to reveal. Data news on new media platforms should provide links to data or raw data. The degree of transparency in accounting for specific sources of data varies across media. For example, in some Chinese data stories, it is common to see “a combination of reports from different media” without mentioning which media have been “combined.” Others may indicate the source of the data, but the information provided is vague, such as “according to the National Bureau of Statistics of China,” but when the data was released and the name of the data is not disclosed. Some Western mainstream media, such as The Guardian, are relatively good at explaining the exact source of data, and will use hyperlinks to let audiences see the source when referring to certain data. In short, the transparency of data sources means that if someone wants to verify the accuracy and veracity of the data, they should be able to find the original data or the source at the lowest possible cost. Transparency in data collection also involves an account of data quality. Objectivity requires an assessment of data quality, while transparency requires disclosure of the evaluation of data quality. In many cases, the raw data do not fit the issues being explored by journalists and need to be “recontextualized.” Sometimes the data used by journalists may even be flawed. Even data used in scientific research has limitations, so journalists need to make those limitations clear so that audiences can judge the extent to which they can trust the story. In Caixin’s “From Regulation to Stimulation: A Ten-Year Cycle of the Property Market” for example, the journalist gives a detailed account of the sources and quality of the data used at the end of the report. The data on year-over-year price increases and decreases come from the monthly price indices of new residential units in 70 large and medium-sized cities published by China’s National Bureau of Statistics. Since the National Bureau of Statistics does not publish the average price of new residential units in individual cities, most of the reported house price data come from the average price of new residential units in a sample of 100 cities according to the China Index Research Institute. There are a few cities that are not included in the list monitored by the China Index Research Institute, and the

156  Ethics of Data Journalism Production data for such cities come from other public sources. The statistics of the average price of new residential units in 100 cities began in 2010, and the data in the report from July 2005 to December 2009 are based on the average price of new residential units in 100 cities and the price index of new residential units in 70 large and medium-sized cities (YoY) from the National Bureau of Statistics, so the values may have slightly deviated. Data on urban disposable income per capita are from Wind Information, a firm in financial data, information services, and solutions in China. The description of data quality assessment is not fully transparent in daily data journalism production. The assessment of data reflects the media’s attitude toward the data they receive, whether they trust it completely, or acknowledge that it has limitations. The account of data quality assessment should be a matter of industry consensus, so that audiences know that the media are aware of data quality, rather than using it uncritically. Data Analysis Transparency

Transparency in data analysis refers to the accounting of the factors that influence the results of data analysis. For example, the accounting of the treatment of the index refers to whether this composite index, after quantifying multiple factors measuring an issue separately, is independent and exhaustive, is standardized, and has a weight sum of 1. This is the information that should be communicated to the audience (Hu, 2017). An interactive news article on The Financial Times website, “What is at stake at the Paris climate change conference?”, explains the team that developed the Climate Change Calculator used, and includes a hyperlink to a report on the team. The COP21 Climate Change Calculator was co-created by the Financial Times and creators of the Global Calculator. Its methodology was developed with funding from the Grantham Institute at Imperial College London and by a team based at Imperial College London and the Indian Institute of Science in Bangalore. You can read more about the work behind the tool in a technical note from the scientists who carried out the research. The BBC’s “Which sport are you made for?” opened up all the data collected and published the data collation framework and data mining model for this data news on the Github website, and the codes can be downloaded for free. Algorithmic Transparency

An algorithm is an exact and complete set of instructions for solving a problem, which can obtain the required output in a limited time for certain inputs (Lv, 2009, p. 7). Algorithm design is complex and usually consists of six steps: understanding the question, choosing an algorithm design technique, designing and describing

Ethics of Data Journalism Production 157 the algorithm, running the algorithm manually, analyzing the efficiency of the algorithm, and implementing the algorithm (Wang & Hu, 2013, pp. 6–7). Algorithms can be used in all aspects of data collection, analysis, and decision making. They are discussed separately because the development of artificial intelligence and big data has made journalism increasingly dependent on algorithms for news production. Algorithms are invisible to most people as a “black box,” which means that many people are unaware of how they work (Rainie & Anderson, 2017). Algorithmic transparency requires an accounting of the use of data and how the algorithm is involved in the news production process (Stark & Diakopoulos, 2016). The algorithms have a strong influence on the extraction of data, discourse analysis, and interpretation of results, because they are human “creations,” meaning that multiple human factors, including the creator’s intent, are embedded in the algorithms (Diakopoulos, 2014). Among the various elements of data journalism transparency, algorithmic transparency may be the most difficult to achieve because an original algorithm can itself be considered a trade secret and it requires a lot of intellectual efforts from conception to design and then application. Therefore, it has been argued that while there is much discussion about “opening the black box,” algorithmic transparency is unlikely to be achieved (Powell, 2016). The rationale for algorithmic transparency lies in the fact that journalism is a public service, meaning that the public has the right to know the mechanisms by which algorithms operate and their flaws when the public interest is involved. There are two different kinds of interest-oriented news production (see Figure 7.1): nonprofit news production that serves exclusively the public interest, such as nonprofit journalism and public broadcasting; and news production that balances public and commercial interests, which makes up the majority of the global news industry. Corresponding to these two different interest-oriented news production, there are two different types of algorithms, which we name as open-source algorithms and proprietary algorithms. Open-source algorithms are those that aim to serve society by transparent and participatory coding, making all source codes available for use and modification, and ultimately generating products that are co-created and shared. The emergence of open-source algorithms is related to the open-source movement. There are now a number of AI open-source movements abroad, such as OpenAI. Once an opensource algorithm is designed, it can be used and modified by all for free. Therefore,

Figure 7.1  Algorithmic transparency in news production

158  Ethics of Data Journalism Production algorithms that involve public interest and not commercial interest can be opensource algorithms, and open-source algorithms that are continuously modified and improved by all sides will serve society to a greater extent. Because proprietary algorithms are protected by law as knowledge products, their use and modification are restricted. Most of the algorithms currently used in news production are proprietary algorithms. When commercial interests are involved, algorithms are often regarded as trade secrets that have the right to remain undisclosed, and algorithmic transparency is not easily achieved. When public interest is involved, the public, as users of and affected by the news, has the right to know about the algorithms. Two types of algorithmic transparency need to be distinguished: proactive algorithmic transparency and passive algorithmic transparency. Proactive algorithmic transparency means that the news producer proactively makes the operating mechanisms and design intentions of the algorithm publicly available and subject to social scrutiny. Open-source algorithms generally actively choose to make themselves transparent. According to Mark Hansen, a statistician and the director of the Brown Institute at Columbia University, “computer scientists are putting into code our society’s values and ethics, and that process has to be out in the open so the public can participate.” (Gourarie, 2016) “The Tennis Racket,” a 2016 investigative data news story by BuzzFeed in collaboration with the BBC, released the raw data, algorithmic procedures, and analysis process, detailing the story’s data acquisition, data preparation, tournament exclusions, odds variation calculations, player selection, simulations, and significance tests, which allowed the public to validate the findings. In “What is at stake at the Paris climate change conference?” published by the Financial Times, users can use the “Climate Change Calculator” to “control” carbon emissions in different countries and gain insight into global warming trends. It also provides information on the design and operation of the Climate Change Calculator so that users can understand how the Calculator works. As the media is honest about the limitations in the design and application of algorithms, proactive algorithmic transparency not only avoids some risks, such as the media having to bear the consequences of possible wrong predictions or biased conclusions; it also helps build trust between the media and users. As the application of algorithms in news production becomes more common, active algorithmic transparency will become more common in the future, and algorithms will be improved through continuous open-source sharing and modification. Algorithmic opacity is a common social phenomenon. If we want to force certain algorithms to be transparent, we must use legal constraints. Passive algorithmic transparency means publishing all or part of an algorithm as required by law and in accordance with legal procedures. If a user suspects or discovers that a proprietary algorithm of public interest is racially discriminatory or misleading, he or she can request the media to disclose information about the algorithm to protect the public’s “right to know.” As algorithms have not been used in news production for very long, most countries still lack laws and regulations to monitor and audit them. The EU was an early adopter of “passive algorithmic transparency.” In the General Data Protection

Ethics of Data Journalism Production 159 Regulation (GDPR), which came into force in May 2018, the EU gave users “the right to explanation” when requesting a conclusion based on an algorithm. However, the Act did not actually guarantee the accountability and transparency of algorithms (Sample, 2017). In the era of algorithms, the formulation of legal provisions for algorithms in different fields and for different purposes to achieve effective supervision is a new issue that needs to be addressed in the legislation of various countries. From a technical perspective, complete algorithmic transparency is difficult to achieve. Many algorithms are “black boxes,” and some of them are difficult to understand even for the algorithm designers. Therefore, a practical approach is to set an appropriate threshold for “passive algorithmic transparency,” i.e., “meaningful transparency,” which is a lower standard of algorithmic transparency in which stakeholders can intervene, use, and implement algorithms to ensure the algorithms are operating responsibly in news production (Brauneis & Goodman, 2017). However, the specific requirements for transparency vary from stakeholder to stakeholder, so “meaningful transparency” is a relative concept that requires algorithm-specific analysis. As a result, it is difficult to set a completely uniform standard for “meaningful transparency,” but it requires full discussion among legislation, relevant industries, the public, and technical personnel. Media organizations need to formulate policies to protect the public’s right to know about news algorithms that involve public interest or let industry associations issue relevant guidelines. Due to the professionalism and complexity of algorithms, it is difficult for the general public to effectively monitor them. A more feasible way is to let a trusted third-party verification agency intervene to check and evaluate news algorithms of public interest or controversial ones to see if they are transparent and fair, which will also allay the concerns of algorithm owners about the leakage of trade secrets. However, how to form third-party agencies to verify algorithms in society and how to empower them is a new issue of social governance in the algorithm era. According to Nicholas Diakopoulos (2015), algorithmic transparency consists of five specific aspects. They include: (1) the criteria used to prioritize, rank, emphasize, or editorialize things in the algorithm, including their definitions, operationalizations, and possibly even alternatives; (2) what data act as inputs to the algorithm – what it pays attention to, and what other parameters are used to initiate the algorithm; (3) the accuracy including the false positive and false negative rate of errors made in classification (with respect to some agreed-upon ground truth), including the rationale for how the balance point is set between those errors; (4) descriptions of training data and its potential bias, including the evolution and dynamics of the algorithm as it learns from data; and (5) the definitions, operationalizations, or thresholds used by similarity or classification algorithms. Of course, not all algorithms need the same degree of transparency (Diakopoulos, 2016). BuzzFeed published the raw data, algorithmic procedures, and analysis process of “The Tennis Racket,” a collaboration with the BBC, on GitHub (Hirst, 2016). Although BuzzFeed stated in advance that the results of the analysis were not evidence that the player had faked the game and anonymized the data, those

160  Ethics of Data Journalism Production familiar with the field could easily find out who the suspected faker was based on the full data and analysis process provided by BuzzFeed. Transparency in this case is a bit “excessive,” and there should be limits to transparency. Data Visualization Transparency

The visual aspect of news transparency has received less attention (Gynnild, 2014), but the effect of visual communication is powerful. Under the influence of “seeing is believing,” people tend to overlook the harm caused by opaqueness in the visualization of data findings. Combining humanity and science, data visualization is designed to convey professional content visually, which inevitably involves a trade-off between professionalism and accessibility. Transparency in data visualization means linking news sources in the visualization as much as possible, fully explaining any caveats related to the data or to the chart itself, and “informing the wider public of the danger signs to watch out for before spreading what may turn out to be misinformation” (Burn-Murdoch, 2013). For example, if a country is too small to be shown on a map, the designer may intentionally allow it to be enlarged or simply removed, and then the reader needs an explanation of why this was done. If a data visualization designer uses an unconventional design, the audience needs to be reminded to avoid understanding it in a conventional way. For example, Reuters’ “Gun Deaths in Florida,” mentioned earlier, uses an unconventional design, and the reporter should prompt the audience with this information in the story. Producer Transparency

Producer transparency in data journalism refers primarily to transparency in the identity and contact information of journalists. In previous news production, journalists were unknown and mysterious to audiences, often only seen by name but not in person (this is also true for TV news unless the journalist appears on camera). Producer transparency requires the publication of journalists’ identity information, including the departments they work for and the areas they cover, as well as the posting of their photos on websites for public scrutiny. In the past, communication between readers and journalists was not always open; for example, audience queries and complaints about certain stories were often handled by the editorial team and might be ignored. When journalists publish their contact information, especially their social media accounts, the channels for audience monitoring of journalists become open. Whereas communication with journalists via email and phone is a closed interactive relationship, communication via social media is a completely open one. On media sites, the media can decide whether to make certain audience comments public; in social media, however, audience comments are not under the control of the media, as long as they do not violate legal requirements. Therefore, the publication of journalists’ social accounts does not only mean the disclosure of contact channels but also puts journalists under the supervision of society and audiences.

Ethics of Data Journalism Production 161 We believe that the personal profiles and contact information of journalists should be made public unless they specialize in secret inquiries and investigations. Information disclosure allows audiences to get up close to the journalists and allows journalists to listen to their audiences. Other Transparency Matters

Transparency in audience evaluation means making publicly available what others have doubts and criticisms about. Some media websites have set up comment sections where audiences can monitor the quality of data news stories by leaving comments. “Organizations that practice data-driven journalism (to the extent this is different from other flavors of journalism) should invite and provide empirical critiques of their analyses and findings.” (Howard, 2014) Update transparency refers to the description of the specific updates to the data news. Some database-type data news pieces are often updated continuously, and the audience needs to be informed of the updates and their dates after they are made. Correction transparency means that when a data journalism piece is significantly flawed, the media must explain the reasons for the error and apologize to the audience. The media also needs to make transparent some other details in data journalism production that should be explained to the audience, such as the motivation for the selection of the topic, the problems encountered in the interview, and how they were solved. For example, ProPublica begins “Revealed: The NSA’s Secret Campaign to Crack, Undermine Internet Security” with a link to “Why We Published the Decryption Story” to explain to its audience the intent of the news production. Some data news stories are done collaboratively or receive outside financial sponsorship, which again needs to be communicated to the audience. ProPublica has published articles on why it partnered with The Guardian and The New York Times to cover the Edward Snowden story. The Guardian’s “From rainforest to your cupboard: the real story of palm oil” clearly identifies its partners on the page so that audiences are aware of the stakeholders’ involvement. Principles for the Use of Personal Data in Data Journalism With the development of data journalism, it is becoming commonplace for data journalists to make secondary use of personal data, and even some data news pieces are themselves personal databases of specific groups. In The Washington Post’s Pulitzer Prize-winning work, “Police shootings database 2015–2022,” users can view information about victims’ ages, genders, races, and whether they suffer from mental illness. The Los Angeles Times’ “The Homicide Report,” a database of homicide victims in Los Angeles, allows users to search for reports by entering addresses or victims’ names. “Swiss Leaks,” winner of the Data Journalism Awards 2015, contains the accounts of about 100,000 customers in 203 countries. “Publishing data is news,” according to Matt Stiles and Niran Babalola of The Texas Tribune (Batsell, 2016). In the information society, the legitimacy of the

162  Ethics of Data Journalism Production collection and use of personal data by the information industry has been generally recognized by legislation and society (Zhang X., 2015). Especially in the era of big data, it has become a social consensus to use personal data to create new social and commercial values. However, the legal and ethical issues of secondary use of personal data have arisen as a result, and how data journalism can protect personal data security has become an urgent issue to be explored. Misuse of Personal Data in Data Journalism

The misuse of personal data in data journalism can be divided into two categories. The first category is “invisible” personal data misuse, such as non-consensual personal data collection, which is rarely known to outsiders. The other category is “visible” personal data misuse, such as personal privacy violation and excessive mining of personal data, which causes obvious harm. Personal Data Collection Without Consent

The collection of personal data requires the consent of the data subject, which includes two types – proactive consent and passive consent: proactive consent means that the data subject expressly consents; passive consent means that consent is deemed to be given as long as the data subject does not expressly object, and non-consent is deemed to be given if the data subject refuses (Guo, 2012). Whether data collection is based on proactive or passive “consent” depends on the circumstances. If an organization publishes data on its own website, the data automatically becomes public information (Shiab, 2015), and data journalists can access the data through various technical means (unless otherwise requested by the data owner), in which case passive consent applies to the data collection. However, if sensitive personal information is involved, proactive consent from the data subject is required. As the use of artificial intelligence technology in news production has become widespread, some data journalists have resorted to web crawling to obtain personal data, sometimes even in violation of legal requirements. According to Cédric Sam, who works for the South China Morning Post, bots have as much responsibility as their human creators; the very important limit between web scraping and hacking is the respect of the law (Shiab, 2015). For crawling web data, Victoria Baranetsky, an expert at Columbia University’s TOW Center, offers the following suggestions. First, the reporters should always pay attention to the terms of service of the website they’re accessing to understand whether the company prohibits scraping, even if its data is accessible. If they cannot determine what those terms are saying, they should reach out to an attorney. Second, if the terms of use prohibit data scraping, reporters should first seek alternate sources for the information. Third, if the information exists only on a company’s site, it can be a good idea to contact the company and see if they will simply turn over the information. Fourth, whether the journalists contact the company or not, they should always diminish the risk of injury to the company or other

Ethics of Data Journalism Production 163 individuals through whatever methods possible. For example, journalists should be careful not to build any tool that might overwhelm a company’s servers. Additionally, if any sensitive data with privacy concerns are collected, journalists should be careful to redact and employ ethical journalistic standards. Fifth, journalists should be good citizens and ensure that any data collection is done in the public interest (Baranetsky, 2018). Personal Privacy Violation

Privacy violations in the use of personal data are a real problem facing data journalism. In data journalism, there are conflicts between privacy and freedom of expression, personal data protection, and reporting based on personal data (Voorhoof, 2015). It is important to note that some data news reports that obtain personal data under the Freedom of Information Act (FOIA) also violate the privacy of individuals. In December 2013, The Journal News used personal data obtained under FOIA to publish a map of all legal gun distribution in Westchester and Rockland, New York State, which included the names and home addresses of all legal gun owners in the two counties. Janet Hasson, president of The Journal News, argues that New Yorkers have a right to all public information about guns and that one of the roles of journalists is to publish public information in a timely manner, even if it has not previously received widespread attention (Weiner, 2012). The public, however, considers personal gun ownership information to be private (Weiner, 2012). Faced with conflicting opinions from the media and the public, the New York State legislature passed a bill that would allow gun registrants to opt out of public records on privacy grounds (Beaujon, 2013; McBride & Rosenstiel, 2013, p. 186), indicating that the legislature believes that detailed personal information about gun ownership should not be released in its entirety. Personal privacy violations are also likely to occur in the processing of “leaked” data obtained by the media. When WikiLeaks released data from the U.S. Department of Defense and Department of State to multiple news organizations in 2010 and 2011, every media organization had to decide not only whether to publish them but how, balancing redacting the names of people who might be put at risk with the public’s right to know what was done by the government (Howard, 2013). 7.3.1.3  Excessive Mining of Personal Data

There are two types of excessive mining of personal data in data journalism: one is over-extended use, and the other is over-display. Over-extended use refers to the use of personal data by a data reporter for purposes other than those initially stated, which may violate the rights of individuals. Over-display refers to the display of personal data with little public interest, which is essentially “display for display’s sake.” The display of some personal data can even be misleading. Jack Shafer, a commentator for Reuters, believes that much of the personal information adds “nothing to the story and puts some people

164  Ethics of Data Journalism Production unnecessarily at risk” (Howard, 2013); “just because content is publicly accessible does not mean that it was meant to be consumed by just anyone” (Boyd & Crawford, 2012). The Tampa Bay Times has released a personal database, “Mug Shot,” in which readers can then scroll through a timeline of headshots from the last 60 days, search by last name or ZIP Code, and filter detailed demographic statistics about arrests. However, an arrested person does not necessarily commit a crime and may be acquitted, and while the media explains this issue and states that no one can convict a person on the basis of the information it publishes, it still tends to discredit the acquitted person. The Los Angeles Times’ celebrity mug shot photo galleries draw a large crowd of viewers but shy away from showing private citizens. Nora Paul, a scholar at the University of Minnesota, criticized the Tampa Bay Times’ “Mug Shot” as “journalistic malpractice,” as journalism should be about putting important events in a community into context, but not these (Milian, 2009). The Principle of Informed Consent

The principle of informed consent means that the personal data collector is required to inform the data subject of important matters concerning the collection and processing of personal information and obtain the data subject’s expressed or implied consent before the collection (Qiu & Huang, 2014). According to the General Data Protection Regulation of the EU, the “consent” of the data subject means any freely given, specific, informed, and unambiguous indication of the data subject’s wishes by which he or she, by a statement or by clear affirmative action, signifies agreement to the processing of personal data relating to him or her (Intersoft Consulting, 2013). The public may use a single confirmation of opt-in or opt-out, depending on the context, or may take a “broad consent” approach, i.e., consent to the use of personal data for a particular category, rather than a particular situation (Qiu & Huang, 2014). “Consent is a fundamental concept in the U.S. approach to data privacy, as it reflects principles of individual autonomy, freedom of choice, and rationality.” (Froomkin, 2019) The “Information Security Technology Guidelines for Personal Information Protection Within Public and Commercial Services Information Systems” issued by China in 2012 stipulated the principle of “individual consent,” which means the consent of the subject of the personal information must be obtained before personal information handling (Zhang M., 2015). The principle of individual consent is for data capable of identifying information about a specific person, and consent is not required for the use of data that does not identify an individual (Xu, 2017). For example, the EU’s General Data Protection Regulation does not protect anonymous personal data. The principle of informed consent is widely used in personal information privacy protection systems. However, in the era of big data, informed consent of all people is difficult to be truly realized due to the high costs (Xu, 2017). The German Federal Data Protection Act provides that personal data shall in principle be collected from the data subjects, but may be collected in other ways if it would be

Ethics of Data Journalism Production 165 unreasonably costly to collect the data from them and if it does not infringe on their significant legitimate interests (Shao, 2017, p. 197). Therefore, in addition to the principle of informed consent, other principles are needed to safeguard the personal information rights of data subjects. The Principles of Legality and Proportionality

The protection of personal information is not only about setting a right for individuals, but also about building a legal framework that balances the interests of information rights of subjects, information users, and society (Hao, 2016). The use of personal data needs to be premised on the protection of personal data. Therefore, data journalists must follow the principles of legality and proportionality in the use of personal data and need to collect data within the scope of the law, while data collection outside the law must be done only after obtaining the consent of the individuals concerned. In addition, the collected data may not be used for purposes other than those originally stated without the consent of the person concerned (Huang, 2013). For example, when a data journalist scrapes personal data from a web page, he or she should not pry into protected data. “If a regular user can’t access it, journalists shouldn’t try to get it” (Shiab, 2015). In data science, there is the concept of threshold value, which is the amount of data needed to solve a problem. Problems related to threshold value are also known as predictive data analysis problems, i.e., what level of satisfaction can be achieved in solving the problem when the amount of data reaches a certain size (Li & Cheng, 2012). The collection and use of personal data in data journalism need to be based on the issue, not just the pursuit of large quantities. It is necessary to determine the appropriate scope of the data to be scraped. Otherwise, inappropriate and excessive mining of citizens’ personal information by the media may lead to excessive “display” of the data subjects’ privacy, which may be harmful to them and cause adverse social effects (Zhang, 2016). In addition, a distinction should be made between the protection of sensitive private personal information and the use of general personal information – thus achieving a balance of interests (Zhang X., 2015). The content of sensitive personal information depends on the laws of different countries and regions. For example, according to the EU’s General Data Protection Regulation: the processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation shall be prohibited (Intersoft Consulting, 2013). When providing data to the media, data holders need to indicate the existence of privacy-sensitive information, anonymize or delete such information, and classify the level of privacy of the data so that the media can determine how to use it.

166  Ethics of Data Journalism Production The Principles of Public Interest Priority and Minimizing Harm

As a new paradigm in journalism, data journalism must assume social responsibility and serve the public interest. Data journalists cannot arbitrarily violate the personal data rights of the public in the name of public interest. As an important and vague legal concept, the public interest is not simply equal to the interest of everyone, but is also different from the common interest shared by the majority. The public interest is based on the choice of values and is characterized by its historical stage (Hu, 2008). Generally speaking, the public interest is a major interest that is enjoyed by the majority of people and is holistic, multi-level, and developmental in nature (Xiao, 2009). When personal privacy and public interest are in conflict, the public interest often takes precedence (Shao, 2017, p. 173). For example, in Europe and the United States, some personal sensitive information is not private once it is in the public interest (Pan, 2015, p. 57). Accompanying the public interest priority is the principle of minimizing harm. As discussed earlier, the public interest may justify the suppression of an individual’s right to privacy (Liu, 2006). The media may release personal data in the public interest but should do so in a way that minimizes the harm caused to individuals. “Journalists must weigh the benefits of open data against the risks of personal harm that may come with publication” (Lewis & Westlund, 2015). The personal data to be released must be closely related to the public interest; if the personal data, although related to the public interest, may instead harm the public interest after release, it should not be released. “Minimize Harm” is clearly defined in The SPJ Code of Ethics: journalists should: show compassion for those who may be affected by news coverage, use heightened sensitivity when dealing with juveniles, victims of sex crimes, and sources or subjects who are inexperienced or unable to give consent, and consider cultural differences in approach and treatment. In 2012, data journalist Nils Mulvad finally got the veterinary prescriptions data that he had been fighting for for seven years, but he decided not to publish the data when he realized that it was full of errors, because “publishing it directly would have meant identifying specific doctors for doing bad things which they hadn’t done” (Bradshaw, 2013). In this case, Nils Mulvad was guided by the principle of minimizing harm. In 2010, The Guardian published only information on those who had died in its “Afghanistan War Logs,” because journalists were unsure whether the database contained important information involving informants (Gray et al., 2012, p. 81). When reusing personal data, journalists can achieve privacy “anonymization” by removing the identifiable elements of personal information through codification or encryption (Zhang X., 2015). For example, after erasing an individual’s identifying information (name, address, credit card number, date of birth, etc.) from a database, the remaining data is available for use and sharing (Zhang M., 2015).

Ethics of Data Journalism Production 167 It should be noted that: in the past, anonymization could avoid the identification of data subjects; however, in the era of big data, anonymization of a single piece of information is not sufficient to avoid identification, and data subjects may still be identified through the combination of multiple anonymous single pieces of personal information. Data journalists need to comprehensively evaluate the potential risks of anonymous personal data and take corresponding measures to protect the legitimate rights and interests of data subjects. When personal data becomes a ubiquitous resource, data journalists inevitably transform personal data into commercial and social values. How to use personal data legally and effectively to achieve a win-win situation of personal data protection and utilization has become a new issue in the era of big data. This issue is not only a legal one but also an ethical one. If there is a lack of legal and professional ethical regulation, the more data journalism develops, the more potential risks it brings to society. Therefore, it is necessary for data journalism academia and industry to fully discuss and reach a consensus on the issue of personal data utilization. References Allen, D. S. (2008). The trouble with transparency. Journalism Studies, 9(3), 323–340. https://doi.org/10.1080/14616700801997224 Baranetsky, D. V. (Ed.). (2018). Data journalism and the law. Columbia Journalism Review. www.cjr.org/tow_center_reports/data-journalism-and-the-law.php Batsell, J. (Ed.). (2016). For online publications, data is news. Nieman Reports. http://nie manreports.org/articles/for-online-publications-data-is-news/ Beaujon, A. (Ed.). (2013). N.Y.’s tough new gun law also prohibits disclosure of gun owners’ names. Poynter. www.poynter.org/2013/n-y-s-tough-new-gun-law-also-prohibits-disclo sure-of-gun-owners-names/200714/ Bell, E. (Ed.). (2012). Journalism by numbers. Columbia Journalism Review. www.cjr.org/ cover_story/journalism_by_numbers.php Boyd, D., & Crawford, K. (2012). Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Communication & Society, 15(5), 662–679. https://doi.org/10.1080/1369118X.2012.678878 Boyles, J. L., & Meyer, E. (2016). Letting the data speak. Digital Journalism, 4(7), 944–954. https://doi.org/10.1080/21670811.2016.1166063 Bradshaw, P. (Ed.). (2013). Ethics in data journalism: Accuracy. Online Journalism Blog. https://onlinejournalismblog.com/2013/09/13/ethics-in-data-journalism-accuracy/ Brauneis, R., & Goodman, E. P. (Ed.). (2017). Algorithmic transparency for the smart city, say experts. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3012499 Burn-Murdoch, J. (Ed.). (2013). Why you should never trust a data visualisation. The Guardian. www.theguardian.com/news/datablog/2013/jul/24/why-you-should-never-trusta-data-visualisation Chang, J., & Yang, Q. (2014). Data journalism: Ideas, methods and influences. Journalism and Mass Communication Monthly (12), 10–18. (Published in China.) Ciobanu, M. (Ed.). (2016). Advice from FT and WSJ for getting started with interactive graphics. Journalism. www.journalism.co.uk/news/advice-from-the-financial-times-andthe-wall-street-journal-for-getting-started-with-interactive-graphics/s2/a677894/

168  Ethics of Data Journalism Production Deuze, M. (2005). What is journalism? Professional identity and ideology of journalists Reconsidered. Journalism, 6(4), 442–464. https://doi.org/10.1177/1464884905056815 Diakopoulos, N. (Ed.). (2013). Rhetoric of data. Nick Diakopoulos. www.nickdiakopoulos. com/2013/07/25/the-rhetoric-of-data/ Diakopoulos, N. (Ed.). (2014). Algorithmic accountability reporting: On the investigation of black boxes. Columbia Journalism Review. www.cjr.org/tow_center_reports/algorith mic_accountability_on_the_investigation_of_black_boxes.php Diakopoulos, N. (2015). Algorithmic accountability. Digital Journalism, 7(3), 398–415. https://doi.org/10.7916/D8ZK5TW2 Diakopoulos, N. (Ed.). (2016). BuzzFeed’s pro tennis investigation displays ethical dilemmas of data journalism. Columbia Journalism Review. www.cjr.org/tow_center/transpar ency_algorithms_buzzfeed.php Engel, P. (Ed.). (2014). This chart shows an alarming rise in Florida gun deaths after ‘Stand Your Ground’ was enacted. Insider. www.businessinsider.com/gun-deaths-in-florida-increasedwith-stand-your-ground-2014-2 Fang, J. (2015). Introduction to data journalism. China Renmin University Press. (Published in China.) Fang, J., & Gao, L. (2015). Data journalism – An area need urgently to be regulated. Chinese Journal of Journalism & Communication, 37(12), 105–124. (Published in China.) Fang, K. (Ed.). (2016). Voting day for the US election is here! Please take this chart when watching. News Lab. https://newslab2020.github.io/Collection/媒体政治/%5B新闻实验 室%5D%20-%202016-11-08%20美国大选投票日来了!围观时请拿好这张图.html (Published in China.) Froomkin, A. M. (Ed.). (2019). Big data: Destroyer of informed consent. Yake Journal of Health Policy, Law, and Ethics. https://yjolt.org/sites/default/files/21_yale_j.l._tech._special_issue_27.pdf Gieryn, T. F. (1983). Boundary-work and the demarcation of science from non-science: Strains and interests in professional ideologies of scientists. American Sociological Review, 48(6), 781–795. https://doi.org/10.2307/2095325 Gourarie, C. (Ed.). (2016). Investigating the algorithms that govern our lives. Columbia Journalism Review. www.cjr.org/innovations/investigating_algorithms.php Gray, J., Chambers, L., & Bounegru, L. (2012). The Data journalism handbook. O’Reilly Media. Grimmelikhuijsen, S. G. (2010). Transparency of public decision-making: Towards trust in local government? Policy & Internet, 2(1), 5–35. https://doi.org/10.2202/1944-2866.1024 Groenhart, H. P., & Bardoel, J. L. H. (2012). Conceiving the transparency of journalism: Moving towards a new media accountability currency. Studies in Communication Sciences, 12(1), 6–11. https://doi.org/10.1016/j.scoms.2012.06.003 Guo, M. (2012). On commercialization of personal data. Legal Forum, 27(6), 108–114. (Published in China.) Gynnild, A. (2014). Surveillance videos and visual transparency in journalism. Journalism Studies, 15(4), 449–463. https://doi.org/10.1080/1461670X.2013.831230 Hackett, R., & Zhang, Y. (2010). Sustaining democracy? Journalism and the politics of objectivity (H. Shen & Y. Zhou, Trans.). Tsinghua University Press. (Published in China.) Hao, S. (2016). Path selection of personal information protection in big data era. Journal of Beijing University of Posts and Telecommunications (Social Sciences Edition), 18(5), 1–20. (Published in China.) Harris, J. (Ed.). (2014). Distrust your data. Source. https://source.opennews.org/en-US/ learning/distrust-your-data/

Ethics of Data Journalism Production 169 Hirst, T. (Ed.). (2016). The rise of transparent data journalism-the BuzzFeed tennis match fixing data analysis notebook. https://blog.ouseful.info/2016/01/18/the-rise-of-transpar ent-data-journalism-the-buzzfeed-tennis-match-fixing-data-analysis-notebook/ Hong, S., Zhuang, Y., & Li, K. (2014). Data mining technology and engineering practice. China Machine Press. (Published in China.) Howard, A. B. (Ed.). (2013). On the ethics of data-driven journalism: Of fact, friction and public records in a more transparent age. Tow Center. https://medium.com/tow-center/ on-the-ethics-of-data-driven-journalism-of-fact-friction-and-public-records-in-a-moretransparent-a063806e0ee3 Howard, A. B. (Ed.). (2014). The art and science of data-driven journalism. The Tow Center. http://towcenter.org/wp-content/uploads/2014/05/Tow-Center-Data-Driven-Journalism.pdf Hu, H. (2008). On the legal definition of public interest: From the way of element explanation. Journal of Hebei Normal University (Philosophy and Social Sciences Edition) (4), 56–67. (Published in China.) Hu, C. (Ed.). (2017). 13 ways to think about data-driven decision making. Sohu. www.sohu. com/a/204942623_236505 (Published in China.) Hu, J., & Wu, Y. (2010). The illusion of objectivity of news and the origin of mass communication research. Contemporary Communication (2), 14–17. (Published in China.) Huang, X. (2013). On information ethic conflicts in the process of data mining. Journal of Changsha University, 27(4), 41–43. (Published in China.) Intersoft Consulting (Ed.). (2013). Processing of special categories of personal data. https:// gdpr-info.eu/art-9-gdpr/ James, W. (2012). Pragmatism: A new name for some old ways of thinking (B. Li, Trans.). The Commercial Press. (Published in China.) Karlsson, M. (2010). Rituals of transparency. Journalism Studies, 11(4), 535–545. https:// doi.org/10.1080/14616701003638400 Kayser-bril, N. (Ed.). (2016). Data-driven journalism in the post-truth public sphere. Nicolas Kayser-Bril. https://blog.nkb.fr/datajournalism-in-the-posth-truth-public-sphere/#foot_14 Koliska, M. (Ed.). (2015). Towards editorial transparency in computational journalism. DRUM. https://drum.lib.umd.edu/handle/1903/17031 Kovach, B., & Rosenstiel, T. (2014). The elements of journalism (H. L. Liu & X. D. Lian, Trans.). China Renmin University Press. (Published in China.) Kuru, O. (Ed.). (2016). What the failure of election predictions could mean for media trust and data journalism. Sections Mediashift. http://mediashift.org/2016/11/136541/ Lewis, S. C., & Westlund, O. (2015). Big data and journalism: Epistemology, expertise, economics, and ethics. Digital Journalism, 3(3), 447–466. https://doi.org/10.1080/2167 0811.2014.976418 Li, G., & Cheng, X. (2012). Research status and scientific thinking of big data. Bulletin of Chinese Academy of Sciences, 27(6), 647–657. (Published in China.) Li, Y., & Chen, P. (2016). “Commercialism” dominating and “professionalism” leaving: The discursive formation of journalism transition in China under digital condition. Chinese Journal of Journalism & Communication, 38(9), 135–153. (Published in China.) Lin, S., Fortuna, J., & Kulkarni, C. (2013). Selecting semantically-resonant colors for data visualization. Computer Graphics Forum, 32(3), 401–410. https://doi.org/10.1111/ cgf.12127 Lin, S., & Heer, J. (Eds.). (2014). The right colors make data easier to read. Harvard Business Review. https://hbr.org/2014/04/the-right-colors-make-data-easier-to-read Liu, J. (2015). The contemporary western theories of journalism. China Renmin University Press. (Published in China.)

170  Ethics of Data Journalism Production Liu, L. (2006). The interpretive dilemma of “public interest” and its breakthrough. Literature, History, and Philosophy, 2006(2), 160–166. (Published in China.) Lu, Y., & Zhou, R. (2016). Liquid journalism: Reconsidering new practices of communication and journalistic professionalism a case study on coverage of the “criental star” accident by the paper. Journalism & Communication, 23(7), 24–46. (Published in China.) Luo, G., & Liu, X. (2000). Cultural studies reader. China Social Sciences Press. (Published in China.) Lv, G. (2009). Algorithm design and analysis. Tsinghua University Press. (Published in China.) Maeyer, J. D. (Ed.). (2015). Objectivity, revisited. Nieman Lab. www.niemanlab.org/2015/12/ objectivity-revisited/ McBride, K., & Rosenstiel, T. (2013). The new ethics of journalism: Principles for the 21st Century. CQ Press. https://doi.org/10.1057/9781137317 McBride, R. (Ed.). (2016). Giving data soul: Best practices for ethical data journalism. Data Journalism. https://datajournalism.com/read/longreads/giving-data-soul-best-practicesfor-ethical-data-journalism?utm_source=sendinblue&utm_campaign=Conversations_ with_Data_May_Ethical_Dilemmas&utm_medium=emai Milian, M. (Ed.). (2009). Tampa Bay mug shot site draws ethical questions. Los Angeles Times. http://latimesblogs.latimes.com/technology/2009/04/mugshots.html NPR (Ed.). (2014). Transparency. NPR. http://ethics.npr.org/category/g-transparency/ Pan, K. (2015). Challenges and reflections on privacy protection in the era of big data. Science and Technology of China Press. (Published in China.) Peirce, C. S. (2007). The essentials of pragmatism (Y. Tian, Trans.). In J. Dewey, et al. (Eds.), Pragmatism. World Affairs Press. (Published in China.) Plaisance, P. L. (2007). Transparency: An assessment of the Kantian roots of a key element in media ethics practice. Journal of Mass Media Ethics, 22(23), 187–207. https://doi. org/10.1080/08900520701315855 Porway, J. (Ed.). (2016). The trials and tribulations of data visualization for good. Markets for Good. https://marketsforgood.org/the-trials-and-tribulations-of-data-visualizationfor-good/ Powell, A. (Ed.). (2016). Algorithms, accountability, and political emotion. Data Journalism. http://datadrivenjournalism.net/news_and_analysis/algorithms_accountability_and_ political_emotion Qiu, R., & Huang, W. (2014). Ethical issues in big data technology. Science and Society, 4(1), 36–48. (Published in China.) Rainie, L., & Anderson, J. (Eds.). (2017). Code-dependent: Pros and cons of the algorithm age. Pew Research Center. www.pewinternet.org/2017/02/08/code-dependent-pros-andcons-of-the-algorithm-age/ Rogers, S. (2015). Facts are sacred: The power of data (Y. Yue, Trans.). China Renmin University Press. (Published in China.) Rorty, R. (1992). Post-philosophy culture. Shanghai Translation Publishing House. (Published in China.) Rosen, J. (Ed.). (2017). Show your work: The new terms for trust in journalism. Press Think. http://pressthink.org/2017/12/show-work-new-terms-trust-journalism/ Rupar, V. (2006). How did you find that out? transparency of the newsgathering process and the meaning of news. Journalism Studies, 7(1), 127–143. https://doi.org/10.1080/ 14616700500450426 Sample, I. (Ed.). (2017). AI watchdog needed to regulate automated decision-making, say experts. The Guardian. www.theguardian.com/technology/2017/jan/27/ai-artificial-inte lligence-watchdog-needed-to-prevent-discriminatory-automated-decisions

Ethics of Data Journalism Production 171 Shao, G. (2017). Introduction to network communication law. China Renmin University Press. (In Chinese). Shiab, N. (Ed.). (2015). On the ethics of web scraping and data journalism. Institute for Nonprofit News. http://gijn.org/2015/08/12/on-the-ethics-of-web-scraping-and-data-journalism/ SPJ (Ed.). (2014). SPJ code of ethics. The Society of Professional Journalists. www.spj.org/ ethicscode.asp Stark, J. A., & Diakopoulos, N. (Eds.). (2016). Towards editorial transparency in computational journalism. Microsoft Word. https://journalism.stanford.edu/cj2016/files/ Towards%20Editorial%20Transparency%20in%20Computational%20Journalism.pdf Stoneman, J. (Ed.). (2015). Does open data need journalism? Oxford University Research Achive. https://ora.ox.ac.uk/objects/uuid:c22432ea-3ddc-40ad-a72b-ee9566d22b97 Sun, Y. (2014). Guard against the infringement of news concept by technical logic: Analysis of journalism under the influence of Big Data. Southeast Communication (2), 18–20. (Published in China.) Tesfaye, S. (Ed.). (2016). Data journalism didn’t fail: Nate Silver pushes back after The New York Times blasts him for getting Donald Trump so wrong. Salon. www.salon. com/2016/05/05/data_journalism_didnt_fail_nate_silver_pushes_back_after_the_new_ york_times_blasts_him_for_getting_donald_trump_so_wrong/ The Trust Project (Ed.). (2014). We help over half a billion people easily assess the integrity of news worldwide. And we’re growing fast. The Trust Project. https://thetrustproject. org/#indicators Van den Hoven, P. (2016). Critical rhetoric as a theory of journalist transparency (Y. Yang, Trans.). Global Journal of Media Studies, 3(4), 83–96. (Published in China.) Voorhoof, D. (Ed.). (2015). ECtHR decision: Right of privacy vs. data journalism in Finland. ECPMF. https://ecpmf.eu/news/legal/archive/ecthr-decision-right-of-privacy-vsdata-journalism-in-finland Wang, H., & Hu, M. (2013). Algorithm design and analysis. Tsinghua University Press. (Published in China.) Wang, S. (Ed.). (2016). Why are polls missing voters in the 2016 presidential election? RFI. http://cn.rfi.fr/政治/20161114-2016美国总统大选,民调为何测不到选民的心? (Published in China.) Wang, X. (2017). “Post-truth” is essentially post-consensus. Exploration and Free Views (4), 14–16. (Published in China.) Ward, S. J. A. (2011). Ethics and the media: An introduction. Cambridge University Press. https://doi.org/10.1017/CBO9780511977800 Ward, S. J. A. (2015). The magical concept of transparency. In L. Zion & D. Craig (Eds.), Ethics for digital journalists: Emerging best practices. Routledge. Weiner, R. (Ed.). (2012). N.Y. newspaper’s gun-owner database draws criticism. USA Today. www.usatoday.com/story/news/nation/2012/12/26/gun-database-draws-criticism/ 1791507/ Xia, Q., & Wang, Y. (2014). From objectivity to transparency: The evolution history and logic of journalistic authority. Nanjing Journal of Social Sciences (7), 97–109. (Published in China.) Xiao, S. (2009). Chinese academic circles on the public interest of the main views and comments. Journal of Hebei Normal University (Philosophy and Social Sciences Edition), 22(6), 30–36. (Published in China.) Xiao, W. (2016). Epistemology of news frame. China Renmin University Press. (Published in China.) Xie, J. (2009). American media criticism. China Renmin University Press. (Published in China.)

172  Ethics of Data Journalism Production Xu, D. (2014). Big data strategy: Thinking revolution and bonus depression of individuals, enterprises and governments. New Century Press. (Published in China.) Xu, L. (2017). The dilemma and solution of the application of consent in personal information processing. Documentation, Information & Knowledge (1), 106–113. (Published in China.) Xu, Y. (2007). Analysis of the basic meaning of objectivity of news report. Journalism Research (4), 49–53. (Published in China.) Zeng, Q., Lu, J., & Wu, X. (2017). Data journalism: A journalistic argument for social science research. Journalism & Communication, 24(12), 79–91. (Published in China.) Zhang, C. (2018). Algorithms as intermediaries: Algorithmic bias and coping in news production. China Publishing Journal (1), 10–18. (Published in China.) Zhang, M. (2015). Risks and countermeasures of citizens’ personal information data in the era of big data. Information Studies: Theory & Application, 38(6), 57–61. (Published in China.) Zhang, T. (2016). Protection of privacy in datamining and data application in the eve of Big Data. Journal of Hebei Normal University (Philosophy and Social Sciences Edition), 39(5), 127–132. (Published in China.) Zhang, X. (2015). From privacy to personal information: Theory and institutional arrangements for benefit remeasurement, (Gui Enqiang, Trans.). China Legal Science (3), 38–59. (Published in China.) Zhou, R., & Liu, Y. (2017). “Post-truth” is essentially post-consensus. Shanghai Journalism Review (1), 36–44. (Published in China.) Zhu, Y. (2016). The data implications of data journalism. Youth Journalist, 24(24), 27–28. (Published in China.)

8

Big Data Journalism

With the advent of the era of big data, data journalism has expanded its object from structured small data to big data, mainly semi-structured and unstructured data. Big data journalism is a form of journalism that uses data science methods to discover facts from big data and present data through data visualization methods in order to serve news value and public interest. The report Big Data for Media, released by the Reuters Institute for the Study of Journalism, concluded that small data is measured in gigabytes or smaller, and big data is measured in terabytes and above (Stone, 2014). For journalism, terabytes of data can be called big data. With the development of artificial intelligence technology and the increasing abundance of big data resources, big data news production will become one of the future development directions of data journalism. The difference between big data journalism and small data journalism lies in three points. First, in terms of data processing, small data news pieces mostly deal with structured small data, while big data news works mainly deal with quasistructured and unstructured large data sets. Second, in terms of product form, small data news works are mostly presented in the form of text and less in the form of apps and databases, while big data news works are mainly in the form of apps or databases. Third, in terms of functions, small data news pieces mostly present information, while big data news pieces have various functions besides presenting information; they can also realize decision making, prediction, and other functions. Status of Big Data News Production Big data news is still in its initial stage. In 2014, CCTV launched a data news series called “Data about Spring Festival” using big data provided by others, becoming the first media to develop data journalism in the name of “big data.” In 2015, CCTV launched the “Using big data to tell the stories of a community of shared future” series, which mined a total of more than 100 million GB of data and produced more than 200 accurate 3D maps, elevating big data news production to a new level. The most significant feature of big data news is open production. For example, CCTV cooperates with professional institutions and companies in big data mining,

DOI: 10.4324/9781003426141-8

174  Big Data Journalism analysis, and data visualization. At present, CCTV’s partners include Baidu, IZP, Tencent, Sina, Dangdang, Zhaopin.com, 360, China UnionPay, etc. CCTV’s big data news pieces are mostly livelihood news and theme-based reports, while some famous international media’s big data news pieces are mainly investigative reports, and these media have or partially have the ability to mine and analyze big data. In 2014, The Wall Street Journal’s investigative story “Medicare Unmasked” captured 9.2 million pieces of publicly available government data and paid for billions of pieces of data from the Centers for Medicare & Medicaid Services (CMS). The journalists used linear regression, logistic regression, PCA (principal component analysis), K-means algorithm, and multiple expectation maximization algorithms to analyze the data. The study revealed the secrets of the $600 billion health care bill and eventually forced the U.S. government in April of that year to disclose for the first time important health care data that had been kept secret since 1979. After the new data was made public, The Wall Street Journal continued its investigation and found that Americans are paying at least $60 billion in bogus healthcare payments each year (GIJN Chinese, 2015). In 2015, the International Consortium of Investigative Journalists (ICIJ) partnered with more than 140 journalists in 45 countries to launch the “Swiss Leaks” investigative series. The “Swiss Leaks” project is based on a trove of almost 60,000 leaked files that provide details on over 100,000 HSBC clients and their bank accounts. The ICIJ produced an interactive database divided into three options: “Countries,” “People,” and “Stories.” Audiences can search for country profiles of HSBC clients’ deposits in Switzerland, profiles of more than 60 celebrity clients, deposit disclosures and responses, and more than 100 stories published by the ICIJ and partner media (Global Investigative Journalism Network, 2015). The Financial Times 2015 news release, “Hanergy: The 10-minute Trade,” analyzed two years of trading data of Hanergy Thin Film stock – more than 800,000 individual trades on the Hong Kong Stock Exchange – and showed “that shares consistently surged late in the day, about 10 minutes before the exchange’s close, from the start of 2013 until February this year” (See QR code 8.1). In 2016, ICIJ worked with 100 news outlets around the world to produce a big data news report called “The Panama Papers,” which analyzed 11.5 million leaked documents (or 2.6 terabytes of data) to uncover details of offshore accounts in more than 200 countries and territories.

QR code 8.1  Hanergy: The 10-minute trade. www.ft.com/content/9e87ba44-d20e-11e4a1a0-00144feab7de

Big Data Journalism 175 The Social Significance of Big Data Journalism All industries are currently embracing big data. The value of big data has gradually emerged in fields such as healthcare and business. With a mission to provide truthful, credible, and high-quality information, journalism has a responsibility to use big data to improve the quality of news when it becomes a universal resource. The rise of big data journalism is not simply the result of technical applications, but also because of the social value that big data journalism possesses. From Refraction to Mirroring: Upgrading the Media’s Ability to Monitor the Social Environment

“Trust, not information, is the scarce resource in today’s world” (Lorenz et al., 2011). In the information overload era, media organizations have shifted from competing for the attention market to competing for the credibility market. Big data journalism helps the media compete for credibility because the mining and analysis of big data allows journalism to gain more comprehensive, objective, and in-depth insight into social reality. We can imagine that as big data becomes thicker, the ability of data to reflect reality will be greatly enhanced, which means that human beings will enter the “mirror worlds.” In 1991, David Gelernter, a professor of computer science at Yale University, argued that the ultimate form of the Internet is the “Mirror World,” which has real connections and expressions with the real world itself (Bao & Song, 2014). With the increased ability of largescale data sets to “mirror” reality, news reporting has shifted from “refracting” reality to “mirroring” reality (Jia & Xu, 2013). Through the application of big data analysis tools, the media’s ability to “monitor the real-world environment” has been enhanced. Maybe one day, big data can make journalism really become the “watchdog” of society. Let’s use the report “Data about Spring Festival Travel Rush” by CCTV’s Nightly News team in cooperation with Baidu as an example. It uses billions of location data points collected by Baidu Migration through LBS (Location Based Services) every day to analyze the trajectory and characteristics of China’s mass migration around the Spring Festival. Because the data comes from Baidu Maps and a limited number of other applications, Baidu Migration cannot yet fully reflect the real-time migration of China’s population. But in the future, the interconnection between different platforms (e.g., national railroad platform, national flight platform) will allow each “isolated data island” to be connected into a whole, and future big data news will have the possibility to “mirror” the complete migration of Chinese people. From Information to Wisdom: Upgrading the Media’s Ability to Participate in Social Governance

Many countries around the world have adopted big data development strategy as an important tool for economic development, public service, and social governance. The U.S. launched the Big Data Research and Development Initiative in

176  Big Data Journalism March 2012, investing more than $200 million to promote the collection, access, organization, and exploitation of big data. This is another major science and technology development plan after the Information Superhighway in 1993. Australia, the European Union, Japan, and South Korea have also launched their own big data development strategies. In October 2015, the State Council of China issued the Action Outline for Promoting the Development of Big Data, proposing that a national unified open platform for government data will be established by the end of 2018 to make public data resources available to the public in a reasonable and moderate manner in such major fields as credit, transportation, medical care, health, employment, social security, geography, culture, education, science and technology, resources, agriculture, environment, safety supervision, finance, quality, statistics, meteorology, oceanography, and registration and supervision of enterprises. The significance of big data to social governance is self-evident. While the traditional social governance model suffers from fragmentation and ambiguity in decision-making, big data can improve the accuracy and timeliness of social governance. The insights and facts based on empirical analysis of big data can provide powerful support for scientific decision-making and improve social governance (Zheng, 2016). The launch and implementation of big data strategy at the national level is a rare opportunity for media development. Big data will enable the media to participate in the governance of society at a deeper, broader, and more effective level. The media as a social public instrument needs to serve the formation and expression of public interests (Pan, 2008); therefore, it should be deeply involved in social governance in the public interest, express public concerns, influence public decisions, and become an active subject of social governance. As a social organization that produces, processes, disseminates, and manages information, the media is also an important part of social governance (Wang, 2016). Open data platforms built based on government data provide an opportunity for media to develop data news products and in-depth information processing services (Bi, 2016). Big data journalism not only produces knowledge, but also wisdom, because the results of data analysis can be used by the public and other stakeholders for reference, thus playing a greater role in social governance. Internet public opinion analysis is an important way for big data news to take part in social governance. For example, Xinhua’s public opinion monitoring system uses big data processing technology to automatically analyze and even predict the trend of public opinion on popular social issues in real time (He & Wan, 2014). After the 2011 England riots, The Guardian’s report used qualitative and quantitative methods to scientifically analyze the causes of the riots, information dissemination mechanisms, and the roles of different actors, overturning the previous erroneous findings of government departments, which is a typical case of news media using big data to participate in social governance. In the future, services such as prediction and analysis of public opinion trends based on big data will become the focus of Internet public opinion analysis, and the media has great potential in these fields.

Big Data Journalism 177 With the public availability of data from various countries, the deepening cooperation between media and international organizations fighting corruption, the improvement of data reproduction capacity in the press, and the maturation of transnational journalistic cooperation, cross-border investigative journalism has been developed, which is conducive to media participation in global governance (Zhan, 2015). It is foreseeable that the media will play a greater role in social governance because of big data technology and inter-media collaboration. From Content to Service: Upgrading the Media Profit Model

According to the Report of the Big Data Development in China (2017), there are currently eight main big data-based business models, which are open data platform, big data software engineering, big data consulting, sharing economy, big data credit scoring, big data-based industry operation, big data marketing, and big data trading (State Information Center, 2017). News media have great possibilities for development and profitability using models such as open data platforms and big data consulting. The media in the era of big data is no longer a mere news production organization, but a platform for integrated data processing, i.e., data hubs, able to analyze even very large and complex data sets internally and build stories on their insights (Aitamurto et al., 2011). Therefore, the media should not solely rely on the success of leaks and open government and thus make data-driven journalism remain a niche form of news reporting (Baack, 2011), but also develop its own internal database to provide various information services, such as public opinion monitoring, market analysis, and prospect prediction. Many media organizations still rely on advertisements that attach to data news pieces to make a profit. For example, Caixin’s “Say it with numbers” column is profitable mainly from advertising and subscriptions. At present, what high-quality data news works can achieve is only to expand the influence of the media, and it is not realistic to rely solely on the advertisements attached to it to make profits (Huang, 2016). The reason the main source of profit for big data journalism is advertising is because big data journalism is still in its infancy. In the future, big data journalism will feature a product-based added-value chain profit model. Big data news products have a distinct advantage in terms of “shelf life.” Whether as an application or as a data platform, big data news products have the ability to update and upgrade themselves, extending their “shelf life” indefinitely, thus reaping both short-term and long-term benefits. Some important databases require only regular maintenance after they are built, so their operating costs can be greatly reduced. The strength of big data news products lies in prediction, which can deeply explore the business and social value of big data. “The Federal Big Data Research and Development Strategic Plan” states that big data should be used to enhance the capability of predicting socio-economic development (Tang, 2013). The development of artificial intelligence technology may make big data news evolve to a smarter stage; for example, in the future, big data news may have the ability to

178  Big Data Journalism predict in real time. These personalized services will be priced differently based on user needs. The services that big data journalism can offer to individual and corporate users include free and paid services. The free services mainly refer to general data information and basic inquiries and forecasts, while the paid services are mainly based on personalized and customized needs of individuals and enterprises, such as professional consulting and decision-making. Since big data news involves public interest and can attract a lot of social attention, it can continue to rely on advertising profit in addition to information services, thus forming a three-dimensional and multi-layer profit system. Some media outlets have now taken a new approach to profitability. Thomson Reuters’ core business is collecting domain-specific data, building specialized analytical models (e.g., credit risk, marketing effectiveness), and then selling the results of the models (e.g., customer credit risk scores) along with the raw data to corporate clients who need them (Baesens, 2016). Since the launch of its Data Store in 2014, ProPublica has offered free datasets that had been downloaded more than 4,500 times by 2016. The Data Store also offers premium datasets for as little as $200 and as much as $10,000+, generating over $200,000 in total revenue by 2016 (Netease School of Journalism, 2016). In the Data Store 2.0, launched in 2016, ProPublica provides APIs that allow users to “pull data into their own website and mobile tools” (LeCompte, 2016). According to Andrew Leckey, a senior financial journalist in the United States, for well-known media such as Associated Press, Reuters, and Bloomberg, data-based products are the core of their business, with very large profits. In the future, as the application of big data in the news industry becomes more mature and extensive, the profit model of big data news will be more diversified. What Big Data Journalism Requires of the Media The World Newsmedia Network (WNN) survey of 144 media outlets in 40 countries around the world in 2015 and 2016 revealed that data journalism and big data journalism strategies are being increasingly emphasized by media, and that data journalism has become an important tool for media to adapt to digital communication trends. When media personnel were asked in which segment big data plays the biggest role, data news production was second only to user analytics (Gu, 2016). Huang Chen, then-assistant editor-in-chief of Caixin.com, believes that the concept of big data journalism in China is currently mentioned to a greater extent than it is developed (Opinion from Toutiao Media Lab online salon on Feb 17, 2017.). Most media outlets currently do not have the ability to produce big data news independently, with most big data used being provided by Internet companies or government departments. Big data news production still faces many challenges in moving from experimentation to normality. According to the Report of the Big Data Development in China (2017), the factors currently restricting the development of China’s big data industry mainly include loopholes in data management, lagging technical development, low openness of data resources, incomplete laws

Big Data Journalism 179 and regulations, imperfect IT infrastructure, and the lack of high-end comprehensive talents (State Information Center, 2017). Since the era of big data has come, it is an inevitable trend for journalism to embrace big data, so journalism can not only try to produce big data news, but also apply big data to all aspects of news production. Of course, we do not think that all media should have the ability to produce big data news. Because of its complexity and difficulty, big data news is a scarce product that can only exist in a few influential media. Those media who are interested in big data journalism need top-level design to embrace big data in all aspects and maximize the benefits. To Develop Big Data Strategies for Media Transformation

For the media engaged in big data news production, it is very important to design a big data strategy for future media development at the top level. The application of big data in journalism is comprehensive, both in news content production and news push, user analysis, and advertising. Some media organizations are now already trying to facilitate media transformation with the help of big data. At present, the application of big data in many industries is still in the experimental stage, which makes it risky for the media to blindly invest in big data technology. Therefore, media should develop a plan for applying big data in stages. Media organizations can start by employing programmers and data analysts for user analysis or data news production, which can be seen as a preparatory phase of a big data strategy, followed by further planning based on the development of journalism and big data. It should be pointed out that the big data strategy of media is not equal to simply accumulating a large amount of data, but should mean using reasonable technology to collect data strategically and process data purposefully according to the business strategy of media, so that the data assets can create more economic benefits (Xu, 2017). To Accumulate Big Data Resources Actively

Big data is not only a new resource for journalism, but also an important asset for news media. Huang Chen, then-assistant editor-in-chief of Caixin.com, mentioned that it is almost impossible for Internet companies to provide raw big data to the media at present, but they will provide the media with conclusions based on the analysis of big data (Opinion from Toutiao Media Lab online salon on Feb 17, 2017.). The media’s ability to mine and analyze big data is important for the independence of media content production. As media reports cover all aspects of social life, it is not practical to accumulate big data in all fields. Therefore, the media should accumulate big data selectively and explore feasible cooperation with data holders to realize a combination of building their own databases and using others’ databases. According to one report by the Financial Times, “Extracting The Full Potential From Data Journalism in 2017,” data journalists can acquire data in three ways: creating datasets, combining datasets, and accumulating data in non-standard

180  Big Data Journalism data formats (Financial Times, 2017). During China’s “Two Sessions” in 2017, the People’s Daily’s data providers included The Global Big Data Exchange, Qingbo Big Data, TRS, Fanews, Sogou Big Data Center, Sina, and Tencent (People’s Daily Media Technology, 2017). For news media aiming to serve public interests, the collection and accumulation of public big data is very important. The data that the media should accumulate include (1) government public data, (2) data from the media’s own platforms, (3) social media data, (4) IoT data, and (5) other data related to public interests. Media can accumulate big data of a certain category or a certain field purposely according to their own development strategies. (1) Government public data. Big data has become a national-level strategy in many countries. It can be expected that a large number of public big data sets and interfaces will be released in the future, making it possible for the media to access these important data. Some media outlets are already working on small data. For example, the Texas Tribune in the U.S. has reorganized many of its datasets while using open government data. ProPublica’s Data Store offers three types of datasets: Premium Datasets, FOIA Data, and External Data (Xu C., & Xu Z., 2015). These datasets relate to compensation paid to physicians by pharmaceutical companies, Medicare prescription for hepatitis C drug spending, partial disability compensation for injured workers, and many other topics.

Nonprofit organizations, law firms, and big companies are often willing to pay top dollar for these kinds of datasets, largely because the data they offer is either often difficult to obtain or is the result of many hours of manual data entry. (Bilton, 2016)

(2) Data from the media’s own platforms. With the trend of media convergence, media organizations will have their own new media platforms, which means that it will become more convenient for media to collect, organize, and analyze their own data. At the same time, the digitalization of media content requires the media to effectively manage and reuse their own data. (3) Social media data is often available through open access or shared access. The significance of social media data for news media is that it contains users’ “social connections” and can be used to gain insight into their emotions, attitudes, behaviors, etc. (4) IoT data. The Internet of Things (IoT) enables all objects to be connected to the network for easy identification, management, and control. For example, the “Baidu Migration” uses IoT data. IoT data contains rich information about individuals (such as their geographic location, movement path, emotion, health status, etc.), which will play an important role in future big data applications. (5) Other data related to public interest include survey reports and news reports from research institutes and scientific institutions.

Big Data Journalism 181 To Improve Big Data Processing Capability

Helena Bengtsson, data project editor at The Guardian, points out the current limitations of media outlets in data processing and analysis: Story-wise, I hope we’re heading towards larger and more unstructured data sets. We will be looking for ways to conquer text; we haven’t conquered text yet. In one way I wish that we had gotten the Panama Papers ten years from now because we would have been able to do much different stories than we do right now. (Kaplan, 2016) In-depth interview (Guo Junyi, head of CCTV’s “Data about” series, telephone interview): We’ve been making big data news for three years, but we don’t have a professional team or a specialized platform for big data news. We are only making special programs periodically now, not to mention hiring dedicated data analysts. It’s all still a long way off . . . Weak big data processing capability is another shortcoming that restricts the big data journalism industry. At present, many media organizations are facing the double lack of big data processing technology and talents. Many of them don’t have specialized data management and analysis departments, and are even unable to process small data. Improving the big data processing capability of media does not only require the establishment of big data platforms, but also requires efforts in technology and talents at the same time (Zhang & Yu, 2013). There are three stages in which media can enhance their big data processing capabilities: the stage of outsourcing big data production, the stage of equal cooperation, and the stage of independent production. In the early stage of big data news production, big data processing is outsourced to professional companies, because at this stage the media has neither the ability to process big data nor the possibility of introducing talents in the short term. This is the model that CCTV adopts when producing big data news: “borrowing others’ boats to sail out to sea.” At the stage of equal cooperation, the media has a certain number of big data talents and a certain level of big data processing ability, but still cannot handle complex big data processing tasks, so they still need the help of professional teams. This stage is clearly different from the stage of outsourcing big data production. In the first stage, the media cannot control the big data production process and can only ask outsourcers for specific requirements. However, in the equal cooperation stage, the media and the professional team are on a more equal footing. In the independent production stage, the media can independently and autonomously grasp the whole big data news production process and are fully capable of producing high-quality big data news works. At this time, the media is also a big data technology company in some ways.

182  Big Data Journalism The reason for the emphasis on media to improve their big data processing capabilities is that it is crucial for the media to maintain their independence. If the production of big data news is controlled by people outside the media, the truthfulness, objectivity, and fairness of news reporting will be at risk (Zhang & Zhong, 2016). Guo Junyi, the head of CCTV’s “Data about” series, said they tend to choose technology companies that have no conflict of interest with the news topic and have a good reputation, and will double-check the conclusions drawn from big data analysis through interviews and analyses of other data. Driven by artificial intelligence technology, the media’s ability to collect, process, handle, and analyze big data will be continuously enhanced. Big data news products will serve the society more widely and deeply. We can imagine that the media may no longer be just the recipient of technology research and development, but become the promoter and participant of it. References Aitamurto, T., Sirkkunen, E., & Lehtonen, P. (Eds.). (2011). Trends in data journalism. http://virtual.vtt.fi/virtual/nextmedia/Deliverables-2011/D3.2.1.2.B_Hyperlocal_Trends_ In%20Data_Journalism.pdf Baack, S. (Ed.). (2011). A new style of news reporting: Wikileaks and data-driven journalism. Social Science Open Access Repository (SSOAR). www.ssoar.info/ssoar/handle/ document/40025 Baesens, B. (2016). Analytics in a big data world: The essential guide to data science and its applications (X. Ke & J. Zhang, Trans.). Posts & Telecom Press. Bao, Z., & Song, G. (2014). Focus on social governance innovation in the era of big data. Red Flag Manuscript (11), 31–32. (Published in China.) Bi, Q. (2016). Open data applications in data journalism. Hubei Social Sciences (16), 190– 194. (Published in China.) Bilton, R. (Ed.). (2016). ProPublica’s data store, which has pulled in $200K, is now selling datasets for other news orgs. Nieman Lab. www.niemanlab.org/2016/10/propubli cas-data-store-which-has-pulled-in-200k-is-now-selling-datasets-for-other-news-orgs/ Financial Times (Ed.). (2017). Extracting the full potential from data journalism in 2017. Financial Times. http://johnburnmurdoch.github.io/slides/data-journalism-manifesto/#/ GIJN Chinese (Ed.). (2015). Best data survey story of the year: Secrets hidden in numbers. The Global Investigative Journalism Network. http://cn.gijn.org/2015/06/02/数据驱动的 调查报道:藏在数字中的秘密/(Published in China.) Global Investigative Journalism Network (Ed.). (2015). Weekly data news picks. http:// cn.gijn.org/2015/02/20/每周数据新闻精选2-14-20/(Published in China.) Gu, X. (Ed.). (2016). Data journalism has become the touchstone of whether the media is advanced or not? Go to New York to hear the leading voices of the industry. The Paper. www.thepaper.cn/newsDetail_forward_1468494 (Published in China.) He, H., & Wan, X. (2014). How to innovate and break through the media public opinion business. Chinese Journalist (7), 70–71. (Published in China.) Huang, Q. (Ed.). (2016). Huang Zhimin, “the first Chinese person in data journalism,” left Caixin and will start his own business in big data. Lan Jing Cai Jing. www.lanjinger.com/ news/detail?id=21283. (Published in China.)

Big Data Journalism 183 Jia, L., & Xu, X. (2013). The nature of “big data” and its marketing value. Nanjing Journal of Social Sciences (7), 15–21. (Published in China.) Kaplan, A. (Ed.). (2016). Data journalism: What’s next. Uncovering Asia Conference. http://2016.uncoveringasia.org/2016/09/24/data-journalism-whats-next/ LeCompte, C. (Ed.). (2016). Introducing the ProPublica Data Store 2.0. ProPublica. www. propublica.org/article/introducing-the-new-propublica-data-store Lorenz, M., Kayser-bril, N., & Mcghee, G. (Eds.). (2011). Voices: News organizations must become hubs of trusted data in a market seeking (and valuing) trust. Nieman Lab. www. niemanlab.org/2011/03/voices-news-organizations-must-become-hubs-of-trusted-datain-an-market-seeking-and-valuing-trust/ Netease School of Journalism (Ed.). (2016). ProPublica builds a data sales platform. Netease. http://news.163.com/college/16/1009/15/C2URC4L8000181KO.html# (Published in China.) Pan, Z. (2008). Reflection and prospect: Writing on the 30th anniversary of China’s media reform and opening up. Communication & Society (6), 17–48. (Published in China.) People’s Daily Media Technology (Ed.). (2017). People’s daily central kitchen first promoted “data service,” how effective? People’s Daily Media Innovation. http://mp.weixin. qq.com/s/noivXXRqG93GceMKNRbffA (Published in China.) State Information Center (Ed.). (2017). China big data development report (2017). State Information Center. www.sic.gov.cn/archiver/SIC/UpFile/Files/Default/20170301105857 384102.pdf Stone, M. L. (Ed.). (2014). Big data for media. Reuters Institute for the Study of Journalism. https://reutersinstitute.politics.ox.ac.uk/sites/default/files/2017-04/Big%20Data%20 For%20Media_0.pdf Tang, J. (2013). Media transformation in the era of big data: Concepts and strategies. News and Writing (9), 23–26. (Published in China.) Wang, S. (2016). The boundaries and approaches of media involvement in social governance. Editorial Friend (7), 83–86. (Published in China.) Xu, C. (Ed.). (2017). Data realization and consulting. SDx-SoftwareDefinedx. http:// mp.weixin.qq.com/s/xGLtTdyxJeINj8ArimWnfA (Published in China.) Xu, C., & Xu, Z. (2015). A field perspective on data journalism – A case study of ProPublica’s journalism practice. Journal of News Research, 6(9), 209–220. (Published in China.) Zhan, J. (Ed.). (2015). Cross-border investigation of international journalism and global governance. Global Investigative Journalism Network. http://cn.gijn.org/2015/10/19/国 际新闻界的跨境调查与全球治理/(Published in China.) Zhang, C., & Zhong, X. (2016). Data journalism: Context, genres and ideas. Editorial Friend (1), 76–83. (Published in China.) Zhang, Y., & Yu, Y. (Eds.). (2013). Big media in big data era. People’s Daily Online. http:// cpc.people.com.cn/n/2013/0117/c83083-20231637.html (Published in China.) Zheng, Z. (2016). Innovation of social governance model based on big data perspective. E-Government (9), 55–60. (Published in China.)

Conclusion

In 2006, Adrian Holovaty argued that data journalism can transform the storycentric news production model by repurposing structured data. In 2008, Simon Rogers proposed data journalism as a way to transform abstract open data into intuitive data visualizations. As times change, so does the meaning of data journalism. Today’s data journalism is thriving. Open data has made news in database form a powerful tool for some mainstream media to gain attention. Interactive visualization has expanded the complex narrative capabilities of data journalism. Machine learning makes it easier for data journalists to discover the truth hidden in massive amounts of data . . . In the first decade of data journalism’s existence, data journalists emphasized “data” in order to construct their own professional discourse and to delimit it from traditional journalism. Data journalism has become a “tool” for re-professionalizing and legitimizing journalism. Of course, we must also realize that the reason data journalism is called a new journalistic paradigm is not in the “data,” but in the data science methodology. The birth and development of data journalism is not directly caused by big data, but is the result of multiple factors, such as the open data movement, the opensource movement, and the crisis of trust and professionalism in journalism, all in common. The trend of data journalism has made the media increasingly aware of the need to integrate data, data science methods, visualization techniques, data analysts, and data visualization designers into news production. The journalism culture is being “reinvented” by the technology culture, and the professionalism of journalism is being enhanced by the use of technology and cross-border cooperation. The process of producing data news is similar to scientific research or quantitative analysis, but the essence of data news is still news. Data and data visualization appear to be objective and neutral, but in fact they are “representations” of reality. Data collection, analysis, narrative, and visualization are not neutral, natural, or transparent. For example, data collection may be incomplete, data analysis may be wrong, narrative may be biased, and data visualization may distort the facts . . . Therefore, journalists need to follow the principles of objectivity and transparency. We are in an era of social media and big data. The era of social media requires data journalism to adapt to the attributes of the platform and the habits of users, and aims at stimulating users to “share” in order to activate the multi-level DOI: 10.4324/9781003426141-9

Conclusion 185 communication power of data news works. The era of big data requires the media to participate more deeply in social governance through data journalism, so that the media can shift from “traditional media” to “intelligent media.” In the future, algorithm-driven journalism will embrace data more and embed it in the “DNA” of news. Data journalism will be unshackled from the term “data” and instead focused on the word “journalism.” (Usher, n.d.) When World Wide Web founder Tim Berners-Lee’s “Data-driven journalism is the future” echoes again, data journalists need to think: How can data-driven journalism become the future? Reference Usher, N. (Ed.). (n.d.). What is data journalism for? Cash, clicks, and cut and trys. Data Journalism. https://datajournalism.com/read/handbook/two/situating-data-journalism/whatis-data-journalism-for-cash-clicks-and-cut-and-trys

Index

accountability 32, 159 actor transparency 153 algorithmic transparency 6, 155, 156, 157, 158, 159 Al Jazeera 30, 79, 121 anonymization 166, 167 anthropomorphic communication 28 artificial intelligence 20, 59, 76, 81, 82, 157, 162, 173, 177, 182 Assange, Julian 25, 30, 31, 62, 88 Associated Press 31, 76, 80, 99, 117, 126, 178 author-driven 101 Barthes, Roland 109, 115 BBC 19, 21, 26, 30, 42, 43, 60, 66, 108, 123, 125, 135, 152, 156, 158, 159 Berners-Lee, Tim 1, 64, 185 boundary work 22, 35, 142 brand image 79 BuzzFeed 26, 66, 81, 98, 139, 158, 159, 160 Caixin Media 25, 28 causation 71, 72, 101 CCTV 2, 42, 48, 49, 62, 103, 173, 174, 175, 181, 182 Chicago Tribune 42, 81 citizen journalism 22, 151 classification 68, 81, 93, 100, 101, 102, 159 Cognitive Surplus 51 collaborative filtering 138 complementary model of words and images 117 complex narrative 6, 26, 43, 90, 91, 117, 118, 128 complex network 96 computational journalism 2, 13, 14, 15, 16

computer-assisted reporting (CAR) 2, 8, 11, 13, 14, 15, 19, 20, 25 consistency criterion 147, 149, 150, 151 correction transparency 161 correlation 6, 17, 67, 71, 72, 76, 77, 81, 100, 101 credibility 4, 22, 25, 61, 72, 73, 142, 144, 147, 148, 153, 175 Crowdsourcer 50, 51 crowdsourcing model 5, 44, 49, 50, 51, 52 data acquisition 59, 158 data analysis transparency 6, 155 database-based exploratory narrative 111 database journalism 9 data cleaning 13, 59, 78 data collection transparency 155 data consistency 59 data-driven journalism 1, 9, 12, 22, 161, 177, 185 data flow 66, 79 Data Journalism Awards 6, 31, 67, 68, 74, 76, 77, 78, 79, 82, 90, 91, 161 data mapping 147 data mining 59, 71, 77, 80, 156, 173 data processing 2, 12, 30, 42, 45, 47, 65, 74, 77, 78, 89, 121, 173, 176, 177, 181, 182 data reuse 4, 59, 64 data type 76 data visualization 4, 6, 152, 155, 160 data visualization transparency 6, 155, 160 data volume 74, 75 data warehouse 76, 77 The Death of the Author 6, 109 de-boundarization 23 decoding 151

Index  187 deprofessionalization 74 Der Spiegel 62 descriptive analytics 67 Deutsche Welle 10 dialogue transparency 153 disclosure transparency 153 The Economist 42, 62, 135, 137, 140, 154 ego network 98, 99 embedded developers 42 empathy 112, 124, 146 encoding 150, 151 endogenous development 23, 41 endogenous model 41, 42, 43, 44, 45, 46, 47, 72 Entman, Robert M. 2 ethical norms 4, 6, 65, 74, 143 EveryBlock 1, 8 excessive mining 162, 163 fact-check 75 The Financial Times 43, 80, 100, 108, 112, 125, 139, 156, 158, 174, 179 fission-style model 135 FiveThirtyEight 26, 42, 65, 70, 72, 121 Foucault, Michel 87 Fourth Estate 19, 22, 143 framing 2, 88, 91, 119, 147 Freedom of Information Act 30, 33, 61, 163 “Free Our Data” 29 Free Software Movement 20 Gallup 21 game-based experiential narrative 11, 107, 108 gamification 138 Genette, Gérard 89 GitHub 66, 156, 159 Google Trends 9 graphic news 2 Graphics Interchange Format (GIF) 126, 128, 138 The Guardian 1, 2, 8, 9, 11, 17, 29, 30, 31, 33, 41, 42, 43, 45, 50, 52, 61, 62, 67, 70, 73, 80, 92, 101, 105, 106, 116, 117, 123, 125, 155, 161, 166, 176, 181 Habermas, Jürgen 88 hackathon model 53, 54 Hall, Stuart 150 hetero-organization 54 Holovaty, Adrian 1, 8, 9, 12, 17, 24, 76, 184 hypothesis testing 59

illusion of choice 113 image-dominant model 114 in-depth reporting 13 information filtering 111, 136 information retrieval 76, 77 informed consent 6, 164, 165 Initium Media 54 interactive narrative 5, 6, 101, 103, 104, 106, 107, 108, 109, 110, 111, 112, 113 International Consortium of Investigative Journalists (ICIJ) 174 Internet of Things (IoT) 180 Internet public opinion analysis 178 Intertextuality 117, 118 investigative journalism 67, 81, 174, 177 journalistic ethics 143, 152 knowledge product 158 learning theory 59 legality 6, 165 legitimacy 1, 22, 74, 142, 144, 146, 161 lightweight design 139, 140 linear narrative 6, 101, 102, 103, 104, 106 linguistics 17, 102 Lippmann, Walter 145 long tail 136 The Los Angeles Times 81, 93, 161, 164 machine learning 67, 73, 76, 77, 78, 80, 81, 82, 184 marketplace of ideas 22 Meyer, Philip 13 Miami Herald 87, 149 minimizing harm 6, 166 mirror world 175 modeling technique 59 Narratology 4, 89, 90 National Institute for Computer-Assisted Reporting (NICAR) 9 the negotiation model of words and images 119 neo-structuralism 96 news discourse 87, 88, 108, 119 New York Daily News 61 The New York Times 2, 30, 33, 42, 43, 62, 68, 70, 80, 81, 82, 88, 93, 101, 112, 121, 122, 123, 129, 138, 151, 161 niche database 79 non-linear narrative 104, 106 NPR 126, 152, 153

188 Index objective reporting 144, 146 objectivity 3, 5, 6, 13, 16, 73, 114, 115, 120, 143, 145, 146, 147, 148, 149, 150, 151, 152, 182, 184 open access 6, 18, 59, 60, 180 Open Data Barometer 33, 63, 64 Open Data Movement 3, 5, 18, 19, 20, 23, 28, 29, 30, 34, 74, 79, 184 Open Government Movement 18 open-source algorithm 157, 158 Open Source movement 15, 18, 20, 157, 184 open source software 15, 20, 21, 23, 41 outsourcing model 41, 47, 49 The Paper 42, 43, 138 parallel narrative 101, 102, 103, 104, 105 participatory transparency 153 passive consent 162 pattern recognition 59 Peirce, Charles Sanders 146 People’s Daily 24, 138, 180 personal data 6, 66, 143, 161, 162, 163, 164, 165, 166, 167 personalized service 79, 178 Phylogenetics 17 pictorial turn 27, 113, 114, 117, 120 PolitiFact 9 post-truth 146 pragmatic objectivity 6, 146, 147, 148 precision journalism 2, 13, 14, 34, 74 predictive analytics 67, 68 prescriptive analytics 67 principle of objectivity 143, 147, 148, 149, 150, 184 principle of transparency 6, 152, 153 PRISM 62 privacy violation 162, 163 procedural rhetoric 5, 109, 110, 111, 112, 113, 138 producer transparency 6, 155, 160 product awareness 6, 120, 124 production transparency 153 professional norms 6, 74, 125, 144 profit model 7, 79, 177, 178 programming marathon 53 proportionality 6, 165 proprietary algorithm 157, 158 proprietary software 20, 21 ProPublica 30, 42, 48, 50, 60, 79, 81, 107, 112, 161, 178, 180 public interest priority 6, 166

public service 123, 157 Pulitzer Prize 9, 27 qualitative analysis 67 quantitative analysis 14 random forest 77, 78 reader-driven 101 recontextualization 69 red-top tabloid newspapers 21 redundancy 117 regression 68, 77, 174 reinforcement learning 81, 82 relevance awareness 6, 120, 122 reperspectivization 69 re-professionalization 82 Reuters 48, 80, 81, 127, 152, 160, 163, 178 Reuters Institute for the Study of Journalism 21, 75, 173 rhetoric 4, 5, 9, 32, 90, 109, 110, 111, 112, 113, 114, 138, 147 Rogers, Simon 9, 11, 17, 29, 31, 43, 52, 149, 151, 184 Russian Formalism 89 search algorithm 59 secondary data 43, 148, 149 self-collection 6, 59, 61, 62 self-organization 54, 55 self-organizing model 5, 54 semi-structured data 11, 16 shared access 6, 59, 62, 160 shelf life 136, 177 Shirky, Clay 51 signified 89, 115, 116 signifier 89, 115 Snowden, Edward 62, 161 social currency 137 social governance 7, 18, 159, 175, 176, 185 social media-oriented production 6, 134, 136, 140 social network 6, 15, 48, 91, 96, 97, 98, 99, 100, 117 social network analysis 96, 97, 98, 99 Stallman, Richard 20 statistical analysis 59, 66, 77 story awareness 120 strong tie 79, 107, 124 Structuralism 89, 96

Index  189 structured data 8, 11, 12, 18, 25, 76, 80, 184 supervised learning 81 SWOT Analysis 43, 44, 48, 51 symbiosis 28

vertical communication 135 viral communication 137 visual rhetoric 110 Vox 51, 100

Tampa Bay Times 9, 164 technology-neutral 3 techno-optimism 3 The Texas Tribune 64, 79, 123, 124, 125, 161, 180 Thomson Reuters 48, 80 threshold value 165 Tian Yan Cha 47 Tow Center for Digital Journalism 9, 62 The Trust Project 154 Tuzheng 47, 55, 56

The Wall Street Journal 25, 42, 78, 98, 126, 138, 148, 174 The Wall Street Journal Formula 26 The Washington Post 27, 30, 42, 43, 62, 69, 79, 80, 81, 92, 93, 94, 95, 119, 120, 154, 161 web scraping 60, 162 WikiLeaks 26, 30, 31, 62, 66, 99, 163 WNYC 50, 123, 150, 151 word-dominant model 115, 117 word-image relationship 6, 113, 120 World News Media Network 46, 178 World Wide Web 1, 33, 63, 64, 185

unstructured data 11, 12, 16, 18, 33, 76, 77, 80, 173, 181 unsupervised learning 81 update transparency 161 “upmarket” newspapers 21 user experience (UX) 49, 79, 124, 125, 126, 127, 134, 139, 140

Xinhuanet 103 YouGov 21, 63 Zeit Online 30