Data Democracy: At the Nexus of Artificial Intelligence, Software Development, and Knowledge Engineering [1 ed.] 0128183667, 9780128183663

Data Democracy: At the Nexus of Artificial Intelligence, Software Development, and Knowledge Engineering provides a mani

1,381 148 12MB

English Pages 266 [252] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Data Democracy: At the Nexus of Artificial Intelligence, Software Development, and Knowledge Engineering [1 ed.]
 0128183667, 9780128183663

Table of contents :
Cover
Data Democracy: At the Nexus of Artificial Intelligence,
Software Development, and
Knowledge Engineering
Copyright
Dedication
How To Use
Contributors
A note from the editors
Foreword
References
Preface
Section I: The data republic
1 - Data democracy for you and me (bias, truth, and context)
1. What is data democracy?
2. Incompleteness and winning an election
3. The story and the alternative story
4. Nothing else matters
References
2 - Data citizens: rights and responsibilities in a data republic
1. Introduction
2. A paradigm for discussing the cyclical nature of data–technology evolution
3. Use cases explaining the black–red–white paradigm of data–technology evolution
3.1 “Thank you visionaries”—black: 100 years of data science (1890–1990)
3.2 “Progress to profit”—red: big data and open data in today's information economy
3.3 “The past provides a lens for the future”—white: looking backward to see forward
4. Preparing for a future data democratization
4.1 The Datamocracy Framework helps envision the future
4.2 Guiding principles within the framework
4.2.1 Feature engineering should creatively use existing data to enhance models without introducing unintended bias. Ideally, inv ...
4.2.2 Machine learning practice should protect the typical data citizen and not exploit their data literacy. Ideally, the data sc ...
4.2.3 Data use should set ethical precedence in this revolution toward progress and harmony. Ideally, the data could be made full ...
5. Practical actions toward good data citizenry
5.1 Use data science archetypes
5.2 Focus on the questions
5.3 Collaborate within the process to build a new culture of data
5.4 Label machine learning products for consumers
6. Conclusion
References
3 - The history and future prospects of open data and open source software
1. Introduction to the history of open source
2. Open source software's relationship to corporations
3. Open source data science tools
4. Open source and AI
5. Revolutionizing business: avoiding data silos through open data
6. Future prospects of open data and open source in the United States
References
Further reading
4 - Mind mapping in artificial intelligence for data democracy
1. Information overload
1.1 Introduction to information overload
1.2 Causes of information overload
1.2.1 Digital transformation
1.2.2 Internet of things
1.2.3 Social media
1.2.4 Cybersecurity
1.2.5 Internet web pages
1.2.6 Emails
1.2.7 Data openness
1.2.8 Push systems
1.2.9 Attention manipulation
1.2.10 Spam email
1.2.11 Massive open online courses (MOOCs)
1.3 Consequences of information overload
1.3.1 Anxiety, stress, and other pathologies
1.3.2 Reduction in productivity
1.3.3 Misinformation
1.3.4 Poor decision-making
1.4 Possible solutions
1.4.1 Literature reviews
1.4.2 Content management systems
1.4.3 Open data portals (data democracy)
1.4.4 Search engines
1.4.5 Personal information agents
1.4.6 Recommender systems
1.4.7 Infographics
1.4.8 Mind mapping
1.5 Artificial intelligence in the reduction of information overload
2. Mind mapping and other types of visualization
2.1 Mind mapping in the visualization of open data
2.1.1 Visualization of open data using mind mapping
2.1.2 Visualization of big open data using mind mapping
2.2 Visualization of content management systems using mind mapping
2.3 Visualization of artificial intelligence results
2.3.1 Types of applications of visualization in AI
2.3.2 Exploratory data analysis as a first step in AI
2.3.3 Software visualization and visual programming of AI applications
2.3.3.1 An example: visualization of complex information in NLU applications
3. Conclusions
References
5 - Foundations of data imbalance and solutions for a data democracy
1. Motivation and introduction
2. Imbalanced data basics
2.1 Degree of class imbalance
2.2 Complexity of the concept
3. Statistical assessment metrics
3.1 Confusion matrix
3.2 Precision and recall
3.3 F-measure and G-measure
3.4 Receiver operating characteristic curve and area under the curve
3.5 Statistical assessment of the insurance dataset
4. How to deal with imbalanced data
4.1 Undersampling
4.1.1 Random undersampling
4.1.2 Tomek link
4.1.3 Edited nearest neighbors
4.2 Oversampling
4.2.1 Random oversampling
4.2.2 Synthetic minority oversampling technique
4.2.3 Adaptive synthetic sampling
4.3 Hybrid methods
5. Other methods
6. Conclusion
References
Section II: Implications of a data democracy
6 - Data openness and democratization in healthcare: an evaluation of hospital ranking methods
1. Introduction
2. Healthcare within a data democracy—thesis
3. Motivation
4. Related works
5. Hospitals' quality of service through open data
6. Hospital ranking—existing systems
7. Top ranked hospitals
8. Proposed hospital ranking: experiment and results
9. Conclusions and future work
References
Further reading
7 - Knowledge formulation in the health domain: a semiotics-powered approach to data analytics and democratization
1. Introduction
2. Conceptual foundations
2.1 Semiotics
2.2 Semantics: lexica and ontologies
2.3 Syntagmatics: relationships and rules
2.4 Syntactics: metadata
2.5 Data interoperability and health information exchange
2.6 Semiotics-based analytics
2.7 Model-based analytics
2.7.1 Information domain delineation: contexts and scope
2.7.2 Data identification (exploration)
2.7.3 Data preparation (data staging)
2.7.4 Information model development
2.7.5 Information presentation
2.7.6 Heuristics-based analytics
3. A semiotics-centered conceptual framework for data democratization
3.1 Data democratization conceptual architecture
3.2 Data democratization governance
4. Conclusion
References
8 - Landsat's past paves the way for data democratization in earth science
1. Introduction
2. Landsat overview
3. Machine learning for satellite data
4. Satellite images on the cloud
5. Landsat data policy
6. Conclusion
References
9 - Data democracy for psychology: how do people use contextual data to solve problems and why is that important for AI systems?
1. Introduction and motivation
2. Understanding context
3. Cognitive psychology and context
4. The importance of understanding linguistic acquisitions in intelligence
5. Context and data, how important?
6. Neuroscience and contextual understanding
7. Context and artificial intelligence
8. Conclusion
References
10. The application of artificial intelligence in software engineering: a review challenging conventional wisdom
1. Introduction and motivation
2. Applying AI to SE lifecycle phases
2.1 Requirements engineering and planning
2.2 Software design
2.3 Software development and implementation (writing the code)
2.4 Software testing (validation and verification)
2.5 Software release and maintenance
3. Summary of the review
4. Insights, dilemmas, and the path forward
References
Further reading
Index
A
B
C
D
E
F
G
H
I
K
L
M
N
O
P
R
S
T
U
W
X
Z
Back Cover

Citation preview

Data Democracy At the Nexus of Artificial Intelligence, Software Development, and Knowledge Engineering Edited by

Feras A. Batarseh Ruixin Yang

Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2020 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-818366-3 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Mara Conner Acquisition Editor: Chris Katsaropoulos Editorial Project Manager: Gabriela Capille Production Project Manager: Punithavathy Govindaradjane Cover Designer: Matthew Limbert Typeset by TNQ Technologies

To the sons and daughters of the digital prison; as we may give you freedom through a data democracy, you must not inherit our thoughts or our ways.

To: Aaron Swartzdthe creator of the Open Access Manifesto. I remember one day visiting the John Crerar Library in Chicago. My father (Aaron’s grandfather) had spoken to me about it many times, about how he had done research there when he was a young man. The library is now on the campus of the University of Chicago. It is an interesting place, as it is a science library whose mission is that it be open to the public. Although the University of Chicago does not encourage public use any more, if you are assertive, they will let you in. Aaron’s grandfather taught me how to do library research when I was very young. We had many reference books at home and used our local public library often. To him, the ability to do research was a fundamental skill to be passed on from father to son. So I took Aaron to the Crerar Library and showed him around and showed him the stacks and all the books that were there. I remember clearly taking a random book off the shelf and discovering that it was from the 19th century and explaining to him how important it was having access to the world’s knowledge. Aaron understood the importance of written knowledge and, just as Crerar wanted his library open to the public, how it was vital that everyone should be able to easily access the world’s research and knowledge. As Wikipedia points out: “Because the library was incorporated under the 1891 special law, court approval was required for the merger, a condition of the merger was that the combined library would also remain free to the public.” We forget that in the last century, all the world’s knowledge was available in the librariesdthere books and journals were accessible and open to everyone. Aaron fought so that in this world of bits and bytes we could once again return to a place where everyone could have access to the world’s knowledge and research. Robert Swartz (Aaron’s father) 2019

Contributors Feras A. Batarseh

Graduate School of Arts & Sciences, Data Analytics Program, Georgetown University, Washington, D.C., United States

Justin Bui Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, United States Deri Chong

Volgenau School of Engineering, George Maon University, Fairfax, VA,

United States

Sam Eisenberg

Department of Mathematics, University of Virginia, Charlottesville,

VA, United States

Jay Gendron

United Services Automobile Association (USAA), Chesapeake, VA,

United States

José M. Guerrero Debra Hollister

Infoseg, Barcelona, Spain

Valencia College e Lake Nona Campus, Orlando, FL, United States

Dan Killian

Massachusetts Institute of Technology Operations Research Center, Cambridge, MA, United States

Erik W. Kuiler

George Mason University, Arlington, VA, United States

Ajay Kulkarni The Department of Computational and Data Sciences, College of Science, George Mason University, Fairfax, VA, United States

Abhinav Kumar Volgenau School of Engineering, George Mason University, Fairfax, VA, United States

Kelly Lewis

College of Science, George Mason University, Fairfax, VA, United States

Connie L. McNeely Rasika Mohod VA, United States

George Mason University, Arlington, VA, United States

Volgenau School of Engineering, George Mason University, Fairfax,

xiv

Contributors

Patrick O’Neil States;

College of Science, George Mason University, Fairfax, VA, United BlackSky Inc., Herndon, VA, United States

Chau Pham

College of Science, George Mason University, Fairfax, VA, United States

Diego Torrejon States;

College of Science, George Mason University, Fairfax, VA, United BlackSky Inc., Herndon, VA, United States

Ruixin Yang

Geography and GeoInformation Science, College of Science, George Mason University, Fairfax, VA, United States

Karen Yuan States

College of Science, George Mason University, Fairfax, VA, United

A note from the editors If you consume or create data, if you are a citizen of the data republic (willingly or grudgingly), and if you are interested in making a decision or finding the truth through data-driven analysis, this book is for you. A group of experts, academics, data science researchers, and industry practitioners gathered to write this book about data democracy. Multiple books have been published in the areas of data science, open data, artificial intelligence, machine learning, and knowledge engineering. This book, however, is at the nexus of these topics. We invite you to explore it and join us in our efforts to advance a major cause that we ought to debate. The chapters of this book provide a manifesto to data democracy. After reading this book, you are informed and suitably warned! You are already part of the data republic, and you (and all of us) need to ensure that our data fall in the right hands. Everything you click, buy, swipe, try, sell, drive, or fly is a data point. But who owns/should own that data? At this point, not you! You do not even have access to most of it. The next best empire of our planet is one that owns and controls the world’s best dataset. This book presents the data republic (in Section 1), introduces methods to democratizing data (in Section 2), provides examples on the benefits of open data (for healthcare, earth science, and psychology), and describes the path forward. Data democracy is an inevitable pursuit, let us begin now. Feras A. Batarseh Assistant Teaching Professor, Graduate School of Arts & Sciences, Data Analytics Program, Georgetown University, Washington, D.C., United States & Research Assistant Professor, College of Science George Mason University, Fairfax, VA, United States Ruixin Yang Geography and GeoInformation Science, College of Science, George Mason University, Fairfax, VA, United States 2019

Foreword The data crisis is an operational and ethical litmus test which the monolithic technology giants have badly failed. The corporations whose profit models depend on datadFacebook, Google, Amazon, and othersdhave proven inept at safeguarding consumers’ personal data and have so outraged the public by sharing and selling personal information that politicians can credibly advocate that they should be torn apart like Ma Bell, a tech monopoly from an earlier era. At the same time, most of the artificial intelligence (AI)ecentered digital corporations that dominate the tech industry regard the data they collect as protected intellectual property. They go to great pains to collect the data and consider their aggregation a more than fair trade for “free” services such as Internet search, social networking, and shopping. They will neither acknowledge that consumers have a legitimate claim to their own data nor give up their proprietary stake and make the data “open” and available to everyone; but they have proven again and again that they are not able, or fit, to manage it. The short-term penalty for their data mishandling is lost customers. Generation Z-ers, the first truly native digital citizens, are abandoning social media; 34% say they will leave it entirely, while 64% say they are taking a break. Privacy concerns are high on their list of reasons why [1]. The longer-term penalties are much worse and threaten us all; if the tech giants cannot ethically manage or control services using machine learning, how will they safeguard the world’s most sensitive technology as machine intelligence ineluctably grows? And how will they do so when we share the planet with computers that can outsmart us all? How did we get to this dangerous state of affairs? To understand, we have to go back a decade, when big data was all the rage. Then, companies with a lot of transactional data could analyze it using data-mining tools and extract useful information like inefficiencies and fraud. Wall Street used big data algorithms to seek out investment opportunities and make trading decisions. A big data technique called affinity analysis let companies discover relationships among consumers and products and offer suggestions for movies, shoes, and other goods. Big data still serves up these enterprise-friendly insights. But around 2009, something really big happened to big data. Three scientistsdHinton, LeCun, and Bengio, all of whom would later win the prestigious A. M. Turing Prizedrevealed that training learning algorithms on big data yields predictive abilities that exceed hand-coded programs [2]. Soon this technique, called deep learning, fueled amazing breakthroughs in speech recognition, computer vision, and selfdriving cars. Corporations everywhere caught on. Since 2009, thanks in large part to deep learning, investment in AI has doubled each year, and now stands at about $30 billion. AI implementation in enterprise grew 270% over the last four years, mostly, again, thanks to deep learning applications. By 2030, AI will add an estimated $15 trillion to the Global GDP [2]. Just the way that electricity powered the 20th century, this century’s economic opportunities are driven by AI. To get the latest AI applications to work, high quality datasets are mandatory. While hackneyed, the aphorism “Data is the new oil” gets truer all the time. As data gain value, their acquisition, ownership, and use grow more controversial. To understand why, we must consider what data are, and where data come from.

xviii

Foreword

Data are discrete pieces of information, such as numbers, words, photographs, measurements, and descriptions. Big data refers to a collection of data so large that it cannot be stored or processed with traditional database or software techniques. For example, Snapchat users share 527,760 photos every minute. Also every minute, 456,000 tweets are sent on Twitter. All these data require hundreds of thousands of terabytes of storage (1 terabyte equals 1024 gigabytes; 1 gigabyte equals 1024 megabytes; 1 megabyte equals 1024 bytes). It’s estimated that Google, Amazon, Microsoft, and Facebook together store 1.2 million terabytes among them [3]. Who produces all these data? You! Or rather, your use of the Internet, social media, digital photos, communications like phone calls and texts, and the IoT, or Internet of Things. Your digital activity generates mountains of data, more than half a gigabyte per day for an average user [4]. AI’s recent boom can be partly explained by the fact that for the first time enough large-scale data are available for high functioning machine learning systems (the other two drivers of the AI revolution are GPU and AI-specific processor chips, and key insights, i.e., deep learning). How do the tech giants profit from data? In two ways. First, companies including Facebook, Amazon, and Google make money by offering their clients curated ad positioning. Based on your digital profile, they target you for their clients’ ads. Second, whenever you buy their product or use their service, the tech giants gather data about you and your web activity. These data feed the development of profitable, data-hungry applications and products. Whenever you comment on an Amazon product, tweet on Twitter, or “like” a notice on your tennis league’s Facebook page, you are helping mint cash for the world’s richest corporations. Because we, the users, generate the data that are the lifeblood of these companies, we should be paid for its use, right? Don’t you own your data? On the Internet, you do not own your data in a traditional sense, the way, say, a photographer owns her photograph. If you publish a photograph on Facebook, it’s still yours for personal use, of course. But by electrically signing Facebook’s Terms of Use, you give FB permission to use your photograph as they see fit and to share your photograph with their business partners and other entities. And there is a lot to share. For each user, on average, Facebook has as much as 400,000 MS Word documents worth of data. Google has much more, about 3 million MS Word docs per user [5]. Google makes the case that they divorce your identity from your data, and so your privacy is safe with them. Their advertising clients target your “digital identity” with ads, without ever knowing your name or other personal information. In essence, who you are does not matter to Google. What matters are your photos, texts, and browsing history, and they highly prize their access to it. They do not want you to have data rights that will restrict their unimpeded use, and they certainly do not want to make your data open and free to anyone. And despite the tech corporations’ promises, they do shamefully little to secure your data or to honor their own Terms of Use. The greatest example so far of how low companies can stoop is a tale of big data, foreign intrigue, and the most important election in a decade: Facebook’s Cambridge Analytica scandal. Briefly, in 2011, due to past failures in keeping user data private, Facebook made an agreement with the Federal Trade Commission (FTC). It required that, among other things, Facebook receive “prior affirmative consent” from users before it shared their data with third parties. This “consent decree” was to last 20 years. Three years later, in 2014, Facebook allowed an app developer to access the personal data of some 87 million users and their Facebook “friends.” The developer worked closely with the election consulting firm Cambridge Analytica. It acquired these data and then created an algorithm that could determine personality traits connected to voting behavior for the affected users. In conjunction with a Russian firm,

Foreword

xix

Cambridge Analytica targeted users with ad and news campaigns meant to impact their vote in the United States’ 2016 Presidential Election. For breaches of its 2011 agreement with the FTC, Facebook may be fined up to $5 billion [6]. The previous record for fines related to privacy violations belongs to Google; in 2012, it paid $22.5 million to settle FTC charges that it misrepresented privacy assurances to users of Apple’s Safari Internet browser. More recently, in the early 2019, France fined Google $57 million for failing to tell users how their data were being collected and failing to get users’ consent to target them with personalized ads. For Facebook and Google, which in 2018 earned $55 billion and $136 billion, respectively, these fines are little more than slaps on the wrist [7]. The tech giants’ own history tells us that modifying their behavior is difficult indeed. Google, now Alphabet, it seems, would prefer to be sued than to change their business practices or protect user privacy. Alphabet employs some 400 lawyers because, among other things, it has been sued in 20 countries for everything from privacy and copyright violations to predatory business practices. In the United States, 38 states sued then-Google when it was discovered that the cars working in its Street View mapping project did more than take pictures. Without permission they hoovered up emails, passwords, and other personal information from computers in houses they passed [8]. Facebook is of course no better. Just weeks after April 2018, when founder Mark Zuckerberg answered questions on Capitol Hill about the Cambridge Analytica scandal and promised to impose harsh new restrictions on third-party use of user data, Facebook shared more user data with at least 50 device manufacturers, including four Chinese companies. The manufacturers were able to access personal data even if the Facebook user denied permission to share their data with third parties [9]. No business entities in world history have possessed wealth compared to that of the tech giants. The profits they earn put them in unique category of human enterprise somewhere between corporations and nations. If they were nations, Alphabet and Facebook would rank in the top richest 30% and 41%, respectively. Consequently, they behave with nation-like arrogance, flying above normal corporate constraints of ethics and law, and paying taxes in low cost tax havens instead of where they make their wealth. As sole providers of their respective services, they act monopolistically. In fact, they are de facto utilities and should be subject to stringent regulations aimed at preserving competition and innovation, or broken up, as the Bell System of telephone companies was in 1983. The tech giants profit from personal data and bulldoze the competition, but their greatest transgression still lies ahead. They are setting up the human race for an AI disaster many have seen coming for years: the intelligence explosion. The formula for the intelligence explosion was laid out in 1963 by English statistician I.J. Good. He wrote

Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control [10].

xx

Foreword

I like to put Good’s theorem in a contemporary context. We have already created machines that are better than humans at chess, go, Jeopardy!, and many tasks such as navigation, search, theorem proving, and more. Scientists are rapidly developing resources that fuel AI design, including AI-specific processors, large datasets, and key insights such as, but not limited to, deep learning and evolutionary algorithms. Eventually, scientists will create machines that are better at AI research and development than humans are. At that point, they will be able to improve their own capabilities very quickly. These machines will match human level intelligence, then become superintelligentdsmarter in a rational, mathematical sense than any humandin a matter of days or weeks, in a recursive loop of self improvement [11]. Many experts who consider the future of AI, including myself, have argued that the intelligence explosion is not merely possible, but probable, and will occur in this century [12]. A great number of factors have gone into that conclusion, including the durability of Moore’s Law, potential defeaters of AI development, the limitations of existing AI techniques, and much more. It seems inescapable that barring a cataclysmic disaster or war, scientists will create the basic ingredients of the intelligence explosionda smarter-than-human machinedin the normal course of developing AI. Its cost will limit the competitors for this dangerous distinction. Open AI, a nonprofit founded to create beneficial general intelligence free of market pressures, recently revealed their best estimate of the price of this endeavor, and how long it will take: at least $2 billion, and more than 10 years [13]. The intelligence explosion will be the most sensitive event in human history for the simple reason that we have no experience with machines that can outwit us; we cannot be sure their development would not be disastrous. Computer scientists and philosophers refer to this with the masterfully understated term “the control problem.” I think there is ample evidence to conclude the tech giants are not fit to guide the development of superintelligence to a safe and beneficial conclusion. And I am not the only one. The British company DeepMind, creator of the go-dominating programs AlphaGo and AlphaGo Zero, is at the forefront of the intelligence race. Its CEO and cofounder Demis Hassabis expects that superintelligent machines will be created, but to create them safely will require the companies involved to be “open and transparent” about their AI research. The leader in the AI race will have to “slow down.at the end” to make sure safeguards were in place to prevent a harmful superintelligent AI from taking shape. Based on prior performance is it conceivable that Alphabet, Facebook, or Amazon will voluntarily slow down as they reach a goal they have spent billions pursuing? If it means watching competitors catch up and perhaps speed past their intelligence benchmarks? Plainly, no. Deepmind CEO Hassabis said “This is what I worry about quite a lot because it seems like that co-ordination problem is quite difficult.” There are many open questions about the ownership and use of the river of data that flows from our personal devices to enrich the corporate giants of technology. To follow the status quo only allows them to grow wealthier and thereby more removed from the normal ethical checks that govern companies that do not have the funds to sway policy with legions of lobbyists and cannot afford to be sued all the time. For the benefit of mankind, the tech giants must be strenuously regulated in their development of AI or broken apart as other monopolies have been. To do otherwise is to put our fate in the hands of some of the most morally suspect companies ever to exist. James Barrat Filmmaker, speaker, and author of “Our Final Invention: Artificial Intelligence and the End of the Human Era”

Foreword

xxi

References [1] E. Sweeney, Study: 34% of Gen Zers are leaving social media, Marketing Dive (March 12, 2018). [2] F. Holmes, AI will add $15 trillion to the world economy by 2030, Forbes (February 25, 2019). [3] B. Marr, How much data do we create every day? The mind-blowing stats everyone should read, Forbes (March 11, 2019). [4] Irfan Ahmad, How much data is generated every minute? [Infographic], Social Media Today (June 15, 2018). [5] M. Travizano, The tech giants get rich using your data. What do you get in return? Entrepreneur (September 28, 2018). [6] E. Graham-Harrison, C. Cadwalladr, Revealed: 50 million Facebook profiles harvested for Cambridge Analytica in major data breach, The Guardian (March 17, 2018). [7] Facebook: annual revenue 2018 j statistic, Statista. [8] D. Streitfeld, Google concedes that drive-by prying violated privacy, The New York Times (March 12, 2013). [9] C. Cadwalladr, E. Graham-Harrison, Zuckerberg set up fraudulent scheme to weaponise data, court case alleges, The Guardian, Guardian News and Media (May 24, 2018). [10] I.J. Good, Speculations Concerning the First Ultraintelligent Machine e Semantic Scholar, January 1, 1965. www.semanticscholar.org. [11] J. Barrat, Our final invention: artificial intelligence and the end of the human era, Thomas Dunne Books (2015). [12] Machine Intelligence Research Institute, Intelligence Explosion FAQ, www.intelligence.org. [13] T. Simonite, OpenAI wants to make ultra-powerful Ai. But not in a bad way, Wired (May 1, 2019).

Preface Most everyone is searching for answers, and for their own version of the truth. Throughout history, Niche, Heisenberg, and Go¨del declared that truth is ever changing, volatile, and potentially nonexistent. Others (such as Socrates and Marie Curie) lost their lives to prevail their version of truth. Truth, however, is unsympathetic; once it is realized, humans change, nations rise, and empires fall. To undergo a transition in the definition of truth, ideologies such as democracy replace bloodshed and conflict with a ballot, or a statistic. Statistics (and math) are often referred to as exact sciences. When taken at face value, they are often referred to as factual. Facts, nonetheless, are an ongoing pursuit; we establish opinions, experiences, and reasons throughout the process. Inputs to general human reasoning include patterns in nature, structures of thought, trends in our thinking process, and wisdom from experiences. We continually look for those reasoning constructs to create or confirm our understanding of the world and its merit. In this day and age though, data drive change, and computational intelligence is king. Truth in a digital data-driven world has a different face. On a societal or national level, such undertaking needs to be part of a systematic skeleton for a republic, a data republic. In the republic, every decision is data-driven; every fact is validated and evaluated through advanced means of statistical analysis; many of which are presented in this book. That republic, nonetheless, would and should have the perfect democracy, a democracy driven by data, and perfected to a level that Plato could not predict in his own “Republic”djustice will be a low hanging fruit, equality will be a given, and prosperity will be thriving faster than ever before. Liberties are saved in the data republic. However, do we deserve or even want such an “exact” system of governance? Comprehensive freedoms without responsibility could act as a major feeder to a digital prison that we might be building for future generations. It is important to note that openness in data/software/science is positively correlated with the notion of accessibility and technological freedoms of citizens in a data republic. Data democracy is the antidote to the exacerbating and continuous lack of our online liberties. Proposed resolutions for a more perfect data democracy are presented in this book (open data and software, information overload avoidance, artificial intelligence (AI) ethics, and data imbalance solutions). Clinical psychological Jordan Peterson wrote in Maps of Meaning: “The natural, pre-experimental, or mythical mind is in fact primarily concerned with meaning.” Consequently, data science promises pointers to finding meaning, in a descriptive, predictive, or prescriptive manner. However, to find the “truth” about any given topic, it is important to not only look for the right answer but also to precisely pose the right question. We have been wrong about many major and serious questions

xxiv

Preface

throughout our existence; finding answers to critical questions is frequently challenged with multiple blind spots that deny us a clear vision. Systems with AI can mitigate such challenges; however, they need to be thoroughly validated and verified. Aligning the values of the executer of the AI system and the AI system itself for instance is a very difficult task. That notion can certainly be a philosophical undertaking. For AI, errors stem from either the algorithm or the data, example error causes include choosing the wrong algorithm and getting low quality outputs from the models, data blunders such as biases, incompleteness, and statistical outliers. This book aims to embark on these issuesdand present solutions through data democracy. Data democracy provides no guarantees that everyone gets to participate, at this point, suffrage is far from being global. Openness offers itself with high accountability; merit-based systems (such as Open Source Software) are often challenged with the lack of structure and organization. Structures eventually emerge to hierarchies; for example, in a data democracy, owners of structured data will have power over users who need the data as input to their decisions. Creators of insights are also high in the hierarchal order of a data republic. Therefore, data democracy does not guarantee equality of the outcome; rather, it creates a novel playing field that ensures equality of opportunity and accessibility to all its members. The data republic, albeit virtual, could eventually transform into a physical national representation of a country. Politically, data would replace experience and wisdom in policy making. Economically, data will replace traditional metrics such as gravity models and the law of supply and demand. Virtual bits could be replicated, shared, and sold infinitely without being effected by scarcity of demand or a lack in supply. Elections, political struggles, warfare, technological advancements, and many other societal aspects (healthcare, energy, education) in such a democracy will be driven by processing data, producing knowledge, and creating intelligence. This book aims to serve as a manifesto to data democracy and a constitution to the inevitable data-driven world that we all live in. Long live the data republic!

Feras A. Batarseh Assistant Teaching Professor, Graduate School of Arts & Sciences, Data Analytics Program, Georgetown University, Washington, D.C., United States & Research Assistant Professor, College of Science George Mason University, Fairfax, VA, United States 2019

1

Data democracy for you and me (bias, truth, and context) Feras A. Batarseh1, Ruixin Yang2 1

GRADUATE SCHOOL OF ARTS & SCIENCES, DATA ANALYTICS PROGRAM, GEORGETOWN UNIVERSITY, WASHINGTON, D.C., UNITED STATES; 2 GEOGRAPHY AND GEOINFORMATION SCIENCE, COLLEGE OF SCIENCE, GEORGE MASON UNIVERSITY, FAIRFAX, VA, UNITED STATES

There are no facts, only interpretations Nietzsche Abstract The manner in which data are analyzed, modified, or presented can change how we perceive the world. Though we cannot ask all data scientists to provide the absolute truth in every single analysis they undergo, we can put a flag on the planet of data science and declare data democracy is an obligatory pursuit. Data democracy is neither about data nor about democracy. It is about you and me. In this book, we aim to claim that data democracy is regarding finding the truth, detecting inherited bias, and realizing context. The three pillars of data democracy are bias, truth, and contextddiscussed in the sections of this chapter. This book aims to explore them, exploit them, challenge them, philosophize them, and entirely destroy them. Let us begin! Keywords: Context; Data bias; Data democracy; Global race.

1. What is data democracy? It has been a fairly long day here at the agency, George thinks to himself as he walks down the stairs of the Smithsonian Metro station in Washington, D.C. George has been working for the US Government for the last 30 years. Every year, he collects and analyzes data on US trade (imports and exports). One can imagine, after all those years, George (a PhD holder) has become quite the expert in US trade history, trends, and patterns. George, however, is a perfectionist; he does not like to share his raw data before it is cleaned, wrangled, normalized, analyzed, and visualized. Visualizations developed by George are presented to top management at his agency for decision-making on trade deals, tariffs, and other international relation issues. At the end of the day, George has been one of the main experts on Data Democracy. https://doi.org/10.1016/B978-0-12-818366-3.00001-0 Copyright © 2020 Elsevier Inc. All rights reserved.

3

4 Data Democracy

the topic, and his insights are trusted. Oh, but wait, George is retiring next month! His visualizations, predictions, statistical models, datasets, sheets, and formulas need to be updated constantly after he leaves; more importantly, the knowledge that he acquired within the last 30 years needs to be extracted and represented in a software system for the next “George” to keep the process alive and provide timely insights to decision makers. George, in this case, is a data tyrant, he owns lots of data, he owns the process, much of his knowledge is in his head or on his computer, and his knowledge cannot be seamlessly transferred to someone else. What if George is a company (George, Inc.)? One company that owns the knowledge and the keys to a dominant technology and one that is entirely biased to its own goals. Moreover, what if George is a country? A country that owns the only weapon of mass destruction or the sole nuclear bomb? That will force other countries to compete and negotiate for their safety. These tyrannical entities are what data democracy aims to eliminate. If George had shared his data and process with his colleagues throughout the last 30 years, George would have not had the ability to be a data tyrant, rather, his data, processes, knowledge, and insights would have been democratized. That process is how data democracy is defined within the context of this book. Today, data democracy needs to be applied for data collection, analysis, mining, processing, and visualization around the world (to bring about positive social change). Otherwise, we are at the risk of creating data tyrants, but in this case it can be an individual, a group of individuals, a company, or a country (that is not friendly to yours)dsomething we all need to agree to avoid through data democracydthis book will help us maneuver such a pursuit.

2. Incompleteness and winning an election Kurt Go¨del developed an appetite (willingly or not) for baby food and laxatives. Paranoid scientists (such as Go¨del) are hyperattentive to details around them; they tend to laser focus on their context. Go¨del incompleteness theory was the first of a wave of theorems that challenged formal systems, call it a school of thought or an antischool (given its rebelliousness). Other examples include Turing’s claims on the halting problem, Einstein’s relativity theory, and Heisenberg’s uncertainty theory, such theories rank within a similar scientific folklore and have high popularity. The importance of Go¨del’s theory, however, is that it is inherently applied to other fields. It could be applied to math, physics, chemistry,

Chapter 1  Data democracy for you and me (bias, truth, and context)

5

computer science, and other areas. Even in pop culture, Go¨del’s rebelliousness proved attractive. For example, the Irish band U2 announced that “everything you know is wrong” in their international music tour (not to claim that they are representing the theory accurately). The theory nonetheless established through mathematical proof, that mathematics could not prove all of mathematics. In the second chapter of the New York Times bestseller book Nudge, the authors listed a number of Biases and Blunders such as the status quo bias, anchoring bias, availability, framing, and representativeness. In their book, they claim that such biases can “nudge” us in different directions and lead to majorly different results. They provide multiple examples in their book, let us discuss one here: think of gerrymandering; is that a word? Yes, and a very important one in US elections. Laws and legislations can be good for only a group of people and bad for others; however, some laws are bad for everybody, gerrymandering is an example of that, and of an outcome that is due to a bad political process or a law that was not thought well through. The term “gerrymandering” was coined in 1812, following the creation of a legislative district resembling a mythological salamander under Massachusetts Governor Elbridge Gerry. The term has a negative connotation; it implies that when

60% purple wards 40% gray wards

3 purple districts 0 gray districts

1 purple district 2 gray districts

FIGURE 1.1 Gerrymandering bias.

geographical lines are drawn between districts, they are defined by potential votes of that area to tilt the results of elections. As Fig. 1.1 shows, the population of a district or a state gets sliced up to present a bias into the results of an election. By observing the three cases in the figure, it is obvious that how the areas are split up effects the outcome of the elections. This is merely an example of how certain facts could be challenged based on how information is presented. Bias can occur at any level,

6 Data Democracy

including the presentation level where information is introduced in a manner that completely masks the truth.

3. The story and the alternative story Data democracy effects multiple areas of our lives; however, it directly influences pursuits of artificial intelligence (AI). This section introduces the overlap between the two concepts/fields of study. Path#1 (the story): This is a story of our future with AI, fast forwarding toward alpha humanoid species that have already won the race between man and machine, passing by all the concerns that we now struggle with (such as data democracy) and passing by human intelligence (because who said AI development will stop once we achieve human’s level of intelligence?). The world is dominated by machines, while human knowledge is doubling every millisecond. Sounds very techy and cool, but we do not know if that is a good path until we compare it with another. Path#2 (the alternative story): Let us roll back the clock and imagine a world without AI. When disease spreads through our planet, and if viruses are multiplying and mutating, physicians are not able to diagnose or learn about these viruses quickly enough. The amount of information available to experts in the medical field is so extensive that without the help of machines, it is impossible for a human doctor to catch up. Both paths have pros and cons (and many other paths can exist); choosing one of them is a topic that has been covered by many authors in multiple famous books of AI including (naming few) 1. 2. 3. 4.

Our Final Invention [1], Superintelligence [2], How to Create a Mind [3], and Life 3.0 [4].

Within AI, however, for the purposes of this book, we discover that the verb “democratize” has been used in different scopes and situations to mean multiple things. Primarily though, when it is applied to data as used to fuel intelligence, data democracy means to make data available, reduce the cost, popularize data, disseminate data, and even crowdsource data, depending on the goals of the groups that seek to popularize these competing meanings, or in other words, depending on the context.

Chapter 1  Data democracy for you and me (bias, truth, and context)

7

FIGURE 1.2 The American artificial intelligence initiative.

4. Nothing else matters In 2019, President Trump signed the AI initiative [5]. A website (https:// www.whitehouse.gov/ai/) was launched to encapsulate values that need to be realized within the AI promise in the United States. This initiative is long overdue; the United States had to acknowledge the threat that is developing from the orient, a major competition that might be too big to beat already, Chinese AI [6]. A snapshot from the initiative’s website is shown below (Fig. 1.2); note number 05, AI with American Values. The initiative summarized AI with values in four different aspects: Understandable and Trustworthy AI, Robust and Safe AI, Workforce Impact, and International Leadership. To any AI expert, these four aspects are overly simplified. AI with values (American or any other for that matter) need to be discussed on a global scale, it is true that the United States needs to take the lead on that discussion; however, as that leadership slips into China, technologists in the west continue to worry; a very justified worry. To build a new data republic, and for it to succeed, it needs to be democratized, that is the topic of the next chapter. This section of this book introduces data democracy, its importance, challenges, and drawbacks. The second section of this book is concerned with societal examples that would be effected by different dimensions of data democracy (such as open data) and life within the data republic (how would healthcare and other areas of society be effected). If cleaned and structured datasets drive intelligence, if they are the next best global commodity, if wars of the future will be about data, and if the world economy will be reshaped through data, then data democracy is the discussion that we need to be having, and nothing else, everything else is noise [7].

8 Data Democracy

References [1] J. Barrat, Our Final Invention: Artificial Intelligence and the End of the Human Era, Thomas Dunne Books, 2015. [2] N. Bostrom, Superintelligence: Paths, Dangers, Strategies, Oxford University Press, 2016. [3] R. Kurzweil, How to Create a Mind: The Secret of Human Thought Revealed, Penguin Books, 2013. [4] M. Tegmark, Life 3.0, Vintage, 2018. [5] The White House, Office of Science and Technology Policy, “The AI Initiative: Accelerating America’s Leadership in Artificial Intelligence”, 2019. [6] K. Lee, AI Superpowers: China, Silicon Valley, and the New World Order, Houghton Mifflin Harcourt, 2018. [7] F. Batarseh, R. Yang, “Federal Data Science: Transforming Government and Agricultural Policy using Artificial Intelligence”, Academic Press, Elsevier, 2017. ISBN: 9780128124437.

2

Data citizens: rights and responsibilities in a data republic Jay Gendron1, Dan Killian2 1

UNITED SERVICES AUTOMOBILE ASSOCIATION (USAA), CHESAPEAKE, VA, UNITED STATES; 2 MASSACHUSETTS INSTITUTE OF TECHNOLOGY OPERATIONS RESEARCH CENTER, CAMBRIDGE, MA, UNITED STATES

What’s Past is Prologue. William Shakespeare Abstract Using William Shakespeare’s quote as a theme, the premise of this chapter is that data democratization will emerge from the same forces witnessed in two other historical transformations: (a) the new form of American governance after monarch rule and (b) globalization resulting from information democratization. The question remains how our industrial and information economies will prepare for this transformation in data. One dimension of the solution lies in the formation of the data citizen. Democracy at once stirs up a notion of majority rule, but history reveals that the yearning for freedomda pure democracydis typically tempered with a desire for representation. For instance, although the Internet has given a voice to nearly anyone, the information consumer still looks to representative forms of journalism and information sources such as The Washington Post and Apple News to maintain “good order and discipline” in a global information cacophony. “How does an industrial economy prepare itself for the coming data democratization?” If past is prologue, it is worthwhile reflecting on how the American colonists learned to function as a democracy having lived under monarchy. By gaining their independence, they also accepted a responsibility in the governance of their countrydno longer able to simply be governed by the ruling class. They transformed from subjects to citizens. A data citizen is an extension of this premise. As more businesses and management programs adapt concepts like data dexterity and data literacy, they highlight the growing need for data science management and data-adept leaders. A data democracy will not stand for a “ruling class.” Using a framework of characteristics drawn from the past, this chapter develops the rights and responsibilities of a data citizen. Keywords: Data citizen; Data rights; Data use; Machine learning; Technology forecasting.

Data Democracy. https://doi.org/10.1016/B978-0-12-818366-3.00002-2 Copyright © 2020 Elsevier Inc. All rights reserved.

9

10 Data Democracy

1. Introduction As global interconnectedness continues to improve, more and more people are finding themselves living in a data republic. In this republic, data are generated, collected, stored, and analyzed. These activities occur at increasing scales and rates due in part to the citizens’ persistent contributionsdknowingly or unknowinglydto this growing cache of data. This does not imply a dystopian view. Rather, in this data republic, the citizensdthe data citizensdhave both rights and responsibilities to shape data usage in ways they most value. This chapter sets a tone for the remainder of this book by presenting the evolution and growth of the data republic and providing a framework and guiding principles for harmonious progress in continued data usage. This chapter is structured as follows: Section 2 presents a paradigm for considering the evolutionary nature between data and technology that serves as a recurring cycle extensible into the future. Section 3 details the interplay between data and technology from 1890 to the present by applying the dataetechnology cycle from the previous section. Section 4 considers how this historic evolution may provide insight into future data democratization and presents the Datamocracy Framework along with its guiding principles. Section 5 presents pragmatic actions available to readers for promoting good data citizenry. Conclusions appear in Section 6.

2. A paradigm for discussing the cyclical nature of dataetechnology evolution Technology has become increasingly iconic and pop culture in nature. We live in a world where, like shooting stars in a meteor shower, highly anticipated computers, phones, and mobile apps burst onto the public scene seemingly out of nowhere. But these technologies are not apparitions. Rather, significant effort over many years drives the seamless introduction of new and superior technology to consumer markets. Consider the AlphaGo Zero algorithm that surpassed previous successes in playing the complex game of Go. The news erupted onto the public conscience, with one particular headline stating, “Google says its AlphaGo Zero artificial intelligence program has triumphed at chess against worldleading specialist software within hours of teaching itself the game from scratch” [1]. What was not discussed in this article, as is often the case for such media coverage, were the 20 years of research and development

Chapter 2  Data citizens: rights and responsibilities in a data republic

11

required to bring this high-level machine learning capability into existence. Today’s technology, which seems so novel and unexpected, is really the result of years of development. It is the story of metered technological evolution under the watchful eyes of visionaries having a sense of technology forecasting. Technology forecasting is as much art as it is science. Recognized methods and techniques are available for conducting quantitative forecasts, such as the seminal work of Porter [2]. There are also industry analyses available providing current and dynamic insights on trends, such as the Gartner Magic Quadrants [3]. Although the treatment of technology forecasting is beyond the scope and needs of this chapter, readers will benefit from a visualization of the dataetechnology cycle of evolution. It is an incomplete treatment of technique, but it is does have value in illustrating what may be found in the literature on technology forecasting. Simply stated, history shows data are a desirable entity and an increase in the amount of available data often leads to increasing demands for technology to process the new data. People desire information to support their decisions. This desire grows as more data are known to exist, which in turn drives technological developments to capture that data. Increased technological capabilities then lead to an increased desire for even more data. A dataetechnology cycle is created where having more data leads to the development of more technologydwhich enables more datad requiring more technology, and so forth. An example of this datae technology cycle lies in the evolution of the US Census. The United States conducted its first official census in 1790. From 1790 through 1840, census enumerators collected population information by household rather than by individuals. There were no requirements for information beyond the age, gender, and race of the members of a household. Collecting data on the basis of households allowed for the manual collection and calculation of aggregate statistics. However, for the 1850 census, US Government officials decided that aggregating information by household was no longer acceptable [4]. This decision to collect additional data for the 1850 census equates to an iteration in the datae technology cycle: a requirement for data (in this case, data about the US population) led to the development of new tools (data collection sheets which enumerators could complete by hand) which, in turn, led to a requirement for additional data. The US Census has experienced several additional iterations through the dataetechnology cycle over its history and continues even today. For example, in 1850, the US Census enumerators began collecting data on individuals, leading to the development of complicated “tally sheets” to assist census officials with the calculation of statistics. In 1870, the data

12 Data Democracy

requirement grew again, motivating the development of new technology: the Seaton Device. In the late 1940s, the Census Bureau commissioned the first electronic computer designed for civilian use to process results from the 1950 censusdthe UNIVAC I [4]. The history of the US Census demonstrates how the dataetechnology cycle reconciles the tensions between data needs and technology improvements. Energizing this cycle are three virtues of the human spirit: vision, progress, and harmony. These set the broad context for a paradigm describing periods in human history where the collective behavior of groups and individuals espouse these virtues as they intersect and fortify one another. These coinciding cycles of vision, progress, and harmony fuel the human behaviors continuously birthing novel technologies to satisfy human needs. Fig. 2.1 depicts this paradigm of cyclical, datae technology evolution. Three phases which seem to repeat throughout history are vision, progress, and harmony. The three phases are depicted in the symbolic colors of black, red, and white, respectively. Throughout history, there have been periods where a vision emerges from darkness (black), giving way to a period of change and burning progress (red), which in turn gives way to a period of acceptance, integration, and harmony (white). These blackeredewhite cycles are triggered by some longing, a question or need. The phases within each cycle have a duration tied to the context of the situation and are unique in each cycle. This dataetechnology cycle is used to represent the essence of scientific discovery throughout history. The next section delves deeper into the context of the black, red, and white periods as they relate to data science. Macro-Phase

Vision Progress

Harmony

Micro-Phases

Time

FIGURE 2.1 Dataetechnology cycle showing three phases in technology evolution.

Chapter 2  Data citizens: rights and responsibilities in a data republic

13

3. Use cases explaining the blackeredewhite paradigm of dataetechnology evolution Future statesdregardless of the disciplinedare often filled with marvel and surprise. Yet, they typically are traceable back to a recurring cyclic nature of vision, progress, and harmony.

3.1

“Thank you visionaries”dblack: 100 years of data science (1890e1990)

Data, or more simply, recorded information, have been around since the dawn of man. Prehistoric cave paintings “stored” information about significant hunts. Information stored in our DNA as natural instinct (such as the fight or flight response) encodes natural human responses to keep us safe from harm. These represent some of the oldest data in existence. But this discussion of the data republic begins in more contemporary timesdin the year 1890 when the United States Census Bureau electronically processed the census for the first time. The use of the Hollerith Machine and its punch card technology to electronically process the results of the 1890 census revolutionized data processing [4]. Suddenly, a large amount of data could be stored and analyzed at a speed not experienced before. Some 50 years later, data processing took another large leap forward when the first general-purpose electronic computer, the Electronic Numerical Integrator and Calculator (ENIAC), was developed in response to the code breaking and target tracking needs of allied forces during World War II [5]. Commercial interests in the latter part of the 20th century, along with the Cold War, and research and development funding by the US Government continued to accelerate the development of high-performance computing systems with increased storage capacity [6]. Concurrent with the rise of ever-advancing computing and processing technology, a new era of mathematicians, economists, statisticians, and scientists began to emerge in industry. These professionals were drawn by the need to analyze data at a scale previously unknown. Innovators such as Sir R. A. Fisher, W. A. Shewhart, W. Edward Deming, and John Tukey all made significant contributions over this period [7], further advancing the state of practice in analyzing and acting on large volumes of data. These early visionaries pioneered the ability to use data in larger quantities than ever before. Their invention of new tools and techniques significantly advanced our understanding of data processing and laid the groundwork for the data science accomplishments of the early 21st century.

14 Data Democracy

3.2

“Progress to profit”dred: big data and open data in today’s information economy

Data and technology visionaries play a critical role in the data science evolution. These visionaries found their way through the dark times represented in the dataetechnology cycle, inventing the theoretical statistics and sparking the light guiding humanity out of the data dark ages. As soon as this spark grew into a flame, early adopters and venture capitalists were drawn like moths to this light. Moore [8] notes the role of early adopters who are in a position to boost the trajectory of technology. It is these individuals who have the funding and the influence to grow a flame into a raging fire. This section talks about the early adopters who helped to fuel the rise of data and grew a small flame lighting the “data darkness” into raging fires ready to consume the world. For purposes of a working definition, the year 2000 is selected as the hinge on which this story is told. Not only does 2000 mark the turn of the century but the early 2000s coincide with several significant data-related events. Around this time, the Internet took root in modern culture, giving rise to a new crop of e-commerce companies. These companies quickly gained an appreciation of the vast amount of data available to them and sought opportunities to understand these data for financial gain. At the same time, other independent technological advancements were taking place. Significant progress in computing technology and the evolution of virtual computing coincided with the rise of these new tech companies. Material in Ref. [6] provides a more complete analysis of these major events. Government also played a significant role in the proliferation of data. For example, in 2009, the Obama Administration launched the Open Government Initiative [9]. Under this initiative, open and machinereadable data became the new standard for government data with the intent of making information about government operations more useful and accessible. Over time, this initiative has led to the release of stores of valuable data and made these resources more open and accessible to innovators and the public [10]. Additionally, the Obama Administration appointed the first Chief Technology Officer, the first Chief Performance Officer, and the first Chief Data Scientist of the United States. These positions were created to “promote technological innovation to help achieve our most urgent priorities” and “streamline processes, cut costs, and find best practices throughout our government” [11]. These initiatives signify the increasing volume and variety of open data from government and municipalities.

Chapter 2  Data citizens: rights and responsibilities in a data republic

15

Today, it is apparent that data are here to stay. Society will continue its reliance on data and analyticsdmore than ever before. This turns up the data fire and the heat associated with unparalleled progress.

3.3

“The past provides a lens for the future”dwhite: looking backward to see forward

If the past is prologue, then to understand the harmony phase of the dataetechnology cycle, we can look back on historic revolutions. One example worth reflecting on is the revolution for American independence, and, specifically, how the early British colonists in North America learned to function as a democracy having lived under monarchy for centuries. By gaining their independence, they also accepted a responsibility in the governance of their countrydno longer able to simply be governed by the ruling class. They transformed from subjects to citizens. How did those colonists learn their roles to shift from subject to citizen? It is doubtful that there were workshops or outreach groups to guide their paths. No matter the process, the result is clear. Those beginnings formed the enduring foundation of the present US Federal Government. The data revolution is an extension of this premise. Much like the revolutions which fell the monarchies of 18th century Europe, data use will not stand for a ruling class. History reveals that the yearning for freedomda pure democracydis typically tempered with a desire for capable representation. The notion of republic at once stirs up the notion of majority rule and a desire for elected representation to determine courses of action in a democratic republic. Hence, one may expect the transformation of data subjects to data citizens. Businesses have found they cannot subject consumers to existing data usage approaches. Data citizens are emerging and are looking for accountability. Yes, the data citizens wish their data be used to help themdbut not exploit them. For businesses and managers, the competitive advantage will come in conjunction with increased data literacy highlighting a growing need for data science management and data-adept leaders. A second revolution worth reflecting on is presented in Thomas Friedman’s [12] book The World Is Flat and regards the impact of open information on a globalized world economy. For centuries, publishers had full control and gatekeeping to control the flow of information. The Internet brought with it the ability for more people to share their information. Although the Internet has given a voice to nearly anyone, the information consumer still looks to representative forms of journalism

16 Data Democracy

and information sources such as The Washington Post and Apple News to curate the stream of information and maintain “good order and discipline” in a global information cacophony. Summarizing the major themes in these two historic transformations sets the stage for the ideas of data democratization  Freedom lay at the heart of both revolutionsda freedom from monarchy and a freedom from gatekeeping information  Formation resulted from thoughtful consideration of policy-making processes of the underlying systemsdsome new world order  Flexibility was another design parameter in both revolutionsdthe degree to which suited the architects of the new governing systems These elements are shaped into a theoretical framework for additional analysis.

4. Preparing for a future data democratization Vision and progress associated with technological advancement sound great, but one ought not lose sight that passions and energy can either go well or poorly based on how data are used. Stephen Hawking [13] speaks of the future and technology in his last published work Our future is a race between the growing power of our technology and the wisdom with which we use it. Let’s make sure that wisdom wins. Will this be a revolution for the people or of the people? Earlier sections of this chapter suggest data, technology, and society are bound together in a cyclical, evolutionary manner. Smaller cycles of vision (black), progress (red), and harmony (white) flow into one another creating a larger cycle. Though the cycles themselves will come, the readers and citizens can better prepare themselves through a framework providing insights to harmonize the progress we seek. This section looks to the future of data citizenry by presenting a theoretical framework for thought, discussion, and education. It begins by offering two historical transformations resulting in a new form of American governance and globalization from information democratization. It then presents a theoretical framework for placing a future view of data within a context of that past. This section closes with guiding principles to enliven the framework.

Chapter 2  Data citizens: rights and responsibilities in a data republic

4.1

17

The Datamocracy Framework helps envision the future

Understanding the concept of data democratization is aided by presenting the Datamocracy Framework. Frameworks are useful because they help communicate the underlying structure of a system of thought. This framework helps to provide a context of two transformative events relative to one another. The theoretical Datamocracy Framework shown in Fig. 2.2 melds the ideas of democratization in the US Federal Government and in information sharing while showing where data democratization (freedom) fits in the spectrum. The figure compares democratic systems based on two dimensions, formality and pliancy. Formalitydmeasured along the vertical axis and increasing from the bottom of the scale to the topdis the degree to which there is formalized adoption and codification of policy (formation). The US Federal Government has a high degree of formality with national elections and a law-making process. Information sharing is less formal with its system allowing nearly anyone to publish and fewer policies which codify publishing materials on the web. Therefore, one sees in the figure the US Federal Government is much more formalized than the current social values which frame information sharing. Pliancydmeasured along the horizontal axis and increasing from the left of the scale to the rightdis the degree to which customization and the ability to change the present approach exists within the system (flexibility). The US Federal Government is marked by elected representatives who serve a group of heterogeneous voters in a voting district with fixed terms and term limits. Federal Government

Formality

Data Democratization D C

A

Information Sharing

B

Pliancy

FIGURE 2.2 Datamocracy Framework showing trajectory of data democratization relative to two historical examples.

18 Data Democracy

Much more pliant is the information sharing system which is marked by individual choice of information source and the ability to change from one information source to another whenever one wishes. Therefore, one sees the US Federal Government is much less pliant than the generally accepted means of information sharing. Data democratization exists in the spectrum between these two transformative, historical examples. The figure depicts the shifts in data democratization since the late 1990s using four points in time on a trajectory to a future state. This trajectory may be summed up in the phrase. Confused, Mesmerized, and now Concerned. Data are moving toward a middle ground in the spectrum having aspects of both formality and pliancy. 1. Point A represents the early days of the World Wide Web when consumers experienced a new idea of e-commerce and monetizing data through advertising like Google search in 1998 [14]. In the confusing early days of this new economy, consumers sought little and gave willingly to the early dot-coms as business set the rules and took the initial advantage in this barter system. 2. Point B represents the period where consumers became mesmerized at the power of data in customizing their experience. The launch of Amazon as a marketplace and the release of the iPhone in 2007 [15] capture this period of time. Data worked behind the scenes to magically recommend the books and apps consumers wanted. Meanwhile, consumers showed an increasing desire for websites and apps to know their preferences and garner convenience with little thought as to how the magic happens. 3. Point C represents an increase in formality concerning the use of data in consumer settings and understandable policy regarding their privacy as it relates to data. The General Data Protection Regulation (GDPR) was released as a draft proposal in 2012 for the purpose of protecting people “with regard to the processing of personal data and on the free movement of such data.” The European Parliament and the Council of the European Union adopted GDPR in 2016 as an outward sign of consumer concern. The Regulation became fully enforceable in 2018.

Chapter 2  Data citizens: rights and responsibilities in a data republic

19

4. Point D represents a possible location of a future state accounting for uncertainty within the forecast (gray cone). Consumers who once had given away data so willingly display increasing scrutiny of how their data are used. News stories from 2019 highlight this shift. They include AT&T statements it will stop selling consumer location data after a call for a federal investigation [16] and apps from Facebook and Google which purposely collected data on teens [17]. This trajectory toward the system of data democratization supports an argument that there is an emergence of a data citizen consciousness.

4.2

Guiding principles within the framework

The data citizen depends on and deserves a balanced, well-run system of power. The Datamocracy Framework presents a theoretical backdrop to make the idea of data democratization a more understandable concept. These guiding principles are presented to enliven the framework. This includes looking at the rights of a data citizen and actions available to citizens and data science practitioners to better execute their responsibilities in a data republic. Before presenting the three guiding principles, it is useful to think about democracy as established in the three branches of the US Federal Government. They serve as a useful analogy for the guiding principles of data democracies. Table 2.1 shows the three branches along with a role and clause (guiding principle) for each as defined in the US Constitution. The legislative branch creates laws and works under the principle of Necessary and Proper “to make all Laws which shall be necessary and proper for carrying into Execution” [18 art. I, sec. 8]. In this regard, we will think of data use as the necessary and proper use of data in a data republic that enables data to help and serve citizens, but not cause them harm. The executive branch enforces the laws created by the legislative branch and works under a principle of Faithfully Executed to “take care that the laws be faithfully executed” [19 art. II, sec. 3, clause 5]. The Table 2.1

US Federal government system as analogy for data democratization.

Branch

Role

Clause

Datamocracy

Legislative Executive Judicial

Create laws Enforce laws Interpret laws

Necessary and proper Faithfully executed Case or controversy

Data use Data science Data rights

20 Data Democracy

analogy to this would be the role of data science practitioners who must faithfully execute the use of data through their modeling practice. The judicial branch interprets the laws as created and enforced by the other two branches. It works under the principle of Case or Controversy which means their authority only applies to actual cases and controversies and not hypothetical ones [20 art. III, sec. 2, clause 1]. The analogy to a data republic is one of data rights taking care to only exert as much formality as demanded by actual needs for the interpretation of data usage and data science execution. The three analogous data republic elements as presented Table 2.1 form the essential building blocks of the three guiding principles of the Datamocracy Framework. Fig. 2.3 presents those interacting elements in a visual form showing the predominant aspects of the principles outlined below. The three guiding principles form by considering interactions between each pair of elements in the figure, beginning with Data Use at the top and working around in a clockwise fashion through Data Science and Data Rights.

4.2.1 Feature engineering should creatively use existing data to enhance models without introducing unintended bias. Ideally, investment in the collection of additional data could further support this principle It is very important when using data for machine learning models that no unintended bias is encoded in other variables which could discriminate a decision based on gender, race, ethnicity, religion, or economic status. Even if the data contains no personally identifiable information, it is possible the data could contain a discriminating feature. One of the more nefarious sorts of unintended bias occurs quite often when using a human-rendered decision as a target variable. Data Use Feature Engineering

Data Rights

Data Science

Transparency

Literacy

FIGURE 2.3 Depicting the guiding principles as interacting elements.

Chapter 2  Data citizens: rights and responsibilities in a data republic

21

Practitioners ought not assume they have before them some absolute and pure form of judgment in a “0” and “1” labeled column of historical decision outcomes. They must consider how those human-rendered decisions emerged and if they may stem from bias. Otherwise, their use in a machine learning system results in a biased predictive modeldeven though on the surface it does not appear the data have any discriminatory features. Ideally, a data science professional would strive beyond the data at hand if they would like to increase the performance of their model. Whether engineering features which could serve as unbiased targets or collecting additional data, they avoid the low-hanging available that may more easily lead to biased models. Freeing a model from deeply encoded information bias may require additional data and thoughtful use of the sources.

4.2.2 Machine learning practice should protect the typical data citizen and not exploit their data literacy. Ideally, the data science professional could enhance the data citizen literacy of those they focus on or serve The authors have witnessed in their professional dealings and presentations that many people do not understand machine learningd assuming it is too complex and not worth understanding its basic concepts. Ironically, most people have personal experience with machine learning when they conducted simple linear regression with pen and paper in primary school. Machine learning concepts are teachable with familiar examples. It is the author’s experience of likening machine learning to student learning. A student is trained that “2 þ 2 ¼ 4” and later tested by asking “What does 2 þ 2 equal?” If the student response differs from the correct answer, a systematic way is used to tune the learning. Replacing the term “student” with “machine” in the previous two sentences is a reasonably satisfactory way of explaining machine learning. This short anecdote has gone a long way in helping the authors’ raise the data literacy among their clients. It also points out that data science professionals will need to consider whether they are possibly unintentionally exploiting data literacy gaps. Ideally, the data science professional could in fact enhance data science literacy among those they serve. One noteworthy example is the

22 Data Democracy

educational outreach of high-tech organizations like NASA to raise the literacy of citizens about space exploration.

4.2.3 Data use should set ethical precedence in this revolution toward progress and harmony. Ideally, the data could be made fully transparent to data citizens The third principle relates to the linkage between data rights and data usage. Although this principle is the most abstract of the three, it is potentially the most important and most worthy of awareness by data collectors and data science professionals. Debate continues regarding use of machine learning and its benefits versus dangers to society. Technological evolutions are unstoppable, but they do not have to emerge in unfettered progress. Consider these extracts from the GDPR adopted in the European Union [21]. Note the balance in the language. The processing of personal data should be designed to serve mankind. The right to the protection of personal data is not an absolute right; it must be considered in relation to its function in society and be balanced against other fundamental rights, in accordance with the principle of proportionality . Rapid technological developments and globalisation have brought new challenges for the protection of personal data. The scale of the collection and sharing of personal data has increased significantly. Technology allows both private companies and public authorities to make use of personal data on an unprecedented scale in order to pursue their activities. Natural persons increasingly make personal information available publicly and globally. Technology has transformed both the economy and social life, and should further facilitate the free flow of personal data within the Union and the transfer to third countries and international organisations, while ensuring a high level of the protection of personal data. One way to employ this guiding principle and help bring about harmonized progress would be a code of ethics for machine learning. Professionals in law, medicine, finance, and engineering may ascribe to a professional association and its code of ethics. Machine learning currently has no sort of oversight body, but perhaps it is time a noteworthy association construct a code of ethical conduct for machine learning practitioners.

Chapter 2  Data citizens: rights and responsibilities in a data republic

23

Ideally, the data could be made fully transparent today to citizens. This must be a topic of considerable debate when one thinks of the proprietary or competitive nature of data, as so well noted in the GDPR language. Yet, where does the data come from? Again, perhaps the fields of law, medicine, and other financial professions can be guides toward the appropriate types of transparency.

5. Practical actions toward good data citizenry In light of the Datamocracy Framework and its guiding principles, there are actions citizens and practitioners of the present data republic can take toward making progress in a more harmonious way.

5.1

Use data science archetypes

The concept of data science archetypes states that an organization needs an essential and minimum set of skills, personalities, and experiences in its data science function. Archetypes as found in Ref. [22] provide a soft structure for building a data science organization. One may also view the archetypes as guidelines for maturing a data science organization. Use of the data science archetypes increases diversity in thought and helps to avoid blind spots. In practice, the investment return on establishing a healthy mix of archetypes is a balance in the types of experiences and mentalities existing in a data science organization. It is less likely to lean too heavily on pure technical prowess at the risk of losing sight of what could be happening to data transparency or bias within its modeling.

5.2

Focus on the questions

Discipline is required to focus on the right questions in the fast-paced world of delivering data-driven insights. Generally, data and models must produce answers. Fig. 2.4 shows a linkage among four key functions of data science [23]. The lower-case and brackets used for {d m} are intended to communicate the relative value and “behind the scenes” nature of data and modeling in the overall process. Clarifying the question (Q) is better seen as a prime mover in data science. Striving for answers (A), many data science organizations focus on models (m) as in “what model should be FIGURE 2.4 Questions drive the model building process.

24 Data Democracy

built?” rather than beginning with the question. Furthermore, data science practitioners often assume the model sought is supported by the data (d) they have or can obtain. Less regularly will the practitioner take the time (sometimes painstaking) to truly unearth the question underlying the modeling effort. This act of highlighting the Q {d m} A process can also increase data literacy because it will require engaging with management and leadership to discuss the data science project before getting deep into data and model engineering.

5.3

Collaborate within the process to build a new culture of data

As data citizens increasingly become versed in the basic elements of data science, they will enable understanding among other citizens who are not as comfortable working with data. Sawicki and Craig [24] highlight three movements that spur data democratization. They note the first two as (a) an increase in computing power and data access and (b) lower barriers to entry for the skills needed to transform data into insights. They find full data democratization will also rely on decreasing the gap between data, its meaning, and the citizens. A similar leveling occurred with credit scores. Consider that 20e25 years ago, most people were confused as to how credit scores were used and how they were determined. They did appreciate its impact, however, when informed they would not receive a credit card. Fast forward to our present time and many more consumers understand how credit scores are used and what goes into those scores. They have even come to understand how they can improve their scores. One only needs to look at television ads about boosting one’s credit score to see the cultural shift in a formerly obscure subject. This shift in credit score literacy came on the heels of a changing culture. People realized that a credit score was not only the key to receiving a credit card but also has an impact on one’s ability to get a job or rent an apartment. Regarding a data culture, Diaz, Rowshankish, and Saleh note “you develop a data culture by moving beyond specialists and skunkworks, with the goal of achieving deep business engagement, creating employee pull, and cultivating a sense of purpose” [25, p. 38]. Credit score literacy has emerged through a very subtle, yet demonstrably effective, collaboration in the process as well as enhancing the readability of materials relating to credit scores.

Chapter 2  Data citizens: rights and responsibilities in a data republic

5.4

25

Label machine learning products for consumers

Data literacy rises when data citizens are better able to know when machine learning is part of a decision-making process. One action is through increased readability and disclosure. Loan applications, food labeling, and health risks on consumer products all use simple and bold labeling to alert consumers about once technical matters like annual percentage rates, chemistry, and toxicity panels. Loan applications must contain clear and readable information about the interest rates and the terms of the loan. Foods are labeled with warnings such as this product may contain shell fragments. Similar things may be done in the realm of data science and machine learning. Imagine a day when we see a credit or rental application come back with a statement: machine learning techniques used in the determination of this application. It would be a start to let people know when algorithms are in fact used. Another way to think about this: although the process is not labeled, it is widely known that machines read and filter resumes. As a result, recruiting coaches and job search blogs shaped the culture to change what people include in their resumes because they had to satisfy the machines. Labeling when machine learning is used in the decisionmaking increases transparency, improves ethical image, and alerts people to when they should make sure their data are proper and appropriate for the decisions being made.

6. Conclusion We do not know what the future of our data democracy will look like. However, contemplating the evolutionary role of data in business and society helps us prepare for, and maybe even drive, the impending change. What we do know is that a cyclical relationship between data and technology exists, and this interplay has given rise to successive periods of vision, progress, and harmony, which themselves contain subcycles of the same primary phenomenon. From this paradigm, we see that our burgeoning data democracy is the result of a vision which began in the late 1800s and was spurred on by technological progress occurring in the beginning of the 21st century. As a period of progress transitions to a period of harmony, it is up to usdthe data citizensdto determine how this harmony will be achieved. Historic examples of two transformations provided in this chapter suggest that the concepts of freedom, formation, and flexibility will have

26 Data Democracy

important parts to play. A data democracy will not stand for a “ruling class.” The Datamocracy Framework depicts the growth of the current data democracy with respect to formality and pliancy. This Framework shows our data democracy continues to evolve, and we can anticipate an emergence of a data citizen consciousness. Data science practitioners are in a unique position to help shape this consciousness and can rely on the guiding principles of data use, data science, and data rights to help further development initiatives. The use of data science archetypes, ensuring the questions being asked are the questions that should be answered, and growing the data republic through collaboration and improving data literacy are examples of practical actions aligning with these principles. What’s past is prologue. The rich past of dataetechnological evolution and adaptive governance approaches provide readers with a strong foundation for exercising their rights and fulfilling their responsibilities as a data citizen in the data republic.

References [1] BBC News, Google’s ’superhuman’ DeepMind AI Claims Chess Crown, BBC News, Decemeber 8, 2017 [Online]. Available: https://www.bbc.com/news/ technology-42251535. [2] A.L. Porter, A.T. Roper, T.W. Mason, F.A. Rossini, J. Banks, B.J. Wiederholt, Forecasting and Management of Technology, John Wiley & Sons, New York, 1991. [3] Gartner, Inc, Gartner Magic Quadrant & Critical Capabilities, Gartner.com, 2019 [Online]. Available: https://www.gartner.com/en/research/magic-quadrant. [4] United States Census Bureau, History: Innovations, United States Census Bureau, 2019 [Online]. Available: https://www.census.gov/history/www/innovations/. [5] A. Goldschmidt, A. Akera, John W. Mauchly and the Development of the ENIAC Computer, 2003 [Online]. Available: http://www.library.upenn.edu/exhibits/ rbm/mauchly/jwmintro.html. [6] S. Fahey, The democratization of big data, J Natl. Secur. Law Policy 7 (2) (2014) 325e331. [7] F. Batarseh, J. Gendron, R. Laufer, M. Madhavaram, A. Kumar, A context-driven data visualization engine for improved citizen service and government performance, Model. Using Context 2 (November 20, 2018), https://doi.org/10.21494/ iste.op.2018.0303 [Online serial]. Available:. [8] G.A. Moore, Crossing the Chasm: Marketing and Selling Disruptive Products to Mainstream Customers, HarperCollins Publishers, Inc., New York, 2002.

Chapter 2  Data citizens: rights and responsibilities in a data republic

27

[9] P.R. Orszag, Memorandum: Open Government Directive, Obamawhitehouse, Archives.gov, December 8, 2009 [Online]. Available: https://obamawhitehouse. archives.gov/open/documents/open-government-directive. [10] The White House, Open Government Initiative, 2009 [Online]. Available, https://obamawhitehouse.archives.gov/open. [11] The White House, Weekly Address: President Obama Discusses Efforts to Reform Spending, Government Waste; Names Chief Performance Officer and Chief Technology Officer, 2009 [Online]. Available: https://obamawhitehouse. archives.gov/the-press-office/weekly-address-president-obama-discussesefforts-reform-spending-government-waste-n. [12] T.L. Friedman, The World Is Flat: A Brief History of the Twenty-First Century, Farrar, Straus and Giroux, New York, 2005. [13] S. Hawking, Brief Answers to the Big Questions, Hodder & Stoughton General Division, London, 2018. [14] S. Brin, L. Page, The anatomy of a large-scale hypertextual Web search engine, Comput. Netw. ISDN Syst. 30 (1e7) (1998) 107e117. https://doi.org/10.1016/ s0169-7552(98)00110-x [Online serial]. Available:. [15] T. Mickle, Among the iPhone’s biggest transformations: apple itself, Wall Street J. (June 20, 2017) [Online]. Available: https://www.wsj.com/articles/among-theiphones-biggest-transformations-apple-itself-1497951003. [16] H. Shaban, B. Fung, AT&T Says It’ll Stop Selling Your Location Data, amid Calls for a Federal Investigation, The Washington Post, January 10, 2019 [Online]. Available: https://www.washingtonpost.com/technology/2019/01/10/phonecompanies-are-selling-your-location-data-now-some-lawmakers-want-federalinvestigation/?utm_term¼ .353042af1eaf. [17] A. Schneider, J. Garsd, Facebook, Google Draw Scrutiny over Apps that Collected Data from Teens, Npr.Org, January 30, 2019 [Online]. Available: https://www.npr.org/2019/01/30/690172103/facebook-google-draw-scrutinyover-apps-that-collected-data-from-teens. [18] U.S. Constitution, Article I, Section 8. [19] U.S. Constitution, Article II, Section 3, Clause 5. [20] U.S. Constitution, Article III, Section 2, Clause 1. [21] Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (Text with EEA relevance), Orkesterjournalen L 119 (4.5) (2016) 1e88 [Online]. Available: http://data. europa.eu/eli/reg/2016/679/oj.

28 Data Democracy

[22] J. Gendron, S. Mortimer, T. Crane, C. Haynes, Transforming data science teams in the future, in: F.A. Batarseh, R. Yang (Eds.), In Federal Data Science: Transforming Government and Agricultural Policy Using Artificial Intelligence, Elsevier Academic Press, Cambridge, MA, 2018. [23] J. Gendron, Introduction to R for Business Intelligence, Packt Publishing Limited, Birmingham, UK, 2016. [24] D.S. Sawicki, W.J. Craig, The democratization of data: bridging the gap for community groups, J. Am. Plan. Assoc. 62 (4) (1996) 512e523. https://doi.org/ 10.1080/01944369608975715 [Online serial]. Available:. [25] A. Diaz, K. Rowshankish, T. Saleh, Why data culture matters, McKinsey Q. (3) (2018) 36e53 [Online]. Available: www.mckinsey.com/w/media/mckinsey/ business%20functions/mckinsey%20analytics/our%20insights/mckinsey% 20quarterly%202018%20number%203%20overview%20and%20full%20issue/ mckinsey-quarterly-2018-number-3.ashx.

3

The history and future prospects of open data and open source software Feras A. Batarseh1, Abhinav Kumar2, Sam Eisenberg3 1

GRADUATE SCHOOL OF ARTS & SCIENCES, DATA ANALYTICS PROGRAM, GEORGETOWN UNIVERSITY, WASHINGTON, D.C., UNITED STATES; 2 VOLGENAU SCHOOL OF ENGINEERING, GEORGE MASON UNIVERSITY, FAIRFAX, VA, UNITED STATES; 3 DEPARTMENT OF MATHEMATICS, UNIVERSITY OF VIRGINIA, CHARLOTTESVILLE, VA, UNITED STATES

To the past, or to the future, to an age where thought is free, from the age of big brother, from the age of the thought police, from a dead man: Greetings. 1984, by George Orwell Abstract Open data for all New Yorkers is the tagline on New York City’s open data website. Open government is being promoted at most countries of the western world. Governments’ transparency levels are being measured by the amount of data they share through their online public repositories. Additionally, open source software is promoted at governments, academia, and the industrydthis is the new digital story of this century, and the new testament between the Gods of technology and there users; data and software openness will redefine the path forward and aim to rekindle our collective intelligence. Data and software openness can redefine Data Democracy and be the catalyst for its progress. This chapter provides a historical insight into data and software openness, the beginnings, the heroes, prospects for the future, and all things we cannot afford to negotiate or lose. Keywords: Copyleft; Hacking; Open data; Open source software.

1. Introduction to the history of open source Open source software (OSS) has had a lasting impact on business, government institutions, and scientific research. OSS was initially adopted by and for developers and programming hobbyists. OSS have then expanded to operating systems (most famously, Linux), virtual machines Data Democracy. https://doi.org/10.1016/B978-0-12-818366-3.00003-4 Copyright © 2020 Elsevier Inc. All rights reserved.

29

30 Data Democracy

(such as Oracle’s VM VirtualBox), and Internet clients such as Mozilla Firefox [1]. The history of open source (OS) begins from the 1950s [2]. Heavily relying on government funding, computing research was relegated to academia. According to David Berry, Professor of Digital Humanities at the University of Sussex, “difficulties faced by computer scientists . and their early links to academic and scholarly norms of research . contributed to sharing practices that later fed directly into the ethics of the [Open-Source movement].” From these origins, “hackers,” ranging from academics to electronics hobbyists, banded in various groups (e.g., MIT’s Artificial Intelligence Club and the Homebrew Computing Club) aiming to “overcome limitations of software systems [and] achieve novel outcomes” [3]. In the 1970s, the hacker culture began to decline as control over “freely sharing files, tools, and information” began to be asserted by corporations [4]. Activists like Richard Stallman would crusade against the privatization of software, denouncing proprietary software as “evil, because it required people to agree not to share and that made society ugly” [5]. Referring to his peers who decided to join for-profit firms as traitors! Stallman would create an incomplete GNU Operating System. Later, Linus Torvalds would develop and publicly release a Linux kernel for GNU, making it the first publicly available operating system. The release of GNU/Linux unleashed a wave of volunteer peer-to-peer collaborations that define OS development to this day. Differing from Stallman’s ideological viewpoint, Torvalds reflected a more practical approach, stating, “I don’t like single-issue people, why should business, which fuels so much of society’s technological advancement, be excluded?” [5]. Eventually, Torvalds emphasis on OSS was “the pragmatic goal of getting people to collaborate in order to create software more effectively.” That philosophy became dominant over Stallman’s free software ideology [5]. His (Stallman’s) evangelical crusade did, however, provide the legal framework for OS to exist on. Stallman’s Free Software Foundation formalized user rights and the copyleft licensing model, legally ensuring liberties that fundamentally differ from proprietary software, namely 1. The right to full access to the source code 2. The right for anyone to run the program without restriction 3. The right to modify the source code 4. The right to distribute both the original software and the modified software 5. The right for users to know about their OS rights

Chapter 3  The history and future prospects of open data and OSS

31

6. The obligation to distribute derivatives under copyleft These defined rights [1] were essential to ensuring the growth and stability of the OS community as corporations patented and trademarked software [6]. OSS’s early relationship with industry would come to define many developments.

2. Open source software’s relationship to corporations Historically, large corporations have had a tumultuous relationship with OSS [7]. Viewed as a source of competition, Microsoft CEO Steve Ballmer referred to Linux as a “malignant cancer” [8]. As a result, much of OSS was first developed by universities and computer hobbyists. OS usage has now become ubiquitous within private industry. A survey conducted by Black Duck Software showed the “percentage of companies running part or all of their operations on OSS had almost doubled between 2010 and 2015, from 42% to 78%” [7]. Since then, corporations and governments not only incorporate OSS but also contribute to OS projects. The number of firms that contribute to OS projects rose from 50% in 2014 to 64% in 2016, signifying that the private sector will continue to define OS development heavily in the future [7]. Economic and technological necessity has led corporations to embrace OS use and/or development. For example, by open sourcing their software, companies “establish their technology as de facto standards”; profiting by offering related products and services [9]. Security is also another concern; proprietary/closed source software often lacks stability. According to a Google Security Survey, Microsoft IIS Web servers were found to be “twice as likely to distribute malware” as Open Source Apache Web servers [10]. Considering the number of variables constituting the likelihood of an attack, it is difficult to statistically assert such claims. Experts like Rob Nash, senior lecturer in Computing at the University of Washington, are skeptical of the metrics used in these kinds of studies, stating: “there are many variables besides the number of attacks against or reported vulnerabilities in comparable open source and proprietary software.” Regardless, OSS gives opportunities to more developersdnovice and expertsdto possibly fix security flaws and bugs in the code. This is vital in an age where “95% of software bugs are caused by 19 common and well-understood programming errors” [11]. As a result, there is an established opinion that OSS “does not pose any significant barriers to

32 Data Democracy

security, but rather reinforces sound security practices by involving many people that expose bugs quickly” [11]. A changed industry outlook toward OSS has coincided with the popularization of artificial intelligence (AI) and Data Science tools. The intersection of OS and data science is discussed in the next section of this chapter.

3. Open source data science tools The cost of data storage has declined drastically within the last 10 years, a 15%e20% cost reduction just in the past several years. Additionally, with cloud services, it is becoming feasible for companies to cheaply store massive datasets. Interpreting interrelated variables on a large scale is impossible to do without data science/analytic tools [12]. Data Science is defined as “the study of the generalizable extraction of knowledge from data” [13]. Specializing in areas such as computer science, mathematics, or statistics, data scientists extract useful information from a dataset. Interest in data science and its applications has grown immensely, according to Google Trends. OS and proprietary tools for data science differ in several key ways. First, proprietary software is usually not as customizable, and if a defect is detected, it cannot easily be changed [14]. Second, because of OSS’s ability to change quickly, OSS tends to be reliable and technically advanced [15]. Third, OSS’s low/free cost, technical sophistication, and flexibility makes OSS an attractive option for researchers, who usually add more features to the software [16]. Fourth, experienced data science professionals including data miners, data analysts, and data scientists prefer using OSS; OSS allows users to edit code for their needs [17]. OS user beliefs also differ; OS users are more likely to value the freedom associated with modifying their software [18]. Those differences, especially the ability to edit source code, are significant in an age where individuals and institutions require customizable data tools for their complex problems. Experienced data scientists will require the use of varied tools, as there is no single machine scheme suitable to all data science deployments. As a result, OS tool development for data science occurs in numerous categories and languages. The increasing number of OS data science tools reflects this trend, totaling 70 since 2014 [13]. Data mining libraries (DMLIB) accounted for most of OS tools available, followed by data mining environments (DME), integration toolkits (INT), and BIS (statistical analysis functionalities in business applications). The four most common OSS tool development languages are Java (34%), followed by others (15%), Cþþ (11%), and Python (3%). Most OS development occurred on the GNU-GPL license (23%), other licenses (22%), and Apache (12%). In scientific and academic publication, the five most commonly

Chapter 3  The history and future prospects of open data and OSS

33

used OS tools are R (43%), followed by Waka (15%), Kepler (7%), LibSVM (7%), and Hadoop (4%). According to a survey conducted by KDNuggets, the five most popular programming languages (among the data science community) in 2017 were Python (52.6%), R (52.1%), SQL (34.9%), RapidMiner (34.9%), and Excel (28.1%, although Excel is not OS) [17]. Users with less technical tasks (such as creating visualizations) prefer the use of closed source software like Excel. OS languages like R and Python are especially used in the private sector for data-related projects. For example, research scientists at Google use R to understand “trends in ad pricing and for illuminating patterns in the search data it collects” [19]. Spotify and Netflix use OS Python modules such as Luigi for data analysis. More than half of all publicly traded companies use some form of OSS. Increasingly, businesses no longer want to simply use OS data science tools but be involved in development [20]. Examples of this shift have occurred at companies like Google. Historically, Google tended to be protective of their software and only released publications describing their technologies, rather than the source code [21]. Google Lab’s papers, such as “MapReduce,” lead to the creation of innovative OS projects like Hadoop [21]. Hadoop’s popularity has possibly served as an omen to Google. In 2015, Google released TensorFlow, an OSS library for dataflow programming (often used in machine learning and neural networks research). Already, it is among the top 10 tools used by the data science community, experiencing a 195% increase in usage in 2017 [22]. Despite recent OS advancements, proprietary software still holds relevance within the data science field. OSS has become ubiquitous in the security industry, but among data scientists, proprietary software is still used to a large extent [19]. Although OSS is dominant among professional data scientists and researchers, proprietary software is widely used in large organizations (be it governments or private companies) for various reasons. To illustrate this, it is useful to compare some features of proprietary software (SAS) and OSS (R)dboth languages are used heavily in statistics [19]. While R is very popular in academia, SAS (Statistical Analysis System) has typically been dominant in the private sector. SAS can do many of the same functions as R, including altering and retrieving statistical data, applying statistical functions and packages, and displaying impressive graphics. For smalland medium-sized datasets (mostly used by startups, researchers, and individuals), R performs well as a “data analysis performer and graphics creator” [19,23]. For massive datasets, however, SAS offers several distinct advantages over R. R was designed for statistical computing and graphics, so “data management tends to be time consuming and not as clean as SAS.”

34 Data Democracy

Students who have solely used R have an unrealistic expectation of the state of the data they receive. Furthermore, for realistic datasets (which are often messy and rarely clean), performing data manipulation in SAS is common, while in “R [it] is not standard” [24]. The quality and levels of documentation also differ. R offers more open documentation than SAS, but documentation for “[R] is not well organized and often has not been thoroughly tested” [23]. Technical documentation for SAS is also vastly more detailed than R [23]. A lack of proper documentation plagues most of OSS, not just R. According to a 2017 OS survey by Github, “93% of respondents noticed that incomplete or outdated documentation is a pervasive problem” [24]. Technical support is another key differentiator, particularly among nontechnical users and corporations. OSS like R relies on free support from the community; as a result, users cannot expect detailed and accurate answers in a timely manner. In contrast, SAS offers technical support to address user needs quickly [23]. Although SAS’s initial costs may deter some users (the Analytics Pro version of SAS is priced at $8700 for the first year), most of SAS’s products and services offer do not charge extra for technical support. Performance differences between OS and proprietary software exist as well. Proprietary hardware drivers are typically built in “close cooperation of the hardware vendor and, thus, they perform better” (Optimus Information Inc., 2015). However, optimization has its disadvantages. A common criticism of SAS is that “proper SAS does not run on Macs” [23]. While the Macintosh and Windows interfaces look different, R offers “very similar functionality” [25]. Developers are increasingly adopting Mac and Linux operating systems, resulting in more library/package support in R (as published by Stack Overflow, 2016). OS data science tools are interchangeably used in other fields, particularly AI. As the field of AI gains prominence, it is important to discuss the merit of recent advancements. What is the trend of OS tool development for AI? The next section attempts to answer this question.

4. Open source and AI In 1956, RAND Corporation and Carnegie Mellon University developed the first AI language, Information Processing Language (IPL). IPL pioneered several new concepts at the time, including dynamic memory allocation, data types, recursion, and lists. However, IPL was soon replaced by a language with far simpler and scalable syntaxdLisp. According to McCarthy, the creator of Lisp, Lisp was designed to facilitate experiments for a proposed machine to “draw immediate conclusions

Chapter 3  The history and future prospects of open data and OSS

35

from a list of premises” and “exhibit common sense in carrying out its instructions” [26]. At the Massachusetts Institute of Technology (MIT) research lab, interest in creating Lisp machines resulted in the creation of two companiesdLisp Machines Inc (LMI) and Symbolics. Early on, differences between LMI, founded by MIT’s traditional “hackers” and the more commercially oriented Symbolics emerged. A former employee of LMI, Richard Stallman, described the conflict between the two companies as a “war” [27]. Ultimately, both companies would go on to fail “with the onset of AI winter,” resulting in a sharp decline in demand for specialized Lisp machines [28]. What remained of the schism was Stallman’s desire to create a free operating system, paving the way for the Free/libre open source software (FLOSS) movement. The launch of FLOSS would ensure continued academic interest during the AI winterda period of “reduced commercial and scientific activities in AI” [28]. The OS movement would lead to the creation of languages like C, while not focused on AI, “fueled its development” [29]. The Deep Blue computer, an example of symbolic AI, would be developed in C. However, the use of low-level languages like C would be limited. According to Neumann [30], principal researcher at the German Research Center for Artificial Intelligence, the “fuzzy nature of many AI problems” caused programmers to develop higher-level languages that would free programmers “from the constraints of too many technical constructions.” Although Lisp had solved some of these technical issues, the semantics of Lisp would prove too difficult for Lisp to become widely popular. Regardless, concepts pioneered by Lisp, including garbage collection and object-oriented programming, would serve the development of future higher-level languages [31]. These languages include Python and Java (Java is not fully OS but is available for free to the public). Python and Java would emerge in the 1990s and eventually become essential for machine learning. R, which “began as an experiment in trying the methods of Lisp implementers to build a small testbed,” eventually became a popular tool for data mining (a subfield of AI) and statistics [32]. As the AI winter ended and “the problems of hardware seemed to be under control,” OS software simultaneously received increased media coverage and recognition [31]. Languages like Python [33], R, Lisp, and Java were used extensively to build the first post AI-winter tools. The first “officially” OS tool specific to AI was called Torch, it was released in 2002. Torch, a machine learning library, is based on the Lua programming language [34].

36 Data Democracy

5. Revolutionizing business: avoiding data silos through open data Data are required for companies to operate. Its importance is ever growing. It is the fuel that machine learning runs on. Without data, nothing is known. But what is not as clear cut and universal is how the data are treated and shared; how it is improved on, interpreted, and what can be done with it. Companies, just like governments, are readjusting to the latest data technologies and implementing the best data strategies as they are uncovered. Data democratization achieved through dataset publication without copyright, patent, or restrictions allow governments to be more auditable and accountable. They accomplish that by distributing knowledge for decision-making to the public. This is not fundamentally unique to governments; a similar process of choosing and concluding can be distributed among many instead of being handed via a centralized hierarchy. Although in the industry instance, the public would be considered the general workforce. Publicly traded companies that must submit public financial statements for regulations is a separate case; although in some sense, that is another example where the system as a whole can function better by increasing data availability and group decision-making. In this industry scenario, distributing data changes the cadence of information flow and often enables a delegation of problem solving. It prevents data disagreements through increased transparency, enables increased data agility by removing barriers and reduces data hoarding which in its own form is a source of power-siloing! It brings about an accountability that in essence can evolve things from being words of a medicine man to validated knowledge of physicians, ultimately enabling the wisdom of the many to be utilized throughout the organization. A Data Silo can be defined as a repository of information that is controlled and accessed by a single group or department. They come in many flavors; from files sitting on computers to large databases tightly held. The most defining feature is how access is granted and how information ultimately flows. It is important to understand the different causes and consequences of siloing data; and the changes that occur as data move out into the open. Data Silos will naturally arise in an organization and one group or department will often be the primary custodian and steward. This is organic, as centralizing particular data entry can improve these processes’ accuracy and efficiency. Most companies will have a single or small number of departments that are ultimately responsible for handling

Chapter 3  The history and future prospects of open data and OSS

37

accounting records and human resource records. Many will choose to have internal and external review processes where light is brought onto the accounting books. A well-known example and pioneer is Google, which started publicly providing annual diversity reports in 2014 and recently released an update this year, 2019 (The Equal Pay ActdNo. 899 of 2008). Through this program, the company has both become aware of some of its own biases and has helped improve its hiring processes to ensure that talent is not being overlooked. It’s not to say the program’s goal was having the body of employees match that of the geographical area; but to ensure that hiring practices would not be influenced by unconscious factors that ultimately removed good candidates. They cross analyzed performance versus demographics and have shown a trend toward population moving toward the geographic trend without any decrease in performance. The publication has helped spur on other companies to be able to follow suit by both raising awareness and by creating metrics that can be tracked, although many keep the data strictly for internal use. When data are isolated and inaccessible but required, redundant data will often result. In an e-commerce company, an accounting group, a website product team, and a marketing group all require knowledge of sales performance to perform the job. Data need to be shared across the organization for such major tasks. Data Silos can also arise for a lack of technical aptitude or experience. Centralized file stores and cloud file shares have made it easier for companies to distribute documents. Spreadsheets are still the dominant data entry and processing form. They also still leave challenges around versions and editing; many workers at any business will experience at least a few emails asking for the latest version of an excel file. Stale files can result to similar disagreements as that of the redundant processing: risks of incorrect conclusions; distrust and wasted time for validation, inefficient sharing, and ineffective processing. Oftentimes, software will use its own data file silo, such as an excel spreadsheet. Raw data are naturally sheathed by a transport and record format. Often it resorts from use of web requests hitting APIs or queries running against data warehouses. These forms are not yet widely performable by the unskilled workers; however, many tools exist to break down barriers and instill an ease of use. Another technical limitation can be the sheer size of datasets. While computing resources available have improved, network speeds have improved too, the storage medium has grown faster. It is cheap enough that many companies have chosen to capture information and left handling it later as an unsolved challenge.

38 Data Democracy

Sharing of information to consumers has been proven vital in multiple areas and is being enforced through legislation. Food safety standards, for instance, often include provisions that set up reports and databases of inspection violations. The effectiveness of transparency improving safety is also found with the publicized car safety reviews and crash statistics (although a car dealership trying to see a car with crash history would not benefit from that). The shared data align the public interests with market performance. The issuing of reviews and ratings has also helped infuse product confidence or desirability among consumers. It is one of the examples of online shopping that is an enhancement over the parallel instore experience. The increase in data availability has greatly changed the knowledge dynamics between a company and its customers and can successfully be used to improve performance, whether it be sales, safety, or satisfaction. Companies also open up datasets to the general public to share knowledge and foster innovation. A recent surge in machine learning has caused an increase in data demand for training. By cooperating, companies have been able to improve modeling testing by having common datasets. Google recently updated its openness message: We hope that the exceptionally large and diverse training set will inspire research into more advanced instance segmentation models. The extremely accurate ground-truth masks we provide rewards subtle improvements in the output segmentations, and thus will encourage the development of higher-quality models that deliver precise boundaries. Finally, having a single dataset with unified annotations for image classification, object detection, visual relationship detection, and instance segmentation will enable researchers to study these tasks jointly and stimulate progress towards genuine scene understanding (Google 2019). The opening of data can also distribute decision-making. I have been part of a company that automated the product experimentation tests. Before having shared automated numbers, longer nonrepeatable analyses were performed. That required more effort from the engineering teams, causing enough friction to make testing prohibitively expensive and fairly infrequent. Automating the experimentation allowed the company to perform multiple concurrent tests with much shorter testing windows. This increased testing bandwidth helped transition data-driven product management, letting everyone focus on hypothesis testing conditions

Chapter 3  The history and future prospects of open data and OSS

39

while providing transparent conclusions. Decisions that were prior gutfeels on core e-commerce conversion flows could now be validated. It allowed the company to become aware of not just single sampled statistics but trials conducted over time to reassert business assumptions. It allowed product ideas to flow from the entire team, removing the organization hierarchies from leading decisions to business impact being the chief driver. Opening data come with challenges. There is also a balance between security and privacy that prohibits full openness with some data sources. Personal information such as health records and tax identification often remain in secured silos; and aggregations or redaction occurs to help anonymize data by removing sensitive information. Some data would jeopardize business function; it could risk employees privacy, provide enough information to allow competitors to gain an advantage, pose a legal risk or violation, and so on. Open data does not also guarantee accuracy. With the growing importance of product reviews, companies have been found to partake in false reviews on merchandise, both positive in their favor and negative on competitors’ brands. Despite this, we see the overall trend that information is cheaper to be stored and captured over time; large value can be gained by opening it up.

6. Future prospects of open data and open source in the United States Compared with the Industrial Revolution, current technological advancements are estimated to be 10 times faster [35,36]. With the advent of the information age, humans can collaborate with greater ease than before. Thomas L. Friedman, author of The World Is Flat, describes our new “flat world”dan ecosphere of innovation and wealth fostered by globalization [37]. A closer look at recent trends, however, does not reflect an era of lightning fast technological advancements. Federal research and development (R&D) as a percent of GDP has declined from an average of 1.2% in the 1970s to 0.7% in the 2000s [38]. In 2016, China received more patent applications than the United States, Japan, South Korea, and the European Union combined [35,36]. The number of patent applications does not directly correlate with an upsurge in innovation. More balanced metrics (e.g., the number of patents per researcher), however, support the notion of declining scientific output in the United States [39]. A lack of technological progress has reflected on economic performance.

40 Data Democracy

Economic studies indicate that as much as “85% of measured growth in US income per capita was due to technological change” [39,40]. Yet, the average income for “millennials” (the cohort born between 1980 and 2000) has declined from the previous generation [41]. Private investments into innovative sectors of the economy have also been lackluster. Before the financial crash of 2007, trillions of dollars flowed into a robust US economydmost of it went into government bonds and housing. Developed nations like the United States depend on exporting technologically advanced products. Yet in categories of advanced technology products, including aerospace, biotechnology, and information technology, the United States “has turned from a net exporter to a net importer” [35]. The percentage of Americans graduating college, particularly in science fields, is not meeting the demands of an information age. The United States has one of the lowest shares of degrees awarded in science fields [41]; 54% of all patents are awarded to foreign-born students or researchers who face increasingly stringent visa restrictions [42]. Technological advancements have been dictated by the availability of data. Data drive knowledge, and data are the subject of many upcoming conflicts around the world. As Batarseh [43] mentioned in an article, “Knowledge however, has a peaceable face; like music, love and beauty e knowledge is the most amiable thing. It transfers through borders, through time, and even some preliminary studies show that knowledge can transfer via our DNA through generations. In many cases, knowledge would either hinder or thrust human progress. Fortunately, our collective knowledge is always growing. One cannot undermine the Knowledge Doubling Curve, it dictates the following: Until year 1900, human knowledge approximately doubled every century; by 1950 however, human knowledge doubled every 25 years; by 2000, human knowledge would double every year. Today, our knowledge is almost doubling every day! Although hard to measure or validate; as a result of such fast pace, three significant questions are yet to be answered: how are we going to manage this mammoth knowledge overflow? What if that knowledge can be organized, structured, and arranged in a way that can allow its usage? What if it falls in the wrong hands?” Well, the question remains, what if data fall hostage into ANY hands? And who has the authority to own data? This book, its authors, and the future of our planet dictate that no one should, a data democracy through open data and OSS is the only peaceable path forward.

Chapter 3  The history and future prospects of open data and OSS

41

References [1] K. Carillo, C. Okoli, The open source movement: a revolution in software development, J. Comput. Inf. Syst. 49 (2008) 1e9. [2] D. Berry, Copy, Rip, Burn: The Politics of Copyleft and Open Source, Published by Pluto Press, 2008, p. 280, https://doi.org/10.2307/j.ctt183q67g. [3] V. Gehring, The Internet in Public Life, Published by Lanham, MD: Rowman and Littlefield, 2004. ASIN: b01felvzdu. [4] C. Cathy, Hacking: the performance of technology?, in: A review of the book Hacker Culture, by Thomas Douglas 9, 2009 (2) Retrieved from, https://scholar. lib.vt.edu/ejournals/SPT/v9n2/legg.html. [5] W. Isaacson, D. Boutsikaris, The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution, Published by Simon & Schuster, 2014. [6] N. Brown, ‘Patterns of Open Innovation in Open Source Software’, SSRN Electronic Journal, Open Innovation: Researching a New Paradigm, Oxford University Press, 2006, https://doi.org/10.2139/ssrn.2470718. [7] G. Anthes, Open source software no longer optional, Commun. ACM 8 (2006) 15e17, https://doi.org/10.1145/2949684. Section 59. [8] T. Greene, Ballmer: ’Linux is a Cancer?, 2001 an Article From the register, Retrieved from, https://www.theregister.co.uk/2001/06/02/ballmer_linux_is_a_ cancer/. [9] J. West, S. Gallagher, Challenges of open innovation: the paradox of firm investment in open-source software, R and D Management 3 (2006) 319e331, https://doi.org/10.1111/j.1467-9310.2006.00436.x. Section 36. [10] R. Grimes, ‘Continuing the Web Server Security Wars: is IIS or Apache More Secure?’, An Article in InfoWars, September 7, 2007. https://www.infoworld. com/article/2649431/continuing-the-web-server-security-wars–is-iis-orapache-more-secure-.html. [11] R. Clarke, D. Dorwin, R. Nash, Is Open Source Software More Secure, A Report from the University of Washington, 2009. [12] C. La-Chapelle, The Cost of Data Storage and Management: Where Is it Headed in 2016? PC Pugs, 2016. [13] P. Barlas, I. Lanning, C. Heavey, A survey of open source data science tools, Int. J. Intell. Comput. Cybern. 8 (3) (2015) 232e261. [14] A. Singh, R. Bansal, N. Jha, Open source software vs proprietary software, Int. J. Comput. Appl. 114 (2015) 26e31. Section 18. [15] M. Muffatto, Open Source: A Multidisciplinary Approach, A Report from the Imperial College Press, London, United Kingdom, 2006.

42 Data Democracy

[16] D. Lipsa, R. Laramee, Open source software in computer science and it higher education: a case study, Int. J. Adv. Comput. Sci. Appl. 2 (2011). https://doi.org/ 10.14569/ijacsa.2011.020102. Section 1. [17] J. King, R. Magoulas, Data Science Salary Survey 4, O’reilly Media Data Science Survey, 2018. Section 1. [18] M. Lemley, Z. Shafir, Who chooses open source software, SSRN Electron. J. (2009). https://doi.org/10.2139/ssrn.1495982. [19] C. Ozgur, M. Kleckner, Y. Li, Selection of statistical software for solving big data problems, SAGE Open Access J. 2 (2015). https://doi.org/10.1177/ 2158244015584379. Section 5. [20] A. Yakkundi, Digital Experience Technology and Delivery Priorities, A Report by Forrester Research, 2016. [21] S. Grady, Changing Tack: Evolving Attitudes to Open Source, A Report from Red Monk, 2015. [22] S. Morgan, ‘Cybersecurity Vendors, Companies, Employers, and Firms’, A Report by Black, Duck Company, 2015. [23] Carnegie Mellon University (CMU), Business Analytics Program Masters, 2020. [24] F. Zlotnick, Github Open Source Survey, Data Set, 2017. [25] Stanford University, Using R for Windows and Macintosh, 2010. [26] J. McCarthy, ‘Recursive functions symbolic expressions and their computation by machine’, Commun. ACM 3, Section 4, 184e195. http://doi.org/10.1145/ 367177.367199. [27] R. Stallman, My Lisp Experiences and the Development of GNU E-Macs, GNU Project e Free Software Foundation’, GNU, 2014. [28] C. Smith, B. McGuire, T. Huang, G. Yang, The History of Artificial Intelligence, University of Washington, 2006. [29] M. Jones, The Languages of AI, IBM, 2017. [30] G. Neumann, Programming Languages in Artificial Intelligence, German Research Center for Artificial Intelligence, 2011. Retrieved from, https://doksi. hu/get.php?lid¼27018. [31] G. Steele, R. Gabriel, ‘The Evolution of Lisp’, ACM SIGPLAN Conference on History of Programming Languages, 1993, pp. 233e330. https://doi.org/10. 1145/234286.1057818. [32] R. Ihaka, ‘R: Past and Future History’, A Draft Paper for Interface, at the University of Auckland, 1998. [33] G. Meer, How We Use Python at Spotify, Retrieved from Spotify, 2014.

Chapter 3  The history and future prospects of open data and OSS

43

[34] R. Collobert, K. Kavukcuoglu, C. Farabet, Torch7: A Matlab-like Environment for Machine Learning, Published at NIPS, 2011. [35] Federal Reserve Report, ‘the High-Tech Trade Balance: Importing and Exporting U.S. Aerospace, Nuclear, and Weaponry Technology’, FRED Blog, 2018. [36] J. Grafstro¨m, ‘Technological Change and Wage Polarization e the Illiberal Populist Response’, Ratio Working Papers 294, Published by the Ratio Institute, 2017. [37] T. Friedman, The World Is Flat: A Brief History of the Twenty-First Century, Farrar, Straus and Giroux, New York, 2005. [38] M. Greenstone, A. Looney, A Dozen Economic Facts about Innovation, Published with the Hamilton Project, 2011. [39] S. Kortum, Research, patenting, and technological change, Proc. Econom. 6 (1997) 1389e1419. https://doi.org/10.2307/2171741. Section 65. [40] B. Benderly, Rising Above the Gathering Storm: Energizing and Employing America for a Brighter Economic Future, 2007. https://doi.org/10.1126/science. caredit.a0700179. [41] Citi Bank and Oxford University, Inequality and Prosperity in the Industrialized World, 2017. Retrieved from, https://www.oxfordmartin.ox.ac.uk/downloads/ Citi_GPS_Inequality.pdf. [42] A. Phung, ‘Made in America: How Immigrants Are Driving U.S. Innovation’, A Report in ThinkBig, 2012. [43] F. Batarseh, Thoughts on the Future of Human Knowledge and Machine Intelligence, An Article on the London School of Economics (LSE), 2018.

Further reading Institute of Medicine (US), Transforming clinical research in the United States, Drug Discov. Dev. Transl. (2010). https://doi.org/10.17226/12900. Section 3.

4

Mind mapping in artificial intelligence for data democracy José M. Guerrero INFOSEG, BARCELONA, SPAIN

We should always allow some time to elapse, for time discloses the truth. Seneca Abstract One of the main hurdles in the quest for data democracy is the information overload problem that end users have to suffer when trying to analyze all information available to be informed and make sensible decisions. In this chapter, I talk about the definition, causes, consequences, and possible solutions to information overload. One of the solutions lies in the use of artificial intelligence (AI). But this is a double-edged word because, in some cases, it helps to reduce information overload and, in other cases, it helps to increase the problem. I introduce the use of the mind mapping (MM) technique to help reduce information overload in all cases and specifically in cases where AI has a negative effect on this problem. For readers not privy to the MM technique, I provide some introductory references that can be useful to them. I do also provide a detailed example of the advantages of MM in the visualization of the results of the semantic analysis of a complex text using IBM Watson NLU and MM automation software developed by myself to create MindManager mind maps. This is one of those cases where AI produces a large amount of information that has to be organized in the best possible way. MM is, in my opinion, the most adequate way to organize that kind of information. The MM technique is also a very important tool to achieve data democracy through the empowerment of end users when gathering and analyzing complex information. Mind maps are single compressed files containing information, notes, graphs, and links that simplify the organization and treatment of complex information. Finally, an HTML5 version of the mind map used is made available to all readers so that they can appreciate the look and feel of the MM experience by simply using their Internet browser. Keywords: Artificial intelligence; Data democracy; Information overload; Mind mapping.

Data Democracy. https://doi.org/10.1016/B978-0-12-818366-3.00004-6 Copyright © 2020 Elsevier Inc. All rights reserved.

45

46 Data Democracy

1. Information overload 1.1 Introduction to information overload According to the Cambridge Dictionary [1], information overload is “a situation in which you receive too much information at one time and cannot think about it in a clear way.” The basic consequence of information overload is that our brain is unable to understand information and make sensible decisions. Information overload has also been reported in the Ecclesiastes 12:12 [2]: “Be warned, my son, of anything in addition to them. Of making many books there is no end, and much study wearies the body.” and Seneca 50 AD [3]: “What is the use of having countless books and libraries, whose titles their owners can scarcely read through in a whole lifetime?“ The symptoms were accelerated by the introduction of printing around 1450. Erasmus complained about the “swarm of new books” [4], but it was not until the 20th century that it became a serious problem due to the fact that information generated by technological advances grows exponentially instead of linearly Kurzweil [5]. In 1964, Bertram Gross used the term “information overload” in his book, “The Managing of Organizations,” for the first time [6]. In this book, Gross defined the term as follows: “Information overload occurs when the amount of input to a system exceeds its processing capacity. Decision makers have fairly limited cognitive processing capacity. Consequently, when information overload occurs, it is likely that a reduction in decision quality will occur.” The problem is twofold equating in difficulty to understand information and the deduction of quality in the decisions made. Toffler popularized the term information overload in 1970 in his book, “Future Shock” [7]. Many other equivalent terms to “information overload” have appeared in technical and business publications. Some of the most common ones are “infoglut/information glut” [8], “data smog” [9], “infoxication” [10], “infosaturation” [11], and “infobesity” [12]. Another related concept is “information anxiety” [13]. It was defined as “the black hole between data and knowledge.” Wurman explored five components of information anxiety: not understanding information, feeling overwhelmed by the amount of information to be understood, not knowing if certain information exists, not knowing where to find information, and knowing exactly where to find the information but not having the key to access it [14].

Chapter 4  Mind mapping in artificial intelligence for data democracy

47

According to a study from the research group IDC sponsored by Seagate [15], the world’s data stored in devices, storage systems, and data centers will reach 175 zettabytes (1 ZB ¼ 1021bytes). By 2025, 60% of the world’s data will be created and managed by businesses, 20% of the data in the global datasphere will be critical to our daily lives, the amount of the global datasphere that is subject to data analysis will grow by a factor of 50, and more than a quarter of all data created will be real time.

1.2 Causes of information overload Besides traditional sources of information overload such as books, magazines, newspapers, letters, brochures, and TV, the main causes of information overload are related to the technological advances that have appeared in recent times. The causes of this exponential increase in information overload can be divided according to the sources of information: digital transformation, Internet of Things (IoT), social media, cybersecurity log files, Internet web pages, emails, data openness, attention manipulation, and SPAM.

1.2.1 Digital transformation Digital transformation has produced an important change in the problem of information overload from a qualitative and quantitative point of view. In this context, it is convenient to define three related concepts: “digititation,” “digitalization,” and “digital transformation.” In the three cases, I am going to use Gartner’s IT Glossary. Digititation [16] is “the process of changing from analog to digital form, also known as digital enablement. Said another way, digitization takes an analog process and changes it to a digital form without any different-in-kind changes to the process itself.” Digitalization [17] is “the use of digital technologies to change a business model and provide new revenue and value-producing opportunities; it is the process of moving to a digital business.” Digital Business Transformation [18] is “the process of exploiting digital technologies and supporting capabilities to create a robust new digital business model.” This term refers not only to the implementation of digital technologies but also to a customer-drive strategic business transformation [19]. For companies, adopting the emerging new digital technologies is a strategic imperative. However, this is usually done without thinking about how to store and use the exponentially increasing amount of information that is being generated.

48 Data Democracy

1.2.2 Internet of things IoT is an interdisciplinary domain that utilizes sensors, networking, cloud computing, edge computing, big data, machine learning intelligence, security, and privacy [20]. The term “Internet of Things” was coined by Ashton in 1999 when he included it in the title of a presentation he made at Procter & Gamble [21]. The purpose of IoT is to use sensors in physical devices to monitor and control them. Some of the most important applications are in smart cities, smart homes, healthcare, connected cars, industrial Internet, agriculture, retail, smart grid, and farming. By the year 2020, the IoT will comprise more than 30 billion connected devices [22]. Other interesting statistics are provided by NewGenApps [23].

1.2.3 Social media Social media sites are an important contributor to information overload. Facebook, Twitter, LinkedIn, Instagram, Pinterest, WeChat, Weibo, WordPress, YouTube, WhatsApp, Googleþ, Flickr, Snapchat, Slashdot, Digg, Reddit, Quora, Disqus, StackExchange, and Amazon for product reviews are the most important sources. In some of these cases, most of the information provided is useless or irrelevant. The causes of this problem are multiple and complex, highlighting especially among them the absence of barriers to the information that is registered, among others: fake news, disinformation, hoaxes, slander, and personal information without any interest. The We Are Social report for 2018 [24] provides some interesting facts about social media sites. The number of social media users is 3196 billion. The number of active mobile social media users is 2958 billion. The average Internet user now spends around 6 h each day using Internetpowered devices and services. Adding all this together for all Internet users, a collective one billion year is spent online. There are almost one million people who start using social media for the first time every day. BrandWatch also provides an interesting set of 121 social media statistics [25]. On average, people have 5.54 social media accounts. 91% of retail brands use two or more social media channels. 81% of small- and medium-sized businesses use social media. Facebook Messenger and WhatsApp handle 60 billion messages a day. On WordPress alone, 74.7 million blog posts are published every month. 27 million pieces of content were shared every day, and today, 3.2 billion images are shared each day. The top three content marketing tactics are social media content (83%), blogs (80%), and email newsletters (77%). 89% of B2B marketers use content marketing strategies.

Chapter 4  Mind mapping in artificial intelligence for data democracy

49

1.2.4 Cybersecurity Cybersecurity is the organization and collection of resources, processes, and structures used to protect cyberspace and cyberspace-enabled systems from occurrences that misalign de jure (practices that are legally recognized) from de facto (practices that exist in reality) property rights [26]. Cybersecurity is needed for businesses, private users, governments, and industries. Cybersecurity software generates lots of event logs to prevent and detect attacks. These event logs must be reviewed regularly, adding to information overload. In many cases, post-breach forensics have to be performed. Event logs can be generated by security software applications, firewalls, antivirus software, network routers, workstations, operating systems, and many other infrastructure elements. The number, volume, and variety of cybersecurity event logs have increased greatly, creating the need for log file management [27]. Logs are the primary source of information in cybersecurity. Besides the necessary risk assessment of every application to determine the level of log needed, the minimum information that has to be logged is the following: user IDs, date and time of log on and log off, and other key events, terminal identity, successful and failed attempts to access systems, data or applications, files and networks accessed, changes to system configurations, use of system utilities, exceptions and other securityrelated events such as alarms triggered, and activation of protection systems such as intrusion detection systems and antimalware [28]. Large enterprises generate an estimated 10 to 100 billion events per day [29]. These numbers will grow as enterprises, enabling more event logging sources in the future.

1.2.5 Internet web pages The total number of websites is greater than 1,946,000,000 [30]. Even browsing 1000 websites per day, a human being would need more than 5331 years to see all of them. Hosting Facts offers a detailed list of statistics about the Internet in its web page, The Internet Stats & Facts for 2019 [31]. According to the HTTP Archive Report, the average page weight keeps increasing [32], adding to the information overload problem created by web pages.

1.2.6 Emails In 2017, the number of email users amounted to 3.9 billion; this figure is expected to be 4.3 billion in 2022 [33]. In 2017, 269 billion emails were sent and received each day; this figure is expected to increase to over 333 billion daily emails in 2022 [34].

50 Data Democracy

The email messages received can be personal or business related. Email was originally designed as a communications application, but it later started to be used for other functions like task management and personal archive [35]. These additional functions further complicate the information overload problem caused by email. Spam is a very important component of email messages. As of September 2018, spam messages accounted for 53.5% of email traffic worldwide [36]. The most common types of spam email were healthcare and dating spam. The Zeldes’ Infomania report on Intel employees [37] includes the results of an email usage survey given to Intel knowledge workers in 1999, showing that an average of 200 messages wait in an employee’s inbox, 30% of emails are perceived as unnecessary, and that, on average, each employee expends 2.5 h per day managing their email. Another survey in March 2006 shows that an average of 350 messages are received by each employee per week, each employee expends 20 h per week managing their email on average, and 2 hours per week is spent processing 30% of incoming messages which are viewed as unnecessary. To reduce the problem, email users try to filter and organize the huge amount of email messages received. Ranking messages by relevance is usually a useful strategy.

1.2.7 Data openness “Open knowledge” is any content, information, or data that people are free to use, reuse, and redistributedwithout any legal, technological, or social restriction [38]. Open knowledge is what open data becomes when it’s useful, useable, and used. Data can be federated by aggregating several data sources into a single dataset. The most common types of open data are found in justice, legal system, education and culture, science, finance, statistics, weather, and environment. Governments around the globe are implementing “open data” strategies to increase transparency, participation, and/or government efficiency [39]. The Open Data Impact Map [40] is a public database of organizations that use open government data from around the world. This Map was developed to provide governments, international organizations, and researchers with a more comprehensive understanding of the demand for open data. It includes private companies, nonprofits, academic institutions, and developer groups that use open government data for advocacy, products and services development, improve operations, strategy, and research. Private businesses are also taking part in this openness movement despite some concerns about the confidentiality of the information. The

Chapter 4  Mind mapping in artificial intelligence for data democracy

51

use of Open Data by private enterprises raises questions about the definition of this term. When related to business, Open Data is not only related to transparency and accountability. The initiative Datacollaboratives.org [41] is a resource on creating public value by exchanging data. It seeks to provide insight on how responsible exchange of corporate data can improve people’s lives. Governments can use sites like this one to learn how their data are being used, and how it could be used even more effectively if it were made more relevant, useful, and accessible for the private sector. The Deloitte report on Open Data [42] includes a list of reasons why the private sector should open up some of their datasets. Benefits to government comply with legal or regulatory obligations, sell data-related services to government, improve collaboration with local government, provide insight into investment and innovation, and support publice private partnerships. Benefits to other businesses sell data-related services to other businesses, help partners and supply chain, gain richer understanding of customers and their circumstances, improve collaboration, support industrial communities, and support local businesses and charities. Benefits to customers comply with legal or regulatory obligations, sell new dataerelated services to customers, demonstrate transparency and anticorruption measures, build trust, improve reputation and moral standing, crowdsource solutions, crowdsource solution improvements to data quality, create a platform for social engagement with customers, engage data scientists, and attract talent. One of the best-known examples of Private Open Data is the Uber Movement Open Data Portal [43] that provides anonymized data from over two billion transportation movements. This information is being used by urban planners and civic communities to analyze investments in new infrastructures. The information in Uber Movement is anonymized and aggregated into the same types of geographic zones that planners use to evaluate which parts of cities need improved and expanded infrastructure. A comprehensive list of data portals from around the world can be found at DataPortals.org [44]. The best way of reducing information overload from this source is by making the disclosed information machine readable. Linked Open Data is a blend of Linked Data and Open Data [45]. It has many applications in Open Government Data, in Linked Enterprise Data, and for electronic media and publishing.

1.2.8

Push systems

Push systems can be defined as software that automates the delivery of information to users [46]. In this kind of system, information is sent to users proactively through a web browser, email, voice mail, or pager. They

52 Data Democracy

are usually based on preferences established previously by the users. The reasons for establishing those preferences are typically to have access to special offers, keep track of order status, or stay up to date with information about products or services like news, software, and weather forecasts. The information received is usually periodic. When this technology is used in combination with the web, it receives the name of Webcasting. In 1996, PointCast Inc. made its free software that was a combination screen saver and news-retrieval service [47]. But already in 1997, the first voices against Push technology started to appear [48], complaining about the excessive use of bandwidth. Although most of the push information is received through the Internet in corporate environments, intranets are also used for this purpose when the information is generated internally. The advantage of Push technology over the traditional Pull technology is that users’ do not need to know when and where to look for the information they need. This makes the Pull technology inefficient compared with the Push technology [49]. Push systems have been attributed to information overload problems around the world [50].

1.2.9 Attention manipulation The previous causes were related to increasing amounts of information, while this one is related to attention. Information providers may deliberately induce information overload to conceal relevant information [51]. Information overload can make readers concentrate on irrelevant information, rendering them unable to make the best decision. Attention manipulation can be used fundamentally in marketing, politics, war, and criminal activities. In a world full of information, attention is a scarce resource. Our working memory plays a crucial role in attention [52]. The design of digital interfaces has a very strong influence on attention manipulation and on preventing it [53].

1.2.10 Spam email SPAM is usually defined as “Unsolicited email.” Spam makes up 45% of all emails; this corresponds to about 14.5 billion messages per day [54]. It could also be considered a subset of the generic “Emails” cause of information overload. Advertising-related email accounts for approximately 36% of all spam messages, while adult-related (31.7%) and financial matters (26.5%) lag behind. Surprisingly, scams and fraud comprise of only 2.5% of all spam email.

Chapter 4  Mind mapping in artificial intelligence for data democracy

53

A Spam bot is a software application designed to send spam emails automatically in large quantities [55]. One of the most important activities of spam bots is to visit websites, collect email addresses, and create a database with them. Later, the database is used to send emails with any desired content. In advertising, spam bots can also be used to generate clicks, impressions, and views. Instagram is an important target for Spam bots. In this case, the idea is not to send emails but to “like” photos based on a particular characteristic. The goal is to receive “likes” back from the persons that received the initial “like” from the bot. Spam bots can also be used to complete fake user registrations or fake information requests. This would also be a serious contribution to information overload because it can induce the owner of the website to initiate an action after having read the fake information.

1.2.11 Massive open online courses (MOOCs) The term MOOC refers to free (or very low cost), easily accessible, completely online courses [56]. The first MMOC was launched in 2008, known as “Connectivism and Connective Knowledge/2008dCCK8” by Stephen Downes and George Siemens of the University of Manitoba [57]. At the end of 2018, there were more than 101 million MOOC students and over 900 universities offering more than 11,400 courses in total [58]. The most important providers of MOOCs are Courserad37 million students, edXd18 million, XuetangXd14 million, Udacityd10 million, and FutureLearnd8.7 million.

1.3 Consequences of information overload The consequences of information overload are multiple, but the main ones are poor decision-making, anxiety, stress and other pathologies, reduction in productivity, and misinformation.

1.3.1 Anxiety, stress, and other pathologies Anxiety, stress, health problems, depression, hostility, and unhappiness are some of the most common pathologies resulting from having to work in an information overloaded environment [59]. Loss of job satisfaction and damage to personal relationships are other undesirable consequences. The Zeldes’ Infomania report on Intel employees [37] includes information about a worldwide survey in March 2007 to all Intel IT employees in which 40% responded that email has a negative impact on their stress level, while 31% responded that email has a negative impact on their quality of life.

54 Data Democracy

1.3.2 Reduction in productivity Reduction in productivity is one of the worst consequences of information overload. Sometimes, it is not clear the impact of more IT investments in the productivity of an organization. Some studies seem to indicate that it is the actual use of technology that has an impact on productivity [60]. This problem has been known for a long time as the “productivity paradox” [61].

1.3.3 Misinformation Information overload has serious consequences for the quality of information received. There is nothing easier than creating content and adding it to the web. There is no need to edit it or verify its correctness. When we have to read too much information, we have no time to decide whether certain information we have read is true or false. In the case of misinformation, it can be a hoax or fake news. With misinformation, we fail to improve our knowledge about a subject. In 2013, the chapter “Digital Wildfires in a Hyperconnected World” of the Global Risks Report published by the World Economic Forum [62] warned about the increasing danger of misinformation. Misinformation is clearly related to populism, political manipulation, and marketing strategies. Instead of traditional censorship, a similar effect can be obtained by drowning people in misinformation and noise [63].

1.3.4 Poor decision-making The “Joining the Dots” decision-making for a new era’s report [64] surveyed board-level executives at large organizations. 36% said that their organization was not coping with information overload, 80% said that flawed information had been used to make strategic decisions, and 42% said that their organization lost a competitive advantage because they were slow to make decisions. Information overload can result in making incorrect decisions, making them for the wrong reasons, not being able to make decisions, or not making them quickly enough. The reasons for this poor decision-making can be due to not being able to process all information received, not obtaining all relevant information, and trusting irrelevant or wrong information.

1.4 Possible solutions There are basically three general ways of fighting information overload. One is based on organizing the information to minimize the time needed to read and analyze it. The second is filtering information in active and passive ways. The third is the withdrawal of some channels of information, for example, TV, radio, or newspapers.

Chapter 4  Mind mapping in artificial intelligence for data democracy

55

In the first group of techniques, we can include literature reviews, content management systems (CMSs) and Open Data portals, infographics, and mind maps. The second group includes search engines, personal information agents, and recommender systems. Infographics and mind maps can also be used to filter and summarize information.

1.4.1 Literature reviews A good definition of a literature review is the one used by Ref. [65]: “A research literature review is a systematic, explicit, and reproducible method for identifying, evaluating, and synthesizing the existing body of completed and recorded work produced by researchers, scholars, and practitioners.” Literature reviews are useful to reduce information overload when doing research on a subject [66].

1.4.2 Content management systems A CMS is a software that allows for the creation and modification of digital content. This can be done in an intranet, on the web, or in a combination of both. Other typical functions include task management and collaboration, search, and Web publishing. Some CMS systems are thought primarily to build web pages. Some of the most common ones are WordPress, MS SharePoint, Documentum, Squarespace, Wix, Joomla, Magento, and Drupal. CMS systems help in the creation of several types of websites, the most important of which are blogs, ecommerce sites, intranets, and corporate websites. The Search and web publishing functions of CMS systems help in the control of information overload.

1.4.3 Open data portals (data democracy) Open Data are data that anyone can access, use, and share [67]. Open Data are generally generated by the public sector, but sometimes the origin is private. Open Data portals are websites containing Open Data and tools to use it free of charge. Data can be in the form of datasets (PDF, XML, RDF XML, ZIP, Plain Text, RDF, HTML, CSV, Excel, PowerPoint, SPARQL, ODF, TSV, JPEG, Word), visualizations, applications used to process the datasets, and tools for developers to build third-party applications using the datasets. The purpose of making these data available to the public is for transparency and to promote the use of such datasets as a driving force in the knowledge society. It can also be helpful, however, in reducing the problems related to information overload. The datasets provided by the Open Data Portals are in machine readable form so that they can be

56 Data Democracy

processed to create summaries and reduce the amount of work needed to be done by users when interpreting such dataset. A very important component of data democracy is that the average end user has to be empowered to gather and analyze Open Data without the help of IT specialists. In general, Open Data Portals are not enough by themselves to fulfill this empowerment. Additional tools are needed to realize data democracy. Wiki is a Hawaiian word meaning “quick.” In the IT world, a “wiki” is a website that is open to anyone to edit content using a web browser. There is no central organization controlling the editing of the articles. The name Wiki tries to describe the speed with which content can be created with a Wiki. Wikis are widely used to promote collaborative learning in schools through the use of content created by students. In companies, wikis are used in knowledge management and in web collaboration. Wikipedia [68,69] is a free online encyclopedia. Wikipedia was started in 2001; it now contains more than five million articles. DBPedia [70,71] is a system created by a community of users to extract structured information from Wikipedia and to make this information available on the web.

1.4.4 Search engines Search engines were one of the first attempts to fight information overload. A “search engine” is [72] “A search engine or search service is a document retrieval system designed to help find information stored on a computer system, such as on the World Wide Web, inside a corporate or proprietary network, or in a personal computer.” The first web search engine (Archie) was created in 1989 by a student at the McGill University in Montreal, Alan Emtage [73]. Google is the leader in the web search engines market with a share of 73.62% [74], but there are several other very interesting products used to browse the web [75].

1.4.5 Personal information agents Personal information agents monitor users to assist them in, for example, performing autonomous web searches and suggesting information resources that were used by the users in the past. They can save users a lot of time and reduce the information overload they suffer overall. Some of the most common personal information agents are Apple’s Siri, Google Now, Amazon’s Alexa, and Microsoft’s Cortana [76]. Another common use of personal information agents is in learning [77].

Chapter 4  Mind mapping in artificial intelligence for data democracy

57

1.4.6 Recommender systems Recommender systems are software applications based on machine learning that offer suggestions to users about products or services that will likely be of interest to them. Obtaining these recommendations is a very important part of human decision-making. Companies such as Amazon, YouTube, Spotify, Facebook, Google, or Netflix use recommender systems to improve the digital experience of their visitors and customers. The first recommender system was GroupLens, a system for collaborative filtering of net news, to help people find articles between all available ones [78]. The collaborative approach idea had been introduced 1992 in the PARC Tapestry system, an experimental mail system [79].

1.4.7 Infographics Infographics can be defined as visual representations of complex information in a way that makes it easy to be understood by users at a first glance. In general, infographics use a very small amount of text. The first infographics are still preserved in European cave paintings [80]. Ancient Egyptians also produced infographics in hieroglyph form [81]. In the 1920s, Isotype (International System of Typographic Picture Education) was the precursor of infographics [82]. The central themes of Isotype were housing, health, social administration, and education. In 1982, the newspaper USA Today started using infographics.

1.4.8 Mind mapping Mind mapping (MM) is a very useful technique for summarizing, grouping, and visualizing complex information. These three uses of MM are very helpful in reducing information overload. A brief introduction to MM can be found in Chapter 8 of Federal Data Science [83]. Infographics are helpful when the information displayed is relatively simple. However, when dealing with complex information, MM has relevant advantages. MM can be used manually with one of the many software applications to edit and visualize mind maps, for example, mindjet.com’s MindManager [84,85] can export mind maps into HTML5 format so that they can be visualized on any web browser [86]. When using MM, the increase in productivity in the work with information can be estimated to be from 20% to 100% or more, depending on the experience of the user [83]. One of the most obvious applications of MM in the reduction of information overload is evident when it is integrated into CMS systems.

58 Data Democracy

The best-known experience is in the integration of the application MindManager into MS SharePoint [85]. Mind maps can be displayed on a single screen or page, eliminating in this way the problems related to the use of hyperlinks. This reduces in a significant way the cognitive load of the user when viewing and analyzing the information displayed on the mind map. While navigating mind maps, human beings do not get disoriented or lost.

1.5 Artificial intelligence in the reduction of information overload Artificial intelligence (AI) can be used to improve the severity of the above problems. The most important ways in which AI can help in doing this are prioritizing, filtering, automating simple tasks, smart searching, classifying and disseminating complex information, summarizing information, improving the visualization of relevant events and complex information, natural language understanding (NLU), and tagging objects in images and videos. Instead of talking about AI in a generic way, we should better talk of augmented intelligence or complex intelligence [87] that tries to empower users and reduce their information overload problem. Some of the most important AI applications for information overload reduction already in the market are Google [88], Gmail for email management using Smart Compose [89], Knowmail for email management [90], Accrete.ai for compounding knowledge [91], IBM [92], Slack for collaborative work [93], Sapho Employee Experience Portal for employee portals [94], Siri virtual assistant for Apple’s operating systems [95], Cortana Microsoft’s personal digital assistant [96], Alexa Amazon’s personal assistant [97], and eBrevia for contract analytics [98]. In general, AI applications help to reduce the problem of information overload. They are mostly automated features within other applications. Some typical examples are the filtering and prioritization of incoming emails or the software that looks for patterns in the stock market and automatically initiate trades. In the future, with the advent of general AI, this approach to solving the problem of information overload will change. However, in other specific cases, AI can increase the information overload problem. A well-known example is that of the semantic analysis of text and web pages using NLU software. In cases like this one, visualization techniques can help to keep this additional problem under control. MM is the ideal tool to do this.

Chapter 4  Mind mapping in artificial intelligence for data democracy

59

2. Mind mapping and other types of visualization 2.1 Mind mapping in the visualization of open data MM is a graphical technique for visually displaying and organizing several items of information. Each item of information is written down and then linked by lines to the other pieces, thus, creating a network of relationships. They are always organized around a single central idea [83,99,100]. As more information is published as Open Data, mostly related to healthcare, transport, tourism, and weather, it becomes more necessary to improve the management of the flood of information available from different public sources.

2.1.1 Visualization of open data using mind mapping When the information need of a user is very specific, downloading one spreadsheet, graph, text, or PDF file and storing it in a folder is enough. However, if several files are needed, users have the problem of organizing the complex information downloaded. A series of folders may be required, and additional files will need to be created to document the files downloaded and the folder structure created. MM can solve or at least reduce the magnitude of this problem. Mind maps have all the features needed to organize any number of downloaded files and document them in a straightforward way. The downloaded files can be attached or linked to topics of a mind map, comments can be added to the topics, and hyperlinks to websites can be added to reference other documents or web content related to each file. The creation of these mind maps can be done manually or by creating web applications helping users to select and download specific files while browsing Open Data websites. If each file is downloaded individually, a complementary file has to be created to describe the purpose, content, and applications of that file. MM automation can be used to select the files to download and document purpose, content, and applications in the same mind map. Each selected file can be attached to a topic of the mind map. Purpose, content, and applications can be described as a note of the topic. The result is a single mind map file containing any number of attached files that have been considered interesting during the process of selection from the Open Data portal. An example of the use of this kind of application can be found in this presentation introducing the applications of MM in the visualization of Open Data [101]. This example is based on the European Open Data Portal [102].

60 Data Democracy

FIGURE 4.1 Open data mind map.

The need for visualization of simple data is already beginning to be taken into account in Open Data portals [103]. However, when dealing with complex data, only MM automation is an efficient way of doing it. In Fig. 4.1, we can find one of those complex visualizations where MM is very useful because the information downloaded includes attached files, tables, charts, and comments or notes.

2.1.2 Visualization of big open data using mind mapping In the case of Big Open Data, data may need a preprocessing of the information before it is visualized. The R programming language is usually a good option because of its speed in the process of Big Data. Once the processing has been done, the complex information generated can be displayed in the form of a mind map. The example of Fig. 4.2 corresponds to Medicare payment data [104]. The Medicare payment data is available through the Centers for Medicare & Medicaid Services [105].

2.2 Visualization of content management systems using mind mapping CMSs are complex and difficult to configure. Once they have been set up, the problem is how to use them without getting disoriented or even lost. Being based in hyperlinks makes them prone to disorientation. The only reliable solution is the use of visual techniques. One of the most popular CMS is Microsoft SharePoint [106] that is also used as a knowledge management system and as a task and project management solution. CMS, like many other software applications, has an ongoing issue with the resistance of users to adopt them in their daily

Chapter 4  Mind mapping in artificial intelligence for data democracy

61

FIGURE 4.2 Big open data mind map.

operations even when the need for change is obvious. The integration of MM software products like MindManager Enterprise [85] into MS SharePoint has proved to offer significant advantages to professional users. The typical disorientation suffered by users when navigating complex MS SharePoint sites is significantly reduced when using MM integration. MindManager’s SharePoint dashboard feature allows users to query tasks, documents, issues, custom lists, calendar items, and other elements and create complex mind maps that reduce the disorientation effect of web page hyperlinks. Users can find information across several SharePoint sites and work with SharePoint information in context, seeing the big picture and all the details at the same time.

2.3 Visualization of artificial intelligence results When human beings have to analyze and interpret results provided by AI software to make decisions, visualization is of the utmost importance. The use of visualization increases productivity, facilitates decisionmaking, and helps to reduce errors. The more complex the information

62 Data Democracy

generated by AI is, the more necessary visualization becomes. As AI becomes more advanced, the need for visualization should become less useful.

2.3.1 Types of applications of visualization in AI In some cases, visualization is applied to the results generated by AI. In other cases, visualization is used to prepare data for AI analysis. The application of visual programming in AI is becoming increasingly common.

2.3.2 Exploratory data analysis as a first step in AI

Exploratory data analysis (EDA) is a very important step in AI when modeling data to discover patterns, problematic issues, and interesting relations between variables. EDA offers a visual overview of the information and helps to discover what kinds of questions can be asked and answered. When the information is multidimensional, EDA helps to avoid manual exploration of all dimensions of the data [107]. Some of the more interesting visual techniques in EDA, when working with multidimensional data, are scatterplot matrices, parallel coordinate plots, scagnostics, multidimensional scaling (MDS), and t-Distributed Stochastic Neighbor Embedding (t-SNE). Scatterplot matrices are one of the most popular of visualization for multidimensional data [108,109]. Parallel coordinate plots are used to represent multidimensional data in two-dimensional space [110]. Scagnostics (scatterplot diagnostics) summarize potentially interesting patterns in 2d scatterplots. It is useful when the number of variables in a scatterplot matrix is large [111]. MDS is a technique for the analysis of similarity or dissimilarity in data on a set of objects [112]. t-SNE is a method to get insight on how multidimensional objects are arranged in the data space [113,114]. High-dimensional data points are given a location in a two- or three-dimensional map. The intention is to preserve as much of the significant structure of the high-dimensional data as possible in the two- or three-dimensional map. The main advantage of the single map created is that it reveals structure at many different scales. Visualizations created with t-SNE are significantly better than the ones produced by any other method.

2.3.3 Software visualization and visual programming of AI applications Many companies have seen the advantages of allowing users to program AI applications without having to code. This is usually done through the

Chapter 4  Mind mapping in artificial intelligence for data democracy

63

use of a visual drag and drop interface that does not require coding for creating complex AI applications. An example is lobe.ai, a visual tool that allows users to build custom deep learning models, quickly train them, and ship them directly in an app without writing any code [115]. It includes a feature to explore datasets visually and focuses on machine vision. Some of the most well-known examples are Microsoft Azure Machine Learning Studio [116], IBM Watson Assistant to build chatbots [117], Tensor Board [118], and Deep Learning Studio [119]. The Internet company Baidu has created a software development platform called EZDL [120] designed for those who have no idea how to program. EZDL allows users to build personalized machine learning models through a drag and drop interface. Another important advantage is that EZDL can create models using a limited amount of data. Custom models can be built with only four steps: create model, upload data, train model, and deploy model. Models can be built in just a few minutes. EZDL can automatically match the needs with the most suitable algorithm to build a custom model. 2.3.3.1

An example: visualization of complex information in NLU applications

In some cases, the results of AI applications are simple and straightforward. For example, when suggesting if hospitalization of a patient is required or not or when AI is used to determine if a transaction is fraudulent or not. In this case, visualization is not completely necessary or, at least, it is not too useful. In other cases, the result is a text or an image such as a simple chart or scatterplot and can be enough to provide a basic visualization. However, there are other types of AI applications that generate complex information that greatly benefits from visualization. There are many AI applications that generate complex information that can even increase information overload. One of the most typical examples is text semantic analysis using AI NLU. NLU, also known as natural language interpretation, is part of natural language processing. The applications of NLU include product reviews, user feedback, interaction with customers, news gathering, semantic analysis, text categorization, voice activation, archiving, content analysis, sentiment analysis, market research, machine translation, summarizing texts, dialogue based applications, market intelligence, call routing in call centers, automated trading, and reputation monitoring. Some of these applications, like call routing for call centers or commands to robots, do not imply a substantial increase in the information generated. But other applications, like semantic analysis or sentiment

64 Data Democracy

analysis, generate a considerable amount of information that increases the information overload problem in a significant way. Some of the best and most popular AI NLU applications are IBM Watson NLU [121] and Google Natural Language API [122]. To illustrate the magnitude of the problem, let us examine an example using IBM Watson NLU to process a text extracted from the article Death by Information Overload [123] that appeared in the Harvard Business Review. The text contains eight paragraphs and is reproduced here for the sake of clarity. Information overload, of course, dates back to Gutenberg [124]. The invention of movable type led to a proliferation of printed matter that quickly exceeded what a single human mind could absorb in a lifetime. Later technologiesdfrom carbon paper to the photocopierd made replicating existing information even easier. Once information was digitized, documents could be copied in limitless numbers at virtually no cost. Digitizing content also removed barriers to another activity first made possible by the printing press: publishing new information. No longer restricted by centuries-old production and distribution costs, anyone can be a publisher today. The internet, with its far-reaching and free distribution channels, wasn’t the only enabler. Consider how the word processor eliminated the need for a steno-padeequipped secretary, with ready access to typewriter and White-Out, who could help an executive bring a memo into the world. In fact, a lot of new informationd personalized purchase recommendations from Amazon, for instancedis “published” and distributed without any active human input. With the information floodgates open, content rushes at us in countless formats: Text messages, Twitter tweets on cell phones, Facebook friend alerts, voice mail on BlackBerrys, instant messages, and direct-marketing sales pitches (no longer limited by the cost of postage) on desktop computers. Not to mention the ultimate killer app: email. Meanwhile, we’re drawn toward information that, in the past, didn’t exist or that we didn’t have access to but, now that it’s available, we dare not ignore such as online research reports, industry data, blogs written by colleagues or by executives at rival companies, Wikis and discussion forums on topics we’re following, the corporate intranet, and the latest banal musings of friends in our social networks. Yet, researchers advise that the stress of not being able to process information as fast as it arrivesdcombined with the personal and social expectation that someone will answer every email messagedcan deplete and demoralize them. Edward Hallowell, a psychiatrist and expert on attention-deficit disorders, argues that the modern workplace induces what he calls “attention deficit trait,” with characteristics similar to those of the genetically based disorder. Author Linda Stone, who coined the

Chapter 4  Mind mapping in artificial intelligence for data democracy

65

term “continuous partial attention” to describe the mental state of today’s knowledge workers, says she’s now noticing “e-mail apnea”: the unconscious suspension of regular and steady breathing when people tackle their e-mail. There are even claims that the relentless cascade of information lowers people’s intelligence. A few years ago, a study commissioned by HewlettePackard reported that the IQ scores of knowledge workers distracted by e-mail and phone calls fell from their normal level by an average of 10 pointsdtwice the decline recorded for those smoking marijuana, several commentators wryly noted. Of course, not everyone feels overwhelmed by the torrent of information; some are stimulated by it. Some may fear, however that this will come with a new phenomenon: information addiction. According to a 2008 AOL survey of 4000 e-mail users in the United States, 46% were “hooked” on e-mail. Nearly 60% of everyone surveyed checked e-mail in the bathroom, 15% checked it in church, and 11% had hidden the fact that they were checking it from a spouse or other family member. The tendency of always-available information to blur the boundaries between work and home can affect our personal lives in unexpected ways. Consider the recently reported phenomenon of BlackBerry orphans: children who desperately fight to regain their parents’ attention from the devicesdin at least one reported case, by flushing a BlackBerry down the toilet.” The above text was processed using the IBM Watson demo [125], and part of the results is shown in Figs. 4.3e4.9.

FIGURE 4.3 Sentiment analysis.

66 Data Democracy

FIGURE 4.4 Emotion.

FIGURE 4.5 Keywords.

Chapter 4  Mind mapping in artificial intelligence for data democracy

FIGURE 4.6 Entities.

FIGURE 4.7 Categories.

67

68 Data Democracy

FIGURE 4.8 Concepts.

FIGURE 4.9 Semantic Roles.

Chapter 4  Mind mapping in artificial intelligence for data democracy

69

The results of the semantic analysis done by IBM Watson NLU are grouped into sentiment, emotion, keywords, entities, categories, concept, and semantic roles. An eighth element, relations, is not shown in this demo. A ninth element, metadata, is generated when the analysis is done on HTML or URL input. The IBM Watson demo uses hyperlinks to navigate all pieces of information generated by the semantic analysis. When using hyperlinks to navigate such a complex information, users get disoriented and cannot see the “whole picture.” As we can see in Fig. 4.3, in the UI of the demo appears an Overall Sentiment, but each phrase of the text has to be further analyzed for detailed sentiment analysis. In any ordinary text, this would generate hundreds of scores. Fig. 4.4 displays the results of Emotion analysis. Overall Emotion contains scores for joy, anger, disgust, sadness, and fear. Targeted Emotion can be generated for each of the phrases of the text. Fig. 4.5 contains a list of keywords found in the text ranked by relevance. In this figure, only five keywords are displayed. Fig. 4.6 displays a list of Entities ranked by relevance score. Entities can be people, companies, organizations, cities, geographic features, job titles, and other information. The total number of Entities in the text is 16 of which the 6 with highest relevance scores are displayed. In the demo, the information displayed is just the name, type, and score of each Entity. However, in the real application, each Entity is accompanied by a list of the five Emotions with a score for each of them. Finally, it appears the full name of the Entity plus a hyperlink to the corresponding DBPedia entry [126]. Fig. 4.7 contains the list of Categories ranked by relevance score. They are represented as hierarchies of up to five levels. Fig. 4.8 lists Concepts ranked by relevance score. These concepts may not be directly referenced in the text. The total number of concepts listed is 8. Fig. 4.9 shows the list of Semantic Roles. In the demo, only the first phrase is analyzed. Like in other previous cases, an ordinary text could generate hundreds of Semantic Roles. This example shows that when we really want to understand a text, the semantic analysis needed generates a lot of information that can increase the level of information overload. The semantic analysis is equivalent to understanding language. When the text to understand is long enough, the semantic analysis can add a considerable amount of information to the original text. The results are presented as a series of web pages that can be navigated using hyperlinks. The main problem is the absence of a global view of the

70 Data Democracy

information generated. Another additional problem is the disorientation of the user when navigating the information using hyperlinks [99]. MM is an interesting option to see the whole picture and the details of the information at the same time. The mind map is a single file that contains, in compressed XML format, the results of the semantic analysis in visual form. The mind map contains text, icons, attached files, links to web pages, and, optionally, images, sound, and video files. In the example, I have used the software MindManager [86]. Fig. 4.10 is a collapsed version of the mind map. All the information appears on a single screen. There is no need to change to other screens to visualize all content. MM avoids the disorientation problem involved in all applications that use hyperlinks. When clicking on the nodes that contain only numbers, the branches expand and we can see the details without changing to other screens. Figs 4.11 and 4.12 are examples for the Categories and Concepts branches. The Categories group includes all categories found in the analysis with a limit of three hierarchical levels. Concepts include two levels of information and are ordered by a relevance score value. In Fig. 4.13, Emotion is also ordered by a relevance score. In Fig. 4.14, Entities have four levels of information and include Emotions at the entity level. In Fig. 4.15, Sentiment only includes a global relevance score for the full test.

FIGURE 4.10 NLU semantic analysis of a text as a mind map.

FIGURE 4.11 Categories branch of the mind map.

Chapter 4  Mind mapping in artificial intelligence for data democracy

71

FIGURE 4.12 Concepts branch of the mind map.

FIGURE 4.13 Emotion branch of the mind map.

In Fig. 4.16, Relations are ordered by a relevance score and show the Arguments of each one. In Fig. 4.17, Keywords include an Emotions analysis at the keyword level. Finally, in Fig. 4.18, Semantic Roles are analyzed at the sentence level. An HTML5 version of this mind map can be downloaded [127] and viewed using any web browser. In this way, the reader will appreciate the look and feel of the MM experience by simply using her browser. Mind maps can be used to group complex results in a single file that can be easily distributed and managed. This grouping can reduce information overload significantly and make complex information much more

72 Data Democracy

FIGURE 4.14 Entities branch of the mind map.

FIGURE 4.15 Sentiment branch of the mind map.

FIGURE 4.16 Relations branch of the mind map.

Chapter 4  Mind mapping in artificial intelligence for data democracy

73

FIGURE 4.17 Keywords branch of the mind map.

FIGURE 4.18 Semantic roles branch of the mind map.

manageable. The possibility of automating the creation of mind maps directly adds to the empowerment of the user. This is one of the critical requisites for the success of data democracy. A mind map can contain all the information, comments, annotations, and links to of the documents in a way that facilitate understanding without the need of assistance from IT specialists. Readers interested in knowing more applications of MM can find many examples in Ref. [128]. There are examples in AI, Healthcare, Insurance, Banking, Social Networks, e-Publishing, Open Data, Big Data, Pharma, and other.

74 Data Democracy

3. Conclusions Information overload is still one of the most important problems in the management of information in the public and private sectors. It is one of the main hurdles in the process for achieving real data democracy. Visualization of complex information must be increasingly introduced in software applications and increased in its scope to reduce information overload and empower end users to achieve real data democracy. Specifically, MM automation is proposed as the best available solution for the visualization of complex information due to its important advantages over linear text and hyperlink-based web pages and applications. The fact that all information needed is included in a single compressed file makes of mind maps a very useful tool to gather and analyze complex information needed to empower end users without the help of IT specialists. To exemplify the advantages of MM automation in visualization, two areas are examined, Open Data and AI. Open Data is one of the areas where more information overload problems can be observed. Advantages and disadvantages of Data Openness seem to be consistently present at the same time. Visualization through MM automation is an efficient solution to reduce these problems. AI offers some partial solutions to information overload problems. Unfortunately, AI also helps to increase information overload when the information it generates is of a complex nature. NLU is one of the areas of AI where this problem is more evident. When AI applications generate complex information, using linear text or hyperlink-based representation is not an efficient solution. MM automation seems to be the only real solution to visualize complex information in an effective way. The combination of AI and MM seems to be very useful in some application cases and deserves further development. The application of MM in AI is very recent, so most of the applications are still in the project phase. I will include just a few examples of the most interesting applications from my point of view and considering that my expertise is greater in MM than in AI. Other options should arise when experts in AI start analyzing the possibilities offered by MM automation. These examples are as follows: log file analysis in chatbot systems, intruder detection in cybersecurity, visualization of log files of the activity of self-driving cars, scene analysis, classification of documents, classification of images, AI-assisted visualization of electronic health records providing alerts, integration with AI-driven virtual nursing assistants, analysis of monitoring devices in people, vehicles or machines, insurance recommender systems, financial portfolio monitoring, risk analysis for personal and business loans, and personalized hospital discharge information.

Chapter 4  Mind mapping in artificial intelligence for data democracy

75

References [1] Information Overload, cambridge.org, 2019 [Online]. Available: https:// dictionary.cambridge.org/dictionary/english/information-overload. [2] Ecclesiastes 12:12, 2019 biblehub.com, 450 BC. [Online]. Available: https:// biblehub.com/esv/ecclesiastes/12.htm. [3] Seneca, “Treatises: On Providence, on Tranquility of Mind, on Shortness of Life, on Happy Life”, goodreads.com, 2019 [Online]. Available: https://www. goodreads.com/quotes/3182665-even-for-studies-where-expenditure-is-mosthonorable-it-is. [4] E. Andrew-Gee, “Your Smartphone is Making You Stupid, Antisocial”, the Globe and Mail, January 6, 2018 [Online]. Available: https://www.theglobeandmail. com/technology/your-smartphone-is-making-you-stupid/article37511900/. [5] R. Kurzweil, The Law of Accelerating Returns, January 12, 2004 [Online]. Available: http://www.kurzweilai.net/kurzweils-law-aka-the-law-ofaccelerating-returns. [6] B.M. Gross, The Managing of Organizations: The Administrative Struggle, vol. 1, Free Press, New York, 1964, p. 864. [7] A. Toffler, Future Shock, Random House, New York, 1970. [8] M. Marien, Infoglut and competing problems. Key barriers suggesting a new strategy for sustainability, Futures 26 (1994) 2. [9] D. Schenk, Data Smog: Surviving the Information Glut, HarperCollins Publishers, New York, 1997. [10] A. Cornella, “Como Sobrevivir a la Infoxicacio´n”, docplayer.es, 2000 [Online]. Available: http://docplayer.es/9719171-%20Como-sobrevivir-a-lainfoxicacionalfons-cornella.html. [11] P. Dias, “From “Infoxication” to “Infosaturation”: A Theoretical Overview of the ´ mbitos: Revista Cognitive and Social Effects of Digital Immersion”, A ´ m. 24, 2014, Sevilla, Espan ˜ a, 2014. Internacional de Comunicacio´n, nu [12] S. Bell, The Infodiet: How Libraries can Offer an Appetizing Alternative to Google 50, The Chronicle of Higher Education, 2004. B15(24). [13] R.S. Wurman, Information Anxiety, Doubleday, New York, 1989. [14] J. Girard, M. Allison, Information anxiety: fact, fable or fallacy, Electron. J. Knowl. Manag. 6 (2) (2008) 111e124, 2008. [Online]. Available: https://www. researchgate.net/publication/228751372_Information_Anxiety_Fact_Fable_or_ Fallacy. [15] Data Age 2025. The Digitalization of the World. From Edge to Core. David Reinsel, John Gantz, John Rydning. An IDC White Paper #US44413318, Sponsored by Seagate.

76 Data Democracy

[16] Gartner 2019a Documents, January 3, 2019 [Online]. Available: https://www. gartner.com/it-glossary/digitization/. [17] Digitalization, gartner.com, 2019 [Online]. Available: https://www.gartner. com/it-glossary/digitization/. [18] Digital Business Transformation, 2019 gartner.com. [Online]. Available: https://www.gartner.com/it-glossary/digital-business-transformation/. [19] J. Bloomberg, Digitization, Digitalization, and Digital Transformation: Confuse Them At Your Peril”, forbes.com, April 29, 2018 [Online]. Available: https:// www.forbes.com/sites/jasonbloomberg/2018/04/29/digitizationdigitalization-and-digital-transformation-confuse-them-at-your-peril/ #4f3e00c22f2c. [20] R. Armentaro, The Internet of Things, CRC Press, Boca Raton, FL, 2018. [21] K. Ashton, “That ’Internet of Things’ Thing”, RFID Journal, June 22, 2009 [Online]. Available: https://www.rfidjournal.com/articles/view?4986. [22] T. Stack, “Internet of Things (IoT) Data Continues to Explode Exponentially. Who is Using that Data and How?”, cisco.com, February 5, 2018 [Online]. Available: https://blogs.cisco.com/datacenter/internet-of-things-iot-datacontinues-to-explode-exponentially-who-is-using-that-data-and-how. [23] 13 IoT Statistics Defining the Future of Internet of Things, NewGenApps, January 8, 2018 [Online]. Available: https://www.newgenapps.com/blog/iotstatistics-internet-of-things-future-research-data. [24] Digital in 2018: World’s Internet Users Pass the 4 B. [Accessed June 8, 2019]. Illion Mark”, wearesocial.com, 2018 [Online]. Available: https://wearesocial. com/uk/blog/2018/01/global-digital-report-2018. [25] “126 Amazing Social Media Statistics and Facts”, brandwatch.com, May 5, 2019 [Online]. Available: https://www.brandwatch.com/blog/amazing-socialmedia-statistics-and-facts/. [26] D. Craigen, et al., Defining cybersecurity, Technol. Innov. Manage. Rev. 4 (10) (2014). [27] K. Kent, M. Souppaya, Guide to Computer Security Log Management, NIST Special Publication, 2006, 800-92. [28] “Best Practices for Audit, Log Review for IT Security Investigations”, computerweekly.com, August 2011 [Online]. Available: https://www. computerweekly.com/tip/Best-practices-for-audit-log-review-for-IT-securityinvestigations. [29] A.A. Cardenas, et al., Big data analytics for security, IEEE Secur. Privacy 11 (6) (2013) 74e76. [30] “Total Number of Web Sites”, Internetlivestats.com, 2019 [Online]. Available: http://www.internetlivestats.com/total-number-of-websites/#trend. [31] Hosting Facts, Internet Stats and Facts for 2019, 2019 hostingfacts.com. [Online]. Available, https://hostingfacts.com/internet-facts-stats/.

Chapter 4  Mind mapping in artificial intelligence for data democracy

77

[32] “Report: State of the Web”, httparchive.org, 2019 [Online] Available: https:// httparchive.org/reports/state-of-the-web. [33] Number of E-mail Users Worldwide From 2017 to 2022 (in Millions)”, statista. com, 2019 [Online]. Available: https://www.statista.com/statistics/255080/ number-of-e-mail-users-worldwide/. [34] Number of Sent and Received E-mails Per Day Worldwide From 2017 to 2022 (in Billions)”, statista.com, 2019 [Online]. Available: https://www.statista.com/ statistics/456500/daily-number-of-e-mails-worldwide/. [35] S. Whittaker, C. Sidner, Email overload: exploring personal InformationManagement of email, Proc. ACM CHI’96 Conf. Human Factors Comput. Syst. (1996) 276e283. [36] “Global Spam Volume as Percentage of Total E-mail Traffic From January 2014 to December 2018, by Month”, statista.com, 2018 [Online]. Available: https:// www.statista.com/statistics/420391/spam-email-traffic-share/. [37] N. Zeldes, et al., “Infomania: Why We Cannot Ignore it Anymore”, firstmonday. org, August 6, 2007 [Online]. Available: https://firstmonday.org/ojs/index.php/ fm/rt/printerFriendly/1973/1848#z6. [38] “What is Open?”, okfn.org, 2018 [Online]. Available: https://okfn.org/opendata/. [39] N. Huijboom, T. Van den Broek, “Open data: an international comparison of strategies”, Eur. J. ePractice, N. 12 , March/April 2011. [40] “Open Data Impact Map”, opendataimpactmap.org, 2019 [Online]. Available: https://opendataimpactmap.org/. [41] “Data Collaboratives”, datacollaboratives.org, 2019 [Online]. Available: http:// datacollaboratives.org/. [42] Open Data, Driving Growth, Ingenuity and Innovation, deloitte.com, 2019 [Online]. Available: http://www2.deloitte.com/content/dam/Deloitte/uk/ Documents/deloitte-analytics/open-data-driving-growth-ingenuity-and-innovation.pdf. [43] “Uber Movement”, uber.com, 2019 [Online]. Available: https://movement. uber.com/. [44] “A Comprehensive List of Open Data Portals from Around the World”, datacatalogs.org, 2019 [Online]. Available: http://datacatalogs.org/. [45] “What are Linked Data and Linked Open Data?”, ontotext.com, 2019 [Online]. Available: https://www.ontotext.com/knowledgehub/fundamentals/linkeddata-linked-open-data/. [46] “Push Technology”, gartner.com, 2019 [Online]. Available: https://www. gartner.com/it-glossary/push-technology. [47] “Pointcast and its Wannabies”, fortune.com, 1996 [Online]. Available: http:// archive.fortune.com/magazines/fortune/fortune_archive/1996/11/25/218683/ index.htm.

78 Data Democracy

[48] “Networks Strained by Push”, cnet.com, 1997 [Online]. Available: https://www. cnet.com/news/networks-strained-by-push/. [49] N.K. Herther, Push and the politics of the internet, Electron. Libr. 16 (2) (1998) 109e114. [50] A. Edmunds, A. Morris, The problem of information overload in business organizations: a review of the literature, Int. J. Inf. Manag. 20 (2000) 17e28. [51] P. Persson, “Attention Manipulation and Information Overload”, IFN Working Paper, No. 995, Research Institute of Industrial Economics (IFN), Stockholm, SWE, 2013. [52] J.W. De Fockert, The role of working memory in visual selective attention, Science 291 (March 2, 2001) 1803e1806. [53] C. Roda, Human Attention in Digital Environments, Cambridge University Press, Cambridge, UK, 2011. [54] “Spam Statistics and Facts”, spamlaws.com, 2019 [Online]. Available: https:// www.spamlaws.com/spam-stats.html. [55] “Spambot”, techopedia.com, Spambot, 2019 [Online]. Available: https://www. techopedia.com/definition/10889/spambot. [56] MOOC, futurelearn.com, 2019 [Online]. Available: https://about.futurelearn. com/blog/what-is-a-mooc-futurelearn. [57] S. Downes, “The Connectivism and Connective Konwledge Course”,, 2008 [Online]. Available: slideshare.com https://www.slideshare.net/Downes/ cck08the-connectivism-connective-know ledge-course. [58] D. Shah, “By the Numbers: MOOCs in 2018”, classcentral.com, 2018 [Online]. Available: https://www.class-central.com/report/mooc-stats-2018/. [59] “Information Overload Research Group (IORG)”, iorgforum.org, 2019 [Online]. Available: https://iorgforum.org/. [60] S. Devaraj, R. Kohli, Performance impacts of information technology: is actual usage the missing link? Manag. Sci. 49 (3) (2003) 273e289. [61] E. Brynjolfsson, The productivity paradox of information technology, Commun. ACM 36 (12) (December 1993) 66e77. [62] Digital Wildfires in a Hyperconnected World, World Economic Forum, 2013 [Online]. Available:http://reports.weforum.org/global-risks-2013/risk-case-1/ digital-wildfires-in-a-hyperconnected-world/. [63] V.F. Hendricks, M. Vestergaard, Reality Lost. Markets of Attention, Misinformation and Manipulation, Springer Open, New York, 2019. [64] “Joining the Dots. Decision Making for a New Era”, cgma.org, February 2, 2016 [Online]. Available: https://www.cgma.org/resources/reports/joining-the-dots. html. [65] A. Fink, Conducting Research Literature Reviews, fourth ed., SAGE, Los Angeles, USA, 2014.

Chapter 4  Mind mapping in artificial intelligence for data democracy

79

[66] C. Rapple, “The Role of the Critical Review Article in Alleviating Information Overload”, annualreviews.org, 2011 [Online]. Available: http://www. annualreviews.org/userimages/ContentEditor/1300384004941/Annual_ Reviews_WhitePaper_Web_2011.pdf. [67] EuropeanDataPortal.eu Documents, What is Open Data?, 2019 [Online]. Available: https://www.europeandataportal.eu/elearning/en/module1/#/id/co-01. [68] Wikipedia, The Free Encyclopedia, 2019 [Online]. Available: https://en. wikipedia.org/wiki/Main_Page. [69] J. Voss, “Measuring Wikipedia”, Proceedings of the ISSI 2005 Conference. [70] Learn about DBPedia, 2019 [Online]. Available: https://wiki.dbpedia.org/about. [71] S. Auer, et al., DBpedia: a nucleus for a web of open data, in: K. Aberer, et al. (Eds.), The Semantic Web. ISWC 2007, ASWC, Lecture Notes in Computer Science, vol. 4825, Springer, Berlin, Heidelberg, 2007. [72] Definition of Search Engine”. ScienceDaily, 2019 [Online]. Available: https:// www.sciencedaily.com/terms/search_engine.htm. [73] “Allan Ermtage, InternetHallofFame.org [Online]. Available: https://www. internethalloffame.org/inductees/alan-emtage. [74] “Search Engine Market Share”, NetMarketShare.com, 2019 [Online]. Available: https://www.netmarketshare.com/search-engine-market-share.aspx? qprid¼4&qpcustomd¼0. [75] The Best Search Engines of 2019, 2019 lifewire.com. [Online]. Available: https://www.lifewire.com/best-search-engines-2483352. [76] “We Put Siri, Alexa, Google Assistant, and Cortana through a Marathon of Tests to See Who’s Winning the Virtual Assistant Race d Here’s what We Found”, BusinessInsider.com, 2016 [Online]. Available: https://www.businessinsider. com/siri-vs-google-assistant-cortana-alexa-2016-11?IR¼T. [77] N. Goksel-Canbel, M.E. Mutlu, On the track of artificial intelligence: learning with intelligent personal assistants, Int. J. Hum. Sci. 13 (1) (2016). [78] P. Resnick, et al., GroupLens: an open architecture for collaborative filtering of netnews, in: Proceedings of ACM 1994 Conference on Computer Supported Cooperative Work, Chapel Hill, NC, 1994, pp. 175e186. [79] D. Goldberg, et al., Using collaborative filtering to weave an information tapestry, Commun. ACM 35 (12) (1992) 61e70. [80] “Cave Art”, Britannica.com, 2019 [Online]. Available: https://www.britannica. com/art/cave-painting. [81] Egyptian Hieroglyphs, Anciente History Encyclopedia, 2015 [Online]. Available: https://www.ancient.eu/Egyptian_Hieroglyphs/. [82] Isotype; International Picture Language, Isotyperevisited.org, 2012 [Online]. Available: http://isotyperevisited.org/2012/08/introduction.html.

80 Data Democracy

[83] J.M. Guerrero, “Data Visualization of Complex Information through Mind Mapping in Spain and the European Union” in Federal Data Science, Academic Press, Orlando, FL, USA, 2018. [84] “Introduction to MindManager”. mindjet.com, 2014 [Online]. Available: https://www.mindjet.com/video/introduction-to-mindmanager/. [85] MindManager Enterprise Overview”. mindjet.com, 2019 [Online]. Available: https://www.mindjet.com/video/mindmanager-enterprise-overview/. [86] Mindjet Documents, MindManager Windows, 2015 [Online]. Available: https:// www.mindjet.com/mindmanager-windows/?nav¼p-mmw. [87] F.-Y. Wang, Moving Towards Complex Intelligence?, 2009 [Online]. Available: https://ieeexplore.ieee.org/document/5172882. [88] “Google App”, Google, 2019 [Online]. Available: http://www.google.com/. [89] P. Lambert, Write Emails Faster with Smart Compose for Gmail, Gmail, 2018 [Online]. Available: https://www.blog.google/products/gmail/subject-writeemails-faster-smart-compose-gmail/. [90] Knowmail, 2019 [Online]. Available: https://www.knowmail.me/. [91] Welcome to the Future of Decision Making”, Accrete.ai, 2019 [Online]. Available: https://www.accrete.ai/. [92] Information Overload Is Killing Returns. AI Is Helping to Change that”, ibm. com, 2018 [Online]. Available:https://www.ibm.com/blogs/watson/2018/02/ information-overload-is-killing-returns-ai-is-helping-to-change-that/. [93] Slack.com, 2019 [Online]. Available: https://www.slack.com/. [94] “Sapho Employee Experience Portal”, sapho.com, 2019 [Online]. Available: https://www.sapho.com/employee-experience/. [95] “Siri”, apple.com [Online]. Available: https://www.apple.com/siri/, 2019. [96] Cortana. Personal Digital Assistant, microsoft.com, 2019 [Online]. Available: https://www.microsoft.com/en-us/cortana. [97] “Alexa”. amazon.com, 2019 [Online]. Available: https://www.amazon.com/ Amazon-Echo-And-Alexa-Devices/b?ie¼UTF8&node¼9818047011. [98] “eBrevia. Contract Analytics”, ebrevia.com, 2019. Contract analytics. [Online]. Available: https://ebrevia.com/. [99] J.M. Guerrero, P. Ramos, Introduction to the Applications of Mind Mapping in Medicine, IMedPub, London, UK, 2015. [100] J.M. Guerrero, Introduccio´n a la Te´cnica de Mapas Mentales, Editorial UOC, Barcelona, SP, 2016. [101] J.M. Guerrero, “Open Data and Mind Mapping”, slideshare.com, 2013 [Online]. Available: https://www.slideshare.net/jmgf2009/open-datamm. [102] EU Open Data Portal Documents, European Union, 2019 [Online]. Available: https://data.europa.eu/euodp/en/home.

Chapter 4  Mind mapping in artificial intelligence for data democracy

81

[103] Visualisation Catalogue, EU Open Data Portal, 2019 [Online]. Available: https://data.europa.eu/euodp/en/visualisation-home. [104] J.M. Guerrero, “Big Open Data in Medicine with R and Mind Mapping”, slideshare.com, 2014, 2014. [Online]. Available: https://www.slideshare.net/ jmgf2009/big-open-data-in-medicine-with-r-and-mind-mapping. [105] Medicare Provider Utilization and Payment Data, 2019. CMS.gov. [Online]. Available: https://www.cms.gov/Research-Statistics-Data-and-Systems/ Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/index.html. [106] Microsoft SharePoint, Microsoft [Online]. Available: https://products.office. com/en-us/sharepoint/collaboration. [107] “Data Analysis with AI-Driven Cognos Analytics”, ibm.com, 2019 [Online]. Available: https://www.ibm.com/uk-en/analytics/business-intelligence. [108] N. Elmqvist, et al., Rolling the dice: multidimensional visual exploration using scatterplot matrix navigation, IEEE Trans. Vis. Comput. Graph. 14 (6) (NovembereDecember 2008), 1539-1148. [109] H.M. Wu, et al., Matrix Visualization, 2008 [Online]. Available: https://link. springer.com/chapter/10.1007/978-3-540-33037-0_26. [110] InfoVis Documents, “Parallel Coordinate Visualization”, infovis cyberinfrastructure, 2004 [Online]. Available: http://iv.slis.indiana.edu/sw/parallel.html. [111] L. Wilkinson, et al., “Graph-Theoretic Scagnostics”, rgrossman.com, 2005 [Online]. Available: http://papers.rgrossman.com/proc-094.pdf. [112] I. Borg, J.P.F. Groenen, Modern Multidimensional Scaling: Theory and Applications, Springer, New York, NY, 2013. [113] L. Van del Maaten, G. Hinton, Visualizing data using t-SNE, J. Mach. Learn. Res. 9 (2008) 2579e2605, 2008. [114] L. Van der Maaten, “Visualizing Data Using t-SNE”. youtube.com, 2013 [Online]. Available: https://www.youtube.com/watch?v¼RJVL80Gg3lA. [115] “Lobe: Deep Learning for Everyone”, lobe.ai,, 2019 [Online]. Available: https:// lobe.ai/. [116] “Machine Learning Studio”, Microsoft, 2019 [Online]. Available: https://azure. microsoft.com/en-us/services/machine-learning-studio/. [117] “IBM Watson Assistant”, ibm.com, 2019 [Online]. Available: https://www.ibm. com/cloud/watson-assistant/. [118] TensorBoard: Graph Visualization, tensorflow.org, 2019 [Online]. Available: https://www.tensorflow.org/guide/graph_viz. [119] “Deep Learning Studio”, Deepcognition.ai, 2019 [Online]. Available: https:// deepcognition.ai/. [120] “EZDL Custom Training and Service Platform”, baidu.com, 2019 [Online]. Available: http://ai.baidu.com/ezdl/.

82 Data Democracy

[121] “IBM Watson Natural Language Understanding”, ibm.com, 2019 [Online]. Available: https://www.ibm.com/watson/services/natural-languageunderstanding/. [122] “Google Natural Language API”, google.com, 2019 [Online]. Available: https:// cloud.google.com/natural-language/. [123] P. Hemp, “Death by Information Overload”, Harvard Business Review, September 2009 [Online]. Available: https://hbr.org/2009/09/death-byinformation-overload. [124] “Johannes Gutenberg”, biographyonline.net, 2019 [Online]. Available: https:// www.biographyonline.net/business/j-gutenberg.html. [125] “IBM Watson Natural Language Understanding Demo”, ibm.Com, 2019 [Online]. Available: https://natural-language-understanding-demo.ng.bluemix.net/. [126] DBPedia, 2019 [Online]. Available: https://wiki.dbpedia.org/. [127] J.M. Guerrero, HTML5 Mind Map Created by Text Processing Using IBM Watson NLU, 2018 drive.google.com. [Online]. Available: https://drive.google. com/file/d/1ERfiV-IfaxGuTvJ8hZpXE4Ib1s-ETsME/view?usp¼sharing. [128] J.M. Guerrero, Mind Mapping Automation Applications, slideshare.com. [Online]. Available: https://www.slideshare.net/jmgf2009/presentations.

5

Foundations of data imbalance and solutions for a data democracy Ajay Kulkarni1, Deri Chong2, Feras A. Batarseh3 THE DEPARTMENT OF COMPUTATIONAL AND DATA SCIENCES, COLLEGE OF SCIENCE, GEORGE MASON UNIVERSITY, FAIRFAX, VA, UNITED STATES; 2 VOLGENAU SCHOOL OF ENGINEERING, GEORGE MAON UNIVERSITY, FAIRFAX, VA, UNITED STATES; 3 GRADUATE SCHOOL OF ARTS & SCIENCES, DATA ANALYTICS PROGRAM, GEORGETOWN UNIVERSITY, WASHINGTON, D.C., UNITED STATES 1

In the end, it’s all a question of balance Rohinton Mistry Abstract Dealing with imbalanced data is a prevalent problem while performing classification on the datasets. Many times, this problem contributes to bias while making decisions or implementing policies. Thus, it is vital to understand the factors which cause imbalance in the data (or class imbalance). Such hidden biases and imbalances can lead to data tyranny and a major challenge to a data democracy. In this chapter, two essential statistical elements are resolved: the degree of class imbalance and the complexity of the concept; solving such issues helps in building the foundations of a data democracy. Furthermore, statistical measures which are appropriate in these scenarios are discussed and implemented on a real-life dataset (car insurance claims). In the end, popular datalevel methods such as random oversampling, random undersampling, synthetic minority oversampling technique, Tomek link, and others are implemented in Python, and their performance is compared. Keywords: Complexity of the concept; Degree of class imbalance; Imbalanced data; Statistical assessment metrics; Undersampling and oversampling.

1. Motivation and introduction In the real world, data are collected from various sources like social networks, websites, logs, and databases. While dealing with data from different sources, it is very crucial to check the quality of the data [1]. Data Data Democracy. https://doi.org/10.1016/B978-0-12-818366-3.00005-8 Copyright © 2020 Elsevier Inc. All rights reserved.

83

84 Data Democracy

with questionable quality can introduce different types of biases in various stages of the data science lifecycle. These biases sometime can affect the association between variables, and in many cases could represent the opposite of the actual behavior [2]. One of the causes that introduce bias in the interpretation of results is data imbalance, a problem that especially occurs while performing classification [3]. It is also noted that most of the time data suffer from the problem of imbalance which means that one of the classes has a higher percentage compared with the percentage of another class [4]. In simple words, a dataset with unequal class distribution is defined as imbalanced dataset [5]. This issue is widespread, especially in binary (or a two-class) classification problems. In such scenarios, the class which has majority instances is considered as a majority class or a negative class, and the underrepresented class is viewed as a minority class or a positive class. These kinds of datasets create difficulties in information retrieval and filtering tasks [6] which ultimately result in a poor quality of the associated data. Many researchers studied similar applications of class imbalance, such as fraudulent telephone calls [7], telecommunications management [8], text classification [4,9e11], and detection of oil spills in satellite images [12]. The machine learning/data mining algorithms for classification are built on two assumptions: maximizing output accuracy and test data is drawn from the same distribution as the training data. In the case of imbalanced data, one or both the assumptions get violated [13]. Let us consider the example of fraud detection to understand the issue of imbalanced data (or class imbalance) more clearly. Suppose there is a classifier system which predicts fraudulent transactions based on the provided data. In real life, there will always be less percentage of people who will do fraudulent transactions, and most of the instances in the data will be of nonfraudulent transactions. In that case, the given classifier will always treat many of the fraudulent transactions as nonfraudulent transactions because of the unequal percentage of class distribution in the data. The classifier system will result in a high percentage accuracy but a poor performance. Therefore, it is critical to deal with this issue and build accurate systems before focusing on the selection of classification methods. The purpose of this chapter is to understand and implement common approaches to solve this issue. The next section of this chapter focuses on the causes of imbalanced data and how accuracy (and traditional quality measures) can be misleading metric in these cases. After that, different solutions to tackle this issue are discussed and implemented.

Chapter 5  Foundations of data imbalance and solutions for a data democracy

85

2. Imbalanced data basics The previous section introduced the meaning of positive class, negative class, and the need to deal with imbalanced data. In this section, the focus will be on the factors which create difficulties in analyzing the imbalanced dataset. Based on the research of Japkowicz et al. [14], the imbalance problem is dependent on four factors: degree of class imbalance, the complexity of the concept represented by the data, an overall size of the training data, and type of the classifier. Japkowicz et al. [14] conducted experiments by using sampling techniques and then compared results using different classification techniques to evaluate the effects of class imbalance. These experiments indicated the importance of complexity of the concept and reflected that the classification of a domain would not be affected by huge class imbalance if the concept is easy to learn. Therefore, the focus of the solution ought to be to understand two essential factorsddegree of class imbalance and complexity of the concept.

2.1 Degree of class imbalance The degree of class imbalance can be represented by defining the ratio of positive class to the negative class. There is also another way to understand the degree of imbalance by calculating the imbalanced ratio (IR) [5], for which the formula is given below: IR ¼

Total number negative class examples Total number of positive class examples

Let us say if there are 5000 examples that belong to a negative class and 1000 examples that belong to a positive class, then we can denote the degree of class imbalance as 1:5, i.e., there is a positive class example for every five negative class examples. Using the above formula, IR can also be calculated as follows: IR ¼

Total number negative class examples 5000 ¼ ¼5 Total number of positive class examples 1000

Thus, IR for the given dataset would be 5. In this way, the degree of class imbalance can provide information about data imbalance and can help structure the strategy for dealing with it.

86 Data Democracy

FIGURE 5.1 Example of (A) class overlap and (B) Small disjuncts.

2.2 Complexity of the concept The complexity of the concept can be majorly affected by class overlap and small disjoints. Class overlap occurs when examples from both the classes are mixed at some degree in the feature space [5]. Fig. 5.1A shows the class overlap in which circles represents a negative class and triangles represent a positive class. Another factor which affects the complexity of the concept is small disjoints which are shown in Fig. 5.1B. Small disjoints occur when the concept represented by the minority class is formed of subconcepts [5,15]. It can be easily seen from the figure that class overlap and small disjoints introduce more complexity in the system, which affects class separability. This results in more complex classification rules and ultimately misclassification of the positive class instances. Now let us focus on different approaches which can help to deal with these imbalance issues. The presented approaches (in the next section) help to improving the quality of the data for better analysis and improved overall results for data science.

3. Statistical assessment metrics This section outlines different statistical assessment metrics and various approaches to handle imbalanced data. To understand this section, “Porto Seguro’s Safe Driver Prediction” [16] dataset is used for implementing various techniques available in Python. This section starts with statistical assessment metrics that provide insights into classification models.

3.1 Confusion matrix Confusion matrix is a very popular measure used while solving classification problems. It can be applied to binary classification as well as for multiclass classification problems. An example of a confusion matrix for binary classification is shown in Table 5.1.

Chapter 5  Foundations of data imbalance and solutions for a data democracy

Table 5.1

87

Confusion matrix for binary classification. Predicted

Actual

Negative Positive

Negative TN FN

Positive FP TP

Confusion matrices represent counts from predicted and actual values. The output “TN” stands for True Negative which shows the number of negative examples classified accurately. Similarly, “TP” stands for True Positive which indicates the number of positive examples classified accurately. The term “FP” shows False Positive value, i.e., the number of actual negative examples classified as positive; and “FN” means a False Negative value which is the number of actual positive examples classified as negative. One of the most commonly used metrics while performing classification is accuracy. The accuracy of a model (through a confusion matrix) is calculated using the given formula below. Accuracy ¼

TN þ TP TN þ FP þ FN þ TP

Accuracy can be misleading if used with imbalanced datasets, and therefore there are other metrics based on confusion matrix which can be useful for evaluating performance. In Python, confusion matrix can be obtained using “confusion_matrix()” function which is a part of “sklearn” library [17]. This function can be imported into Python using “from sklearn.metrics import confusion_matrix.” To obtain confusion matrix, users need to provide actual values and predicted values to the function.

3.2 Precision and recall Precision and recall are widely used and popular metrics for classification. Precision shows how accurate the model is for predicting positive values. Thus, it measures the accuracy of a predicted positive outcome [18]. It is also known as the positive predictive value. Recall is useful to measure the strength of a model to predict positive outcomes [18], and it is also known as the sensitivity of a model. Both the measures provide valuable information, but the objective is to improve recall without affecting the precision [3]. Precision and recall values can be calculated in Python using “precision_score()” and “recall_score()” functions, respectively.

88 Data Democracy

Both of these functions can be imported from “sklearn.metrics” [17]. The formulas for calculating precision and recall are given below: Precision ¼

TP TP þ FP

Recall ¼

TP TP þ FN

3.3 F-measure and G-measure F-measure is also known as F-value which uses both precision score and recall score of a classifier. F-measure is another commonly used metric in classification settings. F-measure is calculated using a weighted harmonic mean between precision and recall. For the classification of positive instances, it helps to understand the tradeoff between correctness and coverage [5]. The general formula for calculating F-measure is given below:   precision  recall  Fb ¼ 1 þ b2   2 b  precision þ recall

In the above formulation, the importance of each term can be provided using different values for b. Most commonly used value for b is 1, which is known as F-1 measure. G-measure is similar to that of F-measure, but it uses geometric mean instead of harmonic mean. F1 ¼ 2 

precision  recall precision þ recall

G  measure ¼

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi precision  recall

In Python, F1-scores can be calculated using “f1_score()” function from “sklearn.metrics” [17], and G-measure scores can be calculated using the “geometric_mean_score()” function from the “imblearn.metrics” library [19].

3.4 Receiver operating characteristic curve and area under the curve ROC (receiver operating characteristic) curve is used for evaluating the performance of a classifier. It is plotted by calculating false positive rate (FPR) on the x-axis against true positive rate (TPR) on the y-axis. A range of thresholds from 0 to 1 is defined for a classifier to perform classification. For every point, FPR and TPR are plotted against each other. An example of an ROC curve is shown in Fig. 5.2.

Chapter 5  Foundations of data imbalance and solutions for a data democracy

89

Receiver Operating Characteristic (ROC) 1.0

True Positive Rate

0.8 0.6 0.4 0.2 0.0

AUC = 0.62

0.0

0.2

0.4 0.6 False Positive Rate

0.8

1.0

FIGURE 5.2 Receiver operating characteristic curve and area under the curve.

The ROC’s curve top left corner represents good classification, while lower right corner represents poor classification. A classifier is said to be a good classifier if it reaches the top left corner [5]. The diagonal in the plot represents random guessing. If ROC curve of any classifier is below the diagonal, then that classifier is performing poorer than random guessing [5], which entirely defeats the purpose. Therefore, it is expected that the ROC curve should always be in the upper diagonal. ROC curve helps by providing a graphical representation of a classifier, but it is always a good idea to calculate a numerical score for a classifier. It is a common practice to calculate the area under the curve (AUC) for the ROC curve. The AUC value represents a score which is between 0 and 1. Any classifier whose ROC curve is present in lower diagonal will get an AUC score of less than 0.5. Similarly, the ROC curve which is present in upper diagonal will get AUC scores higher than 0.5. An ideal classifier will get an AUC score of 1 which will touch the upper left corner of the plot. Python provides a function “roc_curve()” to get FPR, TPR, and thresholds for ROC. AUC scores for a classifier can be calculated using the “auc()” function for which users need to provide FPR and TPR of a classifier. Both of these functions can be imported from “sklearn.metrics” in Python. Note: In some literatures, it is mentioned that precisionerecall curve and cost measures (cost matrix and cost-sensitive curves) can be good performance measures in the case of imbalanced data. Interested readers can refer following resources to get more information about those topics.  N.V. Chawla, “Data mining for imbalanced datasets: An overview,” In Data mining and knowledge discovery handbook, pp. 875e886, 2009

90 Data Democracy

 H. He, and E.A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge & Data Engineering, vol. 9, pp.1263e1284, 2008  F. Alberto, S. Garcia, M. Galar, R. Prati, B. Krawczyk, and F. Herrera, “Learning from imbalanced data sets,” Springer, 2018.

3.5 Statistical assessment of the insurance dataset In this subsection, different statistical assessment metrics are applied on “Porto Seguro’s Safe Driver Prediction” dataset. These data contain information which can be useful for predicting if a driver will file an insurance claim next year or not. The source for this dataset is www.kaggle. com, more information can be found in Ref. [16]. In the mentioned dataset, there are 59 columns and around 595k rows. Every row represents details of a policy holder, and the target column shows a claim was filed (indicated by “1”) or not (indicated by “0”). Before applying different statistical assessment measures, feature selection is performed on all the 59 columns, and this resulted into the selection of 22 columns: the name of these 22 columns are ‘target’, ‘ps_ind_01’, ‘ps_ind_02_cat’, ‘ps_ind_03’, ‘ps_ind_04_cat’, ‘ps_ind_05_cat’, ‘ps_ind_06_bin’, ‘ps_ind_07_bin’, ‘ps_ind_08_bin’, ‘ps_ind_09_bin’, ‘ps_ind_15’, ‘ps_ind_16_bin’, ‘ps_ind_17_bin’, ‘ps_reg_01’, ‘ps_reg_02’, ‘ps_reg_03’, ‘ps_car_03_cat’, ‘ps_car_07_cat’, ‘ps_car_11’, ‘ps_car_12’, ‘ps_car_13’, ‘ps_car_15’. In the “Porto Seguro’s Safe Driver Prediction” dataset, there are 573k rows, with target 0, and 21k rows having target 1. It means that there are only 21k people who filled a claim out of 595k drivers. Nonetheless, we can easily see the imbalance in this dataset by observing these numbers. The degree of class imbalance for this dataset using the formula given in Section 2 can be calculated as 26.44. The code which is used in Python [20] for calculating IR is given below (Figs. 5.3e5.6). Furthermore, logistic regression is used to perform classification on the dataset, and it can be observed that the accuracy of the model is 96%. In the next step, confusion matrix and other statistical assessment metrics were calculated in Python. The results obtained from the analysis can be summarized in Table 5.2. It can be observed from the above results that the model has high quality results, but the model is unable to classify any instance as the positive class. It can also be observed that precision, recall, F1 score, and G-mean score values are 0. Thus, it is a strong indication that balancing the class is critical before making further predictions (Table 5.3).

Chapter 5  Foundations of data imbalance and solutions for a data democracy

FIGURE 5.3 Calculation of imbalanced ratio.

FIGURE 5.4 Implementation of logistic regression and accuracy of the model.

FIGURE 5.5 Calculation of confusion matrix and other statistical assessment metrics.

91

92 Data Democracy

Receiver Operating Characteristic (ROC) 1.0

True Positive Rate

0.8

0.6

0.4

0.2 AUC = 0.62 0.0

0.0

0.2

0.4 0.6 False Positive Rate

0.8

1.0

FIGURE 5.6 The receiver operating characteristic and area under the curve for the logistic regression model.

Table 5.2

Confusion matrix for the Porto Seguro’s safe driver prediction dataset. Predicted

Actual

Table 5.3

Negative Positive

Negative 172,085 6479

Positive 0 0

Other results for the Porto Seguro’s safe driver prediction dataset.

Statistical assessment metrics

Result

Accuracy Precision Recall F1 score G-mean score

0.96 0 0 0 0

Chapter 5  Foundations of data imbalance and solutions for a data democracy

93

4. How to deal with imbalanced data The previous section explained the risk of using imbalanced data for prediction without preprocessing, which also reflects the critical need to perform preprocessing on the data. This part of this chapter concentrates on data-level methods which are sampling methods. The sampling methods can be divided into three categories: undersampling, oversampling, and hybrid methods. For undersampling and oversampling, three different approaches will be explored. Furthermore, these methods will be implemented on Porto Seguro’s Safe Driver Prediction dataset, and that will be compared with results from the previous section. In general, sampling methods modify imbalanced dataset to make it more balanced or make more adequate data distribution for learning tasks [5]. In Python, these sampling methods can be implemented using “imbalanced-learn” library [19]. The link for documentation and installation details of “imbalanced-learn” library can be found in Ref. [21].

4.1 Undersampling Undersampling methods eliminate the majority class instances in the data to make the data more balanced. In some literature, undersampling is also called “downsizing.” There are different methods which can be used for undersampling, but three commonly used methods are covered in this subsection: random undersampling, Tomek links, and edited nearest neighbors (ENNs).

4.1.1 Random undersampling Random undersampling is the nonheuristic method for undersampling [5]. It is one of the simplest methods and generally used as a baseline method. In random undersampling, the instances from the negative class or majority class are selected at random, and they are removed until it matches the count of positive class or minority class. This technique selects instances from the majority class randomly and, thus, it is known as “random undersampling.” The result of this technique will be a balanced dataset consisting of an equal number of positive class and negative class examples (Figs. 5.7 and 5.8). Random undersampling can be performed by importing “RandomUnderSampler” from “imblearn.under_sampling.” In the above example, an object “rus” is created for “RandomUnderSampler()” and then “fit.resample()” method is used by providing data (as “x”) and labels (as “y”) independently. To reproduce the results, “random_state ¼ 0” is used in the “RandomUnderSampler()” function. The result will get stored

94 Data Democracy

FIGURE 5.7 Random undersampling implementation in Python. Receiver Operating Characteristic (ROC) 1.0

True Positive Rate

0.8 0.6 0.4 0.2 AUC = 0.63 0.0

0.0

0.2

0.4 0.6 False Positive Rate

0.8

1.0

FIGURE 5.8 Receiver operating characteristic and area under the curve after performing random undersampling.

in “X_resampled” (details of all the columns except the target) and “Y_resampled” (the target column). After performing random undersampling, it can be observed that there is an equal number of positive (21,694) and negative class (21,694) instances in the dataset. The results obtained after performing logistic regression on the processed data are summarized in Table 5.4. It can be observed after performing random undersampling the accuracy is 59%, and the model can classify positive class instances. From the results, it can also be seen that precision, recall, F1 score, and G-mean Table 5.4

Confusion matrix for Porto Seguro’s safe driver prediction dataset. Predicted

Actual

Negative Positive

Negative 4115 2878

Positive 2402 3622

Chapter 5  Foundations of data imbalance and solutions for a data democracy

Table 5.5

95

Other results for Porto Seguro’s safe driver prediction

dataset. Statistical assessment metrics

Result

Accuracy Precision Recall F1 score G-mean score

0.59 0.56 0.60 0.58 0.59

score values are not 0, but it reflects a serious chance to improve the performance of the model (Table 5.5).

4.1.2 Tomek link

Tomek link is a heuristic undersampling technique based on a distance measure. In Tomek link, a link is established based on a distance between instances from two different classes which are further used for removing majority class instances [22]. The conceptual working of the Tomek link is given below and is motivated from Ferna´ndez et al. [5]. (1) Consider two examples: Ai and Aj where Ai can be represented as (xi, yi) and Aj can be represented as (xj, yj). (2) The distance between Ai and Aj is D which can be represented as d(Ai,Aj). (3) A pair (Ai, Aj) can said to have Tomek link if there is not an example Al such that d(Ai, Al) < d(Ai, Aj) or d(Aj, Al) < d(Ai, Aj). (4) After identifying Tomek links, the instance which belongs from the majority class is removed, and instances from the minority class are kept in the dataset (Figs. 5.9 and 5.10). Tomek links can be implemented in Python using “TomekLinks()” function from “imblearn.under_sampling.” One of the parameters in the function is “sampling_strategy” which can be used for defining sampling strategy. The default sampling strategy is “auto” which means resample

FIGURE 5.9 Tomek link implementation in Python.

96 Data Democracy

Receiver Operating Characteristic (ROC)

True Positive Rate

1.0

0.8

0.6

0.4

0.2 AUC = 0.62 0.0

0.0

0.2

0.4

0.6

False Positive Rate

0.8

1.0

FIGURE 5.10 Receiver operating characteristic and area under the curve after performing Tomek links.

all classes but not minority classes. The additional details about the function parameters can be found in the documentation. After performing Tomek link, it can be observed that there are still less positive (21,694) class instances as compared with the negative class (563,777) instances. The results obtained after performing logistic regression on the processed data are summarized in Table 5.6. After performing Tomek link, the observed accuracy is 96%, but the model is unable to classify positive class instances. It can also be noted that precision, recall, F1 score, and G-mean score values are 0. Thus, in this scenario, Tomek links are not a suitable strategy for balancing the class distribution (Table 5.7). Table 5.6

Confusion matrix for Porto Seguro’s safe driver prediction dataset. Predicted

Actual

Table 5.7

Negative Positive

Negative 169,215 6427

Positive 0 0

Other results for Porto Seguro’s safe driver prediction dataset.

Statistical assessment metrics

Result

Accuracy Precision Recall F1 score G-mean score

0.96 0 0 0 0

Chapter 5  Foundations of data imbalance and solutions for a data democracy

97

4.1.3 Edited nearest neighbors ENN is the third method which is a part of undersampling category. In ENN, the instances which belong from the majority class are removed based on their K-nearest neighbors [23]. For example, if three neighbors are considered, then the majority class instance will get compared with their closest three neighbors. If most of their neighbors are minority classes, then that instance will get removed from the dataset (Figs. 5.11 and 5.12). ENN can be implemented in Python using “EditedNearestNeighbours()” function from “imblearn.under_sampling.” It also has a parameter “sampling_strategy” which can be used for defining sampling strategy. The default sampling strategy is “auto,” which means resampling all classes but not minority classes. In addition to that, users can also set “n_neighbors” parameter, which can be used to set the size of the neighborhood to compute nearest neighbors. The default value for “n_neighbors” parameter is 3. Results obtained after performing logistic regression on the processed data are summarized in Table 5.8. The results indicate that the accuracy is 96%, but the model is unable to classify positive class instances, same as Tomek link. Therefore, in this

FIGURE 5.11 Edited nearest neighbor implementation in Python.

Receiver Operating Characteristic (ROC) 1.0

True Positive Rate

0.8 0.6 0.4 0.2 AUC = 0.64 0.0

0.0

0.2

0.4

0.6

False Positive Rate

0.8

1.0

FIGURE 5.12 Receiver operating characteristic and area under the curve after performing edited nearest neighbor.

98 Data Democracy

Table 5.8

Confusion matrix for Porto Seguro’s safe driver prediction dataset. Predicted

Actual

Table 5.9

Negative Positive

Negative 155,003 6433

Positive 0 0

Other results for Porto Seguro’s safe driver prediction dataset.

Statistical assessment metrics

Result

Accuracy Precision Recall F1 score G-mean score

0.96 0 0 0 0

scenario, ENN is also not a suitable strategy for balancing the class distribution (Table 5.9). Note: There are some other methods for undersampling such as Class Purity Maximization (CPM), Undersampling based on Clustering (SBC), NearMiss approaches, and some advanced techniques using evolutionary algorithms. Interested readers can get more information about these techniques in F. Alberto, S. Garcia, M. Galar, R. Prati, B. Krawczyk, and F. Herrera, “Learning from imbalanced data sets,” Springer, 2018.

4.2 Oversampling This subsection focuses on oversampling methods that work oppositely as compared with undersampling. In oversampling, the goal is to increase the count of minority class instances to match it with the count of majority class instances. Thus, oversampling is “upsizing” the minority class. In this section, three commonly used oversampling methods are covered: random oversampling, synthetic minority oversampling technique (SMOTE), and adaptive synthetic sampling (ADASYN).

4.2.1 Random oversampling Random oversampling is a nonheuristic method under the category of oversampling methods [5]. In random oversampling, instances from the minority class are selected at random for replication which results in a balanced class distribution. In this method, the minority class instances are chosen at random, and therefore, this method is called “random.” (Figs. 5.13 and 5.14).

Chapter 5  Foundations of data imbalance and solutions for a data democracy

99

FIGURE 5.13 Random oversampling implementation in Python. Receiver Operating Characteristic (ROC) 1.0

True Positive Rate

0.8 0.6 0.4 0.2 AUC = 0.63 0.0

0.0

0.2

0.4

0.6

False Positive Rate

0.8

1.0

FIGURE 5.14 Receiver operating characteristic and area under the curve after performing random oversampling.

Random oversampling can be performed in Python using “RandomOverSampler()” from “imblearn.over_sampling.” The function supports different parameters for resampling, but the default parameter is “auto” which is equivalent to not to resample majority class. More information about other parameters of “RandomOverSampler()” can be found in the documentation. After performing random oversampling, it can be observed that there is an equal number of positive (573,518) and negative class (573,518) instances in the dataset. The results obtained after performing logistic regression on the processed data are summarized in Table 5.10. It can be observed that the accuracy is 59%, and the model can classify positive class instances after performing random oversampling. It can also be noted that precision, recall, F1 score, and G-mean score values are less than or equal to 0.6. These results indicate that random oversampling provides more meaningful results as compared with Tomek link and ENN (Table 5.11).

100

Data Democracy

Table 5.10

Confusion matrix for Porto Seguro’s safe driver prediction dataset. Predicted

Actual

Table 5.11

Negative Positive

Negative 108,341 77,842

Positive 63,542 94,386

Other results for Porto Seguro’s safe driver prediction dataset.

Statistical assessment metrics

Result

Accuracy Precision Recall F1 score G-mean score

0.59 0.54 0.60 0.59 0.57

4.2.2 Synthetic minority oversampling technique SMOTE is another popular method for performing oversampling. The previous method, random oversampling, can lead to overfitting because of randomly replicating the instances of the minority class [24]. Therefore, to overcome this problem of overfitting, Chawla et al. [24] developed a new technique for generating minority class instances synthetically. In SMOTE, new instances are created based on interpolation between several minority class instances that lie together [5]. This allows SMOTE to operate in feature space rather than to operate in data space [24]. The detailed working of the SMOTE method along with the discussion can be found in “SMOTE: Synthetic Minority Oversampling Technique” [24] (Figs. 5.15 and 5.16). In Python, SMOTE can be implemented to perform oversampling using the “SMOTE()” function from “imblearn.over_sampling.” SMOTE method is based on the number of neighbors, and it can be defined in the “SMOTE()” function using “k_neighbors” parameter. The default number of neighbors are five, but users can tweak this parameter. Users can also define a strategy for resampling like other functions, and default strategy is “auto” which means not to target majority class. After performing SMOTE, it can be observed that there is an equal number of positive (573,518) and negative class (573,518) instances in the dataset. The results obtained after performing logistic regression on the processed data are summarized in Tables 5.12 and 5.13.

Chapter 5  Foundations of data imbalance and solutions for a data democracy 101

FIGURE 5.15 SMOTE implementation in Python. Receiver Operating Characteristic (ROC) 1.0

True Positive Rate

0.8

0.6

0.4

0.2 AUC = 0.62 0.0

0.0

0.2

0.4

0.6

0.8

1.0

False Positive Rate

FIGURE 5.16 Receiver operating characteristic and area under the curve after performing SMOTE.

Table 5.12

Confusion matrix for Porto Seguro’s safe driver prediction dataset. Predicted

Actual

Table 5.13

Negative Positive

Negative 103,484 74,515

Positive 68,772 97,340

Other results for Porto Seguro’s safe driver prediction dataset.

Statistical assessment metrics Accuracy Precision Recall F1 score G-mean score

Result 0.58 0.57 0.59 0.58 0.58

102

Data Democracy

4.2.3 Adaptive synthetic sampling ADASYN is one of the popular extensions of the SMOTE which is developed by Haibo et al. [25]. There are more than 90 methods which are based on SMOTE, and for the complete list readers can refer to Ref. [5]. In ADASYN, the minority examples are generated based on their density distributiondmore synthetic data are generated from minority class samples that are harder to learn as compared with those minority samples that are easier to learn [25]. The two objectives behind ADASYN technique are reducing the bias and deploying adaptively learning. The detailed workings of the ADASYN method along with the discussion can be found in “ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning” by Haibo et al. [25] (Figs. 5.16e5.18). In Python, ADASYN can be implemented using the “ADASYN()” function from “imblearn.over_sampling.” In “ADASYN(),” there is a parameter “n_neighbors” for providing a number of nearest neighbors for creating synthetic samples and the default value for this parameter is 5. In ADASYN, randomization of the algorithm can be controlled using the “random_state” parameter. After performing ADASYN, it can be observed

FIGURE 5.17 ADASYN implementation in Python. Receiver Operating Characteristic (ROC) 1.0

True Positive Rate

0.8 0.6 0.4 0.2 AUC = 0.62 0.0

0.0

0.2

0.4

0.6

False Positive Rate

0.8

1.0

FIGURE 5.18 Receiver operating characteristic and area under the curve after performing ADASYN

Chapter 5  Foundations of data imbalance and solutions for a data democracy 103

Table 5.14

Confusion matrix for Porto Seguro’s safe driver prediction dataset. Predicted

Actual

Table 5.15

Negative Positive

Negative 101,158 73,534

Positive 70,906 100,007

Other results for Porto Seguro’s safe driver prediction dataset.

Statistical assessment metrics

Result

Accuracy Precision Recall F1 score G-mean score

0.58 0.58 0.59 0.58 0.58

that the count of positive (573,518) and negative class (578,497) instances are very close. The results obtained after performing logistic regression on the processed data are summarized in Table 5.14. It can be observed that the accuracy is 58%. It can also be noted that precision, recall, F1 score, and G-mean score values are less than 0.6. The AUC value obtained after performing analysis is 0.62 (Table 5.15).

4.3 Hybrid methods The most popular undersampling and oversampling methods are reviewed and implemented in previous subsections. In undersampling, there is a removal of majority class instances, while in oversampling, there is a creation of minority class instances for making a class distribution equal. Based on the same theme, hybrid methods are a combination of undersampling and oversampling methods. Thus, hybrid methods help to achieve the balance between removing majority class instances and creating minority class instances. In hybrid methods, SMOTE is one of the popular techniques used for oversampling which can be used with Tomek link or ENN for undersampling. Python also provides the functionality to apply two hybrid methodsd“SMOTEENN()” (SMOTE þ ENN) and “SMOTETomek()” (SMOTE þ Tomek Link) from “imblearn.combine” in Python. More details about parameters and implementation can be found in Ref. [21].

104

Data Democracy

5. Other methods In the previous section, data-level methods such as undersampling, oversampling, and hybrid methods are discussed. The data-level methods primarily perform sampling to deal with the problem of imbalance in our dataset. In addition to data-level methods, there are also algorithmic-level approaches and ensemble-based approaches. Algorithmic-level approaches are more focused on modifying the classifier instead of modifying the dataset [5]. These methods require having a good understanding of the classifier methods and how they are affected by an imbalanced dataset. Some of the commonly used techniques in algorithmic-level approaches are kernel-based approaches, weighted approaches, active learning, and one-class learning. Ensemble-based approaches combine several classifiers at the final step by combining their outputs. Most commonly used ensemble-based approaches are bagging and boosting. The scope of algorithmic-level approaches and ensemble-based approaches are not part of this chapter, and interested readers can get a useful review of these methods in “Learning from imbalanced data sets” by Ferna´ndez et al. [5].

6. Conclusion This chapter introduced the problem of data imbalance and different datalevel methods to deal with this problem. This chapter began with the definition of imbalance, positive class, and negative class, and then focused on the main causes of imbalance which are class overlap and small disjoints. Afterward, different statistical assessment metrics were discussed, implemented, and summarized after performing logistic regression on the “Porto Seguro’s Safe Driver Prediction” dataset. The most important part of this chapter was the discussion and comparison of different datalevel methods. For the data-level methods, it can be observed that for this scenario oversampling methods performed better as compared with undersampling. In the case of undersampling, only random undersampling can classify positive class instances after processing the data, while all the oversampling methods which are discussed in this chapter are able to produce better results. Results from oversampling methods indicate that all the methods were able to provide a similar type of results (accuracy, precision, recall, F1 score, G-mean score, and AUC score). It is critical to note that there is no single method which can be suitable for all problems. In this scenario, oversampling methods are outperforming undersampling methods, but the critical step is to use multiple methods for preprocessing the data and then compare the results for selecting the best possible method based on the type of the data and by considering the scenario of the problem. The analysis

Chapter 5  Foundations of data imbalance and solutions for a data democracy 105

performed in this chapter also points out the importance of feature engineering. Use of a combination of different features and/or selection of features before applying these methods can generate better results. In the real world, it is also essential to understand the meaning of the results and how these results will help to solve a business problem. The use of undersampling methods removes the majority class samples from the dataset which can be problematic in many cases as every record in the dataset may denote something significant. Similarly, the use of oversampling methods generates samples from the data which can result in overfitting of the model or creating irrelevant records in the dataset. Therefore, caution should always be taken while using data-level approaches especially while dealing with sensitive information. As this book aims to present, technical and statistical methods presented in this chapter are essential in avoiding blind spots and ensuring that the correct representation of facts, data, and the truth are being extracted from the data; that notion leads to better data science and a data democracy.

References [1] H. He, E. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 9 (2008) 1263e1284. [2] C. Pannucci, E. Wilkins, Identifying and avoiding bias in research”, Plast. Reconstr. Surg. 126 (2010) 619. [3] N.V. Chawla, Data mining for imbalanced datasets: an overview, Data Min. Knowl. Discov. Handb. (2009) 875e886. [4] D. Lewis, J. Catlett, Heterogeneous uncertainty sampling for supervised learning, Mach. Learn. Proc. (1994) 148e156. [5] F. Alberto, S. Garcia, M. Galar, R. Prati, B. Krawczyk, F. Herrera, Learning from Imbalanced Data Sets, Springer, 2018. [6] M. Kubat, S. Matwin, Addressing the curse of imbalanced training sets: onesided selection, in: Proceedings of the Fourteenth International Conference on Machine Learning vol. 97, 1997, pp. 179e186. [7] T. Fawcett, F. Provost, Adaptive fraud detection, Data Min. Knowl. Discov. 1 (1997) 291e316. [8] K. Ezawa, M. Singh, S. Norton, Learning goal-oriented Bayesian networks for telecommunications risk management, in: Proceedings of the Thirteenth International Conference on Machine Learning, 1996, pp. 139e147. [9] S. Dumais, J. Platt, D. Heckerman, M. Sahami, Inductive learning algorithms and representations for text categorization, in: Proceedings of the Seventh International Conference on Information and Knowledge Management, ACM, 1998, pp. 148e155.

106

Data Democracy

[10] D. Mladenicand, M. Grobelnik, Feature selection for unbalanced class distribution and naive bayes, in: Proceedings of the sixteenth International Conference on Machine Learning vol. 99, 1999, pp. 258e267. [11] D. Lewis, M. Ringuette, A Comparison of Two Learning Algorithms for Text Categorization”, Third Annual Symposium on Document Analysis and Information Retrieval vol. 33, 1994, pp. 81e93. [12] M. Kubat, R. Holte, S. Matwin, Machine learning for the detection of oil spills in satellite radar images, Mach. Learn. 30 (1998) 195e215. [13] F. Provost, Machine learning from imbalanced data sets 101, in: Proceedings of the AAAI’2000 Workshop on Imbalanced Data Sets vol. 68, 2000, pp. 1e3. [14] N. Japkowicz, S. Stephen, The class imbalance problem: a systematic study, Intell. Data Anal. 6 (2002) 429e449. [15] G. Weiss, F. Provost, The Effect of Class Distribution on Classifier Learning: An Empirical Study, 2001. [16] Kaggle, Available: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction, 2019. [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. ´ . Duchesnay, Scikit-learn: machine Cournapeau, M. Brucher, M. Perrot, E learning in Python, J. Mach. Learn. Res. 12 (2011) 2825e2830. [18] P. Bruce, A. Bruce, Practical Statisctics for Data Scientists: 50 Essential Concepts, O’Reilly Media, 2017. [19] G. Lemaıˆtre, F. Nogueira, C. Aridas, Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning, J. Mach. Learn. Res. 18 (1) (2017) 559e563. [20] Python Software Foundation, Python Language Reference, version 3.7.3. Available at: http://www.python.org. [21] Imbalanced-Learn, Available: https://imbalanced-learn.readthedocs.io/en/stable/, 2019. [22] I. Tomek, Two modifications of CNN, IEEE Transact. Syst. Man Commun. 6 (1976) 769e772. [23] D. Wilson, Asymptotic properties of nearest neighbor rules using edited data, IEEE Transact. Syst. Man Cybern. 3 (1972) 408e421. [24] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique”, J. Artif. Intell. Res. 16 (2002) 321e357. [25] H. He, Y. Bai, E. Garcia, S. Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks, 2008, pp. 1322e1328.

6

Data openness and democratization in healthcare: an evaluation of hospital ranking methods Kelly Lewis1, Chau Pham1, Feras A. Batarseh2 1

COLLEGE OF SCIENCE, GEORGE MASON UNIVERSITY, FAIRFAX, VA, UNITED STATES; 2 GRADUATE SCHOOL OF ARTS & SCIENCES, DATA ANALYTICS PROGRAM, GEORGETOWN UNIVERSITY, WASHINGTON, D.C., UNITED STATES

The past, like the future, is indefinite and exists only as a spectrum of possibilities. Stephen Hawking Abstract The democratization of data in healthcare is subsequently becoming one of the most impacting hurdles in patient service, hospital operations, and the entire medical field. Controversy among healthcare data has brought challenges to many medical case studies including legal rights to data, data ownership, and data bias. The experiment performed in this chapter shows thorough hospital quality metrics and provides a new ranking system that compares hospitals and identifies which ones are best for a potential patient, based on their needs and medical status. Keywords: Data; Healthcare; Hospitals; Patient; Quality.

1. Introduction Nowhere is scientific stagnation more apparent than the medical field. Abundant resources are devoted to medical research, yet increased spending has not resulted in tangible outcomes [1]. The National Institute of Health estimates that “researchers would find it hard to reproduce at-least three-quarters of all published biomedical findings” [2]. Many publications in the medical industry are now reliant on large and complex Data Democracy. https://doi.org/10.1016/B978-0-12-818366-3.00006-X Copyright © 2020 Elsevier Inc. All rights reserved.

109

110

Data Democracy

datasets, often yielding contradictory and “irreproducible results” [2]. As a result, “massive investments in basic biomedical and molecular research have resulted in negligible dividends for patients” [3]. The lack of quality publications is due to many factors, including increased administration costs, excess regulation, and an overbearing presence by the pharmaceutical industry. Bigger budgets are used to cover enlarged administration and bureaucratic costs, not scientists or lab gear [4]. Larger bureaucracies are symptoms of increasingly complex regulatory system, much of which emerged as a reaction against casualties in clinical trials [5]. Regulations, often self-imposed by institutions, have increased time costs for researchers. In 2000, the average cost to develop a new drug was $802 milliondtime costs accounted for half of all research and development expenses [4]. Investigators spend large sums of their time performing administrative and managerial tasks, partly attributed to a high turnover rate within clinical research organizations. Protracted time lengths also result in less evidence-based science as companies rush to get their drugs approved. Swelling costs have led to a reliance on pharmaceutical companies funding studies. Pharmaceutical companies are businesses with profit, not scientific advancement; subsequently, biases and underreporting are regularly underreported in industry-sponsored research. In a 2008 study of antidepressant drug trials submitted to the FDA, “97 percent of the trials.yielded positive results” [6]. The study reported that “39 percent of the studies were found to have negative or questionable results” [6]. The lack of negative trial outcomes is not due to superior research models; the number of FDA drug approvals from industry-sponsored research has steadily declined [3]. A dearth of quality publications cannot purely be blamed on any single institution or factor. Medical research is bound to face discrepancies because of its inherent complexities. Computer scientists, for example, can rely on the fact that a computer will function as programmed. Biologists and chemists, however, rely on cells and molecules that often change in unexpected ways. Errors most frequently occur in data analysis, either intentionally or unintentionally [7]. Measurement errors, unvalidated data, fraud, and misinterpretation of data can result in skewed analysis [2]. A gap in appropriate data analysis is one of the contributing factors to “a general slowdown in medical progress” [2]. Flaws in data analysis have been well-documented by researchers, yet institutions have been unable to propose long-term solutions. Publications can improve on studies by increasing transparency standards. Data and conclusions from government-sponsored trials can only be available to the public if they are published. Yet, “less than half of trials initiated by academic researchers are published.” [8]. One of the main

Chapter 6  Data openness and democratization in healthcare

111

obstacles to publication was “ongoing data analysis” or “manuscript preparation” [8]. Many of these obstacles can be aided with Open Data and Open Science within healthcare. How would healthcare transform within the data republic? This chapter aims to design and explore that.

2. Healthcare within a data democracydthesis While healthcare research is typically privatized to prevent plagiarism or release of sensitive information, experimental variables should be available to the public to catalyze the research cycle, allow for correct reproductions and alterations, and provide scientist and citizens of a data republic a complete understanding of the experiment or research.

3. Motivation Data privacy and security measures include removing identifying attributes of patients. Removing this part of data to comply with the DeIdentification Act can cause some negative implication on the outcomes during analysis [9]. Security and privacy issues like patient privacy and confidentiality are a hurdle to data sharing [10]. Controversially, patients believe they have ownership of their healthcare data, despite the ideology of residential ownership by the hospital or healthcare provider. Now, patients must sign a consent form to release their data for use [9]. Insufficient data quality attributes in medical registries include ambiguous data definitions, unclear data collection guidelines, poor interface design, data overload, and programming errors. The data users’ perspective should be taken under great consideration in evaluating the data quality [11]. Deep learning models do not account for errors, uncertainties, and changes in data distribution; that leads to a decrease in the validity and veracity of the output. This could be a significant risk, especially in the healthcare setting [12]. Data quality is key to the augmentation of analytical models; however, data quality assurance has not been stringently investigated. Errors and inaccuracies can arise at every stage in the lifecycle of healthcare data analytics. Poor data quality in early stages can lead to increasing error propagation and quality deterioration [13]. Because data are distributed worldwide, there must be an umbrella regulation for open licensed healthcare data [14]. In conclusion, incomplete, poorly described, or outdated data caused by collecting data from different sources limit value and reliability for big data usage [9]. The massive data generated through clinical data collection and genomic technologies increase the chance to discover disease mechanisms; nonetheless, the data quality and its size and complexity give rise

112

Data Democracy

to challenges. To overcome these challenges, computational models need to be scalable [15]. Additionally, there is a shortage of professional data scientist in the field of healthcare, especially ones with the knowledge and skill to run analysis and create proper data storage [9]. The Agency for Healthcare Research and Quality explores data issues and identifies the root cause of medical errors. Major issues include communication problems, resulting in poorly documented or lost information on laboratory results, diagnostic testing, or medication information. Inadequate information flow prevents the availability of critical information, timely and reliable communication of critical test results, and coordination of medication orders [16]. Furthermore, clinical data linkage may become a challenge. Hospitals are not updating and configuring their data, primarily patient data, which can cause for data duplication, among other issues [9]. The work done in this chapter aims to quantify those issues and present solutions through data openness.

4. Related works The most important consideration in healthcare data systems development is the collection of information about patients. Once the adequate data are obtained, scientist can proceed to integrate, build predictive models, and then derive results from those models. In the United States, medical and healthcare providers are not allowed to disclose this information; therefore, data must be created through public queries and pulled using open healthcare data platforms such as from the Center for Disease Control (CDC) [17]. Google-Flu Trends (GFT), a web service operated by Google, was successful in providing specific flu information in selected countries and in predicting outbreaks. Launched in 2008, this project stored and processed search queries from users to make predictions on flu outbreaks. GFT failed after its first 2 years as it lacked adapting to human behavioral changes online and did not update with new CDC data to enhance their data model. The combination of subjective social media data and healthcare was an additional hurdle in this project [18]. After the trial of GFT, researchers of the Department of Statistics at Harvard University created the Auto Regression with Google search data (ARGO). ARGO performed better that GFT and competing models because of its utilization of public data from Google search, previously collected data from GFT, and other publicly available data including CDC’s. This model could also integrate flu data with other sources while shifting its data mining along with shifts of online human behavior. Overall, the failure of GFT created a skeleton model to improve upon which led to the success of ARGO [18,19].

Chapter 6  Data openness and democratization in healthcare

113

Another popular system is the Electronic Remote Blood Issue System (ERBIS); it was tested in three urban hospitals. Challenges of this study were heavily based on the unpredictability of patient cases. ERBIS was a database linked to a blood bank which tracked the blood type of the patient as well as the blood unit status. The experiment took place over 3 years and included both qualitative and quantitative data. The overall challenge to this case was the implementation of a humanecomputer interactive design. One of the researchers experienced a few forms of resistance during the study, including and not limited to participant boycotts, physical tampering of the ERBIS, and spoiling of the data collection sheets. This is a good example of the controversy behind medical data research as well as the roadblocks in the industry which inhibit innovative growth [20]. Data democratization would allow for transparency of organizational performance which may have an adverse effect in the healthcare’s industrial economy. That may implicate why open data is hard to manage. This ideology ties into the perceived social value of open data. Ironically, the democratization of data would create a booming environment for medical research and perhaps save patients and insurance companies millions through receiving more refined and personalized treatment [21]. Big data analytics has been applied to accumulate, manage, analyze, and assimilate large volumes of disparate and unstructured data produced by current healthcare systems. Troubleshooting the inherent complexities of healthcare data can be achieved using data integratione data mining to access the dataset dependencies/patterns that increase the accuracy of diagnosis, predictions, and overall performance of the system. In conclusion, effective and extensive collaboration among researchers, computational scientists, and healthcare providers is important in solving the many challenges in applying big data analytics in medical field [21]. Genetic research, for instance, would vastly benefit from the democratization of data in healthcare. This is how patients can be given a personalized treatment based on how their DNA aligns with a database bank of those with similar genetic features. The Protein Data Bank, for example, which is a freely accessible archive of data and metadata for biological macromolecules, is studied over a range of disciplines. It is associated with successes in drug discovery, genetics and genomics, and protein structure predictions [22]. Spangler et al. discussed the successful uses of the Knowledge Integration Toolkit (KnIT) that can perform text mining for scientific literatures, visualize that information, and generate hypothesis. KnIT was used to identify the prospective kinases of p53 tumor protein; the top two candidates are PKN1 and NEK1. To validate the result, they carried out two sets of laboratory experiments and robust interaction between the

114

Data Democracy

two kinases and p53. This is an example of how KnIT can be expected to advance cancer treatment as well as understanding of the underlying mechanism [23]. Nextstrain is another example of an open to public system, which has the most updated phylodynamic data in the medical field [24,25]. The mentioned examples are good examples of open data systems that lead to good medical research results; however, hospitals rarely benefit from their own patient data to improve the quality of service at the clinics of the hospital. The next section explores that endeavor.

5. Hospitals’ quality of service through open data Using open-access hospital data from the Centers for Medicare and Medicaid Services (CMS), a data visualization is built to help users locate the best performing hospital in their area. To visualize the information on the quality of care that hospitals are providing to their patients, we compiled different measures including patient payment for treatment(s), complications and death, unplanned hospital visits, imaging efficiency, Medicare spending per person, structural measures, value of care versus cost, and healthcare-associated infections. Users can select multiple hospitals, directly compare and analyze clinical performance based on geography, available treatments, and value of care, and identify the discrepancies in health outcomes, utilization, and spending; thus, making better decisions in choosing the most suitable facility. Hospitals can use this visualization to determine their healthcare delivery ranking on a county, state, or national scale. Local healthcare providers can utilize this tool to recognize important issues that restrain their quality of services and drill down into areas that need improvement and renovation. With this visualization, hospital data will be leveraged to identify underperforming metrics, and healthcare providers are encouraged to translate that data into actionable solutions that improve community health outcomes and lower overall healthcare costs. The data used for the experiment are retrieved from CSM. There are effectively 10 parameters used to formulate the Hospital Service Score that provide an insight into top hospitals (compared with actual patient experience). These measures range on a qualitative to quantitative scale considering patient experience, technology, and healthcare microeconomics. Structural measures reflect the environment of the hospital. This is scored dependently on the hospitals’ technology, modernization in digitized files, and equipment. Outpatient imaging efficiency observes the amount of specific imaging tests in nonurgent instances or

Chapter 6  Data openness and democratization in healthcare

115

in situations which may not be medically appropriate. The lower the score, the more efficient the hospitals’ use of imaging is, and vice versa. Value of care is determined by patient surveys and is comparative to the cost of treatments (Payment for Treatments). For example, a facility with a high cost for a knee/hip replacement procedure may not result in a high value of patient care [26]. Additionally, unplanned hospital visits are the summation of days spent in the hospital within 30 days after a patient was first treated and released (readmission). These observed readmissions are based on the treatment of heart attacks, heart failure, pneumonia, chronic obstructive pulmonary disease, surgical procedures, and strokes. A lower percentage for this measure is ideal. Another score that should be low is the score for infections. A low score indicates better treatment. Complications and deaths caused by complications deemed serious but possibly preventable. This is also reflected and measured by the quality of care received by the patient during treatment and postcare procedure. Moreover, readmissions and death determine the rate of readmissions within 30 days of treatment and rate of death postreadmission. High readmission rates may coincide with low value of care and associated healthcare infections. Medicare hospital spending per beneficiary may be subjective to the hospital’s value of care as well, efficiency, and treatment cost. Payment for treatments is limited to heart attack, heart failure, pneumonia, and hip/knee replaces for beneficiaries 65 years old or older [26,27].

6. Hospital rankingdexisting systems HealthInSight’s (a hospital ranking entity) rankings were focused on the patient experience, healthcare-associated infections, and readmissions [27]; the last measure is an overall performance metric. The data for the use are collected from the American Hospital Association and CMS; then they were combined with survey information. Rankings were based on 16 adult specialties, number of beds, medical technology, and the performance of several specific procedures. With their analysis, the top five hospitals in the United States are Mayo Clinic, Cleveland Clinic, John Hopkins Hospital, Massachusetts General Hospital, and University of Michigan Hospitals [28]. CMS rankings, on the other hand, are based on two perspectives, the US Department of Veterans Affairs puts it as “Relative Performance compared to other VA medical centers using a Star rating system from 1 to 5 and Improvement compared to its own performance from the past year.” Their administration used Strategic Analytics for Improvement and Learning (SAIL) [29].

116

Data Democracy

Our data democracy ranking system challenges these two most commonplace systems and presents and improved ranking system using open data.

7. Top ranked hospitals Mayo Clinic, Fig. 6.1, is primarily an inpatient treatment facility with an outpatient building for specialty services. It has been consistently ranked a top hospital due to its high scores in multiple specialties. Cleveland Clinic, as seen in Fig. 6.2, is an academic medical center which is known for their renowned cardiology and heart health. John Hopkins Hospital, Fig. 6.3, has been known as top ranking as an adult and pediatric hospital as well as ranking high in specialties. Massachusetts General Hospital, in Fig. 6.4, has the country’s largest hospital-based research program. It is also known for its treatments for cancer, heart disease, and trauma care. University of Michigan Hospitals, Ann Arbor, is a general medical and surgical facility which ranks high due to its university’s reputation as one of the world’s greatest academic institutions with its medicinal sector owning advanced technology and a renowned healthcare system (Fig. 6.5). None of these hospitals publish data publicly for experimentation and research (see Fig. 6.6).

FIGURE 6.1 Mayo Clinic, Rochester, MN. From https://www.flickr.com/photos/diversey/

39658224012.

Chapter 6  Data openness and democratization in healthcare

117

FIGURE 6.2 Cleveland Clinic, Cleveland, OH. From https://commons.wikimedia.org/wiki/File:

Cleveland_Clinic_Miller_Family_Pavilion.jpg.

FIGURE 6.3 The Johns Hopkins Hospital, Baltimore, MD. From https://commons.

wikimedia.org/wiki/File:JHACH_Exterior_Night.jpg.

118

Data Democracy

FIGURE 6.4 Massachusetts General Hospital, Boston, MA. From https://commons.

wikimedia.org/wiki/File:Mass_General_Hospital_-_MGH.jpg.

FIGURE 6.5 University of Michigan Hospitals, Ann Arbor, MI. From https://commons.

wikimedia.org/wiki/File:University_of_Michigan_August_2013_071_(Cardiovascular_ Center).jpg.

Chapter 6  Data openness and democratization in healthcare

HIS vs. CMS Survey

RTI vs. CMS Survey

Our Study vs. CMS Survey

400

0.8

119

35

350

0.7

30 300

0.6

Our Study

250

0.5

RTI

HealthinSight

25

0.4

200

0.3

150

0.2

100

0.1

50

15

10

5

0

0.0

0

2

4

CMS patient survey

6

20

0 0

1

2

3

4

5

CMS patient survey

6

0

1

2

3

4

CMS patient survey

FIGURE 6.6 Correlation scatter plots for CMS patient survey.

8. Proposed hospital ranking: experiment and results All the data [26,27] and experimental ranges of this study were standardized (refer to Table 6.1). The final formula includes three base sums which are weighted by their patient experience impact [30]. High impact metrics are weighted at 50%, as they are most referenced for patient hospital experience. Medium impact metrics are weighted at 30%, as they are valued metrics to patients but not most immediate in value of care. Lastly, low impact metrics are weighted at 20% as they may indicate a general patient care value. Variables used include Payment score calculation: 55,134/ (average treatment costs for hospital sample) Complications/deaths calculation: 218.53/ (original sample score)

120

Data Democracy

Unplanned hospital visits calculation: 1497/ (original sample score) Outpatient imaging efficiency calculation: 383.3/ (original sample score) Value of care: scale 1e9, determined by Table 6.2 (calculated average for hospitals with values in multiple table categories) (see Tables 6.3 and 6.4). Healthcare-Associated Infections: scale 1e11, counted by number of infections scoring better rates than national benchmark. Structural Measures: number of structural measures available on a scale 0e8. Medicare Spending per Patient: score presented by the data; a higher spending typically indicates better care. The formula for overall score:

n n n (SUM (Structural Measures, Value of Care vs. Cost, Healthcare-Associated Infections)  0.50) þ (SUM (Complications/Deaths, Unplanned Hospital Visits, Payment for Treatment)  0.30) þ (SUM (Imaging Efficiency, Medicare Spending per Patient)  0.20) ¼ SCORE

n n n

The P-value was used to determine the chance of a Type I error while, the R-squared value, a statistical measure of correlation, shows the goodness of fit to the model’s regression line. Table 6.1 Hospital rankings based on different rankings (sorted by our score). CMS patient CMS HealthInSight survey HealthInSight (1e5 survey (out of (1e5 stars) (out of 100%) stars) 100%)

Our study (out of 100)

Hospital name

RTI

Mayo Clinic, Phoenix Hospitals of the University of Pennsylvania-Penn Presbyterian, Philadelphia Massachusetts General Hospital, Boston Mayo Clinic, Rochester Mount Sinai Hospital, New York UCSF Medical Center, San Francisco New York-Presbyterian Hospital, New York

241 85% 225 32%

5 3

97% 68%

5 4

96.636 35.2

354 68%

5

71%

4

31.249

414 85% 192 8%

5 3

82% 11%

4 2

27.894 25.825

296 56%

4

80%

4

22.496

242 15%

3

28%

3

20.758

Chapter 6  Data openness and democratization in healthcare

121

Table 6.1 Hospital rankings based on different rankings (sorted by our score).dcont’d CMS patient CMS HealthInSight survey HealthInSight (1e5 survey (out of (1e5 stars) RTI (out of 100%) stars) 100%) 178 54% 4 72% 4

Hospital name Duke University Hospital, Durham UPMC Presbyterian 208 Shadyside, Pittsburgh Brigham and Women’s 177 Hospital, Boston Cedars-Sinai Medical Center, 252 Los Angeles Northwestern Memorial 228 Hospital, Chicago Johns Hopkins Hospital, 355 Baltimore Barnes-Jewish Hospital, St. 241 Louis University of Michigan 324 HospitalsdMichigan Medicine, Ann Arbor Vanderbilt University 198 Medical Center, Nashville Cleveland Clinic, Cleveland 385 Standford Health 250 CaredStandford Hospital, Stanford 208 NYU Langone Hospitals, New York

Table 6.2

10%

1

31%

3

17.654

27%

4

62%

4

17.154

79%

5

48%

3

16.304

64%

3

57%

3

14.994

62%

3

65%

4

14.3

46%

2

56%

4

14.098

34%

5

37%

3

14.046

65%

3

56%

3

13.996

40% 20%

5 4

72% 40%

4 4

13.596 11.95

11%

4

25%

3

11.356

Value of care versus cost.

Complications/Morality rates Better Average Worse

Table 6.3

Our study (out of 100) 18.99

Lower payment 9 6 3

Average payment 6 4 2

Higher payment 3 2 1

R-squared values for the models.

R-Squared

HealthInSight

RTI

Our study

CMS survey HealthInSight survey

0.226 0.594

0.103 0.179

0.019 0.258

122

Data Democracy

Table 6.4

P-values for the models.

P value CMS survey HealthInSight survey

HealthInSight 0.039 0.0001

RTI 0.179 0.071

Our study 0.580 0.026

9. Conclusions and future work Our ranking system proves to be a stronger ranking system for patient experience; majorly due to the patient focused methodology, and the use of more quantitative and qualitative parameters to include in overall hospital quality. Patients can use our ranking systems to allocate the best hospitals that are ranked based on medical advancements in structural measures which allow for more efficient treatment. As seen in Fig. 6.7, our study’s trend line has higher correlation than RTI’s, when compared with the HealthInSight Survey. Our study had a positive correlation to the CMS survey as well but did come short to the strength of the RTI and HIS vs. HIS Survey

Our Study vs. HIS Survey

RTI vs. HIS Survey

100

0.8

400

90

350

0.7

80 300 70

0.4

250

60 RTI

0.5

Our Study

HealthinSight

0.6

200

50 40

0.3

150

30 100

0.2

20 0.1

50

10 0

0.0

0.2 0.4 0.6 0.8 1.0 HealthinSight Survey

0 0.2 0.4 0.6 0.8 1.0 HealthinSight Survey

0.2 0.4 0.6 0.8 1.0 HealthinSight Survey

FIGURE 6.7 Correlation scatter plots for HealthInSight patient survey.

Chapter 6  Data openness and democratization in healthcare

123

HealthInSight which may be due to the extensive addition metrics of our ranking systems. With the use of open data in healthcare, such research can help hospitals with service decisions for patients. The availability of a variety of patient survey data would be instrumental in improving hospital ranking scores that focus on experience-based metrics. That will eventually be very beneficial to the data citizen, in a data democracy.

References [1] I. Chalmers, Biomedical research: are we getting value for money? J. Signif. 4 (2006) 172e175, https://doi.org/10.1111/j.1740-9713.2006. 00200.x. Section3. [2] J.J. Berman, Data Simplification, first ed., Elsevier Inc., 2016. [3] I. Mittra, Why is modern medicine stuck in a rut? Perspect. Biol. Med. 52 (4) (2006) 500e517, https://doi.org/10.1353/pbm.0.0131. [4] Institute of Healthcare Improvement, Institute of Medicine (USA), 2010. [5] J. Vijg, The American Technological Challenge: Stagnation and Decline in the 21st Century, Algora Pub, New York, 2011. [6] J. Lexchin, L.A. Bero, B. Djulbegovic, O. Clark, Pharmaceutical industry sponsorship and research outcome and quality: systematic review, BMJ 326 (2003) 1167. Section 7400. [7] K. Sainani, What Biomedical Computing Can Learn From Its Mistakes j Biomedical Computation Review, September 2011. Retrieved from: http:// biomedicalcomputationreview.org/content/error-e-whatbiomedicalcomputing-can-learn-its-mistakes. [8] L. Berendt, L.G. Petersen, K.F. Bach, H.E. Poulsen, K. Dalhoff, Barriers towards the publication of academic drug trials. Follow-up of trials approved by the Danish Medicines Agency, PLoS One 12 (5) (2017) e0172581, https://doi.org/10. 1371/journal.pone.0172581. [9] M. Ottom, Big data in healthcare: review and open research issues, Jordan. J. Comput Inf. Technol. (2017) 37e50. [10] N. Mehta, A. Pandit, Concurrence of big data analytics and healthcare: a systematic review, Int. J. Med. Inf. 114 (Jun. 2018) 57e65. [11] D.G.T. Arts, N.F. de Keizer, G.-J. Scheffer, Defining and improving data quality in medical registries: a literature review, case study, and generic framework, J. Am. Med. Inform. Assoc. 9 (6) (2002) 600e611. [12] C. Xiao, E. Choi, J. Sun, Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review, J. Am. Med. Inform. Assoc. 25 (10) (Jun. 2018) 1419e1428. [13] R.K. Ferrell, S.R. Sukumar, R. Natarajan, Quality of big data in health care, Int. J. Health Care Qual. Assur. 28 (6) (Jul. 2015) 621e634.

124

Data Democracy

[14] S. Kobayashi, T. Kane, C. Paton, The Privacy and Security Implications of Open Data in Healthcare: A Contribution from the IMIA Open Source Working Group, 2018, pp. 041e047, 27.01. [15] N.V. Chawla, D.A. Davis, Bringing big data to personalized healthcare: a patientcentered framework, J. Gen. Intern. Med. 28 (Suppl. 3) (September 2013) 660e665. [16] AHRQ, AHRQ’s Patient Safety Initiative: Building Foundations, Reducing Risk j AHRQ Archive, 2018 [Online]. Available: https://archive.ahrq.gov/research/ findings/final-reports/pscongrpt/psini2.html. [17] K.L. Insider Business, “We’re Finally Cracking the Secrets of What Makes Us Sick,” Business Insider, 2018 [Online]. Available: https://www.businessinsider. com/how-were-finally-cracking-the-secrets-of-what-makes-us-sick-2015-10. [18] F. Batarseh, E. Latif, Assessing the quality of service using big data analytics, Elsevier’s Big Data Res. 4 (2016) 13e24. [19] R. Blank, A new data effort to inform career choices in biomedicine, Science 358 (6369) (2017) 1388e1389. [20] D. Furniss, et al., Fieldwork for Healthcare: Case Studies Investigating Human Factors in Computing Systems, Morgan and Claypool, San Rafael, California, 2014. [21] A. Belle, R. Thiagarajan, S.M.R. Soroushmehr, F. Navidi, D.A. Beard, K. Najarian, “Big Data Analytics in Healthcare,” BioMed Research International, 2015 [Online]. Available: https://www.hindawi.com/journals/bmri/2015/370194/. [22] C. Markosian, L. Di Costanzo, M. Sekharan, C. Shao, S.K. Burley, C. Zardecki, Analysis of impact metrics for the protein Data Bank, Sci. Data 5 (October 2018) 180212. [23] S. Spangler, et al., “Automated hypothesis generation based on mining scientific literature,”, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2014, pp. 1877e1886. [24] J. Hadfield, et al., “Nextstrain: real-time tracking of pathogen evolution,” Bioinformatics. [25] M. Pandika, Mining Gene Expression Data for Drug Discovery, 2018 [Online]. Available: https://cen.acs.org/pharmaceuticals/drug-discovery/Mining-geneexpression-data-drug/96/i38#. [26] Hospital Compare Downloadable Data Dictionary, October, 2018. Retrieved from: https://data.medicare.gov/data/hospital-compare. [27] National Hospital Rankings Chart. Retrieved from: https://HealthInSight.org/ rankings/hospitals/hospital-rankings?type¼hospands¼AL. [28] U.S. News and World Report announces 2018-2019 Best Hospitals rankings, August, 2018. Retrieved from: https://www.rti.org/news/us-news-world-reportannounces-2018-2019-best-hospitals-rankings.

Chapter 6  Data openness and democratization in healthcare

125

[29] Quality of Care, October, 2017. From, https://www.va.gov/qualityofcare/ measureup/end_of_year_hospital_star_rating_fy2017.asp. [30] K. Sullivan, Five Hospital Characteristics Necessary for Success, November 2013. Retrieved from: https://www.fiercehealthcare.com/healthcare/5-hospitalcharacteristics-necessary-for-success.

Further reading A. Belle, R. Thiagarajan, S.M.R. Soroushmehr, F. Navidi, D.A. Beard, K. Najarian, “Big Data Analytics in Healthcare,” BioMed Research International, 2015 [Online]. Available: https://www.hindawi.com/journals/bmri/2015/370194/. Business Insider, “We’re finally cracking the secrets of what makes us sick”, 2018. Available: https://www.businessinsider.com/how-were-finally-cracking-thesecrets-of-what-makes-us-sick-2015-10. C.S. Kruse, R. Goswamy, Y. Raval, S. Marawi, Challenges and opportunities of big data in health care: a systematic review, JMIR Med. Inform. 4 (4) (November 2016). C. Markosian, L. Di Costanzo, M. Sekharan, C. Shao, S.K. Burley, C. Zardecki, Analysis of impact metrics for the protein Data Bank, Sci. Data 5 (2018) 180212. Deep Learning Meets Genome Biology e O’Reilly Media, 2018. Available: https:// www.oreilly.com/ideas/deep-learning-meets-genome-biology. D. Furniss, et al., Fieldwork for Healthcare: Case Studies Investigating Human Factors in Computing Systems, Morgan and Claypool, San Rafael, California, 2014. F. Cabitza, A. Locoro, C. Batini, A User Study to Assess the Situated Social Value of Open Data in Healthcare, Procedia Computer Science, 2015, pp. 306e313. Hospital Compare Downloadable Data Dictionary, 2018. Retrieved from, https:// data.medicare.gov/data/hospital-compare. “Hospitals of Tomorrow: 10 Characteristics for Future Success”. Retrieved from: https://www.beckershospitalreview.com/hospital-management-administration/ hospitals-of-tomorrow-10-characteristics-for-future-success.html. “Healthcare’s ‘Big Data’ Challenge,” AJMC. [Online]. Available from: https://www. ajmc.com/journals/issue/2013/2013-1-vol19-n7/healthcares-big-data-challenge. J.C. McGrew, A.J. Lembo, C.B. Monroe, An Introduction to Statistical Problem Solving in Geography, Waveland Press, 2014. K. Bansal, J.D. Medaglia, D.S. Bassett, J.M. Vettel, S.F. Muldoon, Data-driven brain network models differentiate variability across language tasks, PLoS Comput. Biol. 14 (10) (2018) e1006487. M.F. Drummond, L.M. Davies, F.L. Ferris, Assessing the costs and benefits of medical research: the diabetic retinopathy study, Soc. Sci. Med. 34 (9) (1992) 973e981. https://doi.org/10.1016/0277-9536(92)90128-d. M. Mandel, The Failed Promise of Innovation in the U.S, July, 2009. Retrieved from: https://www.bloomberg.com/news/articles/2009-06-03/the-failed-promise-ofinnovation-in-the-u-dot-s.

126

Data Democracy

M.S. Islam, M.M. Hasan, X. Wang, H.D. Germack, M. Noor-E-Alam, A systematic review on healthcare analytics: application and theoretical perspective of data mining, Healthcare 6 (2) (2018). N. Mehta, A. Pandit, Concurrence of big data analytics and healthcare: a systematic review, Int. J. Med. Inf. 114 (2018) 57e65. P. Ekostkova, et al., Who owns the data? Open data for health care, Front. Public Health 4 (2016). Quality of Care. Retrieved from: https://www.va.gov/qualityofcare/measureup/end_ of_year_hospital_star_rating_fy2017.asp. R. Blank, et al., A new data effort to inform career choices in biomedicine, Science 358 (6369) (December 2017) 1388e1389. S. Kobayashi, T. Kane, C. Paton, “The Privacy and Security Implications of Open Data in Healthcare”, A Contribution from the IMIA Open Source Working Group, 2018, pp. 041e047. S. Kobayashi, T. Kane, C. Paton, The Privacy and Security Implications of Open Data in Healthcare: A Contribution from the IMIA Open Source Working Group, 2018, pp. 041e047, 27.01 (2018). S. Spangler, Automated hypothesis generation based on mining scientific literature, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2014, pp. 1877e1886. The Atlantic, How Fast Is Technology Accelerating, 2015. Retrieved from: https:// www.theatlantic.com/sponsored/prudential-great-expectations/how-fast-istechnology-accelerating/360/. Website: https://www.healthcare.gov/. WIPO, World Intellectual Property Indicators, 2017. Retrieved from, http://www. wipo.int/edocs/pubdocs/en/wipo_pub_941_2017.pdf. W. Raghupathi, V. Raghupathi, Big data analytics in healthcare: promise and potential, Health Inf. Sci. Syst. 2 (2014). Y. Chen, J.D. Elenee Argentinis, G. Weber, IBM Watson: how cognitive computing can be applied to big data challenges in life sciences research, Clin. Ther. 38 (4) (2016) 688e701.

7

Knowledge formulation in the health domain: a semioticspowered approach to data analytics and democratization Erik W. Kuiler, Connie L. McNeely GEORGE MASON UNIVERSITY, ARLINGTON, VA, UNITED STATES

Aliquid stat pro aliquo.1 Anonymous

From the cradle to the grave, from awakening until sleep, the contemporary individual is subjected to an unending barrage of signs through which other persons seek to advance their goals. He is told what to believe, what to approve and disapprove, what to do and not to do. If he is not alert, he becomes a veritable robot manipulated by signs, passive in his beliefs, his valuations, his activities. [.] Against this exploitation of individual life, semiotics can serve as a counter force. When an individual meets the signs with which he is confronted with a knowledge of how signs work, he is better able to co-operate with others when cooperation is justified.2

Charles W. Morris

Abstract Developments in knowledge-based systems and information and communications technology (ICT) occupy increasingly important positions in the health domain. The advent of ICT has enabled “big data” analytics that emphasize leveraging large, complex datasets to manage population health, drive down disease rates, and control costs. Moreover, knowledge formulation, as well as the growth and application of ICT-supported knowledge-based systems, has important implications for the practice of medicine, healthcare distribution, evidence-based health policy making, and professional research agendas. A conceptual analytical framework is presented that encompasses different aspects of knowledge formulation based on semiotics-focused interdependencies. Semiotics is normative, bounded by epistemic, and cultural contexts and provides the foundation for ontology development. 1 2

Something stands for something else. [1, p. 240].

Data Democracy. https://doi.org/10.1016/B978-0-12-818366-3.00007-1 Copyright © 2020 Elsevier Inc. All rights reserved.

127

128

Data Democracy

Knowledge formulation depends on ontologies to provide repositories for formal specifications of the meanings of symbols delineated by semiotics. Accordingly, the framework posits and addresses semiotics-related pragmatics through two kinds of paradigms: (1) a model-based analytics paradigm that reflects the needs of health researchers and (2) a heuristics-based analytics paradigm that is appropriate for medical staff, patients, and other nonresearch-oriented end-user communities. The framework supports data democratization through the operationalization of these paradigms. Keywords: Big data; Data analytics; Health; Knowledge.

1. Introduction The availability of inexpensive high-speed computers, voice recognition and mobile technologies, large datasets in different formats from diverse sources, and the use of social media has provided opportunities for information and communications technology (ICT)eenabled “big data” analytics in the health domain. Big data health analytics and health informatics emphasize leveraging large, complex datasets to manage population health, drive down disease rates, and control costs, and the development of knowledge-focused systems that occupy increasingly important positions in the practice of medicine, the democratization of health data access, and the dissemination of healthcare information to support evidence-based health policy formulation, patient and caregiver participation in their own healthcare regimens, and professional research agendas. Health informatics, as a field, is expected to provide a framework for the electronic exchange of health information that complies with all legal requirements and standards. In the United States (US), for example, the impetus for an expanded health informatics came during 2008e10. Under the Health Information Technology (HIT) for Economic and Clinical Health (HITECH) component of the American Recovery and Reinvestment Act of 2009, the Centers for Medicare and Medicaid reimburse health service providers for using electronic documents in formats certified to comply with HITECH’s Meaningful Use standards [2]. The Patient Protection and Affordable Care Act of 2010 promotes access to healthcare and greater use of electronically transmitted documentation [3]. As is apparent, in addition to supporting healthcare professionals, managers, and payers, the purview of health informatics has changed to address the needs of patients, caregivers, researchers, and other consumers of healthcare information. This emphasis on consumer-accessible health

Chapter 7  Knowledge formulation in the health domain

129

data and informationdi.e., “data democratization” in the healthcare industry and academiadhas expanded the scope of access to health informatics to include additional dimensions of medical information, such as health education and promotion, along with public and environmental health awareness. The scope of the health domain in this regard is extensive: clinical health and healthcare delivery, public health, population health, environmental health, global health, and international collaborative research. Big health datasets may contain clinical device sensor data, reflecting the operationalization of the Internet of things (IoT) signals, and process-driven data as well as socioeconomic information. Moreover, massive datasets frequently come from widely distributed sources, reflecting different ontological bases and different cultural, institutional, and epistemic contexts, so that health-focused ICT development requires rigorously applied standards in governance and management processes to ensure consistent quality and secure and authorized data access throughout the data analytics and knowledge development lifecycles. These data complexities present multiple challenges for healthcare system developers, as organizations must develop new rules and algorithms to standardize and manage data, derive healthcare information consistently, and support diverse communities that use health information, while preserving patient privacy and supporting information security. Given the complexities of this situation, an integrative approach to knowledge formulation in the health domain as a practical and adaptable framework is presented here to address a number of challenges engendered by the availability of massive health datasets. This approach is encapsulated in a conceptual analytical framework that focuses on different aspects of an epistemically constrained, semiotics-based process of extrapolating signals from noise, transforming signals into data, and, subsequently, into information and then knowledge that may be operationalized as ICT applications to support knowledge-based health analytics.

2. Conceptual foundations As an episteme, semiotics has its roots in Classical Greece, drawing from four traditions: logic, semantics, rhetoric, and hermeneutics [1,4e7]. Although the term medical semiotics may be of a recent date, the practice of semiotics in medicine is well established in the Western tradition. Galen of Pergamum (130e200 CE), a noted philosopher and physician,

130

Data Democracy

referred to the practice of diagnosis as semiosis, which became the science of symptomology [8].3,4 The scope of modern medical semiotics has expanded with the use of electronic health records (EHR), electronic medical records (EMR), and personal health records (PHR); digital imaging; digitized procedures; increasing sophistication in laboratory test formulation; real-time availability of sensor data; and, what stands out in the popular press, the introduction of genomics-related projects. This expansion has transformed the functions of the Internet such that, in addition to serving as a digital transportation infrastructure, the Internet has become an everexpanding semiotic network [11e13]. Semiotics, as an intellectual discipline, focus on the relationships between signs, the objects to which they refer, and the interpreters (human individuals or ICT intelligent agents) who assign meaning to the conceptualizations as well as instantiations of such signs and objects based on those relationships. By focusing on diverse kinds of communications as well as their users, semiotics support the development of a conceptual framework to explore various aspects of knowledge transformation and dissemination that include the reception and manipulation of signs, symbols, and signals received from diverse sources, such as signals received from medical IoT devices, and the algorithms and analytics required to transform signals and data into information and knowledge. From the perspective of an ICT-focused semiotics, a signal is an anomaly discovered in the context of a perceived indiscriminate, undifferentiated field of noise and is, in effect, recognition of a pattern out of noise. Frequently, signals provide an impetus to action. For example, in an IoT network, signals reflect properties of frequency, duration, and strength that indicate a change in an object’s state or environment to elicit a response from an interpreting agent (be it an ICT or human agent) or to transfer meaning. In the health domain, coded secure signals sent by portable IoTelinked personal monitors enable data gathering to support individual health and well-being [14]. Signals become datadthe creation of facts (something given or admitted, especially as a basis for reasoning or inference)dby imposing syntactic and symbolic norms on signals. Data become signs when semantics are applied. Signs reflect cultural, 3 Galen posited that the best physician is also a philosopher, arguing that philosophy will enable the physician to distinguish truth from illusion, underlying reality from surface appearances. The application of this insight to symptomology is obvious. Furthermore, he argued that the practice of medicine should be based on empiricism so that experience supported by observation and demonstration would lead to the correct therapeutic results in many instances. 4 Barthes explores semiotic aspects of medicine [9]. See also Burnum for a discussion of semiotics in medicine [10].

Chapter 7  Knowledge formulation in the health domain

131

epistemic norms and function as morphemes of meaning by representing objects in the mind of the interpreter [15,16]. In Aristotelian terms, signs and their associated symbols are figures of thought that allow an individual to think about an object without its immediate presence. In this sense, signs have multiple aspects: a designative aspect, which points an interpreter to a specific object (an index); an appraisive aspect, which draws attention to the object’s ontological properties; and a prescriptive aspect, which instructs the interpreter to respond in a specific way, such as in response to a transmission stop signal [17]. A symbol is a mark, established by convention, to represent an object, state, process, state, or situation. For example, a flashing red light is usually used to indicate a state of danger or failure. Information comprises at least one datum. Assuming that the data are syntactically congruent, information constitutes the transformation of data by attaching meaning to the collection in which the individual data are grouped, as the result of analyzing, calculating, or otherwise exploring them (e.g., by aggregation, combination, decomposition, transformation, correlation, mapping, etc.), usually for assessing, calculating, or planning a course of action. Knowledge constitutes the addition of purpose and conation to the understanding gained from analyzing information [18].

2.1

Semiotics

Semiotics comprises three interrelated disciplines: syntactics, semantics, and pragmatics [1,19].5 Syntactics focuses on sign-to-sign relationships, i.e., the manner in which signs may be combined to form well-formed composite signs (e.g., well-formed predicates). Semantics focuses on the signeobject relationships, i.e., the signification of signs and the perception of meaning, for example, by implication, logic, or reference. Pragmatics focuses on signeinterpreter relationships, i.e., methods by which meaning is derived from a sign or combination of signs in a specific context. From the perspective of ICT-supported healthcare semiotics, pragmatics has two complementary, closely linked components: (1) operationalization support and (2) analytics support. Operationalization pragmatics focuses on the development, management, and governance of ontologies and lexica, metadata, interoperability, etc. Analytical pragmatics focuses on how meaning is derived by engaging rhetoric, hermeneutics, logic, and heuristics and their attendant methods to discern meaning in data, create information, and develop knowledge. 5

Syntactics refers to the analysis of the formal relationships of signs to one another, the use of signs or combination of signs as they comport to syntactic rules, how signs may be combined with other signs to form compound signs [1,19].

132

2.2

Data Democracy

Semantics: lexica and ontologies

Lexica and ontologies provide the semantics component for a semioticsfocused approach to information derivation and knowledge formulation. Lexica and ontologies reflect social constructions of reality, defined in the context of specific epistemic cultures as sets of norms, symbols, human interactions, and processes that collectively facilitate the transformation of data into information and knowledge [20e22]. A lexicon functions as a controlled vocabulary and contains the terms and their definitions that collectively constitute the epistemic domain. For example, in the health domain, the Systematized Nomenclature of Medicine Clinical Terms (maintained by the International Health Terminology Standards Development Organization) is a multilingual lexicon that provides coded clinical terminology extensively used in EHR management. RxNorm, maintained by the US National Institutes of Health National Library of Medicine, provides a common nomenclature for clinical drugs with links to their equivalents in a number of other drug vocabularies used in pharmacy management and drug interaction software. The Logical Observation Identifiers Names and Codes (LOINC, managed by the Regenstrief Institute) provide a standardized lexicon for reporting laboratory results. The terms and their definitions that constitute the lexicon provide the basis for the ontology, which, as noted earlier, delineates the interdependencies among categories and their properties, usually in the form of similes, meronymies, and metonymies. Ontologies define and represent the concepts that inform epistemic domains, their properties, and their interdependencies. An ontology, when populated with valid data, provides a base for knowledge formulation that supports the analytics of those data that collectively operationalize that domain. An ontology informs a named perspective defined over a set of categories (or classes) that collectively delimit a domain of knowledge. In this context, a category delineates a named perspective defined over a set of properties; for example, cigarette and smokeless tobacco products are categories in a tobacco product ontology. A property constitutes an attribute or characteristic common to the set of instances that constitute a category, for example, length, diameter, mode of ingestion. A taxonomy is a directed acyclic perspective defined over a set of categories, for example, a hierarchical tree structure depicting the various superordinate, ordinate, and subordinate categories of an ontology [18]. Ontologies provide the semantic congruity, consistency, and clarity to support different algorithmic-based aggregations, correlations, and regressions [21,22]. From an HIT perspective, ontologies enable the development of interoperable information systems that support, for example, the data and information requirements of healthcare providers, consumers, and payers.

Chapter 7  Knowledge formulation in the health domain

2.3

133

Syntagmatics: relationships and rules

As morphemes of meaning, signs participate in complex relationships. In paradigmatic relations, signs obtain their meaning from their association with other signs based on substitution so that other signs (terms, objects) may be substituted for signs in the predicate, provided that the signs belong to the same ontological category (i.e., paradigmatic relations support lexical alternatives and semantic likenesses). Indeed, the notion of paradigmatic relations is foundational for the development of ontologies by providing the means to develop categories based on properties shared by individual instances. For example, if two individuals exhibit the same symptoms, it may be likely that they belong to the same category of patients. In syntagmatic relations, in contrast with paradigmatic relations, signs obtain their meaning by the way they are linked or combined with each other. In the US health domain, for example, the ConsolidatedClinical Document Architecture (C-CDA) is an interoperability standard that provides a common architecture, coding, semantic framework, and mark-up language for the creation of electronic clinical documents.6 Each C-CDA-compliant EHR is expected to adhere to the structural templates provided by the C-CDA. In addition to supporting ontology development and data interoperability, paradigmatic and syntagmatic relations also serve important roles in the development of inference engines. There are also a number of rules that support semiotics operationalization. Category assignment rules specify the restrictions imposed on categorical membership, i.e., the values that may be assigned to data fields (for example, a restriction on a range of values that may be selected for a field). Alethic rules indicate possibility or impossibility of the existence of an entity (logical necessity). For example, in a clinical database, it is necessary for a patient to have an identification number; without such a number, the patient cannot formally exist in that environment. Deontic rules are normative and constrain the formal existence of an item. Such rules come into play, for example, in determining the propriety of an insurance claim as either valid or invalid and the attendant penalties for filing an invalid claim.

2.4

Syntactics: metadata

Whereas the lexicon and ontology support the semantic and interpretive aspects of data analytics, metadata support the semantic and syntagmatic operational aspects of data analytics. Metadata are generally considered 6

These standard and introductory materials are available from https://www.hl7. org/implement/standards/and http://www.hl7.org/implement/standards/product_ matrix.cfm?ref¼nav.

134

Data Democracy

to be information about data and are usually formulated and managed to comply with predetermined standards [18,25]. Operational metadata reflect the management requirements for data security and safeguarding personally identifiable information; data ingestion, federation, and integration; data anonymization; data distribution; and analytical data storage. Structural (syntactic) metadata provide information about data structures (e.g., file layouts or database table and column specifications). For example, the Health Level 7 Clinical Document Architecture specification provides an increasingly accepted interoperability template for XML-compliant clinical documents [23,24]. Bibliographical metadata provide information about the dataset’s producer, such as the author, title, table of contents, applicable keywords of a document; data lineage metadata provide information about the chain of custody of a data item with respect to its provenancedthe chronology of data ownership, stewardship, and transformations. Metadata also contain filtering rules that specify the requirements that data must meet to be ontologically useful rather than being excluded as “noise.” Ontological assignment rules specify the restrictions imposed on ontological category membership, the values that may be assigned to ontological properties (e.g., a restriction on a range of values that may be selected for a property associated with a category), and the valid interdependencies that may exist between ontological categories. Taxonomic assignment rules stipulate branch points in a taxonomy, based on a particular value of a property, or values of a set of properties, reflecting an “if. then. else” logic. In addition, the application of ontological and taxonomic assignment rules reflects the use of controlled sets of codes, their values, and their interpretations that delimit the range and domain of the properties identified in the ontology (for example, AK, Alaska, MD, Maryland, MI, Michigan) are members of a hypothetical US State property [18]. Metadata also provide information on the data storage locations, either as local or as cloud-based data stores. The ingestion, federation, and integration of externally acquired, usually transient, very large complex sets, containing structured as well as unstructured data, are difficult to manage with currently deployed RDBMS-based architectures. Thus, the introduction of big data analytics will, quite likely, require IT infrastructure and software components specifically dedicated to big data management. Likewise, big data distribution may require specific software packages dedicated to providing related services.

2.5

Data interoperability and health information exchange

Data interoperability is a primary goal of the Office of the National Coordinator (ONC) for HIT in the US Department of Health and Human

Chapter 7  Knowledge formulation in the health domain

135

Services. The ONC supports health information exchange (HIE) by formulating applicable policies, services, and standards.7 Data interoperability standards enable data exchanges between different systems, regardless of provider, recipient, or application vendor, by means of data exchange schemata and standards. Interoperability standards, implemented at intersystem service levels, establish thresholds for exchange timeliness, transaction completeness, and content quality. Health domainefocused semiotics, predicated on lexica and ontologies, pragmatics, and syntactics, support interoperability requirements to transmit, reuse, and share data, such as patient data, claims data, and payment data, in accordance with predetermined semantic and syntactic rules [25,26].

2.6

Semiotics-based analytics

Health data analytics are applied to structured and unstructured, quantitative, textual, and graphical data from diverse sources, such as EHRs, clinical documentation (including handwritten practitioners’ notes), magnetic resonance imaging results, lab results, clinical trials, telemedicine data, and genomic data. Descriptive, predictive, and prescriptive analytics of quantitative health data are statistics based and are applied to structured data (or structured data derived from unstructured data). Data analytics of unstructured text data focus on content analysis and topic analysis, using text mining and natural language processing (NLP) software [27].8

2.7

Model-based analytics

2.7.1 Information domain delineation: contexts and scope The success of any data analytics project depends on understanding the problem to be solved. A standard approach is to reformulate a scientific or business problem as a research question that data analytics can help solve, helping to define the scope and parameters of the problem. Problems do not occur in vacuums; rather, they and their solutions reflect the complex interplay between organizational cultures and the larger context or contexts in which the organization operates.

7 The authors discuss on the role of ontologies in data interoperability, focusing on technical, syntactic, semantic, and organizational aspects of data interoperability. 8 There are a number of standard methods for data analytics and information modeling as well as knowledge-based development, such as the Cross-industry Standard Process for Data Mining (CRISP-DM) and Knowledge Acquisition and Design Structuring (KADS) [30e32].

136

Data Democracy

2.7.2 Data identification (exploration) Establishing the information domain is a critical analytics goal, and data provide the building blocks with which information may be constructed. Ontologies and lexica provide the basis for defining and ensuring semantic consistency of data items.

2.7.3 Data preparation (data staging) Because data may come from disparate sources, it may be necessary to alter the data so that units of analysis are defined at the same levels of abstraction and use the same code sets and value ranges. For example, if diagnostic data from one source are aggregated at the county level and the same kind of data from another source are aggregated at the hospital level, their straightforward and uncritical consolidation can make for ecological fallacy and data validity problems [28,29]. The data need to be transformed to conform to the same unit of analysis before consolidating them before use. A “sandbox” may provide a discrete environment for data staging without compromising the integrity of the authoritative datasets.

2.7.4 Information model development Models present interpretations of reality from particular perspectives. To support data analytics, the models typically take the form of mathematical formulas or statistical predicates that reflect a particular interpretation of a dataset, resulting from the application of deductive or inductive approaches. Induction-based algorithms are useful, for example, in unsupervised learning settings, where testable hypotheses have not been established. Examples are text mining and content and topic analyses used, in this case, not to prove or disprove a hypothesis, but simply to explore corpora of documents for lexical clusters and patterns. In contrast, deduction-based algorithms are useful in supervised learning settings, where the purpose is to answer a research question and prove, or disprove, hypotheses formulated before analyzing the data. It is unlikely that an optimally efficient model can be built at first try; therefore, models should be built iteratively. Thus, they may be refined by adding or subtracting determinant or discriminant variables until an unbiased, efficient model emerges from the data under analysis. Only when a model is tested should it be considered for presentation and deployment.

2.7.5 Information presentation Analytical results may be presented as business intelligence dashboards, visual depictions of data analytics results, or formal presentations of research findings, such as by developing a paper for publication in a standard format, providing an overview of the problem, the research question, or hypotheses that provide the impetus for developing the model, followed by a description of the data and any problems that may have been encountered with those data, the analytical methods used, and the results of the analysis.

Chapter 7  Knowledge formulation in the health domain

137

2.7.6 Heuristics-based analytics Heuristics constitute an iterative, trial and error approach to knowledge acquisition that, in practice, reflects the analyst’s insights and prior store of knowledge, frequently limited by time constraints. Arguably, heuristics reflect an NP (nondeterministic polynomial time) completeness approach to knowledge acquisition and usually rely on frequency of iterations to minimize issues of aleatory variability and epistemic uncertainty (due to uncertainty of outcomes and limitations in data and knowledge). Thus, as noted by Burnum [10, p. 940], From the semiotic perspective, in diagnosis and relating to patients, we physicians are on the receiving end of communication and must assign meaning to signs. Medical signs include the patient, the history and physical examination, test results, and all relevant in formation. Diagnosis, however, precisely because it does depend on transforming one thing into another, can be hazardous; uncertainty is never far away. . Semiotics warns us that if we are to minimize errors in interpretation, we must remember that medical signs are but symbolic, often ambiguous proxies of truth whose meaning, furthermore, is shaped by its contexts and whose interpretation lies at the mercy of inference and the experience and bias of the individual physician. Implicit in this observation is the heuristics involved in medicine, in which medical knowledge is the result of an accretionary process that incorporates individual contributions, bounded by epistemic norms and processes (e.g., peer reviews and shared consultations), which contribute to the knowledge base.9 As shown in Fig. 7.1, a heuristics-based approach to knowledge can be depicted as a “cycle of pragmatism” which, as formulated by philosopher Charles Sanders Peirce [33], integrates reasoning methods with purposeful activities, providing a meta-level framework relating methods of reasoning, inquiry, and research in collaborative, self-correcting cycles [34].

9 Indeed, because medicine depends so heavily on interpretation, it is not too difficult to consider the practice of medicine as the operationalization of hermeneutic circles; i.e., the whole must be understood from the individual and the individual from the whole [35,36].

138

Data Democracy

Conceptualizaon & Theory

Knowledge Base

Predicon

Demonstraon

FIGURE 7.1 Heuristics Analytical Cycle.

In this depiction, the “Knowledge Base” reflects the results of operationalizing semiotic resources and provides one stage gate in the heuristics cycle; “Prediction” provides another. “Conceptualization and Theory” and “Demonstration” bound the epistemic domain. Abductive reasoning depends on perception and heuristics that support knowledge discovery by positing hypotheses to explain observations in the context of an existing knowledge base. Abductive reasoning is enthymematic rather than syllogistic and focuses on proof by example.10 Frequently, a “hunch” is played; the precise premises that constitute the argument are not necessarily known, and, thus, initially, propositions that seem likely to hold true are explored [34]. Common heuristics methods to support abductive reasoning include text and data mining, statistical analytics, and analogous reasoning.11 Arguably, the health domain can be perceived as “information rich” but “knowledge poor.” Data analytical techniques, such as pattern discovery and formulation, statistics, and algorithm development have proven to be useful in expanding the knowledge base by extrapolating inferences or forecasting trends from existing information resources by, for example, anticipating a patient’s future behavior based on that patient’s medical 10 Abduction approximates an Aristotelian approach that uses an epagoge (proof by example derived from experience with similar cases), in which the major premise is true but the minor premise is probable. 11 Abductive reasoning has been used effectively in nursing, and, under certain circumstances, heuristics outperform “information-greedy” methods, such as regressions in medical diagnoses [39,40].

Chapter 7  Knowledge formulation in the health domain

139

history, health insurance trend analyses, or forecasting treatment costs and demand for resources [37,38].Heuristics-based abduction may provide an impetus to revise epistemic beliefs or perspectives, which may lead to a new series of heuristic cycles to refine or expand the knowledge base. Deduction supports the development of logical inferences, enabled, for example, by theorem provers and inference engines.12 Inference provides the basis for predictions, which supports purposive action. For example, the purpose of MYCIN, a first-generation (mid-1970s) rule-based decision support system was to provide consultation advice on diagnoses of, and therapies for, bacteremia.13 Induction provides the foundation for developing neural networks, pattern recognition software, Bayesian nets, etc., to determine an epidemiological event based on a shared set of characteristics in a population. (Analyzing outbreaks of such diseases as tuberculosis provides obvious examples.) Collectively, the results of executing the activities that comprise the heuristic analytics cycle depicted support the expansion and refinement of the knowledge base for future use. In addition, the activities depicted in Fig. 7.1 support NLP, albeit on a limited scale, given that language is foundational to semiotics. However, the current state of NLP is such that, as a discipline, it has difficulties dealing with differences in dialect and patois (it seems that each doctor has his own way of expressing him- or herself) and that, frequently, a medical facility develops its own patois as well as such linguistics artifacts as negation and figures of speech, as is the case with organizations more generally, including meronymy, metonymy, metaphor, and, perhaps most difficult of all, irony [41].14

3. A semiotics-centered conceptual framework for data democratization EHRs provide the foundation for health data democratization, supporting the data and information requirements of different user communities, including researchers, medical providers, pharmacies, insurance companies, and end-user consumersdpatients and caregivers.15 Electronic 12

In contrast with abductive reasoning, deductive reasoning tends to be syllogistic. XCON was developed in the late 1970s by DEC to help ordering and configuring Digital Equipment Corporation (DEC) VAX computer systems based on customer requirements. 14 However, the Unified Medical Language System supports automated encoding of clinical documents based on NLP methods and techniques [27,44]. 15 Data democratization, as a concept, has its provenance, in part, in the end-user computing movement of the 1980s. The intent of data democratization, as represented in the popular media, is to make information available to the “average” user who does not have an extensive ICT background [45]. 13

140

Data Democracy

health documentation may contain both structured and unstructured data (e.g., coded diagnostic data, clinician’s notes, personal genomic data, and X-ray images) and usually takes one of three forms, each of which must comply with predetermined standards before they are authorized for use: an EMR for use by authorized personnel within a healthcare organization, an EHR that provides health-related information about an individual that may be created, managed, and exchanged by authorized clinical personnel, and a PHR that provides healthcare information about an individual from diverse sources (clinicians, caregivers, insurance providers, pharmacies, and support groups) for the individual’s personal use [42,43]. Data democratization, as a concept, has its provenance, in part, in the end-user computing movement of the 1980s. The intent of data democratization, as represented in the popular media, is to make information available to the “average” user who does not have an extensive ICT background [45]. Indeed, PHRs provide the basis for shared decision-making (SDM) by clinicians and their patients who, although still in its initial stages, show promise as a foundation for cliniciane patient-focused data democratization [46]. A number of healthcare organizations have linked their EHRs to PHRs and made them available via secure portals and mobile devices so that patients, caregivers, and other authorized individuals can access these records to participate more fully in their healthcare regimens. Thus, among patients who initially relied on computer-only access to their PHR information, frequency of use increased when they acquired mobile access to their records and were not limited to computer-only access. In a recent study of PHR use by diabetes patients, mobile access increases the frequency with which patients accessed their records [47]. In addition, studies of PHR use indicate that the adoption of such records enhances patient-provider communications and increases patient satisfaction [48].16 However, the efficacy of PHR systems may be compromised by limitations that frequently reflect socioeconomic factors, such as race. For example, while patients of color with mobile access to their records are more likely to view their laboratory results within 7 days when not limited to computer-only access [47], disparities have persisted in PHR registrations by race [48,49]. No statistically significant differences have been apparent among white patients’ access to PHR information [47]. 16

The use of PHR to support data democratization is recent, and, although the volume of research in this topic is increasing, the results currently are too few to suggest, much less develop, general correlations between data democratization and quality of care. See Pearce et al. [51] for a successful instance of data democratization; see Toscos et al. [52] for a different result.

Chapter 7  Knowledge formulation in the health domain

141

MHR Lexicon

Analycs

Ontology

Reporng

EMR PHR

Metadata

Identy/ Security

Data Ingeson, Federaon, Integraon

Pharmacies

Analycal Data Store/ Knowledge Base Data Anonymizaon

Laboratories

Providers

Payers / Claims

Data Distribuon

Radiology

Medical Devices (IoT)

FIGURE 7.2 A Conceptual PHR System Architecture.

3.1

Data democratization conceptual architecture

Fig. 7.2 shows a conceptual architecture that focuses on the complexity of PHR information that must be maintained to support the requirements of various user communities that may obtain access to this information. Effective PHR-based data democratization and ICT implementations require consistently applied semantics and syntactics, robust interoperability mechanisms, and rigorously enforced security and privacy protocols that are at least HIPAA compliant. Combined, these components collaborate to ensure that patient-controlled PHR information, aggregated from diverse sources, may be accessed only by authorized personnel.

3.2

Data democratization governance

Organizational PHR governance is an evolving process. Frequently, PHR systems are developed by vendors, insurers, pharmacologists, clinicians, etc., without taking into account the needs of the patients whose records the systems are designed to manage. Patient-centeredness should emphasize respect for patient values, preferences, and expressed needs; formation and education; access to care; emotional support; involvement of family and friends; continuity and secure transition between healthcare providers; physical comfort; and coordination of care. Arguably, PHR governance should include not only clinicians but also end-user groups and focus on policies that address data access, including security and privacy (for

142

Data Democracy

example, access by patient proxies, such as caregivers and patient’s relatives, and minors); emergency access; PHR contents, including EMR diagnoses and EMR clinical progress notifications; data reuse and patent-entered information; and communications, including clinicians’ response to patients’ questions and timely availability of laboratory results [53,54]. In addition, noting that the essence of data democratization is to make the relevant data understandable to everyone, support to ensure end-user competency in PHR system use requires not only access to the records themselves but also effective levels of digital and health literacy.

4. Conclusion Big data analytics in the health domain emphasize the integration of datasets from diverse sources, each of which may have its own frameworks for creation and interpretation, which require normalization to ensure semantic confluence and syntactic congruence. Semioticsdthe discipline that integrates semantics, syntactics, and pragmaticsdprovides the basis for knowledge formulation by imposing interpretative and morphological consistencies so that data may be transformed dependably into knowledge resources to support user communities. Semiotics supports the process of extrapolating signals from noise, transforming signals into data and, subsequently, into knowledge, epistemically and culturally constrained, which can be operationalized to support strategic planning, cliniciane patient SDM, and health data analytics. Integral to a semiotics framework, ontologies and their associated lexica encapsulate the cultural norms and concepts, their properties, and interdependencies that collectively define the health domain and provide the foundations for data interoperability, facilitating the incorporation, synthesis, and integration of data from multiple sources. Knowledge formulation paradigms depend on ontologies to provide repositories for the formal specifications of the meanings of symbols delineated by semiotics. Semiotics are normative, bounded by epistemic and cultural contexts, and provide the foundation for ontology development. When properly implemented, a semiotic framework provides the basis for transforming big data to information to knowledge that may be exchanged with tools for addressing both cultural and technological heterogeneities that otherwise could hinder HIE. Accordingly, data democratization benefits from the execution of a well-designed, rigorously applied, semiotics-based interoperability program. As delineated, the semiotics framework presented here addresses different aspects of pragmatics by supporting two kinds of paradigms: a model-based analytics paradigm and a heuristics-based analytics paradigm. Although data democratization in the healthcare field is still in its

Chapter 7  Knowledge formulation in the health domain

143

early stages, the framework supports and advances data democratization through the operationalization of these paradigms. The model-based analytics paradigm reflects the needs of analytics-focused researchers, such as data scientists and informaticists; the heuristics-based analytics paradigm is appropriate for medical staff, patients, and other end-user communities. Furthermore, facilitating the effective use of integrated and, where appropriate, synthesized data is fundamental to support program execution and oversight. Accordingly, this semiotics-powered framework supports the data democratization and policy formulation that are critical for providing the institutional guidance, program development, and implementation necessary to achieve community health and societal well-being.

References [1] C.W. Morris, Signs, language, and behavior, in: C.W. Morris (Ed.), Writings On the General Theory of Signs, Mouton, The Hague, 1946, pp. pp73e398. [2] United Sates Congress, Health information technology for Economic and Clinical Health Act (HITECH), Public Law (February 17, 2009), 111e5. https:// www.hhs.gov/hipaa/index.html. [3] United States Congress, Patient Protection and Affordable Care Act healthrelated Portions of the Health Care and Education Reconciliation Act of 2010, Public Law (December 24, 2009) 111e152. [4] Aristotle, in: J.L. Ackrit (Ed.), The Categories and de Interpretatione, Oxford University Press, Oxford, 1963. [5] Aristotle, De Interpretatione, in: J. Barnes (Ed.), the complete works, Vol.1, Princeton University Press, Princeton, 1984, p. 1. Bollingen Series: 71. [6] Aristotle, The rhetoric, in: J. Barnes (Ed.), the complete works, Vol. 2, Princeton University Press, Princeton, 1984, p. 2. Bollingen Series: 71. [7] W. No¨th, Handbook of Semiotics, Indiana University Press, Bloomington, 1995. [8] C. Galenus, Galen, Three Treatise on the Nature of Science, in: R. Walzer, M. Frede (Eds.), Transl. Indianapolis, Hackett Publishing, 1985. [9] R. Barthes, Se´miologie et medicine, in: R. Bastide (Ed.), Les sciences de la folie, Mouton, The Hague, 1972, pp. 37e46. The Hague: Mouton. [10] J.F. Burnum, Medical diagnosis through semiotics: giving meaning to the sign, Ann. Intern. Med. 119 (9) (1993) 939e943. [11] J.F. Sowa, Knowledge Representation: Logical, Philosophical, Computational Foundations, Brooks/Cole, Pacific Grove, CA, 2000.

and

[12] L. Ohno-Machado, Big science, big data, and the big role for biomedical informatics, J. Am. Med. Inform. Assoc. (e1) (2012) 19. [13] N.H. Shah, J.D. Tenebaum, The coming of age of data-driven medicine: translational bio- informatics’ next frontier, J. Am. Med. Inform. Assoc. 19 (2012) e1ee2.

144

Data Democracy

[14] C.A. Velasco, Y. Mohamad, P. Ackermann, Architecture of a web of things eHealth framework for the support of users with chronic diseases, in: Proceedings of the 7th International Conference On Software Development and Technologies For Enhancing Accessibility and Fighting Info-Exclusion, Villa Real, Portugal, 2016, pp. 47e53, 01-03 December 2016 ACM. [15] C.S. Peirce, Collected Papers of C.S. Peirce, Harvard University Press, Cambridge, MA, 1958. [16] C.K. Ogden, I.A. Richards, The Meaning of Meaning: A Study of Language upon Thought and the Science of Symbolism, Mansfield Centre, CT, 2013. [17] C.W. Morris, Signification and Significance: A Study of the Relations of Signs and Values, MIT Press, Cambridge, Mass, 1964. [18] E.W. Kuiler, From big data to knowledge: an ontological approach to big data analytics, Rev. Policy Res. 31 (10) (2014) 311e318. [19] C.W. Morris, The Foundations of the Theory of Signs, Chicago University Press, Chicago, 1938. [20] E. Goffman, The Presentation of Self in Everyday Life, Doubleday, New York, 1959. [21] J.T. Ferna´ndez-Breis, R. Valencia-Garcı´a, R. Martı´nez-Be´jar, P. Cantos-Go´mez, A context- driven approach to knowledge acquisition: application to a leukemia domain, in: Modeling and using context, CONTEXT, vol. 21, Springer, Heidelberg, 2001. Lecture Notes in Computer Science, 16. [22] Y. Liu, A. Coulet, P. LePendu, N.H. Shah, Using ontology-based annotation to profile disease research, J. Am. Med. Inform. Assoc. 19 (2012) e177ee186. [23] J.M. Ferranti, C. Musser, K. Kawamoto, W.E. Hammond, The clinical document architecture and the continuity of care record: a critical analysis, J. Am. Med. Inform. Assoc. 13 (2006) 245e252. [24] R.H. Dolin, L. Alschuler, S. Boyer, C. Beebe, F.M. Behlen, P.V. Biron, A. Shabo, HL7 clinical document architecture, release 2, J. Am. Med. Assoc. 13 (2006) 30e39. [25] E.W. Kuiler, C.L. McNeely, Federal data analytics in the health domain: an ontological approach to data interoperability, in: F.A. Batarseh, R. Yang (Eds.), Federal Data Science: Transforming Government and Agricultural Policy Using Artificial Intelligence, Elsevier, London, 2018, pp. 161e176. [26] J.F. Sowa, Ontology, metadata, and semiotics,, in: B.G.&G. Mineau (Ed.), Conceptual Structures: Logical, Linguistics, and Computational Issues, Springer Verlag, Berlin, 2000, pp. 55e81. [27] V.M. Pai, M. Rodgers, E. Conroy, J. Luo, R. Zhou, B. Seto, Workshop on using natural language processing for enhanced clinical decision making: an executive summary, J. Am. Med. Inform. Assoc. 21 (2014) 1e4. [28] C. Ess, F. Sudweeks, Culture, Technology, Communication: Towards an Intercultural Global Village, SUNY Press, Albany, 2001.

Chapter 7  Knowledge formulation in the health domain

145

[29] S. Schwartz, The fallacy of the ecological fallacy: the potential misuse of a concept and the consequences, Am. J. Public Health 84 (5) (1994) pp819e824. [30] P. Adriaans, D. Zantinge, Data Mining, Addison-Wesley, Harlow, England, 1996. ISBN: 0201403803. [31] M.J. Berry, G. Linoff, Data Mining Techniques. For Marketing, Sales and Customer Support, Wiley Computer Publishing, Hoboken, NJ, 1997. [32] R. Studera, V.R. Benjamins, D. Fensela, “Knowledge engineering: principles and methods, Data Knowl. Eng. 25 (1998) 161e197. [33] C.S. Peirce, Pragmatism as a principle and method of right thinking, in: P.A. Turrisi (Ed.), The 1903 Lectures On Pragmatism, SUNY Press, Albany NY, 1997. [34] J.F. Sowa, The Cognitive Cycle, 2015 (Online) Available: http://www.jfsowa. com/pubs/cogcycle.pdf. [35] H.-G. Gadamer, Truth and Method. 1976, Continuum, New York, 1975. [36] M. Dorato, Peirce’s ‘method of tenacity’ and the ‘method of science’: the consistency of pragmatism and naturalism,, in: M. Dorato (Ed.), Autonomy of Reason? Autonomie der Vernunft: Proceedings of the V Meeting Italian-American Philosophy, Rome, It, LIT Verlag, Vienna, October 16e19, 2007, pp. 154e164. [37] K.B. DeGruy, Healthcare applications of knowledge discovery in databases, J. Healthc. Inf. Manag. 14 (2) (2000) 59e69. [38] H. Kaur, S.K. Wasan, Empirical study on applications of data mining techniques in healthcare, J. Comput. Sci. 2 (2) (2006) 194e200. [39] M. Lipscomb, Abductive reasoning and qualitative research, Nurs. Philos. 13 (2012) 244e256. [40] J.N. Marewski, G. Gigerenzer, Heuristic decision making in medicine, Dialogues Clin. Neurosci. 14 (1) (2012) 77e89. [41] E.H. Schein, Organizational Culture and Leadership, Wiley, Hoboken, NJ, 2016. [42] Office of the National Coordinator (ONC), The National Alliance for Health Information Technology Report to the National Coordinator for Health Information Technology on Defining Key Health Information Technology Terms, Health Information Technology, 2008 [Online]. Available: http://www. hitechanswers.net/wp-content/uploads/2013/05/NAHIT-Definitions2008.pdf. [43] Office of the National Coordinator (ONC), Connecting Health and Care for the Nation: A Shared Nationwide Interoperability Roadmap, Health Information Technology, 2016 [Online] Available: https://www.healthit.gov/sites/default/files/ hie-interoperability/nationwide-interoperability-roadmap-final-version-1.0.pdf. [44] C. Friedman, L. Shagina, Y. Lussier, G. Hripcsak, Automated encoding of clinical documents based on natural language processing, J. Am. Med. Inform. Assoc. 11 (2004) 392e402. [45] S.L. Huff, M. C Munro, B.H. Martin, Growth stages of end user computing, Commun. ACM 31 (5) (1988) 542e550.

146

Data Democracy

[46] S. Davis, A.R. Roudsari, R. Raworth, K.L. Courtney, Shared decision-making using personal health record technology: a scoping review at the crossroads, J. Am. Med. Inform. Assoc. 24 (4) (2017) 857e866. [47] I. Graetz, J. Huang, R. Brand, J. Hsu, M.E. Reed, Mobile-accessible personal health records increase the frequency and timeliness of PHR use of patients with diabetes, J. Am. Med. Inform. Assoc. 26 (1) (2019) 50e54. [48] S. Wells, R. Rozenblum, A. Park, M. Dunn, D.W. Bates, Organizational strategies for promoting patient and provider uptake of personal health records, J. Am. Med. Inform. Assoc. 22 (2015) pp213e222. [49] D. Robin, T.K. Houston, J.J. Allison, P.J. Joski, E.R. Becker, Disparities in use of a personal health record in a managed care organization, J. Am. Med. Inform. Assoc. 16 (5) (2009) 683e689. [50] D.C. Kaebler, K. Ashika, D. Johnston, M. Blackford, D.W. Bates, A research agenda for personal health records (PHRs), J. Am. Med. Inform. Assoc. 15 (6) (2008) 729e736. [51] C. Pearce, C.M. Arnold, S. Phillips, K.S. Trumble, K. Dawan, The patient and the computer in the primary care consultation, J. Am. Med. Inform. Assoc. 18 (2) (2011) 138e142. [52] Toscos, T.C. Daley, L. Heral, R. Doshi, Y. Chen, Impact of electronic personal health record use on engagement and intermediate health outcomes among cardiac patients: a quasi-experimental study, J. Am. Med. Inform. Assoc. 23 (1) (2016) 119e128. [53] S.R. Reti, H. Feldman, S.E. Ross, C. Safran, Improving personal health records for patient-centered care, J. Am. Med. Inform. Assoc. 17 (2010) 192e195. [54] S. Collins, D. K Vawdery, R. KukaFK, G.l. Kuperman, Policies for patient access to clinical data via PHRs: current state and recommendations, J. Am. Med. Inform. Assoc. (2011) i1ei7.

8

Landsat’s past paves the way for data democratization in earth science Karen Yuan1, Patrick O’Neil1, 2, Diego Torrejon1, 2 1

COLLEGE OF SCIENCE, GEORGE MASON UNIVERSITY, FAIRFAX, VA, UNITED STATES; 2 BLACKSKY INC., HERNDON, VA, UNITED STATES

We, who use science to pursue our missions, constantly look to new horizons in research and new techniques for gathering, processing, analyzing, evaluating, and understanding the significance of data. We cannot relax these efforts William Pecora Abstract: The Landsat Program has been observing and collecting Earth system data from space since 1972. Over this time, advances in spatial, spectral, and radiometric resolution have allowed the Landsat satellites to evolve into powerful global Earth observation imaging systems. The longevity of the Landsat Program has produced a massive catalogue of imagery data. The sheer magnitude of this collection necessitates the use of scalable techniques to extract value from decades of remote sensing data. With the introduction of commercial cloud computing, it is now possible to cheaply provision resources for processing and storing enormous amounts of such data. Machine learningebased computer vision is advancing at a blistering pace and has shown great promise when applied to satellite imagery. The combination of cloud computing and machine learning offers the ability to analyze the decades of imagery collected by the Landsat Program at a cost never before possible. The focus of this chapter is to discuss the Landsat Program progression, cloud computing, machine learning, and data policy and how all these components contribute to greater understanding on the state of our planet Earth. Keywords: AWS computing; Deep learning; Landsat; Machine learning.

Data Democracy. https://doi.org/10.1016/B978-0-12-818366-3.00008-3 Copyright © 2020 Elsevier Inc. All rights reserved.

147

148

Data Democracy

1. Introduction As geospatial technology advances, the ability to monitor our everchanging planet grows. Through programs such as Landsat, a collaboration between the National Aeronautics and Space Administration (NASA) and the United States Geological Survey (USGS), and Sentinel, operated by the European Space Agency (ESA), access to free satellite imagery has never been greater. New commercial sensors are coming online which offer dramatically lower costs for acquiring high-resolution satellite imagery. Advances in machine learning are empowering small groups to processes more imagery than ever. The emergence of cloud computing has radically reduced the barrier to entry for big data analytics. In 2008, data from the longest-lasting Earth observing imaging program, Landsat, became freely available to the public. Having already generated over 3 petabytes of data, there is a significant demand to improve the efficiency of scientific computing methods for global and regional monitoring [1]. Additionally, with advances in spectral response onboard modern satellites, the shift from image-based to pixel-based analysis [2] will lean heavily on advanced machine learning techniques. This chapter provides an overview of the sequential advancements in information technology, machine learning, remote sensing, and data policy that have led to the unparalleled and increasingly unrestricted access to high quality, low latency geospatial data that we see today.

2. Landsat overview The Landsat program, launched in 1972, has been observing the Earth’s surface and collecting Earth’s system variables consistently for over 47 years. Although its technical responsibilities and federal oversight were inconsistent during the early phases of the program [3], the longevity of Landsat and the technological advancements it fostered established the United States as the world leader in land remote sensing technology [4]. Landsat’s success lies with the diversity of its applications and its history of successful operational use [5e7]. Additionally, because Landsat shares an orbital plane with other similar missions, the synergy produced by fusing Landsat data with other sensors provides a global, comprehensive view of our planet with increased temporal sampling [8]. As Belward et al. reported in their 2015 study [9], there are over 20 near-polar orbiting civilian imagers with similar specifications as the Landsat satellites.

Chapter 8  Landsat’s past paves the way for data democratization

149

To aid in scientific research, each Landsat satellite maintains a set of remote sensing instruments tailored to collect specific data about our planet. Landsat 1e5 used the Multispectral Scanner System (MSS) as their primary imager. This system used four spectral bands in the visible near infrared (NIR) region of the electromagnetic spectrum with 80 meters (m) spatial resolution. Thematic Mapper (TM) was introduced for Landsat 4 and added to its successor mission, Landsat 5. It includes three additional bands, thermal infrared (TIR), mid infrared (MIR), and NIR, with an improved spatial resolution of 30 m. Landsat 7, launched in 1999, holds a multispectral scanner with six bands from the visible to the short wave infrared region (SWIR) of the spectrum. This scanner operates at 30 m spatial resolution, with a 60 m TIR band and a 15 m panchromatic band. Landsat 8, with two radiometers (Operational Land Imager, OLI; Thermal Infrared Sensor, TIRS), was launched in 2013, introducing new observation capabilities to advance land cover and land research applications. Unlike Landsat 8’s predecessors, which relied on whiskbroom technology, Landsat 8 uses a pushbroom sensor, which images in a linear array formation and uses no moving parts, reducing the risk of malfunction. This change allowed longer integration time of the sensor, leading to improved radiometric sensitivity and a higher signal-to-noise ratio (SNR) [10]. Landsat 8’s OLI, based on technology from EO-1 ALI,1 has three additional bands (ultra blue, cirrus, and TIR), which advance interesting scientific opportunities for the aerosol/coastal and atmospheric communities. Each band contains 6900 detectors, contributing to the improved SNR [10], a significant technological advancement since prior Landsat imagers in a 705 km orbit only had 16 detectors per band [11]. With radiometric sensitivity, refined land cover features can be more readily differentiated [12]. Recent radiometric performance experiments yielded SNR measurements exceeding typical radiance and even the program requirements [13]. Landsat 9 is tentatively set to launch in 2020 with two instruments, OLI-2 and TIR-2. The OLI-2 is a carbon copy of Landsat 8’s OLI-1, while TIR-2 has been modified to remove a stray light issue associated with TIR-1. The improved SNR is significant as imagery analysis shifts from image-based methods to pixel-based methods [14]. Table 8.1 shows the instrument specifications for the Landsat program. The core directive of the Landsat program is to provide global coverage of the Earth’s surface. However, the early days of Landsat saw the sensor focused on making local observations due to the technical

1

EO-1 is no longer operational.

150

Data Democracy

Table 8.1

Landsat instrument specifications.

Local Orbital Altitude mean time Sensor period (km) L1 MSS 1972 e1978 L2 MSS 1975 e1982 L3 MSS 1978 e1983 L4 MSS 1982 TM e2000 L 5 MSS 1984 TM e2013 L7 1999 ETMþ epresent L8 OLI 2013 TIR epresent

917 900 917 705 705 705 705

9:30a.m. 15 min 9:45a.m. 15 min 9:30a.m. 15 min 9:45a.m. 15 min 9:45a.m. 15 min 10:00a.m. 15 min 10:00a.m. 15 min

Sensor type

Radiometric resolution Spatial resolution

Whiskbroom 6 bits

80 m VNIR

Whiskbroom 6 bits

80 m VNIR

Whiskbroom 6 bits

80 m VNIR

Whiskbroom 8 bits

80 m VNIR, 30 m VNIR, 120 m TIR, 30 m IR 80 m VNIR, 30 m VNIR, 120 m TIR, 30 m IR 30 m VNIR, 60 m TIR, 30 m MIR, 15 m PAN 30 m VNIRdSWIR, 15 m PAN, 100 m TIRS

Whiskbroom 8 bits Whiskbroom 8 bits Pushbroom 12 bits

MIR, mid infrared; PAN, panchromatic; TIR, thermal infrared; VNIR, visible near infrared. Note: Landsat 6 has been excluded from this table and chapter as it failed to orbit in 1993. As of March 2019, the USGS has accumulated roughly 3000 terabytes of data from the Landsat program (see Table 8.2).

Table 8.2

USGS Landsat Program data status as of March 2019.

Sensor

Scenes/tiles Total data volume (TB) Scene size (MB)

L1e5 MSS L4e5 TM L7 ETMþ L8 OLI (TIRS) OLI only TIRS only Total

1,328,315 2,920,814 2,765,508 1,466,461 4,069 3,516 8,488,683

65 656 985 1312 4.6 0.1 3022.7

32 263 487 1813

limitations in processing [14]. Although data generated by Landsat are spatially, spectrally, radiometrically, and geometrically calibrated [15,16], it requires additional processing before analysis. In particular, radiometric, atmospheric, and topographic corrections are often performed [17]. This preprocessing becomes essential when performing time series analysis to ensure that all images are normalized. Historically, researchers who wished to use Landsat would have to purchase the data, which would place limitations on the scope of any study. The price ranged from $20 to $200 per scene for MSS, $300 to $400 for TM, and

Chapter 8  Landsat’s past paves the way for data democratization

151

around $600 for ETMþ. For this reason, most change detection analysis was based on bi-temporal Landsat images [18]. However, in 2008, USGS made Landsat data free of cost and available on the Internet [19], opening up the potential to use these data to anyone. In 2017, USGS introduced analysis ready data (ARD) products that required minimal to no spatial, radiometric, geometric, and atmospheric corrections from the user [18] (e.g., orthorectified, atmospherically- and terrain-corrected). ARD products are now available for L4 and L5 TM, L7 ETMþ, and L8 for the conterminous US, Alaska, and Hawaii [Table 8.3]. These products are categorized into three tiers: the Tier 1 product is considered the highest quality data [18] product followed by Tier 2 and real-time (RT) data products. The Tier 1 data product relies on so-called Level 1 Precision and Terrain-Corrected data. It has been cross calibrated against other Landsat imagers and its geodetic accuracy contains less than 12 m error. This is an important metric because time series analysis requires accurately geo-registered images which overlap in the region of interest. Tier 2 data products include cloud covered images and lack ground control points, resulting in poor geo-registration. Finally, RT data products are derived from L7 and L8 data and are typically used during emergency management situations since they are available within 12 hours. Other ready-to-use data products include dynamic surface water extent, fractional snow covered area maps, and burned area maps. Sentinel-2B, also launched in 2017, is managed by the ESA’s Copernicus Program [19,20]. Its twin, Sentinel-2A, was launched in 2015 [21]. Some similarities between Sentinel 2 and Landsat 8 include the use of a pushbroom sensor with near identical spectral coverage. The local mean time of image collection is 10:30 a.m., which was selected to align with Landsat 7’s collection time, enabling the use of the L7-archived image database to build long-term time series data utilizing both Landsat and Sentinel sensors. Sentinel 2 (A and B) provides a 10-day revisit of the Earth’s surface, while Landsat 7, 8, and 9 provide a 16 day revisit with an 8-day offset among each sensor. This has the potential to Table 8.3 USGS Landsat data science products and analysis ready data products status as of March 2019. Data science product

Tiles counts volume (TB)

Total data volume (TB)

Analysis ready data Dynamic surface water extent Fractional snow covered area Burned area Total

1,478,575 1,243,897 500,913 701,645 3,925,030

252.97 17.7 1.4 2.0 272.67

152

Data Democracy

provide a 2e4 day revisit in certain areas of the globe, using the combined capabilities of the Sentinel 2 and Landsat satellites. In the commercial sector, the spatial and temporal resolution offered by Landsat and Sentinel are being surpassed by imaging satellites from the likes of DigitalGlobe, Airbus, Planet, and BlackSky. With a focus on spatial resolution, the WorldView satellites operated by DigitalGlobe and the Pleiades satellites operated by Airbus offer panchromatic resolutions of up to 31 cm [21] and 50 cm [22], respectively. Operating in polar orbits, these constellations offer a couple revisits per week, achieving a similar revisit rate to Landsat and Sentinel, but at a much higher spatial resolution. Moving further in the temporal resolution domain, an emerging trend sees imaging satellites launched into midinclination orbits, offering dramatically higher revisit rates across key latitudes. The BlackSky Global constellation, originally launched in 2018, aims to put 30 highresolution imaging satellites into midinclination orbits to achieve previously unattainable intraday revisit rates over critical sections of the globe. When the Global constellation reaches 16 satellites, revisit rates of around eight images per day will become standard. Following the trend of miniaturization common in the satellite industry, BlackSky’s constellation consists of microsats (satellites with a wet mass between 10 and 100 kg [23]). As satellites become smaller, their costs are generally reduced due to lower launch costs associated with inserting the satellite into orbit. This drives down the cost of acquiring satellite imagery, opening up the potential research applications of this data. When combined with the broad area coverage offered by Landsat and Sentinel, these satellites offer an unprecedented look at our ever-changing world. They also pose a dilemma: how can we handle the massive amount of data produced every day by all the sensors orbiting the Earth? To take advantage of the data volume, machine learning methods are increasingly being employed for the analysis of satellite imagery.

3. Machine learning for satellite data The field of machine learning leverages a collection of analytical approaches to process massive quantities of data, identify hidden patterns, and make accurate predictions based on the data. These approaches range from simple statistics such as maximum likelihood classification (MLC) to more complex models, including random forests (RFs), deep neural networks (NNs), and support vector machines (SVMs). Applications of machine learning include unsupervised clustering, MLCderived normalized difference vegetation indices, semantic segmentation for land cover classification, and neural networkederived global land cover classification [24e27]. These studies in particular relied on the National Oceanic and Atmospheric Administration’s Advanced Very

Chapter 8  Landsat’s past paves the way for data democratization

153

High-Resolution Radiometer sensor. However, due to the sensor’s limited capabilities such as its coarse resolution, regional and local land cover classification studies were not ideal for analysis. Landsat’s 30 m spatial resolution is most suitable for land cover classification across regional levels [25]. For example, Landsat imagery has been used to generate large region-level classification maps detailing forest cover [26e29]. When choosing a machine learning algorithm to analyze satellite imagery, some considerations include the model accuracy, computation speed, robustness to outliers, and generalization to unseen imagery during inference [29]. Many of these considerations are closely linked to the data used for training the machine learning models. A number of studies have experimented with different machine learning algorithms (MLC, NN, RF, SVM) utilizing robust training datasets of varying sizes [30,31]. These studies show that certain machine learning models, such as polynomial regression, can potentially train well with small training sets, whereas other models, such as deep NNs, required significantly larger training sets to yield comparable or superior results. The sparsity characteristics of SVMs yields excellent performance improvements as the training data volume increases. In fact, a recent metaanalysis review showed SVMs yielded the best overall accuracy for pixel-based land cover classification, followed by an NN classifier [32]. In recent years, ensemble methods such as RFs have gained increasing interest due to their tendency to produce models with lower variance and bias. Model variance measures the variability of the model predictions, and model bias measures the difference between the predictions and the correct values. A high variance model tends to perform poorly on unseen data samples, whereas a high bias model tends to produce oversimplified models. Studies show that models trained using an ensemble approach produce fast, highly accurate models [33,34]. Bagging and boosting are the most common ensemble methods in machine learning. Bagging, also known as bootstrapped aggregation, takes the average of several models trained on random subsets of the training data. Bagging aims at lowering model variance while slightly increasing model bias as a consequence. The most popular bagging algorithm is the RF classifier, which is an aggregation of decision trees. Decision trees are a popular machine learning model used for classification and regression tasks. The model generates a graphical tree where the internal nodes represent binary splits of the data obtained by optimizing the information gain, and the leaf nodes represent the class labels [35,36]. The biggest advantage of RF classifiers is their interpretability, which is a critical component missing from more complex models such as NNs. On the other hand, boosting sequentially adds additional “weak” learners to the model to create a strong ensemble model. Boosting aims to lower model variance and bias; at the cost of increased risk of overfitting.

154

Data Democracy

The most common boosting techniques include AdaBoost, gradient boosting, and XGBoost, which mostly differ on the way they generate weak learners. AdaBoost, also known as adaptive boosting, changes the sampling distribution of the training data by modifying their weight and giving higher importance to incorrectly predicted instances [36,37]. Gradient boosting combines the power of gradient descent with the boosting technique. Instead of changing the sampling distribution of the training data, the weak learners are trained on the errors of the strong learner. Finally XGBoost, also known as extreme gradient boosting, speeds up training by exploiting the parallel processing of multicore CPUs [37]. This technique is more accurate, scalable, and memory efficient than simple gradient boosting. XGBoost has gained momentum in recent years for land cover classifiers due to its training speed and performance [36e39]. More recently, deep learning has gained tremendous momentum across the remote sensing community [40,41]. Deep NNs, the focus of deep learning, mimic how the human brain operates by utilizing a sequence of computational layers consisting of “neurons,” an imitation of biological neurons. These layers are “trained” through a learning process where each successive layer represents a higher-level feature extractor. Deep learning has excelled in object detection, classification, and regression with state-of-the-art accuracy [42]. Commonly, convolutional NNs are employed when analyzing images. These NN architectures learn convolutional filters which are applied in succession to extract salient features from the image. These features can then be used to perform classification or regression tasks. One significant challenge which arises with any form of machine learning, particularly deep learning, is the need for annotated training data. Acquiring thousands or millions of annotated satellite images is extremely expensive. Oftentimes, researchers will utilize existing datasets and employ a technique known as “transfer learning,” where a model trained on one dataset is fine-tuned to work on another dataset by updating the model weights based on a learning update with the new data [43,44]. This reduces the amount of labeled data that is needed to train a new model as the feature extraction component of the model was already trained using the large, existing dataset. One very popular dataset of annotated satellite imagery is the SpaceNet Challenge dataset [45]. This dataset consists of dozens of WorldView images ranging from 30 to 50 cm resolution, with buildings and road annotations. By training a model to detect buildings and roads in this imagery and then “transferring” the model to detect another object type, new machine learning models can be developed and deployed at a far lower cost than would be possible building the model from scratch. Oftentimes, these techniques are used to extract insights from a single image. However, one of the most common uses of satellite imagery is to

Chapter 8  Landsat’s past paves the way for data democratization

155

understand how the world changes. While many studies rely on single or bitemporal remote sensing images, new studies [46,47] have expanded to include multitemporal (>10 years) and multisensor imagery analysis. In particular, some of these studies utilized RF models to classify land cover with significant accuracy. In another study, RFs were trained to map biomass using a temporal sequence of Landsat imagery data [48]. Understanding the temporal nature of satellite imagery, what constitutes a significant change or an expected change, will become increasingly important as new, high revisit satellite constellations are deployed.

4. Satellite images on the cloud Traditionally, handling the volume of data produced by the world’s imaging satellites would require a large computing infrastructure to be maintained. However, the modern approach relies on cloud computing technologies. The concept of cloud computing was developed in the 1960s; however, the term was popularized in 2006 by Google CEO Eric Schmidt at an industry event. In the same year, Amazon launched Amazon Web Services (AWS), a cloud computing platform that provides many products and services, including compute, storage, database, analytics, networking, mobile, developer tools, management tools, IoT, security, and enterprise applications [49]. AWS computing resources are designed to be scalable and offer a pay-for-use model which reduces the cost of operating a cloud system. Some of the services important for scientific computing include Elastic Cloud Compute (EC2) and Simple Storage Service (S3). EC2 offers virtual machines, otherwise known as instances, and S3 offers data storagedboth providing near limitless scalability. Operating on a usage-based pricing model, EC2 pricing starts at $0.096 per hour, while S3 starts at $0.023 (GB per month). Computing, processing, storage, and speed are all necessary components for scientific computing [50] and within the cloud architecture provided by AWS are only limited by funding. The pricing models offered by commercial cloud services, such as AWS, obviate the cost of maintaining in-house information technology infrastructure, thereby converting capital expense into operating expense. Additionally, cloud providers offer autoscaling solutions, allowing clients to scale compute resources on demand instead of making predictions about their future compute needs, thereby reducing the risk of over or under allocating compute resources [51]. Historically, AWS service pricing has shown downward trends; thus, increasing its adoption and fostering in a new age of scientific computing built on scalable architectures [52].

156

Data Democracy

5. Landsat data policy Monitoring the Earth’s surface has been the cornerstone of the Landsat program directive. A landmark data policy milestone was reached in 2008 when Landsat data, including data from its heritage sensors, became available to the public at no cost. Historically, the cost of purchasing imagery limited the scope of any remote sensing study using Landsat data. With the data made available for free, this is no longer a limiting factor. While select data from the Landsat program are made available through a number of data portables, the USGS Earth Resources Observation and Science (EROS) Center and Google Earth Engine (GEE) offer the entire data catalog (not including data from the International Ground Stations) to registered users. Table 8.2 shows Landsat data, as of March 2019, as managed by USGS EROS. Additionally, GEE contains over 8.22 million Landsat Tier 1 and 2 images as of June 2019. The aforementioned data statistics do not include processed and ARD products. Registration to access the data catalog from USGS EROS and GEE is free to the public. Both platforms also provide easy-to-use processing tools to facilitate their use. One such tool, provided by USGS EROS, is the Application for Extracting and Exploring Analysis Ready Samples (AppEEARS). This online tool allows the user to select their region of interest from available Landsat ARD products. The availability of these free data has great societal impacts, from the international to the regional level. In 2016, the United Nations launched the 2030 Agenda for Sustainable Development featuring the 17 Sustainable Development Goals, a follow-up to the Millennium Development Goal, an ambitious global directive that aims to improve the quality-of-life around the world while preserving the health of our planet based on science, technology, and innovation [53,54]. The remainder of this section will discuss a few Landsat-based studies that were implemented as part of the decision-making process at international, as well as regional, levels. In 2016, scientists generated global surface water maps (https://globalsurface-water.appspot.com/) that span from 1984 to 2015. The developers relied on the entire archive of orthorectified, brightness, and top-of-atmosphere reflectance products of Landsat 5, 7, and 8. They relied on a combination of expert systems, visual analytics, and evidential reasoning to process the entire collection of Landsat data. Results generated from the Global Surface Water Explorer, a by-product of the study, are used to support the UN Environment and its 193 Members State’s sustainable 2

GEE data catalog numbers were generated using a script via GEE platform on June 26, 2019.

Chapter 8  Landsat’s past paves the way for data democratization

157

development goal by 2030 (Agenda 2030). The global surface water explorer study is an apt example of the confluence of the emerging trends discussed in this chapter: applying advanced machine learning algorithms to enormous quantities of data and incorporating the results into an international initiative to improve the state of our planet. On a regional scale, Al-Bakri et al. [55] mapped crops and identified irrigated areas for three basins located in Jordan. The authors relied on Landsat data with auxiliary data from RapidEye and ground collected data. This region is of great interest to local and international stakeholders as a majority of permanent water loss occurs in the Middle East and Central Asia [55]. Jordan’s Ministry of Water Irrigation relied extensively on the study to improve water management, accounting, and auditing. Pringle et al. [56] generated agricultural landscape maps over Queensland, Australia. This study utilized data from 1987 to 2017 using a combination of surface reflectance Landsat data (L5e8) along with data from satellites with similar orbital and spectral specifications, Sentinel2A and Terra-MODIS. As result of this study, the government of Queensland funded an online resource that provides seasonal climate and pasture conditions to the Queensland community with an aim to protect high-value cropping land from nonagricultural development. Torbick et al. [57] used a combination of Landsat 8, PALSAR-2, and Sentinel-1 to map land cover and land use to monitor rice agriculture, which is an important commodity in Myanmar and South East Asia. Mulianga et al. [58] relied on Landsat 8 and supplementary data to create a Normalized Difference Vegetation Index and a Normalized Difference Water Index over a region of interest in Kenya, which demonstrated great potential for detecting and mapping sugarcane as well as monitoring the conditions of these crops. The Kenyan Sugar Industry has relied on this farm level information extensively and has used the results to improve planning decisions for the industry’s operations. The success of a mission, and the societal benefits it creates, relies on many factors, including design, manufacture, launch, and operation of the sensor [59]. However, it also includes data acquisition, accessibility, availability, and continuity, all of which are embodied by the Landsat program.

6. Conclusion The rapid growth in the availability and affordability of remote sensing data has democratized insights into the health and condition of our planet. The confluence of high frequency data acquisition, data availability, and scalable computing has driven the demand for advanced machine learning techniques capable of delivering consistently high accuracy. Furthermore, the continuously increasing rate of data refresh provides a near RT depiction of our ever-changing planet. Technological evolutions in land

158

Data Democracy

surface observations and data policy have established the Landsat program as the baseline in Earth observation data. This has created the perfect paradigm for machine learning to accurately process petabytes of pixels and provide a robust and continuously updated assessment of the state of our planet. As technology progresses, this trend will continue, empowering researchers from around the world to collect data and gain a better understanding of the ever-changing state of our planet.

References [1] M.C. Hansen, T.R. Loveland, A review of large area monitoring of land cover change using Landsat data, Remote Sens. Environ. 122 (2012) 66e74. [2] D.P. Roy, M.A. Wulder, T.R. Loveland, Landsat-8: science and product vision for terrestrial global change research, Remote Sens. Environ. 145 (2014) 154e172. [3] T. Loveland, J. Dwyer, Landsat: building a strong future, Remote Sens. Environ. 122 (2012) 22e29. [4] U.S., Land Remote Sensing Act Ch 82, 1992. https://www.nasa.gov/offices/ogc/ commercial/15uscchap82.html. [5] L. Blanc, V. Gond, H. Dinh, et al., Remote sensing and measuring deforestation, Land Surf. Rem. Sens. Environ. Risks Book Series (2016) 27e53. Remote Sensing Observations of Continental Surfaces Set. [6] K. Van der Geest, A. Vrieling, et al., Migration and environment in Ghana: a cross district analysis of human mobility and vegetation dynamics, Environ. Urbanization 22 (1) (2010) 107e123. [7] J. Yang, P. Gong, R. Fu, The role of satellite remote sensing in climate change studies, Nat. Clim. Chang. 4 (1) (2014), 74-74. [8] K. Yuan, K. Thome, J. Mccorkel, Radiometric cross-calibration of terra ASTER and MODIS, Proc. SPIE 9607 (2015) 1e9. [9] A. Belward, J. Skoien, Who launched what, when, and why: trends in global land cover observation capacity from civilian earth observation satellites, ISPRS J. Photogrammetry Remote Sens. 103 (2015) 115e128. [10] J.R. Schott, Remote Sensing: The Image Chain Approach, second ed., Oxford University Press, New York, 2007. [11] B. Markham, D. Helder, Forty-year calibrated record of earth-reflected radiance from Landsat: a review, Remote Sens. Environ. 122 (2012) 30e40. [12] J. Storey, Landsat image geocorrection and registration, Image Regist. Remote Sens. (2011) 400e414. [13] P. Coppin, I. Jonckheere, K. Nackaerts, B. Muys, E. Lambin, Digital change detection methods in ecosystem monitoring: a review, Int. J. Remote Sens. 25 (2004) 1565e1596. [14] C.E. Woodcock, A.A. Allen, M. Anderson, A.S. Belward, R. Bindschadler, W.B. Cohen, Free access to Landsat imagery, Science 320 (2008) 1011.

Chapter 8  Landsat’s past paves the way for data democratization

159

[15] J.L. Dwyer, D.P. Roy, et al., Analysis ready data: enabling analysis of the Landsat archive, Remote Sens. 10 (9) (2018). [16] C. Bouzinac, B. Lafrance, L. Pessiot, Sentinel-2 level-1 calibration and validation status from the mission performance centre, in: IEEE International Symposium On Geoscience and Remote Sensing IGARSS, 2018, pp. 4347e4351. [17] F. Gascon, C. Bouzinac, et al., Copernicus sentinel-2A calibration and products validation status, Remote Sens. 9 (2017) 584. [18] DigitalGlobe, WorldView-3 data Sheet, June, 2019. Available from: https://www. digitalglobe.com/resources#resource-table. [19] Airbus, Optical and Radar Data, June, 2019. https://www.intelligence-airbusds. com/optical-and-radar-data/. [20] Tristancho, J., Gutierrez, J., “Implementation of a Femto-Satellite and a MiniLauncher”. Universitat Politecnica de Catalunya. [21] T.R. Loveland, A.S. Belward, The IGBP-DIS global 1 km land cover data set, DISCover: first results, Int. J. Remote Sens. 18 (1997) 3289e3295. [22] V.F. Rodriguez-Galiano, B. Ghimire, J. Rogan, An assessment of the effectiveness of a random forest classifier for land-cover classification, ISPRS J. Photogrammetry Remote Sens. (2012) 93e104. [23] R.S. DeFries, J.R.G. Townshend, NDVI-derived land cover classification at global scales, Int. J. Remote Sens. (1994) 3567e3586. [24] S. Gopal, C. Woodcock, A.H. Strahler, Fuzzy ARTMAP classification of global land cover fom AVHRR data set, in: Proceedings of the 1996 International Geoscience and Remote Sensing Symposium May, 1996, pp. 538e540. [25] J. Townshend, C.O. Justice, Selecting the spatial-resolution of satellite sensors required for global monitoring of land transformations, Int. J. Remote Sens. 9 (1988) 187e236. [26] M.C. Hansen, P.V. Potapov, R. Moore, High resolution global maps of 21stcentury forest cover change, Science (2013) 850e853. [27] J.R. Townshend, J.G. Masek, Global characterization and monitoring of forest cover using Landsat data: opportunities and challenges, Int. J. Digit. Earth (2012) 373e397. [28] P. Gong, J. Wang, Finer resolution observation and monitoring of global land cover: first mapping results with Landsat TM and ETMþ data, Int. J. Remote Sens. 34 (2013) 2607e2654. [29] R.S. DeFries, J.C.W. Chan, Multiple criteria for evaluating machine learning algorithms for land cover classification from satellite data, Remote Sens. Environ. 74 (2000) 503e515. [30] C. Li, J. Wang, L. Wang, Comparison of classification algorithms and training sample sizes in urban land classification with Landsat thematic mapper imagery, Remote Sens. (2014) 964e983. [31] M. Pal, P.M. Mather, Support vector machines for classification in remote sensing, Int. J. Remote Sens. (2005) 1007e1011.

160

Data Democracy

[32] R. Khatami, G. Mountrakis, S.V. Stehman, A meta-analysis of remote sensing research on supervised pixel-based land-cover image classification processes: general guidelines for practitioners and future research, Remote Sens. Environ. (2016) 89e100. [33] M. Belgiu, L. Dragut, Random forest in remote sensing: a review of applications and future directions, ISPRS J. Photogrammetry Remote Sens. (2016) 24e31. [34] M. Pal, Random forest classifier for remote sensing classification, Int. J. Remote Sens. 26 (2005) 217e222. [35] C. Roland, An evaluation of different training sample allocation schemes for discrete and continuous land cover classification using decision tree-based algorithms, Remote Sens. 7 (2015) 9655e9681. [36] Y. Freund, and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting”, J. Comput. Syst. Sci., 119-139. [37] C. Tianqi, C. Guestrin, Xgboost: a scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD International Conference On Knowledge Discovery and Data Mining, 2016, pp. 785e794. ACM. [38] C.D. Man, Improvement of land-cover classification over frequently cloudcovered areas using Landsat 8 time-series composites and an ensemble of supervised classifiers, Int. J. Remote Sens. (2018) 1243e1255. [39] M.D. Chuc, et al., Paddy rice mapping in red silver delta region using Landsat 8 images: Preliminary results, in: 2017 9th International Conference on Knowledge and Systems Engineering (KSE), 2017, 2017. [40] Y. Xu, A co-training approach to the classification of local climate zones with multi-source data, in: IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2017. [41] W. Zhao, S. Du, Learning multiscale and deep representations for classifying remotely sensed imagery, ISPRS J. Photogrammetry Remote Sens. 113 (2016) 155e165. [42] N. Kussul, M. Lavreniuk, S. Skakun, Deep learning classification of land cover and crop types using remote sensing data, IEEE Geosci. Remote Sens. Lett. (2017) 778e782. [43] M.Z. Alom, T.M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M.S. Nasrin, M. Hasan, B.C. Van Essen, A.A.S. Awwal, V.K. Asari, A state-of-the-art survey on deep learning theory and architectures, Electronics 292 (2019). [44] M. Zou, Y. Zhong, Transfer learning for classification of optical satellite image, Sens.Imaging (2018). https://doi.org/10.1007/s11220-018-0191-1. [45] SpaceNet on Amazon Web Services (AWS), “Datasets.” the SpaceNet Catalog. Last Modified April 30, 2018, June, 2019. https://spacenetchallenge.github.io/ datasets. [46] J. Haas, Y. Ban, Urban growth and environmental impacts in Jing-Jin-Ji, the Yangtze, river delta and the Pearl river delta, Int. J. Appl. Earth Obs. Geoinf. (2014) 42e55.

Chapter 8  Landsat’s past paves the way for data democratization

161

[47] N. Tsutsumida, A. Comber, Measures of spatio-temporal accuracy for time series land cover data, Int. J. Appl. Earth Obs. Geoinf. 41 (2015) 46e55. [48] R. Frazier, Coops, C. Nicholas, M. Wulder, Characterization of aboveground biomass in an unmanaged boreal forest using Landsat temporal segmentation metrics, ISPRS J. Photogrammetry Remote Sens. 92 (2014) 137e146. [49] https://docs.aws.amazon.com/aws-technical-content/latest/aws-overview/ aws-overview.pdf, March, 2019. [50] A. Iosup, S. Ostermann, M. Yigitbasi, Performance analysis of cloud computing services for many-tasks scientific computing, IEEE Trans. Parallel Distrib. Syst. 22 (6) (2011) 931e945. [51] P. Yue, H. Zhou, J. Gong, Geoprocessing in cloud computing platformsea comparative analysis, Int. J. Digit. Earth 6 (4) (2013) 404e425. [52] https://aws.amazon.com/blogs/aws/category/price-reduction/, March, 2019. [53] https://www.un.org/sustainabledevelopment/development-agenda/, June, 2019. [54] W. Colglazier, Sustainable development: 2030, Science 349 (2015) 1048e1050. [55] J. Al-Bakri, S. Shawash, Al Ghanim, Geospatial techniques for improving water management in Jordan, Water (2016). [56] M. Pringle, M. Schmidt, D. Tindall, Multi-decade, multi-sensor time-series modelling-based on geostatistical concepts-to predict broad groups of crops, Remote Sens. Environ. (2018) 183e200. [57] N. Torbick, D. Chowdhury, W. Salas, Monitoring rice agriculture across Myanmar using time series Sentinel-1 Assisted by Landsat-8 and PALSAR-2, Remote Sens. 9 (2) (2017). [58] B. Mulianga, A. Begue, P. Clouvel, Mapping cropping practices of a sugarcanebased cropping system in Kenya using remote sensing, Remote Sens. (2015) 14428e14444. [59] D. Albright, S. Burkhard, A. Lach, Commercial satellite imagery analysis for countering nuclear proliferation, Annu. Rev. Earth Planet Sci. 46 (2018).

9

Data democracy for psychology: how do people use contextual data to solve problems and why is that important for AI systems? Debra Hollister VALENCIA COLLEGE e LAKE NONA CAMPUS, ORLANDO, FL, UNITED STATES

“Reality is not a function of the event as event, but of the relationship of that event to past, and future events.” Robert Penn Warren Abstract Psychology is a discipline with a long past, but short history. Psychological influences are found in every discipline from fine art to abstract mathematics because psychological forces are at play within the human behavior that necessitated the inception of those disciplines. Likewise, most disciplines’ understanding of psychology is fundamentally similar, though the language used to describe the same psychological considerations may vary between an engineer and a writer. In psychology, we look at how individuals develop a pattern of behavior over the different developmental stages and then scaffold in new information to fit into the evolving schema over time. This chapter takes a similar approach to investigating why and how contextual understanding is important in developing more efficient artificial intelligence. Keywords: Artificial intelligence; Cognitive psychology; Computer programs; Context.

1. Introduction and motivation Can the cognitive sciences provide some directions and answers to ambient intelligence design systems within a data democracy? The use of contextual data in this system will allow a computing device with incremental dynamic models of its various environments and solve issues as they occur using the contexts in the environment. This work will Data Democracy. https://doi.org/10.1016/B978-0-12-818366-3.00009-5 Copyright © 2020 Elsevier Inc. All rights reserved.

163

164

Data Democracy

investigate this issue and try to provide information on the way that successful working cognitions use context. This work further investigates how context use can shape cognition, in humans, animals, and artificial intelligence (AI). Context is important in humans to process stimuli and information that has been previously encountered, maintain that information while using the brain‘s intellectual ability to evaluate a situation, and detect a potential solution to the newly presented problem. The various definitions of context will also be evaluated and explained. Information should be available to all individuals who want to or are required to evaluate the various forms of collected data that allow information to be readily available to resolve potentially emerging issues and problems. It is important to make sure these data are beneficial and any harm from these data is minimized for the public [1,2]. There are many reasons that ambient intelligence is not a common application. However, one of the major reasons is the inability of implementing reasoning capabilities that enable the environment to respond to continuingly changing situations in the immediate vicinity by using context. Because of this major lapse in immediate responsiveness, research efforts should be concentrated to overcome the cognitive and understanding capabilities that have been presented thus far in research. In reality, there are some systems that need to be built and maintained to disperse information in more specific rather than generalized areas. However, there remains the fact that there are many areas where a diverse and extensive knowledge base is needed to support a variety of problems. For example, when one has questions regarding the warning lights that are triggered in a car, it is necessary to know the actions and steps that must be taken to keep the car functioning and on track to its destination safely. For this type of scenario, there have been a few researchers who have been working to develop programs that are able to take information and apply previously learned knowledge to new problems. These program systems should be built on the premise that an application should provide these systems with the ability to both manage domain information and generate appropriate responses depending on the information that is available in the environment. This also implies that an interpretation of past events or context has to be accomplished through the intelligence system. Cognitive abilities that are enhanced by contextual reasoning enable people to identify and recognize situations that are familiar as well as those that are new. Through the use of contextual data, contextual reasoning is an essential element in assisting users in understanding new situations in their environment and allowing them to assess and react to those situations by narrowing the focus of information that is presented. There are many different sciences that deal with cognitions from diverse areas of study including fields such as machine learning, philosophy, medicine,

Chapter 9  Data democracy for psychology

165

psychology, neuroscience, computer science, and computer engineering as well as AI. The difference that is most often noted is that while natural intelligence is related to the intelligence displayed by both animals and humans in their everyday behaviors and actions, AI is programmed to be displayed in machines by humans. Within all of these fields, there exist many differing perspectives and theories as to how effective cognitions can be accomplished. However, on analysis, there are unifying themes which emerge and point to necessary ingredients for a successful view of any cognitive operation. One such theme is the necessity of understanding context in a situation where there is a problem that needs to be solved. Within the field of AI, Bazire and Bazire and Bre´zillon’s [3] performed an analysis of a corpus of 166 definitions of context and found in a number of various domains and the subsequent conclusions [3] that context can be derived from anything that is significant in any given moment that potentially may include the environment, or an item in that environment, a user or a potential user, or even an observer of the situation.

2. Understanding context One of the earliest known philosophers in this area of cognitive development was John Locke. John Locke, a philosopher born in 1632, was a major influence in the field of psychology in the development of behaviorism with his views on how people matured and developed. John Locke was also an Oxford academic and medical researcher whose thesis “An Essay Concerning Human Understanding” is one of the first great defenses of the application of observation and targets the limits of human understanding and was considered influential in the development of Western Psychology. One area in this field that is investigated in great detail rests on what one can legitimately claim to know based on how past experiences and knowledge affects present understanding and develop the tools to sustain through life. His ultimate suggestion was that we are all born with the building blocks to become who we are. John Lock’s premise was that as we go through life and experience what it has to offer, we form the necessary tools to survive and become individuals. In the Essay Concerning Human Understanding, John Locke alluded to ideas on how humans understand the world around them. In the four books of the Essay, Locke considers the sources and nature of human knowledge. In the first book, Locke argued that humans have no innate knowledge at birthdso basically the human mind is like a blank slated ready to be written on as experiences occur. In the second Book, Locke claimed that all ideas come from the experiences that one has. The term “idea,” Locke tells us stands for whatsoever is the Object of the Understanding, when a man thinks [Essay I, 1, 8, p. 47]. He felt that all

166

Data Democracy

experience comes from either sensation or reflection. Sensation explains the things and processes in the external world, while reflection tells us about the operations of our own minds. As we reflect on our internal state, we become conscious of the mental processes we are engaged in. Some ideas we get only from environmental sensations, some only from reflection, and some from both. Knowledge was defined as the perceptions of the connection and agreement or disagreement of ideas [IV. I. 1. p. 525]. Locke’s ideas on knowledge acquisition allowed him to decide that we could become trapped in our ideas by trying to define what we know based on what we see, touch, or hear.

3. Cognitive psychology and context Jean Piaget began by conceptualizing one broad aim for education and learningdthat of developing autonomy. Piaget hoped that individuals would be able to learn the ability to self-governdnot only their academic track but their moral track as well. The goal was for them to be able to think for themselves and not have to be told what it was they were to do. It is recognized by many that the ideal state would be for children to develop autonomy by utilizing previously learned situations to help solve problems in new situations. Piaget felt that it would be better for children and their development if one were to do away with rewards and punishments and instead exchanged points of view with children to lead them on a path toward autonomy. Although Piaget was wise enough to see that there were occasions that punishment would need to be used, these sanctions (rewards and punishments) could be used to help the child learn what was and was not acceptable behavior in their social environment based on the consequences of previously experiences. Piaget argues that cognitive conflict produced by discrepancies between existing mental schemata and perceived events would motivate changes in thinking. The social cognitive approach finds the imparting of information by social agents in the form of guided instruction and modeling to be very important in creating changes in behavior through contextual means. Children from an early age construct theories or schemas to help them understand the culture and environment where they are living [3,4]. As they interact with their world and learn how to use context to solve new problems, these theories undergo changes based on the information that they have acquired [4]. However, most of this cognitive revision does not occur under the recognition or control of the child. This actually does not occur until much later in lifedsomewhere in middle childhood to adolescence [5]. Another reason regarding this acquisition of context that one must take into consideration is the demands placed on an individual by the culture that the individual has been exposed to throughout their life. What is

Chapter 9  Data democracy for psychology

167

acceptable in one culture (context) may not be acceptable in another. There are many different institutions in a society that range from the formal to the informal, and each rule system has different assigned rules and roles. Aspects of functioning within these different institutions constitute social, cultural, and psychological processes and context. It must be taken into consideration that every person is an active agent in their environment rather than a passive agent in their culture. The actions and reactions that are produced by the individual or groups of individuals is a dialectical and dialogic process that is a two-way street that is taken care of through some type of predetermined language that allows mediation to occur and be negotiated. In psychology, learning is considered to be a relatively permanent change in behavior brought about by practice or experience. The real questions arise when we begin to consider when does learning begin and how does it happen? This question plagues educators and researchers in many disciplines. Because it is a multidisciplinary question, it has been answered in many different and confusing terms depending on the discipline and what they are considering as an example of learning. What is obviously a result of learning is that there is a progression of acquisitions of behaviors that lead to a conditioning of reflexes that eventually lead to a control of behaviors through assimilations which become incorporated in a behavior pattern. This behavior pattern would involve those behaviors that would be considered acceptable in a particular culture or environment. What is known is that after an individual begins to recognize their particular preferences and standards, they tend to select other individuals and activities that share the same standards and preferences which further reinforces their performances and environments [6e9]. An important effect is that people do not just passively absorb standards of behavior from whatever influences they experience. If the individual does not like the behavior standard, they will disregard it. This provides for more regularity in their behavior and maintains the performance of preferred behaviors [10,11]. One can begin to ascertain that the environment that one resides is not exclusionary to the situations that one encounters and helps to change the direction that lives can take [12,13]. The cerebral system and the sensory and motor systems allow us to give meaning and direction to our lives [10,14]. Predicting what our preferences and patterns are or will be is often difficult because of the numerous options that one is often challenged by. As individuals, we frequently base our activities on our interests, and these interests can change based on experiences and interactions with others. This ongoing development involves not only physical but emotional and cognitive development as well. This is important because it implies that our cognitive development is

168

Data Democracy

undergoing changes that allow one to perceive changes in the environment that one is able to manipulate. These important goals of learning about sense of self, relationship to others, and making sense of the world could be reached by every child if every class room encouraged children to make their own class room rules and then make decisions regarding appropriate and acceptable behaviors; by helping to foster intrinsic motivation; and encouraging children to exchange viewpoints. This would encourage more agentic action and lead to further self-reactions and forethought for children learning to control their own environments. We do know that human behavior is partly governed by its own preferences and standards [10]. Through this internal source of direction, people control their lives and derive satisfaction from what they do. Because of these ideas, they partly govern the extent to which social encounters may shape the course of personal development. Individuals and organisms develop and evolve without knowing what the future holdsdthey are simply making adaptations to their present environment in which they live to make sure that they can sustain life. Natural selection is not a predesigned process capable of foreseeing the future and making adaptations based in what is going to happen. Our evolved mechanisms were constructed and adjusted in response to the statistical composite of situations actually encountered by our species during its evolutionary history [15,16]. These mechanisms were not designed to deal with the everyday circumstances that are commonplace. However, they have not been designed to solve all potential problems under all possible circumstances either, because the human species did not encounter all problems under all circumstances. Socially constructed situations differ markedly so no single mode of social adaptation fits all situations. Gould makes an interesting point in that biological determinism is often clothed in the language of interactionism to make it more acceptable [17].

4. The importance of understanding linguistic acquisitions in intelligence Linguistic philosophy helps to point to contextual involvement in linguistic development and comprehension. Nye [18] points out that meanings are, necessarily, shared among speakers of a certain language; however, the ideas behind the meanings are not necessarily the same among all of the speakers of that language [18]. Because of this difference of interpretation, language is a mix of public meaning of words and the private idea that gives those words meaning. The question then becomes how the public meaning of language has the ability to adequately represent the private meaning and the answer lies with context of the word and where and how it was learned.

Chapter 9  Data democracy for psychology

169

As an example of linguistic acquisition and understanding between man and machine, Eliza was a program that was written in 1963 to assist in the communication between man and machine [19]. With this program, the user inputs a sentence or statement in natural language, the program then analyzes the statement and generates a response. For some researchers, this was door that they had been waiting forda way for communication to ensue from a machine to a human, the door was now opened wider and the chatbot was now a reality, and new ideas could be developed to allow more people to communicate with machines more efficiently. Using linguistic input, one can now use a voice to find map directions to a particular location, ask a phone a random question, determine if surgery utilizing the da Vinci surgical robot is going as planned, and using a computer to translate a sentence or phrase as one is enjoying the sites of an unfamiliar place.

5. Context and data, how important? Determining that a particular pattern of commands is consistently being ordered by a user may be recognized by data that are consistently monitored by system. For example, in many new vehicles, there are generally settings for steering wheel placement, mirror, and seat placement. In this instance, the requirements of the user (driver) may be based on height and weight. Most car manufacturers realize that in family units that different drivers do require different seat, mirror, and steering wheel adjustments, and each key is synched to a particular setting. As soon as that key is inserted into the ignition, those adjustments occur automatically simply because of the data that are linked to that key. It is important to note that as a human grows and develops, they learn through practice and/or experience and this increases their knowledge and abilities. A human does not always have to have “hands-on” practical involvement to develop a hypothesis of how an incident will evolve but can theorize an outcome based on the past context of similar experiences. This predictive ability is already utilized in many practical instancesdnot just in vehicles but other computer-initiated/data-driven programs. For example, ones’ social media is closely aligned with advertising of previously researched items including but not limited to government services, personal purchases, and information regarding news items [20]. The data that are regularly extracted from surveys, studies, evaluations, and reviews can all be compiled to ensure that the user is guaranteed to view sources that contain interesting material that may lead to a broader dispersal of information. This customized targeting of items may enhance performance in many areas including education, industry, and businesses. One issue with this personalized

170

Data Democracy

data delivery system is that they are designed to make agentic actions more streamlined which may lead to less exploring of other options that may be available.

6. Neuroscience and contextual understanding Much of the research on brain development has focused on the influential role that agentic action plays in shaping the brain and the neuronal and functional structure of the brain. Many researchers felt that it was not just exposure to stimulation that made the changes but agentic action in exploring, manipulating, and influencing the environment that has led to change in brain behavior [21,22]. Agentic action can be defined as behavior that is performed with intentionality, forethought, self-reactions, and self-reflection [23]. Agentic factors that also have explanatory, predictive value may be translatable and modeled [24,25]. Research on brain development emphasizes how much influence that agentic action plays in shaping the development of the neuronal and functional structure of the brain [21,22]. Stimulation is important in brain development, but the agentic action in exploring, manipulating, and influencing the environment is what really counts in developing the brain function. Because each individual helps to regulate their environment through their activities, they are responsible in producing the experiences that form the neurobiological foundation of symbolic, social, psychomotor, cognitive, and other skills. It is the nature of these experiences that is dependent on the types of social and physical environments people select and construct. An agentic perspective can and does promote research that will provide new insights into the social construction of the functional structure of the human brain [26]. Individuals must select which behaviors that are modeled by others are important to integrate into their own behavioral systems. Bandura felt that the competence of the model performing a behavior would help the individual determine if it was behavior that should be integrated into their own behavioral schema [10]. This self-efficacy is a major determinant of self-regulation and has been a central focus of Bandura’s research since the late 1970s. Bandura felt that cognitions would change over time as a function of maturation and experience. The social cognitive approach finds the source of change in maturation, exploratory experiences, and, most important, the imparting of information by social agents in the form of guided instruction and modeling from the “models” in the environment. Information processing concerns the relationship between encoding and retrieval of material that has been read or heard. These two memory processes greatly influence our cognitive behaviors including perception, attention, learning, and cognition. This formation of associations must

Chapter 9  Data democracy for psychology

171

take into account memory storage and transference from one type of memory system to another [27]. Contrary to popular belief, the associations that are often called conditioned reactions are largely self-activated (agentic) on the basis of learned expectations rather than being automatically evoked [28]. The critical factor, therefore, is not that events occur together in time, but that people have learned to predict the events occurring and to summon up appropriate anticipatory reactions because of the learned memories [28]. While attention to stimuli is important for our retention of the material, does the phonological properties of the words, concepts, or visual aspects of pictures that are envisioned make a difference in our retention or acceptance of the material? Anne Treisman [29] found that if participants in a dichotic listening paradigm project were presented with two passages at the same time through head phones that they would be unable to analyze the material that was being played through their unattended eardas long as the material being listened to “passed the test” such as loudness, brightness, and pitch [29]. The material was being analyzed, but the analyses disappear very quickly. The facilitation of student learning was one of the major incentives behind the development and implementation of computer-assisted and enhanced instruction. This has led to the construct of many intelligent designed systems to be utilized in instruction. Research has shown that the brain’s left hemisphere is the area that is better in the processing of auditory information [30e32]. Listeners identify speech sounds presented to their right ear (left hemisphere) more accurately than those presented to their left ear (right hemisphere). The left hemisphere has also been found to be dominant for the recognition of written letters and words [33e35]. Because written language involves detection and interpretation of shape configurations, which are thought to be processed primarily in the right hemisphere, this was an interesting find in the research. For the deaf and hearing impaired, the spatial aspect of visual speech and the recognition of signs have been shown to favor left hemisphere dominance [36]. Many of the words that are used in the English language are homophones: this means that there are multiple meanings for a word. For example, the car brakes can stop a car from moving, but I break an egg to make scrambled eggs for breakfast. Most individuals are able to understand the meaning of a sentence because of the context of the other words used surrounding the homophones. Another resolution model was proposed Barbara Grosz of Harvard University and Aravind Joshi and Scott Weinstein of the University of Pennsylvania [37]. The Centering Theory predicts how individuals use pronouns from the most obvious character in the previous sentence or paragraph to refer to the subject. However, Jennifer Arnold estimated that only 64% of subject pronouns refer to previous subject.

172

Data Democracy

When an individual talks, their facial movements naturally accompany the production of speech sounds. As infants begin to recognize and produce meaningful sounds or language, they use these sources of speech information early in life to help them understand the meaning of speech [38,39]. The variety of visual information clues produced by the speaker help the infant to determine the meaning of words from the face, tongue, and lip movements and enhance the intelligibility of auditory speech [40,41]. In a study performed by Cohen and Massaro, they used computeranimated speech rather than still pictures or a natural talking face to ascertain comprehension of language [42,43]. The animation provided the dynamic aspect of visual speech that was missing in still pictures. Facial movements were more easily controlled when using animation. Other researchers have shown that there is some evidence that while an individual is talking they do provide some asymmetry in their articulations [44,45]. In this research, there was evidence that individuals tended to open the right side of their mouths faster and wider than the left side of their mouths. The explanation is that the left hemisphere exercises more control over speech production than does the right hemisphere, and therefore more of the right side of the face is moved during speech production. Thus, a real face might show a perceptual advantage presented to the right visual field which is located in the left hemisphere because there is more information on the talker’s right side. If the face is presented to the right of the visual fixation, its right side would be closer to the central visual field. If the face is presented to the left of the visual fixation, its right side would be farther from the central vision field. Thus, the more informative part of the face would be closer to the central visual field when presented in the right visual field than in the left visual field. The animated face was used because it was made to be more symmetrical, which precluded this potential confounding variable. Another factor that needs to be considered is in understanding and reading emotions associated with facial expressions. In general, emotions are also accompanied by a component of motor expression which helps in communicating the sender’s affective reaction through signs in face, voice, gestures, and bodily posture. This component is recognized by some researchers in psychological sciences [46]. Emotional expression and the effect on emotional recognition is also documented by numerous studies, vocal and facial cues which demonstrates that an observer is more likely to identify an emotional state with better accuracy [47e49]. In a study done by Karim N’Diaye, David Sander, and Patrik Vuilleumier, it was suggested by Ekman and Friesen that facial expressions may convey emotional communication. Ekman and Friesen found that each emotion had a prototypical facial expression [50,51]. If the microexpressions could be captured, one could “read” the emotion that was being conveyed by the persondno matter how discrete their

Chapter 9  Data democracy for psychology

173

expressions could be. This is the area of interest that was researched by N’Diaye, Sander, and Vuilleumier [50]. The appraisal theory predicted that there would be an influence of gaze direction on the emotion perceived in a face and that this influence would have relevance for the observer’s own needs, goals, values, or well-being [52]. Bente et al. [53] was able to demonstrate that avatars were very likely to receive the same personeperception process to as a videotaped individual [53]. Moreover, avatars who show social interactive expressions such as smiling or eye movements may lead to activating the same brain regions as those that are triggered by humanehuman interaction [54]. There have been many different studies that have looked at the attributes that are given different agents and the effect that these attributes have on an individual interacting with them. Several researchers found that nonverbal behavior was influential in how an avatar affected an individual interacting with them [55]. In 2008, Baylor and Kim were able to demonstrate that the effects of nonverbal behavior of the avatar were dependent on the task [56]. They discovered that the facial expression of avatars played a part in attitudinal learning by the subject but detrimental to procedural learning by that same subject.

7. Context and artificial intelligence It is clear that human and artificial cognition requires the integration of contextual information to operate as efficiently and effectively as possible. The following information demonstrates some contextually designed programs that are being redefined and will be assisting with the goals of a fully functional AI. Context-Based Reasoning, or CxBR, is successfully used to represent tactical knowledge in simulated as well as physical agents [57]. CxBR is able to dissect an agents’ behavior into a context and subcontexts, with each context containing information on behavior that is relevant to that context. Each context does contain environmental information that has to be true for that context to be in control of the agent. As the situation evolves during a tactical event, another context may be more applicable than the currently active and the system will then transition the controlling context to the one that better addresses the current issue. GenCL, or Genetic Context Learning, combines the use of CxBR and genetic programming. This architecture was developed by Fernlund and colleagues [58] and strives to incorporate the two concepts of a tactical contextual map and state-dependent learning. Successive generations then become increasingly competent at the task presented to the system [58]. Although this system appears to function with more variety, it needs to be understood that all of the contexts must be defined a priori. According to Fernlund et al., new contexts could not be learned.

174

Data Democracy

Turner’s Context-Mediated Behaviors or CMB is somewhat similar to CxBR. However, CxBR requires that the transition between contexts is explicitly defined. In the program developed by Turner, every context is reviewed and analyzed for each situation. In the CMB, all contexts are checked to find the appropriate context to transition to fulfill the objective. Additionally, CMB allows the merging of contexts when a context by itself cannot successfully be used to address the situation. Contextual Graphs, or CxG [59,60], addresses decision-making through context. Context is presented at a progressively higher developed level for the situation to be identified more clearly. The decision-making process is put into simple questions and actions making the process more efficient by responding to situations appropriately and rapidly.

8. Conclusion Researchers must consider cognitive factors when trying to predict human behavior and designing behavioral interventions. In a world of challenges and hazards, people have to make good judgments about their abilities, anticipate the probable effects of different events and courses of action, size up socio-structural opportunities and constraints, and regulate their behaviors accordingly in alien situations. Another aspect that should be considered is appearance. But of course, one has to realize that what one person defines as pleasant looking another person may find as unpleasant. This effect also appears to have a strong influence on learning as well [61]. It only stands to reason that researchers who are interested in imbuing intelligently designed systems with human like processing also consider context.

References [1] S.A. Gelman, H.M. Wellman, “Cognition, Language, and Perception”, Handbook of Child Psychology vol. 2, 1998 (5). [2] M. Raisinghani, A. Benoit, J. Ding, M. Gomez, K. Gupta, V. Gusila, D. Pwer, O. Schmedding, Ambient intelligence: changing forms of human-computer interaction and their social implications, J. Digit. Inf. 5 (4) (2006). [3] M. Bazire, P. Brezillon, Understanding Context Before Using it, Springer, Berlin Heidelberg, 2005. [4] W. Damon, D. Kuhn, R. Siegler, Handbook of Child Psychology, vol. 5, Wiley, Edition, 2000. [5] D. Kuhn, Children and adults as intuitive scientists, Psychol. Rev. 96 (1989) 674e689. [6] A. Bandura, R.H. Walters, Adolescent Aggression, Ronald Press, New York, 1959.

Chapter 9  Data democracy for psychology

175

[7] D. Bullock, L. Merrill, The Impact of Personal Preference on Consistency Through Time: The Case of Childhood Aggression, 1980. [8] R. Elkin, W. Westley, The myth of adolescent culture, Am. Sociol. Rev. 20 (1955) 680e684. [9] W. Mischel, Personality and Assessment, Wiley, New York, 1968. [10] A. Bandura, Social Learning Theory, Prentice Hall, Englewood Cliffs, NJ, 1977. [11] H.L. Raush, W.A. Barry, R.K. Hertel, M.A. Swain, Communication, Conflict, and Marriage, Jossey-Bass, San Francisco, CA, 1974. [12] O.G. Brim Jr., C.D. Ryff, On Properties of Life Events, vol. 3, Academic Press, New York, NY, 1980. [13] D.F. Hultsch, J.K. Plemons, Life Events and Life Span Development, vol. 2, Academic Press, New York, NY, 1979. [14] R. Harre, G. Gillet, The Discursive Mind, Sage Publications, Thousand Oaks, CA, 1994. [15] D. Symons, On the Use and Misuse of Darwinism in the Study of Human Behavior, Oxford University Press, New York, NY, 1992. [16] J. Tooby, L. Cosmides, On the universality of human nature and the uniqueness of the individual: the role of genetics and adaptation, J. Personal. 58 (1990) 17e67. [17] S.J. Gould, An Urchin in the Storm, Norton, New York, NY, 1987. [18] A. Nye, Philosophy of Language: The Big Question, Blackwell, New York, 1998. [19] J. Weizenbaum, ELIZA e a computer program for the study of natural language communication between man and machine, Commun. ACM 9 (1) (1966) 36e45. [20] C.G. Reddick, A.T. Chatfield, A. Ojo, Asocial media text analytics framework for double-loop learning for citizen-centric public services: a case study of local government Facebook use, Gov. Inf. Q. 34 (1) (2017) 110e125. [21] M.C. Diamond, Enriching Heredity, Free Press, New York, NY, 1988. [22] B. Kolb, I.Q. Whishaw, Brain plasticity and behavior, Annu. Rev. Psychol. 49 (1998) 43e64. [23] A. Bandura, Social cognitive theory, Annu. Rev. Psychol. 52 (2001) 1e26. [24] W.A. Rottschaefer, Evading conceptual self-annihilation: some implications of Albert Bandura’s theory of the self-system for the status of psychology, New Ideas Psychol. 2 (1985) 223e230. [25] W.A. Rottschaefer, Some philosophical implications of Bandura’s social cognitive theory ofhuman agency, Am. Psychol. 46 (1991) 153e155. [26] L. Eisenberg, The social construction of the human brain, Am. J. Psychiatry 20 (1995) 1563e1575. [27] R.C. Atkinson, R.M. Shriffin, The control of short term memory, Sci. Am. 225 (1971) 82e90. [28] A. Bandura, Behavior Theories and the Models of Man. Presidential Address for the APA, American Psychologist, 1974, pp. 859e869.

176

Data Democracy

[29] A. Treisman, Selective attention in man, Br. Med. Bull. 20 (1964) 12e16. [30] D. Kimura, Cerebral dominance and the perception of verbal stimuli, Can. J. Psychol. 15 (1961) 166e171. [31] D. Kimura, Functional asymmetry of the brain in dichotic listening, Cortex (3) (1967) 163e178. [32] M. Studdert-Kennedy, D. Shankweiler, Hemispheric specialization for speech perception, J. Acoust. Soc. Am. 48 (1970) 579e594. [33] M.P. Bryden, Tachistoscopic recognition, handedness, and cerebral dominance, Neuropsychologia 3 (1965) 1e8. [34] M.P. Bryden, Laterality: Functional Asymmetry in the Intact Brain, Academic Press, New York, NY, 1982. [35] M. Mishkin, J.D. Forgays, Word recognition as a function of retinal locus, J. Exp. Psychol. 43 (1952) 43e48. [36] U. Bellugi, E.S. Klima, Language, spatial cognition and neuronal plasticity, in: Proceedings of the 16th Annual Meeting of the European Neuroscience Association, Madrid, 1993. [37] B.A. Grosz, Centering: a framework for modeling the local coherence of discourse, Comput. Linguist. 21 (2) (1995) 203e225. [38] B. Dodd, Lipreading in infants: attention to speech presetned in and out of synchrony, Cogn. Psychol. 11 (1979) 478e484. [39] P.K. Kuhl, A.N. Meltzoff, The bimodal perception of speech in infancy, Science 218 (1982) 1138e1141. [40] C.A. Binnie, A. Montgomry, P.L. Jackson, in: Auditory and Visual Contributions to the Perception of Selected English Consonants for Normally and Hearing Impaired Listeners, vol. 4, 1974, pp. 181e209. Speech, Ed., Stockholm, Sweden. [41] A.Q. Summerfield, Use of visual information in phonetic perception, Phonetica 36 (1979) 314e331. [42] M.M. Cohen, D.W. Massaro, Synthesis of visible speech, Behav. Res. Methods Instrum. 22 (1990), 260 - 260. [43] M.M. Cohen, D.W. Massaro, “Modeling Coarticulation in Synthetic Visual Speech”, Models and Techniques of Computer Animation, Springer - Verlag, Tokyo, 1993, pp. 139e156. [44] R. Graves, H. Goodglass, T. Lamdis, Mouth asymmetry during spontanwous speech, Neuropsychologia 20 (1982) 371e381. [45] M. Wolf, Oral asymmetries during verbal and non-verbal movements of the mouth, Neuropsychologia 25 (2) (1987) 375e396. [46] R. Davidson, K.R. Scherer, H. Goldsmith, Handbook of Affective Sciences, Oxford University Press, New York, NY, 2003. [47] D. Keltner, P. Ekman, G. Gonsaga, J. Beer, Facial Expression of Emotion, Oxford University Press, New York, NY, 2003.

Chapter 9  Data democracy for psychology

177

[48] K. Scherer, T. Johnstone, G. Krasmeyer, Vocal Expression of Emotion, Oxford University Press, New York, NY, 2003. [49] P.N. Juslin, K.R. Scherer, Vocal Expression of Affect, Oxford University Press, 2005, pp. 65e135. [50] K. N’Diaye, D. Sander, P. Vuilleumier, Self-relevance processing in the human amygdala: gaze direction, facial expression and emotion, Emotion 9 (6) (2009) 798e806. [51] P. Ekman, W.A. Friesen, Facial Action Coding System, Consulting Psychologists Press, PaloAlto, CA, 1978. [52] D. Sander, D. Grandjean, S. Kaiser, T. Wehrle, K.R. Scherer, Interaction effects of perceived gaze direction and dynamic facial expression: evidence for appraisal theories of emotion, Eur. J. Cogn. Psychol. 19 (2007) 470e480. [53] G. Bente, N. Kramer, A. Peterson, J. DeRuiter, Computer animated movement and person perception. Methodological advances in behavior research, J. Nonverbal Behav. 25 (1993) 151e166. [54] Schilbach, Being with virtual others: neura correlates of social interaction, Neuropsychologia 44 (2006) 718e730. [55] R. Rickenberg, B. Reeves, The effects of animated characters on anxiety, task performance, and evaluations of user interfaces, Chi-Conference (2000) 49e56. [56] Y.A. Kim, A social-cognitive framework for pedagogical agents as learning companions, Educ. Technol. Res. Dev. 54 (6) (2006) 569e596. [57] A.J. Gonzalez, B.S. Stentrud, G. Baarrett, Formalizing context based reasoning e a modeling paradigm for representing actual human behavior, Int. J. Intell. Syst. 23 (7) (2008) 822e847. [58] H. Fernlund, A.J. Gonzalez, M. Georgiopoulous, R. DeMara, Learning tactical human behavior through observation of human performance, IEEE Transact. Syst. Man Cybern. 36 (1) (2006) 128e140. [59] P. Brezillon, Representation of procedures and practices in contextual Graphs, Knowl. Eng. Rev. vol. 18 (2) (2003) 147e174. [60] J.R. Hollister, S. Parker, A.J. Gonzalez, R. Demara, An extended Turing test: a context based approach designed to teach yourth in computing, in: Proceedings of the 8th International and Interdisciplinary Conference Modeling and Using Context, Annecy, France, 2013. [61] S. Domagk, Do pedagogical agents facilitate learner motivation and learning outcomes, J. Media Psychol. 22 (2010) 84e97.

10

The application of artificial intelligence in software engineering: a review challenging conventional wisdom Feras A. Batarseh1, Rasika Mohod2, Abhinav Kumar2, Justin Bui3 GRADUATE SCHOOL OF ARTS & SCIENCES, DATA ANALYTICS PROGRAM, GEORGETOWN UNIVERSITY, WASHINGTON, D.C., UNITED STATES; 2 VOLGENAU SCHOOL OF ENGINEERING, GEORGE MASON UNIVERSITY, FAIRFAX, VA, UNITED STATES; 3 DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, UNIVERSITY OF CALIFORNIA, BERKELEY, CA, UNITED STATES 1

The less there is to justify a traditional custom, the harder it is to get rid of it Mark Twain Abstract The field of artificial intelligence (AI) is witnessing a recent upsurge in research, tools development, and deployment of applications. Multiple software companies are shifting their focus to developing intelligent systems; and many others are deploying AI paradigms to their existing processes. In parallel, the academic research community is injecting AI paradigms to provide solutions to traditional engineering problems. Similarly, AI has evidently been proved useful to software engineering (SE). When one observes the SE phases (requirements, design, development, testing, release, and maintenance), it becomes clear that multiple AI paradigms (such as neural networks, machine learning, knowledge-based systems, natural language processing) could be applied to improve the process and eliminate many of the major challenges that the SE field has been facing. This survey chapter is a review of the most commonplace methods of AI applied to SE. The review covers methods between years 1975e2017, for the requirements phase, 46 major AI-driven methods are found, 19 for design, 15 for development, 68 for testing, and 15 for release and maintenance. Furthermore, the purpose of this chapter is threefold; firstly, to answer the following questions: is there sufficient intelligence in the SE lifecycle? What does applying AI to SE entail? Secondly, to measure, formulize, and evaluate the overlap of SE phases and AI disciplines. Lastly, this chapter aims to provide serious Data Democracy. https://doi.org/10.1016/B978-0-12-818366-3.00010-1 Copyright © 2020 Elsevier Inc. All rights reserved.

179

180

Data Democracy

questions to challenging the current conventional wisdom (i.e., status quo) of the state-of-the-art, craft a call for action, and to redefine the path forward. Keywords: Artificial intelligence paradigms; Design; Lifecycle phase; Requirements; Testing.

1. Introduction and motivation In the 1880s, Charles Babbage built a machine that was capable of performing an assortment of mathematical calculations. The goal of the machine was to execute math correctly. Babbage’s purpose was to get rid of the inherent errors that occur when humans do calculations by hand [1]. Many refer to that program as the first software program. Software, since its earliest stages, aimed to help humans automate processes that require a certain level of “intelligence.” The process of building software (i.e., software engineering (SE)) remains primarily a human activity. Within computer science, the field of SE is the dominating industrial field [2]. Nonetheless, there have been ample amounts of research in every aspect of the SE process. As the field advances, many important questions were still not answered with consensus, such as how to estimate time and cost of a project? When to stop testing the system? How to locate and eradicate errors? How to incrementally develop and redesign? How to refactor the code? How to turn requirements into features? And many other important questions. Similar to other engineering disciplines, building a software follows a well-defined process; this process is referred to as a lifecycle. Many different lifecycles have been introduced in literature, and many have been used in industry. Commonplace lifecycle models include waterfall, agile, spiral, rapid, and incremental. Each phase within the lifecycle has its own challenges, drawbacks, and best practices; each phase therefore became its own research field that has communities of researchers trying to improve it and multiple conferences and journals trying to advance it. In parallel, another area of research that is meant to represent intelligence in a machine is artificial intelligence (AI). Many researchers claimed that these two areas of research (SE and AI) have not interacted or overlapped sufficiently [1e4]. Software’s aim is to model the real world, to represent a certain human activity, or to automate an existing manual process. Similarly, AIdwithin its many subfieldsdaims to solve problems, represent intelligence, engineer knowledge, recognize patterns, learn from experience, and eventually answer the famous questions posed by Alan Turing: “Can Machines Think?”, think intelligently that is [1]. Intelligence, based on a recent definition from the Artificial General Intelligence proceedings [3,4], is the ability to “perform complex tasks within a complex environment.” In 2008, Rech et al. claimed that “the disciplines of AI and SE have

Chapter 10  The application of artificial intelligence in software engineering

181

many commonalities. Both deal with modeling real world objects from the real world like business process, expert knowledge, or process models” [3]. Hence, there is no doubt that the disciplines of SE and AI have a lot to share. To evaluate the success of AI when applied to SE, an updated and extensive review of the state-of-the-art is overdue (therefore, this chapter). There seems to be a general consensus that AI is suitable for solving SE problems [5e7]. There are many methods that apply AI to SE; however, most of the time, the application is theoretical, small, mathematical, or applied to a formal problem within SE. No paper was found that addressed the nonfunctional aspects of SE, for instance, the human face of SE, communication issues between teams, or the “real-world” reasons why SE projects fail. For example, a method that intelligently classifies or clusters test cases in testing might be helpful in some cases, but it still would not answer the big and ongoing questions of testing (what is a valid software? When does the SE team claim an errorless system?). SE researchers were able to address traditional formal issues; however, is AI able to solve bigger problems in SE? That is still unclear. This chapter will address this question. The purpose of the argument presented is not to search for a silver bullet, rather, it is to challenge the many papers that claimed that the solution of many traditional SE challenges lies in AI. Most importantly, however, very few methods AI-driven have been used in industry. The real evaluation of such success should be reflected in the number of success stories and implementations of these methods in industry. There have been a number of reviews that covered the same topic [3,4]; however, none of the ones found are as extensive or as detailed as this manuscript. Additionally, in this chapter, the methods are broken down by SE lifecycle phase. Five phases are used: 1drequirements engineering (RE), 2ddesign, 3ddevelopment, 4dtesting, 5drelease and maintenance. These phases are deemed to be the most “commonplace” ones [4], furthermore, while searching for papers, these phases proved relevant to how the research community is divided. Most researchersdwhen writing about a certain phasedrefer to one of these phases. Moreover, some papers per phase were found but not included in this review (they were deemed less important based on citations and relevance to scope). Some papers were more relevant and had a clear application of AI to an SE phase, while others were a bit vague onto which phase they belong to, or the AI paradigm that they use. Nonetheless, and to provide a complete reviewdof all papers founddin this area of study, a comprehensive list of papers is provided, including the supposed phase that it belongs to and the AI paradigm that it follows. That is discussed further within Sections 2 and 3. This chapter is structured as follows: the following five subsections review AI for each of the five SE lifecycle phase (which constitutes

182

Data Democracy

Section 2); Section 3 provides a summary of the review (including the complete list of all AI-driven SE methods); and Section 4 concludes the manuscript and includes insights and ideas for the path forward.

2. Applying AI to SE lifecycle phases This section has five subsections; each one is dedicated to an SE phase and its AI methods. Within each phase, multiple AI-inspired methods are reviewed. The review within each section is neutral and unbiased as it merely provides the claims that were made by the original referenced authors. Contrary to that, the subsequent sections of this survey (3 and 4) provide arguments, insights, and further discussions.

2.1 Requirements engineering and planning Requirements analysis and planning is the first stage in the SE process, and it forms the building blocks of a software system. Software requirements describe the outlook of a software application by specifying the software’s main objectives and goals. Because of its importance, many researchers tried to impose AI into this generally first phase of the lifecycle ([5e33]). The Institute of Electrical and Electronics Engineers (IEEE) defines a requirement as a condition or capability needed by a user to solve a problem or achieve an objective, which must be met by a system to satisfy a contract, standard, specification, or other formally imposed document [34]. Requirements thus define the framework for a software development process, by specifying what the system must do, how it must behave, the properties it must exhibit, the qualities it must possess, and the constraints that the system must satisfy. RE is the engineering discipline of establishing user requirements and specifying software systems. RE emphasizes the use of systematic and repeatable techniques that ensure the completeness, consistency, and relevance of the system requirements. RE is considered as one of the most critical phases of a software development lifecycle as the unclear, ambiguous, and low-quality requirements specification can lead to the failure of a product or deployment of a completely undesired product and even raise development costs. Therefore, and due to the increasing size and complexity of software systems, there is a growing demand for AI approaches that can help to improve the quality of the RE processes. The research work concerning the need for computerbased tools which help human designers formulate formal and processoriented requirements specifications date as back as 1978 [35]; Balzer et al. determined some attributes of a suitable process-oriented specification language and examined why specifications would still be difficult to write in such a language in the absence of formulation tools.

Chapter 10  The application of artificial intelligence in software engineering

183

The researchers argued that the key to overcoming these difficulties was the careful introduction of informality based on partial, rather than complete, descriptions and the use of a computer-based tool which utilizes context extensively to complete these descriptions during the process of constructing a well-formed specification. The objects were perceived entirely in terms of their relationships with each other and set of primitive operations which allow relationships to be built or destroyed. This allowed incremental elaboration of objects, relationships, and operations. This approach allowed objects and operations to be modeled almost exactly as a user desires. However, constructing operational specification still proved to be difficult and error prone because the specification language was still a formal language. Specification languages only provided partial descriptions because the context provided the rest of the necessary information. The authors further proposed the context mechanisms, which were more complex, and produced more diffused contexts. There had also been a concern over the difficulty and reliability of creating and maintaining specifications over the program’s lifecycle. The authors proposed a tool that assisted in converting informal specifications into formal ones [35]. Advantages of informal specifications included their precision, helped focus development, reduced training for users, and assisted with maintainability. The authors proposed an idea where any changes needed to be made to specifications are made to the informal specifications because they would be much easier to deal with. The authors tested this new idea using a prototype system (called SAFE). The system was given an input for some tasks, and then the prototype produced a program. From the examples presented in the paper, the prototype performed successfully and errors found could be reduced with a little bit of user’s input. Many other partly automatic tools have been developed to assist SE in the requirement analysis phase (such as textual descriptions and Unified Modeling Language (UML) diagrams). To understand the dynamics of a system, a developer needs to analyze the use case descriptions and identify the actors before modeling the system. The common technique is to use grammar in the elicited text as the basis for identifying useful information. However, there is a scalability issue that is due to the unstructured nature of natural language (NL). A semiautomatic approach to extracting required information using Natural Language Processing (NLP) reduces the time spent on requirements analysis. From a computational perspective, the text can be a set of letters, words, and sentences arranged in different patterns. Each pattern of words or sentences gives a specific meaning. Vemuri et al. [36] presented a probabilistic technique that identifies actors and use cases that can be used to approach

184

Data Democracy

this problem. Their research aims to explore Machine Learning (ML) techniques to process these patterns in the text (through NLP as well). A supervised learning algorithm is used to classify extracted features as actors and use cases. Additionally, input text is provided to a NL processor to obtain word features. An extraction algorithm predicts use cases in a subjecteverbeobject combination. These results are used to draw use case diagrams. The process flow of the proposed approach is shown in Fig. 10.1. Experimental work in this chapter [36] successfully attempted to extract actors and use cases using a probabilistic classification model along with minimal assistance from a Rule-Based approach. The use cases obtained were crisp and consistent, irrespective of the size of requirements text. R-Tool, NL-OOPS, and CM-BUILDER are few other NLPbased computer-aided SE tools [37]. Such tools produce class diagrams from the user requirement document (although it still requires user intervention). Michl et al. [37] proposed the NL-OOPS (Natural LanguagedObject-Oriented Production System) project to develop a tool supporting object-oriented analysis. Requirements documents are analyzed with LOLITA (Large-scale Object-based Linguistic Interactor, Translator and Analyzer), a large-scale NL processing system. Both, the knowledge in the documents and ones that are already stored in the Knowledge-Base of LOLITA are then proposed to be used for producing requirements models. The approach was based on the consideration that requirements are often written in unrestricted NL, and in many cases, it is impossible to impose customer restrictions on the language used. The object-oriented modeling module implemented an algorithm that filters entity and event nodes in its knowledge base to identify classes and associations. In parallel, Ninaus et al. [38] proposed a method to reducing the risk of low-quality requirements through improving the support of stakeholders in the development of RE models. The authors introduced the INTELLIREQ environment. This environment is based on different recommendation approaches that support stakeholders in requirementsrelated activities such as definition, quality assurance, reuse, and release planning. There are four basic types of recommendation approaches: 1. collaborative filtering (implementation of word-of-mouth promotion), 2. Input Document

Actor classifier Post processor

Preprocessor Use case classifier

Use case Diagram

FIGURE 10.1 Intelligently processing requirements text [36].

Chapter 10  The application of artificial intelligence in software engineering

185

content-based filtering (using search keywords to determine recommendations), 3. knowledge-based recommenders, and 4. group recommenders (recommendations for groups). INTELLIREQ supports early RE where the major focus is to prioritize high-level requirements in software projects. INTELLIREQ automatically discerns possible dependencies among requirements. The outcome of INTELLIREQ is a consistent set of high-level requirements with corresponding effort estimations and a release plan for the implementation of the identified requirements. Another aspect of RE is the software requirements selection challenge. It is a problem which drives the choice of the set of requirements which will be included in the next release of a software system. The optimization of this process is at the core of the next release problem (NRP). NRP is an NP hard problem, which simultaneously manages two independent and conflicting objectives which must be simultaneously optimized: the development effort (cost) and the clients’ satisfaction. This problem cannot be managed by traditional exact optimization methods. In this case, multiobjective evolutionary algorithms (MOEAs) are the most appropriate strategies because MOEAs tackle simultaneously several conflicting objectives without the artificial adjustments included in classical single-objective optimization methods. Further complexity is added with the multiobjective NRP. When managing real instances of the NRP, the problem is explained by the requirements tackled suffering interactions. Chaves-Gonza´lez et al. [39] adopted a novel multiobjective Teaching LearningeBased Optimization (TLBO) algorithm to solve real instances of NRP. TLBO is a Swarm Intelligence algorithm which uses the behavior of participants in a classroom to solve the requirements selection problem. In this context, the original TLBO algorithm has been adapted to solve real instances of the problem generated from data provided by experts. The authors tested the efficacy of a TLBO scheme modified to work with the multiobjective optimization problem addressed in the study. Various numerical and mathematical experiments were conducted and confirmed the effectiveness of the multiobjective proposal. The results of the experiments demonstrated that the multiobjective TLBO algorithm developed performs better than other researched algorithms designed to do the same task. Contrary to that, Sagrado et al. [40] proposed an application through Ant Colony Optimization (ACO) for NRP. The idea is to help engineers take a decision about which set of requirements has to be included in the next release. The proposed ACO system is evaluated by means of sorting through a Genetic Algorithm (called: NSGA-II) and a Greedy Randomized Adaptive Search Procedure. Neumann [41], however, presented an enhanced

186

Data Democracy

technique for risk categorization of requirements, Principal Component Analysis through Artificial Neural Networks (ANNs). This technique improved the capability to discriminate high-risk software aspects. This approach is built on the combined strengths of Pattern Recognition and ANN. Principal component analysis is utilized to provide means of normalizing the input data, thus, eliminating the ill effects of multicollinearity. A neural network is used for risk determination and classification. This procedure provides the technique with capability to discriminate datasets that include disproportionately large numbers of high-risk software modules. The combination of principal component analysis and ANNs has shown to provide significant improvements over either method by itself. Another aspect of RE is project planning. It is a crucial activity in software projects which significantly affects the success or failure of a project. Accurate estimation of software development efforts plays a vital role in management of software projects because it can significantly affect project scheduling and planning. Underestimation of software project efforts causes delay and cost overrun, which can lead to project failure. Conversely, overestimation can also be detrimental for effective utilization of project resources. Specific past experience of individual situations may be a good guiding factor in such situations. Vasudevan [42] presented one of the first experience-based models for software project management (in 1994). In their approach, the focus was on using concrete cases or episodes, rather than on basic principles. Fuzzy Logic was employed to represent case indices and fuzzy aggregation functions to evaluate cases. This provided a formal scheme to quantify the partial matches of a given problem with multiple cases in the database and to utilize these partial matches to compute an aggregated result. Cost estimation and risk assessment functions of software project management were also covered. The author discussed the basic principles of Case-Based Reasoning (CBR) as it provided a functional description of the proposed system with details of case representation and case evaluation strategies. CBR is an analogical reasoning method that stores and retrieves past solutions at specific episodes. A case-base is a memory bank that represents experience. In this research, the author presented a scheme for software project management that utilizes specific instances of past experience. The author employed a fuzzy representation of case indices that would lead to a partial generalization of cases to make them applicable to a class of situations. In the same year (1994), Black, Jr. [43] introduced a system for using AI in requirements management, called Systems Information Resource/Requirements Extractor (SIR/REX). The first part of SIR/REX or SIR provided support to the earliest phases of systems engineering by automatically creating linkages between certain fragments of information. SIR could function as a glossary to documents with alphabet soup in many cases, simplifying readability for readers.

Chapter 10  The application of artificial intelligence in software engineering

187

Source documents were thus transformed into enriched documents. The second portion of SIR/REX was a Requirements Extractor (REX), which used AI NL analysis to extract candidate requirements. The task of obtaining the requirements expressed by a specification was usually a tedious task. However, SIR/REX initially analyzed an NL document and chose specific sentences with high probability of being candidate requirements. By searching for morphological root words instead of just keywords, a filter could better search for terms in a document. When human behavior was compared to SIR/REX algorithms in requirement extraction efficiency, SIR/REX algorithms proved to be superior in both consistency and speed. From these practices, the quality and speed of requirements candidate selection was improved greatly. Once all the requirements specifications are ready, the further analysis has to be performed for software cost estimation. Soft computingebased approaches such as ANN and Fuzzy Logic have been widely used. However, due to inconsistency and vagueness of software project attributes, an accurate estimation of development effort seems to be unreachable in a dataset comprised of heterogeneous projects. Bardsiri et al. [44] proposed a high-performance model, localized multi-estimator (LMES), in which the process of effort estimation is localized. In LMES, software projects are labeled on underlying characteristics. Then, project clusters are investigated, and the most accurate estimators are selected for each cluster. LMES is a combination of project classification and estimator selection. To find out the most effective estimators, estimator vector and solution vector are produced. Estimator performance is evaluated using many widely accepted metrics including magnitude of relative error (MRE), mean MRE, median MRE, and percentage of the prediction. In this model, the development effort estimation is localized through assigning different estimators to different software projects. The investigation domain included five single and five hybrid effort estimation models. Three real datasets were utilized to evaluate the performance of LMES using widely accepted performance metrics. The evaluation results showed that LMES outperformed the other estimators. LMES is found to be quite flexible to handle the heterogeneous nature of software project datasets through the localization idea. Furthermore, LMES is not dependent on a particular type of estimator because it considers different types of algorithmic and nonalgorithmic estimators for the estimation purpose. This makes LMES able to deal with the uncertainty that exists in the performance of estimation models. Moosavi et al. [45] presented a new estimation model based on a combination of adaptive neuro-fuzzy inference system (ANFIS) and Satin Bowerbird Optimization Algorithm (SBO). SBO is a novel optimization algorithm proposed to adjust the components of ANFIS through applying small and reasonable changes in variables. Although ANFIS is a strong

188

Data Democracy

and fast method for estimation, the nonnormality of software project datasets makes the estimation process challenging and complicated. To deal with this problem, Moosavi et al. suggested adjusting accurate parameters for ANFIS using metaheuristic algorithm SBO. The proposed hybrid model is an optimized neuro-fuzzyebased estimation model which is capable of producing accurate estimations in a wide range of software projects. The main role of SBO was to find the best parameters for ANFIS that reach most accurate estimates. Among many software effort estimation models, Estimation by Analogy (EA) is still one of the preferred techniques by software engineers because it mimics the human problem-solving approach. Accuracy of such a model depends on the characteristics of the dataset, which is subject to considerable uncertainty. To overcome this challenge, Azzeh et al. [46] proposed a new formal EA model based on the integration of Fuzzy set theory with Grey Relational Analysis (GRA). Fuzzy set theory is employed to reduce uncertainty in distance measures between two tuples at the Kth continuous feature. GRA is a problem-solving method that is used to assess the similarity between two tuples with M features. Because some of these features are not necessarily continuous and may have nominal and ordinal scale type, aggregating different forms of similarity measures increases uncertainty in a similarity degree. Thus, the GRA is mainly used to reduce uncertainty in the distance measure between two software projects for both continuous and categorical features. Both techniques are suitable when relationship between effort and other effort drivers is complex. Fig. 10.2 shows the software effort estimation framework of the GRA-based system. The case retrieval stage in GRA of FGRA aims to retrieve historical projects that exhibit large similarity with projects under investigation. The effort prediction stage derives final effort estimate based on the retrieved projects. In this stage, the aim is to determine the number of retrieved projects that should be involved in the effort prediction. The proposed FGRA model produced encouraging results on five publicly available datasets when compared with well-known estimation models (CBR and ANN). Another software cost estimation product, IASCE, a prototypical expert system for estimating the cost of a proposed project in an integrated computer-aided software engineering (CASE), has been developed by

Historical Data set

Data preparation

FGRA feature selection

Case retrieval

Effort prediction

^

E Estimated effort

Project to be estimeted

FIGURE 10.2 A grey relational analysis (GRA) software effort estimation framework [46].

Chapter 10  The application of artificial intelligence in software engineering

189

Wang et al. [47]. IASCE provides support for learning, i.e., tailoring existing models, based on experience to the specific needs and characteristics of the environment. It supports multiple software cost estimation models and their corresponding metrics tractable for project management control, feedback, and learning activities. In addition, IASCE provides for the establishment of project-specific cost models and corporate metrics for the models, enables tracing of these models and metrics throughout the software lifecycle via feedback and postmortem evaluations, and offers a mechanism for long range improvements of software cost estimation. The basic architecture of IASCE consists of Expert System Monitor (ESM), Expert Judgment, Expert Shell1, Expert Shell2, Model Evaluation, and Model and Metrics Interpreter (MM1). ESM controls the execution of IASCE system modules. The Expert Judgment tool is consulting with one or more experts who use their experience and understanding of the proposed project to adjust the estimating results which were given by the expert system. Expert Shell1 contains the inference strategies and control that simulates expert model processing, and it manipulates the rules to produce final cost estimation. Expert Shell2 plays the role of an administrator for entry of both new rules and facts. Benala et al. [48] used the combined Fuzzy C-Means (FCM) Data Clustering algorithm and Functional Link Artificial Neural Networks (FLANN) to achieve accurate software effort prediction. FLANN is a computationally efficient nonlinear network and is capable of complex nonlinear mapping between its input and output pattern space. The nonlinearity is introduced into the FLANN by passing the input pattern through a functional expansion unit. The proposed method uses three real-time datasets. ANN techniques are very popular for prediction of software development effort due to its capability to map nonlinear input with output. Hota et al. [49] explored Error Back Propagation Network (EBPN) for software development effort prediction, by tuning two algorithm-specific parameters, learning rate and momentum. EBPN is a kind of neural network popularly used as predictor due to its capability of mapping highdimensional data. Error back propagation algorithm is used with EBPN which is one of the most important developments in neural networks. The experimental work was carried out with WEKA open source data mining software which provides an interactive way to develop and evaluate model. EBPN was then tested with two benchmark datasets, China and Maxwell. The authors demonstrated the ability of ANN for prediction of software development effort. Furthermore, Nassif et al. [50] carried out the research to compare four different neural network models, Multilayer Perceptron (MLP), General Regression Neural Network (GRNN), Radial Basis Function Neural Network (RBFNN), and Cascade Correlation Neural Network (CCNN), for a software development effort estimation. In their study [50], the four different neural network models, MLP, GRNN,

190

Data Democracy

RBFNN, and CCNN, were compared based on (1) predictive accuracy centered on the mean absolute error criterion, (2) whether such a model tends to overestimate or underestimate, and (3) how each model classifies the importance of its inputs. Industrial datasets from the International Software Benchmarking Standards Group (ISBSG) were used to train and validate the four models. The main ISBSG dataset was filtered and then divided into five datasets based on the productivity value of each project. In this study, the performance criterion used was the mean absolute residual (MAR). Each model had four inputs: (1) software size, (2) development platform, (3) language type, and (4) resource level. The software effort was the output of the model. Results showed that the MLP and GRNN models tend to overestimate based on 80% of the datasets, followed by the RBFNN and CCNN models which tend to overestimate based on 60% of the datasets. An MLP is a feedforward-typed ANN model that has one input layer, at least one hidden layer, and one output layer. Each neuron of the input layer represents an input vector. If a network is only composed of an input layer and an output layer (no hidden layer), then the name of the network becomes perceptron. The MLP model is shown in Fig. 10.3. GRNN is another type of neural network that was proposed by Specht [51]. A GRNN network applies regression on continuous output variables. A GRNN is composed of four layers as depicted in Fig. 10.4. The first layer represents the input layer in which each predictor (aka independent variable) has a neuron. The second layer is fed from the input neurons. An RBFNN network is a feedforward network composed of three layers: an input layer, a hidden layer with a nonlinear RBF activation function, and a linear output layer. Fig. 10.5 shows the diagram of the RBFNN network. A CCNN network, which is also known as a self-organizing network, is composed of an input, hidden, and output layers. When the training Hidden Nodes Input Nodes Output Node

FIGURE 10.3 Multilayer perceptron model [50].

Chapter 10  The application of artificial intelligence in software engineering

x2

x1

xq–1

xq

Input Neurons

Pattern Neurons with activation function: exp(–Di2/2σ2) Y1 Yi Numerator

Yj Yp

Denominator

Summation Neurons

Output Neuron

Y(X)

FIGURE 10.4 General regression neural network [51].

H1 X1

w1

X2

w2

wt

1 Y

w3 Xi

wj Hj Input layer

Hidden layer

Output layer

FIGURE 10.5 Radial basis function neural network [50].

191

192

Data Democracy

process starts, a CCNN network is only composed of an input and output layer. Each input is connected to each output. In the second stage, neurons are added to the hidden layer one by one. Fig. 10.6 shows a CCNN network with one neuron.

2.2 Software design Similar to requirements, although less methods were found, AI has been applied to the design phase as well [52e63]. Software design is the activity of creating an engineering representation (i.e., a blueprint) of a proposed software implementation. In the design process, the software requirements document is translated into design models that define the data structures, system architecture, interfaces, and multiple components [64,65]. The software design phase usually takes only up to 6% of the software development budget, although it can drastically effect results in software development [66]. To make the design phase more efficient, AI applications have been explored. Architectural design defines the relationships among the major structural elements of software. Interface design describes how software elements, hardware elements, and end-users communicate with one another [65]. Knowledge-Based Systems (KBSs) have various applications in database design. A paper from the Hungarian Academy of Science examined the function and architecture of a KBS for supporting the information system design process [67]. Multiple other AI techniques and applications are proposed in design. This includes a paper by Jao et al. [68], proposing a strategy based on autonomic computing technology for self-adaptations to dynamic software architectures and applying self-adaptations to autonomous agents. Fig. 10.7 shows the architecture of autonomous agents presented in that paper. Other papers by Taylor and Frederick [69] proposed the creation of a third-generation machine environment for computer-aided control engineering (CACE), using an Expert System (a synonym for KBS). The Outputs

Add Hidden Unit 1

Inputs +1

FIGURE 10.6 CCNN with one hidden neuron [51].

Chapter 10  The application of artificial intelligence in software engineering

Architectural Reference Model

193

Plans for achieving architectural Goals

Planner

Plans for changes and cascade reactions

Desired Architectural Model Connector Manager

Component Manager

Change

Effecter

Sensor Changes or Change requests

Manipulating the software architecture Software System

FIGURE 10.7 The architecture of autonomous agents [68].

authors described the faults in the CACE software as the inspiration for applying AI to their proposed environment. That paper focused on the “high-level requirements” for an improved CACE environment. Fig. 10.8 shows the complete functional structure of CACE. Another paper by Dixon et al. [70] proposed an architecture focused more on evaluation and redesign, emphasizing the iterative nature of the engineering design ANALYSIS PROCEDURES

PLANT MODELS

DESIGN PROCEDURES

RULES

RULES

RULES

RB1

RB5

RB4

NEEDS

MODEL

DESIGN ENGINEER

P R O CONSTRAINTS B L E M SPECS

F R A M E

STATUS OTHER

RULES

RULES

RB2

RB3

S O L U T I O N

F R A M E

RULES RB6

EXPERT SYSTEM CACE - III

ANALYSIS PROCEDURES

FIGURE 10.8 Complete functional structure of CACE-III [69].

FINAL DESIGN

194

Data Democracy

process. The proposed architecture contains four initial functions: initial design, evaluation, acceptability, and redesign. Each of these functions is represented in the architecture by a separate knowledge source. A fifth function, control, acts as a central control module and decides which knowledge source is invoked next. A sixth function, the user interface (UI), also exists as a separate knowledge-base. Another paper by Soria et al. [71] suggested that searching the design space can occur more efficiently through AI-based tools. The paper discuses two tools. The first tool assists exploring architectural models, while the second tool assists the refinement of design architectural models into object-oriented models (that leads to development). Additionally, a paper by Rodrı´guez et al. [72] suggested solutions for a service-oriented architecture (SOA). SOA is a form of software design where services are provided to other components by application components (typically through a communication protocol). The paper offers a conceptualized analysis of AI research works that have intended to discover, compose, or develop services. The aim of the study is to classify significant works on the use of AI in determining, creating, and developing web services. For AI applications in UI design, few papers were found. A paper by William B. Rouse [73] presents a conceptual structure for the design of manecomputer interfaces for online interactive systems. The design of UIs was discussed through visual information processing and mathematical modeling of human behavior. Likely paths of study in mane computer systems were also recommended [73]. Another paper was recently published by Batarseh et al. [74]. The paper presented a model to developing UIs for emergency applications through the understanding of the context of the user (demographics, age, nationality, and role of the user dictate how the user visualizes the UI). After design, all the decisions taken are transformed into implementation, that phase is discussed in the next section.

2.3 Software development and implementation (writing the code) The software development phase [75], also called the implementation phase, is the phase in the lifecycle where the system is transformed from design to production. In this phase, the desired software components are built either from scratch or by composition. This component building process is done using the architecture or design document from the software design phase and the requirement document from the requirements analysis phase. This phase moreover also deals with issues of quality, performance, baselines, libraries, debugging, and the end deliverable is the product itself. Different frameworks and methodologies have been followed over the years to achieve the final working product in this

Chapter 10  The application of artificial intelligence in software engineering

195

software development phase. The evolutionary nature of SE with long processes and stages of development required makes the realities of requirements change by the time coding is being finished. Automated programming environments are suggested as a solution to this. This automation can be incorporated in the process of code generation, code reuse, and code refactoring. AI can be applied to automate or assist developers in other programming processes such as generating functions and data structures. The Autonomous Software Code Generation (ASCG) is an agent-oriented approach for automated code generation [76]. Insaurralde [77] also proposed Autonomous Development Process. The author proposed an approach that goes beyond the software development automation which usually involves the software synthesis from design models and predefined policies and fixed rules. The author presented a self-directed development process that can make decisions to develop software. In this approach, an Ontology-Enabled Agent becomes the human developer by performing software development activities autonomously. Knowledge captured by the ontological database enables highlevel reasoning to interpret, design, and synthesize the system logic. This methodology is implemented using a graphic computer tool. The framework for ASCG is shown in Fig. 10.9. The approach initially implements only an artificial agentdSoftware Developer Agent (SDA)dwho starts dealing with the development of system by reading the requirements specification given as a physical configuration of the software under development. The mentioned SDA can capture this information and queries its own internal knowledge by means of a reasoner to make decisions to design the software that realizes the system logic. The system logic is built of interconnected blocks that can exchange information by receiving data from and sending data to other blocks. Following this information, the SDA can generate the software code as specified decisions to design the software that realizes the system logic. This approach moved away from conventional engineering solutions by developing software in a semiautonomous manner and instead deployed a purely automated method. Another aspect of software implementation is software reusability. Reusability is often touted as one of the most crucial processes to advancing software development productivity and quality. In reusability, developers obtain standard components whose behavior and functionality are well described and understood and then integrate these components into a new software system. Several knowledge-based software reuse design models have been proposed in early research by Biggerstaff et al. in Refs. [78,79]. The Knowledge-Based Software Reuse Environment (KBSRE) for program development supports the users to acquaint themselves with the domain application environment, to find partly matched components from the reusable component library, to comprehend the

Design Implementation & Integration

Architectural and Detailed Design

Requirements Analysis

196

Data Democracy

Use cases Read Requirements Specification

Determine Functionalities

System Functions

Physical Configuration And System operations

Actors

Model System Context

Sensing /Actuating

Model Scenarios

Scenarios

Model System Structure

Software Components

Protocols & Events /Data

Determine Software Interaction Determine Execution order

act Manual Development Implement Functions in software

Software Programming

Base Software Code Component Specific Software code

Software Integration

State-machine Software code

FIGURE 10.9 Autonomous software code generation framework [77].

lifecycle knowledge of a component, and to decompose a component in necessary conditions. KBSRE is a system which allows the implementer to modify the knowledge base, decomposition rules, and sibling rules with minimum efforts. Wang et al. discussed three examples of such methods in their literature review [80]. The authors presented the work by Prieto et al. [81] who borrowed notions from library science and developed a multidimensional description framework of facets for classifying components. The other research works discussed were The Programmer’s Apprentice system by Waters [82], which provides the user with a knowledge-based editor (KBEmacs), and Intelligent Design Aid (IDeA), also by Biggerstaff et al. [83]. KBEmacs is based on the refinement-based software development paradigm. Other uses and techniques for using AI in coding include the Pattern Trace Identification, Detection, and Enhancement in Java. In addition, Search-Based Software Engineering is used. Hewitt et al. [84] presented the idea of developing a Programming Apprentice, a system that is used to assist developers in writing their code as well as establishing specifications, validating modules, answering

Chapter 10  The application of artificial intelligence in software engineering

197

questions about dependencies between modules, and analyzing implications of perturbations in modules and specifications. The authors proposed a method called metaevaluation which attempts to implement the process that programmers perform to verify if their program meets specifications. Additionally, Hewitt et al. [84] described metaevaluation as a process, which attempts to show that the contracts of an actor will always be satisfied. Here, a contract is a statement of what should happen in a program under a set of conditions. Additionally, Software refactoring (one of the most expensive parts of software implementation) has seen many successful applications of Search-Based methods (using metaheuristics [85e89] and others). In most of these studies, refactoring solutions were evaluated based on the use of quality metrics. On the other hand, Amal et al. [90] introduced the use of a neural networkebased fitness functions for the problem of software refactoring. They presented a novel interactive search-based learning refactoring approach that does not require the definition of a fitness function. The software engineers could evaluate manually the suggested refactoring solutions by a Genetic Algorithm (GA) for few iterations then an ANN could use these training examples to evaluate the refactoring solutions for the remaining iterations. The algorithm proposed by Amal et al. is shown in Fig. 10.10. The method proposed in that paper is the system as an input to refactor. Afterward, an exhaustive list of possible refactoring types and the number of designer’s interactions during the search process are generated. The output also provides the best refactoring sequences that would improve the quality of the system. The approach is composed of two main

Designer evaluation

System to refactor

Interactive GA: refactoring solutions are evaluated manually

Training examples

GA: refactoring Solutions to evaluate solutions are evaluated using the ANN predictive model

ANN: Learning the predictive model

Best refactoring solution

Evaluated solutions

FIGURE 10.10 IGA, LGA, and ANN used for implementation [90].

198

Data Democracy

components: the interactive component, Interactive Genetic Algorithm (IGA), and the learning module, Learning Genetic Algorithm (LGA). The algorithm starts first by executing the IGA component where the designer evaluates the refactoring solutions manually generated by GA for many iterations. The designer evaluates the feasibility and the efficiency or quality of the suggested refactoring one by one because each refactoring solution is a sequence of refactoring operations. Thus, the designer classifies all the suggested refactoring as good or not one by one based on his preferences. After executing the IGA component for many iterations, all the evaluated solutions by the developer are considered as a training set for LGA. The LGA component executes an ANN to generate a predictive model to approximate the evaluation of the refactoring solutions in the next iteration of the GA. Thus, the approach does not require the definition of a fitness function. The authors used two different validation methods: manual validation and automatic validation. These methods evaluate the efficiency of the proposed refactoring to analyzing the extent at which the proposed approach could improve the design quality and propose efficient refactoring solutions. They compared their approach to two other existing search-based refactoring approaches by Kessentini et al. [91] and Harman et al. [85]. They also assessed the performance of their proposal with the IGA technique proposed by Ghannem et al. [92], where the developer evaluates all the solutions manually. The proposed methodology in Ref. [90] required much less effort and interactions with the designer to evaluate the solutions because the ANN replaces the DM after a number of iterations or interactions. Many other automated programming environments have been proposed in the research such as Language Feature, Meta Programming, Program Browsers, and Automated Data Structuring [76]. Language Feature is a technique based on the concept of late binding (i.e., making data structures very flexible). In late binding, data structures are not finalized into implementation structures. Thus, quick prototypes are created, and that results in efficient codes that can be modified and managed easily. Another important language feature is the packaging of data and procedures together in an object, thus giving rise to objectoriented programming, which is found useful in environments where codes, data structures, and concepts are constantly changing [76]. LISP, one of the oldest high-level programming language, is in widespread use today and provides facilities such as Meta Programming and Automated Data Structuring [76]. Meta Programming is the concept developed through NLP, which is a subfield of AI. Meta Programming is a practice in which computer programs can consider other programs as their data. Thus, a program can be created to read, generate, analyze, or transform other programs and even modify itself while running. Meta Programming

Chapter 10  The application of artificial intelligence in software engineering

199

uses automated parser generators and interpreters to generate executable LISP codes. Automated Data Structuring means going from a high-level specification of data structures to an implementation structure. When systematic changes are required to be made throughout a code, it is more “quality-controlled” and manageable to do it through another program, e.g., program update manager, than through a manual text editor, for example, or a traditional API.

2.4 Software testing (validation and verification) Software testing is one of the major targets of AI-driven methods, many methods were found [93e147]. Software testing is a process of executing a program or application with the intent of finding the software defects or bugs. Software testing can also be defined as the process of Validating and Verifying (V&V) that a software program meets the business and technical requirements “building the system right (verification) and building the right system (validation)” is a popular informal definition for testing (V&V) [4]. The testing process establishes a confidence that a product would operate as expected in a specific situation, but not ensuring that it would work well in all conditions. Testing is more than just finding bugs. The purpose of testing can be also quality assurance and reliability estimation. V&V is used to evaluate software’s performance, security, usability, and robustness [148,149]. In software testing, the relationships and interactions between software and its environment are simulated. In the next phase of selecting test scenarios, the correct test cases covering complete source code are developed and deployed, input sequences and execution paths are also selected. This ensures that all modules of the software are adequately tested; however, this stays as one of the most interesting questions in testingdwhen to stop testing? After preparing and selecting test cases, they are executed and evaluated. Testers compare the outputs generated by executed test cases and the expected outputs based on defined specifications. Testers perform quantitative measurement to determine the process status by cognizing the number of faults or defects in the software [149]. Other forms of testing include field testing, graphical testing, simulation testing, and many other types. Software testing consumes a substantial amount of total software development resources and time. It is estimated that around 40%e50% of available resources and almost 50% of development time is invested in software testing [150]. AI methods which are fueled by two parts, the data and the algorithm, pose a great candidate for effective and intelligent optimization of software testing. There have been several finished works reported for applications of AI methods in software testing. One of the early proposals for using AI came through Nonnenmann and Eddy [151]

200

Data Democracy

in 1992. Nonnenmann et al. developed KITSS, Knowledge-Based Interactive Test Script, to ease the difficulty and cost of testing at AT&T’s Bell Laboratories. KITSS works on the Private Branch Exchange (PBX) telephone switches using automated test code generation. KITSS was an automated testing system that focused on functional testing. KITSS inspected the functionality of an application without considering its internal structures. In their study, the researchers restricted KITSS to the problem of generating code tests. The project methodology for designing features included writing test cases in English and then describing the details of the external design. This was a cumbersome process, and only about 5% of the test cases were written by automation languages. Automation also presented problems, including conversion issues (test case to test script). Overall, KITSS attempted to solve such problems. Firstly, KITSS had to convert English into formal logic. Secondly, it had to extend incomplete test cases. To solve these problems, the method used an NL processor supported by a hybrid domain model and a completeness and interaction analyzer. With these formulations, the KITSS prototype system could translate test cases into automated test scripts. Eventually, KITSS turned out to be a testing process which resulted in more automation and less maintenance. The KITSS architecture is shown in Fig. 10.11. During the mid-1990s, the research on using AI techniques to improve software testing structures became more prominent. AI planners began generating test cases, providing initial states, and setting the goal as testing for correct system behavior. In Ref. [150], Annealing Genetic Algorithm (AGA) and Restricted Genetic Algorithm (RGA), two modified versions of Genetic Algorithm (GA), have been applied to testing. Sangeetha et al. [152] presented applications of GA through white box testing: Control Flow Testing, GA-based Data Flow testing, and Particle pleteness Com ction an and alyze ra inte r

or nslat Tra

Natural-la n proces guag sor e

User

Domain model

English test cases

Executable test scripts

Interaction Transformation

FIGURE 10.11 The KITSS architecture [151].

Chapter 10  The application of artificial intelligence in software engineering

201

Swarm Optimization (PSO)ebased testing. Sangeetha et al. also presented GA’s application in black-box testing: GA-based Functional testing. Other researchers injected AI differently, Briand et al. [153], for example, proposed a methodology to refine black-box test specifications and test suites. Suri et al. [154] presented a literature survey of all possible applications of ACO in software testing. The application of ACO at different stages of software testing process is shown in Fig. 10.12. The highest percentage was for test data generation (TDG), 57% as seen in the figure. On the other hand, there are certain limits on usage of AI in software testing which include inability to replace manual testing, being not as good at picking up contextual faults, depending on the quality test, and restricting software development during testing. Despite all the positives and negatives, the research on using AI techniques in software testing has been perceived to flourish. Regression testing is a conventional method for software testing process. Regression tests are performed to retest functionalities of software that are deployed in new versions. This much-needed process can be costly and could be reduced using AI’s ACO [155]. The ACO algorithm takes inspiration from ants finding the shortest distance from their hive to the food source. As proposed by Kire [155], this optimization technique when applied to regression test achieves the following five tasks: 1. generating path details, 2. eliminating redundant test cases, 3. generating pheromone tables, 4. selecting and prioritizing test cases, and 5. selecting the top paths with the least performance time. In this approach, selection of the test cases from the test suites reduces test efforts. Furthermore, incorporating the test cases with prioritization using appropriate optimization algorithm leads to better and effective fault revealing. Thus, the approach results in less execution cost, less test design costs, and maximum coverage ability of codes. ACO Applied to Various Software Testing Steps

TDG

5%

PG/SG

19%

OPT 57% 19%

Survey

FIGURE 10.12 Where AI has been applied the most within testing [154].

202

Data Democracy

Moreover, GAs and fuzzy logic have been used as potential tools in selection of test cases during testing. It is very important to determine and select the right test cases. Automation of this process can significantly increase testing precision and quality [150]. In one of the review paper, Wang Jun put forward the idea of test case prioritization using GAs [156]. Ioana et al. [157] generated testing paths using test data and a GA algorithm. GA uses mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection. Simulated annealing is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space. Li et al. [158] proposed to use UML Statechart diagrams and ACO to TDG. The authors developed a tool that automatically converts a statechart diagram into a graph. The converted graph is a directed, dynamic graph in which the edges dynamically appear or disappear based on the evaluation of their guard conditions. The authors considered the problem of simultaneously dispatching a group of ants to cooperatively search a directed graph. The ants in the paradigm can sense the pheromone traces at the current vertex and its directly connected neighboring vertices and leave pheromone traces over the vertices. The authors concluded that in this ant colony optimization algorithm, a group of ants can effectively explore the graph and generate optimal test data to satisfy test coverage requirements. Furthermore, Saraph et al. [159] discussed that analysis of inputs and outputs (I/O) to aid with testing. The authors proposed that using ANN for I/O analysis (by identifying important attributes and ranking them) can be effective. In their research [160], Xie tackled SE problems that can be solved with the synergy of AI and humans. The presented AI techniques include heuristic searching used in test generation. Traditional test generation relies on human intelligence to design test inputs with the aim of achieving high code coverage or fault detection capability. A Regression Tester Model, presented by Last et al. [161,162], introduced a full automated black-box regression testing method using an Info Fuzzy Network (IFN). IFN is an approach developed for knowledge discovery and data mining. The interactions between the input and the target attributes are represented by an information theoretic connectionist network. Shahamiri et al. developed automated Oracle which can generate test cases, execute, and evaluate them based on previous version of the software under test [163]. The structure of their method is shown in Fig. 10.13. As seen in Fig. 10.13, Random Test Generator provides test case inputs by means of specification of system inputs. These specifications contain information about system inputs such as data type and values domain. The test bed executes these inputs on Legacy Versions (previous versions of the software under test) and receives system outputs. Next, these test cases are used to train and model IFN.

Chapter 10  The application of artificial intelligence in software engineering

Legacy Version System inputs

Specification of System Inputs

The Software under Test System inputs

System outputs

Test case Inputs

Test case Inputs Test case Outputs

Test Cases

IFN Induction Algorithm

System outputs

Test Bed

Test Bed

Random Test Generator

203

Test Case Outputs

Test Library

IFN Structure

IFN Model

Fault or not Fault

FIGURE 10.13 Info fuzzy network (IFN)ebased black-box testing [163].

Ye et al. [164] also used ANNs to approximate the I/O behavior. Xie et al. [165,166] proposed a cooperative testing and analysis including humanetool cooperation. That consisted of human-assisted computing and human-centric computing. In human-assisted computing [165], tools have the driver seat and users provide guidance to the tools so that the tools perform the work better. In contrast, in human-centric computing [166], users have the driver seat and tools provide guidance to the users so that the users can carry out the tasks. Ye et al. [167] also proposed ANNs as an automated oracle to automatically produce outputs that compare the actual outputs with the predicted ones. Intelligent Search Agent for optimal test sequence generation and Intelligent Test Case Optimization Agent for optimal test case generation have been explored [168]. Structural testing is another form of tests for the internal structures or workings of a software. The fault-based testing is testing software using test data designed to demonstrate the absence of a set of prespecified and frequently occurring faults. It is found that the existing testing software testing methods produce a lot of information including input and produced output, structural coverage, mutation score, faults revealed, and many more. However, such information is not linked to functional aspects of the software. An ML-based approach introduced by Lenz et al. [169] is used to link information derived from structural and fault-based testing to functional aspects of the program. This linked information is then used to easily test activity by connecting test results to the applications of different testing techniques. Khoshgoftaar et al. [170] suggested using Principle Component Analysis to reduce number of software metrics and extracting the most important cases. Additionally, Pendharkar [171] proposed a Software Defect Prediction Model Learning Problem (SDPMLP) where a classification model selects appropriate relevant inputs, from a set of all available inputs, and learns the classification function. The problem attempted to be solved is by finding the combination of a function f and a vector z such that f (z) has the best prediction accuracy. The solution to SDPMLP comes from identifying all values of U, learning f

204

Data Democracy

(z) for all values of U using the ML algorithm. Selecting the value(s) of U* that provide the best prediction accuracy. In this method, z is a vector of input attributes, U is a binary vector, and U* provides the optimal solution. Fig. 10.14 illustrates a general framework for solving with SDPMLP. AI planning has also been used in testing distributed systems as well as graphical UIs. GUI testing is a process which is used to test an application to check if the application is working well in terms of functional and nonfunctional requirements. The GUI testing process involves developing a set of tasks, executing these tasks, and comparing actual results with the expected results. The technique includes detecting application’s reaction to mouse events, keyboard events, and reaction of components such as buttons, dialogues, menu bars, images, toolbars, and text fields toward user input. Rauf [150] proposed how Graphical User Interface (GUI) can derive benefits from the use of AI techniques. The technique includes automating test case generation so that tests are also regenerated each time GUI changes. Memon [99] proposed a method to perform GUI regression testing using Intelligent Planning (an AI paradigm). More generic processes can be deployed early in the process and not through the GUI. Although software testing is adopted as a distinct phase in software development lifecycle, it can be performed at all stages of the development process Fig. 10.14. Batarseh et al. [172] introduced a novel method called Analytics-Driven Testing (ADT) to predict software failures in subsequent agile development sprints. The agile software development lifecycle is based on the concept of incremental development and iterative deliveries. Predictive testing is a type of testing that compares previous validation results with corresponding results of the system, iteratively. ADT predicts errors with a certain statistical confidence level by continuously measuring Mean Time Between Failures for software components. ADT then uses statistical forecasting regression model for estimating where and what types of software system failures are likely to occur; this is an example of a recent U

U*

Identify Z using Exhaustive or Heuristic Approach

Z

Best classification for Z

Learn a classification function using a machine learning algorithm for attributes identified in Z and return the best classification.

FIGURE 10.14 A general framework for solving the SDPMLP [171].

Chapter 10  The application of artificial intelligence in software engineering

205

method that injected Big Data Analytics into testing. Afzal et al. [173,174] also took inspiration from predictive testing and evaluated different techniques for predicting the number of faults such as particle swarm optimizationebased artificial neural networks (PSO-ANN), artificial immune recognition systems (AIRS), gene expression programming (GEP), and multiple regressions (MR). AI research has found its way into every stage of software testing process. Different methods and techniques are used to make the testing process intelligent and optimized. Effective software testing contributes to the delivery of reliable and quality-oriented software product, and more satisfied users. Software testing thus can rely on better AI techniques. Shahamiri et al. [163] have presented a broad classification of AI and statistical methods which can be applied in different phases of automated testing. However, after evaluating all these methods, can AI answer the pressing and ongoing questions of software testing mentioned earlier? That is yet to be determined. The next section introduces AI paradigms applied to release and maintenance, the last phase in the SE lifecycle.

2.5 Software release and maintenance Last, but not least, some AI paradigms were also applied to the phase of release and maintenance [175e184]. Software maintenance is defined by the IEEE as the modification of a software product after delivery to correct faults, to improve performance or other attributes, or to adapt the product to a modified environment. This process generally begins with understanding what a user desires from the software product and what needs to be changed to fit what the customer wants. Feedback from users is essential for software engineers [185]. Reviews can be false positives or negatives, and it is important for developers to differentiate between the true reviews and the biased ones. This issue was approached by Jindal and Liu [186] through the method of shingling. They grouped neighboring words and compared groups of words taken from different documents to search for duplicate or near duplicate reviews from different user IDs on the same product. Through their research, Jindal and Liu were able to classify the types of spam review into three categories: untruthful opinions (i.e., false negative or positive reviews), brand-specific reviews (reviews of a manufacturer or brand, not the product), and nonreviews (reviews that express no opinion). This could be classified as a form of Text Mining. Following Jindal and Liu’s research, Ott et al. [187] proposed approaches for fake review detection. They found that n-gramebased text categorization provided the best standalone detection approach. The authors of that paper created a model which used Naı¨ve Bayes and

206

Data Democracy

Support Vector Machines techniques with a fivefold cross-validation system to obtain an 86% accuracy. Following Ott et al., Sharma and Lin [188] searched for spam reviews with five criteria in mind. Firstly, they checked the product rating and customer review for inconsistency. If one was clearly positive and the other clearly negative, then the overall review is obviously untruthful. Secondly, they searched for reviews that were more about asking questions. They searched for yeseno question, WH questions (why, what, where), and declarative questions. The authors used the Brill tagger to tag such sentences. Then, the model searched for reviews with all capitals. Generally, this indicates the presence of spam. Most recently, Catal et al. [185] have proposed a model using five classifiers and a majority voting combination rule that can identify deceptive negative customer reviews. Fake review detection is done through a combination of NLP and ML in conjunction with a multiple classifiers system to be effective. Their research revealed that the strongest classifiers were libLinear, libSVM, SMO, RF, and J48. Catal and Guldan [185] trained their model using the dataset that Ott et al. [187] had used and had determined the accuracy rating for all of the classifiers. Catal and Guldan’s model involved running a review through all five classifiers and then using a majority voting system to determine if the review was genuine or not. Their multiple classifiers model yielded an accuracy of 88.1% which is a statistically significant increase from the aforementioned 86%. However, Catal and Guldan determined that the weakness of this model was the classification cost and the complexity of the model. It is time consuming and expensive to go through the training of five different classifiers Fig. 10.15. In the 1980s, 1984 to be specific, Charles Dyer [189] researched the use of expert system in software maintainability. His paper focused on addressing the issue of planning software maintainability by applying KBS or expert systems. The reason that software maintainability can be 5-fold cross validation LibLinear

Training Dataset

LibSVM Vote SMO

Result

Majority voting rule

Random Forest J48

FIGURE 10.15 Spam detection review model [188].

Chapter 10  The application of artificial intelligence in software engineering

207

addressed using expert systems is because maintainability can be viewed as a series of questions and depending on the answers, different support structures will be chosen. Fig. 10.16 shows an example of a series of questions asked and then resulting conclusions provided in the bottom half of the figure. The following Fig. 10.17 shows an example of the execution of such a system. The advantage of using expert systems is that they offer a systematic method for capturing human judgment. In addition to this, rules can be added whenever necessary and desired by more experienced software engineers. Although it does not replace a human engineer, expert systems simple design complex design high volume low volume no documentation documentation available expert users novice users inexpensive price expensive price prepare for no support prepare for low level of support prepare for high level of support prepare for very high level of support try for simpler design try for lower volume try for more documentatin try for lower price

X X XX X X X X XX X X X X X X X

X XX X X X X X

X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

XX X X

X X

X X X X

X

X X XX X

X X

X

X

X X X X X X X X

X

X

X

FIGURE 10.16 Maintainability planning (as depicted from the reference) [189].

FIGURE 10.17 Execution of planning system (as depicted from the reference) [189].

208

Data Democracy

can most definitely serve as effective aid and support system for human judgment and decision-making. However, the disadvantages of this system include that it may not be practical for sophisticated and complex systems. Such a system would require an ample amount of rules. In addition to this, there is a weakness in the fact that such an expert system exhibits no room for learning. Because of the fact that everything is very explicit, new rules will never be added automatically, and old rules will never be modified accordingly. All changes must be done manually by an engineer. That is a major drawback. Following Dyer [189], Lapa et al. [190] have also investigated maintenance planning. However, their approach was to preventative maintenance planning using GA. Their research contained two goals. First to present a novel methodology for preventative maintenance policy evaluation based on a cost-reliability model. The second goal is to automatically optimize preventative maintenance policies based on the proposed methodology for system evaluation. The authors thought to use GAs because there are many parameters that need to be analyzed. GAs are strong when searching for an optimum combination of options. The algorithm will analyze through the following criteria (fitness function): probability of needing a repair, cost of such a repair, typical outage times, preventative maintenance costs, impact of the maintenance reliability as a whole, and the probability of imperfect maintenance. A function that is able to evaluate the genotype, an objective function, is developed. This function takes into account the models that were built previously to properly evaluate the constructed genotype. After maintenance, engineers implement additions that the user requested or make modifications to improve the software. Lethbridge and Singer [191] have shown that searching is a major task within software maintenance. Software engineers often have to search through the source code to find a certain function, or perhaps to just obtain a better understanding of the software they wish to maintain. Software engineers already use a multitude of search tools like Unit “grep” or Source Navigators. Software like these assists engineers in being able to comprehend code as well as locate parts of the software to be fixed. However, sometimes there is disconnection between what the engineer is searching for and what is actually returned. Liu and Lethbridge [192] investigated this issue. The authors realized that the problem was that the engineer will often not query what they are looking for, which would lead to the search software not returning what the engineer desires. The authors proposed an intelligent search technique tailored to source code to assist in this endeavor. They have constructed knowledge bases to

Chapter 10  The application of artificial intelligence in software engineering

209

represent concepts. The knowledge base represents the hierarchies of words. During the search, the knowledge base is used to find synonyms, superconcepts, and subconcepts of an original query. The actual process of the search starts off with many search candidates generated using several algorithms. This is done by first separating the queries into separate words and by splitting by common delimiters (e.g., spaces and underscores). Queries are then split at places where nonalphabetic characters are found before an upper case. A list of related terms is also stored for later evaluation. The words are then compared against a dictionary of stop words that represent all the stop words, or words to ignore. All the stop words are removed and stored in another separate list. Search queries are then performed on all of these candidates and relevant candidates remain as results. Afterward, results must be evaluated so that an order can be produced. Functions can be created that take into account the similarity between a result and the original query string. This value is then used to weight the result and produce a list of search results. That method is shown in Fig. 10.18. Liu and Lethbridge have determined that their proposal for intelligent search methods has been useful to software engineers in preliminary experiments. The issue that they did find is that the response time of such a search method could be excessive at times. This is attributed to the search candidate generation. Because so many candidates are generated, it takes the system a long time to analyze and query through all of them. Mancoridis et al. [193] have developed a tool that assists in decomposing programs so that a software engineer might be able to more easily understand a particular program. This is done by treating clustering as an optimization program. By doing so, the authors were able to create a software that could automatically decompose a program. The tool achieves this by clustering source level modules and dependencies into subsystems and then maps these modules and dependencies to a module dependency graph. This is done by systematically navigating through a large search space of all the possible partitions. The optimization is done by trying to maximize the value of an objective function, which, when performed, represents the optimal partition of the graph. However, this is candidate generation algorithms Original alg 1 query

Search candidates

queries

Search results

evaluation

Results presented to user

alg 2

FIGURE 10.18 Intelligent search procedure for maintenance [192].

210

Data Democracy

not efficient, and what BUNCH (the name of the tool presented in their paper) [193] achieved is using more efficient search algorithms in hopes of obtaining a slightly suboptimal partition. However, issues include not knowing where to classify a subsystem because it exists in multiple parts of the design or event that it does not clearly reside in any particular part of design. Sometimes BUNCH is incorrect when compared with the intuition of an experience developer who is familiar with the code. But the overall evaluation of the tool is that it does a good job for not having any knowledge of the software design. This kind of system decomposition tool is good for software maintenance because engineers trying to alter the software now have a method in which they can take apart the software and look in particular portions of the software. They no longer have to waste time searching through parts of the code that do not effect the problem explicitly. The implementation of AI has reduced wasted time and has made the entire maintenance process more efficient. However, drawbacks that most of the presented methods share are that they might not be practical, useable or even feasible in many cases. Future work in AI-driven methods in software maintenance will have to address such issue to optimize SE overall. This and other issues are discussed in the following sections (3 and 4).

3. Summary of the review This survey chapter reviewed AI methods applied within SE between the years 1975 and 2017. AI methods are identified based on the definition of AI subfields from a book that is highly cited in the field (Artificial Intelligence: A Modern Approach) [194]. Testing and requirements are the two phases with the highest number of contributions in the field. Scientists and researchers seem to agree that these two phases lend themselves to AI more than other. Neural networks, GAs, KBS, ML, and NLP are among the AI paradigms applied the most. A complete list of all methods (used, reviewed, or mentioned in this chapter), their associated AI paradigms, and year of publication is shown in Table 10.1. The method names are provided in the table to the “best of our knowledge.” Some papers in this list did not have an explicit name for their method. For some others, it was a bit ambiguous to what AI paradigm was used. Please contact the authors for a comprehensive index of further references, details, and more discussions per method that were not included due to either the length of this paper or relevance Table 10.1. As the length of the table of Table 10.1 dictates, there are many AIdriven methods; however, to summarize that, the following stats are reported: Years 1975e2017, for the requirements phase, 46 major AI-driven methods were found, 19 for design, 15 for development, 68 for testing,

Chapter 10  The application of artificial intelligence in software engineering

Table 10.1

211

A complete list for all AI-driven methods for SE (ordered by year

1975e2017). Name of AI-driven method (paper)

AI paradigm

Man-computer interfaces Constructing a programming apprentice (Planner Project) PSL/PSA

Bayesian Nets Design 1975 Knowledge-Based Systems Development 1975

Informality in program specifications

Natural Language Processing Knowledge-Based Systems Development 1982 Knowledge-Based Systems Design 1984 Rule-Based System Design 1984

Programmer’s aapprentice (PA) Architecture for applying AI to design Creating a third-generation machine environment for computer-aided control engineering (CACE) Intelligent maintenance

SE phase

Automation

Release and 1976 Maintenance Requirements 1977

Expert Systems

Classifying reusable modules Code allocation for maintenance

Classification Knowledge-Based Systems

KEE-connection Project planning and control AI applications to information systems analysis LaSSIE Automatic extraction of candidate requirements IASCE: a computer-aided software design tool Predicting development faults with ANN Risk assessment in software reuse KITSS-a knowledge-based interactive test script system A software reuse environment for program development Software project management SIR/REX (Systems Information Resource/ Requirements Extractor) Early detection of program modules in the maintenance phase Exploring the behavior of software quality models Dynamic TDG TCG A model for the design and implementation of software systems BUNCH

Knowledge-Based Knowledge-Based Knowledge-Based Knowledge-Based Automation

Test data generation Automatic reengineering of software

Year

Systems Systems Systems Systems

Expert Systems

Release and 1984 Maintenance Development 1987 Release and 1987 Maintenance Design 1988 Requirements 1990 Design 1991 Development 1991 Requirements 1992 Requirements 1992

Neural Networks Testing 1992 Knowledge-Based Systems Requirements 1993 Knowledge-Based Systems Testing 1993 Knowledge-Based Systems Development 1994 Case-Based Reasoning Natural Language Processing Neural Networks Neural Networks Genetic Algorithms Planning and Heuristics Multiagent AI Implementations Clustering, Optimization Algorithms Genetic Algorithms Genetic Programming

Requirements 1994 Requirements 1994 Release and 1995 Maintenance Testing 1995 Testing Testing Design

1997 1997 1998

Release and 1999 Maintenance Testing 1999 Release and 2000 Maintenance Continued

212

Data Democracy

Table 10.1 A complete list for all AI-driven methods for SE (ordered by year 1975 e2017).dcont’d Name of AI-driven method (paper) Automatic software test data generation Fault allocations in the code Software project effort estimation NL-OOPS: A requirements analysis tool

AI paradigm Genetic Algorithms Search Methods

Genetic Programming Natural Language Processing Comparison of artificial neural network and Neural Network regression models Software risk analysis Neural Networks Devising optimal integration test orders Genetic Algorithms Using automatic test case optimization for Genetic Algorithms NET components Predicting software development faults Neural Networks Software cost estimation Neural Networks Extracting test sequences from a Markov Ant Colony Optimization software usage model Automated test reduction Data analytics (Machine Learning) Test case generation and reduction Machine Learning Intelligent software testing Neural Networks /Data Mining Automatic software testing Automation Automated adaptations to dynamic software Autonomous Agents architectures Software test data generation Ant Colony Optimization Automated GUI regression testing Planning and Heuristics Black-box testing Fuzzy Networks Automated test data generation Data analytics (Machine Learning) Estimating the defect content after an Machine Learning inspection Test case generation and reduction Neural Networks An approach to test Oracle Neural Networks An approach for QoS-aware service Genetic Algorithms composition Adaptive fuzzy logicebased framework Fuzzy Logic Test sequence generation for state-based Ant Colony Optimization software testing An approach for integrated machine fault Data analytics (Machine diagnosis Learning) Automatic test case optimization Genetic Algorithms A dynamic optimization strategy for Genetic Algorithms evolutionary testing Stress testing real-time systems Genetic Algorithms Mutation-based testing Genetic Algorithm (GA), Bacteriological Algorithm (BA)

SE phase Year Testing 2001 Release and 2002 Maintenance Requirements 2002 Requirements 2002 Requirements 2002 Requirements 2002 Testing 2002 Testing 2002 Testing 2002 Requirements 2003 Testing 2003 Testing

2003

Testing Testing

2003 2003

Testing Design

2003 2004

Testing Testing Testing Testing

2004 2004 2004 2004

Testing

2004

Testing Testing Design

2004 2004 2005

Requirements 2005 Testing 2005 Testing

2005

Testing Testing

2005 2005

Testing Testing

2005 2005

Chapter 10  The application of artificial intelligence in software engineering

213

Table 10.1 A complete list for all AI-driven methods for SE (ordered by year 1975 e2017).dcont’d Name of AI-driven method (paper) Software reliability estimations using dynamic weighted combinational models A search-based approach to improve the subsystem structure of a software system Evaluating project management internal efficiencies An early warning system for software quality improvements and project management A model for preventive maintenance planning Component selection and prioritization for the next release problem Optimization of analogy weights Capability-based project scheduling Software risk analysis Evolutionary unit testing of object-oriented software Automated test oracle for software testing Testing XML-based multimedia software applications Pareto optimality for improving search-based refactoring Bi-Objective Release Planning for Evolving Systems (BORPES) Statistical Software Testing Automated web service composition Creating a search-based tool for software maintenance Segmented software cost estimation models Fuzzy critical chain method in project scheduling Time-cost trade off in project scheduling Applying search-based techniques requirements optimization Testing the effectiveness of early lifecycle defect prediction Assessment of software defect prediction techniques Refining black-box test specifications and test suites IntelligenTester Classifying automated testing Recommender system for software project planning

AI paradigm Neural-Networks

SE phase Design

Evolutionary Algorithm

Development 2006

Fuzzy Logic

Genetic Algorithms Genetic Algorithms Neural Networks Genetic Programming

Release and 2006 Maintenance Release and 2006 Maintenance Release and 2006 Maintenance Release and 2006 Maintenance Requirements 2006 Requirements 2006 Requirements 2006 Testing 2006

Neural Networks Planning

Testing Testing

Search Algorithms

Development 2007

Decision Support Machine Learning Planning and Heuristics Search-Based Refactoring

Release and Maintenance Testing Design Development

Fuzzy Clustering Genetic Algorithms

Requirements 2008 Requirements 2008

Genetic Algorithms Search-Based Techniques

Requirements 2008 Requirements 2008

Bayesian Nets

Testing

2008

Machine Learning

Testing

2008

Machine Learning

Testing

2008

Fuzzy Logic Genetic Algorithms Search Algorithms

Year 2006

2006 2006

2007 2007 2008 2008

Multiagents Testing 2008 Neural Networks , CBR, AI Testing 2008 Planning Case-Based Reasoning Requirements 2009 Continued

214

Data Democracy

Table 10.1 A complete list for all AI-driven methods for SE (ordered by year 1975 e2017).dcont’d Name of AI-driven method (paper) AI paradigm The use of a fuzzy logicebased system in Fuzzy Logic cost-volume-profit analysis under uncertainty Software effort estimation Fuzzy Grey Relational Analysis Approach for critical path definition Fuzzy Logic Estimate the software development effort Fuzzy Neural Network R-TOOL-A natural languageebased tool Natural Language Processing Generation of Pairwise Test Sets Artificial Bee Colony Software Testing Genetic Algorithms Software Testing Genetic Algorithms Automatically finding patches Genetic Programming Fault diagnosis and its application to rotating Neuro-Fuzzy machinery Web service composition in cloud computing Planning and Heuristics QoS-based dynamic web service composition Ant Colony Optimization Autonomic, self-organizing service-oriented Automation architecture in service ecosystem Analyzing what makes software design Context-Based Reasoning effective Software reliability modeling Genetic Programming Supporting quality-driven software design Ontologies and Planning/ Automation HACO algorithm for next release problem Hybrid Ant Colony (NRP) Optimization Algorithm (HACO) Optimized fuzzy logicebased framework Fuzzy Logic Predicting the development effort of short Fuzzy Logic scale programs Search-based methods for software Genetic Programming development effort estimation Statistical-based testing Automated Search Presenting a regression test selection Cluster Analysis technique Repairing GUI Test Suites Genetic Algorithms Proposals for a Software Defect Prediction Probabilistic Neural Model Learning Problem (SDPMLP) Network (PNN) Autonomic computing approach in service- Case-Based Reasoning oriented architecture QoS-aware automatic service composition Functional Clustering An automated approach for the detection Genetic Algorithm (GA) and correction of design defects and Genetic Programming (GP) Software release planning with dependent Ant Colony Optimization requirements

SE phase Year Requirements 2009 Requirements 2009 Requirements 2009 Requirements 2009 Requirements 2009 Testing Testing Testing Testing Testing

2009 2009 2009 2009 2009

Design Design Design

2010 2010 2010

Design

2010

Design Design

2010 2010

Release and 2010 Maintenance Requirements 2010 Requirements 2010 Requirements 2010 Testing Testing

2010 2010

Testing Testing

2010 2010

Design

2011

Design 2011 Development 2011 Release and 2011 Maintenance

Chapter 10  The application of artificial intelligence in software engineering

215

Table 10.1 A complete list for all AI-driven methods for SE (ordered by year 1975 e2017).dcont’d Name of AI-driven method (paper) Analogy-based software effort estimation Test case selection and prioritization GA-based test case prioritization technique Semisupervised K means (SSKM) Intelligent test oracle construction for reactive systems Automated choreographing of web services A review and classification of literature on SBSE Software effort prediction Automated analysis of textual use cases

AI paradigm Fuzzy Logic Ant Colony Optimization Genetic Algorithms Machine Learning Machine Learning

SE phase Year Requirements 2011 Testing 2011 Testing 2011 Testing 2011 Testing 2011

Planning and Heuristics Search-Based Optimization (SBO) Fuzzy Clustering, functional link Neural Networks Natural Language Processing Ant Colony Optimization Ant Colony Optimization

Design 2012 Development 2012

Generating test data for structural testing Research of path-oriented test data generation Automatic test data generation for software Evolutionary Algorithms path testing Automatic generation of software test data Hybrid Particle Swarm Optimization (PSO) Puzzle-based Automatic Testing environment Mutation/GE and Planning (PAT) Autonomous development process for code Artificial Agents generation Evolutionary Algorithm Using a multiobjective optimization-based approach for software refactoring Model refactoring Genetic Algorithms LMES: localized multiestimator model Classification and Clustering Framework for software development effort Neural Network and Fuzzy estimation Logic Automated software testing for application Artificial Bee Colony maintenance Automated TDG technique for objectGenetic Algorithms oriented software Linking software testing results Machine Learning NSGA-III Evolutionary Algorithm Defining a fitness function for the problem Genetic Algorithms of software refactoring Multiobjective for requirements Ant Colony Optimization selection INTELLIREQ Knowledge-Based Systems ANFIS model Neuro Fuzzy

Requirements 2012 Requirements 2012 Testing Testing

2012 2012

Testing

2012

Testing

2012

Testing

2012

Development 2013 Development 2013 Development 2013 Requirements 2013 Requirements 2013 Testing

2013

Testing

2013

Testing 2013 Development 2014 Development 2014 Requirements 2014 Requirements 2014 Requirements 2014 Continued

216

Data Democracy

Table 10.1 A complete list for all AI-driven methods for SE (ordered by year 1975 e2017).dcont’d Name of AI-driven method (paper) An approach for the integration and test order problem Multiobjective teaching learningebased optimization (MO-TLBO) algorithm Predicting software development effort Performance evaluation of software development effort estimation Analytics-driven testing (ADT) Improved estimation of software development effort Models for software development effort estimation Enhanced software effort estimation Optimized class point approach Product review management software Satin bowerbird optimizer: A new optimization algorithm Automated use case diagram generation Context-driven UI development (CDT)

AI paradigm SE phase Multiobjective Optimization Testing

Year 2014

Evolutionary Algorithm

Requirements 2015

Neural Networks Neuro-Fuzzy

Requirements 2015 Requirements 2015

Data analytics (Machine Learning) Fuzzy analogy

Testing

Requirements 2016

Neural Networks

Requirements 2016

Neural Networks Neuro-Fuzzy Classification

Requirements 2016 Requirements 2016 Release and 2017 Maintenance Requirements 2017

Neuro-Fuzzy Natural Language Processing Context

2015

Requirements 2017 Development 2017

and 15 for release and maintenance. Many AI paradigms (if not most) have been applied to a certain phase of SE. Success is not limited to any of the paradigms, no correlation was found between the SE phase and the AI paradigm used. Other insights were found though, the next section provides such discussions, challenges the conventional wisdom in most of the papers in Table 10.1, and provides guidance to the path forward.

4. Insights, dilemmas, and the path forward As it is established in this chapter, multiple papers have been published to make the case for the use of AI methods in SE, and to providing examples of where AI could be used, example applications included 1. Disambiguating natural language requirements, through text mining, NL processing, and other possible “intelligent” means. 2. Deploying “intelligence” to the prioritization and management of requirements.

Chapter 10  The application of artificial intelligence in software engineering

217

3. Using Data Analytics, ML, and ANN to predict errors in software. 4. Mining through the system to eradicate any potential run-time issues. 5. Using CBR and Context to understand user requirements and ensure acceptance. 6. Using expert systems and KBS to represent the knowledge of the user and the system logic. 7. Use Fuzzy Logic and AI Planning to predict cost and time of software. 8. Search through the code for issues using GA or ANN algorithms. And many other ones covered in this review. Papers used for this review have been collected following these criteria: 1. Top cited papers in the field. A search that took almost 1 week. 2. Papers by publisher, the following publishers have been tackled: IEEE, AAAI, Springer, ACM, and Elsevier. 3. Google scholar search for papers, by SE phase. 4. Our university library, all papers found there for AI methods in SE were included. Additionally, papers that are included are ones that follow these two rules: 1. Clearly belong to one of the phases of SE (as confirmed by the authors of the paper). 2. Directly use an AI method and is not loosely connected to AI or knowledge engineering. For example, multiple papers were found that cover knowledge engineering and management or search-based algorithms, but those were excluded. We looked for papers that explicitly claimed a direct relationship to AI methods. It is deemed necessary to provide an updated state-of-the-art of AI methods applied to SE, especially with the current hype around AI. That is one of the main motivations of this chapter (although we stopped at 2017, more recent years certainly have more papers as well, we will leave that as part of ongoing work)dhowever, after closely evaluating the state-of-theart, four outcomes are evident: 1. There is a general (and informal) consensus among researchers that AI methods are good candidates to be applied to SE. No paper was found that concluded that AI should be out of SE! 2. Most AI methods were applied to one phase of SE, thus, the structure of Section 2 of this chapter. In some cases, however, AI methods were applied across multiple phases. That was loosely deployed, and

218

Data Democracy

most of the time it was not clear how such a method is used in the real world. For the formal methods reviewed in the five subsections of Section 2, one can notice multiple successful methods that can improve the phase. 3. The “engineering” process of SE lends itself to AI and to being intelligent. 4. AI has been successfully used to build software systems anddas revieweddacross all phases. Contrary to that, when developing an intelligent agent, an AI system, or any form of system that has intelligence, very rarely are SE best practices applied. Most of the time, an AI system is built, some quick prototyping/rapid development is applied, therefore, eliminating the need for conventional SE phases. In reality, AI systems are built either incrementally, based on trial and error, based on choosing “what works best” or other informal models. AI paradigms are used in SE, but SE paradigms are not commonly used in AI. The four outcomes listed above lead to a number of dilemmas, the first one is (challenging conventional wisdom): if more AI is applied to SE, and if the process of developing software becomes completely intelligent, then would not that mean that software should start building itself? And is not that what Artificial General Intelligence (AGI) is? Most researchers and papers included in this review were chasing after reducing development cost, removing errors, and improving the process (among many other optimizations); however, none were found that aimed to inject “general” intelligence into the entire SE process. Therefore, there is an implicit assumption that the SE lifecycle will still be monitored by a humand which defeats the purpose of the activity (if automation aims to control the evaluation, planning, testing, and other phases of the lifecycle, the human role should solely be to monitor the process, all empirical decisions should be taken by AI algorithms). This dilemma becomes part of the traditional question about the “goodness” of AI, AI replacing human jobs, AI democratization, and other facets (scientific and philosophical) of the AI argument that are outside the context of this survey. The second dilemma (that challenges the status quo) is if the SE process succeeded to be completely intelligent as most of the papers “hoped” or aimed for, wouldn’t the fields of SE and AI morph into one field? That is a possibility because mostdif not alldAI systems are depicted through software anyways. Robots without software cannot do anything, intelligent agents without software cannot do anything, and so on. Therefore, if the software that “generates” the intelligence is designed using AI, if it is developed through AI, and if it is tested with AI, then AI becomes the SE

Chapter 10  The application of artificial intelligence in software engineering

219

lifecycle itself, and the two fields merge into one (creating unified same challenges, prospects, and research communities). Does that mean the death of the conventional SE lifecycle? A dilemma worth investigating. The third dilemma (this one is less of a dilemma and more of a quandary) is the following: most (if not all) papers found claimed that the SE process “lends” itself to AI. It is deemed valid that science advance when it is studied in an interdisciplinary place (the intersection of two research fields). However, that is a claim based on the notion that SE is merely a science. Most engineers, researchers, and practitioners agree that SE is both a science and an art. That “artistic” perspective of applying AI to SE is clearly missing from literature, and it is very vague whether that is something that AI would be able to solve or not. The fourth dilemma that begs to be addressed is, as mentioned in the outcomes, no paper was found that recommends not using AI in SE. Most experiments presented in the papers reviewed were looking for (or comparing between) the true positives (correctly identified results) and the false positives (incorrectly identified results) of applying AI to SE. That is a major shortcoming in the overall discussiondbecause given that true negatives (correctly rejected) or even false negatives (incorrectly rejected) were not explicitly explored, there is still a large blind spot in this research area. Therefore, it remains unclear, whether AI should be applied further to SE. Although there are many success stories, but how many failure stories exist? That is still to be determined. However, one thing that is not ambiguous is that at this point, SE needs AI much more than AI needs SE. Errors and bad practices can hinder the progress of AI, but cannot halt the process if intelligence is accomplished. Rarely has it been the case were intelligence is not accomplished due to a software error (except in some few cases; for example, driverless cars in autopilot modes could fail as such). The goal of AI is achieving intelligence; but the goal of SE is very different, it is building a valid, verified system, on schedule, within cost and without any maintenance, or user acceptance issues. Therefore, based on the two very different premises and promises of each field (SE and AI), and all the dilemmas that this overlapping can cause, it is not a straight forward to claim the success or the need of applying AI methods onto SE. Before such a claim is made, the questions posed in this review need to be sufficiently and scientifically answered.

220

Data Democracy

References [1] A. Turing, Computing machinery and intelligence, Mind 45 (1950) 433e460. [2] B. Sorte, P. Joshi, V. Jagtap, Use of artificial intelligence in software development life cycle e a state of the art review, Int. J. Adv. Comput. Eng. Commun. Technol. 4 (2015) 2278e5140. [3] J. Rech, K. Althoff, Artificial intelligence and software engineering: status and future trends, KI 18 (2004) 5e11. [4] R. Prerssman, B. Maxim, Software Engineering, A Practitioner’s Approach, eighth ed., ISBN-13: 978-0078022128. [5] D. Parachuri, M. Dasa, A. Kulkarni, Automated analysis of textual use-cases: does NLP components and pipelines matter?, in: The 19th Asia-Pacific Software Engineering Conference vol. 1, 2012, pp. 326e329. [6] S. Aithal, D. Vinay, An approach towards automation of requirements analysis, in: Proceedings of the International Multi Conference of Engineers and Computer Scientists vol. I, 2009. Hong Kong. [7] J. Atlee, H. Cheng, Research directions in requirements engineering, in: Proceedings of the Future of Software Engineering, 2007, pp. 285e303. [8] (unofficial report), J. Black, A Process for Automatic Extraction of Candidate Requirements from Textual Specifications, 1992. Unpublished manuscript, to be published at GE Research 3094. [9] V. Agrawal, V. Shrivastava, Performance evaluation of software development effort estimation using neuro-fuzzy model, Int. J. Emerg. Manage. Technol. 4 (2015) 193e199. [10] M. Azzeh, D. Neagu, P. Cowling, Analogy-based software effort estimation using Fuzzy numbers, J. Syst. Softw. 2 (2011) 270e284. [11] M. Cuauhtemoc, A fuzzy logic model for predicting the development effort of short scale programs based upon two independent variables, Appl. Softw. Comput. 11 (2010) 724e732. [12] A. Idri, M. Hosni, A. Abran, Improved estimation of software development effort using classical and fuzzy analogy ensembles, Appl. Softw. Comput. (2016) 990e1019. International Software Benchmark and Standard Group. [13] K. Shiyna, V. Chopra, Neural network and fuzzy logic based framework for software development effort estimation, Int. J. Adv. Comput. Sci. Softw. Eng. 3 (2013) 19e24. [14] S. Satapathy, S. Rath, Optimized class point approach for software effort estimation using adaptive neuro-fuzzy inference system model, Int. J. Comput. Appl. Technol. 54 (2016) 323e333. [15] R. Sheenu, S. Abbas, B. Rizwan, A hybrid fuzzy Ann approach for software effort estimation, Int. J. Found. Comput. Sci. 4 (2014) 45e56. [16] S. Vishal, V. Kumar, Optimized fuzzy logic based framework for effort estimation in software development, Int. J. Comput. Sci. Issues (IJCSI) 7 (2010) 30e38.

Chapter 10  The application of artificial intelligence in software engineering

221

[17] M. Ahmed, M. Omolade, J. AlGhamdi, Adaptive fuzzy logic-based framework for software development effort prediction, Inf. Softw. Technol. 47 (2005) 31e48. [18] J. Aroba, J. Cuadrado-Gallego, N. Sicilia, I. Ramos, E. Garcia-Barriocanal, Segmented software cost estimation models based on fuzzy clustering, J. Syst. Softw. 81 (2008) 1944e1950. [19] S. Huang, N. Chiu, Optimization of analogy weights by genetic algorithm for software effort estimation, Inf. Softw. Technol. 48 (2006) 1034e1045. [20] S. Huang, N. Chiu, Applying fuzzy neural network to estimate software development effort, J. Appl. Intell. 30 (2009) 73e83. [21] A. Heiat, Comparison of artificial neural network and regression models for estimating software development effort, Inf. Softw. Technol. 44 (2002) 911e922. [22] F. Ferrucci, C. Gravino, R. Oliveto, F. Sarro, Genetic programming for effort estimation: an analysis of the impact of different fitness functions, in: Second International Symposium on Search Based Software Engineering (SSBSE ’10), 2010, pp. 89e98. [23] H. Yang, C. Wang, Recommender system for software project planning one application of revised CBR algorithm, Expert Syst. Appl. 36 (2009) 8938e8945. [24] M. Braglia, M. Frosolini, A fuzzy multi-criteria approach for critical path definition, Int. J. Project Manag. 27 (2009) 278e291. [25] A. Idri, A. Khoshgoftaar, A. Abran, Can neural networks be easily interpreted in software cost estimation?, in: IEEE World Congress on Computational Intelligence, 2002. Hawaii, 11620-1167. [26] Y. Zhang, A. Finkelstein, M. Harman, Search-based requirements optimization: existing work and challenges, in: International Working Conference on Requirements Engineering: Foundation for Software Quality (REFSQ), vol. 5025, Springer LNCS, 2008, pp. 88e94. [27] B. Hooshyar, A. Tahmani, M. Shenasa, A genetic algorithm to time-cost trade off in project scheduling, in: Proceedings of the IEEE World Congress on Computational Intelligence, IEEE Computer Society, 2008, pp. 3081e3086. [28] Z. Zhen-Yu, Y. Wei-Yang, L. Qian-Lei, Applications of fuzzy critical chain method in project scheduling, in: Proceedings of the Fourth International Conference on Natural Computation e China, 2008, pp. 473e477. [29] Y. Hu, J. Chen, Z. Rong, L. Mei, K. Xie, A neural networks approach for software risk analysis, in: Proceedings of the Sixth IEEE International Conference on Data Mining Workshops, IEEE Computer Society, Washington DC, 2006, pp. 722e725. [30] G. Yujia, C. Chang, Capability based project scheduling with genetic algorithms, in: Proceedings of the International Conference on Intelligent Agents, Web Technologies and Internet Commerce, IEEE Computer Society, Washington DC, 2006, pp. 161e175. [31] Y. Shan, R. McKay, C. Lokan, D. Essam, Software project effort estimation using genetic programming, in: Proceedings of the IEEE International Conference on Communications, Circuits and Systems, IEEE Computer Society, Washington DC, 2002, pp. 1108e1112.

222

Data Democracy

[32] J. Boardman, G. Marshall, A knowledge-based architecture for project planning and control, in: Proceedings of the UK Conference on IT, IEEE Computer Society, Washington DC, 1990, pp. 125e132. [33] F.C. Yuan, The use of a fuzzy logic-based system in cost-volume-profit analysis under uncertainty, Expert Syst. Appl. 36 (2009) 1155e1163. [34] Institute of Electrical and Electronic Engineers, IEEE Standard Glossary of Software Engineering Terminology (IEEE Standard 610.12-1990), Institute of Electrical and Electronics Engineers, New York, NY, 1990. [35] R. Balzer, N. Goldman, D. Wile, Informality in program specifications, IEEE Trans. Softw. Eng. se-4 (2) (1977) 94e103. [36] S. Vemuri, S. Chala, M. Fathi, Automated use case diagram generation from textual user requirement documents, in: IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE), vol. 95, 2017, pp. 1182e1190. [37] R. Garigliano, L. Mich, NL-OOPS: a requirements analysis tool based on natural language processing, in: Conference on Data Mining, vol. 3, 2002, pp. 1182e1190. [38] G. Ninaus, A. Felfernig, M. Stettinger, S. Reiterer, G. Leitner, L. Weninger, W. Schanil, INTELLIREQ: intelligent techniques for software requirements engineering, Front. Artif. Intell. Appl. (2014) 1e6. [39] J. Chaves-Gonza´lez, M. Pe´rez-Toledano, A. Navasa, Teaching learning based optimization with Pareto tournament for the multi-objective software requirements selection, Eng. Appl. Artif. Intell. 43 (2015) 89e101. [40] J. Sagrado, I. Del Aguila, F. Orellana, Multi-objective ant colony optimization for requirements selection, J. Empir. Softw. Eng. 1 (2014) 1e34. [41] D. Neumann, An enhanced neural network technique for software risk analysis, IEEE Trans. Softw. Eng. 28 (2002) 904e912. [42] C. Vasudevan, An experience-based approach to software project management”, in: Proceedings of the Sixth International Conference on Tools with Artificial Intelligence, 1994, pp. 624e630. [43] J. Emmett Black, Al assistance for requirements management, J. Concurr. Eng. Res. Appl. (1994) 190e191. [44] V. Bardsiri, D. Jawawi, A. Bardsiri, E. Khatibi, LMES: a localized multiestimator model to estimate software development effort, Eng. Appl. Artif. Intell. 26 (2013) 2624e2640. [45] H. Seyyed, S. Moosavi, V. Bardsiri, Satin Bowerbird optimizer: a new optimization algorithm to optimize ANFIS for software development effort estimation, Eng. Appl. Artif. Intell. (2017) 1e15. [46] M. Azzeh, D. Neagu, P. Cowling, Fuzzy Grey relational analysis for software effort estimation, J. Empir. Softw. Eng. 15 (2009) 60e90.

Chapter 10  The application of artificial intelligence in software engineering

223

[47] S. Wang, D. Kountanis, An intelligent assistant to software cost estimation”, in: Proceedings of the IEEE International Conference on Tools with AI, 1992, pp. 1114e1176. [48] T. Benala, R. Mall, S. Dehuri, V. Prasanthi, Software effort prediction using fuzzy clustering and functional link artificial neural networks, in: B. Panigrahi, S. Das, P. Suganthan, P. Nanda (Eds.), Swarm, Evolutionary, and Memetic Computing, Springer, Berlin Heidelberg, 2012, pp. 124e132. [49] H. Hota, R. Shukla, S. Singhai, Predicting software development effort using tuned artificial neural network, in: L. Jain (Ed.), Computational Intelligence in Data Mining e Proceedings of the International Conference on CIDM, Springer, India, 2015, pp. 195e203. [50] A. Nassif, M. Azzeh, L. Capretz, D. Ho, Neural network models for software development effort estimation: a comparative study, Neural Comput. Appl. 27 (2016) 2369e2381. [51] D. Specht, A general regression neural network, IEEE Trans. Neural Netw. 2, 568e576. [52] M. Bhakti, A. Abdullah, Autonomic computing approach in service oriented architecture, in: Proceedings of the IEEE Symposium on Computers & Informatics, 2011, pp. 231e236. [53] M. Bhakti, A. Abdullah, L. Jung, “Autonomic, self-organizing service oriented architecture in service ecosystem”, in: Proceedings of the 4th IEEE International Conference on Digital Ecosystems and Technologies, 2010, pp. 153e158. [54] G. Canfora, M. Di Penta, R. Esposito, M. Villani, An approach for QoS-aware service composition-based on genetic algorithms”, in: Proceedings of the ACM Conference on Genetic and Evolutionary Computation, 2005, pp. 1069e1075. [55] M. El-Falou, M. Bouzid, A. Mouaddib, T. Vidal, Automated web service composition using extended representation of planning domain, in: Proceedings of the IEEE International Conference on Web services, 2008, pp. 762e763. [56] G. Zou, Y. Chen, Y. Xu, R. Huang, Y. Xiang, Towards automated choreographing of Web services using planning, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2012, pp. 383e404. [57] R. Huang, Y. Xu, AI planning and combinatorial optimization for web service composition in cloud computing, in: Proceedings of the International Conference on Cloud Computing and Virtualization, 2010, pp. 1e8. [58] F. Wagner, F. Ishikawa, S. Honiden, QoS-aware automatic service composition by applying functional clustering, in: Proceedings of the IEEE International Conference on Web services (ICWS), 2011, pp. 89e96. [59] W. Zhang, C. Chang, T. Feng, H. Jiang, QoS-based dynamic web service composition with ant colony optimization, in: Proceedings of the 34th Annual Computer Software and Applications Conference, 2010, pp. 493e502.

224

Data Democracy

[60] R. Abarnabel, F. Tou, V. Gilbert, KEE-Connection: A Bridge Between Databases and Knowledge Bases, AI Tools and Techniques, 1988, pp. 289e322. [61] O. Raih, “A survey on searchebased software design”, Comput. Sci. Rev. 4 (4) (2010) 203e249. [62] E. Costa, A. Pozo, S. Vergilio, A genetic programming approach for software reliability modeling, IEEE Trans. Reliab. 59 (1) (2010) 222e230. [63] Y. Su, C. Huang, Neural network-based approaches for software reliability estimation using dynamic weighted combinational models, J. Syst. Softw. 80 (2007) 606e615. [64] D. Oliver, Automated optimization of systems architectures for performance, in: Systems Engineering in the Workplace: Proceedings of the Third Annual International Symposium of the National Council on Systems Engineering, 1993, pp. 259e266. [65] R. Pressman, Software Engineering: A Practitioner’s Approach, Pressman and Associates, 2014. [66] S. Schach, Object-Oriented and Classical Software Engineering, McGraw-Hill, 2002. [67] B. Molnar, J. Frigo, Application of AI in software and information engineering, Eng. Appl. Artif. Intell. 4 (6) (1991) 439e443. [68] W. Jiao, H. Mei, Automated adaptations to dynamic software architectures by using autonomous agents, Eng. Appl. Artif. Intell. 17 (2004) 749e770. [69] J. Taylor, D. Fredrick, An expert system architecture for computer-aided control engineering, IEEE (1984) 1795e1805. [70] P. Cohen, J. Dixon, K. Simmons, An architecture for application of artificial intelligence to design, in: 21st Design Automation Conference IEEE, 1984, pp. 388e391. [71] A. Soria, J. Andres Diaz-Pace, L. Bass, F. Bachmann, M. Campo, Supporting Quality-Driven Software Design through Intelligent Assistants, DOI:10.4018/ 978-1-60566-758-4, Chapter 10, pp. 182. [72] G. Rodrı´guez, A. Soria, M. Campo, Artificial intelligence in service-oriented software design, Eng. Appl. Artif. Intell. (2016) 86e104. [73] W.B. Rouse, Design of man-computer interfaces for on-line interactive systems, Proc. IEEE 63 (1975) 847e857. [74] F. Batarseh, J. Pithidia, Context-aware user interfaces for intelligent emergency applications, in: Proceedings of the International and Interdisciplinary Conference on Modeling and Using Context, 2017, pp. 359e369. [75] P. Devanbu, R. Brachman, P. Selfridge, B. Ballard, LaSSIE: a knowledge-based software information system, in: IEEE Proceedings 12th International Conference on Software Engineering, 1991, pp. 249e261.

Chapter 10  The application of artificial intelligence in software engineering

225

[76] K. Hema Shankari, Dr.R. Thirumalaiselvi, A survey on using artificial intelligence techniques in the software development process, Int. J. Eng. Res. Afr. 4 (2014) 24e33. [77] C. Insaurralde Carlos, Software programmed by artificial agents: toward an autonomous development process for code generation, in: IEEE International Conference on Systems, Man, and Cybernetics, 2013, pp. 3294e3299. [78] T. Biggerstaff, A. Perlis, Software Reusability, vol. I, Addison Wesley, Reading, Mass, 1989. [79] T. Biggerstaff, A. Perlis, ,, in: Software Reusability, vol. II, Addison Wesley, Reading, Mass, 1989. [80] P. Wang, S. Shiva, A knowledge-based software reuse environment for program development, IEEE (1994). [81] R. Prieto-Diaz, Classification of reusable modules, IEEE Trans. Softw. Eng. 4 (1) (1987) 6e16. [82] R. Waters, The programmer’s apprentice: knowledge-based program editing, IEEE Trans. Softw. Eng. 8 (1) (1982) 1e12. [83] T. Biggerstaff, C. Richter, Reusability framework, assessment, and directions, IEEE Softw. 4 (1) (1987) 41e49. [84] C. Hewitt, B. Smith, Towards a programming apprentice, IEEE Trans. Softw. Eng. SE-l (1) (1975) 26e45. [85] M. Harman, L. Tratt, Pareto optimal search based refactoring at the design level, in: Proceedings of the Genetic and Evolutionary Computation Conference e GECCO, 2007, pp. 1106e1113. [86] M. O’Keeffe, M. Cinne´ide, Search-based refactoring for software maintenance, J. Syst. Softw. 81 (2008) 502e516. [87] W. Mkaouer, M. Kessentini, S. Bechikh, K. Deb, High dimensional search-based software engineering: finding tradeoffs among 15 objectives for automating software refactoring using NSGA-III, in: Proceedings of the Genetic and Evolutionary Computation Conference, 2014. [88] O. Seng, J. Stammel, D. Burkhart, Search-based determination of refactoring for improving the class structure of object-oriented systems, in: Proceedings of the Genetic and Evolutionary Computation Conference, 2006, pp. 1909e1916. [89] A. Ouni, M. Kessentini, H. Sahraoui, M.S. Hamdi, The use of development history in software refactoring using a multi-objective, in: Proceedings of the Genetic and Evolutionary Computation Conference, 2003, pp. 1461e1468. [90] B. Amal, M. Kessentini, S. Bechikh, J. Dea, L.B. Said, On the use of machine learning and search-based software engineering for ill-defined fitness function: a case study on software refactoring, in: Proceedings of the 6th International Symposium on Search-Based Software Engineering (SSBSE ’14), 2014.

226

Data Democracy

[91] M. Kessentini, W. Kessentini, H. Sahraoui, M. Boukadoum, A. Ouni, Design defects detection and correction by example, in: Proceedings of the IEEE International Conference on Program Comprehension, 2011, pp. 81e90. [92] A. Ghannem, G. El Boussaidi, M. Kessentini, Model refactoring using interactive genetic algorithm, in: G. Ruhe, Y. Zhang (Eds.), SSBSE 2013. LNCS vol. 8084, 2013, pp. 96e110. [93] M. Last, A. Kandel, H. Bunke, Artificial Intelligence Methods in Software Testing, World Scientific Publishing Company, Incorporated, 2003. ISBN-13: 9789812388544. [94] S. Zhang, J. Mathew, L. Ma, Y. Sun, Best basis-based intelligent machine fault diagnosis, Mech. Syst. Signal Process. 19 (2) (2005) 357e370. [95] V. Challagulla, F. Bastani, I. Yen, R. Paul, Empirical assessment of machine learning based software defect prediction techniques, Int. J. Artif. Intell. Tools 17 (2) (2008) 389e400. [96] S. Poulding, J. Clark, Efficient software verification: statistical testing using automated search, IEEE Trans. Softw. Eng. 36 (6) (2010) 763e777. [97] W. Weimer, T. Nguyen, C. Goues, S. Forrest, Automatically finding patches using genetic programming, in: International Conference on Software Engineering (ICSE 2009), Vancouver, Canada, 2009, pp. 364e374. [98] W. Chan, M. Cheng, S. Cheung, T. Tse, Automatic goal-oriented classification of failure behaviors for testing XML-based multimedia software applications: an experimental case study, J. Syst. Softw. 79 (2006) 602e612. [99] A. Memon, Automated GUI regression testing using AI planning, in: World Scientific, Artificial Intelligence Methods in Software Testing, vol. 56, 2004, pp. 51e100. [100] M. Last, M. Freidman, Black-box testing with InfoFuzzy networks, in: World Scientific, Artificial Intelligence Methods in Software Testing, 2004, pp. 21e50. [101] P. Saraph, M. Last, A. Kandell, Test case generation and reduction by automated input-output analysis, in: Institute of Electrical and Electronics Engineers Conference, 2003. [102] P. Saraph, A. Kandel, M. Last, Test Case Generation and Reduction with Artificial Neural Networks, World Scientific, 2004, pp. 101e132. [103] T.M. Khoshgoftaar, A.S. Pandya, H. More, Neural network approach for predicting software development faults, in: Proceedings Third International Symposium on Software Reliability Engineering, 1992. [104] T.M. Khoshgoftaar, R.M. Szabo, P.J. Guasti, Exploring the behavior of neural network software quality models, Softw. Eng. J. 10 (1995) 89e96. [105] B. Baudry, From genetic to bacteriological algorithms for mutation-based testing, in: Software Testing, Verification & Reliability, 2005, pp. 73e96. [106] B. Baudry, “Automatic Test Case Optimization: A Bacteriologic Algorithm”, IEEE Software, Published by the IEEE Computer Society, 2005, pp. 76e82.

Chapter 10  The application of artificial intelligence in software engineering

227

[107] V. Mohan, D. Jeya Mala, IntelligenTester -test sequence optimization framework using multi-agents, J. Comput. 3 (6) (2008). [108] M. Last, A. Kandel, H. Bunke, “Artificial Intelligence Methods in Software Testing”, Series in Machine Perception and Artificial Intelligence, vol. 56, 2004. [109] S. Huang, M. Cohen, A. Memon, Repairing GUI test suites using a genetic algorithm, in: International Conference on Software Testing, Verification and Validation, 2010. [110] C. Michael, G. McGraw, M. Schatz, Generating software test data by evolution, IEEE Trans. Softw. Eng. 27 (12) (2001) 1085e1110. [111] X. Xie, B. Xu, C. Nie, L. Shi, L. Xu, Configuration strategies for evolutionary testing, in: Proceedings of the 29th Annual International Computer Software and Applications Conference COMPSAC, vol. 2, IEEE Computer Society, 2005. [112] N. Chen, S. Kim, Puzzle-based automatic testing: bringing humans into the loop by solving puzzles, Proc. ASE (2012) 140e149. [113] K. Doerner, W. Gutjahr, Extracting test sequences from a Markov software usage model by ACO, LNCS 2724 (2003) 2465e2476. [114] H. Li, C. Peng Lam, Software test data generation using ant colony optimization, Transact. Eng. Comput. Technol. 1 (No. 1) (2005) 137e141. [115] H. Li, C. Peng Lam, An ant colony optimization approach to test sequence generation for state-based software testing, in: Proceedings of the Fifth International Conference on Quality Software (QSIC’05), vol. 5, 2005, pp. 255e264. [116] P. Srivastava, T. Kim, Application of genetic algorithm in software testing, Int. J. Softw. Eng. Appl. 3 (4) (2009) 87e97. [117] C. Michael, G. McGraw, M. Schatz, C. Walton, Genetic algorithm for dynamic test data generations, in: Proceedings of 12th IEEE International Conference Automated Software Engineering, 1997, pp. 307e308. [118] E. Diaz, J. Tuya, R. Blanco, Automatic software testing using a metaheuristic technique based on Tabu search, in: Proceedings 18th IEEE International Conference on Automated Software Engineering, 2003, pp. 301e313. [119] J. McCaffrey, Generation of pairwise test sets using a simulated bee colony algorithm, in: IEEE International Conference on Information Reuse and Integration, 2009, pp. 115e119. [120] C. Mao, C. YuXinxin, C. Jifu, Generating test data for structural testing based on ant colony optimization, in: 12th International Conference on Quality Software, 2012, pp. 98e101. [121] D. Rui, F. Xianbin, L. Shuping, D. Hongbin, Automatic generation of software test data based on hybrid Particle swarm genetic algorithm, in: IEEE Symposium on Electrical and Electronics Engineering (EEESYM), 2012, pp. 670e673. [122] G. Kumar, R. Kumar, Improving GA based automated test data generation technique for object oriented software, in: IEEE International Advance Computing Conference (IACC), 2013, pp. 249e253.

228

Data Democracy

[123] L. Alzubaidy, B. Alhafid, Proposed Software Testing Using Intelligent Techniques (Intelligent Water Drop (IWD) and Ant Colony Optimization Algorithm (ACO)), Semantic Scholar, 2013. [124] J. Wang, Y. Zhuang, C. Jianyun, Test case prioritization technique based on genetic algorithm, in: International Conference on Internet Computing and Information Services, 2011, pp. 173e175. [125] K. Karnaveland, J. Santhoshkumar, Automated Software Testing for Application Maintenance by Using Bee Colony Optimization Algorithms (BCO)", International Conference on Information Communication and Embedded Systems (ICICES), 2013, pp. 327e330. [126] L. Ioana, C. Augustin, V. Lucia, Automatic test data generation for software path testing using evolutionary algorithms, in: International Conference on Emerging Intelligent Data and Web Technologies, 2012, pp. 1e8. [127] S. Bharti, S. Shweta, Implementing ant colony optimization for test case selection and prioritization, Int. J. Comput. Sci. Eng. (2011) 1924e1932. [128] Y. Minjie, The research of path-oriented test data generation based on a mixed ant colony system algorithm and genetic algorithm, in: International Conference on Wireless Communications, Networking and Mobile Computing, 2012, pp. 1e4. [129] N. Baskiotis, M. Sebag, M.-C. Gaudel, S. Gouraud, A machine learning approach for statistical software testing, in: International Joint Conference on Artificial Intelligence, 2011, pp. 2274e2279. [130] L. Briand, Novel applications of machine learning in software testing, in: International Conference on Software Quality, 2008, pp. 1e8. [131] S. Chen, Z. Chen, Z. Zhao, B. Xu, Y. Feng, Using semi-supervised clustering to improve regression test selection techniques, in: IEEE International Conference on Software Testing, Verification and Validation, 2008, pp. 1e10. [132] M. Last, A. Kandel, “Automated Test Reduction Using an Info-Fuzzy Network”, Software Engineering with Computational Intelligence, Kluwer Academic Publishers, 2008, pp. 235e258. [133] M. Noorian, E. Bagheri, W. Du, Machine Learning-Based Software Testing: Towards a Classification Framework, Software Engineering and Knowledge Engineering (SEKE), 2011, pp. 225e229. [134] C. Zhang, Z. Chen, Z. Zhao, S. Yan, J. Zhang, B. Xu, An improved regression test selection technique by clustering execution profiles, in: International Conference on Software Quality, 2008, pp. 171e179. [135] F. Wang, L. Yao, J. Wu, Intelligent test oracle construction for reactive systems without explicit specifications, in: Ninth IEEE International Conference on Dependable, Autonomic and Secure Computing (DASC), 2011, pp. 89e96. [136] A. Rauf, N. Alanazi, Using artificial intelligence to automatically test GUI, IEEE (2014) 3e5.

Chapter 10  The application of artificial intelligence in software engineering

229

[137] F. Padberg, T. Ragg, R. Schoknecht, Using machine learning for estimating the defect content after an inspection, IEEE Trans. Softw. Eng. 30 (No. 1) (2004) 17e28. [138] S. Wappler, J. Wegener, Evolutionary unit testing of object-oriented software using strongly-typed genetic programming, in: Proceedings of the Eighth Annual Conference on Genetic and Evolutionary Computation ACM, 2006, pp. 1925e1932. [139] L. Briand, Y. Labiche, M. Shousha, Stress testing real-time systems with genetic algorithms, in: Proceedings of the Conference on Genetic and Evolutionary Computation, 2005, pp. 1021e1028. [140] M. Gupta, F. Bastani, L. Khan, I. Yen, Automated test data generation using MEA-graph planning, in: Proceedings of the Sixteenth IEEE Conference on Tools with Artificial Intelligence, 2004, pp. 174e182. [141] B. Baudry, F. Fleurey, J.M. Jezequel, L. Traon, Automatic test case optimization using a bacteriological adaptation model: application to. NET components, in: Proceedings of the Seventeenth IEEE International Conference on Automated Software Engineering, 2003, pp. 253e256. [142] B. Baudry, F. Fleurey, J.M. Jezequel, Y. Le Traon, Genes and bacteria for automatic test cases optimization in the .net environment, in: Proceedings of the Thirteenth International Symposium on Software Reliability Engineering (ISSRE‟02), 2002, pp. 195e206. [143] L.C. Briand, J. Feng, Y. Labiche, Using genetic algorithms and coupling measures to devise optimal integration test orders”, in: Proceedings of the Fourteenth International Conference on Software engineering and Knowledge Engineering, 2002, pp. 43e50. [144] T. Thwin, T.S. Quah, Application of neural network for predicting software development faults using object oriented design metrics, in: Proceedings of the Ninth International Conference on Neural Information Processing, 2002, pp. 2312e2316. [145] W. Assunc¸a˜o, T. Colanzi, S.R. Vergilio, A. Pozo, A Multi-Objective Optimization Approach for the Integration and Test Order Problem, Information Scientific, 2004, pp. 119e139. [146] E. Zio, G. Gola, A neuro-fuzzy technique for fault diagnosis and its application to rotating machinery”, Reliab. Eng. Syst. Saf. 94 (1) (2004) 78e88. [147] N. Fenton, M. Neil, W. Marsh, P. Hearty, L. Radlinski, P. Krause, On the effectiveness of early lifecycle defect prediction with Bayesian nets, Empir. Softw Eng. 13 (5) (2008) 499e537. [148] N. Bhateja, Various artificial intelligence approaches in field of software testing, Int. J. Comput. Sci. Mob. Comput. 5 (5) (2016) 278e280. [149] J.A. Whittaker, What is software testing? And why is it so hard? IEEE 17 (2000) 70e79. Software.

230

Data Democracy

[150] A. Rauf, M.N. Alanazi, Using artificial intelligence to automatically test GUI”, in: The 9th International Conference on Computer Science & Education (ICCSE 2014), 2014. [151] U. Nonnenmann, J. Eddy, Software testing with KITSS, in: Conference on Artificial Intelligence for Applications, 1993, pp. 25e30. [152] V. Sangeetha, T. Ramasundaram, Application of genetic algorithms in software testing techniques, Int. J. Adv. Res. Comput. Commun. Eng. 5 (10) (2009) 87e97. [153] L.C. Briand, Y. Labiche, Z. Bawar, Using machine learning to refine black-box test specifications and test suites, in: International Conference on Software Quality, 2008, pp. 135e144. [154] S. Bharti, S. Shweta, Literature survey of ant colony optimization in software testing, in: CSI Sixth International Conference on Software Engineering (CONSEG), 2012, pp. 1e7. [155] K. Kire, N. Malhotra, Software testing using intelligent technique, Int. J. Comput. Appl. 90 (19) (2014). [156] J. Wang, Y. Zhuang, J. Chen, Test case prioritization technique based on genetic algorithm, in: International Conference on Internet Computing and Information Services, 2011, pp. 173e175. [157] L. Ioana, C. Augustin, C. Lucia, Automatic test data generation for software path testing using evolutionary algorithms, in: International Conference on Emerging Intelligent Data and Web Technologies, 2012, pp. 1e8. [158] H. Li, C. Lam, Software test data generation using ant colony optimization, Proc. ICCI (2004) 137e141. [159] P. Saraph, A. Kandel, M. Last, Test Case Generation and Reduction with Artificial Neural Networks, World Scientific, 2004, pp. 1e10. [160] T. Xie, The synergy of human and artificial intelligence in software engineering, in: Proceedings of 2nd International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE), 2013. [161] M. Last, M. Freidman, Black-Box Testing with Info-Fuzzy Networks Chapter 1, 2004, pp. 1e25. [162] M. Last, M. Friendman, A. Kandel, Using data mining for automated software testing, Int. J. Softw. Eng. Knowl. Eng. 14 (2004) 369e393. [163] S. Shahamiri, W. Nasir, Intelligent and automated software testing methods classification, Ann. Res. Semin. (2008). [164] M. Ye, B. Feng, L. Zhu, Y. Lin, Neural Networks Based Automated Test Oracle for Software Testing, Springer Verlag Heidelberg, 2006, pp. 498e507. [165] X. Xiao, T. Xie, N. Tillmann, J. Halleux, Precise identification of problems for structural test generation, Proc. ICSE (2011) 611e620. [166] T. Xie, Cooperative testing and analysis: human-tool, tool-tool, and humanhuman co-operations to get work done, Proc. SCAM (2002) 1e3. Keynote Paper. [167] M. Ye, B. Feng, L. Zhu, Y. Lin, Neural Networks-Based Automated Test Oracle for Software Testing, Springer Verlag, Heidelberg, 2006, pp. 498e507.

Chapter 10  The application of artificial intelligence in software engineering

231

[168] V. Mohan, D. Jeya Mala, IntelligenTester -test sequence optimization framework using multi-agents, J. Comput. (2008). [169] A. Lenz, A. Pozo, S. Vergilio, Linking software testing results with a machine learning approach, Proc. Eng. Appl. Artif. Intell. (2013) 1631e1640. [170] T.M. Khoshgoftaar, R.M. Szabo, P.J. Guasti, Exploring the behavior of neural network software quality models, Softw. Eng. J. 10 (1995) 89e96. [171] P. Pendharkar, Exhaustive and heuristic search approaches for learning a software defect prediction model, Eng. Appl. Artif. Intell. 23 (2010) 34e40. [172] F. Batarseh, A. Gonzalez, Predicting failures in agile software development through data analytics, in: Software Quality Journal, Springer Science, Business Media New York, 2015, pp. 49e66. [173] W. Afzal, R. Torkar, R. Feldt, A systematic review of search based testing for non-functional system properties, J. Inf. Softw. Technol. 51 (2009). [174] W. Afzal, R. Torkar, R. Feldt, Searchdbased prediction of fault-slip-through in large software projects, in: 2nd IEEE International Symposium on Search Based Software Engineering, 2010, pp. 79e88. [175] X.F. Liu, G. Kane, M. Bambroo, An intelligent early warning system for software quality improvement and project management, J. Syst. Softw. 79 (11) (2008) 1552e1564. [176] F.T. Dweiri, M.M. Kablan, Using fuzzy decision making for the evaluation of the project management internal efficiency, Decis. Support Syst. 42 (2) (2006) 712e726. [177] C. Ryan, Automatic Re-engineering of Software Using Genetic Programming, Kluwer Academic Publishers, 1999. [178] M. Saliu, G. Ruhe, Bi-objective release planning for evolving software systems, in: Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), 2007, pp. 105e114. [179] P. Baker, M. Harman, K. Steinhofel, A. Skaliotis, Search based approaches to component selection and prioritization for the next release problem, in: Proceedings of the 22nd IEEE International Conference on Software Maintenance, 2006, pp. 176e185. [180] H. Jiang, J. Zhang, J. Xuan, Z. Re, Y. Hu, A hybrid ACO algorithm for the next release problem, in: Proceedings of the 2nd International Conference on Software Engineering and Data Mining, Chengdu, 2010, pp. 166e171. [181] J.T. Souza, C.L. Brito Maia, T.N. Ferreira, R.A. Ferreira do Carmo, M.M. Albuquerque Brasil, An ant colony optimization approach to the software release planning with dependent requirements, in: Proceedings of the 3rd International Symposium on Search Based Software Engineering (SBSE ’11), 2011, pp. 142e157.

232

Data Democracy

[182] L.B. Alperin, B. Kedzierski, AI-based software maintenance, in: Proceedings of the 3rd IEEE Conference AI Applications, 2019. [183] T.M. Khoshgoftaar, D.L. Lanning, A neural network approach for early detection of program modules having high risk in the maintenance phase, J. Syst. Softw. 29 (2015) 85e91. [184] D. Teichroew, E. Hershey III, PSL/PSA a computer- aided technique for structured documentation and analysis of information processing systems, in: 2nd International Conference of Software Eng., 1976, pp. 2e8. [185] C. Catal, S. Guldan, Product review management software-based on multiple classifiers, in: Advances in Knowledge and Information Software Management, IET Software, 2017, pp. 89e92. [186] N. Jindal, B. Liu, Review spam detection, in: 16th International Conference on World Wide Web, New York, NY, USA, 2007, pp. 1189e1190. [187] M. Ott, C. Cardie, J. Hancock, Estimating the prevalence of deception in online review communities, in: 21st International Conference on World Wide Web, New York, NY, USA, 2012, pp. 201e210. [188] K. Sharma, K. Lin, Review spam detector with rating consistency check, in: 51st ACM Southeast Conference, New York, NY, 2013. [189] C. Dyer, Expert systems in software maintainability, in: Proceedings of the Annual Reliability and Maintainability Symposium, IEEE, 1984. [190] C.M.F. Lapa, C. Pereira, M. De Barros, A model for preventive maintenance planning by genetic algorithms based in cost and reliability, Reliab. Eng. Syst. Saf. 91 (2) (2006) 233e240. [191] T. Lethbridge, J. Singer, Studies of the work practices of software engineers, in: Advances in Software Engineering: Comprehension, Evaluation, and Evolution, Springer-Verlag, 2011, pp. 53e76. [192] H. Liu, T. Lethbridge, C. Timothy, Intelligent search methods for software maintenance, Inf. Syst. Front. (2002) 409e423. ProQuest. [193] S. Mancoridis, B. Mitchell, Y. Chen, E. Gansner, “ Bunch, A clustering tool for the recovery and maintenance of software system structures”, in: Proceedings of IEEE International Conference on Software Maintenance, 1999, pp. 50e59. [194] S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, third ed., Published by Prentice Hall, 2009.

Further reading M. Harman, S. Mansouri, Y. Zhang, Search-based software engineering: trends, techniques and applications, ACM Comput. Surv. 45 (2012) 11.

Index Note: ‘Page numbers followed by “f ” indicate figures, “t” indicates tables’. A Abductive reasoning, 138 AdaBoost, 153e154 Adaptive neuro-fuzzy inference system (ANFIS), 185e186 Adaptive synthetic sampling (ADASYN), 102e103 Airbus, 152 Alethic rules, 133 Amazon Web Services (AWS), 155 American artificial intelligence initiative, 7f Analysis ready data (ARD) products, 151 Analytical pragmatics, 131e132 Analytics-Driven Testing (ADT), 202e203 Annealing Genetic Algorithm (AGA), 198e199 Ant Colony Optimization (ACO) system, 183e185 Antidepressant drug trials, 110 Application for Extracting and Exploring Analysis Ready Samples (AppEEARS), 156 Area under the curve (AUC), 88e90 adaptive synthetic sampling, 102f edited nearest neighbor, 97f logistic regression model, 92f random oversampling, 99f random undersampling, 94f Tomek links, 96f Artificial General Intelligence (AGI), 216 Artificial intelligence (AI), 171e172, 209te214t dilemmas, 216e217 example applications, 214e215 information overload reduction, 58 open source software (OSS), 34e35 probabilistic classification model, 182e183

requirements engineering (RE), 180e182, 182f context mechanisms, 180e182 process-oriented specification language, 180e182 semiautomatic approach, 180e182 software cost estimation, 185e186 software development process, 180e182, 187e188 software requirements, 183e185 software design architectural design, 190 autonomous agents, 190, 191f computer-aided control engineering (CACE) environment, 190e192, 191f context, 192 knowledge-based systems (KBSs), 190 service-oriented architecture (SOA), 192 software development and implementation automated programming environments, 192e193, 196e197 component building process, 192e193 Interactive Genetic Algorithm (IGA), 195e196 Learning Genetic Algorithm (LGA), 196 refactoring solutions, 195 Software Developer Agent (SDA), 192e193 software reusability, 193e195 software release and maintenance cost-reliability model, 206 expert systems, 205e206 intelligent search procedure, 206e208, 207f planning system, 205f preventative maintenance policies, 206 reviews, 203e204

233

234

Index

Artificial intelligence (AI) (Continued ) spam detection review model, 204f software testing Annealing Genetic Algorithm (AGA), 198e199 Graphical User Interface (GUI), 202 human-tool cooperation, 201 Knowledge-Based Interactive Test Script (KITSS), 197, 198f regression testing, 199e200 Restricted Genetic Algorithm (RGA), 198e199 software failures, 202e203 structural testing, 201e202 test data generation (TDG), 198e199 Validating and Verifying (V&V), 197 Artificial Neural Networks (ANNs), 183e185 Automated Data Structuring, 196e197 Autonomous Software Code Generation (ASCG), 192e193, 194f B Bagging, 153 Bibliographical metadata, 134 Big Data Analytics, 202e203 Big health datasets, 129 Black Duck Software, 31 BlackSky Global constellation, 152 Bootstrapped aggregation, 153 C Cascade Correlation Neural Network (CCNN), 187e190, 190f Case-Based Reasoning (CBR), 183e185 Category, 132 Category assignment rules, 133 Centers for Medicare and Medicaid Services (CMS) patient survey correlation scatter plots, 119f P-values, 120, 122t R-squared values, 121t Class overlap, 86, 86f Clinical Document Architecture (C-CDA), 133 Cloud computing platforms, 155

Cognitive psychology age construct theories, 164 autonomy, 164 culture, 164e165 dialectical and dialogic process, 164e165 learning, 165 linguistic acquisitions, 166e167 Computer-aided control engineering (CACE) environment, 190e192, 191f Computer-aided software engineering (CASE), 186e187 Confusion matrix accuracy, 87 binary classification, 86, 87t Porto Seguro’s safe driver prediction dataset, 92t, 94t, 96t, 98t, 100te101t, 103t Consumer-accessible health data and information, 128e129 Content management system (CMS), 55, 60e61 Context, 180e182 Context-based reasoning (CxBR), 171 Context-Mediated Behaviors (CMB), 172 Contextual data and artificial intelligence, 171e172 cognitive development, 163 and cognitive psychology, 164e166 computer-initiated/data-driven programs, 167e168 context, 161e162 contextual reasoning, 162e163 environment, 165e166 knowledge acquisition, 163e164 natural selection, 166 neuroscience agentic action, 168 appraisal theory, 170e171 auditory information processing, 169 brain development, 168 computer-animated speech, 170 dichotic listening paradigm project, 169 emotional expression, 170 information processing, 168e169

Index

personeperception process, 171 social and physical environments, 168 visual information clues, 170 program systems, 162 self-reactions, 166 socially constructed situations, 166 standards and preferences, 165 Contextual Graphs (CxG), 172 Contextual reasoning, 162e163 Convolutional neural networks, 154 Credit score literacy, 24 Cybersecurity, 49 D Data citizenry consumer labeling, 25 data science archetypes, 23 data-technology evolution. See Datatechnology evolution process collaboration, 24 question process, 23e24 theoretical framework data rights and data usage, 22 e-commerce and data monetizing, 18 executive branch, 19e20 formality, 17e18 guiding principles, 19e23 judicial branch, 20 legislative branch, 19 machine learning models. See Machine learning (ML) pliancy, 17e18 US Federal Government, 17 Data democracy, 3e4 concepts and fields, 6 Gerrymandering bias, 5, 5f Data drive knowledge, 40 Data mining environments (DME), 32e33 Data mining libraries (DMLIB), 32e33 Data openness, 50e51 Data republic, 10 Data revolution, 15 Dataetechnology evolution AlphaGo Zero algorithm, 10e11 blackeredewhite cycles

235

big and open data, 14e15 data science (1890e1990), 13 historic revolutions, 15e16 history, 12 complicated “tally sheets”, 11e12 data collection, household, 11 dataetechnology cycle, 11 phases, 12, 12f technology forecasting, 11 Decision trees, 153 Deep learning models, 111, 154 De-Identification Act, 111 Deontic rules, 133 Dichotic listening paradigm project, 169 Digital Business Transformation, 47 DigitalGlobe, 152 Digital transformation, 47 Digititation, 47 E Edited nearest neighbors (ENN), 97e98, 97f Elastic Cloud Compute (EC2), 155 Electronic health records (EHR), 139e140 Electronic medical records (EMR), 139e142 Electronic Numerical Integrator and Calculator (ENIAC), 13 Electronic Remote Blood Issue System (ERBIS), 113 Error Back Propagation Network (EBPN), 187e188 Estimation by Analogy (EA), 185e186 European Space Agency (ESA), 148 Expert Shell, 186e187 Expert System Monitor (ESM), 186e187 Exploratory data analysis (EDA), 62 F Fault-based testing, Artificial intelligence (AI), 201e202 Flexibility, historic transformation, 16 F-measure, 88 Food labeling, 25 Formality, 18 Formation, historic transformation, 16 Freedom, 16

236

Index

Free/libre open source software (FLOSS) movement, 35 Functional Link Artificial Neural Networks (FLANN), 186e187 Fuzzy C-Means (FCM) Data Clustering algorithm, 186e187 Fuzzy Logic, 183e185 Fuzzy set theory, 185e186 G General Data Protection Regulation (GDPR), 18 General Regression Neural Network (GRNN), 187e190, 189f Genetic Algorithm (GA), 183e185, 198e199 Genetic Context Learning (GenCL), 171 Geospatial technology, 148 Global interconnectedness, 10 G-measure, 88 Google Earth Engine (GEE), 156 Google-Flu Trends (GFT), 112 Google Security Survey, 31 Gradient boosting, 153e154 Graphical User Interface (GUI), 202 Greedy Randomized Adaptive Search Procedure, 183e185 Grey Relational Analysis (GRA), 185e186, 186f GroupLens system, 57 H Healthcare data system big data analytics, 113 data privacy and security measures, 111 data visualizations high impact quality metrics, 117f low impact quality metrics, 116f medium impact quality metrics, 117f deep learning models, 111 Electronic Remote Blood Issue System (ERBIS), 113 genetic research, 113 Google-Flu Trends (GFT), 112 high impact metrics, 119 hospital ranking system, 120te121t

Cleveland Clinic, 118f CMS rankings, 115 correlation scatter plots, 119f, 122f HealthInSight’s rankings, 115 John Hopkins Hospital, 119f Massachusetts General Hospital, 122f Mayo Clinic, 116, 118f P-values, 120, 122t R-squared values, 121t insufficient data quality, 111 Knowledge Integration Toolkit (KnIT), 113e114 low impact metrics, 119 medium impact metrics, 119 open-access hospital data, 114e115 patients data, 112 professional data scientist shortage, 112 residential ownership, 111 structural measures, 114e115 value of care vs. cost, 121t visualization, 114 Health informatics, 128e129 Health information exchange (HIE), 135 Health Information Technology (HIT) for Economic and Clinical Health (HITECH), 128e129 HealthInSight patient survey correlation scatter plots, 122f P-values, 120, 122t R-squared values, 121t Heuristics-based analytics, 137e139 abductive reasoning, 138 deduction, 138e139 induction, 138e139 pragmatism cycle, 137e138, 138f High variance model, 153 Hollerith machine, 13 Hospital Service Score, 114 I IASCE system, 186e187 Imbalanced data

Index

calculation, 91f concept complexity, 86 definition, 83e84 degree of class imbalance, 85 hybrid methods, 103 machine learning/data mining algorithms, 84 oversampling adaptive synthetic sampling (ADASYN), 102e103 random oversampling, 98e99, 99f synthetic minority oversampling technique (SMOTE), 100 undersampling edited nearest neighbors (ENN), 97e98, 97f random undersampling, 93e95, 94f, 94t Tomek link, 95e96, 95fe96f Imbalanced ratio (IR), 85 Induction, 138e139 Inference, 138e139 Info Fuzzy Network (IFN), 200, 201f Infographics, 57 Information anxiety, 46e47 Information overload artificial intelligence (AI), 58. See also Mind mapping (MM) causes attention manipulation, 52 cybersecurity, 49 data openness, 50e51 definition, 46 digital transformation, 47 emails, 49e50 information anxiety, 46e47 internet of things (IOT), 48 internet web pages, 49 massive open online courses (MOOCs), 48 push systems, 51e52 social media, 48 spam email, 52e53 consequences anxiety and stress, 53 misinformation, 54

poor decision-making, 54 productivity reduction, 54 content management system (CMS), 55 infographics, 57 literature reviews, 55 mind mapping (MM), 57 open data portals, 55e56 personal information agents, 56 recommender systems, 57 search engines, 56 Information Processing Language (IPL), 34e35 Institute of Electrical and Electronics Engineers (IEEE), 180e182 Integration toolkits (INT), 32e33 Intelligent Design Aid (IDeA), 193e195 Intelligent Planning, 202 INTELLIREQ environment, 182e183 Interactive Genetic Algorithm (IGA), 195e196 International Software Benchmarking Standards Group (ISBSG), 187e188 Internet of things (IOT), 48 Internet web pages, 49 K Knowledge-based editor (KBEmacs), 193e195 Knowledge-Based Interactive Test Script (KITSS), 197, 198f Knowledge-Based Software Reuse Environment (KBSRE), 193e195 Knowledge-based systems (KBSs), 190 Knowledge Integration Toolkit (KnIT), 113e114 L Landsat program, 148 data policy, 156e157 instrument specifications, 150t machine learning, 152e155 orbital plane, 148 preprocessing, 149e151 pushbroom sensor, 149 radiometers, 149

237

238

Index

Landsat program (Continued ) remote sensing instruments, 149 and Sentinel, 151e152 signal-to-noise ratio (SNR), 149 United States Geological Survey (USGS) analysis ready data (ARD) products, 151 data status, 150t science data products, 151t whiskbroom technology, 149 Language Feature technique, 196e197 Large-scale Object-based Linguistic Interactor, Translator and Analyzer (LOLITA), 182e183 Learning Genetic Algorithm (LGA), 196 Lexicon functions, 132 Linguistic philosophy, 166e167 Linked Open Data, 51 Lisp, 34e35 Lisp Machines Inc (LMI), 34e35 Loan applications, 25 Localized multi-estimator (LMES), 185 Logical Observation Identifiers Names and Codes (LOINC), 132 M Machine learning (ML), 180e182 algorithms, 153 applications, 152e153 approaches, 152e153 bagging, 153 biased predictive model, 21 boosting, 153e154 convolutional neural networks, 154 deep neural networks (NNs), 154 ethical precedence, 22e23 random forests (RFs), 153 satellite imagery, 154e155 student learning, 21 support vector machines (SVMs), 153 transfer learning, 154 unintended bias, 20 Magnitude of relative error (MRE), 185 Massive open online courses (MOOCs), 48 Meta Programming, 196e197 Microsoft IIS Web servers, 31

Mind mapping (MM) artificial intelligence visualization categories, 67f concepts, 68f digitizing content, 64 emotion analysis, 66f, 69 entities, 67f exploratory data analysis (EDA), 62 keywords, 66f natural language understanding (NLU) applications, 63e73 semantic roles, 68f sentiment analysis, 65f software visualization and visual programming, 62e73 categories branch, 70f content management systems visualization, 60e61 emotion branch, 71f entities branch, 72f keywords branch, 73f open data visualization big open data, 60, 61f downloaded files, 59 open data mind map, 60, 60f relations branch, 72f semantic roles branch, 73f software applications, 57 Model variance measures, 153 Multidimensional scaling (MDS), 62 Multilayer Perceptron (MLP) model, 187e188, 188f Multilingual lexicon, 132 Multiobjective evolutionary algorithms (MOEAs), 183e185 Multispectral Scanner System (MSS), 149 N Naı¨ve Bayes, 203e204 Natural Language d Object-Oriented Production System (NL-OOPS), 182e183 Natural Language Processing (NLP), 180e182

Index

Natural language understanding (NLU) applications categories, 67f concepts, 68f emotion analysis, 66f, 69 entities, 67f keywords, 66f semantic roles, 68f sentiment analysis, 65f Next release problem (NRP), 183e185 O Ontologies, 132e133 Ontology-Enabled Agent, 192e193 Open Data Impact Map, 50 Open data portals, 55e56 Open Government Initiative, 14 Open knowledge, 50 Open Source Apache Web servers, 31 Open source software (OSS) artificial intelligence (AI), 34e35 corporation’s relationship, 31e32 data opening challenges, 39 decision-making, 38e39 market trends and product preferences, 38 personal information, 39 data science data mining libraries (DMLIB), 32e33 OS tool development, 32e33 proprietary software, 32 R and Python modules, 33 Statistical Analysis System (SAS), 33e34 Data Silo, 36e37 GNU Operating System, 30e31 hackers, 29e30 history, 29e30 internet clients, 29e30 operating systems, 29e30 publicly traded companies, 36 security flaws and bugs, 31e32 technological advancements, 39e40 user rights and copyleft licensing model, 30e31

239

virtual machines, 29e30 Operationalization pragmatics, 131e132 Operational Land Imager (OLI), 149 Operational metadata, 134 P Parallel coordinate plots, 62 Particle Swarm Optimization (PSO), 198e199 Patient Protection and Affordable Care Act, 128e129 Pattern recognition, 183e185 Personal health records (PHR) computer-only access, 140e141 conceptual architecture, 141, 141f limitations, 140e141 organizational governance, 141e142 shared decision-making (SDM), 139e140 Personal information agents, 56 “Porto Seguro’s Safe Driver Prediction”, 86 confusion matrix, 92t statistical assessment metrics, 92t Pragmatics, 131e132 Predictive testing, 202e203 Prehistoric cave paintings, 13 Principle Component Analysis, 201e202 Private Open Data, 51 Process-oriented specification language, 180e182 Programming Apprentice, 193e195 Property, 132 Punch card technology, 13 Push systems, 51e52 R Radial Basis Function Neural Network (RBFNN), 187e190, 189f Random oversampling, 98e99, 99f Random Test Generator, 200 Random undersampling, 93e95, 94f, 94t Receiver operating characteristic (ROC) curve, 88e90 adaptive synthetic sampling, 102f edited nearest neighbor, 97f logistic regression model, 92f

240

Index

Receiver operating characteristic (ROC) curve (Continued ) random oversampling, 99f random undersampling, 94f Tomek links, 96f Recommender systems, 57 Regression Tester Model, 200 Regression testing, 199 Requirements Extractor (REX), 183e185 Restricted Genetic Algorithm (RGA), 198e199 Rule-Based approach, 182e183 RxNorm, 132 S Satin Bowerbird Optimization Algorithm (SBO), 185e186 Scagnostics, 62 Scatterplot matrices, 62 Search-Based Software Engineering, 193e195 Semantics, 131e132 Semiotics conceptualizations, 130 data, 130e131 data interoperability, 135 health data analytics data identification, 136 data staging, 136 heuristics-based analytics, 137e139 information domain delineation, 135e136 information model development, 136 information presentation, 136e137 unstructured text data, 135 health data democratization conceptual architecture, 141, 141f electronic health documentation, 139e140 organizational governance, 141e142 patient health record, 139e141 health information exchange, 135 ICT-focused semiotics, 130e131 information, 130e131

knowledge, 130e131 lexicon functions, 132 ontologies, 132e133 semantics and pragmatics, 131e132 signs, 130e131 symbol, 130e131 syntactics, 131e133 Service-oriented architecture (SOA), 192 Shared decision-making (SDM), 139e140 Simple Storage Service, 155 Small disjuncts, 86, 86f Social cognitive approach, 168 Social media, 48 Software Defect Prediction Model Learning Problem (SDPMLP), 201e202, 202f Software Developer Agent (SDA), 192e193 Software engineering (SE) artificial intelligence (AI). See Artificial intelligence (AI) human activity, 178e179 Spam, 50 Spam detection review model, 204f Spam email, 52e53 Statistical Analysis System (SAS), 33e34 Statistical assessment metrics, 86e90 area under the curve, 88e90 confusion matrix, 86e87, 87t, 91f F-measure, 88 G-measure, 88 insurance dataset, 90 “Porto Seguro’s Safe Driver Prediction”, 86 precision and recall, 87e88 receiver operating characteristic (ROC) curve, 88e90, 89f, 92f Strategic Analytics for Improvement and Learning (SAIL), 115 Structural metadata, 134 Supervised learning algorithm, 180e182 Support Vector Machines, 203e204 Support vector machines (SVMs), 153 Swarm Intelligence algorithm, 183e185 Symbolics, 34e35 Syntactics, 131e134

Index

Systematized Nomenclature of Medicine Clinical Terms, 132 Systems Information Resource/ Requirements Extractor (SIR/REX), 183e185 T Taxonomic assignment rules, 134 Taxonomy, 132 t-Distributed Stochastic Neighbor Embedding (t-SNE), 62 Teaching LearningeBased Optimization (TLBO) algorithm, 183e185 Technology forecasting, 11 Text Mining, 203e204 Thematic Mapper (TM), 149 Thermal Infrared Sensor (TIRS), 149 Tomek link, 95e96, 95fe96f Transfer learning, 154

U Uber Movement Open Data Portal, 51 United States Geological Survey (USGS), 148 analysis ready data (ARD) products, 151 data status, 150t science data products, 151t USGS Earth Resources Observation and Science (EROS) center, 156 W Wiki, 56 X XGBoost, 153e154 Z Zeldes’ Infomania report, 50

241