Data Analytic Literacy 9783111001678, 9783110999754

The explosive growth in volume and varieties of data generated by the seemingly endless arrays of digital systems and ap

260 108 6MB

English Pages 240 [268] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Data Analytic Literacy 9783111001678, 9783110999754

The explosive growth in volume and varieties of data generated by the seemingly endless arrays of digital systems and ap

184 76 3MB Read more

Data Literacy in Practice 9781803246758

A complete guide to data literacy and making smarter decisions with data through intelligent actions

465 55 16MB Read more

Data Literacy in Practice: A complete guide to data literacy and making smarter decisions with data through intelligent actions 9781803232355

Accelerate your journey to smarter decision making by mastering the fundamentals of data literacy and developing the min

3,427 90 16MB Read more

The Data and Analytics Playbook: Proven Methods for Governed Data and Analytic Quality 9780128023075, 0128023074

The Data and Analytics Playbook: Proven Methods for Governed Data and Analytic Quality explores the way in which data co

427 59 18MB Read more

Data Analytics for Absolute Beginners: A Deconstructed Guide to Data Literacy 1081762462, 9781081762469

While exposure to data has become more or less a daily ritual for the rank-and-file knowledge worker, true understanding

5,071 772 4MB Read more

Be Data Literate: The Data Literacy Skills Everyone Needs To Succeed [1 ed.] 1789668034, 9781789668032

It is not enough for a business to have the best data if those using it don't understand the right questions to ask

2,596 266 3MB Read more

AI & Data Literacy: Empowering Citizens of Data Science 1835083501, 9781835083505

Learn the key skills and capabilities that empower Citizens of Data Science to not only survive but thrive in an AI-domi

1,034 46 19MB Read more

Developing Analytic Talent: Becoming a Data Scientist [1. ed.] 1118810082, 9781118810088

Learn what it takes to succeed in the the most in-demand tech job Harvard Business Review calls it the sexiest tech jo

3,108 693 4MB Read more

Analytic Narratives 9780691216232

Students of comparative politics have long faced a vexing dilemma: how can social scientists draw broad, applicable prin

144 27 27MB Read more

The Analytic Imaginary 9781501727429

The notion of the philosophical imaginary developed by Michéle Le Doeuff refers to the capacity to imagine as well as to

158 10 19MB Read more

Data Analytic Literacy
9783111001678, 9783110999754

Author / Uploaded
Andrew Banasiewicz

Table of contents :
Preface
Contents
About the Author
Part I: Literacy in the Digital World
Introduction
Chapter 1 Literacy and Numeracy as Foundations of Modern Societies
Chapter 2 Reframing the Idea of Digital Literacy
Part II: The Structure of Data Analytic Literacy
Introduction
Chapter 3 Data Analytic Literacy: The Knowledge Meta- Domain
Chapter 4 Knowledge of Data
Chapter 5 Knowledge of Methods
Chapter 6 Data Analytic Literacy: The Skills Meta-Domain
Chapter 7 Computing Skills
Chapter 8 Sensemaking Skills
Part III: The Transformational Value of Data Analytic Literacy
Introduction
Chapter 9 Right Here, Right Now: Data-Driven Decision-Making
Chapter 10 Going Beyond: Bridging Data Analytics and Creativity
List of Figures
List of Tables
Index

Citation preview

Andrew Banasiewicz Data Analytic Literacy

Andrew Banasiewicz

Data Analytic Literacy

ISBN 978-3-11-099975-4 e-ISBN (PDF) 978-3-11-100167-8 e-ISBN (EPUB) 978-3-11-100176-0 Library of Congress Control Number: 2023935707 Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the internet at http://dnb.dnb.de. © 2023 Andrew Banasiewicz Cover image: Eugene Mymrin/Moment/Getty Images Typesetting: Integra Software Services Pvt. Ltd. Printing and binding: CPI books GmbH, Leck www.degruyter.com

The ideas summarized in this book are dedicated to those always in my heart and in my mind: my wife Carol, my daughters Alana and Katrina, my son Adam, my motherin-law Jenny, and the memory of my parents Andrzej and Barbara, my sister Beatka and my brother Damian, and my father-in-law Ernesto.

Preface What is the scope and the structure of rudimentary data analytic competency? Answering that question is at the core of this book. The rapid growth and proliferation of means and modalities of utilizing data can sometimes blur the line between basic and more advanced data analytic knowledge and skills, and differences in perspectives can further muddy those waters. And yet it is an important question, because as much as it could be argued that given the ubiquity of data everyone needs basic data utilization abilities, it would be unreasonable, and in fact even counterproductive to suggest that everyone needs to be able to build and validate complex, multi-attribute predictive models. Where, then, should we draw the line between basic and advanced data analytic capabilities? And what, exactly, should comprise the basic data analytic know-how? In a sense it is easier to say what should not be a part of rudimentary data analytic capabilities than what should be a part of it. Still, given the ever-greater importance of data in commercial and even in personal affairs, clearly defining what constitutes data analytic literacy is as important at the onset of the Digital Age as the ideas of literacy and numeracy were at the onset of Enlightenment and the subsequent Industrial Age. And yet, the idea of what it means to be data analytically literate is anything but clearly framed. To some it is simply an application of basic statistics, to others it is ‘data science light’, and to still others it is being able to interpret and utilize data analytic outcomes without necessarily being able to compute those outcomes. The perspective of this book is that, to some extent, it is a little bit of all of those perspectives, but all wrapped into a conceptually sound and practically meaningful process of systematically transforming the often messy and heterogeneous data into informative insights. As such, it has a defined boundary and clearly delineated components tied together by a rational, easy to follow process, with the ultimate goal being simplicity. To that end, the ideas outlined here go beyond the non-informative narrative of data being mind-bogglingly voluminous and diverse, data analytic techniques being numerous and complex, and data processing and analysis tools coming in many different types and forms. Though all of that might indeed be true, continuing to focus on largely unexplained complexity does more to erect barriers to learning than to fire up the desire to become data analytically proficient. And given how the ceaseless march of digitalization and automation is changing the way work is done and lives are lived the resultant and ever-growing centrality of data to not only professional but also to personal lives, becoming a fully engaged member of digital society demands the ability to consume data. The implicit assumption of the data analytic process and supporting ideas outlined in this book is that analyses of data are a manifestation of a rational desire to either acquire new knowledge or to test the efficacy of prior beliefs. That is not a trivial assumption as it plays a key role in differentiating between informative ‘signal’ https://doi.org/10.1515/9783111001678-202

VIII

Preface

and non-informative ‘noise’, the two abstractly framed constituent parts of any dataset. Given that informative signal is any data element that increases one’s knowledge while non-informative noise is the leftover part that does not, the distinction of what is what may change, so some degree, with the scope of informational pursuits. Under those assumptions, being able to engage in meaningful analyses of data requires well balanced abilities to understand structural and informational nuances of data, as well as technical and informational aspects of data analytic techniques, in addition to robust hands-on data manipulation and analysis skills. It also requires robust data analytic sensemaking skills, an often overlooked component of the larger data analytic competencies. In a sense, that is a particularly tricky part because it calls for high degrees of what could be characterized as data analytic situational awareness – being able to correctly interpret the often nuanced and just about always contingent results of data analyses. Data are imperfect and when coupled with approximation-minded tools of exploration paint an informational picture that is only approximately – under the best of circumstances – correct. Being data analytically situationally aware means having the requisite mental tools to assess the informational value of data analytic outcomes in a way that maximizes the informational soundness of takeaways. On a more personal level, this book is a product of more than two decades of applied industry experience, coupled with about decade and a half of academic teaching and research, and my view of what constitutes the elementary set of data analytic skills and competencies, framed here as data analytic literacy, is a product of lots of doing coupled with lots of reflecting on ways and means of transforming raw, messy data into valid and reliable insights. The structure of this book parallels that journey of discovery: The first couple of chapters (Part I) are meant to set the stage by taking a closer look at the general ideas of literacy and numeracy in the context of the still unfolding digital reality brought about by the proliferation of electronic transaction processing, communication, and tracking infrastructure. Part II, which is the heart of the book, details the logic and the contents of the idea of data analytic literacy, seen here as a metaconcept encapsulating a distinct combination of conceptual knowledge and hands-on skills that are needed to engage in basic analyses of data. The last two chapters jointly comprising Part III of this book are meant to take a closer look at data analytic literacy as an integral part of the ongoing socioeconomic evolution; more specifically, they are meant to examine the impact of robust data analytic competencies on how work is done and lives are lived today and take the proverbial stab at how both are likely to evolve going forward. Andrew Banasiewicz

Contents Preface

VII

About the Author

XIII

Part I: Literacy in the Digital World Chapter 1 Literacy and Numeracy as Foundations of Modern Societies The Notions of Literacy and Numeracy 6 Quantitative Literacy 7 Data Literacy 9 Digital Literacy 10 Literacy in the Digital Age 13 On Being Digital Age Literate 14 The Many Faces of Digital Literacy 15 Reconceptualizing Digital Literacy 16 Knowledge vs. Skills 20 Chapter 2 Reframing the Idea of Digital Literacy 25 The Digital Revolution 26 Communication in the Digital Age 27 Understanding Data 29 Classifying Data 30 Toward a General Typology of Data 31 Supporting Considerations 32 The Metaconcept of Data Analytic Literacy 35 Framing Data Analytic Literacy 37

Part II: The Structure of Data Analytic Literacy Chapter 3 Data Analytic Literacy: The Knowledge Meta-Domain The Essence of Knowing 46 Analytic Reasoning 48 Human Learning 49 The Knowledge Meta-Domain 52 The Domain of Conceptual Knowledge 53

45

5

X

Contents

Chapter 4 Knowledge of Data 55 Data Types 57 What is Data? 57 Data as Recordings of Events and States 58 Encoding and Measurement 61 Organizational Schema 63 Data Origin 64 General Data Sources 64 Simplifying the Universe of Data: General Data Classification Typology Chapter 5 Knowledge of Methods 70 Data Processing 71 Data Element Level: Review and Repair 73 Data Table Level: Modification and Enhancement Data Repository Level: Consistency and Linkages Data Analyses: The Genesis 78 Data Analyses: Direct Human Learning 79 Exploratory Analyses 83 Exploring Continuous Values 85 Exploring Associations 90 The Vagaries of Statistical Significance 94 Confirmatory Analyses 95 The Importance of Rightsizing 97 The Question of Representativeness 99 Chapter 6 Data Analytic Literacy: The Skills Meta-Domain Skill Acquisition 103 Skill Appropriateness and Perpetuation The Skills Meta-Domain 107

74 76

102 105

Chapter 7 Computing Skills 110 The Ecosystem of Data Tools 111 Comprehensive Data Analytic Tools 114 Limited-Purpose Data Analytic Tools 116 Structural Computing Skills 119 Some General Considerations 120 Accessing and Extracting Data 122 Transforming Raw Data to Analysis-Ready Datasets

123

67

Contents

Data Feature Engineering 123 Data Table Engineering 126 Data Consistency Engineering 137 Enriching Informational Contents of Data 141 Informational Computing Skills 144 Choosing Data Analytic Tools 145 Computing Data Analytic Outcomes 146 Chapter 8 Sensemaking Skills 148 Factual Skills: Descriptive Analytic Outcomes 149 Key Considerations 150 Data Storytelling 152 Visual Data Storytelling 155 Inferential Skills: Probabilistic Data Analytic Outcomes 159 Probabilistic Thinking 160 Mental Models 161 Probabilistic Inference 162 Interpreting Probabilistic Outcomes 164 Assessing the Efficacy of Probabilistic Results 165 Statistical Significance 165 Effect Size 168 Making Sense of Non-Numeric Outcomes 172 The Text Mining Paradox 174 Sensemaking Skills: A Recap 176

Part III: The Transformational Value of Data Analytic Literacy Chapter 9 Right Here, Right Now: Data-Driven Decision-Making 183 Bolstering Intuitive Sensemaking 185 Creative Problem-Solving and Deductive vs. Inductive Inference 187 Casting a Wider Net: Evidence-Based Decision-Making 190 What is Evidence? 192 Transforming Data into Evidence 193 Mini Case Study 1: Baselining the Risk of Securities Litigation 194 Aggregate Frequency and Severity 195 Mini Case Study 2: Benchmarking the Risk of Securities Litigation 198 Industry Definition and Data Recency 199 Recent Cross-Industry SCA Trends 202 Benchmarking Company-Specific SCA Exposure 203

XI

XII

Contents

Chapter 10 Going Beyond: Bridging Data Analytics and Creativity Sensemaking in the Age of Data 209 The Essence of Intelligence 210 Chipping Away . . . 213 Intelligence and Consciousness 213 Data and Creativity 215 Co-Joined Simulations 217 A Deeper Dive into Creativity 218 Breaking Through 221 Organic vs. Augmented 225 Transcending Reality with Data 228 Symbolic Data Analysis and Visualization 230 SDA and Big Data 232 Some Closing Thoughts 240 Data Analytic Literacy: A Synopsis 242 List of Figures List of Tables Index

249

245 247

207

About the Author Andrew Banasiewicz is the Koch Chair and Professor of Business Analytics at Cambridge College, where he also serves as the Founding Dean of the School of Business & Technology; he is also the founder and principal of Erudite Analytics, a consulting practice specializing in risk research and modeling. He is a former senior-level business analytics and data science industry professional with over two decades of risk management and insurance industry experience; his primary area of expertise is the development of custom, data-intensive analytic decision support systems. Since becoming a full-time academician, researcher and consultant, Andrew has been a frequent speaker at conferences around the world, speaking on data analytic innovation, evidence-based decision-making, and data-driven learning. He is the author of six other books, including Evidence-Based Decision-Making published in 2019, and Organizational Learning in the Age of Data published in 2021, over 40 journal and conference papers, and numerous industry white papers. He holds a Ph.D. in business from Louisiana State University; he is also a fellow of several academic and professional organizations. On a more personal note, Andrew is an avid outdoorsman, scuba diver, and a fitness enthusiast and finisher of multiple marathons and Ironman triathlons.

https://doi.org/10.1515/9783111001678-204

Part I: Literacy in the Digital World

Writing and numeral systems are the single most foundational element of organized societies; in fact, it is hard to think of another invention that has had as profound and prolonged on coalescing of what is commonly referred to as civilizations. Not surprisingly, the oldest major urban civilization, believed to have existed in Mesopotamia, the region in modern-day Iraq situated between the Tigris and Euphrates rivers, was also the birthplace of first formal writing and numeral systems. Sumerians, the people of that region whose civilization, a collection of independent city-states, developed the first writing system known as cuneiform around 3200 BCE, which was used for more than 3,000 years until it was abandoned in favor of alphabetic scripts around 100 BCE (Sumerians also developed the first formal numeral system1 around 3400 BCE). The most widely used writing system today, the Latin alphabet, was developed from Etruscan2 script at some point before 600 BCE, and the foundation of the modern numeral system – the set of ten symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 – originated in India around the 600 – 700 CE (those symbols are often referred to as ‘Arabic numerals’ in the Western tradition because they were introduced to Europeans by Arab merchants; ironically, Arabic mathematicians who popularized that numeral system referred to them as ‘Hindu numerals’). And so throughout the history of organized societies, from the early Sumerians onward, full participation in lives of societies has called for the ability to write, read, and use numbers; over time, the ability to write and read became known as literacy, and the ability to use numbers as numeracy. Today, the Digital Revolution that has been sweeping the world over the past few decades is effectively expanding the longstanding notion of literacy. Rapid and widescale migration to electronic interaction and communication modalities is rendering the ability to utilize those technologies is becoming as fundamental as the ability to read, write, and use numbers has been for centuries. The emerging concept of digital literacy is often used as an umbrella term encompassing one’s ability to use a growing array of digital technologies to find, evaluate, create, and communicate information. However, the general idea of digital literacy lacks definitional clarity and operational specificity; moreover, its common usage strongly emphasizes the ‘hard’ aspects of digital infrastructure by drawing attention to familiarity with technologies, while at the same time largely overlooking the ‘soft’ aspects, most notably, the ability to utilize data. And yet, it is hard to look past the fact that the ability to extract meaning out of torrents of data generated by the ubiquitous electronic transaction processing and communication infrastructure is a key element of what it means to be digitally literate,

 They are also credited with development of the modern conception of time (i.e., dividing day and night into 12-hour periods, hours into 60 minutes, and minutes into 60 seconds), formal schooling, governmental bureaucracy, systems of weights and measures, and irrigation techniques, among their other inventions.  A somewhat enigmatic a civilization that flourished during the Iron Age (roughly 1200 BC to 550 BC) in what is today central Italy. https://doi.org/10.1515/9783111001678-001

4

Part I: Literacy in the Digital World

thus those skills and competencies warrant an explicit and distinct designation, framed here as data analytic literacy. The goal of Part I of this book, comprised of chapters 1 and 2, is to lay a foundation on which the notion of data analytic literacy can be built. To that end, Chapter 1 offers broad overview of the general notion of literacy as seen from the perspective of electronic data, and Chapter 2 re-examines the idea of digital literacy in the context of different types and sources of data.

Chapter 1 Literacy and Numeracy as Foundations of Modern Societies Intuitively, the idea of literacy is straightforward: to be able to read and write. But implicitly, that deceptively simple notion assumes ‘manual’ reading and writing skills – printed text and the old-fashioned pen and paper. Similarly, the equally longstanding notion of numeracy, or the ability to use numbers, assumes a direct, meaning mental skills involving counting and similar operations. And to be sure, those traditional notions of literacy and numeracy are, in a manner of speaking, alive and well; after all, the first years of formal schooling are focused primarily on acquisition of those skills. The emergence of the modern electronic transaction processing and electronic infrastructure that began in the latter part of the 20th century and rapidly accelerated with the onset of the 21st century, coupled with the subsequent emergence of mobile computing and social media are beginning to redefine what it means to be literate. The incessant digitization of commercial and interpersonal interactions through massive shift toward electronic everything – commercial transactions, personal interactions, tracking and measurement, general communications, etc.– is effectively expanding the boundaries of literacy. That trend is clearly evidenced by plethora of modern extensions or reinterpretations of the notion of literacy, with offshoots such as digital literacy, data literacy, or quantitative literacy now being widely used. Moreover, the term ‘numeracy’ has largely faded away and the term ‘literacy’ is now used to describe skills that seem to be more closely aligned with the ability to use numbers rather than letters. For instance, the notion of ‘quantitative literacy’ may seem nearly oxymoronic, given that taken, well, literally, the term ‘literacy’ connotes familiarity with letters while the term ‘quantitative’ clearly suggests numeric abilities. The appropriateness of a particular label aside, the underlying idea of being able to access, use, and contribute to digitally encoded information is now rapidly emerging as a key skillset; in fact, it could be argued that the very participation in many personal and commercial aspects of modern societies is now predicated on those competencies. And yet, one of the core elements of that skillset, framed here as data analytic literacy, is yet to be formalized. The review of scientific, i.e., academic, and practitioner sources reveals a dizzying array of takes on what it means to be a competent user of model electronic data resources, but those individual interpretations seem to approach the task at hand from different perspectives, ultimately producing a chock full of definitionally vague and largely non-operationalizable characterizations. Some see data analytic competencies as proficiency with basic statistics, others expand that scope to also include familiarity with machine learning algorithms, and some others cast an even wider net by data analytic competencies with general structured reasoning abilities; oddly, most see familiarity with data types and sources as being distinct from https://doi.org/10.1515/9783111001678-002

6

Chapter 1 Literacy and Numeracy as Foundations of Modern Societies

the ability to extract meaning out of data. All told, there are numerous partial answers that do not add up to a single complete conclusion. It is the goal of this book to contribute to the much needed definitional and operational clarity of what it means to be data analytically literate by offering a comprehensive framework tying together the now incomplete and/or disparate elements of that critical set of skills and competencies. The focus of this chapter is twofold: First, it is to take a closer look at the longstanding notions of literacy and numeracy, in the context of recent blurring of the line of demarcation between those two important ideas. Second, it is to explore the modern-day extensions of the notion of literacy, brought about by the ubiquitous digital communication and transactional systems, platforms, applications, and devices.

The Notions of Literacy and Numeracy One of the pillars of the modern society is the ability to read, write, and to use numbers. Reading and writing skills are commonly characterized as literacy, a term derived from Latin literatus, or knowledge of letters; the ability to count falls under the broad umbrella of numeracy, from Latin numerus, or knowledge of numbers. Notionally, both ideas arose along with the early writing and numeral systems, some 5,000 years ago, but it was the more recent invention of the printing press in the 15th century, and the subsequent blossoming of the Scientific Revolution in the 16th and 17th centuries that thrusted the ability to read, write, and count into the spotlight. Nowadays, however, the label of literacy is used more expansively; in fact, its common usage not only encompasses the notionally distinct ideas of reading, writing, and using numbers – it also includes skills and competencies that stretch beyond those rudimentary skills. The notions of ‘quantitative literacy’, ‘computer literacy’, ‘digital literacy’, ‘data literacy’, ‘analytic literacy’, ‘statistical literacy’, and ‘informational literacy’ are all examples of the extended framing of the basic idea of literacy. Interestingly, the relatively recently introduced concept of multiliteracies extends the meaning of literacy even further, by positing that any non-verbal form of communication, most notably the ability to understand physical expressions of body language, sign language, and dance, as well as the fixed symbolisms as exemplified by traffic signs, nautical and aeronautical symbols, and even corporate logos, should all be included within the scope of what it means to be literate. What about being numerate? It seems that notion has largely fallen by the wayside; in fact, it looks as if the idea of numeracy has effectively been incorporated under the ever-expanding framing of literacy. Nowadays, the label of ‘literacy’ still communicates the ability to read and write, but the manner in which that term is currently used suggests that its original definitional scope now transcends its literal meaning. Consequently, it is now more appropriate of think of literacy as a term of art used to describe basic proficiency in any area that is related to literacy as the ability to communicate using writing and reading, or stated differently, as proficiency in a wide range of communication-related activities.

The Notions of Literacy and Numeracy

7

As of late, the so-expanded conception of literacy has been applied to skills and competencies relating to various aspects of rapidly expanding digital communications infrastructure that are defining the now unfolding Digital Age. The enormous volumes and diverse varieties of data generated by the modern electronic transaction processing, tracking, and communication infrastructure direct the focal point of the idea of literacy toward the broadly defined abilities to utilize data. Here, the (somewhat oxymoronically sounding) notion of quantitative literacy, along with related notions of digital literacy and data literacy have come to the forefront of what it means to be literate in the Digital Age; moreover, unrelenting march of digitization of more and more aspects of commercial and personal interactions can be expected to continue to highlight the importance of the ability to engage with those technologies, which necessarily encompasses the ability to use data. More specifically, the ability to extract meaning out of digitally encoded data is now as essential to fully engaging commercially, civically, and socially as reading and writing have been in the past (and to be sure, still are today); yet, what it means to be ‘quantitatively’, ‘data’, and ‘digitally literate’ is still definitionally and operationally fuzzy.

Quantitative Literacy What does it mean to be quantitatively proficient? On a surface, it is intuitively straightforward – simply put, it means familiarity with basic numerical methods. But what, exactly, does that mean? The scope of what encompasses ‘quantitative methods’ is mind-bogglingly wide as it encompasses basic mathematical operations (e.g., addition, exponentiation, etc.), statistical methods, core research design principles, basic computational techniques – moreover, there are no universally accepted framings of what specific elements of knowledge should be considered ‘basic’ or ‘foundational’ within each of those domains. As noted by one researcher, ‘there is considerable disagreement among scholars and researchers regarding [quantitative literacy] terminology, definitions, and characteristics . . . terms such as quantitative reasoning and numeracy [are] used as synchronous concepts.’1 Another researcher2 points out that the definitional framing of quantitative literacy can vary across perspectives, with some emphasizing formal mathematical and statistical knowledge, others general problem-solving competencies, structured reasoning, and the ability to analyse data, and still others focusing more on broad cognitive abilities and habits of mind, inclusive of interpretive sensemaking. Formal definition-wise, the National Council on Education and the Disciplines defines quantitative literacy as ‘contextually appropriate decision-

 Craver, K. W. (2014). Developing Quantitative Literacy Skills in History and the Social Sciences: A Web-Based Common Core Standards Approach. Rowman and Littlefield Publishers, 21.  Steen, L. A. (2000). Reading, Writing, and Numeracy. Liberal Education, 86, 26–37.

8

Chapter 1 Literacy and Numeracy as Foundations of Modern Societies

making and interpretation of data’, while other contributors see it as ‘sophisticated reasoning with elementary mathematics rather than elementary reasoning with sophisticated mathematics’,3 or as ‘. . . the ability and disposition to use [quantitative reasoning] (the application of mathematical and statistical methods to a given situation) in daily life’,4 or as ‘aggregation of skills, knowledge, beliefs, dispositions, habits of mind, communication capabilities, and problem solving skills that people need to autonomously engage in and effectively manage situations in life and at work that involve numbers, quantitative or quantifiable information, or textual information that is based on or has embedded in it some mathematical elements’.5 However, it is not just the definitional variability that is troubling. For instance, one of the above noted conceptualizations frames quantitative literacy as ‘aggregation of skills, knowledge, beliefs, dispositions, habits of mind, communication capabilities, and problem-solving skills’ – is such an open-ended, multifaceted definition operationalizable, or even informative? Setting aside the obvious redundancy of ‘skills’ and ‘problem solving skills’, a definition that is so all-encompassing invites numerous questions such as: knowledge of what? what skills? what beliefs? It is also worth asking if it is appropriate, from the standpoint of concept validity, for a single notion to encompass such a wide array of diverse elements. And lastly, why are ‘communication capabilities’ necessary to the attainment of quantitative literacy? How difficult would it be to find exceptionally quantitatively competent individuals with poor communication skills? It is hard to avoid the conclusion that when looked at from the definitional perspective, the notion of quantitative literacy manifests itself as a collection of heterogeneous, at times even conflicting conceptualizations; not only that, but the resultant definitional fogginess also effectively bars meaningful operationalizations of that important idea. To that end, the definitional vagueness-precipitated operational difficulty is a two-tiered problem: Firstly, the competing framings emphasize largely incommensurate framings and combinations of distinct elements of numerical knowledge, the ability to reason analytically and to manipulate data and identify data patterns, and the vaguely framed cognitive abilities, habits or mind, and communication competencies. Secondly, each of those component parts itself encompasses a large and diverse body of knowledge which in turn implies wide ranges of degrees of competency, hence a mere reference to, for instance, mathematical and statistical knowledge offers limited informational value. In other words, if one were aiming to become quantitatively literate, what specific mathematical and statistical knowledge would s/he need to master?

 Grawe, N. (2011). Beyond Math Skills: Measuring Quantitative Reading in Context. New Directions for Institutional Research, 149, 41–52.  Tunstall, L. S., Matz, R. L., and Craig, J. C. (2016). Quantitative Literacy Courses as a Space for Fusing Literacies. The Journal of General Education, 65(3–4), 178–194.  Gal, I. (1997). Numeracy: Imperatives of a Forgotten Goal. In I.A. Steen (ed.), Why Numbers Count: Quantitative Literacy for Tomorrow’s America, 36–44, The College Board.

The Notions of Literacy and Numeracy

9

Another shortcoming of the current conception of the notion of quantitative literacy is more nuanced, as it stems from the implied meaning of that notion, which suggests quantitative data. And indeed, that is the ‘classic’ image of data as rows and columns of numbers, but what is now broadly known as structured numeric data is but one of three general types of data, with the other two being (typically unstructured) text and image data. Thus the standard framing of quantitative literacy does not account for the ability to tap into the informational content of non-numeric data, which is a considerable limitation as by some estimates unstructured, non-numeric data account for as much as 80% to 90% of data captured today. In that sense, being quantitatively literate, in the traditional sense, is tantamount to being able to utilize no more than about 20% of data available nowadays. While the multiplicity of rich and varied perspectives played an essential role in the forming of the comparatively new idea of being able to use data, it is now important to begin to take steps toward bringing about the much needed definitional sharpness by scoping that important idea in a manner that is more reflective of the informational reality of the Digital Age. Quantitative competency is of foundational importance to being able to fully partake in the informational riches of the postindustrial society, and thus the currently unsettled state of definitional and operational framing of that critical skillset is not only undesirable – it is, in fact, unsustainable. However, as evidenced by the preceding analysis, the longstanding notion of quantitative literacy lacks the much needed definitional and operational clarity, which suggests the need to reframe its key tenets, seen here as structured mathematical reasoning. Moreover, those general capabilities need to be further framed in the context of what can be considered, in the manner of speaking, the universal unit of meaning in the Digital Age, which unsurprisingly is data.

Data Literacy Once again, it is important to start with the basic question: What does it mean to be data literate? A commonly held conception is that it is a reflection of the ability to read, comprehend, and communicate data as information. And once again, the framing of what constitutes ‘data’ is often implied to be the neatly organized (i.e., structured) arrays of mostly numeric values, which is not surprising given that structured data have traditionally been the focus on data analytics (because those data lend themselves to comparatively easy computer processing and analyses – more on that later). Interestingly, the broad characterization of modern-day data generating infrastructure as ‘digital’6 also plays a role in reinforcing the almost instinctive associating

 The terms digital and digit are both derived from the Latin word for finger, as fingers are often used for counting.

10

Chapter 1 Literacy and Numeracy as Foundations of Modern Societies

of the idea of ‘data’ with numerically encoded details, because of the expansive meaning of the term ‘digital’. The Oxford dictionary defines digital as something that is ‘expressed as series of the digits 0 and 1, typically represented by values of a physical quantity . . .’, taken literally, that definition suggests measurable quantity expressed as a series of digits, which in turn implies numeric values. That, however, represents an overly restrictive interpretation of the meaning of the term digital, as manifestly non-numeric values, such as letters, can also be represented as series of discrete digits, for instance, the letter ‘B’ is encoded as ‘01100010’ string. Unfortunately, that more complete framing of ‘digital data’ is not always clearly understood. The importance of recognizing typological diversity of data has important practical implications. As noted earlier, structured numeric data represent just one type of data, and that type also has an ever-shrinking share of all data generated nowadays. And so to the extent to which data can be a source of informative insights, it is important to at least be cognizant of different types of data and their potential informational utility. Looking past overly restrictive conception of data, the notion of data literacy also confounds familiarity with data sources, types, and structures with the ability to transform data into information, which entails knowledge of statistical or algorithmic (machine learning) methods, along with software tools that can be used to analyze data. A case in point: An administrator of a particular database would generally be expected to have a robust knowledge of data contained therein, but not necessarily of data analytic tools and techniques required to transform those data into meaningful information. In many organizational settings, familiarity with data and ability to analyze data are functionally distinct, and while it is certainly possible to be both data- and analytics-savvy, that typically requires a very particular type of background. Still, the idea of data literacy warrants attention because, after all, data are the focal point of data analytics. However, to be epistemologically sound and practically meaningful, it needs clearly defined boundaries that show its distinctiveness from other Digital Age interpretations of the foundational notion of literacy. To that end, to be conceptually and operationally meaningful, the idea of data literacy needs to be rooted in a comprehensive and complete typology of data types and data sources. Stated differently, without the foundation of clear and complete description of any and all data in terms of different types and origins, it is simply not meaningful to contemplate data literacy. And to be sure, as evidenced by the chorus of concurring opinions of practitioners and academicians, data variety can be overwhelming; after all, it is one of the defining characteristics of what is commonly referred to as ‘big data.’

Digital Literacy Digital technologies continue to reshape how lives are lived and how work is done, and the impact of the ongoing maturation and proliferation of those systems is not only profound – it is also multifaceted. And while advanced computer technologies

The Notions of Literacy and Numeracy

11

offer lifechanging functionalities, being able to fully utilize those systems’ capabilities calls for specific skills. The notion of digital literacy emerged in response to the need to clearly frame and communicate those skills, which is particularly important in educational contexts to identify specific skills and competencies that are needed to enable one to take full advantage of the various marvels of the modern digital infrastructure. The concept of digital literacy traces its roots to the 1980s emergence and subsequent rapid proliferation of personal computing, further bolstered by birth of the World Wide Web in the early 1990s. While numerous definitions have been put forth, one of the more frequently cited is one developed by the American Library Association (ALA), which frames digital literacy as ‘the ability to use information and communication technologies to find, evaluate, create, and communicate information [emphasis added], requiring both cognitive and technical skills’.7 Implied in that framing are three distinct building blocks: understanding of data structures and meaning, familiarity with data analytic methods and tools, and the ability to use information technologies. While not expressly recognized as such in the ALA’s definition, those three distinct ‘building blocks’ can themselves be framed as more narrowly defined literacies: data literacy (understanding of data structures and meaning), analytic literacy (familiarity with data analytic methods and tools), and technological literacy (the ability to use information technologies). In view of that, digital literacy can be seen as a metaconcept, a presumptive, higher-order (i.e., more aggregate and/or abstract) notion that supports rational explanations of complex phenomena, and the three more narrowly defined literacies – data, analytic, and informational – can be seen as its component parts, as graphically summarized in Figure 1.1.

Analytic Literacy

Digital Literacy

Data Literacy

Technological Literacy

Figure 1.1: The Metaconcept of Digital Literacy.

 https://literacy.ala.org/digital-literacy/.

12

Chapter 1 Literacy and Numeracy as Foundations of Modern Societies

Clearly, the commonly held conception of digital literacy, as illustrated by ALA’s framing, lumps together the earlier discussed notions of quantitative literacy (characterized here as analytic literacy) and data literacy with the idea of technological proficiency. If digital literacy is indeed meant to be a metaconcept or a meta-domain doing so might seem to make good sense, but unless care is taken to thoughtfully delineate the scope and content of each of the three sub-domains, doing so can ultimately cast doubt on discriminant validity of different variants of Digital Age-related literacies. For instance, the earlier outlined data literacy is commonly framed as the ability to read, comprehend, and communicate data as information, which as noted earlier confounds familiarity with data sources and types with the ability to transform data into information, which entails knowledge of data analytic methods and tools, and thus is at the heart of data analytic literacy. It is also important to expressly address the distinctiveness of technological competence, the third elements comprising the metaconcept of digital literacy. As defined here, technological literacy entails functional understanding of manmade systems designed to facilitate communication- and transaction-oriented interactions, including the manner in which those systems capture, store, and synthesize data. In a more expansive sense, it also entails understanding of planned and unplanned consequences of technology, and the interrelationships between technology and individuals and groups. Joining together the scopes and the contents of analytic, data, and informational literacies, digital literacy can be framed as a set of technical, procedural, and cognitive skills and competencies that manifest themselves in broadly defined info-technological competence, familiarity with general sources and types of data generated by the modern transactional and informational electronic infrastructure, and the ability to extract meaning out of those data. And while it might be difficult to dispute the desirability of such a broad and diverse knowledge base and skillset, it is nonetheless hard to avoid seeing it as the Digital Age’s version of a Renaissance person.8 How attainable, to more than the select few, is such a goal? There is a yet another aspect of digital literacy that warrants a closer look. As popularly used and understood, that notion tends to confound data and technology related competencies with broad demographic characterizations, perhaps best exemplified by the label of digital nativity. The idea of digital nativity recently gained widespread usage, particularly among educators looking to explain, rationalize, and possibly leverage natural technology leanings of persons born into the modern technology-intensive society. While well-intended and not entirely unfounded, the notion of digital nativity nonetheless confuses technological habits, that in some instances could even be characterized as addiction to a particular aspect of information technology, such as social media, with the broader and more balanced (in terms of types of information technologies) view of technological literacy. It is hard to deny that electronic communication

 For clarity, the term ‘Renaissance man’ is commonly used to describe a person of wide interests and who is also an expert in several different areas.

The Notions of Literacy and Numeracy

13

tools and applications seem more natural to those born into the world that was already saturated with such technologies, but that sense of natural acceptance should not be confused with the idea of meaningful understanding of the full extent of informational utility of those tools. Even more importantly, it should not be taken to imply a comparable level of familiarity with other information technology tools, such as those used in transaction processing (e.g., the ubiquitous electronic bar and QR code scanners). All considered, while the idea of digital nativity has the potential to offer informative insights in some contexts, using it as a proxy for broad and balanced digital literacy does not seem warranted. It is also worth noting that while there is ample evidence supporting the idea that digital natives are more instinctive users of digital communication technologies, there is no clear evidence pointing to a correspondingly deeper immersion with those technologies, most notably in the sense of developing familiarity with data captured by those devices. To that end, there is even less evidence suggesting heightened understanding or appreciation of the potential informational value of digital systems generated data. In fact, there is evidence supporting the oppositive conclusion, namely that being a digital native can lead to blissful disregard of the ‘how it works’ and ‘what is the value of it beyond what meets the eye’ aspects of familiarity. In that sense, it could be argued that the more user-friendly a given technology, the more hidden its augmented utility. Out of sight, out of mind.

Literacy in the Digital Age One of the more curious aspects of the numerous commentaries on the importance of data in the modern, data-driven society is the relatively scant attention given to the most voluminous type of data – text (considered here in a somewhat narrow context of written language9). Accounting for some 80% to 90% of all data generated today, text data hold a great deal of informational potential, but realizing that potential is fraught with analytic difficulties. Digital devices, including computers, ‘understand’ numbers because numbers can be manipulated and interpreted using explicit, structured logic; the same does not apply to text, where informational content combines explicit elements, e.g., words, as well as nuanced implicit elements, typically in the form of implied sentiment. While words can have multiple meanings which can lead to analytic (in the sense of machine processing) ambiguities, it is really the implied sentiment that is difficult to capture algorithmically, primarily because it can vary significantly across situations, contexts, and intended meanings. Hence while it is relatively straightforward to program computers to analyze numeric data, data analytic

 From the perspective of computer processing, any expression that includes more than just numeric values can be considered text (more on that later).

14

Chapter 1 Literacy and Numeracy as Foundations of Modern Societies

programming is significantly more challenging, and, needless to say, manually processing the almost unfathomably vast volumes of text data is simply not viable. All in all, tapping into the informational potential of the much talked about ‘big data’, much of which is text, demands the use of automated systems, which puts a yet different twist on the longstanding notion of literacy, as in the ability to read and write. For centuries, being ‘literate’ meant being able to directly understand and use written language; to be sure, that skill is still important – however, the rapid proliferation and the growing importance of torrents of digital ‘writings’ now adds another dimension to that skill, in the form of the ability to use available tools and techniques to extract meaning out of the ever greater torrents of digital text data.

On Being Digital Age Literate While (hopefully) informative, the preceding overview of quantitative, data, and digital literacies leaves a core question unanswered: What does it mean to be literate in the Digital Age? It is a vexing question. To start with, when considered from the perspective of foundational ability to communicate using written language, there is a tendency to think of literacy as a binary state, meaning someone can be either literate or illiterate. Such reductive way of framing that particular learned ability to perform a specific task seems warranted by the clarity and relative specificity of reading and writing skills. That, however, is not the case with the Digital Age literacies, as each of those distinct sets of capabilities is itself a composite of numerous, complementary-but-distinct areas of competency. Moreover, it is intuitively obvious that, for instance, quantitative skills or familiarity with data types and sources can range, at times dramatically, across individuals, which suggests that those competencies should be looked at from the perspective of degree of mastery. While on the one hand convincing, that line of reasoning glazes over the distinction between basic and more advanced skills. Going back to the foundation notions of literacy and numeracy – the basic abilities to read, write, and use numbers that are the core skills comprising those descriptive notions do not contemplate, for instance, the ability to understand complex scientific writing or knowledge of calculus; similarly, the scope of what it means to be digital, data, or quantitatively literate should be limited to just the foundational skills that comprise each of those competencies. Doing so, however, requires reducing each of the multiple manifestations of the contemporary notion of literacy to a unique set of skills that define each of those capabilities in a way that captures its distinctiveness. Given the overtly overlapping framings of digital, quantitative, data and other contemporary forms of literacy, that will require re-examining of those notions from the perspective of construct and discriminant validity. In simple terms, construct validity concerns the soundness of the conceptual characterization of a particular latent (i.e., not directly observable) notion that is intended to communicate particular meaning. The earlier overview of the concept of ‘quantitative

On Being Digital Age Literate

15

literacy’ offers a good backdrop here: The lack of singular, clear, operationalizable framing of that idea is suggestive of a lack of construct validity; in order for the idea of quantitative literacy to exhibit the necessary construct validity, and thus be considered conceptually sound, it would need to be reframed in a way that unambiguously communicates a clearly defined set of skills and competencies (as noted earlier, it does not). The related requirement of discriminant validity calls for a particular notion, like quantitative literacy, to exhibit clear distinctiveness from other literacy related notions. Simply put, to be meaningful, a concept needs to capture something that is not being captured by other concepts. And again, the overlapping descriptions of different contemporary applications of the general idea of literacy demonstrate the lack of discriminant validity. What then does it mean to be digitally literate? And is the idea of being digitally literate redundant with other expressions of modern-day literacy?

The Many Faces of Digital Literacy The numerous embodiments of the general idea of literacy, such as digital, quantitative, analytic, data, informational, are all intended to capture the broadly defined ability to utilize the various tools and applications that comprise the modern digital infrastructure. Some, such as digital literacy, can be seen as multifaceted umbrella terms, while other ones, such as data literacy, as component parts of more general skillsets. Taken together, that somewhat haphazard collection of related and often overlapping descriptors offers richness of different perspectives, but at the same time falls short of painting a clear, informative picture of the essence of what it means to be digitally literate. Meaningful reframing is needed. Firstly, there is a need to expressly differentiate between the two distinctly different implications of the idea of ‘digital’: the ability use of digital technologies, and the ability to utilize digitally encoded information. Starting with the former, digital technologies are electronic devices and systems that utilize binary codes, 1s and 0s, also known as bits, to encode, store, and transfer information; they represent an improvement over older, analog technologies which used continuous electric signals that were subject to decay and interference (and were much harder to store and preserve). In that sense, being ‘digital’ does not necessarily mean being different from analog in terms of outward appearance or even functionality – to an ordinary listener, digital music likely sounds the same as analog music. The point here is that the ability to use digital technologies is, in principle, in no particular way different than the ability to use analog technologies, both of the which require learning how to interact with or operate a particular device or system; in that sense, digital technologies may not require a specifically ‘digital’ knowledge to operate. Anyone who has a basic understanding of time can

16

Chapter 1 Literacy and Numeracy as Foundations of Modern Societies

use either a digital or an analog watch, but lacking the underlying understanding of units of time would render both watches equally useless.10 That is not at all the case with digitally encoded information, or data. Recording and storing details of transactions, communications, and other interactions and states digitally is tantamount to encoding meaning using varying arrays of binary (1s and 0s) values, which in turn means that accessing, manipulating, and, ultimately, utilizing the resultant data requires specific knowledge and skills. Hence in contrast to reading digital content (e.g., magazine articles posted online), which generally requires little more than just basic reading skills, extracting meaning out of digitally encoded information, or data, calls for an array of specific skills needed to access and manipulate data. It thus follows that data analytic competencies, or data analytic literacy, is at the heart of what it means to be digitally literate. Interestingly, that conclusion is consistent with the earlier noted APA’s framing of digital literacy, which the (American Library) Association sees as ‘the ability to use information and communication technologies to find, evaluate, create, and communicate information, requiring both cognitive and technical skills’. Focusing the notion of ‘information and communication technologies’ on means of accessing information of interest, and framing the somewhat vague notion of ‘cognitive skills’ by expressly relating those abilities to the basic tenets of quantitative literacy, described earlier as competency with structured mathematical reasoning, suggests that the essence of being technologically, quantitatively, analytically, and data literate can be captured in a single, multidimensional summary construct of data analytic literacy.

Reconceptualizing Digital Literacy The importance of broadly defined digital informational proficiency in the Digital Age cannot be overstated: Steady onslaught of automation technologies is producing everexpanding volumes and varieties of data, and digitally encoded information is rapidly becoming the standard mean of communication; consequently, the ability to tap into the informational content of data is now an essential competency. The currently underway Digital Revolution is ushering in a new era of information exchange and learning, which calls for broadening of the longstanding notion of literacy to accommodate the ability to access and use digital information. To date (2023), numerous conceptualizations have been proposed geared toward addressing the somewhat different aspects of broadly defined digital/data/analytic skillset, but as surmised in the preceding overview, those differently framed expressions of modern-day literacy  This rather rudimentary distinction does not explicitly address natively digital technologies, such as the various social media platforms, but the general argument extends into those technologies as well. Any new technology, whether it is digital, analog, or even mechanical (i.e., not requiring electricity to operate) has a learning curve.

Reconceptualizing Digital Literacy

17

exhibit persistent lack of construct and discriminant validity. As a result, the current state of knowledge of digital/data/analytic competencies can be characterized as a confusing, inconsistent, and incomplete hotchpotch of ideas that neither singularly nor collectively offer conceptually clear and operationalizable answer to the fundamental question of: What specific skills and competencies are necessary to be able to access, utilize, and communicate digitally? Figure 1.2 offers a concise summary of some of the more frequently mentioned literacy ‘extensions.’

Digital Literacy Analytic Literacy

Quantitative Literacy

Data Literacy Ability to Access, Utilize, and Communicate Digital Information

Information Literacy

Statistical Literacy

Computer Literacy

Figure 1.2: The Many Faces of Modern-Day Literacy.

Of the several modern embodiments of the general idea of literacy depicted above, the notion of digital literacy continues to attract the strongest attention, perhaps because, name-wise, it is nearly synonymous with the Digital Age. Discussed in more detail earlier, its most widely accepted framing, one offered by the American Library Association, defines it as ‘the ability to use information and communication technologies to find, evaluate, create, and communicate information [emphasis added], requiring both cognitive and technical skills.’ Graphically summarized earlier in Figure 1.1, its framing encompasses what are shown to be standalone notions of data and data analytic literacies, but its definition also confounds the idea of technology as a largely tangible ‘thing’ (e.g., a particular physical gadget or an electronic system) with the highly intangible digitally encoded information captured by or flowing through it. As noted earlier, using digital technologies does not necessarily demand any special skills (in relation to using older, analog technologies) – many, in fact, might be easier to operate than the older analog technologies they replaced, which suggests that it is the digitally encoded information, rather than the technology ‘shell’ that should be the focus of digital literacy.

18

Chapter 1 Literacy and Numeracy as Foundations of Modern Societies

Whether it is considered as a standalone notion as shown in Figure 1.2 or a part of the summary construct of digital literacy, the notion of quantitative literacy encompasses a combination of familiarity with rules of numeric operations and the ability to extract meaning out of observed quantities, broadly known as inductive reasoning. In other words, to be deemed quantitatively literate one needs to exhibit proficiency with the general use of numbers as a way of representing multitude or magnitude estimates (i.e., the quantities), and proficiency with the logic of structured inductive reasoning, or accepted ways of interpreting the symbolism of numeric expressions. But what, exactly, constitutes quantity? That is a critically important consideration because upon closer examination it becomes evident that the concept of quantity conveys a surprisingly wide range of meanings. In a general definitional sense, it is an expression of a magnitude or a multitude; it thus means that its scope encompasses anything ranging from rough approximations or ballpark figures, as exemplified by visual estimates of the amount of liquid in a container or the number of attendees at a gathering, to precisely calculated physical measurements. Given that, when considered from the perspective of being able to access and utilize digital information, the core idea embedded in the notion of quantitative literacy is the ability to draw inferences from data using structured inductive reasoning. However, if left alone, in the manner of speaking, the idea of quantitative ability is notionally incomplete because it does not explicitly consider the nature of the underlying quantities. Data literacy, another embodiment of the modern conception of literacy shown as one of the satellite concepts in Figure 1.2, contributes that piece of the larger picture. Defined as the ability read, understand, create, and communicate meaning using data, it can be seen as complementary to the idea of quantitative literacy. Another complementary dimension of modern day literacy is represented by analytic and statistical literacies. Notionally related to quantitative literacy, analytic literacy has a wider scope as at its core it is a manifestation of a more general sensemaking capabilities, which includes the aforementioned structured inductive reasoning as well as deductive reasoning, which encompasses comparatively less structured (i.e., more individually determined) mental processes used to draw inferences, or arrive at conclusions that logically follow from stated premises. While also related to quantitative literacy, a yet another modern-day literacy offshoot, statistical literacy, reflects familiarity with fundamental concepts and key techniques of statistical analyses, broadly defined as the science of analyzing and interpreting data. Two additional literacies – information and computer – further contribute to the modern day conception of literacy by drawing attention to the importance of the ability to find, evaluate, use, and communicate information in all its various formats (information literacy), and the ability to use computing technologies (computer literacy). Turning back to Figure 1.2, the modern-day conception of literacy manifests itself in a plethora of somewhat differently focused sets of skills and competencies. Looking past numerous logical and definitional overlaps (e.g., statistical literacy could be considered to be a part of quantitative literacy, data literacy could be seen as a part of analytic

Reconceptualizing Digital Literacy

19

literacy, statistical, quantitative, data, and analytic literacies could all be seen as component parts of the summary construct of digital literacy), there is also a clearly manifest need to offer more explicit, operationally clear framing of what, exactly, comprises the foundational skillset that is at the core of each of those literacy offshoots. For instance, taking a close look at the idea of statistical literacy as a standalone concept it becomes obvious that it is neither definitionally clear nor operationally explicit. It is well-known that statistics plays an important, really a key role in extracting insights out of numeric data – given that, it is generally assumed that being Digital Age-literate should manifest itself in, among other things, proficiency in statistics. However, given that statistics is a very broad field and as such it lends itself to substantial degrees of proficiency variability, what exact statistical knowledge and skills amount to being statistically literate, keeping in mind that being literate means baseline level of proficiency. It is a difficult question to answer, but tackling it can become quite a bit more manageable when the hard to meaningfully operationalize ‘desired degree of proficiency’ is reframed in terms of specific elements of statistical knowledge. Not only is that way of approaching statistical proficiency more operationally, and thus practically meaningful, it is also in closer alignment with the idea of literacy as a manifestation of the ability to access, utilize, and communicate digital information. The notion of analytic literacy is similarly complex, largely because the meaning of the term ‘analysis’ is very broad. In a general sense, to analyze is to methodically examine the constitution or structure of something, typically for purpose of explanation. Given that just about anything can be analyzed, the methods of analysis can vary considerably, ranging from qualitative, which center on the use of subjective judgment, to quantitative, which emphasize objective, structured mathematical techniques. Moreover, qualitative and quantitative analytic approaches are themselves families of more specific techniques – for instance, commonly used qualitative techniques include ethnographic, narrative, and case studies; commonly used quantitative techniques include descriptive, prescriptive, and predictive methodologies (each of which can be further broken down into a number of specific techniques). And thus a similar question comes to the forefront here: What, exactly, can be considered basic literacy-level analytic skillset? That ambiguity is even more pronounced in the context of a yet another modern literacy embodiment captured in Figure 1.2: data literacy. While quantitative or statistical literacy are both expansive and thus in need of specification, the idea of data familiarity seems outright unbounded, given the dizzying array of types and sources of data. And though the now-clichéd notion of ‘big data’ emphasizes the overwhelming volumes of data generated by the ever-expanding electronic interchange and communication infrastructure, it is the seemingly endless diversity of data types and sources actually poses the greatest obstacle to trying to grasp the totality of what is ‘out there.’ The widely cited definition of data literacy framing it as the ability read, understand, create, and communicate data as information, takes for granted knowing how one type of data may differ from another, and how that difference may impact

20

Chapter 1 Literacy and Numeracy as Foundations of Modern Societies

what it takes – skill and knowledge-wise – to ‘read, understand, create, and communicate data as information.’ Can someone with deep familiarity with, for instance, point-of-sales retail scanner structured data captured by virtually all retail outlets, but no appreciable knowledge of text-encoded unstructured product review data be considered data literate? If not, what exactly should be the scope and depth of one’s data knowledge to meet that requirement? In the era of practically endless arrays of information available on the Web it is easy to find numerous characterizations of digital, data, analytic and other modern embodiments of the longstanding notion of literacy, but as outlined above, those descriptions generally lack epistemological rigor and operational clarity. Considered jointly, those diverse interpretations of what it means to be Digital Age-literate paint a conceptually confusing and operationally unworkable picture. There are numerous and significant redundancies among the individual notions, as illustrated by the overlap between analytic, statistical, and quantitative literacies, and even setting those problems aside, the individual modern-day literacy conceptions lack specificity. ‘Quantitative skills’, ‘statistical proficiency’, ‘data familiarity’, ‘cognitive abilities’ all sound very Digital Age-like, but as discussed earlier, all are in need clarifying details. In short, the various dimensions of what it means to be Digital Age-literate need to be reduced to a set of distinct and individually meaningful (i.e., exhibiting robust concept and discriminant validity) elements of knowledge that together form a coherent composite of foundational skills and competencies. It is argued in this book that data analytic literacy is such a metaconcept because, as detailed in the ensuing chapters, it captures the key elements of knowledge that form the foundation of Digital Age literacy.

Knowledge vs. Skills Gently implied in the preceding overview of the different faces of modern-day conception of literacy is the importance of distinguishing between abstract conceptual (also referred to as theoretical) knowledge of data structures and data manipulation, analysis, and interpretation approaches, and the comparatively more tangible applied skills that are required to not only manipulate and analyze data, but also to make sense of data analytic outcomes. It is intuitively obvious that while distinct, those two broad dimensions of data analytic abilities are highly interdependent, the yin and the yang of digital competency, insofar as both are necessary to becoming data analytically literate, as graphically depicted in Figure 1.3. That is not a new idea – in fact, professional training in fields such as medicine or engineering has long been rooted in combining strong theoretical understanding of, for instance, biology and chemistry (medicine) or mathematics and physics (engineering) with robust experiential learning. The appropriateness of that approach to developing professional skills is almost intuitive because the ability to ‘do’ is inseparably tied to understanding the essence of a particular problem which draws on the appropriate facets

Reconceptualizing Digital Literacy

21

Proficiency with data manipulation, analysis & communication tools

Applied Skills

Data Analytic Literacy

Conceptual Knowledge

Familiarity with concepts, processes, and techniques

Figure 1.3: The Core Building Blocks of Data Analytic Literacy.

of scientific knowledge, and being able to undertake appropriate actions typically demands practical application skills. Similarly, the ability to access, utilize, and communicate digital information requires combination of appropriate foundational knowledge of data structures and data analytic techniques, and computation focused execution skills. However, there are notable differences between the practice of, for instance, engineering, and the practice of data analytics. By and large, engineers use abstract concepts to design and build physical things, such as bridges or microchips, whereas data analysts used abstract concepts to extract equally intangible insights out of data – given that, the distinction between ‘theoretical’ and ‘applied’ is not nearly as clear cut. And given that the conceptualization of data analytic literacy presented in this book draws an explicit distinction between conceptual knowledge and procedural skills, as graphically depicted in Figure 1.3, it is reasonable to ask what, exactly, is the difference between knowledge and skills? When considered from a phenomenological11 perspective (discussed in more detail in Chapter 3), the idea of knowledge has a broad meaning as it encompasses three distinct dimensions of semantic (facts and concepts, typically acquired in the course of formal education), procedural (behaviors and skills, often learned in the course of performing certain tasks), and episodic (emotions and experiences associated with certain events) – in that very general sense, skills are seen as a facet of knowledge. Such broad characterization, however, muddies up important differences between the ideas of knowledge and skills, when the two are considered in the context of data analytics. In that particular setting, the otherwise non-specific framing of what constitutes the general idea of ‘knowledge’ needs to be adapted to account for the difference between stable theoretical foundations of data analytics, and the evolutionarily dynamic nature of applications of those theoretical foundations, typically in the form of computing tools and applications. When

 In simple terms, phenomenology is a philosophy of experience, or more specifically, the study of structures of consciousness, as experienced from the first-person point of view.

22

Chapter 1 Literacy and Numeracy as Foundations of Modern Societies

looked at from the broad phenomenological perspective, the former falls under the umbrella of semantic knowledge, which is familiarity with abstract concepts and ideas, not tied to any specific objects, events, domains, or applications, as exemplified by the notion of probability. The latter, on the other hand, falls under the umbrella of procedural knowledge, commonly referred to as know-how, or the ability to perform specific tasks, as exemplified by being able to estimate the probability of occurrence of an event of interest. In contrast to the ethereal semantic knowledge which exists independently of any applications, procedural knowledge is inescapably tied to specific tools and applications – in fact, within the confines of data analytics, procedural knowledge manifests itself as the ability to use specific data manipulation and analysis tools. For example, while discussed in more or less detail in a great number of books, journal articles, and online summaries, the idea and the logic (i.e., semantic knowledge) of Pearson product-moment correlation is always the same, but the type of (procedural) knowledge that is required to compute Pearson correlation coefficient varies depending on the type of tool used. The knowledge, really skills, needed to compute that coefficient using a programming language such as Python are very different from the knowledge needed to compute the same coefficient using a GUI (graphical user interface)-based, i.e., point-and-click, system such as SPSS. It follows that the ability to use an idea such as correlation should be seen as a combination of two distinct but interdependent parts: Firstly, it needs to be understood for what it is: an abstract notion that encapsulates specific transcendental truths, and secondly, as an estimate that can be computed using distinct tools that require tool-specific know-how. Implied in that distinction is that while the abstract conceptual knowledge is universal, in the sense that it does not vary across tools or applications, the ability to apply that abstract knowledge, framed in Figure 1.3 as applied skills, is inescapably tied to specific applications. A perhaps less immediately obvious reason for expressly differentiating between conceptual knowledge and applied skills (i.e., procedural knowledge) is that the advancing capabilities of modern data analytic tools make it easy to complete data analytic tasks without fully understanding the underlying theoretical rationale. The above mentioned GUI-based SPSS system makes it exceedingly easy to compute a wide range of statistical estimates with little-to-no appreciable understanding of the underlying statistical concepts, yet as capable as that system is, it is not capable of detecting all instances of input data not meeting the requirements of a particular estimation technique or the outright inappropriateness of a particular technique. The Pearson correlation mentioned earlier requires input data to be continuous, normally distributed, and relatively outlier-free, but the system will compute reasonable-looking estimates for data that do not meet those requirements, so long as input values are encoded as ‘numeric’ (meaning, using digits to represent individual values). Consequently, someone who has the requisite SPSS usage skills could, for instance, compute Pearson correlation between

Reconceptualizing Digital Literacy

23

digitally encoded but manifestly categorical data feature/variable12 such as ‘region’13 and another (continuous or categorical) variable, but since the use of Pearson correlation requires for both variables to be continuous, the resultant coefficient estimate would be invalid, though very likely would have the appearance of a valid estimate . . . Just knowing where and how to ‘click’ can lead to lots of misinformation, if not accompanied by robust conceptual knowledge. Another important difference between conceptual knowledge and applied skills is the degree of permanence. Some of the foundational notions of statistical probability and inference date back to late 18th century, and many of the current techniques are several decades old;14 even some of the neural networks, the backbone of machine learning applications are several decades old (the initial backpropagation method for training artificial neural networks was first developed in the 1960s). Those concepts are as true and applicable today as they were decades or even centuries ago, in fact even more because the means (computer hardware and software) and opportunities (lots and lots of data) to use those concepts are so much more readily available nowadays. However, in contrast to largely unchanging statistical concepts, applications of those concepts, encapsulated in computer software tools, are in the state of ongoing evolution. Looking back at the original programming languages developed in the middle of the 20th century (the oldest programming language still used today, FORTRAN, was introduced in 1957), tools of data manipulation and analysis have undergone tremendous evolutionary change. There are now distinct families of software applications that can be used to access, extract, manipulate, analyze, and visualize data: Some of those tools are limited purpose applications, such as those used to interact with data stored in relational databases (SQL) or to visually convey trends and relationships (Tableau), whereas others are more general purpose systems (SAS, SPSS) capable of performing a wide range of data manipulation and analysis tasks. Moreover, some are open-source, meaning are freely available (Python, R), while others are proprietary, subscription-based (SAS, SPSS); when considered from the standpoint of usage skills, some are scripting languages (Python, R) with relatively steep learning

 Although it is common to use the term ‘variable’ to refer to what more broadly can be characterized as ‘data feature’, the use of the former is preferred here because the term ‘variable’ signifies something that is subject to change, which in the context of data means assuming different values across records, thus rendering that label inappropriate for data elements that may be fixed (i.e., are ‘constant’) across records in a particular set of data.  It is fairly common to use digit-expressed values to encode distinct groups or categories – for instance, rather than using labels such as ‘Northeast’, ‘Mid-Atlantic’, ‘Southeast’, etc., region could be encoded using ‘1’, ‘2’, ‘3’, etc.  Bayes’ theorem, a cornerstone of probability theory, dates back to 1761 and Gauss’ normal distribution to around 1795; Fisher’s seminal Statistical Methods and Design of Experiments were published in 1925 and 1935, respectively, and Tukey’s The Future of Data Analysis was published in 1961. Those contributions ground much of the knowledge that forms of the foundation of basic data analytic competencies.

24

Chapter 1 Literacy and Numeracy as Foundations of Modern Societies

curves, whereas others are GUI-based, point-and-click systems (SPSS) that do not require any coding skills. When considered jointly, many of those tools offer essentially the same data analytic functionality, which presents a choice, further underscoring the desirability of expressly differentiating between conceptual knowledge and applied skills. ✶✶✶ While distinct, conceptual knowledge and applied skills are seen here not just as interdependent, but also as mutually reinforcing. A general, abstract understanding of, for example, different types of numeric data is usually deepened as a result of immersing oneself in those data, which requires appropriate coding (or familiarity with some type of GUI-based system) skills. In fact, the essence of data analytic competency lies at the intersection of abstract knowledge and applied skills; a more in-depth exploration of that idea is the focus of the next chapter.

Chapter 2 Reframing the Idea of Digital Literacy The plethora of digital technologies related conceptions of literacy clearly attest to the multidimensional nature of the general idea of being able to engage with various modes of electronic information sharing. However, as the overview presented in Chapter 1 reveals, the distinct manifestations of what has been characterized here as Digital Age literacy, as exemplified by the established notions of digital, data, and quantitative literacies, lack the much-needed concept and discriminant validity. Their vague and overlapping descriptions paint a confusing picture of what, exactly, are the distinct skills and competencies that form the foundation of modern-day technoinformational proficiency. The analysis of scopes and contents of those widely used framings of digital technologies related literacies undertaken in Chapter 1 suggests that the notion framed here as data analytic literacy embodies the most fundamental and foundational building blocks of the broadly framed idea of being Digital Ageliterate. It is the goal of this chapter to offer a fuller, more complete introduction of concept of data analytic literacy, with a particular emphasis on its derivation logic and its epistemological structure. The notion of data analytic literacy is discussed in this chapter from the perspective of metaconcept, or a higher-order (i.e., a comprised of multiple distinct component parts) abstract representation of a particular phenomenon. That phenomenon is the ability to communicate digitally, which is seen here as a composite of distinct-butinterdependent elements of conceptual knowledge and applied skills. Consequently, the goal of this chapter is to offer a general outline of the metaconcept of data analytic literacy, and to demonstrate how this newly framed conception of what is believed to be the heart of techno-informational competency captures the core elements of digital, data, quantitative, and other offshoots of modern-day literacy discussed in Chapter 1. Given the many potential avenues that could be taken to approach the task of framing all that it takes to be data analytically literate, that task is approached here from the perspective of what it takes to access, understand, and use information of interest. Moreover, it is important to note that as used here, the idea of ‘accessing information’ is meant to encompass not just the ability to get to or retrieve information, but also in the more open-ended – and more in the need clear definitional and operational specification – sense of being able to extract meaning from digitally encoded information. Lastly, the overview of the genesis and conceptual framing of the metaconcept of data analytic literacy is set to the backdrop of the rise of the Digital Age and the resultant proliferation of digital means of transacting and communicating. With that as the backdrop, the idea of data analytic literacy is presented as the embodiment of the widely held belief that data is the currency of the Digital Age, hence developing basic

https://doi.org/10.1515/9783111001678-003

26

Chapter 2 Reframing the Idea of Digital Literacy

abilities to access, use, and communicate using data is at the core of being Digital Ageliterate.

The Digital Revolution Originally developed to transmit the US Air Force’s radar signals, telephone modems began to be used as computer-to-computer communication medium in the early 1950s, forming an early foundation for development of large-scale electronic networks. The first step in that direction was taken in the 1960s with the US government’s ARPANET (Advanced Research Projects Agency NETwork), the early predecessor of today’s Internet. Around the same time, a method of combining transistors into densely packed arrays, known today as microchips, was invented by Texas Instruments and Fairchild Semiconductors, and that invention fueled a rapid evolution and expansion of ARPANET. The network was a great success, but participation in it was limited to just certain academic and research organizations, which were those who had contracts with the US Department of Defense. The compelling information sharing functionality of ARPANET was mimicked by other, technologically similar networks that offered electronic information sharing capabilities to other entities, than just those connected to the US Department of Defense; however, since no standard electronic communication protocols existed at that time, those networks were not able to communicate with each other. That changed on January 1, 1983. On that day, considered by many the official birthday of the Internet, a new communication protocol was introduced: called Transfer Control Protocol/Internetwork Protocol, or TCP/IP for short, it allowed different computers on different networks to ‘talk’ to each other. Following the 1989 invention of the World Wide Web (WWW or simply the Web was actually launched in 1991), the Internet grew from a vehicle designed to universalize and democratize electronic information sharing into a commercial juggernaut that transformed the way business is conducted and gave rise to whole new industries. Still, up to that point, the Internet was just a medium of electronic information sharing – that changed in November of 1992, which is when evolution of the Internet, or more specifically the Web, took another step forward with the development of the first website, which was CERN’s, the European Organization for Nuclear Research (where the Web was conceived). Initially 100% text, the first websites were very basic, especially by today’s standards, but their functionality grew rapidly. The rapid-fire succession of coding related functionality enhancements starting with HTML (HyperText Markup Language), as the basic tool of website construction, followed by JavaScript, a programming language that supports dynamically updating content, and later by Cascading Style Sheets, or CSS, a rule-based programming language used for describing the presentation of a document written in a markup language such as HTML, created what is now the cornerstone Web technology. It is worth noting that another staple Web functionality, a search engine, was introduced years before the

The Digital Revolution

27

today’s king of Web search, Google, emergence.1 ALIWEB! (Archie Like Indexing for the Web), the original search engine, was launched in 1993 when the Web was just two years old and there were fewer than 1,000 websites. The rapidly expanding technological capabilities and user functionalities of the Internet began to shape what could be considered the digital socio-economic reality. The 1999 launch of PayPal, a leading online payment company, facilitated the growth of online commerce, and more broadly, the online economy. Early 2000s witnessed the rise of multiple social media platforms: the now defunct Friendster (2001) and MySpace (2003), and the now dominant LinkedIn (2002) and Facebook (2004). It was also around that time that what was initially considered a ‘nice to have’ – a meaningful commercial and social online presence in the form of a well-designed and highfunctioning website or a well-articulated individual profile – began to be recognized as a key part of business strategy by commercial firms, and an important part of just being by individuals. Stated differently, the Digital Era was born, and to use a wellknown expression, the rest is history . . . Frequently referred to as the Digital Revolution, it represents a largescale shift from largely standalone mechanical and analog devices to interconnected networks of digital electronic devices, systems, and applications, which in more abstract terms can be characterized as the emergence of the virtual, i.e., online, socio-economic reality. And much like the transition from agrarian to industrial society, which brought with it sweeping social, economic, and cultural changes, the currently underway maturation of electronically rendered virtual dimension of socio-economic reality is reshaping numerous facets of commercial and social lives, at the core of which is the ability to communicate digitally. It is important to note that the conception of what it means to ‘communicate’ is also expanding along with the emergence of new communication modalities. Generically defined as competencies needed to send and receive information, present-day communication skills need to be broad enough to accommodate different types of information and different information sharing modalities, which means verbal and non-verbal, physical as in handwritten, and electronic, most notably digital. Setting aside verbal, non-verbal, and physical (i.e., handwritten) communications as those modalities fall outside the scope of this overview, the idea of digital communication, as a key expression of the aforementioned emergent virtual socio-economic reality, warrants closer examination.

Communication in the Digital Age Digital communication can be defined as the use of strings of discrete values, typically 1s and 0s, to encode and transfer information. It is important to note that while the

 Google Beta was launched in September of 1998.

28

Chapter 2 Reframing the Idea of Digital Literacy

conventional conception of communication implies exchange of information between persons, digitally encoded information can be transferred person-to-person(s), computer-to-computer, computer-to-person, or person-to-computer. In fact, human-computer interaction (HCI) is an established area of research within computer science; HCI researchers study the ways humans interact with computers with an eye toward designing new technologies and new interfaces to existing technologies to facilitate individuals and organizations’ ability to make fuller use of computing capabilities.2 All considered, the combination of rapid technological advancements – in particular, the evolution of artificial intelligence-enabled autonomous systems – and an equally rapid spread of digital communication infrastructure into more and more aspects of commercial and individual lives is beginning to redefine the traditional conception of what it means to communicate. Defined by the Merriam-Webster dictionary as ‘a process by which information is exchanged between individuals through a common system of symbols, signs, or behavior’, in the era of ubiquitous automated and progressively more autonomous technological infrastructure built around information exchange, a broader framing of the general notion of communication seems warranted, one that explicitly accounts for the phenomenon of bidirectional transfer of information between humans, artificial systems, and artificial systems and humans. Part and parcel of that broadening conception of the notion of communication, seen as a phenomenon encompassing the traditional human-to-human information exchange as well as informational exchanges between artificial systems and between artificial systems and humans, is the growing importance of being able to understand the ‘language’ of digital communication, which is data. It is instructive to note that some data are contained entirely within the closed-loop ecosystems of artificial system-to-artificial system communications, as illustrated by the Internet of Things (a network of interconnected physical devices which are able to perform numerous tasks autonomously, because of what can be characterized as innate abilities to function in the desired manner by communicating with other devices by means of exchanging data). At the same time, a large share of data generated by the modern digital infrastructure are meant to be ‘consumed’ by humans, because their informational content is primarily meaningful within the context of human decision-making. In that sense, extracting meaning out of data captured by a wide array of transactional and communication systems and devices can be seen as one of the key elements of HCI. To put it simply, in order for humans to ‘hear’ what the artificial systems are ‘saying’, the former have to be proficient in the language used by the latter, meaning have to have the ability to extract meaning out of raw data. But what if, as is the case with social media platforms, data are captured in the form, such as text, that is easily understood by humans without the need for any

 For a more in-depth discussion of HCI, see Banasiewicz, A. (2021). Organizational Learning in the Age of Data, Springer Nature: Cham, Switzerland.

Understanding Data

29

processing? That ease of understanding is illusory because it overlooks a critical consideration in the form of data volumes. Even a relatively small slice of the staggeringly voluminous torrents of data generated by any of the widely used social media platforms would be more than likely prohibitively large for manual human processing, as in being read and summarized by human editors. Due to that, analyses of text data are usually undertaken using automated analytic algorithms, but outcomes of those analyses can be questionable given the numerous challenges of algorithmizing nuanced human communications (more on that in Chapter 8). A simple example is offered by the now ubiquitous online consumer product and service reviews, which are commonly used by brands to stay abreast of consumer sentiment and to pinpoint positive and negative review themes. To analyze a set of such reviews, a text mining algorithm needs to be trained and tested using large enough and adequately representative samples (typically one for training and a separate one for validation) of those reviews. However, as the volume of such reviews grows, cross-record semantic, syntactical, grammatical, and other sources of linguistic variability increase, which renders the task of selecting adequately representative samples – and by extension, training and validating of text mining algorithms – more and more difficult. Still, some 80% to 90% of data generated nowadays are text, which means that even elementary data analytic proficiency requires at least some familiarity with means and mechanisms of extracting meaning out of large pools of text data. In a more general sense, seen here as digitally encoded machine-to-human communications, data captured by an array of electronic transactional and communication systems can take a variety of forms, which in turn has a direct impact on analyses of those data. It is intuitively obvious that analyses of, for instance, numeric and text data call for different data analytic approaches – what is somewhat less intuitively obvious is that more nuanced differences, such as the type of encoding of numeric data, can have an equally profound impact on how data are to be analyzed. It thus follows that development of robust data analytic capabilities, framed here as data analytic literacy, requires sound understanding of distinct data types.

Understanding Data The distinction between text- and digits-encoded information made earlier draws attention to the idea of what, exactly, is data? While the meaning of the term ‘data’ may seem intuitively obvious, most find it difficult to come up with a clear and concise characterization of that important notion; in fact, there are varying conceptions of what qualifies as data, so much so that the use of that seemingly straightforward term may conjure up materially different images across users and across situations. Moreover, quite commonly the term ‘data’ is used interchangeably with the term ‘information’, especially when referring to evidence supporting a particular conclusion or a point of view; further adding to the blurring of the distinction between those two

30

Chapter 2 Reframing the Idea of Digital Literacy

notions is that some commonly used sources, such as Oxford Languages dictionary, tend to define one in terms of the other, as exemplified by framing of information as ‘facts about something or someone.’ At the same time, decision theory and data analytics focused researchers tend to draw a sharp distinction between data and information, characterizing the former as facts, and the latter as interpreted facts, often in the form of insights derived from analyses of facts.3 Consequently, while the general idea of data has a ring of familiarity, it is a hollow label with inconsistent descriptions and in dire need of clarification. One of the main contributors to epistemological inefficacy of the general conception of data is inattention to origins or source of data. If data are to represent the fundamental elements from which meaning is eventually derived, data need to emanate from some type of an original source, a belief that loosely parallels the general idea of first principles.4 In an abstract sense, data can be conceptualized as original elements of meaning brought into existence by something or someone. That ‘something’ could be a point-of-sales bar code reading device recording details of a transaction, such as item number, price, purchase date, etc.; that ‘someone’ could be a researcher capturing a particular response to a survey question. In that sense, data are encoded, often using numbers, letters or combinations of both, elements of meaning, potential building blocks of information. A closely related but nonetheless distinct source of epistemological inefficacy of the common conception of data is the tendency to confuse the type of encoding employed to capture a particular state or an event, i.e., a fact, with informational content, or the meaning, contained therein. Overlooking that important distinction is problematic because it gives rise to inconsistent frames of reference, which is at the root of the aforementioned definitional fuzziness; by extension, that fuzziness can then lead to epistemologically shaky framing of the idea of data analytic literacy. In other words, to be theoretically sound and practically meaningful, framings of specific skills and competencies deemed necessary to being able to extract meaning out of data need the footing of a solid foundation of clearly delineated distinct (i.e., non-overlapping) and complete (i.e., encapsulating all general types) data forms.

Classifying Data Broadly defined, data is anything known or assumed to be fact. That is a very broad, and frankly not a particularly informative characterization. Seeking greater definitional

 This overview does not consider ideas such as the underlying truthfulness of data, i.e., facts; to most, the intuitive meaning of the term ‘fact’ is analogous to ‘truth’, but as it is well known data can be fabricated or intentionally (or unintentionally) distorted. Given that, the framing of fact used here is that which is believed to be true.  The fundamental concepts or assumptions that form the basis of a system or a theory.

Understanding Data

31

clarity, however, turns up a surprisingly confusing array of taxonomies and general descriptions that together paint an inconsistent, if not an outright confusing picture. In order to be meaningful, any data type classification schema needs to exhibit the basic MECE, or mutually exclusive and collectively exhaustive characteristics – a MECE principle-compliant classification should enumerate data types in a way that accounts for all possible variations of data encodings, and all those enumerated variations should be distinct and different. A review of scientific research sources (electronic research databases such as JSTOR, ERIC, ProQuest), and a separate review of more practitioner oriented and sourced open online sources reveals a surprising paucity of data types focused classification schemas. What emerges is largely informationally incoherent potpourri that confuses data structures, as seen from the perspective of storage and retrieval, with data sources and with data elements, the latter seen from the perspective of algebraic expressions (e.g., arithmetic partial data types). For example, one classification schema divides data types into integer, floating-point number, character, string, and Boolean, another one into integer, real, character, string, and Boolean values, and yet another into integer, character, date, floating point, long, short, string, and Boolean values. Moreover, the everyday conception of data also tends to conjure up portrayals of neatly arranged rows and columns of numbers, as illustrated by the now ubiquitous spreadsheets. However, since facts can be encoded not only numerically but also textually and graphically, data can take on a multiple, highly dissimilar forms; moreover, those forms may or may not be arranged into recognizable and repeating structures. In fact, given that, volume-wise, social media platforms are among the largest contributors to ‘big data’ and those data are predominantly unstructured collections of text and images, uniformly structured rows and columns of numeric values are nowadays the exception rather than the norm. Recognizing that is important not only from the perspective of typological completeness, but also because if not all data are numeric, the ability to extract meaning out of data must encompass more than just quantitative competency.

Toward a General Typology of Data Considering the idea of data types from the standpoint of data analytic literacy, it is taken here to be largely self-evident that developing robust data analytic abilities requires the foundation of a conceptually and operationally sound data classification schema. However, given the combination of the current status quo, which is a confusing maze or largely incongruent data type characterizations, and the sheer variety of data currently in existence, the development of a comprehensive data typology is a daunting task. To make that task more manageable, a key distinction needs to be made: Considering that, broadly characterized, data represent recordings of events or states that can be encoded and organized in multiple ways, a MECE-compliant data

32

Chapter 2 Reframing the Idea of Digital Literacy

classification schema needs to expressly differentiate between structural and informational dimensions of data. The former manifests itself in two key considerations: a general organizational structure, which considers data as a collection of individual elements, commonly referred to as data features or variables, and the type of encoding, whereas the latter can be encapsulated in distinct data origins. The structural dimension of data reflects the manner in which data are captured, organized, and stored. Understanding of those largely technical characteristics is essential to being able to access, manipulate, and analyze data. Data that structured and encoded in a particular way can be manipulated and analyzed in ways that are in keeping with those characteristics. It is important to note that the structural dimension of data is focused on what is permissible, in the sense of the resultant operations yielding valid outcomes, and not at all on what might be appropriate or desirable, from the sensemaking point of view. The informational dimension of data, on the other hand, aims to capture the sensemaking content of data. In other words, it considers data from the perspective of general origins that encapsulate broad types of meanings embedded in data. For example, UPC scanners-captured retail transaction details contain different embedded meanings than online platforms-captured consumer product reviews. Given that data can be seen as raw material to be used as input into knowledge creation processes, clear understanding of the informational dimension of data is as important as strong command of structural characteristics of data. Together, the structural and informational dimensions of data help with framing of the broad conceptual foundation of data analytic literacy, as those dimensions are suggestive of specific elements of conceptual knowledge and distinct skills that are needed to, firstly, be able to access and manipulate data, and, secondly, to be able to extract informative insights out of raw data. Those are, however, relatively abstract high-level objectives which are in need of clarifying details. The more operationally minded details are needed to shed light on what are the key elements of knowledge and sets of skills that are necessary to meet the structural and informational challenges associated with extracting insights out of data. Supporting Considerations The structural dimension of data is a reflection of how individual data elements (i.e., data features or variables) are encoded and organized. A data element can be thought of as a measurement of state (as in a state of being, as exemplified by demographic attributes) or an event (e.g., a product purchase), and it can be encoded in one of three general formats: numeric, text, or symbolic. The captured states or events can be very detailed or specific (e.g., a specific item purchased at a specific location and time) or comparatively more aggregate (e.g., total product sales), depending on data capture goals and mechanisms. In terms of encoding, the seemingly self-evident labels of ‘numeric’, ‘text’, and ‘image’ are somewhat more nuanced than what meets the eye.

Understanding Data

33

Numeric data are encoded using digits, but it is important to note that digital expression does not necessarily imply a quantitative value – in other words, ‘5’ could represent either a quantity (e.g., 5 units of a product) or it could just denote a label (e.g., Sales Region 5); such potential ambiguity underscores the importance of clear operational definitions. The second type of data encoding, text, can take the form of words or phrases, but it can also take the form of alphanumeric values, which are often unintelligible strings (often used as unique identifiers, such as customer ID) comprised of some mix of letters, digits, and special characters (e.g., $ or &). In other words, from the perspective of data types, text is any sequence of letters, digits, and special characters, which may or may not contain informative content. And lastly, the third and final type of encoding, image, encompasses essentially any encoding that is neither numeric nor text, which typically encompasses a wide array of literal or symbolic visual expressions. The ability to recognize the overt differences separating the three distinct types of data encoding needs the ‘companion’ knowledge and skills that are necessary to identify (conceptual knowledge) and execute (applied skills) data type warranted and/ or allowable manipulations. To that end, numeric data can be either continuous or categorical. Continuous numeric data elements are expressions of a magnitude or a multitude of a countable phenomenon, whereas categorical data elements are expressions of discrete groupings.5 When considered from the perspective of the structural dimension only, text and image data can be either symbolic or informative. Symbolic data represent emblematic or tokenized characterizations of states or events, whereas informative data are expressions of meaning encoded using some combination of letters, digits, and other characters. Also important from the perspective of the structural dimension of data – in fact, suggested by the ‘structural’ label itself, is the organization of data elements. Typically grouped into collections commonly referred to as data files, those collections of individual data elements can be either structured or unstructured. The former are arrangements of individual elements using layouts that follow the familiar-to-most fixed, pre-defined structure where rows = records and columns = variables, as illustrated by a set of customer purchase records in a spreadsheet. The latter, on the other hand, simply do not adhere to a predetermined, repeatable layout, meaning that unstructured data do not follow a fixed record-variable format; just about any social media data file can be expected to be unstructured. In fact, the vast majority of data, and certainly the bulk of ‘big data’ are unstructured; in that sense, unstructured format should be considered the norm and structured format a special case. However, structured data are easy to describe, store, query, and analyze whereas unstructured  In a more technically explicit sense, a continuous variable can take any value (i.e., integers or fractions) between the minimum and the maximum, meaning it can assume an unlimited number of values; in contrast to that, a categorical (also known as discrete) variable can only take on one of a limited, and typically fixed, number of possible values.

34

Chapter 2 Reframing the Idea of Digital Literacy

data are not – consequently, until relatively recently the notion of ‘data analysis’ has been synonymous with structured (and predominantly numeric) data, though recent advances in computational tools and techniques are beginning to change that. Still, according to some estimates, unstructured data account for as much as 90% of all data captured worldwide, but currently only about 1% of those data are analytically utilized. Lastly, when considering the difference between numeric and non-numeric (i.e., text and image) data elements in the context of the difference between structured and unstructured data collection layouts, it is important to note that while it is common for numeric data to be structured and non-numeric data to be unstructured, there are instances where that is not the case. For example, when two of more numeric, structured data files are concatenated (i.e., merged together – more on that in Chapter 7), and the overtly the same data features, such as ‘cost’, are differently encoded across some of those files (for instance, encoding of ‘cost’ in one file could allow decimal values, but in another file the same ‘cost’ encoding could not allow decimal values), the resultant combined file might no longer exhibit a single, fixed layout. It is also possible for typically unstructured text data to be structured, which is often the case when text values represent choices selected from a predefined menu of options. Of course, those are relatively infrequent situations, but nonetheless they underscore the importance of carefully examining data prior to making layout related decisions. The core data structure related considerations that play an important role in developing an analytically robust understanding of data are graphically summarized in Figure 2.1. The goal of this summarization is to offer an easy-to-grasp enumeration of distinct aspects of data that need to be expressly recognized as a part of framing of conceptually complete and operationally clear notion of data analytic literacy. Ability to Utilize Data

Structural Dimension

Numeric Data

Continuous

Categorical

Informational Dimension

Text & Image Data

Symbolic

Informative

Structured vs. Unstructured

Numeric Data

Text and Image Data

Continuous

Categorical

Nominal

Ordinal

Classification

Data Origin

Figure 2.1: Core Data Utilization Considerations.

Insights

The Metaconcept of Data Analytic Literacy

35

Turning to the informational dimension of data, there are numerous data considerations that play critical roles in processes used to extract meaning out of data, as graphically summarized in Figure 2.1. Building on the earlier discussion of different types of data features encodings, continuous numeric data represent measurable quantities; from the perspective of data sensemaking that means that continuous numeric values permit standard arithmetic operations (addition, subtraction, multiplication, division, exponentiation, and extraction of roots), which translates into a breadth of statistical techniques that can be used to extract meaning out of those values. Numerically encoded categorical data features, on the other hand, offer significantly less sensemaking flexibility as those values do not allow the aforementioned standard arithmetic operations, which translates into considerably fewer data analytic pathways. Moreover, not all numeric categorical values have comparable informational value – here, ordinal (i.e., ordered classes, as exemplified by the standard college classification groupings of freshman, sophomore, junior, and senior) values are informationally richer than nominal (i.e., unordered labels, as exemplified by gender or ethnicity groupings) ones. A somewhat different set of considerations applies to text and image data considered from the standpoint of informational content. Here, the basic distinction is between data features that contain informative insights and those that can be seen as being a part of a larger set. For instance, text-encoded online consumer product reviews contain insight hidden in what, from data analysis point of view, is a combination of expressions (i.e., single words and multi-word terms) and syntactical and semantic structure, which suggests a different sensemaking pathways than a collection of images. A yet another important consideration that arises in the context of the informational dimension of data is encapsulated as ‘data origin’ in Figure 2.1. As used in the context of data analytic literacy, the origin of data is meant to reflect a general data creation mechanism, as exemplified by bar code scanners widely used to record retail transactions. As detailed in ensuring chapters, understanding the origin related roots of data plays an important role in making sense of data’s informational content.

The Metaconcept of Data Analytic Literacy The steady transition from Industrial to Digital Age technologies and industries manifests itself in, among other things, not only vast quantities of data, but also a growing recognition of centrality of data to value creation processes. Not surprisingly, it is now common for organizations to espouse the value they believe is locked in the vast torrents of data generated by modern electronic transaction processing and communication infrastructure. Structured, unstructured, numeric, text – data generated by the ever-expanding arrays of systems and devices vary considerably in terms of structural and informational characteristics, and so extracting value out of those heterogeneous torrents of data, particularly in the form of decision-guiding insights, requires

36

Chapter 2 Reframing the Idea of Digital Literacy

well-honed skills and competencies. As discussed in Chapter 1, those competencies need to reflect an appropriate mix of conceptual knowledge and applied skills (see Figure 1.3), which is something that the often invoked ideas of digital, quantitative, analytic, and data literacy hint at, but do not quite get at. The summary concept of data analytic literacy proposed in this book is meant to address those shortcomings by offering a new, MECE-compliant conceptualization of what it means to be literate in the Digital Age. It is framed here as a metaconcept, a composite of multiple otherwise distinct skills and competencies, some in the form of distinct domains of theoretical knowledge and others in the form of distinct sets of applied skills. Recall Figure 1.2 in Chapter 1, showing the various embodiments of the general notion of Digital Age related literacies. To be singularly inclusive of the core elements of knowledge and applied skills that are needed to access and utilize diverse types of data, the notion of data analytic literacy needs to clearly define categories of knowledge and skills, spell out elements comprising those categories, and specify relationships among those entities. This task is accomplished here with the use of a threetiered logical model graphically summarized in Figure 2.2.

META-DOMAIN

General categories

DOMAIN

Classes within general categories

SUB-DOMAIN

Specific skills and competencies

Figure 2.2: Three-Tiered Conceptual Model.

The key benefit of the three-tiered organizational structure depicted in Figure 2.2 is that it facilitates bringing together of otherwise disparate elements of conceptual knowledge and applied skills in a way that demonstrates the necessary interdependencies, ultimately yielding informationally complete and logically consistent ‘whole’. The broadest, scope- and content-wise, components of that whole, framed here as meta-domains, capture the general categories of conceptual knowledge and applied skills. Each of those broad categories are broken down into more specific sets of classes of knowledge and skills, framed here as domains of conceptual knowledge and applied skills. And lastly, the most operationally specific level, framed here as subdomain, captures the specific elements of conceptual ‘know what and why’, and applied ‘know how’.

The Metaconcept of Data Analytic Literacy

37

Framing Data Analytic Literacy The general typological6 logic summarized in Figure 2.2 suggests that, when considered in the more abstract sense, proficiency with using structured and unstructured, numeric and non-numeric data needs to be considered in a broader context than just the ability to analyze data. More specifically, it points toward knowing when and how to perform any necessary data manipulation and data preparation tasks. Often given, at best, cursory attention in traditional data analytics overview, the knowledge and skills needed to recognize the need for and to undertake any data due diligence and data feature engineering that might be necessary are critically important. It is so because data generated by the various transactional, communication, and tracking systems are rarely analysis-ready (more on that in Chapter 5), and typically require substantial preprocessing, which in turn calls for specific knowledge and skills. Data utilization, which is the essence of data analytics, is rich with possibilities, but realizing those possibilities is again contingent on highly nuanced combination of general procedural and methodological knowledge, coupled with applied skills that are needed to execute the desired analytic processes. Here, the application of data analysis related knowledge and skills can be characterized as dynamically situational, meaning it is shaped by the interplay between data specifics and expected informational outcomes. And lastly, communication of data analytic outcomes to interested stakeholders demands the ability to correctly capture and translate the often esoterically expressed (i.e., buried in statistical details and nuances) into meaningful takeaways. Set within that general context and bringing to bear the typological logic summarized in Figure 2.2, data analytic literacy can be seen as: 1. an interactive combination of conceptual knowledge and applied skills (i.e., the two distinct meta-domains), and further: a. within the knowledge meta-domain, it can be seen as a combination of familiarity with data (domain 1) types (sub-domain 1) and origins (sub-domain 2), and familiarity with methods (domain 2) of processing data (sub-domain 3) and analysis (sub-domain 4), and b. within the skills meta-domain, it can be seen as a combination of proficiency in computing (domain 3) focused on structural (sub-domain 5) and informational (sub-domain 6) outcomes, and the ability to extract informational content, framed as sensemaking (domain 4) abilities manifesting themselves in abilities to extract factual (sub-domain 7) and inferential (sub-domain 8) insights,

 For clarity, there are two basic approaches to classification: typology and taxonomy. The former conceptually separates a given set of items multidimensionally, where individual dimensions represent concepts rather than empirical cases; the former classifies items based on empirically observable and measurable characteristics. The approach used here is described as typological because it is a multidimensional conceptual design.

38

2.

Chapter 2 Reframing the Idea of Digital Literacy

all of which ultimately combining to give rise to the ability to transform raw data into decision-guiding knowledge.

Figure 2.3 offers a more detailed summary of all pertinent considerations. Data Analytic Literacy

Knowledge

Skills

Data

Methods

Computing

Sensemaking

Type

Processing

Structural

Factual

Origin

Analytic

Informational

Inferential

Ability to Transform Data into Decision-Guiding Knowledge

Figure 2.3: Data Analytic Literacy.

In a more narratively expressive sense, the idea of data analytic literacy is seen here as a summary construct or a metaconcept comprised of two interdependent metadomains of knowledge and skills – the former embodying theoretical understanding of applicable concepts and processes, and the latter representing the ability to execute appropriate data processing and analysis steps. As graphically depicted in Figure 2.3, while distinct, the knowledge and skills meta-domains are seen as interdependent, in the sense that each contributes distinctly and vitally but only partially to meeting the overall requirements of data analytic literacy. The knowledge meta-domain is made up of two more operationally distinct domains of data and methods; the former encompasses familiarity with generalizable types and sources of data, while the latter encompasses familiarity with data processing and data analytic approaches. The data domain is then broken down into two distinct sub-domains of types and origin. The types sub-domain captures the knowledge of distinct and generalizable varieties of data, as seen from the perspective of encoding (e.g., numeric vs. text) and organizational structure (e.g., structured vs. unstructured). The origin sub-domain considers data from the perspective of generalizable sources, as illustrated by point-of-sales transactions or consumer surveys. The methods domain is comprised of two distinct sub-domains of processing, which encapsulates familiarity with analytic data preparation processes and approaches as exemplified by data feature engineering or missing value imputation, and analytic, which encompasses familiarity with methods of data analysis, framed here in the context of widely used exploratory and confirmatory techniques. The skills meta-domain is comprised of two distinct domains of computing and sensemaking, where the former encompasses computational skills manifesting themselves in

39

The Metaconcept of Data Analytic Literacy

familiarity with specific tools used to manipulate and analyze data, and the latter delineates the somewhat less ‘tangible’ abilities to extract and communicate meaning, often hidden in technical data analytic outcomes. The computing domain is comprised of two distinct sub-domains of structural and informational. The structural sub-domain encompasses an array of data wrangling and data feature engineering (i.e., data pre-processing or data preparatory) computational skills, while the informational sub-domain captures the core (i.e., comprising the elementary know-how that falls within the scope of basic skills focused notion of data analytic literacy) data analysis related computational skills. And lastly, the sensemaking domain is made up of two distinct sub-domains of factual and inferential skills, where the former encompasses a collection of competencies that are needed to be able to summarize data into ‘what-is’ type of informational outcomes, whereas encapsulates know-how needed to extract probabilistic insights out of data. The general typological logic framing the organizational structure of the broad metaconcept of data analytic literacy is graphically depicted in Figure 2.4. Knowledge

META-DOMAIN DOMAIN

Skills

Data

Methods

Computing

Sensemaking

Type

Processing

Structural

Factual

Origin

Analytic

Informational

Inferential

SUB-DOMAIN

Figure 2.4: The Typological Logic of the Data Analytic Literacy Metaconcept.

It is important to note that the interdependent nature of ‘knowledge’ and ‘skills’ dimensions of data analytic literacy, graphically depicted with a bidirectional arrow in Figures 2.3 and 2.4, is a core determinant of what it means to be ‘data analytically literate.’ Conceptual knowledge plays a key role in appreciating the nuanced character of data analytics, and applied skills are the means that ultimately transform raw data into informative insights. And while that might seem obvious, it is also important to strike a good balance between those two core elements, which is especially critical within the basic capabilities-minded data analytic literacy. As noted earlier, to be literate is to be possess the basic skills, which has implications for the depth of data and methods related knowledge, and the extent of computing and sensemaking skills. The bulk of this book – Part II, comprised of chapters 3 thru 8 – details the specifics of the scope of conceptual knowledge and the extent of applied computational skills that are believed to constitute those foundational competencies. To that end, some of the ensuring overview might sound quite familiar, and some other aspects might be less so. Data processing and data analysis related approaches are fairly well established and so the overview offered in the next section is focused on delineating the key elements of those otherwise broad domains of knowledge; in contrast to that, the overview of data knowledge is built around a newly introduced general typology of data, meant to

40

Chapter 2 Reframing the Idea of Digital Literacy

facilitate quick and meaningful immersion in diverse and widely varied types of data (see Figure 2.1). The overview of computing related skills and sensemaking abilities also offers a largely new perspective. Here, the discussion of structural and informational data processing and analysis tools is built around another newly developed framework, this one focused on imposing meaningful structure onto the confusing ecosystem of open-source, proprietary, scripting, and GUI-based coding applications. The goal of that grouping and description-based overview is twofold: First, it is to offer a succinct summary of what computing tools might be most appropriate to accomplish different aspects of broadly conceived data analyses, and second, to highlight alternative means of achieving the same outcomes. And lastly, the explicit focus on sensemaking skills aims to underscore the key differences between factual and inferential interpretations of data analytic outcomes.

Part II: The Structure of Data Analytic Literacy

The preceding section aimed to look at the longstanding notion of literacy from the perspective of Digital Age communication. The first step, in Chapter 1, was to examine the foundational notions of literacy and numeracy, framed as the ability to read and write and to use numbers, respectively, and then to take a closer look at their modern-day derivates in the form of digital, quantitative, data and other applications of the general idea of literacy to today’s techno-informational electronic infrastructure. Building on that foundation, the summary notion of data analytic literacy was then derived, seen as a core element of being able to fully engage with broadly defined informational resources of the Digital Age. Framed as a combination of two interdependent meta-domains of data analytic approach-informing conceptual knowledge and data analytic action-enabling computational and sensemaking skills and competencies, the three-tiered conceptualization of data analytic literacy was outlined in Chapter 2. The overriding goal of the data analytic literacy framework outlined there was to, firstly, pull together the key elements of what is currently a largely haphazard mix of epistemologically unclear, overlapping, and operationally vague notions of digital, data, analytic, quantitative, and other modern-day embodiments of the longstanding notion of literacy, and, secondly, to offer a singular, epistemologically sound and operationally explicit framing of what it means to be data analytically literate. Building on the broad overview of the core elements of data analytic competencies offered in chapters 1 and 2, this section takes a closer look at what, exactly, it means to be data analytically literate. More specifically, the ensuing six chapters – Chapter 3 thru Chapter 8 – jointly offer an in-depth overview of the three-tier structure of data analytic literacy, first introduced in Chapter 2. Organization-wise, chapters 3 and 6 are focused on a more explicit examination of the distinction between the meta-domains of knowledge and skills, as the two notionally distinct but process-wise interdependent macro elements that combine to jointly give rise to data analytic literacy. Contents of chapters 3 and 6 are meant to draw a clear distinction between conceptual understanding of ‘what’ and ‘why’ focused ‘knowledge’ meta-domain, and the computation and interpretation ‘how to’ focused ‘skills’ meta-domain. The remaining four chapters comprising this segment – chapters 4, 5, 7, and 8 – offer an in-depth discussion of the four distinct domains of data analytic literacy: data, methods, computing, and sensemaking.

https://doi.org/10.1515/9783111001678-004

Chapter 3 Data Analytic Literacy: The Knowledge MetaDomain One of the defining characteristics of any domain of professional practice is the command of subject matter knowledge. Attorneys, for instance, are expected to be well versed in key legal concepts, doctrines, and procedures, accountants are expected to exhibit a command of generally accepted accounting principles, and investment professionals are expected to have an in-depth understanding of different investment mechanisms and strategies. What does the idea of knowledge and knowing mean within the confines of data analytic literacy? To answer this question, it is essential to first look deeper into what it means to ‘know’. Perhaps the most intuitive conception of what it means to know is that of being familiar with or aware of something, as in some fact, circumstance, or occurrence. For instance, it is widely known that having been established in 1872 makes Yellowstone the oldest national park in the United States, and in fact, in the world. Most people also know some number of other people; hence to know also means to have developed a relationship with someone, typically through meeting and spending time with them. What the general conception of knowing does not clearly account for is an important distinction between being cognizant of something and knowing how to do something. For example, someone might have a general understanding of the principles of flight, but not have the skills to operate an aircraft. That distinction speaks to the difference between conceptual knowledge and applied skills discussed in the previous chapter, and recognizing it is an important element of most areas of professional practice. The fuzziness of the distinction between knowing as in being familiar with something and knowing as in being able to do something is particularly pronounced in the context of data analytics, where those two complementary but distinct aspects of knowing are often lumped together. That is inappropriate because doing so overlooks the fact that the ability to analyze data is a product of several distinct areas of knowledge, most notably understanding of data structures, proficiency with data manipulation and analysis tools, and familiarity with data analytic methods, and being proficient in one of those domains does not imply proficiency in other domains. For instance, database administrators typically exhibit strong understanding of data structures coupled with proficiency with specific data manipulation tools such as SQL, but rarely have equally strong knowledge of data analytic methods; data analysts, on the other hand, can be expected to be highly proficient with data analytic methods (statistical techniques and/ or machine learning algorithms) and computational tools such as Python or SAS, but not necessarily with different types of data structures or specific data access and manipulation tools, such as NoSQL. Moreover, some, most notably academic researchers, https://doi.org/10.1515/9783111001678-005

46

Chapter 3 Data Analytic Literacy: The Knowledge Meta-Domain

might have very deep theoretical knowledge of, for instance, statistical methodologies, and comparatively poor proficiency with applied methods and tools used to extract, manipulate, and analyze data. All considered, the development of robust and balanced abilities to utilize data calls for, firstly, expressly differentiating between theoretical knowledge that captures the ‘what’ and ‘why’ facets of data analytics, and applied skills that encapsulate the ‘how’ dimension, and, secondly, carefully delineating elements of both. The goal of this chapter is twofold: First, it is to take a deeper look at the essence of knowing and remembering, as considered from the broad perspective of the knowledge meta-domain part of data analytic literacy. Second, it is to offer a general overview of the Knowledge meta-domain of data analytic literacy, as a foundation for an in-depth analysis of the two domains (Data and Methods) and four distinct subdomains (Type and Origin of data, and Processing and Analytic methods) comprising that broad dimension of data analytic literacy.

The Essence of Knowing From the perspective of machine-like information processing, a movie is essentially a series of still frames shown in fast succession, but to the human eye it as a continuous flow. Moreover, a movie typically includes ‘chunks’ of thematic details, yet human senses capture those discrete audio-visual stimuli and the brain effortlessly (assuming a well-made production) infers the multitudes of causally and otherwise interconnected meanings necessary to understand and appreciate the underlying plot. At the end of a movie, it is natural to say that one knows what it was all about; in fact, most of us are so comfortable with the idea of knowing that the age-old epistemological1 question of ‘how does one know that one knows’ is rarely asked. A part of the reason is that the very idea of what it means to know is so allencompassing. Variously defined as understanding of, or awareness of, skill, or familiarity with ideas, concepts, objects, or situations, the general conception of what it means to ‘know’ points toward just about everything. The distinction between semantic and procedural knowledge made in Chapter 1 suggests that when considered in the context of data analytic competencies, the seemingly monolithic notion of what it means to know should be considered from the perspective of at least a couple of meaningful angles. Moreover, learning related research points toward two other ways of thinking about the idea of knowing, by drawing distinction between explicit and tacit knowledge, which also has important implications for the idea of data

 Epistemology is a branch of philosophy concerned with understanding the nature of knowledge and belief; as used here, epistemology represents an attempt at trying to understand the core essence of a particular concept or idea.

The Essence of Knowing

47

Knowledge Sources

Explicit

Tacit

o Factual & objective o Community-shared & intermittent o Exposure-, absorption-, and availabilityshaped

o Interpretive & subjective o Individually-held & constantly accruing o Exposure-, absorption- and perspectiveshaped

Dimensions

Semantic o Ideas, facts & concepts o Not related to specific experiences

Procedural o Behaviors, habits & skills o Implicit or unconscious

Episodic o Events, experiences & emotions o Recall-based

Figure 3.1: The Summary Construct of Knowledge.

analytic literacy. Combining those two, somewhat different perspectives on knowledge suggests that the idea of knowing should be seen as a broad summary construct, or a metaconcept, graphically summarized in Figure 3.1. When considered from the epistemological perspective, that which constitutes knowledge can be understood in terms of two separate sets of considerations: sources, which captures the ‘how’ aspect of what we know or believe to be true, and dimensions, which represents the ‘what’ facet of what we know, in the sense of the type of knowledge. The source of knowledge can take the form of formal learning, often referred to as explicit knowledge (also known as declarative), or informal learning, which is also known as tacit knowledge (also known as non-declarative), best exemplified by skills, familiarity with, or understanding of topics or endeavors encountered while or working pursuing a hobby. The second and somewhat distinct set of considerations instrumental to dissecting the essence of how one knows that one knows characterizes the typologically distinct dimensions of knowledge. Here, the individually held ‘I know’ can be grouped into three broad, general kinds of knowledge: semantic, which encapsulates abstract ideas and facts, procedural, which captures behavioral abilities to perform specific tasks, and episodic, which encompasses an array of hedonic or emotive memories. Those distinctions play an important role in understanding the essence of what it means to be data analytically literate. The summary depiction of that metaconcept shown in Figure 2.3 differentiates between conceptual knowledge and applied skills, though in a more abstract, broader sense, what is framed there as distinct can be seen as two manifestations of the general idea of ‘knowing something’ because data analytic literacy takes a relatively narrow view of skills. In a general sense, skill is the ability to do something, where that ‘something’ can assume a wide array of forms. Negotiating an agreement entails specific, predominantly mental skills, and riding a bike also involves skills, though obviously different one, in the form of specific

48

Chapter 3 Data Analytic Literacy: The Knowledge Meta-Domain

physical abilities. Organizations often think in terms of the difference between ‘hard’ and ‘soft’ skills, where the former typically encompass job-specific, teachable abilities such as using specific tools or applications, and the latter includes individual’s social abilities to relate to and interact with others, which includes competencies such as time management, leadership, or conflict resolution. All in all, while the general conception of skills entails a broad and heterogeneous mix of abilities and competencies, the notion of data analytic literacy only contemplates a specific, relatively narrowly framed subset in the form of computing and sensemaking skills. The former can be interpreted, in the context of the high-level descriptive summarization of knowledge shown in Figure 3.1, as a procedural dimension of knowledge, and the latter as a manifestation of the explicit source of knowledge.

Analytic Reasoning Taking a closer look at the idea of what it means to know is also beneficial from the standpoint of shedding additional light on the idea of analytic reasoning that is implicit in the logic of data analytic literacy. While perhaps most strongly implied by the sensemaking domain (see Figure 2.3) of data analytic literacy, analytic reasoning permeates the entirety of data analytic literacy because it is a manifestation of ability to logically and carefully think through informational opportunities and computational and sensemaking challenges associated with a particular data analytic initiative. It is a notoriously elusive element of the data analytic skillset, one that draws from factual and objective (i.e., explicit) as well as interpretive and subjective (i.e., tacit) facets of knowledge to address (data analytic) demands of a particular situation by identifying the most appropriate course of action. Within the confines of analytic reasoning processes, the differently acquired and differently oriented aspects of knowledge impact each other, so that, for instance, one’s explicit knowledge of how to fit regression models can be expected to be at least somewhat altered by one’s practical experience (tacit knowledge) acquired in the course of conducting regression analyses in the past. One could even go as far as suggesting that the ongoing interplay between explicit and tacit sources of knowledge, and semantic and procedural dimensions of knowledge can, within the confines of data analytics produce different effective topical knowledge, something that is in fact common within applied data analytics. Here, individuals may start with the same foundational knowledge of a particular topic – as would tend to be the case with those who took the same course, taught by the same instructor – only to end up at some point in the future with materially different effective knowledge of that topic, because of divergent tacit influences. Advanced degree students offer a good example here: Two hypothetical, similar in age individuals enrolled at the same time in the same statistics doctoral program. Those two individuals ended up taking the same set of courses, and upon completing their degree requirements graduated at the same time. At the point of receiving their doctoral degrees, both individuals could be assumed to have comparable

The Essence of Knowing

49

levels of knowledge of a particular domain of statistical knowledge, such as regression analyses. Their degrees in hand, the two chose to pursue different career paths, with one taking a position as an assistant professor at another academic institution, while the other instead opting to join a consulting organization as a statistical analyst. Over the course of the next decade or so, the former engaged in theoretical research focused on the use of matrices in linear least square estimation (a subset of mathematical considerations addressing an aspect of regression analysis), while the consultant was primarily focused on developing applied solutions for his corporate clients, such as building predictive models to forecast future cost of insurance claims. If their knowledge of regression analysis was to be reassessed at that point (a decade or so post earning their doctorates), it would be reasonable to expect their respective effective knowledge of regression to be materially different. Why? Because of the interplay of the dissimilarity of their post-graduation work experience (tacit knowledge) and the manner in which their on-the-job learning interacted with their initially comparable explicit knowledge (i.e., the professor’s work deepened the semantic dimension of his knowledge, while the consultant’s work deepened the procedural dimension). Ultimately, even though both started with more-or-less the same foundational regression knowledge, the two overtly like-credentialled professionals could ended up with materially different effective knowledge of regression. The idea of effective topical knowledge draws attention to the importance of understanding of the mechanics of learning. The relatively recent explosion of interest in machine learning seemed to have re-emphasized the importance of that longstanding notion, albeit nowadays focused more on artificial systems’ ability to extract and draw inferences from patterns in data (more on that in chapters 9 and 10). Still, the development of foundational data analytic skills and competencies is ultimately focused on learning, thus taking a closer look at the mechanics of human learning is important.

Human Learning Current neuroscientific research suggests that, when considered from the perspective of human brain’s innerworkings, the process of learning is an outcome of developing new neuronal2 connections through a general process that encompasses encoding, consolidation, storage, and recall or remembering, as graphically summarized in Figure 3.2 below.

 Neurons are cells considered to be the fundamental units of the brain and are responsible for sending and receiving signals, in the form of electro-chemical impulses that transfer information between the brain and the rest of the nervous system.

50

Chapter 3 Data Analytic Literacy: The Knowledge Meta-Domain

Activation

Information

Encoding

Short-Term Memory

Active Memory

Re-consolidation

Consolidation

Long-Term Memory

Figure 3.2: The General Process of Learning.

As depicted above, information is first encoded in short-term memory, into which sensory inputs enter in two, somewhat fleeting forms: iconic, or visual, and echoic, or auditory (research suggests that short-term iconic memories have an average duration of less than 1 second, while echoic memories last about 4–5 seconds). The process of learning is then initiated, starting with the formation of new neuronal connections followed by consolidation, which is when preformed remembrances are strengthened and then stored in specific part of the brain as long-term memories. Subsequent retrieval of earlier stored information from long-term to active memory results in reconsolidation, or strengthening of the stored information’s recall, or remembering. In a more abstract sense, learning can be characterized as modifying information already stored in memory based on new input or experiences. It is an active process that involves sensory input to the brain (which occurs automatically), and an ability to extract meaning from sensory input by paying attention to it long enough (which requires conscious attention) to reach temporary, or short-term memory, where consideration for transfer into permanent, or long-term memory takes place. An important aspect of human memory, however, is its fluid nature, which is very much unlike any machine-based storage and retrieval systems. Each subsequent experience prompts the brain, in a cyclical fashion, to reorganize stored information, effectively reconstituting its contents through a repetitive updating procedure known as ‘brain plasticity’ (it all happens at the subconscious level). This process is generally viewed as advantageous, since improvements are made repeatedly to existing information, but it can have adverse consequences as well, most notably when our memories of the past rather than being maintained and protected are amended or changed beyond recognition. In view of that, the widely used characterization of learning as the ‘acquisition of knowledge’ oversimplifies what actually happens when new information is added into the existing informational mix – rather than being ‘filed away’ and stored in isolation, any newly acquired information is instead integrated into a complex web of existing knowledge. Consequently, ongoing learning should not be viewed as a strictly additive, but rather a more complex evolutionarily cumulative process. That is tremendously important within the confines of data analytic literacy because it suggests that ongoing learning

The Essence of Knowing

51

and immersion in data exploration and sensemaking will yield continuous competency improvements – practice does make perfect after all. Type-wise, long-term memories, or more specifically knowledge, can be grouped into two broad types: declarative, which closely aligns with the earlier discussed explicit knowledge, and non-declarative, which parallels the earlier discussed tacit knowledge. When considered from the standpoint of learning, declarative knowledge encompasses information of which one is consciously aware, as exemplified by familiarity with statistical techniques; as graphically depicted in Figure 3.1, explicit knowledge can manifest itself as semantic or episodic, e.g., the knowledge of the meaning of specific statistical concepts, and remembrances of past experiences using those concepts, respectively, or even as episodic, or recollections of experiences and emotions. Non-declarative or tacit knowledge encompasses information of which one may not be fully consciously aware, as exemplified by skills in assessing the efficacy of data analytic outcomes. It is the core of experiential learning, and the reason why experience in a particular area or with a particular activity endows one with that elusive, in the sense of what it is exactly, but nonetheless valuable dimension of knowledge. Tacit knowledge it is why experienced analysts may be at times suspicious of overtly correct data analytic outcomes for reasons that they themselves may find difficult to immediately pinpoint. Understanding of the core mechanics of learning processes plays an important role in developing basic data analytic competencies because it is suggestive of different learning foci and modalities, and the manner in which those different learning avenues can contribute to achieving better outcomes. The distinction between declarative/explicit and non-declarative/tacit knowledge underscores the importance of learning that balances structured learning of established concepts and practices, and more self-guided experiential learning of applications of those concepts. Drawing attention to short-term to long-term (memory) consolidation of new information and the importance of re-consolidation of those initial memory imprints as a condition of learning underscores the importance of repetition – learning of often esoteric data analytic concepts and applications demands reiteration in the sense of re-performing of the same task. In fact, that is likely the key reason why equating the development of meaningful data analytic competencies with a couple or so ‘one-off’ courses in quantitative analysis or statistics has consistently failed to produce appreciable lasting statistical knowledge. Building even the most rudimentary data analytic competencies is, general process-wise, no different than learning how to use letters and numbers – both require thoughtful approaches, persistence, and repetition. The data analytic literacy conceptualization, first introduced in Chapter 2 (Figure 2.3) is intended to serve as a template guiding such systematic building of basic data analytic competencies. At its most rudimentary level, the approach differentiates between conceptual knowledge and procedural skills, hence the in-depth exploration of the ideas encompassed in the said conceptualization begins with a general overview of the knowledge meta-domain.

52

Chapter 3 Data Analytic Literacy: The Knowledge Meta-Domain

The Knowledge Meta-Domain Learning to engage in meaningful analyses of data should begin with a robust foundation of conceptual knowledge. This is one of the fundamental beliefs that underlie the notion of data analytic literacy discussed in this book. It is a reflection of a simple observation: Data captured by the seemingly endless arrays of systems and applications represent encoded information with characteristics that are determined by data capture, organization, and storage systems and mechanisms; it thus follows that development of meaningful grasp of the informational value of data calls for basic understanding of ‘data creation’ mechanisms. Similarly, analyses of data are typically geared toward derivation of insights that are ‘approximately correct’, which in turn calls for understanding of basic rules of probabilistic inference. In other words, the highly intangible nature of data and data analytics suggests that the natural means of experiencing reality, most notably physical senses such as sight and touch used to experience observable physical things, need to be replaced with typically abstract proxies, the general workings of which need to be understood in order for their outcomes to be fully and validly utilized. In more concrete terms, that means that learning different ways and means of extracting insights out of data needs to begin with acquiring rudimentary familiarity with data-defining mechanisms and structures, and basic understanding of conceptual foundations of methods used to extract meaning out of data. One of the challenges posed by learning the basics of data analytics is a lack of universally agreed on approaches to doing that (that is, after all, the inspiration for this book). Even a casual review of a cross-section of college-level undergraduate or graduate data analytics related curricula will quickly yield a picture of significant amounts of philosophical (e.g., theory vs. applications, learning ‘about’ analytics vs. learning the ‘how to’ of analytics, etc.), scope (e.g., focused on classical statistics vs. more machine learning orientation, etc.), analytic approach (e.g., heavy coding vs. result interpretation, etc.), and other differences. For those learning data analytics primarily in on-the-job applied settings the resultant data analytic skills often end up confounded with specific data and data analytic applications that are tied to a particular industry or a decision-making context. Almost as a norm, academic institutions, which are rightfully expected to be the primary engine of data analytic learning, (still) tend to approach teaching the basics of data analytics by relying on collections of loosely related courses that align more closely with their domains, such as statistics or computer science, than with the more broadly defined process of systematically transforming raw data into informative insights. The result is an uneven set of skills and competencies that emphasizes some aspects of data analytics while largely overlooking other ones. To that end, some areas, such as computational programming or statistical analyses tend to immerse learners deeply in their respective whats and whys, but other aspects of data analytics, such as data due diligence and feature engineering, tend to get no better than

The Knowledge Meta-Domain

53

cursory attention, if at all. And yet, as so convincingly communicated by the wellknown (to academicians and practitioners) adage ‘garbage in, garbage out’, robust data examination and preparation skills are at the core of ascertaining informational validity of data analytic outcomes. And to be sure, those who develop their data analytic skills in more hands-on, organizational settings usually fare no better. Organizations that facilitate peer-based data analytic education tend to approach that task from the perspective of their own, somewhat idiosyncratic informational needs, which results in learning experiences that are focused largely on certain subsets of data, data analytic methods, and informational outcomes. It is all a long way of saying that in spite of being pivotally important in the data-everywhere reality of the Digital Age, data analytics, when looked at as a distinct domain of knowledge, is still in its infancy. Changing that clearly undesirable status quo begins with framing of multifaceted domain of conceptual knowledge that is required to lay the foundation for development of robust data analytic skills and competencies.

The Domain of Conceptual Knowledge Building on the foundation of the earlier synopsis of the mechanics of (human) learning and the structure of what it means to ‘know’, this section offers a general overview of conceptual knowledge that lays the necessary groundwork for building of data analytic literacy. It is important to emphasize, again, that the focus here is on basic abilities, which means foundational knowledge needed to review and prepare data for analyses, and to conduct meaningful analyses. Admittedly, the notions of ‘basic’ or ‘foundational’ knowledge and abilities are prone to interpretation, but the details discussed in chapters 4 thru 7 will draw an explicit line of demarcation, both in terms of the scope as well as the depth of knowledge. As noted earlier, it is the goal of this book to contribute to establishing data analytics as a self-defined area of study and practice; with that in mind, the data analytic literacy conceptualization first outlined in Chapter 2 (Figure 2.3) is meant to offer a high-level summary of the structure of that summary concept. Focusing-in on conceptual knowledge, hereon referred to simply as the knowledge meta-domain, that general aspect of data analytic abilities is itself a summary construct comprised of two distinct domains of data and methods, where the former encapsulates general familiarity with distinct data types (sub-domain 1) and data origin (sub-domain 2), and the latter captures understanding of data processing methods used to prepare data for analysis (sub-domain 1) and analytic methods used to extract meaning out of data (sub-domain 2). Figure 3.3 offers a high-level view of that aspect of data analytic literacy. The essence of the knowledge dimension of data analytic literacy is to distill the confusing, if not outright overwhelming volume of data manipulation, review, restructuring, and analysis related information readily available via numerous academic and industry sources and outlets. An informational spectrum covering textbooks, academic

54

Chapter 3 Data Analytic Literacy: The Knowledge Meta-Domain

Data Analytic Literacy

(Conceptual) Knowledge

(Procedural) Skills

Data

Methods

Computing

Type

Processing

Structural

Factual

Origin

Analytic

Informational

Inferential

Sensemaking

Figure 3.3: The Structure of Data Analytic Literacy: The Knowledge Meta-Domain.

research papers, trade publications, and industry white papers available in print and electronically leaves virtually no topic uncovered, but the sheer volume of what is ‘out there’ makes it challenging to reduce that rich universe to a coherent and adequately inclusive yet manageable set of data analytic literacy shaping knowledge. Additionally, keeping in mind that data analytics is an interdisciplinary field that pools together elements of knowledge from established domains, primarily statistics and computer science, but also decision sciences and cognitive psychology, the potential volume of what needs to be known is not only potentially explosive in terms of size – it is also almost impossibly broad. And yet, not all aspects of those component disciplines warrant consideration simply because not all are necessary to developing foundational data analytic capabilities. To be appropriate and sufficient, knowledge drawn from those and other domains needs to be directly related to what one needs to know to fully and correctly assess and amend – as needed – available data, to conduct basic data analyses, and lastly, to translate the often esoteric data analytic outcomes into valid and informationally meaningful insights. To that end, the so-framed conceptual knowledge of data and data analytic methods needs to encompass a complete understanding of generalizable types of data and ways of manipulating and transforming data, but at the same time, knowledge of methods needs to be contained to just those that are necessary to conduct basic analyses of those data. That seemingly asymmetric view of data analytic knowledge is in fact a logical consequence of the spirit of data analytic literacy, which is to enable one to extract insights (i.e., basic information) out of all available data. The ensuring two chapters offer detail examination of the two domains comprising the knowledge meta-domain: Chapter 4 provides an in-depth overview of the data domain related knowledge, framed in the context of type and origin sub-domains, and Chapter 5 offers an equally detailed overview of the methods domain, framed in the context of processing and analytic sub-domains.

Chapter 4 Knowledge of Data Nowadays, data are not just and voluminous – they are also quite varied, and that miscellany poses a challenge to fostering the increasingly essential data analytic literacy; in fact, the idea of general data familiarity feels almost unattainable in view of the seemingly endless arrays of sources and types of data. And indeed, when thinking of data in terms of narrowly defined source-type conjoints, just the task of identifying all sources of data seems overwhelming. That line of reasoning, however, is premised upon the often-implicit assumption that data familiarity manifests itself as a product of understanding of data’s computational characteristics and their informational content, but is that indeed the case? For instance, what specific data knowledge (beyond being able to identify appropriate data features) would an analyst tasked with summarizing product sales details need to be able to complete the assigned task? The answer is that analyst would need to be able to discern individual data features’ computational characteristics, which is essential to being able to apply correct – in data processing sense – statistical operations; that is all. Those using the resultant information – brand, marketing managers, and other business users – would need meaningful knowledge of the informational content of individual data features to correctly interpret the computed summaries, but that knowledge is not necessary to be able to correctly compute the estimates of interest. Stated differently, the general idea data familiarity can manifest itself either as the ability to correctly manipulate data, or as the ability to validly interpret the informational content of data (or, of course, as both). It thus follows that when the idea of data familiarity is considered within the confines of data analytic literacy it should be framed within a narrower context of familiarity with computational aspects of data, an idea that encapsulates understanding of individual variables’ encoding properties, which is what ultimately determines which mathematical operations can be performed. Correct interpretation of individual data elements’ encoding properties is rooted in discerning the two key properties shown earlier (Chapter 2, Figure 2.2): data type and data origin. Given that the notions of ‘type’ and ‘origin’ can be interpreted in a variety of different ways in the context of data analytics, some definitional level-setting is warranted. As used here, data type refers to computationally distinct category of data features, where computational distinctiveness manifests itself in the use of different data manipulation techniques and different mathematical operations. Starting from the premise that data can be conceptualized as encodings of (actual or presumed) facts, typically representing states (as in the state of being, such as ‘returning customer’) or events (e.g., demographic details, sales transactions), computational distinctiveness arises out of:

https://doi.org/10.1515/9783111001678-006

56

–

– –

Chapter 4 Knowledge of Data

the manifest nature of data encoding, where data can be encoded as numbers (i.e., strings of digits only), as text (letters only or strings of mixed characters), or as image (essentially anything other than numbers or text), the range of values individual data elements are allowed (in a factual or logical sense) to assume, and permissible operations (e.g., addition, subtraction, etc.) that can be performed on individual data elements.

In addition, since individual data elements are typically grouped into sets, commonly referred to as data files, the general organizational structure of those aggregates also contributes to the definition of data type. Here, the distinction between structured and unstructured data plays an important role in determining how individual data files can be manipulated and analyzed. Complementing the computational distinctiveness-minded type of data dimension of data analytic knowledge is the analyses-informing data origin facet of data familiarity. It represents knowledge of generalizable informational sources of data, and it plays a critical role in sound analytic insight creation. It is important to here to underscore the ‘generalizable’ part of this characterization, which in the context of data means either a specific type of technology used to generate data or a clearly defined derivation logic for data that were captured using other means. The former is exemplified by the so-called ‘scanner data’, which are data captured using electronic scanning devices, commonly used at a point-of-sale to record individual transactions, and the latter is well illustrated by geodemographic data, which are location-based demographic aggregates derived from the US Census Bureau’s demographic details. Here, understanding the location-time-item framed layout of scanner data files, and, separately, understanding that geodemographic values are computed estimates (rather than actual recorded values) derived from the Census Bureau’s details1 is essential to producing valid and reliable data analytic outcomes. It is important to note the difference between how individual data elements can be analytically processed and what type of data analytic outcomes might offer the most informative insights. To continue with the earlier example of computing product sales summaries, the outcome of such analysis could express sales details as the sum, the mean, or a mode, and from the computational perspective all three could be seen as statistically valid estimates. However, interpretation-wise, each might support a somewhat different informational takeaway, which underscores the importance of expressly differentiating between the earlier mentioned computational characteristics and informational content of data. Knowing what is permissible, analysis-wise, is distinct and different from knowing what might be desirable,

 By means of averaging census block (the smallest geographic unit used by the Census Bureau) values.

Data Types

57

information usage-wise – computational data familiarity always demands the former but not necessarily the latter. One of the key benefits of focusing only on what operations are permissible has the benefit of greatly reducing the otherwise overwhelming variety of data to just a small handful of types that exhibit shared computational characteristics. For example, automotive accident data typically encompass structured numeric values in the form of accident codes, as well as unstructured text adjuster notes – while informationally similar (both describe different aspects of auto accidents), structured numeric and unstructured text data require fundamentally different processing and analytic approaches; conversely, point-of-sales-captured product codes and automotive accident codes are informationally quite different but computationally indistinct because of shared computational characteristics (i.e., both use structured numeric values). A more in-depth exploration of those simple interdependencies is at the core of a comprehensive computational data familiarity focused taxonomy of data types discussed here.

Data Types The seemingly straightforward notion of ‘data type’ turns out to be surprisingly unclear when examined more carefully, largely because of lack of a clear and concise definitional framing of the notion of ‘data.’ Given the centrality of that question to the idea of data analytic literacy, and more immediately, to delineating distinct types of data, it is important for to anchor the review of distinct types of data in a conceptually and operationally robust definition of what, exactly, is data.

What is Data? Merriam-Webster dictionary, a staple reference source, defines data as ‘factual information used as a basis for reasoning, discussion, or calculation’, and it frames ‘type’ as ‘a particular kind, class, or group’. Focusing on the definition of data, if data are information, what then is the definition of information? According to the same source, it is ‘knowledge obtained from investigation, study, or instruction’, which then leads to looking at the definition of knowledge, which, according to Merriam-Webster is ‘the fact or condition of knowing something with familiarity gained through experience or association, or [a second variant] being aware of something.’ Interestingly, another authoritative dictionary source, Oxford Languages, sees data as ‘facts and statistics collected together for reference or analysis’, and it sees information as ‘facts provided or learned about something or someone’, suggesting that a ‘fact’ could be one or the other . . . Moreover, Oxford Language frames knowledge as ‘facts, information, and skills acquired by a person through experience or education; the theoretical

58

Chapter 4 Knowledge of Data

or practical understanding of a subject’ – essentially, just about anything can qualify as knowledge, and known facts could be data, or information, or knowledge. The goal of this brief overview is to highlight an unexpected linguistic challenge, namely that the English language does not currently appear to have a universally accepted, concise term that singularly captures the idea of ‘data’, as in electronic recordings of events or states. When considered within the confines of data analytics, framing ‘data’ as ‘information’ is not only confusing – it is outright counterproductive because it blurs the distinction between inputs and outputs of data analyses. In a general sense, if data is to represent inputs into analyses and information is to represent outcomes of those analyses, data cannot be defined as information, which is why the earlier noted framing of data used in this book characterizes it as ‘electronic recordings of events or states’.2 Drawing that distinction is also important from the standpoint of delineating distinct and generalizable types of data, because doing so calls for epistemologically sound definition of what, exactly is data, and what it is not. It is also worth noting that the data vs. information definitional ambiguities that stem from the perplexing rationale of equating data with information fail to recognize a well-established data utilization progression of data → information → knowledge, which suggests that data can be transformed into information which then can be transformed into knowledge (and which clearly implies epistemological distinctiveness of those three concepts). That progression encapsulates the essence of data analytics, which is a set of processes and procedures used to transform raw and thus generally non-informative data into informative insights.

Data as Recordings of Events and States In principle, data may exist is electronic or non-electronic formats (e.g., an old-fashioned paper telephone book can be considered a database), but in practice, it is reasonable to think of modern data as electronic phenomena, an assertion further supported by the fact that analysis of data is nearly synonymous with the use of some type of computer software. Given that and turning back to the task of delineating distinct types of data, the electronic character of data adds a yet another potential source of ambiguity. More specifically, electronic data exist as recordings that are stored, organized, and manipulated in computer systems using specific software tools, and those data storage tools have their own ways of ‘seeing’ data, which manifest themselves as specific taxonomies that consider data from the technical perspective of storage and manipulation rather than

 Even if data were captured manually, as in hand-recording of responses to a survey, those manually captured data would more than likely be subsequently converted into some form of electronic format to enable the use of electronic means of data analysis.

Data Types

59

informational content. In other words, understanding of data requires being able to see it from the dual perspectives of ‘what it means’ and ‘how it is structured’, or informational and computational, respectively, perspectives. All considered, the seemingly simple idea of ‘data type’ can quickly become unexpectedly complex, ultimately giving rise to competing perspectives on how the rich varieties of available data can be described and organized. Moreover, the everyday conception of data can also be surprisingly varied. To most, it conjures up images of numeric values, or facts encoded using digits as representations3 of those facts, even though that form is just one of sever distinct data encodings. More specifically, data can also be encoded using symbols other than digits – in fact, according to industry sources, some 80% to 90% of data generated nowadays are encoded using text (letters and/or mixed characters, such as combinations of letters, digits, and special characters commonly used in unique identifiers) or images. Thus from the perspective of computer processing, text in the form of words and text in the form of mixed characters all are usually treated as strings, or series of characters, which leads to a three-part, encoding type-based data types of numeric, or digitsonly, string, which includes text-only and mixed character expressions, and image, or visual representations. A distinct yet closely related (to data encoding) is the actual or implied measurement scale, a characteristic primarily associated with numeric data; in fact, measurement of properties of events or states is the essence of numeric data. An event, a state, or any other discernable phenomenon that is measurable can be quantified – i.e., expressed as a numeric value – using one of the four cumulatively additive measurement scales: nominal, which defines attributes’ identities in form of unordered labels, ordinal, which assigns specific rankings to individual labels, interval, which expresses values as points on a defined continuum, and ratio, which expresses values measured on a defined continuum as deviations from the true zero. Nominal and ordinal scales can be grouped into a broader class of categorical (also known as discrete) values, and interval and ratio scales can be grouped into a broader class of continuous values, as graphically illustrated in Figure 4.1. The use of the two broader as opposed to the four more granular types is common in applied analytics because while it is always necessary to differentiate between categorical and continuous data, due to fundamental

 While it could be argued that the terms ‘numeric’ and ‘digital’ are, in principle, interchangeable since data encoding-wise numeric values are ultimately strings of digits, the common usage of the latter of the two terms now refers specifically to the binary (0–1) system used in modern electronics, whereas the former encompasses much more than just the binary system. Still, that intersection of implied and common usages of those two terms poses a definitional problem as the term ‘number’ – the root of ‘numeric’ – is commonly understood to represent a quantity, which in turn implies that all numeric data are expressions of some measurable amounts, but that is not the case since data encoded using digits can also represent discrete categories.

60

Chapter 4 Knowledge of Data

differences in how those two types of values are analyzed, it is not always necessary to address the more granular nominal-ordinal-interval-ratio differences.4

…

Min

Max

Figure 4.1: Categorical vs. Continuous Values.

Making sense of the overwhelming data varieties also hinges on recognizing data organization or layout related differences. The everyday conception of data tends to conjure up images of neatly organized rows and columns, or in more abstract terms, of the commonly used two-dimensional data matrices, where rows (typically) delimit individual records and columns delimit individual variables. Popularized by the ubiquitous spreadsheet applications and generically known as structured data, such repetitive, templated data layouts have long been the staple of data analytics because they naturally lend themselves to computer processing (because no matter how many records need to be processed, the consistent layout makes batch processing fast and efficient). However, text and image data, as well as some numeric data, are predominantly organized and stored using unstructured layouts, which do not follow a predefined, repetitive manner. Much harder to process,5 unstructured data now represent the vast majority of what is generically known as big data. The difference between the two general types of data layouts are graphically summarized in Figure 4.2.

Figure 4.2: Structured vs. Unstructured Data.

The preceding overview suggests that data, as seen from the general perspective of data analytic literacy, can be classified into several distinct, mutually exclusive and

 It is also worth noting that given that the difference between interval and ratio scales is the latter inclusion of the absolute zero, in practice as well as some data analytic applications (e.g., SPSS Statistics) is overlooked.  Because, in principle, each record follows a different format and, again in principle, each individual record can also be informationally distinct, all of which requires that each record be processed independently; in a way of contrast, structured data are typically batch-processed because all records follow the same format and contain the same type of information.

Data Types

61

collectively exhaustive types, framed by individual data elements’ encoding characteristics, underlying (if applicable) measurement scale, and general dataset organizational structure. It is important to note that the classificatory schema implied by that framing is focused only on defining and describing structurally distinct classes of data – it does not address informational content related differences, which also play important role in the broadly defined data sensemaking process, and thus will be addressed separately.

Encoding and Measurement In keeping with the simple definition of data as electronics recordings of events or states, as noted earlier, digits-encoded data are generally considered numeric. Such broad characterization, however, can give rise to data type misunderstandings because it does not expressly differentiate between the use of digits as expressions of measurements, and the use of digits as categorical labels. It is an easily overlooked data nuance, but digital encoding does not necessarily imply a quantity (see footnote #14) because within the realm of data, the use of digits is not limited to just expressions of measurable quantities. For instance, the value ‘5’ could represent either a quantity, e.g., 5 units of product X, or it could represent a categorical label, e.g., Region 5, which suggests the need for an additional data due diligence step in the form of scale of measurement assessment. A simple way to distinguish between the two is to determine if a particular data feature/variable is continuous or categorical: the former indicates a measurable quantity, whereas the latter suggests digits-encoded categorical label.6 Correctly identifying individual data features as continuous or categorical can be accomplished in one of several ways: By far the simplest is to reference a ready description in the form of supporting documentation (commonly referred to as data dictionary); if not available, which unfortunately is common, the second simplest way to make that determination is to examine the range of values assumed by the data feature of interest, in the context of self-manifest variable name. For example, ‘Income’ vs. ‘Income Range’ names suggest that the former might be continuous, and the latter might be categorical – examining the range the values could then provide further clarifying insights (e.g., a scattering of exact amounts would reinforce the initial ‘continuous’ assessment, whereas recurring ranges of values, such as ‘under $50,000’, ‘$50,000–$99,999’, ‘$100,000–$149,999’, etc., would reinforce the initial ‘categorical’ assessment).

 A more in-depth assessment could further differentiate between unordered nominal variables (e.g., gender) and rank-ordered ordinal variables (e.g., gold-silver-bronze status); the latter is informationally richer (as it contains the additional information of rank-ordering), thus making that distinction might be desirable. It is worth noting that some data analytic applications, such as SPSS Statistics, expressly differentiate between nominal and ordinal values.

62

Chapter 4 Knowledge of Data

While making the above determination might seem minor, nothing could be further from the truth. Continuous and categorical data features allow vastly different permissible operations: the former can be manipulated using all arithmetic operations (addition, subtraction, multiplication, division) whereas the latter cannot, which means that continuous variables are informationally richer than categorical ones (i.e., they can be described using wide range of statistical parameters). Moreover, the continuous vs. categorical distinction is one of the key determinants of the type of data analytic techniques that can be used (more on that in later chapters). Paralleling the earlier made literacy vs. numeracy distinction, string data can be characterized as recordings of events or states encoded using letters only, or a combination of letters, digits, and special characters. In a manner reminiscent of the quantity vs. category distinction of made in the context of digits-encoded (i.e., numeric) data, while the everyday conception of the notion of ‘text’ conjures up images of words arranged into sentences in accordance with applicable syntax rules, when considered from the perspective of data types, text also strings of not immediately intelligible (from the human perspective) alpha-numeric and special character strings. In that sense, the former are notionally reminiscent of continuous numeric values because of the rich meaning contained in the collection of expressions (i.e., single words and multi-word terms) and any associated syntactical and semantic structures; the latter, on the other hand, are reminiscent of informationally poorer categorical values. An aspect of text data that is somewhat unique to that particular data type, and in fact is among its defining characteristics is explosively high dimensionality. The notion of data dimensionality is a reflection of the number of potential elements of meaning, where an element of meaning could be an explicit word, a multi-word expression, or an implied idea. For instance, a document which is k words long and where each word comes from a vocabulary of m possible words will have the dimensionality of km; simply put, a relatively short text file (e.g., a couple of so standard typed pages) can contain a surprisingly large number of potential elements of meaning. Perhaps a less confusing way to think of data dimensionality as an expression of potential elements of meaning is to think in terms of the difference between a text expression, e.g., ‘high value’ and a numeric value, e.g., ‘$12,425’. The term ‘high value can have many different meanings depending on the context and how it is used, i.e., it can be explosively high-dimensional, whereas the ‘$12,425’ magnitude has far fewer potential meanings. It follows that text data tend to be highly informationally nuanced, primarily because of their syntactical structure but also because like terms can take on different meanings in different contexts, and the use of punctuation, abbreviations, and acronyms, as well as the occurrence of misspellings can further change or confuse computerized text mining efforts. The third and final broad form of data, image, encompasses visual representations of anything that could range from an abstract form, such as a corporate logo, to a direct visual depiction of an object of interest, such as a picture of a product. Moreover, to the extent to which video is – at its core – a sequence of single images, the scope of what

Data Types

63

constitutes image data includes static images as well as videos. That said, it is important to note that the brief and general characterization of image data is based on overt appearance or representation, which does not consider how those data might be represented in internal computer storage.7

Organizational Schema The tripart numeric – text – image typology highlights individual data element-level distinctions; additional differences arise when data are aggregated – for storage and usage – into data files or tables. The most familiar data file layout is a two-dimension grid, in which rows and columns are used to delimit individual data records and data elements (i.e., features or variables); typically, rows delimit data records (e.g., transactions, customer records, etc.) while columns delimit individual variables, producing matrix-like data layout. Known as structured data format, that layout’s persistent order makes it ideal for tracking of recurring events or outcomes, such as retail transactions as exemplified by point-of-sales (e.g., the ubiquitous UPC scanners) data capture. Overall, structured data are easy to describe, query and analyze, thus even though, according to IDC, a consultancy, those data only account for about 10% of the volume of captured data nowadays, structured data are still the main source of organizational insights. Many other data sources, however, generate data that are naturally unstructured, primarily because what those data capture and how they are recorded does not lend itself to persistent, fixed record-variable format, as exemplified by Twitter records or Facebook posts. In principle, any layout that does not adhere to the two-dimensional persistently repetitive format can be considered unstructured; consequently, the largely non-uniform text and image data are predominantly unstructured. That lack of persistent content and form8 means that unstructured data are considerably more challenging to describe, query and analyze, and thus in spite of being abundant, analytic utilization of unstructured data has been comparatively low. That said, advances in text and image data mining technologies are slowly making unstructured text and image data more analytically accessible.

 Commonly, electronic images are composed of elements known as pixels, which are stored in computer memory as arrays of integers – in other words, within the confines of computer storage, image data can be considered numeric. However, analysis-wise, and thus from the perspective of data analytic literacy, image data are not only visually different, but working with images calls for a distinct set of competencies, all of which warrants treating those data as a separate category.  As exemplified by Twitter records, individual records in an unstructured data file vary not only in terms of positioning (i.e., physical location, when reading from left to right) related organizational structure of individual data elements, but also in terms of their content, or mix of data elements.

64

Chapter 4 Knowledge of Data

Data Origin Framed here as cognizance of generalizable informational origins of data, origin is the second of the two dimensions of data knowledge, as graphically summarized in Figure 3.2. Commonly understood to be the source or the place from which data are derived, even if contained in scope to ‘just’ the data captured by the modern electronic transaction processing and informational infrastructure, the idea of data origin still conjures up an almost endless array of systems and applications. However, when considered from the data analytic perspective, that level of data source granularity entails excessive degree of specificity, that does not translate into differences in how data are stored, manipulated, and analyzed. For instance, point-of-sales data are captured using numerous, technologically distinct systems and technologies, but from the standpoint of data processing and analysis, those origin-related differences are not important because the structure and informational content of otherwise differently sourced are generally the same. A second key data origin related consideration is generalizability of data classification typologies. The notion of data analytic literacy described in this book is context-neutral, meaning that it is intended to be equally meaningful when used in the context of different organizational settings, in the sense of being equally applicable to for-profit commercial as well as governmental and nonprofit organizations, and on a more micro level, also to organizations representing different industry segments, such as retail, healthcare, manufacturing, or information technology. At the same time, the distinct sources also need to exhibit high degrees of distinctiveness, often framed using the general logic of the MECE (mutually exclusive, collectively exhaustive) principle. When considered in the context of those more general source-related considerations, the initially dizzying variety of data origins can be reduced to a more manageable set of general source categories. More specifically, data that, broadly speaking, can be seen as representing differently encoded and structured byproducts of the modern digital infrastructure can be broken up into several distinct origin focused groupings.

General Data Sources The following are five broad informational origin categories of data: passively observational, actively observational, derived, synthetic, and reference; a brief description of each grouping follows. Passively observational. Perhaps best exemplified by point-of-sales product scanner recordings or RFID sensor readings, passively observational data are typically a product of automated transaction processing and communication systems; those systems

Data Origin

65

capture rich arrays of transactional and communication details as an integral part of their operational characteristics. Hence those data are captured ‘passively’ because recording of what-when-where type details is usually a part of those systems’ design, and they are ‘observational’ because they are an ad hoc product of whatever transaction of communication happens to be taking place. From the informational perspective, the scope and the informational content of passively observational data are both determined by the combination of a particular system operating characteristics and the type of electronic interchange; and lastly, given the ubiquitous nature of electronic transaction processing and communication systems coupled with the sheer frequency of commercial transactions and interpersonal communications, the ongoing torrents of passively observational data flows are staggering. Actively observational. The analog to passively observational, actively data share some general similarities with that general data source, but are noticeably different in terms of the key aspects of what they represent and how they are captured. Perhaps best exemplified by consumer surveys, actively observational data can be characterized as purposeful and periodic: Those data are purposeful because they capture pre-planned or pre-determined measurements of attitudinal, descriptive or behavioral states, characteristics and outcomes of interest; they are periodic because they are captured either on as-needed or recurring but nonetheless periodic basis. Derived. As suggested by their name, derived data are created from other data, and as a broad category are perhaps best exemplified by the earlier mentioned US Census Bureau-sourced geodemographics. A detailed counting of all US residents, undertaken by the Census Bureau every 10 years (as mandated by the US Constitution), produces more than 18,000 variables spanning social, economic, housing, and demographic dimensions, but those detailed data can only be used for official governmental purposes; the US law, however, allows census-derived block level averages9 to be used for non-governmental purposes. Thus geodemographics, so named because they represent geography-based demographic averages, are data that were sourced from the detailed Census Bureau’s data, and though those averages are based entirely on the census details, informationally and analytically they represent a distinct class of data. The same reasoning applies to other derived data categories, as such brand-level aggregates sourced from SKU-level (stock keeping unit, a distinct variant of a given brand, such as differently sized and packaged soft drink brand’s varieties) details. Synthetic. Popularized by the recent explosion of interest in machine learning applications, synthetic data are artificially, typically algorithmically generated, rather than

 Keeping in mind that the size of census block groups can vary considerably between rural (typically relatively large areas with comparatively low population counts) and urban (usually much smaller areas with much higher population counts) areas, a census block group will usually contain between 600 and 3,000 individuals.

66

Chapter 4 Knowledge of Data

representing real-world outcomes or events; their generation parameters, however, are usually derived from real-life data. One of the key benefits associated with synthetically generated data is that those data can be structured to replicate the key informative characteristics of sensitive or regulated data enabling organizations to leverage the informational value of those restricted data without the risk of running afoul of the numerous (and expanding) data access, sharing, and privacy considerations. Consequently, synthetic data are commonly used to train machine learning algorithms and build and validate predictive models. For example, healthcare data professionals are able to use and share patient record-level data without violating patient confidentiality; similarly, risk professionals can use synthetic debit and credit card data in lieu of actual financial transactions to build fraud-detecting models. Reference. The last data origin-framed types of data, reference data can be described as data that make other data more understandable. In keeping with the everyday definition of the term ‘reference’, which is a source of information used to ascertain something, reference data is a listing of permissible values and other details that can be used to ascertain validity of other data. For example, reference data could include a listing of wholesale and retail product prices set by the manufacturer which can then be used in determining discounted or sales prices; reference data definitions also commonly spell out a number of data layout considerations, such as the fact that the last two digits might represent decimal values. That particular data facet tends to be confused with similar-but-distinct master data, typically framed as agreed upon definitions shared across an organization. While both reference and master data serve as general benchmarks in the data sensemaking process, the former can be seen as a source of objective and/or technical specifications (e.g., set product prices, units of measurement, etc.), while the latter usually represent the currently agreed upon – within the confines of individual organizations – meaning and usage of different data elements. The goal of the above general origin-focused data classification is to enable reducing the otherwise overwhelming complexity of modern data ecosystems to a manageable number of analytically meaningful categories, with the ultimate goal of facilitating the development of foundational data analytic competencies. It is important to keep in mind that learning how to extract meaning out of data is a journey that starts with development of basic-yet-robust data analytic skills and competencies. To that end, the five broad sources of data outlined above are meant to contribute to laying that foundation by offering a way of grouping many overtly different types of data into a handful of general but highly dissimilar categories.

Simplifying the Universe of Data: General Data Classification Typology

67

Simplifying the Universe of Data: General Data Classification Typology The preceding overview provided a general outline of the core data classificatory considerations, focusing primarily on individual data elements, while also addressing data collections commonly referred to as data tables or data files. One additional consideration that has not yet been addressed, but that is needed to complete the framing of data types is that of data repositories, which are standalone groupings of multiple data tables or files. To be clear, a single data file can constitute a standalone data repository, but it is more common – especially in organizational settings – for a data file to be a part of a larger collection of, typically, related files.10 For example, the earlier mentioned point-of-sales scanner data are usually stored in organized collections of data generically referred to as databases or data warehouses (the latter tend to be larger and encompass wider assortments of data) for reasons that include safekeeping, ongoing maintenance, and convenience. Typically, individual data tables comprising a particular database are linked together in accordance with the logic of a chosen data model,11 such as relational or network, which further shapes the organizational structure of data. And although those technically idiosyncratic organizing aspects of multi-file data collections might seem tangential to data analytic literacy, familiarity with their basic characteristics is in fact important to correctly interpreting the contents of individual data tables (because data elements in table X might be analytically linked with data elements in table Y, in which case computing certain estimates would require some understanding of the nature of those linkages). Therefore, while an in-depth discussion of competing data models and types of databases falls outside of the scope of this overview, a high-level overview of general data aggregates – known as data lakes and data pools – is seen here as playing an important role in developing a more complete understanding of the potential informational value of available data. Broadly characterized as centralized repositories of data, data lakes and data pools are similar insofar as both typically encompass multiple data files, often representing different facets of organizational functioning, such as product, sales, distribution, and promotional details, customer records, etc. The key difference separating data lakes from data pools is standardization, which is the extent to which data captured from different sources were transformed into a single, consistent format. To that end, data lakes are usually not standardized, whereas data pools are typically standardized. That difference has important implications for downstream data

 In general, a collection of related data tables/files is referred to as database, whereas a collection of related as well as unrelated data tables is referred to as data warehouse.  An abstract organizational template that specifies how different elements of data relate to one another and to the properties of real-world entities.

68

Chapter 4 Knowledge of Data

utilization:12 Data contained in data pools are more immediately usable, whereas data contained in data lakes may require potentially extensive data pre-processing. In view of that, individual data elements contained in two or more distinct data tables that are a part of a particular data pool might have more self-evident informational content and value, in contrast to individual data elements spread across two or more data tables contained in a data lake, which may have comparative more hidden informational value. In other words, tapping into the informational content of loosely linked together, unstandardized contents of data lakes requires either deeper initial familiarity with contents of individual data tables or more significant investments of time and effort into developing the requisite knowledge. When considered jointly with the earlier discussed data elements and data tables, data repositories round off a general, tripart or three-tier data framing taxonomy of data elements – data tables – data repositories, graphically summarized in Figure 4.3. Pool Structured

Lake

Origin

Unstructured

Type

DATA ELEMENT DATA TABLE DATA REPOSITORY

Figure 4.3: Data Elements – Data Tables – Data Repositories.

As abstractly summarized in Figure 4.3, a single data feature,13 such as ‘Customer ID’, ‘Selling Price’, or ‘Age’, can be framed in terms of one of three general types (numeric, which can be continuous or categorical, text, or image), and one of five distinct origins (passively observational, actively observational, derived, synthetic, and reference). Analytic properties of that data feature are further framed by the character of the  In the context of data management, ‘downstream’ refers to the transmission of data from the point of origin toward an end user (a comparatively less often use ‘upstream’ term refers to the transmission from the end user to a central repository).  The term ‘data feature’ is used here in place of the more commonly used ‘variable’ term because the latter signifies a value that is subject to change, meaning it can assume different values across data records, which renders it inappropriate for data elements that may be fixed (i.e., are ‘constant’) across records in a particular set of data.

Simplifying the Universe of Data: General Data Classification Typology

69

data table the data feature of interest is a part of, which can be either structured or unstructured; lastly, those properties are also shaped by the type of the data repository the data element-containing table is a part of, which can be either standardized data pool or unstandardized data lake. It is important to consider the logic of this three-dimensional data classification schema in the clarifying context of the earlier in-depth overview of data sensemaking considerations comprising each of those three broad grouping dimensions to fully appreciate the impact of factors not shown in Figure 4.3, such as continuous variables’ scale of measurement. And lastly, it is also important to keep in mind the purpose of this high-level data typology, which is to support the development of basic abilities to identify the key computational and informational aspects of data, as a part of broader set of skills and competencies comprising data analytic literacy.

Chapter 5 Knowledge of Methods The second key domain of the broad knowledge dimension of data analytic literacy is focused on familiarity with methods used to manipulate and analyze data, a broad category encompassing general procedures and processes, as well as the more operationally minded data analytic techniques. It is important to stress that, particularly within the confines of data analytic literacy, the idea of ‘methodological knowledge’ pertains to understanding of applicable rationale and concepts, and it is considered to be separate and distinct from (though obviously closely related to) familiarity with data manipulation and analysis tools and applications, most notably programming languages and software systems. Consequently, the overview of methodological knowledge presented in this chapter is geared toward delineating and discussing – in as simple terms as possible – the core theoretical considerations that underpin the ability to engage in basic data manipulation and analysis work. Framed in the larger context of the conceptual structure of data analytic literacy as graphically summarized in Figure 2.3 (Chapter 2), the ensuring methodological overview aims to expressly define the scope of what constitutes ‘fundamental’ or ‘basic’ data analysis related knowledge. Admittedly, it is a speculative task as there is no generally accepted conception of what, exactly, constitutes foundational data analytic knowledge; in a sense, that is not surprising given the plethora of inconsistent, even conflicting framings of what it means to be data analytically literate (discussed at length in chapters 1 and 2). And yet, it is intuitively obvious that clearly defining the scope of what constitutes elementary data analytic skills and competencies is essential to mapping out systematic and explicit means and mechanisms of knowledge and skill acquisition. The perspective offered in this book is that the broadly defined task of transforming generally not instructive raw data into informative insights demands the foundation of appropriate conceptual knowledge and applied skills, recognized here as two distinct competencies. Moreover, the data analytic literacy rationale outlined in this book further stipulates that the development of desired methodological ‘know what and why’ should supersede acquisition of ‘how-to’ skills, but one should inform the other. That means that the scope of what constitutes conceptual knowledge of interest should reflect the ‘true and tried’ theoretical means and ways of extracting insights out of data, and applied skills should reflect best available modalities and mechanism that can be used to operationalize the earlier learned theoretical steps. In other words, the conceptual knowledge and applied skills aspects of data analytic literacy should fit together like pieces in a puzzle, so that together they can give rise to a comprehensive ability to extract meaning out of raw data. Focusing on the core elements of data analytic literacy-enabling theoretical knowledge, the ensuing overview begins with the know what and how required to preprocess, i.e., prepare for analyses, raw data, encapsulated in the data processing subdomain, followed by conceptual knowledge needed to analyze data, encapsulated in the https://doi.org/10.1515/9783111001678-007

Data Processing

71

data analyses sub-domain (see Figure 2.3 in Chapter 2). Variously referred to as data wrangling, munging, or cleaning, data processing sub-domain of data analytic literacy describes the rationale and utility of processes used to transform ‘raw’, i.e., not directly analytically usable, data into formats allowing those data to be analyzed. Data processing sub-domain, on the other hand, encompasses a broad array of statistical and logical techniques that are commonly used to describe and illustrate, condense and recap, and model and evaluate available data. While abstract in terms of its content, the scope of methodological knowledge of data processing and analyses is framed by two comparatively tangible considerations: 1. what are the characteristics of data to be used (discussed at length in Chapter 4)?, and 2. what are the desired data analytic outcomes? The former is comprised of two distinct facets: understanding of the impact of the earlier discussed data characteristics, and separately, ability to discern and remedy structural and incidental data analytic impediments. The latter tends to be more nuanced, but in the most general sense, it is rooted in understanding the difference between exploratory and confirmatory data analytic outcomes.

Data Processing Development of a comprehensive understanding of appropriate logical and procedural steps that might need to be taken to prepare data for upstream usage requires looking at data from three different perspectives: organization (structured vs. unstructured), type (numeric, text, image), and encoding (continuous vs. categorical). While not all of those distinct characteristics may be salient in a given situation, overall, all can have direct and material impact on common data manipulation and preparation steps, such as merging of multiple files or aggregating of records within individual files. For example, aggregating a data file typically requires clear understanding of that file’s organization, variable types, and variable encodings to ensure that the resultant aggregate outcome file correctly captures input file’s contents; on the other hand, when merging two or more files it may only be necessary to assess cross-file organizational comparability to assure correct record alignment. Separately, building robust knowledge of data processing related approaches and techniques also demands being able to discern and remedy structural and incidental data analytic impediments. Structural data analytic impediments are data layout characteristics that stem from data capture and storage process mechanics; common examples include inconsistent or ambiguous data layouts and formats, or data feature definitions. Incidental impediments, on the other hand, encompass a wide array of data capture, storage, or handling created defects; the most common examples include duplicate, inaccurate, incomplete, or missing data values. Structural impediments are usually systemic (i.e., a predictable result of a particular data capture system’s operating characteristics) and as such lend themselves to development of standard remediative processes; incidental impediments tend to be more random and

72

Chapter 5 Knowledge of Methods

thus require more situation specific solutions, which underscores the importance of robust conceptual knowledge of the potential sources and the nature of those deficiencies, correcting of which is essential to upstream data utilization. While some aspects of the required data processing knowledge are formulaic, as exemplified by rules governing aggregation of continuous and categorical variables,1 other ones require situationally determined approaches, typically framed in the context of careful examination of the difference between ‘current’ and ‘desired’ states of data, as seen from the perspective of data utilization. A structural data impediment known as ‘flipped’ data layout, frequently encountered with point-of-sale (POS) data captured by retail store (typically using the ubiquitous bar code scanners), offers a good example of a situationally determined data processing knowledge. To start with, in a traditional, two-dimensional POS data layout rows delimit records and columns delimit variables; in that particular layout, rows typically represent items (products) and columns usually represent attributes of those items, i.e., data features, such as price. Consequently, such data structures can be characterized as ‘item-centric’ because items sold are the key organizational units of data, whereas buyers of those items (if they can be identified by means such as loyalty program membership or payment method linking) are an attribute associated with item sales. (Or stated differently, items sold are rows and buyers of those items are columns or data features.) It follows that item-centric data layouts enable transaction-focused but not buyerfocused analyses, thus when the latter is of interest, which, as can be expected is more often than not, the originally item-centric POS data have to be recoded, or ‘flipped’, into buyer-centric layout, as graphically illustrated in Figure 5.1. Buyer 1

Buyer n

Item Purchased

Item 1

Buyer 2

Item n

Buyer 1

Buyer 3

Item 2

Item 3

Figure 5.1: Transforming Item-Centric Data Layout into Buyer-Centric Layout.

 Continuous variables can be summed or averaged (mean, median), in addition to also being expressed in terms of variability (standard deviation), extreme values (minimum, maximum), percentages, etc.; categorical variables, on the other hand, can only be expressed in terms of specific values (first, last) or frequencies/counts.

Data Processing

73

It should be noted that the ‘original’ vs. ‘transformed’ data layout difference illustrated in Figure 5.1 illustrates data layout vs. analytic needs incommensurability, not an inherent problem. Item-centric data layouts are perfectly suited for analyses of product sales trends, drivers of sales, or cross-product differences – it is when the analytic focus is on buyers rather than items sold that the item-centric layout becomes unsuitable. Moreover, making that determination is highly situational and it requires a general understanding of nuances of data structures as well as general methods and processes that can be used to amend those structures (separately from knowing how to execute the desired data manipulations). Turning back to the second of the two data-related analytic impediments, incidental impediments, those data challenges can be thought of as deficiencies brought about by data capture or handling, perhaps best exemplified by incorrect or missing values. Identification of those deficiencies is at the heart of a process commonly known as data due diligence. That general process of reviewing data for incidental deficiencies is rooted in two interrelated considerations: knowing what to look for, and knowing how to look. Those considerations are in turn rooted in examining data from the perspective of the tripart data framing taxonomy discussed in Chapter 4 (see Figure 4.3), built around the hierarchical structure of data elements, data tables, and data repositories. The idea of knowing what to look for and knowing how to look takes on different meaning when considered from the standpoint of individual data elements, data tables, and data repositories.

Data Element Level: Review and Repair Reflecting the atomic view of data, data element-focused due diligence is primarily concerned with identifying and correcting deficiencies of individual data features (i.e., variables2). While obviously interdependent, review and correction are nonetheless seen here as two somewhat distinct elements of knowledge: 1. Data Element Review. It is concerned primarily with assessing accuracy, completeness, and consistency of individual data elements. Ascertaining data accuracy requires the ability to describe properties of individual categorical and continuous data elements; it may also require the use reference data or other benchmarks to determine

 While recognizing that it is deeply rooted in everyday practice, generically applying the label ‘variable’ rather than ‘data feature’ or ‘data element’ is discouraged here because doing so may lead to characterizing fixed values (i.e., constants) as variables; after all, the term ‘variable’ denotes a value that is liable to change. In that sense, the terms ‘data feature’ and ‘data element’ are more universally appropriate; moreover, the appropriateness of using data feature/element label is even more pronounced when describing attributes of unstructured data records. That said, the common usage of terms ‘data element’, ‘data feature’, and ‘variable’ is meant to communicate the same general meaning.

74

Chapter 5 Knowledge of Methods

correctness of data values. Assessing completeness of data elements is more straightforward as it calls for identification of instances of missing values and the ability to quantify the extent of that problem (typically expressed as a proportion of the total number of records). And lastly, assessment of consistency of data elements requires the ability to examine like data elements, such as date values, to ensure they use the same representation within individual tables (for example, ‘date’ can be represented using numerous formats such as ‘dd/mm/yy’, ‘dd/mm/yyyy’, ‘mm/dd/yy’). 2. Data Element Repair. The two commonly occurring data element fixes are missing value imputation, and normalization of values. Imputation of missing value entails addressing two distinct considerations: First, is the data element under consideration repairable, as considered from the perspective of the proportion of missing values. A commonly used (in practice) rule of thumb is when the proportion of missing values exceeds about 20% to 25% of all values, it may be analytically unsound to consider imputing missing values. Second, if the proportion of missing values is below the aforementioned threshold, an imputation method needs to be chosen keeping in mind underlying measurement properties (i.e., nominal vs. ordinal vs. continuous) and making well considered imputation technique choice (e.g., mean replacement, median replacement, linear interpolation). The second of the two commonly observed data element repairs – normalization – which within the confines of data element3 repair entails bringing about cross-record consistency, as exemplified by numeric values using the same number of decimal spaces across records or categorical values adhering to a singular grouping and labeling schema.

Data Table Level: Modification and Enhancement Individual data elements are commonly organized into collections known as data tables or files, which typically combine informationally or otherwise related data elements and facilitate orderly storage and retrieval of data. Data tables can be thought of as a standalone mini-informational repositories whose contents are subject to informational optimization-minded review; more specifically, the spirit of data table level due diligence is to identify possible modifications and/or enhancements to tablebound collections of data elements. 1. Data Table Modification. The vast majority of data available to business and other organizations represent passively captured (i.e., recorded as a part of a larger process) transactional and communication details reflecting various facets of their operations –

 Normalization is a multi-faceted concept that takes on different meanings across the continuum of data analytics; consequently, the seemingly repetitive use of the idea of normalization actually entails operationally distinct, context determined steps and outcomes.

Data Processing

75

consequently, the manner in which those data are structured is determined by capture and storage rather than usage driven considerations. Predictably, data tables frequently require substantial structural modifications before meaningful analyses can be undertaken; those modifications, commonly referred to as data feature engineering, can take the form of aggregation, normalization, and standardization. Aggregation transforms granular details into summary values; for example, point-of-sales transactional data are captured at individual item level which can be summed to unit or brand level. The remaining two feature engineering steps – normalization and standardization – are more nuanced, and in fact are frequently confused. Without delving into underlying mathematical details, normalization entails bringing about cross-data elements uniformity in the sense of making (actual or implied) measurement scales comparable, whereas standardization entails rescaling of individual data elements to a mean of 0 and standard deviation of 1, which effectively replaces original values with an idealized notion of number of standard units away from the mean of 0. The core benefit of normalization is that it enables direct comparisons of effects measured on different scales – for instance, the originally not directly comparable ‘household income’ (typically measured in thousands of dollars) and ‘age’ (typically measured in years) magnitudes, once normalized can be compared to one another (e.g., average Brand X customer may have household income that falls in the 60th percentile overall, and that customer may fall into 80th percentile age-wise). Standardization also contributes to making comparative assessments easier, but the primary benefit of standardization is cross-record (e.g., across customers or companies) rather than cross-data feature comparisons. Given their notional similarity and the resultant usage and methodological fuzziness, it is worthwhile to consider their distinctiveness in the context of their respective computational formulas, which are as follows: Standardization : Normalization :

x − x SD ðx − minÞ ðmax − minÞ

where, x is the actual value x is the mean SD is the standard deviation min is the minimum value max is the maximum value 2. Data Table Enhancement. Enriching potential informational content of data tables most commonly takes the form of adding new features (data elements) derived from existing ones. New feature derivation typically makes use of simple arithmetic or logical operations to compute new data elements as derivatives of one or more existing

76

Chapter 5 Knowledge of Methods

features. A common example is offered by using ‘date’ field to create a binary ‘yes-no’ indicator, where coupon redemption ‘date’ field is recoded into easier to use yes vs. no encoded ‘coupon redemption’ indicator. A somewhat more conceptually and methodologically complex example of data table enhancement is offered by creation of interaction terms, which capture the otherwise lost information created by interdependence between two standalone variables.4

Data Repository Level: Consistency and Linkages The third and final data due diligence tier is data repository, an enterprise data storage entity (or a group of entities, such as several databases) in which data have been organized and partitioned for analytical, reporting, or other purposes. While all data repositories contain multiple data tables, some, most notably data lakes, are primarily data holding entities that encompass unprocessed structured and unstructured data. Other types of repositories, most notably data pools and best exemplified by data warehouses, are intended for more defined purposes, such as tracking of customer purchase behaviors, thus their data contents tend to be processed, structured, and unified. Still, given that both data lakes and data pools draw data from a variety of sources (which raises the possibility of cross-source data element encoding and other types of differences), when data analytic focus extends beyond a single table it is necessary to undertake repository-level data due diligence. More specifically, it is necessary to assure consistency of related data elements (e.g., like-formatting of ‘date’ fields or comparability of classificatory schemas used to denote ‘accident type’), in addition to also specifying explicit cross-table linkages for accurate and complete data sourcing. 1. Data Repository Consistency. The often-expansive varieties of organizational data commonly encompass multiple sources of essentially the same types of data, which can give rise to taxonomical, formatting, and other differences. For example, a large casualty insurance company’s workers compensation claims data repository may encompass data captured using now-retired mechanisms, i.e., so called legacy systems, and data captured using newer claim processing systems. The old and the new systems may use different accident and injury type encoding taxonomies, something that is, in fact, rather common, rendering notionally the same data (e.g., ‘accident type’, ‘injury type’, etc.) and by extension the same types of informational outcomes (e.g., cross-time comparisons of types of accidents and injuries) inconsistent.5 And yet,

 Interaction effects are typically considered in the context of causal modeling where the combined effect of two explanatory variables, singularly known as ‘main effects’, can have an otherwise unaccounted for impact on a response variable.  The data consistency problem tends to be particularly pronounced for acquisition focused companies because the combination of competing taxonomies (e.g., different industry classification schemas)

Data Processing

77

repository-wide consistency of like-data is one of the keys to increasing organizational data utilization, which underscores the importance of examining – and reconciling, when possible – taxonomical and formatting representations of notionally like data spread across distinct datasets (tables and/or databases). 2. Data Repository Linkages. Full utilization of available data is another common problem confronting business and other ‘data-rich but information-poor’ organizations. Data generated by numerous and diverse systems and devices that comprise the broadly conceived electronic transaction processing and communication infrastructure are commonly stored and processed on standalone basis – e.g., Product X UPC scanner sales data are distinct from Product X shipment and tracking data – which requires purposeful and explicit linking of informationally related data tables. It is important to note that linking of distinct data tables needs to be preceded by repository-wide data consistency examination, to assure taxonomical and formatting comparability of linked data files. One of the most common methods of implementing linking of informationally related data files within a repository is to create a master file that enumerates individual file linkages. The preceding high-level overview of data element, data table, and data repository related data validity and usability considerations was meant to highlight the complex nature of data, specifically when considered from the perspective of potential deficiencies. It is easy to get excited about the informational value of rich varieties of data now so readily available, but that excitement needs to be accompanied by the realization of the often significant amount of work that might be required to make those data analytically usable. More importantly, the conceptual knowledge that is required to be able to review, identify, and remedy any systemic or situational data challenges, in combination with applied data manipulation skills (discussed later) is seen here as one of the fundamental building blocks of data analytic literacy. The remainder of this chapter is focused on the key elements of data analysis related conceptual knowledge. Keeping in mind the foundational scope of data analytic skills and competencies discussed in this book, the ensuing overview of the otherwise very expansive domain of the conceptual know what and why of basic data analytics is limited to what are considered here the most critical elements of that knowledge.

and transaction processing systems create very high probability of incommensurability of notionally like data captured using functionally different systems; moreover, those differences may be extremely difficult – at times even impossible – to reconcile.

78

Chapter 5 Knowledge of Methods

Data Analyses: The Genesis The ability to learn and the desire to understand things represents the essence of what is commonly believed to be intelligent life, and thus it is one of the defining characteristics of mankind. And while systematic attempts at trying to gain understanding of the world are as old as the oldest human civilizations, it was the period now known as the Age of Reason (or more broadly as the Enlightenment), which began in Europe in the latter part of the 17th century, that set into motion a rigorous philosophical and scientific discourse, which in turn ushered in an era of unprecedented techno-scientific progress. Not surprisingly, the roots of the core elements of data analysis related knowledge reach all the way back to the rise of the Age of Reason, and retracing those roots offers an interesting glimpse at the steady progression of conceptual knowledge that underpins the ability to transform raw and often messy data into informative insights. Figure 5.2 offers a graphically summary of the key developments.

Age of Reason

Early Probability Foundations of probability theory; work of Cardano, Fermat, Pascal

Rise of Statistics Bayes’ theorem (1761), Gauss’ normal distribution (ca. 1795)

Statistical Inference Least squares (1805)

Statistics Department (1911)

Machine Learning

Fisher’s Statistical Methods

Neural Networks

Maturation of Statistics

Backpropagation method for training artificial neural networks (1960s)

Fisher’s Design of Experiments (1935); Tukey’s The Future of Data Analysis (1961)

(1925)

Data Science (1997)

Big Data (1998)

IoT (1999)

Deep Learning

Cognitive Computing

Age of Data

Figure 5.2: The Development and Maturation of Data Analytic Methods and Mechanisms.

Grounded in the foundational work of Pascal, Fermat, and Cardano who laid early mathematical foundations of the now widely used probability theory, and further fueled several decades later by seminal work of Gauss (normal distribution) and Bayes (conditional probability), modern data analytic rationale and computational techniques began to emerge in the form of statistical inference. Broadly characterized as theoretical rationale and computational methods that can be used to draw generalizable conclusions from imperfect and/or incomplete (i.e., sample) data, statistical inference formed the basis of the steady growth and maturation of the theory and practice of data analytics. The emergence and rapid development of electronic devices and

Data Analyses: The Genesis

79

broader electronic infrastructure in the latter part of the 20th century gave rise to previously unimaginable volumes and varieties of data, which in turn created the need for more automated data analytic mechanisms. Commonly known as machine learning, those automated data analytic mechanisms are built around the idea of extracting insights out of data without following explicit (i.e., human) instructions, by instead relying on computing logic inspired by biological neural networks (processing units comprising the human brain), broadly known as artificial neural networks (ANN). The more recent and thus more advanced embodiments of ANN, deep learning and cognitive computing, represent more evolved, in the sense of being even more autonomous, data analytic modalities.

Data Analyses: Direct Human Learning The goal of the very abbreviated history of the evolution of data analytic methods and mechanisms summarized in Figure 5.2 is to offer a general backdrop to help contextualize the ensuing overview of core elements of data analytic knowledge. In principle, that knowledge encompasses the longstanding concepts and methods of probability and statistical inference, seen here as a mechanism of structured human sensemaking, as well as the comparatively more recent, largely technology-driven developments in the form machine learning, deep learning, and behavioral computing, which reflect the progressively more automated – even autonomous – ways and means of utilizing data.6 And while the idea of data analytic literacy discussed in this book is aimed at structured human sensemaking manifesting itself as data manipulation and analysis related conceptual knowledge and applied skills, it also gently implies the importance of appreciating the differences between human analytic sensemaking and machine-centric data processing and analyses. That is because human sensemaking based data analytic competencies exist in a larger context of multi-modal data utilization in which the ceaselessly advancing artificial systems play and ever-greater role, as discussed in chapters 1 and 2. Stated differently, to appreciate the benefits of being data analytically literate in the world which is growing more and more reliant on technology, it is important to consider the idea of data related knowledge from the broad perspective of direct and indirect human learning, the former encapsulated in the idea of data analytic literacy, and the latter in the dynamic context of machine learning. Focusing on direct human learning, the development of sound understanding of the essence of data analysis centers on the notions of estimation and inference. Within the confines of data analysis, estimation is the process of finding the best possible approximations for unknown quantities, whereas inference is the process of drawing

 Those considerations are discussed in more detail in Banasiewicz, A. (2021). Organizational Learning in the Age of Data, Springer Nature.

80

Chapter 5 Knowledge of Methods

conclusions from available evidence; a special type of inference. At the intersection of those two pivotal ideas is the notion of statistical inference, broadly defined as the process of using structured approaches and techniques to draw conclusions about a population of interest, while only using incomplete and typically imperfect subsets (i.e., samples) of data. The goal of statistical inference encapsulates the core utility of data analytics as seen from the perspective of human learning: to derive maximally valid and reliable information out of (typically) imperfect and messy data. At the most rudimentary level, determining sound approaches to analyses of available data is shaped by a combination of data type and informational focus, or stated differently, by the interplay between available data and informational needs. The former encompasses the earlier discussed knowledge of computational implications of encoding related differences between continuous and categorical numeric values, and more broadly between numeric, text, and image data, and still more broadly between structured and unstructured data, while the latter reflects the desired data analytic outcomes. While desired outcomes might seem hard to meaningfully frame, when considered in more abstract sense, they can be grouped into two broad categories of new insights or as means of examination of the validity of currently held beliefs. When considered from the perspective of data analyses, the broadly considered generation of new insights takes the form of exploratory analyses, whereas validation of prior beliefs encompasses a broad family of statistical techniques jointly characterized as confirmatory analyses. Broadly characterized, exploratory data analysis (EDA) entails sifting through new or previously not analyzed data in search for informationally meaningful patterns or associations. Spanning a wide range of degrees of complexity and sophistication, and typically making use of descriptive statistical techniques to bring out patterns and associations hidden in details of raw data seen as collections of informative and noninformative features (the latter commonly referred to as noise or residuals7), the purpose of EDA is to extract informative patterns while discarding non-informative noise.8 Confirmatory data analysis (CDA), on the other hand, is a collection of statistical techniques focused on empirically testing validity of prior beliefs using tools of statistical inference; to be testable, those prior beliefs need to be expressed as formal hypotheses, but can be derived either from established theoretical frameworks or from practical experience. In contrast to EDA, which can be seen as evidence gathering detective work, CDA can be portrayed as a court trial where the validity of evidence is tested.

 As originally detailed by John Tukey in his seminal 1977 work Exploratory Data Analysis, AddisonWesley.  It is worth noting that the EDA process is meant to be iterative – following the extraction of initial informative patterns, residuals are then recursively examined for additional patterns until no new patterns are discernable, at which point the residuals are discarded and analysis are deemed completed.

Data Analyses: The Genesis

81

A brief but important note on the definitional scope of CDA: In the context of data analytic literacy, the scope of confirmatory analyses is narrowed to just those elements of that broad domain of knowledge that relate to testing of knowledge claims, formally known as hypotheses, seen here as a part of foundational data analytic knowledge and competencies. However, when considered in a broader methodological sense, CDA also encompasses making forward-looking projections, commonly referred to as predictive analytics. Itself a broad domain of knowledge, predictive analytics utilize comparatively methodologically complex multivariate statistical models and machine learning algorithms, and the use of those techniques requires considerable methodological knowledge (as well as computational skills) – simply put, the predictive analytics facet of CDA represents advanced, not foundational data analytic competencies, and thus falls outside the scope of this basic knowledge focused overview. Method-wise, EDA and CDA data analytic techniques can be grouped into dependence and interdependence families: Dependence methods aim to discern (EDA) or confirm (CDA) the existence, strength or the direction of causal relationships between the phenomena of interest (typically referred to as dependent, criterion, or target variables) and explanatory or predictive factors (commonly labelled as independent or predictor variables); interdependence methods, on the other hand, aim to either uncover (EDA) or confirm (CDA) more generally defined associations between data features or between entities. The regression family of methods, which includes linear, ordinal, and ridge, as well as binary and multinomial logistic techniques, offers probably the best-known example of dependence techniques, while correlation and cluster analysis are among the best-known interdependence methods. However, the bulk of dependence and interdependence families of methods are comparatively complex multivariate – i.e., simultaneously considering interactions among multiple variables – techniques; in fact, linear and logistic regression are often considered ‘entry level’ predictive analytic methods, thus more in-depth overview of those techniques falls outside the scope of basis knowledge focused data analytic literacy. The general difference between the rationale of dependence vs. interdependence method is graphically summarized in Figure 5.3.

Figure 5.3: Dependence vs. Interdependence Methods.

82

Chapter 5 Knowledge of Methods

Rounding out the general overview of exploratory and confirmatory data analyses is the somewhat less formalized mode of analysis. When considered from that perspective, exploratory and confirmatory analyses can take either statistical or visual path, with the former manifesting itself in formal mathematical tests, while the latter emphasizing more approximate visual assessment of graphically expressed patterns and relationships. More specifically, the statistical mode of analysis makes use of structured mathematical or logical techniques that yield probabilistically conclusive results, typically expressed as quantitative estimates; those estimates are then interpreted within the general rules of statistical inference discussed earlier, producing specific conclusions. In contrast to that, the visual mode of analysis replaces numeric statistical tests with visual interpretation of graphically expressed aggregate summary trends and patterns, which are then interpreted in a less defined or less structured manner (i.e., is there a visually clear difference between X and Y?). In a way of an illustration, a brand analyst interested in determining if there are meaningful differences in average purchase amounts across several different regions could utilize the statistical mode of analysis to search for statistically significant differences, which in this case would be tantamount to computing formal statistical estimates using a method such as analysis of variance (ANOVA with F-test), and then drawing conclusions warranted by structured interpretation of the computed numeric test parameters. The same analyst could also use the visual mode of analysis, which in this case would mean graphically summarizing average sales for each region using, for instance, bar graphs to portray region-specific average sale magnitudes. The ensuing interpretation would entail ‘eyeballing’ of graphically depicted differences while also considering the underlying numeric (i.e., average sales values), and the final conclusion would be based largely on subjective interpretation of perceived differences. As illustrated by that simple example, statistical data mining largely delegates the task of identifying meaningful differences, patterns, and associations to structured, algorithmic logic, whereas visual data mining relies on subjective assessment of mostly graphically expressed differences, patterns, and associations. Instinctively, data users tend to favor the statistical mode of analysis, partly because of past academic training steeped in the scientific method (discussed in more depth in the context of confirmatory analyses), and partly because of what is generally seen as higher degree of analytic precision (algorithmic, and thus objective, comparisons of mathematically derived estimates are generally seen as more robust than subjective eyeballing of graphically expressed differences). And in many instances, most notably when available data are highly accurate and complete, that is not only appropriate but in fact desirable – however, there are situations where that might not be the case. Starting from the premise that not everything that is worth recording (as data) is fully and reliably recordable, there are decision contexts that are characterized by limited, incomplete or otherwise messy data; in those contexts, the perceived robustness of statistical data mining may be illusory because as it is well-known, garbage in, garbage out . . . In those situations, visual data mining may offer more dependable sensemaking mechanisms, because it

Exploratory Analyses

83

affords data users the freedom to interpret available facts (i.e., data) in a broader informational context.9 It is important to keep in mind that the use of highly structured mathematical methods cannot compensate for significant data shortcomings – in fact, the perceived accuracy of those methods may give rise to unwarranted level of confidence in data analytic outcomes (more on that later). With that in mind, the mode of analysis should be in alignment with the efficacy of available data and with the general information usage context, as illustrated by stable vs. volatile applicable trends or the degree of similarity between data analytic outcomes and how the resultant information is to be used. The preceding general characterization of exploratory and confirmatory analyses hides numerous nuances, and while an in-depth discussion of those nuances falls outside of the scope of this book (and there is a rich variety of statistics focused texts readily available), some of the key data analytic proficiency related considerations are discussed next.

Exploratory Analyses The open-ended character of the idea of data exploration can make the task of developing robust understanding of the spirit of EDA overwhelming. However, rather than trying to make sense of data exploration by thinking in terms of the endless arrays of possible data analytic outcomes, it might be more informative to consider it from the perspective of the scope of analysis, which can take on one of two forms: descriptive, which entails examining data features, typically one at a time, with the goal of deriving empirical pictures or characterizations of phenomena of interest, and associative, where potential relationships between two or more variables10 are investigated, and which represents a comparatively more methodologically involved way of exploring data. Descriptive EDA tends to be methodologically simpler, typically relying on basic univariate (i.e., single variable) analysis, while associative EDA tend to utilize comparatively more complex bivariate (pairs of variables) and multivariate (interactions among three or more variables) techniques. Methodological complexity aside, being geared toward different types of informational utilities descriptive and associative exploratory analyses yield noticeably different types of data analytic outcomes: Descriptive analyses attribute previously unknown characteristics to known or anticipated states or outcomes, as exemplified by profiling high value customers using available

 This line of reasoning loosely parallels the logic of Bayesian statistics (discussed in more depth in Chapter 8), which formally incorporates prior beliefs into probabilistic inference.  It is worth noting that a number of sources also expressly differentiate between bivariate (two variables) and multivariate (more than two variables) analyses, but since the former is a special case of the latter, the more parsimonious two-tier univariate vs. multivariate distinction is used here.

84

Chapter 5 Knowledge of Methods

descriptive characteristics (demographics, purchase specifics, etc.),11 whereas multivariate analyses are comparatively non-specific with regard to their outcomes. Built around the idea of an open-ended search for any and all patterns and associations that might be hidden in data, associative EDA aim to identify any previously unknown relationships between two or more variables, with their outcomes being jointly shaped by analysis-impacting data features’ characteristics (i.e., the earlier discussed data types) and types of comparisons, discussed later in this section. The data characteristic that exerts the most profound impact on descriptive as well as associative EDA is the explicit or implied measurement scale, and especially the distinction between categorical and continuous data features. To restate, categorical variables assume only values that represent discrete and distinct categories; moreover, the organizational logic of those categories (within a distinct data feature) can be either nominal, which are unranked labels, such as ‘male’ and ‘female’ gender designations, or ordinal, where individual categories are rank ordered, as exemplified by ‘small’, ‘medium’ and ‘large’ size groupings. Lastly, it is also important to keep in mind that categorical variables can be encoded using any combination of digits, letters, and specifical characters (including digits-only); in other words, categorical variables can be either numeric or text (or alphanumeric). Continuous variables, on the other hand, capture values of measurable properties expressed on ranges bounded by two extremes,12 where any value between the two extremes is possible, as exemplified by age or weight; it follows that continuous values have to be encoded as digits-only. It is important to note that whereas data features encoded using letters or any combination of letters, digits, and specifical characters should be treated as categorical, digits-only encodings are more nuanced. The overt similarity of digits-only data element encodings can be confusing; thus it is important to not lose sight of the elementary difference between the use of digits as labels and the use of digits as magnitudes: the former should be treated as categorical and the latter as continuous variables. The strong emphasis on correctly identifying individual data features as either categorical or continuous in terms of their underlying measurement properties stems from recognition of fundamental differences in mathematical (and thus computational) properties of the two general variable types. More specifically, basic arithmetic operations – addition, subtraction, division, multiplication – cannot be performed on categorical variables which renders those data features informationally poorer than

 The so-called high-value customer baseline is typically constructed by first grouping customers into mutually exclusive spending-based groupings, which is then followed by profiling each group (e.g., computing mean or median values for each group for continuous variables, or frequency counts for categorical variables) using behavioral, demographic, psychographic and other details.  Being bounded by two extremes does not contradict the idea of infinite number of values because the distance between those two extremes can be divided into ever smaller number of units (to use a graphical parallel, any line segment can be subdivided into ever smaller segments, which can go on, in principle, infinitely).

Exploratory Analyses

85

variables measured on a continuous scale, as continuous variables allow all basic arithmetic operations. Consequently, basic descriptive characterizations of categorical data features are limited to just raw counts and relative frequencies of individual categories, whereas continuous data features can be described using numerous measures of central tendency, variability, and spread.13 Approach-wise, being comprised of finite, and typically relatively small number of groups, categorical data features naturally lend themselves to visual descriptions, with basic tools such as bar or pie graphs offering an easy-to-interpret way of conveying absolute and relative frequency counts. Still, it is worth noting that some categorical classifications can be quite numerous. For instance, descriptors such as ‘occupation’ or ‘location’ can yield a large number of distinct values (categories), which when graphed or tabulated might produce visually and informationally unintelligible representations. In those situations it might be appropriate to collapse granular categorizations into more aggregate ones; here, the reduction of informational specificity can be worthwhile in view of enhanced readability of outcomes. An added and somewhat less obvious benefit of collapsing multitudinous categories is the corresponding reduction of the amount of noise, or noninformative content in data, which stems from the tendency of low frequency categories to contribute disproportionately more noise than to meaningful information.

Exploring Continuous Values As noted earlier, the key analysis-related difference between categorical and continuous data features is that the latter can be manipulated using basic arithmetic operations. Consequently, exploration of the informational content of individual continuous variables can yield numerous and varied analytic outcomes, most notably in the form of statistical descriptions of the typical, or average, value, the typical amount of variability or deviation from the average value, and the observed range of values, or the difference between the smallest and the largest value, all of which manifesting themselves in forms of multiple, seemingly competing estimates. Thus while, at least in principle, more analytic outcomes should yield deeper insights, that plurality can be overwhelming. Moreover, the confusing plurality of notionally alike but computationally different estimates is further compounded by some unfortunate common usage conventions, such as equating the general notion of ‘average’ with the arithmetic mean. All considered, to assure methodological soundness of exploratory analyses of continuous data features, it is essential to carefully consider the three core characteristics  The computational differences between categorical and continuous variables also suggest that informationally richer continuous data features can be converted (i.e., recoded) into categories, but not the other way around – for instance, continuously measured ‘age’ (e.g., 59) can grouped into ‘age range’ (e.g., 50–59), but it is impossible to discern exact age from age range.

86

Chapter 5 Knowledge of Methods

of the so-measured variables – average, variability, and range – in the context of specific computational approaches that can be used to generate those statistics. Moreover, it is just as important to evaluate the appropriateness, or lack thereof, of competing estimation approaches, especially in applied data analytic contexts. Estimating Average: Setting aside the existence of other, albeit relatively rarely used in applied analytics, expressions of the ‘mean’ in the form of geometric and harmonic means, there are three distinct mathematical expressions of the general notion of average: mean, median, and mode, and in spite of often being the top-of-mind choice, the mean is not always the most appropriate statistic to use. In fact, quite often the mean may be the least inappropriate expression of average to use because as a computed parameter,14 its estimate is influenced by outliers, or atypically small or large values. Under some very specific circumstances, most notably when the distribution of the variable of interest is perfectly symmetrical (i.e., the so-called standard normal distribution), the mean, median, and the mode can be expected to yield approximately the same numeric estimate, rendering choosing one over the others largely a moot point, unless a broader description of ‘average’ is sought.15 However, in practice those instances are fairly rare, as many of the commonly analyzed phenomena or outcomes are asymmetrically distributed, which is usually a consequence of more cases clustering toward either of the two ends of the underlying value continuum. For instance, considering customer value to a brand, virtually all brands boast noticeably more low value than high value customers; similarly, there are significantly more trivial than severe automotive accidents, there are considerably more people living in poverty than those living in opulence, and the list goes on. Thus quite commonly the values of the mean, the median and the mode will diverge, ultimately leading to a dilemma: Which of those numerically different estimates of the average offers the most robust approximation of the ‘typical’ value? Under most circumstances, when describing either positively or negatively skewed variables,16 median tends to yield the most dependable portrayal of the typical value. The reason for that is rather simple: Median is the center-most value in a sorted distribution, with exactly half of all records falling above and below it;17 it is also unaffected by outliers, since it is an actual, rather than computed value. The far more frequently, and often erroneously used mean tends to yield biased estimates of the average because

 The sum of all values divided by a number of individual values (typically the number of records in a dataset).  Standard deviation, perhaps the most commonly used measure of variability, can only be used with the mean.  A distribution is skewed if one of its tails is longer than the other; a positive skew means that the distribution has a long tail in the positive direction (i.e., higher values), while a negative skew means that the distribution has a long tail in the negative direction (i.e., negative values).  In situations in which there is an even number of data records (in which case there is no actual value that falls symmetrically in the center of the distribution), the median is computed by taking the average of the two values closest to the middle of the distribution.

Exploratory Analyses

87

it is a computed (by summing all values and dividing the total by the number of cases) estimate whose magnitude is directly impacted by outlying, or extreme cases. Lastly, in theory, the mode can also yield reliable depiction of the average (like median, it is also an actual, rather than a computed value), but in practice there are often multiple modes, especially in larger datasets, which ultimately diminishes the practical utility of that statistic. The preceding considerations are summarized in Figure 5.4.

Mean = Median = Mode

Mode Median

Mean

Symmetrical Distribution

Figure 5.4: Choosing the Appropriate Average.

Another important consideration when estimating the deceptively straightforward average values is the generalizability of the estimate. In many, perhaps even most business and other applied data analytic contexts, parameter values such as the mean or the median are computed using samples of applicable data. It is well-known that even the most carefully selected sample can be expected to be a somewhat imperfect reflection of the underlying population, as captured in the notion of sampling error. Simply put, it is extremely difficult to, firstly, account for all possible population-defining characteristics and, secondly, to select a sample in such a way that its makeup does not deviate in any way from the underlying population in terms of those characteristics. In view of that, to be generalizable onto the larger population, sample-derived parameter estimates need to account for that imprecision, which is typically accomplished by recasting the so-called point estimates, which are exact values, into confidence intervals, or value ranges that account for sampling error and frame the estimated range in terms of the chosen level of statistical significance18 (discussed in more detail later in this section). Thus rather than reporting that, for instance, a sample-derived ‘mean purchase price’ is $20.95 (a point estimate), one would report that – assuming the commonly used 95% level of confidence (see footnote #18) and the estimated standard error of $1.50 – the ‘mean purchase price’ falls somewhere between $18.01 and $23.89 ($20.95 ± 1.96 × $1.50).

 Confidence Interval = Mean ± factor*Standard Error (a measure of variability of sample-estimated mean), where ‘factor’ is the number of standard units (1.645, 1.96, and 2.576 for 90%, 95%, and 99%, levels of significance).

88

Chapter 5 Knowledge of Methods

Estimating Variability: Keeping in mind that the essence of computing average is to collapse multiple individual data records into a single summary value, an estimate of cross-record value variability adds an important and complementary piece of information. Broadly characterized as the amount of spread of actual values around the estimated average, it captures the typical degree of departure (i.e., the up and down deviations) from the computed average. In a sense, it can be seen as a measure heterogeneity of individual records with regard to the value of a particular variable – the higher the deviation, the more diverse (in terms of the spread of values) a data set. It is appropriate to think of variability as a complement to average – in other words, a measure that captures additional information lost during collapsing of individual records. Given that, it is reasonable to expect computationally distinct expressions of that general notion to be associated with the mean, median, and the mode. Variance, computationally defined as the average squared deviations from the mean, and its derivative the standard deviation, computed as the square root of variance, are both expressions of variability individual values from the computed mean; standard deviation is expressed in easy to interpret ‘standard units away from the mean’ and thus tends to be the (arithmetic) mean related expression of variability most commonly used in practice. An alternative measure of variability is the average absolute deviation, also referred to as mean absolute deviation. It is a comparatively lesser-known measure of variability, but it is a generalized summary measure of statistical dispersion that, in principle, can be used with the mean, median, and the mode, though in practice tends to be used with median and mode only. Estimating Range: The third and final continuous variable descriptor is the range, which is simply a measure of dispersion of values of individual data records, operationalized as the difference between the largest and smallest value of a data feature in a dataset. While conceptually similar to standard deviation, it is computationally quite different as it only takes into account two values – the maximum and the minimum – and ignores the rest (whereas the standard deviation is computed using all values). While useful as an indication of absolute spread of data points, range can nonetheless produce unreliable or even unrealistic estimates, especially if outliers are present. An obvious remedy would seem to be to simply filter out extreme values but doing so is both difficult and problematic. It is difficult because while there are several recommended outlier identification approaches,19 those mechanisms are closer to general rules of thumb than precise and objective calculus, thus while it might be relatively easy to pinpoint some outliers in some situations (e.g., notable business tycoons would certainly stand out if compared to others in their respective high school graduating

 One of the better-known outlier identification approaches is to convert extreme data points into z scores that convert raw values into the number of standard deviations away from the mean; then, if a value has a high enough or low enough z score, it can be considered an outlier. A widely used rule of thumb suggests that values with a z score greater than 3 or less than –3 ought to be considered outliers.

Exploratory Analyses

89

classes in terms of income), it is quite a bit more difficult to do so in other situations (e.g., the cost of insurance claims tends to grow in a continuous fashion, making it difficult to separate ‘normal’ from ‘outlying’ values). It is problematic because, after all, outliers are typically valid data points and excluding them from analyses can bias the resultant estimates. For instance, when measured in terms of economic damage, hurricanes Katrina (2005, $186.3B), Harvey (2017, $148.8B), and Maria (2017, $107.1B) are, as of early part of 2023, the three costliest storms in US history – when compared to the median cost, Katrina, Harvey, and Maria were each more than 50 times costlier, which makes all three clear outliers. Given that, should those storms be excluded from any damage forecasts and related estimates? It is a difficult question to answer because while those are real and important data points (the argument against exclusion), those three data points can be expected to exert disproportionately large, and potentially distorting influence on at least some of the forward-looking estimates (the argument for exclusion). Oftentimes, the basic descriptive estimates summarized above are computed using subsets of available data with the purpose of projecting those sample-based estimates onto a larger population. Under most circumstances, even samples selected using carefully designed approaches, such as stratified sampling,20 are somewhat imperfect representations of the underlying populations, which means that sample-based estimates are imperfect representations of (unknown) population values. In order to account for that degree of inaccuracy when projecting sample-based estimates onto the underlying population, it is necessary to recast those point estimates into confidence intervals, which represent ranges of probable values. Framed in the context of statistical significance (discussed later), confidence intervals are commonly computed at one of three confidence levels: 90%, 95%, or 99%, with the 95% being the most commonly used one. Keeping everything else (most notably the number of records and variability) the same, the higher the confidence level, the wider the confidence interval. Given that narrower ranges are preferred to wide ones, the choice of the level of significance entails a tradeoff between the desired level of statistical confidence (an important consideration more fully addressed later) and the desired specificity of estimates. The 95% level of statistical confidence, falling in the middle of the three most commonly used thresholds tends to be the default choice because it is often seen as the optimal solution to that trade-off.

 A three-step sample selection method used with populations the encompass distinct segments (e.g., ethnic groups) – the first step is to divide the population into subpopulations (called strata) that differ in important ways, then determine each subpopulation’s share of the total, and lastly to randomly select an appropriately sized sample from each subpopulation, which are then combined into a single stratified sample. Stratified sampling is one of four distinct variants of probability sampling (the other three are sample random, systematic, and cluster).

90

Chapter 5 Knowledge of Methods

Exploring Associations A natural extension to examining informational properties of individual data features is to assess potential relationships among those features. For example, the analysis of ‘customer spend’ almost naturally suggests looking into possible associations between that outcome and various customer characteristics, such as ‘age’ or ‘income’. A word of caution: As postulated by the Simpson’s paradox, the direction or strength of empirical relationships may change when data are aggregated across natural groupings that should be treated separately. In the context of the above example, assessment of a potential relationship between, for instance, ‘age’ and ‘spending’ should contemplate the possibility of natural data subsets, such as gender-based ones. In that case, a search for an overall (i.e., no gender breakouts) relationship might yield no discernible (i.e., statistically significant) age-spending association, but withingender investigations may uncover a different pattern. As implied by the above example, the assessment of between-variable relationships is, at its core, a pairwise comparison, even though numerous such relationships may be examined concurrently.21 And once again, the categorical vs. continuous distinction plays a pivotal role, as it determines how, method-wise, individual pairwise relationships should be assessed. More specifically, the categorical vs. continuous distinction gives rise to three methodologically distinct types of associations: 1. both data features are categorical, 2. both are continuous, and 3. one is categorical and the other continuous. Categorical Associations: Relationships between two categorical variables are most commonly assessed using contingency table based cross-tabulations (often abbreviated as crosstabs) with χ2 (chi square) test of statistical significance of individual pairwise associations. In a statistical sense, χ2 is a nonparametric test, which means that it places no distributional requirements on data; however, in order to yield unbiased estimates, the test requires data to represent raw frequencies (not percentages), and the individual categories to be mutually exclusive and exhaustive. The underlying comparisons are conceptually and computationally straightforward: the analysis simply compares the difference between the observed and expected (i.e., calculated using probability theory, in this case by multiplying the rows and columns in the contingency table, and dividing the product by the total count) frequencies in each cell to identify patterns of cooccurrences that are more frequent than those that could be produced by random chance. It should be noted that the underlying assessment is binary, or yes vs. no, in its nature, meaning that while it might point toward the presence of an association, it offers no information regarding the strength of that association. Since the χ2 test only

 As exemplified by the familiar correlation matrices which combine multiple values into a single grid but given the two-dimensional (rows vs. columns) layouts of those matrices, each coefficient estimate represents a distinct association of the intersection of individual rows and columns.

Exploratory Analyses

91

answers the question of whether or not the association of interest is statistically significant, to assess the strength of categorical associations a different statistic, known as Cramer’s V, is needed. Continuous Associations: Relationships between two continuous variables are most often examined using correlation analysis. In the definitional sense, correlation is a concurrent change in value of two numeric data features; more specifically, it is a measure of linear association between pairs of variables. In keeping with informationally richer nature of continuous variables, correlation analysis yields not only an assessment of statistical significance of individual associations, but also an estimate of their direction and their strength, where both estimates are captured using a standardized correlation coefficient with values ranging from −1 to 1. The sign of the correlation coefficient (i.e., ‘−’ or the implied ‘+’) denotes the direction of the association, which can be either positive (an increase in the value of one variable is accompanied by an increase in the value of other one), or negative (an increase in the value of one variable is accompanied by a decrease in the value of the other); the strength of the association is represented by the magnitude of the correlation coefficient, which ranges from perfectly negative or inverse (−1) to perfectly positive or direct (1), with the value of 0 reflecting no association. An important, though often overlooked caveat of correlation analysis is that it is limited to only the simplest possible type of associations, which is linear. A linear relationship is just one of many different types of associations; graphically represented by a straight line plotted in the context of Cartesian coordinates (the familiar X-Y axis), it stipulates that a change in X (data feature 1) is accompanied by a proportional change in Y (data feature 2), with the magnitude and direction of that change being captured in the correlation coefficient, as graphically illustrated in Figure 5.5. The linear nature of correlation is an important caveat because it means that absence of statistically significant correlation should be interpreted as absence of linear association only, not absence of any association, as there could well be an unaccounted for nonlinear relationship between X and Y. That very important caveat is, unfortunately, often overlooked in applied analyses leading to possibly erroneous conclusions of ‘no association’ between variables of interest. A yet another important aspect of the linear character of correlation analysis is the implicitly infinitely constant rate of change, meaning that the direction and strength of X-Y association will remain unchanged, in theory, indefinitely. Here, unfortunately, there is no easy formulaic fix – the best, practice-rooted advice is to consider the correlation coefficient-communicated strength and direction of X-Y association only within a ‘reasonable’ range of X and Y values; of course, the conclusion of what constitutes ‘reasonable’ is inescapably situational and judgment-prone. Also worth noting is that while numerous correlation formulations have been proposed in the course of the past several decades, two particular approaches gained widespread usage among academicians as well as practitioners: Pearson’s product-moment correlation, and Spearman’s rank correlation. Interestingly, being more widely

92

Chapter 5 Knowledge of Methods

Figure 5.5: Linear vs. Nonlinear Associations.

recognized,22 Pearson’s correlation is also more frequently used inappropriately, meaning used in situations in which Spearman’s correlation might be more appropriate. The choice between the two hinges on X and Y variables’ distributional characteristics, most notably the shape of each of the two variables’ distribution. Pearson’s correlation is appropriate when variables are normally distributed (i.e., symmetrical), and relatively outlier-free, as the presence of outliers may exaggerate or dampen the strength of relationship estimated using that formulation. Spearman’s correlation, on the other hand, is appropriate when one or both variables are skewed; moreover, Spearman’s correlation estimates are also largely unaffected by outliers. Mix-Pair Associations: The logic that governs assessing relationships between continuous and categorical variables is noticeably different that the logic of assessing like-type variables’ associations, as the former aims to capture different expressions of cooccurrence. To determine if a categorical and a continuous variable are related, average (mean) values of the continuous variable need to be contrasted across categories comprising the categorical variable; if the mean values are found to be different, the continuous and the categorical variables of interest are deemed to be related. In the simplest case scenario, which is one where the categorical variable is comprised of just two groups, the assessment logic calls for comparing values means for ‘group 1’ and ‘group 2’ using t-test,23 an inferential statistic developed expressly to determine if there is a

 Commonly known as Pearson’s r correlation coefficient (in fact, in common usage the r coefficient is frequently, and incorrectly, seen as a universal correlation symbol); Spearman’s rank correlation coefficient is denoted by the Greek letter ρ (rho), and Kendall’s rank correlation coefficient is denoted by another Greek letter, τ (tau).  While the focus of this overview is limited to just a high-level overview of general methodological approaches, it is nonetheless worth noting that there are several different computational t-test formulations, most notably independent samples, paired samples, and one-sample t-tests; the general discussion presented here implicitly assumes the use of independent samples t-test.

Exploratory Analyses

93

statistically significant difference between means of two groups. If the categorical variable of interest is comprised of more than two groups, the comparison needs to utilize a different inferential statistic known as the F-test. Though the general rationales of the F-test and t-test are conceptually similar – i.e., both compare values of means computed for each distinct group comprising the categorical variable – it is nonetheless important, the use a correct test (t-test for two-group categorical variables and F-test for categorical variables comprised of more than two groups) to assure sound results. Regardless of the type of association, assessment of relationships is overwhelmingly probabilistic (meaning results are only approximately correct) due to the imperfect nature of data used to estimate those relationships.24 Stated differently, if data used to assess cross-variable associations are incomplete (i.e., represent a subset of the total population) and/or analytically imperfect (e.g., contain missing, incorrect or outlying values), assessments of those relationships need to be expressed in terms of the degree of believability, technically referred to as statistical significance. While, in principle, desired statistical significance can be set at any level, the three commonly used p-value thresholds are 1%, 5% or 10% (usually expressed as .01, .05 and .1, respectively), when considered from the perspective of probability of an estimated value not being true, or 99%, 95% and 90%, when considered from the perspective of an estimated value being true.25 The three distinct tests of association outlined earlier, where at least one of the two variables is categorical – the χ2 test, t-test, and the F test – are all in fact mechanisms of assessing statistical significance under conditions of different underlying scales of measurement (i.e., continuous vs. categorical). In terms of their practical interpretation, those tests can be seen as ‘yes’ vs. ‘no’ (i.e., statistically significant vs. not statistically significant) indicators. That, however, is only partly true for Pearson’s and Spearman’s correlation analyses used to assess bivariate continuous variables’ associations. In keeping with being informationally richer, the assessment of Pearson’s and Spearman’s correlation coefficients incorporates formal tests of statistical significance (the F-test), but in addition to that, it also includes additional information in the form of the direction and the strength of association-communicating correlation coefficients.

 Those typically fall into two general categories: 1. structural deficiencies, such as incorrect or missing values, and 2. sampling deficiencies. The latter stem from the fact that for reasons of practicality or applicability, the vast majority of applied analyses use subsets of all pertinent data, and it is generally accepted (because it can be easily demonstrated) that it is nearly impossible to select a sample that is a perfect representation of the underlying population.  Rooted in the formal logic of hypothesis testing according to which the null hypothesis can be erroneously rejected or accepted, those two complementary expressions consider result believability from the perspective of either ‘false positive’, known as Type I error and expressed as .01, .05, or .1 (i.e., 1%, 5%, 10%), or ‘false negative’, known as Type II error and expressed as 99%, 95%, or 90%. So, either the probability of being correct or the probability of being incorrect.

94

Chapter 5 Knowledge of Methods

The Vagaries of Statistical Significance The above characterized general logic of statistical significance testing seems compelling, and as more in-depth examination of those tests computational mechanics would reveal, the statistical significance tests’ computational logic is also mathematically sound. There is, however, a critical flaw in those tests that severely limits their practical usefulness: tests of statistical significance are strongly influenced by sample size, and are thus biased. More specifically, the likelihood of a test statistic, such as a t-test or a correlation coefficient, being deemed statistically significant is highly correlated with the number of records used in computing statistical significance. That dependence is so pronounced that even at a moderately large record count, such as 10,000 or so, inordinately trivial magnitudes can be deemed to be statistically significant. In fact, it can be easily shown (by recomputing a particular test of statistical significance using randomly selected smaller and larger subsets of data) that merely reducing the number of records will often result in the initially (i.e., computed using larger number of records) statistically significant estimates becoming not significant (when re-computed using smaller number of records). The reason for that is fairly straightforward: A key input into statistical significance calculation is standard error of the mean (commonly referred to simply as standard error), a measure of relative precision of sample-based mean estimates, the magnitude of which sets the statistical significance evaluation benchmark.26 Mathematically, standard error is the square root of standard deviation (a measure of dispersion of values in a set of data), itself computed as the square root of variance, which in turn is the average of the squared deviations from the mean, derived by dividing the sum of squared deviations by the number of records. This simple progression highlights the cascading impact of the number of records on upstream estimates, and it clearly shows that finding progressively smaller computed effects to be statistically significant is an inescapable consequence of the mathematical mechanics of those tests. If simply working with a larger dataset will, everything else being the same, translate into ever smaller estimates being deemed statistically significant, what then are the implications of using statistical significance tests in the era of big data? The sample size dependence is not just a hard to overlook limitation – it casts doubt on the informational utility of statistical significance testing in exploratory analyses (and quite possibly in other contexts where large data sets are utilized). Consequently, it is not surprising that a growing chorus of statisticians is advocating relying on what is commonly referred to as effect size, a measure that better captures the practical importance of estimates by assessing meaningfulness of associations. For relationships between two continuous variables, effect size can be calculated using the earlier  The expected value of a parameter under consideration forms the basis for determining what constitutes the typical, i.e., not brought about by random chance, magnitude of an estimated quantity – if the expected value is small, small observed values will be seen as not chancy, and thus statistically significant.

Confirmatory Analyses

95

discussed Pearson’s correlation coefficient (r), or another statistic known as Cohen’s d, which is computed as a difference between two means divided by pooled standard deviation (i.e., estimated using data from both groups).27 To assess the effect size between categorical variables, one of the most commonly used measures is the phi (φ) coefficient, computed as the square root of the χ2 (chi square) test statistic and the sample size (n);28 Cramer’s V, which can be seen as an extension of phi coefficient, is another frequently used categorical effect size estimate.

Confirmatory Analyses The second of the two broad data utilization avenues, confirmatory analyses, aim to assess the validity of prior beliefs using structured approaches and tools of statistical inference. In the way of a contrast, whereas the goal of the earlier discussed exploratory analyses is to derive new knowledge claims, the goal of confirmatory analyses is to validate existing knowledge claims, using a mechanism known as the scientific method. Tracing its origins to 17th century Europe where it emerged as a mean of advancing objectively verifiable knowledge, the scientific method can be seen as a procedural manifestation of inductive reasoning,29 which uses observed facts (i.e., data) as the basis from which probabilistic general conclusions are derived. The process that governs the scientific method is as follows: A conjecture of interest, such as a theoretical prediction or some longstanding belief, is reframed into a testable hypothesis,30 which is a tentative statement that can be proved or disproved using data. Once the requisite data have been captured or accessed, appropriate analytic approaches are identified, following which analyses are carried out, yielding a conclusion that either proves or disproves the stated hypothesis. It is important to note the tentative character of inductively generated insights, which are accepted as true until or unless refuting evidence emerges; that basic epistemological property is known as falsifiability, and it posits that a knowledge claim that cannot be refuted is not a scientific claim. One of the core elements of the scientific method’s knowledge claims validation process is the earlier discussed statistical significance testing, which serves as the

 d = q (Mean ffiffiffiffi 1 – Mean2) / Pooled Std. Deviation. 2 28 φ = χn  The analog to inductive reasoning is deductive reasoning, which derives definite conclusions (in contrast to probabilistic conclusions yielded by inductive reasoning) from accepted premises – for instance, if both X and Y premises are true, conclusion Z is taken to be true.  A general belief or a conjecture might suggest that, for example, different promotional offers can be expected to yield varying response rates – to be empirically testable, such general conjecture needs to be restated as a more tightly worded hypothesis that lends itself to being answered in a simple yes vs. no manner – for example, ‘Promotion A Response Rate will be greater than Promotion B Response Rate’.

96

Chapter 5 Knowledge of Methods

litmus test of hypothesis falsification efforts. And although the informational goals of exploratory and confirmatory analyses are noticeably different, the logic of significance testing (i.e., to differentiate between spurious and systematic effects) is the same in both contexts, and thus so are the limitations of hypothesis testing, most notably its sample size dependance. The challenge that is more unique to confirmatory analysis, however, is that the scientific method-based testing of knowledge claims requires a clear ‘yes’ vs. ‘no’ assessment mechanism, which suggests a more nuanced use of the idea of ‘effect size’ introduced earlier. More specifically, given that there are no universally accepted standards for what constitutes ‘large’, as in ‘significant’, and ‘small’, as in ‘not significant’, effects, the desired outcome of remedying significance tests’ sample size dependence requires a somewhat more methodologically involved approach built around several technical considerations, which include the earlier discussed ideas of statistical significance testing and effect size, as well as an additional consideration in the form of statistical power. Statistical power is the probability that a given test can detect an effect (e.g., a material difference between two groups or a meaningful relationship) when there is one. Admittedly, the relationship between the notions of ‘statistical power’ and ‘statistical significance’ is somewhat confusing, given that, in a general sense, both speak to the efficacy of the mechanism of hypothesis testing. That said, when looked at from the perspective of distinctiveness, statistical power can be framed as a critical test design consideration that captures the desired efficacy of tests of interest (e.g., the Ftest), whereas statistical significance can be seen as a key test evaluation element, one that spells out the criterion to be used to decide if the hypothesis of interest is true or false. In that sense, both notions contribute something unique to the task of determining the optimal sample size that should be used in testing of hypotheses, or one that simultaneously minimizes the undesirable sample size effect while maximizing the power of the test. To restate, unlike the open-ended exploratory analyses, the purpose of which is to sift through all available data in search any informative insights, confirmatory analyses have comparatively more defined focus, which is to test the validity of specific knowledge claims, in an objective ‘yes’ vs. ‘no’ manner. In view of that and acknowledging the challenge stemming from sample size dependence of statistical significance testing, confirmatory analyses impose additional (i.e., beyond those discussed in Chapter 4) analytic dataset preparation steps. It is important to temper the instinct of using more records just because there are many thousands, even millions of records readily available; the nuanced nature of hypothesis testing does not respond well to ‘brute force’ type mindset that is typically rooted in the unfounded belief that larger record counts translate into more robust data analytic outcomes. In fact, reliance on excessively large, ‘unfiltered’ datasets may not only render the process of statistical significance testing indiscriminate – it may unnecessarily inflate the amount of non-informative noise, ultimately resulting in less robust outcomes.

Confirmatory Analyses

97

The Importance of Rightsizing It follows from the preceding overview that sound means of determining the number of records to be used in testing of hypotheses is critically important to confirmatory analyses, given the outsized influence of sample size on the efficacy of statistical methods of inquiry. With that in mind, there are well established sample size determination methods that could be leveraged here to determine the optimal sample size, defined here as the number of records needed to attain the desired degree of statistical significance, in a way that supports the desired mitigation of the adverse effects of sample size inflation. Before delving into operational details of those methods, it is important to note two general considerations: Firstly, sample size determinations are made at the level of individual groups, which are distinct subsets of records associated with a particular state or effect type. For instance, a test contrasting response rates generated by Promotion A and Promotion B is comprised of two distinct groups (those who received Promotion A and those who received Promotion B), which requires that the optimal sample size be determined for each of the two groups. Secondly, while the general logic of sample size determination is essentially the same for continuous (i.e., comparison of means) and categorical (i.e., comparison of proportions), there are differences in estimation mechanics, hence optimal sample size determination needs to be considered separately for continuous and categorical values. Consequently, the first step in the process of determining the optimal number of records to be used to test hypotheses of interest calls for clear specification of the type of test, which can be either a comparison of continuous or categorical quantities in the form of means or proportions, respectively. More specifically, estimation of the optimal sample size for testing effects expressed either as means or as proportions calls for three distinct inputs: 1. chosen level of statistical significance, which captures the probability of incorrectly concluding that there is an association or a difference (between means or proportions) and expressed as α value (also known as the Type I error) and typically set to equal .1, .05 or .01, 2. desired power, which is an expression of the probability that the test can detect an effect, if there is one, commonly set at 80% or 90%, 3. effect size, which is an expression of practically meaningful differences (between means or proportions) or associations. Recalling that sample size needs to be determined at the level of individual groups, group-level optimal sample size needed to enable empirically sound test of a hypothesis of interest can be determined as follows: Significance + Power 2 Sample Sizegroup 1 = 2 Effect Size

98

Chapter 5 Knowledge of Methods

Of the three inputs, the level of statistical significance and statistical power are both chosen by the data user, while the effect size can be estimated using standard computational methods, one developed for continuous quantities and another for categorical ones. Starting with the former, the effect size is expressed as a simple ratio of the absolute value of the difference between the two means, and the standard deviation of the outcome of interest, as summarized as follows: Effect Sizecontinuous =

jMean 1 − Mean 2j Standard Deviation

Moving onto the categorical effect size estimation, that estimate as expressed as a ratio of the absolute value of the difference in proportions between groups 1 and 2, and the square root of the overall proportion, computed by taking the mean of the proportions of the two groups, summarized as follows: jPropotion 1 − Proportion 2j Effect Sizecategorical = pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Overall Proportionð1 − Overall ProportionÞ Expressed in a more mathematically formal manner, the optimal sample size determination logic outlined above can be stated as follows: Z1 − ∝ =2 + Z1 − β 2 Ni = 2 ES where, EScontinuous =

jμ1 − μ2 j σ

or j p 1 − p2 j EScategorical = pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pð1 − pÞ and where, N is the sample size needed for each group α is the level of statistical significance Z1-α/2 is the statistical significance term; it takes on a standard value of 1.645, 1.96, or 2.576 units for .1, .05, and .01 levels of significance, respectively Z1-β is the statistical power term; for commonly used 80% or 90% values, it takes on standard values of 0.84 or 1.282, respectively ES is the effect size μ is the mean σ is the standard deviation p is proportion

Confirmatory Analyses

99

The preceding sample size determination details play a pivotal role in ensuring a high degree of believability of data analytic results, which should obfuscate the need for the often-seen practice of differentiating between practically meaningful and not practically meaningful statistically significant results.31 It is difficult to defend a practice that effectively subjectifies manifestly objective data analytic outcomes, which is what happens when subjective judgment is used to decide which statistically significant outcomes should be deemed material – at the same time, the impulse to do so is understandable in view of the earlier discussed sample size dependence of significance tests (which results in magnitudinally trivial differences or associations being deemed as ‘significant’ as those with much larger values). That undesirable state of affairs is often a result of what could be characterized as the lack of analysis dataset selection discipline, manifesting itself in the use of all available records to compute the effects of interest (which itself is a product of an erroneous belief that more is better). Using rational sample sizing geared toward the attainment of multifaceted analytical robustness (i.e., a combination of the chosen level of significance, desired power, and empirical effect size assessment) should obfuscate the need for infusing biased subjective judgment into the assessment of data analytic outcomes.

The Question of Representativeness To give rise to valid and reliable estimates, samples used in confirmatory analyses also need to be representative of the underlying population, if the results of confirmatory analyses are to be generalized onto that population (which is usually the case). It is commonly, and often erroneously, assumed that drawing a random sample from a population of interest will assure a high degree of sample-to-population similarity – here, it is not the general belief that is invalid but rather the associated, and largely glazed over, assumptions. In order for a randomly selected sample to closely mirror the larger population, that population needs to be highly homogeneous in terms of its key characteristics, which in practice is relatively rare; moreover, the selection logic requires that each element in the population has an equal chance of being selected, which in many applied contexts may not be the case. The makeup of the US population, widely depicted as a mosaic of distinct ethnic, occupational, socio-economic, to name just a few of the

 It is a common practice in applied business analytics to differentiate between statistically significant results that are seen as ‘material’ and those that are not, which is a direct consequence of the earlier discussed inflationary impact of sample size on tests of statistical significance. Essentially, given a large enough sample (typically a few thousand records and up), even trivially small magnitudes can become highly statistically significant, which then forces data users to infuse subjective judgment (something runs counter to the very idea of objective analyses) by more or less arbitrarily differentiating between material and immaterial outcomes; the ultimate goal of optimal sample size selection is to make that undesirable practice unnecessary.

100

Chapter 5 Knowledge of Methods

many sub-segments, offers a good illustration of the fallacy of the unqualified belief in the efficacy of random selection: Even assuming that one had the ability to select at random from the population of more than 330 million spread over large and widely dispersed area (and that is a big assumption that, in reality, may only be true for the US Government), independent random drawings of a reasonably large, i.e., a few thousand, sets of individuals would more than likely yield compositionally different samples. A common business practice of conducting customer insights focused research by selecting what is believed to be a representative sample of all users of a particular brand illustrates the difficulty of fulfilling the second random sample assumption, which is that each element in the population of interest (here, all users of Brand X) must have equal probability of being selected. More often than not, a given brand (or any other business entity) may simply not have a complete listing of all its users; instead, it may only have a listing of those who identified themselves through means such as loyalty program enrolment or product registration signups. In short, while selecting at random may seem like a good idea, it is important to carefully consider the underlying methodological demands before committing to that course of action. Given the difficulties of selecting adequately representative samples, it is reasonable to assume some degree of non-representativeness, and then take steps to, first, assess the extent of non-representativeness, and then to identify appropriate remediation steps. Here, a statistical concept of sampling error offers a way to estimate the degree of sample non-representativeness;32 it is computed by dividing the standard deviation of the population by the square root of the sample size, and then multiplying the resultant product by the appropriate (i.e., associated with the data use-chosen level of statistical significance) multiplier value corresponding to the chosen level of statistical significance,33 Commonly referred to as the z-score, the said multiplier represents approximately 1.65, 1.96, or 2.58 standard units away from the mean, which corresponds to .1, .05, and .01 levels of significance, respectively, or alternatively, accounts for 90%, 95%, or 99% of the area under the standard normal distribution. It is worth noting that as implied by the sampling error computational formula (see footnote #33), the sample-topopulation disparity, as measured by sampling error, decreases as sample size increases. However, as also implied by that formula, the change is not proportional, meaning that large increases in sample size will yield comparatively small decreases in sampling error. Simply put, it all means that while it might be tempting to increase the sample size to reduce the sampling error, the benefit of doing that might not be worth

 It is worth noting that the use of the term ‘error’ in statistics is somewhat different from the everyday meaning conveying a mistake; in statistics, the notion of error is used to denote a deviation from (sampling error) or variation around (standard error of the mean) some point of reference. In other words, sampling error is not a mistake in a sense of an action that can be avoided or corrected – rather, it is an extremely difficult to avoid consequence of subsetting data, or retrieving/using just a part (i.e., a subset) of a larger universe. Deviation pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 33 Sampling Error = Zscore x Standard Sample Size

Confirmatory Analyses

101

the cost in the form of the earlier discussed adverse impact of large sample sizes on the assessment of statistical significance. If some degree of sample non-representativeness can be assumed to be unavoidable, it is reasonable to ask why even bother with assessing that problem, i.e., quantifying of sampling error? While in a broad sense the answer touches on a myriad of philosophical and methodological considerations, in the narrower context of applied analytics it centers on the efficacy of statistical inferences, commonly manifesting themselves as generalizability and projectability of data analyses-derived insights. Derivation of generalizable and/or projectable insights from imperfect and/or incomplete data (if data were 100% accurate and complete, coverage- and values-wise, the resultant data-derived values would be certain and error-free and data analytic results could be stated as facts rather than inferences) calls for assessing of the degree of believability of what should be considered speculative insights, which in turn necessitates formal assessment of sampling error. Recalling that within the confines of statistical analyses the term ‘error’ is used to denote the degree of deviation rather than a mistake (as used elsewhere), within the confines of generalizing or projecting sample-derived estimates, the estimated sampling error can be used to transform sample-bound (and thus non-generalizable and non-projectable) point estimates into confidence intervals, in a manner discussed in the context of exploratory analyses.34 ✶✶✶ The preceding overview of foundational knowledge of data analysis takes a relatively narrow view of exploratory and confirmatory analyses but focusing on just the basic univariate (one variable) and bivariate (relationships between pairs of variables) data analytic methods. As briefly noted earlier, both exploratory and confirmatory analyses also encompass broad arrays of multivariate (relationships among multiple variables) dependence and interdependence methods that aim to discern (exploratory analyses) or affirm (confirmatory analyses) the existence, strength and/or the direction of relationships between the phenomena or outcomes of interest (i.e., dependent variables) and explanatory or predictive factors (i.e., independent variables). However, those more methodologically complex methods fall outside of the scope of elementary knowledge and skills focused data analytic literacy.

 The ‘standard error’ statistics used there in the context of mean estimates should be replaced with ‘sampling error’ estimate, which captures the deviation of interest in this context.

Chapter 6 Data Analytic Literacy: The Skills Meta-Domain Complementing the knowledge meta-domain of data analytic literacy is the skills dimension. Simply defined, skill is a learned ability to do something. It can manifest itself in physical dexterity or prowess, as exemplified by swimming or gymnastics skills, and it can also manifest itself in mental mastery, as illustrated by chess playing skills. As exemplified by the ability to play chess, there could be a considerable overlap between skills and knowledge – in fact, there is a nonreciprocal relationship between skills and knowledge: Generally speaking, formation of skills requires at least some prior knowledge, while becoming knowledgeable does not require becoming skilful. And that is precisely the case with development of data analytic competencies. Conceptual knowledge was discussed first – i.e., ahead of the ensuring overview of the skills meta-domain – because a robust foundation of understanding of the core elements of data analytics related ‘whats’ and ‘whys’ is necessary to effective and efficient development of ‘how-to’ skills. Attempting to learn how to use data manipulation and analysis tools – a core data analytic skillset – would be hampered by the lack of understanding of basic data structures and key elements of data analytic logic due to of lack of necessary anchoring that is needed to comprehend the operational structure and logic of individual tools. It would be akin to someone trying to learn how to play a particular game, such as tennis, without first becoming acquainted with the general rules of the game. Within the confines of data analytics, knowledge and skills are considered distinct of one another in the sense that both require clearly delineated learning focus; that learning oriented distinctiveness, however, recognizes that both are elements of a larger system of data analytic competencies, and thus ultimately are interrelated. That separate-but-related characterization is at the core of the general idea of literacy discussed in Chapter 1, as developing the conceptual knowledge of the rules of syntax is distinct from learning how to physically write, but that knowledge and skills combine to form the competency of literacy. Recalling the earlier drawn distinction between conceptual knowledge and applied skills (Chapter 1), and the more in-depth examination of the phenomenological1 character of those ideas offered in Chapter 3, the focus of this chapter as well as the next two chapters (7 and 8) is on in-depth examination of the scope and contents of applied data analytic skills. And while the ability to manipulate and analyze data is framed here as a distinct meta-domain of data analytic literacy, it is important to keep in mind the interdependence between execution-minded skills and the earlier

 In simple terms, phenomenology is a philosophy of experience, or more specifically, the study of structures of consciousness, as experienced from the first-person point of view. https://doi.org/10.1515/9783111001678-008

Skill Acquisition

103

acquired conceptual knowledge of data structures and methods of manipulation and analysis of data. In other words, getting the most out of the ensuing discussion of the ‘how-to’ aspect of data manipulation and analysis is contingent on keeping in mind and referencing the corresponding ‘whats’ and ‘whys.’ And lastly, the framing of data analytic skills described here goes beyond the traditional data ‘crunching’ scope as it also encompasses the ability to make sense of the often esoteric data analytic outcomes. Characterized here as sensemaking skills, those abilities are seen as a necessary component of the overall data analytic literacy, because they are essential to completing the process of translating raw data into informative insights. Just as the often overlooked skills needed to make the typically messy data analytically usable, sensemaking skills encompass another important, yet often underweighted facet of data analytic literacy, which is making methodologically nuanced data analytic results informationally clear, an undertaking that is chuck-full of often unexpected obstacles.

Skill Acquisition The emergence of telegraphic communication technologies at the turn of the 19th and 20th centuries created the need to train field operators in new skills of translating human language into Morse code,2 and decoding Morse code messages. Research efforts that were undertaken at that time to better understand the mechanisms of telegraphic language skills acquisition suggested that the progression of learning was not continuous; the initially rapid skill development was typically followed by extended plateaus during which little additional learning happened, and those plateaus tended to persist until operators’ attention was freed by the automation of lower level skills, in the form of unconscious response to a particular activity. In that particular context, response automation involved linking of inputs in the form of individual letters comprising a message with specific Morse code symbols by means of automatic physical operator responses in the form of appropriate telegraph machine tap sequences. In other words, advancing to the next competency level required first attaining unconscious competence in more fundamental competencies. Nowadays, the skill acquisition research tends to be focused primarily on clinical (e.g., medical training), remedial (i.e., deficiency correcting), or athletic skill acquisition. However, even though neither those more contemporary nor the original telegraphic communication focused skill development research findings are directly applicable to abilities to manipulate and analyze data, there are nonetheless some  A method of encoding text characters as standardized sequences of two different signal durations, called dots and dashes; named after Samuel Morse, one of the inventors of the telegraph. It is interesting to note that the fundamentally binary nature of Morse code is highly analogous to the now widely used binary digital systems.

104

Chapter 6 Data Analytic Literacy: The Skills Meta-Domain

distinct and informative similarities. Chief among those is the idea of persistent, reinforced practice, which highlights the important distinction between massed or continuous, and distributed or spaced across time, approaches to skill development. Emerging research evidence suggests that massed practice might be preferred for discrete tasks, such as learning a specific routine, as exemplified by aggregation of individual records, whereas distributed practice tends to be more effective for continuous and somewhat less defined tasks, such as data due diligence (which usually encompasses numerous, contextdependent activities). The key factors to consider in both settings is the difficulty of the skill to be acquired, the availability of appropriate information, and the form in which that information is available, such as definitional explanation only, explanation supported by illustrative examples, case studies, etc. Reinforcing or correcting feedback also plays a critical role in skill development, with particular emphasis on extrinsic feedback that provides augment information about the skill produced outcomes, where correct results are disclosed and explained. In terms of the underlying cognitive processes, which are at the center of developing data analytic competencies, the process of nurturing mental skills is often encapsulated in what is informally known as the four stages of competence, or more formally as the conscious competence learning model. Reminiscent of the well-known Maslow’s hierarchy of needs model (and often incorrectly attributed to Abraham Maslow), the conscious competence model is graphically summarized in Figure 6.1.

Unconscious Competence Conscious Competence

Conscious Incompetence

Unconscious Incompetence

Figure 6.1: Conscious Competence Learning Model.

The competence development process summarized in Figure 6.1 starts from the point of not knowing what one does not know and gradually ascends toward attainment of subconscious mastery, which, as noted in the context of learning of Morse code, is a necessary precondition of advancing toward more advanced skills. The next step in the conscious competence learning model, unconscious incompetence, is not knowing how to do something and not clearly recognizing that deficiency. Within the confines of data analytics, that can manifest itself in, for instance, not knowing how to aggregate transaction

Skill Acquisition

105

detail data into purchase level data, which is often a consequence of lack of appropriately nuanced data knowledge and corresponding data manipulation skills. Becoming more knowledgeable of data specifics can be expected to lead to recognition of that and other data manipulation deficiencies, though one still may not know how to aggregate detailed data, which is the essence of the conscious incompetence stage of the competence learning process. Once the individual acquires the skills of interest and is able to aggregate detailed transactions data, they move into the conscious competence part of the learning process, but the ability to execute those skills is still cognitively taxing. Lastly, and typically following significant amounts of practice, the ability to recognize the problem and take appropriate action steps becomes nearly automatic, at which point the individual can be said to have reached the unconscious competence level of the learning process. The conscious competence learning model offers a very general template to facilitate thinking about the mechanics of developing robust data analytic skills. To start, it is important to resist framing of the idea of unconscious incompetence in a an overly restrictive fashion. For instance, even those with very limited data analytic backgrounds and skills are generally aware of the fact that the vast majority of data are somewhat imperfect, in the sense of containing one or more deficiencies that need to be reconciled before those data can be used, though most may not be able to clearly identify those deficiencies are, or even how to determine if those deficiencies are indeed present. Such lack of guiding or actionable abilities is framed here as unconscious incompetence, because it manifests itself in absence of abilities that are needed to recognize a discernible problem. And then considering the other end of the conscious competence learning model, unconscious competence does not mean taking actions without any need for conscious rational thought – it simply means having deeply engrained abilities to evaluate, diagnose, and take appropriate, though still well-reasoned steps.

Skill Appropriateness and Perpetuation Within the confines of elementary data analytic competencies, one of the more notable differences between what constitutes conceptual knowledge and what constitutes applied skills is that the former is comparatively statistic in relation to the latter, which is continuously evolving. More specifically, some of the foundational notions of statistical probability and inference date back to late 18th century, and many of the current data analytic techniques are several decades old;3 even neural networks, the backbone of machine learning applications date back to the 1960s, when the initial backpropagation

 Bayes’ theorem, a cornerstone of probability theory, dates back to 1761 and Gauss’ normal distribution to around 1795; Fisher’s seminal Statistical Methods and Design of Experiments were published in 1925 and 1935, respectively, and Tukey’s The Future of Data Analysis was published in 1961. Those contributions ground much of the knowledge that forms of the foundation of basic data analytic competencies.

106

Chapter 6 Data Analytic Literacy: The Skills Meta-Domain

method for training artificial neural networks was first developed. Those concepts are as true and applicable today as they were decades ago, in fact even more so because the means (computer hardware and software) and opportunities (volume and variety of data) to use those concepts are so much richer and so readily available nowadays. However, in contrast to the generally unchanging statistical and related concepts, applications of those concepts, encapsulated in computer software tools, are in the state of ongoing evolution. Looking back at the original programming languages developed in the middle of the 20th century (the oldest programming language still used today, FORTRAN, was introduced in 1957), tools of data manipulation and analysis continue to change, with ‘new’ replacing ‘old’ on ongoing basis. In contrast to just three of four decades ago, today there are distinct families of tools that can be used to access, extract, manipulate, analyze, and visualize data: Some of those tools are limited purpose applications, such as those used to interact with data stored in relational databases (SQL), whereas others are more general purpose systems capable of performing wide range of data manipulation and analysis tasks (SAS); some are open-source, meaning are freely available (Python, R), while others are proprietary, subscription-based (SAS, SPSS); some are just scripting languages built around defined syntax (Python, R), others are GUI (graphical user interface) based, point-and-click systems with scripting capabilities (SPSS).Moreover, some of those tools offer essentially the same data manipulation and analysis functionality: As discussed in more detail in Chapter 7, open-source Python and R programming languages offer essentially the same data analytic capabilities as SAS, SPSS and MATLAB proprietary systems, which underscores the earlier mentioned one-to-many relationship of conceptual knowledge to applied skills. In other words, a given data analytic concept can be operationalized using multiple competing data analytic tools, which necessitates explicit differentiation between knowing the ‘whats’ and the ‘whys’, framed in this book as the knowledge meta-domain, and being able to operationalize that knowledge, framed here as the skills meta-domain of data analytic literacy. Another key skill-related challenge is that of maintaining or perpetuation of a particular set of skills. Rapidly changing technological landscapes can render once robust data manipulation and analysis abilities lacking, if not altogether obsolete. COBOL (COmmon Business-Oriented Language), first introduced in 1959, was the go-to programming language at the onset of the Digital Revolution, but the advent of Internet-based and mobile technologies thrusted languages such as Java, Python or Swift into the forefront, pushing the once-dominant COBOL into the background.4 That

 To parallel a famous quote from Mark Twain, ‘the reports of my death are greatly exaggerated’, contrary to a popular sentiment, COBOL’s demise has also been greatly exaggerated. While indeed eclipsed by languages like Python or Java and relegated to the background tasks, it is still the IT backbone for the type of electronic utilities many take for granted such as using of35ATM machines, electronic airline ticketing or instant insurance quoting; in fact, though not at the same rate as languages like Python, the use of COBOL continues to grow.

The Skills Meta-Domain

107

ceaseless technological advancement driven upskilling trend can currently (the early part of 2023) be seen with the rising importance of unstructured data management and analysis tools, such as Apache Hadoop and Spark, Mango DB, KNIME or MonkeyLearn platforms. Though not a replacement for established (and predominantly structured data focused) tools like Python and R open-source languages or SAS and SPSS proprietary data management and analysis systems, those more recently developed applications nonetheless add to the challenge of maintaining a current and adequately well-rounded data analytic skillset.

The Skills Meta-Domain Set in the context of the preceding, background-informing overview of conceptual knowledge vs. applied skills differences and similarities, the remainder of this chapter is focused on an introductory overview of the second of the two most aggregate building blocks of data analytic literacy: the skills meta-domain. It is framed here as broadly defined ability to engage in analyses of data, as well as translating the oftenobtuse data analytic outcomes into meaningful insights. In contrast to ‘what’ and ‘why’ of data analytics focused knowledge meta-domain discussed in chapters 3 thru 5, the skills dimension encapsulates the comparatively more tangible ‘how to do it’ competency. It is important to keep in mind that though the two meta-domains are depicted as notionally separate, when seen from the perspective of data analytic competency development, when considered from the systemic perspective the two are not just complementary – they are mutually necessary. Just as one cannot be deemed to be literate, in the traditional sense, without being able to read and write, attainment of data analytic literacy is contingent on developing sound conceptual understanding of the key concepts and processes used to extract insights out of data, and the ability to execute appropriate data analytic steps. In short, it requires being able to think it and to do it. As graphically summarized in Figure 6.2, the skills meta-domain is comprised of two distinct, more narrowly framed domains: computing, which is the ability to manipulate and analyze data using appropriate software applications, and sensemaking, framed here as the ability to translate data analytic outcomes into informative, meaningful insights. In order to make those still broad domains more operationally meaningful, the computing domain is broken down into two sub-domains: structural, defined here as the ability to review, manipulate, and feature-engineer data, and informational, seen here as the ability to analyze data. Similarly, the sensemaking domain is also broken down into two distinct sub-domains: factual, which encompasses descriptive insights, and inferential, which captures probabilistic data analytic outcomes. When contrasted with the highly abstract knowledge of data analytic concepts and process, data analysis related skills are comparatively tangible, because they are largely tied to familiarity with data manipulation, processing, and analysis tools. At the same

108

Chapter 6 Data Analytic Literacy: The Skills Meta-Domain

Data Analytic Literacy

(Conceptual) Knowledge

(Procedural) Skills

Data

Methods

Computing

Sensemaking

Type

Processing

Structural

Factual

Origin

Analytic

Informational

Inferential

Figure 6.2: The Structure of Data Analytic Literacy: The Skills Meta-Domain.

time, as noted earlier while the knowledge meta-domain of data analytic literacy is relatively slow changing (i.e., the core concepts of exploratory and confirmatory analyses outlined earlier have not materially changed in decades), the ecosystem of data analysis tool and related skills is very dynamic: New programming languages continue to emerge, longstanding data management and analysis tools, such as SAS or SPSS, are continuously updated, and new types of often specialized computing capabilities and applications, as exemplified by graph databases or data visualization applications, continue to enlarge the already expansive data manipulation and utilization tools ecosystem. In a sense, the key challenge characterizing the skills meta-domain is not so much learning how to use selected data analytic tools, but rather settling on which tools to use, while keeping abreast of new developments. To put all that in context, the task of, for instance, assessing of a potential association between two continuous variables typically points toward a single methodology (correlation analysis), but the task of computing the requisite correlation coefficient can be accomplished with the help of numerous tools, all of which can be expected to produce comparable estimates (since they are just different mechanisms of accomplishing the same computational task). As implied in this simple illustration, data analytic literacy related skills encompass not only the ability to use appropriate software applications, but also being able to identify appropriate applications. The following scenario offers an illustrative summary: Hoping to compile an upto-date baseline of their customer base, management of Brand X selects a subset of their customers who made a purchase with the past 6 months. The resultant data file is formatted as CSV (comma-separated values), a common structured data layout where each line represents a data record, and individual records are separated by commas. That particular file format is spreadsheet-friendly, meaning it can be easily converted into a Microsoft Excel or Google Spreadsheet file, and given that those applications also offer a wide range of data analytic functionalities, taking that route often seems natural. But here is a potential problem: Spreadsheet applications combine data and data manipulation and analysis coding into a single file, the limitation of which becomes evident when one tries to replicate the analysis using more recent

The Skills Meta-Domain

109

batch of data. In order to update the earlier analysis, it would then be necessary to recreate all data coding in the new file,5 which can be quite time consuming. Any organization that uses its transactional data for ongoing tracking and analysis would quickly realize that data analytic applications that treat data and data manipulation and coding as two separate entities6 offer more efficient mechanism for conducting recurring analyses. ✶✶✶ The ensuing two chapters offer an in-depth exploration of the two dimensions comprising the skills meta-domain: Chapter 7 tackles the broad domain of computing skills, as seen from the perspective of foundational data analytic skills, and Chapter 8 takes on the task of framing sensemaking skills, framed here as the ability to translate data analytic outcomes into analytically robust and informationally sound insights.

 If the contents of the ‘old’ and ‘new’ files are exactly the same and the only changes are in the form of magnitudes of individual values it might be possible to simply copy-and-paste the new values onto the existing data file, doing so, however, creates the possibility of numerous potential errors as even seemingly unchanged data layouts may contain minor, easy to overlook formatting difference; all considered, that type of overriding is not recommended.  Those applications, which include proprietary manipulation and analysis systems such as SAS or SPSS Statistics, and open-source programming languages such as R or Python, actually break the data analytic workflow into three distinct part: data (which are converted into their respective native file formats), syntax editor, and output files; in contrast to that, spreadsheet applications combine all three into a single physical file.

Chapter 7 Computing Skills The distinction between knowing what data manipulation and analysis applications to use and knowing how to use those tools is a manifestation of the wealth of choices that are available to those interested in tapping into the informational content of data. However, the recently explosive growth in interest in open-source programming languages, most notably Python and R, can sometimes obscure the fact that those tools offer but one avenue of tapping into the informational content of data. That is an important point because R and Python are both programming languages, which means having to learn specific syntactic structures and associated rules, and that is something that can be seen as a barrier by those who either do not have the time or are simply not interested in learning how to code. Fortunately, there are alternative, ‘no coding required’ means of tapping into the informational content of data that offer essentially the same data manipulation and analysis functionality. Recalling the high-level overview of the procedural skills meta-domain of data analytic literacy (see Figure 6.2 in Chapter 6), the computing dimension of broadly defined abilities to manipulate and analyze data is comprised of two distinct subdomains: structural and informational. The structural aspect of the broadly defined computing skillset encompasses the know-how required to review, manipulate, and feature-engineer data, whereas the informational facet of computing skills centers on being able to extract meaning out of data, typically using statistical techniques discussed in Chapter 5 (Knowledge of Methods). Oftentimes those two sets of skills are blended, in spite of ample practical evidence suggesting that doing so can lead to short-changing of data preparation, or specifically, data feature engineering skills.1 With that in mind, the ability to manipulate and prepare data for upstream utilization, framed here as structural computing skills, is discussed separately from skills needed to extract meaning out of available data. Still, tool-wise, the two distinct sub-domains of the broad computing skillset overlap, because some of the commonly used programming languages and data management and analysis systems span the full lifecycle of data preparation and analysis. At the same time, other commonly used data applications have narrower, more defined scopes and capabilities. Not surprisingly, the combination of the sheer number of competing data management and analysis languages and systems coupled with the often unclear application scopes of those tools can paint a confusing picture, especially for

 A common scenario in basic data analytics focused courses is the use of analytically refined datasets that require little-to-no meaningful feature engineering, which has the effect of side-stepping learning how to deal with messy data and instead focusing primarily on learning how to analyze ready-to-use data. And while finding meaningful patterns and associations is, undeniably, the ultimate goal of data analytics, not knowing how to make data usable is also the ultimate obstacle to reaching that goal. https://doi.org/10.1515/9783111001678-009

The Ecosystem of Data Tools

111

those new to data analytics. And yet, robust understanding of the ecosystem of data utilization related tools is essential to developing meaningful data analytic competencies – with that in mind, the next section offers simple data access, manipulation, and analysis classification typology, along with brief descriptions of the most commonly used tools.

The Ecosystem of Data Tools The computing aspect of the skills meta-domain of data analytic literacy can be seen as proficiency with software tools, defined here as programming languages and packaged software applications, that can be used to access, manipulate, and analyze data. A casual online search of available tools yields a dizzying array of data storing and querying, reporting, visualization, mining, online analytical processing (OLAP), and statistical analysis focused languages and systems, some freely available and others requiring paid subscriptions, some offering very wide range of functionalities while others are focused on narrowly defined utilities, and some taking the form of rudimentary programming languages while others packaged in intuitive point-and-click graphical interfaces. To say that available options are plentiful and choosing among them confusing would be an understatement. It is difficult to make an informed tool selection choice without first gaining at least a rudimentary understanding of the broader ecosystem of those tools. To that end, the readily available online descriptions are numerous and largely inconsistent, ultimately offering little help to those trying to grasp the totality of what is ‘out there’, and then also trying to discern clear differences between tools that may seem similar in some regards while still being different in number of ways. Fortunately, when the scope of the evaluation is narrowed to tools that directly contribute to the process of, firstly, accessing, reviewing, and pre-processing of data, and then, secondly, to transforming appropriately prepared data into informative insights, the initially confusing assortment of computer languages and software packages can be organized along a small number of meaningful classificatory dimensions. More specifically, when considered from the perspective of such a general classification shaped by intuitively obvious criteria of what purpose each tool serves, its fundamental functionality, and its availability (i.e., free vs. subscription based), the otherwise confusing mix can be arranged into several distinct categories derived using the logic graphically summarized in Figure 7.1. The three-dimensional classification schema depicted in Figure 7.1 groups data analytics software tools using three sets of binary attributes: commercial vs. open-access (availability), systems vs. languages (functionality), and comprehensive vs. limited purpose (purpose). Commercial software products are developed and marketed by software development focused organizations; that designation has two key implications: One the one hand, the use of those tools has monetary costs, nowadays commonly in the form of annual subscriptions, but on the other hand, those tools typically offer dedicated

112

Chapter 7 Computing Skills

Commercial

Limited-Purpose

Open-Access General-Purpose

Figure 7.1: Data Analytics Related Tools Classification Framework.

product support. Their direct, availability-wise, analog, open-access software tools, usually reside in the public domain, meaning are not owned by any particular entity, which means that those tools can be freely accessed by anyone (i.e., no cost), but at the same time there is usually no dedicated support (although some, such as Python or R programming languages have vast user communities that share their knowledge on dedicated sites such as Stack Overflow, though it would be an overstatement to suggest that those open forums offer the same level of support as those offered by dedicated commercial software teams). Data manipulation and analysis software tools categorized as systems are those that can be actuated using a GUI (graphical user interface, a typically point-and-click functionality that allows users to interact with software using standard menus and graphical icons) interface that can be used to access and actuate distinct functions; on the other hand, those categorized as languages only offer syntaxbased functionality (i.e., require coding). Lastly, general-purpose data analytic tools offer full lifecycle of functionalities, ranging from data manipulation and processing to analysis, visualization, and reporting, which is in contrast to limited-purpose applications which are focused on specific facets of data analytics, such as data visualization. One of the classificatory categories discussed above – open-access – requires a bit more clarification. As used here, the definitional scope of open-access software tools encompasses open-source software, and the distinction between the two is worth noting. Setting aside legalistic nuances of what constitutes an ‘open’ license, ‘open-access’ simply means free to use, whereas ‘open-source’ means free to amend, in the sense of being able to access and change the basic executable program structure, commonly referred to as source code. As implied by that distinction and in a way that is largely unique to computer software, open-access does not necessarily imply open-source;

The Ecosystem of Data Tools

113

more formally stated, all open-source software is also open-access, but not all openaccess software is open-source (i.e., not all open-access software publishes its source code). But that is not all. As defined by the Open Source Initiative, a non-profit organization, to be considered open-source, software has to meet the following criteria: – the product license must be technology-neutral and not be specific to a product (e.g., it could be used with PC as well as Apple operating systems), – it must be free in terms of initial distribution and redistribution (no royalties or fees for any form of use), – and must be free of any license contingencies (i.e., no restrictions on other software distributed along with the licensed software), – the source code must be included, and must be free of any distribution/redistribution restrictions, and there must be no restrictions on modifications or derided works, and lastly, – usage and distribution shall not discriminate against persons, groups, or fields of endeavor. It is important to note that the above framing does not expressly address private vs. commercial use, consequently, some open-source programs are free to individuals but restricted when it comes to commercial use. Interestingly, the open-source terms also do not prohibit charging for support provided by open-source applications, which is why some manifestly open-source applications might ultimately have costs associated with using them. And while the distinction between open-access and opensource is somewhat tangential, the two terms of often confused, in view of which, the preceding clarification seems warranted. Turning back to the data analytic software tools classification framework summarized in Figure 7.1, as graphically depicted there, the three dichotomous dimensions that frame the overall classification logic are assumed to be independent of one another, meaning that each offers a distinct perspective of the ecosystem of data analytic software tools. It is important to note that the goal of the classification schema outlined in Figure 7.1 is to simplify and inform, rather than to try to offer an exhaustive typology of data analytic tools, which is in keeping with the stated goal of this book, namely to contribute to the development of basic, but nonetheless robust data analytic skills. Also in keeping with that general goal is the application selection rationale, which is focused on the most commonly used and thus most representative tools in each of the delineated groupings. All in all, the goal of the aforementioned classification schema as well as the ensuing overview is to delineate distinct means of carrying out data manipulation and analysis tasks outlined earlier in the context of the overview of conceptual knowledge of data processing and analytic methods. Keeping all those considerations in mind, the ensuring overview is structured as follows: Using the generalized data analytic continuum as a conceptual backdrop, distinct software tools are contrasted and briefly described within the confines of distinct three-dimensional groups formed by combinations of the bipolar characteristics

114

Chapter 7 Computing Skills

summarized in Figure 7.1. For example, the first grouping represents a conjoint of general-purpose, commercial, and systems, which forms a distinct grouping of data manipulation and analysis tools that are both widely used and can carry out some or all data processing and analysis steps discussed in prior chapters. It should be noted, however, that while the aforementioned classification logic suggests a total of eight such distinct groupings (2x2x2), the scope of the ensuing overview is constrained to just a select subset of data analytic software clusters that encompass tools that are both widely used and capable of carrying out the earlier discussed data preparation and analysis steps. The overview is nested in a high-level distinction between comprehensive and limited purpose tools, where the former encompasses applications that offer a full spectrum of capabilities ranging from rudimentary data manipulation to advanced data analyses, whereas the latter is comprised of tools that offer more specialized or narrower sets of capabilities. In total, out the eight possible software clusters suggested by the three bipolar classificatory dimensions, four select clusters are discussed, and each cluster is comprised of tools that offer comparable (often practically the same, especially when considered from the perspective of fundamental data manipulation and/or analysis processes) capabilities. Aiming to emphasize distinct combinations of scope, function, and access that jointly characterize each distinct grouping of data manipulation and analysis software tools, the ensuring brief overviews are meant to draw attention to distinct families of applications, and to also suggest alternative mechanisms of carrying out basic data manipulation and analysis tasks.

Comprehensive Data Analytic Tools This broad category encompasses the most widely used general purpose – i.e., spanning the full continuum of data processing and analysis – software tools. The applications reviewed here are divided into two sub-categories: general-purpose commercial data management and analysis systems, and general-purpose open-access programming languages. 1. General-Purpose – Commercial – Systems. This category encompasses data analytic commercial (i.e., paid access) applications that offer full data analytic lifecycle capabilities, which range from rudimentary data manipulation and processing to full range of data analyses and reporting capabilities. Noting that there are several tools that fall into this cluster,2 the two best-known examples of those applications are SAS and SPSS.

 While not discussed here, other popular data analytic systems include Stata, a general-purpose statistical software package developed by StataCorp, MATLAB, a computing system, which also includes its own programming language, developed by MathWorks, and Minitab, a statistics package developed by Penn State. Those tools are not discussed here because they tend to have niche appeal – MATLAB’s

The Ecosystem of Data Tools

115

The early predecessor of SAS (initially short for Statistical Analysis System, though now it is just SAS) was developed in late 1960s at the North Carolina State University to support analyses of large quantities of agriculture data, but the commercial interest known today as SAS Institute was formally launched in 1976. Nowadays offered (in its relevant form, as SAS Institute now offers multiple product lines) under the umbrella of SAS Business Intelligence and Analytics, it is a suite of applications for self-service analytics encompassing a comprehensive graphical user interface (GUI) and its own computer programming language, used primarily for data processing and statistical analysis (also worth noting is SAS’ Proc SQL, which makes SAS-coding a lot easier for those familiar with the SQL query language). SPSS3 (initially short for Statistical Package for the Social Sciences, though now it is just SPSS) was initially developed at Stanford University in the 1960s for the social sciences, and it was the first statistical programming language for the PC (though it was initially developed for IBM mainframe computers). Like SAS, SPSS is a suite of applications for self-service analytics offering full range of capabilities from data processing to statistical analysis which can be access either through an intuitive and comprehensive GUI or its own programming language (also worth noting is SPSS’ paste-function which creates syntaxes from steps executed in the user interface). There are two core benefits to both SAS and SPSS: First, they make the full spectrum of data manipulation and analysis capabilities accessible without the need for programming skills. More specifically, their menu-driven point-and-click interfaces make it possible to execute complex data preparation and analysis steps in codingfree manner, while at the same time their respective programming languages offer parallel coding-based functionality. Second, both SAS and SPSS offer dedicated support in form of carefully curated materials and live technical assistance, which can be a substantial benefit to all users, but particularly new ones. 2. General-Purpose – Open-Access – Languages. A direct alternative to paid commercial data analytic systems, open-access programming languages also offer full lifecycle of data processing and analysis capabilities, but in the form of programming languages only.4 As can be expected, not all data users are interested in investing the required time and effort to learn the logic and nuances of a programming language, but for those who do, that avenue offers a potentially cost-free alternative to pricey commercial

use skews heavily toward engineering and related scientific usage contexts, while Stata and Minitab are used primarily by academicians.  Now IBM SPSS Statistics following IBM’s purchase of SPSS Inc in 2009; the SPSS product family also includes SPSS Modeler, a visual (i.e., drag-and-drop only, no syntax editor) data science and machine learning system.  The R programming language offers its own free, open-source limited purpose GUI known as Rattle, but functionality-wise it falls considerably short of the comprehensive interfaces offered by SAS and SPSS Statistics.

116

Chapter 7 Computing Skills

systems. The two core data analytics related open-access and open-source programming languages are R and Python. Broadly characterized, R is a programming language and computing environment with a focus on statistical analysis; a key part of its ecosystem are more 15,000 also open-source packages5 that have been developed to facilitate loading, manipulating, modeling, and visualizing data. The R language is used mostly for statistical analysis, and it allows technical analysts with programming skills to carry out almost any type of data analysis, but its syntax can be considered somewhat complex, which translates into a relatively steep learning curve. While overtly similar to R in terms of its general usage, Python is more of a general-purpose programming language that is often the go-to choice of technical analysts and data scientists. While it can be used to support rudimentary data manipulation and analysis, it is particularly popular in machine learning, web development, data visualization, and software development; currently (2023) it is one of the most widely used programming languages in the world, as evidenced by the fact that its open ecosystem boasts more than 200,000 available packages. Its syntax structure is considered to have high degree of readability, which means it is comparatively easy to learn. The benefits of open-source R and Python languages are somewhat analogous to SAS and SPSS exemplified commercial systems: The most visible benefit is that they are, nominally, cost-free (although there are companies that provide fee-based support services), which suggests that ultimately there might be some cost associated with using those languages. A perhaps less visible benefit is their large and dynamic user and contributor universe, which translates into an almost overwhelming array of packages that can be used to carry out a wide array of data processing and analysis steps. At the same time, it is important to keep in mind that the open character of those ecosystems means that there are no formal quality review processes or standards, and the sheer number of currently available packages suggests possibly considerable volume of duplicates. All considered, wading through thousands of similar-but-different packages offered by contributors with varying degrees of programming skills can be a challenging undertaking.

Limited-Purpose Data Analytic Tools The applications discussed here offer more narrowly scoped but deeper, in the sense of specific utilities, capabilities that address distinct facets of the overall data analytic process, once again, focusing on basic data manipulation and analysis related competencies. Tools that fall into this broad cluster of limited-purpose applications can be further

 In this context, a package is a self-contained application developed to carry out a particular task, such as a particular type of statistical analysis (e.g., F-test, logistic regression, etc.), file merging, graphing, etc.

Limited-Purpose Data Analytic Tools

117

differentiated along the function (software systems vs. programming languages) and access (commercial/paid access vs. open access) dimensions, which gives rise to two general categories of limited-purpose commercial systems and limited-purpose open-access programming languages: 1. Limited-Purpose – Commercial – Systems. Much like the earlier discussed generalpurpose systems, limited-purpose applications are commercially available (i.e., offering fee-based access), but their functionality is geared toward specific facets of data analytics, such as data manipulation and reporting, or data visualization. As can be expected, there are numerous applications that fall into this broad category, but the two that are particularly relevant to basic data analytic work are Microsoft Excel and Tableau. Microsoft Excel is the most widely used spreadsheet application, and it also offers some built-in data analytic functionality. It is used primarily for charting and graphing, and tracking and managing of various tasks and functions, but it should be noted that it is currently limited to just around 1 million rows or data records, which limits the size of data files it can process (in a way of a contrast, the earlier discussed SAS, SPSS, R, and Python do not have such constraints, assuming ample physical memory). Perhaps even more importantly, an Excel file is an amalgamation of data and data related coding, which is convenient in some situations, most notably basic data wrangling and reporting, but limiting in other ones, such as analyses that require frequent, more recent data refreshes. Tableau is a data visualization and analytics platform that allows users to create reports and share them as standalone outcomes or as parts of other applications. It is almost entirely GUI-driven, and although much of the Tableau platform runs on top of its own query language, VizQL, which translates drag-and-drop dashboard and visualization components into efficient back-end queries, the platform does not offer a syntax editor. Its core premise is to allow users to build vivid and compelling data visualizations without any coding, hence it is particularly appealing to new or light users who are willing to trade ease of use for more extensive customization capabilities offered by other, more customizable (and thus more involved, learning-wise) data visualization applications. 2. Limited-Purpose – Open-Access – Languages. Given the existence of the earlier outlined general purpose programming languages, R and Python, the very existence of limited-purpose languages may be puzzling, but upon closer examination the value of limited-purpose languages becomes clear. In the most general sense, general-purpose languages support virtually any type of data manipulation outside of structured or unstructured data reservoirs (i.e., databases), but cannot be used to access and manipulate data inside of either structured or unstructured databases (largely because their designs do not contain semantics that are precise enough to support the requisite types of operations). And that is where limited-purpose languages come in: Those more narrowly focused tools incorporate precise semantics designed expressly with

118

Chapter 7 Computing Skills

specific types of data structures in mind, which gives those tools the ability to manipulate data inside of structured and unstructured databases, in addition to also extracting data out of those databases. Recalling the fundamental differences between structured and unstructured data structures, it should not be surprising that interacting with (i.e., manipulating and extracting data) those two very different data collections requires dedicated programming languages. To that end, SQL is the programming language used to access and manipulate data inside of structured databases, and NoSQL6 is its analog used to interact with unstructured databases. SQL, or Structured Query Language, is among the best known and most widely used programming languages – in fact, it is likely the most widely used limited-purpose language. Originally developed in the late 1970s by IBM, it was initially publicly released in 1986,7 and since then became the go-to language for storing, retrieving, and manipulating structured data in relational databases. Within the confines of data analytic literacy, the value of developing basic SQL programming skills is that it offers the means of accessing and retrieving data out of structured data reservoirs, and outputting selected data as general-purpose files, or so-called ‘flat files’, which can then be easily read into a data analytic environment of choice, such as SAS, SPSS, R, Python, or Excel. However, when accessing SQL related documentation it is important to not confuse SQL, the freely available programming language, with SQL Server, a proprietary (owned by Microsoft Corp.) database management software, or with MySQL, an open-source relational database management system built on top of SQL programming language. NoSQL query language, formally known as MongoDB Query Language or MQL, offers essentially a parallel to SQL data access and manipulation functionality for unstructured databases, where data are organized into data collections characterized as documents rather than tables (which comprise structured databases). It is worth noting that the notion of ‘document’ is somewhat different than the everyday conception that conjures up images of official records or some other written or electronic form or files. Within the confines of unstructured databases, documents are self-describing, hierarchical, tree-like data structures that consist of maps, collections, and scalar values (i.e., magnitudes); as independent, standalone entities, individual documents can be similar to one another, but should not be expected to be structurally alike. The preceding overview is meant to offer a general backdrop against which the two more narrowly framed computing sub-domains – structural and informational – can be discussed. Some of the tools delineated in the classificatory context provided by

 While it is popularly held that NoSQL stands for ‘non-SQL’, some argue that it is actually more appropriate to think of that name as standing for ‘not only SQL’ because NoSQL is able to accommodate a wide variety of data types.  The date SQL was adopted as a standard by the American National Standards Institute (ANSI); it was adopted as a standard by the International Organization for Standardization (ISO) the following year.

Structural Computing Skills

119

the three distinct sets of characteristics of data analytic tools summarized in Figure 7.1 can be used within both sub-domains, but some others are largely applicable to only one – a more in depth-discussion follows.

Structural Computing Skills Within the confines of data analytics, the very similar sounding ‘structure’, ‘structural’, and ‘structured’ are used in a variety of distinct contexts, which can easily lead to confusion. There are abstractly defined data structures, which represent different logical solutions to the problem of organizing data elements in collections of data; individual data elements and data collections are then each characterized by distinct structural characteristics which reflect endemic properties of individual data elements (e.g., continuous vs. categorical) and organizational blueprints of data collections (e.g., relational vs. objectoriented databases), and, of course, there are structured and unstructured datasets. It thus follows that the idea of structural computational skills could be considered in a variety of related but nonetheless distinct contexts, each reflecting a particular facet of broader computational skills manifesting themselves in knowing how to carry out any appropriate data wrangling tasks. Recalling the earlier discussed three-tier data categorization of data element – data table – data repository (see the Data Aggregates section), when considered from the perspective of individual data elements, structural computing skills entail the ability to review and repair (as needed) individual data features, as exemplified by imputing of missing values; nomenclature-wise, those skills fall under a general umbrella of data feature engineering. When seen from the perspective of data tables, structural computing skills entail the ability to modify and enhance datasets, as exemplified by normalizing or standardizing select data features or by creation of new variables such as indicators or interactions derived from existing data features; that broad set of procedures is referred to as data table engineering. Lastly, when considered from the perspective of data repositories, computing skills entail the ability to assess (and correct, as needed) consistency of data features across datasets, and the ability to establish cross-table linkages for data sourcing and analysis; it is referred to here as data consistency engineering. When considered jointly, data feature, structure, and consistency engineering is referred to as data wrangling or data munging. A couple of important considerations come to the forefront here: Firstly, structural computing skills need to be considered in the context of two distinct sets of activities: data access and extraction, and separately, subsequent data processing (i.e., preparing extracted data for upstream analyses). Accessing and manipulating data in their native environments requires basic proficiency with specific software tools that are capable of interacting with different types of data reservoirs. For example, accessing data stored in structured relational databases commonly used to organize and store transactional

120

Chapter 7 Computing Skills

and related data, or accessing of unstructured databases used to store and organize freeform, predominantly textual data require familiarity with the earlier discussed limited-purpose programming languages (SQL and NoSQL, respectively). Subsequent data processing, typically carried out following extraction of data out of those native environments, calls for a different set of software tools, in the form of open-access programming languages and commercial general- and limited-purpose systems which were also discussed earlier. Secondly, one of the core assumptions made when framing structural computing skills is that raw, unprocessed data are not immediately usable because of where and how they are stored, and how they are structured. For instance, structured transactional data (e.g., point-of-sales data captured by retail stores using the ubiquitous bar code readers) stored in widely used relational databases8 are organized in the manner that favors organizational clarity and storage efficiency. That means that captured details might be spread across several tables of like-data elements (loosely resembling organization of a physical warehouse), which ultimately calls for being able to identify, access, extract, and then amalgamate the data elements of interest into a single dataset.9

Some General Considerations Delineating essential and rudimentary structural computing skills is perhaps the most challenging aspect of framing basic (as opposed to more advanced) data analytic competencies because making data analytically usable can entail a wide range of different alterations, and there are no immediately obvious lines of demarcation separating basic and more advanced data alteration skills. Further adding to that difficulty is the asymmetrical nature of data wrangling activities: Some of the more commonly employed, and thus essential data wrangling tasks can also be seen as some of the more complex, which is in contrast to the methodological dimension of data analytic competencies, where the most frequently used basic methods are also comparatively simple. That complexity, however, is largely limited to the underlying rationale of

 Collections of data that organize data in one or more tables of columns and rows and where relations among tables are predefined, making it easy to understand how different data structures relate to each other.  Extraction of data out of their native storage environments is a common practice because doing that creates virtually unlimited data analytic possibilities – that said, there is a number of highly standardized uses of, particularly transactional, data, such as generation of periodic sales reports, which may not require extracting of data out of their native environments. There is a wide array of reporting software designed to work directly with relational and other types of databases, sometimes jointly (and somewhat generically) described as business intelligence (BI) tools; those reporting tools aggregate, standardize, and analyze predetermined data elements on ongoing basis (typically in fixed intervals, such as monthly or quarterly) and produce easy to read numeric and graphical summaries of key performance indicators (KPIs), often in the form of interactive dashboards.

Structural Computing Skills

121

data wrangling steps, and generally does not extend to the associated ‘how-to’ executable computational steps (i.e., specific data wrangling SAS, SPSS, R, Python, etc. routines). The reason behind that is that the underlying rationale needs to be understood in its entirety in order to correctly identify appropriate data feature or data structure engineering procedures, whereas the executable computational procedures are generally carried out using ‘pre-packaged’ standard executable routines (in fact, it is difficult to think of a data wrangling step that cannot be executed with the help of one of those standard routines). The earlier discussed data analytic systems (SAS and SPSS) as well as general-purpose (R, Python) and limited-purpose (SQL, NoSQL) programming languages all incorporate rich libraries of specialized pre-coded routines, which can be thought of as large standalone macros, which only require users to specify a handful of key parameters. So, the execution part is generally straightforward (assuming basic proficiency with appropriate software tools), but assuring that actual outcomes are in line with intended changes is contingent on understanding of the underlying rationale and procedural logic governing those routines. Consequently, the ensuring overview of core structural computing skills is approached from the perspective of the underlying rationale and procedural logic, as essential prerequisites to correctly identifying data analytic outcomes-suggested data wrangling steps, selecting appropriate data wrangling procedures, and deploying those procedures in a manner that is in keeping with the expected data transformation outcomes and procedure-specific application logic. In terms of the application specifics of individual data wrangling routines, those are tied to operational specifics (e.g., the type of tool as in a programming language vs. GUI-based system, syntax, etc.) of data analytic tools discussed in the previous section, and thus are discussed in detail in application-specific usage documentation that is readily available online. Turning back to the earlier made distinction between structural computing skills needed to access and manipulate data in their native environments (i.e., data reservoirs holding those data), and computing skills that are needed to prepare extracted data for upstream analyses, those two distinct sets of structural computing skills can be grouped into distinct facets of structural computing skills summarized as follows: 1. accessing and extracting data out of their native environments 2. preparing extracted data for upstream analyses, which can be broken down into two more operationally meaningful sets of structural computing activities: a. transforming raw data into analysis-ready datasets b. enriching informational contents of data

122

Chapter 7 Computing Skills

Accessing and Extracting Data Historically, accessing and extraction of data out of their native environments tended to be entrusted to database administrators or to other information technologists, but the growing embrace of self-service analytics, where line-of-business professionals are encouraged to perform their own data queries, is heightening the need for more widely distributed data extraction skills. The exact makeup of those skills, however, is almost entirely situationally determined, as it reflects the specifics of where data of interest are stored (i.e., the type of data reservoir, as in the earlier discussed data lakes or data pools) and how those data are organized (i.e., what specific logical data model, as exemplified by relational or object-oriented, is used). SQL and NoSQL, both discussed earlier, are two widely used data access and extraction tools, and thus those tools are at the core of the ability to access and manipulate data in their native environments; consequently, basic proficiency with SQL and NoSQL is at the core of that facet of structural computing skills. The limited-purpose, open-access SQL programming language is the go-to mean for accessing and manipulating data stored in relational databases, which are widely used for organizing and maintaining transactional (e.g., sales) and related data; its unstructured data analog, NoSQL language, is used to access and manipulate data stored in NoSQL databases.10 Though both are coding-requiring programming languages, their narrow scope (primarily database query) coupled with declarative nature and semantic syntax make them comparatively easy to learn and use, and both are backed up by ample, multi-modal (instructional videos, documents, etc.) learning support readily available online. One aspect of accessing and extracting of data out of their native environments that is especially situational relates to the type of data storage. Data tables in data pools are usually standardized, which means that data captured from different sources were transformed into a single, consistent format; data contained in data lakes, however, are normally not standardized, which means that data of interest might be spread across differently formatted and structured tables. Clearly, that adds an additional level of complexity, and may effectively constrain self-service analytics, instead requiring more skilled help to access and extract such heterogeneous data.

 A comparatively lesser-known than relational databases, the general designation of NoSQL databases encompasses four distinct types: 1. document-based, which organize data elements into document-like files (unstructured analog to structured tables), 2. key-value, where recorded values are linked with specific keys, 3. wide column-based, which are similar to structured relational databases but where columns are dynamic rather than fixed, and 4. graph databases, which store individual values as nodes and edges, where the former store entity (e.g., people, products) and the latter store cross-entity relationship specifics.

Structural Computing Skills

123

Transforming Raw Data to Analysis-Ready Datasets It is taken here to be intuitively obvious that transforming raw data into analysisready dataset is largely determined by the combination of what data are to be used and what are the expected data analytic outcomes, which suggest numerous possible data wrangling steps. That said, while the specifics of what, exactly, needs to be done may vary across situations, there are several frequently recurring procedures that cut across different contexts; those recurring data wrangling processes are at the center of structural computing skills discussed here. As could be expected, some are relatively straightforward while others are comparatively more complex, when considered from the perspective of the underlying procedural logic. And lastly, it should also be noted that while data wrangling encompasses data deficiency remediation, in a more general sense it should not be taken as an indication of inherent shortcomings of data, but rather as a manifestation of diversity of informational potential of data. For instance, the point-of-sale (POS) captured transactional data can be used to generate ongoing sales reports, and at the same time it can also be used to glean insights into buyer behaviors – the former typically requires minimal pre-processing, whereas the latter often demands fundamental re-arranging of the structure of those data. When the original structure and/or layout of data are not conducive to generating data analytic outcomes of interest, data need to be subjected to specific conversions, broadly characterized here as data wrangling or data munging, and more specifically comprised of data feature engineering, data table engineering, and data consistency engineering. The goal of data feature engineering is to transform individual data elements into features, commonly referred to as variables, that can support the desired usage of informational contents of data; the goal of data structure engineering is to maximize discoverable (with the help of exploratory or confirmatory analyses discussed in Chapter 5) informational content of data tables, and lastly, the goal of data consistency engineering is to expand the informational scope of available data by enabling linking of structurally incongruent data tables. Data Feature Engineering To restate, the goal of data feature engineering is to transform raw data elements into analytically usable data features. As succinctly conveyed by the well-known computer science GIGO (garbage in, garbage out) concept, data due diligence, framed here as the ability to use appropriate computational tools to assess individual data elements and to enact any needed or desired content or structural changes or amendments, is of foundational importance to upstream data utilization. The general-purpose openaccess R and Python programming languages, and the general-purpose commercial SAS and SPSS (as well as other similar statistical analysis tools such as MATLAB, Stata, or Minitab) are typically used to shape raw data into analyzable data files. Given that the same set of tools (after all, those are general purpose applications) is used for

124

Chapter 7 Computing Skills

upstream data analytic processing, computing skills geared specifically toward transforming of raw data into usable data need additional clarification. Recalling the three-tier data framing taxonomy (see Figure 4.3), and the discussion of data preparation methods (Chapter 5, Knowledge of Methods), transforming raw, meaning not immediately analytically usable data elements into analyzable data features takes on different meaning in the context of structured and unstructured data. Starting with structured data, data feature engineering typically involves one or more of the following: – Imputation: Most commonly taking the form of replacing missing data with some substitute value but can also include correcting visibly inaccurate values (e.g., a person’s age showing as an improbable 560 years); depending on the encoding type, i.e., continuous vs. categorical, missing continuous data can be replaced with the average of non-missing values, and missing categorical data can be replaced with a catch-all ‘other’ category. All data analytic programming languages and systems discussed earlier incorporate standard missing value imputation routines.11 – Outlier identification: The efficacy of many basic types of analyses is contingent on continuous input data being relatively outlier-free, meaning not containing values that are abnormally (vis-à-vis other values in a particular dataset) large or small; consequently, identifying the existence of such values is an important data due diligence step. And while all aforementioned general-purpose programming languages and data analytic systems offer convenient means of graphically (typically using a scatterplot) and numerically (typically using granular frequency distribution tables, such as 100 percentiles) examining the distribution of values of individual data features, determining what constitutes an abnormally large or small value can be difficult. A commonly used approach is to convert raw data values into z-scores, which are magnitudes expressed in the number of standard deviations away (above or below) from the mean,12 and then consider values that fall outside of ±3 standard deviations away from the mean to be outliers. It is important, however, to keep in mind that outliers are real values that capture unusual, but nonetheless real occurrences, which means that there are distinct data analytic contexts, such as analyses of severe weather events, where elimination of outliers may be undesirable as it can lead to under- or over-stating of estimated effects. – Distributional correction. Another common requirement of frequently used data analytic techniques is the assumption of normality, meaning that the distribution of individual data values should be symmetrical, i.e., roughly fitting the familiar bell-shaped

 It should be noted that in addition to the very basic (e.g., using the mean or median for continuous data features and lumping all categorical missing values into a catch-all ‘other’ category) the standard imputation procedures also tend to include more advanced, probabilistic estimation-based imputation approaches.  Computed as (Raw Value – Mean) / Std. Deviation.

Structural Computing Skills

125

curve. Interestingly, many real-life phenomena are skewed, i.e., are distributed nonsymmetrically, rendering the need to bring about (approximate) normality a common one. There are several different types of mathematical transformations that can used here, with logarithmic (commonly referred to as ‘log’), square root, and reciprocal transformations being the most commonly used ones. It should be noted that the use of a particular transformation does not guarantee the desired outcome; in other words, not all data features that are not normally distributed can be made normal. And lastly, all major data analysis languages and systems support standard transformation routines. – Binning. Often seen as an alternative to distributional correction outlined above, data binning (also referred to as data bucketing) is simply grouping of original values of a continuous data feature into intervals, known as bins, and then using the resultant bin values13 in lieu of the original continuous data values. As implied in this brief description, binning effectively converts continuous data features into categorical ones, and while under most circumstances it is undesirable to do so because of a significant loss of information (continuous values allow all arithmetic operations and thus can yield richer arrays of informational outcomes, whereas categorical values do not allow arithmetic operations and thus are informationally poorer), in situations in which a continuous data feature of interest exhibits undesirable properties that cannot be remedied, converting those data feature into categorical ones with the help of binning may make it possible to ‘salvage’ an otherwise analytically unusable feature. The idea of reviewing and repairing of individual data features implicitly assumes knowing, or at least being able to estimate what constitutes correct or acceptable values. It is intuitively obvious that formal, published support documentation, in the form of data dictionaries or similar, offers the most dependable means of making those determinations. In instances in which such supporting documentation is not readily available, which are fairly common, structured, defined data features offer a number of natural bases for making those determinations. Potential data feature defects, such as presence of inconsistent values, can be surmised from the preponderance of values with the help of descriptive summaries that capture the key variable-level characteristics, typically in the form of frequency counts for categorical data features and distribution-describing parameters (e.g., mean, standard deviation, median, minimum, maximum values) for continuous data features. That, however, is not the case with unstructured data. The lack of consistent organizational structure means that the notion of review and repair of

 Typically representing either a sequence (e.g., Bin 1, Bin 2, etc.) or a central value that is representative of individual intervals (e.g., average of Bin 1, average of Bin 2, etc.).

126

Chapter 7 Computing Skills

individual data features is not as applicable to unstructured data; in fact, the very notion of ‘data feature’ takes on a somewhat different meaning.14 To that end, the very idea of data feature engineering of unstructured data takes on an almost entirely different meaning as the conception of what constitutes ‘data feature’ differs substantially between structured and unstructured data. The orderly twodimensional layout of structured data where rows typically delimit individual data records and column delimit variables is naturally indicative of individual data features – defined largely by the absence of such defined structure, unstructured data’s features are not immediately visible. In fact, whereas the overriding goal of structured data focused data feature engineering is to review and, as needed, modify, the goal of unstructured data feature engineering is discovery of data features that are hidden in raw data. Doing so falls under the broad umbrella of text mining, where an approach known as the ‘bag-of-words’ offers means of converting unstructured text data into structured numeric data. Broadly characterized, the bag-of-words method uses a multistep process that starts with delineation of high frequency (i.e., those repeating across multiple records) terms, where a ‘term’ could be a single word or a multi-word expression, followed by disambiguation, or distinguishing between similar terms, and tokenization, or creation of distinct data features. Final outcome-wise, bag-of-words-generated data features are usually indicator-coded (also referred to as dummy-coding), giving rise to ‘present’ (typically encoded as ‘1’) vs. ‘absent’ (typically encoded as ‘0’) categorical variables. Tool-wise, the basic computational logic of the bag-of-words method can be accessed using the open-access R and Python programming languages (as languagespecific add-on packages), and is also available as a part of general-purpose commercial systems (though it may require more premium level functionality). And though text mining, as a broader undertaking, falls outside the scope of basic data analytic competencies, the ability to convert text data into simple, structured categorical indicators should be considered a part of elementary data analytic skillset, especially considering that the vast majority of data generated nowadays are text. Data Table Engineering The next step in the review and enhancement-minded data wrangling process is to assess the efficacy of data collections contained in standalone data tables. The following are the most often conducted data wrangling/munging/feature engineering steps:

 Consider a simple example of a text data file, containing Twitter records or Amazon online product reviews – given the free-formatted layout of such files, data features are not initially evident as, in principle, each data record represents a mix of somewhat different set of data elements; in other words, raw unstructured text data can be considered featureless (to that end, one of the core elements of text mining is to identify recurring data features, which in itself is a complex undertaking given the numerous semantic and syntactical nuances of text data).

127

Structural Computing Skills

– Data layout restructuring. Re-arranging of the organizational and logical structure of a data table, geared toward redefining the basic units of analysis; it is best exemplified by transforming item-centric data layouts into buyer-centric layouts. The need to re-structure the basic layout of a data table frequently arises in the context of transactional data, such as point-of-sales (POS) data captured by retail outlets, because the manner in which those data are captured and then stored allows some, but not all commonly performed types of analyses. The general logic as well as the mechanics of data layout restructuring can be somewhat confusing but might be easier to grasp in the context of a specific, relatable type of data. The transactional POS-captured offer a good backdrop here because those data capture potentially informative (to brand managers and other organizational decision-makers) purchase-related details, but those details are structured in a way that does not allow purchase-focused analyses. Not surprisingly, being able to convert those data from their native ‘as-captured’ into analyses-friendly format is among the key requirements to being able to fully utilize their informational potential. Figure 7.2 offers a high-level schematic of the logic and the process of data layout restructuring. CONCEPTUAL DATA MODEL Buyer 1

Buyer n

Item 1

Item 1

Buyer 2

Buyer 1

Item n

Buyer 3

Item 2

Item 3

Item-Centric Data Model

Buyer-Centric Data Model PHYSICAL DATA LAYOUT

Item 1

Buyer 1

Item 2

Buyer 1

Item 3

Buyer 1

Item 4

Buyer 1

Item n

Buyer 1

Buyer

Item 1

Item 2

Item 3

Item 4

Item n

Attribute becomes record

Figure 7.2: Converting Item-Centric Data Structures into Buyer-Centric Structures.

The schematic shown in Figure 7.2 illustrates the essence of a common data layout restructuring where item-centric (i.e., where rows represent individual items sold, such as a can of beans) is transformed into buyer-centric (i.e., where rows represent purchasers of items). That type of data transformation is usually necessitated by the

128

Chapter 7 Computing Skills

difference between capture and storage-dictated and analyses-dictated layouts. As captured and as stored, POS data are commonly organized using the traditional twodimensional data layout where rows delimit records and columns delimit individual data features, in which case records represent individual items and data features represent attributes of those items, such selling price or buyer, as graphically depicted by the left-hand side ‘Conceptual Data Model’ graph. In that layout, as shown in the ‘Physical Data Layout’ part of Figure 7.2, the number of records is equal to the number of individual items purchased, and each item is attributed with purchaser information. The result is an item-centric data layout where buyers are treated as attributes of individual items sold, or framed in the context of the standard two-dimensional data matrix, where records represent items and variables represent buyers of those items. Data structured that way lend themselves to basic sales focused analyses, such as periodic sales reporting15 but not to buyer-focused analyses. In order to support the latter, item-centric data layout needs to be rearranged into buyer-centric layout, as graphically illustrated in Figure 7.2. Informally known as ‘flipping’ of data, it is effectively a process of changing the unit of analysis in a dataset – operationally, it is comprised of two distinct steps: Step 1 is to transform individual data columns containing buyer identification into rows by means of attributing items sold to buyers (i.e., making the change shown under ‘Conceptual Data Model’ part of Figure 7.2); step 2 is to reduce the dimensionality of so-restructured data file by eliminating the duplicate buyer identification information, as graphically illustrated by the ‘Physical Data Layout’ part of Figure 7.2 (in the example shown there, a single buyer is purchasing five items). – Deduplication. Removal of data table’s redundant data, typically in the form of data record-identifying rows with the same values across two or more records, as exemplified by ‘Buyer 1’ records in the ‘Physical Data Layout’ part of Figure 7.2. Definitionwise, a record is considered to be a ‘duplicate’ if can be considered an exact copy of another record; in practice, however, that determination is tied to just the data record-identifying feature, such as ‘Customer ID’. In other words, if two separate data records in the same data table have identical, for instance, ‘Customer ID’ record identifiers, those records would be considered to be duplicate, even if other data features were not exactly the same between the two records. While such limited duplicate identification scope may seem somewhat counterintuitive, it in fact reflects the importance of assuring that individual data records are independent of one another, which is one of the key assumptions of the basic exploratory and confirmatory analyses discussed in Chapter 5.

 That type of instantaneous tracking and reporting capability is why it is common to see summary reporting/business intelligence applications integrated with native database management systems.

Structural Computing Skills

129

The presence of duplicates could be a consequence of data restructuring, which was the case with converting of POS-captured items-centric data into buyer-centric datasets, or it could be a result of an unanticipated data capture or processing error; it could also arise during merging of standalone data tables (discussed in the Data Consistency Engineering section). Computation-wise, duplicate identification routines are readily available as a part of data due diligence capabilities of R and Python opensource programming languages as well as SAS and SPSS commercial applications (as well as other commercial data analytic applications such as MATLAP, Minitab, or Stata). While those applications vary in terms of their operational specifics, they generally require the user to specify two key parameters: 1. the specific data feature to be used as the target for duplicate identification, which is typically the aforementioned record-identifying feature, and 2. which record should be considered ‘primary’ or the ‘parent’, which would usually be either the first or the last record (going from the top to the bottom of the data table) in a series of duplicates. Given that the standard duplicate identification routines can use those parameters to create a ‘primary’ vs. ‘duplicate’ indicators that could then be used to quickly eliminate the redundant records, it is important to carefully consider the ‘primary’ record designation to make sure that appropriate records are being eliminated.16 – Normalization. Also known as min-max scaling,17 it is a process aimed at making the informational content of a data table more immediately meaningful, which is commonly accomplished by adjusting values of multiple data features measured on different scales to a notionally common scale. A common example from the arena of sport performance captures the idea and the value of normalization well: Let us assume that Data Feature A contains individual athletes’ footspeed, and Data Feature B contains individual athletes’ physical strength; the former is typically measured as distance divided by time, and the latter as the greatest load (in pounds or kilos) that can be fully moved. The resultant data are neither immediately meaningful (e.g., is running 100 meters in 14.1 seconds or being able to bench press 375 pounds good or just average?) nor directly comparable (i.e., should a given athlete be considered ‘fast’ or ‘strong’ or both?), but recasting those otherwise incommensurate measures in a common scale bound by 0 and 1 will make each measure immediately interpretable (e.g., footspeed of 0.9 would suggest that the athlete’s speed is in the top 10% of the group)

 As discussed in more detail in the Data Consistency Engineering section, merging of two standalone but informationally related data tables may produce (in the combined data file) records that are duplicate but that have materially different contents (for instance, one could contain all customerlevel demographic and past purchase details, while the other could only contain that customer’s recent promotional responses), in which case being deemed ‘primary’ may need to be considered more in the context of content than the order of appearance.  That name is a more direct reflection of the commonly used computational normalization formula which is: Xnorm = (Xactual – Xmin) / (Xmax – Xmin).

130

Chapter 7 Computing Skills

and directly comparable (e.g., footspeed of 0.9 and strength of 0.7 lend themselves to immediate comparative conclusions). An alternative to normalization is z-score standardization, mentioned earlier in the context of outlier identification – it recasts values expressed in original units of analysis into standard measures expressed in terms of the number of standard deviations away from the mean of 0. It is notionally similar to min-max framed normalization in the sense that it transforms quantities expressed in different units – as exemplified earlier by timeover-distance vs. weight – into quantities expressed in abstract (i.e., ‘standard units’) but ultimately more immediately informative units. At the same time, z-score standardization is computationally and interpretationally somewhat different, primarily because it does not have defined upper and lower limits. To use the same sports performance example, the 14.1 seconds 100-meter dash speed could translate into 2.5 standard deviations above the mean, and the bench press of 375 could translate into 1.7 standard deviations above the mean; depending on the intended usage, either min-max normalization or z-score standardization could be appropriate. Computation-wise, R and Python include pre-packaged z-score standardization functions; Python also includes a standard data feature normalization function, but using R, normalization requires a two-step computation of first determining the minimum and maximum values and then computing the normalized value using a simple formula of (Xactual – Xmin) / (Xmax – Xmin). Somewhat analogously, SAS and SPSS commercial packages include built-in standardization functions, but normalization requires similar to R two-step computation. There is one additional data table engineering consideration, one that aimed at the size aspect of data collections, as considered from the perspective of potential redundancies. It goes beyond the earlier discussed deduplication as it frames the idea of redundancy in a broader context of similar rather than just identical data; moreover, it takes a more global view of data tables by considering both the number of data records as well as the number of data features. Given that, it can manifest itself in two distinct aspects of data record aggregation and dimensionality reduction; the difference between those two general data table size reduction approaches is graphically illustrated in Figure 7.3. – Data record aggregation. It is a process of combining or ‘rolling up’ of informationally related (and thus not necessarily duplicate) disaggregate details into summative totals. In contrast to the earlier discussed deduplication, the goal of which is eliminate redundant records, the goal of data record aggregation is to combine similar records into higherorder, more general aggregates. Using again the example of POS-captured transactional data, where individual data records represent single items, such as 48 oz. box of ready-toeat Brand X cereal, along with their descriptive attributes (selling price, selling time, selling location, etc.), data record aggregation would yield number of different summaries, reflecting different informational needs. One such summary could take the form of brand-level daily totals, where all sizes (commonly referred to as SKUs, for ‘stock-keeping units’) of the aforementioned ready-to-eat Brand X cereal (assuming it comes in more

Structural Computing Skills

Entity ID

Feature 1

Feature 2

Feature 3

Feature 4

131

Feature n

Dimensionality Reduction

001

003 004

Aggregation

002

005 006 Figure 7.3: Data Record Aggregation vs. Dimensionality Reduction.

than just the 48 oz. box size) sold on a given day across all sales outlets were added together to yield a single, ‘Total Brand X Sales on Day Y’ totals. As implied by this simple example, that particular type of data record aggregation would effectively create new aggregate data features as replacements for the original, disaggregate attributes; for example, the originally captured ‘selling price’, ‘selling time’, ‘selling location’ would be replaced with summary outcome measures, such as ‘total $ sales’, ‘number of units sold’, etc. An important consideration that permeates data aggregation is loss of information. In general, whenever granular data are ‘rolled up’ into more aggregate summary measures, there is an associated, and largely inescapable loss of information. In the above example of where transaction-level details are aggregated into daily sales totals, some information, such as SKU-level sales or time-of-day sales volume variability, are lost, unless steps are taken to capture that information in appropriately designed summary measures (e.g., rather than expressing daily sales total as a single, Brand X, summary measure, those totals could be computed at the overall brand level as well as at the level of individual SKUs). A somewhat different but equally important loss of information related consideration is the type of summary measure to be used. Continuing with the example of aggregating transaction-level details into daily sales summaries, the aggregated outcome measures could take the form of totals (sums) or averages (means or medians); moreover, when capturing ‘average sales’, a more expansive view of the notion of ‘average’ might be warranted, which is one that captures an estimate of choice, e.g., the mean, along with its associate measure of variability, which for mean is the standard deviation.18 All in all, data aggregation needs to be approached from the

18 When electing to use median instead of the mean, a different measure of variability needs to used, known as the median absolute deviation, or MAD (standard deviation is a property of the mean and should not be used alongside median). MAD is defined as the median of the absolute deviations from the data’s median, computed as follows: , where Xi denotes actual values, and X is the mean. MAD = median Xi − X

132

Chapter 7 Computing Skills

perspective of, on the one hand, increasing the informational utility of data, while, on the other hand, minimizing the loss of information. Dimensionality reduction. The notion of ‘dimensionality’ of data is an expression of the number of data features comprising a particular dataset; given that, dimensionality reduction is simply a structured process of decreasing the number of features in a data table. It is critical to note, however, that reducing the number of features does not mean simply dropping some and keeping other ones – the above discussed loss of information minimization is just as important in the context of dimensionality reduction as it is in the context of data record aggregation. With that in mind, dimensionality reduction can be more formally defined as a structured process of decreasing the number of data features in a data table in a way that minimizes the associated loss of information. It is worth noting that the idea of data dimensionality is nowadays most readily associated with machine learning applications used with high-dimensional data. More specifically, situations in which datasets used to train machine learning algorithms are believed to contain excessively large numbers of variables in relation to the number of records19 give rise to a general estimation bias concern.20 In that particular context, the assessment of dimensionality is rooted in the relationship between the number of data records and the number of data feature in a particular data table, hence a relatively small (in terms of the number of records) dataset can be deemed high-dimensional, if the number of data features is close to the number of records. However, in many practical data analytic applications, i.e., in the context of basic data analytics discussed in this book, a dataset can also be deemed to be high-dimensional, even if the number of data features is considerably small in relation to the number of records. The POS-captured transactional data again offers a good illustration here: Sales detail tables commonly contain millions of records, in relation to which several hundred data features may appear modest (i.e., not high-dimensional), yet a dataset containing that many distinct data features might be difficult to work with, especially if there is a high degree of cross-feature similarity. Moreover, even though in terms of the ‘number of data features’ vs. ‘number of data records’ ratio such dataset may not be deemed high-dimensional, it may nonetheless contain redundant features, and thus not be informationally parsimonious. Rooted in Occam’s (also written as Ockham’s or Ocham’s) razor, and commonly known as the principle of parsimony, this idea suggests that individual data features should be largely

 The commonly used rule of thumb is that to be deemed ‘high-dimensional’, a dataset needs to encompass a number of variables that approaches or exceeds the number of records.  Though falling outside the scope of this basic competencies focused overview, it is worth noting that the estimation bias concern is rooted in two basic problems: 1.insufficient variability to support multivariate maximum likelihood estimation (i.e., simultaneous fitting of multiple parameters), and 2. overfitting, which is a situation where a multivariate model is too tailored to unique specifics of a particular dataset.

Structural Computing Skills

133

informationally non-overlapping because the best explanations are those that use the smallest possible number of elements.21 All in all, the assessment of data tables’ dimensionality and the possible the need to reduce it should take into account the ratio of the number of features to the number of records as well as informational parsimony. Should it become evident that for either of the two aforementioned reasons a dataset of interest is in need of dimensional reduction, there are several possible dimensionality reduction avenues that can be taken. Those can be grouped into two general categories: computation-based, and estimation-based. The computation-based approach leverages general familiarity with the content of individual data tables and uses basic logical data manipulation steps to reduce the number of features by combining closely related features; consumer surveys offer an illustrative example of the use of that approach. Using established, meaning previously formally validated psychometric scales,22 a product satisfaction survey might ask multiple disagree-agree type questions (typically using a Likert scale that offers respondents several evaluative choices anchored by opposing reactions, such as ‘strongly disagree’ and ‘strongly agree’); designed to assess underlying latent constructs that are difficult to measure directly, such as the idea of ‘satisfaction’, multi-item psychometric scales yield data features that are meant to be combined using pre-determined logic, such as computing an average of all individual items comprising a particular scale. Estimation-based dimensional reduction approaches are both more numerous and more methodologically complex; given the basic data analytic competencies focus of this overview, only a very abbreviated, high-level overview is offered here. There are two general sub-sets of estimation-based techniques: those that only retain the most informative features, and those that combine multiple features into a smaller subset of new features. The former are nearly synonymous with machine learning techniques, such as random forest, while the latter make use of a wide cross-section of multivariate statistical techniques, with principal component analysis (PCA) being the perhaps best-known example of an estimation-based dimensionality reduction methodology. PCA converts cross-feature correlations into a set of linearly uncorrelated higher-order summary features that collapse sets of multiple individual features into single ‘components’ that contain most of the information in the larger set, thus fulfilling the requirement of making a dataset more informationally parsimonious while at the same time minimizing the loss of information.

 This idea is usually operationalized as the degree of association between individual data features (e.g., the earlier discussed correlation); it plays a pivotal role in regression analyses, perhaps the most widely used family of multivariate statistical techniques, where it manifests itself as collinearity.  Systems of measuring not directly observable attributes such as attitudes or beliefs; as data capture instruments, psychometric scales are collections of statements asking essentially the same but differently worded, or closely related questions which, when interpreted jointly, offer an assessment of a property or a characteristic that cannot be reliably assessed using direct (as in a single, specific question, such as asking a person’s age) means.

134

Chapter 7 Computing Skills

Execution-wise, both the open-source R and Python languages and the commercial SAS and SPSS (as well as other like systems including MATLAB, Stata or Minitab) include standard PCA estimation modules. It is important to keep in mind that PCA requires input data to be continuous, in addition to which users need to make several estimation related choices, such as selecting a specific component rotation approach (the goal of which is to enhance the interpretability of the overall solution), thus referencing more in-depth description of this method is strongly urged prior to using it. – Joint record aggregation and dimensionality reduction. The preceding overview of data aggregation and dimensionality reduction implicitly assumes that the two are conducted independently of one another, which is not always the case. An effective way of storing and maintaining continuously accruing tracking data, which are data that contains the same type of information captured at different points in time, is to combine cross-sectional (i.e., different entities) and longitudinal facets into a single, structurally more involved data table. A good example of such a complex data structure is offered by one of the key tables (known as Fundamentals) comprising the Compustat database, a vast collection of financial, statistical, and market data on active and inactive companies throughout the world. Built to serve as a cumulative repository of regulatorily mandated financial disclosure of companies traded on public stock exchanges (e.g., the New York Stock Exchange or NASDAQ in the U.S), the Fundamentals table encompasses multiple years of data for each of thousands of companies traded on public exchanges; moreover, it also keeps track of any corrective resubmissions, known as financial restatements. It all amounts to a complex data structure, the generalized layout of which is graphically summarized in Figure 7.4. For a single company (Company X), there are multiple years of data (Data Year 1 thru Data Year n), and each year encompasses two distinct data formats (Data Format A and B), representing original and resubmitted data. When considered strictly from the perspective of data layout, the Fundamentals table combines two separate cross-sectional dimensions, one in the form of multiple companies and the other in the form of original vs. restated filings, and it also encompasses a longitudinal dimension represented by multiple years of data for each company and each filing status (i.e., original vs. restated). The resultant multidimensional structure offers an efficient solution to storing related data features, but it poses data utilization problems arising out of lack of a singularly consistent unit of analysis. More specifically, due to the fact that some rows represent different annual, and within those, different filing status values all for each individual company, while other rows represent different companies,23 the so-structure dataset contains

 As noted earlier, the standard two-dimensional structured data layout is rooted in the idea that rows delimit individual entities, meaning that each such ‘entity’ – which could be a company or a time period in the context of Figure 7.4 – needs to be of the same type (i.e., a company or a time period) in order for the ensuing analysis of variability (cross-entity differences in values) to yield

Structural Computing Skills

Company X

Data Year 1

Data Format A

Company X

Data Year 1

Data Format B

Company X

Data Year n

Data Format A

Company X

Data Year n

Data Format B

Company Y

Data Year 1

Data Format A

Company Y

Data Year 1

Data Format B

Company Y

Data Year n

Data Format A

Company Y

Data Year n

Data Format B

Company n

Data Year n

Data Format A

Company n

Data Year n

Data Format B

135

Figure 7.4: Generalized Fundamentals Data Table Layout.

numerous systemic duplicates, which are redundancies directly attributable to the organizational structure of data (other types of duplicates, referred to here as random, are usually an unintended consequence of data capture or processing errors). To eliminate structural duplicates it is necessary to collapse the Fundamentals data along either the cross-sectional dimension, if the goal of ensuing analyses is to examine cross-company differences and similarities, or along the longitudinal dimension, if the goal is to examine aggregate (i.e., all companies rolled into one) cross-time trends. Though not expressly shown in Figure 7.4, the Fundamentals data table encompasses a large number (close to 2,000) of individual features, but many of those features are informationally redundant as they represent multiple operational manifestations of the same underlying measure. For instance, the general measure of ‘revenue’ is shown as ‘total revenue’, ‘gross revenue’, ‘net revenue’, etc., all of which are highly intercorrelated; suggesting the need to also reduce the dimensionality of that data table. The ‘before’ and ‘after’ are graphically illustrated in Figure 7.5.

unambiguous interpretation. For instance, the analysis of Total Revenue (one of the Fundamentals variables) needs to be undertaken either in the context of individual companies (cross-company revenue variability – i.e., averaged or summed across time and filing status, thus no time dimension) or in the context of time (longitudinal revenue variability – i.e., averaged or summed across all companies, meaning no company dimension).

Data Year 1

Data Year 1

Data Year n

Data Year n

Data Year 1

Data Year 1

Data Year n

Data Year n

Data Year n

Data Year n

Company X

Company X

Company X

Company X

Company Y

Company Y

Company Y

Company Y

Company n

Company n

Data Format B

Data Format A

Data Format B

Data Format A

Data Format B

Data Format A

Data Format B

Data Format A

Data Format B

Data Format A

Attribute n1

Attribute n1

Attribute Y1

Attribute Y1

Attribute Y1

Attribute Y1

Attribute X1

Attribute X1

Attribute X1

Attribute X1

Attribute n2

Attribute n2

Attribute Y2

Attribute Y2

Attribute Y2

Attribute Y2

Attribute X2

Attribute X2

Attribute X2

Attribute X2

Company n

Company Y

Company X

Attribute n1_aggregated

Attribute Y1_aggregated

Attribute X1_aggregated

Attribute n2_aggregated

Attribute X2_aggregated

Attribute X2_aggregated

136 Chapter 7 Computing Skills

Figure 7.5: The ‘Before’ and ‘After’ of Dimensional Reduction.

Structural Computing Skills

137

The core idea captured by the preceding example is that there are situations that necessitate jointly considering the otherwise distinct procedures of data record aggregation and dimensionality reduction. The emphasis here is not on a distinct procedure (as the previously discussed rules governing aggregation and dimensionality reduction apply here), but rather on more careful planning of how to approach the task of making the contents of a particular data table analytically usable, given the potentially different types of data analytic outcomes that could be expected. In short, when dealing with complex data, the importance of forward-looking data analytic planning cannot be overstated. Data Consistency Engineering The third and the final aspect of transforming raw data into analytically usable datasets entails jointly considering all distinct data sources, which usually manifest themselves in multiple standalone data tables. The widespread digitization of ever more diverse aspects of commercial and social interactions translates into rich diversity of data sources, which means that multisource analytics is becoming the norm. The socalled ‘360-degree customer view’ offers perhaps the most visible example – it is based on a simple (conceptually, not necessarily operationally) idea, according to which developing robust understanding of buyer behavior demands pooling together otherwise disparate elements in the form of current and past purchases, demographics, attitudes, etc. But amalgamating source- and type-dissimilar data into a single, integrated dataset can be an involved undertaking, and it brings into scope the importance of assuring analytic comparability of the individual inputs, in the form of distinct data files. Assuring consistency of the combined dataset is a process comprised of two related but distinct sets of activities: data table audit, and file merging and/or concatenation. – Audit of data tables. Much like it is with assembling a physical item, ‘doing’ needs to be preceded by ‘planning’, or carefully assessing the required fit. Within the confines of joining together distinct data tables, it entails carefully reviewing informational contents and encoding properties of individual tables, with the primary focus on assessing the degree to which the organizational structures of individual data tables can be brought into alignment with one another. That, however, can take on one of two different forms, depending on how the individual files are to be combined. More specifically, data files can be merged or concatenated. Merging is the process of combining contents (which could encompass all features in individual tables or just select subsets of interest) of multiple data files by linking records in File A with records in File B, and so on, based on values of a shared (meaning appearing in both data tables in exactly the same form) unique record identifier. The net effect of merging of the desired contents of two or more data tables is expansion of the number of data features; the number of data records should generally not

138

Chapter 7 Computing Skills

change.24 Within the confines of the 360-degree customer view example, file merging would be used to, for instance, combine customer demographic details (File A) with customer purchases (File B). Concatenation, on the other hand, is the process of combining records spread across multiple data tables. Here, each data file is presumed to contain a subset of larger universe of data, meaning that File A and File B, and so on, each contain exactly the same data feature but distinct and non-overlapping data records; consequently, when two or more separate data tables are concatenated the number of data features is not expected to change, while the number of data records is expected to reflect their combined total. In the context of the 360-degree customer view examples, old and current customer purchases could be concatenated. File merging and/or concatenation. This is the ‘doing’ part of the two-part data consistency engineering process, the goal of which is to implement the desired combinations of informationally related data tables, subject to considerations coming out of the audit of data tables. While the choice between merging and concatenation is primarily a reflection of contents and structures of individual data tables that are to be combined, that choice applies just to structured data files – unstructured data collections can only be concatenated, because they lack organizational consistency that is needed for merging. All considered, the difference between those two very different approaches to combining of data files is graphically illustrated in Figure 7.6. As noted earlier, merging results in a greater number of distinct data features associated with a particular set of records, which may result in the combined dataset becoming high-dimensional, a possibility that should be considered prior to merging. Concatenation can yield very high record count datasets which can become more difficult to manipulate and, more importantly, can exacerbate the shortcomings of widely used tests of statistical significance (discussed at length in Chapter 5 in Vagaries of Statistical Significance section). In contrast to the instinctive belief held by many, more (records) is not necessarily better within the realm of statistical analyses, thus when concatenating large data tables, the size of the combined dataset needs to be carefully considered. When the combined dataset is to be used to carry basic descriptive and association analyses discussed in earlier chapters, and if informational content of interest is spread across multiple data tables (e.g., each data table contains sales details for just one calendar year), it might be appropriate to consider taking a sample of each table prior to concatenation, as a way of containing the number of records in the combined dataset.

 Under some circumstances, most notably when File A and File B contain different numbers of records, the number of records in the combined dataset may temporarily increase (if the target table, or the one into which additional data features are added happens to contain fewer records than the contributing table); however, under those circumstances the ‘extra’ records would typically be eliminated due to their incompleteness, which would ultimately result in an unchanged number of records in the combined dataset.

002

n

002

n

006

003

Figure 7.6: Merging vs. Concatenation in the Context of Structured Data.

006

005

004

005

002

Entity ID

Data Feature 1

Table 2

Data Feature 2

Table 2 Data Feature 1

004

Data Feature 1

Data Feature 2

001

001

Table 1

Entity ID

Entity ID

001

Entity ID

Data Feature 1

Table 1

Data Feature 2

Data Feature 3

CONCATENATION

MERGING

Structural Computing Skills

139

140

Chapter 7 Computing Skills

In terms of more procedural considerations, merging imposes specific requirements on data collections that are to be combined; those include commonality of unique record identifier (i.e., the same variable formatted the same way), identical record counts (merging requires record-to-record linking), uniformity of data aggregation levels, and record ordering uniformity (all individual data files need to be sorted in the same fashion, i.e., in either ascending or descending order, using the values of the aforementioned unique record identifier feature). Failure to meet one or more of those requirements will inescapably result in undesired consequences, which can range from an outright failure to produce a combined dataset (i.e., the merge process will fail to complete) in creation of duplicate records (which will need to be removed prior to analyses of data). Concatenation, being comparatively more straightforward may invite less scrutiny, but that process can also go astray; of particular concern here are encoding specifics of individual data features. Since all data tables being combined that way need to include the same data features, it is important to assure cross-table invariance with regard to not only the encoding type (e.g., continuous vs. ordinal vs. nominal) but also more minute aspects, such as the width (the number of storage bits allocated to a particular type of value – otherwise alike data features may exhibit different widths across data tables) or the type of quantity encoding (e.g., including decimals vs. not – again, manifestly alike data features may include decimal values in one table and not in another). Execution-wise, open-source general-purpose languages (R and Python) as well as commercial general-purpose systems (SAS, SPSS, and other) all include standard data file merging and concatenation routines. Given the numerous technical considerations associated with combining standalone data tables, it is important to carefully examine routine-specific input file requirements to make sure that individual data files have been adequately pre-processed prior to execution. As evidenced by the somewhat lengthy overview of the numerous procedures and considerations that jointly comprise the broad process of transforming raw data into analyzable datasets, making data usable can be a tedious, involved process. Mastering that process is seen here as the heart of fundamental data analytic competencies, not only because it is so clearly essential to being able to extract meaning out of data, but also because those abilities do not lend themselves to the type of automation that made carrying out computationally involved data analytic tasks, such as computing a correlation matrix, as simple as a handful of mouse clicks or invoking a pre-packaged routine. Making data usable is a purposeful process built around conscious awareness of the many critical considerations, understanding of alternative courses of action and their implications, and skills needed to execute the desired action steps.

Structural Computing Skills

141

Enriching Informational Contents of Data An important, though often overlooked aspects of data manipulation and engineering skills is the ability to expand available data’s informational value. The earlier outlined data feature engineering is the first step in that direction, but the focus there is on correcting structural deficiencies – the next step is to look beyond ‘what is’ and toward ‘what could be.’ The ‘what is’ represents the current state of a properly cleaned and organized dataset: no duplicates, analytically appropriate dimensionality, etc. The ‘what could be’ represents looking at that collection of features from the perspective of informational needs at hand and asking a basic question: Are there features that the contemplated analyses call for, but that are not readily available? If the answer is ‘yes’, it might be possible to derive new data features by means of either restructuring or combining the existing ones. To be clear, the idea here is to go beyond the earlier discussed amending of existing features by means of normalization, zscore standardization, or binning – going beyond those statistical structure-minded enhancements entails tapping into manifest informational facets of existing data elements to create new data features. For example, a brand manager might be interested in delineating demographic and behavioral differences between ‘active’ and ‘inactive’ (i.e., lapsed) customers, but normally, data available for such analysis does not include a ready-to-use ‘active-inactive customer’ indicator, though it may be possible to derive such an indicator by leveraging existing data features. More on that shortly. When considered from general informational perspective, the totality of data enhancement efforts can be characterized as conceptual data model specification; on a more micro level, newly derived data elements can be characterized as effects, or metrics created to capture particular latent (i.e., not expressly accounted for) phenomena. Although in theory an infinite number of such custom variables can be derived, they fall into three basic categories of indicators, indices, and interaction terms. Indicators are typically dichotomous measures created to denote the presence or absence of an event, trait, or a phenomenon of interest; commonly referred to as ‘flags’ or ‘dummy-coded’ variables, indicators are structured, categorical data features encoded numerically (e.g., 0–1) or as strings (e.g., yes vs. no). A common example of a derived indicator is to use a date field, such as ‘Product Return Date’, to create an action-denoting flat, such as ‘Return’ (coded as 0–1), where the presence of an actual date value is encoded as ‘1’ and all other values as ‘0’.25 Another type of derived effects, indices, are notionally reminiscent of indicators insofar as both capture latent

 Transaction processing systems typically auto-generate ‘time stamps’ to go along with recorded events, thus within the confines of individual data records, the presence of a date value indicates that a particular event took place, and the absence indicates that it did not take place. Working with raw date fields can be cumbersome and at times tricky, thus re-casting the informational content of a particular ‘date’ field into an easy-to-use binary indicator will almost always enhance the utility of a dataset.

142

Chapter 7 Computing Skills

effects of interest, but the two are distinct because indices capture ‘rates’ rather than ‘states’ (e.g., the rate of pharmaceutical product liability lawsuits, derived from raw litigation tracking data) and thus are typically expressed as continuous variables. And lastly, interactions are combinations of two or more original variables intended to capture the joint impact of those variables that goes beyond their individual impact. While the rationale for deriving indicators and indices seems straightforward, the need for interaction effects is more nuanced, as it emanates from two, largely distinct considerations. The first is to fully explain the phenomenon of interest, which may requires linking two or more standalone data elements. An obvious example is a thunderstorm, which is a general name for a weather phenomenon that combines lightning, thunder, and rain. Hence to fully explain the phenomenon known as thunderstorm it is necessary to derive an interaction effect that combines the joint occurrence of otherwise standalone phenomena of lightning, thunder, and rain. The second reason for creation of interaction effects is a bit more obtuse, as it stems from what is technically known as collinearity, or correlation between predictor variables in regression models (the presence of which can make it difficult, at times even impossible to reliably estimate individual regression coefficients). And though regression analyses fall outside the scope of this basic data analytic competencies focused overview, in a more general sense, collinearity increases explanatory redundancy, and by doing so it diminished informational parsimony.26 For those reasons, it is generally recommended to reduce collinearity, which is usually accomplished by eliminating collinear data features.27 Doing so, however, may inadvertently result in loss of information, but that loss can be minimized when an interaction combining the effect of the two collinear variables is created as a new, standalone measure28 (it should be noted that interactions are not limited to just combining two features – while not often seen, three-way, or even higher, interactions can be created). Turning back to the binary ‘active-inactive customer’ indicator mentioned earlier, Figure 7.7 offers a summary of the logic used to derive that indicator from ‘transaction date’ field. Relying on the combination of business logic (e.g., customers who did not make a purchase in the past 12 months are considered ‘inactive’) and the existing Transaction Date feature, the new Customer Status data feature is derived, as graphically summarized in Figure 7.7.

 This line of reasoning stems from the principle of parsimony commonly known as Occam’s (also spelled Ockham’s) razor, which states that entities should not be multiplied without necessity; in other words, if two explanations account for all pertinent facts, the simpler of the two is preferred.  Typically, if two predictors in regression analysis are collinear, the one that exhibits stronger association with the target or dependent variable is retained and the other one is eliminated.  In that situation, one of the collinear features would still be eliminated, but at least some of that feature’s informational content would be captured, and thus retained, by the newly created interaction effect.

Structural Computing Skills

Customer ID

Transaction $

Transaction Date

Customer Status

2/20/2023

Active

143

Inactive Inactive

12/18/2022

Active

Figure 7.7: Derived Data Example: Creating Customer Status Indicator.

In more operationally explicit terms, derivation of the new, Customer Status data feature makes use of conditional ‘if-then’ logic (e.g., if Transaction Data is any nonmissing value then Customer Status = Active, else Customer Status = Inactive), readily available as defined functions29 in general-purpose open-access computational languages (R and Python) as well as general-purpose commercial systems, such as SAS or SPSS. It should be noted that the example shown above highlights just one of many potential applications of derived data logic – for instance, the ‘Transaction $’ feature could be used in conjunction with appropriate business logic to derive ‘customer value’ (e.g., high – average – low) indicator. Somewhat further afield, in terms of the level of difficulty, is applying the above informational enrichment logic to unstructured data. While clearly beyond the scope of rudimentary data analytic competencies, it is nonetheless worth mentioning that the idea of data emergent from data is at the core of sentiment analysis, which makes use of text analytics, natural language processing, and computational linguistics to identify and categorize opinions expressed in text, most notably, social media postings, such as online product reviews. ✶✶✶ The preceding overview speaks to the importance of developing robust and comprehensive data manipulation skills. While it is not feasible to foresee all possible data manipulation needs that may emerge, the data due diligence and data feature engineering steps outlined above are common to most analyses of structured data, because as much as structured data may vary, at times considerably, in terms of size

 It was worth noting that derivation of the other two types of new data feature – indices and interaction terms – can also be accomplished using readily available routines. Here, numerous operators and functions are available in R to create new variables, Python users tend to utilize the assign () function (Python has no specific command for declaring new variables), SPSS Statistics users tend to use the compute variable function, and SAS users utilize length, attrib, or assignment statements.

144

Chapter 7 Computing Skills

and content, their organizational schemas – and thus processing requirements – tend to cut across those differences.

Informational Computing Skills The goal of the numerous data element, data table, and data consistency engineering steps discussed in the previous section is to transform raw and often messy data into analytically usable datasets. Interestingly, in contrast to nuanced and situational nature of data wrangling, upstream data utilization lends itself to far greater degree of automation, as even syntax-based means of data analysis, such as R and Python programming languages, boast extensive libraries of packaged applications. In that sense, while the abstractly defined informational goals of analyses require clear understanding of differences among distinct data analytic methods and techniques, the execution-oriented computing skills are largely centered on identification, proper specification, and deployment of standalone packaged data analytic utilities. In short, in view of the advanced nature of modern data analytic tools, informational computing skills manifest themselves primarily in knowing which specific estimation ‘package’ to use, and how to use it correctly, meaning in keeping with applicable assumptions and restrictions and in alignment with informational goals of interest. Focusing once again on just rudimentary data analytic competencies, informational computing skills largely take the form of the ability to execute the earlier discussed (in Chapter 5, Knowledge of Methods) exploratory and confirmatory analyses. Moreover, given that data analytic outcomes can take either numeric or visual form, those skills need to encompass the ability to produce numeric estimates as well as visually expressed outcomes. To a large extent, those abilities are tied to specific data analytic tools, which include the general-purpose programming languages, most notably R and Python, and commercial systems, such as SAS or SPSS, as well as select limited-purpose applications, which include the ubiquitous MS Excel, and the increasingly more popular Tableau. While some more advanced or inclined data users may be proficient in multiple substitute or surrogate tools, such as SAS and R, beginners as well as more casual data users tend to focus on narrower sets of tools (which often means choosing a particular general-purpose tool, such as SPSS or Python, along with need-aligned limited-purpose tools). It is important to not lose sight of the breadth of available choices – while throughout this book attention is drawn to just the most representative sets of data manipulation and analysis related software applications, those are just the most commonly used tools.30 Less widely (in the realm of data analytics) used programming languages,

 Also not included here are advanced, i.e., combining high degree of automation with breadth and depth of data mining and estimation techniques, data analytic systems such as SAS Enterprise Miner or JMP (also part of SAS Institute).

Informational Computing Skills

145

such as Java and C++, and commercial data analytic systems, such as MATLAB or Stata, also offer the ‘bells and whistles’ required to carry out fundamental – as well as advanced, in many instances – data analyses. The universe of ‘other’ choices is even greater when it comes to limited-purpose applications, especially data visualization tools where applications such as QlikView, Dundas BI or MS Power BI, to name just a few out of numerous competing offerings, offer alternatives to Tableau. With so many different options readily available, it can be difficult to pick a set of tools to suit one’s needs and abilities because there are no universally ‘best’ tools – there are just tools that are in better or worse alignment with individual needs. And while it is difficult to address all possible choice impacting considerations, there are a handful of factors that capture the most salient points.

Choosing Data Analytic Tools Perhaps the most top-of-mind data analytic tool selection consideration is the tool specific learning curve, or the rate of a person’s progress in developing basic usage proficiency. Within the confines of data analytic software applications, learning curve is strongly and positively correlated with the degree of function automation, which is the extent to which the functionality of a particular software application can be accessed using a familiar-to-most point-and-click GUI (graphical user interface) functionality. Starting with the general-purpose data analytic applications discussed earlier, SPSS offers arguably the greatest degree of function automation, which manifests itself in an elegant, highly evolved GUI-based way of interacting with the software’s functionality.31 That is not surprising, as the ease of use has been one of the SPSS design hallmarks for the past several decades (SPSS version 1 was released in 1968; as of the writing of this book, SPSS Statistics just released version 28). SPSS Statistics’ key competitor, SAS, also offers a GUI front in the form of SAS Enterprise Guide application, though it is generally considered not as evolved as the one offered by SPSS. As can be expected, R and Python programming languages do not boast such highly automated GUI-based interfaces, though both offer workspaces known as integrated development environments, or IDEs, to support easier execution, debugging (identifying and correcting syntax errors), and outcome review; the best known of those interfaces

 It should be noted that SPSS Statistics (which is the full name of the application meant to differentiate it from its sister application, SPSS Modeler, which is focused on data mining and text analysis) has its own, built-in syntax language which offers a parallel to the point-and-click functionality (i.e., one can interact with the software using either its GUI or by executing syntax commands), as well as the ability to save, edit, and re-use syntax capturing GUI-accessed functions. In other words, executable syntax associated with each function accessed via the system’s GUI interface can be saved as a separate syntax file (a single file is typically used to capture distinct sets of commands), which can be modified as run separately just like R or Python syntax (much of the same is true of SAS).

146

Chapter 7 Computing Skills

is RStudio, which was initially developed for R, but now can now be used with both R and Python. Looking beyond function automation, the second key data analytic tool selection consideration is cost (in eyes of many it may rank higher than function automation), and here open-access R and Python programming languages have an undeniable advantage since open-access means cost-free. Moreover, the most widely used IDE, RStudio, as well as popular code sharing and project collaboration platforms, such as GitHub, are also free of cost, and are supported by large communities of users. However, it is important to keep in mind that those are resources that require time, effort, and commitment, which means that individual (R and/or Python) learners need to self-navigate the rich but also confusing and at times overwhelming arrays of resources – as can be expected, that works for some, but not for all. With that in mind, there are numerous service providers that offer paid support for those interested in learning how to program in R and/or Python but require more structured, dedicated learning support mechanisms. A yet another data analytic tool selection consideration is the prevailing data usage manner. It is not meant to suggest that data users restrict themselves to one and only one data usage modality, but rather it is a manifestation of the fact that many data users are professionals who see data analytics as that which helps them do what they do better. A brand manager, for example, relies on data analyses-derived insights to make better informed brand related decisions – that individual may use data to track key brand performance indicators. A different professional, such as a marketing analyst, may use data in a wider array of situations, such as to assess the impact of different promotions or to segment different customer populations. In a more general sense, the former is focused on certain types of recurring analyses (to estimate the same types of outcomes at different points in time), whereas the latter’s data usage can be described as ad hoc, or as-needed, meaning it can vary from instance to instance. As a general rule, recurring analyses are best served by syntaxbased applications, while ad hoc analyses tend to be well-served by GUI-based systems, used in that functionality; that said, as noted above, GUI-based, i.e., SAS and SPSS Statistics systems also incorporate their own syntax languages which make those tools more versatile from the standpoint of prevailing data usage.

Computing Data Analytic Outcomes The broadly framed basic data analytic capabilities have been described in Chapter 5 in the context of two general data analytic approaches: exploratory and confirmatory. And while the two approaches are distinct in terms of key methodological considerations and the specifics of computational know-how, those differences do not translate into appreciable data analytic tool disparities. As suggested by their general designations, the generalpurpose open-access programming languages, and the general-purpose commercial systems both support essentially any type of data analysis, ranging from simple univariate

Informational Computing Skills

147

descriptive summaries to complex multivariate predictive modeling. In that sense, informational computing skills ultimately manifest themselves as competencies in one or more of those tools. There are, however, some differences between exploratory and confirmatory analyses in terms of limited-purpose tools: Exploratory analyses. The general purpose of data exploration which is to uncover previously unknown patterns and relationships. Given such an open-ended scope, both in terms of data analytic goals as well as expected outcomes, exploratory analyses place particularly high value on visual renderings of data analytic outcomes. As captured by the well-known adage – ‘a picture is worth a thousand words’- well-crafted visualizations, ranging from single outcome graphs and charts or larger story-telling summative dashboards, can offer more impactful means of sharing data analytic outcomes. In fact, the popularity of openaccess Python and commercial Tableau32 data analytic tools can be largely attributed to their diverse (Python) and easy to use (Tableau) visualization capabilities. Of course, not all data exploration outcomes lend themselves to graphing or charting – inferential analyses, as exemplified by parameter estimates (e.g., average purchase amount) expressed as statistical significance-bound confidence intervals, or estimates of the strength and nature of cross-variable associations captured by correlation coefficients, t-tests, F-tests, and χ2 tests, all lend themselves more naturally to numeric representations. Thus in addition to general-purpose commercial systems or general-purpose programming languages, extracting and communicating exploratory data analytic insights is greatly aided by competencies with limited-purpose commercial systems such as MS Excel or Tableau. Confirmatory analyses. Focused on formal assessment of validity of knowledge claims, confirmatory analyses are heavily rooted in examination of numerically expressed outcomes. In contrast to visualizations-heavy exploratory analyses, the use of graphing and charting in confirmatory analyses is far more limited, and it skews heavily toward evaluations of statistical properties of data analytic outcomes. In keeping with that, confirmatory analyses related informational computing skills place strong emphasis on general-purpose commercial systems and/or open-access programming languages, all of which encompass robust statistical assessment related visualization capabilities. ✶✶✶ The overview of computational skills presented in this chapter was geared toward painting a clear picture of the many nuanced that may need to be taken to make raw data analytically usable. The next chapter continues that theme, by addressing considerations that are critical to transforming data analytic outcomes into informative insights.

 Tableau Public is a free access (but not open-source) platform allowing anyone to create and publicly share data visualizations online; given its ‘open to the world’ character, it is most appropriate for students, bloggers or journalists rather than business users wishing to keep their analyses private (who typically opt for fee-based Tableau Desktop).

Chapter 8 Sensemaking Skills While undeniably essential, the ability to compute data analytic outcomes is ultimately a mean to an end, which is extraction of informative, valid, and reliable insights out of data. The process of translating the often-esoteric outcomes of data analyses is framed here as sensemaking skills. It encompasses a relatively broad set of capabilities and competencies that are at the core of correctly interpreting data analytic outcomes; those frequently overlooked abilities play a critical role in shaping the efficacy of the resultant knowledge. Underscoring the criticality of sensemaking skills is a rather obvious observation that once translated into typically qualitative generalizations, such as ‘accounting accruals increase the likelihood of securities litigation’, data analyses-derived insights can be embraced or not, but are no longer scientifically (i.e., using accepted tools and rationale of the scientific method) examinable. Simply put, lack of proper result due diligence can lead to enshrining of empirically questionable conclusions as justified beliefs. In that sense, the ability to correctly interpret data analytic outcomes is as important as the ability to generate those outcomes. Yet in contrast to well-defined, almost ‘tangible’, in a manner of speaking, computational skills described in the previous chapter, sensemaking skills are elusive. Almost as a rule, data analytics focused overviews stop at result generation, implicitly assuming that utilization of generated result is either straightforward or can take on too many different paths to be meaningfully summarized. Applied data analytic experience, however, suggests that post-analysis sensemaking may not be as straightforward as some may want to assume, and while there are indeed untold many potential data usage contexts, those contexts nonetheless share some distinct communalities. Within the confines of basic data analyses discussed in this book, complexity tends to manifest itself as the difference between desired result utilization and those results’ interpretation limits. For example, business and other practitioners prefer exact to approximate estimates, or point estimates over confidence intervals, to use the technical jargon; moreover, they also like to ascribe certain degree of confidence to those typically sample based estimates, most often in the form of the familiar notion of statistical significance. In short, practitioners desire to frame sample-derived point estimates as exhibiting certain (e.g., 95%) level of significance, but that desire runs counter to the basic precepts of statistical analyses. That simple example is at the heart of the importance of sensemaking skills: to correctly translate technical results into information, which is at the core of data storytelling. As graphically summarized in Figure 6.1, the broad domain of sensemaking skills can be broken down into two distinct sub-domains: factual and inferential. Factual sensemaking skills represent the ability to correctly interpret the explicit informational content of data analytic outcomes, while inferential sensemaking skills represent the ability to extract implicit, but still valid and reliable, informational content of https://doi.org/10.1515/9783111001678-010

Factual Skills: Descriptive Analytic Outcomes

149

data analytic outcomes. More specifically, recalling the general distinction between exploratory and confirmatory analyses, where the former focuses on delineation of ‘what is’ while the latter ventures into the speculative realm of presumptive interpretations and extrapolations, telling a largely factual, descriptive story calls for a somewhat different set of competencies than telling a story built around speculative estimates.

Factual Skills: Descriptive Analytic Outcomes While it is predictive analytics, and more recently machine learning facets of data analytics that tend to headline discussions of data utilization, it is the humble descriptive exploratory analyses that consistently deliver decision-guiding value. The reason is rather simple: Describing ‘what is’ is both simpler to undertake and simpler to make sense of; additionally, summarizing and describing emerging trends and relationships is generally seen as largely factual and thus more immediately believable than speculative predictions generated by complex multivariate statistical models or black box machine learning algorithms. Still, interpretation of comparatively simple descriptive data analytic outcomes can be nuanced, and care must be taken to avoid potential interpretation pitfalls. Broadly characterized, descriptive analyses are geared toward answering the general question of ‘what is’ – What is the structure of Brand A’s customer base? What are the leading indicators of securities litigation? What are the key drivers of customer loyalty? Those are just a few of the seemingly endless array of potential questions, all geared toward uncovering previously unknown patterns and relationships, inspired and guided by the overall goal of infusing objective insights into a wide array of organizational planning and decision-making processes. Focused on describing and/or summarizing individual data features and their interactions (i.e., associations between individual features), descriptive data analytic outcomes are most commonly generated using subsets of all potentially available and applicable data1 and thus are meant to be interpreted as approximately true – meaning, as largely qualitative takeaways. For instance, a statistical test of difference between response rates to promotional offers A and B (in this case, the t-test) showing A to be greater than B should be interpreted in the general context of A being more appealing than B, rather than in the context of the numeric difference between the two response rates. As discussed in the Vagaries of Statistical Significance Tests section (Chapter 5), there is a tendency to quantify such differences in terms of the exact numeric spread, e.g., response rate A  In more formal terms, the totality of all potentially available data can be characterized as a population of interest, whereas a particular subset of data used in the analysis as a sample; in applied business analyses it is often a manifestation of the desire to make the results as applicable as possible to the decision context.

150

Chapter 8 Sensemaking Skills

minus response rate B, and even to ascribe statistical significance to that numeric difference. Doing so is incorrect as it disregards that A and B groups are subsets of all responders and thus the two response rates should be treated as sample means subject to sampling error (which means that response rate-wise, their estimates should be expressed as confidence intervals). It is important to emphasize that the reasoning outlined here implicitly assumes that the desired informational outcome is a generalizable conclusion that one type of promotional offer (e.g., A) can be seen as being more attractive than the other type of offer (e.g., B), which is in line with a common practice of conducing limited scale in-market tests prior to committing to a large-scale rollout. That said, it is conceivable that groups A and B used here together comprise the total population of interest and furthermore, that the informational goal is to simply compare the two outcomes, without any subsequent generalizations. Should that be the case, the simple ‘A minus B’ type of comparison would be warranted, and there would not be a need to consider the impact of sampling error. Recognizing that distinction underscores the importance of sensemaking skills.

Key Considerations By and large, however, applied business and other types of analyses use subsets of data to derive insights that can be generalizable onto larger populations. In view of that, one of the key descriptive data analyses related sensemaking skills is the ability to soundly characterize generalizability of those outcomes, seen here as the degree to which the resultant takeaways can be applied to a larger context of interest. And while within the realm of statistical analyses that usually implies formal assessment of sample-to-population projectability, in applied data analytics the idea of population might be somewhat elusive. For example, for consumer staple brands, such as ready-to-eat breakfast cereals, the population is the totality of buyers of all ready-to-eat cereal brands, but when considered more closely, the overtly simple definition becomes surprisingly difficult to operationalize because of buyer classification ambiguity. Is it just the consistent purchasers those products, or should the framing of that population also include anyone who occasionally purchases ready-to-eat cereal, however infrequently, or maybe just purchased a given brand a single time? And once definitionally agreed on, is that population operationally knowable, meaning, could all ready-to-eat breakfast cereal purchasers be identified? Not likely, as even in today’s era of ubiquitous data capture both cash and credit transactions cannot be readily attributed to identifiable individuals (which is why retailers rely so heavily on loyalty programs to identify and track purchasers). The point of the above analysis is to draw attention to practical limitations the longstanding statistical notion of ‘population’ by showing that while it may make sense to use it in some situations, it may not make sense to use it in other contexts. And by extension, not all data analytic outcomes should be interpreted in the traditional scientific

Factual Skills: Descriptive Analytic Outcomes

151

context of sample-to-population generalizability; in some contexts, degree of applicability based interpretation might be more appropriate. To be clear, the idea of ‘degree of applicability’ is not a formal statistical concept, rather, it is a practice-born recommendation rooted in the recognition of some considerable theory building vs. applied analysis differences. First and foremost, the goal of theoretical research is to uncover and/or test universally true generalizations, which contrasts sharply with applied analytics that are typically geared toward identification of entity- (e.g., a company or a brand) and situation-specific insights. In a very broad sense, theoretical research aims to find truths that apply equally to all, whereas applied research tends to focus on competitively advantageous insights. In keeping with those fundamental differences, the most important methodological considerations surrounding theory building research pertain to (point-in-time) sample-to-universe generalizability, whereas the manner in which applied analytic insights are used suggests stronger emphasis on future replicability. Interestingly, the widespread use of the notion of statistical significance in applied data analytics is likely a product of somewhat erroneous interpretation of the meaning of that notion which confuses sample-to-population generalizability with now-to-future replicability. To use the earlier example of responses to promotions A and B – if A and B represent limited in-market tests, the decision-makers are ultimately most interested in future replicability of the observed results, but the notion (and the underlying statistical tests) cannot be used to attest to that. If not statistical significance, what then could be used to supply some tangible footing to the highly interpretive character of descriptive data analytic outcomes? The answer is effect size, which is the magnitude of an estimate or the strength of the estimated relationship. Recalling the earlier discussed shortcomings of statistical significance tests, which can be effectively fooled by the number of records used to compute the requisite test statistics, making sense of effect size magnitudes can also be fraught with interpretational ambiguity. A good example is offered by correlation analysis: The commonly used Pearson correlation expresses the strength and the direction of bivariate associations on a scale ranging from −1 (perfect inverse correlation) to 1 (perfect direct correlation). While manifest correlations such as −.02 or .87 are clearly weak and strong, respectively, what about, for instance, a correlation of .42? There are no hard and fast interpretation rules for qualifying such neither pronouncedly low or high associations,2 thus magnitudinally alike correlation estimates can, and frequently are, interpreted differently across users and situations. That said, there

 A quick internet search will uncover numerous sources relating absolute r coefficient (correlation) values to qualitative categories such as r < 0.3 = very weak, 0.3 < r < 0.5 = weak, 0.5 < r < 0.7 = moderate, r > 0.7 = strong association; while it might be tempting to embrace such seemingly reasonable schema, it is important to note that it is just one of many such interpretations. Moreover, such generalizations do not account for sometimes significant differences in the amount of variability – and thus implicit range of the strength of association – across different data sets (which is likely one of the reasons behind lack of generally agreed strength of association interpretation schemas).

152

Chapter 8 Sensemaking Skills

structured evaluation steps that can be taken to remove much of that subjectivity – those are discussed in more detail later in this chapter. Interpretational challenges are not limited to numerically expressed outcomes; in fact, the now very popular visual data representations can give rise numerous interpretational challenges. Perhaps the most obvious are individual-level differences, where multiple users of data analytic outcomes may draw somewhat differing conclusions from the same visualizations. Somewhat less obvious are differences that may emanate from, well, not paying close enough attention to scale and related choices, as captured in the examples shown in Figure 8.1. $30.00 $25.00

$27.00 $26.00 $25.00

$20.00 $15.00 $10.00

$24.00 $23.00 $22.00 $21.00

$5.00

$20.00

Figure 8.1: Stock Prices for a Sample Security.

The trendlines shown above depict exactly the same stock price data, just graphed using different scales: The left-side uses $0–$30 scale, while the right-side uses a more granular $19–$27 scale. As clearly visible, the right-hand side trend suggests noticeably greater degree of (stock price) volatility than the left-side trend, and considering that the choice of scale granularity is not always well thought out (in fact, in some applications such as MS Excel it is auto-selected using the distribution of the data to be graphed), there is an element of chance that sneaks into the interpretation exemplified in Figure 8.1. If one analyst picked less granular and another one picked a more granular scale, and if the two reached somewhat different conclusions, which conclusion would be right? It is a question with no clear answer as scales that are exceedingly coarse may diminish perceived variability, while scales that are exceedingly granular may create an illusion of significant variability, but what constitutes ‘exceedingly’ coarse or granular is not always clear. All considered, drawing of true and justified conclusions from observed patterns and associations calls for careful and deliberate choices.

Data Storytelling One of the common usages of descriptive analyses is the development of multivariable profiles, as exemplified by customer profiles (sometimes referred to as

Factual Skills: Descriptive Analytic Outcomes

153

customer baselines). Those analytically rudimentary but informationally highly useful applications are essentially composites of otherwise standalone attributes that when pulled together ‘tell a story’; that story could be in the form of prototypes of different types of buyers, often expressed in terms of basic demographics (e.g. average age, gender splits, etc.), psychographics (e.g., lifestyles and hobbies), and behaviors (e.g., average purchase amount, average repurchase frequency, etc.). That manner of data utilization, nowadays referred to as storytelling with data, places heavy emphasis on data visualization (more on that later), which at times can lead to glazing over some equally – perhaps even more – important considerations of validity and reliability of data analytic outcomes. Broadly characterized, validity is the degree of truthfulness of, in the case of data analytics, outcomes and/or conclusions.3 The idea of validity is used widely in applied as well as in theoretical research, thus there are numerous manifestations of that general notion; of those, face validity and content validity are of particular importance to analytic storytelling.4 The former is a reflection of a largely subjective assessment of whether or not a particular idea captures what it is supposed to capture, whereas the latter encapsulates logical consistency of a particular notion – together, those two distinct validity dimensions frame the idea of believability of data analytic takeaways. The second key evaluative facet of analytic storytelling is the notion of reliability, which captures the repeatability or consistency of a particular insight or an observation. A data analytic outcome can be considered reliable if conclusions drawn from it are dependable, meaning, decisions based on those conclusions can be expected to yield predictable outcomes. Conceptually, reliability can be thought of as a ratio of informative content to total content, with the difference between total and informative content representing non-informative noise; in applied data analytic settings, reliability of data analytic outcomes is often evaluated with the help of pilot learning, which are smaller scale ‘trial runs’ of data analytics-suggested decisions.5 Implied in the above brief summarization of validity and reliability is a substantial degree of evaluative judgment, which is particularly pronounced in the context of validity, as neither face nor content validity lend themselves to structured (i.e., objective) assessment. While undeniably important, validity assessment is just as undeniably individualized and situational, and as such it is impacted by more than simply  In the more abstract sense, validity can be conceptualized as the degree of agreement between the manner in which a particular idea is conceptualized and the manner in which it is operationalized.  Other manifestations of validity include discriminant (the degree to which two notionally constructs are independent), convergent (the degree to which two measures of constructs that theoretically should be related, are in fact, related), concurrent (the stability of the observable indicator—latent construct association), and predictive (the degree to which an observable indicator correctly signals change in the latent construct of interest) validity; as suggested by the brief definitions, those additional validity manifestations tend to be used in theoretical research.  Within the confines of theoretical analyses, reliability of data analytic outcomes is commonly assessed using methods such as split-half or test-retest.

154

Chapter 8 Sensemaking Skills

understanding the technical language of statistical analyses. Individual level and situational factors can exert an unexpectedly profound impact on the efficacy of informational takeaways in a way that may not be overtly discernable. Influences such as cognitive bias, or sensemaking conclusions that deviate from rational judgment, constrained processing capabilities known as human channel capacity, and brain selfrewiring, or neuroplasticity (or brain plasticity) all influence, even shape data storytelling. Those involuntary and subconscious mechanisms have biological roots and thus cannot be ‘turned-off’, but understanding the nature of their influence and being aware of their impact can materially diminish the potentially undesirable impact they exert of human sensemaking. Cognitive bias impacts the manner in which stored information is used. Reasoning distortions such as availability heuristic (a tendency to overestimate the importance of available information) or confirmation bias (favoring of information that confirms one’s pre-existing beliefs) attest to the many ways subconscious information processing mechanics can warp how overtly objective information shapes individual-level sensemaking. To make matters worse, unlike machines that ‘remember’ all information stored in them equally well at all times, the brain’s persistent self-rewiring renders older, not sufficiently reinforced (i.e., are not recalled or activated, in the sense of being ‘remembered’) memories progressively ‘fuzzier’ and more difficult to retrieve. As a result, human recall tends to be incomplete and selective. Moreover, the amount of information human brain can cognitively process in attention at any given time is limited due to a phenomenon known as human channel capacity. Research suggests that, on average, a person can actively consider approximately 7 ±2 of discrete pieces of information. In other words, a consumer trying to decide between brands A and B will base their choice on comparison of somewhere between 5 and 9 distinct brand attributes. When coupled with the ongoing reshaping of previous learnings (neuroplasticity) and the possibly distorted nature of perception (cognitive bias), channel capacity constraints further underscores the limits of human information processing capabilities. It also underscores the value of relying on objective data when making complex determinations. To make matters worse, making sense of data analytic outcomes can also be impacted by situational factors such as group dynamics. Contradicting longstanding conventional wisdom which suggests that groups make better decisions than individuals, recent research in areas of social cognition and social psychology instead suggests that the efficacy of group-based decisions cannot be assumed to outperform the efficacy of choices made by individuals. In fact, it is the combination of cognitive (individual), social (group), and situational (expressly neither individual nor group) factor that jointly determine the efficacy of decisions. The widely embraced higher levels of confidence attributed to group decisions may at times even be misguided because of a phenomenon known as groupthink, a dysfunctional pattern of thought and interaction characterized by closed-mindedness, uniformity expectations, and biased information search. The idea of groupthink explains strong preference for information

Factual Skills: Descriptive Analytic Outcomes

155

that supports the group’s view; it is the reason potentially brilliant breakthrough ideas die in committees in business settings, and it is the reason governing bodies of oppressive regimes are so astonishingly like-minded. An altogether different aspect of group dynamics is group conflict, a yet another situational influencer of human sensemaking. As suggested by social exchange theory, which views the stability of group interactions through a theoretical lens of negotiated exchange between parties, individual group members are ultimately driven by the desire to maximize their benefits, thus conflict tends to arise when group dynamics take on more competitive than collaborative character. Keeping in mind that the realization of group decision-making potential requires full contributory participation on the part of all individual group members, within-group competition reduces the willingness of individuals to contribute their best to the group effort. Not only can that activate individuals’ fears of being exploited, as well as heighten their desire to exploit others, it can compel individuals to become more focused on standing out in comparison with others. That in turn can activate tendencies to evaluate one’s own information more favorably than that others’, and also to evaluate more positively any information that is consistent with one’s initial preferences. The preceding brief overview of human sensemaking obstacles is not meant to suggest that human reason cannot be relied on – clearly, that is not the case as attested to by, among other things, human scientific discoveries. It is meant to suggest that in certain contexts, the innerworkings of the human brain can undermine rational judgment, and being aware of those pitfalls is an important part of storytelling. Visual Data Storytelling ‘A picture is worth a thousand words’ is a popular adage that succinctly explains the popularity of graphically captured trends and associations. Visual representations of data offer particularly attractive means of conveying results of descriptive analyses, where simple bar, pie, or line graphs,6 or by multi-chart composites known as dashboards can offer easy to grasp mechanisms of conveying meaning. The use of graphics to communicate data analytic outcomes is now popularly referred to as visual data storytelling; that particular approach to conveying patterns and relationships embedded in data is effective because it takes advantage of the way human brain processes information: To effectively use its computing resources, the brain relies on categorization of simple general attributes, rather than detailed analyses required by numeric and similar information.7 But, because visual representations of meaning are, in

 Graph is a chart that plots data along two dimensions – e.g., a line graph charts magnitudes (dimension 1) across time (dimension 2); while the terms ‘graph’ and ‘chart’ tend to be used interchangeably, technically, graph is a particular type of chart.  While not yet fully understood, researchers now believe that visual systems are able to automatically categorize pictorial representations without requiring attention span that is required to process

156

Chapter 8 Sensemaking Skills

vs.

Figure 8.2: Effective Expository Visualization.

effect, auto processed, to serve their intended purpose, data visualizations require careful and deliberate planning, the importance of which is illustrated in Figure 8.2. The bar (left) and pie (right) graphs summarize the same set of data, yet absent any explanatory value labeling, the bar chart more unambiguously captures and communicates the intended meaning, which is to highlight magnitudinal differences among the four groups. Given that, the left-hand side graph is more effective because it communicates the intended meaning more clearly on unaided (e.g., without the use any value labels) basis. While that might seem intuitively obvious, the failure to recognize that important point it is one of the more commonly encountered data visualization design deficiencies, which manifest themselves in relying more on one’s own stylistic preferences than the aforementioned good design principles (e.g., some may find pie graphs more visually attractive than bar graphs) or by adding informationally superfluous elements, such as the depth dimension. The choice of what chart to use is probably the most top-of-mind consideration in data visualization. At the core of chart type selection logic is the idea of visual grammar, a summary construct that ties together the core design elements that shape how the intended meaning is communicated. The notion of visual grammar expressly differentiates between expository and declarative communication intents, as well as between factual and conceptual content. Effective expository visualizations tend to emphasize completeness and objectivity, while effective declarative visualizations usually place a premium on sound logic and clarity. Well thought out factual visualizations, on the other hand, place a premium on appropriateness and the ease of interpretation of selected charts, in contrast to conceptual presentations which aim to capture the essence of abstract ideas. Data analytic result visualizations can range considerably in terms of complexity, where complexity is a manifestation of the number of distinct data elements encompassed in a visualization. In general, the larger the number of data elements included

numeric representations; one current hypothesis suggests that the visual system is able to ‘compress’ the resolution of visual stimuli which obfuscates the need for detailed analyses by instead categorizing information based on simple general attributes.

Factual Skills: Descriptive Analytic Outcomes

Four Data Elements

157

Two Data Elements

Figure 8.3: The Impact of the Number of Data Elements on Complexity.

in a single chart, the more complex it is, and thus more difficult to comprehend the content, as graphically illustrated by Figure 8.3. The four element, left hand-side image might be chosen because it is believed to parsimoniously capture and tie together multiple elements of meaning, which might be clear to the analyst putting the graph together (because of having been immersed in the analysis depicted in the graph), but the intended audience might find it taxing and confusing to make sense of the interplay of four distinct elements. The two element, right hand-side image is significantly easier to interpret, precisely because of the fewer number of elements which unambiguously communicate the intended meaning. In this context, less is more. The idea of visual storytelling, however, stretches beyond single charts – in fact, to many it is now nearly synonymous with dashboards, thematically related collections of at-a-glance snapshots of key outcomes relevant to a particular goal of process. Popularized by easy-to-use software applications (e.g., the earlier discussed Tableau and MS Power BI), dashboards combine stylistically varied and aesthetically pleasing visualizations with the goal of capturing all pertinent information in a single view (i.e., a page, computer screen, etc.). The composite character of dashboards, however, adds a yet another dimension to complexity considerations, as graphically illustrated by Figure 8.4 (a picture is worth a thousand words, after all). While the sample dashboard depicted in Figure 8.4 might be seen as somewhat extreme in terms of the sheer number of distinct visualizations and the diversity of stylistic expressions, it nonetheless draws attention to the importance of carefully considering complexity of multi-element visualizations. Just because something is possible does not mean it is desirable, or even preferred. It is hard to deny that the sample dashboard shown in Figure 8.4 is visually impressive, but it is just as hard to deny that it is also informationally overwhelming. To most, digesting that amount of informational content would require considerable amount of effort, which runs counter to the very idea of quick and effortless visual outcome representations. At the same time, it offers a very concise and creative reference source, where numerous related outcomes are summarized in an easy to understand, single-view manner. The clarity of message vs. the scope of coverage distinction implied here underscores the importance of intentionality of design. Multi-element dashboards should not be seen as one-

158

Chapter 8 Sensemaking Skills

Figure 8.4: Sample Dashboard.

size-fits-all in terms of their design – some may be intended to offer easy to grasp summarizations of outcomes of interest, while others may be intended to serve as comprehensive and centralized reference source; neither is preferred per se, but the intended use should be among the core design considerations. (A quick note: Not expressly included in this brief overview are infographics, which are notionally similar to dashboards but are usually manually drawn with greater emphasis of aesthetics and comparatively lower emphasis on data; as such, those particular visualizations fall outside the scope of the basic data analytic competencies focused overview.) Turning back to the more rudimentary design considerations, at its core, data visualization can be seen as an application of graphical arts to data. And while the array of data analytic outcome charting choices might be overwhelming, an organization known as Visual-Literacy.org developed what they call a Periodic Table of Visualization Methods,8 which, in the manner similar to the familiar periodic table of (chemical) elements, delineates and organizes many common and some not so common forms of visual communication of information. In total, 100 distinct visualization types are grouped into six categories of data, information, concepts, strategy, metaphors, and compound visualizations. Focusing on what could be characterized as a garden variety of organizational data visualization charts, some of the more commonly used graphs include line and bar charts, histograms, scatter plots, and pie charts, including its variation known as donut charts. Line charts, which show relationships between variables, are most commonly used to depict trends over time, as exemplified by cross-time sales trends; as of late, it has become customary to use line charts to capture multiple trends, by stacking multiple trend lines. Bar charts are most frequently used to summarize and compare  https://www.visual-literacy.org/periodic_table/periodic_table.html.

Inferential Skills: Probabilistic Data Analytic Outcomes

159

quantities of different categories or sets, as in comparing total sales of different products. Visually similar histograms, which are often confused with bar charts, are best used for visually summarizing distributions of individual variables; the key difference between bar charts and histograms is that the former are meant to be used with unordered categories (e.g., when comparing sales of Brand X to sales of Brand Y only the relative bar height matters, which in the context of the standard 2-dimensional Cartesian or x-y coordinates captures the magnitude of the two categories; in that context, ordering of bars on the horizontal or the x-axis is informationally inconsequential) whereas the latter are used with ordered categories, where both the height and order of individual bars combine to communicate the intended information. Scatter plots, also known as X-Y plots, capture the somewhat more obtuse joint variation of two variables; under most circumstances, scatter plots are used to convey a visual summary of either the spread of individual data features or the relationship, as in correlation, between features. And lastly, pie charts and their derivative form known as donut charts are typically used to show relative proportions of several groups or categories, as exemplified by the distribution of the total company revenue across product categories. It should be noted, however, that in spite of their rather common usage, the value of pie charts as a communication tool can be questionable, especially when used with more numerous categories, because it can be difficult to visually surmise magnitudes captured by angled ‘slices’ comprising those charts.

Inferential Skills: Probabilistic Data Analytic Outcomes In the broad context of applied analytics, it is usually safe to assume some degree of imprecision, primarily because of data imperfections (missing, miscoded or incomplete data, etc.), and because the bulk of applied data analyses are based on subsets of all available and applicable data, which then gives rise to random error. Confirmatory analyses dive deeper into the realm of estimation murkiness by adding-in probabilistic inference, which expresses an outcome of interest in terms of defined probability distributions, as exemplified by the well-known standard normal distribution. When considered within the narrow realm of just basic data analytic competencies,9 confirmatory analyses aim to assess the validity of prior beliefs using the combination of structured data analytic approaches and tools of statistical inference, most notably

 In a broader sense, the domain of confirmatory analyses also encompasses prescriptive and predictive analytics, the goal of which is to delineate and explain the strongest predictors of outcomes of interest (prescriptive analytics), and to estimate the likelihood or the magnitude of future outcomes of interest (predictive analytics), typically making use of multivariate statistical models, such as linear or logistic regression, or supervised machine learning algorithms. As could be expected, those are more advanced data analytic applications and thus fall outside the scope of the basic competencies focused data analytic literacy knowledge and skills discussed in this book.

160

Chapter 8 Sensemaking Skills

the logic of the scientific method, to generate probabilistic estimates of states or outcomes of interest. Implied in this brief characterization is a distinct set of data analytic skills. Recalling that skill represents the ability to do something well, it is important to ask what, exactly, constitutes a skill in the context of validation of knowledge claims? To start with, within the confines of earlier discussed, factually minded descriptive analyses, sensemaking skills were framed as general abilities to capture and communicate informative patterns and relationships. Given that the goal of confirmatory analyses is to probabilistically assess the validity of existing beliefs, it follows that the confirmatory analog to exploratory sensemaking ought to be construed to represent the ability to validly and reliably capture and convey probabilistic meaning of data analytic outcomes. Central to that framing of inferential skills is the notion of probabilistic thinking.

Probabilistic Thinking Few mathematical concepts have had as profound an impact on decision-making as the idea of probability. Interestingly, the development of a comprehensive theory of probability had relatively inglorious origins in the form of a gambling dispute which compelled two accomplished French mathematicians, Blaise Pascal and Pierre de Fermat, to engage in an intellectual discourse in 1654 that ultimately laid the foundations of the theory of probability. The initial mathematical framing of probability was limited to just mathematical analysis of games of chance, until some 150 years later another renowned French mathematician, Pierre-Simon Laplace, began to apply the idea of probabilistic estimation to diverse scientific and practical problems, ultimately leading to formalization of the theory of errors, actuarial mathematics, and basic statistics. More than a century later, in 1933, a noted Russian mathematician, Andrey Kolmogorov, outlined an axiomatic approach that forms the basis for the modern theory of probability. Today it is hard to think of a domain of modern science that does not make extensive use of the general ideas and computational mechanics of probability, and more and more aspects of everyday life are also being framed in probabilistic perspectives. It might even be reasonable to go as far as saying that nowadays, rational decision-making and probabilistic thinking are inseparable. And yet, probabilistic reasoning can also be highly subjective. The reason for that is encapsulated in the very essence of subjectivity. Many individual and group decisions are ‘personal’ in the sense of being based on or influenced by personal feelings, tastes, or opinions; consequently, some alternatives or outcomes might be preferred to others (due to influences such as cognitive bias). However, it is important to not equate ‘subjective’ with ‘flawed’ – some subjective feelings or beliefs can be shown to be justified and thus can be considered sound from the perspective of choice-making; others are biased, prejudiced, or otherwise factually unfounded. In view of that, it is

Inferential Skills: Probabilistic Data Analytic Outcomes

161

neither appropriate nor necessary (or realistic, for that matter) to set aside subjective beliefs – it may suffice to simply use objectively derived probabilistic ‘validity assessments’ to differentiate between empirically sound and unsound prior beliefs. Doing so is the essence of probabilistic thinking. Delving a bit deeper into the idea of probabilistic thinking, that essential inferential sensemaking skill is intertwined with the similar but distinct notion of mental models, a somewhat difficult to precisely frame and without doubt one of the more ethereal aspects of thinking. Broadly characterized as subjective prior conceptions, mental models can have a strong, even profound impact on subjective evaluations of probabilistic data analytic outcomes, thus warrant closer examination. Mental Models Broadly characterized as abstract representations of the real word that aid the key aspects of cognitive functioning, particularly reasoning about problems or situations not directly encountered, mental models embody individualized interpretations of reality. As is the case with all individual perception and experience-based generalizations, mental models are subject to flawed inferences and bias, and as implied by their name, are not objectively discernible or reviewable. That said, while mental models as such are hard to ascertain, their impact on sensemaking can be estimated using a combination of Bayesian and frequentist probabilistic inference (this is one instance where the two, otherwise competing estimation approaches can work collaboratively; the key differences between the two approaches are outlined below). Starting from the premise that the former, which aims to interpret objective evidence through the lens of subjective prior beliefs, can be used as an expression of mental model-influenced inference, whereas the latter, which sees probability as an algorithmic extrapolation of the frequency of past events, can serve as an objective baseline, a generalizable subjectivity estimate can be derived. Given that, it is possible to estimate the extent of subjectivity of individual mental model representations by relating individual perspectives (Bayesian probability) to a baseline represented by a groupwide perspective composite (frequentist probability). The resultant pooled individual vs. baseline perspective variability can be interpreted as the aggregate measure of perspective divergence, summarized here as the cognitive diversity quotient (CDQ). Estimation-wise, leveraging the widely accepted computational logic of statistical variance, CDQ can be estimated as follows: sP ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 i = 1 ðPerspectivei − BaselineÞ CDQ = Number of Perspectives Outcome-wise, the cognitive diversity quotient is a standardized coefficient, which means its magnitude can be compared across different group opinion situations. Therefore, in situations that lend themselves to such analyses, the use of CDQ can infuse elements of analytic logic into the often ill-structured mechanics of organizational consensus building.

162

Chapter 8 Sensemaking Skills

Moreover, CDQ can also provide important insights when used longitudinally, meaning it can be used to track and evaluate changes in situation- and decision-point-specific consensus across time. Also worth mentioning is that the use of such an objective assessment method with an inherently individual perspective-ridden evaluations of problems or situations can help with addressing the seemingly inescapable polarity of organizational diversity. More specifically, it is common to come across cherished organizational attributes that at the same time can also act as obstacles to efficient organizational decision-making; colloquially stated, even good things may need to be managed to continue to produce good outcomes. And lastly, being able to address perspective differences in what could be seen as a more democratic manner (in computing of the CDQ score, all opinions are treated the same) may diminish the threat of groupthink, and by doing so it may foster greater cognitive diversity. The above outlined reasoning and approach might seem esoteric, even overengineered, but given the potentially profound impact of mental models on how probabilistic outcomes are interpreted, it is important to at least carefully consider the spirit of those ideas. As noted by L. Atwater, a political strategist, ‘perception is reality’, after all. Probabilistic Inference As noted earlier, it is important to not equate ‘subjective’ with ‘flawed’, while at the same time being opened to the possibility of some prior beliefs being empirically unfounded, and thus flawed. In keeping with that general idea, the centerpiece of probabilistic thinking is the redundantly sounding probabilistic inference, which encapsulates the mechanics of estimating likelihood of a particular belief being true. More formally described as the task of estimating the probability of one or more states or outcomes of interest, probabilistic inferences can be drawn using one of two competing approaches: Bayesian and frequentist. The former is named after Thomas Bayes, an English clergyman and statistician who derived what is now known as Bayes’ Theorem and which relates conditional and marginal probabilities of two random events. Broadly speaking, Bayesian probability treats likelihood as measure of the state of knowledge, known and available to the decision-maker at decision time; as such, a Bayesian probability is a function of largely subjective prior beliefs (of the decision-maker and/or decision-influencers), and past outcomes, or objective data. While it can be used in a wide array of situations, Bayesian probability estimation logic is particularly well suited to problems characterized by sparse or otherwise undependable or poorly projectable historical outcome data, as exemplified by acts of terror. Offering a conceptually and computationally distinct and different way of approaching probabilistic inference is the frequentist approach to probability estimation, which as implied by its name is strictly focused on analyses of objective data. In fact, the very essence of the frequentist approach is to set aside any prior beliefs and

Inferential Skills: Probabilistic Data Analytic Outcomes

163

use nothing but ‘hard’ data to estimate the probability of the outcome of interest, which is expressed as the relative frequency of occurrence (hence the name ‘frequentist’) of that outcome in available data. Under the frequentist view, if 2% of publicly traded companies end up becoming entangled in securities class action litigation on annual basis, on average, a given public company faces 2% chance of incurring this type of litigation. It follows that the frequentist approach is particularly well suited to problems characterized by abundant historical outcome data, coupled with relatively stable longitudinal trends, as exemplified by automotive accidents.10 Looking beyond approach related estimation logic differences and thinking in terms of the everyday language of applied analytics, probability estimation is simply an attempt to predict unknown outcomes based on known parameters. For instance, weather forecasters strive to predict future conditions such as air temperature, the amount of precipitation or the amount of sunshine relying on historical trends and weather element interdependencies. Using statistical likelihood estimation techniques, forecasters can estimate unknown parameters based on known (i.e., historical) outcomes. That said, the users of the resultant estimates are rarely cognizant of which of the above two broad approaches was utilized, even though the estimates can differ substantially, based on the choice of an approach. That is because while the frequentist approach only leverages objective outcome data, the Bayesian method combines objective data with subjective judgment, and combining those two elements can significantly alter forward-looking projections. As suggested earlier, when historical data are robust and cross-time trends are relatively stable the frequentist approach is likely to produce more dependable probability estimates; when either of those two conditions are not present or when the phenomenon in question varies significantly across distinct occurrences (as tends to be the case with terrorist attacks), Bayesian estimation is the preferred approach. Albeit more esoteric but nonetheless important to appreciating information value of probability are frequency distributions, which depict the spread of possible values of what is technically referred to as a random variable, a mathematical characterization of a quantity or an event that can take on arbitrary or incident values, as exemplified by natural events such as a hurricane, or man-made ones such as fraud. Although easy to construct (especially given that the earlier discussed data analytic software tools include built-in, standard routines for doing so), the underlying mathematical descriptions can be comparatively technical, possibly preventing less technically inclined audiences from using those tools. Still, those ultimately conceptually straightforward instruments offer succinct summaries of the likelihood of expected future outcomes of interest, thus developing at least a rudimentary understanding of

 In the U.S., although the number of drivers and the number of vehicles on the road both grew steadily over the past couple of decades, the overall (i.e., nationwide) automotive accident rate remained remarkably flat.

164

Chapter 8 Sensemaking Skills

the different types of frequency distributions – most notably, the ‘classic’ standard normal, chi square, binomial, and Poisson – plays an important role in developing robust probabilistic inference skills. Interpreting Probabilistic Outcomes Building on top of the discussion of the notion of mental models as an expression of subjective beliefs, and the foundational overview of statistical methods of probabilistic inference, this section tackles the central theme of probabilistic thinking: interpretation of probabilistic statistical estimates. When considered within the relatively narrow confines of basic data analytic competencies, assessing the efficacy of prior beliefs is comparatively methodologically involved, but at the same time it is interpretationally straightforward. It is methodologically involved because it demands sound understanding of the general logic and distinct computational elements of statistical inference, and it is interpretationally straightforward because of the nearly formulaic sensemaking mechanics. Still, deriving sound conclusions from technical data analytic outcomes demands interpretating of those results in line with limitations of the underlying data, constraints stemming from assumptions imposed by methods used to analyze data, and in accordance with general principles that guide the nuanced process of transforming statistical outcomes into informative insights. Moreover, even though all probabilistic analyses can be assumed to be rooted in the reasoning and applications of the scientific method, it would be a mistake to assume that all probabilistic analyses should be evaluated from the same philosophical perspective. At issue here is the difference between analyses geared toward identification of broad generalizations, often grouped under the umbrella of theoretical research, and decision support analytics, typically geared toward identification of competitively advantageous insights. Though the former is most readily associated with theoretical academic research, and the latter with applied organizational management supporting research, that should not be taken to mean that only academic entities have reasons to engage in theoretical research. There are numerous political, social, and commercial organizations with widespread operations, and some of those organizations at some point in time may have informational needs that fall under the general category of theoretical research. In view of that, it is worthwhile to understand the difference between interpreting of results geared toward broad generalizations, and interpreting of confirmatory data analytic outcomes geared toward identification of competitively advantageous insights. The gist of the difference between the two result interpretation mindsets is in what can be considered the burden of proof. The generalizability conclusion is rooted in (typically sample-derived) outcomes being deemed statistically significant,11 but that level of proof may not be necessary for results geared toward identification of  Using the earlier (Chapter 5) discussed t-test, F-test, or χ2 tests, as appropriate.

Inferential Skills: Probabilistic Data Analytic Outcomes

165

competitively advantageous insights (in fact, it could be argued that to be competitively advantageous, insights should be unique rather than universal). All considered, while it is a matter of commonly accepted practice to use the notion (and the specific tests) of statistical significance as a general litmus of informational efficacy of data analytic outcomes, it is not always appropriate to do so. The very essence of applied data analytic outcomes suggests that those unique advantage-minded insights should be looked at through a different lens than analyses geared toward yielding universally true generalizations. But, the speculative nature of confirmatory analyses gives rise to an inescapable need for an objective result validation mechanism, one that can separate the proverbial wheat (informative insights) from the chaff (non-informative noise), which begs the question: if not statistical significance, then what? That is a loaded question, and it is explored in the next section.

Assessing the Efficacy of Probabilistic Results The general logic of statistical inference as well as specific computational tests of statistical significance used to operationalize that logic were discussed in the context of exploratory analyses in Chapter 5. It should be noted that the same logic and the same tests apply to confirmatory analyses – the contexts in which those tests are used are somewhat different as are some aspects of their interpretation, but other than that, there are no differences in how efficacy of exploratory and confirmatory analyses is assessed. Statistical Significance The idea behind statistical significance is to assess the possibility that an observed difference, such as Response Rate A vs. Response Rate B, or an association, such as a correlation between Age and Response Propensity, is a manifestation of true – meaning generalizable – difference or association. Given that broad purpose, statistical significance offers a great deal of utility (subject to the earlier discussed limitations) when searching for universal generalizations, but not so when searching for unique insights. Rather predictably, those interested in the latter are oftentimes trying to ‘bend’ tests of significance (most notably the t-test, F-test, and χ2 test) to fit purposes which fall outside of applicability limits of those techniques. One of the more commonly seen misapplications of statistical significance testing is to attribute a particular level of significance, e.g., 0.05 (which can also be stated as 95%), to exact magnitudes, technically known as point estimates. For example: Is the difference between Response Rate A at 2.1% and Response Rate B at 2.6% statistically significant? While seemingly unambiguous, this question is in fact nuanced because it confounds statistical and practical considerations. When considered from the statistical perspective, it is asking if the observed 0.5% difference (2.6% – 2.1%) can be interpreted as ‘real’ (rather than

166

Chapter 8 Sensemaking Skills

spurious), but when considered from the applied decision-making perspective it is asking if, when repeated, Promotion B can be expected to generate about 0.5% higher response rate than Promotion A. What often happens in such situations is that technical analysts answer the statistical question, but the ultimate users of that information interpret the results in the context of the more meaningful to them business question. And so the statistical conclusion that the difference between the 2.1% Promotion A and the 2.6% Promotion B response rates is ‘statistically significant’ at, let’s say, 0.05 level ends up being interpreted as saying that, going forward, Promotion B can be expected to deliver about 0.5% response lift over Promotion A. There are two distinct problems highlighted by the above scenario: An invalid attribution of statistical significance to a point estimate, and an unwarranted implied ascertainment of longitudinal stability of observed differences. The finding of statistical significance of the difference between the two response rates attests only to there being a difference – namely, factoring-in the relative imprecision of the two response rate magnitudes (primarily due to sampling error related effects), there is no more than 5% (given the 0.05 or 95% level of significance) probability that the observed response rate differential was brought about by random chance. In other words, the finding does not attest to the difference in response rates being equal to 0.5% (2.6% – 2.1%), it merely attests to their being a statistically material difference between the two response rates (with the implication that the ‘true’ and typically unknown magnitude of the difference could be higher or lower than the observed 0.5%). The second problem illustrated by the above scenario is more tacit though equally consequential: An unwarranted conclusion of longitudinal stability attributed to statistical significance. In business, as well as other organizational settings, the prevailing reason organizations undertake analyses of data is to support future decisions. In view of that, users of data analytic insights are mainly interested in ascertaining the stability or now-to-future projectability of those insights – in the context of the earlier example of the difference in response rates to promotions A and B, the users of that information want to make sure that, going forward, Promotion B can be expected to deliver better results. Unfortunately for those users, tests of statistical significance cannot accommodate those needs because those tests are not forecasting mechanism. Computational mechanics of significance tests are built around, implicitly point-intime, analyses of variability, which give rise to assessments of the probability of estimated magnitudes, such as the 0.5% differential between response rates A and B, falling outside the range delimited by the chosen level of significance. It follows that the evaluative logic of statistical significance supports sample-to-population generalizability assessments, but not now-to-future projectability. All in all, assessments of statistical significance should be interpreted as the degree of confidence in the validity of estimates at a given point in time; since there is no longitudinal element to significance testing, the notion of statistical significance cannot be used to attest to temporal stability of estimates.

Inferential Skills: Probabilistic Data Analytic Outcomes

167

Looking beyond invalid attributions of statistical significance to a point estimate and unwarranted ascertainments of longitudinal stability of observed differences, the usage of statistical significance should also be carefully considered in the context of universal generalizations-minded theoretical research. While rarely expressly stated, the broadly defined theory development and validation is a process that is built around distributed analytic efforts. The general idea here is that independent researchers interested in the same phenomenon independently design and execute similarly themed but separate empirical research studies; if and when a consensus conclusion is reach, universally true knowledge claims emerge. Within that broad context, theoretical research utilizes tests of statistical significance as a mean of pooling independently conceived and executed investigation of a particular set of knowledge claims; more specifically, tests of statistical significance serve as a common validation mechanism, one that assures that each independent conclusion is validated using the same criteria. In other words, a single finding of statistical significance is merely one link in a chain; under most circumstances, to be accepted as ‘generally true’, a finding needs to be sufficiently cross validated (as a single research study may produce false positive results or be methodologically or otherwise flawed). Doing so is rarely practical in applied organizational settings, which further weakens arguments in favor of relying on statistical significance in applied organizational settings. The well-known dependence of statistical significance tests on the number of records used in the underlying test statistic calculations, discussed at length in Chapter 5, is a yet another shortcoming that becomes even more pronounced in single validation contexts outlined above. In a manner that is counterintuitive to most, as the number of records increases, significance tests become progressively less and less sensitive, in the sense of being able to differentiate between spurious and material associations and/or differences. That problem is particularly visible in applied data analyses where the use of large datasets is routine. It is important to note that for the purposes of significance testing, ‘large’ means several thousand records or more, which is not particularly large in the modern era in which datasets of millions of records and larger are common. Moreover, significance testing is a pass-fail process,12 which means that magnitudinally trivial effects can be deemed as ‘significant’ as much larger effects. A picture that emerges here is one that shows limited applicability of statistical significance testing to applied analytics. That is not to say that statistical significance cannot or should not be used in those contexts – however, outside of carefully

 An effect of interest either is or is not statistically significant at a given level of significance, such as the commonly used 0.05 / 95% threshold; in a statistical sense, effects that are significant at the same level are considered to be equally important. The rather commonly used applied logic of differentiating between ‘practically material’ and ‘practically immaterial’ statistically significant results runs counter to the very idea of objective significance testing as it infused biased subjective judgment into the manifestly objective significance testing process.

168

Chapter 8 Sensemaking Skills

designed analytic settings – i.e., testing the generalizability of sample-derived estimates using carefully selected and mindfully sized datasets – application logic and computational mechanics of statistical significance tests do not align with informational demands of applied data analytics. In short, that assessment mechanism it lacks the ability to consistently and objectively differentiate between informative insights and non-informative noise (in the form of spurious associations). But, if not statistical significance, then what? As limitations of statistical significance become more and more widely recognized and acknowledged, reliance on effect size as the measure of importance is gaining wider acceptance. Effect Size Within the realm of statistical analyses, effect is an expression of bivariate association; in the most direct sense, an effect size is simply a quantitative assessment of the magnitude of a statistical estimate, expressed in relation to the estimate’s degree of imprecision. When looked at as a mean of assessment of informational efficacy of data analytic outcomes, effect size offers an alternative to tests of statistical significance, though it is common to see both approaches used together, in in conjunction with one another. Doing so, however, is not recommended because it is quite possible for the two approaches to yield contradictory findings (as in magnitudinally trivial effects being statistically significant), which would do more to confuse than to inform. The essence of effect size based assessment is to circumvent the earlier discussed limitations of statistical significance testing, which means that using the two approaches together is simply counterproductive. When considered in the context of rudimentary data analytic competencies (which do not encompass more advanced sub-domains such as multivariate predictive modeling, experimental design or time series analyses), effect size can manifest itself as an assessment of the magnitude of the strength of the relationship between two variables, which can be operationally expressed in one of three functional forms: correlation, mean difference, and odds ratio. Correlation analysis, which as discussed earlier captures the direction and the magnitude of an association between two normally distributed continuous variables,13 yields probably the most straightforward assessment of the effect size. The reason for that is because the standardized correlation coefficient, which ranges in value from −1 (perfect negative or inverse association) to +1 (perfect positive or direct association); consequently, within the confines of correlation, effect size is simply the absolute value of the correlation coefficient, where larger values indicate stronger effect. Cohen’s standard is an often-recommended interpretation guide, according to which correlations

 For Pearson r correlation, which is the most widely used correlation statistic; Spearman ρ (rho) rank correlation and Kendall τ (tau) rank correlation are the two lesser known correlation statistics that can accommodate rank-order (i.e., not continuous) data.

Inferential Skills: Probabilistic Data Analytic Outcomes

169

smaller that 0.1 (disregarding the sign) should be considered spurious or informationally immaterial, correlations between 0.1 and 0.3 should be interpreted as ‘weak’, correlations between 0.3 and 0.5 should be interpreted as ‘moderate’, and those 0.5 or greater as ‘strong’. The second of the three expression of association is mean difference. Effect size evaluation of mean difference can be characterized as standard deviation-adjusted assessment of the magnitude of statistical effects; computation-wise, the best-known approach is standardized mean difference, with the most widely used formulation known as or Cohen’s d, which is computed as follows: Effect Size mean difference =

Mean1 − Mean2 Std.Deviationdiff

As clearly visible in the above computational formula, mean difference expressly takes into account variability of estimated effects, which here is meant to capture the degree of imprecision of the estimate(s) of interest. Overall, the larger the standard deviation in relation to the mean the less precise the estimate in question, which in turn means smaller effect size. A commonly used interpretation logic of standardized mean difference expressed effect size categorizes effect size of less than 0.2 as spurious or ‘trivial’, 0.2 to 0.5 as ‘small’, 0.5 to 0.8 as ‘medium’, and those greater than 0.8 as ‘large.’ It is worth noting that Cohen’s d formulation assumes that the underlying distributions are normal (i.e., symmetrical) and homoscedastic (i.e., exhibit comparable error14 variability); moreover, the computational logic makes no allowances for the impact of outliers, all of which suggests that more computationally involved approaches that relax the said assumptions and rely on trimmed mean-based comparisons might be more appropriate,15 though those considerations reach beyond the scope of basic data analytic competencies. The third and final manifestation of effect size is odds ratio. The notion of odds is defined as the probability that the event of interest will occur, divided by the probability that it will not occur. In a binary, success-failure context, it is the ratio of successes to failures (in contrast to the notionally similar idea of probability, which expresses the chances of outcome of interest, e.g., success, as a fraction of the total, i.e., successes + failures). Odds ratio captures the association between two categorical data features, as in two sets of counts, or more specifically, it quantifies the impact of presence or absence of A on the presence or absence of B, under the assumption of

 As noted earlier, in the context of statistical analyses, the notion of ‘error’ is used to denote deviation, as in the difference between an estimated mean and individual actual values.  Those interested in additional details and alternative formulations are encouraged to consult more in-depth overviews, such as Wilcox, R. (2022). One-way and two-way ANOVA: Inferences about a robust, heteroscedastic measure of effect size. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 18(1), 58–73.

170

Chapter 8 Sensemaking Skills

Present

Absent

Present

A

B

Absent

C

D

Effect Size Odds Ratio =

Figure 8.5: Calculating Effect Size for Odds Ratio.

A – B independence. Evaluation of effect size of odds ratio is computed using a formula framed in the context of a 2x2 contingency table, summarized in Figure 8.5. Interpretation of odds ratio-express effect size is as follows: When the effect size equals 1.0, the two events are considered independent (the odds of one event are the same if the other event is present or absent); when the effect size is greater than 1.0, the odds of one event increase when the other event is present (it is analogous to positive correlation), and lastly, when the effect size is smaller than 1.0, the odds of one event decrease when the other event is present (which is analogous to negative correlation). The type of effect notwithstanding, it important to be mindful of the impact of method-specific assumptions on the interpretation of results. For example, Pearson’s r correlation-expressed effect size reflects the degree to which two continuous, symmetrically distributed variables are linearly associated, as graphically illustrated in Figure 8.6. Keeping those result framing considerations in mind is critical not only to

Relating two normally distributed, homoscedastic continuous variables Variable A

*

Variable B

can result in

Variable A

positive association no association or negative association Variable B Figure 8.6: Graphical Representation of the Logic of Pearson Correlation.

Inferential Skills: Probabilistic Data Analytic Outcomes

171

assuring the validity of results (i.e., input data are expected to meet the normality and related, i.e., equal error variance, formally known as homogeneity of variance), it is also critical to correctly interpreting those results. In the case of Pearson’s r correlation, the effect estimate only captures the extent to which two data features are linearly related; if the underlying association is nonlinear, which as can be expected is common, it will not be captured, thus both presence and absence of correlation need to be interpreted in the narrow context of linear associations. Similarly, the mean difference-expressed effect size imposes specific requirements on data to be used in estimation, namely normality of the underlying distributions, equality of error variances, and independence of individual records (meaning that the value of a given variable does not depend on the value of the same variable in the preceding record). Consequently, interpretation of mean difference-expressed effect size should be couched in the extent to which data features of interest meet those requirements. Lastly, while somewhat less bound by assumptions (largely because counts-based categorical data permit fewer operations), the interpretation of odds ratio-expressed effect size needs to clearly differentiate between the meaning of odds ratio and the notionally similar but mathematically different interpretation of probability. The distinction between the two, briefly mentioned earlier, can be confusing given that odds are defined in terms of probability of occurrence – still, the informational takeaways are somewhat different, because odds are computed by relating the chances of event A occurring in relation to chances of event B occurring, whereas probability relates the chances of either of the two events to the total or combined sum. So, for example, the odds of rolling ‘5’ using a standard dice are 1-in-5 (since there are five other outcomes that are possible), or 20%, whereas the probability of rolling ‘5’ is 1-in-6 (since there are a total of 6 outcomes that are possible), or 16.6%. Turning back to broader considerations of interpreting probabilistic outcomes, it is also important to not lose sight of the fact that drawing (statistical) inferences from data is not only inescapably speculative, but also prone to situational and personal influences. The preceding effect size evaluation overview offers a good illustration of those considerations. Setting aside any potential data limitations, most notably some degree of incompleteness and/or inaccuracy, analyses of data are about understanding patterns and relationships of the past, whereas drawing inferences from data is almost always forward-looking in nature. For example, starting with quantifying an effect, as in computing A-B correlation, and then qualifying the magnitude of that effect, as in translating numeric A-B correlation estimates into more interpretationally straightforward terms, such as ‘weak’ vs. ‘strong’ association, the goal of effect size assessment is to determine the validity of the association of interest. Those assessments, however, reflect the past, while the idea behind statistical inference is to draw future-looking conclusions. That is an important, though often overlooked distinction merely being valid, as in being true, is not enough to assure that what has been observed in the past can be expected to also be true in the future. Therefore, along with

172

Chapter 8 Sensemaking Skills

the assessment of data analytic result validity it is also important to examine reliability, or dependability of those results. As a general rule, validity does not imply reliability, and vice versa. Inferences drawn from data can be true, in the sense of truthfully reflecting (past) patterns and associations, but may be unreliable, in the sense of not offering dependable (forwardlooking) decision guidance. That tends to happen in situations in which a sudden, unexpected event disrupts the continuity of historical trends, and by extension, past-to-future projectability of insights derived from those trends. Examples include macroeconomic events, such as the 2007–2008 global financial crisis, as well as competitive events, such as the 2007 unveiling of the original iPhone by Apple Inc., both of which effectively reset numerous trends and associations based on pre-event data. ✶✶✶ The preceding exploration of result data analytic result interpretation related skills implicitly assumed that input data are structured and numeric. While historically structured numeric data have been the focal point of applied data analyses, it is hard to overlook that nowadays the vast majority of data – by some estimates as much as 85% or 90% – are unstructured, and predominantly non-numeric. Consequently, any overview of sensemaking skills would be incomplete without at least considering the factors that determine the ability to utilize results of non-numeric analyses, most notably in the form of text mining and sentiment analysis.

Making Sense of Non-Numeric Outcomes It is well-known that computer systems are very adept at finding patterns and relationship in structured numeric and numerically expressible (e.g., text labels that can be interpreted in a manner paralleling numeric values) data; those systems, however, are considerably less adept at extracting meaning out of unstructured text and symbolic data. At the core of the problem is the complexity of human communication in general, and the human language in particular, along with lack of clearly discernable patterns of value/expression recurrences; also impeding computerized analyses of unstructured data is the difficulty of translating largely intuitive human sensemaking into adequately explicit and complete computer instructions. Much of the meaning contained in text and image data is either implied (text) or symbolic (image), both of which are nuanced and situational, and neither of which lend itself to easily programmable computer logic. Consequently, even just focusing on comparatively simple text data, such as online product reviews, extraction of text data-contained meaning that at least notionally resembles insights contained in trends and relationships derived from structured numeric data is far more difficult, and the resultant informational takeaways tend to be considerably more speculative, in the sense of completeness and accuracy. At the same time, the sheer volume of text data makes anything other

Making Sense of Non-Numeric Outcomes

173

than limited spot-checking practically implausible, resulting in speculative results that are not easy to validate. When considered from an operational perspective, the general endeavor of text mining is a complex undertaking that requires a combination of specific know-how and purpose-built applications; not surprisingly, it largely falls outside of the scope of basic data analytic literacy. However, considering the sheer ubiquity of text data, even rudimentary data analytic competencies cannot entirely exclude non-numeric data, especially given that those data account for as much as 90% of data captured nowadays. Setting aside more advanced text mining applications, such as delving into consumer brand sentiment which requires highly specialized knowledge and applications, other types of text-encoded insights are comparatively more accessible. An approach commonly known as bag-of-words16 considers text data from the broad conceptual perspective of signaling theory, which sees a body of text as a collection of informative signal and non-informative noise (and dismisses any syntactical and related information). In that context, the goal of text mining to first identify and then extract informative signals, and then to convert the resultant metadata (data about data) into more easily analyzable structured numeric data. With that in mind and starting from the premise that data to be analyzed represent a subset of a larger collection of like data, mining of text entails four distinct steps: retrieval, summarization, structural mining, and digitization. Retrieval entails selection and extraction of a subset of data of interest, which often starts with selection of a smaller set to be used to train the algorithm to be used in mining. Summarization is the process of condensing of typically large number of data records by identifying and tabulating recurring terms and expressions; the outcome of this process typically takes the form of an abstract or a synopsis. The next step in the process, structural mining, is probably the broadest as well as most complex aspect of the bagof-words method as it involves transforming of (still) text values into statistically analyzable, categorized metadata. The last general text mining process step, digitization, entails digit-coding of individual metadata features, or converting of textual expressions into effect-coded (i.e., present vs. absent, typically encoded as 1 and 0, respectively) numeric data features, as graphically illustrated in Figure 8.7; the so-converted initially textual expressions can then be combined with structured numeric data. The general process of transforming text metadata into effect-coded structured numeric data summarized in Figure 8.7 is notionally reminiscent of the process used to transform item-centric data layout into buyer-centric layout (see Figure 5.1) discussed

 The simplest approach to text mining; it reduces text to be analyzed into unordered collections of words, where grammar and even the word order are disregarded, which inescapably leads to lumping together of stem-related (words derived from the same root term) but differently valenced (the same term used in the positive vs. negative context) expressions. Sentiment analysis is the more advanced analog to the bag-of-words; its goal is to extract the general attitude or sentiment by interpreting not only collections of words but also the grammatical and syntactical structures.

174

Chapter 8 Sensemaking Skills

Effect-Coded Structured Numeric Data

Text Metadata Text Expression 1 Text Expression 2 Text Expression 3 Text Expression n

Record 001

Text Expression 1 Text Expression 2 Text Expression 3 Text Expression n 1 0 0 1

Record 002

0

0

1

0

Record 003

1

1

1

0

Record n

0

1

1

0

Figure 8.7: Digitization of Text Metadata.

earlier; in that sense, it can be seen as a yet another example of basic data manipulation. It should be noted that not expressly shown in Figure 8.7 is a typical next step, i.e., following digitization, which is when text metadata are usually attributed to structured data records, typically by means of common identifier based linking of records in unstructured text data with records in structured numeric data (which parallels the file merging process discussed in Chapter 7). The reason for that is that, typically, the ultimate goal of bag-of-words based text mining is to enrich the informational content of largely numeric and structured transactional data, by means of attributing additional data features to entities (i.e., data records) of interest. For example, brand managers are keenly interested in the so-called 360° customer view, or comprehensive customer profiles compiled by pooling customer-level data from the various touchpoints that customers use to purchase products and receive services. But nowadays, customer interactions span a wide range of modalities including in-store and online purchases, company website visits, online and phone support, and participation in various product/service review and discussion forums – all of the individual touchpoints generate standalone data flows ranging from structured numeric transactions to unstructured text to (also unstructured) web logs. To create a singular, comprehensive multisource and multiattribute 360° customer view, it is necessary to amalgamate carefully processed and curated informational content of informationally and organizationally diverse data, a part of which is a process summarized above in Figure 8.7.

The Text Mining Paradox Although deeper mining of text data in the form of sentiment analysis falls outside the scope of this book, awareness of issues relating to that important aspect of data analytics should be considered a part of basic data analytic knowledge base. To that end, extracting insights out of famously voluminous unstructured text data (the now popular ‘big data’ label is largely used in referenced to social media and other mostly text data) necessitates use of automated text mining mechanisms, which gives rise to a paradoxical choice: Data that are too voluminous for anything other than automated ‘black box’ analyses are also too voluminous for standard result validation approaches, a situation

Making Sense of Non-Numeric Outcomes

175

that leads to a choice of either ‘using and trusting’ or altogether foregoing mining of text data ascertaining. Proponents of the former often argue that limited spot-checking of results offers adequate assessment of outcome validity, but that argument is rooted in a common misapplication of sampling logic, specifically relating to sample-topopulation generalizability discussed earlier. The idea that text mining results can be adequately validated by manually reviewing small subsets of corpora (a body of text being analyzed) of interest are not just unconvincing – they are flawed in principle. The reason for that is as follows: There are two broad approaches to selecting representative samples: probability and non-probability sampling (both discussed in more detail in Chapter 5). Starting with the former, the very idea behind probability sampling, most frequently operationalized using random sampling,17 is rooted in the belief that each element in the population of interest can be seen as an undifferentiated part of a homogeneous total, sort of like a single ball in container full of like balls, often used in lottery drawings. In order for that to be true, however, each record needs to have equal probability of being selected, and the resultant mix of so-selected records must support the assumption that it is a good representation of the underlying population (here, the totality of text data under consideration). Is it reasonable to assume that in a given corpora, any record is essentially like any other record, and choosing some, typically relatively small subset of those records offers a close enough representation of the entire corpora? And is it operationally feasible to select a subset of that corpora in a way that gives every record the same chance of being selected? In a sense, it seems like assuming that in a particular document, all individual lines of text are informationally alike, and choosing any random subset of those line will give a good portrayal of the entire document . . . And yet, failure to meet those requirements will more than likely produce a sample that simply does not offer sufficiently close reflection of the underlying whole. Given the demands associated with probability sampling methods, non-probability sampling might offer a viable alternative, but non-probability samples cannot be assumed to be representative of the larger population. One of the most widely nonprobability sampling approaches is convenience sampling, which, as suggested by its name is built around the idea of using whatever subset of data is available or accessible. Convenience and other non-probability samples can certainly be used to generate initial exploratory or descriptive insights, but cannot be used as the basis of ascertaining the validity of text mining insights.

 For the purposes of parsimony of explanation, otherwise distinct probability sampling techniques, most notably simple random, systematic, and stratified sampling methods are all lumped together under the general random sampling umbrella. In a strict sense, only in simple random sampling each record has the same probability of being selected; the mechanics of systematic and stratified sampling result in not all records having the same probability of being selected, but delving into those considerations would detract from the focal point of this argument.

176

Chapter 8 Sensemaking Skills

It is reasonable to assume that the ability to make sense of human language will continue to evolve, leading to ever-more efficacious text mining algorithms; that progress is already clearly visible in the form of popular applications such as Amazon’s Alexa or Apple’s Siri. Still, the essence of data analytic proficiency demands robust understanding of the key operational characteristics of the ever-more automated data analytic tools, even as those tools continue to evolve.

Sensemaking Skills: A Recap When working with data, the distinction between knowledge and skills is not always clear, and in fact, it would be a stretch to suggest that the two are clearly distinct and fully independent of one another. It is difficult to envision someone developing robust knowledge of, for instance, statistical methods without also developing at least rudimentary familiarity with different types of data, and at least some degree of proficiency with computational tools used to manipulate and analyze data. At the same time, it is just as unworkable to suggest that competency in, for instance, statistical methodologies implies equally robust computational skills or data type familiarity. After all, data analytics is a multifaceted domain, a composite of elements of conceptual knowledge and applied skills drawn from otherwise distinct domains of statistics, computer science, and decision theory. It thus follows that attainment of data analytic literacy, as framed in Figure 2.2 discussed in Chapter 2, necessitates carefully ‘choreographed’ immersion in thoughtfully selected elements of data, statistical, computational, and sensemaking conceptual knowledge and applied skills, combined into a singular set of complementary competencies. The idea of ‘carefully choreographed’ immersion in composite elements of data analytic literacy warrants additional consideration. The side-by-side tabular depiction of the structure of data analytic literacy shown earlier in Figure 2.2 draws attention to distinct elements of data analytic competency, but at the same time it masks their interdependence, which is an important aspect of data analytic know-why, -what, and how. Knowledge of data analytic methods, familiarity with different data types, and the ability to carry out any appropriate data analytic steps all function as components of a larger system of being able to transform raw and messy data into informative insights. That interdependency is particularly pronounced in the context of data analytic sensemaking, as graphically illustrated in Figure 8.8. The symbolism of Figure 8.8 is meant to communicate that knowledge of data and of data analytic methods dynamically interacts with computational skills to produce (data analytic) outcomes, which necessitate assessment and interpretation, jointly characterized here as result sensemaking. As the degree of proficiency in methods, data, and computing increases, it can be expected that the ability to extract deeper meaning out of available data will also need to grow. Noting that is important because the conventional approach to teaching data analytics is focused primarily on methods,

Sensemaking Skills: A Recap

177

Methods

Data Sensemaking

Computing

Figure 8.8: Sensemaking as System.

data, and computing, and rarely goes beyond the ‘tried and’, well, not all that ‘true’ approaches to assessing the efficacy of data analytic outcomes. Just as important is noting that the ‘standard’, statistical significance focused methods of assessment may not be able to offer the type of validation that is called for applied data analytics. Hence in order to be able to boil data analytic outcomes down to their practical informational essence, data analytics practitioners need to take a more expansive view of means and ways of assessing often nuanced results of confirmatory and exploratory analyses. And while the objective forms of assessment should be at the center of that process, subjective result sensemaking should not be entirely dismissed, as cognitive dissonance may signal a potential problem somewhere in the many steps to derive data analytic insights. Representing the state of sensemaking conflict that emerges when external evidence and internal beliefs or the general sense of reason suggest conflicting conclusions, cognitive dissonance should not be disregarded as it can compel deeper assessment of not just the outcomes of interest, but the entire process that generated those outcomes. In a more general sense, it is healthy, in fact desirable, to approach any empirical result with a dose of skepticism because data are often messy and incomplete, and data analytic methods are essentially structured means of arriving at approximately true conclusions. Hence, the possibility of data analyses producing flawed results is ever-present (which is why statistical significance approaches, but never reaches 100%). At the same time, data are usually unbiased and encompass a much wider and deeper slice of reality than any anyone’s subjective experience, which suggests that data can paint a more accurate picture than one’s mental models. Consequently, it stands to reason that, when properly vetted, cleansed, and analyzed, data can be a powerful teacher, and so it is important to approach any new insights with an open mind, particularly in view of perception warping cognitive bias and other

178

Chapter 8 Sensemaking Skills

shortcomings of human thinking and remembering mechanisms.18 Experienced professionals sometimes see data analytics as a competitor, fearing that data-driven decision-making will render their professional judgment and experience irrelevant – while understandable, such sentiments are unfounded. In all but a small number of highly structured, repetitive decision contexts (i.e., clerical rather than professional tasks), data analytic insights need to be contextualized and synthesized with other information, which requires the uniquely human ‘touch’. In short, data-driven decision-making needs human sensemaking to be effective, and combining experiencerich professional judgment with broad, depth, and objectivity of data-derived insights offers the most direct path to enhancing the efficacy of organizational decisions. In fact, as argued in the Part III of this book, data analytics can enrich and advance one’s ability to discharge one’s professional duties.

 For a more in-depth discussion of those mechanisms and their impact of the data analysis related sensemaking processes see Banasiewicz, A. (2019). Evidence-Based Decision-Making: How to Leverage Available Data and Avoid Cognitive Biases, Routledge: New York.

Part III: The Transformational Value of Data Analytic Literacy

Steam, oil, electricity, and data all have one thing in common: each ushered-in a new chapter in the ongoing story of socioeconomic development. The invention of steam engine fundamentally changed how many aspects of work were done, which was the linchpin of Industrial Revolution. The subsequent discovery of oil, first, and then of electricity further evolved not only how work was done, but also how lives were lived. Better and more ubiquitous tools and technologies powered first by steam, then by oil and electricity fundamentally reshaped not just many aspects of work and personal lives, but also ultimately remade the dominant social order in which the landowners were replaced by industrialists as the dominant social class. The more recent emergence of progressive more and more autonomous intelligent networks-driven automation, accompanied by bio-info-electro-mechanical symbiosis is beginning to exert an even more profound impact on humanity, as it is beginning to redefine what it means to be human. The goal of this section is to take a closer look at data analytic literacy as an integral part of the ongoing socioeconomic evolution; more specifically, it is to examine the impact of robust data analytic competencies on how work is done and lives are lived today, and how both are likely to evolve tomorrow. Hence in contrast to the prior chapters which were focused on detailing the core elements of data analytic competencies, the remainder of this book looks at the value of becoming data analytically literate from a more futuristic perspective of the transformative and even transcendental impact of the rapidly expanding digital automation. Chapter 9 – Right Here, Right Now: Data-Driven Decision-Making – examines how the ubiquity of data coupled with expanding abilities to utilize data is taking much of the guesswork out of organizational choices and decisions; stated differently, it looks at the value of data analytic literacy from the perspective of transformative change. Chapter 10 – The Future: Data-Driven Creativity – takes the current technological trends to their logical, relatively near-term conclusion by painting a picture of what work and life are likely to be in the future, which is to say that it considers data analytic capabilities from the more ‘out there’ transcendental perspective.

https://doi.org/10.1515/9783111001678-011

Chapter 9 Right Here, Right Now: Data-Driven Decision-Making Data is emerging as the currency of the Digital Age, quite literally, in the form of cryptocurrency. A decentralized digital medium of exchange, cryptocurrency is an application of the idea of electronic ledger distributed across a closed computer network, with individual transactions validated using network-based cryptographic processing of transaction data stored in blocks and linked together in a chain (hence the name blockchain) and duplicated across the entire network. Operating as independent private networks, the various types of cryptocurrencies are not reliant on any central authority such as a government or a bank and are (still, as of the early part of 2023) outside of established governmental monetary controls; moreover, existing as digital, cryptographically validated, and shared across distributed ledgers records, cryptocurrencies are also nearly impossible to counterfeit. Their digital sophistication aside, at their core, cryptocurrencies are another form of data. Data can also be seen as the currency of the Digital Age in a less literal sense. Described earlier as recordings of states and events, data can be seen as encoded information, where insights are hidden in patterns and relationships that can be uncovered with the help of purposeful and thoughtful analyses. In that sense, it can be seen as a potential source of informative, decision-guiding insights, which can support fact rather than intuition driven decision-making. Broadly defined, intuition is the ability to know something without relying on conscious reasoning, a higher sense of perception.1 It is an integral part of human experience, and as such, a natural element of human decision-making. Given that it is generated by the brain and the human brain is essentially an incredibly powerful parallel processing mechanism, even though it is subconscious, intuition is nonetheless a product of information processing, much like rational sensemaking. But in contrast to rational thinking, which is usually deliberate, logical, and factual, the typically unintentional intuition relies on different mechanisms to generate its conclusions. Those mechanisms are geared toward quick snap judgments based on limited, at times even superficial information, which stands in contrast to slow and thorough evaluation of all available information that characterizes rational thinking. Still, there are instances where the combination of limited information and snap judgment is good enough, as

 Although frequently used interchangeably with the instinct, intuition can be characterized as ‘gut feeling’, while instinct as ‘gut reaction’; in that sense, intuition captures learned, subconscious processing of information that might be too complex for rational thinking, such as choosing a mate, whereas instincts are innate reactions (thus behaviors rather than feelings) that stem from combination of past experiences and innate inclinations. https://doi.org/10.1515/9783111001678-012

184

Chapter 9 Right Here, Right Now: Data-Driven Decision-Making

is often the case with ‘fight or flight’ automatic reactions to frightening situations. But there are other situations in which intuition can lead to flawed, even disastrous choices, as vividly illustrated by fates of once high-flying companies such as Circuit City, Toys R Us, Compaq or Minolta, whose managers chose to follow their ‘gut’ rather than readily available marketplace trends and other objective evidence when making important organizational decisions. Softly implied in this line of reasoning is that efficacy of intuitive choice-making can be bolstered by immersive informatics – an ongoing, systematic infusion of factual insights into general sensemaking processes. Given that intuition is a learned process that relies on informational inputs to generate conclusions (i.e., it is not a manifestation of hard-coded inclinations), it stands to reason that intuitive sensemaking could be primed with data, especially when data are ‘consumed’ on ongoing basis. This is where data analytic literacy plays a central role: Immersing oneself in data on ongoing basis systematizes what is sometimes referred to as learning with data, a process of creating perspective-neutral, unbiased intuition-informing insights. In short, immersive informatics offers a solution to the pervasive problem of over-reliance on intuition in professional settings: Given that doing so is a largely involuntary response to uncertainty, rather than trying to suppress it, which might be unrealistic, why not better inform it? Infusing subconscious intuitive sensemaking with objective, data analyses-derived insights is tantamount to countering one of the core reasons behind faulty intuition – the ubiquitous cognitive bias. A multifaceted phenomenon that manifests itself in dozens of discernable, perception warping effects, as exemplified by availability heuristics, belief and confirmation inclination, or neglect of probability predispositions, cognitive bias gives rise to sensemaking conclusions that deviate from rational judgment. And while becoming aware of its rationality-hindering effects is an important first step, immersive informatics is a necessary follow up. A combination of cognitive bias awareness and thoughtful and systematic infusion of analytically robust objective information holds the key to ‘recalibrating’ intuitive sensemaking mechanisms. If subconscious, abbreviated (i.e., snap judgment, as in hurried or impetuous conclusions) sensemaking mechanisms of intuition can draw on robust informational base, the impact of cognitive bias can be greatly reduced, in some situations even altogether eliminated. Disastrous outcomes, such as those vividly illustrated by the ill-fated Kodak’s decision to not pursue digital photography it invented,2 are avoidable, if broadly conceived consumption of data-derived facts and patterns becomes a staple part of decision-makers informational diet. The Kodak case highlights a yet another systemic problem related to data analytic literacy, encapsulated in the longstanding view of executive decision-makers existing

 The Eastman Kodak Company invented digital photography in the 1970s, long before any of its competitors, but decided to kill their own innovation out of fear that it would cannibalize their highly profitable film business, to which they clung all the way to their 2012 bankruptcy.

Bolstering Intuitive Sensemaking

185

largely outside the realm of engaging in hands-on exploration of data. Simply put, that mindset cannot be reconciled with the data-everywhere reality of the Digital Age. The duty of care, a fiduciary obligation of corporate directors’ and officers’ mandates that they pursue the organization’s interests with reasonable diligence and prudence – not availing oneself of being able to dive into insights hidden in readily available data can be reasonably interpreted as failing to fully discharge those obligations. That is not to say that CEOs, CFOs, or corporate board member are expected to become data scientists; but just as they are expected to be able to read and understand the often technical language of financial and economic reports, they also need to exhibit rudimentary data analytic proficiency, not only to be able to engage in ad hoc analyses of available data, but also to be able to discern analytic efficacy of data analytic outcomes prepared by others and shared with them.

Bolstering Intuitive Sensemaking The idea of immersive informatics raised in the previous section is particularly cogent in the context of organization decision-making, which typically entails choosing between competing courses of action. It is nearly instinctive for decision-makers to favor subjective beliefs and interpretations, as doing so is hard-coded in human psychobiological makeup. But, when coupled with the earlier discussed shortcomings of intuitive sensemaking, perhaps best illustrated by the multifaceted and ubiquitous cognitive bias, instinctive information processing can lead to distorted views of reality. The well-known tendencies to, for instance, place more emphasis on recent events, highlight positive and look past negative outcomes, or rely on imperfect recall, can all lead to unwarranted or outright faulty conclusions. That is not to say that intuitive sensemaking lacks merit in its entirety, but in situations that demand careful and thoughtful review of available evidence, it can lead to otherwise avoidable errors in judgment, as exemplified by the disastrous choices of Kodak, Circuit City, or Toys R Us managers. There are ample reasons to believe that many so-called ill-fated decisions can be more aptly characterized as ill-informed – it was not fate, but poorly thought-out choices that doomed many once high-flying companies. The opportunity to bolster intuitive sensemaking, however, can be expected to vary across different types of decisions. Casting a wide net while still focusing largely on professional aspect of decision-making, decisions can be grouped into several different categories, each considering decision types from a somewhat different perspective: – Organizational and individual. Defining organizations as groups of individuals working together toward shared goals suggests that some decisions made by organizational members are organizational outcomes focused, while others are individual focused. The former tends to take the form of collective collaborations, where infusion of objective data derived insights offers perhaps the most visible benefits, which includes

186

Chapter 9 Right Here, Right Now: Data-Driven Decision-Making

reducing the potentially adverse consequences of dysfunctional group dynamics, such as groupthink. As can be expected, individual decisions tend to be more self-centered; as such, those decisions can be unduly influenced by rushed judgment, subjective feelings and emotions, and a host of situational factors, suggesting that objective evidence can be as valuable in those contexts as it is in broader organizational contexts. – Strategic and operational. Decisions made to advance high-level organizational goals and objectives, such as being a leading provider of a particular type of service or a type of product, are typically characterized as strategic; decisions that contribute to routine operations fall under the broad umbrella of operational choices. One of the more direct benefits of the seemingly ever-present electronic transactional infrastructure is the ubiquity of granular, near-real-time data, which are nowadays commonly used to support operational decisions; strategic decision-making, on the other hand, often continues to be seen more as art than science. Admittedly, strategic decisionmaking is inescapably more speculative and subject to a wider array of known and unknown influences – at the same time, unaided human sensemaking is subject to the earlier discussed cognitive bias or channel capacity-imposed limitations, something that the often-rich varieties of objective information can help to rectify.3 – Major and minor. Be it organizational or individual, strategic or operational, some decisions are more consequential than others; major decisions are those that contemplate large-scale changes, while minor decisions are those that contemplate comparatively smaller changes. Of course, there is no single, objective line of demarcation that separates those two types of decisions, as what is deemed ‘minor’ in one organizational setting might be deemed ‘major’ in another. Setting the definitional distinction aside, whatever decisions are considered to be ‘major’ should be more heavily rooted in objective evidence because doing so can be expected to diminish the probability of undesirable outcomes, as reliance on objective evidence can be shown to materially reduce perception-warping individual cognitive distortions and dysfunctional aspects of group dynamics. – Routine and non-routine. Routine decisions do not require much consideration, as exemplified by inventory replenishment, whereas non-routine decisions call for careful and thoughtful analysis of pertinent factors and alternatives. Though somewhat reminiscent of the major vs. minor distinction, the routine vs. non-routine offers a different perspective as it considers decision types from the vintage point of process,  Interestingly, the volume and diversity of available information can be both overwhelming and confusing, especially when different sources are pointing toward conflicting take-aways. That said, it is possible to synthesize source- and type-dissimilar information into a singular, summative set of conclusions using a systematic and objective analytic approach; for a more in-depth discussion of those mechanisms and their impact of the data analysis related sensemaking processes see Banasiewicz, A. (2019). Evidence-Based Decision-Making: How to Leverage Available Data and Avoid Cognitive Biases, Routledge: New York.

Bolstering Intuitive Sensemaking

187

rather than impact. As can be expected given the current levels of informational and transactional automation, it is common for routine decisions to be entirely datadriven, while the use of objective evidence in non-routine decisions can be expected to vary across decision types and organizational settings. Implied in the simple categorization outlined above is that some decisions more naturally lend themselves to being data-driven, while some others tend to skew more toward what is commonly characterized as creative thinking. Routine, minor, and operational choices tend to be driven by substantively different sensemaking mechanics than non-routine, major, strategic decisions. The latter almost always call for wider and more thorough information search as well as more careful evaluation of all pertinent inputs, which typically encompass subjective beliefs as well as objective information. The resultant open-ended, unbound character of the underlying decision-making process is frequently framed as creative problem solving, and it is believed to exist outside the realm of ideas that lend themselves to clear operational understanding, as the ‘creative’ part of the description tends to overshadow the ‘problem-solving’ part.

Creative Problem-Solving and Deductive vs. Inductive Inference The idea of creative problem-solving has an almost irresistible appeal, as in minds of many it conjures up images of breakthrough ideas or captivating designs. But what, exactly, does it entail? Understood by most as an expression of the ability to imagine, or forming of novel mental images, the notion of creative thinking is about as poorly defined as it is appealing. It is broadly characterized as a process that is relatively unstructured and that encourages open-ended, or not rigidly constrained, solutions, and aims to balance divergent (multiplicity of ideas) and convergent (selection of most compelling alternatives) thinking by encouraging brainstorming of ideas while discouraging immediate judgments, and while also pushing past prior beliefs and other barriers (such as accepted practices). At first blush that conception of creative problem-solving paints a compelling picture, though it is not clear what makes that process ‘creative’ – other labels, such an ‘open-minded’ might seem equally suitable, if not more fitting. There is no clear basis for believing that merely encouraging open-ended brainstorming and pushing past accepted beliefs and practices will lead to creative, in the sense of being novel, outcomes – it may simply lead to the acceptance of previously dismissed courses of action. Even more importantly, the above framing implicitly assumes in-the-mind human sensemaking, or stated differently, does not expressly account for the possible impact of data. Brainstorming, a key element of creative problem-solving, is characterized as a group problem-solving technique that involves spontaneous contribution of ideas from group

188

Chapter 9 Right Here, Right Now: Data-Driven Decision-Making

members, and those contributions are commonly understood to be whatever (related to the problem at hand) ideas materialize in group members’ minds. As such, brainstorming, and by extension creative problem-solving, place heavy emphasis on human imagination, all while largely disregarding potential benefits of looking to data for inspiration. And that presents an opportunity to re-frame the idea of creative problem-solving as process rooted in ‘creative’ use of valid and reliable inferences. Inference

Deductive

Inductive

Accepted premises → Conclusion

Data → Conclusion

Figure 9.1: The Two Dimensions of Inference.

Accepted premises that form the basis of deductive inference can take a variety of forms ranging from formal theories to speculative conjectures; inferences drawn from those sources comprise the traditional informational source of creative decision-making. Systematic and persistent immersion in appropriate (i.e., related to a particular type of decision) data, characterized earlier as immersive informatics, can add an empirical dimension, in the form of inductive inference, to creative decision-making, which can not only bolster the efficacy of the resultant conclusion – it can also expand the breadth of emerging conclusions. The idea here is to redefine the mechanism of brainstorming: Rather than being seen (and practiced) as a group exercise in deductive reasoningfueled thinking of potential solutions to a problem at hand, it should be seen as a broader endeavor that also encompasses systematic assessment of applicable data. In more technical terms, the above reframing of brainstorming and thus of the broader idea of creative problem-solving calls for moving beyond the now dominant hypothetico-deductive sensemaking mindset, which is not as simple as it may sound, as that knowledge creation approach has been educationally hard-coded into learners’ psyche by years of theory-first educational philosophy. A byproduct of the 17th century’s blossoming of the Age of Reason (more commonly known as Enlightenment, a period in European history which marked the rise of modern science at the tail end of the 17th century) and its embrace of hypothesis testing-centric knowledge creation, the currently dominant hypothetico-deductive educational mindset conditions brainstorming to be approached from the perspective of empirical tests of theoretical conjectures (i.e., accepted premises in Figure 9.1), which manifests itself as deductive inference. Stated differently, the decision-makers of today spent their formative learning years immersed in the mindset in which new ideas flow out of prior beliefs, and data are used primarily to test and develop new theories, rather than to give rise to directly usable conclusions.

Bolstering Intuitive Sensemaking

189

That is not to say that the hypothetico-deductive mindset did not work well – after all, when placed in the context of the roughly 6,000-year-long timeline of the evolution of human civilization, the scientific and technological progress that followed the victory of scientific reason over dogma some three centuries ago has been nothing short of astounding. But progress also brings with it new opportunities, and at least some of those new opportunities cannot be realized without some rethinking of, well, how thinking is done. Up until just a few short decades ago, the amount and diversity of data were extremely limited, as were data analytic capabilities (e.g., calculating something as simple as the mean and the standard deviation without the use of electronic computers is quite taxing) – in those days, the deductive method offered the best mechanism of knowledge creation. But times have changed and while the deductive method remains relevant, the inductive method now offers highly viable parallel knowledge creation mechanism; flashes of brilliance can now come from peering first into data . . . Broadening the conception of creative problem-solving to expressly incorporate data-centric, or inductive brainstorming is important not only because of the upside of more – i.e., accepted premises-based as well as data-derived – insights, but also because reliance on deductive reasoning alone may, at times, lead to myopic thinking. A clear case in point is offered by business education: Management students are introduced to distinct business sub-disciplines, such as management or economics, all of which aim to describe and rationalize their respective facets of business through lenses of attitudinal and behavioral theoretical frameworks. In that general context, business students end up being conditioned to view organizational decision-making through the prism of innumerable decision frameworks, ranging from the ubiquitous SWOT analysis to agile management to business process reengineering, and on and on . . . Devoting the bulk of their educational time and effort to internalizing all sorts of sensemaking templates, they spend comparatively little time learning how to directly, creatively and meaningfully engage with ‘raw’ organizational management related evidence, most notably numeric and text, structured and unstructured data. In the end, what should be a journey of discovery paralleling learning to become a skilled chess player, management education can be more aptly characterized as indoctrination into the art of pigeonholing, where the primary focus is on learning how to fit observed fact patterns into predetermined templates and solutions. In view of that, it is not surprising that some of the most iconic business leaders did not graduate from business schools . . . the now dominant, in terms of market capitalization and influence, technology companies including household names such as Apple, Microsoft, Twitter, Oracle, Dell, Twitter, Facebook were all started by those who learned how to creatively solve problems elsewhere . . . Perhaps not having been indoctrinated into standard management thinking freed those and other business innovators to truly think creatively? It seems reasonable to suspect that there might not be a single answer to such a profound question, but it also seems reasonable to suspect that over-reliance on deductive reasoning rooted in, at

190

Chapter 9 Right Here, Right Now: Data-Driven Decision-Making

times questionable, decision frameworks can lead to under-utilization of organizational data, which can manifest itself in a variety of undesirable outcomes. Perhaps the most obvious one is that when data are looked at through the prism of prior belief testing – one of the consequences of that mindset is heightened inclination to dismiss unexpected findings as flukes, which can mean missing out on opportunities to uncover thought provoking, maybe even eye-opening insights. Simply put, when the goal is to find X, it is easy to dismiss or overlook anything other than X. A somewhat less obvious deductive reasoning pitfall might be the framing of the underlying research question. The logic of the scientific method implicitly assumes that the informational essence of conjectures that are to be tested (which, by the way, typically emanate from whatever happen to be the accepted view), holds the greatest knowledge potential. In other words, the conjectures to be tested are the ‘right’ questions to ask. However, in rapidly evolving environmental, sociopolitical, economic, and even situational contexts that simply may not always be the case. Trend wrecking change can relatively quickly and quietly lessen the informational value of established beliefs, ultimately rendering empirically testing of conjectures derived from those conceptual roots informationally unproductive, if not outright counterproductive. Hence to yield maximally beneficial outcomes, organizational data need to be seen not only as a source of prior beliefs validation, or conjecture testing, but also as a source of open-ended, exploratory learning. In a more general sense, to arrive at truly novel solutions, decision-making needs to be rooted in a wellbalanced mix of pondering new implications stemming from established premises (deductive reasoning) and seeking previously not contemplated possibilities by delving into patterns and relationships hidden in data (inductive reasoning).

Casting a Wider Net: Evidence-Based Decision-Making In the context of organizational management, the idea of data-bolstered decision-making is a key element of evidence-based management, a relatively recent paradigm built around explicit use of best available evidence as the basis of managerial decision-making. The present-day conception of evidence-based management is rooted in the broader idea of evidence-based professional practice, notional origins of which can be traced back to the 1980s efforts on the part of the British government to emphasize the need for public policy and practices to be informed by accurate and diverse body of objective evidence. In the US, evidence-based practice first took roots in medicine, most notably as an outcome of the Evidence-Based Medicine Working Group’s 1992 publication of ‘EvidenceBased Medicine: A New Approach to Teaching the Practice of Medicine’.4 Characterized

 Evidence-Based Medicine Working Group (1992). Evidence-Based Medicine: A New Approach to Teaching the Practice of Medicine. (1992). JAMA, The Journal of the American Medical Association, 268 (17), 2420–2426.

Casting a Wider Net: Evidence-Based Decision-Making

191

as a new paradigm which ‘. . . de-emphasizes intuition, unsystematic clinical experience and pathophysiological rationale as sufficient grounds for clinical decision making and [instead] stresses the examination of evidence from clinical research . . .’, it advocated integration of practitioners’ clinical expertise with the best available external research evidence. It is important to note that the idea of ‘research evidence’ was framed as predominantly academic empirical research published in scientific journals, which was to be critically reviewed and synthesized using two distinct-but-complementary methodological tools: systematic reviews, which provide structured means of identifying and summarizing (in terms of objectives, materials, and methods) empirical studies relating to topics of interest, and meta-analyses, which offer a statistical mechanism for analyzing and synthesizing results of selected studies. That approach has two main shortcomings: Firstly, there are some 28,000 academic journals that together publish about 1.8 million articles per year, and thus even a relatively narrowly framed inquiry is likely yield a very large number of research studies to be reviewed, a process that is not only very time and resource-consuming, but also fraught with methodological challenges stemming from the general idea of reviewer bias.5 Secondly, and in a way even more consequentially, in contrast to natural sciences-based fields, such as medicine, in which practice is very closely linked with ongoing scientific research, social sciences rooted fields, such as management, are not nearly as closely aligned with ongoing theoretical research. In fact, there is a well-known academic-practice gap (ironically, it gave rise to its own stream of academic research), as management practitioners tend to see academic researchers as unconcerned with practical problems and outright dismissive of practitioners’ informational needs, instead choosing to focus on jargon-laden, overly mathematical, and overly theoretical inquiries. Unfortunately, leading proponents of evidence-based management (EBM) practice continue to look past those glaring limitations of theoretical business research, instead clinging to the unrealistic idea of academic business research being the main guiding light of applied managerial decision-making. At issue here is not just a general lack of relevance, but also a possible lack of exposure: Unlike medical practitioners, who can be assumed to have gone through standardized professional training culminating in formal credentials and licenses to practice, business practitioners come from virtually all educational and applied backgrounds, with many not ever have taken any business courses, and thus not having been exposed to academic business research. That is not surprising given that business management is an occupation, not a profession, as clearly evidenced by the fact that outside of a handful of domains, such as accounting, becoming a business manager or a business professional (a staff member, a consultant, etc.) does not require specific academic training or license to

 For more details see Banasiewicz, A. (2019). Evidence-Based Decision-Making: How to Leverage Available Data and Avoid Cognitive Biases, Routledge: New York.

192

Chapter 9 Right Here, Right Now: Data-Driven Decision-Making

practice. So, if management education is not necessary to practice management, why would largely academic theoretical management research be the primary source of insights and inspiration for managers? And especially now, given that the rampant digitization of commercial and interpersonal interactions generates vast quantities of data which can be used to glean timely and tailored decision-guiding insights . . . As noted earlier, data is the currency of the Digital Age, and more and more, data is the predominant source of managerial insights. In contrast to generalizationsminded theoretical research, which by its very nature is focused on search for universal truths (rather than addressing the often nuanced and situational informational needs of decision-makers), exploration of readily available and typically detailed and relevant transactional and communication data can be expected to yield insights that are not only far more tailored to informational needs at hand, but also considerably more timely than insights generated by the slow moving academic research enterprise. So, in order for applied managerial decision-making to become truly evidencebased, that which is characterized as ‘evidence’ needs to be more closely aligned with informational of its users.

What is Evidence? The standard, i.e., dictionary-based conception of evidence posits that it is ‘an outward sign’ or ‘something that furnishes proof’; using that framing as a starting point, decision-making related evidence can be framed as facts or other organized information presented to support or justify beliefs or inferences. Implied in that conceptualization is the requirement of objectivity, meaning that facts or information presented as evidence should be independently verifiable, or stated differently, should be expressions of measurable states or outcomes, rather than subjective perspectives, judgments, or evaluations. The requirement of objectivity is critical to assuring persuasiveness of evidence, or its the power to reduce decision-making uncertainty. Independently verifiable, external-to-self evidence also reduces the perception-warping influence of cognitive bias, and in doing so helps to create a frame of reference that is common-to-all involved in the decision-making process. In fact, basing decisions on objective evidence is just as beneficial even if the decision-making process involves only a single individual. Nowadays, there are numerous and diverse sources of decision inputs ranging from entirely subjective past experiences, predominantly subjective expert judgment, to objective research findings and results of analyses of available data. Such a heterogenous mix, however, can result in a confusing informational mosaic, especially if different sources suggest different courses of action. Consequently, focusing on just objectively verifiable factual evidence can lead to greater informational clarity; sometime, less is more . . .

Casting a Wider Net: Evidence-Based Decision-Making

193

However, less in terms of scope does not necessarily mean less in terms of volume. As discussed in Chapter 4, the rapidly expanding digital, data-generating infrastructure produces vast quantities of data, which suggests that even narrowing the scope of evidence to just transactional sources can still yield overwhelming volumes. Moreover, while informationally rich, data available to business and other organizations also tend to be incomplete, in the sense that not everything that is knowable is being captured (e.g., while the UPC scanners record all electronic sales of a particular product, non-electronic transactions are not captured, thus UPC scanner-based paint an incomplete picture of total sales of a product of interest) and messy (e.g., erroneous or incomplete transactions due to system or operator errors, etc.). It follows that imperfect data give rise to imperfect information, though that does not necessarily mean biased. From a statistical point of view, so long as those imperfections are nonsystematic, meaning do not bias informational takeaways, those random data imperfections may impact the precision of estimates, but not the general patterns and trends related conclusions. Still, to yield insights that can be considered factual, in the sense of being independently verifiable and unbiased in the sense of not being shaped to align with a particular perspective, available data need to analyzed in a transparent, replicable manner.

Transforming Data into Evidence In general, the simpler the analysis, the more believable the outcomes. Complex, multivariate predictive model-generated conclusions, typically in the form of forwardlooking predictions, are methodologically intricate and their informational outcomes are inescapably speculative, which erects natural acceptance barriers. In a way of a contrast, basic descriptive and confirmatory analyses that are at the core of data analytic literacy are methodologically straightforward and yield predominantly factual outcomes. And while predictive multivariate analyses (not discussed in this book) as well as basic descriptive and confirmatory analyses (discussed in earlier chapters) both meet the objectivity requirement of evidence-based management, the latter tend to be far more persuasive precisely because of their easy-to-follow derivation logic and matter-of-fact content. Moreover, individually simple exploratory or confirmatory data analytic outcomes can be linked together into a compelling, evidence-based story. With that in mind, the remainder of this chapter is focus on two distinct types of data storytelling that are used to support common types of business decisions, such as risk management and mitigation: baselining, and benchmarking. While similar in the sense that both are meant to offer an evidence-based decision foundation, there are somewhat subtle but nonetheless important differences between the two: Baselining is a tool of self-focused assessment, whereas benchmarking is a method of externally focused comparisons. General concept-wise, baselines are rooted in the idea that certain types of assessment are multifaceted – for instance, performance of a

194

Chapter 9 Right Here, Right Now: Data-Driven Decision-Making

brand, a company, or individual team members typically manifests itself in multiple outcomes; for a brand, those typically include topline measures of total revenue and bottom line measures of profitability, along with key operating characteristics such as current customer retention and new customer acquisition. In that context, the role of baseline is to offer a concise summary of key performance indicators (KPIs); the earlier discussed dashboards are commonly used templates. Benchmarking, on the other hand, is used to answer a frequently asked followup question: How does the performance of the brand (or any other entity) of interest compare to similar brands? The importance of that question stems from the difficulty of qualitatively characterizing quantitative outcomes – for example, should 10% profit be considered ‘good’? How about 5% year-over-year revenue growth? It is intuitively obvious that to answer such questions in a defensible manner requires the use of an objective point of reference, commonly referred to as a benchmark. Under most circumstances, evaluative benchmarks can be either adapted or estimated. In applied business settings, the former often take the form of the so-called ‘rules of thumb’,6 as exemplified by insurance claim incidence rates – for instance, on average, about 4% of auto insurance policies incur a claim. Rules of thumb are rarely formally published; instead, they tend to be informally disseminated through word-of-mouth. Estimated benchmarks, on the other hand, are usually customderived and tailored to needs of individual companies, even specific decision scenarios – for example, large business organizations routinely make use of handpicked peer groups as the basis for qualifying important measures, such as performance outcomes or risk exposures. Moreover, though in principle benchmarks can be either static or dynamic, the former are far more common. And while no universally accepted benchmarking methodologies are readily available, it is widely accepted among practitioners that benchmarks should be relevant to the purpose at hand, impartial and bias-free, informationally valid, and easy to interpret. Next, two illustrative case studies, one focused on baselining and another focused on benchmarking, are outlined as examples of compelling storytelling with data using basic data analytic techniques.

Mini Case Study 1: Baselining the Risk of Securities Litigation Securities class actions (SCA) are often characterized as low frequency, high impact events. On average, a company traded on a public U.S. exchange (to be subject to the U.S. securities laws, a company does not need to be domiciled in the United States, it  It is worth noting that in a broader context that includes federal, state, and local governments as well as various non-governmental agencies and entities adapted benchmarks can include formally developed and disseminated measures; the focus of this overview is limited to evidence-based management related benchmarking applications.

Mini Case Study 1: Baselining the Risk of Securities Litigation

195

just needs to have securities traded on a U.S. stock exchange) faces a roughly 4% chance of incurring securities litigation, and the median settlement cost is about $8,750,000; factoring-in defense costs (given the nuanced and complex nature of SCA cases, even companies with sizable in-house legal staffs tend to use specialized outside law firms), estimated to average about 40% of settlement costs, the total median SCA cost is about $12,250,000. Those are just the economic costs – securities litigation also carries substantial though hard to quantify reputational costs, as being accused financial fraud often brings with it waves of adverse publicity. In more tangible terms, in the 25-year from 1996 to 20217 there have been a total of 6,118 SCAs filed in federal courts; to-date (some of the more recent cases are still ongoing), 2,352 settlements arouse out of those lawsuits, adding up to a grand total of more than $127.5 billion. Those 25-year aggregates hide considerable cross-time and cross-industries variability, which is explored in more detail below, as a step toward uncovering potential future shareholder litigation trajectories.

Aggregate Frequency and Severity The preceding summary suggests that the two core SCA tracking metrics are frequency, or ‘how often,’ and severity, or ‘how much’. Starting with the former, Figure 9.2 shows the year-by-year (using the lawsuit filing date as the assignment basis) distribution of the 6,118 SCAs that have been recorded in the twenty-five-year span between 1996 and 2021; the dark-shaded bars highlighting years 1996, 2002, 2005, 2007, 2010, 2014, and 2017 relate the key post-PSLRA U.S. legislative acts and the U.S. Supreme Court rulings, as a mean of visually exploring potential associations. One of the most striking informational elements of the summary of annual SCA frequencies is the exceptionally high 2001 count (498 cases). Buoyed by a sudden influx of the so-called ‘IPO laddering ’ cases which followed on the heels of bursting of the dot-com bubble of the early Internet era (between 1995 and 2000) characterized by excessive speculation of Internet-related companies, year 2001 set the all-time record for the number of securities fraud cases. Setting that single anomalous year aside, the second key takeaway is a distinct upward-sloping ebb and flow pattern, suggesting steady cross-time growth in average annual SCA frequency. However, there appears to be no visually obvious association between the incidence of securities fraud litigation and the distinct SCA-related legislative and legal developments (timing of which is highlighted by the dark-shaded bars in Figure 9.2). A potential alternative explanation of the gradual uptick in the annual incidence of shareholder litigation  1996 is commonly used as the beginning of the ‘modern’ era of securities litigation because it was the year the Private Securities Litigation Reform Act (PSLRA) of 1995 went into effect; the Act’s goal was to stem frivolous or unwarranted securities lawsuits, and its provisions fundamentally reset the key aspects of securities litigation.

196

Chapter 9 Right Here, Right Now: Data-Driven Decision-Making

498 411402402 Median = 213

242 209216

265 228239

223

182

174

320 271

177

208 188 175 168 165 164 151

227

121

110

Figure 9.2: Annual SCA Frequency: 1996–2021.

might be the gradual broadening of the scope of disclosure materiality. More specifically, traditionally rooted almost exclusively in financial performance measures, the definition of what constitutes ‘material disclosure’ is now beginning to encompass non-financial environmental, social justice, and governance, jointly known as ESG, considerations, resulting in public companies having to content with a broader array of potential securities litigation triggers, ultimately manifesting itself in higher average frequency of SCA litigation. $16,000,000 $14,000,000 $12,000,000

$10,000,000 $8,000,000 $6,000,000 $4,000,000 $2,000,000 $0

Figure 9.3: Annual Median SCA Settlements.

Median: $8,750,000

Mini Case Study 1: Baselining the Risk of Securities Litigation

197

A similar, ebb and flow upward trending pattern characterizes the second key facet of shareholder litigation – severity, graphically summarized in Figure 9.3. There are two key takeaways here: First, there again appears to be no obvious impact of the individual legal developments, highlighted by the dark-shaded bars in Figure 9.3, on the median settlement value. Second, there is a pronounced upward drift in annual median settlement amounts; however, that conclusion warrants closer examination in view of the potentially moderating impact of company size, which is rooted in the positive correlation between the magnitude of SCA settlements and the size of settling companies, as measured by market capitalization. More specifically, for all 1996–2021 SCA settlements (n = 2,352) for which market capitalization value was available around the time of settlement announcement (n = 1,373), the value of the Pearson correlation between Settlement Amount and Market Capitalization was .21 (p