Analyzing Media Messages: Using Quantitative Content Analysis in Research [5 ed.] 1032264691, 9781032264691

The fifth edition of this comprehensive and engaging text guides readers through the essential tools and skills necessar

145 101 11MB

English Pages 232 [243] Year 2024

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Analyzing Media Messages: Using Quantitative Content Analysis in Research [4 ed.] 9781138613980, 1138613983

1,253 65 2MB Read more

Analyzing Media Messages 9780429875168, 9781138613973, 9781138613980, 9780429464287

Analyzing Media Messages, Fourth Edition provides a comprehensive guide to conducting content analysis research. It esta

137 115 3MB Read more

Analyzing Animal Societies: Quantitative Methods for Vertebrate Social Analysis 9780226895246

Animals lead rich social lives. They care for one another, compete for resources, and mate. Within a society, social rel

212 79 3MB Read more

A Toolkit for Quantitative Data Analysis: Using SPSS 9781350394209, 9781137038258

This straightforward, approachable text provides students with a beginner's guide and continuing reference tool for

344 74 5MB Read more

Quantitative Data Analysis: Doing Social Research to Test Ideas

1,088 119 43MB Read more

Media Literacy: Keys to Interpreting Media Messages, 4th Edition: Keys to Interpreting Media Messages 9781440830921, 9781440830914, 9781440831157, 1440830924

This fourth edition of Keys to Interpreting Media Messages supplies a critical and qualitative approach to media literac

235 129 3MB Read more

Conducting Quantitative Research In Education 9811391319, 9789811391316

This book provides a clear and straightforward guide for all those seeking to conduct quantitative research in the field

1,079 107 19MB Read more

Point of Sale: Analyzing Media Retail 9780813595566

Point of Sale offers the first significant attempt to center media retail as a vital component in the study of popular c

404 117 2MB Read more

Methods for Analyzing Social Media 041581832X, 9780415818322

Social media is becoming increasingly attractive for users. It is a fast way to communicate ideas and a key source of in

249 87 2MB Read more

Strategies for Quantitative Research 9781351802949

434 124 2MB Read more

Analyzing Media Messages: Using Quantitative Content Analysis in Research [5 ed.]
1032264691, 9781032264691

Author / Uploaded
Daniel Riffe
Stephen Lacy
Brendan R. Watson
Jennette Lovejoy

Table of contents :
Cover
Half Title
Title
Copyright
Dedication
Contents
Preface
1 Introduction
2 Defining Content Analysis as a Social Science Tool
3 Designing a Content Analysis
4 Computers and Content Analysis
5 Measurement
6 Sampling
7 Reliability
8 Validity
9 Data Analysis
Appendix A: Sample Protocol
Appendix B: Reporting Standards for Content Analysis Articles
References
Index

Citation preview

Analyzing Media Messages

The fifth edition of this comprehensive and engaging text guides readers through the essential tools and skills necessary to conduct quantitative content analysis research. Readers will find a clear definition of quantitative content analysis and step-by-step instructions on designing a content analysis study, along with examples of content analysis studies and journal articles. This edition has been updated with the latest methods in sampling in the digital age, computerized content analysis, and the uses of social media in content analysis research. It maintains the concise, accessible approach of previous editions while including refreshed examples and discussions throughout. This is an essential text for content analysis courses in communication and media studies programs at all levels, as well as a useful supplementary text in more general research methods courses. Daniel Riffe is Professor Emeritus in the School of Journalism and Media at the University of North Carolina at Chapel Hill, USA, and former editor of Journalism and Mass Communication Quarterly. His research examines mass communication and environmental risk, political communication and public opinion, international news coverage, and research methodology. Before joining UNC-Chapel Hill, he was Presidential Research Scholar in the Social and Behavioral Sciences at Ohio University. Stephen Lacy is Professor Emeritus at Michigan State University, USA, where he studied content analysis and media managerial economics for more than 30 years in the School of Journalism and Department of Communication. He has co-written or co-edited five other books and served as co-editor of the Journal of Media Economics. Brendan R. Watson is an independent scholar. His research examines the role of public affairs news/information in helping communities cope with social upheaval due to the increasing urbanization, globalization, and pluralism of postindustrial society. He also studies research methodology. He has taught graduate seminars in content analysis at Michigan State University and the University of Minnesota, where he was previously on the faculty. He holds a Ph.D. in Mass Communication from the University of North Carolina at Chapel Hill. He currently works for Level, a performance marketing agency. Jennette Lovejoy is Professor and Chair of the Department of Communication and Media at the University of Portland, USA. She teaches critical media, journalism, and research methods. Her interdisciplinary research spans media, health, and content analysis methods. She has edited multiple books and published in leading communication and medical journals.

Analyzing Media Messages Using Quantitative Content Analysis in Research Fifth Edition

Daniel Riffe, Stephen Lacy, Brendan R. Watson, and Jennette Lovejoy

Designed cover image: BlackJack3D/© Getty Images Fifth edition published 2024 by Routledge 605 Third Avenue, New York, NY 10158 and by Routledge 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business © 2024 Daniel Riffe, Stephen Lacy, Brendan R. Watson, and Jennette Lovejoy The right of Daniel Riffe, Stephen Lacy, Brendan R. Watson, and Jennette Lovejoy to be identified as authors of this work has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilized in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. First edition published by Routledge 1998 Fourth edition published by Routledge 2019 Library of Congress Cataloging-in-Publication Data Names: Riffe, Daniel, author. | Lacy, Stephen, 1948– author. | Watson, Brendan R., author. | Lovejoy, Jennette, author. Title: Analyzing media messages : using quantitative content analysis in research / Daniel Riffe, Stephen Lacy, Brendan R. Watson and Jennette Lovejoy. Description: Fifth edition. | New York, NY : Routledge, 2024. | First edition published by Routledge 1998. Fourth edition published by Routledge 2019. | Includes bibliographical references and index. Identifiers: LCCN 2023028911 (print) | LCCN 2023028912 (ebook) | ISBN 9781032264691 (hardback) | ISBN 9781032264677 (paperback) | ISBN 9781003288428 (ebook) Subjects: LCSH: Content analysis (Communication) | Mass media– Research–Methodology. | Mass media–Statistical methods. Classification: LCC P93 .R54 2024 (print) | LCC P93 (ebook) | DDC 302.2301/4–dc23/eng/20231016 LC record available at https://lccn.loc.gov/2023028911 LC ebook record available at https://lccn.loc.gov/2023028912 ISBN: 978-1-032-26469-1 (hbk) ISBN: 978-1-032-26467-7 (pbk) ISBN: 978-1-003-28842-8 (ebk) DOI: 10.4324/9781003288428 Typeset in Times New Roman by Apex CoVantage, LLC

Daniel Riffe: For Florence, Ted, Eliza, Bridget, Brynne, Hank, Annie, James, and Elizabeth Stephen Lacy: For I. P. Byrom, N. P. Davis, and A. G. Smith Brendan R. Watson: For Joan and Maroun Jennette Lovejoy: For Georgia, Sawyer, and Roan

Contents

Preface

viii

1

Introduction

1

2

Defining Content Analysis as a Social Science Tool

21

3

Designing a Content Analysis

36

4

Computers and Content Analysis

57

5

Measurement

71

6

Sampling

92

7

Reliability

118

8

Validity

152

9

Data Analysis

168

Appendix A: Sample Protocol Appendix B: Reporting Standards for Content Analysis Articles References Index

192 197 199 220

Preface

The purpose of the fifth edition of this book is to help facilitate the development of a science of communication, in particular as it relates to different forms of mediated communication. A communication science is at the heart of all our social sciences because communication increasingly defines what we do, how we do it, and even who we are individually, socially, and culturally. In fact, never before in human history has mediated communication been so central, pervasive, and important to human civilization. A good communication science is necessary if humanity is to fully understand how communication affects us. Absent good understandings from such a communication science, we will always be at the mercy of unintended, unforeseen consequences. But absolutely necessary to the development of a communication science is a means of logically assessing communication content. Broadly speaking, communication content varies based on a large set of factors that produce and deliver that communication. And, in turn, the variations in communication content affect a large set of individual, group, institutional, and cultural factors. In other words, understanding communication content is necessary and central to any communication science in which the goal is to predict, explain, and potentially control phenomena (Reynolds, 1971). More specifically, we believe the systematic and logical assessment of communication content requires quantitative content analysis, the topic of this book. Only this information-gathering technique enables us to illuminate patterns in communication content reliably and validly. And only through the reliable and valid illumination of such patterns can we hope to illuminate content causes or predict content effects. We bring to this effort our experiences conducting or supervising hundreds of quantitative content analyses in our careers as researchers, examining content ranging from White House coverage, to portrayal of women and minorities in advertising, to the sources given voice in local government news. The content analyses have included theses and dissertations, class projects, and funded studies, and have involved content from sources as varied as newspapers, broadcast media, Twitter, and websites. Some projects have been descriptive, whereas

Preface ix others have tested hypotheses or sought answers to specific research questions. They have been framed in theory about processes that affect content and about the effects of content. If conducting or supervising those studies has taught us anything, it is that some problems or issues are common to virtually all quantitative content analyses. Designing a study raises questions about accessing content, sampling, measurement, reliability, and data analysis—fundamental questions that arise whether the researcher is a student conducting her first content analysis or a veteran planning her 20th, whether the content being studied is words or images, and whether it comes from social networking sites or a legacy medium. In preparing this book for the fifth edition, we re-engage these recurring questions. Our goal is to make content analysis accessible, not arcane, and to produce a comprehensive guide that is also comprehensible. We hope to accomplish the latter through clear, concrete language and by providing numerous examples— of recent and “classic” studies—to illustrate problems and solutions. We see the book as a primary text for courses in content analysis, a supplemental text for research methods courses, and a useful reference for fellow researchers in mass communication fields, political science, and other social and behavioral sciences. This fifth edition varies from the previous four because a new coauthor, Jennette Lovejoy, has joined the team, while original author Frederick Fico has stepped down. In addition to Fico, we owe thanks to many for making this book possible: teachers who taught us content analysis (Donald L. Shaw, Eugene F. Shaw, Wayne Danielson, and James Tankard); colleagues who provided suggestions on improving the book; and our students who taught us the most about teaching content analysis. Jennette and Brendan learned content analysis by studying previous editions of this very book and doing content analysis with their mentors, with whom they are now co-authors. Finally, our deepest appreciation goes to our families, who often wonder whether we do anything but content analysis. Daniel Riffe Stephen Lacy Brendan R. Watson Jennette Lovejoy

1

Introduction

Of all the social science methods (e.g., experiments, focus groups, surveys, etc.) available for use in researching the broad domain of communication, content analysis is intuitively the most central: “Because all human verbal and mediated exchanges involve messages (content), content analysis is particularly important for the study of communication” (Lacy, Watson, Riffe, & Lovejoy, 2015, p. 807). One of the authors recalls early impressions of content analysis, based on published studies in graduate school half a century ago. Content analysts worked in libraries looking at printed (or microfilmed) newspaper pages to measure space given to a particular topic or person. While such studies, describing messages produced by professional communicators in covering particular topics, have merit and import for communication science, things have changed obviously since a half-century ago. The “broad domain” for contemporary study requires reconceptualization of what is a medium of communication (e.g., ranging from legacy public news media in print, visual, or digital forms, to neighborhood, community, or corporate websites, to social media apps targeting broad or narrow audiences); who is a communicator (ranging from traditional trained professionals, to individuals using social media to influence followers or represent their real or “ideal” selves, to interest groups or communities that form via social media to address common interests or goals); and what constitutes a message (ranging from a Fortune 500 company’s corporate mission statement, to a newspaper editorial, to legacy media news posted on Facebook, to a candidate’s speech, a hate group’s online manifesto, an Instagram “selfie,” or a posted response to any of these). Consider, then, the diversity of these quantitative content analyses. With early COVID-19 infection rates three times greater among Black Americans than among Whites, and Black Americans twice as likely to die from the virus, Biswas, Sipes, and Brost (2021) compared Spring 2020 general media and Black media coverage of Black-related COVID-19 issues. Contrary to the authors’ hypothesis, Black media did not include more “social responsibility”

DOI: 10.4324/9781003288428-1

2

Introduction

frames (e.g., blaming systemic racism and inequality for unequal healthcare). However, a significantly greater percentage of Black media items (64%) included “action or solution frames” (e.g., safety guidance, testing options, etc.). When analyses focused specifically on Black-majority cities, general media were more likely to use social responsibility and consequence (e.g., health, political, economic) frames, while Black media were more likely to use the action/ solution and individual responsibility (e.g., action by the President or a specific local official) frames. After several popular rappers publicly disclosed battles with anxiety, depression, and other mental health conditions, scholars (Kresovich, Collins, Riffe, & Carpentier, 2021) explored mental health themes in lyrics of Billboard charttopping rap songs, noting a significant two-decade increase. A plurality (28%) of songs referenced anxiety, 22% mentioned depression, and 8% alluded to suicide. “Contributing stressors” included the artists’ social environment and unhappy love life. Kresovich et al. pondered what effects these “increasingly prevalent messages may have in shaping mental health discourse and behavioral intentions” (p. 286). Through the lens of “public diplomacy” (Golan, 2013)—in which “governments communicate and build relationships with foreign publics in pursuit of political objectives” (Fitzpatrick, Fullerton, & Kendrick, 2013, p. 1)—Sobel, Riffe, and Hester (2016) collected a year’s Twitter feeds from four embassies on the US State Department’s “watch list” (designated as “dangerous or unstable”: Afghanistan, Libya, Nigeria, and Syria) and four not on the list. A simple random sample of 25% of each embassy’s tweets was drawn, for a total of 2,625. Among “watch list” feeds, only the embassy in Syria commented on the ongoing civil war; Embassy Kabul was silent on the conflict in Afghanistan. Sobel et al. concluded there was little consistency among embassies in “formally furthering the State Department mission” (p. 102). Video-game researchers (Lynch, Tompkins, van Driel, & Fritz, 2016) looked at female character “sexualization” across three decades—a period encompassing the 1996 Tomb Raider game that introduced Lara Croft, a character described as highly sexualized yet strong, bold, educated, and capable (p. 569). Sexualization increased from 1992 to 2006, but declined from 2007 to 2014. Lynch et al. reported a persistent relationship between sexualization and capability, a fact that may help “empower female gamers” (p. 578), though female characters were more often in secondary roles (p. 580). Bastien (2018) compared newspaper coverage with verbatim transcripts from televised debates in five Canadian federal campaigns (1968–2008). Debate reports became increasingly “analytical and judgmental” and less “factual”: the presence of journalists’ opinions in paragraphs increased from 14% to 24% (p. 9). However, agendas of politicians and journalists were correlated: “the longer an issue is debated by the leaders, the more it is reported” (p. 1757).

Introduction 3 Lee and Riffe (2017) explored how corporations and an industry monitoring group focus media attention on corporate social responsibility (CSR) activities (e.g., efforts to improve the environment, community, and employees). Data from 7,672 press releases from 223 US corporations, 1,064 New York Times and Wall Street Journal articles, and ratings of corporations by a CSR monitoring group showed stronger relationships between ratings and news coverage than between press releases and coverage. Companies may need to heed such monitoring groups and reconsider what they provide in press releases. Indeed, Ki and Hon (2006) earlier explored Fortune 500 companies’ website promotion of CSR activities involving education, the community, and the environment, finding that few sites communicated effectively about CSR. While many of these content analyses focused on the presence or representation of individuals, groups, and ideas in messages, other scholars have analyzed social media content (numbers of readers, sharing/linking, networking, etc.) to illustrate how communities of ideas are created by communicating via social media, and how “dominant media narratives” can be reshaped by that discourse. Indeed, research shows some editors and journalists attend to social media and alter content based on social media responses (Tandoc & Vos, 2016). Qualitative research has examined “right-wing populism” (RWP) in social media platform content (e.g., Engesser, Ernst, Esser, & Büchel, 2017), while other researchers have explored the function of online RWP for followers. Heiss and Matthes (2020) captured 13,358 Facebook posts from political parties, candidates, and their followers in Austria and Germany, seeking anti-immigrant and anti-elite sentiments. They created “dictionaries” of anti-immigrant and anti-elite terms to use with a computerized coding process (see Chapter 4, this volume, for a discussion of automated approaches). Confirming that RWP party sites had more anti-immigrant and anti-elite references than non-RWP sites, and that the references triggered angry posts from followers, Heiss and Matthes surveyed followers and non-followers. Data indicated that respondents’ anti-immigrant beliefs drove them to follow RWP sites that promoted anti-immigrant policies and provided a “community” of followers sharing such beliefs, resulting in a reciprocal “nativist spiral”: citizens with strong anti-immigrant attitudes exposed themselves to RWP content that reinforced their anti-immigrant attitudes. Harlow and Kilgo (2021) showed how social media disrupt the “dominant media narrative” and “the hierarchy of social struggle” (Kilgo & Harlow, 2019) by amplifying protest coverage differently than mainstream news media do. Harlow and Kilgo sampled protest-related stories from national, metropolitan, and local newspapers, and collected Facebook “media engagement data” from its Application Programming Interface (API). Facebook users forwarded and redistributed news selectively, amplifying particular narratives and acting as “powerful gatewatchers” to mediate the news organization’s role in legitimizing or delegitimizing movements (p. 680).

4

Introduction

Although these nine studies differ in purpose, focus, and techniques, they reflect a range of applications of quantitative content analysis—a research method defined briefly as the systematic assignment of communication content to categories according to rules specified in a coding protocol, and the analysis of relationships involving those categories using statistical methods. Usually, content analysis involves drawing representative samples of content, training human coders to use a protocol to apply category rules to measure or reflect differences in content, and measuring the reliability (agreement or stability over time) of coders in applying the rules. Resulting data are usually analyzed to describe patterns or characteristics, or to identify relationships among the content qualities examined. If the categories and rules are sound and reliably applied, the chances are that the study results will be valid (e.g., that the observed patterns are meaningful). Though most of these procedures are well established, contemporary scholars are exploring new uses of computers to complement human coding and deal with large amounts of text, as discussed below. This skeletal definition deliberately lacks mention of the researcher’s specific goal (e.g., to test hypotheses about rap song lyrics), specification of types of communication to be examined (e.g., corporate websites, Instagram “selfies,” or protest news), content qualities explored (e.g., presence of a reporter’s opinion, a reference to mental health, or an anti-immigrant term), or types of inference that will be drawn from the data (e.g., that social media influence dominant narratives or that reporter and candidate agendas match). Such specification of terms is essential to an effective study design. Moreover, the definition does not prescribe types of data analysis researchers might pursue. Some analyses are univariate (Vogt, 2005, p. 333), focusing on the distribution of cases on a single variable (e.g., the age distribution of models as child, teen, young-adult, adult, and elderly in television advertising), without relating it to any others. Other analyses are bivariate (p. 28), relating two variables (e.g., age distribution of models—the dependent variable—and type of programming airing the ads—the independent variable—to determine if elderly models, for example, are more often used in news than sports programming). Multivariate analyses (p. 201) involve relationships among three or more variables, often designating two or more independent variables (e.g., age distribution of models by programming type and by time of day—morning, afternoon, early evening, prime time). In later chapters, we emphasize that the complexity of data analysis is limited only by the research objective, how variables are measured (Chapter 5), and how the units of analysis are drawn (i.e., probability sampling; Chapters 6 and 9). However, before a more comprehensive definition of this versatile method is developed in Chapter 2, we offer an overview of content analysis in mass communication research and examples of its use in other fields and disciplines.

Introduction 5 Communication Research Whereas some scholars approach communication messages from perspectives associated with the humanities (e.g., as literature or art), others employ a social science approach based in empirical observation and measurement. Typically, the latter approach means researchers identify questions or problems (e.g., derived from the scholarly literature or professional practices), identify concepts or factors that “in theory” may be involved, and propose possible explanations or relationships among concepts. Implausible explanations are discarded, and viable ones tested empirically, with theoretical concepts now measured in concrete observable terms. If members of an ethnic minority, for example, believe that they are underrepresented in news content (in terms of their census numbers), a researcher may propose that it is because minorities are underrepresented among occupational groups that serve more often as news sources. This proposition, suggesting different concepts to be “operationalized” into measurement procedures, can be tested empirically. Similarly, if researchers want to address how social media fostered concerted activity during the 2020 Black Lives Matter protests, operational procedures can be developed and used to collect data on social media content, which might be compared with the protest activities of individuals who maintain the social media accounts. Put another way, explanations for problems or questions for such researchers are sought and derived through direct and objective observation and measurement rather than through one’s reasoning, intuition, faith, ideology, or conviction. In short, these researchers employ what is traditionally referred to as the “scientific method.” The centuries-old distinction between idealism (i.e., the mind and its ideas are “the ultimate source and criteria of knowledge”) and empiricism (i.e., observation and experimentation yield knowledge) continues to hold the attention of those interested in epistemology or the study of knowledge (Vogt, 2005, pp. 105–106, 149). Content analysis assumes an empirical approach—a point made more emphatically in later chapters. Content Analysis and Mass Communication Effects Research Scholarly or scientific study of mass communication is fairly new, with roots in early 20th-century work by political scientists concerned with the effects of propaganda and other persuasive messages (McLeod, Kosicki, & McLeod, 2009). In addition to communication scholars, researchers from disciplines such as sociology, economics, and psychology have focused on communication processes and effects, contributing their own theoretical perspectives and research methods.

6

Introduction

Powerful Effects?

One particularly durable communication research perspective reflects a behavioral science orientation that grew from early 20th-century theories that animal and human behaviors could be seen as stimulus–response complexes. Some communication researchers have viewed communication messages and their assumed effects from this same perspective. Researchers interested in these effects typically adopted experimentation for testing hypotheses. Participants were assigned to different groups; some were exposed to a stimulus within a treatment (a message), whereas others were not (the control participants). Under tightly controlled conditions, subsequent differences in what was measured (e.g., attitudes about an issue or behavioral intention) could be attributed to exposure/non-exposure differences. Meanwhile, for most of the first half of the 20th century, there was a widespread assumption—among scientists and the public—that stimuli such as mass persuasive messages could elicit powerful responses, even outside the experimental laboratory. Why? Propaganda, as seen during the World Wars, was new and frightening (Lasswell, 1927; Shils & Janowitz, 1948). A 10-volume summary of 13 Payne Fund Studies conducted from 1929 to 1932 suggested movies’ power to affect children’s attitudes, emotions, moral standards, and perceptions of daily conduct (Lowery & DeFleur, 1995, p. 51). Anecdotal evidence of the impact of Communist or Nazi oratory in Europe, or the radio demagoguery of Father Charles E. Coughlin in America (Stegner, 1949), heightened concern over mass messages and collective behavior. Media were able to leapfrog official national boundaries and were believed capable of undermining national goals (Altschull, 1995). Broadcast media demonstrated a capacity for captivating, mesmerizing, and holding people in rapt attention, and inciting collective panic (Cantril, Gaudet, & Hertzog, 1940). With the rise of commercial advertising and public relations agencies, persuasive campaigns used messages crafted to make people do what a communicator wanted (McLeod et al., 2009). These assumptions about powerful effects contributed to early models or theories that used metaphors such as hypodermic needle or bullet. In the language of the latter, all one had to do was shoot a persuasive message (a bullet) at the helpless and homogeneous mass audience, and the desired effects would occur. Experimental studies of messages and their effects were interpreted as supporting these assumptions. Of course, the assumption that audience members were uniformly helpless and passive was a major one. Methodologists warned that the artificially controlled and contrived conditions in laboratory settings meant experimental attitude-change findings lacked real-world generalizability (Hovland, 1959). Others suggested that scientists’ emphasis on how best to do things to the

Introduction 7 audience was inappropriate; Bauer (1964, p. 322) questioned the “moral asymmetry” of such a view of the public. Nonetheless, content analysis was useful within the powerful effects perspective because of the implicit causal role for communication content. It was important to study because it was believed to have an effect (Krippendorff, 2004a). Scholars scrutinized content for variables that could affect people. One might catalogue appeals used in propaganda, another might describe status differences among sources in persuasive messages, and a third might analyze whether antisocial behavior was sanctioned or ignored in television programs. Limited Effects?

However, assumptions that powerful effects were direct and uniform were eventually challenged as simplistic (Severin & Tankard, 2000). Experimental findings had, in some cases, suggested that messages might change subjects’ knowledge but not the targeted attitudes or behaviors. Researchers conducting public opinion surveys brought field observations that ran counter to cause– effect relations found in laboratory settings. Examination of how people are exposed to messages in the real world and mixed results on the effectiveness of persuasive message “bullets” suggested that a limited effects perspective might be worth exploring (Chaffee & Hochheimer, 1985). Non-laboratory audiences had only an opportunity for exposure to particular content; they were not forced to attend like experimental participants. Under “natural” conditions, audiences (who, surveys showed, were not uniformly helpless or passive, nor, for that matter, very uniform in general) used media and messages for their own individual purposes, chose what parts of messages—if any—to attend or ignore, and rejected much that was inconsistent with their attitudes, beliefs, and values (Lazarsfeld, Berelson, & Gaudet, 1944). A decision to accept, adopt, or learn a message was a function of existing psychological and social characteristics and not necessarily mere exposure to the manipulated, artificial credibility of a source trying to persuade. Social affiliations such as family and community involvement were important predictors of people’s attitudes and behaviors, and networks of personal influence were key in their decisions (Carey, 1996). Contingency Effects?

Research during the second half of the 20th century thus suggested that the effects—powerful or limited—of mass media are contingent on a variety of factors and conditions. This contingency effects approach allowed theorists to reconcile conflicting conclusions of the powerful and limited effects approaches.

8

Introduction

Rather than being the result of any single cause (e.g., the message), communication effects reflected a variety of contingent conditions (e.g., whether the message is attended to alone or as part of a group and what motivates one to attend). Put another way, the effect of a particular message may be moderated (Vogt, 2005, pp. 103, 195), or modified by attributes of individuals (e.g., horror movies are more frightening to young children; age is a moderator because it interacts with exposure). Exposure may be mediated through an intervening variable (p. 190): for example, exposure to footage of police using force during an arrest may have a different effect on a viewer whose father works in law enforcement than on a viewer whose father does not. The father’s occupation intervenes and mediates the effect. However, despite increasing interest in contingent conditions, in what people do with media messages, and in how—or if—they learn from them, content analysis remained an important means of categorizing content. Messages were now analyzed in terms of differences in psychological or social gratifications consumers might seek (e.g., escape from boredom, being “connected,” or having something to talk about), cognitive images they develop (e.g., views of gender roles or of the acceptability of antisocial acts), and what they deem important on the news media agenda (e.g., what issues in a political campaign were worth considering and what attributes of issues were critical). These studies of cognitive (not attitudinal) effects and people’s social and psychological uses and gratifications of media and content reflected a view of the audience far different from the “morally asymmetrical” view criticized by Bauer (1964, p. 322). These triggered additional studies aimed at measuring content variables associated with those uses and effects. For example, content analysts have categorized entertainment content to answer questions about how ethnic and gender stereotypes are learned (Mastro, 2009; Smith & Granados, 2009). They have looked at content ranging from daytime soap operas to reality programs because of guiding assumptions about psychological and social gratifications people achieve by viewing those shows (Rubin, 2009). They have examined victim gender in “slasher” horror movies because of concern that such violence is desensitizing (Sapolsky, Molitor, & Luque, 2003; Sparks, Sparks, & Sparks, 2009). And content analysis has shown how different communicators “frame” the same events, because scholars argue that frames shape interpretations (Biswas et al., 2021; Reese, Gandy, & Grant, 2001). According to Tankard (2001, pp. 100–101), “A frame is a central organizing idea for news content that supplies a context and suggests what the issue is through the use of selection, emphasis, exclusion, and elaboration.” Moreover, as public and personal social media platforms have enabled virtually anyone to communicate publicly to audiences of unknown size, content analysis remains key to the study of those communicators and content. These

Introduction 9 messages are not the work of traditional professional communicators, but they reach large audiences nonetheless, and scholars ponder the motives and gratifications of those consuming them and effects on their knowledge and beliefs. Content analysis remains important for researchers exploring how individuallevel cognitive processes and effects relate to message characteristics (Shrum, 2009; Oliver & Krakowiak, 2009). For example, scholars have argued that important differences between one message’s effects and another’s may be due less to the communicator’s or audience member’s intent (e.g., to inform or be informed) than to different cognitive or other processes (e.g., transportation and enjoyment, entertainment, arousal, mood management, social isolation, and so on) triggered by content features or structure (Bryant, Roskos-Ewoldsen, & Cantor, 2003; Green, Brock, & Kaufman, 2004; Oliver & Krakowiak, 2009; Vorderer & Hartmann, 2009). These additional layers of complexity, and compelling questions of “what causes what” and whether relationships are unidirectional or even reciprocal (two variables mutually influence each other), point to the need—in all of social science, not just communication—for more sophisticated research designs that may incorporate multiple methods and data forms. Multi-method studies in communication might couple content analysis with surveys or experiments, while multi-form designs might use official transcripts as a baseline for comparison with mediated reports, or mainstream media coverage to compare with the responses and reinterpretations of online “gatewatchers” (e.g., Harlow & Kilgo, 2021), to name just a few examples. Content Analysis and the Context of Production Thus far, the discussion has implicitly viewed communication content as an antecedent condition, presenting possible consequences of exposure ranging from attitude change to the different gratifications people obtain from media or cognitive images they learn. However, content—whatever the medium that conveys it—is itself the consequence of a variety of other antecedent conditions or processes that may have led to or shaped its construction. A news site’s or aggregator’s content, for example, is a consequence of the organization’s selection from an array of possible stories, graphics, interactive features or affordances, and other content. That content may be a consequence of editors’ application of what has traditionally been called “news judgment,” based on numerous factors that visitors to the site need or want. The content is also shaped by other constraints, such as the kinds of motion graphics or interactivity available, how often material is updated, and so on. The content a researcher examines thus reflects all those antecedent choices, conditions, constraints, or processes (Stempel, 1985). In some instances, aggregator content is the consequence of an algorithm reflecting a user’s prior choices.

10

Introduction

Similarly, individual news stories are the consequence of influences including (but not limited to) a news organization’s market (Lacy et al., 2010; Lacy, Watson, & Riffe, 2011; Lacy, 1987); resources available for staffing (Lacy et al., 2012; Fico & Drager, 2001); on-scene reporter judgments and interactions with purposive and non-purposive sources (Bennett, 1990; Duffy & Williams, 2011; Lawrence, 2010); and decisions about presentation style, structure, emphasis (as in the “framing” process described previously), and language, to name a few (Scheufele & Scheufele, 2010). Media sociologists no longer view news reporting as “mirroring” reality but speak instead of journalistic practices and decisions that constitute the manufacturing of news (Cohen & Young, 1981). News content is the product or consequence of those routines, practices, and values (Shoemaker & Reese, 1996; Reese, 2011), is constructed by news workers (Bantz, McCorkle, & Baade, 1997), and reflects both the professional culture of journalism and the larger society (Berkowitz, 2011). Examples of “content as consequence” abound. Under the stress of natural disasters (e.g., tornadoes, hurricanes, or earthquakes), individual journalists produce messages in ways that differ from routine news work (Dill & Wu, 2009; Fontenot, Boyle, & Gallagher, 2009). Different ownership, management, operating, or competitive situations have consequences; news organizations in different competitive situations allocate content differently (Lacy et al., 2012; Beam, 2003; Lacy, 1992). The presence of women in top editorial positions has consequences for how reporters are assigned beats (Craft & Wanta, 2004) and the newsroom culture (Everbach, 2005), though evidence on the effects of female management on content is mixed (Beam & Di Cicco, 2010; Everbach, 2005) or perhaps issue-dependent (Correa & Harp, 2011). Predictably, some international coverage in US news media is a consequence of having a US military presence overseas; absent a state of war, “foreign news” is relatively rare (Allen & Hamilton, 2010). Facing censorship in authoritarian countries, correspondents gather and report news in ways that enable them to get stories out despite official threats, sanctions, or barriers (Riffe, 1984, 1991; Riffe, Kim, & Sobel, 2018). Symbols that show up in media messages at particular points in time (e.g., allusions to nationalism or solidarity during a war) are consequences of the dominant culture and ideology (Shoemaker & Reese, 1996); images, ideas, or themes reflect important antecedent cultural values. “Content as consequence” is applicable to non-news communication, too. Recall Ki and Hon (2006), whose examination of Fortune 500 companies’ websites allowed them to critique those companies’ communication strategies— strategies that were antecedent to the site content. Scholars often speak of such evidence as unobtrusive or non-reactive. That is, researchers can examine content after the fact of its production and draw inferences about the conditions of its production without making the communicators self-conscious or reactive to being observed while producing it (Weber, 1990).

Introduction 11 Letters, diaries, bills of sale, or archived newspapers, tweets, or blog posts—to name a few—can be examined and conclusions drawn about what was happening at the time of their production, or what the producer wanted to have known about their production. Indeed, with platforms and applications like Instagram, Twitter, Tumblr, and Facebook, content is a consequence of “construction” by individual users, for many reasons—to comment on events, signal support for a movement, share one’s experiences, or project an image that may or may not be accurate, to name a few. Original posts on such sites and comments are constructed content. Even the act of liking, linking, or forwarding material on these sites is an act of construction and communication. The “Centrality” of Content So, communication content may be viewed as end product, the assumed consequence of antecedent individual, organizational, social, and other contexts. The validity of that assumption depends on how closely the content evidence can be linked empirically (through observation) or theoretically to that context. As noted earlier, communication content also merits systematic examination because of its assumed role as cause or antecedent of a variety of individual processes, effects, or uses people make of it. Figure 1.1 is a content-centered model illustrating why content analysis can be integral to theory-building about both communication effects and processes. The centrality remains regardless of the importance (for theory-building) of myriad non-content variables, such as individual human psychological or social factors and the larger social, cultural, historical, political, or economic context of communication. However, if the model graphically illustrates the centrality of content, it does not accurately reflect the design of many mass communication studies. As Shoemaker and Reese (1990, p. 649) observed, most content analyses are not linked “in any systematic way to either the forces that created the content or to its effects.” As a result, Shoemaker and Reese (1996, p. 258) warned mass communication theory development could remain “stuck on a plateau” until that integration occurs. A 1996 study (Riffe & Freitag, 1997) of 25 years of content analyses published in Journalism & Mass Communication Quarterly revealed that 72% of the 486 studies lacked a theoretical framework linking content to either its antecedents or its consequences. Trumbo (2004, p. 426) placed the percentage at 73% in his analysis of Journalism & Mass Communication Quarterly content studies during the 1990 to 2000 period. Not surprisingly, only 46% of the cases examined by Riffe and Freitag involved formal research questions or hypotheses about testable relations among variables—testing that is essential to theory-building.

12

Introduction

Figure 1.1 Centrality model of communication content

Still, research in this field is dynamic, although the scientific goal of prediction, explanation, and control (Reynolds, 1971) of media phenomena may still be decades away. However, quantitative content analysis of media content is key to such a goal. Since initial publication of this book in 1998, hundreds of content analysis-related studies have been published in Journalism & Mass Communication Quarterly and other refereed journals, such as the Journal of Broadcasting & Electronic Media and Mass Communication and Society, using the kind of quantitative content analysis examined in this book. According to Wimmer and Dominick (2011, p. 156), about a third of all articles published in those three journals in 2007 and 2008 employed quantitative content analysis, a higher

Introduction 13 proportion than the 25% that Riffe and Freitag (1997) reported for 25 years of Journalism & Mass Communication Quarterly. Of the 2,534 articles Lovejoy, Watson, Lacy, and Riffe (2014) studied from Journalism & Mass Communication Quarterly, Journal of Communication, and Communication Monographs between 1985 and 2010, 23% involved content analysis. Consistent with the emphasis on the “centrality” of content in understanding communication processes and effects, many studies place content analysis research into the context of framing, agenda-setting, cultivation, and various persuasion theories. Research on content antecedents is still largely atheoretical, though, with some studies using the Shoemaker and Reese (1996) hierarchy of influences approach to order, interpret, and interrelate influences on content. Theories addressing effects and antecedents of social media content will be difficult to synthesize, as such content ranges from comments on public events to Instagram selfies and school shooter manifestos.

Description as a Goal Of course, not all research has theory-building as a goal. Simple descriptive studies of content have value. A Southern daily newspaper publisher, stung by criticism that coverage of the African American community was excessively negative, commissioned one of the authors to examine that coverage. The publisher needed an accurate description of his paper’s coverage to respond to the criticism and, perhaps, change the coverage. Thus, some descriptive content analyses may be “reality checks” whereby representations or portrayals of groups, phenomena, traits, or characteristics are assessed against a “standard” (Wimmer & Dominick, 2011, pp. 158–159). Such comparisons to normative data can, in some instances, index media distortion (Mastro & Greenberg, 2000; Smith & Granados, 2009). For example, more than 30 years ago, a study of characters in television advertising during Saturday morning children’s programming reported a female and ethnic presence far smaller than those groups’ presence in the population, according to census data (Riffe, Goldson, Saxton, & Yu, 1989). Historically, when new content and delivery forms evolve, they lend themselves to such descriptive “real-world” comparisons. Early video games, for example, were criticized because of assumptions about imitative aggression or learning of gender roles among users—a research focus previously applied to content ranging from comic books to movies, television, and popular music. Martins, Williams, Harrison, and Ratan (2008) analyzed 150 top-selling video games, measuring physical dimensions of animated characters and converting the dimensions to real-human “equivalencies.” Animated female characters were far more slender than their real-world counterparts—a pattern consistent with the thinness ideal cultivated by many media.

14

Introduction

Or consider the study by Law and Labre (2002) analyzing male body images in magazines. Although the research used a longitudinal (1967–1997) design, it was essentially a descriptive study of how male body shapes became increasingly lean and muscular in visual representations. Law and Labre suggested that males’ exposure to idealized mediated body images may parallel the experience women face. Content on some social media applications and websites cannot always be assessed against a normative “gold standard” such as the census; instead, the content unabashedly reflects the authors’ own world views or beliefs. Recall, for example, Heiss and Matthes’s (2020) documentation of anti-immigrant and elite sentiments expressed on right-wing populist websites. Indeed, studies of sites that disseminate misinformation or blatantly false conspiracies require assessment only against what extant evidence and consensus confirm. Finally, descriptive content analyses sometimes serve as a first phase in programs of research. Research on anonymous news sources is illustrative. Reporters sometimes hide a source’s identity (e.g., “a senior official, speaking on condition of anonymity, said . . .”), despite complaints about the source’s lack of public accountability (Duffy & Williams, 2011; Sobel & Riffe, 2016). Initially, Culbertson (1975, 1978) analyzed representative content to describe message variables associated with unnamed sources. Based on those results, Culbertson and Somerick (1976, 1977) conducted an experiment (participants received simulated news stories either with or without anonymous sources) to test the effects of unnamed sources on believability. More recently, a program of research used experiments to test the effects of media framing of government policies on audience members, usually fashioning (manipulated) experimental treatment frames from examples found in analysis of media content (e.g., de Vreese, 2004, p. 39; de Vreese, 2010; de Vreese & Boomgaarden, 2006). Research Applications: Making the Connection As many of the examples presented above have shown, content analysis is often an end in itself—a method to answer research questions about content. However, some of the examples featured designs that brought together multiple forms or sources of content: Harlow and Kilgo (2021) studied news coverage and Facebook forwards; Lee and Riffe (2017) used press releases, corporate rating reports, and news coverage; and Bastien (2018) used newspaper coverage and debate transcripts. Other examples illustrate the method’s use in conjunction with other methods: for example, Heiss and Matthes’s (2020) analysis of right-wing populist Facebook posts and subsequent survey of followers and non-followers. In fact, numerous studies have involved multiple methods or data forms. Scheufele, Haas, and Brosius (2011) explored the “mirror or molder” role of stock price and trading coverage on subsequent market activity. Data on four

Introduction 15 leading German dailies’ coverage and the two most frequently visited financial websites were matched with stock prices and trading volume for companies ranging from blue-chip stocks to lightly capitalized and traded firms. The authors concluded that media “mirror [rather] than shape what happens at stock markets” (p. 63), but also affect online traders who may trade immediately after reading reports. Another study (Eddy, Riffe, Cohen, & Kim, 2021), examining alignment between nine presidents’ foreign-policy priorities and international news coverage, brought together a half-century of New York Times content data, computational analysis of countries mentioned in 284 major presidential speeches during the 50 years, and qualitative “close readings” of foreign-policy summaries for each president written by political scientists and presidential historians with the nonpartisan Miller Center of Public Affairs. The authors concluded that popular perceptions of the president as “newsmaker-in-chief” who is able to turn the “gaze” of the Times is—at least in terms of foreign policy—“part truth, part myth” (p. 21). Coverage sometimes “echoed” presidential priorities, but trend analysis showed the echo becoming “increasingly faint” (p. 22). To examine how “compliant” the press was in responding to the official US stance on US forces’ abuse of Iraqi prisoners at Abu Ghraib Prison during the Iraq War, Rowling, Jones, and Sheets (2011) studied White House speeches, interviews, press conferences, and press releases; statements made on the floor of Congress and recorded in the Congressional Record; and news coverage by CBS News and the Washington Post. Despite congressional challenges, the administration’s “national identity-protective” frames—minimization, disassociation, and reaffirmation—“were largely echoed by the press” (p. 1057). Observers have argued that journalistic news judgment overvalues extreme groups and undervalues “moderate” groups. Identifying more than 1,100 (notfor-profit) advocacy groups from Internal Revenue Service databases, McCluskey and Kim (2012) interviewed top executives of 208 groups, characterizing groups as very conservative, very liberal, and moderate. They analyzed the 20 largestcirculation US dailies and “matched” each group with the newspaper nearest its headquarters. Content analysis showed moderate groups received less prominence than extreme groups. McCombs and Shaw (1972) hypothesized an agenda-setting function of mass media coverage of a political campaign in which differential media emphasis, over time, communicates to the public a rough ranking (agenda) of important issues. The authors surveyed undecided voters about the most important issues in a campaign and content analyzed campaign coverage in nine state, local, and national media, finding a strong positive correlation between the media and public agendas. Similarly, Wanta, Golan, and Lee (2004) combined content analysis of network newscasts with national poll data, showing that amount of coverage of foreign nations is strongly related to public opinion about the importance of those

16

Introduction

nations to US interests. However, they also examined how negatively or positively the nations were portrayed, and found a “second-level” agenda-setting effect involving those attributes: the more negative the coverage, the more likely poll respondents were to think negatively about the nation. However, as impressive as the agenda-setting approach has been over the last five decades, and despite the innovative mixed-method approaches highlighted above, methodological integration is still relatively rare. A quarter-century ago, Riffe and Freitag (1997) found only 10% of content analyses published in 25 years of Journalism & Mass Communication Quarterly involved a second research method—a pattern that continued through this book’s fourth edition in 2019. Innovation and Expanding the Research Reach While the definition of content analysis offered earlier emphasized trained human coders applying classification rules, researchers have taken advantage of a range of computational methods. At one end of the continuum, computer functions and software are used to access, retrieve, categorize, filter, and manage content units so humans may manually code appropriate content. At the other end, researchers have “trained” computers to apply codes. This “algorithmic text analysis” (ATA) is “a computer application that assigns numeric values to attributes of media content based on a set of programmed rules” (Lacy et al., 2015, p. 9), which some scholars call “machine learning,” “supervised machine learning,” or, simply, “computer coding.” The operative term is “programmed rules.” Innovations and issues in computing will be explored in Chapter 4, but a few examples illustrate the growth of computational methods. “Arab Spring” protests pitted citizens against authoritarian regimes in Tunisia and Egypt in 2011. Examining sources quoted in news coverage and facing a data set of more than 60,000 tweets, researchers set about sorting and filtering the raw data, identifying unique sources, linking to specific articles, and importing those articles into a template to simplify human coding. They used a programming language (Python) to write a script that located target elements in the large, raw data set; spreadsheet and statistical software to organize and identify the objects for their analysis; open-source software to convert dynamic web pages into static objects; and “a Web-based electronic coding interface to facilitate the work of human coders and reduce error” (Lewis, Zamith, & Hermida, 2013, p. 41; see also Hermida, Lewis, & Zamith, 2013). This “hybrid” approach retained the “systematic rigor and contextual awareness” of trained human coders, while “maximizing the large-scale capacity of Big Data and the efficiencies of computational methods” (Lewis et al., 2013, p. 47). Analysis showed that nonelite sources were retweeted more often than elite sources or other journalists. Opperhuizen, Schouten, and Klijn (2019) studied 2,265 news articles about gas drilling in the Netherlands across 25 years and 5 different newspapers, using “supervised machine learning” (SML) and training a computer “to recognize

Introduction 17 patterns in the text that correspond to the manually assigned codes” (p. 8). A subset of 102 articles was “inductively” coded by humans using frame categories (personalization, dramatization, and negativity) and then served as the “training document” for the computer algorithm—a process the researchers described as “challenging” and “very time-consuming” (p. 18). Baden and Tenenboim-Weinblatt (2017) amassed a data set of more than 200,000 news texts in 13 Israeli, Palestinian, and international media over a decade. To qualify for scrutiny, texts had to reference both sides of the Israeli–Palestinian conflict. After “a laborious qualitative pilot study,” the authors created “a large, fine-grained dictionary of 1,974 semantic concepts” (pp. 9–10). Adjusted for idiomatic usage, the final dictionaries contained 6,500–10,500 search terms and more than 34,000 “disambiguation criteria” (p. 10) for terms with multiple meanings. Co-occurrences of concepts appearing together were also targeted. However, the authors said their “fine-grained automated analysis” (p. 8) yielded only “a bird’s eye perspective, using highly abstracted data [that] deliberately glosses over the specific conflict events and political controversies covered” (p. 19). They noted that the inductive, algorithmic approach is vulnerable because even a catalogue of 1,974 concepts “is bound to be incomplete” (p. 20). In sum, computers’ ability to perform large and small operations very quickly enables researchers to manage and analyze mammoth data sets. The hybrid approach used by Lewis et al. (2013) represents a prime example of utilizing computing to access, filter, and sort content, coupled with a human–computer interface that helps reduce human-coding and data-entry errors. Yet, as we shall see in Chapter 6, samples of available digital content units may or may not represent valid populations of relevant content. Tweets about Black Lives Matter that can be captured today may not include those posted yesterday or a week ago and then removed. In addition, much of the available online or social media content that can be accessed with computers—while allowing researchers to “look inside” the dynamic discourse or exchanges around a topic, event, or issue—represents a conceptualization challenge. Social media content often has multiple “authors” (e.g., site developers, those who post and others who respond, and still others who forward or retweet) and “editors” who occasionally alter or remove material or deny access to individuals or groups they deem inappropriate. In other words, what does computerized retrieval and analysis of the fluid discourse around a topic reveal about which author or which editor? These represent ongoing challenges for content analysts. Exploration of the most valid ways to examine the content of social media and platforms continues to be a formidable task. Research Applications: Content Analysis in Other Fields The examples cited thus far have shown the utility of systematic content analysis, either alone or in conjunction with other methods and tools, for answering

18

Introduction

theoretical and applied questions explored by journalism or mass communication researchers. As we noted earlier (Lacy et al., 2015, p. 17), it is intuitively the most central method for researching the broad domain of communication. Moreover, because communication itself is so central to individual and social behavior, content analysis has utility in many scholarly disciplines: journals include examples in sociology, political science, banking, economics, natural resources and the environment, medicine, and nutrition, to name a few. Because messages presumably indicate the communicator’s psychological state, content analysis has a long history of use in psychology, dating perhaps to the examination by Gordon Allport in the 1940s of more than 300 letters from a woman to her friends. Those Letters from Jenny (Allport, 1965) were a non-reactive measure of the woman’s personality for Allport and his associates, whose work heralded what Wrightsman (1981) called the “golden age” of personal documents as data in personality analysis. In the 1980s, psychologists used content analysis of verbatim explanations (CAVE) in individuals’ speaking and writing to see if they describe themselves as victims, blaming others or other forces for events. Earlier research used questionnaires to elicit causal explanations, but these have limited use if potential participants are “famous, dead, uninterested, hostile, or otherwise unavailable” (Zullow, Oettingen, Peterson, & Seligman, 1988, p. 674). However, content analysis can be used with “interviews, letters, diaries, journals, school essays, and newspaper stories, in short, in almost all verbal material that people leave behind” (p. 674). Zullow et al. examined President Lyndon Johnson’s Vietnam War press conferences: when Johnson offered optimistic causal explanations, bold and risky military action followed, whereas pessimistic explanations predicted passivity on the president’s part. Content analysis has also been used to examine the evolution of academic disciplines. An economic historian (Whaples, 1991) collected data on content and authorship variables for the first 50 volumes of the Journal of Economic History, examining how researchers’ focus shifted from one era to another and isolating a particular paradigm change—cliometrics—that swept the profession in the 1960s and 1970s. In organizational and strategic management, Duriau, Reger, and Pfarrer (2007) used quantitative content analysis of 98 scholarly journals to assess the field’s use of the technique during 1980–2005, when it “gained its legitimacy as a methodology in the management field” (p. 8). They reported data on research themes, data sources, theoretical stances, coding approaches, and analytical methods (p. 8), and documented a trend in the reporting of reliability across the 25-year period. As noted later (Chapter 7), Lovejoy, Watson, Lacy, and Riffe (2014, 2016) studied top communication journals across the 1980–2014 period and noted increased reliability reporting but not reporting of how the content sample for

Introduction 19 reliability assessment was selected—a process that has important consequences for the validity of the assessment (Lacy & Riffe, 1996; Lacy et al., 2015). A team of communication and economics scholars (Rickard, Noblet, Duffy, & Brayden, 2018) examined media coverage of risks and benefits associated with marine aquaculture (cultivating seafood in salt or brackish water) to gauge its acceptance in Maine and Massachusetts. Coupling a 15-year content analysis of 3 regional newspapers with focus group discussions with residents, the team found extensive coverage of environmental risk, but focus group participants generally lacked awareness of aquaculture and used beliefs about traditional terrestrial agriculture to inform beliefs about aquaculture risks and benefits. Arguing that there is a knowledge gap about wildlife (e.g., characteristics and dangers to humans) between the public and scholars/scientists, and that media both perpetuate and reflect public perception, biologists Unger and Hickman (2020) located 288 newspaper articles across 153 years encompassing 5 “conservation eras” (Exploitation, 1850–1899; Protection, 1900–1929; Game Management, 1930–1965; Environmental Management, 1966–1979; and Conservation Biology, 1980–2016). To examine how public perceptions changed across these eras, they focused on the lowly hellbender salamander—an amphibian described as an “ideal case,” given its long history of public and scientific curiosity, geographic range, recent decline, and “history of persecution” as ugly, poisonous, and troublesome for fishermen. Hypothesizing a change in coverage—a proxy for public opinion—during the later eras, they found that hellbenders were viewed more positively in the last 40 years of the study, with a statistically significant spike during the Conservation Biology era. Bank marketing researchers (Czarnecka & Mogaji, 2020) systematically examined emotional appeals in ads for financial services—primarily loans— in 2,900 editions of 8 UK papers from 3 strata (“quality,” “mid-market,” and “popular”) with a combined circulation of 4.9 million across 12 consecutive months. About 43% of the loan ads had an emotional appeal, as determined by two trained coders who, working independently, each coded every ad. Positive emotions were most common (95% of ads), with the most frequent being relief (64%), security (53%), and adventure (37%). Sociologists (McLoughlin & Noe, 1988) content analyzed 26 years (936 issues and more than 11,000 articles) of Harper’s, Atlantic Monthly, and Reader’s Digest to examine coverage of leisure and recreational activities within a context of changing lifestyles, levels of affluence, and orientation to work and play. Pratt and Pratt (1995) examined food, beverage, and nutrition advertisements in leading consumer magazines with primarily African American and non-African American readerships to gain “insights into differences in food choices” (p. 12) related to racial differences in obesity rates and “alcohol-related physiologic diseases” (p. 16).

20

Introduction

A political scientist (Moen, 1990) explored Ronald Reagan’s “rhetorical support” for social issues embraced by the “Christian Right” by categorizing words used in each of his seven State of the Union messages. Systematic content analysis has been used in the humanities, too. Simonton (1994) used computerized content analysis to contrast the style of the popular and more obscure of Shakespeare’s 154 sonnets in terms of whether they feature primitive emotional or sensual meanings or cerebral, abstract, rational, and objective meanings. Summary As it has evolved, the field of communication research has seen a variety of theoretical perspectives influencing how scholars develop research questions and methods to answer those questions. The focus of inquiry has often been the communication content that is central to meaning, and systematic content analysis is the most appropriate method for examining it. Scholars have examined content assumed to be the cause of particular effects or reflecting the antecedent context or process of its production. Content analysis has been used in mass communication and other fields to describe content and to test theory-derived hypotheses. It has involved multiple content forms and designs that incorporate multiple methods. The variety of applications may be limited only by the analyst’s imagination, theory, and resources, as illustrated by the examples described in this chapter. While computers are increasingly employed in content analysis to access, retrieve, and manage large data sets, thus far they lack the ability to evaluate context and disambiguate many subtleties of language.

2

Defining Content Analysis as a Social Science Tool

The previous chapter’s preliminary definition of quantitative content analysis permitted a broad overview of the method’s importance and utility for numerous communication research applications: the systematic assignment of communication content to categories according to rules specified in a coding protocol and the analysis of relationships involving those categories using statistical methods. A more specific definition derived from previous ones can now be proffered, distinguished from earlier definitions by its view of the centrality of communication content. Content analysis procedures and purposes draw on the social science approach to knowledge—a system of standards and guidelines for generating relational statements that describe and explain human behavior and mental processes. Reynolds (1971, p. 4; original emphasis) said science provides: (1) A method of organizing and categorizing “things,” a typology; (2) Predictions of future events; (3) Explanations of past events; (4) A sense of understanding about what causes events. And occasionally mentioned as well is: (5) The potential for control of events. These goals are not accomplished by any single study or program of study. They come from the accumulation of research, synthesized and presented in theory. Theory-building and -testing are the goals of the scientific process (Shoemaker, Tankard, & Lasorsa, 2004). Because human behavior is complex and changes over time, this accumulation of research is never complete. And, as noted in Chapter 1 of this volume, changing technology makes human communication increasingly dynamic and complex. This warrants new studies re-examining and refining existing theories, while building new theories incorporating technology’s impact. Because content analysis allows sophisticated examination of fundamental processes of human communication within a given context, it can contribute to the accumulated research and the building of theory.

DOI: 10.4324/9781003288428-2

22 Defining Content Analysis as a Social Science Tool Just as the body of social science knowledge changes, the methods used by scientists evolve as scholars investigate and evaluate those methods. However, Reynolds (1971) identified three characteristics of generating scientific knowledge that are consistent across time: science is abstract, intersubjective, and empirically relevant. According to Reynolds (1971, p. 14), scientific abstractness “means that a concept is independent of a particular time or place.” If theoretical concepts are tied to a place and time, they cannot predict the future. In addition, abstractness is efficient in enhancing scientific understanding. Having theories that are unique to specific times and places would require an overwhelming number of theories for understanding the world. The concept of media agenda-setting, for example, is abstract enough to allow examination of the news media’s role in every election, even though the degree of agenda-setting can vary. Were it useful if it applied only to the 1972 election, it would be merely an historical artifact. Intersubjectivity means scientists who study an area agree on what a concept means and also on the validity of relationships among concepts (Reynolds, 1971). As examples of the former, relevant scholars generally have come to agree on the meaning of concepts such as agenda-setting (McCombs & Reynolds, 2009) and diffusion of innovation (Rogers, 2003). Intersubjectivity also includes agreement among scholars about the use of a logic system for developing relational statements within a theory. Reynolds (1971) calls this “logical rigor.” Unlike some fields, such as economics and political science, that have adopted mathematics for theory-building, communication has no agreed logic system for theory-building. Rather, communication scholars tend to use what is labeled “informal logic” or “natural language reasoning” (Johnson, 1999; Johnson & Blair, 2000) for creating more explicitly delineated theory. “Empirically relevant” means scientists are able to compare theoretical statements with objective empirical data (Reynolds, 1971). If statements in a theory cannot be tested against measures of real phenomena, their validity cannot be established independently, and the five goals of science outlined above cannot be achieved. Moreover, an important part of empirical relevance is the ability of scientists to replicate the empirical results of other scientists (McEwan, Carpenter, & Westerman, 2018). Consistent results across studies, scientists, and time are the strongest form of validation for theoretical statements. The relationship between news media content and the issues considered important by members of the public (e.g., agenda-setting) has been examined and supported to varying degrees in hundreds of studies. More recently, Kelly and Westerman (2020) articulated the necessity for consistency in scientific research in the field of communication, calling this effort a “dedication to validity” (p. 178). They argued that science is inherently iterative, but most importantly it adheres to the same backbone of research principles (see Kerlinger & Lee, 2000) time and time again, regardless of method. This backbone is paramount in the collaboration, replication, and measurement refinement

Defining Content Analysis as a Social Science Tool

23

(Kelly & Westerman, 2020) needed to advance the field of communication, and justifies the need for rigorously consistent practices and procedures in the content analysis method. Adapting a Definition As a more specific definition of content analysis is developed, the result will reflect the social scientific principles elaborated above. As a data-generating process, content analysis lends itself to theory-testing, and the results of testing theoretical relationships will suggest new ideas for adjusting existing theories and for building new theories to explain antecedents or effects of content, as suggested by the centrality model explained in Chapter 1 of this volume. Stempel (2003, p. 209) suggested a broad view of content analysis, what he called “a formal system for doing something we all do informally rather frequently—draw conclusions from observations of content.” What makes quantitative content analysis more “formal”? In his first edition, Krippendorff (1980, p. 21; emphasis added) emphasized reliability and validity: “Content analysis is a research technique for making replicative and valid inferences from data to their context.” Emphasizing “data” reminds the reader that quantitative content analysis is reductionist, employing sampling and operational procedures that reduce communication phenomena to manageable data (e.g., numbers) from which inferences about the phenomena may be drawn. Kerlinger (1973) suggested that content analysis is conceptually similar to “pencil-and-paper” scales used by survey researchers to measure attitudes—a parallel similar to Chapter 1’s emphasis on viewing communication content as an unobtrusive or non-reactive indicator. According to Kerlinger, content analysis should be treated as “a method of observation” akin to observing people’s behavior or “asking them to respond to scales,” except that the investigator “asks questions of the communications” (p. 525). Krippendorff (2004a, p. 18; 2019, p. 24; emphasis added) would later refine his definition: “Content analysis is a research technique for making replicative and valid inferences from texts (or other meaningful matter) to the contexts of their use.” On one hand, this refinement acknowledges the obvious—that messages have meaning, or, as Krippendorff wrote, “are produced by someone to have meaning for someone else” (2019, p. 25). But by including “other meaningful matter” with “texts,” the definition extends the range of forms of content that may be analyzed to encompass Chapter 1’s “broad domain for contemporary study.” Researchers may consider memes, art, maps, emojis, lyrics, GIFs, AI language, and other innovative symbols, sounds, and so on as meaningful content. This definition also acknowledges the importance—for drawing valid inferences—of the context (i.e., communication ecosystem) in which content is encountered.

24 Defining Content Analysis as a Social Science Tool Each of these definitions is useful, sharing emphases on the systematic and objective nature of quantitative content analysis. However, most forgo discussion of the specific goals, purpose, or type of inferences to be drawn, other than suggesting that valid inferences are desirable. Moreover, some might apply equally to qualitative analysis of messages. Stempel’s (2003) and Krippendorff’s (1980) definitions, for example, do not mention quantitative measurement (although both researchers have remarkable records of scholarship using quantitative content analysis). Content Analysis Defined The definition guiding this volume, by contrast, is informed by a view of the centrality of content for understanding theoretically significant processes and effects of communication, and of the utility, power, and precision of quantitative measurement: Quantitative content analysis is the systematic and replicable examination of symbols of communication, which have been assigned numeric values according to valid measurement rules (called a protocol), and the analysis of relationships involving those values using statistical methods, to describe the communication, draw inferences about its meaning, or infer from the communication to its context, of both production and consumption. What do the key terms in this definition mean? Systematic

We may speak of a method being systematic on several levels. Scientists are systematic in their approach to knowledge: researchers require generalizable empirical, not just anecdotal, evidence. Explanations of phenomena, relationships, assumptions, and presumptions are not accepted uncritically, but are subjected to a system of observation and empirical verification. The scientific method is a system with a step-by-step process of problem identification, hypothesizing of an explanation, and testing of that explanation (McLeod & Tichenor, 2003). The goal of science is to build systematically related sets of theoretical statements that explain relationships among precisely defined concepts. These sets of propositions are called theory; and, when supported empirically, theories can be generalized to appropriate types of human behavior and mental processing. Thus, from a theory-building point of view, the next step in systematic research requires identification of key terms or concepts involved in a phenomenon, specification of possible relationships among concepts, and generation of testable hypotheses (“if . . . then . . .” statements about one concept’s influence on another). In addition to its important role in theory-building and -testing,

Defining Content Analysis as a Social Science Tool

25

content analysis is useful for practical problems and in generating baseline data for new communication phenomena that accompany developing technologies across physical and virtual spaces, as noted in the previous chapter. Testing of hypotheses is not paramount in these instances. However, whether looking critically at communication phenomena, testing theory-driven hypotheses, generating baseline data, or solving practical problems, we may speak of researchers being systematic in terms of a study’s research design: the planning of operational procedures to be employed. The researcher, who determines in advance such research design issues as the time frame for a study, what form of communication is the focus of the study, what concepts are to be examined, and how precise the measurement must be—in effect laying the ground rules in advance for what qualifies as evidence of sufficient quality that the research question can be answered—is also being systematic. Research design is explored more fully in Chapter 3. Replicable

Two defining traits of science are objectivity and reproducibility or replicability. To paraphrase Wimmer and Dominick (2011, p. 157), a particular scientist’s “personal idiosyncrasies and biases,” views, and beliefs should not influence either the method or findings of an inquiry. Research definitions and operations that were used must be reported exactly and fully so that readers can understand exactly what was done. That exactness means that other researchers can evaluate the procedure and the findings, and, if desired, repeat the operations. Defining concepts in terms of the actual, measured variables is operationalization. For example, a student’s maturity and self-discipline (both abstract concepts) may be measured or operationalized in terms of number of classes missed and assignments not completed. Both can be objectively measured and reproduced by another observer. A researcher interested in a politician’s popularity on Twitter might operationalize that concept (popularity) in terms of the number of followers or the average number of times a tweet is retweeted. Both measures are elements of popularity and could be easily replicated. Similarly, examining how much of a Reddit subthread is authored by different contributors or dominated by a few is a fairly straightforward way to operationally define whether the subthread serves as an echo chamber or a venue for public discourse featuring varied perspectives. However, researchers must follow a clear methodological process consistent with accepted scientific practice and explicitly and precisely explained so subsequent researchers have the opportunity to replicate, extend, and refine an empirical study that builds upon a body of literature. Consider a protracted, hypothetical example from mass communication research, focusing on how gender is portrayed (i.e., normatively or nonnormatively). Suppose researchers published a content analysis of gender

26 Defining Content Analysis as a Social Science Tool portrayal in original programming for teens available through streaming services (e.g., Hulu, Disney+, Prime Video), and reported that gender was stereotypically represented as binary and unrepresentative of the LGBTQ+ teen US population. Obviously, coders counted people in the programming and identified them as male, female, or they—an easily replicable operationalization. Now, consider the number of points in the research process where the chosen operational procedure could influence what was found, and how any unclear or imprecise reporting of that procedure could influence the study’s replicability. For example, how did the researchers operationally define how gender was identified and counted into a category? Was the decision based on an assessment by trained coders making judgments? Did they refer to a coding protocol definition or rule based on specific criteria (e.g., hair length and style, face shape, eye shape, gestures, body type, voice pitch, clothing style, etc.)? What happened if a person was assigned to the “male” category in one scene and subsequently to the “female” category in another scene, indicating fluidity? Did the person “count” in two different types of portrayal and thereby receive a final code of “they”? Making such judgments without a clear procedural rule is like asking someone to record the heights of a group of friends without providing them with a tape measure. Did coders examine and code content individually, or were coding determinations a consensual process, with two or more coders reaching a decision? Did anyone confirm that coders understood the criteria and applied them in the same way when identifying and counting gender portrayal? Were individual coders consistent across time, or did their judgments become less certain and less reliable after they became fatigued, consulted with one another, or talked with the senior researcher about the study’s objectives? Did the study offer a quantitative measure of reliability that reached acceptable standards? How and when was that assessment performed? Moreover, teen streaming programs present both foreground and background characters. Did the definition of which person to identify and code take this into account? Did a person in the background “weigh” as heavily as one in the foreground or one in a major or speaking role? Did a person have to be on-screen for a specified length of time to be identified and counted? Were coders able to freeze scenes and carefully analyze and appropriately code people into a category (decreasing the chance of missing characters but totally unlike typical audience viewing)? Finally, how did the researchers conclude that programming portrayed normative gender stereotypes once the data were collected? Did they tally how many entire streaming programs had at least one counter-stereotypical gender portrayal (i.e., non-binary) or compare the percentage of total non-binary people on-screen with data on US teen gender identities? This example used what at first blush seemed a rather “simple” form of coding and measurement—identifying and counting gender portrayals—but it

Defining Content Analysis as a Social Science Tool

27

demonstrates how difficult it might be to reproduce findings, as required by our definition of quantitative content analysis, without clear reporting of even such simple operational procedures. So, what would happen if coders tried to measure more abstract variables, such as attractiveness, bias, presence of particular frames, or fairness and balance, or tried to code the deeper meaning of symbols rather than manifest content? Other researchers applying the same system, research design, and operational definitions to the same content should replicate the original findings. Only then can an observed relationship be generalized with a high level of certainty. Only after repeated replications can researchers develop a new theory or challenge an existing theory to explain a phenomenon. Symbols of Communication

Our definition also recognizes that the communication content suitable for content analysis is as variable as the many purposes and media of communication. All communication uses symbols, whether verbal, textual, or images. The meanings of these symbols can vary from person to person and culture to culture by a matter of degrees, but shared meaning of symbols is essential for social groups to exist. Moreover, the condition under which symbols of communication were produced is variable: that is, it may have been natural or manipulated. As Kerlinger (1973, p. 527) stated, content analysis “can be applied to available materials and to materials especially produced for particular research problems.” For example, scholars can analyze current online content or archived content from newspapers, magazines, videos, tweets, and social networking sites; or participants in experimental conditions of exposure may be asked to write, draw, or report postexposure sentiments that may be subjected to content analysis. Developments in artificial intelligence have opened multiple new areas of content for analysis. For instance, responses from voice assistants such as Google or Alexa can be recorded and content analyzed (e.g., Frehmann, Ziegele, & Rosar, 2022), and messages and actions while participants are engaged in virtual or augmented reality (AR) experiences may be recorded and analyzed (e.g., see Liao, 2021 for content analysis of AR tweets). Though these areas are nascent, content and communication in them are growing and, with suitable theoretical underpinning, may be fertile ground for content analysis. Virtual chatbots (e.g. ChatGPT) or transcripts from virtual shopping assistants could be analyzed. The possibilities are increasingly plentiful, with technologies able to capture and record communication symbols for analysis, whether “in real time” or afterward. Although the phrase “symbols of communication” suggests all-inclusiveness and broad applicability of the content analysis method, recall the requirement that content analyses should be systematic and replicable, and the goal that they should be driven by the scientific method.

28 Defining Content Analysis as a Social Science Tool What represents appropriate and meaningful content for analysis must be based on the research task and specified clearly and without ambiguity. However, even that straightforward requirement is made complex because, as noted in Chapter 1, communication processes involve different media (if any) of communication (e.g., print versus broadcast versus social media) or different functions (e.g., entertainment versus news versus social networking), to name only two dimensions. The issue is compounded by potential differences in the symbols in different media and the units used for coding (e.g., themes, frames, entire news stories, or 280-character message strings). Appropriate communication content for study might thus be individual words or labels in advertising copy, news stories, movies, phrases or themes in political speeches, individual postings or entire exchanges among Facebook or Reddit posters, and entire recorded conversations between two people in a viral YouTube video. Within these text units, the focus might be further sharpened to address the presence of particular frames, as Elmasry and el-Nawawy (2020) did in exploring frames used in news coverage of Muslim and non-Muslim perpetrators in the weeks following the 2016 Orlando and 2017 Las Vegas mass shootings. Visual communication for study might include photos, graphics, or display advertisements in a variety of media. For example, Seyidoglu et al. (2022) examined evolving gender and race representation in popular running magazine cover images over an 11-year period. Johnson and Pettiway (2017) quantitatively and qualitatively studied visual expressions of Black identity on 46 African American museum websites, concluding that those elements promoted identity and provided counter-stereotypes. Video or film content analyses might involve movies, entire newscasts, television advertisements, individual programs or episodes of streaming series, and many social media platforms (e.g., TikTok, Twitter, Snapchat, Instagram, LinkedIn, BeReal, Facebook, etc.). With movies, advertisements, and TV programs, scholars might code individual camera shots or scenes, particular sequences of scenes (e.g., in a dramatic program, only the time when the protagonist is on-screen with the antagonist), or entire dramatic episodes. While television programs, films, and video advertising content remain important areas of examination, social media video content analyses have proliferated because of high user engagement, their viral nature, and shareability. The definition of communication content could be extended to include song lyrics, graffiti, or even gravestones (i.e., tombstone inscriptions indicate a culture’s views about the hereafter, virtue, redemption, etc.). In fact, if transcripts or audio recordings are available, interpersonal exchanges may be suitable for content analysis, and, as noted previously, analyses may extend to interpersonal interactions that take place in virtual spaces. Students of nonverbal communication may record encounters between people and designate how sequences of physical movements, gestures, and expressions constitute units of communication that happen in-person, recorded online, or in virtual spaces.

Defining Content Analysis as a Social Science Tool

29

More than two decades ago, the Internet became a focal point for content analysis, and social networking site studies have boomed more recently. Social media, broadly defined as including forms such as Twitter and Instagram, may be analyzed to explore “real-time” diffusion of news, though capturing relevant populations of these messages theoretically may be challenging (see Chapters 4 and 6). For example, health communication researchers have examined timely topics ranging from engagement with official COVID-19 TikTok dance videos (Li, Guan, Hammond, & Berrey, 2021), to weight loss surgery (Meleo-Erwin, Basch, Fera, & Smith, 2021), to human papillomavirus vaccine message strategies on YouTube in Korea (Kim, Lee, Heo, & Baek, 2021), and health and environmental communications in YouTube public service announcements about the global water crisis (Krajewski, Schumacher, & Dalrymple, 2019). The use of the Internet and social networking sites for political communication has drawn considerable attention. Druckman, Kifer, and Parkin (2010, 2014, 2017, 2018) have documented the rapid growth in online political campaigning. Hale and Grabe (2018) examined visuals and text in subreddit forum posts for Donald Trump and Hillary Clinton in the 2016 election, finding consistent positive support for Trump, perhaps reflecting the young, male audience of Reddit. Coe and Griffin (2020) content analyzed Trump tweets during the first two years of his term, reporting that tweets about racial/ethnic groups accounted for twothirds of tweets on marginalized groups and were negative in tone. Numeric Values or Categories According to Valid Measurement Rules and Statistical Analysis of Relationships

Our definition specifies further that quantitative content analysis involves numeric values assigned to represent measured differences in symbols, with rules for assigning values detailed in the coding protocol. For example, a simple look at social media video advertising for representation and inclusion of racially diverse individuals might adhere to the following procedure. First, an advertisement receives a case number (001, 002, etc.), differentiating it from all other cases. Another number reflects the social media channel distributing the ad (1 = Instagram, 2 = Facebook, etc.), while a third number specifies the length of the video ad (e.g., 5 seconds, 8 seconds, etc.). A number is assigned reflecting the type of video advertising (1 = skippable streaming, 2 = non-skippable streaming, etc.), another number categorizes the product type being advertised (1 = clothing, 2 = beauty products, etc.), and yet another number indicates the total count of racially diverse individuals present. Different numeric values are assigned to differentiate African American individuals from Asian individuals, Hispanic individuals, and so on. Finally, coders might use a 1 (“low sexual appeal”) to 5 (“high sexual appeal”) rating scale, assigning a value to indicate the level of sexual appeal in the portrayal of an individual. The coding protocol presumably provides a detailed operational definition for “low” and “high” sexual appeal.

30 Defining Content Analysis as a Social Science Tool A crucial element in the assignment of numbers involves the validity of assignment rules in the protocol and the consistency and reliability of their application. The rules must assign numbers that accurately represent the content’s meaning. For example, if a person on a social media video advertisement is assigned a 1 for low sexual appeal, the portrayal must be such that most viewers would perceive it as lacking sexual appeal. Creating rules to help coders assign numbers reliably is sometimes easy, but difficulties can arise in creating rules that reflect the “true” manifest meaning of the content (validity). Put another way, reliability and validity must be addressed with particular care when assignment of a numerical value is based not merely on counting (e.g., it is a skippable streaming ad or not) but on scores or ratings for perceived sexuality, identity, race, gender, and so on. For example, consider how Álvarez, González, and Ubani (2021) examined gender roles and displays of primary characters in Graphics Interchange Formats (GIFs). The authors positioned gender as a dual construct, distinguishing perceived “sex” (i.e., biological) and “gender” (i.e., cultural presentation). In order to code reliably for measures of specific gender representations and stereotypes, the authors used student coders majoring in communication, provided extensive training with the protocol on non-sample GIFs, and used an online Slack channel to resolve discrepancies during training, retraining, and codebook refinement. Elements of gender representation included “overt sexuality” (sexually revealing clothing and nudity), “objectification” (breast size, appropriateness of attire and clothing), “type of setting” (workplace, home, etc.), and “gesture and nonverbal expressions” (type of gesture, facial expression, use of hands, gaze, etc.). The authors carefully developed a protocol creating mutually exclusive categories and extensive operational definitions for coders, thus contributing specificity and empirical data on stereotyping and gender representation using a numerical coding scheme. Rather than using the close reading approach of, say, literary criticism, examining units of communication and offering qualitative assessment of what was observed, quantitative content analysis reduces units to numbers that retain important information about content units (e.g., how each scores on a variable and differs from others) but are amenable to arithmetical operations that can be used to summarize or describe the whole set. Chapter 5 more fully addresses the necessity of creating valid measurement rules. Describing and Inferring

Simple description of content has its place in communication research. For example, applied content analysis research is often descriptive. As noted in Chapter 1, a Southern daily commissioned one of this volume’s authors to examine the paper’s treatment of the local African American community. The publisher’s goal was to respond to focus group complaints about underrepresentation in the

Defining Content Analysis as a Social Science Tool

31

news pages and excessively negative coverage. The research involved analysis of six months of coverage and culminated in a report indicating what percentage of news stories focused on African Americans and what part of that coverage dealt with negative news. Other applied descriptive content analyses might be conducted by news organizations in order to bring news-site stories in line with reader preferences discovered via cookies, readership surveys, or focus groups. For example, if search and click histories show visitors accessed more items about topic X and fewer about topic Y, a careful site manager might examine the current level of each before selecting future topics for site postings. Public relations applications might involve profiling a corporation’s image on the business’s website. If particular angles in the organization’s publications are ineffective, change may be in order. Agency practitioners might analyze a new client’s web and mobile presence to evaluate and plan their actions. On the other hand, there are also instances in which description is an essential early phase of a program of research. For example, early researchers in mass communication found it useful to provide descriptive profiles of media, such as what percentage of space was being devoted to local news. More recent examples might focus on the number of likes and shares achieved by a political post or the number of reader comments on a news article posted on Twitter. In their study of 25 years of Journalism & Mass Communication Quarterly content analyses, Riffe and Freitag (1997) found that a majority of published studies might qualify as descriptive: 54% involved no formal hypotheses or research questions, and 72% lacked any explicit theoretical underpinning. Kamhawi and Weaver (2003) reported similar data. The lack of theoretical foundation is not unique to content analysis approaches. Potter (2018) studied 211 articles in 6 communication journals and found only 28% had a theoretical foundation. Researchers also continue to discover entirely new research domains with previously unexplored messages or content. Song lyrics have been examined for reaction to and commentary on social and political protests. For example, Mozie (2022) analyzed 66 rap songs released following George Floyd’s death in 2020, finding a significant negative correlation between negative affect toward the police and a desire for retaliation. Some descriptive data can be used for the second goal of content analysis specified in our definition: to draw inferences about meaning or infer from the communication to its context, of both production and consumption. In fact, simple descriptive data invite inference-testing (i.e., conclusions about what was not observed based on what was observed). A simple example is the “why” question raised even in descriptive content analyses. Why do food ads in popular US magazines underrepresent women of color? Why does a Southern daily provide so little “good news” of the African American community? Why do social media selfies reflect the strong gender stereotyping found in other media? Why are some digital news sites more linked-to by posters on Facebook than others? Why do so many long-standing journalistic practices and routines break down in crisis situations?

32 Defining Content Analysis as a Social Science Tool Social scientists using quantitative content analysis techniques generally seek to do more than describe. Content analysts—whether conducting applied or basic research—typically do not collect descriptive data and then ask questions. Instead, they conduct research to answer questions. In the case of basic research, that inquiry is framed within a particular theoretical context. Guided by that context, they select content analysis from a variety of methods or tools that may answer those questions. From their data, they seek to answer theoretically significant questions by inferring meaning or consequences of exposure to content or inferring what might have contributed to the content’s form and meaning. The researcher must be guided by theory if they are to draw inferences from content about the consequences of the consumption of content or about the production of content. For example, Shin and Thorson (2017) based their study of sharing fact-checking messages with Twitter during the 2012 presidential election on social identity theory. A study of radio competition (Lacy et al., 2013) tested whether the financial commitment model (Lacy, 1992) can be used to explain variations in local government coverage by radio news. These examples of inference-drawing suggest the range of appropriate targets of inference (e.g., the antecedents or consequences of communication, as discussed in Chapter 1). However, students with a grounding in research design, statistics, and sampling theory will recognize that there are other questions of appropriateness in inference-drawing. Conclusions of cause–effect relationships, for example, require particular research designs. Perhaps more basic, statistical inference from a sample to a population requires a particular type of sample (see Chapter 6). Also, use of certain statistical tools for such inference-testing assumes that specific measurement requirements have been met (Riffe, 2003, pp. 184–187; Weaver, 2003). Issues in Content Analysis as a Research Technique Interestingly, the strengths of quantitative content analysis (primarily its emphasis on replicability and quantification) are the focus of some criticisms of the method. Critics have argued that it places too much emphasis on comparative frequency of different symbols’ appearance. In some instances, they argue, the presence—or absence—of even a single particularly important symbol may be crucial to a message’s impact. Holsti (1969, pp. 10–11) described this focus on “the appearance or nonappearance of attributes in messages” as “qualitative content analysis” and recommended using both quantitative and qualitative methods “to supplement each other.” A more important criticism repeated by Holsti (1969, p. 10) is the charge that quantification leads to trivialization: critics have suggested that because some researchers “equate content analysis with numerical procedures,” problems are selected for research simply because they are quantifiable, with emphasis on “precision at the cost of problem significance.” However, superficiality of

Defining Content Analysis as a Social Science Tool

33

research focus is more a reflection of the researchers using content analysis than a weakness of the method. Trivial research is trivial research whether it involves quantitative content analysis, experimental research, or qualitative research. Some might argue that theory, as the rationale for a study, and validity, as the gold standard for data quality, are at risk given the rapid advances in computing capacity or advances in data-searching and data-analysis capabilities (Mahrt & Scharkow, 2013). Though researchers can swiftly collect vast amounts of online information via database searching, there are few parameters or guides to ensure quality, rigor, and consistent and complete sampling (Blatchford, 2020). Using these online tools, for example, researchers could collect millions of tweets exchanged during the Super Bowl, the Democratic National Convention, a royal marriage, a protest, or some other event. Frequency counts of words might be interpreted as proxies for public sentiment (Prabowo & Thelwall, 2009; Thelwall, Buckley, & Paltoglou, 2011), or linkage patterns among words might indicate relationships among attitude or opinion objects, though such linkages are arguably a poor substitute for the larger context surrounding a communication. The potential problems with this approach are readily apparent: for one, only about a quarter of the US population uses Twitter (Auxier & Anderson, 2021) and it is not clear how this group differs from the other three-quarters. Despite advances in computing capacity, it is difficult to ascertain whether any set of tweets actually represents all relevant tweets (Bialik, 2012). Again, the problem is not with the available and evolving tools, but how they are used; regardless of how they are gathered, social science data have always varied in quality and validity. Chapter 4 examines the use of computing in content analysis. Another criticism involves the distinction between manifest and latent content. Analysis of manifest content assumes that, with the message, “What you see is what you get.” The meaning of the message is its surface meaning. Latent analysis is reading between the lines (Holsti, 1969, p. 12). Put another way, manifest content involves denotative meaning—the meaning most people share and apply to given symbols. Given that “shared” dimension, it is rather curious to suggest that analysis of manifest content is somehow inappropriate. Latent or connotative meaning, by contrast, is the meaning given by individuals or small groups to symbols. The semantic implications notwithstanding, this distinction has clear implications for quantitative content analysis. Consider, for example, Kensicki’s (2004) content analysis of frames used in covering social issues (pollution, poverty, and incarceration), in which she concluded that news media seldom identified causes or effects for the issues, nor did they often suggest the likelihood that the problems could be solved. In the study, two coders had to agree on how to identify evidence pointing to the cause, effect, and responsibility for each of those issues. Discussing the “lone scholar” approach of early framing research, on the other hand, Tankard (2001, p. 98) described it as “an individual researcher working alone, as the expert, to identify

34 Defining Content Analysis as a Social Science Tool the frames in media content.” Tankard asked, “Does one reader saying a story is using a conflict frame make that really the case?” (p. 98; original emphasis). The difference between latent and manifest meaning is not always as clearcut as such discussions indicate. Symbols in any language that is actively used change in meaning over time. A manifest meaning of a word in 2023 may not have been manifest 100 years before. For example, ghosting, a word that has arisen with social media and online dating apps, means (to younger generations) to cut someone off from communication, typically after a date. Although this word was historically associated with the soul of a person after death, the new manifest meaning was added to dictionaries in the 2010s. To a degree, the manifest meaning of a symbol reflects the proportion of people using that symbol for that meaning. This somewhat arbitrary nature of language is made more concrete by the existence of dictionaries that explain and define shared meaning. Researchers need to be careful of the changing nature of symbols when designing content analysis research, particularly when employing longitudinal designs. Language users share meaning, but they may also have idiosyncratic variations of meanings for common symbols. How reliable will the data be if the content is analyzed at a level that implicitly involves individual interpretations? We concur with Holsti (1969), who suggested that the requirements of scientific objectivity dictate that coding should be restricted primarily to manifest content; the luxury of latent meaning analysis comes at the interpretative stage, not at the point of coding. Advantages of Quantitative Content Analysis of Manifest Content The strengths of quantitative content analysis of manifest content are numerous. First, it is an unobtrusive, non-reactive measurement technique. The messages are separate and apart from communicators and receivers. Armed with a strong theoretical framework, researchers can draw conclusions from content evidence without having to gain access to communicators who may be unwilling or unable to be examined directly. As Kerlinger (1973, p. 525) observed, the investigator using content analysis “asks questions of the communications.” Second, because content often has life beyond its production and consumption, longitudinal studies are possible using archived materials that may outlive communicators, their audiences, or the events described in the communication content. Third, quantification or measurement by coding teams using a well-developed protocol permits reduction to numbers of large amounts of information or numbers of messages that would be logistically impossible to understand well with close qualitative analysis. Properly operationalized and measured, such a process of reduction nonetheless retains meaningful distinctions among data. Fourth, as shown in Chapter 1, the method is virtually unlimited in its applicability to a variety of questions important to many disciplines and fields because of the centrality of communication in human affairs.

Defining Content Analysis as a Social Science Tool

35

Finally, because the reliability of content analysis data is invested in the protocol and not just the coders, the consistency of the application by many coders can be measured within and across studies using the same training and protocol. This ability adds to establishing the validity of the reliable variables in the protocol. Researchers should heed Holsti’s (1969, p. 15) advice on when to use content analysis, which harks back to the criticism that the method’s focus on precision leads to trivial topics: “Given the immense volume and diversity of documentary data, the range of possibilities is limited only by the imagination of those who use such data in their research.” Holsti suggested three “general classes of research problems which may occur in virtually all disciplines and areas of inquiry” (p. 15). Content analysis is useful, or even necessary, when: 1 data accessibility is a problem, and the investigator is limited to using documentary evidence (p. 15); 2 the communicator’s “own language” use and structure are critical (e.g., in psychiatric analyses) (p. 17); and 3 the volume of material exceeds the investigator’s individual capability to examine it (p. 17). Summary If Holsti’s (1969) advice on when to use content analysis is instructive, it is also limited. Like so many of the definitions explored early in this chapter, its focus is primarily on the attractiveness and utility of content as a data source. Recall, however, the model in Chapter 1 on the centrality of communication content. Given that centrality, both as an indicator of antecedent processes and effects and consequences, content analysis is indeed necessary, and not just for the three reasons cited by Holsti. As a social science method, content analysis needs to be systematic, following generally accepted design, coding, and analytical processes to generate data that will lead to valid conclusions about behavior and mental processing. The raw material of the data is composed of symbols in a variety of media, and the symbols must be assigned to numbers taking their context into consideration. Patterns in these numbers, and the corresponding content, are identified using statistics. Content analysis is crucial to any theory dealing with the impact or antecedents of content. It is not essential to every study conducted, but in the long run one cannot study communication without studying content that carries symbolic meaning. Absent knowledge of the relevant content, all questions about the processes generating that content—or the effects that content produces—are meaningless.

3

Designing a Content Analysis

The previous chapter described how quantitative content analysis—a social science method like surveys and experiments—is used to describe phenomena, observe their interrelationships, and make predictions about those interrelationships. The process of learning empirically about a phenomenon consists of three phases or stages: conceptualization of the inquiry; formulation of a research design to gather needed information; and data collection and analysis to get answers. Conceptualization occurs at an abstract level, where researchers formulate basic questions they want to answer: for example, “To what degree and why are social media messages misogynistic?” Such broad questions are not answered easily and often involve multiple research projects. Conducting the research requires a plan about the steps that are necessary to address more specific questions emerging from the broad, abstract question of interest. This plan is called the research design. Any process that requires management and expenditure of resources (e.g., time, energy, funding) is more likely to reach its goals when broken down into the three steps mentioned above—steps analogous to stages a property owner might go through when building a structure. In fact, thinking about construction projects to learn about research projects is apt: in both cases, correct decisions produce things of enduring value and incorrect decisions produce regretful costs. A construction project begins with the property owner’s vision of the structure’s appearance and function—a vision parallel to the research conceptualization phase. A property owner might envision a home, a strip mall, or an office building. Similarly, a researcher will envision a variable being described, a hypothesis being tested, or a causal model that explores a sequence or process being estimated (e.g., A affects B, which affects C). The building construction vision includes general ideas about features dictated by the structure’s function or purpose, but must also consider context: a home in an industrial area would never do. In like manner, a research project takes account of its context: that is, previous research into which the study “fits.”

DOI: 10.4324/9781003288428-3

Designing a Content Analysis 37 Once a property owner gets beyond the general vision, a far more detailed planning process is necessary to illuminate the “how to” of the construction: precise architectural blueprints directly address how the goals for the structure result in decisions about materials, open spaces, entrances, and so on. This parallels the design stage in research, which requires decisions about data collection, measurement, analysis, and, above all, whether the design adequately addresses research questions and hypotheses. In both research design and building construction, “blue-sky” wishes and hopes must be fitted into the realities of what can be done, given available time and resources. Finally, the builder executes the architectural plan, providing even more detailed instructions for contractors, carpenters, electricians, plumbers, roofers, and others to make the structure a reality. Similar details in research design specify what data to seek, how to collect it appropriately, and how to analyze it statistically. Obviously, all three phases of the process are essential. The property owner’s vision provides the focus, direction, and purpose of the architectural plan. That plan is needed before the work can begin. And, finally, trained workers must carry out the plan reliably, or the structure will have flaws undermining its purpose. Similarly, content analysis research demands careful thinking about research goals and skilled use of data collection and analysis tools to learn about the phenomenon of interest. Conceptualization in Content Analysis Research Design A content analysis research design provides the link between conceptualization and data analysis. Research conceptualization involves addressing goals shown in Figure 3.1. What question is the research supposed to answer? Is the purpose of the study description or testing relationships among variables? Is the goal to illuminate causal relationships? From a larger perspective, where does this study “fit” in the communication process described in Chapter 1? Is the study all about content characteristics? Will it assess antecedent conditions that shape content? Will it assess how content produces particular effects? In the design, how will antecedents and/or effects be linked to the content variables? The focus on each or all of the purposes described in Figure 3.1 for a particular study affects the design of that study. For example, a content analysis designed to describe messages may require little more than counting. But a content analysis designed to test how myriad factors affect a particular content variable must collect and analyze data in a way different than simply describing one or more variables. And assessing how content variations may result from some cause or may affect some audience may require design decisions about collecting noncontent data.

38

Designing a Content Analysis

Figure 3.1 Purposes of content analysis that guide research design

Despite the specific purposes of particular studies, all research projects aim to contribute to answering broader research questions that are often the basis for series of studies or even whole careers, depending on how abstract the question. For example, one of the authors of this book started with the question: “How does competition among journalism outlets affect the journalism they produce?” Other such questions include: “How does social media content shape young people’s perceptions of their peers’ behaviors?” and “How does use of social media affect political participation?” Moving from broad questions to research design requires more specific questions. For example, after asking the broad question about political participation, researchers might ask more particularly: “Does regularly attending to political tweets result in higher probability of voting?” and “Are people who view larger numbers of political videos on Facebook more likely to donate to political candidates?” These more concrete questions allow researchers to produce specific research designs because they suggest ways of measuring (viewing political videos) abstract concepts (use of social media) presented in the broad question. It is not the intent of this book to suggest specific research questions for researchers to pursue. Rather, the next section discusses some general ways in which content analyses may be grouped, and at the end of the chapter we provide a helpful structure for pursuing a multi-study research program.

Designing a Content Analysis 39 A General Typology for Content Studies

Content analysis has been employed in three general ways: studies using content analysis only; studies incorporating content analysis into designs with other methods to explore influences on content; and studies using content analysis with other methods to explore content effects. The first approach is often atheoretical and aims to compare content against some normative standard, such as comparing the presence of a particular group (e.g., women) in media content with the prevalence of that group in the general population. The other two approaches advance knowledge best when based in theory. Because theories aim to predict and explain (Reynolds, 1971), their theoretical statements usually deal with causality. What factors cause content to take a given form? How does content result in a particular effect? Theory addresses relationships at two related levels—micro (individuals) and macro (collectives or groups). Macro-level studies encompass the actions of individuals (e.g., elections), but social forces within the environment influence those actions. Content Analysis Only Designs

These studies typically take two forms. The first looks at one or more variables across time. For example, studies of images in media content are common and often track one or more variables across time to describe changes in content characterizations. Webb et al. (2017) examined female models’ attributes and attire on covers of Yoga Journal between 1975 and 2015, concluding that more recent images emphasized thin and lean bodies, objectifying women. The second kind of study that uses only content variables assesses a multitude of factors, observable through content analysis, for their influence on some other particular content variable of interest. For example, a team of scholars compared partisan and nonpartisan outlets as well as local and national outlets to examine patterns of incivility in comments on news outlets’ Facebook pages (Yi-Fan Su et al., 2018). Another study that used only content analysis looked at how Instagram posts set newspapers’ agendas in the 2016 presidential primaries (Muñoz & Towner, 2017). Most of these studies are macro level because their focus is on the content produced or consumed by groups of people, and they assume or imply a causal sequence. However, they offer only correlations between content types, without empirically linking how what appeared in one set of content (the independent variable: e.g., a press release or a Twitter post) was consumed by and motivated another person to write a news story (the dependent variable). A true causal sequence can be established only by measuring independent variables hypothesized to influence production and then examining the connection between people’s decisions as individuals and groups and the resulting content (the dependent variable).

40

Designing a Content Analysis

Content Analyses and Influences on Content

This type of content analysis has exploded during the past three decades. Stimulated by Shoemaker and Reese’s (1991, 1996, 2014) theorizing, researchers have examined how a range of individual and group behaviors, to name only two “levels” of influence, affect production of content. Schmuck and Hameleers (2020) studied Facebook and Twitter populist posts by Dutch and Austrian politicians and found that populist content was greater in both countries before elections than after, suggesting populism was more a strategy for generating votes than an ideology. Watson (2014) examined how journalists’ individual-level political ideologies, environmental beliefs, and endorsement of different journalistic roles, along with the percentage of the local population identified as working in the oil industry, affected Gulf Coast newspapers’ coverage of the 2010 BP oil spill. Content Analyses and Content as Independent Variable

Two prolific lines of research—agenda-setting and cultivation research—have combined content analysis and survey methods to examine the influences of content. In first- and second-level agenda-setting studies, news content emphasis on issues at time 1 is assessed for its influence on audience issue priorities or thinking at time 2 (Kim, Kim, & Zhou, 2017; McCombs, Shaw, & Weaver, 2013). Cultivation studies have assessed how violence in mass media content affects both fear levels among audience members and their general ideas about social reality (Gerbner, Signorielli, & Morgan, 1995; Potter, 2014). Other studies have looked at information campaigns’ effects on citizen knowledge and university donations (Ball-Rokeach, Rokeach, & Grube, 1984) and at “political knowledge gaps” resulting from differences in use of social media (Gil de Zúñiga, Weeks, & Ardèvol-Abreu, 2017). Research Hypotheses and Research Questions

The designs of these study examples range from very simple to very complex. But as with other quantitative research methods, content analysis requires resources of time and effort that should be used efficiently and effectively. In particular, content analyses should not be carried out absent an explicit hypothesis or research question to guide the design of the inquiry. As McCombs (1972, p. 5) argued, a hypothesis (or, presumably, a research question) “gives guidance to the observer trying to understand the complexities of reality. Those who start out to look at everything in general and nothing in particular seldom find anything at all.” Research lacking explicit a priori research questions and/or hypotheses may encourage a hunt for any statistically significant associations, which may be spurious, regardless of whether they can or cannot be explained by post hoc theorizing. To guard against post hoc hypothesizing and theorizing,

Designing a Content Analysis 41 authors can pre-register their research questions and hypotheses, as well as their methodology, with AsPredicted (http://aspredicted.org) or Open Science Community (https://osf.io/2vu7m). These sites differ in how much privacy they give an author’s plans. Careful thinking about a problem or issue and review of previous related research are absolutely vital to the formulation of hypotheses or questions that are, in turn, essential to successful research design. Reviewing previous research provides guidance on variables to examine and on how—or how not—to collect data to measure them. Moreover, hypotheses formulated to build on or extend such research give guidance on how to measure the variables to be explored. An explicit hypothesis (or question) guides both data collection and variable measurement in good research design. A hypothesis explicitly asserts that a state or level of one or more variables is associated with a state or level of one or more other variables. A hypothesis is appropriate where there is adequate theoretical or empirical support in the existing literature for specifying a relationship between two or more variables. A hypothesis may assert an actual causal relationship or merely a predictable association (discussed below). Which of these two assumptions is adopted will guide the research design. Hypotheses often take the form of conditional statements (“if X, then Y”). Here are two examples of published hypotheses: Females’ Instagram selfies will reveal more of their unclothed bodies than males’ Instagram selfies. (Döring, Reif, & Poeschl, 2016, p. 958) Gubernatorial candidates in more competitive contests will exhibit more personalization on social media. (McGregor, Lawrence, & Cardona, 2017, p. 269) Note that each hypothesis identifies the content data the study must collect (selfies posted on Instagram and personalization on social media), and the independent variables (female/male and degree of electoral competitiveness) that are not content variables. Moreover, the first hypothesis “locks” the study into a two-level dependent variable (sparsely clothed versus fully clothed), while the second specifies two types of information (personalization versus nonpersonalization) on social media. Hypotheses commit the researcher to specific variables guiding the design but also limiting the generalizability of results, even if the hypothesis is supported: that is, the first holds only for Instagram selfies, and the second holds only for personalizing information on social media such as Twitter and Facebook. Research questions are more tentative because the researchers are unable to predict possible outcomes based on existing theoretical knowledge or empirical

42

Designing a Content Analysis

evidence. Gil de Zúñiga et al. (2017, p. 110) posed a research question addressing the relationship between people’s belief that they acquire news passively and their political knowledge: What is the relationship between individuals’ perceptions that the news finds them and their political knowledge? The authors did not use a hypothesis because existing research was inconsistent about this relationship. As government agencies began to use Twitter to communicate with the world, Sobel, Riffe, and Hester (2016, p. 86) asked: When faced with major world events, how do US Embassy Twitter feeds respond? This was a very broad question because little literature existed to suggest a hypothesis or even guide the research question. The authors selected 14 world events (e.g., Obama’s re-election, the London Olympics, etc.) during the time period studied and examined which received the most attention in embassy tweets. An important difference between hypotheses and research questions is that the former usually indicate the direction of a relationship. In the second example above, increased electoral competitiveness results in more posts personalizing the candidate (McGregor et al., 2017). The ability to indicate direction comes from theory or existing research. However, research questions ask if a relationship of any kind exists. Hypotheses and/or research questions not only define the nature of variables to be coded but also permit visualization of what kind of data analysis will address the hypothesis or research question. In the selfies hypothesis above, a two-bytwo contingency table would display the proportions of sparse or fully clothed images produced by females and males. Riffe (2003, pp. 184–188) argued this “preplanning” is essential to effective data analysis. Moreover, such “preplanning” provides an opportunity to revise the study before the investment of time and money. Visualizing the analysis before the study begins can feed back into decisions about the wording of study hypotheses, about what content to examine, about the level at which content variables should be measured, and about the best analysis technique to be employed. Correlation, Causation, and Design

Research design that flows from a hypothesis should deal explicitly with whether the study’s purpose is to demonstrate correlation or causation. In the first example above, can there be something about gender that causes differences in female and male self-presentation on Instagram? If so, we believe there is a causal

Designing a Content Analysis 43 relationship at work. On the other hand, it could—more likely—be that there are no inherent causal gender differences. Rather, females’ Instagram posts adhere more to gender stereotypes because of varying levels of internalized cultural expectations as to how different genders are performed. If we believe the latter to be the case, then we are exploring a correlation in the relationship between gender and self-presentation in Instagram posts. In both cases, we may be able to make good predictions, but only in the first case would we know why those predictions are good. Correlation between two variables means that an increase or decrease in one variable is associated with an increase or decrease in another variable. If higher levels of one variable are associated with higher levels of some other variable, the correlation is positive (e.g., news start-ups with larger staffs publish more stories). If the correlation is negative, higher levels of one variable are associated with lower levels of the other (e.g., as the minutes of advertisements in a program increases, the time devoted to program content will decrease). The problem in inferring causation from some observed correlation is that the observed changes may be coincidental. In the summer, sales of ice-cream and murder rates are positively correlated with one another, but it is unlikely that ice-cream sales cause murders to spike, or vice versa. A third variable is likely the cause of change in both of the first two variables. By simply observing associations between variables, it is easy to leap to incorrect, spurious inferences. A causal relationship, on the other hand, is a special kind of correlation that satisfies the logical conditions for inferring a necessary or sufficient connection between a change in one thing and a change in another thing. Prior theory must be consulted to assess whether particular influences are necessary or sufficient for the expected change to occur. If a particular factor is necessary, it must change or else the expected change cannot happen. If a particular factor is sufficient, its change may bring about the expected change, but that expected change may also happen because of changes in other factors as well. Three such logical conditions must be met for such inferences. One condition necessary for demonstrating a causal relationship is time order. The hypothesized cause should precede the effect. Suppose a researcher wanted to examine whether forcing online news commenters to sign in with their Facebook accounts would lead to more civility in news comment sections. A poorly designed study might develop a measure of civility and attribute the observed degree of civility to the change in how commenters sign in and are identified without measuring the degree of civility before the sign-in requirement was instituted (see Design A in Figure 3.2). A better study (Design B) would measure the degree of civility in news comments both before (at Time 1 [Tl], the first point in time) the change (which occurs at T2) and after the change (at T3). This is a before/after design with a clear time order. It should be clear, however, that some other variable or variables might explain any change in civility between T1 and T3.

44

Designing a Content Analysis

Figure 3.2 Research designs involving time order

The second condition necessary for a causal relationship is that a correlation between the variables actually exists. If we cannot observe both of the variables changing, or if we have a research design that does not permit one of the variables to change, we cannot logically infer a causal relationship. If, for example, we had no data on civility levels before the policy change that is comparable to data we have after the change to commenting policy, we could never infer a cause–effect relationship between requiring users to sign in and civility. So, we

Designing a Content Analysis 45 must be able to observe that different levels or degrees of the cause are associated with observed levels or degrees of the effect. However, we now run head-on into the problem of possible “third variables” causing the change in the degree of civility. What if, between T1 and T3, in addition to the sign-in policy change (T2), a news event occurred evoking strong emotional reactions (e.g., Black Lives Matter or the #MeToo movement), which may have affected the tone of public discourse, at least around that event. If such a scenario occurred, it would be difficult to attribute the change from T1 to T3 solely to the change in sign-in policy, rather than to a change in public discourse due to the emotions stirred by the social movement. To have greater certainty that the observed change is due to our independent variable, we need to use a multivariate design to bring such “third variables” under control. One approach is to identify two similar news organizations, with similar news commenting sections, only one of which made the change to require those posting comments to log in using their Facebook profiles (T2). We would measure the degree of civility in both organizations’ news comments prior to the change (T1) and after the change (T3). The study will now have defined and ensured the necessary variation on a special independent variable (whose values are “sign-in change” and “sign-in did not change”). If there is a difference in the degree of change between T1 and T3 for the organization that changed its sign-in policy versus the organization that did not, the study will have found variation in the dependent variable that is related to variation in the independent variable, thereby supporting the inference of a causal connection. The change requiring news commenters to log in using their Facebook profiles likely influenced the change in the degree of civility. This third requirement for demonstrating a causal relationship is the most difficult to establish, however. It involves the control of all (known and unknown) rival explanations for why changes in two variables are systematically and predictably related. Rival explanations are the full range of potential and possible alternative explanations and associated variables for what is plausibly interpreted as a cause–effect relationship between two variables. For example, in addition to requiring users to log in using their Facebook profiles, a news organization could have made other changes, such as introducing algorithms to block comments using uncivil language, encouraging more visible participation of moderators in comment threads, and so on. Thus, different rates of change at T3 could be due to one of these other factors or to some combination of all three changes (Facebook sign in, algorithmic filtering, and increased moderator activity). Researchers designing content analyses try to control as many factors as possible that might give plausible rival explanations for an observed relationship. Previous research and theory may give guidance on what rival explanations to control. Some studies may be able to remove such rival explanations through the logic of their research designs or by collecting the necessary data on them to permit their statistical control in the data analysis.

46

Designing a Content Analysis

It is impossible, however, for any single non-experimental study to control or measure every potential important variable that could influence a relationship of interest. Equally important, few phenomena are themselves the results of a single cause. This explains why most scientists are reluctant to close the door on any area of study. It also explains why scientists continue to “tinker” with ideas and explanations, incorporating ever more variables in their research designs, seeking contingent conditions, and testing refined hypotheses in areas in which the bulk of evidence points in a particular direction. Sometimes simply going through the process of graphically identifying elements of a research study can help the researcher avoid pitfalls—and identify rival explanations. Alternative ways of illustrating research designs or of depicting the testing of various research hypotheses and questions—and the types of inferences that can be drawn—have been offered by Holsti (1969, pp. 27–41) and Stouffer (1977). Moreover, assuming a researcher wants to engage the phenomenon of interest across multiple studies, a graphical representation of findings such as a line drawing with arrowheads indicating tested relationships can help the researcher keep track of important variables whose interrelationships affect what they are trying to discover. This sort of modeling is discussed further in Chapter 9. Good Design and Bad Design For Babbie (2013) and Holsti (1969), research design is a plan or outline encompassing all the steps in research, ranging from problem identification through interpretation of results. Kerlinger (1973, p. 346) argued that the “outline of what the investigator will do from writing the hypotheses and their operational implications to the final analysis of the data” is part of research design. Holsti (1969, p. 24) described research design simply as “a plan for collecting and analyzing data in order to answer the investigator’s question.” A simple definition, yes, but its emphasis on utilitarianism—“to answer the investigator’s question”—is singular and suggests the gold standard for evaluating research design. How can research be designed to answer a specific question? Holsti (1969, pp. 24–26; original emphasis) argued: A good research design makes explicit and integrates procedures for selecting a sample of data for analysis, content categories and units to be placed into the categories, comparisons between categories, and the classes of inference which may be drawn from the data. To quote Stouffer (1977, p. 27), strong design ensures that “evidence is not capable of a dozen alternative interpretations.” By careful design, the researcher

Designing a Content Analysis 47 eliminates many of the troublesome alternative or rival explanations that are possible and “sets up the framework for ‘adequate’” testing of relationships “as validly, objectively, accurately, and economically as possible” (Kerlinger, 1973, p. 301). Thus, the hallmark of good design, according to Kerlinger, is the extent to which the design enables the researcher to answer the question, controls extraneous independent variables, and permits generalizable results. The emphasis in these definitions on “alternative interpretations” (Stouffer, 1977, p. 27) and “troublesome alternative . . . explanations” (Kerlinger, 1973, p. 301) reflects far more than intolerance for ambiguity. It captures the essence of what makes good, valid research design. Imagine that somewhere among all the communication messages ever created by all communicators, there are message characteristics or variables that would enable a researcher to answer a particular research question. Unfortunately, that same set of messages also contains information that is irrelevant to the researcher, the answers to countless other questions, and even answers that may distort the answer the researcher seeks. A good research design is an operational plan that permits the researcher to locate precisely the data that permit the question to be answered. Elements of Research Design

Often the heart of a research design is some sort of comparison of content that has theoretical importance. In particular, content is often compared across time, across content-producing organizations, or among people. Note that such designs usually incorporate more than one hypothesis or question. Finally, where possible, research designs may usefully take advantage of existing data-gathering or variable-measurement techniques that have been successfully used in past research. In fact, this is most useful for building a body of integrated knowledge in social science research. Comparisons may also be made between media (contrasting one communicator or one medium with another), within media (comparing various social media platforms or TV networks), between markets, between nations, and so on. Moreover, content analysts may link their research to other methods and/or other data, such as comparisons between content data and survey results (e.g., the agendasetting studies discussed earlier) or between content data and non-media data (e.g., comparing minority representation in advertising with census data). The ability to study important phenomena increases with the triangulation of several data-collection methods, and our confidence in findings increases with a convergence of findings from data collected using different methods. Very powerful designs incorporate a number of design elements and data-gathering methods to address research problems.

48

Designing a Content Analysis

Macro-level content analysis research often uses existing data with the content results to answer research questions. As daily newspapers continued to decline in number from the beginning of the 21st century, observers expressed concern about the impact of this process on knowledge about community institutions such as city government and schools. Some of the dailies were replaced with weekly newspapers, and an important question was whether the new weeklies could provide the community information that was needed. Lacy et al. (2012) used a national sample of newspaper content over 13 weeks and found that both dailies and weeklies were continuing to emphasize city government news, but daily newspapers provided greater coverage and a larger number and range of news sources in their reporting. Although testing relationships among variables and comparing content among media and over time have been emphasized, the value and validity of so-called “one-shot” design studies that do not compare across media or time should be acknowledged. Such studies are important for many reasons raised earlier: their focus may be on variable relationships that do not involve time or comparisons; they may be crucial to a new area of inquiry; or the content being analyzed may be viewed as the consequence of antecedent processes or the cause of other effects. Our emphasis on hypothesized relationships and research questions, however, is a product of years of working with students who sometimes prematurely embrace a particular research method or procedure without thinking through what it is actually useful for accomplishing. One of the authors of this text recalls overhearing a student telling a classmate that she had decided to “do content analysis” for her thesis. “Content analysis of what?” the second student asked. The first student’s response was: “I don’t know. Just content analysis.” This is analogous to a child who enjoys pounding things and wants to hammer without thinking about what is being built. A General Model for Content Analysis Based on this volume’s definition of content analysis and the need for careful conceptualization and research design, how should a researcher go about conducting a content analysis? Table 3.1 offers a design model involving primary and secondary questions that a researcher might ask or address at different stages. This model is organized under larger headings representing the three processes of conceptualization and purpose; design or planning of what will be done to achieve that purpose; and data collection and analysis. Although Table 3.1 suggests a linear progression—and certain steps should precede others—the process is viewed as recursive, in that the analyst must continually refer back to the theory framing the study and must be prepared to refine and redefine when situations dictate.

Designing a Content Analysis 49 Table 3.1 Conducting a content analysis Conceptualization and Purpose Identify the problem Review theory and research Pose specific research questions and hypotheses Design Define relevant content Specify formal design Create dummy tables Operationalize (coding protocol and sheets) Specify population and sampling plans Pretest and establish reliability procedures Analysis Process data (establish reliability and code content) Apply statistical procedures Interpret and report results Conceptualization and Purpose What Is the Phenomenon or Event to Be Studied?

In some models of the research process, this is called “problem identification” or “statement of the research objective.” Researchable problems may come from direct observation or may be suggested by previous studies or theory. Personal observation, or a concern with some communication-related problem or need, is always an acceptable place from which to start an inquiry. Observations can be about any form of communication: “Why do my friends who spend most of their time gaming act differently than my friends who spend most of their time on TikTok and Instagram?” “Can our different sources of news and information explain why my parents and I have different opinions about political issues?” Once identified, a question should be broken into its constituent parts and possible relationships should be considered: “Where do my parents and I find our news?” “How does that news differ?” “What issues do we argue about?” “Do our positions make us act differently?” The goal in asking these sub-questions is to identify phenomena that might vary together (correlation), thus satisfying at least one condition necessary in a causal relationship. Asking questions is the first step, but immersion in relevant scientific theory and empirical research is the necessary follow-up. How Much Is Already Known about the Phenomenon?

Have any studies of this or related phenomena been conducted already? Is enough already known to permit hypotheses and the testing of variable relationships, or is the study’s purpose more likely to be exploratory or descriptive?

50

Designing a Content Analysis

Inexperienced—and even some experienced—researchers often approach this step too casually. The result may be a review of existing research and theory that excludes knowledge crucial to framing the problem properly. The incomplete review of existing knowledge occurs mostly for one or a combination of five reasons: (a) overdependence on web searches or computer indexes that may be incomplete (some may not include all relevant journals or all volumes of those journals); (b) exclusion of important journals from the review; (c) unfamiliarity with scholarship from other fields; (d) impatience to get on with a project before examining all relevant materials; and (e) belief that only recent research is relevant and useful. What Are the Specific Research Questions or Hypotheses?

Will the study examine correlations among variables or will it test causal hypotheses? Will its purposes include inference to the context of message production or consumption? It is at the conceptualization stage that many studies are doomed to fail simply because the researcher may have spent insufficient time thinking about the existing research. This step includes identification of key theoretical concepts that may be operative and may involve a process of deduction, with the researcher reasoning what might be observed in the content if certain hypothesized relationships exist. Moreover, a study’s publication and contribution success are related to how well it fits into the context of past research, adding to the body of knowledge, refining some concept of interest, qualifying past assumptions or findings, or even correcting conceptual confusion or methodological mistakes. In sum, conceptualization involves problem identification, examination of relevant literature, a process of deduction, and a clear understanding of the study’s purpose. That purpose will guide the research design. Design What Content Will Be Needed to Answer the Specific Research Question or Test the Hypothesis?

Will newspaper content, broadcast video, multimedia, social media, or some other form of communication content be involved? What resources are available and accessible and over what time period? Most importantly, what specific units of content will be examined to answer the question? Another issue that arises during this planning and design phase has to do with the availability of appropriate materials for analysis (e.g., newspapers, videos, texts, web pages, tweets). A disproportionate number of content analyses still examine traditional print media—newspapers in particular—due in part to the fact that newspapers are better indexed and archived in databases that are widely available

Designing a Content Analysis 51 in many libraries. Video (as opposed to print transcripts), audio, website, and social media data all pose challenges in terms of accessing the appropriate content to answer a research question because they are less often indexed and archived. Nonetheless, logistical and availability factors should not be as important in planning as the theoretical merit of the research question itself, and the study should reflect how today’s audiences are interacting with and creating content. However, not all researchers have unlimited resources or access to ideal materials for content analysis. Therefore, to be realistic, every design phase should involve some assessment of feasibility and accessibility of materials. What Is the Formal Design of the Study?

How can the research question or hypothesis best be tested? How can the study be designed and conducted in such a way as to assure successful testing of the hypothesis or answering the research question? Recall an earlier observation that good research design is an operational plan for the study that ensures that the research objective can be achieved. Recall also that the formal content analysis research design is the blueprint for execution of the study. It is directed by what occurred in the conceptualization process, particularly the decision to propose testable hypotheses or pursue a less specific research question. Each of these objectives suggests particular decisions in the study design process, such as a study’s time frame (e.g., a study of tweets before and after the platform doubled its character limit to 280 characters), how many data points are used, or any comparisons that may be involved, whether with other media or other sources of data. Many content analysts find it useful at this point in planning to preplan the data analysis. Developing “dummy tables” (see Table 3.2) that show various hypothetical study outcomes, given the data collected for study variables and their measurement levels, can help the researcher evaluate whether study design decisions on these matters will even address the hypothesis or the research question. Table 3.2 breaks down characters in a TV program by gender, race, and whether the character speaks or not. At this point, some researchers realize that their design will not achieve their goal; better now, however, than later. Table 3.2 Example of a dummy table Character Is

Female of color Male of color White female White male Total

Character Has Speaking Role

Nonspeaking Role

?% ?% ?% ?% 100%

?% ?% ?% ?% 100%

52

Designing a Content Analysis

How Will Coders Know the Data When They See It?

What units of content (words, square inches, tweets, video scenes, etc.) will be placed in the categories? The analyst must move from the conceptual level to the operational level, describing abstract or theoretical variables in terms of actual measurement procedures that coders can apply. What sorts of operational definitions will be used? What kind of measurement can be achieved (e.g., simple categories, such as male or female characters; real numbers, such as story length; or ratings for fairness or interest on a scale)? “Units of content” vary in terms of ease of identification or specification. Television programs and movies are discrete packages of content, and they contain scenes, which can be easily identified. Commercial websites are similar, with discrete articles, posts, and visuals (photographs and videos). Social media can be more difficult because individual users may not share a common definition of what makes a content unit. For example, Cortese et al. (2018) studied how young women presented their smoking behavior on Instagram. They settled on and searched for 18 hashtags that would likely yield images of women smoking and then analyzed the images for a number of characteristics. Thus, the first step was finding posts (unit) and then images (another unit). The process of finding and generating units is discussed further in Chapter 5. The heart of a content analysis is the content analysis protocol that explains how the variables in the study are to be measured and recorded on the coding sheet or other medium. It is simple enough to speak of abstract concepts such as a tweet’s emotional valence, but a coder for a Twitter content analysis must know what it looks like in text. How Much Data Will Be Needed to Test the Hypothesis or Answer the Research Question?

What population of communication content units will be examined? Will sampling from that population be necessary? What kind of sample? How large a sample? A population of content is simply the entire set of potential tweets, broadcast programs, documents, web pages, and so on within a pertinent time frame (which is, of course, also an element of design). When appropriate, researchers use representative samples of the population rather than examining all the members. However, in some situations, sampling is inappropriate. If the focus is on a particular critical event (e.g., the September 11 terrorist attacks or a major oil spill) within a specified time period, probability sampling might miss key parts of the coverage. Or, if one is working with content that might be important but is comparatively scarce (e.g., sources cited in early news coverage of AIDS), one would be more successful examining the entire population of AIDS stories. Chapter 6 discusses sampling in more detail.

Designing a Content Analysis 53 How Can the Quality of the Data Be Maximized?

The operational definitions will need to be pretested and coders will need to be trained in their use. Before and during coding, coder reliability (or agreement in using the protocol procedures) will need testing. Chapter 7 addresses the logic and techniques of reliability testing. Many researchers test coding instructions during the process of developing them. Then coders who will be applying the protocol are trained. A pretest of reliability (how much agreement is there among the coders in applying the protocol?) may be conducted and the instructions refined further. We emphasize here that maximizing data quality by testing reliability, achieving reliability, and reporting reliability is necessary in content analysis research. Lacy and Riffe (1993) argued that reporting content analysis reliability is a minimum requirement if readers are to assess the validity of the reported research. Data Collection and Analysis What Kind of Data Analysis Will Be Used?

Will statistical procedures be necessary? What statistical tests are appropriate once content analysis data have been collected? A number of factors influence the choice of statistical tests, including level of measurement and type of sample used. (Inferential statistics are inappropriate when using a population or nonrandom sample.) Some content analyses involve procedures of varying complexity that examine and characterize relationships among and between variables. It is helpful to think before the study begins about whether particular statistical analyses have requirements. For example, one may be interested in using hierarchical linear modeling (HLM) to estimate separate individual (level 1), organizational (level 2), and national (level 3) influences on how reporters cover social protest. HLM requires a minimum number of observations, which vary based on the source/author consulted (Hox, Moerbeek, & Van de Schoot, 2017), in order to produce stable estimates of effects observed at the different levels. These requirements are rarely met with “naturally occurring” data; thus, they need to be considered as part of the formal design process to make sure those requirements are met. That said, other studies merely report simple percentages or averages. These issues are discussed in detail in Chapter 9. Has the Research Question Been Answered or the Research Hypothesis Tested Successfully?

What are the results of the content analysis and any statistical tests? What is the importance or significance of the results? Interpreting and reporting the results is the final phase. It enables scientists to evaluate and build on the work of others. The actual form of the report

54

Designing a Content Analysis

depends on the purpose of the study and the appropriate forum (a thesis committee, the readers of a trade publication, colleagues in the academic community, etc.). The importance of a research finding is determined by connecting the found relationship with the problem that underlies the research. A relationship can be statistically strong but have little importance for scholarship or society. The importance of a research finding cannot be determined statistically. Rather, it is determined by the finding’s contribution to developing theory and solving problems. Only when the statistical measures of strength of a relationship are put in the context of theory and existing knowledge can importance be evaluated. Research Program Design One of the authors of this text routinely asks doctoral students (and new doctoral program graduates seeking a job), “What is your dependent variable?” This question implies a lot. Does the researcher have an enduring focus on a particular research problem or phenomenon? Is there an overall theoretical coherence to this focus and research? Is there a “fire in the belly” that will motivate study after study to illuminate and understand some part of the communication world? The focus of this chapter has been on the design of a single study. But if a body of knowledge is to be built in communication science, multiple studies will be needed. Recall that the goal of science is prediction, explanation, and control of some phenomenon. Almost always, multiple causes of varying strengths interacting under varying conditions will be affecting that phenomenon. No study (and no single researcher) can hope to illuminate them all. So, this chapter concludes not with a suggested program of research, but with a brief suggestion for how a researcher (or researchers) might organize such a program of research. Shoemaker and Reese (1996, 2014) have provided a general organizing framework for a research program that applies to studies about antecedents and content. They argue that five levels of influence affect content variation, and conceptualize these levels in terms of higher-level constraints that limit the freedom of lower-level factors. More broadly, one can see the possibility of higher-level factors actively influencing lower-level ones. Indeed, there may be conditions under which lower-level factors are impervious to higher-level ones, or may even affect them. Studies can be conducted within and across these levels, all focused ultimately on some dependent content variable of interest. The five levels emphasized by Shoemaker and Reese are: individual media worker characteristics; media organization routines; media organization characteristics; the environment of media organizations; and societal ideology and culture.

Designing a Content Analysis 55 Media Worker Characteristics

These personnel have the most direct, creative influence on content. Variables possibly influencing their work include demographic ones, such as gender, race, and other variables that are usually the focus of sociological research. Political orientation, values, attitudes, and the like, as well as psychological processes, might also be investigated for their influence on content. For digital media studies involving blogging or social network content, this level may engage an entire research program. Media Organization Routines

Routines are the repeated patterns of interaction that enable an organization to function and reliably achieve its goals. Such routines may include deadlines for media content production, expectations for content amounts and packaging, and publication cycles. Shoemaker and Reese (1996, 2014) identify routines involving news sources, audiences, and processers. Obviously, such routines will differ across organizations producing the same kind of media content (e.g., news) and across organizations producing different kinds of content (e.g., news outlets and advertising agencies). Research programs may focus within this routine level or on how such routines affect the work of those who directly produce content. Media Organization Characteristics

A media organization’s characteristics may include its goals and resources, the way it allocates rewards and punishments for achieving or thwarting those goals, the internal differentiation of power and responsibility within the organization, and the way it interacts with the external environment. Interaction with other organizations and institutions encompasses resource dependencies and relative power. Research programs may focus on different organizational characteristics and on how variations across those characteristics influence organizational routines and media worker outputs. Media Organization Environments

This level includes both other organizations and social institutions that affect the work of media organizations. Research programs may focus, for instance, on how laws and governmental regulations influence a media organization, or on how the organization copes with competitors, critics, and interest groups. For example, news organizations and public relations firms operate in very different environments.

56

Designing a Content Analysis

Societal Ideology

Shoemaker and Reese’s (1996, 2014) approach focuses on ideology as a way in which dominant economic interests influence organizational environments, organizational characteristics, routines, and media workers. More broadly, however, research programs may focus on how national cultures or even subcultures within a nation may influence content. Research programs at this level are likely to be heavily international, but studies of communications within a single country’s subgroups may be undertaken, too. Given what is aptly called the “World Wide Web,” studies of the Internet may produce examples of new cross-national cultures being formed or of the growth of subcultures within a nation that are focused on particular beliefs or values. Summary Content analysis involves conceptualization, design, and execution phases. The research design of a study is its blueprint—the plan specifying how a particular content analysis will be performed to answer a specific research question or test a specific research hypothesis. Design considerations include time, comparisons with other media or data sources, operationalization and measurement decisions, sampling, reliability, and appropriate statistical analysis. Ultimately, good research design can be evaluated in terms of how well it permits answering the research question and fulfilling the study’s purpose. Research design is like any other process—the more a person does it, the more adept she or he will become. Inexperienced researchers may find it difficult, but the remaining chapters in this volume will help them put flesh on the research design skeleton presented in this chapter. Also, reading existing research and soliciting advice from colleagues are both useful means of developing a design that will accomplish a study’s goals. The importance of research design in the success of any project cannot be overemphasized. Embarking on a content analysis without a well-thought-out design is equivalent to leaving New York City with the goal of reaching a specific address in Seattle but not having GPS software or a map to help. The odds against arriving at the desired destination are very high indeed.

4

Computers and Content Analysis

The use of computers to analyze communication content can range from querying a database of news articles for coverage related to the Black Lives Matter movement (Mourão, Kilgo, & Sylvie, 2018) to searching TikTok for videos with COVID-19-related hashtags (Li, Guan, Hammond, & Berrey, 2021). A “hybrid” content analysis might employ computer programming to “freeze” dynamic or fluid online content and create a database for later analysis (Zamith, 2017) or use a Python script to parse data in a large Twitter data set to streamline subsequent manual coding (Lewis, Zamith, & Hermida, 2013). These hybrid content analysis approaches capitalize on the strengths of computational methods to capture, store, and parse “big data.” But to be considered content analysis—as defined in Chapters 1 and 2—the research design should ultimately rely on human coders applying a predetermined codebook or coding protocol to categorize media content for the purposes of making valid inferences about that content. On the other hand, using computer algorithms to tally the frequency of terms in nearly one-and-a-half million news articles to explore how diversity in coverage relates to increasing media concentration within markets is more appropriately labeled automated or algorithmic text analysis (ATA) (Hendrickx & Van Remoortere, 2022). This term can describe any computer-based analysis of content that does not use human coders. It is possible, however, to use human-coded data to help “train” a computer to perform specified functions automatically in classifying content (Su et al., 2016). Computers can be used to enhance content analysis in a number of ways: 1. To access or gather content (includes retrieving content from databases, freezing dynamic content, and creating custom databases, scraping data, etc.). 2. To parse content into data tables for analysis. 3. To sort and/or filter content, including sampling. 4. To organize coding tasks (e.g., create dynamic interface or coding template). 5. To validate codes entered on coding sheet.

DOI: 10.4324/9781003288428-4

58

Computers and Content Analysis

6. To code content (i.e., ATA). 7. To analyze coded data. In addition to elaborating on these uses, this chapter draws methodological distinctions among content analysis (using human coders), ATA (using automated algorithmic coding), and hybrid designs (using computers to improve efficiency and reliability but still relying ultimately on human coders who recognize meaningful context and the complexities of human language). It does not provide a “how-to” guide to using computers in content analysis. Rather, it presents an overview of how computers can improve efficiency and reliability in content analysis, along with issues to consider in deciding whether and how to use computers in a content analysis research design. Finally, it outlines some contemporary challenges content analysts face in an increasingly privacy-conscious era, as organizations limit the use of application programming interfaces (API) to access digital data. Even the most passionate content analyst might agree that, when done correctly, content analysis can be a resource-intensive slog, to the point of discouraging its use (Conway, 2006). Students, in particular, may gravitate toward ATA because they perceive it as being “easier” in terms of time and cost. However, while ATA is often more efficient than human coding, it is far from easy. The technical understanding and abilities needed to produce valid data using ATA are significant and expose the dilettante who is simply pursuing what they perceive as “easy” (i.e., less work). Moreover, this is increasingly true as ATA approaches come to rely on complex natural language processing algorithms to get at deeper meanings in human language. These algorithms are far more complicated than earlier “dictionary”-based approaches that measured sentiment based on the textual frequency of words classified as “negative” or “positive” (by the researcher who developed the study’s “dictionary”). These recent developments reinforce the fact that the perceived ease, or even efficiency, of ATA methods is an insufficient justification for including them in a research design, even in the most resource-constrained research programs. As is true for all rigorous social-scientific inquiry, decisions on whether and how to use computers in a research design should be made on the basis of several considerations, the least of which is efficiency: 1. Will the use of computers yield valid measures? 2. Do those measures meaningfully address the research questions posed by the study? 3. Will the use of computers improve reliability? 4. Will the use of computers improve efficiency (i.e., save time and/or money)? The goal of social-scientific research should always be to produce valid conclusions, though too often validity is not directly addressed by content analysts

Computers and Content Analysis

59

(see Chapter 8, this volume). An algorithmic coder may be used in such a way that it improves efficiency, but at significant cost in terms of the validity of the study’s data. In such instances, improved efficiency or even improved reliability (see Chapter 7, this volume)—the two primary strengths of algorithmic coders— cannot justify the use of computers in a research design. That said, if the four considerations outlined above are met, algorithmic coding may provide other benefits in addition to improved efficiency and reliability. For example, computer codes that direct the algorithmic coder’s classifications must be unambiguous and exhaustive; thus, the algorithmic coder can be more transparent and replicable than its human counterparts. If human coders are provided with ambiguous and incomplete codebooks, this issue may be “remedied” informally during the human coder training process (Hak & Bernts, 1996) as coders discuss and reach agreement on what they think a protocol definition means. Whether such remedies are documented in the study report is an empirical question for students of best practice. Distinguishing Algorithmic Text Analysis If a study primarily uses an algorithmic coder—that is, a computer application that assigns numeric values to communication content based on either a pre-programmed set of rules or a machine-learning approach—it is not “content analysis” because: (1) it follows different processes/methods; and (2) these distinct processes have unique implications for the validity of the data (Lacy, Watson, Riffe, & Lovejoy, 2015). Because it follows a separate set of processes (i.e., it is a distinct research method), a separate term is needed for the use of algorithmic coders. The phrase “algorithmic text analysis” (ATA) was used above, with “text” broadly defined as encompassing any fixed form of communication that can be analyzed. This term has been used by others to describe a broad range of analytical methods that rely on algorithmic coders (Tenenboim-Weinblatt & Baden, 2021; Yarchi, Baden, & Kligler-Vilenchik, 2021; Neff, 2020); it is also synonymous with “automated text analysis” (Hendrickx & Remoortere, 2022; Song et al., 2020). ATA is not synonymous with “computer-aided text analysis” (CATA) (see, e.g., Neuendorf, 2017), because it is possible to use computer aids to freeze dynamic web content (Zamith, 2017), to retrieve or sample data from databases (see Chapter 6, this volume), or to sift and sort data and organize the coding task and store coded data, all while still relying primarily on human coders. Thus, CATA does not truly distinguish the use of an algorithmic coder as following a different set of research processes. Blending the use of computers and computer programming with human coding is what Lewis, Zamith, and Hermida (2013) termed a “hybrid” approach to conducting content analysis. Such approaches should be encouraged because they have the potential to improve efficiency and reduce human errors

60

Computers and Content Analysis

(e.g., a computer can prescribe the possible codes for a given variable and flag mis-keyed entries). That said, it is important to distinguish studies of communication content that rely on algorithmic coders to do the actual classification and assign numeric values as a distinct method—or, probably more accurately, a distinct set of methods. The purpose in delineating what is not content analysis is not to make claims as to the superiority of one method or another, or otherwise create false divisions between scholars who use different methods. Labeling these as distinct research methods draws attention to the fundamentally different research designs that are required when using human versus algorithmic coders and identifies distinct concerns that these methods raise about the validity of data and the inferences that can be drawn based on those data. Practitioners of either method can learn from the other’s insights and best practices. Unique to ATA is the requirement that algorithmic coded data be carefully accessed, cleaned, and prepared for analysis. It is important, of course, in both approaches to ensure that relevant data are accessed. When using an algorithmic coder, however, the precision of keywords used to sample content from databases may assume greater importance. Obviously, the first step in content analysis is eliminating irrelevant articles from the sample. Often the first variable in a human coding protocol is simply whether the content is relevant (e.g., has a search for the keyword “cancer” in a study of the disease erroneously identified the constellation; see Stryker, Wray, Hornik, & Yanovitzky, 2006). Coding software, however, has more difficulty selecting relevant content because it tends to be less flexible than humans in understanding context. Therefore, ATA will likely code more “false positives,” potentially introducing significant errors into the data. Human coders can sacrifice the narrow precision of ATA for the sake of broad recall, knowing that the human coder can pick out false positives and thus a broader search that will retrieve the greatest number of potentially relevant articles is preferred. Another process that distinguishes ATA from content analysis is “stemming” the text, so that noun, verb, and adjective forms of a single root word (e.g., “bike,” “biking,” and “bikeable”) are coded as pertaining to the same activity (Grimmer & Stewart, 2013). Function words that serve grammatical purposes but do not convey relevant meaning (e.g., definite and indefinite articles) are also typically removed from text before it is analyzed. The fact that content analysis and ATA are distinct methods calls into question the idea that they should necessarily produce equivalent data and that potentially “imperfect human annotations” or coding should nonetheless be considered the gold standard for validating ATA (Song et al., 2020). To clarify: a valid criticism of many algorithmic analyses is that they cannot reliably classify more complex, nuanced forms of human language, such as irony, sarcasm, and humor, so many researchers implicitly assume that human coders are better at classifying the “true” meaning behind these forms. Indeed, human-coded data on these forms is

Computers and Content Analysis

61

often used to train and (if validity is addressed at all) validate algorithmic classifiers. But is human coding really the gold standard? According to Song et al.’s (2020) study of algorithmic studies of political communication, only 58% reported efforts to validate the algorithmic methods. Of those, 88% used human coders to validate algorithmic results. Yet, only 38% of studies using human validation reported reliability between and among human coders—testament to the lack of scrutiny applied to the assumption that human coding is the gold standard. Song et al. do not argue against human coders being used to validate algorithmic approaches (or to train algorithmic classifiers), but they do observe that “actual practices of the proper validation of automated text analysis,” and discussions of best practice, have “profound implications” (p. 563). Furthermore, they argue that all ATA studies should report validation metrics; and, where human coders are the gold standard, that performance by such coders must also be validated, consistent with the best practices of content analysis described within this volume (see Appendix B). Human coders can, after all, be sources of error or invalidity, due perhaps to insufficient training with the protocol, “coding drift” due to fatigue in coding a large data set over an extended period, and so on. So, should we consider other gold standards—“some forms of ‘objective,’ or intersubjectively valid, measurements that serve as a reference” (Song et al., 2020, p. 551)—when assessing the validity of ATA data? In one creative approach, Pilny, McAninch, Slone, and Moore (2019) used a supervised machine-learning algorithm approach to study “relational uncertainty” (individuals’ confidence in commitment level in a relationship) in online relationship forums. They used a human-coded data set from forum postings to train the algorithmic classifier. After establishing human–machine reliability in coding online postings, they created an “independent test set” to validate their data: they recruited survey participants who were in relationships to write at least 200 words about the certainties and uncertainties in their current relationship and to complete a questionnaire measuring their perceived levels of relational certainty. These measures were positively correlated with those generated by algorithmic analysis of the participants’ essays. The authors acknowledge the limitations of this approach, but argue that, as with content analysis, those using ATA must attempt to establish the validity of their data. They quote Grimmer and Stewart (2013, p. 269), who wrote that scholars must “validate, validate, validate.” That said, Song et al. (2020) argue that human coders need not—and, in some instances, should not—be considered the gold standard against which algorithms are validated. Advantages and Disadvantages of ATA As noted above, ATA has some distinct advantages over its human counterpart. Primarily, scholars using “traditional” tools would find it very difficult, if not

62

Computers and Content Analysis

impossible, to analyze the large data sets that algorithmic coders are able to analyze very quickly in “big data” research projects. Freelon, McIlwain, and Clark (2018), for example, analyzed more than 40 million tweets about Black people killed by police to examine how Black Lives Matter activists, political conservatives, and “unaligned” Twitter users wielded different types and degrees of social power in online conversations about police violence. It is difficult to imagine analyzing such a huge data set using human coders. By contrast, the algorithmic coder makes this a simple task due to vastly improved coding speed, with the added benefit of substantial cost savings, assuming access to computing infrastructure. The algorithmic coder is also 100% reliable. The validity of ATA data, however, is still determined by the validity of the conceptualization of the variable defined by the researcher, as well as the validity of the operationalization in the code that the algorithm follows. Given that an algorithmic coder relies on human programming, the algorithmic coder is no more objective than a human coder. That said, unlike humans, who are susceptible to fatigue, human error, and so on, the algorithmic coder always executes commands with perfect fidelity. Furthermore, artificial intelligence is improving computers’ capacity to process complex human language peppered with colloquialisms, satire, irony, and emotion. Add this to the algorithmic coder’s efficiency and reliability, as well as advances in natural language processing, and it is probable that human coders and “traditional” content analysis will become obsolete at some point in the future. Practically speaking, though, that day is still quite a long way off. Human coders will continue to apply protocols into the near future for a range of reasons. First, as previously discussed, the development of many of these computational methods still relies on human-coded training sets that are used in supervised machine learning and against which the performance of algorithms is validated. Second, the validity of algorithmic coding of complex human language remains a primary concern. Some computational methods are no better than a coin flip in terms of classifying content into the correct categories (e.g., whether a particular statement is sarcastic or not). ATA is also rarely used in studies of visual media because those media pose their own challenges for algorithms, though technologies that can be applied to visual media are improving. Joo and Steinert-Threlkeld (2022), for example, used a visual classifier to examine how protesters were portrayed via images shared on Twitter. Such analyses, however, are still relatively basic in terms of the research questions they ask. When ATA Is Best Applied Currently, ATA methods are best suited for analyzing manifest variables (Zamith & Lewis, 2015). Kornfield et al. (2018), for example, used Linguistic Inquiry and Word Count (LIWC) to examine whether language used in an online

Computers and Content Analysis

63

alcohol abuse recovery group could predict relapse. LIWC is a commonly used program that classifies word usage based on a predetermined set of dictionaries, each of which is thought to represent a different, psychologically meaningful category of language (Tausczik & Pennebaker, 2010). Kornfield et al. hypothesized that relapse would be positively predicted by the use in addicts’ posts of words from the “negative affect” dictionary, whereas relapse would be negatively predicted by the use of words in the “first-person plural pronouns” dictionary (e.g., “we”), perhaps indicating stronger social support (e.g., relationships) than posts primarily referring to “I.” Such dictionary approaches are also quite common in sentiment analysis, which seeks to classify the valence (positive, negative, or neutral) of emotions expressed publicly in text, based primarily on the frequency of the use of either positive or negative words. Sentiment analyses of tweets about specific candidates and policy issues, for example, have been shown to predict election outcomes (Ceron, Curini, & Iacus, 2017). The problem with dictionary-based approaches is that they give no information about how particular words are used in context. Keyword-in-context (KWIC) and concordance approaches improve slightly in giving words more contextual meaning by examining which words occur most frequently together, but extracting individual words from their context limits the ability to understand those words’ actual meaning (e.g., was a specific word or phrase used ironically or sarcastically?). Analyses of which words cluster together or “co-occur” within a specified distance (e.g., within five or ten words before or after the key term) and network analyses (e.g., key terms co-occur in other pieces of content, or from other sources, etc.) can provide additional contextual information. But these methods still might not capture the full meaning of human language that depends on words beyond an arbitrary interval or distance. Natural language processing, machine learning, and artificial intelligence are areas where computational linguistics and computer science researchers are working toward programming computers that can understand human language in its full complexity and context. Despite advances in these areas, however, recognizing more complex human language, such as humor, sarcasm, and irony, remains beyond the grasp of the algorithmic coder. Moreover, even as ATA capabilities advance—and they are advancing quickly—there are fundamental differences in the motivations driving those who build algorithmic tools (i.e., computer scientists) compared to social scientific scholars. In an aptly titled research paper—“Detecting sarcasm is extremely easy ;-)”—two computer scientists (Parde & Nielsen, 2018) described an algorithmic approach to identifying sarcasm in tweets. They created a test data set by collecting tweets that had hashtags corresponding to six emotions and others with the hashtag #sarcasm, assuming that the latter were indeed sarcastic. They trained an algorithm to identify sarcastic tweets within the whole data set and it correctly identified 68% of the actual sarcastic tweets (i.e., those originally

64

Computers and Content Analysis

self-identifying as #sarcasm). However, its precision was only 53%, hardly better than a coin flip: that is, despite correctly identifying 68% of the tweets that were actually sarcastic, it miscategorized 47% of all tweets in the data set as sarcastic. The results were better—82% and 75%, respectively—for detecting sarcasm in Amazon reviews. The key point is that, for these computer scientists, even an incremental improvement in the ability of an algorithm to recognize complex language was an achievement. By contrast, for social scientists, the algorithm’s ability to identify sarcasm correctly in only 53% of cases is unacceptable for valid inferences to be drawn about human behavior. The tools that computer scientists develop are often used for data mining—a largely atheoretical approach of inductively discovering and extracting relationships that may exist in the data. Content analysis is a method that seeks to draw valid inferences about communication content (including its antecedents and consequences) in a way that builds and tests theory about human communication. These different goals help explain differences in what is considered acceptable in the two fields. Different orientations are also at the root of some of the challenges to creating collaborative, multidisciplinary teams of computer and social scientists to advance the application of ATA. Nonetheless, by recognizing and accepting the different approaches, and the value each brings to building both research tools and theory, it is possible to build bridges between the two orientations. Hybrid or Computer-Aided Content Analysis While ATA refers to research designs that rely on algorithmic coders, there are numerous examples of content analyses that use computers in the research design but ultimately use human coders to categorize and assign numeric values to content. As mentioned above, this hybrid approach—which involves using computers in studies but not fully automating the coding—might more aptly be labeled “computer-aided text analysis” (CATA). Computers can be used to retrieve content from databases; and, in cases where there is no database or repository of online content, such as occurs with social media’s fluid content, computational methods can “freeze” dynamic web content, creating searchable “databases” of static, cross-sectional observations. Computer programs can identify tweets that use specific keywords or hashtags or mention specific users’ Twitter accounts. In addition, computers can organize coding tasks, providing a coding sheet for coded data, and then validate codes that are entered to eliminate mis-keying. In one example, Lewis et al. (2013) created such an interface to streamline coding about the Arab Spring protests. On the left-hand side of the computer screen, the interface pulled a piece of content sampled from a database of 60,000 tweets. This ensured that content was arranged in order and not

Computers and Content Analysis

65

inadvertently skipped over, eliminating errors that can arise when pairs of coders are tasked with dual-coding the same units. On the right-hand side was a coding sheet for entering data, with dropdown menus that eliminated invalid, mis-keyed values of variables (e.g., by preventing coders from entering a “2” for a category with only acceptable codes of present = 1 and absent = 0). This study illustrates that computers can be used to: (1) gather data, amassing 60,000 tweets; (2) extract data fields (i.e., parse data) of interest by organizing the fields in a spreadsheet; and (3) filter data based on date ranges, keywords, and so on. Without the aid of a computer, each of these tasks would have been extremely laborious for a human to perform manually. For example, if one assumes one minute to process each tweet, that task alone would have taken 25 40-hour work weeks—or an entire year for a half-time graduate research assistant! Such hybrid or computer-aided approaches to conducting content analyses can leverage the efficiencies and reliability of computers, cutting back on potential human error, while still relying on human coders to recognize the full richness and complexity of human language. One reason Twitter data are popular for content analysis is that they are “structured.” Each tweet has various data points—author, associated user profile page, time, date, replies, mentions, text or content, URLs mentioned in the tweet, and so on. All of these data points are stored in a consistent, structured database that can be accessed through Twitter’s Application Programing Interface (API). Users can “call” fields from Twitter’s databases and store them in table format for analysis. On the downside, scholars are increasingly aware of fake social media accounts, sites where users can “purchase” followers, and other developments that have cast doubt on what the data ultimately reflect (see Karpf, 2012; Zamith & Lewis, 2015). Unstructured online data are not stored in any predetermined way (e.g., on individual websites), but can also be accessed using software and then sorted, though not as efficiently or, perhaps, reliably. Computers can also access semistructured data, which are not quite as uniformly structured as tweets, but might have some relatively consistent, recognizable form—for example, newspaper articles, with headlines, subheads, bylines, body of articles, and other consistent formats. A challenge of studying online communication content is that very little of it is structured, and data that were once semi-structured are often less so when they move to online distribution. Television news homepages, for example, have a much less structured form than traditional scheduled newscasts, which tend to be linear (a lead, then one story follows another). Online content is dynamic, meaning that a web page is updated multiple times throughout the day and previous versions of the page may not be archived (Zamith, 2017). Working with less structured data, such as social media content, requires more sophisticated programming to “scrape” data from the homepages,

66

Computers and Content Analysis

“freeze” it so it can be reliably analyzed at a specific point of time, and parse it into fields (e.g., topic, date, gender of the person controlling the site, etc.) that can be stored in a data table for analysis. As an example of dealing with semi-structured data, Zamith (2017) described using custom Python scripts to “freeze” dynamic web news content and then parsing that content so that changes in the placement of news stories across different organizations’ websites could be tracked. Change in placement was used as a measure of newsworthiness over time. “Scaling up” Content Analyses The size and complexity of data sets used in communication research are increasing, offering new opportunities for social scientists but also creating new challenges for the technical skills researchers need to access data. While “upscaling” the size of samples is appealing, sampling from some data sets leads to concerns about privacy, research ethics, and retaining a sharp focus on data validity and foundational values of social scientific research (e.g., transparency and reproducibility). Challenges facing content analysts today include accessing ephemeral digital communication data and establishing that the data represent the manifest meaning of the symbols and messages being analyzed. Increasingly, commercial digital media platforms are revoking researchers’ access to users’ data and curtailing API access. An API is a program that allows two applications to exchange information. In this context, popular communication platforms such as Facebook and Twitter had APIs that allowed researchers to collect data on posts to those platforms. However, platforms have started to limit or restrict API access. For instance, in 2015 Facebook shut down the API that allowed researchers access to individual posts, and three years later it shut down access to posts on Facebook pages, leading Freelon (2018) to consider the future of “computational research in the post-API age.” He argued that researchers will need to be equipped with even more technically advanced skills in order to “scrape” data from web pages as they are increasingly denied access to simple APIs. However, such researchers should be aware that doing so will likely violate a platform’s terms of service (TOS) agreement and even, potentially, research ethics. A “cease and desist” notice from a social media platform might throw a significant wrench in a dissertating student’s progress or a junior scholar’s pretenure research productivity. Moreover, as mentioned in Chapter 1, questions about the representativeness of data from different platforms remain, given that tweets and posts may be removed by authors or platform administrators, and that the “populations” of such content vary over time. Chapter 6 addresses questions and issues of sampling and data generalizability.

Computers and Content Analysis

67

In a world that is increasingly focused on data privacy, social media companies will likely impose even stricter terms of service and become more aggressive in enforcing them. Legislation such as the Europe Union’s General Data Protection Regulation, which took effect in 2018, has introduced the concept of “privacy by design”—that is, users’ data should be private unless they opt in to a higher degree of sharing. Other legislative bodies have followed suit with their own privacy regulations (European Union, n.d.). These regulations are putting greater pressure on communication platforms to ensure users’ privacy and adhere to and enforce privacy policies, which led Freelon (2018, p. 668) to caution that communication researchers “need to keep the tenuousness of our access to digital data firmly in mind.” While there are channels to create direct research partnerships with platforms to gain access to communication data within TOS agreements, such opportunities lack independence from the platforms and are generally not accessible to younger scholars or investigators from less prominent institutions. Furthermore, while changes in privacy preference and regulation are restricting access to users’ data, universities’ institutional review boards (IRBs) are taking ever more interest in reviewing research that uses digital communication data. The era of IRBs with little awareness of or interest in social media research is over. They are now well aware of the prevalence of social media research and keenly interested in its implications for privacy and protection of vulnerable populations. As an example of evolving policies, in 2021 Indiana University’s IRB still allowed the use of “existing and public” data without IRB review, but at the same time stated that public availability of data is insufficient to establish the public nature of those data (Indiana University, 2021). If individual users do not clearly intend their data to be used publicly, or if the researcher gathers data in which a user can be personally identified, the researcher must obtain IRB approval for the research. For example, a researcher may be studying an open online support group for substance abuse users. While the group may be technically “public,” it is likely that many of the participants intend their history of substance abuse to remain private. As a result, the IRB’s view is that the forum is not public, so the researcher must seek human subject research review. Proper access to digital communication data, including seeking IRB review where required, is only one challenge scholars face. As noted above, once researchers gain access to data, there is an increasing understanding that the data may not be representative of the attitudes, behaviors, and messages the researchers assume. Updated, accurate estimates of the scale of fake social media accounts are difficult to find, but most experts agree that spam and bot accounts now number in the millions. In 2022, more than one in three US social media users admitted to creating “fake accounts” to post messages free of judgment or to follow other users anonymously (Mello, 2022).

68

Computers and Content Analysis

More worrisome, however, are troll accounts that attempt to influence public opinion and stoke conflict around controversial political topics such as elections, social movements like Black Lives Matter, and health issues like COVID-19. A new research industry has emerged to create algorithms to detect these fake accounts, whose widespread existence cuts right to the heart of data validity. Yet, content analysts rarely discuss them or how they may affect the validity of social media data. Scholars also face the challenge of the increasing commercialization of tools for analyzing digital communication content, particularly “plug-andplay” methods that do not require extensive coding knowledge. Based on technology developed by Harvard Political Science Professor Garry King at the Institute for Quantitative Social Science (Roush, 2008), Crimson Hexagon was a frequently used commercial tool for “social listening” and sentiment analysis prior to its merger with Brandwatch in 2018. Another popular tool, Linguistic Inquiry and Word Count (LIWC), was developed by University of Texas-Austin Psychology Professor James Pennebaker. It still offers a relatively affordable academic license and transparent documentation, but a commercial offshoot—Receptiviti—was founded by Pennebaker and partners in 2015 (Crunchbase, n.d.). A study conducted by researchers from Princeton and Harvard universities highlights these tools’ usefulness as well as their drawbacks (Jamal, Keohane, Romney, & Tingley, 2015). The research team used Crimson Hexagon to categorize tweets from Arab countries that mentioned the United States in terms of whether the topic was social or political, and whether the valence was positive or negative. Negative assessment was primarily attached to political tweets, especially those relating to interventionist policies. For social issues, there was no more scorn for the United States than there was for Iran. The authors concluded that Arab Twitter users did not express generalized hatred toward the United States, but they were angry about political intervention in the Arab world. Intriguing as these results are, they are not easily replicable, which is regrettable, as replication (see Chapter 2, this volume) is an important, albeit rare, aspect of the scientific process. Indeed, replication is one of the primary tools we have to guard against human bias in our research. If others can follow an identical research process and obtain the same result, at the very least this suggests that the study’s methodology did not distort the data. Jamal et al.’s research is not replicable because it relied on proprietary Twitter data that the team accessed via an expensive Crimson Hexagon license. The researchers could not openly share this data without violating the service’s user agreement, which was written with the aim of preserving the data’s commercial value. (For instance, the data could be used to study reactions to new product launches or to examine public sentiment toward products.) Such user agreements can handcuff certain research activities, such as sharing data, or the use

Computers and Content Analysis

69

of data accessed without a company’s permission (e.g., accessed via the Twitter API or another Twitter tool; Twitter, 2018). In addition, Crimson Hexagon’s high cost meant that it was not widely available even before the merger with Brandwatch, so most researchers have been unable to replicate Jamal et al.’s data-gathering process. Finally, the algorithm the research team used to classify the content is proprietary and confidential (as is the case with other commercial software). In effect, the data were classified within a “black box,” beyond the view, inspection, and assessment of peer scholars who were keen to review the research. In response to the barriers to replication posed by commercialized data and analysis tools, Trilling and Jonkman (2018) have offered a list of broad, “bestpractice” requirements for the use of computer platforms in hybrid content analysis and ATA designs, which are meant to achieve goals of the scientific process, especially replicability: 1. The platform should be scalable, to work on a laptop computer, but also on servers to support “big data” projects. 2. It should not depend on commercial software to run. Rather, it should be free and open source. The latter implies that the source code of the program is freely available for inspection and alteration; open source is the antithesis of the commercial “black box.” 3. The platform should be adaptable to a wide range of projects and collaborations, which requires it to be open source. 4. The platform should give users advanced control, but also provide an easyto-use interface for the novice user: that is, it should not be arcane and accessible only to a small subset of social scientific scholars with expert technical abilities; it should be widely usable by the whole community of interested scholars. In addition to these requirements, the platform should be geared toward analyzing publicly accessible data sources that can be readily shared, at least within the community of interested scholars, keeping in mind that the sharing of research data is also an important expectation of major research funders, such as the National Science Foundation. Summary The use of computers in studies that analyze media content can greatly improve efficiency, allow researchers to study more complex data sets, and enhance reliability, which is a precondition for validity, the most important consideration for social scientific scholars. It is probable, but still a distant promise, that human coders will be rendered obsolete by the algorithmic coder. However, it is questionable to assume that, at present, human and algorithmic coders should

70

Computers and Content Analysis

produce equivalent, much less identical, data, and that if the two fail to produce such data, it is necessarily a “fault” or flaw in the algorithmic coder. The use of algorithmic coders is distinguished as different from content analysis by its name: algorithmic text analysis (ATA). Studies using human or algorithmic coders follow very distinct processes that have different implications ultimately for the validity of the data. There are some tasks (such as word counts) that a computer will always do better than any human coder. A human coder, however, should at present be superior at recognizing messages with multiple interpretations or that rely heavily on connotation—humor, for example. Currently, the algorithmic coder is best suited to studying text rather than visual media, such as photography and video, and particularly manifest content. Computers, though, can also be used to enhance the efficiency and reliability of human coders, albeit on different scales. They can be used to gather, parse, sort, and filter data, as well as to organize the coding task and validate codes applied to content. Thus, they should be used in content analysis to improve efficiency and reliability, as long as their performance of tasks does not significantly threaten the validity of data. Finally, this chapter discusses how content analysis studies, particularly samples of content, can be “scaled up,” particularly with an eye toward further developing tools that make content analysis more efficient. Validity, a primary concern of science, is dependent on other aspects of the social scientific process, including the role of peer review and the replicability of research. It is therefore important that social scientists in particular encourage the development of tools for content analysis and ATA that are open source. The transparency of such tools allows them to be freely examined and adapted to a variety of research questions, and to be readily available and usable by the community of scholars for purposes of peer review and replication.

5

Measurement

Social science research methods use what Babbie (2013) calls a “variable language” to study variations of attributes among people and people’s artifacts. Any phenomenon that varies in value when it is measured and assigned numbers is called a variable. When individual people or artifacts represented by a variable are assigned numbers, they can be summarized and analyzed with statistics. Content analysis measurement involves assigning numbers based on instructions in a coding protocol. Measurement links the conceptualization, data collection, and analysis steps presented in Chapter 3. Careful thinking about that linking process forces a researcher first to identify the properties of content that represent theoretical concepts of interest (e.g., bias, frames, etc.) and then to transform those properties into numbers that can be analyzed. In more concrete terms, measurement is the reliable and valid process of assigning numbers to units of content. Measurement failure, conversely, creates unreliable and invalid data that lead to inaccurate conclusions, significantly wasting effort and resources. In content analysis, establishing adequate intercoder reliability is key to assessing measurement success. Inter-coder reliability means that trained coders applying the same classification rules articulated in a predetermined protocol to the same content will almost always assign the same numbers. Beyond reliability of measurement, validity of measurement requires that the assignment of numbers accurately represents the focal concept. As Rapoport (1969, p. 23) said: It is easy to construct indices by counting unambiguously recognizable verbal events. It is a different matter to decide whether the indices thus derived represent anything significantly relevant to the subject’s mental states. As with all social science measurement, some error will exist, but a carefully constructed measurement procedure that is adjusted with use will give valid and reliable measurement.

DOI: 10.4324/9781003288428-5

72

Measurement

Content analysis shares a common approach and problem with other observational methods, such as survey and experimental designs: isolating a phenomenon enables us to study it, but that isolation removes it from its context, resulting in some distortion in our understanding. In content analysis, reducing a body of content into more easily studied content units risks losing context that provides fuller meaning. Yet, reducing content to units is necessary for the definition and measurement of variables of interest. Ultimately, the success of predictions based on the use of these content units must validate the choice of variables and how they have been measured. Content Units and Variables in Content Analysis As noted in Chapter 1, researchers often test relationships in which content variables change in response to antecedent causes, and/or in which content variables cause subsequent effects. How researchers identify appropriate variables for content analysis in such relationships depends first on the hypotheses and research questions guiding a particular study. In other words, theory and prior research point to variables that might be observed, while hypotheses and questions derived from theory and prior research give even more explicit insights into possible relationships among variables. However, decisions on how to define and observe relevant variables also depend on how they are embedded in observable messages—the context noted above. Sometimes social scientists’ search for manageable units is straightforward, even absent guidance from theory or research. In survey research, for example, individual people are interviewed, and there is little difficulty in distinguishing individual survey units (people) from one another. Variables are measured characteristics of people at the individual (micro) level of analysis. Research questions might ask how one person’s height, political views, or occupational status differs from another’s. This quickly becomes more complex and ambiguous, however, when researchers examine groups, organizations, social systems, or even societies. Similarly, communication content creation can be influenced by micro-level and macro-level variables, and content’s impact can also occur at both levels. Variables at both levels influence each other, and such relationships can be complicated and difficult to identify. Nonetheless, content analysis requires that content be divided into observable communication forms and units in a manner justified by logic, research, and/or theory. As with survey research, this sometimes seems simple: stories in newspapers, comments on Facebook, ads in a women’s magazines, and so on. The medium in which the communication appears usually provides a means of separating one type of communication content from another. Separation among content types within particular media may also seem relatively easy. Researchers might look at a website aimed at women to assess ads (paid content attempting to sell products) as distinct from posts, articles,

Measurement 73 photos, videos, and other material. Finally, within those ads, they might define and measure variables: for example, whether the ad includes the image of a woman; if it does, other variables of interest about that image may be examined (e.g., body type, clothing, interactions with others, etc.). Content analysts must also consider how crudely or precisely variables such as body type, clothing, or interactions can be measured. This degree of precision—or “level of measurement” for variables—must correspond to phrasing of relationships among those variables in hypotheses. In turn, the level of measurement of these variables then determines the types of statistical analysis that can be used to test the hypotheses. This chapter links theory and measurement by conceptualizing content in three “descending levels.” Again, hypotheses and research questions give guidance that is more or less explicit for decisions as researchers descend through these levels. The top level is composed of content forms—the manifest ways in which a “universe” of communication may be decomposed into parts. Included within those forms are units of observation that more specifically identify content likely to include variables of interest. Finally, at the lowest level are units of analysis—the content that includes the variables of interest measured at a level informed by the hypotheses or research questions and appropriate to the mode of statistical analysis. Content Forms Consider three hypotheses: H1: Newspaper news stories most often report public policy issues from the standpoint of institutional leaders. H2: Verbal interchanges among characters on a streaming TV comedy are more likely to include cruel humor than are verbal interchanges among characters on a network television comedy. H3: Women pictured in Instagram selfies will be portrayed in more revealing clothing than Instagram selfies of men. These hypotheses suggest a variety of content forms and combinations of forms that can be analyzed. Although Chapters 1 and 2 emphasized familiar distinctions among print, broadcast, and online media, a broader classification scheme distinguishes among written, verbal, and visual communication. These three forms are basic and can be found across media. Written communication informs with text—a deliberate presentation of language using combinations of symbols. H1 requires observation of content forms that are written. The text can be on paper, an electronic screen, or any other type of physical surface, but an important characteristic of written communication is that a reader must know the language in order to understand the communication.

74

Measurement

Most content analysis studies have involved text because historically text has been the primary way mass-produced content has been preserved. Of course, as new, non-text-based media platforms have grown and preserved non-text messages, content studies of them have followed, though perhaps not to the extent that they should have done. For reasons of time and money, content analysts have long been drawn to well-indexed and -archived textual content— disproportionately, newspapers—that make their lives easier. More recently, as noted in Chapter 4, they have relied heavily on Twitter because it has been more accessible than other social media and the data are well structured (Giglietto, Rossi, & Bennato, 2012). It is relatively easy to parse out a tweet’s sender, replies, hashtags, and so on. In most, but not all, content analyses of text, division of text into meaningful units in order to define the variables of interest is fairly straightforward. As suggested above, historically, much of mass communication research, particularly content analysis, has focused on articles in newspapers or print magazines. For example, Kilgo, Mourão, and Sylvie (2019) examined how four US newspapers covered social protests following the fatal shootings of Trayvon Martin and Michael Brown. These national newspapers used familiar tropes, focusing on protester violence and privileging official sources, though they did shift to discuss more substantive reasons for the protests after courts failed to hold the shooters responsible. The expansion of digital media has allowed easy combination of various forms of content and given individuals and organizations an ability to access high-quality visual and verbal communication with mobile devices. Text continues to be important, but content analysts must now deal with many other forms of mediated communication. Verbal communication is spoken communication, both mediated and nonmediated, intended for aural processing. H2 requires analysis of such verbal communication, specifically in the context of video content. When aural content is preserved, it is often saved as text in transcripts that are particularly suited for content analysis. For example, content analysis of aural content may examine doctor–patient conversations. Using digital audio recorders connected unobtrusively to common medical equipment, Vashi and Rhodes (2011) recorded 477 conversations of emergency-room providers with patients about discharge instructions. In particular, the researchers wanted to examine providers’ efforts to confirm patients’ understanding of instructions and found that understanding was confirmed in only 22% of conversations. Visual communication involves efforts to communicate through non-text symbols processed with the eyes. H3 requires this kind of visual content analysis. Visual communication includes still visuals (e.g., photographs and graphics) and motion visuals (e.g., film and video). The former are often easier to content analyze than the latter because stills freeze relationships among visual elements.

Measurement 75 Motion visuals often require repeated viewing to identify elements, symbols, and relationships in visual space. Famulari (2020) content analyzed major news organizations’ textual and visual framing of President Donald Trump’s brief “zero tolerance” immigration policy, which aimed to deter illegal immigration and involved separating children from their parents. Sympathetic, human-interest frames showing immigrants in shelters, distressed children at the border, and so on were contrasted with images of immigrants, particularly single men, crossing the border, emphasizing potential threats of illegal immigration. In other examples, Hum et al. (2011) analyzed gender role and identity construction in Facebook users’ profile photographs, while Döring, Reif, and Poeschl (2016) found more gender stereotype reinforcement in Instagram selfies than in magazine ads. Additional complexity arises with presentations using more than one communication form. Social media often combine text with video, graphics, or photographs, while reports about events may vary from digital platform to digital platform. For instance, Thorson et al. (2013) studied Occupy movement videos that were available on Twitter and YouTube. They accessed the footage through both platforms to create a more diverse pool of video, finding both tweets and shared videos essential to a tweet’s meaning. Combinations of communication forms will likely expand as people increasingly use multimedia, such as the Internet and mobile devices, to access information. Special Problems Associated with Measuring Non-Text Forms

Recall the emphasis in Chapter 2 on denotative or manifest meaning and validity. Non-text communication adds dimensions that can cloud manifest meaning. For example, spoken or verbal communication depends, like text, on the meaning of words or symbols, but also involves vocal inflection, tone, and even body language that affect the meaning applied by receivers. The simple verbal expression “the hamburger tastes good” can be ironic or not depending on the inflection added to the words. No such emphasis can easily be inferred from written text unless explicitly stated. Inflection and tone can be difficult to interpret and categorize, placing an extra burden on content analysts to develop thorough coding instructions for spoken or verbal content. Visual communication can create similar problems because of ambiguities not easily resolved within the message itself. For instance, a text description can easily reveal age: “Smith is 35.” A visual representation alone of Smith is likely more vague or imprecise. Olson (1994) found that she could establish reliability for coding character ages in TV soap operas by defining wide age ranges (e.g., 20–30 years old, 30–40, etc.). This may not be a problem in some research, but it can affect validity. Consider studies recording ages of characters in advertising. It is often difficult to differentiate a 16-year-old from someone who is 18,

76

Measurement

yet some content analyses argue that 18-year-old characters are adults, whereas 17-year-olds are teenagers. Because of the shared meaning of so many commonly used words, written text may provide within-message cues that reduce ambiguity. Shared meanings of visual images are less common. Researchers should be cautious and thorough about assigning numbers to symbols whose meanings rely on visual cues. Consider the task of inferring the socioeconomic status of characters in television programs: identifying a White, pickup-driving man in his 40s as “working class” on the basis of his clothing—denim jeans, flannel shirt, and a baseball cap—may be more reliable and valid than using similar cues to assign a teenager to that class. Perhaps more complex, reconsider the Chapter 2 example of studying the normative or non-normative gender portrayal of the LGBTQ+ teen US population by examining streaming service programming. Combinations of forms—visual with verbal, for example—can generate coding problems because of between-form ambiguity. That is, multiform communication requires consistency among forms if accurate communication is to occur. If visual, text, and verbal forms are inconsistent, meaning becomes ambiguous and content categorization more difficult. A television news package might have text describing a demonstration as non-violent, while accompanying video shows people hurling bottles. Researchers can categorize each form separately, but reconciling and categorizing the combined meaning may be difficult. One form may dominate the other in people’s reactions. Cox (2022) faced this problem when examining Facebook posts by three cable news networks (Fox, CNN, and MSNBC) and three national daily newspapers (New York Times, Washington Post, and Wall Street Journal) relating to the Black Lives Matter (BLM) movement in summer 2020. She solved the cross-media issue by using closed captioning from the video, and not coding the visual elements; the unspoken assumption was that words create frames for what can sometimes be ambiguous video. She found conservative news outlets (Fox and Wall Street Journal) more likely to use the term “riot” in reference to BLM activities than other outlets, but news coverage of the protests by all media outlets was generally more positive than earlier coverage of BLM. Units of Observation Once researchers identify the communication form or forms addressed by their research questions, the next step is to specify units of observation likely to provide access to content containing relevant data. Units of observation are more specific distinctions that sharpen a study’s focus on the content of interest. Complications can occur because “units of observation” can be layered and need to be broken down successively to find appropriate units for addressing hypotheses or research questions. This is analogous to Internet or web map searches, proceeding through successively tighter foci that begin with a broad national view

Measurement 77 and ultimately reach a street corner or even a specific address of interest. All the intermediate steps between start and endpoint are layered units of observation. The hypotheses presented above illustrate this process of narrowing of our observation. H1, for instance, requires that a newspaper be defined in a way differentiating it from all other text media of communication. This is Observation Level 1 (OL1). For OL2, news stories must then be defined to differentiate them from all other newspaper content (e.g., editorials, ads, letters, etc.). News stories that are thus defined and identified can be coded for content variables addressing H1. The process would be similar with H2. The analyst must first define content at OL1—how streaming TV videos differ from all other videos. At OL2, the analyst distinguishes programs on network TV from non-network TV shows. OL3 defines how comedies differ from other broadcast or streamed content. Finally, at OL4, the analyst must define how humorous verbal interactions among characters differ from other interactions. Note that there is no “system” for specifying how many observation levels the analyst must travel through before reaching the content of interest. This is because content for study was not created with researchers’ needs in mind! Content analysts must use the successive definitions of “units of observation” to explain explicitly the steps taken to generate or reach units of content addressing hypotheses and research questions. Only then can other researchers and reviewers follow those steps and understand the data used in the study. Basic Units of Observation Used in Content Analysis

Every study, then, must determine the most relevant and useful units of observation for the research goals. As emphasized, no uniform process exists for selecting units of observation that are relevant to all studies. Nonetheless, content units can be divided into physical and meaning units, which differ in terms of the approaches used to classify and measure them. Physical Units

The most basic physical units are time and space measures of content. These units are discrete and, in the case of commercial media, have been standardized. A written article takes up space on paper or the screen, as does a photograph. An item that takes up space can be recorded as a single item, or recorded based on the total space it occupies (e.g., square inches or square centimeters). Measurement of space for verbal and moving visual communication has little usefulness. Spoken verbal content does not involve space, and the size of space devoted to a visual element depends on the display equipment used (larger screens have more space than smaller screens). Therefore, instead of space, verbal and moving visual messages are measured by numbers of discrete items

78

Measurement

and/or time devoted to visual and verbal content. For example, television video can be evaluated by measuring the number of seconds given to a character or topic. The average length of time given to a character or topic may be assumed in such research to reflect amount or depth of information. A movie featuring a particular character for 30 minutes provides more information than if it devotes 15 minutes to the character. Although they are among the most objective units used in content analysis, physical units are often used to infer to the (antecedent) values of the sender/ creator and/or to the units’ effect on receivers (consequences). A person posting ten photographs of an event on Facebook likely cares more about that event than someone who posts just one. A three-minute video about an event uploaded to the Internet by a TV newsroom suggests that the organization’s journalists consider the event more important than another they cover in a one-minute video. For example, measures of broadcast time were used in a study to explore how reporter gender was related to the gender of sources used on-air by television networks covering the 2004 presidential race (Zeldes & Fico, 2010). Analysis of physical units indicated that women reporters gave women sources more air time compared to the time male reporters gave women. The authors offered several hypotheses about why this might have occurred, but without directly measuring the reporters’ motives, no causal inference could be drawn. Two assumptions undergird any inferences from physical or time measures of units to content antecedents or effects. The first is that allocation of content space or time is systematic and not random, and that such space or time can be measured. The second assumption is that the greater the content space or time devoted to some issue, subject, or person, the greater will be the content’s impact on the audience. So, for example, an online news site that allocates 75% of news content space to stories about the city in which it is located might plausibly be assumed to be making a conscious effort to reach readers interested in local events. At the same time, allocating 75% of space to local coverage plausibly has a different impact on the total audience than allocating 40%, at least in terms of the probability of exposure to such content. Meaning Units

Meaning units also have a physical and/or temporal manifestation, but the meaning of the units depends on the symbols used in the message. Content analysts can be interested in physical units, meaning units, or both. Sources in a news story, for example, will have words attributed to them that take up a certain amount of space or time. However, the focus of interest for a study may be on the meanings of the words, and those meanings provide both richness and ambiguity to inferences from them to antecedent causes or subsequent effects. One of the most basic types of meanings unit in content analysis is what Krippendorff (1980) called syntactical units. These occur as discrete units in

Measurement 79 a language or medium. The simplest syntactical unit in language is the word, but phrases, sentences, paragraphs, articles, and even books are also examples. Because they are standardized parts of a language, syntactical units can be useful to indicate context. A paragraph in a text relates to the same topic; all the sentences provide context for all the other sentences. Syntactical units in the Bible, for example, would be particular verses within chapters. In plays, they are dialogue within scenes and acts. In television programs, they can be phrases in verbal comments, scenes, interactions among people within scenes, and so on. Most visual messages also contain syntax that has to do with how people perceive the images. Dark and light tones, receding lines, forms, and so on comprise a visual language typically taught in art schools. Although some common, standardized syntactical units can help identify context, not all messages use these, and not all content of interest has standardized context. For example, how can a particular unit of content be validly separated from other content? How can passages of dialogue in a television comedy be separated from dialogue that comes before or after that particular content unit? Deciding this may not be easy because context can change within a TV program scene. In addition, does the breaking up of a scene change the context and distort the meanings communicated? An even more foundational problem, discussed in more detail in Chapter 8, concerns the very meanings that can be inferred from the words in the unit of content. Almost all content analysts must cope with these problems because of the focus on some kind of syntactical units. Such units are examined in studies of bias, framing, diversity, persuasiveness, sexuality, violence, and so on. Sampling Concerns with Units of Observation

As explained above, units of observation can be nested in a larger message that requires moving from higher observational levels to lower ones. Unless data can be collected for all the defined units of observation, sampling will have to be applied in one way or another (see Chapter 6). If researchers choose to use a census or a purposive sample of all units of observation in a population, the reasoning for that decision is all the justification needed in reporting the study. But if random sampling within units of observation is used because a population of those units is too large, then the rules of random sampling and inference should be considered. Specifically, researchers can only generalize from a sample of one type of observation units to the population of such units from which the sample was taken, and within a certain amount of sampling error. Consider, then, an example of what this would mean if a random sample were selected from multiple units of observation for the Instagram selfie study referenced above that tested the hypothesis that women pictured in Instagram selfies will be portrayed in more revealing clothing than Instagram selfies of men.

80

Measurement

Such sampling is similar to cluster or multistage sampling (described further in Chapter 6). First, Instagram posts might have to be sampled from some population of such posts. Next, a sample of Instagram posts with selfies would need to be selected. Next, equal numbers of Instagram posts with selfies that portray women and Instagram posts that portray men would need to be chosen. Each of these steps produces some level of sampling error that depends on the size of the sample. These final sampling errors produced by sampling across all units of observation will require special calculation. Of course, a detailed research design helps a researcher anticipate such sampling decisions. Units of Analysis A unit of analysis refers to demarcated content about which the variables of theoretic interest can be defined and observed. For the example hypotheses presented above, this is the level of observation at the heart of the content analysis where data are measured and recorded to answer hypotheses and research questions. In H3, for instance, “revealing clothing,” as defined in the protocol, is employed to assign numbers (i.e., coding) to represent the clothing in the images. When a content analysis is at the unit of analysis level, two critically important measurement design decisions come into play: the definitions of content (its classification into categories) and measurement of content (in particular, the level of measurement). These decisions will determine how the hypothesis is addressed or the research question answered. Classification Systems

Content analysis protocols contain descriptions for how content can be observed and classified into categories that make up the variables. In other words, each content variable in a hypothesis must have at least two distinct categories in which the content units are placed. For example, for H3, a researcher might task coders with looking at one or more types of clothing (e.g., tank tops, blouses, etc.) or body parts (e.g., knees, shoulders, stomach) that are unclothed. The coders might then utilize a particular level of measurement determined by the researcher, who might have opted for a dichotomous variable (i.e., revealing/not revealing) or perhaps some sort of scale (e.g., a photo in which bare shoulders and stomach are visible is categorized as more revealing than one that shows just exposed shoulders, which in turn is more revealing than one in which no part of the body is unclothed). Of course, as noted, the researcher has to define in advance what kinds of observations in Instagram selfies would constitute data or evidence for determining whether a selfie conforms to gender stereotypes—a definition guiding the rules detailed in a coding protocol. A classification system is a collection of definitions linking observed content to categories that measure the variables in hypotheses or research questions.

Measurement 81 A content analysis protocol, then, is a collection of classification systems for multiple variables. As in the example above, when variables are nominal level, they will have categories (revealing/somewhat revealing/not revealing). Similarly, a variable assigning values based on political “leaning” of content may have categories such as liberal, conservative, or neither. The variable and all categories will have definitions to guide the assignment of values for the categories. (See Appendix A for a detailed protocol example from Adams’s [2020] study of newspaper editorial endorsements of presidential candidates.) Variable definitions can use a range of content characteristics. Studies of online news sites have looked at the geographic emphasis of stories; others have assigned numbers representing an order of importance based on space or time devoted to stories. The classification system translates messages into variables either by assigning content units to categories or by assigning numbers to content units based on time or space used for the messages. Content analysts should consult conceptual and operational definitions used in past studies that may relate to their research. Building a body of knowledge is done most efficiently and effectively when researchers use common definitions. Of course, new ground may have to be broken, old mistakes corrected, and previous measures improved. Deese (1969) presented a typology that has proved useful in conceptualizing and classifying content analysis variables. That said, the classification system used in a particular content analysis will draw most usefully from related past research and will be guided most efficiently by the study’s specific hypotheses or questions. The following subsections present a modified, edited version of Deese’s typology. Grouping

This involves classifying content into groups when the units of analysis share some common attribute, such as topic, geographic location, or emotional valence (positive or negative depiction). In a study of the motives behind professional athletes’ tweets, Hambrick, Simmons, Greenhalgh, and Greenwell (2010) classified the tweets according to the presence or absence of interaction with followers and found that a plurality of them involved interactivity. Scaling

Some content can be classified on the basis of a numerical scale. These scales or continua can be based on a variety of characteristics, such as intensity, frequency of appearance, position, length, and time (Deese, 1969). It is fairly common to find content analyses using one or more of these types of scales. Lacy et al. (2012) created a source diversity scale to study sources used in city government articles. An article received a score of 1 for each of 11 different types of

82

Measurement

sources represented in the article, with the scale ranging from 0 to 11. Boukes, Jones, and Vliegenthart (2022) studied whether news values in newspaper, television programs, and websites predicted a story’s prominence. They measured prominence by location (position) and length, and found that a variety of news values correlated with prominence, but eliteness and conflict were the strongest predictors. Relationships

Both scales and group classification methods represent practical efforts to make language more concrete in order to assign numbers to individual cases. Some concepts, such as friendship among TV characters, may not fit groups and scales well. Scales and groups categorize content by common characteristics rather than interactions that exist among people and artifacts within the content. For example, in a content analysis of episodes of Survivor (a “reality” TV show), Wilson, Robinson, and Callister (2012) explored the participants’ antisocial behavior. They found that indirect aggression and verbal aggression were the most common forms of such behavior. The former, which occurred without the victim’s knowledge, was defined as any act or words designed to hurt the victim or destroy the victim’s relationships. The latter was defined as a direct attempt to diminish or humiliate the victim. Binary Attribute

In English and related languages, characteristics attributed to a person or thing often have an opposite. Good is the opposite of bad; bright is the opposite of dark. These binary structures are often found in content, although concepts need not be construed in these terms. In journalism, many readers assume reporters to be either subjective or objective. Binary attribute classification is a special case of groupings and is often accompanied by a scale regarding the degree to which the attribute applies to a person or thing. For example, Adams (2020) measured the number of positive and negative mentions of personal characteristics in newspaper editorial endorsements of Hillary Clinton and Donald Trump during the 2016 election. Her binary attributes for each characteristic were used in a scale formed by adding the number of incidences. In another example of a binary variable, Oswald and Bright (2022) examined how climate-change skeptics interacted with opposing viewpoints in a Reddit forum specifically for skeptics. First, they labeled each post either “consonant” (i.e., it was skeptical of climate change) or “dissonant” (i.e., it accepted the concept of anthropogenic global warning and its seriousness). Rather than disrupting the “echo chamber” nature of the skeptics’ forum, the dissonant posts led to more commenting activity, reinforcing the group’s beliefs.

Measurement 83 Although the above classification methods are independent of whether a content unit is physical or meaning, some types of units tend to be classified with certain methods. Physical units, such as length of video, could be categorized into groups (short, medium, and long) based on time (a length classification method), but using a scale of time (e.g., minutes and seconds) would be preferable because it is more precise and allows use of more sophisticated statistics, as discussed below. Classification systems are crucial in a particular study, but systems that are employed have implications beyond that study because they relate to the validity of concepts and how those concepts have been measured. Validity is established across time, which is why replication of a system is important. Ultimately, selection of a classification system for content should have a theoretical basis, and the validity of variables in a system must be argued logically and/or established empirically. This is discussed more fully in Chapter 8. Classification System Requirements

All content classification systems must meet particular requirements dictated by the logic of empirical inquiry. Meeting these requirements is necessary, but not necessarily sufficient, for helping to establish the validity of the concepts and measures used in a content analysis. Creating categories requires specific instructions for defining variables so they can be coded reliably. These coding instructions for defining variables must meet five requirements. Definitions for variables must: (a) reflect the purpose of the research; (b) be mutually exclusive; (c) be exhaustive; (d) be independent; and (e) be derived from a single classification principle (Holsti, 1969, p. 101). To reflect the purpose of the research, the researcher must adequately define the variables theoretically. Then, coding instructions must clearly specify how and why content units will be placed in categories for these variables. This specificity requires detail that will guide coders in distinguishing among content units that seem similar. Novice content analysts tend to err on the side of too little detail in the variable and category specification. Details allow other researchers to replicate content analyses. These instructions provide the operational definitions that go with the theoretical definitions of the concepts underlying the variables. The operational definition should be a reliable and valid measure of the theoretical concept. Classification systems must be mutually exclusive when assigning numbers to recording units for a given variable. If magazine articles about environmental issues (unit of analysis) must be classified as pro- or anti-environment, a single article cannot logically be both. Using statistics to study patterns of content requires units to be unambiguous in their meaning; assigning more than one number to a unit for a given variable creates ambiguity.

84

Measurement

Of course, it may be that an article contains both pro- and anti-environmental statements. In such cases, the problem is solved by selecting smaller units of analysis that can be classified in mutually exclusive ways. Instead of selecting the whole article as the unit of analysis, individual paragraphs or sentences within the article could become the focus for content as either pro- or anti-. Setting up mutually exclusive categories requires a close examination of the categories and careful testing to reduce or eliminate ambiguity. This subject is discussed more fully in Chapter 7. In addition to mutual exclusivity for variable categories, classification systems must be exhaustive. Every relevant unit must fit into a subcategory. This requirement is easy to fulfill in areas of content research that have received a great deal of attention. However, in rarely analyzed types of content (e.g., suicide notes), exhaustive category coding schemes will be more difficult to create. Often, researchers fall back on an “other” category for any units that do not fit within defined categories. This may be appropriate if a researcher is interested primarily in one category of content, such as local news coverage. In such situations, all non-local coverage could be grouped together, with no loss of important information. That said, “other” categories should generally be used with caution. Arguably, when content analysts rely on the “other” category, they are admitting that it contains content units they know too little about to categorize meaningfully. And the more units that fall within the “other” category, the more information about the content is lost. Extensive pretesting with content similar to that being studied will help create exhaustive categories, and minimize reliance on the “other” category. If the latter grows beyond 10% of units classified, the content analyst should examine those units closely to determine if there were other meaningful categories that should have been added to the system. Researchers can adjust the classification system and fine tune definitions during pretesting, but if changes to the classification system prove necessary during the data-collection process, the reliability of the protocol must be re-established and prior units recoded. Independence in classification means that placing a unit in one category does not influence the placement of other units. However, this rule is often ignored when ranking is involved, or when coding an item with a given value on one variable requires coding a particular value of some other variable. Independence is also important in assessing coder reliability and statistical analysis. Inference from a sample relationship to the population would be biased in unknown ways without classification independence. For example, suppose a researcher examines two TV situation comedies and two TV dramas for number of characters/persons of color (Black, Hispanic, Native American, or Asian) during a season. Each program has 20 new episodes per year. One system involves assigning ranks based on the number of characters who fall into the four categories mentioned above. The program with the most

Measurement 85 characters of color during the season is ranked first, that with the second most characters of color is ranked second, and so on. Another system involves calculating the average number of characters in each category per episode. Now suppose that Comedy A has five characters of color over the course of the season, while Comedy B has three, Drama A has four, and Drama B has two. The ranking system might suggest that TV comedies provide the audience with much more exposure to such characters, because comedies ranked first and third, while dramas ranked second and fourth. Ranking “creates” this impression of an important difference because the assignment of rankings is not independent: assignment of the first three ranks determines the fourth. By contrast, the independent calculations provided in the second system gives an average of .20 characters per comedy episode (8 characters divided by 40 episodes) and an average of .15 characters per dramatic episode (6 characters divided by 40 episodes). The conclusion based on these averages is that neither program type provides extensive exposure to characters of color. So an independent assignment system—such as averaging the number of characters per episode—provides a more valid conclusion. Finally, each category should have a single classification principle separating different levels of analysis. For example, a system for classifying news stories could have two dimensions: geographic location (local, national, international) and topic (economic, political, cultural, social). Each of these two dimensions must have a rule for classifying units. It would be a violation of the single classification rule to have a classification system that treated local, national, international, and economic as being in one dimension. Classification in such a scheme would mix geographic and topic cues in the content. Consequently, coders would have great difficulty categorizing content relating to local economic issues. Levels of Measurement Content can be assigned numbers that represent one of four levels of measurement: nominal, ordinal, interval, and ratio. These are the same measurement levels used in all social science and reflect the type of information conveyed by the numbers assigned. Nominal measures have numbers assigned to categories of content. In studying which presidential candidate was mentioned most in tweets after a debate, a researcher would give each candidate a number and assign the appropriate number on the basis of the candidate mentioned in the tweets. The number used to distinguish each candidate is arbitrary—the Democratic candidate might be given a 1 and the Republican candidate a 2, but using a 10 for the Democrat and a 101 for the Republican would work just as well. The numbers carry no meaning other than “connecting” the candidate to a tweet and distinguishing the candidate categories from one another. Put another way, nominal measures have only the property of equivalency or non-equivalency: if 41 is the code for tweets

86

Measurement

about the Republican, all such tweets receive the value 41, and no other value is applied to tweets about the Republican candidate; different codes must be used for other candidates. Analyzing news sources in articles about city government published in newspapers and on citizen journalism websites, Fico et al. (2013a; 2013b) classified various types of sources (e.g., government official, citizen, etc.) as being either present or absent in the articles. The coders used numbers to distinguish present and absent categories, but the authors reported their results by label of source category, not by assigned number. Nominal measures can take one of two different forms. In the first, each category in the variable gets a number to designate membership, so, in the earlier example, the Democratic candidate would be recorded as a 1, the Republican candidate as a 2, and an independent candidate as a 3. In the second form, each category (e.g., candidate) is treated as a separate variable and each case (e.g., tweet) is assigned a number that either includes or excludes the case from the variable. Each tweet would get a 1 on a “Democrat present” variable if the Democrat is mentioned and a 0 (zero) if not. Each tweet also gets a 1 or a 0 for a variable for every type of candidate mentioned (e.g., “Republican present,” “Independent present”). The number of such variables would equal the number of parties with candidates in the race; each category within the first form of variable becomes a variable in the second form. With the one-variable approach, the variable has multiple categories with one number each. With the multi-variable approach, each category becomes a variable, with one number for having the variable characteristic and one for not having that characteristic. This approach allows a single article to be placed into more than one classification, so it is particularly useful if a unit needs to be classified into more than one category of a nominal variable. For example, if individual tweets deal with more than one candidate, the multi-variable system might be preferable to the one-variable approach. After coding for multiple variables, the data can be recombined into one variable with data analysis software. For example, having a variable for the presence of the Democratic candidate and another for the Republican would allow researchers to create a four-category variable at a later stage: only Republican mentioned in a tweet; only Democrat mentioned; Republican and Democrat both mentioned; and neither candidate mentioned. Ordinal measures also place content units into exclusive categories, but in this case the categories have an order. Each category is greater than or less than the other categories. Arranging categories in rank order carries more information about the content than just placing units into categories. The ordering of units can be based on any number of characteristics, such as prominence (which article appeared first in a news feed), amount of content that fits a category (publications with more assertions than others), or order of placement of a unit

Measurement 87 within a publication (front-page placement in newspapers carries more importance than inside placement). Interval measures have the property of order, but the number assignment also assumes that the differences between the numbers are equal. They are called interval measures because the interval between any two values is equal to the interval between any other two values. In other words, the difference between variable values 2 and 3 (1 unit) is equal to the difference between values 7 and 8 or between 13 and 14. The simple process of counting numbers of content units illustrates interval measures. If a researcher wants to study the number of posts on a company Facebook page over time, they could count the individual posts published daily for a period of time. Ratio measures are similar to interval measures because the difference between numbers is once again equal, but ratio data also have a meaningful zero point. Counting the number of words in a magazine issue has no meaningful zero point because every magazine must contain words by definition. However, if one counts the number of active verbs in a magazine issue, the measure is a ratio. It would be possible (although not likely) for a magazine to be written totally with passive verbs. Because ratio data have a meaningful zero point, researchers can find ratios among the data (e.g., Magazine A has twice as many active verbs as Magazine B). In some situations, ratio data can be created from a nominal classification system when the ratio of units in one category to all units is calculated. For example, Beam (2003) studied whether content differed between groups of newspapers with strong and weak marketing orientations. Beam first classified content units (self-contained units that could be understood independently of other content on the page) into a variety of categories for topic and type of item, then calculated the percentage of content units within various nominal categories (e.g., content about government or the “public sphere”), and finally compared percentages for strong market-oriented newspapers with percentages for weak market-oriented newspapers. This transformation of nominal data to ratio data was used because the number of content units varied from newspaper to newspaper, usually based on circulation size. A ratio measure allows researchers to compare relative emphasis regardless of number of units. One advantage of using interval- and ratio-level variables in content analysis is that they allow the use of more sophisticated statistical procedures, primarily because the measures permit computation of means and measures of dispersion, such as variance and standard deviation. For example, r or Pearson’s Product Moment Correlation is foundational for a number of higher-level statistical analyses, such as regression procedures, which allow researchers to control statistically for the influences of a variety of variables and to isolate relationships of interest (see Chapter 9). Lacy et al. (2013) regressed organizational and market variables on five measures of sources to evaluate which organizational or market

88

Measurement

variables best predicted the sourcing. Similarly, Shin and Thorson (2017) used regression to predict the selective retweeting of fact-checking messages based on valence toward a given candidate. Importance of Measurement Levels

Selecting a measurement level for variables depends on two rules: the chosen measurement level should be theoretically appropriate and carry as much information about the variables as possible. “Theoretically appropriate” means the measurement reflects the nature of the content and the particular hypotheses. If a hypothesis states that women will use more descriptive adjectives in Facebook posts than men, content will have to be analyzed with a nominal variable called writer gender. The variable for descriptive adjectives could take several forms. One measure would be nominal, classifying posts simply by whether they contain descriptive adjectives. However, this nominal level fails to incorporate the reality of writing because it treats all posts equally, irrespective of whether they contain one descriptive adjective or a hundred. A better measure, with more information, would be to count the number of descriptive adjectives in each post. This ratio-level measure would allow a far more sophisticated statistical procedure. In fact, the level at which a variable is measured determines what types of statistical procedures can be used because each procedure assumes a level of measurement. Procedures that assume an interval or ratio level are called parametric procedures. These require certain population distributions if the researcher is to describe more precisely the population parameters with sample statistics. Nonparametric procedures (with nominal- and ordinal-level measures) make no such assumptions about population distribution, are less precise at describing the population, provide less information about patterns in data, and are often more difficult to interpret and allow for statistical controls. Rules of Enumeration

No matter what classification system is used, quantitative content analysis requires rules coders must follow to connect content with numbers. Of course, the numbers reflect the level of measurement selected by the researcher for the content variable(s) of interest. The rules may be as simple as applying a 1 to a certain variable category of content unit (e.g., positive posts), and a 0 to other content units (e.g., negative posts). Enumeration rules for nominal data require arbitrarily picking numbers simply to distinguish among non-equivalent groups. In some cases, however, anticipating the planned data analysis provides guidance. For instance, a multivariate analysis may use a nominal-level independent variable. In this instance, coding “present” as 1 and “absent” as 0 facilitates the analysis. Or, if the analysis deals with gender, content including males could be

Measurement 89 coded as 0 and content including females coded as 1 (or vice versa). These are called “dummy variables” when used in regression analysis. For interval or ratio data, enumeration rules might be instructions about what part of physical content to include or exclude. For example, rules about counting words in texts require a physical description of which words to count. Do coders count words in a headline? Do they count only those words in original Facebook posts or those in the original posts and the subsequent comments? In this case, a ratio scale facilitates analysis using correlation or regression analysis methods. Enumeration rules must be clear and consistent, with the same numbers applied in the same way to all equivalent content units. If a scholar studies the percentage of time devoted to violent acts during a television program, rules of enumeration must identify clearly the point at which the timing of each violent act begins and ends. Enumeration rules are part of a classification system and are written for each variable in a content analysis protocol. They must provide consistent numbering of content and must be applied consistently by coders. Their success or failure affects the reliability as well as the validity of the study. Measurement Steps The following steps summarize the process of measuring content: 1. Develop hypotheses or research questions. Research questions and hypotheses force researchers to identify variables they want to study and the levels of measurement for the variables. They are the basis for a study and should be stated explicitly and referred to during data conceptualization, analysis, and presentation in an article. 2. Examine existing literature employing the variable or discussing its measurement. Social science should build on what is known—knowledge best synthesized in theory. However, explicit theory is sometimes absent, so new research is based instead on existing empirical studies. Commentary articles address methodology and measurement issues and should be considered (e.g., Lacy et al., 2015). In short, reviewing existing literature is crucial for accurate measurement. Initially, the literature provides a theoretical definition of variables being addressed in the research. Such definitions are important for guiding measurement because of their role in establishing the face validity of a measure. If the measurement of the variable reflects a reasonable representation of the theoretical definition of a variable, the measure can be said to have face validity (see Chapter 8). 3. Use good previous measures, or, if these are not good enough, adjust your measures. How good an existing measure is depends in part on its reliability. To evaluate reliability, use the suggestions in Chapter 7. Reviewing the literature will provide potential operationalization of variables. The social

90

Measurement

scientific process also requires replication, which requires sharing of instruments and data with other scholars. Replication requires use of the same measures across studies. Content analysts should publish coding protocols when given the chance, but scholars should also freely request coding protocols from published authors. It is a waste of effort to recreate the wheel when adequate measures exist or provide a foundation. That said, researchers should use existing measures critically, evaluating their reliability, and understand that all measures have error. It should also be noted that adapting a measure, changing it even slightly from its original form, is not the same as adopting it “as is.” The variable being studied might be slightly different from those in existing literature. If a modified measure is used, the adapted one should have face validity and be consistent with existing measures. The modification should strive to reduce measurement error by being more consistent with the theoretical definition of the variable. During modification, the researcher should determine the appropriate level of measurement for the variable—one that also better reflects the theoretical definition. When appropriate, the researcher should aim for higher levels of measurement that provide more precise tests of hypotheses. 4. Establish explicit coding instructions, including content categories defined in as much detail as is possible and practical. Each variable should have an explicit classification system and enumeration rules. A list of variables with category labels and corresponding values for each is insufficient. Generally, the more detailed the definitions, the higher the reliability. This includes detailing frequently occurring content that does not fit a variable. However, a researcher must be careful not to be so detailed as to make the application of the coding instruction too difficult. The coding instructions include any technical information about how the process will work, such as rules for rounding off numbers. All of this must be done and presented in a logical order that will allow a coder to refer to the instructions easily as he or she codes. 5. Establish a system for recording data to be entered into a computer. Virtually all quantitative content analysis projects use computers for analyzing data. Unless data are entered directly from the content into a computer, coding sheets will be required as an intermediate step. Coders record numbers for the variables’ categories on the sheets and then enter them into a computer, retaining the sheets for verification purposes. It is possible to record directly from content to computers, though this process might interfere with the coding “flow,” as coders switch from content to computer and back, and might increase the risk of keyboard input error. Some software allows coders to view content on half of a split screen and enter values on the other half (as discussed in Chapter 4; Lewis, Zamith, and Hermida (2013) used such an interface when coding Arab Spring tweets). A variety of coding-sheet formats can be used and may be as simple or as complex as the study requires. Chief criteria are efficiency in data input and cost reduction. The protocol

Measurement 91 and coding sheets must be used easily together, with variables arranged and numbered consistently between them. More information about coding sheets is provided in Chapter 7. Summary Measurement is the process of moving from theoretical definitions of concepts to numerical representations of those concepts as variables. This process is called operationalization. The measurement process involves identifying the appropriate content of interest and designing an appropriate classification system for that content. The classification system uses content units to develop definitions of variables and categories for the variables. These variable categories must be translated into numbers, which requires the selection of appropriate levels of measurement, a system for classifying content, and rules for applying numbers to the content. This process is governed by coding instructions in a coding protocol that maximize the validity and reliability of the measurements of the content concepts of interest. The instructions should allow a variety of coders to replicate the measurements. Such measurements are then statistically analyzed to address study hypotheses or research questions. Almost always, such analyses are performed by statistical packages on computers.

6

Sampling

When designing a study, content analysts must ask, “How much data will be needed to adequately test the hypotheses or answer the research questions?” In an ideal world, sampling would not be an issue. Researchers would include all relevant content. A study of gender representation on television would examine every program on every channel during every pertinent time period. However, time and money limitations force trade-offs between ideal and practical. Human coding of all relevant content is impractical with thousands or even millions of content units in the population. In other situations, all relevant content cannot be obtained. Because of numerous issues discussed below, most content analysts sample relevant content rather than conducting a census of all content. How a sample is selected is critical because it determines the appropriate type of statistics that can be used (inferential or descriptive) and the extent to which results can be generalized. The more representative the study data, the more valid the conclusions about the represented group. A sample is a subset of units from the population being studied. When probability samples (units are randomly chosen) are selected, scholars can make valid inferences, within a sampling error interval, about the population under study. A probability sample permits estimation of sampling error with a given level of confidence. If samples are drawn in any manner other than randomly (and many are, or must be), sampling error cannot be calculated, it is impossible to estimate how much the sample differs from the population, and generalizations from sample to population are invalid. When sampling, the researcher should define the most appropriate universe, population, and sampling frame for the research purpose and design. The universe includes all possible units of content being considered. The population is composed of all the sampling units to which the study will infer. The sampling frame is the actual list of units from which a sample is selected. An example helps to clarify these three elements. If a researcher wanted to study the historical accuracy of William Shakespeare’s plays, the universe would be all of the plays written by Shakespeare, whether published or unpublished.

DOI: 10.4324/9781003288428-6

Sampling 93 However, Shakespeare probably wrote plays that were unpublished or lost, so the population would comprise only his published plays. Finally, the sampling frame would be plays written by Shakespeare to which the researcher had access. A sample randomly drawn from this frame would also be a sample of the population if the frame and the population were the same. If the researcher could not gain access to one play, the population and the sampling frame would be different. When an intact set of all population units is unavailable, the sampling frame becomes the available content that is sampled and about which inference is drawn. This example illustrates how content a researcher seeks to study is not necessarily the same as content available for sampling. For instance, a content analysis exploring the portrayal of women in YouTube videos could not reasonably include a list of all characters before the content is sampled. The population (all women in YouTube videos) can be specified, but the sampling frame cannot. This problem is addressed below in the section that deals with multistage sampling. Sampling Time Periods Most survey researchers conduct cross-sectional studies, sampling people at one point in time to investigate behaviors, attitudes, and perceptions. Although some content analysts also conduct cross-sectional research, most study designs examine content across time. Because communication occurs on a continuing and often regular basis, understanding the antecedents and effects of content is difficult without looking at different points in time. Interesting longitudinal designs are possible when content is available from several time periods. For example, Danielson, Lasorsa, and Im (1992) compared the readability of newspapers and novels from 1885 until 1989, finding that the New York Times and Los Angeles Times became harder to read but novels became easier. Such longitudinal designs (discussed in Chapter 3) require populations and sampling frames that incorporate time as well as content. Because content analysts sample time and content, confusion can occur about which populations are linked to inferences drawn from the data. For example, Kim, Carvalho, and Davis (2010) studied news framing of poverty in newspaper and television news. They took a purposive sample (discussed below) of television content from three broadcast networks (ABC, CBS, and NBC) and CNN, and of daily newspapers from four high-income and four low-income states. If an outlet published more than 60 stories over the study’s 15 years, 60 articles were randomly selected. The use of random content sampling allowed inference to the entire time period for these news outlets while making coding manageable. However, inference could not be made to news outlets other than those purposively selected.

94

Sampling

Scholars have expressed concerns about the time dimension in studying online (Mahrt & Scharkow, 2013) and mobile content. The lack of a predictable publication cycle and the ability of almost anyone to post content make sampling across time even more important (and difficult) when dealing with online content. However, these sampling problems are a matter of degree, not of kind: media content has historically changed across time, from multiple daily editions of newspapers in the 19th century to today’s ever-changing social media environment. In addition, interpersonal communications (writing and phone calls) have always generated rapidly changing content lacking a discernible routine. The biggest problem relating to time in Internet and mobile sampling occurs when content is not archived with a timestamp. This content must be collected as it is posted, generating sampling problems that can be addressed using software to capture content at predetermined, randomly selected times. In effect, researchers studying un-archived and un-stamped content use software to generate their own archive. In sum, content analysts should make clear whether inferences concern content, time, or both dimensions. The appropriateness of inferences is based on which dimension was sampled using probability. Sampling Techniques Put simply, sampling means selecting a group of content units to analyze. However, to estimate sampling error and infer to a population, the sample must be a probability sample. With non-probability samples, sampling error estimates are meaningless and inferential statistics invalid. The basic problem addressed in this chapter, then, is how to collect samples that allow valid population inferences. The techniques outlined below will help with this. Census

Every content unit in a population is included in a census. This technique often makes the most sense for research examining particular events or series of events. Walsh-Buhi et al. (2021) used keywords and a Facebook search tool to identify all Instagram posts in a 12-month period about daily oral pre-exposure prophylaxis (PrEP) to reduce the risk of acquiring HIV. Their goal was to describe the available PrEP information accurately. They discovered only 275 total posts, and dropped 25 that were not in English or had non-functioning links. They then coded the entire census of 250 because of the limited number of posts. Deciding between a census and a sample becomes an issue of best use of coders’ time to accomplish research goals. Whether a census is feasible depends on the resources and goals of individual projects. In principle, the larger the number

Sampling 95 of content units that are coded, the more representative the data will be, but the more resources the project will require. Non-Probability Sampling

Non-probability samples are often used despite their limitations in estimating sampling error. Such samples are appropriate under certain conditions, but they are also frequently used because adequate sampling frames are unavailable. Two commonly used non-probability techniques are convenience samples and purposive sampling. A study of content analysis articles in Journalism & Mass Communication Quarterly from 1971 to 1995 found that 9.7% of all studies used convenience samples and 68.1% used purposive samples (Riffe & Freitag, 1997). Convenience Samples

A convenience sample uses particular content simply because it is available. The study of local television news provides a good example. Until the growth of the Internet, local TV newscasts were unavailable outside particular markets. As a result, programs had to be taped in the market of origin. Generalizing outside a single market might require people around the country to tape newscasts. Now, though, local television stations upload content online, with the majority providing text as well as video, so national probability samples of local television are possible (Baldwin et al., 2009). The Internet has also simplified sampling for content produced by most legacy media outlets. However, content on websites and in print products distributed by the same organization may not be equivalent, and increasing use of pay walls by legacy outlets requires research funding or reliance on library access to content. In addition, while a convenience sample can be thought of as a census of a population defined by availability rather than research questions, this population is a biased representation of the universe of units. Moreover, it is impossible to estimate the extent of the bias. As a result, convenience samples do not allow population inferences. That said, they can be justified under three conditions. First, the material being studied must be difficult to obtain. For example, a random sample of the population of magazines published in 1900 cannot be obtained. A sampling frame of such magazines would always be incomplete because a complete list is unavailable and most magazine editions published in 1900 no longer exist. Nonetheless, a researcher could acquire lists of magazine collections from libraries around the country and generate a random sample of the surviving magazines from that year. However, this would be extremely expensive and time-consuming work and the end result would still not represent the population of all magazines published in 1900.

96

Sampling

Second, convenience sampling may be justified if a lack of resources limits the researcher’s ability to generate a random sample of the population. Each individual researcher must decide how much time and money they are prepared to invest in a project. Whatever decision they reach, though, they should remember that their study will eventually be evaluated by journal reviewers. Third, convenience sampling may be justified when exploring an underresearched but important area. When little is known about a topic, even convenience samples can help generate hypotheses for further study. When such exploratory research is undertaken, the topic should be important to scholarly, professional, or policymaking communities. Of course, some under-researched areas are destined to remain that way because they are neither interesting nor important. On the other hand, convenience samples can provide a useful starting point for further research into fascinating, but previously neglected, subjects. Thus, while researchers should always attempt to minimize any bias in the data and must justify their use of convenience samples, such studies can have value. Science is cumulative, and consistent results from many convenience samples over time can contribute to theory-creation and -testing. In addition, data from such samples can suggest research questions and hypotheses to be checked through replication with probability samples or censuses. Purposive Sampling

Purposive sampling also yields a non-probability sample, but this time for reasons dictated by the research project. Studies of particular types of publications or particular times may be of interest because the publications were important or the time period (and events within it) had historical significance. For example, Di Cicco (2010) studied 1967–2004 newspaper coverage of political protest in the New York Times, Washington Post, Seattle Times, San Francisco Chronicle, and Los Angeles Times. The Seattle and San Francisco papers were purposively selected because those cities were centers of protest, while the other dailies were among the top five circulation leaders during the study period. Given the large number of news stories published during such a lengthy period, limiting the study to these five publications made it manageable. Purposive samples differ from convenience samples because they require specific research justification, other than simple lack of money and availability. One often-used type of purposive sample is consecutive unit sampling, which involves, for example, examining all content produced during a certain time period. Analyzing all Facebook postings and comments during a two-week period is a consecutive day sample. Consecutive day sampling can be important when studying a continuing or unfolding news story because connected events

Sampling 97 cannot be examined adequately otherwise. Such samples are often found in studies of elections, ongoing controversies, and prolonged crises or protests. Probability Sampling

The core notion of probability sampling is that each member of a population has an equal chance of being in the sample. When done properly, characteristics found more frequently in the population—whether of TV dramas, Facebook posts, or TikTok videos—will also turn up more frequently in the sample, while less frequent characteristics will turn up less frequently. A simple example illustrates how this (usually much more complicated) process works. Take a coin. Its population consists of a head and a tail. The chance of getting a head (or a tail) on a single flip is 50%. Flip 100 times and close to half—but rarely exactly half—the results will be heads. Flip 1,000 times and the percentage will even more closely approach 50%. With an infinite number of flips, the “expected value” of the percentage of heads will be 50%. Similarly, if a very large number of relevant content units are in a sample, the sample’s value on any variable being explored in content will approximate (within calculable sampling error) the population value of that variable. An extension of this logic would be for a researcher to take many samples from the same population, one at a time. The best guess for the value of each sample’s mean would be the population mean, although in reality the sample means would vary from that population mean. If an infinite number of samples were taken from a population, the average of all those sample means would equal the population mean. If the means of all the samples were plotted on a graph, the result would be a distribution of sample means called the sampling distribution. With an infinite number of samples, the sampling distribution of any population has the characteristics of a normal curve. One such characteristic is that the mean, median (the middle score in a series arranged from lowest to highest), and mode (the most frequent score value) are all equal. Moreover, 50% of all the sample means will be on either side of the true population mean; and 68% of all sample means will be within plus or minus one standard error (SE) of the true population mean (standard error is an estimate of how much the sample means in a sampling distribution vary from the population mean). That any sampling distribution, regardless of the population distribution, takes on a normal distribution when infinite numbers of samples are drawn is the central limits theorem. Of course, no researcher ever draws an infinite number of samples. Each researcher draws a single sample, but the central limits theorem allows them to estimate the amount of sampling error in a probability sample at a particular level of probability or confidence. In other words, the researcher can calculate the probability that a particular sample mean (for a random sample) is close to

98

Sampling

the true population mean in that distribution of infinite (but theoretically “drawable”) random samples. This probability can be calculated because the mean of an infinite number of samples (the sampling distribution) will, again, equal the population mean, and the distribution will be normal. The sampling error for a particular sample, when combined with the sample’s mean or proportion, allows researchers to estimate the population mean or proportion within a particular range (plus or minus) and with a particular level of confidence that the range includes the population value. The best guess at the unknown population mean or proportion is the sample mean or proportion, and calculating sampling error allows researchers to estimate the range of error in this guess. Crucial to understanding inference from a probability sample to a population is sampling error—an indication of the sample’s accuracy. Sampling error for a given sample is represented by standard error, which is calculated differently for means and proportions. Standard error of the mean is calculated by using a sample’s standard deviation—the average distance that cases in the sample vary from the sample mean. The standard deviation is divided by the square root of the sample size. The equation for standard error of the mean is:

in which SE(m) = standard error of the mean SD = standard deviation n = sample size The standard error of the mean is applied to interval- or ratio-level data. Nominallevel data use a similar equation for standard error of proportions. The equation for standard error of proportions is:

in which SE(p) = standard error of the proportion p = the proportion of sample with this characteristic q = (1 − p) n = sample size Standard error formulas adjust the sample’s standard deviation for sample size because size is one (usually the most important) of three factors that affect how good an estimate a sample mean or proportion will be. The larger the sample, the better the estimate of the population. Very large and very small case values

Sampling 99 will crop up in any sample. The more cases in a sample, the less the impact of the large and small values on the sample mean or proportions. The second factor affecting the accuracy of a sample estimate is the variability of case values in the sample, which reflects the homogeneity of the population. If the case values vary widely, the sample will have more error in estimating the population mean or proportion because variability results from the presence of large and small values for cases. Sample size and variability of case values are related, because the larger a sample, the more likely case variability will decline. The third factor affecting the accuracy of a sample’s estimate of the population is the proportion of the population in the sample. If a large proportion of the population is in the sample, the amount of error declines because the sample distribution better approximates the population distribution. However, a sample size must equal or exceed about 20% of the population before this factor plays much of a role in estimating sampling error. In fact, most statistics books ignore the influence of population proportion because surveys, which are often used in sociology and political science, usually sample from very large populations. As a result, sampling a high proportion of a large population is not necessary to generate a representative sample. That said, content analysts should not automatically ignore the impact of the population proportion in a sample, because their studies often include fairly high proportions of populations. When the percentage of the population in a sample of content exceeds about 20%, a researcher should adjust the sampling error formula using the finite population correction (fpc). To adjust the standard error for a sample, the standard error is multiplied by the fpc formula, which is:

in which fpc = finite population correction n = sample size N = population size For further discussion of the fpc, see Moser and Kalton (1972). As noted above, communication occurs on a continuing and often regular basis, and understanding content’s antecedents and effects is difficult without including time in the study design. Thus, sampling decisions often involve both time and content. Several probability sampling techniques are available, and decisions about probability sampling depend on numerous issues, but virtually every decision involves time and content dimensions. Researchers must decide whether probability sampling is appropriate for both of these dimensions as well as how randomness is to be applied. For example, a probability sample could be taken for both time and content (e.g., 20 randomly selected movies from each of

100

Sampling

10 randomly selected years between 1993 and 2023), for content alone (e.g., a random sample of all movies released in 2023), for time alone (e.g., all Disney movies in 10 years randomly selected from between 1993 and 2023), or for neither time nor content (e.g., all movies released in 2023). In a strict sense, all content involves a time dimension. However, the concept of sampling time used here concerns trends over periods longer than a year, which represents a natural planning cycle for most media. Simple Random Sampling

Simple random sampling occurs when all units in the population have an equal chance of being selected. If a researcher wanted to study gender representation in all feature films produced by major studios in a given year, random sampling would require a list of all films produced by those studios. The researcher would then determine the number of films in the sample (e.g., 100 out of a population of 375 films). Using a computer or random numbers table, the researcher would select 100 numbers between 1 and 375 and locate the corresponding films on the list. Simple random sampling can occur with two conditions: when units are replaced in the population after being selected and when they are not replaced. With replacement, a unit could be selected for the sample more than once. Without replacement, each unit can appear only once in a sample. When units are not replaced, every unit does not have an exactly equal chance of being selected. For example, in a population of 100, before the first draw, every unit has a 1 in 100 chance of being selected. On the second draw, each remaining unit would have only a 1 in 99 chance. This variation is not a serious problem because even without replacement each potential sample of a given size has an equal chance of being selected, even if each unit did not. When populations are large, the small variation of probability without replacement has negligible impact on sampling error estimates. Simple random sampling works well for selecting a probability sample. However, it may not be the best or most feasible sampling technique in all situations. For example, if the population list is particularly long, or if the population cannot be listed easily, an alternative random sampling technique might be in order. Systematic Sampling

Systematic sampling involves selecting every nth unit from a sampling frame. The particular number (n) is determined by dividing the sampling frame size by the sample size. So, if a sample includes 1,000 sentences from a book with 10,000 sentences, the researcher would select every tenth sentence after a starting point. Taking every nth unit becomes a probability sample when the starting point is randomly determined. The researcher could randomly select a number

Sampling 101 between one and ten, which would be the number of the first sentence taken. Every tenth sentence after that would be selected until the complete sample is in hand. Because the starting point is randomly selected, each unit has an equal chance of being selected. Systematic sampling may work well when simple random sampling creates problems. However, it requires a listing of all possible units for sampling, so if the sampling frame is incomplete (the entire population is not listed), inference cannot be made to the population. Second, systematic sampling is subject to periodicity, which involves a “bias” in how units in a list are arranged (Wimmer & Dominick, 2011). For example, a researcher wants to study firearm advertising in monthly sporting magazines, such as Field & Stream. If the researcher took 4 copies per year for 20 years using systematic sampling, a biased sample could result. Assuming a 2 is picked as the random starting point and every third copy is selected after the first, the researcher would end up with 20 editions from February, 20 from May, 20 from August, and 20 from November. This creates a problem because advertising and editorial space vary by month, and eight months are missing from the sample. Stratified Sampling

Stratified sampling involves breaking down a population into smaller groups and randomly sampling within those groups. The groups are more homogeneous than the entire population with respect to some important characteristic. For example, to study jingoistic language about the Vietnam War between 1966 and 1974 in Senate floor speeches, a sample could be randomly selected or stratified by year. Stratified random selection would be better because the language likely changed over time, as support for the war was stronger in 1966 than in 1974. A simple random sample might generate a sample with most of the speeches either at the beginning or the end of this period. Using years as strata, however, yields smaller, more homogeneous groupings and enhances representativeness. The percentage of the total 1966–1974 speeches that were made each year would determine the percentage of the sample to come from that year. Stratified sampling serves two purposes. First, it increases representativeness of a sample by using knowledge about the distribution of units to avoid the oversampling and undersampling that can occur with simple random sampling when samples are not large. This is proportionate sampling, or selecting sample units from within strata based on a stratum’s proportion in the population, as in the Senate speeches example. A study of Facebook or Twitter posts might stratify by topic areas. The percentage of sample messages from a given topic area would represent that topic area’s proportion of the entire population. If 20% of all messages address “movies,” then 20% of sample posts should be about movies. This makes the sample more representative.

102

Sampling

Second, stratifying can increase the number of units in a study when that type of unit is a small proportion of the population. This disproportionate sampling involves selecting a sample to represent a stratum that is larger than that stratum’s population proportion. Such deliberate oversampling allows the sample from the stratum to be large enough for comparison with others. If, for instance, only 10% of 1,000 Twitter account owners are older than 60, and the study is about the relationship between age and tweets, the researcher might want to oversample the 60-plus age stratum because a simple random sample of 200 would yield only 20 people from that stratum, which would be insufficient for valid comparisons with larger population groups (strata). So, disproportionate sampling oversamples particular units to obtain enough cases for valid analysis, but it yields a final sample that is no longer representative of the entire population precisely because the 60-plus stratum is overrepresented. Recall that stratification in sampling is based on some important variable. Because mass communication media produce content on a regular basis (e.g., every day, every week, or once a month), stratified sampling takes advantage of known variations within production cycles. Daily newspaper print editions, for example, vary in size by day of the week because of cyclic variation in advertising. Stratified sampling ensures neither “large news hole” nor “small news hole” days are overrepresented. We examine systematic variation and media sampling in detail below. Stratified sampling requires adjustments to sampling error estimates. Because sampling is within homogeneous subgroups, standard error is reduced. The standard error for stratified samples is the sum of standard errors across all strata (Moser & Kalton, 1972). Cluster Sampling

Simple random, systematic, and stratified sampling all require a list as a sampling frame. This list indicates how many units are in the population and allows calculation of probabilities. Often, however, complete lists of population units are unavailable. To sample when no such list exists, researchers use cluster sampling, which involves selecting content units from clusters or groups of content. Consider that mass media products often include clusters of content. For instance, Google News presents each day’s array of articles, divided into topic clusters (e.g., sports, business, entertainment). Similarly, listing all websites is impossible, but search engines can use cities to identify clusters of local websites, which can be sampled when geography is important. Cluster sampling allows probability selection of groups (clusters) and subgroups; random sampling within groups and subgroups would lead to the specific content units. Although cluster sampling can be useful, it can introduce additional sampling error (compared to simple random sampling) because of intra-class correlation. Content units, such as entertainment articles, that cluster together may do so because they are similar. These shared characteristics create a positive correlation

Sampling 103 among attributes. Clusters are more likely to include units with similar characteristics and exclude units that have different characteristics. As a result, the sample, although randomly determined, may not be representative. That said, intra-class correlation can be anticipated, and some studies (e.g., Moser & Kalton, 1972) provide formulas for estimating it. Multistage Sampling

Multistage sampling is not a form of probability sampling in itself. Rather, it refers to the common practice of using one or more of the aforementioned probability sampling techniques at different stages in generating a sample. Recall that the simplest form of probability sample lists all units, randomly selects from them, and proceeds with analysis. However, as noted earlier, most content is not easily listed, and media content is often in packages or clusters. Moreover, most content also has a time dimension. Indeed, Berelson (1952) said content has different dimensions that require sampling consideration: titles, issues, or dates, and relevant content within dimensions. A multistage sampling procedure might address all these dimensions as stages, with a random sample taken at each stage to facilitate population inference. For example, someone studying the content of local talk radio programs would have to randomly select radio stations, days of the week, and particular programs. Yet another stage might be particular topics within the programs (e.g., government). When studying organizational Facebook pages, organization type, specific organizations, and date might be stages. Pure multistage sampling requires random sampling at each stage. Multistage sampling can also combine various techniques reflecting the research’s purpose, with the guiding principle being to produce a sample that is as representative as possible for population inference. The researcher must determine the number of stages in the process. Danielson and Adams (1961) used sophisticated multistage sampling to study the completeness of campaign event coverage available to average readers during the 1960 presidential campaign. They selected 90 daily newspapers, stratifying for ownership type (group and non-group), geographic region, and time of publication (a.m. or p.m.). Despite slightly oversampling Southern dailies, the sample’s characteristics matched the population. Meanwhile, a systematic random sample of 42 campaign events was drawn from the population of 1,033 events covered by 12 large dailies between September 1 and November 7, 1960. The process of sampling celebrity tweets could have one, two, or three sampling stages. The first might involve randomly selecting celebrity type (sports, movie, music, etc.). A second might involve selecting one or more celebrities from the selected types. A third stage might be to randomly sample tweets by the selected celebrities. Contrast this with a single-stage probability procedure that lists every tweet by every celebrity over a given time period and then randomly selects from the list. The multistage process would take considerably less time.

104

Sampling

Just as stratified sampling alters the formula for standard error, so too does multistage sampling: it introduces sampling error at each stage, and error estimates must be adjusted. Stratified Sampling for Media Deciding between simple random or stratified sampling usually involves efficiency. Content produced by legacy commercial media has predictable variations reflecting news cycles and advertising support. For example, politicians often release “negative” information late on Friday afternoons because weekend newsrooms have smaller staffs and people are less likely to watch weekend TV news. The number of pages in printed daily newspapers varies by weekday on the basis of advertising lineage, but a news outlet’s website is not limited by time or space. If systematic variations affecting content are known, they can be used to select a representative sample more efficiently. The variations identify subsets of more homogeneous content that permit using a smaller stratified sample that will be just as representative as a larger simple random sample. Several studies have looked at stratified sampling in various media forms to identify efficient sample sizes and techniques for inferring to a particular time period. These studies have also examined different types of variables (see Table 6.1). Table 6.1 Efficient stratified sampling methods for inferring to content Type of Content

Nature of Sample

Year of daily print newspapers

Two constructed weeks from year (randomly selecting two Mondays, two Tuesdays, two Wednesdays, etc.) Six constructed weeks

Year of health stories in daily print newspapers Year of the New York Times online Five years of daily print newspapers Year of online Associated Press stories Year of weekly print newspapers Year of evening television network news Season of TV program sexual content Year of print news magazines Five years of print consumer magazines Year of online press releases

Six randomly selected days Nine constructed weeks Eight constructed weeks Randomly select one issue from each month in the year Randomly select two days from each month’s newscasts during the year Randomly select five programs for programlevel behavior and seven programs for character behavior Randomly select one issue from each month in a year One constructed year (randomly select one issue from each month) Twelve constructed weeks (three weeks per quarter)

Note: These are general rules of thumb; exceptions may be found in the original articles (cited in the text).

Sampling 105 Daily Print Newspapers

Because of their traditional importance as a journalistic mass medium, daily newspapers have received more attention in sampling efficiency studies than other media forms. These studies have concentrated on efficiency of sampling for inference to typical levels of content by using the “constructed week,” created by randomly selecting an issue for each day of the week. An early study by Stempel (1952) concluded that twelve days (two constructed weeks of two Sundays, two Mondays, etc.) were sufficient for representing a year’s content in a six-day-a-week newspaper. Riffe, Aust, and Lacy (1993) conducted a replication and extension of Stempel’s study, comparing simple random, constructed week, and consecutive day sampling for efficiency for a seven-day-a-week newspaper. They found that one constructed week adequately predicted the population mean for a year’s issues, and two constructed weeks worked even better. Song and Chang (2012) examined sampling efficiency for China’s People’s Daily and reached similar conclusions: one constructed week was representative and two worked better. Two constructed weeks of daily newspapers works well to infer to a year of content, but researchers often study content changes across longer time periods. Lacy et al. (2000) examined efficiency in sampling daily newspaper content across five years, concluding that nine constructed weeks from a five-year period were as representative as two constructed weeks drawn from each year if the variable of interest did not show great variance. These studies addressed sampling efficiency for general patterns of newspaper content, but the same systematic variations in news hole (e.g., large Sunday but small Saturday issues) might not affect the presence of particular topics attracting researchers’ interest. News coverage of specific topics typically reflects variations in the environment rather than size of the news hole. For example, newspapers are more likely to cover city government the day after a city government meeting (Baldwin et al., 2009). Luke, Caburnay, and Cohen (2011) looked at health studies in daily newspapers and again found that constructed week sampling was more efficient, but it took six constructed weeks, rather than two, to represent health stories. (Six constructed weeks provided a representative sample for a five-year period as well as a one-year period.) This runs counter to the nine constructed weeks suggested by Lacy et al. (2000) for representative content. Having six constructed weeks represent five years is a concern if one wants a representative sample, because one or more years could have only a few days in the sample. If the five-year period included extraordinary events (e.g., the Great Recession of 2008–2009 or the COVID-19 pandemic), such events might reduce or increase the space given to topics such as health during any given time period. Either way, smaller samples are less likely to be representative. Scholars would be advised to evaluate the nature of their samples in terms of accomplishing project goals.

106

Sampling

Weekly Print Newspapers

The sparse research on daily newspaper sampling seems extensive compared to studies of sampling weekly newspapers. Lacy, Robinson, and Riffe (1995) wanted to see if stratified sampling would also improve sampling efficiency with weeklies. The results indicated stratified sampling has some efficiency compared to random sampling, but the influence of cycles in weeklies’ content is not as strong as with dailies. A simple random sample of 14 issues or one issue randomly selected from each month (12 issues) were the most efficient approaches to representing a year’s content. The authors concluded that the former was preferable when management needed to make risky decisions, whereas the latter worked better when time and money were more important considerations than risk. Print Magazines

Sampling studies have addressed efficiency for weekly news magazines and monthly consumer magazines. Riffe, Lacy, and Drager (1996) found that selecting one Newsweek issue randomly from each month was most efficient for inferring to a year’s content; the next most efficient method was simple random selection of 14 issues from a year. This result mirrors those for weekly newspapers, which have the same publication cycle. Unlike news magazines, however, consumer magazines usually appear monthly, and the best approach to studying a year’s content is to examine all 12 issues. However, if a researcher wants to study longer-term trends in consumer magazines, stratified sampling might prove more efficient than simple random sampling. Using Field & Stream and Good Housekeeping as examples of monthly consumer magazines, Lacy, Riffe, and Randle (1998) found that a constructed year (one issue from January, one from February, one from March, etc.) from a five-year period produced a representative sample for that time frame. A longitudinal study of consumer magazines over, say, 20 years could divide that period into five-year sub-periods and use a constructed year for each sub-period to make valid inferences to two decades of the magazines under study. Network Television News

Although television content analyses are plentiful, research on valid and efficient sampling methods is practically nonexistent. Types of samples found in published research include randomly selecting twelve composite weeks from sixty months (Weaver, Porter, & Evans, 1984), using the same two weeks (March 1–7 and October 1–7) from each year between 1972 and 1987 (Scott & Gobetz, 1992), sampling two constructed weeks per quarter for nine years (Riffe et al., 1986), and using four consecutive weeks per six-month period (Ramaprasad, 1993).

Sampling 107 The variation in types of sampling reflects particular research questions, but stems also from the absence of guidance from sampling studies about television news. Riffe, Lacy, Nagovan, and Burkum (1996) explored network news sampling by using a year’s worth of Monday through Friday broadcasts from ABC and CBS as the populations. The most efficient sampling for network TV news was randomly selecting two days per month: that is, a total of 24 days from the year. Simple random sampling required 35 days to predict a year’s content adequately. The authors cautioned that researchers should be aware of extreme variations in particular content categories that could affect sample adequacy. In the absence of studies demonstrating superior efficiency of stratified sampling, researchers should use simple random sampling, because applying constructed weeks, months, and years to forms of content not known to be influenced by weekly, monthly, and yearly cycles will introduce unknown biases into the data. Network Television Sexual Content

Because social scientists are interested in the sexual content of media programming (Coyne et al., 2019), Manganello, Franzini, and Jordan (2008) explored sampling of sexual behavior in television network entertainment programs. Using ten primetime programs from the 1998–1999 season, they concluded that five randomly selected programs were sufficient for representing program-level sexual behaviors for the season, but seven random programs were needed to represent an individual character’s behaviors. Sampling Digital Content Historians will likely call the 21st century’s first ten years the “digital decade,” as social media and smartphones were introduced, Netflix began streaming series and movies, and legacy media outlets accepted their future in digital delivery. The digital delivery of content creates benefits and problems for social scientists. When media companies put content online, content acquisition by scholars became easier and typically less expensive. However, the explosive growth of social networking sites like Facebook and Twitter soon indicated that these new information distribution and networking systems would have a huge impact on individuals and social groups. Equally obvious was the fact that accessing populations and even probability samples would become increasingly difficult, due to the absence of sampling frames, the existence of private areas on social media sites, and the expense of acquiring and analyzing large data sets. For instance, Twitter users post an average of 500 million tweets per day (Aslam, 2022b). Similarly, in 2022, more than 1.88 billion users visited Facebook each day (Aslam, 2022a).

108

Sampling

In addition, the complex nature of social networking sites and processes affects how they are studied and, therefore, the process of sampling. Most legacy outlets traditionally used distribution systems that were very slow and expensive and made interaction almost impossible. They were oriented exclusively toward a mass audience. By contrast, digital platforms allow people to communicate not only to large numbers of others (mass communication) but also to a single person (interpersonal communication). These two dimensions can even combine when an interpersonal message is redistributed over networks. (This is especially important during crises.) Sampling has become more difficult due to the fact that all three of these functions can occur in a single data set. As with all sources of content, the type of sample (probability, convenience, or purposive) is determined by the research questions, access to content, and cost of access. The impact of these factors can vary by type of digital content. For example, digital content designed for mass consumption is easier to access than content that is not, although access to the former may require payment. Twitter makes some of its content more readily available than does Facebook; and Snapchat is a platform for sharing messages that last for only a short period of time, creating additional sampling problems. The following discussion of digital sampling focuses on the Internet in general and social networking platforms such as Twitter, Facebook, and YouTube in particular. Digital distribution occurs primarily through fiber-optic, cellular, and Wi-Fi systems, and is displayed mostly through websites or applications. A prime difference between websites and social networking platforms is the tendency of the former to represent organizations, both for-profit and nonprofit, instead of individuals, whereas the latter represent the content of organizations, social groups, and individuals. Thus, social networking platforms tend to generate very large data sets (compared to most websites), which can complicate the process of drawing representative samples of social network content. Sampling the Web

Discussing online content sampling problems, Karlsson (2012) noted that they stem from four dimensions: interactivity, immediacy, multimodality, and hyperlinks. As a result, he argued, online information is erratic and unpredictable. A lack of predictability, from a researcher’s perspective, presents sampling challenges, reducing the ability to use stratified sampling and requiring a longer time frame for simple random sampling. Stempel and Stewart (2000) said a serious problem confronting Internet studies is the absence of population sampling frames. To a degree, the Internet is like a city without a telephone book or a map for guidance, with new houses being built and old houses deserted with no listing of the changes. Sampling therefore requires creative solutions.

Sampling 109 Despite these difficulties, numerous studies on a wide range of topics have sampled news websites. Kim, Thrasher, Kang, Cho, and Kim (2017) studied coverage of e-cigarettes by three newspaper websites, three television networks, and six print newspapers by downloading articles from the Internet and finding transcripts on a news database. Carpenter, Boehmer, and Fico (2016) used a convenience sample of for-profit and nonprofit news sites in five cities and downloaded online content from ten sites to study how the two types of sites differed in journalistic role enactment. As with analog communication, researchers have investigated efficient online sampling. Hester and Dougall (2007) used six months of content from news aggregator Yahoo! News that included stories from several legacy media organizations (e.g., Associated Press, USA Today, CNN, etc.). They concluded that constructed week sampling was the most efficient type of random sampling. However, the minimum number of weeks was two, while some types of news required up to five constructed weeks. Tamul and Martínez-Carrillo (2018) used keywords about immigration to create a sampling frame for immigration stories on three Arizona daily newspaper websites. After identifying 18 themes, they found that 14 randomly selected weeks were representative of all themes for a year, and that 14 randomly constructed weeks were representative of 17 of the 18 themes. Another study examined topic, geographic bias, number of links, and uses of multimedia presentation in stories on the New York Times website (Wang & Riffe, 2010). The authors found that six randomly selected days could represent an entire year of content. They cautioned, however, that their finding might not be generalizable to other websites because the Times has a large news staff, and smaller staffs might result in more or fewer content variations from day to day. Given the limited number of news website sampling studies and variations among those that exist, scholars should be careful about sampling. The elasticity of the online news hole has reduced the impact of both advertising and news cycle on content. In addition, the size of sample that is required to be representative may vary depending on the nature of the variable being studied (theme, topic, story length, etc.). When in doubt, scholars should use simple random sampling and increase sample size as a category becomes more variable. Of course, organizations other than legacy news outlets also post content. Connolly-Ahern, Ahern, and Bortree (2009) studied a year of content on two press release services (PR Wire and Business Wire) and one news service (Associated Press), concluding that constructed weeks are efficient but more than two are necessary for representativeness. Indeed, the authors recommended at least twelve constructed weeks (three per quarter) for online press releases and eight constructed weeks for the Associated Press website. However, the necessary sample size was very topic-dependent. Researchers should consult the tables in the article when sampling these services. The need for larger samples found

110

Sampling

by Connolly-Ahern et al., when compared to efficient legacy news sampling (newspapers, magazines, and television), represents the lack of consistent and predictable variations or cycles that allow more efficient stratification. McMillan (2000) made a series of recommendations after analyzing 19 research studies that examined online content. First, she warned scholars to be aware that the Internet is both similar to and dissimilar from legacy media. Researchers must understand that people use them differently. Second, online sampling can be very difficult because sampling frames are not readily available and content can change quickly. Third, the changing nature of the Internet can make coding difficult: either content must be “captured” in some form and/or sampling must take the change itself into consideration. Fourth, researchers must be aware that the multimedia nature of the Internet can affect various study units. Fifth, the changeable nature of websites can make reliability testing difficult because coders may not be coding identical content. As with all media, decisions on online sampling depend on how the research is conceptualized. A study of representative content on Facebook, for example, presents certain problems, whereas sampling legacy news sites creates others. Convenience samples create fewer problems, but results cannot be generalized. In all studies, researchers must be aware of the time dimension and changing web content. When dealing with online content without a readily available sampling frame, multistage sampling might be employed. The first stage could involve using a range of search engines and algorithms to generate multiple lists of sites. These lists become a sampling frame after duplicates are removed. The second stage would involve selecting from the sites in the sampling frame. More stages could be added if other variables (such as geography) are important in generating a sample. However, this approach creates other problems. Search engines and algorithms generate long lists of sites, the lists are not randomly selected, and the search engines have different algorithms for generating the order in their lists. As a result, creating a sampling frame from search results can be time-consuming, and samples might be more representative of commercial and organizational sites than individual sites. Second, content on some pages changes at varying rates. Such variability yields a process similar to putting personal letters, books, magazines, and newspapers into the sample population and trying to get a representative sample. Using categories other than topics to classify web pages might be an answer, but a standardized typology of categories has yet to be developed and accepted. To deal with the changeable nature of news websites, Kutz and Herring (2005) developed micro-longitudinal sampling using a software program that would download specified page components (e.g., headlines) every 60 seconds from a news site. The program would only download elements that had been changed

Sampling 111 since the last visit. Using CNN, BBC, and Al Jazeera, the researchers analyzed the nature of these changes. Most of the changes on Al Jazeera indicated new stories, but most on the BBC and CNN were revisions of previous stories. As noted in Chapter 1, the size and complexity of the Internet has led to the development of machine-learning approaches to sampling. For example, Chau and Chen (2008) provided a substitute for traditional search engines in the form of topic-specific search engines that learn from training documents. Such engines find the appropriate URLs and then filter documents from URLs that are not relevant to the research question being studied. This approach uses both content and structure (links) to collect web content and compares favorably to keyword and lexicon-based approaches. After documents are identified and filtered, the researcher may want to take a sample from the population that results from searching for the topics online. Sampling with Databases

The digitization of media content has also enhanced content analysis through increases in storage capacity. Messages of all types have been digitized, preserved, and made available online. With this increased capacity has come an ability to search and retrieve specific types of content from a wide range of available databases. At its simplest level, a database is a structured collection of data that can be easily searched (and units retrieved) with computers. The data may take many forms, but content databases are typically text, visual, and auditory messages from media ranging from newspapers to social networking sites. Databases can be commercial (e.g., Factiva, LexisNexis, PR Wire) or researchers may create them (Lacy et al., 2015). Of course, others could be a combination of the two. The creator decides the content and organization of each database. Most databases are searched with keywords—terms specifically associated with study concepts. For example, Watson (2017) studied the relationship between local newspaper coverage of violent crime in Minneapolis and St. Louis and local online searches about crime in those two cities. He accessed the ProQuest database for the papers’ coverage and retrieved weekly search data from the Google Trends site. He tested a series of search “strings” for the two newspapers using terms such as “police,” “murder,” “homicide,” “arrest,” and so on. Commercial databases, such as LexisNexis, are often used to generate sampling frames with keywords so random samples can be generated from them. As noted above, Tamul and Martínez-Carrillo (2018) used keywords about immigration to create a sampling frame of LexisNexis articles for their study of sampling efficiency in identifying themes. Scholars studying themes in samples based on LexisNexis should examine their article.

112

Sampling

Although databases can be powerful tools for accessing content, the process has its limits. For instance, it is highly unlikely that a single database will contain the universe of content a researcher hopes to study. Wu (2015) compared articles about post-traumatic stress disorder in the LexisNexis and America’s News databases and found 94% overlap. However, Weaver and Bimber (2008) compared news coverage of nanotechnology in LexisNexis and Google News and found just 71% overlap. All media databases are purposefully organized populations, not representative of the universe. One way of dealing with the limitations of any given database is to use more than one to collect needed content. This would require comparison of the range of content in the databases by discovering what content is and is not included. In addition, researchers should explore whether indexing and archiving software for the databases are equivalent. Existing research has yielded little information about the processes used to generate content samples from databases. After examining 83 content analysis studies, Stryker et al. (2006) reported that only 39% provided keywords from the search and only 6% discussed the validity of the keywords. The keywords employed are crucial in determining a sample’s ability to yield valid results. Sobel and Riffe (2015) used LexisNexis to study the New York Times’ coverage of Botswana, Ethiopia, and Nigeria. They found 7,454 articles about these countries by using their names as keywords. However, only 19% of these stories had one of the countries as the main focus of the article. Clearly, then, searches that use a single keyword may yield content irrelevant to the study. Instead, researchers should use strings of keywords based on previous research and compare the output of various forms of the keyword strings (e.g., see Watson, 2017). Stryker et al. (2006) advocate conducting and reporting formal evaluations of the recall and precision of a search string. Recall measures a string’s ability to retrieve the pertinent content, while precision measures whether the retrieved content is actually relevant to the study’s goals. Recall is calculated by dividing number of relevant articles retrieved by number of relevant articles in the database. Precision is calculated by dividing relevant articles retrieved by all articles retrieved. These two measures are often negatively correlated: that is, the more precise (narrow) a keyword string, the more likely it will miss relevant content. Relevant database content is established with a protocol that has reached acceptable reliability and has been applied by two or more independent coders. Precision and recall measures can be used to create a “correction coefficient” that better estimates error associated with samples from the database (Stryker et al., 2006). The correction is calculated by dividing the precision by the recall. For example, if a study sampled magazine articles about student loan debt and had a precision measure of .75 (i.e., 75% of the articles retrieved were pertinent to the study) and a recall of .5 (i.e., 50% of the pertinent articles in the database were captured in the sample), the coefficient would be 1.5. If the search string had identified 100 articles, a more accurate estimate of pertinent articles in the

Sampling 113 database would be 150 (100 × 1.5). A correction coefficient of less than 1 indicates the search string overestimated the number of articles, while a coefficient of greater than 1 indicates that the string underestimated the number of articles in the database. Stryker et al. (2006) suggested that this correction coefficient is accurate for longer time periods (more than a month) but not for short time periods (a day or week). Researchers using databases to access content should provide a detailed description, including a discussion of what relevant media outlets were included in and excluded from the database. In addition, search strings should be reported, and the process by which they were determined should be explained. The researcher should also calculate precision, recall, and the correction coefficient, and report all three in the article. Sampling Social Media

Perhaps the most revolutionary use of digital media has been the development of social media. Social media platforms provide a flexible and instantaneous system for people or organizations to interact with others one-to-one or as a mass audience. People can create their own content or re-post news and information from organizations and journalists. The large numbers of people who use social media and the fact that both organizations and individuals utilize it have led to a variety of data and sampling problems (Mneimneh et al., 2021). The problems researchers face and their solutions depend on several conditions: Is the social media content provided by an organization or an individual? How accessible is the content? How much funding does the researcher have? How much technical expertise does the research team possess? Mneimneh et al. (2021) described five common approaches to collecting social media data, some of which were addressed in Chapter 4 of this volume. First, researchers can collaborate with social media companies. Some companies, such as Facebook, provide access to data for selected projects. However, because of the growing politicization of social media over the past decade, selected projects tend to be technical, such as work on machine learning and artificial intelligence (Meta, 2022), rather than content-oriented. Second, researchers can use available data sets of social media content, such as Harvard Dataverse (2022). Unfortunately, the odds of finding an existing data set to match the research goals are quite low. Third, researchers can buy data from third-party vendors like Brandwatch (2022) and Meltwater (2022) that collect and sell data from a variety of sources. This approach can be expensive. Fourth, researchers can use “scraping” tools in a variety of programming languages to collect data from web pages. Scraping can be used with any public web data, and Python scraping tools can be found online (Beautifulsoup4 4.11.1,

114

Sampling

2022; Request 2.28.1, 2022). For example, Sendra and Farré (2020) used Netlytic software to scrape content from Instagram with the hashtag #chronicpain to study how chronic pain sufferers describe their condition. Using a Python script, Comfort and Hester (2019) collected 440,574 tweets about the People’s Climate March during a 2015 UN climate change conference; a second script searched the sample for key messaging promoted by environmental groups. As noted in Chapter 4, commercial “plug-and-play” tools for gathering digital data require little knowledge of coding, but they may lack transparent documentation, potentially limiting researchers’ ability to replicate methods and findings. Fifth, many researchers have used application programming interfaces (APIs), which have proved popular because they are generally easy to use and are sometimes provided for free by social media companies (e.g., Twitter, Facebook, TikTok, Instagram, etc.). In their study of the impact of right-wing postings, Heiss and Matthes (2020) used a Facebook API to download a year’s worth of Facebook posts by German and Austrian political parties. However, while APIs can be very useful in collecting large amounts of data, they also specify and limit the types and amounts of information made available. Moreover, some companies have begun restricting access to API data because of concerns about terms of service agreements and users’ privacy, as noted in Chapter 4. Studying social media content does not require the use of API or scraping software, though such resources may be useful when generating large amounts of descriptive data. Samples will vary depending on the type of API and scraping algorithm used. Traditionally, Twitter allowed access to tweets either through its “firehose,” which provided all tweets located by a set of terms, or through its streaming API, offering a random selection of 1% of the firehose (Joseph, Landwehr, & Carley, 2014). Morstatter, Pfeffer, Liu, and Carley (2013, p. 406) found that the streaming API data estimated top hashtags well for large samples (n), “but is often misleading for a small n.” A number of studies have examined various algorithms that can be used to access social media data (Rusmevichientong, Pennock, Lawrence, & Giles 2001). For examples, see Bruns and Liang (2012); Palguna et al. (2015); Rezvanian and Meybodi (2017); and Gjoka, Kurant, Butts, and Markopoulou (2009). As always, the logistics of sampling social media content depend on the goal of the research and the nature of the data being sought. Sampling becomes easier when studying public content generated by specific organizations. Using the keyword “Coronavirus,” Hillyer, Basch, and Basch (2021) studied YouTube videos about COVID-19 transmission by sampling the 100 most viewed English and Spanish videos available on three dates between January 1 and June 30, 2020. In a similar study, Li, Guan, Hammond, and Berrey (2021) analyzed 331 videos from the COVID-19 TikTok information hub, which constituted a census of videos posted by 8 public health and UN agencies. As with other media, samples of social media content serve two fundamental purposes: to describe a population and to test hypotheses about relationships

Sampling 115 among content, antecedent, and effect variables. Because of the nature of social media, description can be conceptualized as addressing the content posted by people or the content most accessed by people. The distinction is important because people who tweet vary in number of followers: in 2021, Justin Bieber had 114 million followers, whereas the average number of followers for a Twitter account was just 707 (Aslam, 2022b). Not surprisingly, “expert” Twitter users tweet different types of content compared to the average user. Ghosh et al. (2013) examined tweets generated by 587,759 experts—Twitter posters followed by at least 10 users. This sample of expert tweets had more useful and trustworthy information, as determined by a sample of volunteer online survey respondents, and covered more diverse topics. Just as they should examine the content and structure of archived legacy media and databases carefully, researchers studying social media content need to consider the ways that the sampling process might bias their data. For example, Thorson et al. (2013) looked at Twitter and YouTube coverage of the November 2011 Occupy movement. Using keywords, they captured 43,378 YouTube videos and 417,413 tweets from which they extracted 22,768 videos. Using both Twitter and YouTube yielded a more diverse collection of video than using YouTube alone. Pfeffer, Mayer, and Morstatter (2018, p. 15) demonstrated that Twitter accounts can be overrepresented in its 1% streaming API (discussed above): “Twitter cannot provide scientifically sound random samples via its Sample API,” and it is possible to influence the extent and content of topics in the Sample API. Sampling Suggestions for Digital Media Digital media have opened up many new research questions while also providing unprecedented access to content from both organizations and individuals. However, sampling has become more difficult for a variety of reasons. Here are some questions researchers should address when sampling digital content for a study: • • • • • • • • •

Is the study of content and messages created by organizations or individuals? Which organizations and/or individuals should be studied? Is the goal to describe a population or examine relationships among variables? What time period should be studied? After identifying the population, is a sampling frame available for the content? Can a list be made of all the sampling units? If no, can multistage sampling generate a sampling frame? If no, can a database provide an adequate sampling frame for study? If no, can an API or scraping software be used to generate an adequate sampling frame? Does an online search identify potential APIs or scraping tools? Once a sampling frame has been identified, is the frame too long to conduct a census?

116

Sampling

• If the frame is too long, can a simple random sample be generated? • Have studies identified stratified sampling that would create a more efficient representative sample? • If a representative sample is impossible, is a convenience or purposive sample available that would achieve the study goals? Sampling Individual Communication Mass communication messages usually have the sampling advantage of being regular in their production cycle. Because such communication usually comes from institutions and organizations, records of its creation are often available. More problematic is the study of individual communication, such as letters, email messages, and Facebook posts. For example, to analyze letters written by soldiers during the American Civil War, identifying a sampling frame of available letters is a burdensome but essential task. Research about individuals’ communication will be only as valid as the sampling frame is complete. Of course, researching the communications of particular individuals, such as politicians, writers, and artists, often involves a census of all available material. Trying to research individual communications of “non-notable” people should involve probability sampling, even if the entire universe cannot be identified. However, accessing such communications is problematic and convenience samples often result. For example, in an early examination of digital interpersonal communication, Dick (1993) studied the impact of user activity on sustaining online discussion forums. Unable to sample such forums randomly, Dick used three active forums from the GEnie system, tallying 21,584 messages on 920 topics between April 1987 and September 1990. Because Dick was interested in relationships rather than behavior, this set of messages, which was strictly a census, was adequate for exploration. The scientific method is the solution for the inability to sample randomly. If there are strong relations of interest in larger populations, they will usually be found consistently even in non-probability samples. However, accumulated support from convenience samples works best if such samples span geographic locations and time periods. Summary Content analysts must choose from a variety of techniques for selecting content. Which is most appropriate depends on the theoretical issues and practical problems inherent in the research project. If the number of recording units involved is small, a census of all content should be conducted. If the number of units is large, a probability sample is likely more appropriate because it allows inference to the population, within a certain confidence interval.

Sampling 117 As with techniques for selecting content, the researcher must choose from several different probability sampling methods, including simple random, systematic, stratified, cluster, and multistage sampling. Once again, the most appropriate probability sample depends on the nature of the research project. Probability samples are necessary for the use of statistical inference. Efficient sampling of mass media to infer to content for a given time period may involve stratified sampling because mass media content often varies systematically with time periods and routines. Content analysts need to be aware that sampling involves probability samples based on time, content, or both. Digital media create their own sampling problems. Sampling is easier for publicly available digital content. Digital communication is often available in databases that are organized for particular purposes and thus are probably not representative. Researchers should therefore examine both the content and the structure of databases. APIs and scraping tools can sometimes facilitate access to both private and public content. However, regardless of the technique that content analysts decide to employ, they should always beware of potential bias when using any sampling tool or procedure, not least because many readers, and most reviewers, will swiftly recognize such bias.

7

Reliability

One of the questions all content analysts face is: “How can the quality of the data be maximized?” To a considerable extent, data quality reflects the reliability of the measurement process. Reliable measurement in content analysis is crucial to the validity of the research conclusions. If one cannot trust the measures, one cannot trust analyses that use those measures. The core notion of reliability is simple: measurement instruments (protocols) applied to observations must be consistent over time, place, coder, and circumstance. As with all measurement, one must be certain one’s measuring instrument does not develop distortions. If, for example, one had to measure day-to-day changes in someone’s height, would a metal yardstick or a rubber one work better? Clearly, the rubber yardstick’s own length would more likely vary with the temperature and humidity on the day the measure was taken and with the measurer’s pull on the yardstick. Indeed, a biased measurer might stretch the rubber yardstick. Similarly, if one wanted to measure the presence of people with disabilities in television programs, an untrained coder’s results would likely be different from those of a trained coder following explicit coding instructions. In this chapter, we deal with specific issues in content analysis reliability that involve definition of concepts and their operationalization in a study protocol; training of coders in applying those concepts; and mathematical measures of reliability that permit assessment of how effectively the protocol and the coders have achieved reliability. Reliability: Basic Notions In content analysis, reliability is defined as consistency among coders in applying a protocol to categorize content. But what is being measured is the ability of the coding protocol to produce consistent measurement rather than the ability of the coders to agree with one another. Content analysis as a research method is based on the assumption that explicitly defined variables in a protocol with adequate instructions will control assignment of numbers to content units by

DOI: 10.4324/9781003288428-7

Reliability 119 coders. If variable and category definitions do not control assignment of content, then human biases may be doing so in unknown ways. If this is so, findings are likely to be of questionable validity and unreplicable by others. Yet, as noted in Chapter 2, replicability is a defining trait of science and is thus crucial to content analysis as a scientific method. The problem of assessing reliability ultimately comes down to testing coder consistency to verify the assumption that content coding is determined solely by the variable definitions and category operationalizations in the protocol. Achieving reliability in content analysis begins with defining variables and categories (subdivisions of the variable) relevant to the study goals. Coders are trained to apply those definitions to the content of interest. The process ends with assessment of reliability using tests that indicate numerically how well the concept definitions have controlled the assignment of content to appropriate analytic categories. These steps obviously interrelate, and if a single one fails, overall reliability must suffer. Without clarity and simplicity of concept definitions, coders will fail to apply them properly when looking at content. Without coder diligence and discernment in applying the concepts, the reliability assessment will be unacceptable. Without the assessment, an alternate interpretation of any study’s findings could be “coder bias.” Failure to achieve reliability in a content study means replication attempts by the same or other researchers will be of dubious value. Variable Definitions and Category Construction Reliability in content analysis starts with the variable and category definitions and the rules for applying them in a study. These definitions and the rules that operationalize them are specified in a content analysis protocol—a guidebook that answers the question posed in Chapter 3: “How will coders know the data when they see it?” For example, one of the authors developed a protocol to study nonprofit professional online news sites in order to compare them to news sites created by daily newspapers. The first step was to code which sites fit the concept of “nonprofit professional online news sites.” The variable was labeled “type of news site” and the categories were “1” or “2,” with 1 representing the nonprofit sites and 2 representing legacy newspaper sites. The rules for giving the site a 1 (nonprofit) were: (1) it has 501(c) status; (2) it pays a salary to at least some staff; (3) its geographic market includes city, metro, or regional areas; (4) it publishes general news and opinion information rather than niche information; and (5) it posts such information multiple times during the week. The protocol then explained how coders could find information to address each of these characteristics.

120

Reliability

Conceptual and Operational Definitions

Conceptual and operational definitions specify how the concepts of interest can be recognized in the content of interest. Think of it this way: “A concept is an abstraction that describes some portion of reality. It is a general name for specific instances of the phenomenon described” (Shoemaker, Tankard, & Lasorsa, 2004, p. 4). For example, the concept of a person’s social media use is defined as the collection of digital interactive platforms and programs that person uses to interact with other people. Each variable in a content analysis (e.g., postings on Facebook, Instagram, and TikTok) is the operationalized definition of that more abstract concept. Each category of each content variable is an operational definition as well, but one that is subsumed by the broader operational definition of the variable. A simple example clarifies this process. The concept of prominence was defined and measured in a study of political visibility of state legislators (Fico, 1985). As an abstract concept, prominence means something that is first or most important and clearly distinct from all else in these qualities. In a news story about the legislative process, prominence can be measured (operationalized) in numerous ways. For example, a political actor’s prominence can be measured in terms of “how high up” the actor’s name appears in a story. The actor’s prominence can be assessed according to how much story space or time is given to assertions attributed to or about the actor. Prominence can even be assessed by whether the political actor’s photo appears within the article, or his or her name is in the headline. Certainly, it can be argued that the concept of prominence is best tapped by several measures, such as those noted—story position, space, accompanying photograph—but combined into an overall index. In fact, many concepts are operationalized in just this way. Of course, using several measures of a concept to create an index requires making sure that the various components indicate the presence of the same concept. For example, story space or number of paragraphs devoted to a politician may not be a good measure of prominence if the politician is not mentioned until the last few paragraphs of a story. Concept Complexity and Number of Variables

As we explain below, the more conceptually complex the variables and categories, the harder it will be to achieve acceptable reliability. For complex content, more time and effort must be available for coder training and actual coding, or the analysis may have to be less extensive. That is, if variables are simple and easy to apply, reliability is more easily achieved. A large number of complex variables increases the likelihood of coders making mistakes, diminishing the reliability of the data.

Reliability 121 Reliability is also easier to achieve when a concept is more, rather than less, manifest. Recall from Chapter 2 that something manifest is observable “on its face,” and therefore easier to recognize and categorize than content with mostly latent meaning. The easier it is to recognize when the concept exists in the content, the easier it is for coders to agree, and thus the greater the chance of achieving reliability. For example, recognizing the race of a character in a streaming television series is easier than categorizing the presence of subtle institutional racism in the series. Or, if racial diversity in the series is operationalized simply as the number of times a person of color appears, coders will probably easily agree on the count of non-White characters. However, categorizing whether the characters face discrimination may require more complex judgments, which will likely affect coder reliability. Although reliability is easiest to achieve when content is more manifest (e.g., counting names), strictly manifest content is not always the most interesting or the most important content. Therefore, content studies might address content that also has some degree of latent meaning. Content is rarely limited to only manifest or only latent meaning. Two problems can ensue, one of which affects the study’s reliability. First, as the proportion of latent meaning increases, agreement among coders becomes more difficult to achieve. The second problem may affect interpretation of the study’s results. Specifically, even though trained coders may achieve agreement on content high in latent meaning, it may be unclear whether untrained observers of the content (e.g., book readers, TV program viewers, etc.) experience meanings as defined in the protocol and applied by the researchers. For example, few television viewers repeatedly rewind and review programs to describe relationships among actors. Here, the issue of reliability relates in some sense to validity—the degree to which the study and its operationalizations “matter” in the real world (see Chapter 8). These concerns do not mean researchers should avoid studies involving latent meaning; that decision depends on the goals of the research. For example, Simon, Fico, and Lacy (1989) studied defamation in stories of local conflict. The definition of defamation came from court decisions: words that tend to harm reputations of identifiable individuals. Simon et al.’s study further operationalized per se defamation (which harms reputation on its face) and per quod defamation (which requires interpretation that harm has occurred). Obviously, what harms reputation depends on what the reader or viewer of the material brings to it (latent content). To call a leader “tough” may be an admirable characterization to some and a disparaging one to others. Furthermore, it is doubtful that many readers of these stories had the concept of defamation in mind as they read them (although they may have noted that sources were insulting one another). However, the goal of Simon et al.’s study was to determine when stories might risk angering one crucial population of readers: people defamed in the news who might bring a lawsuit.

122

Reliability

The concepts of manifest and latent meaning exist on a continuum. Some symbols are more manifest than others because a higher percentage of receivers share a common meaning for those symbols. Few people would disagree on the common, manifest meaning of the word car, whereas the word cool has multiple uses (as an adjective, adverb, verb, and noun) in a standard dictionary. Likewise, latent meanings of symbols vary according to how many members of the group using the language share the latent meaning. Latent or connotative meaning can also evolve over time. In the 1950s, the majority of the US population considered a Cadillac the ultimate automotive symbol of wealth. Today, the Cadillac may still symbolize wealth, but it may not be the ultimate symbol in everyone’s mind. Other cars—such as Mercedes and BMWs—have come to symbolize wealth just as much as or even more than Cadillacs. The point is that variables requiring difficult coder decisions, whether because of concept complexity or limited shared meaning, will affect reliability and time needed for coding. The more complex categories there are in a protocol, the more time will be needed for training. Before each coding session, instructions should require that coders first review the protocol rules governing the categories. Coding sessions may be restricted to a set duration or amount of content to reduce the likelihood of coder fatigue systematically degrading coding toward the end of a session. Content Analysis Protocol However simple or complex the variables, definitions and coding procedures must be articulated unambiguously. This is done in the content analysis protocol. The protocol’s importance cannot be overstated. It defines the study in general and the coding rules applied to content in particular. Purpose of the Protocol

First, the protocol sets down the rules governing the study—rules that bind the researchers in the way they define and measure the content of interest. Once the protocol has been judged to be reliable and coding has begun, these rules must be invariant across the duration of the study: that is, content coded on day 1 should be coded in precisely the same way on day 100. Second, the protocol is the archival record of the study’s operations and definitions, or how the study was conducted. Thus, the protocol makes it possible for other researchers to interpret the results and replicate the study. Such replication strengthens the ability of science to build a body of findings and theory. The content analysis protocol can be thought of as a cookbook. Just as a cookbook specifies ingredients, amounts of ingredients needed, and procedures for combining and mixing them, the protocol specifies the study’s conceptual and operational definitions and the ways they are to be applied. To continue the analogy, if a cookbook is clear, one need not be a cordon bleu chef to make a good

Reliability 123 stew. The same is true for a content analysis protocol. If concepts and procedures are sufficiently clear and procedures for applying them straightforward, anyone with the necessary training and practice should be able to apply the protocol consistently. If the concepts and procedures are more complex, then more exhaustive training will allow coders to apply the protocol precisely and assign the content consistently. Protocol Development

Of course, making variables sufficiently clear and procedures straightforward and explicit may not be such a simple process. Because variables that remain defined in a researcher’s head are not likely to be very useful, the researcher starts by writing down the definitions. Although that sounds simple, the act of putting even a simple variable into words is as likely as anything else to illuminate sloppy or incomplete thinking. Defining variables forces more discerning thinking about what the researcher really means by a concept underlying the variable. The dynamic of articulation and response, both within oneself and with other researchers and coders, drives the process that clarifies variables. This interactive, iterative process forces the researcher to formulate variables in words and sentences that are less ambiguous and less subject to alternative interpretations that miss the concept the researcher had in mind. Protocol Organization

Because it is the documentary record of the study, care should be taken to organize and present the protocol in a coherent manner. The document should be sufficiently comprehensive for other researchers to replicate the study without additional information from the original study team. Furthermore, the protocol must be available to anyone who wishes to use it to help interpret, replicate, extend, or critique research governed by the protocol. A three-part approach works well for protocol organization. The first part is an introduction specifying study goals and generally introducing major concepts. For example, in a study of local government coverage (Fico et al., 2013b; Lacy et al., 2012), the protocol introduction specified the content and news media to be examined (news and opinion stories in eight types of news outlets). The second part specifies procedures governing how the content will be processed. For example, the aforementioned protocol explained to coders which stories were to be excluded and included. The third part of the protocol specifies each variable used in the content analysis, and therefore carries the weight of the protocol. The overall operational definition is given for each variable, along with operational definitions of each category and numerical values assigned to various categories. These are the actual instructions used by coders to assign content to specific values of particular variables

124

Reliability

and categories. Obviously, instructions for variables range from relatively simple (e.g., types of social media) to complex (e.g., degree of interactivity). How much detail should the category definitions contain? Only as much as is necessary. As just noted, defining the concepts and articulating them in the protocol is an iterative and interactive process. The protocol itself undergoes change before it is judged reliable, as coders attempt to use the definitions in their practice sessions, assessing their interim agreement at various stages in the training process. Category definitions become more coder-friendly as examples and exceptions are integrated. However, extremes in category definition—too much or too little detail—should be avoided. Definitions that lack detail permit coders too much leeway in interpreting when categories should be used. Meanwhile, excessively detailed definitions may promote coder confusion or result in coders forgetting particular rules while coding. Table 7.1 shows coding instructions that formed part of a protocol used with a national sample of more than 47,000 news stories from roughly 800 news media outlets. The protocol consisted of two sections: the first (shown in Table 7.1) was applied to all stories; the second, which was more complex, was applied only to local government stories. Table 7.1 Coding protocol for local government coverage Introduction This protocol addresses the news and opinion coverage of local governments by daily newspapers, weekly newspapers, broadcast television stations, local cable networks, radio news stations, music radio stations, citizen blogs, and citizen new sites. It is divided into two sections. The first addresses general characteristics of all local stories, and the second concerns the topic, nature, and sources of local governments (city, county, and regional) governments. The content will be used to evaluate the extent and nature of coverage and will be connected with environmental variables (size of market, competition, ownership, etc.) to evaluate variation across these environmental variables. Procedure and Story Eligibility for Study Our study deals with local public affairs reporting at the city/suburb, county, and regional government levels. These areas include the local governmental institutions closest to ordinary people, and therefore more accessible to them. A city government (also sometimes called a “township”) is the smallest geopolitical unit in America. Many cities (townships) are included in counties, and many counties may be connected to a regional governmental unit. A story may NOT be eligible for coding for the following reasons: 1 2 3 4 5 6

The story deals with routine sports material. The story deals with routine weather material. The story deals with entertainment (e.g., plays). The story deals with celebrities (their lives). The story deals with state government only. The story deals with national government only.

Reliability 125 Introduction Read the story before coding. If you believe a story is NOT eligible for the study because it deals with excluded material noted above, go on to the next story. Consult with a supervisor on the shift if the story is ambiguous in its study eligibility. Variable Operational Definitions V1: Item Number (assigned) V2: Item Date: month/day/year (two digits: e.g., Aug. 8, 2008 is 080808) V3: ID Number of the City (assigned)—see list. Assign 999 if DMA sample V4: Item Geographic Focus Stories used in this analysis were collected based on their identification as “local” by the news organization. Stories that address state, national, or international matters would not be included unless some “local angle” was present. The geographic focus of the content is considered to be the locality that occurs first in the item. Such localities are indicated by formal names (e.g., a Dallas city council meeting) used first in the story. In some cases, a formal name will be given for a subunit of a city (e.g., the “Ivanhoe Neighborhood” of East Lansing), and in these cases the city is the focus. Often the locality of a story is given by the dateline (e.g., Buffalo, NY), but in many cases the story must be read to determine the locality because it may be different than that in a dateline. If no locality at all is given in the story, code according to the story’s dateline. 1 = listed central city: see list 2 = listed suburb city: see list 3 = other local geographic area V5: ID Number of DMA (assigned number)—see list. Assign 99 if city council sample V6: ID Number of outlet (assigned)—see list V7: Type of Medium (check ID number list) 1 = daily newspaper 2 = weekly newspaper 3 = broadcast television 4 = cable television 5 = news talk radio 6 = non-news talk radio 7 = citizen journalism news site 8 = citizen journalism blog site V8: Organizational Origin of Content Item 1 = Staff Member: (Code story as 1 if there is any collaboration between news organization staff and some other story information source.) a

Includes items from any medium that attribute content to a reporter’s or content provider’s NAME (unless the item is attributed to a source such as those under the code 2 below). A first name or a username suffices for citizen journalism sites. b Includes items by any medium that attribute content to the news organization name (e.g., by KTO; by the Blade; by The Jones Blog). Such attribution can also be in the story itself (e.g., KTO has learned; The Blade’s Joe Jones reports). c Includes items that attribute content to the organization’s “staff” or by any position within that organization (e.g., “editor,” etc.). (Continued)

126

Reliability

Table 7.1 (Continued) Introduction d

FOR TV AND RADIO ONLY, assume an item is staff produced if:

1) A station copyright is on the story page (the copyright name may NOT match the station name). However, if an AP/wire identification is the only one in the byline position or at the bottom of the story, code V7 as 2 even if there is a station copyright at the bottom of the page. 2) A video box is on a TV item or an audio file is on a radio item. e f

FOR RADIO ONLY, assume an item is staff-produced ALSO if the item includes a station logo inside the story box. FOR NEWSPAPER ONLY, assume an item is staff-produced ALSO if the item includes:

1) An email address with newspaper URL. 2) A “police blotter” or “in brief” section of multiple stories. 2 = News and Opinion Services: a This includes news wire services such as Associated Press, Reuters, and Agence France Press, and opinion syndicates such as King’s World. b This includes news services such as the New York Times News Service, McClatchy News Service, Gannett News Service, and Westwood One Wire. c This includes stories taken WHOLE from other news organizations as indicated by a different news organization name in the story’s byline. 3 = Creator’s Online Site (for material identified as 7 or 8 in V8): a Used ONLY for online citizen journalism sites whose content is produced by one person, as indicated by the item or by other site information. b If the site uses material from others (e.g., “staff,” “contributors,” etc.), use other V8 codes for those items. 4 = Local Submissions: Use this code for WHOLE items that include a name or other identification that establishes it as TOTALLY verbatim material from people such as government or nongovernment local sources. The name can refer to a person or to an organization. Such material may include: a Verbatim news releases. b Official reports of government or nongovernment organizations. c Letters or statements of particular people. d Op-ed pieces or letters to the editor e Etc. 5 = Can’t Tell: The item includes no information that would result in the assignment of codes 1, 2, 3, or 4 above. Coding Sheet

Each variable in the content analysis protocol must correspond unambiguously to the actual coding sheet used to record numbers assigned to content attributes of each unit of study content. A coding sheet should be coder-friendly

Reliability 127 and can be printed on paper or presented on a computer screen. Each form has advantages and disadvantages. Paper sheets allow flexibility when coding. With paper, a computer need not be available while coding content, and the periodic interruption of coding for keyboarding is avoided. Not having interruptions is especially important when categories are complex, and the uninterrupted application of the coding instructions can improve reliability. Paper sheets are useful particularly if a coder is examining content that is physically large, such as a print newspaper. Using paper adds more time to the coding process, however. Paper coding sheets require coders to write the values, which someone must then key into a computer. If large amounts of data are being analyzed, the time increase can be considerable. This double recording on paper and into computer also increases the risk of transcribing error. On the other hand, having paper sheets provides backup for the data should a hard drive crash. If only a computer is used, back up the data after every session on an external drive or the Cloud. There are also webbased interface possibilities, which have the benefit of automatically backing up the data and allowing researchers to track the completion of coding tasks, how long it takes to complete a coding task, or even time of day when coding is being completed. Is a particular coder more error prone when rushing through coding faster than others and/or coding in the small hours of the morning? Organization of the coding sheet will, of course, depend on the specific study. However, variables on the coding sheet should be organized to follow the order of variables in the protocol, which in turn follows the flow of the content of interest. Coders should not be darting back and forth repeatedly within the content to determine the variable categories. If an analysis involves recording who posted a comment on Facebook, for example, that variable should be coded relatively high up on the coding sheet because coders will encounter a poster’s name early in the coding process. Planning the sheet design along with the protocol requires the researcher to visualize the process of data collection and determine how problems can be avoided. Coding sheets usually fall into two types: single case and multiple cases. The former have one or more pages for each case or recording unit. For example, the analysis of suicide notes for themes might use a sheet for each note, with several content categories on the sheet. Table 7.2 shows the single-case coding sheet associated with the coding instructions given in Table 7.1. Each variable (V) and a response number or space are identified with a letter and number (V1, V2, etc.) that corresponds with the definition in the coding protocol. Matching variable locations on the protocol and on the coding sheet reduces time and confusion while coding. Multi-case coding sheets allow an analyst to put more than one case on a page. This type of coding sheet often appears as a grid, with the cases placed along the rows and the variables listed in the columns. This is the form used when setting up a computer database in Excel or SPSS. Figure 7.1 shows an abbreviated

128

Reliability

Table 7.2 Coding sheet Content Analysis Protocol AAA for Assessing Local Government News Coverage VI: Item Number ________________ V2: Item Date ________________ V3: ID Number of the City ________________ V4: Item Geographic Focus 1 = listed central city: see list 2 = listed suburb city: see list 3 = other local geographic area ________________ V5: ID Number of DMA ________________ V6: ID Number of Outlet ________________ V7: Type of Medium (Check ID Number List) 1 = daily newspaper 3 = broadcast television 5 = news talk radio 7 = citizen news site

2 = weekly newspaper 4 = cable television 6 = non-news talk radio 8 = citizen blog site ________________

V8: Organizational Origin of Content Item 1 = staff member 3 = creator’s online site 5 = can’t tell ________________

2 = news and opinion services 4 = local submissions

multi-case coding sheet for the same local government coverage study. Each row contains data for one issue of an outlet (daily newspaper, weekly newspaper, broadcast TV station, etc.); this example contains data for seven cases. Each column holds the number given in the protocol for the variable listed. Each item is represented by a row, and coders will record the origin of that item in the eighth column. For instance, the item in the first row was published on August 12, 2021 and was created by a staff writer at a daily newspaper.

Figure 7.1 Coding sheet for local government coverage

Reliability 129 As discussed in Chapter 4, computer programs can make the use of digital coding sheets easier. Lewis, Zamith, and Hermida (2013) used a split computer screen when recording values for variables. Content was viewed on the left portion of the screen, while the right portion was the coding sheet. This project used a dropdown menu for assigning values, which can help eliminate miskeyed values. Content analysts should look for or create programs to help with coding and storage as long as the changes reduce error and make the process more efficient. Coder Training The process of variable and category definition, protocol construction, and coder training is iterative. Central to this process—its length and when it stops—are the coders themselves. The coders, of course, change as they engage in successive encounters with both the content of interest and the way that content is captured by the concepts defined for the study. A content analysis protocol will go through many drafts during pretesting as variables are refined, measures specified, and coding procedures worked through. As part of the process of developing an effective coding protocol, researchers should examine other protocols, particularly those that have generated publishable data. Published authors should share content analysis protocols upon request, and some journals publish protocols along with articles. Studying others’ approaches should help researchers gain insight into effective protocol design. To that end, Appendix A of this volume presents the coding protocol used by Adams (2020) in examining editorial endorsements by top-100 (by circulation) US daily newspapers during the 2016 Clinton–Trump presidential campaign. This protocol allowed coders to record explicit mentions of both candidates’ competence and integrity, and also to record the positive or negative tone of mentions of seven character traits that may have been used to endorse or disendorse one candidate or the other. Coding Process

This process is both facilitated and made more complex by increasing the number of coders. Like everyone, researchers carry mental baggage that influences their perception and interpretation of communication content. A single coder may not notice the dimensions of a concept they are missing, or understand how a protocol that is perfectly clear to them may be opaque to someone else. Several coders are more likely to hammer out operational definitions that are clearer and more explicit. On the other hand, a large number of coders might find it difficult to reach agreement on the classification of content units, or their operationalization may

130

Reliability

create problems that would not occur with fewer coders. At some point, a concept and its measure may not be worth further expenditure of time or effort, and recognizing that a variable should be dropped before coding may not be easy, either. Although the protocol may be well organized and clearly and coherently written, a content analysis still involves systematic training of coders to use it. An analogy to a survey is useful. Telephone survey interviewers must be trained in the rhythms of the questionnaire and gain comfort and confidence in reading questions and recording responses. Coders in a content analysis must grow comfortable and familiar with protocol definitions and how they relate to the content of interest. Coding rules must be explicit in the coding protocol, though. The purpose of coder training is to learn how to apply those explicit rules, not to create additional rules to guide coding. If additional rules are necessary, they should be made explicit in the coding protocol before coding begins. The first step in training coders is to familiarize them with the content that will be analyzed. The aim is not to precode material, so content in the study sample should not be used at this time. That said, the coders should be familiarized with content that is as similar as possible to what they will be coding. This could be content from the same sources but a different time frame, or content from similar sources. The familiarization process is meant to increase coders’ comfort level with the content of interest, give them an idea of what to expect, and determine how much energy and attention will be needed to comprehend it. To help minimize differences among coders, the study should establish general procedures that all coders should follow when dealing with content: for example, how many pieces of content a coder may deal with in each coding session or the maximum length of time of each session. The procedure may also require that each session begins with a full reading of the protocol to refresh coders’ memories of category definitions. Coders should also familiarize themselves with the content analysis protocol, discussing it with the supervisor and other coders during training and dealing with problems in applying it. This will clarify whether coders are approaching the content from similar or different frames of reference. Obviously, any differences will need to be addressed because these will almost certainly result in coder disagreements and poor study reliability. Sources of Coder Disagreements

Differences among coders have various origins. Some are relatively easy to address, such as simple confusion over definitions. Others may be impossible to solve, such as a coder who simply does not follow the procedure specified in the protocol. Coding is a patient, detail-oriented task. Many people do not possess the attributes that are needed to undertake such work.

Reliability 131 Protocol Problems

Differences due to inadequate category definitions must be addressed. Does disagreement exist because a category is ambiguous or poorly articulated in the protocol, or is the problem a coder who simply does not understand the concept or the rules for operationalizing it? If several coders disagree on which category to assign a content unit, the problem probably lies in the category or variable. Specifically, disagreement may reflect fundamental ambiguity or complexity in the variable, or the rules assigning content to variable categories may be poorly spelled out in the protocol. The simplest approach to such a variable or category problem is to revise its definition to eliminate all ambiguity and confusion. If this revision fails to resolve the disagreement, attention must be turned to the variable categories and their definitions. Perhaps an overly complex variable or its categories can be broken down into several parts that are easier to handle. For example, a study on defamation (Fico & Cote, 1999) initially required coders to identify defamation in general and then code items as containing defamation per se and defamation per quod. Courts interpret defamation per quod to mean that the defamation exists in the context of overall meanings people might bring to the reading. This definition resulted in poor coder reliability, although better reliability was achieved on recognition of defamation in general and defamation per se. Therefore, the solution was obvious: given defamation in general, defamation per quod was defined to exist simply when defamation per se was ruled out. In other words, if all defamation was either per se or per quod, getting a reliable measure of the former automatically generated a reliable measure of the latter. Although such a process results in a lack of independence between categories, this is not a problem if data from only one category are used in analysis. Sometimes, however, coders find it impossible to use a particular variable reliably. In such cases, the only solution may be to drop the variable from the study. For example, Fico and Soffin (1995) tasked coders with distinguishing between “attack” and “defense” assertions in news articles, but the content intermixed these assertions to such a degree that achieving acceptable reliability proved impossible. The variable was dropped from the research. Coder Problems

If only one coder consistently disagrees with the others, it may be that something is preventing that coder from applying the definitions correctly. Between-coder reliability measures make it easy to identify problem coders by comparing the agreement of all possible pairs of coders. Attention must then be given to retraining the problem coder or even removing them from the study. There may be several reasons why a coder persistently disagrees with others on the application of category definitions. The easiest coder problems to solve

132

Reliability

involve applications in procedure. In the authors’ experience, problems have occurred because: coders were inadequately trained; coders spent insufficient time determining the appropriate value; the protocol was not reviewed by coders as specified in the general instructions; coders became fatigued during a coding session; or coders “learned” the most prevalent values of variables over time and developed “rules of thumb” that led them to code too quickly. Some studies may involve specialized knowledge that coders need to learn. For example, some coders involved in the aforementioned local government project knew little about the structure, officials, and terms associated with local government, so the principal investigators created an explanatory booklet for the coders (Fico et al., 2013a; Lacy et al., 2012). Content analysts should assess any need to familiarize coders with essential terminology during early training with the protocol and similar content, then integrate such terminology into subsequent training. Symbolic Complexity

The nature of the language and symbols coded for some variables can impact levels of reliability and ease of achieving consistent coding (Lacy et al., 2015). As discussed earlier, for example, the generation of reliable data becomes more difficult as the proportion of a message using latent meaning increases. Visual symbols, such as those found in photographs and videos, tend to be more ambiguous than text. Difficulties generated by symbolic complexity are usually dictated by the research questions and hypotheses. The area of interest determines the content to be coded, not the other way round. To improve reliability with symbolic complexity, make sure the protocol is well developed and the coders receive adequate training with it. Language Issues

Coders must have a sound grasp of the language being coded. Problems may arise due to the semantic or syntactic limitations of non-native speakers, or simply because coders do not possess a sufficiently broad vocabulary. (It should be remembered that even native speakers may struggle when dealing with specialized terminology.) Other problems with cross-national coders coding English may arise because of cultural differences or ignorance of certain frames of reference. These issues will be encountered most frequently when coders deal with variables involving latent meaning. One of the authors of this volume worked as a student on a content study in a class of students from the United States, Bolivia, Nigeria, France, and South Africa. The study involved applying concepts such as terrorism to stories about international relations. As might be imagined, what constitutes terrorism from one perspective may be viewed as national liberation from

Reliability 133 another. In another internationally diverse coding team, one of the coders was unable to distinguish between international humanitarian efforts and efforts to achieve regional hegemony. Despite extensive protocol training, the problem coder remained consistently unreliable in pairwise comparisons with the seven other coders, indicating a deep-seated inability to transcend personal connotations derived from latent meanings. Thus, they had to be dismissed. It is possible to overcome frame-of-reference problems, but this will inevitably involve committing more time to training and possibly coding. Such issues might also indicate that the study itself requires more careful definition of terms, given the context of cultural or social differences. However, it is worth reiterating that good coding requires the ability to attend to the protocol and training, and not everyone possesses this. Peter and Lauf (2002) examined factors affecting inter-coder reliability in a study of cross-national content analysis—defined as comparing content in the different languages of more than one country. They concluded that some coder characteristics can affect inter-coder reliability in bilingual content analysis. Ultimately, their recommendations centered on failure to check reliability among the people who train the coders. They suggested that cross-country content analysis can be reliable as long as three conditions are met: “First, the coder trainers agree in their coding with one another; second, the coders within a country group agree with one another; and, third, the coders agree with the coding of their trainers” (p. 827). Experience and research suggest that coders should code in their native language. If the coding team includes non-native speakers, the study leaders should prepare for additional training and beware of possible problems in creating a common frame of reference. If groups of coders are coding in different languages, Peter and Lauf’s (2002) recommendations regarding trainer reliability should be followed. For examples of this process, see Peter and De Vreese (2004) and Joshi, Peter, and Valkenburg (2011). Reliability Assessment Reliability Tests

Ultimately, the process of concept definition and protocol construction must cease and the researcher must assess the degree to which the protocol can be reliably applied. There are three dimensions to reliability (Krippendorff, 2004a, pp. 214–216): stability, reproducibility, and accuracy. Stability refers to a coder consistently applying the protocol to the same set of content at two points in time. Intra-coder—or “within-coder”—assessment tests whether slippage has occurred in the coder’s understanding or application of the protocol definitions. Checking stability is necessary when coding continues for a long time, although there is no accepted definition of how long that time might

134

Reliability

be. That said, if a project involves more than a month of coding, stability testing would undoubtedly reinforce the argument for data validity. Reproducibility involves two or more coders applying the protocol to the same content. It should measure the ability of the coding protocol—the measurement tool—to produce similar results; it is not a measure of coders’ ability to agree with one another. Each variable in the protocol is tested for reproducibility by looking at agreement among coders in applying relevant category values to the content. For example, two coders code 100 tweets dealing with abortion. Coding the variable for pro-choice or pro-life, they compute the percentage of those stories on which they have agreed that the particular story is pro-choice or pro-life according to the coding definitions. Most measures of reliability reported in journal articles refer to this reproducibility (also known as inter-coder reliability). Establishing reproducibility is easier and more accurate when at least three coders are involved. It is difficult to resolve disputes during training with only two coders if a deciding “vote” about diverging interpretations of variable definitions is absent. Three or more coders provide a greater awareness of how the variable definitions could be interpreted and, as a result, an easier and more effective identification of sources of disagreement. Accuracy relates to whether or not coding is consistent with some external standard for the content, much as one resets (or “calibrates”) a household clock to a “standard” provided by one’s mobile phone after a power outage. A problem in content analysis is how to come by a standard with limited measurement error. One possible solution is to compare content analysis data with a standard established by experts, but there is no way to verify the degree of bias in the experts’ standard. Therefore, most content analyses are limited to testing reproducibility and occasionally stability. Although reliability tests are framed as comparing coders’ assignment of content to categories, reliability is in fact a measure of the entire process during which trained coders apply a well-developed protocol to relevant content. Because of the need for replication in social science, the role of the protocol in establishing and testing reliability should be primary. Scholars seeking to replicate a study’s findings should always be able to apply the same protocol and achieve the same results even though they did not create or refine that protocol. Establishing reliability (reproducibility and stability) is a necessary—but not, in itself, sufficient—step in arguing for data validity. Coder training sessions constitute a kind of reliability pretest. However, wishful thinking and post hoc rationalizations of why errors have occurred (e.g., “I made that mistake because I was interrupted”) mean more formal and rigorous procedures must be applied. In fact, formal coder reliability tests should be conducted during the coder training period itself as an indicator of when to proceed with the study. As mentioned earlier, training tests should not be conducted with the actual study content because coders must code independently of both other

Reliability 135 coders and themselves. If content is coded several times, prior decisions are sure to contaminate subsequent ones. Furthermore, repeated coding of the same content inflates the ultimate reliability estimate, generating false confidence in the study’s overall reliability. Of course, at some point, the training stops, coding begins, and assessment of achieved reliability for the study must take place. Two issues are addressed during this process. The first concerns the selection of content used in the reliability assessment. The second concerns the statistical reliability tests to be used. The process of testing reproducibility should include at least one coder who was not involved in the creation or development of the protocol. Recall that reliability assessment is primarily a test of the protocol, not of the coders’ ability to reach agreement. Creators of the protocol have already come to some sort of agreement; put differently, their coding could artificially inflate reliability because they may share biases other coders do not have. Selection of Content for Testing

If the number of content units being studied is small, protocol reliability can be established by having three or more coders code all the units. Otherwise, researchers must randomly select content samples for reliability testing. Unfortunately, the literature contains rather arbitrary and ambiguous advice about how much content to use when establishing protocol reliability. Wimmer and Dominick (2003) suggest between 10% and 25% of the body of content should be tested, whereas Kaid and Wadsworth (1989) argue that between 5% and 7% of the total is adequate. Finally, one popular online resource (Lombard, Snyder-Duch, & Bracken, 2010) recommends that the reliability sample “should not be less than 50 units or 10% of the full sample, and it rarely needs to be greater than 300 units.” However, the foundations for these suggestions are far from clear. The number of units that are needed will be addressed below, but probability sampling should be used when a census is impractical. Random sampling accomplishes two things. First, it controls for inevitable human biases in selection. Second, it produces, with a known probability of error, a sample reflecting the appropriate characteristics in the overall population of content being studied. Without a random sample, inference that the reliability outcome represents all the content being studied cannot be supported. Given a random sample of sufficient size, coder reliability testing should reflect the full range of potential coding decisions that must be made in the entire body of material. The problem with non-random selection of content for reliability testing is the same as the problem with a non-random sample of study content: tested material may be atypical of the entire body of content that will be coded. A non-representative sample yields reliability assessments whose relation to the entire body of content is unknown.

136

Reliability

Using probability sampling to select content for reliability testing also enables researchers to take advantage of sampling theory to determine how much material should be tested. Random sampling can specify sampling error at known levels of confidence. For example, if two researchers using randomly sampled content achieve a 90% level of agreement, the actual agreement they would achieve coding all material could vary above and below that figure, according to the computed sampling error. That computed sampling error would vary with the size of the sample—the bigger the sample, the smaller the error and more precise the estimate of agreement. Therefore, if the desired level of agreement is 80%, and the achieved level on a coder reliability test is 90% plus or minus five percentage points, the researchers can proceed with confidence that the desired agreement level has been reached or exceeded. However, if the test produced a percentage of 84%, the plus or minus five percentage points sampling error would include a value of 79%—that is, below the required standard of 80%. A study that assessed the reporting of reliability processes in Communication Monographs, the Journal of Communication, and Journalism & Mass Communication Quarterly articles from 1985 to 2010 (Lovejoy et al., 2014) found that 24% did not include any information about reliability tests, and only 34% of those that provided reliability testing information described the reliability sampling process. The good news is that the reporting of reliability sampling information improved over the time period. However, anything less than 100% reporting is clearly insufficient. Selection Procedures

Assuming content for a reliability test will be selected randomly, how many content units must be selected? Lacy and Riffe (1996) noted that the answer to this question depends on several factors: the total number of units to be coded; desired degree of confidence in the eventual reliability assessment; and degree of precision desired in the reliability assessment. Although each of these three factors is under the researcher’s control, a fourth must be assumed on the basis of prior studies, a pretest, or a guess: the researcher’s estimate of the actual agreement that would have been obtained had all the content of interest (i.e., a census) been used in the reliability test. For reasons explained below, the estimate of actual agreement should be set five percentage points higher than the minimum required reliability for the test. This buffer will ensure a more rigorous test (i.e., the achieved agreement will have to be higher for the reliability test to be judged adequate). The first step in applying this procedure is to compute the number of content cases required for the reliability test. When researchers survey a population, they use the formula for the standard error of proportion to estimate a minimal sample size necessary to infer to that population at a given confidence level. A similar procedure is applied here to a population of content. One difference, however, is that a

Reliability 137 content analysis population is likely to be far smaller than the population of people involved in a survey. This makes it possible to correct for a finite population size when the sample makes up 20% or more of the population. This has the effect of reducing the standard error and giving a more precise estimate of reliability. The formula for the standard error can be manipulated to solve for the sample size needed to achieve a given level of confidence. This formula is:

in which N = the population size (number of content units in the study) P = the population level of agreement Q = (1 − P) n = the sample size for the reliability check Solving for n gives the number of content units needed in the reliability check. Note that standard error gives the confidence level desired in the test. This is usually set at the 95% or 99% confidence level (using a one-tailed test because interest is in the portion of the interval that may extend below the acceptable reliability figure). For the rest of the formula, N is the population size of the content of interest, P is the estimate of agreement in the population, and Q is 1 minus that estimate. For example, a researcher might assume an acceptable minimal level of agreement of 85% and P of 90% in a study using 1,000 content units (e.g., newspaper stories). One further assumes a desired confidence level of .05 (i.e., the 95% confidence level). A one-tailed z-score—the number of standard errors needed to include 95% of all possible sample means on agreement—is 1.64 (a two-tailed test z-score would be 1.96). Because the confidence level is 5% and our desired level of probability is 95%, SE is computed as follows: .05 = 1.64(SE) or SE = .05 / 1.64 = .03 Using these numbers to determine the test sample size to achieve a minimum 85% reliability agreement, and assuming P equals 90% (5% above our minimum), the results are:

In other words, 92 test units out of the 1,000 (e.g., tweets) are used for the coder reliability test. If a 90% agreement in coding each variable on those 92 test units is achieved, the chances are 95 out of 100 that at least an 85% or better

138

Reliability

agreement would exist if the entire content population were coded by all coders and reliability measured. Once the number of test units needed is known, selection of the particular ones for testing can be based on any number of random techniques (see Chapter 6). All coders then code the selected units. This procedure is also applicable to studies in which coding categories are measured using interval or ratio scales. The calculation of standard error is the only difference. If these formulas seem difficult to use, Tables 7.3 and 7.4 may be useful. These apply to studies that deal with nominal-level percentage of agreement. Table 7.3 is configured for a 95% confidence level, while Table 7.4 is configured for the more rigorous 99% confidence level. Furthermore, within each table, the number of test cases needed has been configured for 85%, 90%, and 95% estimates of population coding agreement. Researchers should set the assumed level of population agreement (P) at a high enough level (we recommend at least 90%) to ensure that the reliability sample includes the range of category values for each variable. Otherwise, the sample will not represent the population of content. Table 7.3 Content units needed for reliability test based on various population sizes, three assumed levels of population inter-coder agreement, and a 95% level of probability Assumed Level of Agreement in Population Population Size 10,000 5,000 1,000 500 250 100

85%

90%

95%

141 135 125 111 91 59

100 99 92 84 72 51

54 54 52 49 45 36

Table 7.4 Content units needed for reliability test based on various population sizes, three assumed levels of population inter-coder agreement, and a 99% level of probability Assumed Level of Agreement in Population Population Size 10,000 5,000 1,000 500 250 100

85%

90%

95%

271 263 218 179 132 74

193 190 165 142 111 67

104 103 95 87 75 52

Reliability 139 Using a statistic he developed to assess reliability (see below), Krippendorff (2013) suggests a different approach to selecting units for the reliability test. He provides a table with reliability sample sizes that are a function of the researcher’s selection of an acceptable minimum level for Krippendorff’s CAlpha and an acceptable p-value, as well as the number of coders, and the probability of selecting the least frequency value from among all population values. Researchers could use all study content to test reliability in order to eliminate sampling error. However, if they do not use such a census, the cases used for reliability testing must be randomly selected from the population of content of interest to have confidence in the reliability results. The number of units should be calculated using either Lacy and Riffe’s (1996) or Krippendorff’s (2013) procedure. The level of sampling error for reliability samples should always be reported. Regardless of which sampling process is used, the sample should be checked to verify that all categories for all variables have been selected at least once by coders (Krippendorff, 2013). When to Conduct Reliability Tests

The process of establishing reliability involves two types of tests. The first is a series of pretests that occurs during training. As mentioned above, reproducibility pretesting serves as part of an iterative process of coding, examining reliability, adjusting the protocol, and coding again in order to improve the protocol. The length of this process reflects several factors, but it should continue until reliability has reached an acceptable level, as discussed below. Formal reliability pretests with coders working independently to classify content will determine when this point is reached. Study coding should begin once the pretests have demonstrated that the protocol can be applied reliably. It is during coding of the study content units that final protocol reliability is established. As coding gets under way, the investigators must select the content for the reliability test, as described above. Generally, it is a good idea to wait until about 10% to 15% of the coding has been completed before beginning the reliability test. This gives coders sufficient time to develop a coding routine and become familiar with the protocol. The content used for establishing and reporting reliability should be coded by all the coders, and the reliability content units should be interspersed with the study content so the coders are unaware of which content units are being coded by everyone. This “blind” approach to testing will yield a better representation of reliability than having an identifiable set of reliability content coded separately from the normal coding process. If coders are aware of the test content, they might try harder or become nervous, either of which could influence the results. If the study’s coding phase will exceed a month in length, the investigators should consider testing the stability of the process, as discussed above, by

140

Reliability

administering multiple tests using units randomly selected from the study content. However, if the initial reliability test demonstrated sufficient reliability, the additional samples do not need to be as large as in the initial test. Samples in the 30 to 50 range should be sufficient to demonstrate maintenance of reliability. Most content analysis projects do not involve enough content to require more than two reliability tests, but in some cases researchers should consider more tests. For example, the aforementioned coding of local government news coverage (Fico et al., 2013a; Lacy et al., 2012) took more than four months and involved three reliability tests. As a rule of thumb, when additional tests are required, the second test should take place when between 50% and 60% of the content has been coded, while the third should be conducted in the 80% to 90% completion range. In addition to incorporating extra inter-coder reliability tests within longterm projects, study leaders should consider instituting intra-coder reliability tests. Coders can stray unconsciously from the process specified in the protocol and develop their own heuristics for applying values—heuristics that may be inconsistent with the protocol. There is no advice in the literature about when intra-coder tests should be administered, but it would be sensible to include such assessments within the schedule for inter-coder testing. Previously coded content can be integrated into the set of content units selected for the inter-coder assessment. A major concern with long-term projects is what to do if reliability falls below acceptable levels. If initial reliability of the protocol is high, deterioration likely reflects coder problems. Coders whose reliability has slipped must be identified and either retrained or dismissed from the study. Moreover, all of their coding work since their last acceptable reliability tests must be recoded by other coders. Reliability Coefficients The degree of reliability that applies to each variable in a protocol is reported with a reliability coefficient—a summary statistic for how often trained coders using the protocol agreed on classification of content units. The literature about content analysis reliability and inter-rater agreement in medicine encompasses more than thirty different reliability coefficients (Nili, Tate, & Barros, 2017), but only four of these have featured regularly in communication studies (Lovejoy, Watson, Lacy, & Riffe, 2016): percentage of agreement (also called Holsti’s coefficient); Scott’s pi; Cohen’s kappa; and Krippendorff’s original CAlpha. Moreover, percentage of agreement has fallen out of use as a primary reliability coefficient because it overstates the true level of reliability by not taking unjustified (i.e., not prompted by the protocol) or “chance” agreement (defined below) into consideration. Therefore, it is an inadequate test of reliability. By contrast, the other three coefficients do consider “chance” agreement.

Reliability 141 Researchers are encouraged to explore this wide-ranging literature, but they should remember that the goals and processes of medical inter-rater agreement and content analysis reliability are not the same. Medical diagnosis does not follow a set of instructions and definitions that are developed for a given problem. Instead, health practitioners depend on general guidelines and tests that are unique to each patient. In short, health raters have more influence in generating data than coders do in content analysis. Lovejoy et al. (2016) examined 672 articles in Journalism & Mass Communication Quarterly (JMCQ), the Journal of Communication (JoC), and Communication Monographs (CM) that used content analysis to generate data. The percentage of articles reporting a “chance-corrected” reliability coefficient for every variable increased over the course of the 1985–2014 period. Nevertheless, between 2010 and 2014, 25% of content analysis articles in JMCQ, 23% of articles in JoC, and 43% of articles in CM still did not report reliability coefficients that corrected for chance. During the same five years, 68% of articles in CM, 50% of articles in JMCQ, and 41% of articles in JoC did not report a chancecorrected coefficient for each variable in the study. Percentage of Agreement

Before discussing which coefficient scholars should use, this section examines the four that they most often use. Percentage of agreement among two or more coders was the first coefficient researchers used to assess reliability, but they have since learned its limits. This coefficient is easily calculated by dividing the number of coder agreements by the total judgments made by the coders. All coding decisions can be reduced to dichotomous outcomes for figuring simple agreement: each possible pair of coders is compared for agreement or disagreement. For example, if three coders categorize an article, the total number of dichotomous coding decisions will equal three: coders A and B, coders B and C, and coders A and C. Similarly, four coders will yield six decisions (A and B, A and C, A and D, B and C, B and D, and C and D), and so on. As mentioned above, the percentage of agreement coefficient overestimates reliability because it does not control for the influence of agreements due to accident or error. However, the fact that the term “chance” is used with other reliability coefficients to summarize such agreements does not mean they are actually due to chance or guesses. Most, or even all, agreements could be the result of a well-developed protocol and good training, especially if agreement is high and the data distribution is not skewed. The terms “chance,” “accident,” and “error” are not equivalent: “chance” implies randomness, whereas “error” and “accident” imply some non-random process. Although percentage of agreement can inflate reliability, it is useful during protocol development and coder training as a way of identifying where and why disagreements are occurring. It also helps in understanding the nature of the

142

Reliability

data when it is compared with other reliability coefficients. As discussed below, sometimes a study has a high level of simple agreement but low CAlpha, kappa, or pi. Examining these together can help improve the protocol for future use. Thus, content analysts should report both a simple agreement figure and one or more reliability coefficients. The simple agreement figure should be placed in an endnote as information for researchers conducting replications. However, decisions about the reliability of a variable in the protocol should be based on a coefficient that adjusts for “chance” agreement. Coefficients that Evaluate Chance Agreement

Consider the possibility that some coder agreements might occur among untrained coders who are not guided by a protocol. These have traditionally been called “chance” agreements. One of the earliest reliability coefficients that “corrects” for chance agreement is Scott’s pi (Scott, 1955). This involves only two coders and it is used with nominal data. Correcting for chance leads to the calculation of “expected agreement” using probability theory. Scott’s pi computes expected agreement by using the proportion of times particular values of a category are used in a given test. Assume that a variable (genre of streaming movie) has four categories (drama, comedy, science fiction, and horror) and that two coders have coded 10 units of content for a total of 20 coding decisions. Drama has been used 40% of the time (i.e., in total, the two coders have selected drama as the correct coding category on 8 occasions), comedy has been used 30% of the time (on 6 occasions), and both science fiction and horror have been used 15% of the time (each on 3 occasions). Here is where the multiplication rules of probability apply. We multiply because chance involves two coders rather than one. The probability of a single “event” (a streaming movie’s genre being drama) equals .4, but the probability of two such events (two coders coding the same movie as drama) requires .4 to be multiplied by .4. Of course, this makes intuitive sense: a single event is more likely to occur than two such events occurring. In this example, the expected agreement is .4 multiplied by .4, or .16 (movie drama), plus .3 multiplied by .3, or .09 (comedy), plus .15 multiplied by .15, or .022 (science fiction), plus .15 multiplied by .15, or .022 (horror). The expected agreement by chance alone would then be the sum of the four products: .16 + .09 + .022 + .022, or .29 (29%). The computing formula for Scott’s pi is:

in which OA = observed agreement EA = expected agreement

Reliability 143 In this formula, OA is the agreement achieved in the reliability test and EA is the agreement expected by chance, as just illustrated. Note that the expected agreement is subtracted from both the numerator and denominator. In other words, chance is eliminated from both the achieved agreement and the total possible agreement. To continue with the example, suppose the observed agreement between the two coders coding the four-value category for 10 streaming movies is 90% (i.e., they have disagreed only once). In this test, Scott’s pi would be:

The result (.86) can be interpreted as the agreement that has been achieved as a result of the category definitions and their diligent application by coders after a measure of possible chance agreement has been removed. Finally, Scott’s pi has an upper limit of 1.0 (in the case of perfect agreement) and a lower limit of −1.0 (in the case of perfect disagreement). Figures around 0 (zero) indicate that chance is more likely governing coding decisions than the content analysis protocol definitions and their application. As mentioned above, there are other ways to assess the impact of chance. For example, Cohen (1960) developed kappa, which has the same formula as Scott’s pi:

in which P0 = observed agreement Pe = expected agreement Kappa and pi differ, however, in how expected agreement is calculated. Recall that Scott (1955) squared observed proportions for each value of a category on the assumption that all coders are using those values equally. In other words, if the team of two coders selected drama (value 1) on 8 out of 20 occasions, .4 is squared regardless of whether one of the coders chose that genre six times and the other only twice. By contrast, kappa uses expected agreement based on the proportion of a particular value of a category used by one coder multiplied by the proportion for that value used by the other coder. These proportions are then added for all the values of the category to get the expected agreement. In the example, one coder has chosen value 1 (drama genre) in 6 of 10 decisions (.6), while the second coder has chosen value 1 in 2 of 10 decisions (.2). Therefore, while pi yielded an expected value of .16 (.4 × .4) for drama, kappa yields an expected value of .12 (.6 × .2). Kappa will sometimes produce somewhat higher reliability figures than pi, especially when one value of a category is used much more often than others. See Cohen (1960) for further explanation of this coefficient.

144

Reliability

Standard kappa is used for nominal-level measures, and all disagreements are assumed to be equivalent. However, Cohen (1968) developed a weighted kappa to address disagreements that vary in the potential seriousness of their consequences (e.g., a psychiatrist reads a patient’s diary and erroneously concludes that the patient suffers from a personality disorder rather than psychosis). Krippendorff (1980) developed another coefficient—CAlpha—that is similar to Scott’s pi, followed by a family of further Alpha coefficients (Krippendorff & Craggs, 2016; Krippendorff, Mathet, Bouvry, & Widlöcher, 2016) that are useful under various circumstances (see below). The original CAlpha is presented by the equation:

in which D0 = observed disagreement Dc = expected disagreement The process of calculating D0 and Dc depends on the level of measurement (nominal, ordinal, interval, and ratio) used for the content variables. The difference between CAlpha and pi is that Krippendorff’s statistic can be used with non-nominal data and multiple coders. CAlpha also corrects for small samples (Krippendorff, 2013), and some CAlpha computer programs can accommodate missing data. CAlpha and pi are equal when nominal variables with two coders and a large sample are used. For more details about CAlpha, see Krippendorff (2013). Krippendorff (2004b) stated that a reliability coefficient can be an adequate measure of reliability under three conditions. First, content to be checked for reliability requires two or more coders working independently but applying the same instructions to the same content. Second, a coefficient treats coders as interchangeable and presumes nothing about the content other than that it is separated into units that can be classified. Third, a reliability coefficient must control for agreement due to chance. Krippendorff (2004b) pointed out that most formulas for reliability coefficients have similar numerators that subtract observed agreement from 1. However, they vary as to how the denominator (expected agreement) is calculated. Scott’s pi and Krippendorff’s (2004a) CAlpha are the same, except that CAlpha adjusts the denominator for small sample bias, and Scott’s pi exceeds CAlpha by (1 − pi)/n, where n is the sample size. As n increases, the difference between pi and CAlpha approaches 0 (zero). Krippendorff (2004b) criticized Cohen’s kappa because the expected disagreement is calculated by multiplying the proportion of a category value used by one coder by the proportion used by the other coder (as described above). Therefore, according to Krippendorff, the expected disagreement is based on

Reliability 145 the coders’ preferences, which violates the second and third of his three conditions. Nevertheless, he concluded that Scott’s pi is acceptable with nominal data and large samples (although he failed to define what qualifies as “large”). On the other hand, he recommended CAlpha for studies in which data other than nominal measures are used, multiple coders are involved, and/or the samples are small. There are several options for researchers who wish to use software to calculate reliability coefficients. For example, ReCal, a web-based software launched in 2010 and updated in 2013, will calculate all four of the coefficients discussed above (Freelon, 2010; Freelon, 2013). It can be accessed at http://dfreelon.org/ utils/recalfront/. This site has modules for nominal, ordinal, interval, and ratio data that work with data in Excel, SPSS, Stata, and other statistical programs. Hayes (2005) developed a macro for use with SPSS to calculate CAlpha. De Swert (2012) has written a useful handbook on this macro’s application. Finally, there are modules to calculate cAlpha for both Stata (Staudt & Krewel, 2015) and R (Hughes, 2021). Krippendorff (2016a) suggested that bootstrapping should be used to estimate Alpha if the reliability sample is small and the sample distribution is irregular. C Bootstrapping involves drawing a large number of sub-samples (with replacement, as described in Chapter 6) from the original reliability sample as a way of creating a better estimate of the population distribution and the reliability coefficient. Hayes (2005) includes an algorithm for bootstrapping CAlpha in his macro. Pearson’s Product–Moment Correlation

Although it is not a reliability coefficient, Pearson’s correlation coefficient (r) is sometimes used as a check for accuracy of measurement with interval- and ratiolevel data. This statistic, which is explained more fully in Chapter 9, measures the degree to which two variables, or two coders in this case, vary together. Correlation coefficients can be used when coders are measuring space or time. With this usage, the coders become the variables, and the recording units are the cases. If, for example, two coders measured the length of a movie scene (in seconds), or the space devoted to a symbol in a newspaper campaign ad (in square inches), a correlation coefficient would measure the coders’ similarity in their use of highand low-scale values to describe the length of the scene or the space given to the symbol, relative to one another. Krippendorff (1980) warned against using correlations for reliability because association is not necessarily the same as agreement. However, this is not a problem if the assignment for meaning and accuracy of measurement for content units are determined separately. The correlation coefficient is not used to measure category assignment, but rather to measure the consistency of measuring instruments such as clocks and rulers.

146

Reliability

Other Forms of Alpha

The discussion above applies to coefficients used to calculate the reliability of variables predefined in the coding protocol. But some content may not have easily recognizable and discrete units that can be coded. Therefore, Krippendorff et al. (2016) developed a variation of CAlpha (UAlpha) that can be used to calculate a reliability coefficient when the units have not been predefined. They suggest conversation among people as an example of this type of content. In such situations, grammatical and syntactical structures found in written communication often break down and can make unitizing difficult. The same argument might apply to some tweets and online comments. Unfortunately, Krippendorff et al. do not explore the process by which such coding occurs. Researchers interested in examining unstructured data should access their article and any related literature. In addition, Krippendorff and Craggs (2016) introduced an Alpha (MVAlpha) to use in calculating reliability for variables that are applied to units with multivalues. The exact nature of these multivalued variables is unclear. At one point, the authors write, “Literary theory has argued for some time that all texts have multiple interpretations” (p. 182). However, later in the article (p. 185), the discussion centers on the presence of two words—“relieved and ashamed”—as representing an emotional state in a text. These two examples are not necessarily the same type of coding decision. In addition, the article does not address how the coding of multivalued units differs from single-valued variables. Because units are defined by the variables in a protocol, it seems unnecessary to have units with multivalues. This issue is discussed in Chapter 5. Controversy about Reliability Coefficients

During the first two decades of the 21st century, scholars debated which of the many reliability coefficients is most appropriate as an omnibus coefficient for estimating reliability (Feng & Zhao, 2016; Gwet, 2008, 2014; Hayes & Krippendorff, 2007; Krippendorff, 2012, 2016b; Quarfoot & Levine, 2016; Zhao, Feng, Liu, & Deng, 2018; Zhao, Liu, & Deng, 2012). Limitations of space preclude extensive discussion of this debate; suffice to say that it focused on the process of calculating expected agreement using the “chance agreement” concept and on the assumptions underlying calculation of some of the coefficients. The pi, kappa, and CAlpha coefficients have all been criticized because they can produce low coefficients despite high percentages of agreement among coders (Gwet, 2008; Krippendorff, 2011; Lombard, Snyder-Duch, & Bracken, 2004; Potter & LevineDonnerstein, 1999; Zhao et al., 2012) and because they conservatively assume— and correct for—a maximum level of chance agreement (Zhao et al., 2012). More recently, the debate has evolved into arguments about the definition of reliability (Feng & Zhao, 2016; Krippendorff, 2016b; Zhao et al., 2018).

Reliability 147 However, these discussions often seem to get bogged down in detail while ignoring the larger picture of reliability and the corresponding role of reliability coefficients. The following observations are offered to scholars who might wish to examine the debate: • Concern with reliability should not focus solely on whether coders of content are consistent in their agreements and disagreements. As emphasized previously, reliability reflects the application of a well-developed protocol by trained coders to content of interest. By definition, replication of a research project should use the original protocol, even if it is modified, but the coders will likely be different. Therefore, reliability must be more a function of the protocol and training than about agreement and disagreement among a certain set of coders. • Generating reliable data is not the ultimate aim of developing a protocol. Rather, the goal should be to generate valid data, with adequate reliability simply a threshold to help establish data validity. The higher the reliability, the greater the likelihood of generating valid data. It is worth remembering this goal—data validity—when using automated textual analysis. • The goal of protocol development and coder training should not be to generate data that will be publishable, but to reach the highest possible level of validity. • Discussing a coefficient in isolation from the acceptable level of that coefficient adds little to the debate. Most discussions of coefficients ignore what level should be obtained to have confidence in the validity of the data. Indeed, different coefficients might have different minimal acceptable values. To argue that one coefficient is better or worse than another simply because it has a higher or lower value sounds like the argument in the movie This Is Spinal Tap that a guitar amplifier with a maximum of 11 on its volume dial is better than one with a maximum of 10. Communication and media studies need research that explores the relationship between reliability coefficients and the various forms of validity discussed in Chapter 8. • Reviewers and researchers should consider the stage of research in a given theoretical area when evaluating protocol reliability. The more advanced theory and research in an area, the higher should be the threshold for acceptable levels of agreement. The false-positive findings for relationships that occur in newly emerging research areas become less acceptable as those areas become better understood. Using data with lower reliability, and therefore lower validity, can result in the acceptance of relationships that do not predict social phenomena well. • The term “chance agreement” is often used when calculating expected agreement for these coefficients, but with adequate training and a well-developed protocol, very few agreements among coders occur by chance. Error agreements (not due to protocol and training) are a form of measurement error that

148

Reliability

cannot be accurately identified. Because of this, scholars use probabilities based on decisions made by coders to calculate expected agreement. How those probabilities are calculated depends on the assumptions underlying the coefficients and varies from coefficient to coefficient. One central element of the debate noted earlier is the occurrence of high simple agreement with low reliability coefficients, first described more than four decades ago (Kraemer, 1979). Potter and Levine-Donnerstein (1999) note that with a two-valued measure (e.g., “Is a terms of use agreement link clearly visible on a website’s front page—yes or no?”), the frequent occurrence of one value and the infrequent occurrence of the other (e.g., 97% of sites have links while 3% do not) creates an “imbalance” that is “overcorrected” in the chance agreement component of Scott’s pi. Other scholars recently noted that this phenomenon also occurs with kappa (Gwet, 2008) and CAlpha (Zhao et al., 2012). Gwet (2008) addressed the high agreement/low reliability phenomenon found with pi and kappa by developing a new coefficient called AC1. He divided agreements and disagreements into groups based on four conditions for two coders and two categories: (1) both coders assign values based on the correct application of the protocol; (2) both assign based on randomness; and for (3) and (4), one coder assigns randomly while the other assigns based on the correct application of the protocol. Gwet argued that kappa and pi assume only two types of agreement—one based on the correct application of the protocol and one based on randomness— which ignores the other two conditions. He conducted a Monte Carlo study with data from psychology and psychiatry, and concluded that AC1 produces lower variances than pi or kappa and provides a better estimate of expected agreement. However, AC1 can be applied only to nominal data, so Gwet (2014) subsequently developed the AC2 coefficient, which can be used with ordinal and interval data. Krippendorff (2011; 2012; 2016b) responded to criticisms of CAlpha by insisting that variables that do not vary (i.e., have an “imbalanced” distribution) are not useful and that the uneven distribution could represent inadequate reliability sampling. While these arguments may be valid in some situations, there are circumstances when the population distribution may actually be extremely uneven among categories. Krippendorff’s (2016b) observation about the need for variance to study relationships is true in social science for studying relationships. However, content analysis is also used to describe content without examining relationships. As mentioned in Chapter 1, some content analyses describe content in order to compare it with some external standard. For example, historically, the percentage of television characters who are people of color has been small (Mastro & Greenberg, 2000). A content analysis of clinical studies (Davison et al., 2016) to discover how often fathers are present in such studies found two variables with high levels of simple agreement above 90% but CAlphas below .7. This distribution reflected conditions in the population itself.

Reliability 149 Certainly, studies of representation in the media and research are worthwhile, and the resulting distribution among categories may be skewed. In such situations, kappa, pi, and CAlpha could be considerably lower than percentage of agreement. The size of the reliability sample would be irrelevant because the population itself has the same “one-sided” distribution. In some situations, even if the coding has only a small amount of error, the reliability coefficient will report reproducibility at a lower level than it actually is. Selecting a Reliability Coefficient

Debate about the limitations and advantages of various reliability coefficients is a healthy part of social science. However, any decision to move away from current practice should be based on both empirical and theoretical grounds. Thus far, there has been little empirical research on the connection between levels of reliability and valid conclusions about data. How reliable do data need to be in order to yield valid conclusions about relationships in those data? Nor has much empirical research used actual content analysis data to explore the implications of using various coefficients. Scholars should be encouraged to engage in such research. Until the issue of whether an omnibus coefficient is appropriate, or even possible, and what that coefficient would be, the following recommendations for reporting reliability coefficients are offered: 1. Report a reliability coefficient that corrects for “chance” agreements in calculating expected agreement for each variable in the study. Replication requires this. An average or “overall” measure of reliability can hide weak variables, and not every researcher will want to use every variable from previous studies. 2. Use either a census or a probability sample to calculate reliability coefficients and report the sampling error for the reliability coefficients if it is a probability sample. 3. If the data are not skewed, report Krippendorff’s CAlpha and the simple percentage of agreement (Lacy et al., 2015). 4. An important question is: “What is an acceptable level for a given coefficient?” Krippendorff (2004a) suggested that a CAlpha of .8 indicates adequate reliability. However, in the same text, he wrote that variables with CAlphas as low as .667 could be acceptable for drawing tentative conclusions. For established areas of research, a variable’s reliability should be .8 or preferably higher. In newer areas of research, variables in the region of .7 to .8 could be reported, but the researcher should justify the use of such variables by addressing the issue of validity and scholarly importance. 5. If the data show a high level of simple agreement, but reliability coefficients do not reach these levels for acceptability, report Gwet’s AC2 as the reliability coefficient (Lacy et al., 2015) and explain why it is being used.

150 Reliability 6. If a researcher is uncomfortable with relying on CAlpha alone, CAlpha and other coefficients should be reported in the article, along with an explanation of why multiple coefficients have been included. Authors should also explain their acceptance level for each type of reliability coefficient. Data that meet requirements for multiple coefficients have a stronger argument for their validity. 7. Adhering to these rules will require random selection of an adequate sample size. All the categories for each variable should be reflected in the sample. If they are not, then sample size should be increased. Equally important to establishing reliability for variables in a given protocol is establishment of variable reliability across time. Social science advances through improved measurement, and improved measurement requires consistent reliability. If scholars aim to standardize protocols for commonly used variables, the reliability of these protocols will increase over time. Summary Conducting and reporting reliability assessments of the content analysis process are not optional. Reliability assessment of how well the protocol is applied by coders is an essential, but in itself insufficient, step in establishing data validity. However, this has not always been the case. A study of 581 content analyses that appeared in Communication Monographs, the Journal of Communication, and Journalism & Mass Communication Quarterly between 1985 and 2010 (Lovejoy et al., 2014) found 24% of articles did not report any information about reliability, and only 49% included a reliability coefficient that took chance into consideration. Reporting on reliability has improved over time, but there has been little improvement in the provision of details on the use of a census or probability sample for reliability or in transparent descriptions of how the reliability sample was selected. Moreover, full information on the content analysis should be disclosed or at least made available for other researchers to examine or use. A full report on content analysis reliability would include protocol definitions and procedures. Because space in journals is limited, the protocol should be made available by study authors on request. In addition, information on the training of coders, the number of content items tested, and how they were selected should be included in the article. At the very least, the specific coder reliability tests applied and the achieved numeric reliability, along with confidence intervals, should be included for each variable in the published research. Meanwhile, journal editors should invite authors of content analysis studies to submit their protocols whenever such articles are accepted, as collecting and sharing these protocols will enhance fellow-scholars’ understanding of previous studies and improve measurement.

Reliability 151 When applying reliability tests, researchers should randomly select a sufficient number of units for the tests, then decide whether each variable reaches acceptable reliability levels based on coefficients that take possible error agreements into consideration, and report simple agreement in an endnote to assist in replication of the study. If a variable does not reach an acceptable level of reliability, it should be dropped. Using variables with inadequate reliability can result in reporting relationships that do not exist, or concluding that relationships that are present in the population do not exist. Unfortunately, unlike sampling error, the measurement error induced by unreliable data cannot be estimated. Dropping a variable from analysis does not mean the variable and its reliability coefficient should go unreported. Such dropped variables should be discussed in an endnote as this information will help other scholars develop and code difficult variables. Failure to systematize and report the procedures used, as well as to assess and report reliability, virtually invalidates whatever usefulness a content study may have for building a coherent body of research. Therefore, journals that publish content studies should always insist on such assessments.

8

Validity

Validity is a requirement for meaningful social scientific measurement. Although reliability is requisite, it is insufficient in itself to establish validity. That said, content analysts rarely give much attention to validity in their articles, beyond establishing the reliability of their measures. Yet, if authors want to assert both the meaning of their measures and how those measures relate to theoretical concepts, content analyses must substantively address questions of validity. What does the term valid mean? “I’d say that’s a valid point,” one person might respond to an argument presented by another. In this everyday context, validity can relate in at least two ways to the process by which one knows things with some high level of probability. First, valid can mean the speaker’s argument refers to some fact or evidence (e.g., that the national debt in 2023 topped $31 trillion). A reference to a particular fact suggests, of course, that the fact is part of objective reality. Second, valid can mean the speaker’s logic is persuasive, because observation of facts leads to plausible inferences or deductions. In social science, validity relates to both of these everyday ways in which people make inferences and interpretations of reality. However, social scientific validity is more rigorous. First, social science breaks up “reality” into conceptually distinct parts that people believe exist and that have observable indicators of their existence. Second, social science operates with logic and rules of observation to connect those concepts to their indicators in ways that help us predict, explain, and potentially control that reality. Content analysis must therefore incorporate these two processes. Scholars need to address how a concept they have defined about some part of “communication reality” actually exists. Second, researchers are obliged to address how their category measurement of that communication concept is appropriate. If researchers are mistaken about the existence of that part of communication reality, or if they measure it incorrectly, their predictions about the communication process fail. Yet, even with reality-based concepts and measures, researchers face the problem of linking those concepts through data-collection and -analysis methods that

DOI: 10.4324/9781003288428-8

Validity 153 have the highest likelihood of producing successful predictions. So, the second validity problem focuses on these “linking processes” that tie together concepts, measurements of them, observations about their interconnections, and predictions about their future states. All of this may sound formidable enough, and philosophers of science urge humility even when successful predictions do suggest that particular concepts allow the understanding and measuring of reality. Bertrand Russell illustrated this with a story of a chicken that, day after day, was well fed, watered, and cared for in every way by a farmer. That chicken could predict these interconnected events to a high degree of probability, suggesting a very good understanding of reality. Until, of course, the day when the farmer showed up with his axe. That chicken did not even know, much less understand, the larger context in which it was a part. This chapter deals first with the validity of the concepts in our theories and then with the validity of the observational processes we use to link those concepts. Finally, taking a lesson from Russell’s chicken, it addresses a wider context called “social validity” by considering how a scientifically valid, empirically sound content analysis relates to the wider communication world experienced by people. The Problem of Measurement Reliability and Validity Content analysis studies the reality of communication in our world. It does this through the creation of reliable and valid categories that make up the variables that relate to one another in hypotheses or models of communication processes. As emphasized in the previous chapter, concepts in hypotheses and research questions must be defined operationally in terms of the actual, concrete process of measuring variables. For example, several content analysis studies have addressed the concept of news “quality” (Lacy & Rosenstiel, 2015). Just what is quality in news content? Who says that the measures used to assess quality are good measures? Is quality, like beauty, in the eye of the beholder? The answer to that third question is, of course, “yes”: sometimes quality is in the eye (or mind) of the beholder. However, multiple beholders can often agree. This question also illustrates another potentially troublesome validity issue in content analysis measurement: communication is not simply about the measurable frequency or occurrence of communication elements; it is also about the meanings of all the words, expressions, inflections, gestures, symbols, signs, and so on that people use to communicate. In measuring something like news “quality,” researchers attempt to employ an operational definition that sufficiently reduces all the potential ambiguity in the measurement of communication reality; they do not attempt to apprehend that reality in its complex entirety. Researchers

154

Validity

ought not to assume that such ambiguities can always be resolved, but resolving ambiguity is often accomplished by connecting content measurements to previous research. These issues become critical in terms of efforts to achieve reliability in content category measures. Recall that measurement reliability is a necessary but, in itself, insufficient condition for measurement validity. A measure can be reliable in its application but wrong in what researchers assume it is measuring. For example, a measure may sufficiently reduce ambiguity about what makes news “quality” by specifying particular indicators that can be reliably measured, but one may question the indicators’ validity as representing the complex concept. A valid measure must be both reliable in its application and valid for what it measures. In other words, one may achieve high levels of coder agreement on the existence or state of a content variable, but the operational definition may have no more than a tenuous connection with the concept of interest. As noted in Chapter 4, much of the concern with computer content analysis is that the validity of concept measurement is compromised by its focus on keywords absent any context that gives them meaning. Part of the solution to this problem, for scholars generally, is multiple measures of the concept of interest, as has been done with the concept of news quality (Lacy & Rosenstiel, 2015). Ultimately, though, content analysts must ask the most consequential question of their variables: “Do they validly assess something meaningful beyond their utility in the particular study?” In some studies, the validity of the measures used is assessed in the broader research stream of which the particular study is a part. Too often, however, reports will focus on the reliability of the measures for the study in question while ignoring or assuming their validity. Types of Measurement Validity Analysts such as Holsti (1969) and Krippendorff (2013) have discussed validity at length. In particular, Holsti’s familiar typology identifies four types or tests of measurement validity—face, concurrent, predictive, and construct—that apply to the operational terms of variables used in hypotheses and questions. These same tests of validity may be applied to measures, constructs, and even relationships. Face Validity

The most common validity test used in content analysis, and certainly the minimum requirement, is face validity. Basically, the researcher makes a persuasive argument that, absent any immediately obvious contradictory information, a particular measure of a concept makes sense on its face. Coding presidents’ State of the Union addresses over time for references to public issues like foreign policy

Validity 155 would—“on the face of it”—indicate changes in the United States’ foreign policy agenda over time (e.g., Eddy, Riffe, Cohen, & Kim, 2021). Put bluntly, the researcher assumes that the sufficiency of a measure is obvious and requires little additional explanation. Relying on face validity can arguably be appropriate when agreement on a particular measure’s sufficiency is high among relevant researchers. Using measures from previous studies enhances face validity because other researchers have successfully argued the connection between a measure and the underlying concept. Concurrent Validity

Validity should be well established for purposes of inference. One of the best techniques is to correlate a measure used in one study with a similar one used in another study. Concurrent validity can also be established with two different methods and measures yielding the same conclusion. In effect, the two methods can provide mutual—or concurrent—validation. In the early years of the 21st century, as traditional legacy media faltered, questions arose about whether “citizen journalism” could replace legacy news outlets in fulfilling the same journalistic functions for communities. (Citizen journalism is journalism created by non-professionals.) A content analysis of 64 news sites and blogs in 2007 in 15 randomly selected US cities (Lacy, Riffe, Thorson, & Duffy, 2009) found that most were blogs (opinion) rather than news sites. Five of the cities had no news sites and the mean or average for others was less than one news site per city. A 2007 survey of 104 city government reporters in 38 states (St. Cyr, Carpenter, & Lacy, 2010) found that the mean number of citizen news sites covering city government was .57. In sum, measures of citizen journalism sites through content analysis and survey were consistent, providing concurrent validity for both measures. Predictive Validity

A test of predictive validity correlates a measure with a predicted outcome. If the outcome occurs as expected, confidence in the measure’s validity is increased. More specifically, if a hypothesized prediction is confirmed, confidence in the validity of the measures of the variables in the hypothesis is strengthened. The classic example cited by Holsti (1969, p. 144) concerns a study of suicide notes left in real suicides and a companion sample from non-suicides. The real notes were used to put together a linguistic “model” predicting suicide. Based on the model, coders successfully identified notes from real suicides, thereby validating the predictive power of the content model. The aforementioned citizen journalism study by Lacy et al. (2009), which examined its prevalence in 2007, was later expanded to include citizen journalism websites, blog sites, and commercial websites in 46 markets in 2009 (Lacy et al., 2010). Measures used in the

156

Validity

2007 study were included and the conclusions were consistent: the 2007 measures predicted the 2009 content. Construct Validity

Construct validity involves the relation of an abstract concept to the observable measures that presumably indicate the concept’s existence and change. The underlying notion is that a construct exists but is not directly observable, except through one or more indicators used to measure the concept. Therefore, some change in the underlying abstract concept will cause observable change in the measures. Statistical tests of construct validity assess whether the measures relate only to that concept and to no other (Hunter & Gerbing, 1982). If construct validity of measures exists, then any change in the measures, and the relation of the measures to one another, is entirely a function of their relation to the underlying concept. If construct validity does not exist, then measures may change because of their relation to some other, unknown concepts. In other words, construct validity enables the researcher to be confident that when the measures vary, only the concept of interest is varying. Put another way, the issue of construct validity involves whether measures “behave” as theory predicts, and only as theory predicts (Wimmer and Dominick, 2003). Construct validity must exist if a field such as mass communication is to build a cumulative body of scientific knowledge across a multitude of studies. Common constructs used across studies help bring coherence and focus to a body of research. Valid constructs also make for more efficient research, enabling researchers to take the next step in extending or applying theory without needing to duplicate earlier work. Few studies in the field validate measures this way, however. Validity in Observational Process Given that researchers have enough confidence in the validity of concept measures (variables), the question becomes how they are linked in a way that validly describes social reality. Every social science method has a set of procedures meant to ensure that observations minimize human biases in how such reality is perceived. In survey research, such procedures include random sampling to make valid inferences from characteristics in a sample to characteristics in a population. In an experiment, procedures include randomly assigning subjects to create equivalent control and experimental treatment groups, thus permitting logical inference that only the treatment administered to the experimental group could have caused an observed effect. Chapter 7 discussed how content analysis uses protocol definitions and tests for “chance” agreement to minimize the influence of human biases and chance in coding. Application of these procedures strengthens confidence in survey, experimental, or content

Validity 157 analysis findings. However, given science’s primary goal—the prediction, explanation, and potential control of phenomena—how does content analysis achieve validity? Internal and External Validity

Experimental method provides some ways of thinking about validity in the research process that can be related to content analysis. Assessing experimental method in educational research, Campbell and Stanley (1963) distinguished between an experimental design’s internal and external validity. By internal validity, they meant the ability of an experiment to illuminate valid causal relationships. An experiment does this through the use of controls to rule out other possible sources of influence and rival explanations, as prescribed in the set of procedures above. By external validity, Campbell and Stanley meant the broader relevance of an experiment’s findings to the vastly more complex and dynamic pattern of causal relations in the world in general. An experiment may increase external validity by incorporating “naturalistic settings” into the design. This permits assessment of whether causal relations observed in the laboratory are in fact important, relative to other influences operating in the world. However, by their nature, laboratory experiments cannot entirely replicate complex naturalistic settings. Notions of internal and external validity in experimental design are also useful for thinking about content analysis validity. Obviously, when used alone, content analysis cannot possess internal (causal) validity in the sense outlined by Campbell and Stanley because it cannot control all known and unknown “third variables.” Inferring causal relations requires knowledge of the time order in which cause and effect operate, knowledge of their joint variation, control over the influence of other variables, and a rationale explaining the presumed cause– effect relationship. That said, content analysis can incorporate other research procedures that strengthen the ability to make such causal inferences in the building of theory. For instance, if some content is thought to produce a particular effect on audiences, content analysis could be paired with survey research designs to explore that relationship, as in the agenda-setting studies described in Chapter 1. These designs are discussed in more detail later in this chapter. Traditionally, content analysis was a very strong research technique in terms of the external validity or generalizability of research, primarily because sampling of content and media allows generalization from a sample to the population of all content or media. However, this type of external validity claim rests heavily on whether a census—or appropriate sample—of the content was collected, as discussed in Chapter 6. Establishing that an appropriate sample was used was once relatively straightforward, albeit expensive and time consuming. For example, when studying the universe of US daily newspaper coverage,

158

Validity

researchers had access to a list of all US daily newspapers. Establishing the external validity of sampled content is more difficult when studying ever-changing digital communications because it is impossible to list all the producers of digital content. Thus, to produce valid content measures of digital content, the population of relevant content must often be limited to a non-probability sample from an unknown universe. For example, Tillery (2019) wanted to understand social movement strategies deployed by the Black Lives Matter (BLM) movement through Twitter. It would have been almost impossible to define the population of all tweets related to BLM—an essential step in drawing a representative probability sample—so Tillery chose to define the population as the Twitter feeds of six prominent local affiliate chapters of the national BLM organization. He defended this decision by claiming that these accounts represented how “core activists” sought to shape the movement. However, by focusing solely on these “official accounts,” many significant core activists in a movement that has taken on a grassroots nature were omitted from the sample, limiting the external validity of the data. The notion of external validity can also be related to a study’s social validity— the social significance of the content being explored and the degree to which content categories have broader social meaning. To return to the previous example, the inability to generalize to the entire BLM movement’s digital organizing strategies does not necessarily impact the social validity of the study’s conclusions. If the researcher can clearly demonstrate that the sampled Twitter accounts had significant influence over the BLM movement in a way that had social resonance, they can argue for social validity. Some of these issues are discussed in the following pages. Slater (2013) explored how content analysis can fit into a broader program of research. For example, he discussed its use with surveys and how it can be used as a foundation for experiments. Underlying this discussion is the realization that the creation, testing, and support for social science theory are parts of a process that involves multiple methods. Figure 8.1 summarizes several types of validity. Note first that internal validity deals with the design governing data collection and how design may strengthen causal inference. Data collection also requires assessment of measurement validity consisting of face, concurrent, predictive, and construct validity. Statistical validity is a subset of internal validity addressing measurement decisions and the assumptions about data required for particular statistical analyses. Finally, the external and social validity of a content analysis presupposes the internal validity of measurement and design that makes content analysis a part of scientific method. However, the notion of external and social validity used here goes beyond those qualities to assess the social importance and meaning of the content being explored. The overall validity of a study therefore depends on a number of interrelated factors, discussed in the next section.

Validity 159

Figure 8.1 Types of content analysis validity

Internal Validity and Design

Content analysis by itself can best illuminate patterns, regularities, or variable relations in content. However, content analysis alone cannot establish the antecedent causes producing those patterns in the content; nor can it alone explain the effects that content produces. Of course, the analyst may make logical inferences to antecedent causes or subsequent effects of such content, as discussed in Chapter 1, with its model showing the centrality of content to communication processes and effects. Also, certain research designs pairing content analysis with other methods strengthen the ability to infer such causal relationships, thereby enhancing internal validity. In one way or another, then, content analysis designs should address issues of control, time order, and correlation of variables included in a causal model. Control in Content Analysis

Designs that attempt to explain patterns of content must look to information outside the content of interest. This requires a theoretical or hypothesized model, including the kinds of factors that may influence content. In other words, this model is assumed to control for other sources of influence by bringing them into the analysis. The model itself is derived from theory, previous research, and/or personal observations. Consider a simple example: a researcher, noting the collapse of an authoritarian regime, predicts the rapid growth of new journalistic ventures as alternative political voices seek audiences, even as existing news outlets “open up” to previously taboo topics. Two issues need to be emphasized. First, regardless of how plausible a model may be, certain variables that may be crucial in explaining the process

160

Validity

of interest are always left out of the research design. However, the second issue makes the first one interesting: such a model can always be empirically tested to assess how well it works to explain patterns in content. If the model does not work as planned, interesting and engaging work to devise a better model can be undertaken. Unimportant variables can be dropped from the model and other, theoretically interesting ones added. In this example, failure to find new journalistic ventures might reflect limited access to facilities and equipment, rather than lack of dissent. Similarly, failure to find public criticism of former party programs in existing news outlets may indicate residual citizen distrust of journalists who were viewed for years as political party tools. A well-thoughtout model that identifies relevant concepts in the causal process is essential to introducing control variables in content analysis studies. The model should include specifying mediating and moderating constructs, as discussed in Chapter 9. Time Order in Content Analysis

Furthermore, in designing such a model, these presumed influences must incorporate the time element into the design, as noted in Chapter 3. Such incorporation may be empirical—data on the hypothesized cause occurs and is collected and measured before the content it presumably influences. For example, a number of studies have examined the relationship between newspaper quality at time 1 and circulation at time 2 (Lacy & Fico, 1991; St. Cyr, Lacy, & Guzman-Ortega, 2005). Such studies also require statistical controls to eliminate competing explanations for the studied relationship. Incorporation of the time element may also be assumed from the logic of the design. For instance, Lovejoy, Watson, Lacy, and Riffe (2016) studied three communication journals over thirty years to examine variations among the journals’ reporting of content analysis reliability statistics. The use of reliability coefficients in the articles changed over time. However, logically, the use of coefficients could not explain the change in time, so the predictive relationship must exist in the other direction. Obviously, exploring effects of content is the converse of the situation just discussed. Here, time must also form part of the design, but other methods to assess effect are mandatory, too. The logic of the design for content affecting behavior and attitudes may be sufficient for controlling time order. For example, the Great American Values Test (Ball-Rokeach, Rokeach, & Grube, 1984) used a field experiment to measure the relationship between a specific television program about American values (produced by the researchers) and changes in attitude and behavior related to those values. The researchers measured attitudes both before and after the communities saw the program to control time order. (They used a similar community, which did not receive the program, as a control.)

Validity 161 Perhaps the most frequent multi-method example of content analysis research that assesses effect is the agenda-setting research described in Chapter 1. This line of research explores whether differences in news media coverage of various topics at time 1 creates a similar subsequent ordering of topic importance among news consumers at time 2. Of course, the possibility that consumers’ news priorities influence the media, or that each influences the other, must be taken into account in the design. Causation in Content Analysis

Establishing a causal relationship requires specification of time order, control for competing explanations, and demonstration of joint variation or correlation. Time order is essentially built into the design. If content is influenced by an antecedent variable, then the antecedent variable must come first. If content influences an individual, the content must be created and accessed by that individual. The requirements of control and correlation in causality among variables are established statistically. Assumptions about independent and dependent variable influences must be explicit in any kind of multivariate analysis. Furthermore, the analysis must consider direct and indirect causal flows, and whether any variables moderate or mediate the nature of the model’s relationships. Chapter 9 discusses the process of modeling variable relationships. Statistics used for analyzing content data are also discussed in Chapter 9. These techniques range from simple correlation measures for relating two variables to multivariate techniques enabling the analysis to more fully control and assess the effects of multiple variables. Different statistics have different assumptions that must be considered. Specific techniques that can be employed will also depend on the level at which variables have been measured. Furthermore, if content data have been randomly sampled, tests of statistical significance must be employed for valid inferences to content populations. These issues relate to the statistical validity of the analysis of content. Measuring content before an effect is a necessary step, but research suggests that some, if not many, content-effect relationships may be reciprocal. For example, there is evidence that political interest/participation can lead to increased media use, which in turn reinforces interest/participation and leads to further media use (Kruikemeier & Shehata, 2017; Lee & Xenos, 2022). Results indicate that such relationships are complex, with variations across media regarding their influence. The existence of reciprocal relationships between media use and various behaviors and attitudes creates an extra burden in establishing causality because it requires the use of longitudinal designs and waves of data from various points in time. Content from more than one time point needs to be matched with other sources of data from more than one time point with the appropriate time order.

162 Validity External Validity and Meaning in Content Analysis A study may have strong internal validity, as discussed above. However, its findings may be so circumscribed by theoretical or methodological considerations that they have little or no relevance beyond the research community to which they are meaningful. Certainly, research adds to the body of knowledge through intersubjective validation among a group of researchers. Isaac Newton summed this up in his often-quoted saying, “If I have seen farther it is by standing on the shoulders of giants” (Oxford University, 1979, p. 362). The researcher interacts professionally within a community of scientists. But the researcher is also part of wider society, interacting with it in a variety of roles, such as parent, neighbor, consumer, voter, and citizen. In these broader researcher roles, the notion of validity can also have a social dimension related to how new knowledge is understood, valued, or used. Successful communication requires exchange of knowledge that is meaningful to both the message sender and the receiver. This meaningfulness results from a common language, a common frame of reference for interpreting the concepts being communicated, and a common evaluation of the relevance, importance, or significance of those concepts. In this social dimension of validity, the broader importance or significance of what has been found can be assessed. External Validity and the Scientific Community

First, though, research must be placed before the scientific community for assessment of a study’s meaningfulness as valid scientific knowledge. The minimum requirement is validation through a “blind” peer-review process in which competent judges assess the study’s fitness to be published as part of scientific knowledge. Specifically, scientific peers must agree that the study should be published in a peer-reviewed venue. As in scientific method generally, the “blind” peer-review process is meant to minimize human bias in the assessment of studies. Two or three judges who are unknown to the author review the work of an author who is unknown to them. The judges apply scientific criteria for validating the research’s relevance, design and method, analysis, and inference. Requirements for such scientific validation are relatively straightforward. Presumably, the research demonstrably grows from previous work, with the researcher explicitly calling attention to its relevance for developing or modifying theory, replicating findings, extending the line of research and filling research gaps, or resolving contradictions in previous studies. This process must be completed before the research is deemed fit for presentation or publication as part of scientific knowledge. Over time, other scholars will adopt the published study as valid and the basis for their own studies. Researchers choose to submit work for peer review for several reasons. First, judges provide comments and criticisms for improving the study. Every piece of research has flaws and limitations, and expert judges can help illuminate these

Validity 163 for correction or at least acknowledgment by researchers. Second, researchers submit work for peer review so that the study might inform and assist other researchers in collectively building a body of knowledge that advances science’s goals of predicting, explaining, and potentially controlling phenomena. The judgment of the scientific community provides the necessary link between the internal and external validity of research. Clearly, research that is flawed because of some aspect of design or measurement cannot be trusted to generate meaningful knowledge. The scientific validation of research is necessary before that research can—or should—have any broader importance. In essence, internal validity (the study is deemed fit to be accepted as scientific knowledge) is a necessary—but not necessarily sufficient—condition for external validity (the study knowledge has wider implications for part or the whole of society). However, the status of any one study as part of scientific knowledge is still tentative until other research provides validation through replication and extension. This validation takes place with direct replication and extension in similar studies, as in agenda-setting research. Replication of findings is important because any one study, given even the most critical scrutiny, may produce an atypical result by chance. However, if study after study replicates similar patterns, the entire weight of the research as a whole strengthens confidence in the knowledge that has been found. Recall from Chapter 6 that even data sets drawn from nonprobability samples can be useful if their findings make a cumulative contribution. Scientific community validation of a study can also happen through the use, modification, or further development of that study’s definitions or measures, or through more extensive work in an area to which a study has drawn attention. The attention to media agenda-setting across multiple decades is an example of collective validation from multiple studies by multiple researchers. External Validity as Social Validity in Content Analysis

The validation of research method and inference is usually determined by the scientific community acting through peer-review and replication processes, as discussed above. Such validation is necessary but insufficient in itself for establishing the broader meaning and importance of research for audiences beyond the scientific community. The external validity of a content analysis beyond the scientific community can be strengthened by maximizing its social validity on two dimensions: the social importance of the content and how it has been collected; and the social relevance of content categories and the way they have been measured and analyzed. The social validity of content studies is discussed in the following sections. Nature of the Content

The social validity of a content analysis increases if the content being explored is important. The more that content is attended to by audiences, the greater the

164

Validity

social validity of analyses exploring the content. Thus, one aspect of importance is the sheer size of the audience exposed to the content. Much of the research into and social attention paid to digital media, for example, reflect the facts that digital content is readily available, that it allows interactivity, and that large numbers of people use it for many hours on a daily basis. Another dimension of the importance of content concerns the exposure of some critical audience to its influence. Children’s television advertising is explored because of its implications for the social development of a presumably vulnerable and impressionable population. Similarly, Twitter has become more important since 2016, when it became a primary means of influencing political agendas. Finally, content importance may reflect some crucial role or function it plays in society. For example, advertising is thought crucial to the economic functioning of market societies. The effectiveness of advertising in motivating consumers to buy products affects not only producers and consumers but the entire fabric of social relations linked in the market. Furthermore, advertising messages can have cultural by-products because of the social roles or stereotypes they communicate. Similarly, news coverage of political controversy is examined because it may influence public policy that affects millions of people. Whatever the importance of the content, the social validity of the analysis will also be affected by how that content has been gathered and analyzed for study: for example, whether the content has been selected through a census or a probability sample will influence what generalizations can be validly made. A major goal in most research is generating knowledge about populations of people, social institutions, or documents. Knowledge of an unrepresentative sample of content is frequently of limited value for knowing or understanding a population. A random sample or even a census of the relevant population of content enables the researcher to speak authoritatively about the characteristics of that population. Findings from content selected purposively or for convenience cannot be generalized to wider populations. However, a strong case for the social validity of purposively selected content may be made in specific contexts. For example, the news content of “prestige” journalism outlets is clearly atypical of general news coverage. However, such news outlets influence policymakers as well as other news outlets, so they are important because they are atypical. Probability, Validity, and the Nature of Content

Establishing external validity through social science does not mean establishing certainty. Social science is probabilistic, and validity is based on causal relationships having high probabilities of occurring, as described by theory and research. Science cannot establish certainty because it is impossible to examine every case that is described by theory and research, and because cases change over time.

Validity 165 The nature of content itself can limit the ability to establish validity through content analysis. As discussed in Chapters 2 and 7, symbols have both manifest and latent meanings. Content analysis is better at categorizing manifest content, but even this is not always easy. Almost all words have more than one manifest meaning (just look in the dictionary), and a particular latent meaning of a word can become adopted so widely over time that it becomes manifest and part of common language. As a result, manifest and latent meanings are determined by the context in which a word is used. The same can be said of images. A photograph revealing a crying face could represent any number of emotions (sorrow, joy, relief, resignation, etc.), but the true meaning may be determined by the visual and textual context surrounding the facial image. Therefore, given the changing nature of words and images, context is essential for valid classification of these symbols. Krippendorff’s (1980) concept of “semantical validity” (p. 157) relates to the notion of relevance in content analysis. He asserted that this involves assessing “the degree to which a method is sensitive to the symbolic meanings that are relevant within a given context” (p. 157). In particular, he argued that high semantical validity is achieved when the “data language corresponds to that of the source, the receiver or any other context” (p. 157). To what extent, then, do content categories have corresponding meanings to people other than researchers? This question is particularly pertinent when a researcher attempts to pursue an analysis focusing on content that is heavy with latent meaning. Manifest content is more easily recognized and counted than latent content: Person A’s name appears in a story, maybe accompanied by a picture; a television show runs X number of commercials; a novel’s average sentence length is X number of words. Analyses that attempt to capture more latent content deal with more holistic or “gestalt” judgments, evaluations, and interpretations of content and context. Studies attempting to analyze content with extensive latent meaning assume that important characteristics of communication may not be captured through sampling, category definition, reliability assessment, or statistical analysis of the collected data. Instead, the proper judgment, evaluation, or interpretation of communication content rests with the researcher. This assumption that the researcher will be able to accomplish these tasks is problematic for several reasons. In particular, it is seldom argued explicitly in analyses of heavily latent content that the researcher’s experience, intuition, knowledge, or whatever renders them competent to make such judgments. A reader must simply believe the meaning of content is illuminated by the discernment of the researcher who brings appropriate context to the communication. In other words, the researcher who analyzes latent content knows what content in a communication “actually is” and what it “actually does” to audiences of that communication. Tankard (2001) found this to be the case with early framing research.

166

Validity

Therefore, a study of latent content must assume that the researcher possesses one or both of two different—and indeed contradictory—qualities that displace explicit assessments of the reliability and validity of content studies. The first is that the researcher is an authoritative interpreter who can intuitively identify and assess meaning embedded in some communication sent to audiences. Thus, the researcher is the source of the study’s reliability and validity of measurement. However, this requires a large leap of faith in researchers. Specifically, a reader must believe that, while human biases in selective exposure, perception, and recall exist in the naive perceivers of messages, the researcher is somehow immune to such biases and can perceive “real” meaning. The second, contrary quality assumed in the analysis of latent content is that the researcher is representative of the audience of a communication. In other words, readers must believe that the researcher is a “random sample” (with an n of 1) who “knows” the content’s effects on audiences because he or she experiences and identifies them. Of course, few would trust the precision or generality of even a well-selected random sample with an n of 1. Yet, readers are expected to believe that a probably very atypical member of the audience—the researcher—can experience the content in precisely the same way as other audience members. Given the probabilistic nature of validity and the fact that reliability is a necessary but, in itself, insufficient condition for validity, a highly reliable protocol used with manifest content has a higher probability of achieving face validity for the variables and relationships than when these conditions are not met. If variables in the protocol have been used in previous studies and judged to have reliability and face validity, there is a higher probability the variables have construct validity. If the relationships that include the content variables are consistent across studies, there is a higher probability of predictive validity. Of course, these problems might also exist in quantitative analyses of manifest content. For example, Austin, Pinkleton, Hust, and Coral-Reaume Miller (2007) found large differences in the frequency with which trained coders and a group of untrained audience members assigned content to categories. This suggests that the untrained audience might have been influenced by latent meaning of the content or that they selected a different manifest meaning than the one emphasized in the protocol. Although content analysis standards are satisfied by appropriate reliability tests, the social validity of a study may be limited if content categories have little or no meaning to broader audiences. Claims made about content analyses of manifest content should thus always be tentative and qualified. That, however, is the nature of the entire scientific enterprise. In other words, quantitative content analysis is necessary, even if sometimes insufficient in itself, for the development of a science of human communication.

Validity 167 Summary While reliability is a requisite for validity, social scientific research requires that scholars establish that measures accurately capture meaningful social constructs and that those measures connect to plausible inferences that allow us to make meaningful deductions about the social world. Yet, content analysts all too often pay scant attention to—or even completely ignore—the need to explicitly establish the broader validity of their measures beyond reliability. If they are to make significant contributions to social scientific knowledge, they should first engage more deeply with the validity of their measures by addressing the following two questions: “Does the measure identify a meaningful communication concept?” and “Is the operationalization of that measurement appropriate?” Second, content analysts should also make the external validity of their measures explicit in terms of how well they capture meaningful social phenomena in a way that allows us to make meaningful inferences and deductions about social reality. In other words, to establish the appropriateness of content analysis, analysts need to demonstrate that their measures are both reliable and valid.

9

Data Analysis

Like most research methods, content analysis is comparable to detective work. Content analysts examine evidence to solve problems and answer questions, limiting the examinations to relevant evidence. The research design, measurement, and sampling decisions discussed in Chapters 3, 5, and 6 are, in effect, the content analyst’s rules for determining relevant evidence and how to collect it, whereas Chapters 7 and 8 offer insights to help ensure the evidence is of optimal quality. Ultimately, data must be reduced and summarized. Patterns within the evidence must be plumbed for meaning. In quantitative content analysis, data analysis involves statistical procedures— tools that summarize data so patterns may be illuminated. The goal here is to help researchers think efficiently and logically about data analysis. The strategy is to illustrate the logical bases of several commonly used analysis techniques and to provide guidance on applying them. Some techniques are basic: descriptive measures, such as means and proportions, along with correlation and tests of statistical significance. Others include analysis of variance (ANOVA) and multivariate techniques. Basic notions of probability are presented to facilitate understanding of how and why particular statistics work. While causal modeling and reciprocity are mentioned, detailed discussion of these techniques and the mathematical basis of statistics are beyond the scope and goal of this chapter. Analyzing Content Data Although many disciplines employ content analysis, communication researchers have been among the method’s most persistent users. An unpublished examination of data tables and analysis sections of 239 studies in Journalism & Mass Communication Quarterly from 1986 through 1995 indicated that content analysts relied on several basic analysis techniques and a few more advanced ones. That is, a small number of tools were useful for a variety of tasks. Some of these tools are simple: 28% of the 239 studies employed only means, proportions, or frequencies. When other techniques were used, they were often DOI: 10.4324/9781003288428-9

Data Analysis 169 employed in combination with means and proportions. Statistical techniques included chi-square and Cramer’s V (used in 37% of studies) and Pearson’s product–moment correlation (15%). Techniques to assess differences between the means or proportions of two samples were used in 17% of studies. More advanced techniques included ANOVA (6%) and multiple regression (8%). Only 7% of studies used more sophisticated statistical techniques. The purpose of this chapter, therefore, is to review these techniques and explain how they relate to particular content analysis goals. In fact, analysis techniques should be considered carefully in terms of a study’s goals before any data are collected. Decisions on data collection, measurement, and analysis are inextricably linked to one another, to the study’s overall research design, and to research questions or hypotheses the study addresses. Fundamentals of Analyzing Data Thinking about Data Analysis

The goal of a particular data analysis may be relatively simple: to describe certain characteristics of a sample or population. For example, researchers may examine the frequency of occurrence of some particular characteristic to assess what is typical. By contrast, the goal may be to go beyond mere description to illuminate relationships among characteristics in some sample or population. To describe relationships, researchers would focus on patterns of association between and among characteristics. Familiarity with relevant previous research, knowledge of theory and theoretical concepts, and well-focused questions facilitate data collection and are also crucial for good data analysis. Previous research and supported theory provide guidance on what variables to examine and how to measure them. Research and theory also provide direction for the formulation of hypotheses or research questions that lend focus to data collection and analysis. Moreover, effective replication of studies and building of a coherent body of research and theory may require the use of identical measures and data analysis techniques for maximum comparability across studies. Hypotheses and Research Questions

Quantitative content analysis is much more efficient when explicit hypotheses or research questions are posed than when a researcher collects data without either. A hypothesis is an explicit statement predicting that a state of one variable is associated with a state in another variable. A research question is more tentative, merely asking if such an association exists. Hypotheses or research questions permit research designs to focus on collecting only relevant data. Furthermore, an explicit hypothesis or research question

170

Data Analysis

permits the researcher to visualize an analysis that addresses the hypothesis or question. Researchers can even prepare dummy tables as part of this visualization. An inability to visualize what the completed analysis tables “should look like,” given the hypotheses or questions, may signal problems in conceptualizing the study or the collection and measurement of data. If a hypothesis predicts, for example, that anonymous comments online are more likely than signed comments to include negative language, the simplest approach is to measure the type of comment (anonymous or signed) and the comment’s valence (positive, negative, or neutral). A test of this hypothesis can be visualized as involving nominal-level or categoric data (as discussed in Chapter 5). The hypothesis would be supported if the proportion of anonymous comments with negative language is greater than the proportion of signed comments with negative language. Now assume that the researcher is interested in the degree of negative language in anonymous comments, rather than simply its presence or absence. A more refined and detailed level of measurement, at the interval level, would be needed. For instance, coders could count the number of negative, positive, and neutral statements in each comment. Averages (means) for each type of valence can be calculated for both anonymous and signed comments. In this revised example, the hypothesis would be supported if the mean number of negative comments were higher for anonymous comments than for signed comments. Although a researcher’s specification of a hypothesis or research question affects the nature of data analysis, that analysis is also affected by whether the researcher plans to make inferences from the study content to a larger population of content. If all data from a population have been collected (e.g., all of an author’s poems, or all of a celebrity’s tweets for a year), then that question is moot because the sample is the population. If only a small part of the known content is studied, the way the data have been selected determines whether inferences about the parent population can be made. As discussed in Chapter 6, probability sampling enables the researcher to make valid inferences to some population of interest. Only this allows researchers to calculate sampling error—a measure of how much the sample and population may differ at a certain level of confidence. Describing and Summarizing Findings The type of analysis a researcher chooses to use depends on the goals of the research, the level at which variables have been measured, and whether data have been randomly sampled from some population. The analysis techniques we discuss in this chapter range from relatively simple ways to describe data to more complex formulas that illuminate relationships among variables in the data. Many others are also available.

Data Analysis 171 Describing Data

Numbers are at the heart of the coding process. Unsurprisingly, counting is at the heart of the analysis. What may be surprising, however, is how often very basic arithmetic, such as calculating a mean or proportion, suffices to clarify what is found. Counting

Once data have been collected using the appropriate level of measurement, one simple summarizing technique is to display results in terms of the frequencies with which the values of a variable occurred. Of course, the content analysis coding scheme provides the basic guidance for such a display. For instance, in a study of 200 streaming programs, the data on the number of Latino/Latina characters can be described in terms of the raw numbers (e.g., 50 programs have Latino/Latina characters and 150 do not), or by counting the number of Latino/Latina characters (e.g., the 50 programs have a total of 110 Latino/Latina characters). Sometimes an important element of counting is to see how content elements group together on the basis of similar characteristics. A procedure called cluster analysis can be used for identifying such groups (Discovering Statistics, 2017). Chang, Chang, and Tseng (2010) used automated textual analysis to identify characteristics (author, title, publication year, citations, etc.) of journal articles about science education and then applied various forms of cluster analysis. They identified nine main topics in four journals studied from 1990 to 2007. These topics were then used to identify relationships among the topics and trends across the 17 years. Displaying data in these ways may not be illuminating, however, because raw numbers and even clustering may not provide a meaningful reference point. Thus, summarizing tools, such as proportions or means, are used if an appropriate level of measurement was obtained. Means and Proportions

A mean is simply the arithmetic average of a number of scores measured at the interval or ratio level. It is a sensitive measure because it is influenced by and reflects each individual score. A mean provides a reference point for what is most common or typical in a set of scores and for what is not typical. In the previous example, if the mean number of Latino/Latina characters is 1, one can expect that many programs in the sample have a single Latino/Latina character, although one also expects variability. Furthermore, the mean has the advantage of being stable across samples. If several samples were taken from a population, their means would vary less than other measures of central tendency, such as the median (the value at the midpoint of the distribution of cases).

172

Data Analysis

A proportion can be used with variables measured at the nominal as well as the interval or ratio level of measurement. The proportion reflects the degree to which a particular category dominates the sample or population, and it has an implicit reference point for discerning the meaning of findings. If 55 of 100 movies have graphic violence, that is 55%. Because the reference point is 100%, the importance of that 55% is easily grasped, and comparisons are possible across samples (e.g., 55% of 1980s movies versus 60% of 1990s movies). To illustrate, consider a study of coverage of county governments using a national sample of daily and weekly newspapers (Fico et al., 2013a). The authors calculated the mean number of unique sources quoted in articles in the two types of newspapers. Dailies averaged 2.77 sources in county government stories while weeklies averaged 1.9. Also, 14.2% of the dailies’ stories quoted ordinary citizens, compared to 7.9% of the weeklies’ stories. A question necessarily occurs about what to do when variables have been measured at the ordinal level (e.g., coders have tried to assign favorability rankings such as “most favorable,” “somewhat favorable,” “neutral,” “somewhat unfavorable,” and “most unfavorable”), with numbers used to indicate the assigned rankings. Although researchers may employ these numbers as indicators of differences among units, in much the same way as interval or ratio scales do, an ordinal scale does not meet the mathematical assumptions of these two higher levels. Intervals between numbers in ordinal scales have no consistent meaning, in terms of the concept being measured. With interval and ratio scales, 4 is twice the value of 2, and 8 is twice the value of 4, etc. By contrast, one does not really know how much more favorable a favorability ranking of 3 is compared to a favorability ranking of 2, so an average favorability of 2.4 is meaningless. Thus, the safe solution for analyzing data measured at the ordinal level is to report proportions for each separate value that makes up the scale. The Significance of Proportions and Means

Data from samples can be easily described using the basic tools described above. However, if data come from a probability sample, the aim is not just to describe the sample, but also to describe the population from which the data were drawn. Generalizing Sample Measures

Calculating sampling error (see Chapter 6) permits one to make inferences from a probability sample to a population. Recall that sampling error varies with sample size and desired level of confidence (almost always 95% or 99% for social science purposes) for conclusions drawn from the analysis. Consider a content analysis of a random sample of 400 primetime (8–11 p.m.) broadcast network television shows drawn from a population of such shows. The proportion of sample shows with African American characters is 25%. Is this the

Data Analysis 173 proportion of such shows in the population? Might that population proportion be 30% or 20%? Sampling error allows a researcher to answer these questions at the desired confidence level. The most common ways of finding sampling error for a sample of a given size are error tables, which are available online and frequently in statistics books, and output from data analysis computer programs. For a sample size of 400 at the 95% level of confidence, sampling error for the proportion works out to nearly 5%. Therefore, in the population of relevant primetime shows, the proportion with African American characters could be as low as 20% or as high as 30%. The interval is smaller with larger samples. The Significance of Differences

Describing findings from a random sample may be interesting, but a research problem frequently focuses on exploring differences in some characteristic in two or more such samples. In fact, hypotheses often emphasize such differences: for example, “Facebook posts are more likely than tweets to have video links.” The analysis frequently goes beyond simply describing if two (or more) sample means or proportions are different, because an observed difference begs the question of why it occurs. However, when random sampling has been used to obtain samples, the first possible answer to consider is that the difference does not really exist in the population but rather is an artifact of sampling error. Testing the statistical significance of differences in means or proportions addresses whether the observed differences among samples could be explained in this way. In probability terms, testing the significance of differences assesses the chance that an observed difference in means or proportions of two samples represents a real difference in the population, as opposed to a difference due to sampling error. In effect, the two samples represent two separate populations. The aforementioned study about numbers of sources quoted in county government articles by daily and weekly newspapers (Fico et al., 2013a) reported the dailies’ mean as 2.77 sources and the weeklies’ mean as 1.9. Because this was a randomly selected sample, the authors could describe the difference as statistically significant (probably not due to sampling error) at the p < .001 level. In other words, the difference likely existed in the populations of US dailies and weeklies. The chance was only one in a thousand that the observed difference was a misleading sampling “fluke.” Testing statistical significance is called statistical inference. Two-Sample Differences and the Null Hypothesis

The starting assumption of statistical inference is that the null hypothesis is true—there really is no difference in the population between apparently different samples. The question comes down to determining whether the samples

174

Data Analysis

belong to one common population or really represent two distinct populations, as defined by the variable of interest. Probability samples reflect the population from which they are drawn, but not perfectly. For example, if the mean number of sexual references computed for a census of reality TV programs was subtracted from the mean number for a census of other primetime TV programs, any difference would be real between the two program types. Using random samples of programs, however, could reveal differences due to sampling variation. Tests of difference of means or proportions calculate how likely it is that the difference between two groups found using probability sampling could have occurred by chance (i.e., sampling error). If the sample difference is so large that it is highly unlikely under the assumption of no real population difference, then the null hypothesis is rejected in favor of the hypothesis that the two groups do represent two separate populations. The null hypothesis is rejected at a certain level of probability (usually 95%). In this case, there remains a slim (5%) chance of no difference in the population, but a 95% chance that the difference is in the population of programs. The statistical measures used to test differences of proportions and means are called z- and t-statistics. A z-statistic can be used to test difference of means and proportions, whereas a t-statistic is only a difference of means test. For both statistics, a sampling distribution has been computed that indicates how likely it is that the obtained sample statistic (z or t) could differ from 0 (zero) if the two samples were actually from the same population. The sampling distribution for the t-statistic was developed for small (under 30) samples, while the sampling distributions for the z- and t-statistics become approximately equal with sample sizes larger than 30. Computer analysis programs can easily compute the statistics and the probability that their magnitude could occur by chance. Below, we present the formulas by which the two statistics are calculated. Difference of proportions test is:

in which P1 = the proportion of the first sample n1 = the sample size of the first sample P2 = the proportion of the second sample n2 = the sample size of the second sample The numerator is the difference between the proportions being compared, and the denominator is the estimate for the standard error of the difference in the proportions.

Data Analysis

175

Difference of means test is:

The numerator is the difference between the means being compared, and the denominator is the estimate of the standard error of the difference between sample means. This is a basic t-statistic formula. Separate formulas are used when the two means have equal variance and when the two means have unequal variances. As with all statistics, researchers should be aware of assumptions to be met for particular tests, check their data, and adjust the statistics as necessary. This can be done using standard statistical programs such as SPSS. The result of the computation is a z or t value that is compared to probabilities in a table to determine the likelihood that the obtained difference is due to sampling error or is a real population difference. The critical values in the tables come from the sampling distributions for the z- or t-statistics. A low probability (.05 or less) indicates that the sample means are so different that they very likely reflect a real population difference between the two. This is the inverse of saying one’s confidence in the decision to reject the null hypothesis is 95%. Differences in Many Samples

A somewhat different approach is needed when the researcher is comparing the differences among three or more groups. As in the two-sample problem, the researcher wants to know if these samples all come from the same population. For example, the use of the term abortion in four Republican platforms in the last four presidential elections could be compared to see if this issue increased in importance during this period. What is needed is a single, simultaneous test for the differences among the means in groups. Why a single test and not simply a number of tests contrasting two pairs of means or proportions at a time? If a great many two-pair comparisons are made, some will turn up false differences due to random sampling alone. Recall that the 95% level of confidence is used to reject the null hypothesis. This means that, although true population differences are most likely to show up in samples about 95% of the time, in 5% of the samples an apparently insignificant difference will be obtained that does not truly represent any real difference in a population. Therefore, as the number of comparisons increases, it is ever more likely that at least one comparison will produce a false finding. Equally important, it is impossible to know which one is false. One possible way around this problem is to run a series of two-mean tests but with a more rigorous level of confidence required (e.g., 99% or 99.9%.)

176

Data Analysis

However, a better solution is to run a single test simultaneously comparing mean differences called an analysis of variance (ANOVA). Unlike difference of proportions and difference of means tests, ANOVA uses not only the mean but also the variance in a sample. The variance is the standard deviation squared. (The standard deviation is a measure of how individual members of some group differ from the group mean.) ANOVA tests if variability between the groups being compared is greater than the variability within each group. Obviously, variability within each group is to be expected. If all the groups really come from one population, then the variability between the groups will be low compared to the variability within the groups. Therefore, ANOVA computes an F-ratio that takes a summary measure of between-group variability and divides it by a summary measure of within-group variability: F = between-group variability/within-group variability As in the case of a difference in means or a difference in proportions test, the null hypothesis predicts no difference (i.e., all the groups come from the same population, and any difference is merely random variation). The empirically obtained F-ratio from the groups can then be assessed to determine whether the null hypothesis should be rejected. The larger the obtained F, the bigger the differences between groups. A computer analysis program will display a numeric value for the calculated F along with a probability estimate that such a difference could have occurred by chance under the null hypothesis of no difference in the population. The smaller that probability estimate, the more likely it is that the groups really do come from different populations. For example, Shen and Bissell (2013) content analyzed the Facebook pages of six large cosmetics companies to explore how they used social media to engage with customers and build brand loyalty. Using ANOVA with F tests to evaluate differences among the six pages, they found that the companies rarely used twoway dialogue with customers on the pages. Table 9.1 summarizes the various descriptive measures that are used with nominal, ordinal, interval, and ratio data. Table 9.1 Common data descriptive techniques in content analysis Level of Measure

Summary Measure

Significance Test (if Needed)

Nominal

Frequency Proportion Difference of proportion Frequency Proportion Difference of proportion

— Sample error z-test — Sample error z-test

Ordinal

Data Analysis 177 Level of Measure

Summary Measure

Significance Test (if Needed)

Interval

Frequency Mean and standard deviation Difference in means ANOVA Frequency Mean and standard deviation Difference in means ANOVA

— Sample error z-test, t-test F-test — Sample error z-test, t-test F-test

Ratio

Finding Relationships Summary measures describing data and, where needed, their statistical significance are obviously important. However, as suggested in Chapters 3 and 5, measures describing relationships are key to the development of social science. Specifically, such measures are useful and necessary when social scientists generate hypotheses about the relationship between two (or more) things. Such hypotheses are frequently stated in terms of “The more of one, the more (or less) of the other.” For example, “The more videos a website carries, the more unique visitors the site will attract during a month.” Note that this hypothesis requires a higher level of measurement (ratio level). The Idea of Relationships

Identifying how two variables covary, or correlate, is one of the key steps in identifying causal relationships, as noted in Chapter 3. The assumption is that such covariance is causally produced by an association—that it is systematic and therefore recurring and predictable. The null hypothesis is that the variables are not related at all, and that any observed association is random or reflects the influence of some other unknown force acting on the variables of interest. Put differently, if the observed association is purely random, what is observed on one occasion may be completely different from what is observed on another occasion. To restate a point made in Chapter 3, covariation means the presence or absence of one thing is observably associated with the presence or absence of another thing. Covariation can also be thought of as the way in which the increase or decrease in one thing is accompanied by the increase or decrease in another thing. These connections are straightforward and, in fact, relate to many experiences in the daily lives of most people. However, although the notion of relationship is intuitively simple, it becomes rather more complicated when one wants to know the relative strength or degree of the relationship being observed.

178

Data Analysis

First, what is meant by a strong or weak relationship? Second, what does a relationship that is somewhere in the middle look like? And, third, on what basis, if any, is there confidence that a relationship of some type and strength exists? That is, how confident can one be in one’s assumed knowledge about a particular relationship? Relationship Strength

Some observed relationships are clearly stronger than others. Think about the strength of a relationship in terms of degree of confidence. If, for instance, one had to bet on a prediction about a relationship, what knowledge about the relationship would maximize one’s chance of winning? Betting confidence should come from past systematic observations (a social science approach) rather than subjectivity (e.g., “I feel lucky today”). Note that the question asked how one might “maximize one’s chances of winning” rather than “ensure winning” because social science deals with probability, not certainty. For example, does the gender of a reporter predict the writing of stories about women’s issues? If the traditional concept of professional “news values” strictly guides reporters’ selection of stories, then gender would be inconsequential to story selection: men and women reporters would write equally often about women’s issues. If the prediction were that women were more likely than men to write about women’s issues, then gender should be systematically linked to story topic. Under the strongest possible relationship, all women write only about women’s issues, and no men do. In that case, knowing the gender of the reporter would enable the researcher to predict the topic of each reporter’s stories perfectly: 100% of bets would be won by predicting that every story written by a woman dealt with a topic of interest to women, while every story by a man would deal with some other kind of topic. Of course, such perfect relationships seldom exist. For example, if women reporters write 70% of their stories on women’s issues and men write 10% of their stories on women’s issues, one could better predict the likelihood of either gender producing stories about women’s issues than if these percentages were unknown. However, the prediction would not be correct 100% of the time. Past data can be useful in predicting future behaviors, but the degree to which the prediction would be correct can vary from never to 100% of the time. What is needed is a number or statistic that neatly summarizes the strength observed in relationships. Several measures of association do exactly this. They are employed depending on the level of measurement of the variables in the relationships being explored. Techniques for Finding Relationships

The measures of association described below do something similar to the previous example. Based on data from a population or a sample, a mathematical pattern of

Data Analysis 179 the association, if any, is calculated. The measures of association discussed below set a perfect relationship at 1 and a non-relationship at 0 (zero). A statistic closer to 1 thus describes a stronger relationship than a statistic closer to 0. There is an additional problem (analogous to the problem that arises when generalizing from a sample mean or proportion to a population mean or proportion) if data used to generate the statistic measuring strength of association have been drawn from a probability sample. A statistical measure of association could merely be an artifact of sampling error, which means the association probably does not exist in the population from which the sample was drawn. Procedures of statistical inference exist to permit researchers to judge when a relationship in randomly sampled data most likely reflects a real relationship in the population. Chi-Square and Cramer’s V

Chi-square indicates the statistical significance of the relationship between two variables measured at the nominal level. Cramer’s V is one of a family of measures that index the strength of that relationship. V alone suffices when all population data have been used to generate the statistic, but both measures are needed when data have been randomly sampled from some population of interest. Put another way, chi-square answers the key question of the likelihood that the observed relationship exists in the population, while V assesses the strength of the relationship in the population. The chi-square test of statistical significance assumes that the randomly sampled data accurately describe, within sampling error, the population proportions of cases falling into the categorical values of variables being tested. For example, a random sample of 400 television dramas might be categorized into two values: “contains physical violence” and “no physical violence.” The same shows might also be categorized into two values of a sexuality variable: “contains sexual depictions” and “no sexual depictions.” Four possible combinations of the variables could be visualized in terms of a dummy 2 × 2 table: violence with sexual depictions; violence without sexual depictions; no violence but with sexual depictions; and no violence and no sexual depictions. A hypothesis linking the two variables might be that violent and sexual content are more likely to be present in shows together. If sample data seem to confirm this, how does chi-square put to rest the question of whether this may be a statistical artifact? Chi-square starts with a null hypothesis assumption of no association between variables in the population: any sample finding to the contrary is the result of sampling error. In the above example, what might a lack of association between violence and sexuality look like? As the chi-square formula presented below illustrates clearly, chi-square is based on a null pattern reflecting the proportions of the values of each of the two variables being tested. Assume, for example, that 70% of all programs lack violence and 30%

180

Data Analysis

have violent depictions. Furthermore, suppose that half of all programs have some form of sexual content. If knowing the violence content of a show was of no help in predicting its sexual content (i.e., if sexual and violent content are unrelated), then sexual content should be included in about half of both the violent and the nonviolent programs. By contrast, if the two types of content are associated, one would expect a greater concentration of sex in programs that also have violence. For each cell in the table linking the two variables (violence, sex; violence, no sex; no violence, sex; no violence, no sex), chi-square calculates the theoretical expected proportions based on this null relationship. The empirically observed data are then compared cell-by-cell with the proportions expected under the null relationship. The absolute value of the differences between the observed and expected values in each cell goes into the computation of the chi-square statistic. Therefore, the chi-square is large when differences between empirical and theoretical cell frequencies are large, and small when obtained data more closely resemble the pattern expected with the null relationship. If the empirically obtained relationship is identical to the hypothetical null relationship, chisquare is 0. This chi-square statistic has known values that permit a researcher to reject the null hypothesis at the standard 95% and 99% levels of probability. The computational work in computing a chi-square is still simple enough to do by hand (although tedious if the number of cells is large). Once again, though, statistical computer programs produce chi-square readily. The formula for hand computation is:

in which fo = the observed frequency for a cell fe = the frequency expected for a cell under the null hypothesis It is important to know that a relationship is real (statistically significant) in the population from which the data have been obtained. Cramer’s V can indicate precisely how important, with values ranging from 0 to a perfect 1. Based literally on the computed chi-square value, V takes into account the number of cases in the sample and the number of values of the categorical variables being tested. V and chi-square make it possible to distinguish between small but nonetheless real associations between two variables in a population and associations that are both real and relatively more important. Statistical significance alone is not a sufficiently discerning measure because a large enough sample will by itself “sweep up” small though real relationships. V therefore permits an assessment of the actual importance of the relationship in the population of interest. A statistically significant relationship that is small in the population of interest will produce a small V. A significant relationship that is large in the population will produce a large V, with 1 indicating a perfect relationship. However, V tends to take low values because a V close to 1 would require extreme distributions.

Data Analysis

181

Cramer’s V is often produced by computer analysis programs, but it may be easily calculated by hand once chi-square has been calculated:

in which X 2 = the calculated chi-square for the table n = the sample size (min is the lesser of the rows or columns) (r − 1) = the number of rows minus 1 (c − 1) = the number of columns minus 1 Hum et al. (2011) used chi-square to examine Facebook profile photographs for users between the ages of 18 and 23. Users averaged about 20 photos per album and very few were inappropriate. Chi-square showed no statistical difference on the basis of gender. Gajda and Wolowicz (2022) used Cramer’s V to explore representation of women in 75 Polish school texts. Regardless of the subject matter, men appeared more often in the content, which is consistent with studies from other European countries. Higher-Level Correlation

Correlation techniques are also available for levels of measurement that are higher than nominal. The most common of these are two rank-order correlations: Spearman’s rho and Kendall’s tau (Blalock, 1979). Both can be used with ordinal-level data to determine how similarly two variables share common rankings. In general, they tend to produce the same inference. However, we focus on tau because: (1) rho is more sensitive to error and discrepancies in the data; (2) rho has larger asymptotic variance when data are normally distributed; and (3) tau allows calculation of a partial correlation that removes the influence of a third variable (Statistical Odds & Ends, 2019). Tau can be calculated either without considering ties in the ranking of two items (tau-a) or when considering ties (tau-b). However, tau-b will work both with and without ties in the data. The formula for tau-b is:

in which S = number of concordant ranks minus the number of disconcordant ranks T = ties of ranks in one variable U = ties of ranks in the second variable n = sample size

182

Data Analysis

Calculating tau-b involves comparing each rank with every other rank. The comparison is concordant if a second rank is consistent with the initial overall rankings. It is disconcordant if a second rank is inconsistent with the initial rankings. For example, a comparative study might rank the emphasis that two news sites give to an array of topics (crime, schools, etc.). Using raw frequency of stories might be misleading if one site routinely has more total articles, but converting topic frequencies to percentages of the total makes the sites comparable. Ranks can then be assigned to each site’s percentages to reflect topic emphasis; rank-order correlation would show the sites’ similarity. For example, Riffe, Kim, and Sobel (2018) used tau-b to correlate data on the shrinking annual international news hole of the New York Times over a 50-year period with the amount of news coverage generated annually by the newspaper’s own correspondents, annual amounts of “borrowed” news taken from other news organizations, and changing annual levels of press freedom in countries where events occurred. The researchers found a decline in the Times’ amount of international news coverage but an increase in borrowed news—that is, the two trends were negatively correlated. Pearson’s product–moment correlation is employed with data measured at the interval and ratio levels. Unlike in the example of news sites’ topic emphases, where the sites’ total frequencies for each topic are used to create ranks comparable with rho or tau, Pearson’s correlation employs the original measurement scales of individual variables. Because more information is provided by interval and ratio scales (e.g., the mean and variance), the latter correlation provides a more sensitive measure of any degree of association, and it is considered more powerful because it can reveal a significant association when Kendall’s tau analysis of the same data would not. The formula for the Pearson product–moment correlation is:

in which X = each case of the X variable Xˉ = the mean of the X variable Y = each case of the Y variable Yˉ = the mean of the Y variable It is worth mentioning, however, that Pearson’s correlation makes an important assumption about what it measures—specifically, that any covariation is linear—which means the increase or decrease in one variable is uniform across the

Data Analysis

183

values of the other variable. A curvilinear relationship would exist, for example, if one variable increased across part of the range of the other variable, then decreased across some further part of the range of that variable, then increased again. A relationship would certainly exist, but not a linear one, and not one that could be well summarized by the Pearson correlation. An easy way to envision a curvilinear relationship is to think about the relationship of coding time and reliability in content analysis. As a coder becomes more practiced in using a content analysis system during a coding session, reliability should increase; the relationship is steady and linear. However, after a time, fatigue can increase, and reliability may curve or “tail off.” Inspection of a scatter diagram, as shown in Figure 9.1, is frequently recommended if a researcher suspects that a linear relationship between two variables does not exist. In such a diagram, each case relating the values of the

Figure 9.1 Scatter diagrams of correlations

184

Data Analysis

two variables is plotted on a graph. If a linear relationship exists, dots representing the joint values will be tightly clustered and uniformly increasing or decreasing. Both Kendall’s and Pearson’s correlation measures provide summary numbers for the strength of association between two variables. Both can range from a perfect −1 (negative) correlation to a perfect +1 (positive) correlation. With perfect negative correlation, for example, every instance in which one variable is high would find the second variable correspondingly low. Because both variables are measured on ordinal, interval, or ratio scales, the correlation measures are more sensitive to small differences in the variables than would be the case for Cramer’s V, which uses nominal-scale variables. Kendall’s rank-order correlation and Pearson’s product–moment correlation are thus more powerful tests than those available for nominal-level data, and they are more likely to find a relationship that actually exists in the population of interest whereas Cramer’s V measure of association might not. Perfect relationships are rare in the world, of course, and a data set will have a number of inconsistencies that depress the size of the correlations. Statistics textbooks usually consider correlations of .7 or above to be strong, correlations of between .4 and .7 to be moderate, and correlations of between .2 and .4 to be weak. Kendall’s and Pearson’s correlations also share the important ability to calculate partial correlation coefficients that remove (control for) the influence of variables other than the two being considered in the initial correlation. For example, if a study found a correlation between number of posts on a Facebook page and number of comments left on the page, a researcher could control for the percentage of posts that include video. Such a partial correlation would tell the researcher to what degree the number of comments was related to total posts after the influence of the proportion with video is removed. Another use of Pearson’s r is the calculation of R-squared, which helps a researcher assess more precisely the importance of one variable’s influence on another. R-square measures the proportion of variance shared by two variables. Thus, an r between variables of .7 produces an R-square of .49, meaning that just under half (.50 or 50%) of one variable’s variance is linearly related to another variable’s variance. This is why r must be relatively large to be meaningfully related to another variable. Correlation and Significance Testing

As with all statistics, correlations from randomly sampled data require tests of significance for valid generalization to the populations of content. The null hypothesis, in this case, is that the true correlation equals 0 (zero). As in the case of chi-square and Cramer’s V, correlation coefficients also have well-known mathematical properties. Therefore, the question about a correlation found in a

Data Analysis 185 random sample is whether it is large enough, given the sample size, to rule out the possibility that it is due to sampling error. The answer is provided by an F-test of statistical significance. The larger the F, the greater the chance the obtained correlation reflects a real correlation in the population. The computational process producing the F is also accompanied by a probability value that gives the probability that the relationship in the data was produced by chance. It is possible to put a confidence interval around a Pearson’s correlation. With such an interval, the researcher can argue (at the 95% or higher confidence level) that the true population correlation is somewhere within the range formed by the coefficient plus or minus the interval. Causal Modeling

Finding relationships through measures of association is important. However, life is usually more complicated than two things varying together in isolation. For example, while it may be interesting and important to find a relationship between gender and social media behavior, gender alone is unlikely to explain everything about the content people post on Facebook. In fact, gender may be a relatively small component of the total array of factors that influence the messages people share on any social media platform. Furthermore, these factors may not influence the variable of interest directly. Factor A, for example, may influence factor B, which then influences factor Y, which is what one really wants to know. More galling still, factor D, thought initially to be important in influencing factor Y, may not be influential at all. Rather, factors A and B may be influencing factor D, which then merely appears to influence factor Y (this is called a spurious correlation). Researchers need a means of comprehending how all these factors influence each other, and, ultimately, some variable of interest. How much does each factor influence that variable of interest, correcting for any mutual relationships? The whole package of factors or variables that researchers believe might directly or indirectly influence the variation of some variable of interest can be assembled and tested in a causal model. Knowing what to include in the model (and what to leave out) is guided by theory, previous research, and logic. This is also the case when predicting which variables influence which other variables in the model, irrespective of whether that influence is positive or negative, as well as the relative magnitude of those influences. Such a model often permits researchers to grasp a bigger, more complex piece of reality in a conceptually neat package that is relatively easy to comprehend. Furthermore, each assumed influence in the model provides guidance for the whole community of researchers working on similar or related problems. Such models can be tested in a variety of ways, including path analysis and structural equation modeling. However, the model always comes first; seeing how well it

186

Data Analysis

Figure 9.2 Hypothesized model showing relationships between economic, newsroom, and content variables

fits data comes later. One of the easiest ways to think about a model of multiple causal influences is to draw a picture of it, as illustrated by Figure 9.2, which was used to predict fair and balanced reporting as an outcome of economic and newsroom factors (Lacy, Fico, & Simon, 1989). Note first that each variable is named. Variables causally antecedent to others are on the left, with the “ultimate” dependent variable on the extreme right, with arrows indicating the assumed causal flows from one variable to the next. The plus and minus signs indicate the expected positive or negative relationships. The six arrows with signs are the graphic representation of hypotheses presented explicitly in the study. Arrows that lack such signs indicate research questions or simply lack of knowledge about what to expect. For this model, the causal relationship flows in one direction. However, models can involve variables influencing each other. Mutual influence between variables can occur two ways. First, influence between two variables occurs either simultaneously or so quickly that a time lag cannot be measured. Second, the influence between two variables is cyclical, with a lag that can be measured. Models can be drawn that incorporate these reciprocal relationships. As noted above, such models may guide future research, and may change with more and higher-quality research. This change occurs both theoretically and empirically. First, the model grows as variables outside the original model are brought into it. For example, new variables causally prior to all the other variables in the model may be added. In addition, new variables that intervene between two already in the model may be added.

Data Analysis 187 Such models also change as they undergo empirical testing. Specifically, each arrow linking two variables in the model can be tested against data to determine the validity of that particular relationship. Furthermore, the whole model can be tested to determine the validity of all its separate parts and its overall usefulness as a model describing social reality. Those interested in testing causal models should consult an advanced statistics text, such as Tabachnick and Fidell (2013). Multiple Regression

Techniques such as ordinary least squares regression and its variants are needed to assess how well variables in a causal model and the paths among them explain variation in some dependent variable of interest. Multiple regression permits assessment of the nature of the linear relationship between two or more variables and some dependent variable of interest. Correlation can only indicate when two things are strongly (or weakly) related to each other. By contrast, multiple regression can indicate how, for every unit increase in each independent variable, the dependent variable will have some specified change in its unit of measure. Multiple regression requires that the dependent variable be interval or ratio level, although the independent variables can be dichotomous in nature (called dummy variables). (A form of regression called logistic regression is available to assess independent variable effects on a dependent variable measured at the nominal level. Readers should consult an advanced statistics book [e.g., Tabachnick & Fidell, 2013] for an explanation of how to use this.) When all independent variables are dummy variables, multiple regression is equivalent to ANOVA. The technique also assumes that each of these interval/ratio variables is normally distributed around its mean. Whether the data set meets this requirement can be assessed by examining each variable’s measures of skewness or departure from a normal distribution. Because small samples are likely to be more skewed, the technique is also sensitive to the overall number of cases providing data for the analysis. Tabachnick and Fidell (1996, p. 132) reported that testing the multiple correlation (the effect of all the independent variables together on the dependent variable of interest) requires a minimum of 50 cases plus 8 times the number of independent variables. In order to test the effect of individual independent variables on the dependent variable, the sample would need to be at least 104 plus the number of independent variables. For example, if a researcher had an equation with 6 independent variables, he or she would need a sample of at least 98 cases (50 + 48) to test the multiple correlation (correlation between the dependent variable and the collection of independent variables) and 110 cases (104 + 6) to test the relationships between the dependent variable and the individual independent variables. Multiple regression assesses the nature of the way variables vary together, but it does so while also controlling for all the ways other variables in the model are varying. Think of it this way: multiple regression correlates each independent

188

Data Analysis

variable with the dependent variable at each measurement level of all the other variables. Regression analysis creates an equation that allows the best prediction of the dependent variable based on the data set. The equation takes the following form:

In this equation, y is the value of the dependent variable when various values of the independent variables (X1 X2 . . . Xn) have been placed in the equation. The letter a represents an intercept point and would be the value of y when all the Xs equal 0 (zero). The letter e represents the error term, which is the variation in y not explained by all the Xs. The error term is sometimes dropped, but it is important to remember all statistical analysis has error. Each independent variable has a regression coefficient, which is represented by b1, b2 . . . bn. This coefficient equals the amount by which the X values are multiplied to calculate the y value. The coefficient specifies how much the dependent variable changes for a given change in each independent variable. However, regression coefficients are expressed in the original units of the variables, which can make comparisons difficult. Therefore, the regression coefficients can be standardized to facilitate comparisons among the contributions of independent variables. This standardization of coefficients is similar to the standardization of exam scores, or putting the scores on a curve. It places the coefficients on a normal curve by subtracting each score from the variable mean and dividing by the standard deviation. The standardized—or beta—coefficients are most useful for within-model comparisons of the relative importance of each independent variable’s influence on the dependent variable. Beta weights are not comparable across data sets. Multiple regression computes a beta for each independent variable, with the beta varying according to each variable’s standard deviation. The interpretation is that for each change of 1 SD in the independent variable, the dependent variable changes by some part of its standard deviation as indicated by the beta coefficient. For example, a beta of .42 means that, for each increase of 1 SD in the independent variable, the dependent variable would increase by .42 of its standard deviation. If a second independent variable had a beta of .22, it is easy to see that it is less influential because its variation produces relatively less variation in the dependent variable. Another statistic that is used in multiple regression is the multiple r-squared statistic (coefficient of determination). This is the proportion of the dependent variable’s variance accounted for by all the variation of the independent variables in the model. In other words, a large multiple r-squared produced by a model means that the set of included variables is indeed substantively important in illuminating the social processes being investigated. A smaller multiple r-squared means that independent variables not included in the model are important and

Data Analysis 189 in need of investigation. The adjusted multiple r-squared modifies the multiple r-squared by taking into consideration the number of independent variables and the number of cases. The adjusted multiple r-squared is a better measure of fit when probability samples are used with multiple regressions. Finally, if the data were drawn from a random sample, a test of statistical significance is necessary to determine whether the coefficients found in the regression analysis are really 0 (zero) or reflect some actual relationship in the population. Regression analysis also generates significance tests to permit the assessment of each coefficient and of the entire set of variables in the regression analysis as a whole. In a study of news sourcing at radio stations in the United States, Lacy et al. (2013) used multiple regression and found that radio stations tended to perform worse than daily newspapers in the numbers and diversity of sources in stories. However, there were two exceptions: public radio stations and stations crossowned with television stations performed better with sourcing than commercial stations that were not cross-owned. Table 9.2 summarizes various measures of association used with nominal, ordinal, interval, and ratio data. Feedback Relationships

Establishing the time order of causal relationships can be complicated for models with feedback loops where concepts influence each other across time. These feedback loops (reciprocal relationships) can vary in the time required for the influence to occur. For example, news content influences the public agenda, which in turn influences news content, which in turn influences the public agenda. Theory can be helpful in understanding these feedback loops, but the existence of the loops can influence the empirical testing of the theory. Survey data are collected simultaneously and can create difficulty in identifying time order for examining causal relationships. Both research design and statistics can be useful in dealing with these loops. If the profit level for a streaming video service is hypothesized

Table 9.2 Common data association techniques in content analysis Level of Measure

Summary Measure

Significance Test (if Needed)

Nominal

Cramer’s V Phi Kendall’s tau Pearson’s r Regression Pearson’s r Regression

Chi-square

Ordinal Interval Ratio

z-test F-test F-test F-test F-test

190

Data Analysis

to affect the number and quality of streaming programs produced by the service, then the profits must be measured before programs are counted and evaluated for quality. Of course, the number and quality of programs may well influence profit level, which would require that quality and number of programs be measured before profit levels. Two frequently used statistical techniques can examine the mutual influence of variables and control for feedback loops. First, structural equation modeling (SEM) is a set of procedures that require an explicit model to be tested by evaluating the simultaneous relationships among a collection of latent variables, which are unobservable (e.g., attitudes), and measured variables, which are observable (e.g., content). It has many uses, but can create ambiguity in the interpretation because of the complexity of the models (Tabachnick & Fidell, 2013). Feedback loops also can be examined using two-stage multiple regression. Used often in econometrics, this approach controls for the simultaneous influence of the dependent and independent variables. It allows a researcher to evaluate the influence of variable x (independent variable) on y (dependent variable) when controlling for the influence of y on x. For more background, see Wooldridge (2015). A detailed discussion of SEM and two-stage regression is beyond the scope of this volume, but researchers should be aware of any feedback relationships and simultaneity within the models they are testing, and access information about useful statistics in evaluating such relationships. Statistical Assumptions These analysis procedures have been presented in such a manner that they should be intuitively easy to grasp. However, this runs the risk of oversimplification. In particular, statistical procedures carry certain assumptions about the data being analyzed. If the data differ to a great degree from the assumed conditions (e.g., extreme values, or outliers, with regression analysis), the analysis will lack validity. Researchers should always test data for these assumptions. For example, Weber (1990) pointed out that content analysts should be particularly careful in this regard when transforming frequencies, time, and space measures into percentages to control for length of a document. Percentages have a limited range, and the distribution is not linear; means and variances for percentages are not independent; and content analysis data are often not normally distributed. Linearity, independence of mean and variance, and normal distribution of variable values are all assumptions for commonly used statistical procedures. Therefore, when transforming content measures into percentages and using sophisticated statistical analysis, data should be checked to see if they fit such assumptions. Statistical procedures vary in how sensitive they are to violations of assumptions. With some procedures, minor violations will not result in invalid conclusions. However, researchers will have more confidence in their conclusions if

Data Analysis 191 data are consistent with statistical assumptions. Readers should consult statistics books to help them evaluate assumptions about data (Blalock, 1979; Tabachnick & Fidell, 2013). Summary Data analysis explores and interprets large collections of data with the aim of finding meaning in what has been observed. The term “statistics” encompasses a set of tools that describe the data and allow social scientists to recognize patterns of differences and relationships within the data. Beyond description and finding patterns in a set of data, the central limits theorem provides a foundation for concluding if the data set characteristics, differences, and relationships identified in a probability sample likely exist in the population from which that sample was taken. This process is statistical inference, and it uses an estimate of sampling error to establish a probability of sample data representing population data at an assumed level of confidence. This is a powerful tool that is the basis for the use of statistics in both science and business activities. This chapter comprises a very brief survey of some often-used statistics. It is an introduction and jumping-off point for further exploration of the tools that provide rigor and precision to social science. Regardless of the statistics that are produced from a set of data, deriving meaning from them is the goal. Statistical analysis can help people understand data patterns only when the analysis is conducted in ways that are consistent with standard practices and when the results are examined within the framework of social science theory and existing research. Which statistics are appropriate to a particular study depends on the hypotheses or research questions, the level of measurement of variables (nominal, ordinal, interval, ratio), and the nature of the sample. Like any good tool, statistics must be appropriate to the project. One size does not fit all. Used properly, statistical techniques are valuable ways of expanding one’s understanding. Yet, they can also generate previously unimagined puzzles and questions. Few studies do not contain a sentence beginning “Further research is needed to . . .” For most researchers, that sentence is less an acknowledgment of the study’s limitations and more an invitation to join the exploration.

Appendix A Sample Protocol

The following combination protocol and code sheet was modified from one created by Kirsten Adams as an assignment in a content analysis class taught at the University of North Carolina. Data from this protocol were used in the article “Between Trump and a hard place: Civil gatekeeping and moral equivalence in press endorsements of 2016 presidential candidates,” Journalism Studies, 21(11), 1531–1550 (Adams, 2020). Introduction This study will examine US newspapers’ editorial endorsements in the 2016 US presidential election. Political endorsements entail publicly declaring support for a particular candidate for elected office. In this case, endorsements will include all those made by the top 100 news organizations (based on daily circulation figures) for presidential candidates in the 2016 election. Procedure The following steps should be taken in the content analysis coding. Coders will first make note of basic information about the editorial (newspaper ID number, publication date, and newspaper name), and the political party (or parties) the newspaper endorsed in the 2008 and 2012 US presidential elections. Coders will then turn to the editorials’ content, noting both the explicit candidate endorsed (or disendorsed) in each endorsement and valence/tone (positive, negative, or neutral/both) for each candidate. After this, coders will measure the presence or absence of explicit mentions of candidates’ character, based on Iyengar, Peters, and Kinder (1982) conceptualizations of presidential candidate character references. If explicit mentions of any of the seven character traits are present, the valence/tone of each trait as it is used in the endorsement (positive, negative, or neutral/both) will be coded for each candidate.

Appendix A: Sample Protocol 193 Block 1 Coder ID: Place a 1 by the coder’s ID and leave the other spaces blank. Coder 1

Coder 2

Coder 3

Coder 4

Newspaper ID: Place the appropriate number associated with the newspaper from the newspaper ID list. Publication month: Use the pull-down menu for the newspaper edition’s publication month. Publication day: Use the pull-down menu for the newspaper edition’s date of publication. Newspaper name: Write the name of the newspaper in which the editorial appeared. Editorial article length (number of paragraphs, excluding numbered or bulleted list). Paragraphs are collections of text delimited by either an indentation or a line of white space. Lists of items are not counted as paragraphs. Place the appropriate number of paragraphs in the space below.

Block 2 Data on 2012 endorsements come from UC Santa Barbara’s American Presidency Project (Woolley & Peters, 2016). Look at the provided list. Which political party did this newspaper endorse in 2012? Place a 1 by the appropriate response and leave the other responses blank. ______Democratic Party ______Republican Party ______Other Party ______Split between parties ______None Data on 2008 endorsements come from UC Santa Barbara’s American Presidency Project (Woolley & Peters, 2016). Look at the provided list. Which political party did this newspaper endorse in 2008? Place a 1 by the appropriate response and leave the other responses blank.

194

Appendix A: Sample Protocol

______Democratic Party ______Republican Party ______Other Party ______Split between parties ______None Block 3 Was any candidate explicitly endorsed in this editorial endorsement? (This statement would often be in the title, or in the lede or last paragraph.) Place a 1 by the appropriate response and leave the other responses blank. Hillary Clinton

Donald Trump

Gary Johnson

None

Was any candidate explicitly disendorsed in this editorial endorsement (“Do not vote for . . .” or “Say no to . . .”)? (This would include statements in the title, or in the lede or last paragraph. Place a 1 by the appropriate response and leave the other responses blank.) Hillary Clinton

Donald Trump

Gary Johnson

None

Overall, was each candidate referred to positively, negatively, or neutrally/both in this endorsement. (Select “No mention” if the candidate was not mentioned in this endorsement.) Place a 1 in the appropriate response option and leave the non-appropriate options blank. Hillary Clinton

Donald Trump

Gary Johnson

Positive Negative Neutral/Both No Mention

Block 4 A candidate’s character will be measured by whether an endorsement mentions the candidate’s “integrity (honesty, morality, trustworthiness), competence (leadership ability, strength, knowledge, psychological fitness)” (King, 1995, p. 87). This study will focus specifically on explicit character traits in order to get at the shared manifest meaning in these symbols of communication. Candidates’ characters will be measured by the presence or absence of explicit mentions of the above seven traits. If explicit mentions of any of the seven traits are present, the valence of those judgments for each will be coded positive, negative, or neutral/both positive and negative.

Appendix A: Sample Protocol 195 Examples of explicit mentions of traits from real endorsements: • Honesty: “liar,” “struggles with honesty,” “truthful,” “honest” • Morality: “bigot,” “honorable,” “self-serving,” “selfish,” “thoughtful,” “(un) ethical” • Trustworthiness: “(un)trustworthy,” “dependable,” “(un)reliable” • Leadership ability: “(in)competent,” “true leader,” “respectable,” “responsible,” “reckless” • Strength: “thin-skinned,” “thick-skinned,” “strong,” “weak” • Knowledge: “(in)experienced,” “knowledgeable,” “skilled,” “expert,” “educated,” “ignorant” • Psychological fitness: “(un)fit,” “(un)stable,” “dangerous” In this endorsement, were any of the seven character traits mentioned about each candidate? Place a 1 for “yes” and a 0 for “no” after the name. Hillary Clinton _______ Donald Trump _______ Gary Johnson _______ Based on the previous response, if this endorsement mentioned at least one character trait for Hillary Clinton, complete the following. If not, leave the space blank. (As a reminder, valence/tone is measured by whether the endorsement refers to a trait positively, negatively, neutrally (both positively and negatively), or the trait was not mentioned at all in the endorsement.) Overall, what valence/tone was used for Hillary Clinton’s character trait mentioned in this endorsement? (Select “No mention” if this was not mentioned in relation to Hillary Clinton in this endorsement.) Place a 1 in the appropriate response and leave the other responses blank. Valence/Tone of Hillary Clinton’s Characteristics Positive

Neutral/Both

Negative

No Mention

Honesty Morality Trustworthiness Leadership Ability Strength Knowledge Psychological Fitness

Based on the earlier response, if this endorsement mentioned at least one character trait for Donald Trump, complete the following. If not, leave the space blank. (As a reminder, valence/tone is measured by whether the endorsement refers to a trait positively, negatively, or neutrally (both positively and negatively), or the trait was not mentioned at all in the endorsement.)

196

Appendix A: Sample Protocol

Overall, what valence/tone was used for Donald Trump’s character trait mentioned in this endorsement? (Select “No mention” if this was not mentioned in relation to Donald Trump in this endorsement.) Place a 1 in the appropriate response and leave the other responses blank. Valence/Tone of Donald Trump’s Characteristics Positive

Neutral/Both

Negative

No Mention

Honesty Morality Trustworthiness Leadership Ability Strength Knowledge Psychological Fitness

Based on the earlier response, if this endorsement mentioned at least one character trait for Gary Johnson, complete the following. If not, leave the space blank. (As a reminder, valence/tone is measured by whether the endorsement refers to a trait positively, negatively, or neutrally (both positively and negatively), or the trait was not mentioned at all in the endorsement.) Overall, what valence/tone was used for Gary Johnson’s character trait mentioned in this endorsement? (Select “No mention” if this was not mentioned in relation to Gary Johnson in this endorsement.) Place a 1 in the appropriate response and leave the other responses blank. Valence/Tone of Gary Johnson’s Characteristics Positive Honesty Morality Trustworthiness Leadership Ability Strength Knowledge Psychological Fitness

Thank you for completing the coding.

Neutral/Both

Negative

No Mention

Appendix B Reporting Standards for Content Analysis Articles

The following suggestions are aimed at standardizing the reporting process for content analysis articles. The suggestions are based on Lombard, Snyder-Duch, and Bracken (2004), Lovejoy, Watson, Lacy, and Riffe (2014; 2016), and Lacy, Watson, Riffe, and Lovejoy (2015). These reporting standards represent the need for replication as a foundational element of social science. Replication requires sufficient and detailed information about a study. Sampling • The nature and selection process of the study sample should be clearly described in the article. This requires a specific and detailed description of the sampling method (census, simple random sampling, etc.) and a justification for the sampling method. • If a probability sample is used, the population and sampling frame should be explicitly described. Report the sample and, if possible, the population sizes. • If a probability sample is used, descriptive statistics (mean, median, range, and standard deviation) should be reported for each variable in a footnote, table, or the main text. • If a non-probability sample is used, it should be justified and its limitations specified. Coders, Variables, and Protocol • Articles should be transparent about study variables that ultimately failed to reach acceptable reliability levels and were thus dropped. Science is cumulative, and reporting unsuccessful efforts can help advance communication research both theoretically and methodologically by allowing other scholars to learn from such experiences. • Articles should report the number of coders employed and who supervised the coding, the administration of reliability testing, and so on. Articles should report how the coding work was distributed (what percentage of it was done

198

Appendix B: Reporting Standards for Content Analysis Articles

by the PI or the coder(s), were sets of coding units/assignments assigned randomly to avoid systematic error, etc.). • The role, if any, that the coders played in developing the protocol should be reported. Reliability • The sample used for the reliability check should be either a census of the study sample or a randomly selected subgroup of the study sample. As discussed in Chapter 7, ensure the sample is sufficiently large enough to represent all variables and categories, and report the basis for judging it to be sufficiently large. Report the selection method and the number (not the percentage) of units used in the reliability check. • Protocol reliability should be established during a pilot test before the study coding begins. This step should be reported. • Articles should report how coding reliability problems were resolved (retraining and retesting, coder consensus, dropping the variables, etc.). • Assessing main study protocol reliability should use units from the study sample and be conducted during the coding process. Coders should not know which content units are being used in the reliability check. • If probability sampling was used to generate a reliability sample, the process for determining the number of reliability cases should be explained and justified by citing literature. Two such processes—Krippendorff (2013) and Lacy and Riffe (1993)—have been mentioned in this volume. • Krippendorff’s CAlpha should be reported for each variable, along with the percentage of agreement (which could be placed in a footnote). If one or more variables show a high level of agreement but a low level for the reliability coefficient, the data should be examined to determine why (see Chapter 7). If the variable data are skewed, the author should report Gwet’s AC2 along with Alpha and explain why AC2 is appropriate. C • In addition to reporting the reliability coefficients, a confidence interval should be reported for each. • The article should justify the decision that the reliability is sufficiently high for each variable to be included in the analysis.

References

Adams, K. (2020). Between Trump and a hard place: Civil gatekeeping and moral equivalence in press endorsements of 2016 presidential candidates. Journalism Studies, 21(11), 1531–1550. Allen, C. J., & Hamilton, J. M. (2010). Normalcy and foreign news. Journalism Studies, 11, 634–649. Allport, G. W. (Ed.) (1965). Letters from Jenny. New York: Harcourt, Brace & World. Altschull, J. H. (1995). Agents of power (2nd ed.). New York: Longman. Álvarez, D., González, A., & Ubani, C. (2021). New feminist studies in audiovisual industries: The portrayal of men and women in digital communication: Content analysis of gender roles and gender display in reaction GIFs. International Journal of Communication, 15, 31. Retrieved July 31, 2023, from https://ijoc.org/index.php/ijoc/article/ view/14907. Aslam, S. (2022a) 63 Facebook statistics you need to know in 2022. Omnicore. Retrieved September 8, 2022 from https://www.omnicoreagency.com/facebook-statistics/. Aslam, S. (2022b). Twitter by the numbers: Stats, demographics, and fun facts. Omnicore. Retrieved September 1, 2022 from https://www.omnicoreagency.com/twitter-statistics/. Austin, E. W., Pinkleton, B. E., Hust, S. J. T., & Coral-Reaume Miller, A. (2007). The locus of message meaning: Differences between trained coders and untrained message recipients in the analysis of alcoholic beverage advertising. Communication Methods and Measures, 1(2), 91–111. Auxier, B., & Anderson, M. (2021). Social media use in 2021. Pew Research Center. Retrieved July 24, 2023, from https://www.pewresearch.org/internet/2021/04/07/ social-media-use-in-2021/. Babbie, E. (2013). The basics of social research. Boston, MA: Cengage Learning. Baden, C., & Tenenboim-Weinblatt, K. (2017). Convergent news? A longitudinal study of similarity and dissimilarity in the domestic and global coverage of the Israeli– Palestinian conflict. Journal of Communication, 67, 1–25. doi: 10.1111/jcom.12272. Baldwin, T., Bergan, D., Fico, F., Lacy, S., & Wildman, S. S. (2009). News media coverage of city governments in 2009. East Lansing: Quello Center for Telecommunication Management and Law, Michigan State University. Ball-Rokeach, S. J., Rokeach, M., & Grube, J. W. (1984). The great American values test: Influencing behavior and belief through television. New York: Free Press. Bantz, C. R., McCorkle, S., & Baade, R. C. (1997). The news factory. In D. Berkowitz (Ed.), Social meanings of news: A text-reader (pp. 269–285). Thousand Oaks, CA: Sage.

200

References

Bastien, F. (2018). Using parallel content analysis to measure mediatization of politics: The televised leaders’ debates in Canada, 1968–2008. Journalism, 21(11), 1743–1761. doi: 10.1177/1464884917751962. Bauer, R. A. (1964). The obstinate audience: The influence process from the point of view of social communication. American Psychologist, 19, 319–328. Beam, R. A. (2003). Content differences between daily newspapers with strong and weak market orientations. Journalism & Mass Communication Quarterly, 80, 368–390. Beam, R. A., & Di Cicco, D. T. (2010). When women run the newsroom: Management change, gender, and the news. Journalism & Mass Communication Quarterly, 87, 393–411. Beautifulsoup4 4.11.1. (2022). Retrieved August 29, 2022, from https://pypi.org/project/ beautifulsoup4/. Bennett, L. W. (1990). Toward a theory of press–state relations in the United States. Journal of Communication, 40(2), 103–127. Berelson, B. R. (1952). Content analysis in communication research. New York: Free Press. Berkowitz, D. (Ed.) (2011). Cultural meanings of news: A text-reader. Thousand Oaks, CA: Sage. Bialik, C. (2012). Tweets as poll data? Be careful. Wall Street Journal, February 12. Retrieved July 12, 2012, from http://online.wsj.com/article/SB100014240529702036460 04577213242703490740.html. Biswas, M., Sipes, C., & Brost, L. (2021). An analysis of general-audience and Black news sites’ coverage of African American issues during the COVID-19 pandemic. Newspaper Research Journal, 42(3), 397–415. Blalock, H. M., Jr. (1979). Social statistics (rev. 2nd ed.). New York: McGraw-Hill. Blatchford, A. (2020). Searching for online news content: The challenges and decisions. Communication Research and Practice, 6(2), 143–156. doi.org/10.1080/2204 1451.2019.1676864. Boukes, M., Jones, N. P., & Vliegenthart, R. (2022). Newsworthiness and story prominence: How the presence of news factors relates to upfront position and length of news stories. Journalism, 23(1), 98–116. Brandwatch. (2022). Retrieved August 29, 2022, from https://www.brandwatch.com/. Brown, J. D., & Campbell, K. (1986). Race and gender in music videos: The same beat but a different drummer. Journal of Communication, 36(1), 94–106. Bruns, A., & Liang, Y. E. (2012). Tools and methods for capturing Twitter data during natural disasters. First Monday, 17(4). https://doi.org/10.5210/fm.v17i4.3937. Bryant, J., Roskos-Ewoldsen, D., & Cantor, J. (Eds.) (2003). Communication and emotion: Essays in honor of Dolf Zillmann. Mahwah, NJ: Lawrence Erlbaum Associates. Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research. Chicago, IL: Rand McNally. Cantril, H., Gaudet, H., & Hertzog, H. (1940). The invasion from Mars. Princeton, NJ: Princeton University Press. Carey, J. W. (1996). The Chicago school and mass communication research. In E. E. Dennis & E. Wartella (Eds.), American communication research: The remembered history (pp. 21–38). New York: Routledge. Carpenter, S., Boehmer, J., & Fico, F. (2016). The measurement of journalistic role enactment: A study of organizational constraints and support in for-profit and nonprofit journalism. Journalism & Mass Communication Quarterly, 93, 587–608.

References 201 Ceron, A., Curini, L., & Iacus, S. M. (2017). Politics and big data: Nowcasting and forecasting elections. New York: Routledge. Chaffee, S. H., & Hochheimer, J. L. (1985). The beginnings of political communication research in the United States: Origins of the “limited effects” model. In M. Gurevitch & M. R. Levy (Eds.), Mass communication yearbook 5 (pp. 75–104). Beverly Hills, CA: Sage. Chang, Y. H., Chang, C. Y., & Tseng, Y. H. (2010). Trends of science education research: An automatic content analysis. Journal of Science Education and Technology, 19(4), 315–331. Chau, M., & Chen, H. (2008). A machine learning approach to web page filtering using content and structure analysis. Decision Support Systems, 44(2), 482–494. Coe, K., & Griffin, R. A. (2020). Marginalized identity invocation online: The case of President Donald Trump on Twitter. Social Media + Society, 6(1). https://doi. org/10.1177/2056305120913979. Cohen, J. A. (1960). Coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 31–46. Cohen, J. A. (1968). Weighted kappa: Nominal scale agreement with a provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220. Cohen, S., & Young, J. (Eds.) (1981). The manufacture of news. London: Constable. Comfort, S. E., & Hester, J. (2019). Three dimensions of social media messaging success by environmental NGOs. Environmental Communication, 13(3), 281–286. doi:10.108 0/17524032.2019.1579746. Connolly-Ahern, C., Ahern, L. A., & Bortree, D. S. (2009). The effectiveness of stratified sampling for content analysis of electronic news source archives: AP Newswire, Business Wire, and PR Wire. Journalism & Mass Communication Quarterly, 86, 862–883. Conway, M. (2006). The subjective precision of computers: A methodological comparison with human coding in content analysis. Journalism & Mass Communication Quarterly, 83(1), 186–200. Correa, T., & Harp, D. (2011). Women matter in newsrooms: How power and critical mass relate to the coverage of the HPV vaccine. Journalism & Mass Communication Quarterly, 88, 301–319. Cortese, D. K., Szczypka, G., Emery, S., Wang, S., Hair, E., & Vallone, D. (2018). Smoking selfies: Using Instagram to explore young women’s smoking behaviors. Social Media + Society, 4(3). https://doi.org/10.1177/2056305118790762. Cox, J. B. (2022). Black lives matter to media (finally): A content analysis of news coverage during summer 2020. Newspaper Research Journal, 43(2), 155–175. Coyne, S. M., Ward, L. M., Kroff, S. L., Davis, E. J., Holmgren, H. G., Jensen, A. C., Erickson, S., & Essig, L. W. (2019). Contributions of mainstream sexual media exposure to sexual attitudes, perceived peer norms, and sexual behavior: A meta-analysis. Journal of Adolescent Health, 64(4), 430–436. Craft, S. H., & Wanta, W. (2004). Women in the newsroom: Influences of female editors and reporters on the news agenda. Journalism & Mass Communication Quarterly, 81, 124–138. Crunchbase (n.d.). Receptiviti. Retrieved January 17, 2023, from https://www.crunch base.com/organization/receptiviti. Culbertson, H. M. (1975). Veiled news sources—who and what are they? ANPA News Research Bulletin, No. 3, May 14.

202

References

Culbertson, H. M. (1978). Veiled attribution—an element of style? Journalism Quarterly, 55, 456–465. Culbertson, H. M., & Somerick, N. (1976). Cloaked attribution—what does it mean to readers? ANPA News Research Bulletin, No. 1, May 19. Culbertson, H. M., & Somerick, N. (1977). Variables affect how persons view unnamed news sources. Journalism Quarterly, 54, 58–69. Czarnecka, B., & Mogaji, E. (2020). How are we tempted into debt? Emotional appeals in loan advertisements in UK newspapers. International Journal of Bank Marketing, 38(3), 756–776. Danielson, W. A., & Adams, J. B. (1961). Completeness of press coverage of the 1960 campaign. Journalism Quarterly, 38, 441–452. Danielson, W. A., Lasorsa, D. L., & Im, D. S. (1992). Journalists and novelists: A study of diverging styles. Journalism Quarterly, 69, 436–446. Davison, K. K., Gicevic, S., Aftosmes-Tobio, A., Ganter, C., Simon, C. L., Newlan, S., & Manganello, J. A. (2016). Fathers’ representation in observational studies on parenting and childhood obesity: A systematic review and content analysis. American Journal of Public Health, 106(11), e14–e21. Deese, J. (1969). Conceptual categories in the study of content. In G. Gerbner, O. R. Holsti, K. Krippendorff, W. J. Paisley, & P. J. Stone (Eds.), The analysis of communication content (pp. 39–56). New York: Wiley. De Swert, K. (2012). Calculating inter-coder reliability in media content analysis using Krippendorff’s alpha. Center for Politics and Communication. Retrieved January 16, 2019, from http://www.polcomm.org/wp-content/uploads/ICR01022012. pdf. de Vreese, C. H. (2004). The effects of frames in political television news on issue interpretation and frame salience. Journalism & Mass Communication Quarterly, 81, 36–52. de Vreese, C. H. (2010). Framing the economy: Effects of journalistic news frames. In P. D’Angelo & J. A. Kuypers (Eds.), Doing news framing analysis: Empirical and theoretical perspectives (pp. 187–214). New York: Routledge. de Vreese, C. H., & Boomgaarden, H. (2006). Valenced news frames and public support for the EU. Communications, 28(4), 361–381. Di Cicco, D. T. (2010). The public nuisance paradigm: Changes in mass media coverage of political protest since the 1960s. Journalism & Mass Communication Quarterly, 87, 135–153. Dick, S. J. (1993). Forum talk: An analysis of interaction in telecomputing systems. Unpublished doctoral dissertation, Michigan State University. Dill, R. K., & Wu, H. D. (2009). Coverage of Katrina in local, regional, national newspapers. Newspaper Research Journal, 30(1), 6–20. Discovering Statistics (2017). Cluster analysis. Retrieved June 24, 2022, from https:// www.discoveringstatistics.com/2017/01/13/cluster-analysis/. Döring, N., Reif, A., & Poeschl, S. (2016). How gender-stereotypical are selfies? A content analysis and comparison with magazine adverts. Computers in Human Behavior, 55, 955–962. doi:10.1016/j.chb.2015.10.001. Druckman, J. N., Kifer, M. J., & Parkin, M. (2010). Timeless strategy meets new medium: Going negative on congressional campaign web sites, 2002–2006. Political Communication, 27(1), 88–103.

References 203 Druckman, J. N., Kifer, M. J., & Parkin, M. (2014). Congressional campaign communications in an Internet age. Journal of Elections, Public Opinion, and Parties, 24, 20–44. doi:10.1080/17457289.2013.832255. Druckman, J. N., Kifer, M. J., & Parkin, M. (2017). Consistent and cautious: Online congressional campaigning in the context of the 2016 presidential election. In J. Baumgartner & T. Towner (Eds.), The Internet and the 2016 presidential campaign (pp. 3–24). Lanham, MD: Lexington. Druckman, J. N., Kifer, M. J., & Parkin, M. (2018). Resisting the opportunity for change: How congressional campaign insiders viewed and used the Web in 2016. Social Science Computer Review, 36, 392–405. doi:10.1177/0894439317711977. Duffy, M. J., & Williams, A. E. (2011). Use of unnamed sources drops from peaks in 1960s and 1970s. Newspaper Research Journal, 32(4), 6–21. Duriau, V. J., Reger, R. K., & Pfarrer, M. D. (2007). A content analysis of the content analysis literature in organization studies: Research themes, data sources, and methodological refinements. Organizational Research Methods, 10(1), 5–34. doi: 10.1177/1094428106289252. Eddy, K. A., Riffe, D., Cohen, M. S., & Kim, S. (2021). “Newsmaker-in-chief”? US presidents’ foreign-policy priorities and international news coverage, from LBJ to Obama. International Communication Research Journal, 56(1), 10–25. Elmasry, M. H., & el-Nawawy, M. (2020). Can a non-Muslim mass shooter be a “terrorist”? A comparative content analysis of the Las Vegas and Orlando shootings. Journalism Practice, 14(7), 863–879. doi:10.1080/17512786.2019.1643766. Engesser, S., Ernst, N., Esser, F., & Büchel, F. (2017). Populism and social media: How politicians spread a fragmented ideology. Information, Communication & Society, 20(8), 1109–1126. doi:10.1080/1369118X.2016.1207697. European Union (n.d.). Privacy by Design. European Data Protection Supervisor. Retrieved January 17, 2023, from https://edps.europa.eu/data-protection/our-work/ subjects/privacy-design_en. Everbach, T. (2005). The “masculine” content of a female-managed newspaper. Media Report to Women, 33, 14–22. Famulari, U. (2020). Framing the Trump administration’s “zero tolerance” policy: A quantitative content analysis of news stories and visuals in US news websites. Journalism Studies, 21(16), 2267–2284. doi:10.1080/1461670X.2020.1832141. Feng, G. C., & Zhao, X. (2016). Do not force agreement: A response to Krippendorff (2016). Methodology, 12(4), 145–148. Fico, F. (1985). The search for the statehouse spokesman. Journalism Quarterly, 62, 74–80. Fico, F., & Cote, W. (1999). Fairness and balance in the structural characteristics of stories in newspaper coverage of the 1996 presidential election. Journalism & Mass Communication Quarterly, 76, 123–137. Fico, F., & Drager, M. (2001). Partisan and structural balance in news stories about conflict generally balanced. Newspaper Research Journal, 22(1), 2–11. Fico, F., Lacy, S., Baldwin, T., Wildman, S. S., Bergan, D., & Zube, P. (2013a). Newspapers devote far less coverage to county government coverage than city governance. Newspaper Research Journal, 34(1), 104–111. Fico, F., Lacy, S., Wildman, S. S., Baldwin, T., Bergan, D., & Zube, P. (2013b). Citizen journalism sites as information substitutes and complements for newspaper coverage of local governments. Digital Journalism, 1(1), 152–168.

204

References

Fico, F., & Soffin, S. (1995). Fairness and balance of selected newspaper coverage of controversial national, state and local issues. Journalism & Mass Communication Quarterly, 72, 621–633. Fitzpatrick, K., Fullerton, J., & Kendrick, A. (2013). Public relations and public diplomacy: Conceptual and practical connections. Public Relations Journal, 7(4), 1–21. Fontenot, M., Boyle, K., & Gallagher, A. H. (2009). Comparing type of sources in coverage of Katrina, Rita. Newspaper Research Journal, 30(1), 21–33. Freelon, D. G. (2010). ReCal: Intercoder reliability calculation as a web service. International Journal of Internet Science, 5(1), 20–33. Freelon, D. (2013). ReCal OIR: Ordinal, interval, and ratio intercoder reliability as a web service. International Journal of Internet Science, 8(1), 10–16. Freelon, D. (2018). Computational research in the post-API age. Political Communication, 35(4), 665–668. doi:10.1080/10584609.2018.1477506. Freelon, D., McIlwain, C., & Clark, M. (2018). Quantifying the power and consequences of social media protest. New Media & Society, 20, 990–1011. doi: 10.1177/1461444 816676646. Frehmann, K., Ziegele, M., & Rosar, U. (2022). “Alexa, Siri, Google, what do you know about Corona?” A quantitative survey of voice assistants and content analysis of their answers on questions about the COVID-19 pandemic. SCM Studies in Communication and Media, 11(2), 278–303. Gajda, A., & Wolowicz, A. (2022). If not in science, then where are the women? A content analysis of school textbooks. Education as Change, 26(1), 1–26. Gerbner, G., Signorielli, N., & Morgan, M. (1995). Violence on television: The Cultural Indicators Project. Journal of Broadcasting & Electronic Media, 39, 278–283. Ghosh, S., Zafar, M. B., Bhattacharya, P., Sharma, N., Ganguly, N., & Gummadi, K. (2013). On sampling the wisdom of crowds: Random vs. expert sampling of the twitter stream. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (pp. 1739–1744). New York: ACM. Giglietto, F., Rossi, L., & Bennato, D. (2012). The open laboratory: Limits and possibilities of using Facebook, Twitter, and YouTube as a research data source. Journal of Technology in Human Services, 30(3–4), 145–159. Gil de Zúñiga, H., Weeks, B., & Ardèvol-Abreu, A. (2017). Effects of the news-findsme perception in communication: Social media use implications for news seeking and learning about politics. Journal of Computer-Mediated Communication, 22(3), 105–123. Gjoka, M., Kurant, M., Butts, C. T., & Markopoulou, A. (2009). A walk in Facebook: Uniform sampling of users in online social networks. Retrieved January 11, 2019, from https://arxiv.org/abs/0906.0060. Golan, G. (2013). An integrated approach to public diplomacy. American Behavioral Scientist, 57(9), 1251–1255. Green, M. C., Brock, T. C., & Kaufman, G. F. (2004). Understanding media enjoyment: The role of transportation into narrative worlds. Communication Theory, 14(4), 311–327. doi: 10.1111/j.1468–2885.2004.tb00317.x. Grimmer, J., & Stewart, B. M. (2013). Text as data: The promises and pitfalls of automated content analysis methods for political texts. Political Analysis, 21, 267–297. doi: 10.1093/pan/mps028.

References 205 Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29–48. Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gathersburg, MD: Advanced Analytics. Hak, T., & Bernts, T. (1996). Coder training: Theoretical training or practical socialization. Qualitative Sociology, 19, 235–257. doi:10.1007/BF02393420. Hale, B. J., & Grabe, M. E. (2018). Visual war: A content analysis of Clinton and Trump subreddits during the 2016 campaign. Journalism & Mass Communication Quarterly, 95, 449–470. Hambrick, M. E., Simmons, J. M., Greenhalgh, G. P., & Greenwell, T. C. (2010). Understanding professional athletes’ use of Twitter: A content analysis of athlete tweets. International Journal of Sport Communication, 3(4), 454–471. Harlow, S., & Kilgo, D. K. (2021). Protest news and Facebook engagement: How the hierarchy of social struggle is rebuilt on social media. Journalism & Mass Communication Quarterly, 98(3), 665–691. Harvard Dataverse. (2022). Retrieved August 29, 2022 from https://dataverse.harvard.edu/. Hayes, A. F. (2005). An SPSS procedure for computing Krippendorff’s alpha. Retrieved September 19, 2018, from www.afhayes.com/spss-sas-and-mplus-macros-and-code. html. Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77–89. Heiss, R., & Matthes, J. (2020). Stuck in a nativist spiral: Content, selection, and effects of right-wing populists’ communication on Facebook. Political Communication, 37(3), 303–328. doi:10.1080/10584609.2019.1661890. Hendrickx, J., & Van Remoortere, A. (2022). Exploring the link between media concentration and news content diversity using automated text analysis. Journalism, 0(0). doi:10.1177/14648849221136946. Hermida, A., Lewis, S. A., & Zamith, R. (2013). Sourcing the Arab Spring: A case study of Andy Carvin’s sources on Twitter during the Tunisian and Egyptian revolutions. Journal of Computer-Mediated Communication, 19(3), 479–499. Hester, J. B., & Dougall, E. (2007). The efficiency of constructed weeks sampling for content analysis of online news. Journalism & Mass Communication Quarterly, 84, 811–824. Hillyer, G. C., Basch, C. H., & Basch, C. E. (2021). Coverage of transmission of COVID-19 information on successive samples of YouTube videos. Journal of Community Health, 46(4), 817–821. Holsti, O. R. (1969). Content analysis for the social sciences and humanities. Reading, MA: Addison-Wesley. Hovland, C. I. (1959). Reconciling conflicting results derived from experimental and survey studies of attitude change. American Psychologist, 14, 8–17. Hox, J. J., Moerbeek, M., & Van de Schoot, R. (2017). Multilevel analysis: Techniques and applications (3rd ed.). New York: Routledge. Hughes, J. (2021). krippendorffsalpha: An R package for measuring agreement using Krippendorff’s alpha coefficient. Retrieved July 23, 2023, from https://arxiv.org/ abs/2103.12170.

206

References

Hum, N. J., Chamberlin, P. E., Hambright, B. L., Portwood, A. C., Schat, A. C., & Bevan, J. L. (2011). A picture is worth a thousand words: A content analysis of Facebook profile photographs. Computers in Human Behavior, 27(5), 1828–1833. Hunter, J. E., & Gerbing, D. W. (1982). Unidimensional measurement, second order factor analysis and causal models. Research in Organizational Behavior, 4, 267–320. Indiana University (2021). Research using online tools and mobile devices. Retrieved January 17, 2023, from https://research.iu.edu/compliance/human-subjects/guidance/ mobile.html. Iyengar, S., Peters, M. D., & Kinder, D. R. (1982). Experimental demonstrations of the “not-so-minimal” consequences of television news programs. American Political Science Review, 76(4), 848–858. Jamal, A. A., Keohane, R. O., Romney, D., & Tingley, D. (2015). Anti-Americanism and anti-interventionism in Arabic Twitter discourse. Perspectives on Politics, 13, 55–73. doi: 10.1017/S1537592714003132. Johnson, M. A., & Pettiway, K. M. (2017). Visual expressions of Black identity: African American and African museum websites. Journal of Communication, 67, 350–377. Johnson, R. H. (1999). The relation between formal logic and informal logic. Argumentation, 13, 265–274. Johnson, R. H., & Blair, J. A. (2000). Informal logic: An overview. Informal Logic, 20(2), 93–107. Jones, R. L., & Carter, R. E., Jr. (1959). Some procedures for estimating “news hole” in content analysis. Public Opinion Quarterly, 23, 399–403. Joo, J., & Steinert-Threlkeld, Z. C. (2022). Image as data: Automated content analysis for visual presentations of political actors and event. Computational Communication Research, 4(1), 11–67. doi:10.5117/CCR2022.1.001.JOO. Joseph, K., Landwehr, P. M., & Carley, K. M. (2014). Two 1%s don’t make a whole: Comparing simultaneous samples from Twitter’s streaming API. In W. G. Kennedy, N. Agarwal, & S. J. Yang (Eds.), International conference on social computing, behavioral-cultural modeling, and prediction (pp. 75–83). Cham: Springer. Joshi, S. P., Peter, J., & Valkenburg, P. M. (2011). Scripts of sexual desire and danger in US and Dutch teen girl magazines: A cross-national content analysis. Sex Roles, 64(7), 463–474. Kaid, L. L., & Wadsworth, A. J. (1989). Content analysis. In P. Emmert & L. L. Barker (Eds.), Measurement of communication behavior (pp. 197–217). New York: Longman. Kamhawi, R., & Weaver, D. (2003). Mass communication research trends from 1980 to 1999. Journalism & Mass Communication Quarterly, 80(1), 7–27. Karlsson, M. (2012). Changing the liquidity of online news: Moving towards a method for content analysis. International Communication Gazette, 74, 385–402. Karpf, D. (2012). Social science research methods in Internet time. Information, Communication, & Society, 15, 639–661. doi:10.1080/1369118X.2012.665468. Kelly, S. & Westerman, D. (2020). Doing communication science: Thoughts on making more valid claims. Annals of the International Communication Association, 44(3), 177–184. doi:10.1080/23808985.2020.1792789. Kensicki, L. J. (2004). No cure for what ails us: The media-constructed disconnect between societal problems and possible solutions. Journalism & Mass Communication Quarterly, 81(1), 53–73.

References 207 Kerlinger, F. N. (1973). Foundations of behavioral research (2nd ed.). New York: Holt, Rinehart & Winston. Kerlinger, F. N., & Lee, H. B. (2000). Foundations of behavioral research (4th ed). Belmont, CA: Wadsworth Thomson Learning. Ki, E., & Hon, L. C. (2006). Relationship maintenance strategies on Fortune 500 company web sites. Journal of Communication Management, 10(1), 27–43. Kilgo, D. K., & Harlow, S. (2019). Protests, media coverage, and a hierarchy of social struggle. International Journal of Press/Politics, 24, 508–530. Kilgo, D. K., Mourão, R. R., & Sylvie, G. (2019). Martin to Brown: How time and platform impact coverage of Black Lives Matter movement. Journalism Practice, 13(4), 413–430. doi:10.1080/17512786.2018.1507680. Kim, J., Lee, J., Heo, J., & Baek, J. (2021). Message strategies and viewer responses: Content analysis of HPV vaccination videos on YouTube. Journal of Health Communication, 26(12), 818–827. doi:10.1080/10810730.2021.2015644. Kim, S. H., Carvalho, J. P., & Davis, A. C. (2010). Talking about poverty: News framing of who is responsible for causing and fixing the problem. Journalism & Mass Communication Quarterly, 87, 563–581. Kim, S. H., Thrasher, J. F., Kang, M. H., Cho, Y. J., & Kim, J. K. (2017). News media presentations of electronic cigarettes: A content analysis of news coverage in South Korea. Journalism & Mass Communication Quarterly, 94, 443–464. Kim, Y., Kim, Y., & Zhou, S. (2017). Theoretical and methodological trends of agendasetting theory: A thematic analysis of the last four decades of research. Agenda Setting Journal, 1(1), 5–22. King, E. G. (1995). The flawed characters in the campaign: Prestige newspaper assessments of the 1992 presidential candidates’ integrity and competence. Journalism & Mass Communication Quarterly, 72(1), 84–97. Kornfield, R., Toma, C. L., Shah, D. V., Moon, T. J., & Gustafson, D. H. (2018). What do you say before you relapse? How language use in peer-to-peer online discussion forum predicts risky drinking among those in recovery. Health Communication, 33, 1184–1193. doi:10.1080/10410236.2017.1350906. Kraemer, H. C. (1979). Ramifications of a population model for k as a coefficient of reliability. Psychometrika, 44, 461–472. Krajewski, J. M. T., Schumacher, A. C., & Dalrymple, K. E. (2019). Just turn on the faucet: A content analysis of PSAs about the global water crisis on YouTube. Environmental Communication, 13(2), 255–275. doi:10.1080/17524032.2017.1373137. Kresovich, A., Collins, M. K., Riffe, D., & Carpentier, F. R. (2021). A content analysis of mental health discourse in popular rap music. JAMA Pediatrics, 175(3), 286–292. Krippendorff, K. (1980). Content analysis: An introduction to its methodology. Beverly Hills, CA: Sage. Krippendorff, K. (2004a). Content analysis: An introduction to its methodology (2nd ed.). Thousand Oaks, CA: Sage. Krippendorff, K. (2004b). Reliability in content analysis: Some common misconceptions and recommendations. Human Communication Research, 30, 411–433. Krippendorff, K. (2011). Agreement and information in the reliability of coding. Communication Methods and Measures, 5(2), 93–112.

208

References

Krippendorff, K. (2012). Commentary: A dissenting view on so-called paradoxes of reliability coefficients. In C. T. Salmon (Ed.), Communication yearbook 36 (pp. 481–499). New York: Routledge. Krippendorff, K. (2013). Content analysis: An introduction to its methodology (3rd ed.). Thousand Oaks, CA: Sage. Krippendorff, K. (2016a). Bootstrapping distributions for Krippendorff’s alpha. Retrieved July 23, 2023, from http://www.afhayes.com/public/alphaboot.pdf. Krippendorff, K. (2016b). Misunderstanding reliability. Methodology, 12(4), 139–144. Krippendorff, K. (2019). Content analysis: An introduction to its methodology (4th ed.). Thousand Oaks, CA: Sage. Krippendorff, K., & Craggs, R. (2016). The reliability of multi-valued coding of data. Communication Methods and Measures, 10(4), 181–198. Krippendorff, K., Mathet, Y., Bouvry, S., & Widlöcher, A. (2016). On the reliability of unitizing textual continua: Further developments. Quality & Quantity, 50(6), 2347–2364. Kruikemeier, S., & Shehata, A. (2017). News media use and political engagement among adolescents: An analysis of virtuous circles using panel data. Political Communication, 34(2), 221–242. Kutz, D. O., & Herring, S. C. (2005). Micro-longitudinal analysis of web news updates. In Proceedings of the 38th Annual Hawaii International Conference on Social Sciences (pp. 1–10). Retrieved January 12, 2019, from www.computer.org/csdl/proceedings/ hicss/2005/2268/04/22680102a.pdf. Lacy, S. (1987). The effects of intracity competition on daily newspaper content. Journalism Quarterly, 64, 281–290. Lacy, S. (1992). The financial commitment approach to news media competition. Journal of Media Economics, 59(2), 5–22. Lacy, S., Duffy, M., Riffe, D., Thorson, E., & Fleming, K. (2010). Citizen journalism web sites complement newspapers. Newspaper Research Journal, 31(2), 34–46. Lacy, S., & Fico, F. (1991). The link between newspaper content quality and circulation. Newspaper Research Journal, 12(2), 46–57. Lacy, S., Fico, F. G., Baldwin, T., Bergan, D., Wildman, S. S., & Zube, P. (2012). Dailies still do “heavy lifting” in government news, despite cuts. Newspaper Research Journal, 33(2), 23–39. Lacy, S., Fico, F., & Simon, T. F. (1989). The relationships among economic, newsroom and content variables: A path model. Journal of Media Economics, 2(2), 51–66. Lacy, S., & Riffe, D. (1993). Sins of omission and commission in mass communication quantitative research. Journalism Quarterly, 70, 126–132. Lacy, S., & Riffe, D. (1996). Sampling error and selecting intercoder reliability samples for nominal content categories. Journalism & Mass Communication Quarterly, 73, 963–973. Lacy, S., Riffe, D., & Randle, Q. (1998). Sample size in multi-year content analyses of monthly consumer magazines. Journalism & Mass Communication Quarterly, 75, 408–417. Lacy, S., Riffe, D., Stoddard, S., Martin, H., & Chang, K. K. (2000). Sample size for newspaper content analysis in multi-year studies. Journalism & Mass Communication Quarterly, 78, 836–845. Lacy, S., Riffe, D., Thorson, E., & Duffy, M. (2009). Examining the features, policies and resources of citizen journalism: Citizen news sites and blogs. Web Journal of Mass Communication Research, 15(1), 1–20.

References 209 Lacy, S., Robinson, K., & Riffe, D. (1995). Sample size in content analysis of weekly newspapers. Journalism & Mass Communication Quarterly, 72, 336–345. Lacy, S., & Rosenstiel, T. (2015). Defining and measuring quality journalism. New Brunswick, NJ: Rutgers School of Communication and Information. Lacy, S., Watson, B. R., & Riffe, D. (2011). Study examines relationship among mainstream, other media. Newspaper Research Journal, 32(4), 53–67. Lacy, S., Watson, B. R., Riffe, D., & Lovejoy, J. (2015). Issues and best practices in content analysis. Journalism & Mass Communication Quarterly, 92, 791–811. doi: 10.1177/1077699015607338. Lacy, S., Wildman, S. S., Fico, F., Bergan, D., Baldwin, T., & Zube, P. (2013). How radio news uses sources to cover local government news and factors affecting source use. Journalism & Mass Communication Quarterly, 90, 457–477. Lasswell, H. D. (1927). Propaganda technique in the world war. New York: Peter Smith. Law, C., & Labre, M. P. (2002). Cultural standards of attractiveness: A thirty-year look at changes in male images in magazines. Journalism & Mass Communication Quarterly, 79(3), 697–711. Lawrence, R. G. (2010). Researching political news framing: Established ground and new horizons. In P. D’Angelo & J. A. Kuypers (Eds.), Doing news framing analysis: Empirical and theoretical perspectives (pp. 265–285). New York: Routledge. Lazarsfeld, P. F., Berelson, B., & Gaudet, H. (1944). The people’s choice. New York: Columbia University Press. Lee, S., & Riffe, D. (2017). Who sets the corporate social responsibility agenda in the news media? Unveiling the agenda-building process of corporations and a monitoring group. Public Relations Review, 43, 293–305. doi:10.1016/j.pubrev.2017.02.007. Lee, S., & Xenos, M. (2022). Incidental news exposure via social media and political participation: Evidence of reciprocal effects. New Media & Society, 24(1), 178–201. Lewis, S. C., Zamith, R., & Hermida, A. (2013). Content analysis in an era of Big Data: A hybrid approach to computational and manual methods. Journal of Broadcasting & Electronic Media, 57(1), 34–52. Liao, T. (2021). You can see all that from right here? A content analysis of in situ augmented reality tweets. Communication Studies, 72(6), 1073–1088. doi:10.1080/10510 974.2021.2011352. Li, Y., Guan, M., Hammond, P., & Berrey, L. E. (2021). Communicating COVID-19 information on TikTok: A content analysis of TikTok videos from official accounts featured in the COVID-19 information hub. Health Education Research, 36(3), 261–271. doi:10.1093/her/cyab010. Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2004). A call for standardization in content analysis reliability. Human Communication Research, 30, 434–437. Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2010). Practical resources for assessing and reporting intercoder reliability in content analysis research projects. Retrieved May 2, 2023, from http://matthewlombard.com/reliability/index_print.html. Lovejoy, J., Watson, B. R., Lacy, S., & Riffe, D. (2014). Assessing the reporting of reliability in published content analyses: 1985–2010. Communication Methods and Measures, 8(3), 207–221. Lovejoy, J., Watson, B. R., Lacy, S., & Riffe, D. (2016). Three decades of reliability in communication content analyses: Reporting of reliability statistics and coefficient

210

References

levels in three top journals. Journalism & Mass Communication Quarterly, 93(4), 1135–1159. Lowery, S. A., & DeFleur, M. (1995). Milestones in mass communication research: Media effects (3rd ed.). White Plains, NY: Longman. Luke, D. A., Caburnay, C. A., & Cohen, E. L. (2011). How much is enough? New recommendations for using constructed week sampling in newspaper content analysis of health stories. Communication Methods and Measures, 5(1), 76–91. Lynch, T., Tompkins, J. E., van Driel, I. I., & Fritz, N. (2016). Sexy, strong, and secondary: A content analysis of female characters in video games across 31 years. Journal of Communication, 66, 564–584. doi: 10.1111/jcom.12237. Mahrt, M., & Scharkow, M. (2013). The value of big data in digital media research. Journal of Broadcasting & Electronic Media, 57(1), 20–33. Manganello, J., Franzini, A., & Jordan, A. (2008). Sampling television programs for content analysis of sex on TV: How many episodes are enough? Journal of Sex Research, 45(1), 9–16. Martins, N., Williams, D. C., Harrison, K., & Ratan, R. A. (2008). A content analysis of female body imagery in video games. Sex Roles, 61, 824–836. Mastro, D. (2009). Effects of racial and ethnic stereotyping. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed.) (pp. 325–341). New York: Routledge. Mastro, D. E., & Greenberg, B. S. (2000). The portrayal of racial minorities on prime time television. Journal of Broadcasting & Electronic Media, 44(4), 690–703. McCluskey, M., & Kim, Y. M. (2012). Moderation or polarization? Representation of advocacy groups’ ideology in newspapers. Journalism & Mass Communication Quarterly, 89(4), 565–584. McCombs, M. E. (1972). Mass media in the marketplace. Association for Education in Journalism, Journalism Monograph No. 24. McCombs, M., & Reynolds, A. (2009). How the news shapes our civic agenda. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed.) (pp. 1–17). New York: Routledge. McCombs, M. E., & Shaw, D. L. (1972). The agenda-setting function of mass media. Public Opinion Quarterly, 36, 176–187. McCombs, M. E., Shaw, D. L., & Weaver, D. H. (2013). Communication and democracy: Exploring the intellectual frontiers in agenda-setting theory. New York: Routledge. McEwan, B., Carpenter, C. J., & Westerman, D. (2018). On replication in communication science. Communication Studies, 69(3), 235–241. McGregor, S. C., Lawrence, R. G., & Cardona, A. (2017). Personalization, gender, and social media: Gubernatorial candidates’ social media strategies. Information, Communication & Society, 20(2), 264–283. McLeod, D. M., Kosicki, G. M., & McLeod, J. M. (2009). Political communication effects. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed.) (pp. 228–251). New York: Routledge. McLeod, D. M., & Tichenor, P. J. (2003). The logic of social and behavioral sciences. In G. H. Stempel III, D. H. Weaver, & G. C. Wilhoit (Eds.), Mass communication research and theory (pp. 91–110). Boston, MA: Allyn & Bacon. McLoughlin, M., & Noe, F. P. (1988). Changing coverage of leisure in Harper’s, Atlantic Monthly, and Reader’s Digest: 1960–1985. Sociology and Social Research, 72, 224–228.

References 211 McMillan, S. J. (2000). The microscope and the moving target: The challenge of applying content analysis to the World Wide Web. Journalism & Mass Communication Quarterly, 77, 80–98. Meleo-Erwin, Z. C., Basch, C. H., Fera, J., & Smith, B. (2021). Discussion of weight loss surgery in Instagram posts: Successive sampling study. JMIR Perioperative Medicine, 4(2), e29390. doi:10.2196/29390. Mello, J. P., Jr. (2022). A third of US social media users creating fake accounts. Tech News World, August 10. Retrieved January 17, 2023, from https://www.technewsworld.com/ story/a-third-of-us-social-media-users-creating-fake-accounts-176987.html. Meltwater. (2022). Retrieved August 29, 2022, from https://www.meltwater.com/en. Meta. (2022). Meta research. Retrieved August 29, 2022, from https://research.facebook. com/publications/?s. Mneimneh, Z., Pasek, J., Singh, L., Best, R., Bode, L., Bruch, E. et al. (2021). Data acquisition, sampling, and data preparation considerations for quantitative social science research using social media data. Retrieved January 15, 2023, from https://psyarxiv. com/k6vyj. Moen, M. C. (1990). Ronald Reagan and the social issues: Rhetorical support for the Christian Right. Social Science Journal, 27, 199–207. Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from Twitter’s streaming API with Twitter’s firehose. International AAAI Conference on Web and Social Media, July. Retrieved July 26, 2023, from https://arxiv.org/pdf/1306.5204.pdf. Moser, C. A., & Kalton, G. (1972). Survey methods in social investigation (2nd ed.). New York: Basic Books. Mourão, R. R., Kilgo, D. K., & Sylvie, G. (2018). Framing Ferguson: The interplay of advocacy frames, journalistic frames and sourcing in newspaper coverage of Michael Brown. Journalism, 22(2), 320–340. Mozie, D. (2022). “They killin’ us for no reason”: Black Lives Matter, police brutality, and hip-hop music—A quantitative content analysis. Journalism & Mass Communication Quarterly, 99(3), 826–847. Muñoz, C. L., & Towner, T. L. (2017). The image is the message: Instagram marketing and the 2016 presidential primary season. Journal of Political Marketing, 16, 290–318. doi: 10.1080/15377857.2017.1334254. Neff, T. (2020). Transnational problems and national fields of journalism: Comparing content diversity in US and UK news coverage of the Paris Climate Agreement. Environmental Communication, 14(6), 730–743. doi:10.1080/17524032.2020.1716032. Neuendorf, K. (2017). The content analysis guidebook (2nd ed.). Thousand Oaks, CA: Sage. Neumann, R., & Fahmy, S. (2012). Analyzing the spell of war: A war/peace framing analysis of the 2009 visual coverage of the Sri Lankan Civil War in Western newswires. Mass Communication and Society, 15, 169–200. Nili, A., Tate, M., & Barros, A. (2017). A critical analysis of inter-coder reliability methods in information systems research. In K. Riemer, M. Indulska, & V. Tuunainen (Eds.), Proceedings of the 28th Australasian Conference on Information Systems (pp. 1–11). Retrieved July 26, 2023, from https://aisel.aisnet.org/acis2017/99/. Oliver, M. B., & Krakowiak, M. (2009). Individual differences in media effects. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed.) (pp. 517–531). New York: Routledge.

212

References

Olson, B. (1994). Sex and soap operas: A comparative content analysis of health issues. Journalism Quarterly, 71, 840–850. Opperhuizen, A. E., Schouten, K., & Klijn, E. H. (2019). Framing a conflict! How media report on earthquake risks caused by gas drilling: A longitudinal analysis using machine learning techniques of media reporting on gas drilling from 1990 to 2015. Journalism Studies, 20(5), 714–734. doi:10.1080/1461670X.2017.1418672. Oswald, L., & Bright, J. (2022). How do climate change skeptics engage with opposing views online? Evidence from a major climate change skeptic forum on Reddit. Environmental Communication, 16(6), 805–821. doi:10.1080/17524032.2022.2071314. Oxford University (1979). Newton, Isaac. In The Oxford Dictionary of Quotations (3rd ed.) (p. 362). New York: Oxford University Press. Palguna, D. S., Joshi, V., Chakaravarthy, V. T., Kothari, R., & Subramaniam, L. V. (2015). Analysis of sampling algorithms for Twitter. In Q. Y. Hong & M. Wooldridge (Eds.), Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (pp. 967–973). Palo Alto, CA: AAAI Press. Parde, N., & Nielsen, R. D. (2018). Detecting sarcasm is extremely easy ;-). In Association for Computational Linguistics, Proceedings of the Workshop on Computational Semantics beyond Events and Roles. Retrieved July 26, 2023, from https://aclanthol ogy.org/W18-1303.pdf. Peter, J., & De Vreese, C. H. (2004). In search of Europe: A cross-national comparative study of the European Union in national television news. Harvard International Journal of Press/Politics, 9(4), 3–24. Peter, J., & Lauf, E. (2002). Reliability in cross-national content analysis. Journalism & Mass Communication Quarterly, 79, 815–832. Pfeffer, J., Mayer, K., & Morstatter, F. (2018). Tampering with Twitter’s sample API. EPJ Data Science, 7(1), 50. Pilny, A., McAninch, K., Slone, A., & Moore, K. (2019). Using supervised machine learning in automated content analysis: An example using relational uncertainty. Communication Methods and Measures, 13(4), 287–304. doi:10.1080/19312458.2019.16 50166. Potter, W. J. (2014). A critical analysis of cultivation theory. Journal of Communication, 64(6), 1015–1036. Potter, W. J. (2018). A review and analysis of patterns of design decisions in recent media effects research. Review of Communication Research, 6, 1–29. Potter, W. J., & Levine-Donnerstein, D. (1999). Rethinking validity and reliability in content analysis. Journal of Applied Communication Research, 27(3), 258–284. Prabowo, R., & Thelwall, M. (2009). Sentiment analysis: A combined approach. Journal of Informetrics, 3, 149–157. Pratt, C. A., & Pratt, C. B. (1995). Comparative content analysis of food and nutrition advertisements in Ebony, Essence, and Ladies’ Home Journal. Journal of Nutrition Education, 27, 11–18. Quarfoot, D., & Levine, R. A. (2016). How robust are multirater interrater reliability indices to changes in frequency distribution? American Statistician, 70(4), 373–384. Ramaprasad, J. (1993). Content, geography, concentration and consonance in foreign news coverage of ABC, NBC and CBS. International Communication Bulletin, 28, 10–14.

References 213 Rapoport, A. (1969). A system-theoretic view of content analysis. In G. Gerbner, O. Holsti, K. Krippendorff, W. J. Paisley, & P. J. Stone (Eds.), The analysis of communication content (pp. 17–38). New York: Wiley. Reese, S. D. (2011). Understanding the global journalist: A hierarchy-of-influences approach. In D. Berkowitz (Ed.), Cultural meanings of news: A text-reader (pp. 3–15). Thousand Oaks, CA: Sage. Reese, S. D., Gandy, O. H., Jr., & Grant, A. E. (Eds.) (2001). Framing public life: Perspectives on media and our understanding of the social world. Mahwah, NJ: Lawrence Erlbaum Associates. Request 2.28.1. (2022). Retrieved August 29, 2022, from https://pypi.org/project/ requests/#history. Reynolds, P. D. (1971). A primer in theory construction. Indianapolis, IN: Bobbs-Merrill. Rezvanian, A., & Meybodi, M. R. (2017). A new learning automata-based sampling algorithm for social networks. International Journal of Communication Systems, 30(5), e3091. Rickard, L. N., Noblet, C. L., Duffy, K., & Brayden, W. C. (2018) Cultivating benefit and risk: Aquaculture representation and interpretation in New England. Society & Natural Resources, 31(12), 1358–1378. doi:10.1080/08941920.2018.1480821. Riffe, D. (1984). International news borrowing: A trend analysis. Journalism Quarterly, 61, 142–148. Riffe, D. (1991). A case study of the effect of expulsion of US correspondents on New York Times’ coverage of Iran during the hostage crisis. International Communication Bulletin, 26, 1–2, 11–15. Riffe, D. (2003). Data analysis and SPSS programs for basic statistics. In G. H. Stempel III, D. H. Weaver, & G. C. Wilhoit (Eds.), Mass communication research and theory (pp. 182–208). Boston, MA: Allyn & Bacon. Riffe, D., Aust, C. F., & Lacy, S. R. (1993). The effectiveness of random, consecutive day and constructed week samples in newspaper content analysis. Journalism Quarterly, 70, 133–139. Riffe, D., Ellis, B., Rogers, M. K., Ommeren, R. L., & Woodman, K. A. (1986). Gatekeeping and the network news mix. Journalism Quarterly, 63, 315–321. Riffe, D., & Freitag, A. (1997). A content analysis of content analyses: 25 years of Journalism Quarterly. Journalism & Mass Communication Quarterly, 74, 873–882. Riffe, D., Goldson, H., Saxton, K., & Yu, Y. C. (1989). Females and minorities in TV ads in 1987 Saturday children’s programs. Journalism Quarterly, 66(1), 129–136. Riffe, D., Kim, S., & Sobel, M. R. (2018). News borrowing revisited: A 50-year perspective. Journalism & Mass Communication Quarterly, 98(4), 909–929. doi: 10.1177/ 1077699018754909. Riffe, D., Lacy, S., & Drager, M. (1996). Sample size in content analysis of weekly news magazines. Journalism & Mass Communication Quarterly, 73, 635–644. Riffe, D., Lacy, S., Nagovan, J., & Burkum, L. (1996). The effectiveness of simple random and stratified random sampling in broadcast news content analysis. Journalism & Mass Communication Quarterly, 73, 159–168. Rogers, E. M. (2003). Diffusion of Innovation (5th ed.). New York: Free Press. Roush, W., Jr. (2008). How Crimson Hexagon translates the blogosphere’s Babel into wisdom. Xconomy, November 12. Retrieved January 17, 2023, from https://xconomy.

214

References

com/boston/2008/11/12/how-crimson-hexagon-translates-the-blogospheres-babelinto-wisdom/. Rowling, C. M., Jones, T. J., & Sheets, P. (2011). Some dared call it torture: Cultural resonance, Abu Ghraib, and a selectively echoing press. Journal of Communication, 61, 1043–1061. Rubin, A. M. (2009). Uses-and-gratifications perspective on media effects. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed.) (pp. 181–200). New York: Routledge. Rusmevichientong, P., Pennock, D. M., Lawrence, S., & Giles, C. L. (2001). Methods for sampling pages uniformly from the World Wide Web. Proceedings of the AAAI Fall Symposium on Using Uncertainty within Computation (pp. 121–128). Menlo Park, CA: AAAI Press. St. Cyr, C., Carpenter, S., & Lacy, S. (2010). Internet competition and US newspaper city government coverage: Testing the Lowrey and Mackay model of occupational competition. Journalism Practice, 4(4), 507–522. St. Cyr, C., Lacy, S., & Guzman-Ortega, S. (2005). Circulation increases follow investments in newsrooms. Newspaper Research Journal, 26(4), 50–60. Sapolsky, B. S., Molitor, F., & Luque, S. (2003). Sex and violence in slasher films: Reexamining the assumptions. Journalism & Mass Communication Quarterly, 80(1), 28–38. Scheufele, B., Haas, A., & Brosius, H. (2011). Mirror or molder? A study of media coverage, stock prices, and trading volumes in Germany. Journal of Communication, 61, 48–70. Scheufele, B. T., & Scheufele, D. A. (2010). Of spreading activation, applicability, and schemas: Conceptual distinctions and their operational implications for measuring frames and framing effects. In P. D’Angelo & J. A. Kuypers (Eds.), Doing news framing analysis: Empirical and theoretical perspectives (pp. 110–134). New York: Routledge. Schmuck, D., & Hameleers, M. (2020). Closer to the people: A comparative content analysis of populist communication on social networking sites in pre-and post-election periods. Information, Communication & Society, 23(10), 1531–1548. Scott, D. K., & Gobetz, R. H. (1992). Hard news/soft news content of national broadcast networks. Journalism Quarterly, 69, 406–412. Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19, 321–325. Sendra, A., & Farré, J. (2020). Communicating the experience of chronic pain through social media: Patients’ narrative practices on Instagram. Journal of Communication in Healthcare, 13(1), 46–54. Severin, W. J., & Tankard, J. W., Jr. (2000). Communication theories: Origins, methods, and uses in the mass media (5th ed.). New York: Addison Wesley Longman. Seyidoglu, J., Roberts, C., Darroch, F., Hillsburg, H., Schneeberg, A., McGettigan-Dumas, R., Huddle, M., & Montaño, A. (2022). Racing for representation: A visual content analysis of North American running magazine covers. Communication & Sport, 10(4), 642–663. doi:10.1177/21674795211000325. Shen, B., & Bissell, K. (2013). Social media, social me: A content analysis of beauty companies’ use of Facebook in marketing and branding. Journal of Promotion Management, 19(5), 629–651.

References 215 Shils, E. A., & Janowitz, M. (1948). Cohesion and disintegration in the Wehrmacht in World War II. Public Opinion Quarterly, 12, 300–306, 308–315. Shin, J., & Thorson, K. (2017). Partisan selective sharing: The biased diffusion of fact-checking messages on social media. Journal of Communication, 67(2), 233–255. Shoemaker, P. J., & Reese, S. D. (1990). Exposure to what? Integrating media content and effects studies. Journalism Quarterly, 67, 649–652. Shoemaker, P. J., & Reese, S. D. (1991). Mediating the message: Theories of media influence on mass media content. New York: Routledge. Shoemaker, P. J., & Reese, S. D. (1996). Mediating the message: Theories of influences on mass media content (2nd ed.). White Plains, NY: Longman. Shoemaker, P. J., & Reese, S. D. (2014). Mediating the message in the 21st century: A media sociology perspective. New York: Routledge. Shoemaker, P. J., Tankard, J. W., Jr., & Lasorsa, D. L. (2004). How to build social science theories. Thousand Oaks, CA: Sage. Shrum, L. J. (2009). Media consumption and perceptions of social reality: Effects and underlying processes. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed.) (pp. 50–73). New York: Routledge. Simon, T. F., Fico, F., & Lacy, S. (1989). Covering conflict and controversy: Measuring balance, fairness and defamation in local news stories. Journalism Quarterly, 66, 427–434. Simonton, D. K. (1994). Computer content analysis of melodic structure: Classical composers and their compositions. Psychology of Music, 22, 31–43. Slater, M. D. (2013). Content analysis as a foundation for programmatic research in communication. Communication Methods and Measures, 7(2), 85–93. Smith, S. L., & Granados, A. D. (2009). Content patterns and effects surrounding sex-role stereotyping on television and film. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed.) (pp. 342–361). New York: Routledge. Sobel, M. R., & Riffe, D. (2015). US linkages in New York Times coverage of Nigeria, Ethiopia, and Botswana (2004–13): Economic and strategic bases for news. International Communication Research Journal, 50(1), 3–23. Sobel, M., & Riffe, D. (2016). Newspapers use unnamed sources less often in high-stakes coverage. Newspaper Research Journal, 37(3), 299–311. Sobel, M., Riffe, D., & Hester, J. B. (2016). Twitter diplomacy: A content analysis of eight US embassies’ Twitter feeds. Journal of Social Media in Society, 5(2), 75–107. Song, H., Tolochko, P., Eberl, J. M., Eisele, O., Greussing, E., Heidenreich, T., Lind, T., Galyga, S., & Boomgaarden, H. G. (2020). In validations we trust? The impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis. Political Communication, 37(4), 550–572. doi:10.1080/10584609.20 20.1723752. Song, Y., & Chang, T. K. (2012). Selecting daily newspapers for content analysis in China: A comparison of sampling methods and sample sizes. Journalism Studies, 13(3), 356–369. Sparks, G. G., Sparks, C. W., & Sparks, E. A. (2009). Media violence. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed.) (pp. 269–286). New York: Routledge.

216

References

Statistical Odds & Ends. (2019). Spearman’s rho and Kendall’s tau. Retrieved June 15, 2022, from https://statisticaloddsandends.wordpress.com/2019/07/08/spearmans-rhoand-kendalls-tau/. Staudt, A., & Krewel, M. (2015). KRIPPALPHA: Stata module to compute Krippendorff’s alpha intercoder reliability coefficient. Ideas. Retrieved December 12, 2022, from https://ideas.repec.org/c/boc/bocode/s457750.html. Stegner, W. (1949). The radio priest and his flock. In I. Leighton (Ed.), The aspirin age: 1919–1941 (pp. 232–257). New York: Simon & Schuster. Stempel, G. H., III. (1952). Sample size for classifying subject matter in dailies. Journalism Quarterly, 29, 333–334. Stempel, G. H., III. (1985). Gatekeeping: The mix of topics and the selection of stories. Journalism Quarterly, 62(4), 791–796, 815. Stempel, G. H., III. (2003). Content analysis. In G. H. Stempel III, D. H. Weaver, & G. C. Wilhoit (Eds.), Mass communication research and theory (pp. 209–219). Boston, MA: Allyn & Bacon. Stempel, G. H., III, & Stewart, R. K. (2000). The Internet provides both opportunities and challenges for mass communication researchers. Journalism & Mass Communication Quarterly, 77, 541–548. Stouffer, S. A. (1977). Some observations on study design. In D. C. Miller (Ed.), Handbook of research design and social measurement (3rd ed.) (pp. 27–31). New York: McKay. Strodthoff, G. G., Hawkins, R. P., & Schoenfeld, A. C. (1985). Media roles in a social movement. Journal of Communication, 35(2), 134–153. Stryker, J. E., Wray, R. J., Hornik, R. C., & Yanovitzky, I. (2006). Validation of database search terms for content analysis: The case of cancer news coverage. Journalism & Mass Communication Quarterly, 83(2), 413–430. Su, L. Y., Cacciatore, M. A., Liang, X., Brossard, D., Scheufele, D. A., & Xenos, M. A. (2016). Analyzing public sentiments online: Combining human and computer-based content analysis. Information, Communication & Society, 20(3), 406–427. doi:10.108 0/1369118X.2016.1182197. Tabachnick, B. G., & Fidell, L. S. (1996). Using multivariate statistics (3rd ed.). New York: HarperCollins. Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). New York: HarperCollins. Tamul, D. J., & Martínez-Carrillo, N. I. (2018). Ample sample? An examination of the representativeness of themes between sampling durations generated from keyword searches for 12 months of immigration news From LexisNexis and newspaper websites. Journalism & Mass Communication Quarterly, 95(1), 96–121. Tandoc, E. C., Jr., & Vos, T. P. (2016). The journalist is marketing the news: Social media in the gatekeeping process. Journalism Practice, 10(8), 930–966. Tankard, J. W., Jr. (2001). The empirical approach to the study of framing. In S. D. Reese, O. H. Gandy, & A. E. Grant (Eds.), Framing public life: Perspectives on media and our understanding of the social world (pp. 95–106). Mahwah, NJ: Lawrence Erlbaum Associates. Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29, 24–54. doi:10.1177/0261927X09351676.

References 217 Tenenboim-Weinblatt, K., & Baden, C. (2021). Gendered communication styles in the news: An algorithmic comparative study of conflict coverage. Communication Research, 48(2), 233–256. doi:10.1177/0093650218815383. Thelwall, M., Buckley, K., & Paltoglou, G. (2011). Sentiment in Twitter events. Journal of the American Society for Information Science and Technology, 62(2), 406–418. Thorson, K., Driscoll, K., Ekdale, B., Edgerly, S., Thompson, L. G., Schrock, A., Swartz, L., Vraga, E. K., & Wells, C. (2013). YouTube, Twitter, and the Occupy movement. Information, Communication & Society, 16, 421–451. doi:10.1080/1369 118X.2012.756051. Tillery, A. B., Jr. (2019). What kind of movement is Black Lives Matter? The view from Twitter. Journal of Race, Ethnicity, and Politics, 4(2), 297–323. doi:10.1017/ rep.2019.17. Trilling, D., & Jonkman, J. G. F. (2018). Scaling up content analysis. Communication Methods and Measures, 12, 158–174. doi:10.1080/19312458.2018.1447655. Trumbo, C. (2004). Research methods in mass communication research: A census of eight journals 1990–2000. Journalism & Mass Communication Quarterly, 81, 417–436. Twitter (2018). Search tweets. Retrieved August 10, 2018, from https://developer.twitter. com/en/docs/tweets/search/overview.html. Unger, S. D., & Hickman, C. R. (2020). A content analysis from 153 years of print and online media shows positive perceptions of the hellbender salamander follow the conservation biology. Biological Conservation, 246, 108564. doi:10.1016/j. biocon.2020.108564. Vashi, A., & Rhodes, K. V. (2011). “Sign right here and you’re good to go”: A content analysis of audiotaped emergency department discharge instructions. Annals of Emergency Medicine, 57, 315–322. doi:10.1016/j.annemergmed.2010.08.024. Vogt, W. P. (2005). Dictionary of statistics and methodology: A nontechnical guide for the social sciences (3rd ed.). Thousand Oaks, CA: Sage. Vorderer, P., & Hartmann, T. (2009). Entertainment and enjoyment as media effects. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed.) (pp. 532–550). New York: Routledge. Walsh-Buhi, E., Houghton, R. F., Lange, C., Hockensmith, R., Ferrand, J., & Martinez, L. (2021). Pre-exposure prophylaxis (PrEP) information on Instagram: Content analysis. JMIR Public Health and Surveillance, 7(7), e23876. Wang, X., & Riffe, D. (2010). An exploration of sample sizes for content analysis of the New York Times web site. Web Journal of Mass Communication Research, 20. Retrieved July 26, 2023, from https://wjmcr.info/2010/05/01/an-exploration-of-sam ple-sizes-for-content-analysis-of-the-new-york-times-web-site/#:~:text=This%20 study%20found%20that%20a,enormous%20variations%20in%20Web%20content. Wanta, W., Golan, G., & Lee, C. (2004). Agenda setting and international news: Media influence on public perceptions of foreign nations. Journalism & Mass Communication Quarterly, 81, 364–377. Watson, B. R. (2014). Assessing ideological, professional, and structural biases in journalists’ coverage of the 2010 BP oil spill. Journalism & Mass Communication Quarterly, 91, 792–810. doi: 10.1177/1077699014550091. Watson, B. R. (2017). Murder she searched: The effect of violent crime and news coverage on residents’ information needs. Mass Communication and Society, 20, 241–259.

218

References

Weaver, D. A., & Bimber, B. (2008). Finding news stories: A comparison of searches using LexisNexis and Google News. Journalism & Mass Communication Quarterly, 85(3), 515–530. Weaver, D. H. (2003). Basic statistical tools. In G. H. Stempel III, D. H. Weaver, & G. C. Wilhoit (Eds.), Mass communication research and theory (pp. 147–181). Boston, MA: Allyn & Bacon. Weaver, J. B., Porter, C. J., & Evans, M. E. (1984). Patterns in foreign news coverage on US network TV: A 10-year analysis. Journalism Quarterly, 61(2), 356–363. Webb, J. B., Vinoski, E. R., Warren-Findlow, J., Burrell, M. I., & Putz, D. Y. (2017). Downward dog becomes fit body, inc.: A content analysis of 40 years of female cover images of Yoga Journal. Body Image, 22, 129–135. Weber, R. P. (1990). Basic content analysis (2nd ed.). Newbury Park, CA: Sage. Whaples, R. (1991). A quantitative history of the Journal of Economic History and the cliometric revolution. Journal of Economic History, 51, 289–301. Wilson, C., Robinson, T., & Callister, M. (2012). Surviving Survivor: A content analysis of antisocial behavior and its context in a popular reality television show. Mass Communication and Society, 15(2), 261–283. Wimmer, R. D., & Dominick, J. R. (2003). Mass media research: An introduction (7th ed.). Belmont, CA: Wadsworth. Wimmer, R. D., & Dominick, J. R. (2011). Mass media research: An introduction (9th ed.). Belmont, CA: Wadsworth. Wooldridge, J. M. (2015). Introductory econometrics: A modern approach. Toronto: Nelson Education. Woolley, J., & Peters, G. (2016). 2016 general election editorial endorsements by major newspapers. Retrieved August 8, 2023, from http://www.presidency.ucsb.edu/ data/2016_newspaper_endorsements.php. Wrightsman, L. S. (1981). Personal documents as data in conceptualizing adult personality development. Personality and Social Psychology Bulletin, 7, 367–385. Wu, L. (2015). How do national and regional newspapers cover post-traumatic stress disorder? A content analysis. Paper presented at AEJMC Annual Convention, San Francisco, CA, August. Yarchi, M., Baden, C., & Kligler-Vilenchik, N. (2021). Political polarization on the digital sphere: A cross-platform, over-time analysis of interactional, positional, and affective polarization on social media. Political Communication, 38(1–2), 98–139. doi: 10.1080/10584609.2020.1785067. Yi-Fan Su, L., Xenos, M. A., Rose, K. M., Wirz, C., Scheufele, D. A., & Brossard, D. (2018). Uncivil and personal? Comparing patterns of incivility in comments on the Facebook pages of news outlets. New Media & Society, 20, 3678–3699. doi:10.1177/1461444818757205. Zamith, R. (2017). Capturing and analyzing liquid content: A computation process for freezing and analyzing mutable documents. Journalism Studies, 18, 1489–1504. doi:1 0.1080/1461670X.2016.1146083. Zamith, R., & Lewis, S. C. (2015). Content analysis and the algorithmic coder: What computational social science means for traditional modes of media analysis. Annals of the American Academy of Political and Social Science, 659, 307–318. Zeldes, G., & Fico, F. (2010). Broadcast and cable news network differences in the way reporters used women and minority group sources to cover the 2004 presidential race. Mass Communication and Society, 13(5), 512–514.

References 219 Zhao, X., Feng, G. C., Liu, J. S., & Deng, K. (2018). We agreed to measure agreement: Redefining reliability de-justifies Krippendorff’s alpha. China Media Research, 14(2), 1–15. Zhao, X., Liu, J. S., & Deng, K. (2012). Assumptions behind intercoder reliability indices. In C. T. Salmon (Ed.), Communication yearbook 36 (pp. 419–480). New York: Routledge. Zullow, H. M., Oettingen, G., Peterson, C., & Seligman, M. E. P. (1988). Pessimistic explanatory style in the historical record: CAVing LBJ, presidential candidates, and East versus West Berlin. American Psychologist, 43, 673–682.

Index

abstractness, scientific 22 access to data 66–8, 113–14 accuracy, as dimension of reliability 134–5 Adams, J. B. 103 Adams, K. 82 agenda-setting, media and 22, 40, 161 Ahern, L. A. 109–10 algorithmic sampling 114 algorithmic text analysis (ATA) 16, 57, 58–60; advantages and disadvantages 61–2, 69–70; applications and best practice for 62, 69; versus content analysis 59–60, 70; data preparation 60; and data validation 59–62; false positives 60; motivation behind 63–4; programmed rules 16; and visual media 62 Allen, C. J. 10 Allport, G.: Letters from Jenny 18 alpha coefficients 139–40, 144–6 see also Krippendorff’s CAlpha Altschull, J. H. 6 Álvarez, D. 30 ambiguities, reducing 153–4 analysis of variance (ANOVA) 176–7 Anderson, M. 33 antecedents of content 9–10, 12 Application Programming Interfaces (API) 65, 66–8, 114, 115 Ardèvol-Abreu, A. 40 artificial intelligence (AI) 27, 62–3 Aslam, S. 107, 115 assumed level of population agreement (P) 138, 138 assumptions, statistical 190

augmented reality (AR) 27 aural content 74 Aust, C. F. 105 Austin, E. W. 166 automated text analysis see algorithmic text analysis (ATA) Auxier, B. 33 Baade, R. C. 10 Babbie, E. 46, 71 Baden, C. 17, 59 Baek, J. 29 Baldwin, T. 95, 105 Ball-Rokeach, S. J. 40, 160 Bantz, C. R. 10 Barros, A. 140 Basch, C. E. 114 Basch, C. H. 29, 114 Bastien, F. 2, 14 Bauer, R. A. 7, 8 Beam, R. A. 10, 87 Bennato, D. 74 Bennett, L. W. 10 Berelson, B. 7 Berkowitz, D. 10 Bernts, T. 59 Berrey, L. E. 29, 57, 114 beta coefficients 188 Bialik, C. 33 bias: coder 119; in data 95–6; minimizing 156, 162; periodicity, systematic sampling and 101 big data 16, 57, 62 Bimber, B. 112 binary attributes, in classification system 82–3

Index Bissell, K. 176 Biswas, M. 8 bivariate data analysis 4 Black Lives Matter (BLM), coverage of 57, 75, 76, 158 Black media coverage 1–2 Blair, J. A. 22 Blalock, H. M., Jr. 181, 191 Blatchford, A. 33 Boehmer, J. 109 Boomgaarden, H. 14 bootstrapping 145 Bortree, D. S. 109–10 Boukes, M. 82 Bouvry, S. 144 Boyle, K. 10 Bracken, C. C. 135, 146 Brayden, W. C. 19 Bright, J. 82 Brock, T. C. 9 Brosius, H. 14–15 Bruns, A. 114 Bryant, J. 9 Büchel, F. 3 Buckley, K. 33 bullet, persuasive message 7 Burkum, L. 107 Butts, C. T. 114 Caburnay, C. A. 105 Callister, M. 82 Campbell, D. T. 157 Cantor, J. 9 Cantril, H. 6 Cardona, A. 41 Carey, J. W. 7 Carley, K. M. 114 Carpenter, S. 22, 109, 155 Carpentier, F. R. 2 Carvalho, J. P. 93 case values 98–9 categories 84–8, 124 category definitions 119, 124, 131 causal models 185–7, 186 causal relationships: causal models 186, 186–7; conditions for 42–5; in content analysis 161; and correlation 42–3, 44; feedback loops 189–90; identifying 39; rival explanations 45–6, 47; statistical

221

analysis of 161; time order 43–4, 44 census sampling technique 94–5, 139 centrality of content 11–14, 12, 23–4, 35 central limits theorem 97 Ceron, A. 63 Chaffee, S. H. 7 chance agreements 140–4, 147 Chang, C. Y. 171 Chang, T. K. 105 Chang, Y. H. 171 chatbots 27 Chau, M. 111 Chen, H. 111 chi-square: Cramer’s V and 169, 179–81; formula 180; relationships and 184 Cho, Y. J. 109 Clark, M. 62 classification systems 80–5; Deese’s typology 81; requirements 83–5; units of analysis and 87–8 cluster analysis 171 cluster sampling 102–3 coder disagreements, sources of 130–3 coder reliability 53 coders 119; bias 119; fatigue 122; human 57, 60–2; number of 129–30 coder training 129–30, 134–5 coding see also algorithmic text analysis (ATA); computers in content analysis: operationalizations 26–7, 62; process 129–30; and reliability testing 139–40 coding instructions 83, 90, 119, 124 coding protocols 71, 90, 118–19, 124–6, 128, 147; Appendix A: Sample Protocol 192; human coding protocol 60, 62 coding sheets 90–1, 126–8, 128, 128, 129 Coe, K. 29 coefficient of determination (multiple r -squared statistic) 188–9 coefficients, standardized (beta) 188 Cohen, J. A. 143, 144 Cohen, M. S. 15, 105, 155 Cohen, S. 10 Cohen’s kappa 140, 143–4, 146, 148–9 Collins, M. K. 2 Comfort, S. E. 114

222

Index

communication content 1, 18, 35; analyzing 67; appropriate and meaningful 28; centrality of 11–14, 12, 23–4, 35; descriptive 13–15, 30–1; digital 64–5, 67; film and video 28, 75; five levels of influence 55; interpersonal exchanges 28; meaningful 23; nonverbal 28; population of 52; symbols of 27–8, 33–4; text units 28; verbal 74; visual 28, 74; written 73–4 Communication Monographs 13, 136, 141, 150 communication processes 28 communication research 5–14 communicators 1 computer-aided text analysis (CATA) 59, 64 computers in content analysis 21, 33, 64 see also algorithmic text analysis (ATA); computer-aided text analysis (CATA); bestpractice requirements 69; big data 17, 57, 62; considerations 58; dynamic web content, freezing 64; examples 16–17; and human coders 57, 60–2, 62, 69–70; and human language 58; hybrid approaches 17, 57–60, 69; organization of coding tasks 16–17, 64–5; programmed rules 16; uses of 16–17, 57–8 concepts see also operationalizations: complexity of 120, 122; defining 25, 120; intersubjectivity 22; manifest versus latent content 121–2; operationalizations 25–7; relationships between 22 conceptualization in content analysis design 36, 37–8, 49–50, 114–15 concordance 62 concurrent validity 154, 155 confidence levels 137–8 Connolly-Ahern, C. 109–10 consecutive unit sampling 96–7 constructed week / month sampling 105–6, 109–10 construct validity 154, 156 consumer magazines 106

content see also communication content: antecedent 9–10, 12; categories 154, 166; characteristics 81; as consequence 10–11; and context 23; forms of 73–6; manifest versus latent 33–4, 165–6; multiple sources 14–16; nature of 4, 165; social validity of 163–4; types, in content analysis 72–3 content analyses, designing 54–6; conceptualization 36, 37–8, 49–50; content analysis only designs 39; content as independent variable 40; correlation and causation 42–5; five levels of influence 54–6; formal design 51–3; good versus bad design 46–7, 51; influences on content 40; macro versus micro 39, 48; materials for analysis 50–1; model for 48–9, 49 table; “one-shot” studies 48; results 53–4; stages of 36–7; use of (types) 39 content analysis see also computers in content analysis; quantitative content analysis: versus algorithmic text analysis (ATA) 59–60, 70; applications of 17–20; causal inferences 157; causation in 161; control in 159–60; cross-country 132–3; as datagenerating process 23; defined 1, 24; importance of 8–9; models in 159–60; new areas (artificial intelligence) 27; relevance in 165; as social science tool 21, 23, 35; statistical validity of 161; time order 160–1; tools for 68–9; validity 157; variables 39, 42–5, 52; when to use 35 content analysis of verbatim explanations (CAVE) 18 content analysis protocols 34–5, 52, 81, 122–4, 126, 129–31 content analysts, challenges facing 66 content data, analyzing see data analysis content-effect relationships 161 content units in content analysis 72 context: concordance 62; and content 23, 165; false positives 60;

Index Keyword-in-context (KWIC) 62; syntactical units of observation 79 context analysis, as research technique 32 contingent conditions, of communication 8 control, in content analysis 159–60 convenience samples 95–6 Conway, M. 58 Coral-Reaume Miller, A. 166 Correa, T. 10 correction coefficient 112–13 correlation see also rank-order correlations: and causal relationships 42–3, 44; coincidental 43; F-test of statistical significance 185, 189; perfect 184; scatter diagrams 183; spurious 43, 185 correlation coefficient 184 Cortese, D. K. 52 Cote, W. 131 counting, data 171 covariance 177, 182 Cox, J. B. 76 Coyne, S. M. 107 Craft, S. H. 10 Craggs, R. 144, 146 Cramer’s V 179, 180–1, 184 Crimson Hexagon 68–9 cross-sectional studies 93 cultivation research 40 Curini, L. 63 curvilinear relationships 183–4 Czarnecka, B. 19 Dalrymple, K. E. 29 Danielson, W. A. 93, 103 data: access to 66–8; comparison of (in research design) 47; describing 171–7; quality 33, 53; quantity 33, 34, 52; recording 90; relevance 60 data analysis 51, 168, 191; common tools 168–9; describing data 171–7; dummy tables 51, 51; goal of 169; statistical inference 170, 173–4; summary measures 176–7; univariate / bivariate / multivariate 4 databases: sampling 111–13; searching 33 data collection 53, 158 data mining 64

223

data privacy regulations 67–9 data sets 66 Davis, A. C. 93 Davison, K. K. 148 Deese, J. 81 definitions: categories 124; conceptual 120; operational 83, 120, 123–4; problems 131; theoretical 83 DeFleur, M. 6 Deng, K. 146 dependent variables 188 descriptive adjectives 88 descriptive content analysis 13–15, 30–1; to draw inferences 31–2 De Swert, K. 145 De Vreese, C. H. 14, 133 Di Cicco, D. T. 96 Dick, S. J. 116 dictionary-based approaches 62 differences: among groups 175–6; analysis of variance (ANOVA) 176–7; statistical significance of 173–5 digital content see online content digital distribution 108 digital media 74 Dill, R. K. 10 dimensional order, in classification systems 85 disproportionate sampling 102 Dominick, J. R. 12–13, 25, 101, 135, 156 Döring, N. 41, 75 Dougall, E. 109 Drager, M. 10, 106 van Driel, I. I. 2 Druckman, J. N. 29 Duffy, K. 19 Duffy, M. 155 Duffy, M. J. 10, 14 dummy tables 51, 51 dummy variables 89, 187 Duriau, V. J. 18 dynamic web content 64–6 Eddy, K. A. 15, 155 Elmasry, M. H. 28 el-Nawawy, M. 28 empirical relevance 22 Engesser, S. 3 enumeration, rules of 88–9

224

Index

Ernst, N. 3 error tables 173 Esser, F. 3 estimate of actual agreement 136–7, 137 Evans, M. E. 106 Everbach, T. 10 expected agreement 143, 147–8 experimental design, and validity 157 external validity 162; in content analysis 157–8; scientific validation 162–3; social validity 163 Facebook 107; Application Programming Interface (API) 3, 66; profiles 45, 75; sampling from 3, 40, 76, 88–9, 96, 120, 176 face validity 154–5, 166 Famulari, U. 75 Farré, J. 114 feedback loops, in relationships 189–90 Feng, G. C. 146, 146–7 Fera, J. 29 Fico, F. 10, 78, 109, 121, 123, 131, 132, 140, 160, 172, 186 Fidell, L. S. 187, 190, 191 film content 28 finite population correction 99 Fitzpatrick, K. 2 Fontenot, M. 10 frames/framing 1–2, 8, 33, 75 Franzini, A. 107 F-ratio 176–7 Freelon, D. 62, 66–7, 145 Frehmann, K. 27 Freitag, A. 11, 13, 16, 31, 95 Fritz, N. 2 F-test of statistical significance 184–5, 189 Fullerton, J. 2 Gallagher, A. H. 10 Gandy, O. H., Jr. 8 Gaudet, H. 6, 7 General Data Protection Regulation (GDPR) 67 Gerbing, D. W. 156 Gerbner, G. 40 Ghosh, S. 115 Giglietto, F. 74 Gil de Zúñiga, H. 40, 42 Giles, C. L. 114

Gjoka, M. 114 Gobetz, R. H. 106 Golan, G. 2, 15 Goldson, H. 13 González, A. 30 Grabe, M. E. 29 Granados, A. D. 8, 13 Grant, A. E. 8 Green, M. C. 9 Greenberg, B. S. 13, 148 Greenhalgh, G. P. 81 Greenwell, T. C. 81 Griffin, R. A. 29 Grimmer, J. 60, 61 grouping, in classification system 81 Grube, J. W. 40, 160 Guan, M. 29, 57, 114 Guzman-Ortega, S. 160 Gwet, K. L. 146, 148–9 Haas, A. 14–15 Hak, T. 59 Hale, B. J. 29 Hambrick, M. E. 81 Hameleers, M. 40 Hamilton, J. M. 10 Hammond, P. 29, 57, 114 Harlow, S. 3, 9, 14 Harp, D. 10 Harrison, K. 13 Hartmann, T. 9 Hayes, A. F. 145, 146 Heiss, R. 3, 14, 114 Hendrickx, J. 57, 59 Heo, J. 29 Hermida, A. 16, 57, 59, 90, 129 Herring, S. C. 110–11 Hertzog, H. 6 Hester, J. B. 2, 42, 109, 114 Hickman, C. R. 19 hierarchical linear modeling (HLM) 53 high agreement/low reliability phenomenon 148 Hillyer, G. C. 114 Hochheimer, J. L. 7 Holsti, O. R. 32–5, 46, 83, 154, 155 Holsti’s coefficient (percentage of agreement) 140, 141–2, 148 Hon, L. C. 3, 10 Hornik, R. C. 60 Hovland, C. I. 6

Index Hughes, J. 145 Hum, N. J. 75, 181 human communication 21 human language 58 see also natural language human language processing 62 Hunter, J. E. 156 Hust, S. J. T. 166 hypotheses: conditional statements 41; content forms 73–4; preregistration of 41; in research 40–3, 48, 169–70; testing 25 Iacus, S. M. 63 Im, D. S. 93 independence, in classification systems 84–5 individual communication sampling 116 inferences, drawing 24, 64; appropriateness 32; causal 157; from data analysis 170; through descriptive content 31–2; for time periods 93 influences, on content 40 innovation, diffusion of 22 Instagram 41–3, 79–80, 94, 114 institutional review boards (IRBs) 67 inter-coder reliability 71, 133–4, 140 internal (causal) validity 157, 158–9, 163 Internet content 29 see also online content; changing nature / variability 110; constructed week / month sampling 109–10; newspaper websites 109; press releases 109; sampling 95, 108–9; sampling efficiency 109–10; sampling frame 108–10; search engines 111 interpersonal exchanges, for content analysis 28 intersubjectivity 22 interval measures 85, 87, 88, 89, 98 interviews, survey units 72 intra-class correlation 102 intra-coder reliability tests 140 Jamal, A. A. 68–9 Janowitz, M. 6 Johnson, M. A. 28 Johnson, R. H. 22 Jones, N. P. 82 Jones, T. J. 15 Jonkman, J. G. F. 69

225

Joo, J. 62 Jordan, A. 107 Joseph, K. 114 Joshi, V. 133 Journalism & Mass Communication Quarterly 11–13, 31, 95, 136 Journal of Broadcasting & Electronic Media 12 Journal of Economic History 18 Kaid, L. L. 135 Kalton, G. 99, 102–3 Kamhawi, R. 31 Kang, M. H. 109 kappa 143, 144 Karlsson, M. 108 Kaufman, G. F. 9 Kelly, S. 22, 23 Kendall’s tau 181–2, 184 Kendrick, A. 2 Kensicki, L. J. 33 Keohane, R. O. 68–9 Kerlinger, F. N. 22, 23, 27, 34, 46–7 Keyword-in-context (KWIC) approaches 62 keywords: for database searches 111, 112; precision 60; for sampling frame 109 Ki, E. 3, 10 Kifer, M. J. 29 Kilgo, D. K. 3, 9, 14, 57, 74 Kim, J. 29 Kim, J. K. 109 Kim, S. 10, 15, 155, 182 Kim, S. H. 93, 109 Kim, Y. 40 Kim, Y. M. 15 King, Professor Garry 68 Kligler-Vilenchik, N. 59 Klijn, E. H. 16–17 Kornfield, R. 62–3 Kosicki, G. M. 5 Kraemer, H. C. 148 Krajewski, J. M. T. 29 Krakowiak, M. 9 Kresovich, A. 2 Krewel, M. 145 Krippendorff, K. 7, 23, 24, 78, 133, 139, 144, 144–7, 146, 146–7, 148, 149, 154, 165, 198; semantical validity (1980) 165

226

Index

Krippendorff’s CAlpha 139, 140, 144–5, 146, 148–50 Kruikemeier, S. 161 Kurant, M. 114 Kutz, D. O. 110–11 Labre, M. P. 14 Lacy, S. 1, 10, 13, 16, 18, 18, 18–19, 32, 48, 53, 59, 81–2, 87, 89, 105, 106, 107, 111, 121, 123, 132, 136, 139, 140, 149, 153–5, 155, 160, 186, 189, 197 Landwehr, P. M. 114 Lasorsa, D. L. 21, 93, 120 Lasswell, H. D. 6 latent content 33–4, 121–2, 132–3, 165–6 Lauf, E. 133 Law, C. 14 Lawrence, R. G. 10, 41 Lawrence, S. 114 Lazarsdfeld, P. F. 7 Lee, C. 15–16 Lee, H. B. 22 Lee, J. 29 Lee, S. 3, 14, 161 Letters from Jenny (Allport, 1965) 18 Levine, R. A. 146 Levine-Donnerstein, D. 146, 148 Lewis, S. C. 16–17, 57, 59, 62–3, 64, 65, 90, 129 LexisNexis 111–12 Li, Y. 29, 57, 114 Liang, X. 114 Liao, T. 27 linear relationships 183–4 Linguistic Inquiry and Word Count (LIWC) 62–3, 68 lists, as sampling frame 102, 110 Liu, H. 114, 146 Liu, J. S. 146 logic 22 logical rigor 22 logistic regression 187 Lombard, M. 135, 146 longitudinal studies 34, 93 Lovejoy, J. 1, 13, 18, 18, 59, 136, 140, 141, 150, 160, 197 Lowery, S. A. 6 Luke, D. A. 105 Luque, S. 8 Lynch, T. 2

machine learning 63, 111; supervised machine learning (SML) 16–17 magazines, print 106 Mahrt, M. 33, 94 Manganello, J. 107 manifest content 33–4, 62; in concepts 121–2; establishing validity 165–6; quantitative content analysis, advantages of 34; symbols 122 Markopoulou, A. 114 Martínez-Carrillo, N. I. 109, 111 Martins, N. 13 mass communication messages 116 Mastro, D. 8 Mastro, D. E. 13, 148 Mathet, Y. 144 Matthes, J. 3, 14, 114 Mayer, K. 115 McAninch, K. 61 McCluskey, M. 15 McCombs, M. 15, 22 McCombs, M. E. 40 McCorkle, S. 10 McEwan, B. 22 McGregor, S. C. 41, 42 McIlwain 2018 62 McLeod, D. M. 5, 24 McLeod, J. M. 5 McLoughlin, M. 19 McMillan, S. J. 110 mean 170, 171; distribution around 187; population mean 97–8; sample mean 97–9; standard error of 98; tests of difference 174–6 meaning units of observation 78–9, 83 measurement: accuracy 145; defined 71; failure 71; reliability of 71; in social science 71; steps 89; summary of 91; validity of 71 measurement levels 85–8 measurement validity, types of 154–6 measures, content category 154–6 measures of association 189; chi-square 179–81; Cramer’s V 179, 180–1, 184 media: agenda-setting of 22, 161; stratified sampling 104, 104–6, 109–10 median 171 media organizations 55 media workers 55

Index Meleo-Erwin, Z. C. 29 Mello, J. P., Jr. 67 messages: exposure to 8; impact of 6–8; mass communication 116; persuasive 6–8 Meta 113 Meybodi, M. R. 114 micro-longitudinal sampling 110–11 Mneimneh, Z. 113 mobile content 94 models, in content analysis 159–60 Moen, M. C. 20 Mogaji, E. 19 Molitor, F. 8 Moore, K. 61 Morgan, M. 40 Morstatter, F. 114, 115 Moser, C. A. 99, 102–3 Mourão, R. R. 57, 74 Mozie, D. 31 multi-case coding sheets 127–8 multiple regression 187–90 multiple r -squared statistic (coefficient of determination) 188–9 multistage sampling 103–4, 110 multivariate analysis 4 Muñoz, C. L. 39 mutual exclusivity, in classification 83–4 Nagovan, J. 107 natural language processing 22, 58, 63–4 Neff, T. 59 network television 106–7 Neuendorf, K. 59 news content 14–15, 105 see also network television; local 95, 155; microlongitudinal sampling 110–11; online 109–10, 119; quality of 153; sampling efficiency 109–10 news media 33; stratified sampling methods 104, 109–10 newspapers 96, 104–6; sampling efficiency studies 105–6; stratified sampling 104, 104–6, 109–10 Nielsen, R. D. 63 Nili, A. 140 Noblet, C. L. 19 Noe, F. P. 19 nominal measures 85–8, 88, 98, 170 non-parametric procedures 88 non-probability sampling 95–6

227

non-text communication content 75–6 non-verbal communication content 28 null hypothesis 173–6 numeric values: assigning 29–30, 83–4, 90; nominal measures 85–6, 98; ordinal scales 172; rules for 30 objectivity, as trait of science 25 observational processes, validity in 156–9 observation levels 77 Oettingen, G. 18 Oliver, M. B. 9 Olson, B. 75 online content see also dynamic web content; Internet content: across time periods 94; content variations 104, 104 table 6.1; data privacy regulations 67–8; digital delivery of 107, 108; discussion forums 116; establishing external validity 157–8; for mass consumption 108; sampling 107, 107–9, 115–16, 158; timestamps 94; unstructured 65–6 operational definitions, of variables 83, 120, 123–4 operationalizations 25–7, 62 see also concepts Opperhuizen, A. E. 16–17 ordinal measures 85, 86–7, 88 ordinal scales 172 Oswald, L. 82 “other” category, in classification 84 oversampling see stratified sampling Palguna, D. S. 114 Paltoglou, G. 33 parametric procedures 88 Parde, N. 63 Parkin, M. 29 partial correlation coefficients 184 Payne Fund Studies 6 Pearson’s product-moment correlation (r) 87–8, 145, 182–5 peer-review process 162–3 Pennebaker, J. W. 63 Pennock, D. M. 114 percentage of agreement 140, 141–2, 148 periodicity, systematic sampling and 101 persuasive messages 6–8 Peter, J. 133

228

Index

Peterson, C. 18 Pettiway, K. M. 28 Pfarrer, M. D. 18 Pfeffer, J. 114, 115 physical units of observation 77–8, 83 Pilny, A. 61 Pinkleton, B. E. 166 Poeschl, S. 41, 75 political communication 2–3, 8, 15, 28, 38, 61; 2016 US election 82; multistage sampling 103; via social media 29, 38–40, 68 population 92–3; of content 136–8; finite population correction 99; homogeneity of 99; proportion in sample 99; samples 52, 136; standard error of proportion 136 population distributions 88, 99 population mean 97–8 population value 98 Porter, C. J. 106 Potter, W. J. 31, 40, 146, 148 Prabowo, R. 33 Pratt, C. A. 19 Pratt, C. B. 19 predictive validity 154, 155 probability 142 probability sampling 52, 92, 94, 97–9, 103, 136; for reliability testing 136 programmed rules (ATA) 16 prominence, concept of 120 propaganda, as persuasive message 6 proportion 172; estimating 98; sampling error for 172–3, 177; tests of difference 174–7 proportionate sampling 101 protocol reliability 135, 139 protocols 147; Appendix A: Sample Protocol 192; coding 71, 90, 118–19, 124–6, 128; content analysis 34–5, 52, 81, 122–4, 126, 129–30; human coding 60, 62; impact on reliability 134; problems 131; reliability testing of 147 purposive sampling 93, 95, 96–7 quantitative content analysis: defined 4, 23–4; manifest versus latent content 33–4; numeric values 29–30; for social science 32–3, 166; strengths and criticisms 32–3

quantitative measurement 24 Quarfoot, D. 146 questions, in research design 38, 40–2, 48–9 Ramaprasad, J. 106 Randle, Q. 106 random sampling 92, 93, 100; multistage 103–4; for reliability samples 135–6; simple 100, 102, 107, 109; stratified sampling 101–2, 104 table 6.1, 104–6, 109–10; in survey research 156; systematic 100–1, 102; and units of observation 79 rankings see also ordinal measures: in classification systems 84–5; ordinal scales 172 rank-order correlations 181–2; Kendall’s tau 181–2, 184; Spearman’s rho 181 Rapoport, A. 71 Ratan, R. A. 13 ratio measures 85, 87–9, 88, 98 reality TV shows 82 ReCal (software) 145 Reddit 25, 82 Reese, S. D. 8, 10, 11, 40, 54–6, 110 Reger, R. K. 18 regression analysis 188–9; dummy variables 89, 187; logistic regression 187; multiple regression 187–90; two-stage multiple regression 190 regression coefficient 188 Reif, A. 41, 75 relationships: in classification system 82; covariance 177, 182; curvilinear 183; feedback loops 189–90; linear 183; measures of association 179, 189 table 9.2; perfect 184; in social science 177–80; statistical significance of 180–1, 184, 189; strength of 178–80; structural equation modeling (SEM) 190 relevance: in content analysis 165; empirical 22; as variable 60 reliability 118, 167 see also bias; coder training; assessing 133–6; and concept complexity 120, 122, 132; defined 118; definition of

Index 146–7; importance of protocol 134, 147; inter-coder reliability 133–4; latent versus manifest content 121–2, 132–3; reliability processes 136; reporting of 18–19, 136, 139; three dimensions of 133–4 reliability coefficients 140–50; calculation software 145; chance agreements 140–4, 147; Cohen’s kappa 144, 146, 148–9; controversy about 146–8; expected agreement 147–8; Gwet AC 148; high agreement/ low reliability phenomenon 148; Krippendorff’s CAlpha 146, 148–50; multivalue variables 146; percentage of agreement 140, 141–2, 148; Scott’s pi 146, 148–9; selecting 149–50 reliability samples 135–6, 145 reliability testing 133–50; during coding phase 139–40; content sample size 135–8, 140; content units 138, 139, 151; correlation coefficients 145; estimate of actual agreement 136–7; pretesting 139; of protocols 147, 150; reporting 150; sampling method 135–6; Scott’s pi 140, 142–3; summary 150 replication 67–9, 90; definition of variables 83; as trait of science 25–7 reproducibility see also replication: as dimension of reliability 134 research design 36, 38, 54–6 see also content analyses, designing; comparison of data 47; correlation and causation 42–5; defined 46; drawing inferences 32; elements of 47; five levels of influence 54–6; formal design 51–3; good versus bad 46–7, 51; illustrating 46; materials for analysis 50–1; model for 48; “one-shot” studies 48; purpose of 32; results 53–4 researcher: analysis of latent content 166; role of 162 research hypotheses 40–3, 48, 169–70 research questions 38, 40–1, 169–70; defining 41–2, 48, 49 research techniques, content analysis 32–3 results, interpreting 53–4

229

Reynolds, A. 22 Reynolds, P. D. viii, 12, 21–2, 39 Rezvanian, A. 114 Rhodes, K. V. 74 Rickard, L. N. 19 Riffe, D. 1, 2, 3, 10, 11, 13, 14, 15, 16, 16, 18, 18–19, 31, 32, 42, 53, 59, 95, 105, 106, 107, 109, 112, 136, 139, 140, 155, 160, 182, 197 right-wing populism (RWP), in social media 3 Robinson, K. 82, 106 Rogers, E. M. 22 Rokeach, M. 40, 160 Romney, D. 68–9 Rosar, U. 27 Rosenstiel, T. 153–4 Roskos-Ewoldsen, D. 9 Rossi, L. 74 Roush, W., Jr. 68 Rowling, C. M. 15 R-square 184 Rubin, A. M. 8 Rusmevichientong, P. 114 Russell, Bertrand 153 sample mean 97–8 samples: biased representations 95; defined 92; standard deviation of 98; variability of case values 99 sample size 66, 98–9; standard error of proportion 136–7 sampling 92; cluster 102–3; convenience 95–6; cross-sectional studies 93; databases 111–13; digital content 107–16, 158; individual communication 116; machine learning 111; multistage 103–4, 110; non-probability 95–6; population 52; probability 52, 92, 94, 97–9, 136; proportion of population 99; purposive 93, 95, 96–7; random 92, 93, 100–1, 135–6, 156; social media content 113–15; for social validation 164; standard error of proportion 136–7; stratified sampling 101–2, 104–6, 109–10; summary of 116–17; time and content dimensions 99–100, 104–6, 160–1; time periods 93; and units of observation 79

230

Index

sampling distribution 97–9, 174 sampling efficiency 105–6, 109–10 sampling error 92; calculating 97–9, 102, 104; computed sampling error 136–7, 139; error tables 173; intra-class correlation 102–3; for proportion 172–3; and units of observation 80 sampling frame 92–3; for Internet sampling 108–10; lists 102, 110; using keywords 109 sampling stages see multistage sampling sampling techniques 94; census 94–5, 139; consecutive unit sampling 96–7; probability 103, 136 Sapolsky, B. S. 8 Saxton, K. 13 scaling, in classification system 81–2 scatter diagrams 183, 183 Scharkow, M. 33, 94 Scheufele, B. 14–15 Scheufele, B. T. 10 Scheufele, D. A. 10 Schmuck, D. 40 Schouten, K. 16–17 Schumacher, A. C. 29 scientific method of content analysis 21–2, 24 scientific process 21 scientific research 22 scientific validation, of research 162–3 Scott, D. K. 106 Scott, W. A. 142 Scott’s pi 140, 142–3, 142–6, 146, 148–9 scraping tools 113–14 search engines 111 search strings 112–13 Seligman, M. E. P. 18 semantical validity (Krippendorff, 1980) 165 Sendra, A. 114 sentiment analysis 62, 68 Severin, W, J. 7 Seyidoglu, J. 28 Shaw, D. L. 15, 40 Sheets, P. 15 Shehata, A. 161 Shen, B. 176 Shils, E. A. 6 Shin, J. 32, 88

Shoemaker, P. J. 10, 11, 21, 40, 54–6, 110, 120 Shrum, L. J. 9 Signorielli, N. 40 Simmons, J. M. 81 Simon, T. F. 121, 186 simple random sampling 100, 102, 107, 109 single-case coding sheets 127, 128 Slater, M. D. 158 Slone, A. 61 Smith, B. 29 Smith, S. L. 8, 13 Snyder-Duch, J. 135, 146 Sobel, M. R. 2, 10, 42, 112, 182 social listening 68 social media content: accessing data 113–14; available data sets 113; conceptualization 114–15; data privacy regulations 68–9; diffusion of news 29; fake/troll accounts 67–8; health communications 29; political communication 29, 38–40, 68; sampling 107–8, 113–15; scraping tools 113–14; unstructured content 65–6; video content analyses 28, 75 social science: approach to knowledge 21–2; finding relationships 177–80; and quantitative content analysis 32–3, 166 social validity 158, 163–4 societal ideology, as influence on content 56 Soffin, S. 131 Song, H. 59, 60–1 Song, Y. 105 song lyrics, analysis of 2, 31 source diversity scale 81–2 Sparks, C. W. 8 Sparks, E. A. 8 Sparks, G. G. 8 Spearman’s rho 181 St. Cyr, C. 155, 160 stability, as dimension of reliability 133–4 standard deviation, of sample 98 standard error (SE): finite population correction 99; of mean 98; of proportion 98, 136–7; sample size, formula for 137

Index Stanley, J. C. 157 statistical assumptions 190 statistical inference 173–4 statistical procedures 88, 169 statistical validity 158 Staudt, A. 145 Stegner, W. 6 Steinert-Threlkeld, Z. C. 62 ‘stemming’ (of root word) 60 see also data preparation under algorithmic text analysis (ATA) Stempel, G. H., III. 9, 23, 24, 105, 108 Stewart, B. M. 60, 61 Stewart, R. K. 108 Stouffer, S. A. 46–7 stratified sampling 101–2; for media 104, 104–6, 109–10 streaming services 26 structural equation modeling (SEM) 190 Stryker, J. E. 60, 112–13 Su, L. Y. 57 summary measures 176–7 survey research 72 Sylvie, G. 57, 74 symbols: of communication content 27–8, 33–4; complexity of 132; latent versus manifest content 122 syntactical units of observation 78–9 systematic approach of content analysis 24–5 systematic sampling 100–1, 102 Tabachnick, B. G. 187, 190, 191 Tamul, D. J. 109, 111 Tankard, J. W., Jr 7, 8, 21, 33–4, 120, 165 Tate, M. 140 Tausczik, Y. R. 63 technological change 21, 33, 66 Tenenboim-Weinblatt, K. 17, 59 text see communication content text units (for analysis) 28 Thelwall, M. 33 theoretical definitions, of variables 83 theoretically appropriate, defined 88 theory-building 22, 24–5 Thorson, K. 32, 75, 88, 115, 155 Thrasher, J. F. 109 Tichenor, P. J. 24 Tillery, A. B., Jr. 158

231

time order: in causal relationships 43–4, 44; in content analysis 160–1 time periods: inferences for 93; longitudinal designs 93; online and mobile content 94; sampling 93, 99–100, 104–6 timestamps (online content) 94 time units of observation 77, 77–8 Tingley, D. 68–9 Tompkins, J. E. 2 Towner, T. L. 39 transcripts 74 Trilling, D. 69 Tseng, Y. H. 171 t-statistics (difference of means) 174–7 Twitter: Application Programming Interface (API) 65, 114, 115; augmented reality (AR) tweets 27; data structure 65; “expert users” 115; popularity (measuring) 25; tweets, analyzing 2, 33, 63–5, 68–9, 74, 103, 114, 158 Ubani, C. 30 Unger, S. D. 19 units of analysis 28, 73, 80, 84 units of content 52 units of observation 73, 76–7; classification systems 80–1; physical versus meaning 77–9, 83; sampling concerns 79; sampling errors 80 univariate analysis 4 universe 92, 157–8 validation: latent versus manifest content 165–6 validity 167; and algorithmic coding 59–62; of content analysis 157, 159; defined 152–3; external 157–8, 162–3; internal (causal) 157, 158–9, 163; probabilistic nature of 164–6; in research process 157; role of researcher 162; scientific 162–3; social 158, 163–4; statistical 158; and technological change 33; types 157–63, 159 Valkenburg, P. M. 133 Van Remoortere, A. 57, 59 variable language 71

232

Index

variables 71 see also classification systems; measurement; research design; in causal models 186 figure 9.2, 186–7; coding sheets 126–7; coincidental correlation 43; in content analysis 39, 42, 52; correlation versus causation 42–3; defining, coding instructions for 83, 119, 128, 131; degree of precision 73; dependent 188; dropped 150; identification of 72–3; independent 45; instructions for 123–4; nominal measures 85–6, 87; replication 83; “third variables” 45 variance: analysis of variance (ANOVA) 176–7; proportion shared by two variables (R-square) 184 Vashi, A. 74 verbal communication 74 video content 28, 75 video games 2, 13 virtual reality experiences 27 virtual shopping assistants 27 visual communication 74–5 visual media 62 Vliegenthart, R. 82 Vogt, W. P. 4, 5, 8 voice assistants 27 Vorderer, P. 9 Wadsworth, A. J. 135 Walsh-Buhi, E. 94 Wang, X. 109 Wanta, W. 10, 15–16 Watson, B. R. 1, 10, 13, 18, 18, 40, 59, 111–12, 140, 160, 197

Weaver, D. H. 31, 32, 40, 106, 112 Webb, J. B. 39 web content see dynamic web content Weber, R. P. 10, 190 Weeks, B. 40 Westerman, D. 22, 23 Whaples, R. 18 Widlöcher, A. 144 Willliams, A. E. 10, 14 Willliams, D. C. 13 Wilson, C. 82 Wimmer, R. D. 12–13, 25, 101, 135, 156 Wooldridge, J. M. 190 words, frequency and linkage 33 Wray, R. J. 60 Wu, H. D. 10 Wu, L. 112 Xenos, M. 161 Yanovitzky, I. 60 Yarchi, M. 59 Yi-Fan Su, L. 39 Young, J. 10 YouTube 75, 93, 114–15 Yu, Y. C. 13 Zamith, R. 16, 57, 59, 62, 65, 65–6, 90, 129 Zeldes, G. 78 Zhao, X. 146, 148 Zhou, S. 40 Ziegele, M. 27 z-statistics (difference of means and proportions) 174–7 Zullow, H. M. 18