Analyzing Media Messages: Using Quantitative Content Analysis in Research [4 ed.] 9781138613980, 1138613983

1,250 64 2MB

English Pages [235] Year 2019

Polecaj historie

Analyzing Media Messages: Using Quantitative Content Analysis in Research [5 ed.] 1032264691, 9781032264691

The fifth edition of this comprehensive and engaging text guides readers through the essential tools and skills necessar

145 101 11MB Read more

Analyzing Media Messages 9780429875168, 9781138613973, 9781138613980, 9780429464287

Analyzing Media Messages, Fourth Edition provides a comprehensive guide to conducting content analysis research. It esta

137 115 3MB Read more

Analyzing Animal Societies: Quantitative Methods for Vertebrate Social Analysis 9780226895246

Animals lead rich social lives. They care for one another, compete for resources, and mate. Within a society, social rel

211 79 3MB Read more

A Toolkit for Quantitative Data Analysis: Using SPSS 9781350394209, 9781137038258

This straightforward, approachable text provides students with a beginner's guide and continuing reference tool for

343 73 5MB Read more

Quantitative Data Analysis: Doing Social Research to Test Ideas

1,086 117 43MB Read more

Media Literacy: Keys to Interpreting Media Messages, 4th Edition: Keys to Interpreting Media Messages 9781440830921, 9781440830914, 9781440831157, 1440830924

This fourth edition of Keys to Interpreting Media Messages supplies a critical and qualitative approach to media literac

235 129 3MB Read more

Conducting Quantitative Research In Education 9811391319, 9789811391316

This book provides a clear and straightforward guide for all those seeking to conduct quantitative research in the field

1,079 107 19MB Read more

Point of Sale: Analyzing Media Retail 9780813595566

Point of Sale offers the first significant attempt to center media retail as a vital component in the study of popular c

404 117 2MB Read more

Methods for Analyzing Social Media 041581832X, 9780415818322

Social media is becoming increasingly attractive for users. It is a fast way to communicate ideas and a key source of in

249 87 2MB Read more

Strategies for Quantitative Research 9781351802949

434 124 2MB Read more

Analyzing Media Messages: Using Quantitative Content Analysis in Research [4 ed.]
9781138613980, 1138613983

Author / Uploaded
Daniel Riffe
Stephen Lacy
Brendan R. Watson
Frederick Fico

Table of contents :
Cover
Half Title
Series Page
Title Page
Copyright Page
Dedication
Table of Contents
Preface
Chapter 1: Introduction
Communication Research
Content Analysis and Mass Communication Effects Research
Content Analysis and the Context of Production
The “Centrality” of Content
Description as a Goal
Research Applications: Making the Connection
Innovation and Expanding the Research Reach
Research Applications: Content Analysis in Other Fields
Summary
Chapter 2: Defining Content Analysis as a Social Science Tool
Adapting a Definition
Content Analysis Defined
Issues in Content Analysis as a Research Technique
Advantages of Quantitative Content Analysis of Manifest Content
Summary
Chapter 3: Computers and Content Analysis
Distinguishing Algorithmic Text Analysis (ATA)
Advantages and Disadvantages of ATA
When ATA Is Best Applied
Hybrid or Computer-Aided Content Analysis
“Scaling Up” Content Analyses
Summary
Chapter 4: Measurement
Content Units and Variables in Content Analysis
Content Forms
Units of Observation
Units of Analysis
Levels of Measurement
Measurement Steps
Summary
Chapter 5: Sampling
Sampling Time Periods
Sampling Techniques
Stratified Sampling for Legacy Media
Sampling Digital Content
Sampling Suggestions for Digital Media
Sampling Individual Communication
Summary
Chapter 6: Reliability
Reliability: Basic Notions
Variable Definitions and Category Construction
Content Analysis Protocol
Coder Training
Reliability Assessment
Reliability Coefficients
Summary
Chapter 7: Validity
The Problem of Measurement Reliability and Validity
Tests of Measurement Validity
Validity in Observational Process
External Validity and Meaning in Content Analysis
Summary
Chapter 8: Designing a Content Analysis
Conceptualization in Content Analysis Research Design
Good Design and Bad Design
A General Model for Content Analysis
Research Program Design
Summary
Chapter 9: Data Analysis
An Introduction to Analyzing Content
Fundamentals of Analyzing Data
Describing and Summarizing Findings
Finding Relationships
Statistical Assumptions
Summary
Appendix: Reporting Standards for Content Analysis Articles
Sampling
Coders, Variables, and Protocol
Reliability
References
Author Index
Subject Index

Citation preview

Analyzing Media Messages

Analyzing Media Messages, Fourth Edition provides a comprehensive guide to conducting content analysis research. It establishes a formal definition of quantitative content analysis; gives step-by-step instructions on designing a content analysis study; and explores in depth several recurring questions that arise in such areas as measurement, sampling, reliability, data analysis, and the use of digital technology in the content analysis process. The fourth edition maintains the concise, accessible approach of the first three editions while offering updated discussions and examples. It examines in greater detail the use of computers to analyze content and how that process varies from human coding of content, incorporating more literature about technology and content analysis throughout. Updated topics include sampling in the digital age, computerized content analysis as practiced today, and incorporating social media in content analysis. Each chapter contains useful objectives and chapter summaries to cement core concepts. Daniel Riffe is Richard Cole Eminent Professor in Media and Journalism at UNCChapel Hill and former editor of Journalism & Mass Communication Quarterly. His research examines mass communication and environmental risk, political communication and public opinion, international news coverage, and research methodology. Before joining UNC-Chapel Hill, he was Presidential Research Scholar in the Social and Behavioral Sciences at Ohio University. Stephen Lacy is Professor Emeritus at Michigan State University, where he studied content analysis and media managerial economics for more than 30 years in the School of Journalism and Department of Communication. He has co-written or co-edited five other books and served as co-editor of the Journal of Media Economics. Brendan R. Watson is an Assistant Professor of Journalism Innovations at Michigan State University. His research examines the role of public affairs news/information in helping communities cope with social upheaval due to the increasing urbanization, globalization, and pluralism of postindustrial society. He also studies research methodology. He has taught graduate seminars in content analysis at MSU and the University of Minnesota, where he was previously on the faculty. He holds a Ph.D. in Mass Communication from the University of North Carolina at Chapel Hill. Frederick Fico is Professor Emeritus from the School of Journalism at Michigan State University, where he studied and taught content analysis for more than 30 years. His research specialties are news coverage of conflict, including elections, and how reporters use sources, particularly women and minorities. His research explores the implications of empirical findings for values of fairness, balance, and diversity in reporting.

Routledge Communication Series

Jennings Bryant/Dolf Zillmann, Series Editors

Selected titles include: The Business of Sports Off the Field, in the Office, on the News, 3rd Edition Mark Conrad Advertising and Public Relations Law, 3rd Edition Carmen Maye, Roy L. Moore, and Erik L. Collins Applied Organizational Communication Theory and Practice in a Global Environment, 4th Edition Thomas E. Harris and Mark D. Nelson Public Relations and Social Theory Key Figures, Concepts and Developments, 2nd Edition Edited by Øyvind Ihlen and Magnus Fredriksson Family Communication, 3rd Edition Chris Segrin and Jeanne Flora Advertising Theory, 2nd Edition Shelley Rodgers and Esther Thorson An Integrated Approach to Communication Theory and Research, 3rd Edition Edited by Don W. Stacks, Michael B. Salwen, and Kristen C. Eichhorn Analyzing Media Messages, 4th Edition Using Quantitative Content Analysis in Research Daniel Riffe, Stephen Lacy, Brendan R. Watson, and Frederick Fico For a full list of titles, please visit: https://www.routledge.com/RoutledgeCommunication-Series/book-series/RCS.

Analyzing Media Messages Using Quantitative Content Analysis in Research Fourth Edition

Daniel Riffe, Stephen Lacy, Brendan R. Watson, and Frederick Fico

Fourth edition published 2019 by Routledge 52 Vanderbilt Avenue, New York, NY 10017 and by Routledge 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business  2019 Taylor & Francis The right of Daniel Riffe, Stephen Lacy, Brendan R. Watson, and Frederick Fico to be identified as authors of this work has been asserted by them in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilized in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. First edition published by Routledge 1998 Third edition published by Routledge 2013 Library of Congress Cataloging-in-Publication Data A catalog record has been requested for this book ISBN: 978-1-138-61397-3 (hbk) ISBN: 978-1-138-61398-0 (pbk) ISBN: 978-0-429-46428-7 (ebk) Typeset in Sabon by Swales & Willis Ltd, Exeter, Devon, UK

Daniel Riffe: For Florence, Ted, Eliza, Bridget, Brynne, and Hank Stephen Lacy: For I. P. Byrom, N. P. Davis, and A. G. Smith Brendan R. Watson: For Joan and Maroun Fred Fico: For Beverly, Benjamin, and Faith

Contents

Preface 1 Introduction

viii 1

2 Defining Content Analysis as a Social Science Tool

20

3 Computers and Content Analysis

36

4 Measurement

47

5 Sampling

71

6 Reliability

98

7 Validity

132

8 Designing a Content Analysis

148

9 Data Analysis

168

Appendix: Reporting Standards for Content Analysis Articles References Author Index Subject Index

192 194 213 219

Preface

The purpose of this book is to help facilitate the development of a science of communication, in particular as it relates to mediated communication. A communication science is at the heart of all our social sciences because communication increasingly defines what we do, how we do it, and even who we are individually, socially, and culturally. In fact, never before in human history has mediated communication been so central, pervasive, and important to human civilization. A good communication science is necessary if humanity is to fully understand how communication affects us. Absent good understandings from such a communication science, we will always be at the mercy of unintended, unforeseen consequences. But absolutely necessary to the development of a communication science is a means of logically assessing communication content. Broadly speaking, communication content varies based on a large set of factors that produce that communication. In turn, the variations in communication content affect a large set of individual, group, institutional, and cultural factors. In other words, understanding communication content is necessary and central to any communication science in which the goal is to predict, explain, and potentially control phenomena (Reynolds, 1971). More specifically, we believe the systematic and logical assessment of communication content requires quantitative content analysis, the topic of this book. Only this information-gathering technique enables us to illuminate patterns in large sets of communication content with reliability and validity, and through the reliable and valid illumination of such patterns can we hope to illuminate content causes or predict content effects. We bring to this effort our experiences conducting or supervising hundreds of quantitative content analyses in our careers as researchers, examining content ranging from White House coverage, to portrayal of women and minorities in advertising, to the sources given voice in local government news. The content analyses have included theses and dissertations, class projects, and funded studies, and have involved content from sources as varied as newspapers, broadcast media, and social media. Some projects have been descriptive, whereas others have

Preface ix tested hypotheses or sought answers to specific research questions. They have been framed in theory about processes that affect content and about the effects of content. If conducting or supervising those studies has taught us anything, it is that some problems or issues are common to virtually all quantitative content analyses. Designing a study raises questions about sampling, measurement, reliability, and data analysis. These fundamental questions arise whether the researcher is a student conducting her first content analysis or a veteran planning her twentieth, whether the content being studied is words or images, and whether it comes from social networking sites or a legacy medium. In preparing this book for the fourth edition, we re-engage these recurring questions. Our goal is to make content analysis accessible, not arcane, and to produce a comprehensive guide that is also comprehensible. We hope to accomplish the latter through clear, concrete language and by providing numerous examples—of recent and “classic” studies— to illustrate problems and solutions. We see the book as a primary text for courses in content analysis, a supplemental text for research methods courses, and a useful reference for fellow researchers in mass communication fields, political science, and other social and behavioral sciences. This fourth edition varies from the previous three because a new coauthor (Brendan R. Watson) has joined the team. His participation represents the reduction in scholarly activity by two of the authors, who are now emeritus, and his presence contributes a deeper understanding of the growing use of computers for a variety of activities in content analysis. We owe thanks to many for making this book possible: teachers who taught us content analysis—Donald L. Shaw, Eugene F. Shaw, Wayne Danielson, James Tankard, G. Cleveland Wilhoit, and David Weaver— colleagues who provided suggestions on improving the book; and our students who taught us the most about teaching content analysis. Brendan learned content analysis by studying the second edition of this very book and doing content analysis with his mentors, with whom he is now coauthor. Finally, our deepest appreciation goes to our families, who often wonder whether we do anything but content analysis. —Daniel Riffe —Stephen Lacy —Brendan R. Watson —Frederick Fico

1 Introduction

Consider the diversity of these quantitative content analyses. Epps and Dixon (2017) examined Facebook sharing by 381 survey respondents in order to compare shared rap songs with Billboard’s top rap songs. Though respondents were familiar (70% reported “strong familiarity,” p. 474) with the chart-toppers, what they chose to share was an imperfect mirror of Billboard’s ratings: 73% of top songs involved sexual explicitness compared to 52% of those shared; 62% of top sellers objectified women compared to only 32% of shared songs; and 57% of top songs used derogatory words to describe women (e.g., “slut,” “dog,” “bitch”) compared to just 35% of shared songs. Other researchers (Lynch, Tompkins, van Driel, & Fritz, 2016) looked at female character “sexualization” in video games across three decades, a time period encompassing the 1996 Tomb Raider game that introduced Lara Croft, a character scholars have described as highly sexualized yet strong, bold, educated, and capable (p. 569). While sexualization increased from 1992 to 2006, it declined from 2007 to 2014; moreover, Lynch et al. (2016) reported a persistent relationship across time between female character sexualization and character capability, a fact that may help “empower female gamers” (p. 578), even though female characters were more often in secondary than in primary roles (p. 580). Johnson and Pettiway (2017) examined cultural projection on 46 African and African-American museums’ websites, recording the visual imagery, affordances, and tactics the museums used to express black identity, and called the websites “digital disruptors on an Internet mostly controlled by companies led by white men” (p. 371). Two traditionally marginalized groups—women and protestors— were the focus of a four-and-a-half-decade (including before and after 1973’s Roe v. Wade case legalizing abortion) content analysis of New York Times and Washington Post abortion protest coverage (Armstrong & Boyle, 2011). Despite the “uniqueness of the issue to women, (and) to the feminist movement” (p. 171), men appeared more often as sources. Challenging the dichotomy of audience members as consumers or citizens, Mellado and van Dalen (2017) drew three successive samples of

2 Introduction Chilean newspaper articles (totaling more than 3,500), coding each for 19 indicators in order to confirm a three-dimensional categorization of news as serving civic, infotainment, or service functions for audiences. A study of Norwegian local political campaigns (Skogerbo & Krumsvik, 2015) examined the influence of social media on “mediatization,” a process wherein “parties and politicians adapt their practices to formats, deadlines and genres that are journalistically attractive” (p. 350), thus allowing candidates to help set the political issue agenda. Following 21 candidates from seven different parties in five municipalities, the researchers gathered data on local newspaper coverage and on the content and linking (sharing or retweeting) in candidates’ Facebook and Twitter postings (total of 2,615 items). While candidates were active on social media, “there was surprisingly little evidence that social media content travelled [sic] to local newspapers and contributed to agenda setting” (p. 350). Similarly, Bastien (2018) compared issue agendas between newspaper coverage and transcripts from televised debates in five Canadian federal campaigns between 1968 and 2008. Reporting on the debates became increasingly “analytical and judgmental” and less “factual” in style: the presence of journalists’ opinions in paragraphs increased from 14% to 24% across the study period (p. 9). “On the other hand, the correlation between the agendas of both politicians and journalists is steady: the longer an issue is debated by the leaders, the more it is reported by journalists” (p. 15). Visitors to the political blogosphere may assume that its news content is qualitatively different from mainstream media, which are often dismissed as partisan, pro-status quo, or slaves to advertisers. Leccese (2009) coded more than 2,000 links on six widely read political blogs, discovering that 15% looped readers back to another spot on the blog, 47% linked to mainstream media websites, and 23% linked to other bloggers. Only 15% linked to primary sources. In order to examine how one political “tradition”—“going negative” with advertising—has fared in the 21st century, Druckman, Kifer, and Parkin (2010) analyzed more than 700 congressional candidate websites from three election cycles (2002, 2004, 2006), and compared website and television advertising negativity. Contrary to predictions (e.g., Wicks & Souley, 2003) that web advertising would be more negative, Druckman et al. (2010) found 48% of candidates went negative on the web, but 55% went negative in their television ads. By the 2008 campaign cycle (N = 402 sites), Druckman, Kifer, and Parkin (2014) were able to determine that non-incumbency and availability of consultant guidance were pivotal in how aggressively candidates used technologies on websites, an insight that has spurred subsequent surveys of “insiders” who ran congressional campaigns between 2008 and 2014 (Druckman, Kifer, & Parkin, 2017; Druckman, Kifer, Parkin, & Montes, 2017).

Introduction 3 Lee and Riffe (2017) explored how corporations and an industry monitoring group focus media attention on corporate social responsibility (CSR) activities (e.g., efforts to improve the environment, the community, and employees). Data from 7,672 press releases from 223 U.S. corporations, 1,064 New York Times and Wall Street Journal articles, and ratings of corporations by a CSR monitoring group showed stronger relationships between ratings and news coverage than between press releases and news coverage. Companies may need to heed such monitoring groups, lest they learn of their own shortcomings through the media, and promote CSR topics that interest the media—corporate governance, consumer issues, diversity, and the environment. Earlier, Ki and Hon (2006) explored Fortune 500 companies’ web communication strategies, coding company sites’ ease of use, openness, and public access, as well as site promotion of firms’ CSR activities involving education, the community, and the environment, finding that few sites communicated effectively about CSR. Systematic content analysis showed that Survivor, a long-running “reality” television program, routinely offered viewers high doses of antisocial behavior, with indirect aggression (behind the victim’s back) the most common (73% of antisocial behaviors), followed at 23% by verbal aggression and deceit at 3% (Wilson, Robinson, & Callister, 2012). Although these studies differ in purpose, focus, techniques employed, and scientific rigor, they reflect the range of applications possible with quantitative content analysis, a research method defined briefly as the systematic assignment of communication content to categories according to rules, and the analysis of relationships involving those categories using statistical methods. Usually, such content analysis involves drawing representative samples of content, training human coders to use category rules developed to measure or reflect differences in content, and measuring the reliability (agreement or stability over time) of coders in applying the protocol. The collected data are then usually analyzed to describe typical patterns or characteristics or to identify important relationships among the content qualities examined. If the categories and rules are sound and are reliably applied, the chances are that the study results will be valid (e.g., that the observed patterns are meaningful). Though most of these procedures have remained constant over time, contemporary scholars are exploring new ways to utilize computers to complement human coding to deal with large amounts of text, explorations that are discussed below. This skeletal definition deliberately lacks any mention of the specific goal of the researcher using quantitative content analysis (e.g., to test hypotheses about sharing rap videos), any specification of appropriate types of communication to be examined (e.g., corporate websites, video games, or political blogs), the content qualities explored (e.g., presence of a reporter’s opinion, use of profanity in a shared song, or negativity

4 Introduction on candidate websites), or the types of inferences that will be drawn from the content analysis data (e.g., concluding that antisocial behavior goes unpunished in reality television). Such specification of terms is essential to a thorough definition. However, before a more comprehensive definition of this versatile research method is developed in Chapter 2, we offer an overview of the place of content analysis in mass communication research and examples of its use in other fields and disciplines.

Communication Research Whereas some scholars approach communication messages from perspectives associated with the humanities (e.g., as literature or art), many others employ a social science approach based on empirical observation and measurement. Typically, that means that these researchers identify questions or problems (either derived from the scholarly literature or occurring in professional practice), identify concepts that “in theory” may be involved, and propose possible explanations or relationships among concepts. Implausible explanations are discarded, and viable ones tested empirically, with theoretical concepts now measured in concrete, observable terms. If members of an ethnic minority, for example, believe that they are underrepresented in news content (in terms of their census numbers), a researcher may propose that racism is at work or that minorities are underrepresented among occupational groups that serve more often as news sources. Each of these propositions involves different concepts to be “operationalized” into measurement procedures and each can be tested empirically. Similarly, if researchers want to address how social media help achieve concerted action during a crisis such as the 2011 Arab Spring, operational procedures can be developed and used to collect data on social media content, which can be compared with data for official media. Put another way, explanations for problems or questions for such researchers are sought and derived through direct and objective observation and measurement rather than through one’s reasoning, intuition, faith, ideology, or conviction. In short, these communication researchers employ what is traditionally referred to as the scientific method. The centuries-old distinction between idealism (an approach that argues that the mind and its ideas are “the ultimate source and criteria of knowledge”) and empiricism (an approach that argues that observation and experimentation yield knowledge) continues to hold the attention of those interested in epistemology or the study of knowledge (Vogt, 2005, pp. 105–106, 149). Content analysis assumes an empirical approach, a point made more emphatically in later chapters. Another important distinction involves reductionism and holism. Much of communication social science adheres implicitly to a reductionist view—i.e., that understanding comes through reducing a phenomenon

Introduction 5 to smaller, more basic, individual parts (Vogt, 2005, p. 267)—rather than holism, an assumption that wholes can be more than or different from the sum of their individual parts (Vogt, 2005, p. 145). From a holistic perspective, the whole “is literally seen as greater than the sum of its parts” (McLeod & Tichenor, 2003, p. 105). For example, collectivities such as communities have properties or characteristics that are more than the aggregate of individuals within them. Although the reductionism– holism debate most often involves the place of individuals in larger social systems, it might as easily address the distinction between individual communication messages or message parts, and “the media,” news, and entertainment as institutions.

Content Analysis and Mass Communication Effects Research The scholarly or scientific study of mass communication is fairly new. Historians have traced its beginnings to early 20th-century work by political scientists concerned with effects of propaganda and other persuasive messages (McLeod, Kosicki, & McLeod, 2009; Rogers, 1994; Severin & Tankard, 2000). In addition to scholars in communication, researchers from disciplines such as sociology, psychology, political science, and economics have focused on communication processes and effects, contributing their own theoretical perspectives and research methods. Regardless of whether they were optimistic, pessimistic, or uncertain about communication’s effects, researchers have recognized content analysis as an essential step in understanding those effects. Powerful Effects? One particularly important and durable communication research perspective reflects a behavioral science orientation that grew out of early 20th-century theories that animal and human behaviors could be seen as stimulus-response complexes. Some communication researchers have viewed communication messages and their assumed effects from this same perspective. Researchers interested in these effects typically have adopted experimental methods for testing hypotheses. Participants were assigned to different groups; some were exposed to a stimulus within a treatment (a message), whereas others were not (the control participants). Under tightly controlled conditions, subsequent differences in what was measured (e.g., attitudes about an issue, or perhaps purchasing or other behavioral intention) could be attributed to the exposure– non-exposure difference. Meanwhile, for most of the first half of the 20th century, there existed a widespread assumption—among scientists and the public—that stimuli

6 Introduction such as mass persuasive messages could elicit powerful responses, even outside the experimental laboratory. Why? Propaganda, as seen during the world wars, was new and frightening (Lasswell, 1927; Shils & Janowitz, 1948). Reinforcement came in the form of a 10-volume summary of 13 Payne Fund Studies conducted from 1929 to 1932 that showed movies’ power “to bring new ideas to children; to influence their attitudes; stimulate their emotions; present moral standards different from those of many adults; disturb sleep; and influence interpretations of the world and day-to-day conduct” (Lowery & DeFleur, 1995, p. 51). Anecdotal evidence of the impact in Europe of communist or Nazi oratory or, in America, the radio demagoguery of Father Charles E. Coughlin (Stegner, 1949) heightened concern over mass messages and collective behavior. Broadcast media demonstrated a capacity for captivating, mesmerizing, and holding people in rapt attention and for inciting collective panic (Cantril, Gaudet, & Hertzog, 1940). With the rise of commercial advertising and public relations agencies, carefully organized persuasive campaigns used messages that were constructed to make people do what a communicator wanted (Emery, Emery, & Roberts, 2000; McLeod et al., 2009). Communication media were increasingly able to leapfrog official national boundaries and were believed capable of undermining national goals (Altschull, 1995). These assumptions about powerful media effects were consistent with the early 20th-century behaviorist tradition and contributed to early models or theories of communication effects that used metaphors such as hypodermic needle or bullet. In the language of the latter, all one had to do was shoot a persuasive message (a bullet) at the helpless and homogeneous mass audience, and the communicator’s desired effects would occur. Data generated in experimental studies of messages and their effects were interpreted as supporting these assumptions of powerful effects. Of course, the assumption that audience members were uniformly helpless and passive was a major one. Methodologists warned of the artificiality of controlled and contrived conditions in laboratory settings and cautioned that experimental attitude-change findings lacked real-world generalizability (Hovland, 1959). Still others suggested that scientists’ emphasis on understanding how to best do things to the audience was inappropriate; Bauer (1964) questioned the “moral asymmetry” (p. 322) of such a view of the public. Nonetheless, content analysis found a home within the powerful effects perspective because of the implicit causal role for communication content described in the models, tested in the experiments, and ascribed—by the public as well as scientists and policymakers—to content, whether it was propaganda, popular comics or films, pornography, political promises, or persuasive advertisements.

Introduction 7 In short, communication content was important to study because it was believed to have an effect (Krippendorff, 2004a; Krippendorff & Bock, 2009). Scholars scrutinized content in search of particular variables that, it was assumed, could affect people. One researcher might thus catalog what kinds of suggestions or appeals were used in propaganda, another might describe the status or credibility of sources in persuasive messages, and still others might analyze whether antisocial behavior was sanctioned, applauded, or ignored in popular television programs. Limited Effects? However, the assumption that powerful effects were direct and uniform was eventually challenged as simplistic and replaced by more careful specification of factors that contribute to or mitigate effects (Severin & Tankard, 2000). Experimental findings had, in fact, suggested that in some cases, mass media messages were effective in changing subjects’ knowledge but not the targeted attitudes or behaviors. Researchers conducting public opinion surveys brought field observations that ran counter to cause–effect relations found in laboratory settings. Examination of how people are exposed to messages in the real world and the mixed results on real-world effectiveness of persuasive message “bullets” suggested that a more limited effects perspective might be worth exploring (Chaffee & Hochheimer, 1985; Klapper, 1960). Under natural, non-laboratory field conditions, members of the audience (who, it turned out, were not uniformly helpless or passive, nor, for that matter, very uniform in general) used media and messages for their own individual purposes, chose what parts of messages—if any—to attend, and rejected much that was inconsistent with their existing attitudes, beliefs, and values (Lazarsfeld, Berelson, & Gaudet, 1944). Social affiliations such as family and community involvement were important predictors of people’s attitudes and behaviors, and networks of personal influence were key factors influencing their decisions (Carey, 1996). Real-world (non-laboratory) audience members had only an opportunity to be exposed to particular media content. They were not forced to attend to the message like experimental participants. A decision to accept, adopt, or learn a message was a function of existing psychological and social characteristics, and not necessarily mere exposure to the manipulated, artificial credibility of a source trying to persuade as part of an experimental treatment. Contingency Effects? Research during the last half of the 20th century suggested that the effects—powerful or limited—of mass media are contingent on a variety of factors and conditions. This contingency effects approach allowed

8 Introduction theorists to reconcile conflicting conclusions of the powerful and limited effects approaches. Rather than being the result of any single cause (e.g., the message), communication effects reflected a variety of contingent conditions (e.g., whether the message is attended to alone or as part of a group). Of course, some contemporary research on content— particularly that aimed at impressionable children, whether online, in video games, or elsewhere—continues to adhere implicitly to powerful effects assumptions. However, despite increasing interest in what people do with media messages and how or if they learn from them—rather than a powerful effects focus—content analysis remained an important means of categorizing all forms of content. The communication messages that might previously have been analyzed because of assumed effects were now related to differences in psychological or social gratifications consumers gained from media use (e.g., escape from boredom, being “connected” to what is going on, or having something to talk about), to differences in cognitive images they developed (e.g., views of appropriate gender roles or of the acceptability of antisocial acts), and to differences in what they deemed important on the news media agenda (e.g., what issues in a political campaign were worth considering and what attributes of issues were critical). In short, different theories or hypotheses about varied cognitive (not attitudinal) effects and people’s social and psychological uses and gratifications of media and media content were developed that reflected a view of the audience experience far different from the “morally asymmetrical” view criticized by Bauer (1964, p. 322). These triggered additional studies aimed at measuring content variables associated with those uses and effects. For example, content analysts have categorized entertainment content to answer questions about how ethnic and gender stereotypes are learned (Mastro, 2009; Smith & Granados, 2009). They have looked at content ranging from daytime soap operas to reality programs because of guiding assumptions about psychological and social gratifications people achieve by viewing those shows (Rubin, 2009). They have examined victim gender in “slasher” horror movies because of concern that the violence in such films has a desensitizing effect (Sapolsky, Molitor, & Luque, 2003; Sparks, Sparks, & Sparks, 2009). They have analyzed “movement” of issues on the media’s agenda during political campaigns, assuming that readers recognize the priorities journalists give issues and issue attributes, internalize that agenda, and use it as a basis for voting decisions (McCombs & Reynolds, 2009; McCombs & Shaw, 1972). And, systematic content analysis has shown how different communicators “frame” the same events, because scholars argue that frames shape interpretations (Reese, Gandy, & Grant, 2001; Tewksbury & Scheufele, 2009). Tankard’s (2001) definition of framing in news is illustrative:

Introduction 9 “A frame is a central organizing idea for news content that supplies a context and suggests what the issue is through the use of selection, emphasis, exclusion, and elaboration” (pp. 100–101). Content analysis remains an important tool for researchers exploring more directly how individual-level cognitive processes and effects relate to message characteristics (Bradac, 1989; Shrum, 2009; Oliver & Krakowiak, 2009). For example, scholars have argued that important differences between one message’s effects and another’s may be due less to the communicator’s or audience member’s intent (e.g., to inform or be informed) than to different cognitive or other processes (e.g., transportation and enjoyment, entertainment, arousal, mood management, etc.) triggered by content features or structure (Bryant, 1989; Bryant, Roskos-Ewoldsen, & Cantor, 2003; Green, Brock, & Kaufman, 2004; Oliver & Krakowiak, 2009; Thorson, 1989; Vorderer & Hartmann, 2009; Zillmann, 2002).

Content Analysis and the Context of Production Thus far, our discussion has implicitly viewed communication content as an antecedent condition, presenting possible consequences of exposure to content that may range from attitude change (from a powerful effects perspective) to the gratifications people obtain from media use or the cognitive images they learn from it. However, content is itself the consequence of a variety of other antecedent conditions or processes that may have led to or shaped its construction. A news web page, for example, might be conceived as being a consequence of the news organization’s selection from an array of possible stories, graphics, interactive features or affordances, and other content. In terms of the individual site manager or editor, that page’s content is a consequence of editors’ application of what has traditionally been called “news judgment,” based on numerous factors that visitors to the site need or want from that content. Of course, that judgment is shaped by other constraints, such as what kinds of motion graphics or interactivity are available, how often material is updated, and so on. The content a researcher examines reflects all those antecedent choices, conditions, constraints, or processes (Stempel, 1985). Similarly, individual news stories are the consequence of a variety of influences, including (but not limited to) a news organization’s market (Lacy, 1987, 1988; Lacy, Duffy, Riffe, Thorson, & Fleming, 2010; Lacy, Watson, & Riffe, 2011); the resources available for staffing (Fico & Drager, 2001; Lacy et al., 2012); on-scene reporter judgments and interactions with both purposive and non-purposive sources (Bennett, 1990; Duffy & Williams, 2011; Entman, 2010; Lawrence, 2010; Westley & MacLean, 1957); and decisions about presentation style, structure, emphasis (as in the “framing” process described previously), and language, to

10 Introduction name a few (Scheufele & Scheufele, 2010). Media sociologists do not view news reporting as “mirroring” reality, but speak instead of journalistic practices and decisions that collectively constitute the manufacturing of news (Cohen & Young, 1981). News content is the product or consequence of those routines, practices, and values (Reese, 2011; Shoemaker & Reese, 1996), is constructed by news workers (Bantz, McCorkle, & Baade, 1997), and reflects both the professional culture of journalism and the larger society (Berkowitz, 2011; Mellado et al., 2017). Examples of “content as consequence” abound. Under the stress of natural disasters (e.g., tornadoes, hurricanes, or earthquakes), individual journalists produce messages in ways that differ from routine newswork (Dill & Wu, 2009; Fontenot, Boyle, & Gallagher, 2009; Whitney, 1981). Different ownership, management, operating, or competitive situations have consequences; news organizations in different competitive situations allocate content differently (Beam, 2003; Lacy, 1987). The presence of women in top editorial positions has consequences for how reporters are assigned beats (Craft & Wanta, 2004) and the newsroom culture (Everbach, 2005), but evidence on effects of female management on content is mixed (Beam & Di Cicco, 2010; Everbach, 2005) or perhaps issue-dependent (Correa & Harp, 2011), though Beam and Di Cicco (2010) found increased feature treatment in news with senior female editors. Predictably, a good deal of international coverage in U.S. news media is a consequence of having a U.S. military presence overseas; absent a state of war, “foreign news” is relatively rare (Allen & Hamilton, 2010). Facing censorship in authoritarian countries, correspondents gather and report news in ways that enable them to get stories out despite official threats, sanctions, or barriers (Riffe, 1984, 1991; Riffe, Kim, & Sobel, 2018). Many of the symbols that show up in media messages at particular points in time (e.g., allusions to nationalism or solidarity during a war effort) are consequences of the dominant culture and ideology (Shoemaker & Reese, 1996); communication messages that contain particular images, ideas, or themes reflect important—and clearly antecedent—cultural values. “Content as consequence” is applicable to non-news communication, too. Recall Ki and Hon (2006), whose examination of Fortune 500 companies’ websites allowed them to critique those companies’ communication strategies, strategies that were antecedent to the site content. Scholars often speak of such evidence as unobtrusive or non-reactive. That is, researchers can examine content after the fact of its production and draw inferences about the conditions of its production without making the communicators self-conscious or reactive to being observed while producing it. As a result, according to Weber (1990), “[T]here is little danger that the act of measurement itself will act as a force for change that confounds the data” (p. 10). Letters, diaries, bills of sale, or archived newspapers, tweets, or blog posts—to name a few—can be

Introduction 11 examined and conclusions drawn about what was happening at the time of their production.

The “Centrality” of Content So, communication content may be viewed as an end product, the assumed consequence or evidence of antecedent individual, organizational, social, and other contexts. The validity of that assumption depends on how closely the content evidence can be linked empirically (through observation) or theoretically to that context. As just noted, communication content also merits systematic examination because of its assumed role as cause or antecedent of a variety of individual processes, effects, or uses people make of it. Figure 1.1 is a simple, content-centered model summarizing the previous discussion and illustrating why content analysis can be integral to theory-building about both communication effects and processes.

ANTECEDENT CONDITIONS (a) individual psychological/professional (b) social, political, economic, cultural or other contextual factors

that are assumed or demonstrated to affect

COMMUNICATION CONTENT

which is an antecedent/correlate of

(a) assumed or demonstrated (b) immediate or delayed (c) individual, social or cultural EFFECTS

Figure 1.1 Centrality model of communication content

12 Introduction The centrality remains regardless of the importance (for theorybuilding) of myriad non-content variables such as individual human psychological or social factors and the larger social, cultural, historical, political, or economic context of communication. However, if the model illustrates graphically the centrality of content, it does not reflect accurately the design of many mass communication studies. As Shoemaker and Reese (1990) observed, most content analyses are not linked “in any systematic way to either the forces that created the content or to its effects” (p. 649). As a result, Shoemaker and Reese (1996) warned, mass communication theory development could remain “stuck on a plateau” (p. 258) until that integration occurs. A 1996 study (Riffe & Freitag, 1997) of 25 years of content analyses published in Journalism & Mass Communication Quarterly revealed that 72% of the 486 studies lacked a theoretical framework linking the content studied to either the antecedents or consequences of the content. Trumbo (2004, p. 426) placed the percentage at 73% in his analysis of Quarterly content studies during the 1990–2000 period. Not surprisingly, only 46% of the cases examined by Riffe and Freitag (1997) involved formal research questions or hypotheses about testable relations among variables, testing that is essential to theory-building. Still, research in this field is dynamic, although the scientific goal of prediction, explanation, and control (Reynolds, 1971) of media phenomena may still be decades away. However, quantitative content analysis of media content is key to such a goal. Since initial publication of this book in 1998, hundreds of content analysis-related studies have been published in Journalism & Mass Communication Quarterly and other refereed journals such as the Journal of Broadcasting & Electronic Media and Mass Communication and Society, using the kind of quantitative content analysis examined in this book. According to Wimmer and Dominick (2011, p. 156), about a third of all articles published in those three journals in 2007 and 2008 employed quantitative content analysis, a proportion higher than the 25% that Riffe and Freitag (1997) reported for 25 years of Journalism & Mass Communication Quarterly. Of the 2,534 articles Lovejoy, Watson, Lacy, and Riffe (2014) studied from Journalism & Mass Communication Quarterly, the Journal of Communication, and Communication Monographs between 1985 and 2010, 23% involved content analysis. Consistent with this book’s emphasis on the “centrality” of content in understanding processes and effects of communication, many of these studies place the content analysis research into the context of framing, agenda-setting, cultivation, and a variety of persuasion theories. Research on content antecedents has grown since Shoemaker and Reese (1996) spotlighted the approach with their hierarchy of influences theory. During the past three decades, scholars have examined content antecedents using theories from the other social sciences as well as communication research.

Introduction 13

Description as a Goal Of course, not all individual research projects have theory-building as a goal. Even apparently simple descriptive studies of content may be valuable. A southern daily newspaper publisher, stung by criticisms that his paper was excessively negative in coverage of the African-American community, commissioned one of the authors of this text to measure coverage of that segment of the community. That publisher needed an accurate description of his paper’s coverage over time to respond to criticisms and perhaps to change the coverage. Some descriptive content analyses are “reality checks” whereby portrayal of groups, phenomena, traits, or characteristics are assessed against a standard taken from real life (Wimmer & Dominick, 2011, pp. 158–159). Such comparisons to normative data can, in some instances, serve as an index of media distortion. For example, a study (Riffe, Goldson, Saxton, & Yu, 1989) of characters’ roles in television advertising during Saturday morning children’s programming reported a female and ethnic presence far smaller than those groups’ presence in the real-world population as reflected in census data. Moreover, when new media or content forms evolve, they lend themselves to descriptive studies and similar “real-world” comparisons. Video games, for example, have been examined because of assumptions about imitative aggression or learning of gender roles among users, a research focus previously applied to media content as varied as comic books, movies, television, and popular music. Martins, Williams, Harrison, and Ratan (2008) judged the level of realism in 150 top-selling video games. They also measured physical dimensions of animated characters, converting the dimensions to their “equivalencies” if the characters had been real, and comparing those, in turn, to real-world female body size and features. They found that animated female characters in video games had smaller chests, waists, and hips than their real-world counterparts, a pattern the authors deemed consistent with the thinness ideal cultivated by many media. Or consider the study by Law and Labre (2002) analyzing male body images in magazines from 1967 to 1997. Although the study implicitly incorporated time as part of its longitudinal design, it was essentially a descriptive study of how male body shapes have become increasingly lean and muscular in visual representations. Law and Labre suggested that male exposure to idealized body images may thus parallel the experience women face with mediated body images. Descriptive content analyses sometimes serve as a first phase in a program of research. The research program (as opposed to a single study) on anonymous attribution by Culbertson (1975, 1978) and Culbertson and Somerick (1976, 1977) is illustrative. Reporters sometimes hide a source’s identity behind a veil of anonymity (e.g., “a senior White House

14 Introduction official, speaking on condition of anonymity, said today . . .”), despite critics’ complaints about lack of public accountability for the source (Duffy & Williams, 2011). In the first phase, Culbertson (1975, 1978) analyzed representative newspapers and newsmagazines. Based on the results, which described variables associated with unnamed attribution, Culbertson and Somerick (1976, 1977) conducted a field experiment (participants received simulated news stories either with or without veiled attribution) to test effects of unnamed attribution on the believability of the report. More recently, a body of studies conducted by a variety of researchers has used experiments to test the effects of media frames of government policies on audience members’ thoughts, usually fashioning those (manipulated) experimental frames from actual examples used in mass media content (e.g., de Vreese, 2004, p. 39; de Vreese, 2010; de Vreese & Boomgaarden, 2006).

Research Applications: Making the Connection As many of these examples have shown, content analysis is often an end in itself, a method used to answer research questions about content. However, some of the previous examples illustrate how the method can be used in conjunction with other research strategies. In fact, despite Shoemaker and Reese’s (1996) complaint about non-integration of media content analyses into studies of effects or of media workers or other types of communications, some studies have attempted precisely such a linkage. Scheufele, Haas, and Brosius (2011) designed a study to explore the “mirror or molder” role of media coverage of stock prices and trading; specifically, they asked about the short-term effect of media coverage on subsequent market activity. Data on coverage in four leading German daily papers and the two most frequently visited financial websites were matched with stock prices and trading volume for companies ranging from blue-chip DaimlerChrysler to lightly capitalized and traded companies. While the authors cited clear evidence in their data that media “mirror (rather) than shape what happens at stock markets” (Scheufele et al., 2011, p. 63), they concluded that online coverage reflects market reality, but it also affects online traders who may trade immediately after reading reports. To examine how “compliant” the press and Congress were in responding to the official U.S. government stance on the 2004 Abu Ghraib prison scandal, Rowling, Jones, and Sheets (2011) systematically examined White House speeches, interviews, press conferences, and press releases; statements made on the floor of Congress and recorded in the Congressional Record; and news coverage by CBS News and the Washington Post. “[D]espite the challenges from congressional Democrats,” coverage did

Introduction 15 not reflect the range of views in play, and the Republican administration’s three “national identity-protective” frames—minimization, disassociation, and reaffirmation—“were largely echoed by the press” (p. 1057). Observers have argued that U.S. politics has become increasingly polarized in the 21st century, and that journalistic news judgment overvalues extreme groups and undervalues “moderate” groups. After identifying more than 1,100 (not-for-profit) advocacy groups from Internal Revenue Service databases, McCluskey and Kim (2012) interviewed top executives of 208 groups and characterized 20 as very conservative, 41 as very liberal, and 71 as moderate, in order to contrast coverage of moderate and extreme groups. They analyzed the 20 largest-circulation dailies in the United States and “matched” each group with the daily newspaper nearest its headquarters. Content analysis revealed that moderate groups were given less prominence than extreme groups and tended to be covered by smaller newspapers. McCombs and Shaw (1972) hypothesized an agenda-setting function of mass media coverage of different aspects of a political campaign in which differential media emphasis, over time, communicates to the public a rough ranking (agenda) of important issues. In theory, the media agenda would be recognized, learned, and internalized by the public until the public’s priority ranking of issues mirrors the differential emphasis on issues in the media. In a survey, the authors asked undecided voters the most important issues in a campaign, and they analyzed campaign coverage in nine state, local, and national media, finding a strong, positive correlation between the media and public agendas, and supporting the hypothesized media effect. Similarly, Wanta, Golan, and Lee (2004) combined content analysis of network newscasts with national poll data, showing that amount of coverage of foreign nations is strongly related to public opinion about the importance of those nations to U.S. interests. However, they also examined how negatively or positively the nations were portrayed, and found a “second-level” agenda-setting effect involving those attributes: the more negative the coverage of a nation, the more likely poll respondents were to think negatively about the nation. These studies involving communication content and survey measures of its (presumed) effect represent important steps in moving beyond merely describing content and assuming effects, or surveying attitudes and presuming a causal role for content. Via such an integration, researchers can respond to the challenge posed by Shoemaker and Reese (1990) in their aptly titled article “Exposure to What? Integrating Media Content and Effects Studies,” and developed in their book Mediating the Message: Theories of Influences on Mass Media Content (Shoemaker & Reese, 1996). However, as impressive as the agenda-setting approach is, such methodological integration is rare. Riffe and Freitag (1997) found only

16 Introduction 10% of content analyses published in 25 years of Journalism & Mass Communication Quarterly involved a second research method, a pattern that has remained largely unchanged since the publication of this book’s first edition.

Innovation and Expanding the Research Reach One place where content analysis has evolved is in the use of computational methods to complement human coding. At one end of the continuum, innovations have involved use of computer functions and software to access, retrieve, categorize, filter, and otherwise manage content units to enable researchers to manually code appropriate content. At the other end of the continuum, researchers have in effect “trained” computers to apply codes, sometimes “validated” by parallel human coding. This has been called “algorithmic text analysis” or ATA (Lacy, Watson, Riffe, & Lovejoy, 2015), “a computer application that assigns numeric values to attributes of media content based on a set of programmed rules” (p. 9), which some scholars call “machine learning,” “supervised machine learning,” or simply computer coding. The operative term is “programmed rules.” While computing in content analysis will be explored fully later (Chapter 3), the growth of computational methods and ATA merits a peek in this introductory chapter. “Arab Spring” protests pitted citizens against authoritarian regimes in Tunisia and Egypt in 2011. With the goal of analyzing sources in news coverage and faced with a data set of more than 60,000 tweets, researchers faced a difficult task of trying to sort and filter the raw data, identify the unique sources, link to specific articles, and have those articles “imported” into a template that made it easy to code directly into a statistical package. They: created a Python script to categorize a large dataset; used spreadsheet and statistical software to organize the data and identify the objects of our analysis; converted dynamic Web pages into static objects with open-source software; and developed a Web-based electronic coding interface to facilitate the work of human coders and reduce error. (Lewis, Zamith, & Hermida, 2013, p. 41; see also Hermida, Lewis, & Zamith, 2013) The analysis showed how NPR reporter Andy Carvin gave greater voice to non-elite sources by retweeting them than he did to elite sources or other journalists. Such a “hybrid” approach, “to enhance, rather than supplant, the work of human coders,” retained the “systematic rigor and contextual awareness” of traditional content analysis, while “maximizing the large-scale capacity of Big Data and the efficiencies of computational methods” (Lewis et al., 2013, p. 47).

Introduction 17 Opperhuizen, Schouten, and Klijn (2018) studied 2,265 news articles about gas drilling in the Netherlands across 25 years and five different newspapers, using “supervised machine learning” (SML), “in which an algorithm learns to recognize patterns in the text that correspond to the manually assigned codes” (p. 8). In other words, on the basis of “handmade codes, the computer ‘learned’ to recognize codes in documents” (p. 2) and “predict” the articles’ classification. A subset of 102 articles was “inductively” coded by humans using frame categories (personalization, dramatization, and negativity) and then served as the “training document” for the algorithm. The researchers called it “challenging” and “very time costly” to apply SML, and predicted that “much more research [is] needed in the field of social and communicational science, to make SML an accessible technique for content analysis” (p. 18). Baden and Tenenboim-Weinblatt (2017) sought to examine more than 200,000 news texts in 13 Israeli, Palestinian, and international media over a decade. Qualifying texts had to reference both sides of the Israeli–Palestinian conflict. After “a laborious qualitative pilot study,” the authors created “a large, fine-grained dictionary of 1,974 semantic concepts” (pp. 9–10). Adjusted for idiomatic usage in each language, the final dictionaries contained 6,500–10,500 search terms and more than 34,000 “disambiguation criteria” (p. 10). “Co-occurrences” of concepts were also in play. These steps were taken to program “a fine-grained automated analysis” (p. 8). Nonetheless, the authors said their analysis yielded only “a bird’s eye perspective, using highly abstracted data” that, over time, “deliberately glosses over the specific conflict events and political controversies covered” (p. 19). In addition, “the inductive, algorithmic approach adopted here is vulnerable to flaws in the automated comparative measurement of news contents” and because the catalog of 1,974 concepts “is bound to be incomplete” (p. 20).

Research Applications: Content Analysis in Other Fields The examples cited thus far have shown the utility of systematic content analysis, alone or in conjunction with other methods and tools, for answering theoretical and applied questions explored by journalism or mass communication researchers. However, scholarly journals include examples of content analysis in academic disciplines as varied as sociology, political science, economics, psychology, and nutrition, to name a few. For example, because messages presumably indicate the psychological state of the communicator, content analysis has a long history of use in psychology. Most sources have dated the earliest such use to the examination by Gordon Allport in the 1940s of more than 300 letters from a woman to her friends. Those Letters from Jenny (Allport, 1965) were a non-reactive measure of the woman’s personality for Allport and his associates, whose work heralded

18 Introduction what Wrightsman (1981) called the “golden age” of personal documents as data in analyzing personality. In the 1980s, psychologists used content analysis of verbatim explanations (CAVE) in examining individuals’ speaking and writing to see if they describe themselves as victims and blame others or other forces for events. Earlier research used questionnaires to elicit these causal explanations, but questionnaires have limited use if potential participants are “famous, dead, uninterested, hostile, or otherwise unavailable” (Zullow, Oettingen, Peterson, & Seligman, 1988, p. 674). However, researchers can use explanations recorded in “interviews, letters, diaries, journals, school essays, and newspaper stories, in short, in almost all verbal material that people leave behind” (Zullow et al., 1988, p. 674). Zullow et al. examined President Lyndon Johnson’s Vietnam War press conferences. Whenever Johnson offered optimistic causal explanations, bold and risky military action followed. Pessimistic explanations predicted passivity on Johnson’s part. Analysis of presidential nomination acceptance speeches between 1948 and 1984 showed that nominees who were “pessimistically ruminative” (dwelling on the negative) in causal explanations lost nine of ten elections. Content analysis has been used to examine evolution of academic disciplines. An economic historian (Whaples, 1991) collected data on two dozen content and authorship variables for every article in the first 50 volumes of the Journal of Economic History, examining how researchers’ focus shifted from one era to another and isolating a particular paradigm change—cliometrics—that swept the profession in the 1960s and 1970s. Sociologists (McLoughlin & Noe, 1988) analyzed 26 years (936 issues and more than 11,000 articles) of Harper’s, Atlantic Monthly, and Reader’s Digest to examine coverage of leisure and recreational activities within a context of changing lifestyles, levels of affluence, and orientation to work and play. Pratt and Pratt (1995) examined food, beverage, and nutrition advertisements in leading consumer magazines with primarily African-American and non-African-American readerships to gain “insights into differences in food choices” (p. 12) related to racial differences in obesity rates and “alcohol-related physiologic diseases” (p. 16). A political scientist (Moen, 1990) explored Ronald Reagan’s “rhetorical support” for social issues embraced by the “Christian Right” by categorizing words used in each of his seven State of the Union messages. Moen justified use of content analysis on familiar grounds: content analysis is non-reactive (i.e., the person being studied is not aware he or she is being studied), allows “access” to inaccessible participants (such as presidents), and lends itself to longitudinal— over time—studies. Systematic content analysis has been used in the humanities. Simonton (1994) used computerized content analysis to contrast the style of the (consensually defined) popular and more obscure of Shakespeare’s 154 sonnets

Introduction 19 in terms of whether a sonnet’s vocabulary features primitive emotional or sensual meanings or cerebral, abstract, rational, and objective meanings.

Summary As it has evolved, the field of communication research has seen a variety of theoretical perspectives that influence how scholars define research questions and the methods they use to answer those questions. The focus of their research has often been communication content. Scholars have examined content because it is often assumed to be the cause of particular effects, and it reflects the antecedent context or process of its production. Content analysis has been used in mass communication and in other fields to describe content and to test theory-derived hypotheses. The variety of applications may be limited only by the analyst’s imagination, theory, and resources, as shown in the content analyses described in the introduction to this chapter and other examples throughout.

2 Defining Content Analysis as a Social Science Tool

In the preceding chapter, we offered a preliminary definition of quantitative content analysis that permitted an overview of the method’s importance and utility for a variety of communication research applications: the systematic assignment of communication content to categories according to rules and the analysis of relationships involving those categories using statistical methods. A more specific definition derived from previous ones can now be proffered. It is distinguished from earlier definitions by our view of the centrality of communication content. In the complete definition, we address both purpose and procedure of content analysis and discuss its constituent terms. Content analysis’ purpose and procedure draw on the social science approach to knowledge: a system of standards and guidelines for generating relational statements that describe and explain human behaviors and mental processes. Reynolds (1971) said science provides: (1) A method of organizing and categorizing “things,” a typology; (2) Predictions of future events; (3) Explanations of past events; (4) A sense of understanding about what causes events. And occasionally mentioned as well is: (5) The potential for control of events. (p. 4, emphasis in original) These goals are not accomplished by any one study or even one program of study. They come from the accumulation of research that is synthesized and presented in theory. Theory-building and -testing are the goals of the scientific process (Shoemaker, Tankard, & Lasorsa, 2004). Because human behavior is complex and changes over time, this accumulation of research is ongoing and never complete. New studies support or reject older studies and theories, and more useful theories replace ones that do not receive empirical support. Because it allows sophisticated analysis with the fundamental process of human communication, content analysis can contribute significantly to the accumulation of research and the building of theory.

Content Analysis as a Social Science Tool 21 Just as the body of social science knowledge changes, the methods used by scientists evolve as scholars investigate those methods. However, Reynolds (1971) identifies three characteristics of generating scientific knowledge that are consistent across time. He says science is abstract, intersubjective, and empirically relevant. Abstraction concerns the range of behaviors to which social science applies. Reynolds (1971) states: “In its simplest form, abstractness means that a concept is independent of a particular time or place” (p. 14). If theoretical concepts are tied to a place and time, they cannot predict the future. In addition, abstractness is efficient in creating scientific understanding. Having theories that are unique to a specific time and place would require an overwhelming number of theories to understand the world. The concept of media agenda-setting, for example, is abstract enough that it allows examination of the news media’s role in every election, even though the degree of media agenda-setting can vary. Agenda-setting would not be useful if it applied only to the 1972 election. That would make it an historical artifact. Intersubjectivity requires that scientists who study an area agree on what a concept means and also on the validity of relationships among the concepts (Reynolds, 1971). An example of the former is that relevant scholars generally have come to agree on the meaning of concepts such as agenda-setting (McCombs & Reynolds, 2009), diffusion of innovation (Rogers, 2003), and financial commitment (Lacy, 1992). Intersubjectivity also includes agreement among scholars about the use of a logic system for developing relational statements within a theory. Reynolds (1971) calls this logical rigor. Unlike some fields, such as economics and political science, that have adopted mathematics for theory-building, communication has no agreed upon logic system for theory-building. Rather, communication scholars tend to use what is labeled “informal logic” or “natural language reasoning” (Johnson, 1999; Johnson & Blair, 2000) for creating more explicitly delineated theory. This approach has the advantage of including empirical examination of premises in the reasoning (Johnson & Blair, 2000). Deductive mathematic logic systems do not require empirical tests of premises. Empirical relevance concerns the ability to compare theoretical statements with objective empirical data (Reynolds, 1971). If statements in a theory cannot be tested against measures of real phenomena, their validity cannot be established independently, and the five goals of science cannot be achieved. An important part of empirical relevance is the ability of scientists to replicate the empirical results of other scientists (McEwan, Carpenter, & Westerman, 2018). Consistent results across studies, scientists, and time are the strongest form of validation for theoretical statements. The relationship between news media content and the issues considered important by the members of the public (e.g., agenda-setting) has been examined and supported to varying degrees with hundreds of studies.

22 Content Analysis as a Social Science Tool

Adapting a Definition As a more specific definition of content analysis is developed, the result will reflect the nature of the social scientific process elaborated above. As a data-generating process, content analysis lends itself to the testing of theory, but the results of testing theoretical relationships will suggest new ideas for adjusting existing theories, and even building new theories to explain either the antecedents of content or the effects of that content as suggested by the centrality model. Stempel (2003) suggested a broad view of content analysis, what he called “a formal system for doing something we all do informally rather frequently—draw conclusions from observations of content” (p. 209). What makes quantitative content analysis more formal than this informal process? Weber’s (1990) definition specifies only that “Content analysis is a research method that uses a set of procedures to make valid inferences from text” (p. 9, emphasis added). In his first edition, Krippendorff (1980) emphasized reliability and validity: “Content analysis is a research technique for making replicative and valid inferences from data to their context” (p. 21, emphasis added). The emphasis on data reminds the reader that quantitative content analysis is reductionist, with sampling and operational procedures that reduce communication phenomena to manageable data (e.g., numbers) from which inferences may be drawn about the phenomena themselves. Berelson’s (1952) often-quoted definition—“[C]ontent analysis is a research technique for the objective, systematic, and quantitative description of the manifest content of communication” (p. 18)—includes the important specification of the process as objective, systematic, and focusing on content’s manifest (or denotative or shared) meaning (as opposed to connotative or latent “between-the-lines” meaning). Kerlinger (1973) suggested that content analysis is conceptually similar to “pencil-and-paper” scales used by survey researchers to measure attitudes, a parallel consistent with the emphasis we placed in Chapter 1 on communication content as an unobtrusive or non-reactive indicator. Content analysis, according to Kerlinger (1973), should be treated as “a method of observation” akin to observing people’s behavior or “asking them to respond to scales,” except that the investigator “asks questions of the communications” (p. 525). Each of these definitions is useful, sharing emphases on the systematic and objective nature of quantitative content analysis. However, most forego discussion of the specific goals, purpose, or type of inferences to be drawn from the technique other than to suggest that valid inferences are desirable. Moreover, some of the definitions might apply equally to qualitative analysis of messages. Stempel’s (2003) and Krippendorff’s (1980), for example, do not mention quantitative measurement (although each of

Content Analysis as a Social Science Tool 23 those researchers has assembled a remarkable record of scholarship using quantitative content analysis).

Content Analysis Defined Our definition in this volume, by contrast, is informed by a view of the centrality of content to the theoretically significant processes and effects of communication (see Chapter 1), and of the utility, power, and precision of quantitative measurement. Quantitative content analysis is the systematic and replicable examination of symbols of communication, which have been assigned numeric values according to valid measurement rules, and the analysis of relationships involving those values using statistical methods, to describe the communication, draw inferences about its meaning, or infer from the communication to its context, both of production and consumption. What do the key terms of this definition mean? Systematic One can speak of a research method as being systematic on several levels. Scientists are systematic in their approach to knowledge: the researcher requires generalizable empirical, not just anecdotal, evidence. Explanations of phenomena, relationships, assumptions, and presumptions are not accepted uncritically, but are subjected to a system of observation and empirical verification. The scientific method is a system with its step-by-step process of problem identification, hypothesizing of an explanation, and testing of that explanation (McLeod & Tichenor, 2003). The goal of science is to build systematically related sets of propositions that explain relationships among precisely defined concepts. These sets of propositions are called theory, and when supported empirically, theories can be generalized to an appropriate range of human behaviors and mental processing. Thus, from a theory-building point of view, systematic research requires identification of key terms or concepts involved in a phenomenon, specification of possible relationships among concepts, and generation of testable hypotheses (if-then statements about one concept’s influence on another). In addition to its important role in theory-building and -testing, content analysis is useful for practical problems and in generating baseline data for new communication phenomena that accompany developing technologies. Testing of hypotheses is not paramount in these instances. However, whether testing theory-driven hypotheses, generating baseline data, or solving practical problems, one may speak of the researcher being systematic on another level in terms of the study’s research design: the planning of operational procedures to be employed. The researcher,

24 Content Analysis as a Social Science Tool who determines in advance such research design issues as the time frame for a study, what kind of communication constitutes the focus of the study, what the concepts are to be, or how precise the measurement must be—who, in effect, lays the ground rules in advance for what qualifies as evidence of sufficient quality that the research question can be answered—is also being systematic. Research design is explored more fully in Chapter 8. Replicable Two defining traits of science are objectivity and reproducibility or replicability. To paraphrase Wimmer and Dominick (2011), a particular scientist’s “personal idiosyncrasies and biases” (p. 157), views, and beliefs should not influence either the method or findings of an inquiry. Findings should not be influenced by the researcher’s beliefs or hopes as to the outcome. Research definitions and operations that were used must be reported exactly and fully so that readers can understand exactly what was done. That exactness means that other researchers can evaluate the procedure and the findings and, if desired, repeat the operations. This process of defining concepts in terms of the actual, measured variables is operationalization. A quick example is that a student’s maturity and self-discipline (both abstract concepts) may be measured or operationally defined in terms of number of classes missed and assignments not completed. Both can be objectively measured and reproduced by another observer. A researcher interested in how popular a politician is on Twitter might operationalize that concept in terms of the number of followers or the average number of times a tweet is retweeted. Both measures are elements of popularity and could be easily replicated. Still another researcher, examining whether social networking sites represent venues for public discourse, could look for examples of citizengenerated and uploaded content on Twitter and Facebook, as well as interactive discussions of specific topics important to communities. In summary, other researchers applying the same system, the same research design, and the same operational definitions to the same content should replicate the original findings. Only then can a discovered relationship be generalized at a high level of probability. Only after repeated replications can a researcher develop a new theory or challenge and modify existing theory that explains a phenomenon. Examining how much of a news site’s content is authored by members of the community is also a fairly straightforward and easy way to assess whether the site is a venue for dealing with community issues. However, consider the systematic and replicable requirements in terms of a protracted example, this one from mass communication research. Suppose a team of researchers had published a content analysis of the portrayal of children of color in programming available through

Content Analysis as a Social Science Tool 25 streaming services (e.g., Netflix and Prime Video), and reported that the number of those characters was unrepresentative of the U.S. population. Obviously, the researchers counted the number of characters of color, an easily replicable operationalization, right? Consider in addition how many places along the way that the operational procedure used could have influenced what was found and how unclear or imprecise reporting of that procedure could influence the replicability of the study by other researchers. For example, how did the researchers operationally define which characters to count as characters of color? Was the decision based on the assessment of trained coders making judgments with or without referring to a definition or rule based on some specific criterion (e.g., skin color, eye shape, surname, etc.)? Making such judgments without a rule is like asking someone to measure the height of a group of friends but without providing a tape measure. Did coders examine and code content individually, or were the coding determinations a consensual process, with two or more coders reaching a decision? Did anybody check the coders to make sure all the coders understood the criterion and applied it the same way in making the character count? Were the individual coders consistent across time, or did their judgments become less certain? Did their counting become less reliable after they became fatigued, consulted with one another, or talked with the senior researcher about the study’s objectives? Did the study offer a quantitative measure of reliability that reached acceptable standards? Moreover, streaming video programs present both foreground and background characters. Did the definition of a character of color take this into account? Did a character in the background “weigh” as much as one in the foreground or one in a major or speaking role? What about groups or crowd scenes? Were coders able to freeze scenes and count characters (a process that decreases the chance of missing characters but that is unlike the typical audience viewing experience)? Finally, how did the researchers conclude the extent of underrepresentation once the data were collected? Did they tally how many entire programs had at least one minority character or compare the percentage of total characters that were non-white with census data (e.g., the percentage of the real population that is composed of people of color)? The previous example used what at first blush seemed a rather simple form of coding and measurement—counting characters of color—but it demonstrated how difficult it might be to reproduce findings, as required by our definition of quantitative content analysis, without the clear reporting of even such simple operational procedures. What would happen if coders were trying to measure more abstract variables, such as attractiveness, bias, presence of a particular frame, or fairness and balance, or were trying to code the deeper meaning of symbols rather than the manifest content?

26 Content Analysis as a Social Science Tool Symbols of Communication This definition also recognizes that the communication content suitable for content analysis is as variable as the many purposes and media of communication. All communication uses symbols, whether verbal, textual, or images. The meanings of these symbols can vary from person to person and culture to culture by a matter of degrees, but shared meaning of symbols is essential for social groups to exist. Moreover, the condition under which the symbols of communication were produced is variable in that it may have been natural or manipulated. As Kerlinger (1973) stated, content analysis “can be applied to available materials and to materials especially produced for particular research problems” (p. 527). For example, scholars can analyze “current” online content or archived content from newspapers, magazines, video, tweets, and social networking sites, or participants may be placed in experimental situations or exposure conditions, be asked to write or report their post-exposure sentiments, and then those sentiments may be subjected to content analysis. Although the phrase “symbols of communication” suggests allinclusiveness and broad applicability of the content analysis method, recall the requirement that content analyses be systematic and replicable, and the hope that they be driven by the scientific method. What represents appropriate and meaningful communication for content analysis must be based on the research task and specified clearly and without ambiguity. However, even that straightforward requirement is made complex because processes in communication involve questions of medium (if any) of communication (e.g., print versus broadcast versus mobile versus social media) or different functions (e.g., entertainment versus news versus social networking), to name only two. The issue is compounded further by potential differences in the symbols examined and the units used for coding (e.g., themes, frames, or entire news stories, or 280-character message strings). Appropriate communication content for study might thus be individual words or labels in advertisement copy, news stories, movies, phrases or themes in political speeches, individual postings or entire exchanges among Facebook posters, and entire recorded conversations between two people. Within these text units, the focus might be further sharpened to address the presence of particular frames, as Hamdy and Gomaa (2012) did in exploring official, independent, and social media framing of Egypt’s spring 2011 uprising. Visual communication for study might include photos, graphics, or display advertisements. Examples range from Neumann and Fahmy’s (2012) exploration of war or peace frames in three international news services’ visual images of the Sri Lankan Civil War to Johnson and Pettiway’s (2017) quantitative and qualitative study of visual expressions

Content Analysis as a Social Science Tool 27 of black identity on African-American museum websites. They analyzed a range of visual elements (e.g., layout, colors, typeface, visual images) on 46 sites and concluded these visual elements promoted identity and provided counterstereotypes. Video or film content analyses might involve movies, entire newscasts, individual programs or episodes of streaming series, or even the news “tickers” that crawl along the bottom of the screen (Coffey & Cleary, 2008). Within movies and programs, scholars can code individual camera shots or scenes, particular sequences of scenes (e.g., in a dramatic program, only the time when the protagonist is on screen with the antagonist), or entire dramatic episodes. Local public access broadcasting represents a source of diverse messages, often far less constrained by advertising pressures than network content. Animated images in video games have drawn scholarly attention (e.g., Martins et al., 2008). The definition of communication content could be extended to include song lyrics, graffiti, or even gravestones (tombstone inscriptions indicate a culture’s views about the hereafter, virtue, redemption, etc.). In fact, if transcripts or audio recordings are available, interpersonal exchanges may be suitable for content analysis. One can easily imagine extending this approach to examination of presidential and other political debates, or to phrases and words used in the conversations in a group of people, and so on. Additionally, students of nonverbal communication may record encounters between people and designate how sequences of physical movements, gestures, and expressions constitute units of communication. About two decades ago, the Internet became the focal point for content analysis, and more recently social networking site studies have boomed as well. For example, health communication researchers have examined health websites, both commercial and nonprofit, that feature both patient–physician and patient-to-patient dialogue (Rice, Peterson, & Christie, 2001; Sundar, Rice, Kim, & Sciamanna, 2011; West & Miller, 2009). Social media, broadly defined as including forms such as Twitter and Facebook, may be analyzed to explore “real-time” diffusion of news, though capturing theoretically relevant populations of these messages remains a challenge. Personal web home pages might be examined to determine what kind of “self” the creator is projecting to the online masses (Dominick, 1999; Papacharissi, 2002). Tremayne (2004) has analyzed nearly 1,500 news stories on ten news organizations’ online news pages in terms of a “century-long shift toward long journalism, long on interpretation and context, short on new fact” (p. 237), documented earlier by Barnhurst and Mutz (1997). The use of the Internet and social networking sites for political communication has become particularly active. Negative online advertising began during the 1996 presidential campaign. By the 2000 race between Al Gore and George W. Bush, online mudslinging had reached the point

28 Content Analysis as a Social Science Tool that Wicks and Souley (2003) found three-fourths of the releases on the candidates’ sites contained attacks on the opponent. Druckman, Kifer, and Parkin (2010, 2014, 2017, 2018) have documented the rapid growth in online political campaigning. Hale and Grabe (2018) examined the visual and text in the subreddit forum posts for Trump and Clinton in the 2016 election. They found higher and more consistent positive support for Trump, which probably reflects the young male demographic of Reddit. Brummette, DiStaso, Vafeiadis, and Messner (2018) analyzed tweets and then used social network analysis and cluster analysis to study the use of the term “fake news” during the 2016 election. They concluded that members of opposing political parties use the term to condemn messages from the opposition and to disparage the opposition. Numeric Values or Categories According to Valid Measurement Rules and Statistical Analysis of Relationships The definition specifies further that quantitative content analysis involves numeric values assigned to represent measured differences in symbols. For example, a simple look at Internet or mobile video advertising and the representation and inclusion of diverse characters might follow this procedure. First, a commercial receives a case number (001, 002, etc.) differentiating it from all other cases. Another number reflects the organization that distributed the ad (1 = Politico, 2 = BuzzFeed, etc.), whereas a third number specifies the length of the advertisement (1 = 15 seconds, 2 = 30 seconds, etc.). Another number is assigned to reflect the advertised product (1 = clothing, 2 = financial institutions, etc.), whereas another indicates the total count of characters of color presented in the commercial. Different numeric values are assigned to differentiate African-American characters from Asian, Hispanic, or white characters. Finally, coders might use a 1 (very negative) to 5 (very positive) rating scale, assigning a value to indicate how positively the character is portrayed. Of course, a crucial element in assigning these numbers involves the validity of the assignment rules and the consistency of their application. The rules must assign numbers that accurately represent the content’s meaning. If a television character receives a 1 for being portrayed negatively, the character must be portrayed in a way that the great majority of viewers would perceive it to be negative. Creating number assignment rules that will help coders assign values reliably is relatively easy, but difficulty can arise in creating rules that reflect the “true” manifest meaning of the content (validity). Put another way, reliability and validity issues must be addressed with particular care when assignment of a numerical value is not merely based on counting (e.g., how many characters are African-American), but on assignment of some sort of score or rating. Consider, for example, the task facing Law and Labre (2002, p. 702) in developing their “male scale” to study three decades of changing male

Content Analysis as a Social Science Tool 29 body images in magazine visuals. Combining two different dimensions (low, medium, and high body fat, and not, somewhat, and very muscular), Law and Labre (2002) established eight types of body composition for coders to use, ranging from “low body fat/not muscular” at one end of the scale to “high body fat/somewhat muscular.” Thirty-eight cropped photos were then sorted by three judges to identify the best photo to represent each of the eight types on the scale. The sorting process served to provide an empirical basis for the eight conceptual levels. With training, coders using the eight-image male scale were able to achieve acceptable levels of reliability, calculated using a common formula (see Chapter 6). The point here is that measures of concepts or variables such as physical attractiveness or—in news content—fairness or balance, traditionally viewed as individually variable (“in the eye of the beholder”), can be developed empirically and, with enough care, used reliably. Rather than using the close reading approach of, say, literary criticism and examining a set of units of communication, and then offering a qualitative assessment of what was observed, quantitative content analysis reduces the set of units to numbers that retain important information about the content units (e.g., how each scores on a variable and is different from others) but are amenable to arithmetical operations that can be used to summarize or describe the whole set. For example, using the system described earlier for examining diversity in online advertising, a study might report the average number of African-American characters in clothing advertisements or the percentage of those characters in a particular age cohort. The numbers assigned to variables measured in a single unit of communication make it possible to determine if that unit is equal to other units or, if not equal, how different it is. Describing and Inferring Simple description of content has its place in communication research, as noted earlier. For example, applied content analysis research is often descriptive. Years ago, one of the authors was commissioned by a southern daily to examine the paper’s treatment of the local African-American community. The publisher’s goal was to respond to focus group complaints about underrepresentation in the news pages and excessively negative coverage. The research involved analysis of six months of coverage and culminated in a report that indicated what percentage of news stories focused on African Americans and what part of that coverage dealt with negative news. Other applied descriptive content analyses might be conducted to bring news site stories in line with reader preferences discovered via cookies, readership surveys, or focus groups. For example, if search and click histories show visitors accessed more items about topic X and fewer about topic Y, a careful site manager might examine the current level of each before selecting future topics

30 Content Analysis as a Social Science Tool for site postings. Public relations applications might involve profiling a corporation’s image on the business’s website. If particular angles in the organization’s publications are ineffective, change may be in order. Agency practitioners might analyze a new client’s web and mobile presence to evaluate and plan their actions. On the other hand, there are also instances in which description is an essential early phase of a program of research. For example, researchers in mass communication have found it useful, at one point in the early years of their discipline, to provide descriptive profiles of media such as what percentage of space was devoted to local news. More recent examples might focus on the number of likes and shares given a political post. In a study of 25 years of Journalism & Mass Communication Quarterly content analyses, Riffe and Freitag (1997) found that a majority of published studies might qualify as descriptive: 54% involved no formal hypotheses or research questions, and 72% lacked any explicit theoretical underpinning. Kamhawi and Weaver (2003) reported similar data. Researchers also continue to discover entirely new research domains with previously unexplored messages or content. Anyone who has monitored the evolution of communication technologies during the last three decades has seen content offered in new forms. Forty years ago, few music videos existed, but concern about their content quickly prompted researchers to describe the extent of the sex and violence themes in them (Baxter, DeRiemer, Landini, Leslie, & Singletary, 1985; Vincent, Davis, & Boruszkowski, 1987). Twenty years ago, who among political communication researchers could have envisioned the role—deservedly or not—that television comedy programs (e.g., Last Week Tonight with John Oliver, Full Frontal with Samantha Bee, etc.) would assume in American political discussions? A few presidential races ago, how many experts on political communication would have anticipated the extent of attack campaigning on the web (Druckman et al., 2010; Wicks & Souley, 2003). Some descriptive data are involved in the second goal of content analysis specified in the definition: to draw inferences about meaning or infer from the communication to its context, both of production and consumption. In fact, simple descriptive data invite inference testing (i.e., conclusions about what was not observed based on what was observed). A simple example is the “why” question raised even in descriptive content analyses. Why does a southern daily provide so little “good news” of the African-American community? Why does one network’s nightly newscast mirror the other networks’? Why are some digital news sites more linked-to by posters on Facebook than others? Why do so many long-standing journalistic practices and routines break down in crisis situations? Social scientists using quantitative content analysis techniques generally seek to do more than describe. Content analysts—whether conducting

Content Analysis as a Social Science Tool 31 applied or basic research—typically do not collect descriptive data and then ask questions. Instead, they conduct research to answer questions. In the case of basic research, that inquiry is framed within a particular theoretical context. Guided by that context, they select content analysis from a variety of methods or tools that may provide answers to those questions. From their data, they seek to answer theoretically significant questions by inferring the meaning or consequences of exposure to content or inferring what might have contributed to the content’s form and meaning. To draw from content inferences about the consequences of consumption of content or about production of content, the researcher must be guided by theory. For example, Shin and Thorson (2017) based their study of sharing fact-checking messages with Twitter during the 2012 presidential election on social identity theory (Tajfel & Turner, 1979; Turner, Hogg, Oakes, Reicher, & Wetherell, 1987). Using a combination of human coding and automated textual analysis, they discovered “selective sharing” of fact-checked messages as well as the presence of hostile media perception of fact-checking organizations. A study of radio competition (Lacy et al., 2013) tested whether the financial commitment model (Lacy, 1992) can be used to explain variations in local government coverage by radio news. The authors found a significant but weak positive relationship between two or more radio stations providing news in a market (competition) and the number of sources used in local government stories and the diversity of the sources. These examples of inference-drawing suggest the range of appropriate targets of inference (e.g., the antecedents or consequences of communication as discussed in Chapter 1). However, students with a grounding in research design, statistics, or sampling theory will recognize that there are other questions of appropriateness in inference-drawing. Conclusions of cause–effect relationships, for example, require particular research designs. Perhaps more basic, statistical inference from a sample to a population requires a particular type of sample (see Chapter 5). Also, use of certain statistical tools for such inference testing assumes that specific measurement requirements have been met (Riffe, 2003, pp. 184–187; Stamm, 2003; Weaver, 2003).

Issues in Content Analysis as a Research Technique What we as the authors of this volume see as the strengths of quantitative content analysis (primarily its emphasis on replicability and quantification) are the focus of some criticisms of the method. Critics of quantitative content analysis have argued that the method puts too much emphasis on comparative frequency of different symbols’ appearance. In some instances, they have argued, the presence—or absence—of even a single particularly important symbol may be crucial to a message’s impact.

32 Content Analysis as a Social Science Tool Holsti (1969) described this focus on “the appearance or nonappearance of attributes in messages” as “qualitative content analysis” (p. 10) and recommended using both quantitative and qualitative methods “to supplement each other” (p. 11). However, a more important criticism repeated by Holsti (1969) is the charge that quantification leads to trivialization; critics have suggested that because some researchers “equate content analysis with numerical procedures” (p. 10), problems are selected for research simply because they are quantifiable, with emphasis on “precision at the cost of problem significance” (p. 10). Although this criticism could be dismissed out of hand as circular, the point it raises about a method focusing on trivial issues seems misdirected. Superficiality of research focus is more a reflection of the researchers using content analysis than a weakness of the method itself. Trivial research is trivial research whether it involves quantitative content analysis, experimental research, or qualitative research. Some might argue that theory, as the rationale for a study, and validity, as the gold standard for data quality, are at risk of taking a back seat to advances in computing capacity or advances in data-searching and data-analysis capabilities (Mahrt & Scharkow, 2013). Using these tools, for example, one could collect millions of tweets exchanged during the Super Bowl, the Democratic National Convention, a royal marriage, or some other event. Frequency counts of words might be interpreted as proxies for public sentiment (Prabowo & Thelwall, 2009; Thelwall, Buckley, & Paltoglou, 2011), or linkage patterns among words might indicate relationships among attitude or opinion objects, though such linkages are arguably a poor substitute for the larger context surrounding a communication. The potential problems with this approach are readily apparent: for one, only about a third of the U.S. population use Twitter (Statista, 2018a) and it is not clear how this group differs from the other two-thirds. Despite advances in computing capacity, it is difficult to ascertain whether any set of tweets actually represents all the relevant tweets (Bialik, 2012). Again, the problem is not with the available and evolving tools, but how they are used; social science data have always, regardless of how they are gathered, varied in quality and validity. Another criticism involves the distinction between manifest and latent content. Analysis of manifest content assumes, as it were, that with the message, “what you see is what you get.” The meaning of the message is its surface meaning. Latent analysis is reading between the lines (Holsti, 1969, p. 12). Put another way, manifest content involves denotative meaning— the meaning most people share and apply to given symbols. Given that “shared” dimension, to suggest that analysis of manifest content is somehow inappropriate is curious. Latent or connotative meaning, by contrast, is the meaning given by individuals or small groups to symbols.

Content Analysis as a Social Science Tool 33 The semantic implications notwithstanding, this distinction has clear implications for quantitative content analysis. Consider, for example, Kensicki’s (2004) content analysis of frames used in covering social issues (pollution, poverty, and incarceration), in which she concluded that news media seldom identified causes or effects for the issues, nor did they often suggest the likelihood that the problems could be solved. Two coders had to agree on how to identify evidence pointing to the cause, effect, and responsibility for each of those issues. Discussing the “lone scholar” approach of early framing research, on the other hand, Tankard (2001) described it as “an individual researcher working alone, as the expert, to identify the frames in media content” (p. 98). This approach made frame identification “a rather subjective process” (p. 98). Finally, Tankard (2001) asked, “Does one reader saying a story is using a conflict frame make that really the case?” (p. 98, emphasis in original). The difference between latent and manifest meaning is not always as clear-cut as such discussions indicate. Symbols in any language that is actively being used change in meaning with time. A manifest meaning of a word in 2018 may not have been manifest 100 years before. The word cool applied to a book, for example, means to most people that it is a good book, which would make it manifest. This meaning currently found in dictionaries was not manifest in 1850. To a degree, the manifest meaning of a symbol reflects the proportion of people using the symbol for that meaning. This somewhat arbitrary nature of language is made more concrete by the existence of dictionaries that explain and define shared meaning. Researchers need to be careful of the changing nature of symbols when designing content analysis research. Language users share meaning, but they also can have idiosyncratic variations of meanings for common symbols. How reliable can the data be if the content is analyzed at a level that implicitly involves individual interpretations? We concur with Holsti (1969), who suggested that the requirements of scientific objectivity dictate that coding be restricted primarily to manifest content; the luxury of latent meaning analysis comes at the interpretative stage, not at the point of coding.

Advantages of Quantitative Content Analysis of Manifest Content The strengths of quantitative content analysis of manifest content are numerous. First, it is an unobtrusive, non-reactive measurement technique. The messages are separate and apart from communicators and receivers. Armed with a strong theoretical framework, the researcher can draw conclusions from content evidence without having to gain access to

34 Content Analysis as a Social Science Tool communicators who may be unwilling or unable to be examined directly. As Kerlinger (1973) observed, the investigator using content analysis “asks questions of the communications” (p. 525). Second, because content often has a life beyond its production and consumption, longitudinal studies are possible using archived materials that may outlive the communicators, their audiences, or the events described in the communication content. Third, quantification or measurement by coding teams using a welldeveloped protocol permits reduction to numbers of large amounts of information or numbers of messages that would be logistically impossible to understand well with close qualitative analysis. Properly operationalized and measured, such a process of reduction nonetheless retains meaningful distinctions among data. Fourth, the method is, as shown in Chapter 1, virtually unlimited in its applicability to a variety of questions important to many disciplines and fields because of the centrality of communication in human affairs. Finally, because the reliability of content analysis data is invested in the protocol and not just the coders, the consistency of the application by many coders can be measured within and across studies using the same protocol. This ability adds to establishing the validity of the reliable variables in the protocol. Researchers should heed Holsti’s (1969) advice on when to use content analysis, advice that hearkens back to the criticism that the focus of the method on precision leads to trivial topics: “Given the immense volume and diversity of documentary data, the range of possibilities is limited only by the imagination of those who use such data in their research” (p. 15). Holsti (1969) suggested three “general classes of research problems which may occur in virtually all disciplines and areas of inquiry” (p. 15). Content analysis is useful, or even necessary, when: 1

data accessibility is a problem, and the investigator is limited to using documentary evidence (p. 15); 2 the communicator’s “own language” use and structure is critical (e.g., in psychiatric analyses) (p. 17); and 3 the volume of material exceeds the investigator’s individual capability to examine it (p. 17).

Summary If Holsti’s (1969) advice on when to use content analysis is instructive, it is also limited. Like so many of the definitions explored early in this chapter, its focus is primarily on the attractiveness and utility of content as a data source. Recall, however, the model in Chapter 1 on the centrality of communication content. Given that centrality, both as an indicator

Content Analysis as a Social Science Tool 35 of antecedent processes or effects and consequences, content analysis is indeed necessary, and not just for the three reasons cited by Holsti. Content analysis is crucial to any theory dealing with the impact or antecedents of content. It is not essential to every study conducted, but in the long run one cannot study communication without studying content that carries symbolic meaning. Absent knowledge of the relevant content, all questions about the processes generating that content or the effects that content produces are meaningless.

3 Computers and Content Analysis

Today, it is hard to imagine a content analysis design that does not involve computers to execute some aspect of the research project—for example, to query databases, to sift a big data set based on keywords, to use an Excel spreadsheet as a coding sheet to record data, or to analyze data (see Figure 3.1). But to be considered “content analysis,” the research design must ultimately rely on human coders applying a predetermined protocol to assign values to media content for the purposes of making valid inferences about that content. This does not rule out hybrid designs, which incorporate computers and computational social scientific methods with human coding (see Lewis et al., 2013), or multi-methods designs that might compare, for example, human and algorithmic coders (Conway, 2006). Human-coded data might also be used as training sets in supervised machine learning, the process of iteratively improving the ability of an algorithm to perform a task without explicitly programming it (Opperhuizen et al., 2018). This chapter highlights the various ways that computers can be used to enhance studies of communication content, while drawing methodological distinctions between content analysis (using human coders), 1. To access or gather content (includes retrieving content from databases, freezing dynamic content, and creating custom databases, scraping data, etc.) 2. To parse content into data tables for analysis 3. To sort and/or filter content, including sampling 4. To organize coding tasks (e.g., create dynamic interface or coding template) 5. To validate codes entered on coding sheet 6. To code content (i.e., automated textual analysis or ATA) 7. To analyze coded data

Figure 3.1 Uses of computers in research studies of communication content

Computers and Content Analysis 37 algorithmic text analysis or ATA (using algorithmic coders), and hybrid designs that use computers to improve efficiency and reliability, but still rely ultimately on human coders who can best recognize the meaningful context and complexities of human language. This chapter is less of a “how-to” on using computers in content analysis, and more of a survey of how computers can improve efficiency and reliability in content analysis, and those considerations one should have in mind when making decisions about whether and how to use computers in a research design. Done correctly, even the most passionate content analyst might find the method a resource-intensive slog, to the extent that it may discourage use of the method (Conway, 2006). Students in particular gravitate toward algorithmic text analysis (ATA) because they perceive it as being easier, which is not the same as being efficient. Certainly, the use of computers in content analysis greatly improves efficiency, reducing time and costs of performing traditionally resource-intensive content analyses. However, ATA is not easy: the technical understanding and abilities needed to produce valid data using ATA are significant and betray the dilettante who is simply pursuing what is perceived as “easy” (i.e., less work). Furthermore, the efficiency of ATA methods is an insufficient justification for including ATA in one’s research design, even in the most resource-constrained research programs. Following the criteria for all rigorous social science inquiry, decisions as to whether and how to use computers in one’s research design should be made based on several considerations, the least of which is efficiency: 1 Will the use of computers yield valid measures? 2 Do those measures meaningfully address the research questions posed by the study? 3 Will the use of computers improve reliability? 4 Will the use of computers improve efficiency (i.e., expend less time and/or money)? The goal of social science research should always be to produce valid conclusions. It is possible that one can use an algorithmic coder in such a way that improves efficiency, but that comes at significant cost in terms of the validity of the study’s data. In such instances, improved efficiency or even improved reliability—the two primary strengths of algorithmic coders—cannot justify the use of computers in a research design. But if the criteria above are met, the algorithmic coder not only greatly improves efficiency and reliability, but has other benefits as well. For example, the computer code that directs the algorithmic coder’s classifications must be unambiguous and exhaustive; thus, the algorithmic coder can be more transparent and replicable than its human counterparts, who are often working from ambiguous and incomplete protocols, two weaknesses that are addressed (though, contrary to best

38 Computers and Content Analysis practices, often undocumented) in the human coder training process (Hak & Bernts, 1996).

Distinguishing Algorithmic Text Analysis (ATA) If a study primarily uses an algorithmic coder (i.e., uses a computer application to assign numeric values to communication content based on either a preprogrammed set of rules of or a machine-learning approach), it is not “content analysis,” because: (1) it follows different processes/ methods; and (2) these distinct processes have unique implications for the validity of the data (Lacy et al., 2015). Because it follows a separate set of processes (i.e., it is a distinct research method), we should apply a separate term to the use of algorithmic coders. We propose algorithmic text analysis (ATA), with text broadly defined as encompassing any fixed form of communication that can be analyzed. We prefer the term ATA compared to computer-aided text analysis (CATA) (e.g., see, Neuendorf, 2017) because it is possible to use computer aids to freeze dynamic web content for coding (Zamith, 2017), to retrieve data from databases (see Chapter 5), or to sift and sort data and organize the coding task and to store coded data, while still relying primarily on a human coder. Thus, CATA does not truly distinguish the use of an algorithmic coder as following a different set of research processes. Blending the use of computers and computer programming with human coding is what Lewis et al. (2013) referred to as a hybrid approach to conducting content analysis. We not only expect, we encourage, these hybrid approaches to conducting content analysis. Doing so has the potential to improve efficiency and reduce human errors (e.g., a computer can validate data entries to prevent miskeyed entries). That said, we believe that it is important to distinguish studies that rely on algorithmic coders to do the classification and to assign numeric values to communication content as its own, distinct method—or, probably more appropriate, as a distinct set of methods. In delineating what is not content analysis, our purpose is not to make claims as to the superiority of one method or another, or otherwise create false divisions between scholars using different methods to answer research questions about communication content. By labeling these distinct research methods, we are simply drawing attention to fundamentally different research designs that are required for using human versus algorithmic coders, as well as distinct concerns these methods raise concerning the validity of data and the inferences that can be made based on those data. As an example of the unique processes that ATA entails (compared to content analyses using human coders), algorithmic coder data need to be carefully accessed, cleaned, and prepared for analysis. First, it is important to make sure that one is accessing relevant data. When using

Computers and Content Analysis 39 an algorithmic coder, the precision of keywords used to sample content from databases becomes more important. Where the human coder will eliminate irrelevant articles from the sample—often the first question in a human coding protocol gives a definition of what is a relevant piece of content, and the first variable is simply whether the content is relevant or not—computers will code even false positives, potentially introducing significant error into the data. On the other hand, when using a human coder, one can sacrifice narrow precision for the sake of broad recall, knowing that the human coder can pick out the false positives, and thus the broader search that will retrieve the greatest number of relevant articles is preferred. Another distinct ATA research process is “stemming” the text, so that noun, verb, and adjective forms of the same word (e.g.,“bike,” “biking,” and “bikeable”) are coded as pertaining to the same activity (Grimmer & Stewart, 2013). Function words that serve grammatical purposes but do not convey meaning (e.g., the, a, an) are also typically removed from text before it is analyzed. These are some of the unique research processes that help define ATA as a distinct set of methods.

Advantages and Disadvantages of ATA The distinct advantage of ATA is that the algorithmic coder can very quickly analyze large data sets, allowing scholars to engage in “big data” research. One way to define “big data” research is that it involves the analysis of a data set so large that the analysis by “traditional” tools is difficult, if not impossible. Freelon, McIlwain, and Clark (2018), for example, analyzed a collection of more than 40 million tweets about African Americans killed by police to examine how Black Lives Matter activists, political conservatives, and “unaligned” Twitter users wielded different types and degrees of social power in online conversations about police violence. It would be nearly impossible to imagine analyzing such a population data set using human coders. The algorithmic coder, however, makes it possible by vastly improving on the speed of coding, which, assuming there’s no need for very expansive computing infrastructure, manifests in substantial cost savings as well. The algorithmic coder is also 100% reliable. The validity of ATA data, however, is still determined by the validity of the conceptualization of the variable defined by the researcher, as well as the validity of the operationalization in the code that the algorithmic coder follows. To the extent that an algorithmic coder relies on human programming, the algorithmic coder is no more objective than a human coder. Unlike humans, though, who are susceptible to fatigue, human error, and so on, the algorithmic coder always executes commands with perfect fidelity. Due to the algorithmic coder’s efficiency and reliability, as well as advances in natural language processing and artificial intelligence that

40 Computers and Content Analysis are improving the capacity of computers to process complex human language peppered with colloquialisms, satire, irony, emotion, and so on, it is probable that the human coder and “traditional” content analysis will be made obsolete at some point in the future. Practically speaking, though, that day is still quite a way off. First, the development of many of these computational methods also relies on “gold standard” training sets that have been coded by humans, which are used in machine learning and against which the performance of the algorithm is tested. (Though we would point out that because they follow different research processes and those processes have different bearing on the validity of research data, it is a mistake to assume that the human and algorithmic coder should generate identical data and that discrepancies are necessarily the failure of the algorithmic coder, and not vice versa.) Second, the validity with which algorithmic coders can classify complex human language remains the primary concern in algorithmic text analysis (ATA), as some computational methods are no better than a random coin flip in terms of being able to correctly classify content into the correct categories (e.g., whether a particular statement is sarcastic or not). ATA is also rarely used in studies of visual media because those media pose their own challenges for algorithms, though technologies that can be applied to visual media are also improving. For example, Zhu, Luo, You, and Smith (2013) used face recognition software to analyze social media images of Barack Obama and Mitt Romney during the 2012 presidential election cycle, including their association with the number of views and comments their images received. Such analyses, however, are still nascent and relatively basic in the research questions that they ask. (Indeed, Zhu et al. (2013) were primarily concerned with simply developing a face recognition algorithm.)

When ATA Is Best Applied Currently, ATA methods are best suited for analyzing particularly manifest variables (Zamith & Lewis, 2015). Kornfield, Toma, Shah, Moon, and Gustafson (2018), for example, used Linguistic Inquiry and Word Count (LIWC) to examine whether language used in an online alcohol abuse recovery group could predict relapse. LIWC is a commonly used program that classifies word usage based on a predetermined set of dictionaries, each of which is thought to represent a different psychologically meaningful category of language (Tausczik & Pennebaker, 2010). For example, Kornfield et al. (2018) hypothesized that relapse would be positively predicted by the frequency of words used from the “negative affect” dictionary in addicts’ posts, whereas relapse would be negatively predicted by use of words in the “first-person plural pronouns” dictionary (e.g., we), perhaps indicating stronger social support than posts primarily referring to “I.”

Computers and Content Analysis 41 Such dictionary approaches are also quite common in sentiment analysis, which seeks to classify the emotional valence (positive, negative, or neutral) of text messages based primarily on the frequency of the use of either positive or negative words. Sentiment analyses of tweets about specific candidates and policy issues, for example, have been shown to predict election outcomes (Ceron, Curini, & Iacus, 2016). The problem with dictionary-based approaches is that they give no information about how words are used in context. Keyword-in-context (KWIC) and concordances approaches improve slightly in giving words more contextual meaning, by examining which words occur most frequently together, but still primarily extract individual words from their context, limiting the ability to understand those words’ actual meaning (e.g., was a specific word or phrase meant to be ironic or sarcastic?) Analyses of which words cluster together or “co-occur” within a specified distance, as well as network analyses of pieces of content, sources, and so on, can provide additional contextual information to algorithmic text analyses, though they also may not capture the full meaning of human language that may depend on words beyond a set “distance.” Natural language processing, machine learning, and artificial intelligence are areas where computational linguistics and computer science disciplines are working toward programming computers that can understand human language in its full complexity and context. Despite advances in these areas, however, recognizing more complex human language, such as humor (including sarcasm and irony), remains beyond the grasp of the algorithmic coder. Also worth noting is that even as algorithmic coding capabilities advance, there are fundamental differences in the motivations driving those who build algorithmic tools (i.e., computer scientists) compared to social science scholars. In an aptly titled research paper, “Detecting sarcasm is extremely easy ;-),” two computer scientists applied an algorithmic approach to correctly recall 68% of tweets containing sarcasm from a training data set with precision of only 53%—for detecting sarcasm in Amazon reviews (Parde & Nielsen, 2017). For computer scientists, even an incremental improvement over previous attempts in the ability of an algorithm to recognize complex language was a great success, but for social scientists the algorithm’s ability to correctly identify sarcasm only 68% of the time is unacceptable for valid inferences about human behavior. While we assume content analysis is a social science method that seeks to draw valid inferences about communication content in a way that builds and tests theoretical knowledge, the tools that computer scientists develop are often used for data mining, a largely atheoretical approach of inductively discovering and extracting relationships that may exist in the data. These different orientations toward research between computer and social scientists are at the root of some of the challenges to creating

42 Computers and Content Analysis collaborative, multidisciplinary teams of computer and social scientists to further advance algorithmic text analysis (ATA) and its application. That said, by recognizing and accepting these different approaches, and the value each brings to building research tools and theory that help answer important research questions, it is possible to strengthen bridges between the disciplines.

Hybrid or Computer-Aided Content Analysis While ATA refers to those studies that rely primarily on algorithmic coders, there are numerous examples of content analyses that use computers as part of their design but ultimately use human coders to categorize and assign numeric values to the communication content under study. This so-called “hybrid approach” might more aptly be labeled “computeraided text analysis,” which refers to those studies that use computers in the design but do not fully automate the coding itself. Computers can be used to retrieve content from databases and social media content, to “freeze” ever-changing, dynamic web content, and to create one’s own database of static, cross-sectional observations. They can be used to sift data (e.g., to identify only those tweets that use specific keywords or hashtags, or mention specific users’ Twitter accounts), to organize the coding task as a coding sheet to record data, and to validate those codes in order to eliminate miskeying. Lewis et al. (2013), for example, showed the example of a coding interface created to streamline the coding task. On the left-hand side of the computer screen, the interface pulled from a database the specific piece of content—in their case, a tweet—that was to be coded. This left-hand window ensures that content being coded is not rearranged out of order and that content is not inadvertently skipped over, and so on, thus eliminating the possibility of unreliability that is a function of a mismatch in content units that pairs of coders are coding. In Lewis et al.’s (2013) coding interface, the coding sheet was on the right-hand side. Dropdown menus eliminated the possibility of invalid, miskeyed values of the variables. (One can even use a simple tool such as Excel to validate keyed entries by, for example, constraining coders from being able to enter “2” for a category that is supposed to be coded present = 1, absent = 0. In such a scenario, the coder who entered “2” would be directed by the computer to make a valid entry [“1” or “0’]). Such hybrid or computer-aided approaches to conducting content analyses can leverage the efficiencies and reliability of computers, cutting back on potential human error, while still relying on human coders who can recognize the full richness and complexity of human language. Lewis et al. (2013), using a study of sourcing in tweets about the Arab Spring as a case study, describe processes for using computers and computer programming to: (1) gather data (i.e., to gather tweets); (2) extract

Computers and Content Analysis 43 data fields (i.e., to parse data) that were of interest to their analysis by organizing those fields into a spreadsheet; and (3) filter data based on date ranges, keywords, and so on. Without the aid of a computer, each of these tasks would have been laborious for a human to perform manually, particularly working with a corpus of 60,000 tweets. (Assuming, very conservatively, one minute to process each tweet, the task would take 25 40-hour workweeks—or an entire year for a 20 hours per week graduate research assistant.) One of the reasons Twitter data are popular for content analysis is that they are what is known as “structured data.” Each tweet has various data points—an author, an associated user profile page, a time, a date, replies, mentions, the text of the tweet, URLs mentioned in the tweet, and so on. All of these data points are stored in a consistent, structured database, which can be accessed through Twitter’s application programming interface (API), which allows users to “call” fields from Twitter’s databases and store them in table format for analysis. On the downside, scholars also are increasingly aware of fake social media accounts, sites where one can purchase followers, and so on, that have drawn into question what these structured data are ultimately capturing (see Karpf, 2012; Zamith & Lewis, 2015). Unstructured data, such as individual websites, are not stored in a predetermined way but can be studied. Online content can also take the form of semi-structured data that do not necessarily conform to the datatable structure that tweets do, but which have some relatively consistent, recognizable form (e.g., newspaper articles, with headlines, subheads, bylines, body of articles, etc.). A challenge of studying online communication content is that very little of it is structured, and data that were once semi-structured are often less-so online (newspapers’ home pages have a much less structured form than traditional print front pages, with their somewhat standard sizes, columns, headline conventions, above-/below-the-fold distinctions). Online content is also dynamic, meaning that web pages are being continually updated, and previous versions of a web page may not be archived (Zamith, 2017). Thus, processes such as gathering data via Twitter’s API and parsing the various fields in the data, while requiring some technical knowledge, are relatively straightforward. On the other hand, working with less structured data, such as stories from different news organizations’ home pages, for example, requires more sophisticated programming to “scrape” the data from the home pages, freeze it so it can be reliably analyzed at a specific point of time, and parse the data into fields (e.g., headline, subhead, byline, dateline, body, etc.) that can be stored in a data table for analysis. As an example of these processes, Zamith (2017) described a process of using custom Python scripts to “freeze” dynamic web content and then parsing that content so that changes in

44 Computers and Content Analysis the placement of news stories across different organizations’ websites, as a function of judgments of newsworthiness, could be tracked overtime.

“Scaling Up” Content Analyses Certainly, the size and complexity of data sets in communication research are increasing. How can studies of communication content be “scaled up” while retaining a focus first and foremost on data validity and also preserving other foundational values of social science research (e.g., transparency and reproducibility)? There are powerful enterprise tools, such as Crimson Hexagon (www.crimsonhexagon.com/), that corporate brands and scholars alike are using to understand large social media data sets. The advantage of such enterprise solutions is that they have purchased from Twitter access to all tweets, including historical data that are not available to users of the open Twitter API, which draws from a sample of tweets and only covers the past seven days of tweets at the time of this writing (Twitter, 2018). Such data access, though, comes at a steep cost for anyone except those from the most generously funded university (or corporation). As an example, a group of researchers from Princeton and Harvard Universities used Crimson Hexagon to categorize tweets from Arab countries that mentioned the United States based on whether the topic was either social or political, as well as their valence (positive or negative) (Jamal, Keohane, Romney, & Tingley, 2015). Their findings, though largely descriptive in nature, were interesting: negative assessment was primarily attached to political tweets, especially those regarding interventionist policies. On the other hand, there was no more scorn for the United States than Iran on social issues. Thus, the authors concluded that Arab Twitter users are not expressing a general hatred toward the United States, but rather anger over interventionist political policies in the Arab world. However, intriguing as the results were, the problem with the study is that it is not possible to easily replicate it, though replication is an important, albeit rare, element of the scientific process. Replication is one of the primary tools we have to guide against human bias in our research; if others can follow an identical research process and obtain the same results, it suggests that at least the methods that the study followed were objective. The Jamal et al. (2015) research is not replicable because, first, Twitter data are themselves proprietary, and their use and researchers’ ability to openly share research data containing tweets are restricted by the service’s user agreement, which aims to preserve the commercial value of the data. For example, it is helpful and valuable to a company such as Apple to use Twitter to study reactions to new product launches, or to examine sentiment toward its products. However, such agreements handcuff the sharing of data among scholars. Second, Crimson Hexagon is not widely available to other researchers due to its high cost, so others are not able

Computers and Content Analysis 45 to freely replicate the data-gathering process. Third, the algorithm used to classify the content is proprietary and confidential (as is the case with other commercial software); for all intents and purposes, the data were classified within a “black box,” beyond the view, inspection, and assessment of peer scholars tasked with reviewing the research. In response, Trilling and Jonkman (2018) have offered a list of broad “best practice” requirements for the use of computer platforms in hybrid content analysis and algorithmic text analysis (ATA) designs, best practices meant to preserve values of the scientific process, especially replicability: 1 The platform should be scalable, to work on a laptop computer, but also on servers to support “big data” projects. 2 It should not depend on commercial software to run; it should be free and open-source—open-source implies that the source code of the program is freely available for inspection and alteration; open source is the antithesis to the commercial “black box.” 3 The platform should be adaptable to a wide range of projects and collaborations, which requires the platform to be open-source. 4 The platform should give users advanced control, but also provide an easy-to-use interface for the novice user (i.e., the platform should not be accessible only to a small subset of social science scholars with expert technical abilities). It should be widely usable by the community of scholars. We add to the list that the platform should be geared toward analyzing publicly accessible data sources that can be readily shared, at least within the community of interested scholars, keeping in mind that the sharing of research data is also an important expectation of major research funders, such as the National Science Foundation.

Summary The use of computers in studies that analyze media content can greatly improve efficiency, allowing researchers to study more complex data sets, and reliability, a precondition for validity. It is probable, but still a distant promise, that the human coder will be made obsolete by the algorithmic coder. We challenge the notion that human and algorithmic coders necessarily will produce equivalent, much less identical, data, or that if they fail to produce equivalent data, it is necessarily a “fault” of the algorithmic coder. We distinguish the use of algorithmic coders as a different method by assigning a different name: algorithmic text analysis (ATA). Currently, the algorithmic coder is best suited to studying text (rather than visual media, such as photography and video) and particularly manifest variables.

46 Computers and Content Analysis Computers, though, can also be used to enhance the efficiency and reliability of human coders, albeit on different scales. This chapter noted how computers can be used to gather, parse, sort, and filter data, as well as to organize the coding task and validate codes applied to content. Computers should be used in content analysis to improve efficacy and reliability, as long as their use to perform tasks does not significantly threaten the validity of data. Lastly, this chapter discussed how content analysis can be “scaled up,” particularly with an eye toward further developing tools that make content analysis more efficient. Not only is validity a primary concern, but so are other aspects of the social science process, including the role of peer review and replicability of research.

4 Measurement

Social science research methods use what Babbie (2013) calls a variable language to study variations in attributes among people and people’s artifacts. A concept that shows variation in values when it is measured is called a variable. Variables can be summarized and analyzed quantitatively by assigning numbers to show these variations, and content analysis assigns numbers that show variation in communication content. Measurement links the conceptualization, data collection, and ana lysis steps presented in Chapter 8. Careful thinking about that linking process forces a researcher to first identify properties of content that represent the theoretical concepts of interest (e.g., bias, frames, etc.) and then to transform those properties into numbers that can be analyzed statistically. In more concrete terms, measurement is the reliable and valid process of assigning numbers to units of content. Measurement failure, on the other hand, creates unreliable and invalid data that lead to inaccurate conclusions, which, particularly in content analysis, means a significant waste of effort and other resources. In content analysis, establishing adequate intercoder reliability is a key part of assessing measurement success. Intercoder reliability means that different trained coders applying the same classification rules as set forth in a predetermined protocol to the same content will almost always assign the same numbers. Beyond reliability of measurement, validity of measurement requires that assignment of numbers accurately represents the concept being studied. As Rapoport (1969) said: It is easy to construct indices by counting unambiguously recognizable verbal events. It is a different matter to decide whether the indices thus derived represent anything significantly relevant to the subject’s mental states. As with all social science measurement, some error will exist, but a carefully constructed measurement procedure that is adjusted with use will give valid and reliable measurement. (p. 23)

48 Measurement Content analysis shares a common approach and a common problem with other observational methods such as survey research and experimental design. Isolating a phenomenon enables us to study it, but that very isolation removes the phenomenon from its context, resulting in some distortion in our understanding of that phenomenon in the world. In content analysis, reducing a body of content into units of content that are more easily studied risks losing the communication context that provides a fuller meaning. Yet that reduction of content to units is necessary for the definition and measurement of the variables of interest in studies of human communication. Ultimately, then, the success of our predictions based on our use of these content units must validate the choice of the variables we study and of the ways we have measured them. As noted in Chapter 1, we test relationships in which content variables change in response to antecedent causes, and/or we test content variables as causes of some subsequent effects.

Content Units and Variables in Content Analysis How we identify variables appropriate for content analysis in such relationships depends first on the hypotheses and research questions guiding a particular study. In other words, theory and prior research give us some idea of the variables we might observe, and the hypotheses and questions we derive from theory and prior research give us even more explicit guidance. However, defining and observing relevant variables also depends on how they are embedded in the observable world of communication. This is partly a question of validity in content analysis, discussed in Chapter 7. But part of this validity problem comes from how we divide a universe of observable content into the more manageable parts, or units, of content. It is with these units that we address our hypotheses and questions. Sometimes the search for the units that compose our variables is straightforward, even absent guidance from theory or research. If we think of survey research, for example, we know that objective reality includes individual people, and there is little problem in distinguishing where one person stops and another begins. So, our variables, in this case, are characteristics of people found at the individual level of analysis. We might, for instance, ask how one person’s height, political views, or occupational status differs from another’s. This may quickly get more complex and ambiguous, however, when we ask about pairs, groups, organizations, social systems, or even societies. It may not be so clear where one leaves off and another begins. As noted above, content analysis requires that we “cut up” observable communication behaviors in a way that is justified by logic or theory. Very commonly, as in the case of individual people, this may sometimes seem simple: we look at stories in newspapers, comments on Facebook, ads in a women’s magazine, and so on. The medium in which

Measurement 49 the communication appears provides both a means of separating one kind of communication content from another and an expectation that what we are looking for can be found. This separation of content may also seem relatively easy within a medium. Let’s say we are looking at a website aimed at women, and we want to assess ads (paid content that attempts to sell a product) as distinct from the stories and other material a women’s website might include. Finally, within the ads, we might define, look for, and measure variables. Specifically, we may ask if an ad includes the image of a woman, and if so we can look at a variety of other variables of interest about that image of a woman (e.g., body type, clothing, interactions with others, etc.). We must also consider how crudely or precisely we can measure variables such as body type, clothing, or interactions. This degree of precision, or “level of measurement,” for our variables must correspond to the phrasing of the relationships among variables in our hypotheses. Our level of measurement of these variables then determines the kind of analysis we can conduct to test the hypotheses. This chapter links theory and measurement by conceptualizing content in three “descending levels” with each broader level, including the lower ones. Again, our hypotheses and research questions give guidance that is more or less explicit for our decisions as we descend through these levels. At the broadest level, we consider content forms, the manifest ways in which a “universe” of communication may be decomposed into parts. Included within those forms are the units of observation that more specifically direct us to content likely to include variables of interest. And at the lowest level are the units of analysis, the content that includes the variables of interest measured at a level informed by our hypotheses or questions and appropriate to our mode of statistical analysis.

Content Forms Consider these three hypotheses: H1: Newspaper news stories most often report public policy issues from the standpoint of institutional leaders. H2: Verbal interchanges by characters on a streaming TV comedy are more likely to include humor that is cruel than are verbal interchanges among characters on a network television comedy. H3: Women pictured in Instagram selfies will be portrayed in more revealing clothing than Instagram selfies of men. These hypotheses suggest a broad variety of content forms and combinations of forms that can be analyzed. Although Chapters 1 and 2 have emphasized familiar distinctions among print, broadcast, and online

50 Measurement media, a broader classification scheme distinguishes among written, verbal, and visual communication. These three forms are basic and can be found across media. Written communication informs with text, the deliberate presentation of a language using combinations of symbols. H1 above requires observation of content forms that are written. The text can be on paper, an electronic screen, or any other type of physical surface. An important characteristic of written communication is that a reader must know the language to understand the communication. (But coding protocols should be adequately specific such that, assuming basic linguistic familiarity, one should not have to understand unwritten cultural expectations not explicitly laid out in the coding protocol.) Historically, most content analysis studies have involved text because throughout most of history text has been the primary way mass-produced content has been preserved. Of course, as new non-text-based media platforms have grown, content studies of them have followed, albeit perhaps not to the extent that they should. Content analysts are still drawn to well indexed and achieved content—disproportionately, newspapers—that make the analysts’ lives easier. In most, but not all, content analyses of text, it is straightforward to divide the text in theoretically meaningful or practical ways in order to then define the variables of interest. As suggested above, much of mass communication research, particularly content analyses, has focused on news stories in newspapers or stories in print magazines. In a typical example, McCluskey and Kim (2012) examined how the ideology of 208 advocacy groups was portrayed in 4,304 articles published by 118 newspapers in 2006. They found that more moderate groups were less prominent than more extreme ones. However, the expansion of digital media has allowed easy combination of various forms of content and the ability of individuals and organizations to capture high-quality visual and verbal communication with mobile devices. Text continues to be important, but content analysts must deal with many other forms of mediated communication. Verbal communication, by contrast, is spoken communication, both mediated and non-mediated, intended for aural processing. H2 requires analysis of such verbal communication, specifically in the context of video content. When aural content is preserved, it is often saved as text in transcripts, particularly for the purposes of content analysis. One frequent purpose of content analysis of auditory content is to examine doctor–patient conversations in a number of care settings. Vashi and Rhodes (2011) used digital audio recorders unobtrusively connected to common medical equipment to record 477 conversations that emergency-room providers had with patients about discharge instructions. In particular, the researchers wanted to know the extent to which the conversations contained an opportunity to ask questions and the extent

Measurement 51 to which providers confirmed that the patient understood the discharge instructions given. (Providers confirmed patients’ understanding in only 22% of conversations analyzed.) One might also be interested in studying calls to various types of support hotlines. Auditory content, however, is not always readily available in its original form. Gilat and Shahar (2007) faced the problem of acquiring verbal content from adolescent suicide prevention hotlines in order to compare suicidal ideations between those using the telephone hotline, one-on-one synchronous chat rooms, and asynchronous online support groups. The hotline calls were not recorded to protect callers’ confidentiality, but after calls concluded, staff logged the calls in detail, which the authors argued convincingly is an accurate record of the contents of the conversation. Similar methods were used to compare the different social support reasons that men versus women called sexual assault survivors’ hotlines (Young, Pruett, & Colvin, 2018). It is also common to study just the auditory portions of video content. Fouts and Burggraf (1999), for example, coded prime time network television situation comedies to evaluate the verbal reinforcement of females’ body weight (below average, average, above average). Comments toward main characters were classified as positive or negative toward body weight. Below average weight female characters received more positive comments from males than did the other two groups. Visual communication involves efforts to communicate through nontext symbols processed with the eyes. H3 requires this kind of visual content analysis. Visual communication includes still visuals, such as photographs, and motion visuals, such as film and video. Still visuals are often easier to analyze than motion because stills freeze the relationship among visual elements. Motion visuals often require repeat viewing to identify the elements, symbols, and relationships in the visual space. Kurpius (2002) coded videotapes for the race and gender of sources in newscasts that won James K. Batten Civic Journalism Awards. Kurpius’ conclusion was that civic journalism stories had more source diversity than did traditional TV newscasts. Táboas-Pais and Rey-Cao (2012) discovered that male characters were given more prominence than females in Spanish physical education textbooks between 2000 and 2006. Hum et al. (2011) found no gender differences in Facebook profile photographs. Döring, Reif, and Poeschl (2016) compared Instagram selfies and magazine advertisements, concluding that Instagram selfies were more likely to reinforce gender stereotypes than the advertisements were. In addition to written, verbal, and visual forms, most digital presentations use more than one communication form. Social media often combine text with video, graphics, or photographs. In addition, reports about events may vary from digital platform to digital platform. For instance, Thorson et al. (2013) studied video of the 2011 Occupy movement that was available on Twitter and YouTube. They found that

52 Measurement scholars should access video through both Twitter and YouTube to create a more diverse pool of video. In addition, they found that both tweets and shared videos are essential to grasping the meaning of a Twitter post. The combination of communication forms will likely expand as people increasingly use web-based and mobile devices for information. Papacharissi (2002) coded personal websites for the presence of graphics, text, and interactivity to see how individuals present themselves online, and found that web page design was influenced by the tools supplied by web space providers. Special Problems Associated with Measuring Non-Text Forms All content analysis projects face common problems such as sampling, reliability, content acquisition, and so on, which are the bases for most of this book. However, projects involving non-text content forms face special problems. Non-text communication adds dimensions that can cloud the manifest content of communication. For example, spoken or verbal communication depends, like text, on the meaning of words or symbols, but it also involves inflection, tone, and even body language that affect the meaning applied by receivers. The simple verbal expression “the hamburger tastes good” can be ironic or not depending on the inflection added to the words. No such emphasis can easily be inferred from written text unless it is explicitly stated in the text. Inflection and tone can be difficult to interpret and categorize, placing an extra burden on content analysts to develop thorough coding instructions for verbal content. Similarly, visual communication can create analysis problems because of ambiguities that are not easily resolved from within the message itself. For instance, a text description of someone can easily reveal age with a number: John Smith is 35. However, a visual representation of that same person can be quite vague as to age. Olson (1994) found that she could establish reliability for coding character ages in TV soap operas by using wide age ranges (e.g., 20–30 years old, 30–40 years old, etc.). This may not be a problem in some research, but reliability may come at the expense of validity when studying content variables such as the ages of characters in advertising. It is often difficult to differentiate a person who is 16 years old from one who is 18 years old, yet some content analyses would suggest that characters 18 years and older are adults, whereas 17-year-olds qualify as teenagers. Because of the shared meaning of so many commonly used words, written text may in effect provide within-message cues that can serve to reduce ambiguity. Shared meanings of visual images are less common. Researchers should be cautious and thorough about assigning numbers to symbols whose meanings are to be drawn from visual cues. Consider the task of inferring the socioeconomic status of characters in television programs. Identifying a white, pickup-truck-driving man in his forties as

Measurement 53 “working class” on the basis of clothing—denim jeans, flannel shirt, and a baseball cap—may be more reliable and valid than using the same cues to assign a teenager to that class. Combinations of communication forms—visual with verbal, for example—can generate coding problems because of between-form ambiguity. That is, multiform communication requires consistency among the forms if accurate communication is to occur. If the visual, text, and verbal forms are inconsistent, the meaning of content becomes ambiguous, and the categorizing of that content becomes more difficult. A television news package might have a text describing a demonstration as nonviolent, whereas the accompanying video shows people throwing bottles. A researcher can categorize the two forms separately, but it becomes difficult to reconcile and categorize the combined meaning of the two forms.

Units of Observation Once we’ve decided on the communication form or forms that address our research concern, we then descend to the units of observation that are likely to help provide access to content containing the variable data that address that concern. Units of observation are more specific demarcations of content that serve to further focus our observation on that content of interest. Here is where additional complication begins to occur, because we frequently have to define layered “units of observation” to finally get to a unit of content that we analyze to address hypotheses or questions. To use a metaphor provided by Internet map programs, one may go through successively tighter focusing that starts with a broad planetary view and ultimately reaches a street corner or address of interest. All the “intermediate focuses” between these endpoints are the layered units of observation. The hypotheses above serve to illustrate this process of narrowing of our observation. H1, for instance, requires that we first define how a newspaper is different from all other text media of communication. We’ll call this observation level 1 (OL1). We then have to define how news stories in newspapers differ from all other content (e.g., editorials, ads, letters, etc.) we might find there. We’ll call this observation level 2 (OL2). Once we have a news story defined, we’re ready to look within it for content that addresses H1. This process would be similar with H2. The analyst must define at OL1 how streaming TV videos differ from all other videos, define at OL2 how shows on network TV programs differ from non-network TV shows, define at OL3 how a comedy show differs from other content broadcast or streamed, and finally define at OL4 how humorous verbal interactions among characters differ between streaming and network comedy programs.

54 Measurement Note that there is no “system” for specifying how many levels of observation units we must define and descend through before reaching the content of interest. That’s because the complex world of communication is what it is. The content we study is created by others who do not have the purposes of content analysts in mind. Content analysts must therefore use successive definitions of “units of observation” to “map” that world so that other researchers can follow where they’ve gone to understand the data presented, and perhaps also find the data they may need. Basic Units of Observation Used in Content Analysis Every study will need to determine the units of observation most relevant and useful for achieving its research goals. As we emphasized above, no uniform “system” exists for selecting units of observation that will be relevant in all studies. Nonetheless, we can provide a broad division of such units into physical units and meaning units, and provide examples of such units of observation that may be useful in research projects. Physical Units The most basic physical units are time and space measures of content. These physical units may also be more or less closely related to the sheer number of some defined units of content that take up space or time. Measuring text and photographs, for example, may involve counting the number of stories and photos, and these will occupy physical space in a content form measured by square inches or centimeters. Some research has assessed how strongly the two are correlated. Windhauser and Stempel (1979) examined correlations among six measures of local newspaper political coverage (article, space, statement, single issue, multiple issue, and headline). Rank order correlations varied from .692 for headline and multiple issues to .970 for space and number of articles. Windhauser and Stempel concluded that some physical measures might be interchangeable. The type of measure used should be based on the type of information needed in the study, the variables being investigated, and the theoretical basis for the study. Measurement of space for verbal and moving visual communication has little meaning, of course. Verbal content does not involve space, and the size of space devoted to a visual element depends on the equipment used to display the visual (larger screens have more space than smaller screens). Instead of space, verbal and moving visual communication has a time dimension measured by units of time (seconds, minutes) devoted to the visual and verbal content. For example, television video can be evaluated by measuring the number of seconds of time given to a character or topic. The average length of time given to topics may be assumed in such research to relate to the depth of treatment. The more time devoted to

Measurement 55 a character or topic, the more information the video contains about the character or topic. Although they are among the most objective units used in content analysis, physical units often are used, as in the statement above about depth of coverage, to infer to the values of the sender and/or to their impact on the receivers. For example, measures of broadcast time were used in a study (Zeldes & Fico, 2010) to explore how the gender of reporters was related to the gender of sources used by television networks covering the 2004 presidential race. An effect-size computation indicated that women reporters gave women sources an important “boost” in airtime compared to the time that male reporter colleagues gave women sources. Zeldes and Fico (2010) inferred that this difference was caused by greater aggressiveness in finding women sources by women reporters, perhaps supplemented by newsroom mandates to include more women sources in stories, but certainly other explanations are plausible (e.g., women sources may have felt more comfortable talking to women reporters). So, caution in making such inferences from measures of physical content alone is always necessary. More fundamentally, two assumptions undergird any inferences from physical or time measures of units to content antecedents or content effects. The first is that the allocation of content space or time is systematic and not random. Part of this first assumption is that these systematic allocations are identifiable as nonrandom content patterns. The second assumption is that the greater the content space or time devoted to some issue, subject, or person, the greater will be the impact on the audience for the content. So, for example, an online news site that allocates 75% of its news content space for stories about the city in which it is located might plausibly be assumed to be making a conscious effort to appeal to readers interested in local happenings. At the same time, allocating 75% of space to local coverage plausibly has a different impact on the total readership than allocating 40%, at least in terms of the probability of exposure to such content. Meaning Units Meaning units may involve the kind of physical and temporal qualities described above, but are less standardized. Sources in a news story, for example, will have words attributed to them that take up a certain amount of space or time. But it is the meaning of the words that may be the focus of interest for a study, and those meanings provide both richness and ambiguity to inferences from them to antecedent causes or subsequent effects. One of the most basic types of symbolic units in content analysis is what Krippendorff (1980) called syntactical units. Syntactical units occur

56 Measurement as discrete units in a language or medium. The simplest syntactical unit in language is the word, but sentences, paragraphs, articles, and books are also syntactical units. Syntactical units in the Bible, for example, would be particular verses within chapters. In plays, they are dialogue within scenes and acts; in television, such dialogue would be found in the commercials and programs. A special problem with analyzing such syntactical units concerns context: How do we validly separate a particular unit of content, say dialogue in a television comedy, from dialogue that comes before or after that unit? In particular, does that separation from context distort the meanings communicated? An even more foundational problem, discussed in more detail in Chapter 7, concerns the very meanings we infer from the words in the unit of content that we analyze. Almost all content analysis must cope with these problems because of the focus on some kind of syntactical units. For example, such units are examined in studies of bias, framing, diversity, persuasiveness, sexuality, violence, and so on. Sampling Concerns with Units of Observation Recall how we define nested units of observation in a process of tightening our observational focus on the content of interest in a study, moving from higher observational levels to lower ones. Unless we can collect all data for all the defined units of observation, we’re going to have to sample one way or another. If we have a census or a purposive sample of all units of observation, our reasoning to justify the decision is all we need to present. However, if we are random sampling within units of observation because there is too much data, we need to keep the rules of random sampling inference in mind. Specifically, we can only generalize from a sample of one type of observation units to the population of such units from which the sample was taken. And we make such inferences with a certain amount of possible error. Consider, then, an example of what this would mean if we randomly sampled from multiple units of observation for the Instagram selfie study referenced above: H3: Women pictured in Instagram selfies will be portrayed in more revealing clothing than Instagram selfies of men. Such sampling is similar to cluster or multistage sampling (described more in Chapter 5). We might first have to sample Instagram posts from some population; we might then have to sample Instagram posts that are selfies; and we might then have to sample in equal parts Instagram posts that portray women and Instagram posts that portray men. Each of these samples produces some level of sampling error that depends on the size of the sample. This final sampling error produced by sampling across all units of observation will require special calculation.

Measurement 57

Units of Analysis A unit of analysis refers to that demarcated content about which we can define and observe one or more variables of theoretic interest. Given any of the example hypotheses above, we’ve finally reached a level at which we can observe data that give us our answer. In fact, we’ve also gotten to the heart of the work content analysis must do as a valuable informationgathering method. We must define the variables in our hypotheses in terms that are observable in content, and, given relevant observations, we must quantify what we’ve found. To continue the example with H3, just how do we define “revealing clothing” and how do we measure clothing types (or body/skin exposure) we find using that definition? When a content analysis is at the unit of analysis level, then we make two types of decisions of critical theoretical and empirical importance. These are the definitions of content (its classification into categories) and measurement of content (in particular, the level of measurement). These decisions determine how the hypothesis is addressed or the research question is answered. Note, by the way, that a single “unit of analysis” may sometimes exactly correspond to what we want to know and the variable we want to measure: Does an Instagram post include a female, yes or no? But we can “mine” that unit of analysis for more than just that yes or no, presence or absence, variable. In the ad example, we can look at the clothing the subjects are wearing, the body pose they are pictured in, or even the race and age of pictured subjects. Classification Systems Content analysis protocols contain descriptions for how content can be observed and classified into categories that make up the variables. In other words, each content variable in a hypothesis must have at least two categories in which the content units are placed. To continue the example of H3 above, we might look at one or more types of clothing (e.g., tank tops, blouses, etc.) or body parts (e.g., knees, shoulders, stomach) that are unclothed. Then we need to decide on a level of measurement: Do we use a dichotomous variable (i.e., revealing/not revealing) or do we develop a scale (e.g., a photo that shows the shoulders and stomach unclothed is more revealing than a photo that just shows the shoulders exposed)? Of course, as noted above, we’d have to carefully define which particular elements in an Instagram picture determine whether a given Instagram post conforms to gender stereotypes, such as those related to body displays. A classification system, then, is a collection of definitions that link observed content to the theoretical, conceptual variables in our hypotheses or questions. As in the example above, when variables are nominal

58 Measurement level, they will have categories. For a different example, a variable that assigns values based on political leaning of content will have categories such as liberal, conservative, and neutral. The variable and all categories will have definitions to guide assignment of values for the categories. Variable definitions can use a range of content characteristics. Some studies of news have looked at geographic emphasis of the articles, whereas others have assigned numbers that represent an order of importance based on the physical location of articles within the paper or within a newscast. The classification system translates messages into variables by assigning content units to categories. A content analyst is well advised to consult conceptual and operational definitions used in past studies that relate to the present one. Given that the aim of research is to build a body of knowledge, this is most efficiently and effectively done when researchers use, or at least start with, existing definitions. Of course, new ground may have to be broken, or even old mistakes corrected, when it comes to such definitions and measures. Deese (1969) provided a six-part typology useful in conceptualizing content analysis variables that is frequently used. But we emphasize that the classification system used in a particular content analysis will draw most usefully from related past research and will be guided most efficiently by the study’s specific hypotheses or questions. Deese’s classifications include: Grouping: Content is placed into groups when the units of analysis (e.g., sources in a news story) share some common attribute (e.g., they take a position on some issue). In a study of the motives behind professional athletes’ tweets, Hambrick, Simmons, Greenhalgh, and Greenwell (2010) found that the plurality of tweets involved interactivity with followers. Class structure: Class structure is similar to grouping, but the groups have a hierarchical relation, with some classes (or groups) being higher than others. Deese (1969) said, “Abstractly, a categorical structure can be represented as a hierarchically ordered branching tree in which each node represents some set of attributes or markers which characterize all concepts below that node” (p. 45). The Strodthoff, Hawkins, and Schoenfeld (1985) classification system using three levels of abstraction (general, doctrinal, and substantive) for content involved a class structure based on how concrete magazine information was. Dimensional ordering or scaling: Some content can be classified on the basis of a numerical scale. Deese (1969) gave five abstract properties that typically involve scaling: (a) intensity; (b) numerosity; (c) probability; (d) position or length; and (e) time. It is fairly common to find content analyses using one or more of these types

Measurement 59 of scales. Washburn (1995) studied the content of radio news broadcasts from three commercial and three nonprofit networks by using length of broadcast in minutes and seconds, the number of separate news items, and the average time given to news items. All of these represent a form of dimensional scaling. Shin and Thorson (2017) studied how partisanship related to the sharing and commenting on social media about fact-checking. Each message was coded as positive, negative, or neutral toward each presidential candidate in the 2012 election. Spatial representation and models: Language can be thought of as representing a cognitive space or map. The meaning of words and language can be placed in a mental spatial model that allows a person to evaluate objects, issues, and people along continua or dimensions. Osgood, Suci, and Tannenbaum (1957) pioneered using the idea of semantic space to develop the semantic differential for measuring the meaning people attach to given concepts. Spatial models assume content has two or more dimensions that need description. Describing content along these dimensions of meaning is similar to applying a semantic differential through content analysis. A film could be described along seven-point scales as good/ bad, effective/ineffective, and so on. The use of such models allows content analysts to explore complex meanings attached to symbols. For example, Hua and Tan (2012) identified cultural dimensions in which Chinese and American media described the success of athletes in the 2008 Olympics. Chinese media more often used a cultural dimension emphasizing social support, while U.S. media more often used a dimension focusing on the qualities of the individual athlete. Using spatial models such as the semantic differential has the potential for analyzing messages for latent meaning. However, the groups of people (coders) applying the spatial dimensions must be a representative sample of those using or creating the content for the conclusions to have validity. Abstract relations: Both scales and maps represent practical efforts to make language more concrete. Some of these abstract concepts, such as friendship among TV characters, may not fit maps and scales well. Scales and maps categorize content by common characteristics rather than by relations that exist among elements within the content. A classification system can specify and make more concrete these abstract relations expressed in content. Wilson et al. (2012), for example, looked at types of antisocial behavior engaged in by characters in a content analysis of television episodes of Survivor, a “reality” show. They found that indirect aggression and verbal aggression were the most frequent types

60 Measurement portrayed. Indirect aggression, which took place without the victim’s knowledge, was defined as acts or words designed to hurt the victim or destroy the victim’s relationships. Verbal aggression was defined as a direct attempt to diminish or humiliate the victim. They also found that such behavior had increased compared to assessments in earlier studies. Binary attribute structure: In English and related languages, characteristics attributed to a person or thing often have an opposite. Good is the opposite of bad, and bright is the opposite of dark. These binary structures are often found in content, although concepts need not be thought of in these terms. In the field of journalism, reporters are assumed to be either subjective or objective by many readers. Shin and Thorson (2017) classified tweets about fact-checking as either mentioning bias in the fact-check or not. A study of MTV content by Brown and Campbell (1986) classified content of music videos as prosocial and antisocial. In a two-candidate election, social media comments about the election would support one candidate or the other. Such classification systems as in the above example from Deese (1969) are crucial for a particular study. But they have implications beyond any particular study because content classifications relate to the validity of the concepts and the way those concepts have been measured. The selection of a classification system for content should have a theoretical basis. The validity of the variables in such a system must be argued logically and/or established empirically. This is discussed more fully in Chapter 7. Classification System Requirements Although studies vary in the content classification systems that are useful, all must meet particular requirements dictated by the logic of empirical inquiry. Meeting these requirements is necessary, but not sufficient, for helping to establish the validity of the concepts and measures used in a content analysis. The process of creating categories requires specific instructions for defining variables in content so they can be coded reliably. These coding instructions for defining variables must meet five requirements. Definitions for variables must: (a) reflect the purpose of the research; (b) be mutually exclusive; (c) be exhaustive; (d) be independent; and (e) be derived from a single classification principle (Holsti, 1969, p. 101). To reflect the purpose of the research, the researcher must adequately define the variables theoretically. Then the coding instructions must clearly specify how and why content units will be placed in categories

Measurement 61 for these variables. This specificity requires detail that will guide coders in distinguishing among content units that seem similar. Novice content analysts tend to err on the side of too little detail in the variable and category specification. Detail allows other researchers to replicate content analyses. These instructions provide the operational definitions that go with the theoretical definitions of the variables. The operational definition should be a reliable and valid measure of the theoretical concept. Classification systems must be mutually exclusive when assigning numbers to recording units for a given variable. If magazine articles about environmental issues must be classified as pro-environment and anti-environment, the same article cannot logically be considered to be both. Using statistics to study patterns of content requires that units be unambiguous in their meaning, and assigning more than one number to a content unit for a given variable creates ambiguity. Of course, it may be that an article contains both pro-environmental and anti-environmental statements. In such cases, the problem is solved by selecting smaller units of analysis that can be classified in mutually exclusive ways. Instead of selecting the article as the unit of analysis, statements within the article could become the focus for content as pro or anti. Setting up mutually exclusive categories requires a close examination of the categories and careful testing to reduce or eliminate ambiguity, a matter dealt with more fully in Chapter 6. In addition to mutual exclusion for categories of a variable, classification systems must also be exhaustive. Every relevant unit of content must fit into a subcategory. This requirement is easy to fulfill in areas of content research that have received a great deal of attention. However, in newer types of content (e.g., Instagram and YouTube video), exhaustive category coding schemes will be more difficult to create. Often researchers fall back on an “other” category for all the units that do not fit within defined categories. This may even be appropriate if a researcher is interested primarily in one category of content (e.g., local news coverage). In such situations, all non-local coverage could be grouped together with no loss of important information. However, the use of “other” should be undertaken cautiously. The more that relevant content units fall within the “other” category, the less information the researcher has about that content. Extensive pretesting with content similar to that being studied will help create categories that are exhaustive. Researchers can adjust the classification system and finetune definitions as they pretest. The requirement of independence in classification requires that placing a content unit in one category does not influence the placement of the other units, a rule often ignored when ranking of content is involved, or when coding one variable requires coding a particular value of some other variable. Independence is also an important characteristic for assessment

62 Measurement of coder reliability and later for statistical analysis of collected content data. For coder reliability, both agreements (and disagreements) on, say, variable 2, may be “forced” by the coding instructions for variable 1, thereby distorting the reliability assessment for variable 2. For statistical analysis of collected data, an inference from a relationship in a sample of content to the population would be biased in some unknown way. An example can illustrate the point. Suppose a researcher examines two TV comedies and two TV dramas for the number of characters during a season who are people of color. Each program has 20 new episodes per year. One system involves assigning ranks based on the number of characters who are people of color. The program with the most such characters during a season is first, the second most is assigned second, and so on. Another system involves calculating the average number of characters in a category per episode. Suppose for the entire season, comedy A has five characters of color, while comedy B has three, drama A has four, and drama B has two. The ranking system might result in the conclusion that TV comedies provide the audience with much more exposure to characters of color during a season because the comedies ranked first and third and the dramas ranked second and fourth. This first system creates this impression of an important difference because the assignment of rankings is not independent. The assignment of the first three ranks determines the fourth. But the independent calculations provided by the second system give an average of .20 characters per comedy episode (eight characters divided by 40 episodes) and an average of .15 characters per dramatic episode (six characters divided by 40 episodes). The conclusion based on the second system is that neither program type provides extensive exposure to characters of color. So, the independent assignment system, such as average number of characters per episode, provides a more valid conclusion of television programming. Finally, each category should have a single classification principle that separates different levels of analysis. For example, a system for classifying news stories by emphasis could have two dimensions: geographic location (local, national, international) and topic (economic, political, cultural, social). Each of the two dimensions would have a separate rule for classifying units. It would be a violation of the single classification rule to have a classification system that treated local, national, or international location, and economic topic, as if the four represented a single dimension. A rule that would allow classification in such a scheme mixes geographic and topic cues in the content. Systems that do not have a single classification principle often also violate the mutually exclusive rule. A classification system that uses local, national, international, and economic would have difficulty categorizing content that concerns local economic issues.

Measurement 63

Levels of Measurement Content can be assigned numbers that represent one of four levels of measurement: nominal, ordinal, interval, and ratio. These levels concern the type of information the numbers carry, and they are the same levels of measurement used in all social sciences. Nominal measures have numbers assigned to categories of content. If one wants to study which presidential candidate was mentioned the most in tweets after a debate, a researcher would assign each candidate a number and assign the appropriate number to each article on the basis of the candidate written about in the tweets. The number used for each candidate is arbitrary. The Democratic candidate might be given a 1 and the Republican candidate a 2. However, assigning the Democrat a 10 and the Republican candidate a 101 would work just as well. The numbers carry no meaning other than connecting the candidate to a tweet. Put another way, nominal measures have only the property of equivalency or non-equivalency (if 41 is the code for tweets about the Republican candidate, all such tweets receive the value 41, and no other value is applied to tweets about the Republican candidate). In an analysis of news sources articles about city government that were found in newspapers and on citizen journalism websites, Fico et al. (2013a, 2013b) classified various types of sources (e.g., government official, citizen, etc.) as being either present or absent in the articles. The coders used numbers, but in the article the results were reported by the label of category and not by the assigned number. Nominal measures can take two different forms. The first treats membership in a collection of categories as a variable. Each category in the variable gets a number to designate which category is mentioned in the tweet. For example, in a presidential race, if a tweet mentions a Democratic candidate, it is recorded as a 1; if it mentions a Republican candidate, it receives a 2; and if it mentions an independent candidate, it receives a 3. If two or more different types of candidate are mentioned, the variable would receive a 4. Each number represents a category of the variable. The second form of nominal measure is to treat each category above as a dummy variable. Instead of assigning a tweet a 1 if a Democrat is mentioned, a 2 if a Republican, or a 3 if an independent is mentioned, and so on, each type of candidate would become a separate variable. For the Democratic variable, the tweet would receive a 1 if a Democrat is mentioned and a 0 (zero) if no Democrat is mentioned. For the Republican variable, it would receive a 1 if a Republican is mentioned and a 0 (zero) if no Republican is mentioned. For the independent variable, the tweet would receive a 1 if an independent candidate is mentioned and a 0 (zero) if no independent candidate is mentioned. The number of such variables would equal the number of different candidate

64 Measurement categories in the first approach mentioned above, except there is no need for a multi-candidate variable (two or more candidates) as occurs with the first approach. The use of dummy variables allows researchers to calculate the presence of multiple candidates in a tweet by combining these dummy variables into new variables with a statistical package. The same dummy variable approach could be applied using the names of the candidates rather than the type of candidate. Mentioning candidate Jane Smith, for example, would receive a 1, and no mention of candidate Smith would receive a 0 (zero). With the one-variable approach, the variable has multiple categories with one number each. With the multivariable approach, each category becomes a variable with one number for having the variable characteristic and one for not having that characteristic. The multivariable approach allows the same article to be placed into more than one classification. It is particularly useful if a unit needs to be classified into more than one category of a nominal variable. For example, if individual tweets deal with more than one candidate, the multivariable system might work better. After coding for multiple variables, the data can be recombined into one variable to use in the analysis. For example, having a variable for the presence of the Democratic candidate and one for the Republican candidate would allow the researcher to later create a variable with four categories: Republican only mentioned, Democrat only mentioned, Republican and Democrat mentioned, and neither candidate mentioned. Krippendorff and Craggs (2016) introduced the concept of multivalued coding of data, which they described as texts and nonverbal messages that could have multiple interpretations. Of course, it is true that any collection of symbols can have multiple manifest and latent meanings. It is context that determines the meaning of any given message. As discussed above, content analysts have options for creating variables that can control for multiple meanings in messages. The first step is to break messages into smaller units. The smaller the units, whether text or visual, the fewer alternative interpretations they are likely to have. Also, researchers do not have to be interested in all the meanings a message or unit could hold, and coders can code for prominence or the overall meaning of a unit. As mentioned above, a single variable can be turned into multiple variables by measuring the presence or absence of content in a unit through dummy-coding. Krippendorff and Craggs (2016) criticize this approach to the multivalued variables because: (1) “It introduces entirely artificial variety into the coding process” (p. 186); (2) “Second, all contingencies intrinsic to multiple valued accounts of phenomena are lost” (p. 186); (3) this process tends to produce more unreliable variables than does the process of assigning one value from among multiple categories in a variable; and (4) using “dummy” variables for all possible values can provide less information about the reliability of the protocol. These criticisms are not necessarily valid.

Measurement 65 With regard to number 1, they argue that developing a list of all possible characteristics (values) can be an enormous task. This assumes that researchers are interested in all the possible characteristics, which is rarely the case. First, not all characteristics fall on the same dimension, and therefore into the same variable. Second, scholars are usually interested in specific characteristics that are identified by theory or previous research. The fact that an approach may not be appropriate under some conditions is not a reason to reject it for all conditions. For number 2, they argue, “The apparent complexity of a person cannot possibly be captured by coding for these qualities separately” (p. 186). As mentioned above, using multiple dummy variables allows for the recombination of these variables into more complex variables. For example, if political tweets are dummy-coded for three important campaign issues (e.g., abortion, immigration, and deficit spending), these three variables could then be used to create a variable with eight cells. The scholar may or may not be interested in all of those cells. In addition, a series of dummy variables has been used to generate an index of source diversity in news stories (Lacy et al., 2013). In other words, observation units can be manipulated to form different analysis units. The third argument that the multivariable approach tends to be more unreliable is not supported by Krippendorff and Craggs (2016) with anything other than their opinion. This position runs counter to the experience of the authors of this text, who have participated in more than 100 content analyses. Our experience is just the opposite. Dummy variables can usually be coded with more consistency. The fourth argument by Krippendorff and Craggs (2016) assumes, “It follows that the cells of the huge multi-dimensional spaces that dummy variables create are mostly empty” (p. 187). This is not necessarily true. The number of dummy variables and the resulting matrix of cells need not be large or full of empty cells, particularly if the protocol is based on existing research and theory. Criticisms of existing approaches are best based on empirical tests rather than possible problems that are not inherent in a content analysis protocol. Content analysis and particular approaches to measurement may not be appropriate for answering every research question, but that is not an indictment of approaches for protocol development that have proven useful in the past. Ordinal measures also place content units into categories, but the categories have an order. Each category is greater than or less than all other categories. Arranging categories in order carries more information about the content than just placing units into categories. The ordering of units can be based on any number of characteristics, such as prominence (which article appeared first in a publication), the amount of content that fits a category (publications with more assertions than others), and the

66 Measurement order of placement of a unit within a publication (front-page placement in newspapers carries more importance than inside placement). Interval measures have the property of order, but the number assignment also assumes that the differences between the numbers are equal. They are called interval measures because each interval is equal to all other intervals. The difference between 2 and 3 is equal to the difference between 7 and 8 or 13 and 14. The simple process of counting numbers of content units illustrates interval measures. If one wants to study the number of articles in a newsmagazine over time, the researcher could count the articles with headlines published in each issue for a period of time. Ratio measures are similar to interval measures because the difference between numbers is equal, but ratio data also have a meaningful zero-point. Counting the number of words in a magazine issue has no meaningful zero-point because a magazine must have words by definition. However, if one counts the number of active verbs in a magazine issue, the measure is a ratio. It would be possible (although not likely) for a magazine to be written totally with passive verbs. Because ratio data have a meaningful zero-point, researchers can find ratios among the data (e.g., magazine A has twice as many active verbs as magazine B). In some situations, ratio data can be created from a nominal classification system when the ratio of units in one category to all units is calculated. For example, Beam (2003) studied whether content differed between groups of newspapers with strong and weak marketing orientation. Beam classified content units (self-contained units that could be understood independently of other content on the page) into a variety of categories for topic and type of item. Beam then calculated the percentage of content units within the various categories (e.g., content about government or the “public sphere”) and compared the percentages for strong market-oriented newspapers with the percentages for weak marketoriented newspapers. This transformation of nominal data to ratio data was used because the number of content units varies from newspaper to newspaper, usually based on circulation size. A ratio measure allows one to compare relative emphasis regardless of number of units. It is important to think carefully about levels of measurement at the design stage of a content analysis, because those decisions should inform how hypotheses or research questions are phrased (hypothesizing differences in proportions suggests categorical data and chi-square tests, where hypothesizing correlations suggests ordinal or higher levels of measurement). It also affects what statistical tests one can/will use to answer those hypotheses and research questions. One advantage to using interval- and ratio-level variables with content analysis is that they allow the use of more sophisticated statistical procedures, primarily because the measures permit computation of means and measures of dispersion (variance and standard deviation). These procedures, such as multiple regression, allow researchers to control

Measurement 67 statistically for the influences of a variety of variables and to isolate the relationships of interest. For example, Hindman (2012) used hierarchical regression to identify blocks of factors predicting beliefs that health care reform would produce benefits for one’s family. Shin and Thorson (2017) used regression to predict the selective retweeting of fact-checking messages based on valence toward a given candidate. Importance of Measurement Levels Selecting a measurement level for a variable depends on two rules: the measurement level selected should be theoretically appropriate and carry as much information about the variables as possible. Theoretically appropriate means the measurement reflects the nature of the content and the particular hypotheses. If a hypothesis states that female writers will use more descriptive adjectives than male writers, content will have to be assigned to a nominal variable called writer’s gender. The variable of descriptive adjectives could take several forms. One measure would be nominal by classifying articles by whether they have descriptive adjectives. However, this nominal level fails to incorporate the reality of writing because it treats all articles equally, whether they have one descriptive adjective or 100. A better measure with more information would be to count the number of descriptive adjectives in each article. This is a ratio-level measure that would allow a far more sophisticated statistical procedure. In fact, the level at which a variable is measured determines what types of statistical procedures can be used because each procedure assumes a level of measurement. Procedures that assume an interval or ratio level are called parametric procedures, which assume certain population distributions to describe more precisely the population parameters with sample statistics. Nominal- and ordinal-level measures make no such assumptions about the population distribution and are less precise at describing the population of interest. Such non-parametric statistics provide less information about patterns in data sets and are often more difficult to interpret and allow for statistical controls. The relationship between measurement and statistics is discussed more in Chapter 9. Rules of Enumeration No matter what classification system is used, quantitative content analysis requires the creation of rules coders must follow for connecting content with numbers. The numbers will, of course, represent the level of measurement selected by the researcher that relate to the content variable(s) of interest. The rules may be as simple as applying a 1 to a certain type of content unit, say positive stories, and a 0 to other content units, say negative stories. Enumeration rules for nominal data require arbitrarily

68 Measurement picking numbers for the groups. But looking forward to the kind of analysis anticipated can give guidance to numbers given. For instance, a multivariate analysis predicting some dependent variable may use a nominal-level independent variable. In this instance, coding “present” as 1 and “absent” as 0 facilitates that analysis. Or, if the analysis deals with gender, for instance, relevant content including males can be coded as 0 and content including females can be coded as 1. For interval or ratio data, the enumeration rules might be instructions about what part of the physical content unit to include or exclude. For example, rules about counting words in any form of text require a physical description of which words to count in relation to the text. Do coders count words in a headline? Do coders count the words only in an original Facebook post or in the original post and comments? In this case, a ratio scale is used that facilitates analysis using correlation or regression analysis methods. No matter what the rules are for assigning numbers to content, they must be clear and consistent. The same numbers are applied in the same way to all equivalent content units. If a scholar studies the percentage of time during a television program devoted to violent acts, the rules of enumeration must clearly identify the point at which the timing of a violent act begins and when it ends. The success of enumeration rules affects the reliability as well as the validity of the study. The rules must provide consistent numbering of content.

Measurement Steps The following five steps summarize the process of measuring content: 1 Develop research hypotheses or questions. Research questions and hypotheses force researchers to identify the variables they want to study and even the level of measurement for observing the variables. The hypotheses and research questions form the basis of the study. They should be explicitly stated and referred to as the data are analyzed and explained. 2 Examine existing literature that has used the variable or that discusses the measurement of the variable. Social science research should build on what is already known. This knowledge is best synthesized in formal theory. However, explicitly presented theory is sometimes absent, so new research is based on existing empirical studies. Occasionally, commentary articles address methodology and measurement issues. Reviewing the existing literature in whatever form is crucial for accurate measurement. The initial use of the literature is to provide a theoretical definition of the variables being addressed in the research. Theoretical definitions are important for guiding measurement because they play a

Measurement 69 role in establishing the face validity of a measure. If the measurement of the variable reflects a reasonable representation of the theoretical definition of a variable, the measure can be said to have face validity (see Chapter 7). 3 Use good previous measures, or, if the existing measures are not good enough, adjust your measures. Reviewing the literature will provide theoretical definitions of variables and potential operationalization of those variables. However, researchers need to be cautious about using existing measures. They should be used critically. The variable being studied might be slightly different from those in existing literature, and all measures also have error. If a modified measure is used, the new one should have face validity and be consistent with existing measures. The modification should be aimed at reducing measurement error by making the new measure more consistent with the theoretical definition of the variable being studied. During this step, the researcher has to decide the appropriate level of measurement for the variables. This level must reflect the theoretical definition. If a new measure is being developed, this new measure should aim at higher levels of measurement when appropriate. Higher-level measurements provide more precise tests of hypotheses. 4 Create coding instructions. Explicit coding instructions require that content categories for each variable be defined in as much detail as is possible and practical. A list of variables with category labels and the corresponding values for each is insufficient. Generally, the more detailed the definitions, the higher the reliability. However, a researcher must be careful not to be so detailed as to make the application of the coding instruction too difficult. Defining the categories involves selecting among type of content, setting up the classification system, and deciding the enumeration rules. All of this must be done and presented in a logical order that will allow a coder to refer to the instructions easily as he or she codes the content being studied. The coding instructions include any technical information about how the process will work. This would include rules for rounding off numbers and any physical limits used to narrow down the content being studied. 5 Create a coding system for recording data that will go into a computer. Any quantitative content analysis project will use a computer for analyzing the data. Unless data are entered directly from the content into the computer, projects require coding sheets. Numbers for the categories are put on these sheets and then entered into a computer. Although it is possible to record from content to computers, the process might interfere with the flow of coding and could take more time as coders move from content to computer and back.

70 Measurement A variety of coding sheet formats can be used. The primary criteria are efficiency in data input and keeping cost down. It is important that the coding instructions or protocol and coding sheets can be used easily together. The variables should be arranged and numbered consistently between the two. In Chapter 6, we go into more detail about coding sheets.

Summary Measurement is the process of moving from theoretical definitions of concepts to numerical representations of those concepts as variables. This process is called operationalization. The measurement process involves identifying the appropriate content of interest and designing the appropriate classification system for that content. The classification system uses content units to develop definitions of variables and categories for the variables. These variable categories must be translated into numbers, which requires the selection of appropriate levels of measurement, a system for classifying content, and rules for applying numbers to the content. This process is governed by coding instructions that maximize the validity and reliability of the measurements of the content concepts of interest. The instructions should allow a variety of coders to replicate the measurements. Such measurements are then statistically analyzed to address study hypotheses or research questions. Almost always, such analyses are performed by statistical packages on computers.

5 Sampling

A question content analysts must ask is, “How much data would be needed to adequately test the hypotheses or answer the research questions?” In an ideal world, sampling would not be an issue for social scientists. Researchers would include all relevant content in their studies. For example, a study concerning gender representation on television would examine every program on every channel during all pertinent time periods. However, researchers face trade-offs between the ideal and practical limitations of time and money. Coding all relevant content with human coders is impractical when thousands or even millions of content units are in the population. In other situations, researchers find that all content cannot be obtained. Because of a number of issues discussed below, most content analysts use a sample of all relevant content rather than a census. How a researcher selects a sample is extremely important because it determines the appropriate type of statistics that will be used (inferential or descriptive) and the extent to which the results can be generalized. Social science theory aims to describe people’s behavior and mental processes. The more representative the data, the more valid are the conclusions for the represented group. A sample is a subset of units from the entire population being studied. The usual goal of such samples is to represent the population. When probability samples (units are randomly chosen) are selected, scholars can make valid inferences about the population of content under study. The inferences drawn from a probability sample are subject to sampling error, but statistical procedures enable researchers to estimate this sampling error with a given level of probability. If researchers assemble samples in any way other than random sampling (and many do or must), sampling error cannot be calculated accurately. It becomes impossible to estimate how much the sample differs from the population from which it was taken. Therefore, any generalizations from the sample to the population are suspect. When sampling, the researcher must define the universe, population, and sampling frame appropriate to the research purpose and design.

72 Sampling The universe includes all possible units of content being considered. The population is composed of all the sampling units to which the study will infer. The sampling frame is the actual list of units from which a sample is selected. An example may help clarify the relationship among these groups. If a researcher were to study the historical accuracy of William Shakespeare’s plays, the universe would be all plays written by Shakespeare, published or unpublished. Because Shakespeare might have written some plays that were unpublished and lost over time, the population would be all published plays attributed to Shakespeare. Finally, the sampling frame would be a list of plays available to the researcher. A sample of plays randomly taken from this list would be a sample of the population if the sampling frame and population were the same. If one of the plays had gone out of print and a copy was not available, the population and sampling frame would differ. When an intact set of all units of a population is unavailable, the sampling frame becomes the available content that is sampled and about which inference is made. The Shakespeare example illustrates that the content one seeks to study is not necessarily the same as the content available for sampling. For example, a content analysis exploring the portrayal of women in YouTube videos could not reasonably include a list of all the characters before the content is sampled. The population (all women in YouTube videos) can be specified, but the sampling frame cannot. This problem may be solved with a technique called multistage sampling, which is explained below.

Sampling Time Periods Most survey researchers conduct cross-sectional studies. They sample people at one point in time to investigate behaviors, attitudes, and perceptions. Although some content analysts conduct cross-sectional research, most studies examine content that appears over time. Because communication occurs on an ongoing and often regular basis, it is difficult to understand the forces shaping content and the effects of content without examining content at various times. When content is available from several time periods, some interesting longitudinal designs are possible. For example, Danielson, Lasorsa, and Im (1992) compared the readability of newspapers and novels from 1885 until 1989, and found that the New York Times and Los Angeles Times became harder to read but novels became easier to read. Such long-term research designs, discussed in Chapter 8, require a population and sampling frame that incorporates time as well as content. Because content analysts sample units of content and time, confusion can occur as to which type of population inferences are applicable. For example, Kim, Carvalho, and Davis (2010) studied the news framing of

Sampling 73 poverty in newspapers and on television news. They took a purposive sample (discussed below) of television content from three broadcasting networks (ABC, CBS, and NBC) and CNN. For newspapers, they selected one daily newspaper from each of four states that ranked in the top ten for median household incomes and one daily newspaper from each of four states in the bottom ten median-income states. If the news outlets published more than 60 stories during the 15 years of the study, the authors randomly selected 60 articles. The use of random content sampling allowed the authors to infer to the entire time period for these news outlets while making the coding time manageable. However, inference could not be made to news outlets other than those studied because the outlets were a purposive sample. Scholars have expressed concerns about the timing of content posted online (Mahrt & Scharkow, 2013), which also applies to mobile content. The lack of a predictable publication cycle for web content and the ability of almost anyone to post content make sampling from time even more important (and difficult) with online and mobile content. Although digital distribution creates time sampling problems, those problems are a matter of degree, and not of kind. Media content has a history of changing across time—from the multiple daily editions of newspapers in the 19th century to the ever-changing Facebook pages today. In addition, interpersonal communications through writing and phone calls have always generated rapidly changing content that followed no discernible routine. The impact of time on the Internet and mobile samples creates the biggest problem when the content is not archived with a timestamp. Archived content can be searched and a sampling frame created. Material that is not archived and timestamped must be collected as it is posted, which generates sampling problems that can be addressed using software to scrape Internet content at randomly selected and predetermined times. In effect, researchers seeking to study un-archived and un-timestamped content must generate their own archive using software. Other online sampling issues will be addressed below. No matter the distribution system, when researchers infer in content analysis, they should make clear whether the inference concerns content producers, time, or both. The appropriate dimension of inference (content or time) is based on which was selected with a probability sample.

Sampling Techniques At its most basic level, sampling means selecting a group of content units to analyze. To estimate the sampling error and infer to a large population of content, the sample must be a probability sample. Any estimate of sampling error with a non-probability sample is meaningless, and inferential statistics describing some population from which the sample was drawn have no validity.

74 Sampling The basic problem researchers face is collecting a sample that will allow valid conclusions about some larger group without taking excessive amounts of time to complete the project. The sampling techniques below help do that. Census A census means every unit in a population is included in the content analysis, and often makes the most sense for research that examines a particular event or series of events. Jung (2002) wanted to study how Time and Fortune magazines covered three mergers that involved the parent corporation, Time, Inc. Jung started by examining all issues from three newsmagazines and three business magazines, including Time and Fortune, published the month before the merger announcements. All articles and visuals from the 22 issues of these six magazines that included coverage were analyzed. Because of the relatively small population of content dealing with these mergers, a random sample of all content would not have identified all of the stories and would have likely distorted the results. Deciding between a census and a sample becomes an issue of how best to use coders’ time to accomplish the research goals. Whether a census is feasible depends on the resources and goals of individual research projects. The following principle applies: the larger the number of content units that are coded, the less bias in the data, but the more resources the project will require. Non-Probability Sampling Despite the limitations of non-probability samples in generating estimates of sampling error, they are used often. Such samples are appropriate under some conditions and often must be used because an adequate sampling frame is not available. Two non-probability techniques are commonly used: convenience samples and purposive sampling. In a study by Riffe and Freitag (1997) of content analysis articles in Journalism & Mass Communication Quarterly from 1971 to 1995, they found that 9.7% of all studies used convenience samples and that 68.1% used purposive samples. Convenience Samples A convenience sample involves using content because it is available. Historically, the study of local television news provides a good example. Until the growth of the Internet, local TV newscasts were rarely available outside the particular market. As a result, the programs had to be taped in the area of origin. A study trying to generalize outside a single market might require people around the country to tape newscasts.

Sampling 75 Now, all local television stations place their content online. In addition, the majority of stations also provide text as well as video stories. As a result, a national probability sample of local television can be collected (Baldwin, Bergan, Fico, Lacy, & Wildman, 2009). In another example, most university libraries purchase books on the basis of faculty and student requests, and do not provide a selection of books representing any larger population. One way to think of a convenience sample is that it is a census in which the population is defined by availability rather than research questions. However, this population is a biased representation of the universe of units, and that bias is impossible to estimate. The Internet has also made sampling easy for content produced by most legacy media outlets. However, problems arise because content on websites and in print products by the same organization may not be equivalent, and the increasing use of paywalls by legacy outlets will require that content analysts acquire funding or once again depend on library access to newspaper organizations’ content. Convenience samples do not allow inference to a population, but they can be justified under three conditions. First, the material being studied must be difficult to obtain. For example, a random sample of the magazines published in 1900 cannot be obtained. A sampling frame of such periodicals would be incomplete because a complete list of magazines from that period is unavailable. More importantly, most magazine editions from that period no longer exist. A researcher could, however, acquire lists of magazine collections from libraries around the country and generate a random sample from all the surviving magazine copies from 1900. This, however, would be extremely expensive and time-consuming, and it would still not represent the population of all magazines at the time. Such an example points to a second condition that would justify a convenience sample: resources limit the ability to generate a random sample of the population. Just how much time and money a researcher should be willing to spend before this condition is satisfied is a question for each researcher to answer. Whatever a scholar’s decision, it will eventually be evaluated by journal reviewers. The third condition justifying convenience sampling is when a researcher is exploring some under-researched but important area. When little is known about a research topic, even a convenience sample becomes worthwhile in generating hypotheses for additional studies. When such exploratory research is undertaken, the topic should be of importance to the scholarly, professional, or policymaking communities. Of course, some areas are under-researched and destined to remain that way because they are neither very interesting nor very important. With little research available, such samples provide a starting point for scholarship. However, the researcher should attempt to reduce bias

76 Sampling and to justify the use of such limited samples. The value of research using convenience samples should also not be diminished. Science is a cumulative process that aims to develop systematic generalizations in the form of theory. Over a period of time, consistent results from a large number of convenience samples can contribute to theory creation and testing. In addition, these samples can suggest important research questions and hypotheses to be checked with probability samples or censuses. Of course, such replication occurs across time, and time itself may result in content changes that might be misread as inconsistencies between the studies’ findings. Purposive Sampling Purposive sampling uses a non-probability sample for logical or deductive reasons dictated by the nature of the research project. Studies of particular types of publications or particular times may be of interest because these publications were important or the time played a key role in history. For example, Di Cicco (2010) studied newspaper coverage of political protest during eight years between 1967 and 2004 in the New York Times, Washington Post, Seattle Times, San Francisco Chronicle, and Los Angeles Times. The Seattle Times and San Francisco Chronicle were selected because those cities have been centers of protest, and the other three were selected because they ranked in the top five circulation leaders during the period of study. Given the large number of news stories published during such a time period, limiting the years and publications made the study manageable. Purposive samples differ from convenience samples because purposive samples require specific research justifications other than lack of money and availability. An often-used type of purposive sample is consecutive unit sampling, which involves taking a series of content produced during a certain time period. Analyzing Facebook postings and comments during a two-week period is a consecutive day sample. Consecutive day sampling can be important when studying a continuing news or feature story because connected events cannot be examined adequately otherwise. Such samples are often found in studies of elections and continuing controversies. Probability Sampling The core notion of probability sampling is that each member of a population has an equal chance of being included in the sample. If this is so, characteristics found more frequently in the population—whether of TV dramas, Facebook posts, or poems—will also turn up more frequently in the sample, and less frequent characteristics in the population will turn up less frequently in the sample.

Sampling 77 A simple example can illustrate how this (much more complicated) process works. For example, take a coin. Its population consists of a head and a tail. The chance of getting a head (or a tail) on a single flip is 50%. Flip 100 times and very close to half—but rarely exactly half—of that number will be heads. Flip 1,000 times and the proportion of heads will even more closely approach 50%. Given an infinite number of flips, the “expected value” of the proportion of heads will be 50%. Similarly, the expected value of any variable about content being explored will approximate, within a calculable sampling error, the actual population value of that variable if a very large number of relevant content units is included in the sample. An extension of this logic would be for a researcher to take many samples from the same population one at a time. The best guess for the value for each of the sample means would be the population mean, although in reality the sample means would vary from that population mean. However, if an infinite number of samples were taken from a population, the average mean of all the sample means would equal the population mean. If all the means of all these samples were plotted along a graph, the result would be a distribution of sample means, which is called the sampling distribution. With an infinite number of samples, the sampling distribution of any population will have the characteristics of a normal curve. One characteristic is that the mean, median (the middle score in a series arranged from low to high), and mode (the most frequent score value) are all equal. Moreover, 50% of all the sample means will be on either side of the true population mean; 68% of all sample means will be within plus or minus one standard error (SE) of the true population mean (standard error is an estimate of how much the sample means in a sampling distribution vary from the population mean). That any sampling distribution, regardless of the population distribution, will take on a normal distribution when an infinite number of samples is taken is called the central limits theorem. Of course, a researcher never draws an infinite number of samples. The researcher draws a single sample, but the central limits theorem allows a researcher to estimate the amount of sampling error in a probability sample at a particular level of probability. In other words, the researcher can calculate the probability that a particular sample mean (calculated by the researcher from a random sample) is close to the true population mean in the distribution of infinite (but theoretically “drawable”) random samples. This probability can be calculated because the mean of an infinite number of samples (the sampling distribution) will equal the population mean, and the distribution will be normal. The sampling error for a particular sample, when combined with a sample mean or proportion for that sample, allows a researcher to estimate the population mean or proportion within a given range (plus or minus) and with a given level of confidence that the range includes the population value.

78 Sampling The best guess at the unknown population mean or proportion is the sample mean or proportion, and calculating the sampling error allows a researcher to estimate the range of error in this guess. Crucial to understanding inference from a probability sample to a population is sampling error, an indication of the accuracy of the sample. Sampling error for a given sample is represented by standard error. Standard error is calculated differently for means and proportions. Standard error of the mean is calculated by using a sample’s standard deviation, which is the average distance that cases in the sample vary from the sample mean. The standard deviation is divided by the square root of the sample size. The equation for standard error of the mean is SE(m) =

SD n −1

in which SE(m) = standard error of the mean SD = standard deviation n = sample size The standard error of the mean is applied to interval- or ratio-level data. Nominal-level data use a similar equation for standard error of proportions. The equation for standard error of proportions is SE(p) =

p. q n

in which SE(p) = standard error of proportion p = the proportion of sample with this characteristic q = (1 − p) n = sample size Standard error formulas adjust the sample’s standard deviation for the sample size because sample size is one of three factors that affect how good an estimate a sample mean or proportion will be. The sample size is usually the most important. The larger the sample, the better the estimate of the population. Very large and very small case values will crop up in any sample. The more cases in a sample, the smaller the impact of the large and small values on the mean or proportions in the sample. The second factor affecting the accuracy of a sample estimate is the variability of case values in the sample, which reflects the variability (homogeneity) of the population. If the case values vary widely in a sample, the sample will have more error in estimating the population mean or proportion because variability results from the presence of large

Sampling 79 and small values for cases. Sample size and variability of case values are related because the larger the sample, the more likely case variability will decline. The third factor affecting accuracy of a sample’s estimate of the population is the proportion of the population in the sample. If a high proportion of the population is in the sample, the amount of error will decline because the sample distribution is a better approximation of the population distribution. However, a sample must equal or exceed about 20% of the population cases before this factor plays much of a role in estimating sampling error. In fact, most statistics books ignore the influence of population proportion because surveys—which, along with experiments, dominate fields such as psychology, sociology, and political science—usually sample from very large populations. As a result, sampling a high proportion of a large population is not necessary to generate a representative sample. Content analysis researchers should not automatically ignore the impact of the population proportion in a sample because their studies often include fairly high proportions of the population. When the percentage of the population in a sample of content exceeds about 20%, a researcher should adjust the sampling error using the finite population correction (fpc). To adjust the standard error for a sample, the standard error formula is multiplied by the fpc formula, which is n fpc = 1 −   N in which fpc = finite population correction n = sample size N = population size For further discussion of the fpc, see Moser and Kalton (1972). Recall that sampling decisions involve both time and content. A variety of probability sampling techniques—permitting sampling error calculation—are available, and decisions about probability sampling depend on a variety of issues, but virtually every decision involves time and content dimensions. Researchers must decide whether probability sampling is appropriate for both these dimensions and how randomness is to be applied. For example, a probability sample can be taken for both time and content (e.g., a random sample of 20 movies from each of ten randomly selected years between 1988 and 2018), for just content (e.g., a random sample of all movies released in 2018), for just time (e.g., examine all Paramount movies in ten years randomly selected from between 1988 and 2018), or for neither (e.g., examine all movies released in 2018). In a strict sense, all content involves a time dimension. However,

80 Sampling the concept of time sampling used here concerns trend studies over periods longer than a year, which represents a natural planning cycle for most media. Simple Random Sampling Simple random sampling occurs when all units in the population have an equal chance of being selected. If a researcher wanted to study the gender representation in all feature films produced by the major studios during a given year, random sampling would require a list of all films. The researcher would then determine the number of films in the sample (e.g., 100 out of a population of 375 films). Then, using a computer or random numbers table, the researcher would select 100 numbers between 1 and 375 and locate the appropriate films on the list. Simple random sampling can occur with two conditions: when units are replaced in the population after they are selected and when they are not replaced. With replacement, a unit could be selected for the sample more than once. Without replacement, each unit can appear only once in a sample. When units are not replaced, every unit does not have an exactly equal chance of being selected. For example, in a population of 100, when the first unit is selected, every unit would have a 1 in 100 chance. On the second draw, each remaining unit would have a 1 in 99 chance. This variation is not a serious problem because even without replacement each potential sample of a given size has an equal chance of being selected, even if each unit did not. When populations are quite large, the small variation of probability without replacement has negligible impact on sampling error estimates. Simple random sampling works well for selecting a probability sample. However, it may not be the best sampling technique in all situations. If the population list is particularly long or the population cannot be listed easily, a random sampling technique other than simple random sampling might be in order. Systematic Sampling Systematic sampling involves selecting every nth unit from a sampling frame. The particular number (n) is determined by dividing the sampling frame size by the sample size. If a sample will include 1,000 sentences from a book with 10,000 sentences, the researcher would select every tenth sentence. Taking every nth unit becomes a probability sample when the starting point is randomly determined. The researcher could randomly select a number between 1 and 10, which would be the number of the first sentence taken. Every tenth sentence after that would be selected until the complete sample is in hand. Because the starting point is randomly selected, each unit has an equal chance of being selected.

Sampling 81 Systematic sampling may work well when simple random sampling creates problems. However, systematic sampling can generate problems under two conditions. First, it requires a listing of all possible units for sampling. If the sampling frame is incomplete (the entire population is not listed), inference cannot be made to the population. A second problem is that systematic sampling is subject to periodicity, which involves a bias in the arrangement of units in a list (Wimmer & Dominick, 2011). For example, a researcher wants to study the gun advertising in sporting magazines, and Field & Stream is one of the monthly magazines. If the researcher took four copies per year for 20 years using systematic sampling, a biased sample could result. Assuming a 2 is picked as the random starting point and every third copy is selected after the first, the researcher would end up with 20 editions from February, 20 from May, 20 from August, and 20 from November. This creates a problem because advertising and editorial space varies by month, and eight months are not represented in the final sample. Stratified Sampling Stratified sampling involves breaking a population into smaller groups and random sampling from within the groups. These groups are more homogeneous than the entire population with respect to some characteristic of importance. If one wanted to study the jingoistic language about the Vietnam War between 1964 and 1974 in speeches made on the Senate floor, the sample could be randomly selected or stratified by year. Stratified random selection would be better because the language likely changed with time. Support for the war was much stronger in 1966 than in 1974. A simple random sample could generate a sample with most of the speeches either at the beginning or end of this period. Using years as strata, however, makes smaller homogeneous groups that would guarantee a more representative sample. The percentage of total speeches that were made in each year would determine the percentage of the sample to come from that year. Stratified sampling serves two purposes. First, it increases the representativeness of a sample by using knowledge about the distribution of units to avoid the oversampling and undersampling that can occur from simple random sampling. This is proportionate sampling, which selects sample sizes from within strata based on the stratum’s proportion of the population. A study of Facebook or Twitter postings might stratify by topic areas. The percentage of sample messages from a given topic area would represent that topic area’s proportion of the entire population. If 20% of all messages address the topic “movies,” then 20% of the sample should come from posts about movies. This will make the sample more representative of the entire posting activity. In some situations, stratifying can increase the number of units in a study when those types of units make up a small proportion of the

82 Sampling population. This is disproportionate sampling, which involves selecting a sample from a stratum that is larger than that stratum’s proportion of the population. This procedure allows a large enough sample for comparison. If, for instance, in a study, only 10% of 1,000 Twitter account owners are older than 60, and the study concerns the relationship between age and tweets, the researcher might want to disproportionately sample from the stratum of participants older than 60. With 10% 60 or older, a simple random sample of 200 would probably likely yield only 20 people in this category. A sample of 1,000 would yield only 100, which is not large enough for valid comparisons with larger groups in the population. Disproportionate sampling oversamples particular units to obtain enough cases for a valid analysis. However, it yields a sample that is no longer representative of the entire population because a subset of members is over-represented in the sample. Because mass communication media produce content on a regular basis, say every day or every week, stratified sampling can take advantage of known variations within these production cycles. Daily newspapers, for example, vary in size with days of the week because of cyclical variations in advertising. We examine these systematic variations in media in more detail later in the chapter. Stratified sampling requires adjustments to sampling error estimates. Because sampling comes from homogeneous subgroups, the standard error is reduced. The standard error of proportion for stratified samples equals the sum of the standard errors for all strata (Moser & Kalton, 1972). Cluster Sampling Simple random, systematic, and stratified sampling require a list as a sampling frame. This list tells the researcher how many units make up the population and allows the calculation of probabilities. Often with communication research, however, complete lists of units are unavailable. To sample when no list is available, researchers use cluster sampling, which is the process of selecting content units from clusters or groups of content. Mass media products often include clusters of content. For instance, each time you check Google News, you find a cluster of many articles, divided into topic clusters such as sports, business, local, and entertainment. A television news program is a cluster of stories. Listing all websites is impossible, but search engines can find local websites by city, which forms a cluster for sampling when geography is important to a study. Cluster sampling allows the probability selection of groups and then subgroups; random sampling within those subgroups would lead to the specific content units. Cluster sampling can introduce additional sampling error compared to simple random sampling because of intra-class correlation. Content units,

Sampling 83 such as entertainment articles, that cluster together may do so because they are similar in nature. These shared characteristics create a positive correlation among the attributes. By selecting clusters, a researcher is more likely to include units with similar characteristics and exclude units that have different characteristics from units in selected clusters. As a result, the sample, although randomly determined, may not be representative. Intra-class correlation can be anticipated, and statistics books (Moser & Kalton, 1972) provide formulas for estimating such biases. Multistage Sampling Multistage sampling is not a form of probability sampling such as simple random, systematic, and stratified sampling techniques. Rather, it is a description of a common practice that may involve one or several of these techniques applied at different stages of generating a sample. Recall that the simplest form of probability sample would be to list all recording units, randomly select from them, and proceed with the analysis. However, as just noted, most content is not easily listed. Often content comes in packages or clusters. Moreover, most content has a time dimension as well. Indeed, Berelson (1952) said mediated content has three different dimensions that must be considered in sampling: titles, issues or dates, and relevant content within the issues. A sampling procedure may be designed that addresses all these dimensions as stages of sampling. At each stage, a random sample must be taken to make inference to the population. Someone studying the content of local talk radio programs would have to randomly select the radio stations, then particular days from which to take content, and then the particular talk programs. Yet another stage might be the particular topics within the talk radio programs. For organizational Facebook pages, the type of organization, particular organizations, and date would be the stages. Pure multistage sampling requires random sampling for each stage. Multistage sampling can also combine a variety of sampling techniques. The techniques should reflect the purpose of the research, with the guiding principle being an effort to produce as representative a sample as possible for inferring to the population. Danielson and Adams (1961) used a sophisticated multistage sampling procedure to study the completeness of campaign coverage available to the average reader during the 1960 presidential campaign. They selected 90 daily newspapers in a procedure that stratified for type of ownership (group and non-group), geographic region, and time of publication (a.m. or p.m.). With the exception of a slight oversampling of southern dailies, the sample’s characteristics matched the population’s. The sampling of campaign events came from a population of events covered by 12 large daily newspapers from September 1 to November 7, 1960. These 1,033 events were narrowed to 42 by systematic random sampling.

84 Sampling The researcher determines the number of stages in a multistage sampling process. The process of sampling celebrity tweets could have one, two, or three sampling stages. For example, the first stage of a sample could involve randomly selecting types of celebrity (sports, movie, TV, etc.). A second stage could involve selecting one or more particular celebrities from the selected types. A third stage could be randomly sampling the tweets by the selected celebrity. This process could be reduced to one stage of probability sampling by listing every tweet by every celebrity for a given time period and randomly selecting a given number. The multistage selection process would take considerably less time. Just as cluster and stratified sampling alter the formula for figuring standard error, so does multistage sampling. Multistage sampling introduces sampling error at each stage of sampling, and estimates of error must be adjusted.

Stratified Sampling for Legacy Media The question of whether to use simple random sampling or stratified sampling usually involves efficiency. Media content produced by legacy commercial enterprises have predictable variations that reflect news cycles and advertising support. For example, politicians often release information that might be negative late on a Friday afternoon because newsrooms have smaller staffs and people are less likely to watch TV news on the weekend. The number of pages in printed daily newspapers varies by day on the basis of advertising lineage. At the same time, a news outlet’s website is not limited by time or space as are its legacy products. Such systematic variations can affect content. If systematic variations in content are known, these variations can be used to select a representative sample more efficiently. These variations allow identification of subsets of more homogeneous content that can be used to select a smaller stratified sample that will be just as representative as a larger simple random sample. Several studies have looked at stratified sampling in various forms of media to identify the most efficient sample size and technique to infer to a particular time period. These studies have often examined types of variables as well (see Table 5.1). Daily Newspapers Because of their traditional importance as a journalism mass medium, daily newspapers have received more attention in sampling efficiency studies than other forms of media. These studies have concentrated on efficiency of sampling for inference to typical levels of content by using the constructed week, which is created by randomly selecting an issue for each day of the week.

Sampling 85 Table 5.1 Efficient stratified sampling methods for inferring to content Type of Content

Nature of Sample

Year of daily newspapers

Two constructed weeks from year (randomly selecting two Mondays, two Tuesdays, two Wednesdays, etc.) Six constructed weeks

Year of health stories in daily newspapers Year of the New York Times online Five years of daily newspapers Year of online Associated Press stories Year of weekly newspapers Year of evening television network news Year of news magazines Five years of consumer magazines Year of online press releases

Six randomly selected days Nine constructed weeks Eight constructed weeks Randomly select one issue from each month in the year Randomly select two days from each month’s newscasts during the year Randomly select one issue from each month in a year One constructed year (randomly select one issue from each month) Twelve constructed weeks (three weeks per quarter)

Note: These are general rules, but researchers should access the articles, cited in the text, from which these rules were taken to find exceptions.

An early study by Stempel (1952) concluded 12 days (two constructed weeks) were sufficient for representing a year’s content in a six-day-aweek newspaper. Research by Davis and Turner (1951) and Jones and Carter (1959) found results similar to Stempel’s. Riffe, Aust, and Lacy (1993) conducted a more thorough replication of Stempel’s (1952) study by comparing simple random sampling, constructed week sampling, and consecutive day sampling for efficiency for a seven-day-a-week newspaper. One constructed week adequately predicted the population mean, and two constructed weeks worked even better (Riffe et al., 1993). Taking two constructed weeks of daily newspapers works well to infer to one year of representative content, but some researchers are interested in studying content changes across longer time periods. Lacy, Riffe, Stoddard, Martin, and Chang (2000) examined efficiency in selecting a representative sample of daily newspaper content from five years of newspaper editions. They concluded that nine constructed weeks taken from a five-year period were as representative as two constructed weeks from each year, provided the variable of interest did not show great variance.

86 Sampling These studies addressed selecting representative samples for the content in the newspaper. The same systematic variations in news hole might not affect the presence or absence of particular topics in the news. The coverage of a topic typically reflects variations in the environment rather than news hole. For example, newspapers are more likely to cover city government on the day following a city government meeting (Baldwin et al., 2009). Luke, Caburnay, and Cohen (2011) examined sampling in daily newspapers for health stories and found that constructed week sampling was more efficient, but that it would take six constructed weeks, rather than two, to find a representative sample of health stories. They also found that the six constructed weeks provided a representative sample for a five-year period as well as the one-year period. This runs counter to the nine constructed weeks suggested by Lacy et al. (2000), which was for representative content. Having six constructed weeks represent five years is a concern if one wants a representative sample because it is possible that one or more of the years would only have a few days in the sample. If the five-year period included extraordinary events (e.g., the Great Recession of 2008–2009, or the beginning year of the Iraq War), these extraordinary events might have taken enough space to reduce the amount of coverage for topics such as health. Scholars would be advised to check the nature of their samples and whether they can accomplish the goals of the project. Weekly Newspapers The sparse research on daily newspaper sampling seems extensive compared to research about sampling weekly newspapers. Lacy, Robinson, and Riffe (1995) studied sampling of weeklies to see if stratified sampling would improve sampling efficiency with weeklies as it does with dailies. The results indicated stratified sampling has some efficiency compared to random sampling, but the influence of cycles in content is not as strong in weeklies as in dailies. They concluded that a simple random sample of 14 issues or one issue randomly selected from each month (12 issues) were the most efficient approaches. They also concluded that the former was preferable when management need to make risky decisions, and the second approach worked when decisions were less risky and time and money were the important consideration. Magazines Magazine sampling studies have addressed efficient sampling for weekly newsmagazines and for monthly consumer magazines. Riffe, Lacy, and Drager (1996) used Newsweek and found that selecting one issue randomly from each month was the most efficient sampling method for inferring to a year’s content. The next most efficient method was simple

Sampling 87 random selection of 14 issues from a year. This result was consistent with those for weekly newspapers, which have the same publication cycle. Unlike newsmagazines, consumer magazines usually appear monthly, and the best approach to studying a year’s content is to examine all issues. However, if a researcher wants to study long-term trends in consumer magazines, stratified sampling might prove more efficient than simple random sampling. Lacy, Riffe, and Randle (1998) used Field & Stream and Good Housekeeping as examples of consumer magazines and found that a constructed year (one issue from January, one from February, one from March, etc.) from a five-year period produced a representative sample of content for that period. A longitudinal study of consumer magazines could divide the time period into five-year sub-periods and use constructed years to make valid inferences to the magazines under study. Network Television News Although television content analyses are plentiful, sampling studies to find valid and efficient sampling methods are practically nonexistent. Types of samples found in published research include randomly selecting 12 composite weeks from 60 months (Weaver, Porter, & Evans, 1984), using the same two weeks (March 1–7 and October 1–7) from each year between 1972 and 1987 (Scott & Gobetz, 1992), sampling two constructed weeks per quarter for nine years (Riffe, Ellis, Rogers, Ommeren, & Woodman, 1986), and using four consecutive weeks per six-month period (Ramaprasad, 1993). The variation in types of sampling methods reflects particular research questions, but it also reflects the absence of guidance from sampling studies about television news. Riffe, Lacy, Nagovan, and Burkum (1996) began exploration into network news sampling by using a year’s worth of Monday through Friday broadcasts from ABC and CBS as the populations. The most efficient form of sampling for network TV news was randomly selecting two days from each month for a total of 24 days from the year. It took 35 days with simple random sampling to predict adequately a year’s content. They cautioned that researchers should be aware of extreme variations in particular content categories. In the absence of sampling studies that demonstrate efficient forms of stratified sampling, researchers should use simple random sampling. Applying constructed weeks, months, and years to forms of media content that are not influenced by weekly, monthly, and yearly cycles will introduce unknown bias into the data.

Sampling Digital Content Historians likely will call the first ten years of the 21st century the digital decade. During this time, social media and smartphones were introduced,

88 Sampling Netflix began streaming series and movies, and legacy media outlets had finally accepted that the future is digital delivery. Digital delivery of content creates benefits and problems for social scientists. Once media companies put content online, it made acquisition by communication scholars easier and in most cases less expensive. However, as social networking sites such as Facebook and Twitter exploded with users, it became obvious that these new information distribution and networking systems would have a huge impact on individuals and social groups. Equally obvious was the fact that accessing populations and even probability samples would be difficult. These difficulties come from the lack of a sampling frame, the existence of private areas on social media sites, and the expense of acquiring and analyzing large data sets. Twitter users tweet an average of 500 million tweets per day (Internet Live Stats, 2018). In 2013, Facebook users shared 4.75 billion pieces of content daily (Zephoria Digital Marketing, 2018). In addition, the very nature of social networking sites affects the way they are studied, and therefore the sampling process. Most legacy outlets traditionally used distribution systems that were very slow and made interaction almost impossible. The cost of production and distribution of analog content meant that most legacy media were oriented toward a large audience. Digital communication creates platforms that allow people to communicate to large numbers of people (mass communication) and to communicate with a single person (interpersonal). These two approaches can even be combined when an interpersonal message is redistributed over networks, which is especially important during crises. Making sampling more difficult is the fact that all three of these functions can be combined in the same data set. As with all sources of content, the type of sample (probability, convenience, or purposive) is determined by the research questions, access to content, and the cost of that access. The role of these three factors can vary with the type of digital content the researcher proposes to study. Digital content that is designed for mass consumption is easier to access than content that is not, although mass content may require payment. Twitter makes some of its content more readily available than does Facebook. Snapchat was created as a platform that shared messages for a short time period, which of course created sampling problems, but the company began to move away from this approach in 2017 (Wagner, 2017). The discussion below of digital sampling will examine sampling the World Wide Web in general and sampling social networking platforms such as Twitter, Facebook, and YouTube. Digital distribution occurs primarily through fiber-optic, cellular, and Wi-Fi systems, and is displayed mostly through websites or applications. A prime difference between websites and social networking platforms is the tendency of websites to represent organizations, both for-profit and nonprofit, instead of individuals, while social networking sites tend to represent the content of

Sampling 89 organizations, social groups, and individuals. This explains why social networking platforms generate such large data sets compared to most websites and why generating a representative sample of social networking content can be so complicated. Sampling the Web Karlsson (2012) discussed the problems with sampling online content, and said the problems stem from four dimensions: interactivity, immediacy, multimodality, and hyperlinks. As a result, he argued that online information is erratic and unpredictable. Unpredictability, from a researcher’s perspective, can present sampling challenges. It reduces the ability to use stratified sampling and it requires a longer time frame for simple random sampling. Stempel and Stewart (2000) said a serious problem confronting Internet studies is the absence of sampling frames for populations. To a degree, the Internet is like a city without a telephone book or map to guide people. New houses are being built all the time, and old houses are being deserted with no listing of the changes. Sampling requires creative solutions. Despite sampling difficulties, digital content distribution has resulted in a large number of studies on a wide range of topics that have sampled news websites. Kim, Thrasher, Kang, Cho, and Kim (2017) studied the coverage of e-cigarettes on three newspaper websites, on three tele vision networks, and in six print newspapers by downloading articles from the web and finding transcripts from a news database. Johnson and Pettiway (2017) used the lists from the African American Museums and the American Alliance of Museums to identify 46 African-American museum websites. They analyzed how these organizations expressed black identities through their digital platforms. Carpenter, Boehmer, and Fico (2016) used a convenience sample of for-profit and nonprofit news sites in five cities to study how the two differed in journalistic role enactment. They downloaded online content from the ten sites. Druckman et al. (2010) studied U.S. Senate and House websites for negative messages in the 2002, 2004, and 2006 elections. They first identified the candidates using National Journal, Congressional Quarterly, and state and national party websites. They used stratified random sampling to select about 20% of the races and then used the National Journal to find candidates’ websites. As was the case with analog communication, researchers have begun to investigate efficient sampling from the web. Hester and Dougall (2007) used six months of content from news aggregator Yahoo! News that included stories from several legacy media organizations (e.g., Associated Press, USA Today, CNN, etc.). They concluded that constructed week sampling was the most efficient type of random sampling.

90 Sampling However, the minimum number of weeks equaled two, and some types of news required up to five constructed weeks. Another study examined topic, geographic bias, number of links, and uses of multimedia presentation in the stories on the New York Times website (Wang & Riffe, 2010). The authors found that only six randomly selected days could represent the entire year of content. They cautioned, however, that this study might not be generalizable to other websites because the New York Times has the largest news staff of any daily newspaper in the United States. Smaller staffs might result in more or fewer content variations from day-to-day. Of course, organizations other than legacy news outlets post their content online. Connolly-Ahern, Ahern, and Bortree (2009) studied a year of content on two press release services (PR Wire and Business Wire) and one news service (Associated Press). They concluded that constructed weeks are efficient but more than two are required for representative samples. When sampling press releases, Connolly-Ahern et al. (2009) recommended at least 12 constructed weeks (three per quarter) for online press releases and eight constructed weeks for the Associated Press website. However, there was great variation in needed sample size depending on the topic. Scholars should consult the tables in the article when sampling these services. The need for larger samples found here, when compared to efficient legacy news sampling (newspapers, magazines, and television), represents a lack of consistent variation that would allow more efficient stratification. McMillan (2000) analyzed 19 research articles and papers that studied web content and generated a series of recommendations. First, McMillan warned scholars to be aware of how the web is similar to and not similar to legacy media. Researchers must understand that people use the web differently than they use traditional media. Second, sampling the web can be very difficult because sampling frames are not readily available, and the content can change quickly. Third, the changing nature of the web can make coding difficult. Content must either be “captured” in some form and/or sampling must take change into consideration. Fourth, researchers must be aware that the multimedia nature of the web can affect the various study units. Fifth, the changing nature of sites can make reliability testing difficult because coders may not be coding identical content. As with all sampling, the process of sampling online content depends on how the research is conceptualized. A study of representative content on Facebook, for example, presents certain problems, whereas sampling legacy news sites would create other problems. Convenience sampling creates fewer problems, but the results cannot be generalized beyond the sample. In all studies, researchers must be aware of the time element of changing web content. When dealing with content that does not have a readily available sampling frame, selecting representative content from online sites could

Sampling 91 use multistage sampling. The first stage would involve using a range of search engines and algorithms to generate multiple lists of sites. These lists become a sampling frame once the duplicate sites are removed. The second stage would be selecting from among the sites in the sampling frame. If other variables such as geography are important in generating a sample, then more stages could be involved. However, this approach presents other problems. Search engines and algorithms generate long lists of sites, the lists are not randomly selected, and the various search engines have different algorithms for generating the order in their lists. As a result, creating a sampling frame from search results can be quite time-consuming, and the sample might be more representative of commercial and organizational sites than individual sites. Second, the content on some pages changes at varying rates. The process is similar to putting personal letters, books, magazines, and newspapers all into the sample population and trying to get a representative sample. The answer might be using categories other than topics to classify web pages, but a standardized typology of categories has yet to be developed and accepted by scholars. To deal with the changing nature of news websites, Kutz and Herring (2005) developed micro-longitudinal sampling using a software program that would download specified components of a page (e.g., headlines) every 60 seconds from a news site. The program would only download elements that had been changed from the last visit. Using CNN, BBC and Al Jazeera, they analyzed the nature of changes on the sites. Most of the changes on Al Jazeera were new stories, but most of the changes on BBC and CNN were revisions of previous stories. As noted in Chapter 1, the size and complexity of the web has led to the development of machine-learning approaches to sampling. For example, Chau and Chen (2008) provided a substitute for traditional search engines in the form of topic-specific search engines that learn from training documents. Such vertical engines need to find the appropriate URLs and then filter the documents from those URLs that are not relevant to the research question being studied. Their approach uses both content and structure (links) to collect web content and compares favorably to keyword and lexicon-based approaches. After the documents are identified and filtered, the researcher may want to take a sample from the population that results from searching the web for the topics. Sampling with Databases Digitization of media content has also enhanced content analysis through the resulting increase in storage capacity. Messages of all types have been digitized, preserved, and made available online. With this increased capacity has come the ability to search and retrieve specific types of content from a wide range of available databases.

92 Sampling At its simplest level, a database is a structured collection of data that can be easily searched and retrieved with computers. The data can take many forms, but content databases are typically text, visual, and auditory messages that have appeared in a range of media from newspapers to social media. Databases can be commercial (e.g., Factiva, LexisNexis, PR Wire) or researchers can create them (Lacy et al., 2015). Of course, they could be a combination of the two. Just what content goes into a database and how the database is organized is decided by the database creator. Most databases are searched with keywords—terms specifically associated with the concepts being studied. For example, Watson (2017) studied the relationship between local newspaper coverage of violent crime in Minneapolis and St. Louis and local online searches about crime in those two cities. He accessed the ProQuest database for the newspapers’ coverage and retrieved weekly search data from the Google Trends site. He tested a series of search “strings” for the two newspapers that used terms such as “police,” “murder,” “homicide,” “arrest,” and so on. Although databases can be powerful tools for accessing content, the process has its limit. It is highly unlikely that one database will contain the universe of content a researcher hopes to study. Wu (2015) compared articles about post-traumatic stress disorder in LexisNexis and America’s News databases and found a 94% overlap. Weaver and Bimber (2008) used two databases (LexisNexis and Google News) to study news coverage of nanotechnology. They found a 71% overlap. All media databases are purposefully organized populations and not representative of the universe. One way of dealing with the limitation of any given database is to use more than one database to collect needed content. This would require a comparison of the range of content in the databases by discovering what content is and is not included. In addition, researchers need to explore whether the indexing and archiving software for the databases are equivalent. One problem with existing research literature has been the absence of information about the process used to generate the content sample from databases. Stryker, Wray, Hornik, and Yanovitzky (2006) examined 83 content analysis studies, and reported that only 39% provided keywords from the search and only 6% discussed the validity of the keywords. The keywords are crucial in determining the ability of a sample to yield valid results. Sobel and Riffe (2015) studied the New York Times coverage of Botswana, Ethiopia, and Nigeria by using LexisNexis. They found 7,454 articles about the countries by using the countries’ names as keywords. However, only 19% of these stories had one of the countries as the main focus of the article. Searches that use a single keyword may yield content that is not relevant to the study. Researchers should use strings of keywords based on previous research and compare the output of various forms of the keyword strings (e.g., see Watson, 2017).

Sampling 93 Stryker et al. (2006) advocate that researchers conduct and report formal evaluations of the recall and precision of a search string. Recall measures a string’s capability of retrieving the pertinent content, and precision measures whether the retrieved content is actually relevant to the study’s goals. Recall is calculated by dividing the relevant articles retrieved by relevant articles in the database. Precision is calculated by dividing relevant articles retrieved by all articles retrieved. The two measures are often negatively correlated. The more precise a keyword string, the more likely it will miss relevant content. Relevant content in the database is established with a protocol that has reached acceptable reliability and has been applied by two or more independent coders. Precision and recall measures can be used to create a correction coefficient that better estimates error associated with samples from the database (Stryker et al., 2006). The correction is calculated by dividing the precision by the recall. For example, if a study sampled magazine articles about student loan debt and had a precision measure of .75 (i.e., 75% of the articles retrieved were pertinent to the study) and a recall of .5 (i.e., 50% of the pertinent articles in the database were captured in the sample), the coefficient would equal 1.5. If the search string had identified 100 articles, a more accurate estimate of pertinent articles in the database would be 150 (100 × 1.5). A correction coefficient less than 1 indicates the search string overestimated the number of articles, and a coefficient greater than 1 indicates that the string underestimated the number of articles in the database. Stryker et al. (2006) said this correction coefficient is accurate for longer time periods (more than a month) but not for short time periods (a day or week). Researchers using databases to access content should provide a detailed description of the process. This would include a discussion of what relevant media outlets were included and excluded in the database. In addition, the search strings used should be reported, and the process by which they were determined should be explained. The researcher should also calculate the precision and recall and report them in the article, as well as reporting the correction coefficient. Sampling Social Media Perhaps the most revolutionary use of digital media has been the development of social media. Platforms such as Facebook, YouTube, Twitter, Snapchat, and many more have provided a flexible and instantaneous system for people or organizations to interact with people one-on-one or as a mass audience. We can create our own content or we can access news and information from organizations and journalists. Each of the individual social network sites creates sampling difficulties. Sampling Twitter has received the most attention because the company provides greater access to messages than does Facebook. However, this

94 Sampling seems to create a bias toward studying Twitter that may over-represent its social impact. In April 2018, Facebook had 2,234 million global users compared to Twitter’s 330 million (Statista, 2018b). On the other hand, Twitter has allowed President Donald Trump to bypass and even attack legacy media and influence the national political agenda more directly (Keith, 2016). This impact would argue for Twitter’s importance at least for the period that Trump ran and served as president. Waters, Burnett, Lamm, and Lucas (2009) analyzed 275 randomly selected incorporated nonprofit organizations’ Facebook pages to examine how they were using social networking. Naaman, Boase, and Lai (2010) examined the tweets from 350 Twitter users and classified them into two groups. The majority communicated primarily about themselves and a smaller group primarily shared information. Hanusch and Bruns (2017) created a sample of 4,189 Australian journalists by searching Twitter, examining news organization web pages, and searching journalists’ Twitter lists of followers and people they follow. They used the resulting Twitter accounts to examine how the journalists branded themselves on Twitter. In a study of how European politicians use Twitter, Theocharis, Barberá, Fazekas, Popa, and Parnet (2016) identified 2,482 candidates who had a Twitter presence in 2014. They studied their tweets over four weeks and found that most candidates use Twitter as a broadcast rather than interactive tool, and concluded this use results from concern about the incivility that can often flow from political discussion on Twitter. None of these studies used a representative sample. Most studies of social networking sites tend to examine one platform at a time, but Thorson et al. (2013) looked at the relationship between Twitter and YouTube in their coverage of the Occupy movement during November 2011. Using keywords related to the Occupy movement, they searched for YouTube videos and tweets. Using commercial software, they captured 43,378 YouTube videos and 417,413 tweets from which they extracted 22,768 videos. They also downloaded metadata from YouTube. They concluded that using both Twitter and YouTube yielded a more diverse collection of video than was evident from using YouTube alone as a source. Researchers studying public organizations and their representatives can create a fairly good sampling frame for tweets by searching the Internet. Organizations that interact with the public want their messages to be heard and tend to make those messages available. However, when Twitter is used for interpersonal communication, the problem becomes more difficult because of the large number of people who use Twitter. This is true for studies of Twitter use during natural and man-made crises. In these types of studies, either a census or representative sample of large data sets are useful. Efforts to generate representative samples have resulted in a number of methods.

Sampling 95 One of the most often used ways to generate representative samples is accessing tweets from Twitter. Traditionally, Twitter allowed access to tweets either through its firehose, which provides all tweets pertaining to a given set of terms, or through its streaming API, which was a random selection of 1% of the firehose tweets (Joseph, Landwehr & Carley, 2014). The firehose provides a census within the selected stream of tweets and could either be used in its entirety or a probability sample could be taken. However, access to the firehose is expensive (Joseph et al., 2014). The API stream is available without cost, and Morstatter, Pfeffer, Liu, and Carley (2013) tested an API stream against the firehose for the defined topics. They found the “streaming API data estimates the top hashtags for a large n well, but is often misleading for a small n” (p. 406). They concluded that how well the API streaming represents all tweets depends on the coverage of the topic by the tweets and the nature of the analysis by the researchers. Joseph et al. (2014) compared multiple samples from the API streaming data using the same search terms. They found that 96% of the tweets found in one sample were found in all samples, and concluded that using more than one API streaming sample did not significantly increase the numbers of tweets in the database. In another examination of the API stream, Ghosh et al. (2013) compared the API tweets with tweets generated by 587,759 experts—Twitter posters who were followed by at least 10 Twitter users. They concluded that the expert sample of tweets had more useful and trustworthy information, as determined by a sample of survey respondents on Mechanical Turk, was more diverse in topics, and yielded more popular content. They concluded that the choice of the two approaches depends on the goals of the project. Algorithmic Sampling Another approach for identifying and sampling web content is the creation of specific algorithms that use the nodes (web pages) and edges (hyperlinks) of the web graph (the structure of connections on the web) to generate a uniform sample, which means all possible samples are equally probable. A variety of approaches and algorithms are available and a literature exists that addresses the most effective and efficient approaches (Rusmevichientong, Pennock, Lawrence, & Giles 2001). Algorithms can also be used to sample social networking content. Similarly, researchers have investigated learning algorithms for sampling social media (Rezvanian & Meybodi, 2017). Bruns and Liang (2012) discussed using open-source software for selecting and analyzing tweets from the API stream. Palguna, Joshi, Chakaravarthy, Kothari, and Subramaniam (2015) used the Twitter API to explore what size of samples were needed to represent the API stream. They found that a sample of

96 Sampling 8,000 tweets was sufficient for estimating frequency of nouns and 2,000 tweets worked for estimating positive/negative sentiment of tweets. Gjoka, Kurant, Butts, and Markopoulou (2009) compared four crawling algorithms, which visit sites and download content, to see which was best at generating a random sample of Facebook users. They concluded that the Metropolis-Hastings approach was better than two other approaches in generating a representative sample.

Sampling Suggestions for Digital Media Digital media have opened up a wealth of new research questions and access to content from both organizations and individuals. However, sampling has become more difficult for a variety of reasons. Here are some questions and suggestions for sampling content for a study. •

Are you studying content and messages created by organizations or individuals? • Which organizations and/or individuals should be studied? • What time period should be studied? • After identifying the population, is a sampling frame available for the content? Can a list be made of all the sampling units? • If yes, is the frame too long to study all of the units? If not too long, conduct a census. • If the frame is too long, can a simple random sample be generated? • Have studies identified stratified sampling that would create a more efficient representative sample? • If a sampling frame cannot be generated, have sampling studies suggested ways to generate a representative sample using either general commercial search engines or specialized? Which approach works best? • If a representative sample is impossible, is a convenience or purposeful sample available that would allow you to identify theoretically interesting relationships among the variables of interest?

Sampling Individual Communication Mass communication messages usually have the sampling advantage of being regular in their creation cycle. Because such communication usually comes from institutions and organizations, records of its creation are often available. More problematic is the study of individual communication such as letters and email. If someone wanted to analyze the letters written by soldiers during the American Civil War, identifying the sampling frame of available letters is a burdensome task, but it is the first step that must be taken. Research about individual communication will be as valid as the list of such communication pieces will be complete.

Sampling 97 Of course, researching the communication of particular individuals, such as politicians, writers, and artists, often involves a census of all available material. Trying to research individual communication of nonnotable people should involve probability sampling, even if the entire universe cannot be identified. However, just accessing such communications is a problem. Often convenience samples result. For example, in an early examination of Internet interpersonal communication, Dick (1993) studied the impact of user activity on sustaining online discussion forums. Being unable to sample such forums randomly, Dick used three active forums from the GEnie system and came up with 21,584 messages about 920 topics in 53 categories between April 1987 and September 1990. Because Dick was interested in examining relationships and not describing behavior, the set of messages, which was strictly a census, was adequate for his exploratory research. The scientific method is the solution for the inability to randomly sample. If strong relations of interest exist in larger populations, then they usually will be found consistently even in non-probability samples. However, accumulated support from convenience samples works best if these samples come from a variety of situations (e.g., the samples are from several locations and time periods).

Summary Content analysts have a variety of techniques at their disposal for selecting content. The appropriate one depends on the theoretical issues and practical problems inherent in the research project. If the number of recording units involved is small, a census of all content should be conducted. If the number of units is large, a probability sample is likely to be more appropriate because it allows inference to the population from the sample. A variety of probability sampling methods are available, including simple random, systematic, stratified, cluster, and multistage sampling. The appropriate probability sample also depends on the nature of the research project. However, probability samples are necessary if one hopes to use statistical inference. Efficient sampling of mass media to infer to content for a given time period often involves stratified sampling because mass media content varies systematically with time periods. Content analysts need to be aware that sampling may involve probability samples based on time, content, or both.

6 Reliability

One of the questions all content analysts face is: “How can the quality of the data be maximized?” To a considerable extent, the quality of data reflects the reliability of the measurement process. Reliable measurement in content analysis is crucial to the validity of the research conclusions. If one cannot trust the measures, one cannot trust any analysis that uses those measures. The core notion of reliability is simple: the measurement instruments (protocols) applied to observations must be consistent over time, place, coder, and circumstance. As with all measurement, one must be certain that one’s measuring stick does not develop distortions. If, for example, one had to measure day-to-day changes in someone’s height, would a metal yardstick or one made of rubber be better? Clearly, the rubber yardstick’s own length would be more likely to vary with the temperature and humidity of the day the measure was taken and with the measurer’s pull on the yardstick. Indeed, a biased measurer might stretch the rubber yardstick. Similarly, if one wanted to measure the presence of people with disabilities in television programs, one would find different results by using an untrained coder’s assessment or by using trained coders with explicit coding instructions. In this chapter, we deal with reliability in content analysis. Specific issues in content analysis reliability involve the definition of concepts and their operationalization in a content analysis protocol, the training of coders in applying those concepts, and mathematical measures of reliability permitting an assessment of how effectively the content analysis protocol and the coders have achieved reliability.

Reliability: Basic Notions Reliability in content analysis is defined as consistency among coders in applying a protocol to categorize content. Indeed, content analysis as a research tool is based on the assumption that explicitly defined variables in a protocol with adequate instructions will control assignment

Reliability 99 of numbers to content units by coders. If the variable and category definitions do not control assignment of content, then human biases may be doing so in unknown ways. If this is so, findings are likely to be of questionable validity and unreplicable by others. Yet replicability is a defining trait of science, as noted in Chapter 2, and is thus crucial to content analysis as a scientific method. The problem of assessing reliability comes down ultimately to testing coder consistency to verify the assumption that content coding is determined by the variable definitions and category operationalizations in the protocol. Achieving reliability in content analysis begins with defining variables and categories (subdivisions of the variable) that are relevant to the study goals. Coders are then trained to apply those definitions to the content of interest. The process ends with the assessment of reliability through reliability tests. Such tests indicate numerically how well the concept definitions have controlled the assignment of content to appropriate analytic categories. These steps obviously interrelate, and if any one fails the overall reliability must suffer. Without clarity and simplicity of concept definition, coders will fail to apply them properly when looking at content. Without coder diligence and discernment in applying the concepts, the reliability assessment will prove inadequate. Without the assessment, an alternate interpretation of any study’s findings could be “coder bias.” Failure to achieve reliability in a content study means replication attempts by the same or by other researchers will be of dubious value.

Variable Definitions and Category Construction Reliability in content analysis starts with the variable and category definitions and the rules for applying them in a study. These definitions and the rules that operationalize them are specified in a content analysis protocol, a guidebook that answers the question “How will coders know the data when they see it?” For example, one of the authors developed a protocol to study nonprofit professional online news sites in order to compare them to news sites created by daily newspapers. The first step was to code which sites fit the concept of “nonprofit professional online news sites.” The variable was labeled “type of news site” and the categories were “one” or “two,” with one representing these nonprofit sites and two representing legacy newspaper sites. The rules for giving the site a one were: (1) it has 501(c) status; (2) it pays a salary to at least some of the staff; (3) its geographic market includes city, metro or regional areas; (4) it publishes general news and opinion information rather than niche information; and (5) it posts such information multiple times during the week. The protocol then explained how coders could find information to address each of these characteristics.

100 Reliability Conceptual and Operational Definitions Conceptual and operational definitions specify how the concepts of interest can be recognized in the content of interest. Think of it this way: a concept is a broad, abstract idea about the way something is or about the way several things interrelate. Each variable in a content analysis is the operationalized definition of that broader, more abstract concept. Each category of each content variable is an operational definition as well, but one subsumed by the broader operational definition of the variable. A simple example makes this process clear. In a study of political visi bility of state legislators (Fico, 1985), the concept of prominence was defined and measured. As an abstract concept, prominence means something that is first or most important and clearly distinct from all else in these qualities. In a news story about the legislative process, prominence can be measured (operationalized) in a number of ways. For example, a political actor’s prominence in a story can be measured in terms of “how high up” the actor’s name appears. The actor’s prominence can be assessed according to how much story space or time is taken up with assertions attributed to or about the actor. Prominence can even be assessed by whether the political actor’s photo appears with the article, or his or her name is in a headline. Certainly, it can be argued that the concept of prominence is best tapped by several measures, such as those noted—story position, space, accompanying photograph—but combined into an overall index. In fact, many concepts are operationalized in just this way. Of course, using several measures of a concept to create an index requires making sure that the various components indicate the presence of the same concept. For example, story space or number of paragraphs devoted to a politician may not be a good measure of prominence if he or she is not mentioned until the last paragraphs of a story. Concept Complexity and Number of Variables The more conceptually complex the variables and categories, the harder it will be to achieve acceptable reliability for reasons that we explain in the following section. Either more time and effort must be available for a complex analysis to train coders, spend more time coding, or both, or the analysis itself may have to be less extensive. That is, if the variables are simple and easy to apply, reliability is more easily achieved. A large number of complex variables increases the chances that coders will make mistakes, diminishing the reliability of the data. Reliability is also easier to achieve when a concept is more, rather than less, manifest. Recall from Chapter 2 that something manifest is observable “on its face,” and therefore easier to recognize and count

Reliability 101 than content with mostly latent meaning. The simpler it is to recognize when the concept exists in the content, the easier it is for the coders to agree, and thus the better the chance of achieving reliability in the study. For example, recognizing the race of a character in a streaming television series is easier than categorizing the presence of subtle institutional racism in the series. Or, if racial diversity in the series is operationalized simply as the number of times a person of color appears in the programs, coders will probably easily agree on the count of non-white characters. However, categorizing whether the characters face discrimination may require more complex judgment, and thereby affect coder reliability. Although reliability is easiest to achieve when content is more manifest (e.g., counting names), strictly manifest content is not always the most interesting nor important content. Therefore, content studies might address content that also has some degree of latent meaning. Content is rarely limited to only manifest or only latent meaning. Two problems can ensue, one of which affects the study’s reliability. First, as the proportion of meaning in content that is latent increases, agreement among coders becomes more difficult to achieve. Beyond within-study reliability, however, a second problem may occur that engages the interpretation of study results. Specifically, even though trained coders may achieve agreement on content with a high degree of latent meaning, it may be unclear whether naive observers of the content (e.g., book readers, TV program viewers, etc.) experience the meanings defined in the protocol and applied by the researchers. Few viewers of television video, for example, repeatedly rewind and review these programs to describe the relationships among actors. Here, the issue of reliability relates in some sense to the degree to which the study and its operationalizations “matter” in the real world (see Chapter 7). These issues do not mean that studies involving latent meaning should not be done or that they fail to have broader meaning and significance. That depends on the goals of the research. For example, Simon, Fico, and Lacy (1989) studied defamation in stories of local conflict. The definition of defamation came from court decisions: words that tend to harm the reputation of identifiable individuals. Simon et al.’s study further operationalized “per se” defamation that harms reputation on its face, and “per quod” defamation that requires interpretation that harm to reputation has occurred. Obviously, what harms reputation depends on what the reader or viewer of the material brings to it (latent content). To call a leader “tough” may be an admirable characterization to some and a disparaging one to others. Furthermore, it is doubtful that many readers of these stories had the concept of defamation in mind as they read (although they may have noted that sources were insulting one another). However, the goal of Simon et al.’s study was to determine when stories might risk angering one crucial population of readers: people defamed in the news who might bring a lawsuit.

102 Reliability These concepts of manifest and latent meaning can be thought to exist on a continuum. Some symbols are more manifest than others in that a higher percentage of receivers share a common meaning for those symbols. Few people would disagree on the common, manifest meaning of the word car, but the word cool has multiple uses in a standard dictionary as a verb and as a noun. Likewise, the latent meanings of symbols vary according to how many members of the group using the language share the latent meaning. The latent or connotative meaning can also change with time. In the 1950s in America, a Cadillac was considered the ultimate automotive symbol of wealth by the majority of people in the United States. Today, the Cadillac still symbolizes wealth, but it is not the ultimate symbol in everyone’s mind. Other cars, such as the Mercedes and BMW, have come to symbolize wealth as much or more so than the Cadillac. The point is that variables requiring difficult coder decisions, whether because of concept complexity or limited common meaning, will affect reliability and the time needed for coding. The more complex categories there are, the more time will be needed for training. Before each coding session, instructions should require that coders first review the protocol rules governing the categories. Coding sessions may be restricted to a set amount of content or a set amount of time to reduce the chance that coder fatigue will systematically degrade the coding of content toward the end of the session.

Content Analysis Protocol However simple or complex the variables, the definitions and coding procedures must be articulated clearly and unambiguously. This is done in the content analysis protocol. The protocol’s importance cannot be overstated. It defines the study in general and the coding rules applied to content in particular. Purpose of the Protocol First, the protocol sets down the rules governing the study—rules that bind the researchers in the way they define and measure the content of interest. Once the protocol has been judged to be reliable and coding has started, these rules must be invariant across the life of the study. Content coded on day 1 of a study should be coded in the identical way on day 100 of the study. Second, the protocol is the archival record of the study’s operations and definitions, or how the study was conducted. Therefore, the protocol makes it possible for other researchers to interpret the results and replicate the study. Such replication strengthens the ability of science to build a body of findings and theory.

Reliability 103 The content analysis protocol can be thought of as a cookbook. Just as a cookbook specifies ingredients, amounts of ingredients needed, and the procedures for combining and mixing them, the protocol specifies the study’s conceptual and operational definitions and the ways they are to be applied. To continue the analogy, if a cookbook is clear, one does not need to be a chef to make a good stew. The same is true for a content analysis protocol. If the concepts and procedures are sufficiently clear and procedures for applying them straightforward, anyone with training and practice should be able to apply the protocol consistently. If the concepts and procedures are more complex, then more exhaustive training will allow coders to apply the protocol precisely and to assign the content consistently. Protocol Development Of course, making variables sufficiently clear and the procedures straightforward may not be such a simple process. Variables that remain in a researcher’s head are not likely to be very useful. Therefore, the researcher starts by writing down the definitions. Although that sounds simple, the act of putting even a simple variable into words is more likely than anything else to illuminate sloppy or incomplete thinking. Defining variables forces more discerning thinking about what the researcher really means by a concept underlying the variable. The dynamic of articulation and response, both within oneself and with other researchers and coders, drives the process that clarifies variables. This interactive, iterative process forces the researcher to formulate variables in words and sentences that are less ambiguous and less subject to alternative interpretations that miss the concept the researcher had in mind. Protocol Organization Because it is the documentary record of the study, care should be taken to organize and present the protocol in a coherent manner. The document should be sufficiently comprehensive for other researchers to replicate the study without additional information from the researchers. Furthermore, the protocol must be available to any who wish to use it to help interpret, replicate, extend, or critique research governed by the protocol. A three-part approach works well for protocol organization. The first part is an introduction specifying the goals of the study and generally introducing the major concepts. For example, in a study of local government coverage (Fico et al., 2013a; Lacy et al., 2012), the protocol introduction specified the content and news media to be examined (news and opinion stories in eight types of news outlets). The second part specifies the procedures governing how the content was to be processed. For example, the protocol explained to coders which stories were to be excluded and included.

104 Reliability The third part of the protocol specifies each variable used in the content analysis, and therefore carries the weight of the protocol. For each variable, the overall operational definition is given along with the definitions of each category and the numerical values assigned to the various categories. These are the actual instructions used by the coders to assign content to particular values of particular variables and categories. Obviously, the instructions for some variables will be relatively simple (e.g., types of social media) or complex (e.g., degree of interactivity). How much detail should the category definitions contain? It should have only as much as necessary. As just noted, defining the concepts and articulating them in the protocol involves an iterative and interactive process. The protocol itself undergoes change before it is judged reliable, as coders in their practice sessions attempt to use the definitions, assessing their interim agreement at various stages in the training process. Category definitions become more coder-friendly as examples and exceptions are integrated. However, extremes in category definition—too much or too little detail—should be avoided. Definitions that lack detail permit coders too much leeway in interpreting when the categories should be used. Definitions that are excessively detailed may promote coder confusion or may result in coders forgetting particular rules in the process of coding. The coding instructions shown in Table 6.1 provide an example of part of a protocol used with a national sample of news content from roughly 800 news outlets. The protocol was applied to more than 47,000 stories across multiple news media outlets and had two sections. This is the first section that was applied to all stories. The second section was more complex and applied only to local government stories. Table 6.1 Coding protocol for local government coverage Introduction This protocol addresses the news and opinion coverage of local governments by daily newspapers, weekly newspapers, broadcast television stations, local cable networks, radio news stations, music radio stations, citizen blogs, and citizen new sites. It is divided into two sections. The first addresses general characteristics of all local stories, and the second concerns the topic, nature, and sources of local governments (city, county, and regional) governments. The content will be used to evaluate the extent and nature of coverage and will be connected with environmental variables (size of market, competition, ownership, etc.) to evaluate variation across these environmental variables. Procedure and Story Eligibility for Study Our study deals with local public affairs reporting at the city/suburb, county, and regional government levels. These areas include the local governmental institutions closest to ordinary people, and therefore more accessible to them. A city government (also sometimes called a “township”) is the smallest geopolitical unit in America. Many cities (townships) are included in counties, and many counties may be connected a regional governmental unit.

A story may NOT be eligible for coding for the following reasons: 1 2 3 4 5 6

The story deals with routine sports material. The story deals with routine weather material. The story deals with entertainment (e.g., plays). The story deals with celebrities (their lives). The story deals with state government only. The story deals with national government only.

Read the story before coding. If you believe a story is NOT eligible for the study because it deals with excluded material noted above, go on to the next story. Consult with a supervisor on the shift if the story is ambiguous in its study eligibility. Variable Operational Definitions V1: Item Number (assigned) V2: Item Date: month/day/year (two digits: e.g., Aug. 8, 2008 is 080808) V3: ID Number of the City (assigned)—see list. Assign 999 if DMA sample. V4: Item Geographic Focus Stories used in this analysis were collected based on their identification as “local” by the news organization. Stories that address state, national, or international matters would not be included unless some “local angle” was present. The geographic focus of the content is considered to be the locality that occurs first in the item. Such localities are indicated by formal names (e.g., a Dallas city council meeting) used first in the story. In some cases, a formal name will be given for a subunit of a city (e.g., the “Ivanhoe Neighborhood” of East Lansing), and in these cases the city is the focus. Often the locality of a story is given by the dateline (e.g., Buffalo, NY), but in many cases the story must be read to determine the locality because it may be different than that in a dateline. If no locality at all is given in the story, code according to the story’s dateline. 1 = listed central city: see list 2 = listed suburb city: see list 3 = other local geographic area V5: ID Number of DMA (assigned number)—see list. Assign 99 if city council sample V6: ID Number of outlet (assigned)—see list V7: Type of Medium (check ID number list) 1 = daily newspaper 2 = weekly newspaper 3 = broadcast television 4 = cable television 5 = news talk radio 6 = non-news talk radio 7 = citizen journalism news site 8 = citizen journalism blog site (continued)

Table 6.1 (continued) Introduction V8: Organizational Origin of Content Item: 1 = Staff Member: (Code story as 1 if there is any collaboration between news organization staff and some other story information source.) a Includes items from any medium that attribute content to a reporter’s or content provider’s NAME (unless the item is attributed to a source such as those under the code 2 below). A first name or a username suffices for citizen journalism sites. b Includes items by any medium that attribute content to the news organization name (e.g., by KTO; by the Blade; by The Jones Blog). Such attribution can also be in the story itself (e.g., KTO has learned; The Blade’s Joe Jones reports). c Includes items that attribute content to the organization’s “staff” or by any position within that organization (e.g., “editor,” etc.). d FOR TV AND RADIO ONLY, assume an item is staff produced if: 1) A station copyright is on the story page (the copyright name may NOT match the station name). However, if an AP/wire identification is the only one in the byline position or at the bottom of the story, code V7 as 2 even if there is a station copyright at the bottom of the page. 2) A video box is on a TV item or an audio file is on a radio item. e FOR RADIO ONLY, assume an item is staff-produced ALSO if the item includes a station logo inside the story box. f FOR NEWSPAPER ONLY, assume an item is staff-produced ALSO if the item includes: 1) An email address with newspaper URL. 2) A “police blotter” or “in brief” section of multiple stories. 2 = News and Opinion Services: a This includes news wire services such as Associated Press, Reuters, and Agence France Press, and opinion syndicates such as King’s World. b This includes news services such as the New York Times News Service, McClatchy News Service, Gannett News Service, and Westwood One Wire. c This includes stories taken WHOLE from other news organizations as indicated by a different news organization name in the story’s byline. 3 = Creator’s Online Site (for material identified as 7 or 8 in V8): a Used ONLY for online citizen journalism sites whose content is produced by one person, as indicated by the item or by other site information. b If the site uses material from others (e.g., “staff,” “contributors,” etc.), use other V8 codes for those items. 4 = Local Submissions: Use this code for WHOLE items that include a name or other identification that establishes it as TOTALLY verbatim material from people such as government or nongovernment local sources. The name can refer to a person or to an organization.

Reliability 107 Such material may include: a Verbatim news releases. b Official reports of government or nongovernment organizations. c Letters or statements of particular people. d Op-ed pieces or letters to the editor e Etc. 5 = Can’t Tell: The item includes no information that would result in the assignment of codes 1, 2, 3, or 4 above.

Coding Sheet Each variable in the content analysis protocol must relate unambiguously to the actual coding sheet used to record the content attributes of each unit of content in the study. A coding sheet should be coder-friendly. Coding sheets can be printed on paper or presented on a computer screen. Each form has advantages and disadvantages. Paper sheets allow flexibility when coding. With paper, a computer need not be available while coding content, and the periodic interruption of coding content for keyboarding is avoided. Not having interruptions is especially important when categories are complex, and the uninterrupted application of the coding instructions can improve reliability. Paper sheets are useful particularly if a coder is examining content that is physically large, such as a newspaper. Using paper, however, adds more time to the coding process. Paper coding sheets require the coder to write the value; someone else must then keyboard it into the computer. If large amounts of data are being analyzed, the time increase can be considerable. This double recording on paper and keyboard also increases the chance of transcribing error. On the other hand, having the paper sheets provides a backup for the data should a hard drive crash. The organization of the coding sheet will, of course, depend on the specific study. However, the variables on the coding sheet should be organized to follow the order of variables in the protocol, which in turn follows the flow of the content of interest. The coders should not have to dart back and forth repeatedly within the content of interest to determine the variable categories. If, for example, an analysis involves recording who posted a comment on Facebook, that category should be coded relatively high up the coding sheet because coders will encounter a poster’s name early in the coding process. Planning the sheet design along with the protocol requires the researcher to visualize what the process of data collection will be like and how problems can be avoided. Coding sheets usually fall into two types: single case and multiple cases. The single-case coding sheets have one or more pages for each case or recording unit. The analysis of suicide notes for themes might use a “sheet” for each note, with several content categories on the sheet.

108 Reliability Table 6.2 Coding sheet Content Analysis Protocol AAA for Assessing Local Government News Coverage VI: Item Number ________________ V2: Item Date ________________ V3: ID Number of the City ________________ V4: Item Geographic Focus 1 = listed central city: see list 2 = listed suburb city: see list 3 = other local geographic area ________________ V5: ID Number of DMA ________________ V6: ID Number of Outlet ________________ V7: Type of Medium (Check ID Number List) 1 = daily newspaper 2 = weekly newspaper 3 = broadcast television 4 = cable television 5 = news talk radio 6 = non-news talk radio 7 = citizen news site 8 = citizen blog site ________________ V8: Organizational Origin of Content Item 1 = staff member 2 = news and opinion services 3 = creator’s online site 4 = local submissions 5 = can’t tell ________________

Table 6.2 shows the single-case coding sheet associated with the coding instructions given in Table 6.1. Each variable (V) and a response number or space is identified with a letter and numbers (V1, V2, etc.) that corresponds with the definition in the coding protocol. Connecting variable locations on the protocol and on the coding sheet reduces time and confusion while coding. Multi-case coding sheets allow an analyst to put more than one case on a page. This type of coding sheet often appears as a grid, with the cases placed along the rows and the variables listed in the columns. This is the form used when setting up a computer database in Excel or SPSS. Figure 6.1 shows an abbreviated multi-case coding sheet for a study of monthly consumer magazines. Each row contains the data for one issue of the magazine; this example contains data for seven cases. Each column holds the numbers for the variable listed. Coders will record the number of photographs in column 4 for the issue listed on the row. For instance, the March issue in 1995 had 45 photographs in the magazine.

Coder Training The process of variable and category definition, protocol construction, and coder training is an iterative process. Central to this process— how long it goes on and when it stops—are the coders. The coders, of

Reliability 109

Pages of Food Ads

# of Health Stories

Total space

# of Stories

ID #

Month

Year

# of Photos

01

01

95

42

15

09

102

29

02

02

95

37

21

10

115

31

03

03

95

45

32

15

130

35

04

04

95

31

25

08

090

27

05

06

95

50

19

12

112

30

06

01

96

43

19

11

120

25

07

02

96

45

23

17

145

29

Figure 6.1 Coding sheet for monthly consumer magazines

course, change as they engage in successive encounters with the content of interest and the way that content is captured by the concepts defined for the study. A content analysis protocol will go through many drafts during pretesting as variables are refined, measures specified, and procedures for coding worked through. Coding Process This process is both facilitated and made more complex depending on the number of coders. Along with everyone else, researchers carry mental baggage that influences their perception and interpretation of communication content. A single coder may not notice the dimensions of a concept being missed or how a protocol that is perfectly clear to him or her may be opaque to another. Several coders are more likely to hammer out conceptual and operational definitions that are clearer and more explicit. On the other hand, the disadvantage of using several coders is that agreement on classifying content units might be more difficult, or their operationalization may reveal problems that would not occur with fewer coders or with only one coder. At some point, a concept and its measure may just not be worth further expenditure of time or effort, and recognizing that a variable should be dropped before coding may not be easy either. Although the protocol may be well organized and clearly and coherently written, a content analysis must still involve systematic training of coders to use the protocol. An analogy to a survey is useful. Telephone survey administrators must be trained in the rhythms of the questionnaire and gain comfort and confidence in reading the questions and recording

110 Reliability respondent answers. Coders in a content analysis must grow comfortable and familiar with the definitions of the protocol and how they relate to the content of interest. The first step in training coders is to familiarize them with the content being analyzed. The aim here is not to precode material, and indeed content not in the study sample should be used for this familiarization process. The familiarization process is meant to increase the coders’ comfort level with the content of interest, to give them an idea of what to expect in the content, and to determine how much energy and attention is needed to comprehend it. To help minimize coder differences, the study should establish a procedure that coders follow in dealing with the content. For example, that procedure may specify how many pieces of content a coder may deal with in a session or a maximum length of time governing a coding session. The procedure may also specify that each coding session must start with a full reading of the protocol to refresh coder memory of category definitions. Coders should also familiarize themselves with the content analysis protocol, discussing it with the supervisor and other coders during training and dealing with problems in applying it to the content being studied. During these discussions, it should become clear whether the coders are approaching the content from similar or different frames of reference. Obviously, differences will need to be addressed because these will almost certainly result in disagreements among coders and poor study reliability. Sources of Coder Disagreements Differences among coders can have a number of origins. Some are relatively easy to address, such as simple confusion over definitions. Others may be impossible to solve, such as a coder who simply does not follow the procedure specified in the protocol. Protocol Problems Differences because of inadequate category definitions must be seriously addressed. Does disagreement exist because a category is ambiguous or poorly articulated in the protocol? Or is the problem with a coder who just does not understand the concept or the rules for operationalizing it? Obviously, when several coders disagree on which category to assign a content unit, the strong possibility exists that the problem is in the category or variable. A problem may occur because of fundamental ambiguity or complexity in the variable or because the rules assigning content to the variable categories are poorly spelled out in the protocol. The simplest approach to such a variable or category problem is to begin by revising its definition to remove the sources of ambiguity or confusion. If this revision fails to remove the source of disagreement,

Reliability 111 attention must be turned to the fundamental variable categories and definitions. It may be that an overly complex variable or its categories can be broken down into several parts that are relatively simpler to handle. For example, research on defamation (Fico & Cote, 1999) required initially that coders identify defamation in general and, following that, coding copy as containing defamation per se and defamation per quod. Defamation per quod is interpreted by courts to mean that the defamation exists in the context of the overall meanings that people might bring to the reading. With this definition, coder reliability was poor. However, better reliability was achieved on recognition of defamation in general and defamation per se. The solution was obvious: given defamation in general, defamation per quod was defined to exist when defamation per se was ruled out. In other words, if all defamation was either per se or per quod, getting a reliable measure of per se was all that was necessary to also define reliably the remaining part of defamatory content that was per quod. Although this process resulted in a lack of independence between categories, this is not a problem if data from only one category are used in analysis. However, researchers may also have to decide if a category must be dropped from the study because coders cannot use it reliably. In another study of how controversy about issues was covered in the news (Fico & Soffin, 1995), the coders attempted to make distinctions between “attack” and “defense” assertions by contenders on these issues. In fact, the content intermixed these kinds of assertions to such a degree that achieving acceptable reliability proved impossible. The variable was dropped. Coder Problems If only one coder is consistently disagreeing with others, the possibility exists that something has prevented that coder from properly applying the definitions. Between-coder reliability measures make it easy to identify problem coders by comparing the agreement of all possible pairs of coders. Attention must then be given to retrain that coder or to remove him or her from the study. There may be several reasons why a coder persistently disagrees with others on application of category definitions. The easiest coder problems to solve involve applications in procedure. Is the coder giving the proper time to the coding? Has the protocol been reviewed as specified in the coding procedures? Assigning content to categories too quickly is a common coder problem. A coder needs to develop a rhythm for coding and a sense of how long coding variables will take. In some cases, if the content involves specialized knowledge, the coders may need to be educated. For example, some of the eight coders involved in the project about local government knew little about the structure,

112 Reliability officials, and terms associated with local government (Fico et al., 2013b; Lacy et al., 2012). Therefore, the principal investigators created a booklet about local government and had the coders study it. Content analysts should be aware of the need to familiarize coders with terms that are unfamiliar to them. Early training with the protocol and similar content should reveal this condition. At that point, the training involves educating coders with the specialized knowledge they need. More difficult problems involve differences in cultural understanding or frame of reference that may be dividing coders. These differences will be encountered most frequently when coders deal with variables that have more latent content. One author recalls working as a student on a content study in a class of students from the United States, Bolivia, Nigeria, France, and South Africa. The study involved applying concepts such as terrorism to a sample of stories about international relations. As might be imagined, what is terrorism from one perspective may well be national liberation from another. Such frame of reference problems are not impossible to overcome, but they will increase the time needed for coder training and possibly coding. Such issues should also signal that the study itself may require more careful definition of its terms in the context of such cultural or social differences. Peter and Lauf (2002) examined factors affecting intercoder reliability in a study of cross-national content analysis, which was defined as comparing content in different languages from more than one country. They concluded that some coder characteristics affected intercoder reliability in bilingual content analysis. However, most of their recommendations centered on the failure to check reliability among the people who trained the coders. The conclusion was that cross-country content analysis would be reliable if three conditions are met: “First, the coder trainers agree in their coding with one another; second, the coders within a country group agree with one another; and, third, the coders agree with the coding of their trainers” (Peter & Lauf, 2002, p. 827). Symbolic Complexity The nature of the language and symbols coded for some variables can have an impact on the levels of reliability and ease of achieving consistent coding (Lacy et al., 2015). As discussed earlier, as the proportion of a message that uses latent meaning increases, generating reliable data becomes more difficult. Visual symbols, such as those found in photographs and videos, tend to be more ambiguous than text. The difficulty generated by symbolic complexity is dictated by the research questions and hypotheses. The area of interest determines the content to be coded, and not the other way. The way to improve reliability with symbolic complexity is to make sure the protocol is well developed and that coders have adequate training on the protocol.

Reliability 113

Reliability Assessment Reliability Tests Ultimately, the process of concept definition and protocol construction must cease. At that point, the researcher must assess the degree to which the protocol can be reliably applied. Reliability falls into three types (Krippendorff, 2004a, pp. 214–216): stability, reproducibility, and accuracy. Stability refers to a coder consistently applying the protocol to the same set of content at two points in time. This “within-coder” assessment tests whether slippage has occurred in the single coder’s understanding or application of the protocol definitions. Checking stability is needed with coding that lasts for a long time period, but there is no accepted definition of a “long time period.” However, if a project takes more than a month of coding, stability testing would improve the argument for data validity. Reproducibility involves two or more coders applying the protocol to the same content. Each variable in the protocol is tested for reproducibility by looking at agreement among coders in applying relevant category values to the content. For example, two coders code 100 tweets dealing with abortion. Coding the variable for pro-choice or pro-life, they compute the percentage of those stories on which they have agreed that the particular story is pro-choice or pro-life according to the coding definitions. The third type of reliability is accuracy, which addresses whether or not the coding is consistent with some external standard for the content, much as one resets (or “calibrates”) a household clock to a “standard” provided by one’s mobile phone after a power outage. The problem in content analysis is how to come by a standard with limited measurement error. One way is to compare the content analysis data with a standard established by experts, but there is no way to verify the degree of bias in the experts’ standards. Therefore, most content analyses are limited to testing reproducibility. Although reliability tests are framed as comparing assignment by coders, it is important to understand that reliability is a measure of the entire process by which trained coders apply a well-developed protocol to relevant content. Because of the need for replication, the role of the protocol in establishing and testing reliability should be emphasized. Coder training sessions constitute a kind of reliability pretest. However, wishful thinking and post hoc rationalizations of why errors have occurred (e.g., “I made that mistake only because I got interrupted by the phone while coding that story”) mean a more formal and rigorous procedure must be applied. In fact, formal coder reliability tests should be conducted during the period of coder training itself as an indicator of when to proceed with the study. Such training tests should not, of course,

114 Reliability be conducted with the content being used for the actual study because a coder must code study content independently, both of other coders and of himself or herself. If content is coded several times, prior decisions contaminate subsequent ones. Furthermore, repeated coding of the same content inflates the ultimate reliability estimate, thus giving a false confidence in the study’s overall reliability. At some point, the training formally stops, coding begins, and during the coding process the actual assessment of achieved reliability must take place. Two issues must be addressed. The first concerns selection of content used in the reliability assessment. The second concerns the actual statistical reliability tests that will be used. The process of testing reproducibility should include at least one coder who was not involved in the initial creating and development of the protocol. Coding after protocol formation could artificially inflate reliability because the creators share biases that other coders might not have. Selection of Content for Testing If the number of content units being studied is small, protocol reliability can be established by having two or more coders code all the content. Otherwise, researchers need to randomly select content samples for reliability testing. Most advice has been arbitrary and ambiguous about how much content to use when establishing protocol reliability. One text (Wimmer & Dominick, 2003) suggests that between 10% and 25% of the body of content should be tested. Others (Kaid & Wadsworth, 1989) suggested that between 5% and 7% of the total is adequate. One popular online resource (http://matthewlombard.com/reliability/index_print. html) suggests that the reliability sample “should not be less than 50 units or 10% of the full sample, and it rarely needs to be greater than 300 units.” However, the foundations for these recommendations are not always clear. The number of units that are needed will be addressed below, but probability sampling should be used when a census is impractical. Random sampling accomplishes two things. First, it controls for the inevitable human biases in selection. Second, the procedure produces, with a known probability of error, a sample that reflects the appropriate characteristics in the overall population of content being studied. Without a random sample, inference that the reliability outcome represents all the content being studied cannot be supported. Given a random sample of sufficient size, the coder reliability test should then reflect the full range of potential coding decisions that must be made in the entire body of material. The problem with nonrandom selection of content for reliability testing is the same as the problem with a nonrandom sample of study content: tested material may be atypical

Reliability 115 of the entire body of content that will be coded. A non-representative sample yields reliability assessments whose relation to the entire body of content is unknown. Using probability sampling to select content for reliability testing also enables researchers to take advantage of sampling theory to answer the question of how much material must be tested. Random sampling can specify sampling error at known levels of confidence. For example, if two researchers using randomly sampled content achieve a 90% level of agreement, the actual agreement they would achieve coding all material could vary above and below that figure according to the computed sampling error. That computed sampling error would vary with the size of the sample—the bigger the sample, the smaller the error and the more precise the estimate of agreement. Therefore, if the desired level of agreement is 80%, and the achieved level on a coder reliability test is 90% plus or minus five percentage points, the researchers can proceed with confidence that the desired agreement level has been reached or exceeded. However, if the test produced a percentage of 84%, the plus or minus 5% sampling error would include a value of 79% that is below the required standard of 80%. A study assessing the reliability process (Lovejoy et al., 2014) reported in Communication Monographs, the Journal of Communication, and Journalism & Mass Communication Quarterly from 1985 to 2010 found that 24% did not include information about reliability tests, and only 34% of the articles that provide information about reliability tests described the reliability sampling process. The good news is that reporting reliability sampling information improved over the time period, but anything less than 100% reporting is insufficient. Selection Procedures Assuming content for a reliability test will be selected randomly, how many units of content must be selected? Lacy and Riffe (1996) noted that this will depend on several factors: the total number of units to be coded, the desired degree of confidence in the eventual reliability assessment, and the degree of precision desired in the reliability assessment. Although each of these three factors is under the control of the researcher, a fourth factor must be assumed on the basis of prior studies, a pretest, or a guess. That is the researcher’s estimate of the actual agreement that would have been obtained had all the content of interest (census) been used in the reliability test. For reasons that we explain later, it is our recommendation that the estimate of actual agreement be set five percentage points higher than the minimum required reliability for the test. This five percentage point buffer will ensure a more rigorous test (i.e., the achieved agreement will have to be higher for the reliability test to be judged adequate).

116 Reliability The first object in applying this procedure is to compute the number of content cases required for the reliability test. When researchers survey a population, they use the formula for the standard error of proportion to estimate a minimal sample size necessary to infer to that population at a given confidence level. A similar procedure is applied here to a population of content. One difference, however, is that a content analysis population is likely to be far smaller than the population of people involved in a survey. This makes it possible to correct for a finite population size when the sample makes up 20% or more of the population. This has the effect of reducing the standard error and giving a more precise estimate of reliability. The formula for the standard error can be manipulated to solve for the sample size needed to achieve a given level of confidence. This formula is n=

(N − 1)(SE)2 + PQN (N − 1)(SE)2 + PQ

in which N = the population size (number of content units in the study) P = the population level of agreement Q = (1 − P) n = the sample size for the reliability check Solving for n gives the number of content units needed in the reliability check. Note that standard error gives the confidence level desired in the test. This is usually set at the 95% or 99% confidence level (using a one-tailed test because interest is in the portion of the interval that may extend below the acceptable reliability figure). For the rest of the formula, N is the population size of the content of interest, P is the estimate of agreement in the population, and Q is 1 minus that estimate. As an example, a researcher could assume an acceptable minimal level of agreement of 85% and P of 90% in a study using 1,000 content units (e.g., newspaper stories). One further assumes a desired confidence level of .05 (i.e., the 95% confidence level). A one-tailed z-score—the number of standard errors needed to include 95% of all possible sample means on agreement—is 1.64 (a two-tailed test z-score would be 1.96). Because the confidence level is 5% and our desired level of probability is 95%, SE is computed as follows: .05 = 1.64(SE) or SE = .05 / 1.64 = .03

Reliability 117 Using these numbers to determine the test sample size to achieve a minimum 85% reliability agreement and assuming P to equal 90% (5% above our minimum), the results are (999)(.0009) + .09(1000) (999)(.0009) + .09 n = 92 n=

In other words, 92 test units out of the 1,000 are used (e.g., newspaper stories) for the coder reliability test. If a 90% agreement in coding a variable on those 92 test units is achieved, chances are 95 out of 100 that at least an 85% or better agreement would exist if the entire content population were coded by all coders and reliability measured. Once the number of test units needed is known, selection of the particular ones for testing can be based on any number of random techniques (see Chapter 5). All coders then code the selected units. The procedure just described is also applicable to studies in which coding categories are measured using interval or ratio scales. The calculation of standard error is the only difference. If these formulas seem difficult to use, two tables may be useful. Tables 6.3 and 6.4 apply to studies that deal with nominal-level percentage of agreement. Table 6.3 is configured for a 95% confidence level, and Table 6.4 is configured for the more rigorous 99% confidence level. Furthermore, within each table, the number of test cases needed has been configured for 85%, 90%, and 95% estimates of population coding agreement. Researchers should set the assumed level of population agreement (P) at a high enough level (we recommend 90% or higher) to assure that the reliability sample includes the range of category values for each variable. Otherwise, the sample will not represent the population of content. Using a statistic he developed to assess reliability (see below), Krippendorff (2013) provides a different approach toward selecting units for the reliability test. He provides a table with reliability sample sizes that is a function of the researcher’s selection of an acceptable minimum level for Krippendorff’s CAlpha and an acceptable p-value, as well as the number of coders, and the probability of selecting the least frequency value from among all population values. Researchers could always use all study content to test reliability, which eliminates sampling error. If they do not use a census of study content, the cases used for reliability testing must be randomly selected from the population of content of interest to have confidence in the reliability results. The number of units should be taken using the procedure suggested by Lacy and Riffe (1996) or by Krippendorff (2013). The level of sampling error for reliability samples should always be reported.

118 Reliability Table 6.3 Content units needed for reliability test based on various population sizes, three assumed levels of population intercoder agreement, and a 95% level of probability Assumed Level of Agreement in Population

Population Size 10,000 5,000 1,000 500 250 100

85%

90%

95%

141 135 125 111 91 59

100 99 92 84 72 51

54 54 52 49 45 36

Table 6.4 Content units needed for reliability test based on various population sizes, three assumed levels of population intercoder agreement, and a 99% level of probability Assumed Level of Agreement in Population

Population Size 10,000 5,000 1,000 500 250 100

85%

90%

95%

271 263 218 179 132 74

193 190 165 142 111 67

104 103 95 87 75 52

Regardless of which sampling process is used, the sample should be checked to verify that all categories for all variables have been selected at least once by coders (Krippendorff, 2013). When to Conduct Reliability Tests The process of establishing reliability involves two types of tests. The first is a series of pretests that occurs during training. As mentioned above, reproducibility pretesting serves as part of an iterative process of coding, examining reliability, adjusting the protocol, and coding again that aims to improve the protocol. Just how long this process continues reflects several factors, but it should continue until the reliability has reached an acceptable level, as discussed below. Formal reliability pretests with

Reliability 119 coders working independently to classify content will determine when this point is reached. Once the pretests demonstrate that the protocol can be applied reliably, the study coding should begin. It is during the coding of the study content units that the final protocol reliability is established. As the coding gets underway, the investigators must select the content to be used for the reliability test, as described above. Generally, it is a good idea to wait until about 10% to 15% of the coding has been completed to begin the reliability test. This will allow coders to develop a routine for coding and become familiar with the protocol. The content used for establishing and reporting reliability should be coded by all the coders, and the reliability content should be interspersed with the study content so the coders do not know which content units are being coded by everyone. This “blind” approach to testing reliability will yield a better representation of the reliability than having an identifiable set of reliability content coded separately from the normal coding process. If coders are aware of the test content, they might try harder or become nervous. In either case, the reliability results could be influenced. If the study’s coding phase will exceed a month in length, the investigators should consider testing stability of the process, as discussed above, by administering multiple tests. As with the initial reliability tests, content for stability tests and additional reproducibility tests should be randomly selected from the study content. However, if the initial reliability test demonstrated sufficient reliability, the additional samples do not need to be as large as in the initial test. Samples in the 30 to 50 ranges should be sufficient to demonstrate the maintenance of reliability. Most content analysis projects do not involve enough content to require more than two reliability tests, but in some cases the investigators should consider more tests. For example, the coding of local government news coverage mentioned previously (Fico et al., 2013b; Lacy et al., 2012) lasted more than four months and involved three reliability tests. When additional tests are involved, a second one should take place after half of the content has been coded and before 60% has been coded. If a third test occurs, it should be in the 80% to 90% completion range. A major concern with longer projects is what to do if reliability falls below acceptable levels. If the reliability of the protocol in the initial test is high, any deterioration will likely reflect problems with coders. If this happens, coders whose reliability has slipped have to be identified and either retrained or dropped from the study. Of course, any content they coded since the last reliability test will need to be recoded by other coders.

Reliability Coefficients The degree of reliability that applies to each variable in a protocol is reported with a reliability coefficient, which is a summary statistic for

120 Reliability how often trained coders using the protocol agreed on the classification of the content units. Literature about content analysis reliability and inter-rater agreement in medicine contain more than 30 different reliability coefficients (Nili, Tate, & Barros, 2017), but four are most often used in communication studies (Lovejoy, Watson, Lacy, & Riffe, 2016): percentage of agreement (also called Holsti’s coefficient), Scott’s pi, Cohen’s kappa, and Krippendorff’s original CAlpha. Percentage of agreement has fallen out of use as a primary reliability coefficient because it overstates the true level of reliability by not taking unjustified (not prompted by the protocol) or “chance” agreement into consideration. It is not an adequate test of reliability. However, the latter three coefficients do consider “chance” agreement. Researchers are encouraged to explore this wide-ranging literature, but they should remember that the goals and processes of medical interrater agreement and content analysis reliability are not the same. Medical diagnosis does not follow a set of instructions and definitions that are developed for a given problem. Instead, health practitioners depend on general guidelines and chemical tests that are unique to the patient. In short, the variations in the health rater are more prominent in explaining decisions than are coders in content analysis. Lovejoy et al. (2016) examined 672 articles in Journalism & Mass Communication Quarterly (JMCQ), the Journal of Communication (JoC), and Communication Monographs (CM) that used content analysis to generate data. The percentage of articles reporting a “chance-corrected” reliability coefficient for every variable increased during the 1985 to 2014 period. However, between 2010 and 2014, 25% of the content analysis articles in JMCQ, 23% of the articles in JoC, and 43% of the articles in CM did not report reliability coefficients that corrected for chance. During the same five years, 68% of the articles in CM, 50% of the articles in JMCQ, and 41% of the articles in JoC did not report a chance-corrected coefficient for each of the variables in the study. Percentage of Agreement Before discussing which coefficient scholars should use, we will examine the four most often used. Although it should not be used as a primary reliability coefficient, the percentage of agreement among two or more coders was first used to evaluate reliability. It is calculated as the proportion of correct judgments as a percentage of total judgments made. All coding decisions can be reduced to dichotomous decisions for figuring simple agreement. In such cases, each possible pair of coders is compared for agreement or disagreement. For example, if three coders categorize an article, the total number of dichotomous coding decisions will equal three: coders A and B, coders B and C, and coders A and C. Four coders

Reliability 121 will yield six decisions for comparison (A and B, A and C, A and D, B and C, B and D, and C and D), and so on. The percentage of agreement coefficient overestimates reliability because it does not control for the influence of agreements due to accident or error. However, the fact that the term “chance” is used to summarize such agreements does not mean they are actually due to chance or guesses. Most, or even all, agreements could be the result of a welldeveloped protocol and good training, especially if agreement is high and the data distribution is not skewed. Although percentage of agreement can inflate reliability, it is useful during protocol development and coder training as a way of identifying where and why disagreements are occurring. Percentage of agreement also helps in understanding the nature of the data by comparing it with other reliability coefficients. As discussed below, sometimes a study has a high level of simple agreement but low CAlpha, kappa, or pi. Examining these together can help future studies improve the protocol. Because of this, content analysis studies should report both a simple agreement figure and one or more reliability coefficients. The simple agreement figures should be placed in an endnote as information for researchers conducting replications. However, decisions about the reliability of a variable in the protocol should be based on a coefficient that takes chance agreement into consideration. Coefficients That Evaluate Chance Agreement Consider the possibility that some coder agreements might occur among untrained coders who are not guided by a protocol. These have traditionally been called “chance agreements.” One of the earliest reliability coefficients that “corrects” for chance agreement is Scott’s pi (Scott, 1955). It involves only two coders and is used with nominal data. Correcting for chance leads to the calculation of “expected agreement” using probability theory. Scott’s pi computes expected agreement by using the proportion of times particular values of a category are used in a given test. Here is an example. Assume that a variable (topic of news stories) has four categories (government, crime, entertainment, and sports) and that two coders have coded 10 units of content for a total of 20 coding decisions. Government has been used 40% of the time (i.e., eight of the combined decisions by the two coders selected government as the correct coding category), sports has been used 30% of the time (in six decisions), and crime (in three decisions) and entertainment (in three decisions) have each been used 15% of the time. Here is where the multiplication rules of probability apply. We multiply because chance involves two coders and not one. The probability of a single “event” (a story being about government) equals .4, but the probability of two such events (two coders

122 Reliability coding the same variable as government) requires .4 to be multiplied by .4. This, of course, makes intuitive sense: a single event is more likely to occur than two such events occurring. In this example, the expected agreement is .4 times .4 (government stories), or .16, plus .3 times .3 (sports stories), or .09, plus .15 times .15 (crime stories), or .022, plus .15 times .15 (entertainment stories), or .022. The expected agreement by chance alone would then be the sum of the four products, .29 (29%), or .16 + .09 + .022 + .022. The computing formula for Scott’s pi is Pi =

%OA − %EA 1 − %EA

in which OA = observed agreement EA = expected agreement In this formula, OA is the agreement achieved in the reliability test and EA is the agreement expected by chance, as just illustrated. Note that the expected agreement is subtracted from both the numerator and denominator. In other words, chance is eliminated from both the achieved agreement and the total possible agreement. To continue with the example, suppose the observed agreement between two coders coding the four-value category for 10 news stories is 90% (they have disagreed only once). In this test, Scott’s pi would be Pi =

90 − .29 .61 = = .86 1 − .29 .71

That .86 can be interpreted as the agreement that has been achieved as a result of the category definitions and their diligent application by coders after a measure of possible chance agreement has been removed. Finally, Scott’s pi has an upper limit of 1.0 in the case of perfect agreement and a lower limit of −1.0 in the case of perfect disagreement. Figures around 0 indicate that chance is more likely governing coding decisions than the content analysis protocol definitions and their application. A number of other forms for assessing the impact of chance are available. Cohen (1960) developed kappa, which has the same formula as Scott’s pi: Kappa =

P0 − Pe 1 − Pe

Reliability 123 in which P0 = observed agreement Pe = expected agreement Kappa and pi differ, however, in the way expected agreement is calculated. Recall that Scott (1955) squared the observed proportions used for each value of a category assuming all coders are using those values equally. In other words, if 8 of 20 decisions were to select government (value 1) of a category, .4 is squared regardless of whether one of the coders used that value six times and the other only two. However, kappa uses expected agreement based on the proportion of a particular value of a category used by one coder multiplied by the proportion for that value used by the other coder. These proportions are then added for all the values of the category to get the expected agreement. In the example, one coder has used the value of 1 in 6 of 10 decisions (.6), and the second coder has used the value of 1 in 2 of 10 decisions (.2). Therefore, whereas pi yielded the expected value of .16 (.4 × .4), kappa yields an expected value of .12 (.6 × .2). Kappa will sometimes produce somewhat higher reliability figures than pi, especially when one value of a category is used much more often than others. For further explanation of kappa, see Cohen (1960). Kappa is used for nominal-level measures, and all disagreements are assumed to be equivalent. However, if disagreements vary in their seriousness (e.g., a psychiatrist reading a patient’s diary’s content concludes the person has a personality disorder when the person is really psychotic), then a weighted kappa (Cohen, 1968) has been developed. Krippendorff (1980) developed a coefficient, CAlpha, that is similar to Scott’s pi. Since he first introduced CAlpha, Krippendorff has created a family of alpha coefficients (Krippendorff & Craggs, 2016; Krippendorff, Mathet, Bouvry, & Widlöcher, 2016) that are useful under various circumstances. These will be discussed below. His original alpha is now represented with the addition of a subscript C (CAlpha). Krippendorff’s (1980) CAlpha is presented by the equation Alpha =

D0 Dc

in which D0 = observed disagreement Dc = expected disagreement The process of calculating D0 and Dc depends on the level of measurement (nominal, ordinal, interval, and ratio) used for the content variables. The difference between CAlpha and pi is that Krippendorff’s (1980) statistic

124 Reliability can be used with non-nominal data and with multiple coders. The CAlpha also corrects for small samples (Krippendorff, 2013), and some computer programs for CAlpha can accommodate missing data. When nominal variables with two coders and a large sample are used, CAlpha and pi are equal. For more details about CAlpha, see Krippendorff (2013). Krippendorff (2004b) stated that a reliability coefficient can be an adequate measure of reliability under three conditions. First, content to be checked for reliability requires two or more coders working independently applying the same instructions to the same content. Second, a coefficient treats coders as interchangeable and presumes nothing about the content other than that it is separated into units that can be classified. Third, a reliability coefficient must control for agreement due to chance. Krippendorff (2004b) pointed out that most formulas for reliability coefficients have similar numerators that subtract observed agreement from 1. However, they vary as to how the denominator (expected agreement) is calculated. Scott’s pi and Krippendorff’s (2004a) CAlpha are the same, except CAlpha adjusts the denominator for small sample bias, and Scott’s pi exceeds CAlpha by (1 − pi) / n, where n is the sample size. As n increases, the difference between pi and CAlpha approaches zero. Krippendorff (2004b) criticized Cohen’s kappa because expected disagreement is calculated by multiplying the proportion of a category value used by one coder by the proportion used by the other coder (as described above). Krippendorff said the expected disagreement is based, therefore, on the coders’ preferences, which violates the second and third of his three conditions. Krippendorff (2004b) concluded that Scott’s pi is acceptable with nominal data and large samples, although what qualifies as large was not defined. In situations in which data other than nominal measures are used, multiple coders are involved, and the samples are small, CAlpha is recommended. Researchers who want software to calculate reliability coefficients have options. In 2010, ReCal, a web-based software, was introduced that will calculate the four coefficients discussed above (Freelon, 2010). The service can be accessed at http://dfreelon.org/utils/recalfront/. Hayes (2005) developed a macro for use with SPSS to calculate CAlpha. A handbook for applying the macro was created by De Swert (2012). Pearson’s Product–Moment Correlation Although it is not a reliability coefficient, Pearson’s correlation coefficient (r) is sometimes used as a check for accuracy of measurement with interval- and ratio-level data. This statistic, which we explain more fully in Chapter 9, measures the degree to which two variables, or two coders in this case, vary together. Correlation coefficients can be used when coders are measuring space or minutes. With this usage, the coders become the variables, and the recording units are the cases. If, for example, two

Reliability 125 coders measured the length in seconds of news stories on network evening news devoted to international events, a correlation coefficient would measure how similarly the coders were in their use of high and low scale values to describe the length of those stories relative to one another. Krippendorff (1980) warned against using correlations for reliability because association is not necessarily the same as agreement. However, this is not a problem if the assignment for meaning and accuracy of measurement for content units are determined separately. The correlation coefficient is used not to measure category assignment, but to measure the consistency of measuring instruments such as clocks and rulers. Other Forms of Alpha This discussion applies to coefficients used to calculate the reliability of variables that are predefined in the coding protocol. Some content may not have easily recognizable and discrete units that can be coded. Krippendorff et al. (2016) provided a variation of CAlpha (labeled UAlpha) that can be used to calculate a reliability coefficient when the units have not been predefined. The example they provide of this type of content is conversation among people. In these situations, the grammatical and syntactic structure found in written communication often breaks down and can make unitizing difficult. The same argument might apply to some tweets and comments online. Unfortunately, the article does not explore the process by which such coding occurs. Researchers interested in examining unstructured data should access this article and any related literature. In addition, Krippendorff and Craggs (2016) introduced an Alpha (MVAlpha) to use in calculating reliability for variables that are applied to units with multivalues. The exact nature of these multivalued variables is not clear. At one point, they write, “Literary theory has argued for some time that all texts have multiple interpretations” (Krippendorff & Craggs, 2016, p. 182). However, later in the article, the discussion centers on the presence of two words, “relieved and ashamed” (p. 185), as representing the emotional state of a text. These two examples are not necessarily the same type of coding decision. In addition, the article does not address how the coding of multivalued units differs from singlevalued variables. Because units are defined by the variables in a protocol, it seems unnecessary to have units with multivalues. This issue is discussed in Chapter 4. Controversy about Reliability Coefficients Recent years have seen a growing debate about which of the many reliability coefficients is most appropriate as an omnibus coefficient for estimating reliability (Feng & Zhao, 2016; Grant, Button, & Snook, 2017; Gwet, 2008, 2014; Hayes & Krippendorff, 2007; Krippendorff,

126 Reliability 2012, 2016; Nili et al., 2017; Quarfoot & Levine, 2016; Zhao, 2012; Zhao, Feng, Liu, & Deng, 2018; Zhao, Liu, & Deng, 2012). The details of this debate are more extensive than can be discussed in the space available here. However, the debate concentrates on the process of calculating expected agreement using the concept of “chance agreement” and on the assumptions underlying the calculation of some of the coefficients. The pi, kappa, and CAlpha coefficients have been criticized because they can produce low coefficients despite high percentages of agreements among coders (Gwet, 2008; Krippendorff, 2011; Lombard, Snyder-Duch, & Bracken, 2004; Potter & Levine-Donnerstein, 1999; Zhao, 2012; Zhao et al., 2012) and because they conservatively assume—and correct for—a maximum level of chance agreement (Zhao, 2012; Zhao et al., 2012). More recently, the debate has evolved into arguments about the definition of reliability (Feng & Zhao, 2016; Krippendorff, 2016; Zhao et al., 2018). The debate often seems to bog down and miss the larger picture of reliability and the corresponding role of reliability coefficients. Should scholars seek to examine the debate, we offer the following observations as guidance: • Concern with reliability should not be focused solely on whether coders of content are consistent in their agreements and disagreements. As mentioned previously, reliability reflects the application of a well-developed protocol by trained coders to interesting content. By definition, replication of a research project should use the original protocol, even if it is modified, but the coders will likely be different. Therefore, reliability must be more about the protocol and training than about agreement and disagreement among a certain set of coders. • Generating reliable data is not the ultimate aim of developing a protocol. The goal is to generate valid data, and adequate reliability is simply a threshold to help establish data validity. The higher the reliability, the more likely the data are to be valid. This is certainly true of automated textual analysis. • The goal of protocol development and coder training should not be to generate data that will be publishable, but to reach the highest level of validity possible. • Discussing a coefficient in isolation from the acceptable level of that coefficient adds little. Most discussions of these coefficients ignore what level should be obtained to have confidence in the validity of the data. Indeed, different coefficients might have different minimal acceptable values. To argue that one coefficient is better or worse than another because it has a higher or lower value sounds like the argument in the movie This Is Spinal Tap that an amplifier with a maximum of 11 on its dials is better than one with a maximum of 10. We need research that explores the relationship between reliability coefficients and the various forms of validity discussed in Chapter 7.

Reliability 127 •

Reviewers and researchers should consider the stage of research in a given theoretical area when evaluating protocol reliability. The more advanced theory and research in an area, the higher should be the acceptable levels of agreement. The false-positive findings for relationships that occur in newly emerging research areas become less acceptable as those areas become better understood. Using data with lower reliability, and therefore lower validity, can result in the acceptance of relationships that do not predict social phenomena well. • The term “chance” agreement is used when calculating expected agreement for these coefficients, but with adequate training and a well-developed protocol, very few of the agreements among coders occur by chance. If coders do “guess,” they are likely “educated guesses” and not random. Error agreements (not due to protocol and training) are a form of measurement error that cannot be accurately identified. Because of this, scholars use probabilities based on decisions by coders to calculate expected agreement. How those probabilities are calculated depends on the assumptions underlying the coefficients and varies from coefficient to coefficient. One central element of the debate noted earlier is the occurrence of high simple agreement with low reliability coefficients, which was pointed out almost 40 years ago (Kraemer, 1979). Potter and Levine-Donnerstein (1999) have discussed the same phenomenon occurring with Scott’s pi. They note that with a two-valued measure (e.g., is a terms-of-use agreement link clearly visible on a website’s front page—yes or no?), the frequent occurrence of one value and the infrequent occurrence of the other creates an imbalance (e.g., 97% of sites have the agreement link while 3% do not), which is “overcorrected” in the chance agreement component of Scott’s pi. Other scholars have recently joined the discussion, noting that this phenomenon occurs with kappa (Gwet, 2008) and with kappa, pi, and CAlpha (Zhao et al., 2012). Gwet (2008) addressed the high agreement/low reliability phenomenon found with pi and kappa by developing a new coefficient called AC1. He divided agreement and disagreements into groups based on four conditions for two coders and two categories: (1) both coders assign values based on the correct application of the protocol; (2) both assign based on randomness; and for (3) and (4), one coder assigns randomly and the other assigns based on the correct application of the protocol. He argued that kappa and pi assume only two types of agreement—one based on the correct application of the protocol and one based on randomness— which ignores the other two conditions. He conducted a Monte Carlo study with data from psychology and psychiatry, and concluded that AC1 produces lower variances than pi and kappa and provides a better estimate of expected agreement. Because AC1 applied to nominal data, Gwet

128 Reliability developed the AC2 coefficient (Gwet, 2014) that can be used with ordinal and interval data. An effort to deal with the high agreement/low reliability phenomenon in communication studies was produced by Zhao (2012), who criticized kappa, pi, and CAlpha because they depend on the distribution of agreements and number of categories to calculate chance agreement. Zhao argues that it is coding difficulty that determines chance agreement, and not distribution of agreements and categories. He therefore developed alpha1, which calculates chance agreement based on disagreements. His paper also includes a Monte Carlo study based on human coders and concludes that actual level of agreement correlated highest with his new coefficient (alpha1), followed by percentage of agreement. He reported that Gwet’s AC1, kappa, pi, and CAlpha correlated with actual level of agreement at only .559. The “behavior Monte Carlo” study is limited, however, because it used coding of visual representations rather than coding of symbols using a protocol as found in communication content analysis. In other words, the test did not include a protocol with training. Also, coding difficulty is a function of protocol development and training that cannot be measured. The future of alpha1 is unclear. Krippendorff (2011, 2012, 2016) has responded to criticisms of Alpha by saying variables that do not vary are not useful and that the C uneven distribution could represent an inadequate reliability sample. These criticisms may be valid in some situations, but there are situations when the population distribution may actually be extremely uneven among categories. Krippendorff’s (2016) observation about the need for variance to study relationships is true for social science studying relationships. However, content analysis is also used to describe content without examining relationships. As mentioned in Chapter 1, some content analyses describe content in an effort to compare it with some external standard. Historically, for example, the percentage of television characters who are people of color is small (Mastro & Greenberg, 2000). Similar situations have been found in television representation of the elderly (Signorielli, 2001). A content analysis of clinical studies (Davison et al., 2016) to discover how often fathers are present in these studies found two variables with high levels of simple agreement above 90% but CAlphas below .7. This issue reflected conditions in the population itself. Certainly, the study of representation in media and research is worthwhile, and the resulting distribution among categories may be skewed. In such situations, kappa, pi, and CAlpha could be considerably lower than percentage of agreement. It would not matter how large the reliability sample is because the population itself has the same “one-sided” distribution. In some situations, even if the coding has only a small amount of error, the reliability coefficient will report reproducibility at a lower level than it is.

Reliability 129 Selecting a Reliability Coefficient The discussion about the limits and advantages of various reliability coefficients is a natural part of social science. However, any decision to move away from current practices should be based on both empirical and theoretical grounds. Currently, little existing empirical research has examined the connection between levels of reliability and valid conclusions about data. How reliable do data need to be in order to yield valid conclusions about relationships in those data? Neither is there much empirical research that uses actual content analysis data to examine the implications of using various coefficients. Scholars should be encouraged to engage in these types of research. Until the issue of whether an omnibus coefficient is appropriate, or even possible, and what that coefficient would be, we offer the following recommendations for reporting reliability coefficients: 1 Report a reliability coefficient that corrects for “chance” agreements in calculating expected agreement for each variable in the study. Replication requires this. An average or “overall” measure of reliability can hide weak variables, and not every study will want to use every variable from previous studies. 2 Use either a census or a probability sample to calculate reliability coefficients and report the sampling error for the reliability coefficients if it is a probability sample. 3 If the data are not skewed, report Krippendorff’s CAlpha and the simple level of agreement (Lacy et al., 2015). 4 An important question is: What is an acceptable level for a given coefficient? Krippendorff (2004a) suggested that a CAlpha of .8 indicates adequate reliability. However, Krippendorff (2004a) also wrote that variables with CAlphas as low as .667 could be acceptable for drawing tentative conclusions. For established areas of research, we suggest a variable reliability should reach at least the .8 level, and higher is better. In newer areas of research, variables in the .7 to .8 area could be reported, but the researcher needs to justify the use of such variables by addressing the issue of validity and scholarly importance. 5 If the data show a high level of simple agreement, but reliability coefficients do not reach the above levels for reliability coefficients, report Gwet’s AC2 as the reliability coefficient (Lacy et al., 2015) and explain why it is being used. 6 If the scholar is uncomfortable with relying on CAlpha, the researcher should also report CAlpha with other coefficients in the article and explain why multiple coefficients are included. The authors should also explain their acceptable level for each type of reliability coefficient. Data that meet the requirement for multiple coefficients have an even stronger argument for their validity.

130 Reliability 7 Adhering to these rules will require that an adequate sample size be randomly selected. All of the categories for each variable should be in the sample. If they are not, then the sample size should be increased. Equally important to establishing reliability for variables in a given protocol is the establishment of variable reliability across time. Social science advances through improved measurement, and improved measurement requires consistent reliability. If scholars aim to standardize protocols for commonly used variables, the reliability of these protocols will increase over time.

Summary Conducting and reporting reliability assessments of the content analysis process are a necessity, not a choice. However, this has not always been the case. A study of 581 content analyses from 1985 to 2010 in Communication Monographs, the Journal of Communication, and Journalism & Mass Communication Quarterly (Lovejoy et al., 2014) found 24% of the articles did not report any information about reliability and only 49% included a reliability coefficient that took chance into consideration. Reporting on reliability has improved across time, but there was little improvement in the use of a census or probability sample for reliability or in a transparent description of how the reliability sample was selected. Another study reported on the use of reliability coefficients for the same three journals from 1985 to 2014 (Lovejoy et al., 2016). There was general improvement across the 30 years in the Journal of Communication and Journalism & Mass Communication Quarterly, but not Communication Monographs. Overall, during the 2010–2014 period, 23.4% of all articles reported no reliability coefficient, and during this period 28.8% of the articles reported CAlpha, 25.2% reported kappa, and 16.2% reported pi. Moreover, full information on the content analysis should be disclosed or at least made available for other researchers to examine or use. A full report on content analysis reliability would include protocol definitions and procedures. Because space in journals is limited, the protocol should be made available by study authors on request. Furthermore, information on the training of judges, the number of content items tested, and how they were selected should be included in the article. At a minimum, the specific coder reliability tests applied and the achieved numeric reliability, along with confidence intervals, should be included for each variable in the published research. The editor of Journalism & Mass Communication Quarterly invites authors of content analysis studies to submit their protocol when an article is accepted. Collecting and sharing protocols will help scholars to better understand previous studies and improve measurement.

Reliability 131 In applying reliability tests, researchers should randomly select a sufficient number of units for the tests, and then decide about whether each variable reaches acceptable reliability levels based on coefficients that error agreements into consideration, and report simple agreement in an endnote to assist in the replication of the study. Failure to systematize and report the procedures used, as well as to assess and report reliability, virtually invalidates whatever usefulness a content study may have for building a coherent body of research. Students must be taught the importance of assessing and reporting content analysis reliability. Journals that publish content studies should insist on such assessments.

7 Validity

When we introduced the definition of quantitative content analysis in Chapter 1, it was noted that if the categories and rules are conceptually and theoretically sound and are reliably applied, the chance increases that the study results will be valid. The focus of Chapter 6 on reliability leads easily to one of the possible correlates of reliable measurement: valid measurement. What does the term valid mean? “I’d say that’s a valid point,” one person might respond to an argument offered by another. In this everyday context, validity can relate in at least two ways to the process by which one knows things with some degree of certainty. First, valid can mean the speaker’s argument refers to some fact or evidence (e.g., that the national debt in 2018 topped $21 trillion). A reference to some fact suggests, of course, that the fact is part of objective reality. Second, valid can mean the speaker’s logic is persuasive, because observation of facts leads to similar plausible inferences or deductions from them. The social science notion of validity relates to both these everyday ways in which we make inferences and interpretations of our reality. Social science does this in two ways. First, social science breaks up reality into conceptually distinct parts that we believe actually exist, and that have observable indicators of their existence. And second, social science operates with logic and properly collected observations to connect those concepts in ways that help us to predict, explain, and potentially control that reality. Content analysis, then, must also incorporate these two processes in the way this method illuminates reality. First, scholars must address how a concept they have defined about some part of communication reality actually exists in that reality. And even if this is true, researchers must address how their category measurement of that communication concept is an appropriate one. If we are mistaken about that communication part, or if we are measuring it wrongly, then our predictions about the communication process fail. But even with good concepts and measures of communication reality, we then have the second problem of validly linking those concepts

Validity 133 through data collection and analysis methods that have the highest chance of producing successful predictions. So, our second validity problem focuses on these “linking processes” that tie together our concepts, our measurements of them, our observations of their interconnections, and our predictions about their future states. Although all that may sound formidable enough, philosophers of science urge humility even when our successful predictions suggest we have a good handle on understanding and measuring reality. Bertrand Russell illustrated this with a homely story of the chicken that day after day was well fed and watered and cared for in every way by a farmer. That chicken had a very solid record of very good predictions about these interconnected events, suggesting a very good understanding of reality. Until, of course, the day the farmer showed up with his axe. That chicken did not even know, much less understand, the larger context in which it was a part. In the sections of this chapter that follow, we first deal with the validity of the concepts in our theories and then with the validity of the observational processes we use to link those concepts. Finally, taking a lesson from the experience of Bertrand Russell’s chicken, we address a wider context that we call “social validity” in which we ask how a scientifically valid content analysis relates to the wider communication world experienced by people.

The Problem of Measurement Reliability and Validity Content analysis studies the reality of communication in our world. It does so through the creation of reliable and valid categories making up the variables we describe and relate to one another in hypotheses or models of the communication process. As we’ve emphasized in earlier chapters, we must operationally define content categories for the terms in these hypotheses and questions. For example, several content analysis studies have addressed the concept of news quality (Lacy & Rosenstiel, 2015). However, some scholars have raised concerns about measuring “quality” (Bogart, 2004). Just what on earth can “quality” be in news content? Who says measures such as those are good measures of quality? Is quality, like beauty, in the eye of the beholder? The answer to that third question is, of course, yes: sometimes quality is in the eye (or mind) of the beholder. That question also nicely illustrates the validity problem in content analysis measurement. Communication is not simply about the occurrence or frequency of communication elements. It is also about the meanings of all the words, expressions, gestures, and so on that we use to communicate in life. So, when we ask about the content analysis validity of our measurement of something such as news “quality,” we frequently have an operational definition that has reduced ambiguity in the measurement of communication reality rather than

134 Validity genuinely apprehended that reality. We ought not to assume that such ambiguities can always be resolved, but resolving ambiguity can often be accomplished by connecting content measurements to previous research. This becomes a critical problem because of efforts to achieve reliability in our content category measures. Measurement reliability is a necessary but not sufficient condition for measurement validity. A measure can be reliable in its application but wrong in what researchers assume it is really measuring. A valid measure must be both reliable in its application and valid for what it measures. A special problem in content analysis may occur because reliable measurement can come at the expense of valid measurement. Specifically, in order to get high levels of coder agreement on the existence or state of a content variable, the operational definition may have only tenuous connection with the concept of interest. Much of the concern with computer content analysis (dealt with in Chapter 3) is that the validity of concepts is compromised by the focus on keywords absent any context that gives them meaning. Part of the solution to this problem is multiple measures of the concept of interest, as has been done with the concept of news quality (Lacy & Rosenstiel, 2015). But ultimately, content analysts must ask the most consequential question of their variables: Do they validly assess something meaningful beyond their utility in the particular study? Sometimes the validity of particular measures used in a content study may have been assessed in the broader research stream of which the particular study is a part. Too often, however, research will discuss the reliability of measures at length and ignore or assume the validity of those measures.

Tests of Measurement Validity Analysts such as Holsti (1969) and Krippendorff (2013) have discussed validity assessment at length. In particular, Holsti’s familiar typology identifies four tests of measurement validity—face, concurrent, predictive, and construct—that apply to the operational terms we use in our hypotheses and questions. These same tests of validity apply to measures, constructs, and even relationships. Face Validity The most common validity test used in content analysis, and certainly the minimum requirement, is face validity. Basically, the researcher makes a persuasive argument that a particular measure of a concept makes sense on its face. Coding a president’s State of the Union addresses for references to public issues would, on the face of it, indicate the administration’s change in its policy agenda over time. In essence, the researcher assumes that the adequacy of a measure is obvious to all and requires little additional explanation. Relying on face validity can sometimes

Validity 135 be appropriate when agreement on a measure is high among relevant researchers. Using measures from previous studies will enhance face validity because other researchers have successfully argued for the connection between a measure and the underlying concept. However, assuming the face validity of measures is sometimes chancy, especially in broader contexts, because concepts have latent meanings, various groups may use the same concept, such as fairness, in different ways. For example, one of the authors of this text participated in a study assessing fairness and balance in reporting a state political race (Fico & Cote, 1997). The measure of such fairness and balance—equivalent treatment of the candidates in terms of story space and prominence— was consistent with invocations for “balance” and “nonpartisanship” in codes of ethics. These measures were subsequently discussed with seven reporters who wrote the largest number of campaign stories. None agreed with the research definition. That is not to say that either the research definition or the professional definition was wrong per se. However, what seems obvious on its face sometimes is not obvious to all. Concurrent Validity Face validity can and should be strengthened for purposes of inference. One of the best techniques is to correlate the measure used in one study with a similar one used in another study. Concurrent validity can also be established with two different methods and measures yielding the same conclusion. In effect, the two methods can provide mutual or concurrent validation. In the early years of the 21st century as legacy media faltered, the question arose as to whether citizen journalism could replace legacy news outlets by fulfilling the same functions of those outlets. Citizen journalism is the output of news by nonprofessionals (Wall, 2015). A content analysis of 64 news sites and blogs in 15 randomly selected U.S. cities (Lacy, Riffe, Thorson, & Duffy, 2009) found that most sites were blogs (opinion) and not news sites. Five of the 15 cities had no news sites, and after one city was removed the mean for the other nine cities was less than one news site per city. These data were collected in 2007. A survey of 104 city government reporters in 38 states that same year (St. Cyr, Carpenter, & Lacy, 2010) found that the mean number of citizen news sites covering city government equaled .57. The measures of citizen journalism sites through content analysis and through survey were consistent and provides concurrent validity for the two measures. Predictive Validity A test of predictive validity correlates a measure with some predicted outcome. If the outcome occurs as expected, our confidence in the

136 Validity validity of the measure is increased. More specifically, if a hypothesized prediction is borne out, then our confidence in the validity of the measures making up the operational definitions of concepts in the hypothesis is strengthened. The classic example cited by Holsti (1969, p. 144) concerns a study of suicide notes left in real suicides and a companion sample from non-suicides. Real notes were used to put together a linguistic model predicting suicide. Based on this model, coders successfully classified notes from real suicides, thereby validating the predictive power of the content model. The study that examined the level of citizen journalism in 2007 (Lacy et al., 2009) was expanded to include citizen journalism websites, blog sites and commercial websites in 46 markets in 2009 (Lacy et al., 2010). The measures used by Lacy et al. (2009) were included and the conclusions were consistent with those of the 2010 study (Lacy et al., 2010). Construct Validity Construct validity involves the relation of an abstract concept to the observable measures that presumably indicate the concept’s existence and change. The underlying notion is that a construct exists but is not directly observable except through one or more measures. Therefore, some change in the underlying abstract concept will cause observable change in the measures. Statistical tests of construct validity assess whether the measures relate only to that concept and to no other concept (Hunter & Gerbing, 1982). If construct validity of measures exists, then any change in the measures and the relation of the measures to one another is entirely a function of their relation to the underlying concept. If construct validity does not exist, then measures may change because of their relation to some other, unknown concepts. In other words, construct validity enables the researcher to be confident that when the measures vary, only the concept of interest is actually varying. Put another way, the issue of construct validity involves whether measures “behave” as theory predicts and only as theory predicts. Wimmer and Dominick (2003) wrote that: Construct validity is present if the measurement instrument under consideration does not relate to other variables when there is no theoretical reason to expect such a relationship. Therefore, if an investigator finds a relationship between a measure and other variables that is predicted by theory and fails to find other relationships that are not predicted by theory, there is evidence for construct validity. (p. 60) Construct validity must exist if a research program in a field such as mass communication is to build a cumulative body of scientific knowledge

Validity 137 across a multitude of studies. Common constructs used across studies help bring coherence and a common focus to a body of research. Valid constructs also make for more efficient research, enabling researchers to take the next step in extending or applying theory without needing to duplicate earlier work. Some of the news quality studies (Lacy & Rosenstiel, 2015), for instance, use the “financial commitment” (Lacy, 1992) construct, which in turn is related to broader economic theory. Few studies in our field, however, validate measures this way.

Validity in Observational Process Given that we have enough confidence in the validity of our concept measures (variables) to use them to address hypotheses and research questions, the question then becomes how we link them in a way that validly describes social reality. Every social science method has a set of procedures meant to ensure that observations are made that minimize human biases in the way such reality is perceived. In survey research, such procedures include random sampling to make valid inferences from characteristics in a sample to characteristics in a population. In an experiment, such procedures include randomly assigning subjects to create equivalent control and experimental treatment groups, thereby permitting a logical inference that only the treatment variable administered to the experimental group could have caused some effect. In our previous chapter on reliability, we discussed how content analysis uses protocol definitions and tests for chance agreement to minimize the influence of human biases and chance on coding defined and measured variables. Valid application of these procedures strengthens one’s confidence in what the survey, experiment, or content analysis has found. But given science’s largest goal, the prediction, explanation, and potential control of phenomena, how does content analysis achieve validity? Internal and External Validity Experimental method provides some ways of thinking about validity in the research process that can be related to content analysis. In their assessment of experimental method in educational research, Campbell and Stanley (1963) made the distinction between an experimental research design’s internal and external validity. By internal validity, Campbell and Stanley meant the ability of an experiment to illuminate valid causal relationships. An experiment does this by the use of controls to rule out other possible sources of influence and rival explanations. By external validity, Campbell and Stanley meant the broader relevance of an experiment’s findings to the vastly more complex and dynamic pattern of causal relations in the world. An experiment may increase external validity by incorporating “naturalistic settings” into the design. This permits

138 Validity assessment of whether causal relations observed in the laboratory are in fact of much importance relative to other influences operating in the world. However, by their nature, laboratory experiments cannot entirely replicate naturalistic settings. These notions of internal and external validity in experimental design are also useful for thinking about content analysis validity. A first and obvious observation is that content analysis, used alone, cannot possess internal (causal) validity in the sense that Campbell and Stanley (1963) used because it cannot control all known and unknown “third variables.” Inferring causal relations requires knowledge of the time order in which cause and effect operate, knowledge of their joint variation, control over the influence of other variables, and a rationale explaining the presumed cause–effect relationship. However, content analysis can incorporate other research procedures that strengthen the ability to make such causal inferences in the building of theory. For instance, if some content is thought to produce a particular effect on audiences, content analysis could be paired with survey research designs to explore that relationship, as in the agenda-setting and cultivation studies we described in Chapter 1. We discuss some of these designs in more detail in the following sections. Content analysis can be a very strong research technique in terms of the external validity or generalizability of research that Campbell and Stanley (1963) discussed. This, of course, will depend on such factors as whether a census or appropriate sample of the content has been collected. However, the notion of external validity can also be related to what we call a study’s social validity. This social validity will depend on the social significance of the content that content analysis can explore and the degree to which the content analysis categories created by researchers have relevance and meaning beyond an academic audience. We explore some of these issues in the following pages. Slater (2013) provides an extensive discussion of how content analysis can fit into a broader program of research. He discusses the use of content analysis with surveys and how content analysis can be used as a foundation for experiments. Underlying his discussion is the realization that the creation, testing, and support for social science theory is a process that involves multiple methods. Figure 7.1 summarizes several types of validity. Note first that internal validity deals with the design governing data collection and how designs may strengthen causal inference. Data collection also requires assessment of measurement validity consisting of face, concurrent, predictive, and construct validity. Statistical validity is a subset of internal validity that deals with measurement decisions and the assumptions about data required by particular statistical analyses. Finally, the external and social validity of a content analysis presupposes the internal validity of measurement and design that makes content analysis a part

Validity 139 Internal Validity

Time Order

Control

Correlation

External and Social Validity

Scientific Validation

Nature of Content

Internal Validity

Nature of Sample

Measurement Statistical Validity Validity

Nature of Categories

Face Concurrent Predictive Construct

Figure 7.1 Types of content analysis validity

of scientific method. However, the notion of external and social validity used here goes beyond those qualities to assess the social importance and meaning of the content being explored. The overall validity of a study therefore depends on a number of interrelated factors we discuss in the following section. Internal Validity and Design Content analysis by itself can best illuminate patterns, regularities, or variable relations in some content of interest. Content analysis alone cannot establish the antecedent causes producing those patterns in the content, nor can it explain as causal the subsequent effects that content produces in some social system. Of course, the analyst may make logical inferences to antecedent causes or subsequent effects of such content, as we discussed in Chapter 1, with the model showing the centrality of content to communication processes and effects. Also, certain research designs pairing content analysis with other methods strengthen the ability to infer such causal relationships, thereby enhancing internal validity. In one way or another, then, content analysis designs should address issues of control, time order, and correlation of variables included in a causal model. Control in Content Analysis Designs that attempt to explain patterns of content must look to information outside the content of interest. This requires a theoretical or hypothesized model, including the kinds of factors that may influence content. In other words, this model is assumed to control for other sources of influence by bringing them into the analysis. The model itself is derived from theory, previous research, or hunch. Consider a

140 Validity simple example. A researcher, noting the collapse of an authoritarian regime, predicts rapid growth of new news ventures as alternative political voices seek audiences, even as existing news outlets “open up” to previously taboo topics. Two problems need to be emphasized. First, however plausible a model may be, there is always something—usually a lot of things—that is left out and that may be crucial. However, the second problem is what makes the first one so interesting: such a model can always be empirically tested to assess how well it works to explain patterns in content. If the model does not work as planned, interesting and engaging work to figure out a better model can be undertaken. Unimportant variables can be dropped from the model and other theoretically interesting ones added. In this example, failure to find new news outlets might simply reflect limited access to facilities and equipment rather than a lack of dissent. Also, failure to find criticism of former party programs in existing news outlets simply indicates residual citizen distrust of journalists who were viewed for years as political party tools. A well-thought-out model that identifies relevant concepts in the causal process is essential to introducing control variables in content analysis studies. The model should include specifying mediating and moderating constructs. Time Order in Content Analysis Furthermore, in designing such a model, these presumed influences must incorporate the time element into the design, as we note in Chapter 8. Such incorporation may be empirical—data on the presumed cause occurs and is collected and measured before the content it presumably influences. For example, studies have examined the relationship between newspaper quality at time 1 and circulation at time 2 (Lacy & Fico, 1991; St. Cyr, Lacy, & Guzman-Ortega, 2005). Such studies also require statistical controls to eliminate competing explanations for the relationship being studied. Incorporation of the time element may also be assumed from the logic of the design. For example, newspaper content quality and population in a city can be measured at time 1 to predict circulation at time 2. Clearly, the research design here would rule out quality at time 1 predicting population at time 1 or circulation at time 2 predicting circulation at time 1. Similarly, Lovejoy et al. (2016) studied three communication journals over 30 years to examine variations among the journals in reporting reliability statistics. The use of reliability coefficients in articles changed over time. Of course, the use of coefficients could not explain the change in time, so the predictive relationship must exist in the other direction. Obviously, exploring the effects of content is the converse of the situation just discussed. Here, time must also be part of the design, but other methods to assess effect are mandatory as well. The logic of the design for

Validity 141 content affecting behavior and attitudes may be sufficient for controlling time order. For example, the Great American Values Test (Ball-Rokeach, Rokeach, & Grube, 1984) used a field experiment to measure the relationship between a specific television program about American values, which the researchers produced, and changes in attitude and behavior related to these values. They measured attitudes before the communities saw the program and after to control time order. They used a similar community, which did not receive the program, as a control. Perhaps the most frequent multi-method example of content analysis research that assesses effect is the agenda-setting research we described in Chapter 1. This line of research explores whether differences in news media coverage frequency of various topics at time 1 creates a similar subsequent ordering of importance among news consumers at time 2. Of course, the possibilities that the news priorities of consumers really influence the media or that both influence one another must be taken into account in the design. Causation in Content Analysis Establishing a causal relationship requires specification of time order, control, and demonstration of joint variation or correlation. Time order is essentially built into the design. If content is influenced by an antecedent variable, then the antecedent variable must come first. If content influences an individual, the content must be created and accessed by the individual. The requirements of control and correlation in causality among variables are established statistically. Assumptions about independent and dependent variable influences must be explicit in any kind of multivariate analysis. Further, the analysis must consider direct and indirect causal flows, and whether any variables moderate or mediate the nature of the model’s relationships. We discuss statistics used for analyzing content data in Chapter 9. These techniques range from simple correlation measures for relating two variables, to multivariate techniques enabling the analysis to more fully control and assess the effects of multiple variables. Different statistics have different assumptions that must be considered. The specific techniques that can be employed will also depend on the level at which variables have been measured. Furthermore, if content data have been randomly sampled, tests of statistical significance must be employed for valid inferences to content populations. These issues relate to the statistical validity of the analysis of content.

External Validity and Meaning in Content Analysis A study may have strong internal validity in the senses discussed above. But study findings may be so circumscribed by theoretical or methodological

142 Validity considerations that they have little or no relevance beyond the research community to which it is meaningful. Certainly, any research should first be a collective communication among a group of researchers. Isaac Newton summed this up in his often-quoted saying, “If I have seen farther it is by standing on the shoulders of giants” (Oxford University, 1979, p. 362). The researcher interacts professionally within a community of scientists. But the researcher is also part of the larger society, interacting with it in a variety of roles such as parent, neighbor, or citizen. In these broader researcher roles, the notion of validity can also have a social dimension that relates to how such knowledge is understood, valued, or used. Successful communication requires the exchange of knowledge that is meaningful to both the message sender and receiver. This meaningfulness results from a common language, a common frame of reference for interpreting the concepts being communicated, and a common evaluation of the relevance, importance, or significance of those concepts. In this social dimension of validity, the broader importance or significance of what has been found can be assessed. External Validity and the Scientific Community But first, research must be placed before the scientific community for their assessment of a study’s meaningfulness as valid scientific knowledge. The minimum required is validation of the research through the blind peer-review process in which competent judges assess a study’s fitness to be published as part of scientific knowledge. Specifically, scientific peers must agree that a particular study should be published in a peer-reviewed venue. As in scientific method generally, the peer-review process is meant to minimize human biases in the assessment of studies. In this process, judges unknown to the author review the work of the author who is unknown to them. The judges, usually two or three, apply scientific criteria for validating the research’s relevance, design and method, analysis, and inference. The requirements for the scientific validation of research are relatively straightforward. Presumably, the current research demonstrably grows out of previous work, and the researcher explicitly calls attention to its relevance for developing or modifying theory, replicating findings, extending the research line and filling research gaps, or resolving contradictions in previous studies. Only after this process is the research deemed fit to be presented or published as part of scientific knowledge. Researchers submit their final work to a peer-review process for several reasons. First, researchers are given comments and criticisms for improving the study. No research is without flaws and limitations, and other experts in the field can help to illuminate them so they can be corrected or taken into account. Second, researchers submit work

Validity 143 to the peer-review process so that the study might inform and assist other researchers in the collective building of a body of knowledge that advances the goals of science to predict, explain, and potentially control phenomena. The judgment of the scientific community provides the necessary link between the internal and external validity of research. Clearly, research that is flawed because of some aspect of design or measurement cannot be trusted to generate new knowledge. The scientific validation of research is necessary before that research can (or should!) have any broader meaning or importance. In essence, internal validity (the study is deemed fit as scientific knowledge) is a necessary (but not necessarily sufficient) condition for external validity (the study has wider implications for part or the whole of society). However, the status of any one study as part of scientific knowledge is still tentative until other research provides additional validation through study replication and extension. This validation can take place through direct replication and extension in similar studies as in the example of agenda-setting research. Replication of findings by other studies is important because any one study, even given the most critical scrutiny, may through chance produce an atypical result. However, if study after study finds similar patterns in replications, the entire weight of the research as a whole strengthens one’s confidence in the knowledge that has been found. Recall in Chapter 5 that it was noted that even data sets drawn consistently from non-probability samples can be useful if their findings make a cumulative contribution. Scientific community validation of a study can also happen through the use, modification, or further development of that study’s definitions or measures or through more extensive work into an area to which some study has drawn attention. The attention to media agenda-setting across multiple decades now is an example of collective validation from multiple studies by multiple researchers. External Validity as Social Validity in Content Analysis The validation of research method and inference is usually determined by the scientific community acting through the peer-review and replication process just discussed. This validation is necessary but not sufficient to establish the broader meaning and importance of research to audiences beyond the scientific community. The external validity of a content analysis beyond the scientific community can be strengthened in two ways that maximize its social validity. These concern the social importance of the content and how it has been collected, and the social relevance of the content categories and the way they have been measured and analyzed. In the following sections, we address these issues in the social validity of content studies.

144 Validity Nature of the Content The social validity of a content analysis can be increased if the content being explored is important. The more pervasive and important the content of interest to audiences, the greater will be the social validity of the analysis exploring that content. One dimension concerns the sheer size of the audience exposed to the content. Much of the research and social attention to digital media, for example, emerges from the fact that digital content is readily available, that it allows interactivity, and that large numbers of people use its content for many hours on a daily basis. Another dimension of the importance of the content being analyzed deals with the exposure of some critical audience to its influence. Children’s television advertising is explored because it may have important implications for the social development of a presumably vulnerable and impressionable population. Twitter has become more important since 2016 when it became a primary way of influencing the political agenda in news media. Finally, content may be important because of some crucial role or function it plays in society. For example, advertising is thought to be crucial to the economic functioning of market societies. Obviously, the effectiveness of advertising in motivating consumers to buy products will affect not only producers and consumers, but also the entire fabric of social relations linked in the market. Furthermore, advertising messages can have cultural byproducts because of social roles or stereotypes they communicate. Similarly, news coverage of political controversy is examined because it may influence public policy affecting millions. The political ethic of most Western societies is that an informed citizenry, acting through democratic institutions, determines policy choices. Clearly, then, the way these choices are presented has the potential to influence the agendas and opinions of these citizens. Whatever the importance of the content, the social validity of the analysis will also be affected by how that content has been gathered and analyzed for study. Specifically, whether content has been selected through a census or a probability sample will influence what generalizations can be validly made. A major goal in most research is generating knowledge about populations of people, social institutions, or documents. Knowledge of an unrepresentative sample of content is frequently of limited value for knowing or understanding the population. Probability sampling, however, enables researchers to generalize to the population from which the sample was drawn. Taking a random sample or even a census of the relevant population of content obviously enables the researcher to speak with authority about the characteristics being measured in that population. Findings from content selected purposively or because of convenience cannot be generalized to wider populations. However, a strong case for

Validity 145 the social validity of purposively selected content may be made in specific contexts. For example, the news content of the “prestige” journalism outlets is clearly atypical of news coverage in general. However, those news outlets influence important policymakers and other news outlets as well, and therefore have importance because they are so atypical. Nature of the Categories Whatever the size or importance of the audience for some communication, content analysis creates categories for the study of the content of that communication. These categories serve three purposes. First, they are created because the researcher believes they describe important characteristics of the communication. Second, they are created because the researcher believes these communication characteristics are themselves systematically produced by other factors that can be illuminated. And third, they are created because the researcher believes these categories have some kind of meaning or effect for the audiences experiencing them in a communication. The conceptual and operational definitions of a content category that are relevant for these last two reasons are therefore also relevant for a study’s social validity. Such concepts and their operational definitions may be interpretable by only a small body of researchers working in some field, or they may be accessible and relevant to far broader audiences. Krippendorff’s (1980) “semantical validity” (p. 157) relates to this notion of relevance in content analysis. Krippendorff (1980) asserts that semantical validity “assesses the degree to which a method is sensitive to the symbolic meanings that are relevant within a given context” (p. 157). In particular, Krippendorff (1980) considered a study to be high in semantical validity when the “data language corresponds to that of the source, the receiver or any other context” (p. 157). To what extent, therefore, do content analysis categories have corresponding meanings to audiences beyond the researchers? This question is particularly important when a researcher explicitly attempts to pursue a content analysis that deals with both the theoretical and practical relevance of the research. This question is also critical when a researcher chooses to focus on either manifest or latent content in a content analysis. Manifest content, as we discussed earlier in this text, is more easily recognized and counted than latent content. Person A’s name is in a story, maybe accompanied by a picture; a television show runs X number of commercials; a novel’s average sentence length is X number of words. Analyses that attempt to capture more latent content deal with more holistic or “gestalt” judgments, evaluations, and interpretations of content and its context. Studies attempting to analyze content with extensive latent meaning assume that some important characteristics of communication may not be captured through sampling, category definition, reliability

146 Validity assessment, or statistical analysis of the collected content data. Instead, the proper judgment, evaluation, or interpretation of communication content rests with the researcher. This assumption about the ability of the researcher to do these things is troublesome on several grounds. In particular, seldom or ever is it argued explicitly in the analysis of heavily latent content that the researcher’s experience, intuition, judgment, or whatever is actually competent to make those judgments. We must simply believe that the meaning of content is illuminated by the discernment of the researcher who brings the appropriate context to the communication. In other words, the researcher analyzing latent content knows what that content in a communication “actually is” and what that content “actually does” to audiences getting that communication. Tankard (2001) found this to be the case with early framing research. A study of latent content must assume, therefore, that the researcher possesses one or both of two different, even contradictory, qualities that displace explicit assessments of the reliability and validity of content studies. The first is that the researcher is an authoritative interpreter who can intuitively identify and assess the meaning embedded in some communication sent to audiences. The researcher is therefore the source of the study’s reliability and validity of measurement. But this requires a large leap of faith in researchers. Specifically, we must believe that while human biases in selective exposure, perception, and recall exist in the naive perceiver of some communication, the researcher is somehow so immune that he or she can perceive the “real” content. For example, interpretations of the media’s power to control the social construction of reality emerge from the assumed ability of the researcher—but not of media audience members—to stand enough apart from such an effect to observe and recognize it. But if the researcher can observe it, couldn’t others? And if others can’t observe it, then why can the researcher? A second but contrary quality assumed for the researcher in the analysis of latent content is that the researcher is a kind of representative of the audience getting a communication. In other words, we must believe that the researcher is a “random sample” (with an n of 1) who “knows” the content’s effects on audiences because he or she experiences and identifies them. Of course, few would trust the precision or generality of even a well-selected random sample with an n of 1. Are we to believe that a probably very atypical member of the audience—the researcher—can experience the content in the same way that other audience members would? As mentioned previously in Chapter 2, symbols often carry elements of both manifest and latent meaning. Variable and category definitions found on protocols should be designed to help coders by providing instructions on how content can be coded even if it contains some latent meaning.

Validity 147 It should be noted that these problems might also exist in the quantitative analysis of manifest content. For example, Austin, Pinkleton, Hust, and Coral-Reaume Miller (2007) found large differences in the frequency with which trained coders and a group of untrained audience members would assign content to categories. Although content analysis standards are satisfied by appropriate reliability tests, the social validity of a study may be limited if the content categories have little or no meaning to some broader audience. And no solution to the problem of inferring the effects of content on audiences exists unless content analysis is paired with techniques such as audience surveys or experiments that can better illuminate such effects. The claims made in the content analysis of manifest content should therefore always be tentative and qualified. But that’s the nature of the entire scientific enterprise. In other words, we argue here that quantitative content analysis is necessary, even if sometimes not sufficient, for the development of a science of human communication.

Summary The assumption of this text is that scientific method enables research to speak as truthfully as possible to as many as possible. Accomplishing this is the essence of validity in content analysis as well as in other research. Indeed, the parallels and perils of establishing validity for other research techniques are as serious. Survey researchers ask questions that assume implicitly a context and frame of reference that interviewees may not share. Experimenters achieve strong causal inference at the cost of so isolating phenomena that little may actually be learned about the broader world of interest. However, validity as truth is what enquiry is all about. When it comes to illuminating truth about content, quantitative content analysis is the best way we have. We should accept nothing less in our communication scholarship.

8 Designing a Content Analysis

Research methods such as surveys, experiments, or content analyses are productively used in social science to describe phenomena, observe their interrelationships, and make predictions about those interrelationships. The process of learning about a phenomenon empirically in order to help us do these things consists of three phases or stages: conceptualization of the inquiry, formulation of a research design to gather needed information, and data collection and analysis to get answers. These three phases are comparable to stages a property owner might go through to have a structure built. In fact, thinking about construction projects to learn about research projects is apt. In both cases, right decisions produce something of enduring value and wrong decisions produce regretful costs. A construction project begins with the property owner’s vision of what the structure will look like and how it will function, a vision parallel to the research conceptualization phase. A property owner might imagine a home, a strip mall, an office building, or an apartment complex. A researcher will similarly envision a variable being described, a hypothesis being tested, or a causal model being estimated. The building construction vision includes general ideas about features dictated by the function of the structure. But that vision must also consider the context: a home in an industrial area would never do. In a like manner, a research project takes account of its context: the previous research into which the study “fits in.” And once a property owner gets beyond the general vision, a far more detailed planning process is necessary. At this stage, precise architectural blueprints address directly how the goals for the structure result in decisions about open space, entrances, and so on. This parallels the design stage in research that requires decisions about data collection, measurement, analysis, and, above all, whether the design adequately addresses research questions and hypotheses. In both research design and building construction, “blue-sky” wishes and hopes must be fitted into the realities of what can be done given the time and resources available.

Designing a Content Analysis 149 Finally, the builder executes the architectural plan with even more detailed instructions for contractors, carpenters, electricians, plumbers, masons, roofers, and others needed to make the structure a reality. Similar details in research design specify what data to seek, how to collect it appropriately, and how to analyze it statistically. Obviously, all three parts of the process are essential. The property owner’s vision provides the focus, direction, and purpose of the architectural plan. The architectural plan is needed before the work can begin. Finally, trained workers must carry out the plan reliably, or the structure will have flaws undermining its purpose. Similarly, content analysis research demands careful thinking about research goals and skillful use of data collection and analysis tools to learn about the phenomenon of interest.

Conceptualization in Content Analysis Research Design Content analysis research involves a process in which decisions about research design provide the link between conceptualization and data analysis. Research conceptualization in content analysis involves addressing goals shown in Figure 8.1. What question is the research supposed to answer? Is the purpose of the study description or testing relationships

Describe content variables

Draw inference about content’s meaning Answer research questions or test hypotheses about relationships among content variables Infer from the content to its context of production and consumption Answer research questions or test hypotheses about relationships among content and non-content variables

Figure 8.1 Purposes of content analysis that guide research design

150 Designing a Content Analysis among variables? Is the goal to illuminate causal relationships? From a larger perspective, where does this study “fit” in the communication process described in Chapter 1? Is the study all about content characteristics? Will it assess antecedent conditions that shape content? Will it assess how content produces particular effects? In the design, how will antecedents and/or effects be linked to the content variables? The focus on each or all of the purposes described in Figure 8.1 for a particular study affects the design of that study. For example, a content analysis designed to describe messages may require little more than counting. But a content analysis designed to test how a multitude of factors affects a particular content variable must collect and analyze data in a way different than simply describing one or more variables. And assessing how content variations may result from some cause or may affect some audience may require design decisions beyond content analysis. It is not our intent in this book to suggest specific research questions for researchers to pursue. But we will discuss in the section that follows some general ways in which content analyses can be grouped, and at the end of this chapter we will provide one helpful structure for pursuing a multi-study research program. A General Typology for Content Studies Research using content analysis has been pursued in three general ways: studies that use content analysis only, studies that incorporate content analysis into designs with other methods to explore influences on content, and studies that use content analysis in conjunction with other methods to explore content effects. Content Analysis Only Designs These studies typically take two forms. The first looks at one or more variables across time. For example, studies of sex or violence in the media usually track such a single variable across years or decades in order to assess whether such characteristics are becoming more common or rare in some medium (Sapolsky et al., 2003). However, a second kind of study using only content variables assesses a multitude of factors observable through content analysis for their influence on some particular content variable of interest. For example, a team of scholars examined patterns of incivility in comments on news outlets’ Facebook pages, including comparing partisan and nonpartisan outlets, as well as local and national outlets (Yi-Fan Su et al., 2018). Other studies using only content analysis have looked at “intermedia agenda-setting,” studying, for example, the agendas of Twitter and traditional news media during 2012 presidential primaries (Conway, Kenski, & Wang, 2015), how Instagram posts set newspapers’ agenda in the 2016 presidential primary (Muñoz &

Designing a Content Analysis 151 Towner, 2017), or how news releases influence reporting (Kiousis, Kim, McDevitt, & Ostrowski, 2009). Content Analyses and Influences on Content Various studies have used content analysis in conjunction with available social or other indicators as standards to assess content or as influences on content. Watson (2014) examined how journalists’ political ideologies, environmental beliefs, and endorsement of different journalistic roles, along with the percentage of the local population employed in the oil industry, affected Gulf Coast newspapers’ coverage of the 2010 BP oil spill. The influence of entire national cultures on content has also been explored (Zhang, 2009). Content Analyses and Content as Independent Variable There are two prolific lines of research that have combined content analysis and survey methods to examine the influences of content: agenda-setting and cultivation research. In first- and second-level agenda-setting studies, news content on issues at time 1 is assessed for its influence on audience priorities or thinking at time 2 (Golan & Wanta, 2001). Cultivation studies have assessed how violence in mass media content affects both the fear levels of audience members and their general ideas about social reality (Gerbner, Gross, Morgan, & Signorielli, 1994; Gerbner, Signorielli, & Morgan, 1995). Other studies have looked at information campaigns’ effects on citizens’ knowledge of various topics (An, Jin, & Pfau, 2006) and at “knowledge gaps” that result from differential exposure to media information (Hwang & Jeong, 2009). More rarely, content analysis results inform experimental manipulations to test effects of such content on consumers. Some of this research has used experimental manipulations of commonly found entertainment violence and pornography to assess changes in the attitudes of those exposed to such fare (Allen, D’Alessio, & Brezgel, 1995; Malamuth, Addison, & Koss, 2000). Fico, Richardson, and Edwards (2004) used findings from previous content analysis studies to fashion balanced and imbalanced stories on controversial issues to see if readers would consider such stories biased and if those judgments affected their assessment of news organization credibility. Research Hypotheses and Research Questions The kinds of studies referenced briefly above range from very simple designs to very complex ones. But as with other quantitative research methods, content analysis requires resources of time and effort that should be used efficiently and effectively. In particular, content analyses

152 Designing a Content Analysis should not be carried out absent an explicit hypothesis or research question to efficiently and effectively guide the design of the inquiry. Describing the value of such explicitness, McCombs (1972) argued that a hypothesis (or, presumably, a research question) “gives guidance to the observer trying to understand the complexities of reality. Those who start out to look at everything in general and nothing in particular seldom find anything at all” (p. 5). A lack of explicit a priori research questions and/or hypotheses also encourages a hunt for any statistically significant associations, which may be spurious, regardless of whether they can or cannot be explained by post hoc theorizing. To guard against post hoc hypothesizing and theorizing in one’s own work, and to encourage others to do the same, authors can pre-register their research questions and hypotheses with AsPredicted (http://aspredicted.org). Careful thinking about a problem or issue and review of previous related research is absolutely vital to the formation of such hypotheses or questions that are, in turn, vital to successful research design. Reviewing previous research provides guidance on what variables to examine and on how—or how not—to collect data to measure them. Moreover, the hypotheses formulated to build on or extend such research give guidance on how to measure the variables to be explored. An explicit hypothesis (or question) guides both data collection and variable measurement in good research design. A hypothesis explicitly asserts that a state or level of one variable is associated with a state or level in another variable. A hypothesis is appropriate where there is adequate theoretical or empirical support in the existing literature for a specific relationship between two or more variables. A hypothesis may assert an actual causal relationship or merely a predictable association (discussed below). Which of these two assumptions is made will guide research design. Sometimes taking the form of conditional statements (“if X, then Y”), a hypothesis in quantitative content analysis may be as simple as this one (Döring et al., 2016, p. 958): Females’ Instagram selfies will reveal more of their unclothed bodies than males’ Instagram selfies. Note that the hypothesis has identified the data the study must collect (selfies posted on Instagram), and that the way the hypothesis has categorized such selfies means the study is “locked” into an independent variable with two values (female versus male). Moreover, the hypothesis also (in this case) “locks” the study into a dependent variable with two levels (wearing sparse clothing versus fully clothed). Note also that the hypotheses implicitly suggest a third condition or variable—whether an Instagram post is a selfie—which will be made constant for the study and which directs what kind of posts to collect. This implicit condition

Designing a Content Analysis 153 for this sample hypothesis provides the opportunity for the hypothesized relationship to occur. It also makes for a limitation on the generality of the claims of the study even if the hypothesis is supported: it holds only for selfies posted on Instagram. Research questions are more tentative because the researchers are unable to predict possible outcomes based on existing theoretical knowledge or empirical evidence. Döring et al. (2016, p. 598) also posed the following research question: Are gender stereotypes in Instagram selfies more or less salient than in magazine ads? Because there had not been a previous comparison of selfies and magazine ads, there was not empirical or theoretical evidence to suggest whether women’s self-presentation online is different from their objectification in magazine ads, much less the nature of potential differences. Hypotheses and/or research questions most valuably enable researchers to “get out ahead” of the study. They enable the researcher to visualize what kind of data analysis will address the hypothesis or research question. In the example hypothesis above, a two-by-two contingency table would display the proportions of sparse or fully-clothed images produced by the two poster genders. Riffe (2003, pp. 184–188) called this “preplanning” essential to effective data analysis. Moreover, such “preplanning” provides an opportunity to revise the study before the expenditure of time and money. Analysis visualization before the study begins can feed back into decisions about the very wording of study hypotheses, about what content to examine, about the level at which content variables should be measured, and about the best analysis technique to be employed. Correlation, Causation, and Design Research design that flows from a hypothesis should deal explicitly with whether the study’s purpose is to demonstrate correlation or causation. In the example above, can there be something about gender that causes differences in female and male self-presentation on Instagram? If so, we believe there is a causal relationship at work. On the other hand, it could be—more likely—that there are not inherent, causal gender differences. Rather, females’ Instagram posts adhere more to gender stereotypes because of cultural expectations as to how different genders are performed. If we believe the latter to be the case, then we are exploring a correlation in the relationship between gender and selfpresentation in Instagram posts. In both cases, we may be able to make good predictions, but only in the first case would we know why those predictions are good.

154 Designing a Content Analysis Correlation between two variables means that an increase or decrease in one variable is associated with an increase or decrease in another variable. If higher levels of one variable are associated with higher levels of some other variable, the correlation is positive (e.g., news startups with larger staffs publish more stories). If the correlation is negative, higher levels of one variable are associated with lower levels of the other (e.g., as the minutes of advertisements in a program increases, the number of minutes devoted to the program content will decrease). The problem in inferring causation from some observed correlation is that the observed changes may be coincidental. In the summer, both sales of ice cream and murder rates are positively correlated with one another; certainly, though, it cannot be said that ice cream sales cause murders to spike, or vice versa. Indeed, a third variable may be the cause of change in both of the first two variables, like a puppeteer working two hand puppets. By simply observing associations between variables, it is easy to leap to incorrect, spurious inferences. A causal relationship, on the other hand, is a special kind of correlation that satisfies the logical conditions for inferring a necessary or sufficient connection between a change in one thing and a change in another thing. Prior theory must be consulted to assess whether particular influences are necessary or sufficient for the expected change to occur. If a particular factor is necessary, it must change or else the expected change cannot happen. If a particular factor is sufficient, its change may bring about the expected change, but that expected change may also happen because of the change of other factors as well. Three such logical conditions must be met for such inferences. One condition necessary for demonstration of a causal relationship is time order. The alleged or hypothesized cause should precede the effect. Suppose a researcher wanted to examine whether a change in news commenting online that forced commenters to sign in with their Facebook account led to more civility in news comment sections. A poorly designed study might develop a measure of civility and, without measuring the degree of civility before the sign-in requirement was instituted, attribute the observed degree of civility to the change in how commenters sign in and are identified (see Design A in Figure 8.2). A better study (Design B) would measure the degree of civility in news comments both before (at time 1 [Tl], the first point in time) the change (which occurs at T2) and after the change (at T3). This is a before/after design with a clear time order. It should be clear from our “puppeteer” analogy, however, that some other variable or variables might explain any change in civility between T1 and T3. The second condition necessary for a causal relationship is the observable correlation we described previously. If we cannot observe both of the variables changing, or if we have a research design that doesn’t permit one of the variables to change, we cannot logically infer a causal relationship.

Designing a Content Analysis 155 Time 1 (T1)

Time 2 (T2)

Time 3 (T3)

Sign-in Change

Civility Level

Time 1 (T1)

Time 2 (T2)

Time 3 (T3)

Civility Level

Sign-in Change

Civility Level

Time 1 (T1)

Time 2 (T2)

Time 3 (T3)

Civility Level

Sign-in Change

Civility Level

Civility Level

No change

Civility Level

Design A News Site A

Design B News Site A

Design C News Site A

News Site B

Figure 8.2 Research designs involving time order

If, for example, we had no data on civility before the change that is comparable to data we have after the change to news commenting policy described above, we could never infer a cause–effect relationship of that change on requiring users to sign in with their Facebook accounts. So, we must be able to observe that different levels or degrees of the cause are associated with observed levels or degrees of the effect. However, we now run head-on into the problem of possible “third variables” causing the change in the degree of civility in news comment sections. What if between T1 and T3, in addition to a change in requiring news commenters to sign in using their Facebook accounts (T2), a significant news event(s) that evoked strong emotional reactions had occurred (e.g., the #BlackLivesMatter or #MeToo movements), which themselves affected tone of public discourse, at least around these significant news topics.

156 Designing a Content Analysis If such a scenario occurred, it would be difficult to attribute the change from T1 to T3 solely to the change in requiring news commenters to sign in using their Facebook profiles, rather than to a change in the public discourse due to the emotions stirred by these social movements. To have greater certainty that the observed change is due to our independent variable, we need to use logic or some multivariate design to bring such “third variables” under control. One way to do so is to identify two similar news organizations, with similar news commenting sections, only one of which made the change to require news commenters to log in using their Facebook profile (T2). We would measure the degree of civility in both organizations’ news comments prior to the change (T1) and after the change (T3). The study will now have defined and ensured the necessary variation on a special independent variable (whose values are “sign-in change” and “sign-in didn’t change”). If there is a difference in the degree of change between T1 and T3 for the organization that changed its signin policy versus the organization that did not, the study will have found variation on the dependent variable that is related to variation in the independent variable, thereby supporting the inference of a causal connection. The change requiring news commenters to log in using their Facebook profile likely influenced the change in the degree of civility. This third requirement for demonstrating a causal relationship, however, is the most difficult to establish. It involves the control of all (known and unknown) rival explanations for why changes in two variables are systematically and predictably related. Rival explanations are the full range of potential and possible alternative explanations for what is plausibly interpreted as a cause–effect relationship between two variables. For example, in addition to requiring users to log in using their Facebook profile, a news organization could have made other changes: introducing algorithms that block comments that use uncivil language, encouraging more visible participation of moderators in comment threads, and so on. Thus, different rates of change at T3 could be due to either of these factors, or some combination of all three changes (Facebook sign in, algorithmic filtering, or participation by moderators). Researchers designing content analyses try to control as many factors as possible that might give plausible rival explanations for an observed relationship. Previous research and theory may give guidance on what rival explanations to control. Some studies may be able to remove such rival explanations through the logic of their research designs or by collecting the necessary data on them to enable their statistical control in the data analysis. It is impossible, however, for any single non-experimental study to control or measure every potential important variable that could influence a relationship of interest. Equally important, few phenomena are themselves the results of a single cause. That explains why most scientists are reluctant to close the door on any area of study. It also explains why

Designing a Content Analysis 157 scientists continue to “tinker” with ideas and explanations, incorporating more variables in their research designs, seeking contingent conditions, and testing refined hypotheses in areas in which the bulk of evidence points in a particular direction. Sometimes simply going through the process of graphically identifying elements of a research study can help the researcher avoid pitfalls—and identify rival explanations. Alternative ways of illustrating research designs or of depicting the testing of various research hypotheses and questions—and the types of inferences that can be drawn—have been offered by Holsti (1969, pp. 27–41) and by Stouffer (1977). Moreover, assuming a researcher wants to engage the phenomenon of interest across multiple studies, a graphical representation of findings such as a line drawing with arrowheads indicating tested relationships can help a researcher keep track of important variables whose interrelationships affect what the researcher is trying to find out. We discuss this more in Chapter 9.

Good Design and Bad Design For Babbie (2013), Holsti (1969), and Miller (1977), research design is a plan or outline encompassing all the steps in research ranging from problem identification through interpretation of results. Kerlinger (1973) argued that the “outline of what the investigator will do from writing the hypotheses and their operational implications to the final analysis of the data” (p. 346) is part of research design. Holsti (1969) described research design simply as “a plan for collecting and analyzing data in order to answer the investigator’s question” (p. 24). A simple definition, yes, but its emphasis on utilitarianism—“to answer the investigator’s question”— is singular and suggests the gold standard for evaluating research design. How can research be designed to answer a specific question? Holsti (1969) argued that: A good research design makes explicit and integrates procedures for selecting a sample of data for analysis, content categories and units to be placed into the categories, comparisons between categories, and the classes of inference which may be drawn from the data. (pp. 24–26, emphasis in original) For Wimmer and Dominick (1991), that meant “the ideal design collects a maximum amount of information with a minimal expenditure of time and resources” (pp. 24–25). To quote Stouffer (1977), strong design ensures that “evidence is not capable of a dozen alternative interpretations” (p. 27). By careful design, the researcher eliminates many of the troublesome alternative or rival explanations that are possible and “sets up the framework for

158 Designing a Content Analysis ‘adequate’” testing of relationships “as validly, objectively, accurately, and economically as possible” (Kerlinger, 1973, p. 301). Thus, the hallmarks of good design, according to Kerlinger (1973), are the extent to which the design enables one to answer the question, controls extraneous independent variables, and permits generalizable results. The emphasis in these definitions on “alternative interpretations” (Stouffer, 1977, p. 27) and “troublesome alternative . . . explanations” (Kerlinger, 1973, p. 301) reflects far more than intolerance for ambiguity. It captures the essence of what makes good, valid research design. Imagine that somewhere among all the communication messages ever created by all communicators, there are message characteristics or variables that would enable a researcher to answer a particular research question. Unfortunately, that same set of messages also contains information irrelevant to the researcher, the answers to countless other questions, and even answers that can distort the answer the researcher seeks. A good research design is an operational plan that permits the researcher to locate precisely the data that permit the question to be answered. Elements of Research Design Often the heart of a research design is some sort of comparison of content that has theoretical importance. In particular, content is often compared across time, across content-producing organizations, or among people. Note that such designs usually incorporate more than one hypothesis or question. Finally, where possible, research designs may usefully take advantage of existing data-gathering or variable measurement techniques that have been successfully used in past research. In fact, this is most useful for building a body of integrated knowledge in social science research. Comparisons may also be between media (contrasting one communicator or one medium with another), within media (comparing among networks or newspapers with one another), between markets, between nations, and so on. Moreover, content analysts may link their research to other methods and to other data, such as comparisons between content data and survey results (e.g., the agenda-setting studies discussed earlier) or between content data and non-media data (e.g., comparing minority representation in advertising with census data). Our ability to study important phenomena increases with the triangulation of several data collection methods, and our confidence in findings increases with a convergence of findings from data collected using different methods. Very powerful designs incorporate a number of design elements and data-gathering methods to address research problems. Although testing relationships among variables and comparing content among media and over time have been emphasized, one needs to re-emphasize the value and validity of so-called one-shot design studies that might not compare across media or time. These studies are important

Designing a Content Analysis 159 for many reasons raised earlier: their focus may be on variable relationships that do not involve time or comparisons, they may be crucial to a new area of inquiry, or the content that is analyzed may be viewed as the consequence of antecedent processes or the cause of other effects. Our emphasis on hypothesized relationships and research questions, however, is a product of years of working with students who sometimes prematurely embrace a particular research method or procedure without thinking through what it is actually useful for accomplishing. One author recalls overhearing a student telling a classmate that she had decided to “do content analysis” for her thesis. “Content analysis of what?” the second student asked. The first student’s response was, “I don’t know. Just content analysis.” This is analogous to a child who enjoys pounding things and wants to hammer without thinking about what is being built.

A General Model for Content Analysis Based on this volume’s definition of content analysis and the need for careful conceptualization and research design, how should a researcher go about the work of conducting a content analysis? We offer below a design model in terms of primary and secondary questions that a researcher might ask or address at different stages. This model is organized under larger headings representing the three processes of conceptualization and purpose, design, or planning of what will be done to achieve that purpose, and data collection and analysis (see Table 8.1). Although Table 8.1 suggests a linear progression—and certain steps should precede others—the process is viewed as a recursive one in the sense that the analyst must continually refer back to the theory framing the study and must be prepared to refine and redefine when situations dictate. Table 8.1 Conducting a content analysis Conceptualization and Purpose Identify the problem Review theory and research Pose specific research questions and hypotheses Design Define relevant content Specify formal design Create dummy tables Operationalize (coding protocol and sheets) Specify population and sampling plans Pretest and establish reliability procedures Analysis Process data (establish reliability and code content) Apply statistical procedures Interpret and report results

160 Designing a Content Analysis Conceptualization and Purpose What Is the Phenomenon or Event to Be Studied? In some models of the research process, this is called problem identification or statement of the research objective. Researchable problems may come from direct observation or may be suggested by previous studies or theory. Personal observation, or a concern with some communicationrelated problem or need, is always an acceptable place from which to start an inquiry. But immersion in the scientific theory and empirical research relevant to that personal observation is the necessary follow-up. Ideally, a study’s purpose can be placed in what has been called the “Pasteur quadrant” (Stokes, 1997): an argument that the study is important for both advancing a body of theory and research and also important for solving some practical problem or meeting some social need. Such a context maximizes the possibility of both external funding for research and successful publication in an appropriate peer-reviewed journal. Conversely, a study that has no arguable theoretical relevance or practical importance should be seriously reconsidered (unless, of course, there is a billionaire friend or relative willing to fund it anyway, in which case the authors of this volume would appreciate an introduction). How Much Is Known about the Phenomenon Already? Have any studies of this or related phenomena been conducted already? Is enough known already to enable the researcher to hypothesize and test variable relationships that might be involved, or is the purpose of the study more likely to be exploratory or descriptive? Beginning researchers and even experienced ones often approach this step in too casual a manner. The result is a review of existing research and theory that excludes knowledge crucial to a proper framing of a problem. The incomplete review of existing knowledge occurs mostly for four reasons: (a) an overdependence on web searches or computer indexes that may not be complete (some may not include all relevant journals or all the volumes of those journals); (b) an exclusion of important journals from the review; (c) an unfamiliarity with scholarship from other fields; and (d) an impatience to get on with a project before examining all relevant materials. What Are the Specific Research Questions or Hypotheses? Will the study examine correlations among variables or will it test causal hypotheses? Will its purposes include inference to the context of message production or consumption? It is at the conceptualization stage that many studies are doomed to fail simply because the researcher may not have spent enough time thinking and pondering the existing research. This step includes identification of

Designing a Content Analysis 161 key theoretical concepts that may be operative and may involve a process of deduction, with the researcher reasoning what might be observed in the content if certain hypothesized relationships exist. Moreover, a study’s publication and contribution success is related to how well it fits into the context of past research, adding to the body of knowledge, refining some concept of interest, qualifying past assumptions or findings, or even correcting conceptual confusion or methodological mistakes. In sum, conceptualization involves problem identification, examination of relevant literature, a process of deduction, and a clear understanding of the study’s purpose. That purpose will guide the research design. Design What Content Will Be Needed to Answer the Specific Research Question or Test the Hypothesis? Will newspaper content, broadcast videotape, multimedia, social networking, or some other form of communication content be involved? What resources are available and accessible? Most important, what specific units of content will be examined to answer the question? Another issue that arises during this phase of planning and design has to do with availability of appropriate materials for analysis (e.g., newspapers, tapes, texts, web pages, tweets). It is important to note here that a disproportionate number of content analyses still examine traditional print media, newspapers in particular. This is in part due to the fact that newspapers are better indexed and archived in databases available at many libraries—though this is increasingly not the case of content that appears only on newspapers’ web pages, a significant gap when one considers how audiences today are reading news (Hansen & Paul, 2015). Video (as opposed to print transcripts), audio, website, and social media data all pose greater challenges in terms of accessing the appropriate content to answer one’s research question as they are less indexed and archived. Nonetheless, we emphasize that logistical and availability factors should not be as important in planning as the theoretical merit of the research question itself, and that one’s study should reflect how today’s audiences are interacting with content (it makes no sense to use a database that archives only content appearing in print). However, it is unfortunately true that not all researchers have unlimited resources or access to ideal materials for content analysis. The design phase should, to be realistic, involve some assessment of feasibility and accessibility of materials. What Is the Formal Design of the Study? How can the research question or hypothesis best be tested? How can the study be designed and conducted in such a way as to assure successful

162 Designing a Content Analysis testing of the hypothesis or answering the research question? Recall an earlier observation that good research design is the operational plan for the study that ensures that the research objective can be achieved. Recall also that the formal content analysis research design is the actual blueprint for execution of the study. It is directed by what occurred in the conceptualization process, particularly the decision to propose testable hypotheses or pursue answering a less specific research question. Each of these objectives suggests particular decisions in the study design process such as a study’s time frame (e.g., a study of tweets before and after the platform doubled its character limit to 280 characters), how many data points are used, or any comparisons that may be involved, whether with other media or other sources of data. Many content analysts find it useful at this point in planning to preplan the data analysis. Developing “dummy tables” (see Table 8.2) that show various hypothetical study outcomes, given the data collected for study variables and their measurement levels, can help the researcher evaluate whether study design decisions on these matters will even address the hypothesis or the research question. At this point, some researchers realize that their design will not achieve that goal; better now, however, than later. How Will Coders Know the Data When They See It? What units of content (words, square inches, tweets, video scenes, etc.) will be placed in the categories? The analyst must move from the conceptual level to the operational level, describing abstract or theoretical variables in terms of actual measurement procedures that coders can apply. What sorts of operational definitions will be used? What kind of measurement can be achieved (e.g., simple categories such as male or female characters, real numbers such as story length, or ratings for fairness or interest on a scale)? The heart of a content analysis is the content analysis protocol that explains how the variables in the study are to be measured and recorded on the coding sheet or other medium. It is simple enough to speak of abstract concepts such as a tweet’s valence, but a coder for a Twitter Table 8.2 Example of a dummy table Character Is

Female of color Male of color White female White male Total

Character Has Speaking Role

Nonspeaking Role

?% ?% ?% ?% 100%

?% ?% ?% ?% 100%

Designing a Content Analysis 163 content analysis must know what it looks like in text. In Chapter 4, we addressed the question of measurement in greater detail. How Much Data Will Be Needed to Test the Hypothesis or Answer the Research Question? What population of communication content units will be examined? Will sampling from that population be necessary? What kind of sample? How large a sample? A population of content is simply the entire set of potential tweets, broadcast programs, documents, web pages, and so on within a pertinent time frame (which is, of course, also an element of design). When appropriate, researchers use representative samples of the population rather than examining all the members. However, in some situations, sampling is not appropriate. If the focus is on a particular critical event (e.g., the September 11 terrorist attacks or a major oil spill) within a specified time period, probability sampling might miss key parts of the coverage. Or, if one is working with content that might be important but comparatively scarce (e.g., sources cited in early news coverage of AIDS), one would be more successful examining the entire population of AIDS stories. In Chapter 5, we discussed sampling in more detail. How Can the Quality of the Data Be Maximized? The operational definitions will need to be pretested and coders will need to be trained in their use. Before and during coding, coder reliability (or agreement in using the procedures) will need testing. In Chapter 6, we addressed the logic and techniques of reliability testing. Many researchers test coding instructions during the process of developing them. Then coders who will be applying the rules and using the instructions are trained. A pretest of reliability (how much agreement among the coders is there in applying the rules) may be conducted and the instructions refined further. We emphasize here, however, that maximizing data quality by testing reliability, achieving reliability, and reporting reliability is necessary in content analysis research. Lacy and Riffe (1993) argued that reporting content analysis reliability is a minimum requirement if readers are to assess the validity of the reported research. Data Collection and Analysis What Kind of Data Analysis Will Be Used? Will statistical procedures be necessary? What statistical tests are appropriate once content analysis data have been collected? A number

164 Designing a Content Analysis of factors influence the choice of statistical tests, including level of measurement and type of sample used. (Inferential statistics are inappropriate when using a population or nonscientific, nonrandom sample.) Some content analyses involve procedures of varying complexity that examine and characterize relationships among and between variables. Here, it is helpful to think about on the front end if particular statistical analyses pose requirements for the formal design. For example, one may be interested in using hierarchical linear modeling (HLM) to estimate separate individual (level 1), organizational (level 2), and national-level (level 3) effects on how reporters cover social protest. HLM requires a minimum number of observations, which vary based on the source/ author one consults (Hox, Moerbeek, & Van de Schoot, 2017), in order to produce stable estimates of effects observed at the different levels. These requirements are rarely met with “naturally occurring” data; thus, they need to be considered as part of the formal design process to make sure those requirements are met. That said, other studies simply report simple percentages or averages. These issues are dealt with in detail in Chapter 9. Has the Research Question Been Answered or the Research Hypothesis Tested Successfully? What are the results of the content analysis and any statistical tests? What is the importance or significance of the results? Interpreting and reporting the results is the final phase. It enables scientists to evaluate and build on the work of others. The actual form of the report depends on the purpose of the study and the appropriate forum (to a publisher, a thesis committee, the readers of a trade publication, colleagues in the academic community, etc.). The importance of a research finding is determined by connecting the found relationship with the problem that underlies the research. A relationship can be statistically strong but have little importance for scholarship or society. The importance of a research finding cannot be determined statistically. It is determined by the finding’s contribution to developing theory and solving problems. Only when the statistical measures of strength of a relationship are put in the context of theory and existing knowledge can importance be evaluated.

Research Program Design One of the authors of this text routinely asks doctoral students (and new doctoral program graduates seeking a job), “What’s your dependent variable?” The question implies a lot. Does a researcher have an enduring focus on a particular research problem or phenomenon? Is there an overall theoretical coherence to this focus and research? Is there a “fire in the

Designing a Content Analysis 165 belly” of this researcher that will motivate study after study to illuminate and understand some part of the communication world? The focus of this chapter has been on the design of a single study. But if a body of knowledge is to be built in communication science, particularly in that “Pasteur quadrant,” multiple studies will be needed. Recall that the goal of science is prediction, explanation, and control of some phenomenon. Almost always, multiple causes of varying strengths interacting under varying conditions will be affecting that phenomenon. No study (and no single researcher) can hope to illuminate them all. So, this chapter concludes not with any suggested program of research, but with a brief suggestion for how a researcher (or researchers) might organize such a program of research. Shoemaker and Reese (1996) have provided a general organizing framework for such a program of research that applies to studies about antecedents and content. The foundational analysis they make is that content variation is affected by five levels of influence. Shoemaker and Reese conceptualize these levels in terms of higher-level constraints limiting the freedom of lower-level factors. But more broadly, we can also see the possibility of higher-level factors actively influencing lower-level ones. Indeed, it may even turn out that there are conditions under which lower-level factors may be impervious to higher-level ones, or may even affect higher-level factors. Studies can be conducted within and across these levels, all focused ultimately on some dependent content variable of interest. The five levels emphasized by Shoemaker and Reese (1996) are: individual media worker characteristics, media organization routines, media organization characteristics, the environment of media organizations, and societal ideology and culture. Media Worker Characteristics These personnel have the most direct, creative influence on content. Variables possibly influencing their work include demographic ones such as gender, race, and other variables that are usually the focus of sociological research. Political orientation, values, attitudes, and the like, as well as psychological processes, might also be investigated for their influences on content. For digital media studies involving blogging or social network content, this level may engage an entire research program. Media Organization Routines Routines are the repeated patterns of interaction that enable an organization to function and reliably achieve its goals. Such routines may include deadlines for media content production, expectations for content amounts and packaging, and publication cycles. Shoemaker and Reese

166 Designing a Content Analysis (1996) identify routines involving news sources, audiences, and processers. Obviously, such routines will differ across organizations producing the same kind of media content (e.g., news) and across organizations producing different kinds of content (e.g., news organizations compared to advertising organizations). Research programs may focus within this routine level or on how such routines affect the work of those who directly produce content. Media Organization Characteristics Such characteristics include the goals and resources of organizations producing media content, and who is setting such goals. It includes the way rewards and punishments are allocated for achieving such goals. It includes both the internal differentiation of power and responsibility within the organization and the way the organization interacts with the external environment. This interaction with other organizations and institutions includes resource dependencies and relative power. Research programs may focus on different organization characteristics and on how variations in such organizational characteristics influence organizational routines and media worker outputs. Media Organization Environments This includes both other organizations and social institutions affecting the work of media organizations. Research programs may focus, for instance, on how laws and governmental regulations influence the media organization. It may include how the organization copes with competitors, critics, and interest groups. News organizations, for example, operate in very different environments than do public relations firms. Societal Ideology The Shoemaker and Reese (1996) approach focuses on ideology as a way in which dominant economic interests influence organization environments, organization characteristics, routines, and media workers. More broadly, however, research programs may focus on how national cultures and even subcultures within such nations may influence media content. Research programs at this level may be heavily international. But even within a nation, studies of communications within subgroups in society may be done. Given what is aptly called the “World Wide Web,” studies of the Internet may produce examples of new cross-national cultures being formed or of the growth of subcultures within a nation that are focused on particular beliefs or values.

Designing a Content Analysis 167

Summary Content analysis involves conceptualization, design, and execution phases. The research design of a study is its blueprint, the plan specifying how a particular content analysis will be performed to answer a specific research question or test a specific research hypothesis. Design considerations include time, comparisons with other media or data sources, operationalization and measurement decisions, sampling, reliability, and appropriate statistical analysis. Ultimately, good research design can be evaluated in terms of how well it permits answering the research question and fulfilling the study’s purpose.

9 Data Analysis

Like most research methods, content analysis is comparable to detective work. Content analysts examine evidence to solve problems and answer questions. Of course, scholars limit their examinations to relevant evidence. The research design, measurement, sampling, and research design decisions we discussed in Chapters 4, 5, and 8 are, in effect, the content analyst’s rules for determining relevant evidence and how to collect it, whereas Chapters 6 and 7 offer insights to help ensure that the evidence is of optimal quality. Ultimately, however, data collection ceases, and data must be reduced and summarized. Patterns within the evidence must be plumbed for meaning. In quantitative content analysis, the process of data analysis involves statistical procedures, tools that summarize data so that patterns may be illuminated. In this chapter, we aim to help researchers think efficiently and logically about analyzing data quantitatively. The strategy is to illustrate the intuitively logical bases of several commonly used analysis techniques and to provide guidance on applying them. These techniques are basic ones: descriptive measures, such as means and proportions, along with correlation and tests of statistical significance. We also introduce analysis of variance (ANOVA) and multivariate techniques. We present basic notions of probability to facilitate understanding of how and why particular statistics work. On the other hand, detailed discussion of these techniques or the mathematical basis of statistics is beyond the scope and goal of this text.

An Introduction to Analyzing Content Although a number of disciplines employ content analysis, communication researchers have been among the most persistent in exploiting the technique. An unpublished examination of the data tables and analysis sections of 239 studies in Journalism & Mass Communication Quarterly from 1986 through 1995 indicates that content analysts rely on several basic analysis techniques and a few more advanced ones. That is, a limited number of tools turn out to be useful for a variety of tasks. As in

Data Analysis 169 many kinds of work, knowing what tool will serve adequately for which job is essential knowledge. Some of these analysis techniques are very simple. Researchers who produced 28% of the 239 content analysis studies were able to achieve their objectives using only means, proportions, or simple frequency counts. When other techniques have been used, they were often in combination with means and proportions. Techniques for analyzing the content data included chi-square and Cramer’s V (used in 37% of studies) and Pearson’s product–moment correlation (15% of the studies). Techniques to assess differences between means or proportions of two samples were used in 17% of studies. More advanced techniques included ANOVA (used in 6% of the studies) and multiple regression (8% of the studies). Only 7% of the studies employed statistical techniques more sophisticated than these. The purpose of this chapter, therefore, is to review these techniques and emphasize how they relate to the particular content study’s goals. In fact, analysis techniques should be carefully thought through in the context of study goals before any data are even collected. Decisions on data collection, measurement, and analysis are inextricably linked to one another, to the study’s overall research design, and to the questions or hypotheses the study addresses.

Fundamentals of Analyzing Data Thinking about Data Analysis The goal of a particular data analysis may be relatively simple: to describe characteristics of a sample or population. For example, researchers may be interested in learning the frequency of occurrence of some particular characteristic to assess what is typical. By contrast, the goal may be to go beyond such description to illuminate relationships in some sample or population. To describe relationships, researchers would focus on illuminating patterns of association between characteristics of one thing and characteristics of another. Of course, some researchers might pursue both goals. Familiarity with the relevant previous research and well-focused questions facilitate data collection and are also crucial for good data analysis. Previous research and the thinking that goes into assessing its meaning are vital to focusing any data analysis. Previous research provides guidance on what variables to examine and on how to collect data to measure them. Earlier research also provides direction for the formulation of hypotheses or research questions that themselves lend focus to both the data collection and data analysis. Finally, effective replication of studies and the building of a coherent body of research may require using identical measures and data analysis techniques for maximum comparability across studies.

170 Data Analysis Hypotheses and Research Questions Quantitative content analysis is much more efficient when explicit hypotheses or research questions are posed than when a researcher collects data without either. A hypothesis is an explicit statement predicting that a state of one variable is associated with a state in another variable. A research question is more tentative, merely asking if such an association exists. Hypotheses or research questions permit research designs to focus on collecting only relevant data. Furthermore, an explicit hypothesis or research question permits the researcher to visualize the kind of analysis that addresses the hypothesis or question. Researchers can even prepare dummy tables to aid in visualization. In fact, the inability to visualize what the completed analysis tables “should look like” given the hypotheses or questions may well signal some problem in conceptualizing the study or in the collection and measurement of the data. If a hypothesis predicts, for example, that anonymous comments online are more likely to include negative language than signed comments, the simplest approach is to measure the type of comment (anonymous or signed) and the valence of the comment (positive, negative, or neutral). Note that this hypothesis can be addressed with nominal-level data (as discussed in Chapter 4). The hypothesis is obviously supported if a greater proportion of anonymous comments have negative language than the proportion of signed comments. Now assume the researcher is interested in the degree of negative language in anonymous comments. A more refined and detailed level of measurement, at the interval level, would be needed than the one using the simple presence or absence of negative language in a comment. Coders could count the number of negative, positive, and neutral statements in each comment. Averages or means for each type of valence can be calculated for both anonymous and signed comments. In this revised example, the hypothesis would be supported if the mean number of negative comments were higher for anonymous comments than for signed comments. Although a researcher’s specification of a hypothesis or research question affects the nature of data analysis, that analysis is also affected by whether the researcher plans to make inferences from the study content to a larger population of content. If all data from a population have been collected (e.g., all the poems of an author or all the tweets from a certain celebrity for a year), then that question is moot. The sample is the population. If only a small part of the known content is studied, how the data have been selected determines whether inferences about the parent population can be made. As discussed earlier in Chapter 5, probability sampling enables the researcher to make valid inferences to some population of interest. Only probability sampling enables researchers to calculate sampling error, a

Data Analysis 171 measure of how much the sample may differ from the population at a certain level of confidence.

Describing and Summarizing Findings The researcher’s choice among several types of analysis depends on the goals of the research, the level at which variables have been measured, and whether data have been randomly sampled from some population. The analysis techniques we describe in the following section proceed from relatively simple techniques used to describe data to more complex ones that illuminate relationships. The techniques discussed here are far from exhaustive because several different analysis approaches can achieve the same goals. Describing Data Numbers are at the heart of the content coding process. It should not be surprising that counting is at the heart of the analysis. What may be surprising, however, is how often very basic arithmetic, such as calculating a mean or proportion, suffices to clarify what is found. Counting Once data have been collected using the appropriate level of measurement, one of the simplest summarizing techniques is to display the results in terms of the frequencies with which the values of a variable occurred. The content analysis coding scheme provides the basic guidance, of course, for such a display. For instance, in a study of 200 television programs, the data on the number of Latino/Latina characters can simply be described in terms of the raw numbers (e.g., 50 programs have Latino/Latina characters and 150 do not). Or, in counting the number of Latino/Latina characters, the total number of characters in the 50 programs can be displayed. Displaying data in these ways, however, may not be illuminating because raw numbers do not provide a reference point for discerning the meaning of those numbers. Thus, summarizing tools such as proportions or means are used, depending on the level of measurement employed for the variables being analyzed. Means and Proportions A mean is simply the arithmetic average of a number of scores that are measured at the interval or ratio level. The mean is a sensitive measure because it is influenced by and reflects each individual score. A mean provides

172 Data Analysis a reference point for what is most common or typical in a group. If the mean number of Latino/Latina characters is 1, one expects that many of the programs in the sample have 1 Latino/Latina character, although one also expects variability. Furthermore, the mean also has the advantage of being stable across samples. If several samples were taken from a population, the means would vary less than other measures of central tendency such as the median (the value that is a midpoint for the cases). A proportion can be used with variables measured at the nominal as well as interval or ratio level of measurement. The proportion reflects the degree to which a particular category dominates the sample or population. A proportion is illuminating because it too provides a context for discerning the meaning of findings. If 55 movies out of 100 have graphic violence, that works out to 55%. Because the reference point is 100%, the importance of such a finding is easily grasped, and comparisons are possible across samples (e.g., 55% of 1980s movies versus 60% of 1990s movies). Consider, as an example, a study of coverage of county governments using a national sample of daily and weekly newspapers (Fico et al., 2013a). The authors calculated the mean number of unique sources quoted in articles by these two types of newspapers. They discovered that daily newspapers averaged 2.77 sources in county government stories and weeklies averaged 1.9 sources. The study also found that 14.2% of the daily newspaper stories about county government quoted ordinary citizens compared to 7.9% of the weekly newspaper stories. A question necessarily occurs about what to do when variables have been measured at the ordinal level (e.g., having coders assign favorability rankings to content). Although ordinal scales use numbers in much the same way interval or ratio scales do, an ordinal scale does not meet the mathematical assumptions of the higher levels. Furthermore, summary measures such as means used with ordinal scales merely “transfer” the underlying conceptual problem of what “more” of the concept means. In other words, if one does not really know how much more a favorability rating of 3 is compared to a favorability rating of 2, having an average favorability rating of 2.4 is not much additional help. The safe solution to analyzing data measured at the ordinal level is to report proportions for the separate values that make up the scale. As mentioned in Chapter 4, ordinal measures can create problems for content variables because of their lack of independence. Nominal-, interval-, and ratio-level variables are more often used and will meet the needs of content analysis. The Significance of Proportions and Means Data from samples can be easily described using the basic tools we just presented. However, if the data come from a probability sample, the aim

Data Analysis 173 is not just to describe the sample, but also to describe the population from which the data were drawn. Generalizing Sample Measures Calculating sampling error permits one to make inferences from a probability sample to a population. We introduced sampling error and level of confidence in Chapter 5. Recall that sampling error will vary with the size of the sample being used and with the level of confidence desired for the conclusions drawn from the analysis. However, for social science purposes, the conventional level of confidence is almost always at the 95% or 99% confidence levels. Consider an example involving a content analysis of a random sample of 400 prime time television shows that run between 8 and 11 p.m. Eastern time drawn from a population of such shows. The proportion of sample shows with African-American characters is 15%. Is this actually the proportion of such shows in the population of television programs? Might that population proportion actually be 20% or 10%? Sampling error allows a researcher to answer these questions at a given confidence level. Three ways are available to find sampling error for a sample of a given size. First, and simplest, error tables for given sample sizes are available online and frequently included in statistics books. Second, many data analysis computer programs include this in the output. Finally, hand computation is described in many statistics and research methods texts. For a sample size of 400 at the 95% level of confidence, the sampling error for the proportion works out to nearly five percentage points. Therefore, in the population of relevant prime time television shows, the proportion of shows with African-American characters could be as low as 10% or as high as 20%. The interval becomes smaller with larger samples. The Significance of Differences Describing findings from a random sample may be interesting, but frequently a research problem focuses on exploring possible differences in some characteristic in two or more such samples. In fact, hypotheses are often stated to emphasize the possibility of such a difference: “Facebook posts are more likely to have video links than are tweets.” The analysis frequently goes beyond simply describing if two (or more) sample means or proportions are different because an observed difference begs the question of why the difference occurs. However, when random sampling has been used to obtain samples, the first possible answer that must be considered is that the difference does not really exist in the population the samples came from and that it is an artifact of sampling error. Tests of the statistical significance of differences in means or proportions

174 Data Analysis address the likelihood that observed differences among samples could be explained by sampling error. Stated more specifically in terms of probability, tests for the significance of differences are used to assess the chance that an obtained difference in the means or proportions of two samples represents a real difference between two populations instead of a difference due to sampling error (null hypothesis). A study of the number of sources quoted in county government articles by daily and weekly newspapers (Fico et al., 2013b) reported dailies had a mean of 2.77 sources and weeklies had a 1.9 sources mean. The authors reported that the difference was statistically significant (probably not due to sampling error) at the p < .001 level. In other words, the difference likely exists in the populations of U.S. dailies and weeklies. Put more statistically, the chance is only one in a thousand that the observed difference was merely a sampling “fluke” that has misled the researchers. Two-Sample Differences and the Null Hypothesis The starting assumption of statistical inference such as the one above is that the null hypothesis is true—there really is no population difference between apparently different groups. Each group can now be considered a subsample because each member of each group was still selected using probability sampling. The question comes down to determining whether the two samples belong to one common population or really represent two distinct populations as defined by the independent variable. Probability samples reflect the population from which they are drawn, but not perfectly. For example, suppose the mean number of sexual references in a census of reality TV programs was subtracted from the mean number of sexual references in a census of other prime time TV programs—any difference could be a real one between the two populations. Samples from each population of programs, however, could turn up differences that are merely due to sampling variation. Do those differences reflect a real programming difference in the populations of interest or a sampling artifact? A difference of means test or a difference of proportions test calculates how likely it is that the difference between two groups found in a probability sample could have occurred by chance (sampling error). If the sample difference is so large that it is highly unlikely under the assumption of no real population difference, then the null hypothesis is rejected in favor of the hypothesis that the two groups in fact come from two different populations. Of course, the null hypothesis is rejected at a certain level of probability (usually set at the 95% level). There remains a 5% chance of making an error but a 95% chance that the right choice is being made.

Data Analysis 175 The statistical measures used in the difference of proportions and difference of means tests are called z- and t-statistics. A z-statistic can be used with a difference of means and a difference of proportions test, whereas the t-statistic is used with variables measured at an interval or ratio level. For both statistics, a sampling distribution has been computed that indicates how likely the sample statistic could differ from zero (the null hypothesis) if the two samples were actually from the same population. The sampling distribution for the t-statistic was developed for small samples (under 30), but the sampling distributions for the z- and t-statistics become approximately equal with sample sizes larger than 30. Again, standard computer analysis processing programs easily compute the statistics and how likely their magnitude could have occurred by chance alone. But it is still possible to calculate them easily by hand using standard textbook formulas. Examples of these formulas applicable to analyses using means and proportions are the following: Difference of proportions test is P1 − P2

Z=

P1 (1 − P1 ) n1

+

P2 (1 − P2 ) n2

in which P1 = the proportion of the first sample n1 = the sample size of the first sample P2 = the proportion of the second sample n2 = the sample size of the second sample The denominator is the estimate for the standard error of the difference in the proportions. Difference of means test is

t=

X1 − X2 SX − 1

X2

The denominator in the expression is the estimate of the standard error of the difference between sample means. In the case of equal variances in both samples, the denominator formula for t is

SX − 1

X2

=

S1 S2 + n1 n2

176 Data Analysis In the case of unequal sample variances, the denominator formula for t is SX − 1

X2

=

( n1 − 1)( S1 ) + ( n2 − 1)( S2 ) n1 + n2 − 2

 n + n2  • 1   n1n2 

in which X1 = the mean of the first sample group X2 = the mean of the second sample group S1 = the variance of the first group mean S2 = the variance of the second group mean n1 = the size of the first sample group n2 = the size of the second sample group The result of the computation is a value for z or t that is compared to probability values in a table to find how likely the difference is due to sampling error or is a real population difference. The values in the tables come from the sampling distributions for the z- or t-statistics. A low probability value (.05 or less) indicates that the two-sample means are so different that they very likely reflect a real population difference between the two. This is just the inverse of saying one’s confidence in the decision to reject the null hypothesis is at the 95% level. Differences in Many Samples A somewhat different approach is needed when the researcher is comparing the differences among three or more groups. As in the two-sample problem, the researcher wants to know if these samples all come from the same population. For example, the use of the term abortion in four Republican platforms in the last four presidential elections could be compared to see if this issue gained in importance during this period. What is needed is a single, simultaneous test for the differences among the means. Why a single test and not simply a number of tests contrasting two pairs of means or proportions at a time? The reason is that if a great many comparisons are being made, some will turn up false differences due to random sampling alone. Recall that the 95% level of confidence is being used to reject the null hypothesis. That means that although true population differences are most likely to turn up about 5% of the time, an apparently significant difference will be obtained that does not truly represent any real difference in a population. Therefore, as the number of comparisons of sample differences becomes larger, it is more and more likely that at least one comparison will produce a false finding. Equally important, it is impossible to know which one is false. One possible way around this problem is to run a series of two-mean tests but with a more rigorous level of confidence required (e.g., 99% or 99.9%.)

Data Analysis 177 However, a single test that simultaneously compares mean differences is called an analysis of variance (ANOVA). Unlike difference of proportions and difference of means tests, ANOVA uses not only the mean, but also the variance in a sample. The variance is the standard deviation squared, and the standard deviation is a measure of how individual members of some group differ from the group mean. ANOVA is a test that asks if the variability between the groups being compared is greater than the variability within each of the groups. Obviously, variability within each group is to be expected, and some individual scores in one group may overlap with scores in the other groups. If all the groups really come from one population, then the variability between groups will approximately equal that within any one of them. Therefore, ANOVA computes an F-ratio that takes a summary measure of between-group variability and divides it by a summary measure of within-group variability: F = between-group variability / within-group variability As in the case of a difference in means and a difference in proportions test, the null hypothesis predicts no difference (i.e., all the groups come from the same population, and any difference is merely the result of random variation). The empirically obtained ratio from the groups can then be assessed to determine whether the null hypothesis should be rejected. The larger the obtained F, the bigger the differences there are among the various groups. A computer analysis program will display a numeric value for the calculated F along with a probability estimate that a difference this size could have occurred by chance under the null hypothesis of no difference in the population. The smaller that probability estimate, the more likely it is that the groups really do come from different populations. Table 9.1 summarizes the various descriptive measures used with nominal, ordinal, interval, and ratio data.

Finding Relationships Summary measures describing data and, where needed, their statistical significance are obviously important. However, as we suggested in Chapter 3, measures describing relationships are the key to the development of social science. Specifically, these measures are useful and necessary when the state of knowledge in a social science generates hypotheses about the relationship of two (or more) things. Such hypotheses are frequently stated in terms of “the more of one, the more (or less) of the other.” For example, “The more videos a website carries, the more unique visitors the site will have during a month.” Note that this hypothesis implies a higher level of measurement: the number of individual videos, for example, and unique visitors measured by the number of Internet Protocol addresses to visit the site during the month.

178 Data Analysis Table 9.1 Common data descriptive techniques in content analysis Level of Measure

Summary Measure

Significance Test (If Needed)

Nominal

Frequency Proportion Difference of proportion Frequency Proportion Difference of proportion Frequency Mean and standard deviation Difference in means ANOVA Frequency Mean and standard deviation Difference in mean ANOVA

— Sample error z-test — Sample error z-test — Sample error z-test, t-test F-test — Sample error z-test, t-test F-test

Ordinal

Interval

Ratio

The Idea of Relationships Identifying how two variables covary, or correlate, is one of the key steps in identifying causal relationships, as noted in Chapter 8. The assumption is that such covariance is causally produced by something, that it is systematic and therefore recurring and predictable. The null hypothesis is that the variables are not related at all, that any observed association is simply random or reflects the influence of some other unknown force acting on the variables of interest. In other words, if the observed association is purely random, what is observed on one occasion may be completely different than what is observed on some other occasion. To restate one of the points we made in Chapter 8, covariation means that the presence or absence of one thing is observably associated with the presence or absence of another thing. Covariation can also be thought of as the way in which the increase or decrease in one thing is accompanied by the increase or decrease in another thing. These notions are straightforward and, in fact, relate to many things observed in the daily lives of most people. (One of them, romance, makes this quite explicit. The lovestruck ones are “going together,” and maybe later have a “parting of the ways.”) Although this notion of relationship is a simple one intuitively, it gets somewhat more complicated when we want to know the relative strength or degree of the relationship being observed. First, what is meant by a strong or weak relationship? What does a relationship that is somewhere in the middle look like? On what basis, if any, is there confidence that a relationship of some type and strength exists? To put a point to that last question, how confident can one be in one’s assumed knowledge of the particular relationship?

Data Analysis 179 Relationship Strength Some observed relationships are clearly stronger than others. Think about that strength of a relationship in terms of degree of confidence. If, for instance, one had to bet on a prediction about a relationship, what knowledge about the relationship would maximize the chances of winning? Betting confidence should come from past systematic observations (a social science approach) rather than subjectivity (e.g., “I feel lucky today”). Note that the question asked how to maximize the chances of winning rather than to “ensure winning.” Take a hypothetical example: Does the gender of a reporter predict the writing of stories about women’s issues? If the traditional concept of news values strictly guides reporters’ selection of stories, then gender would be inconsequential to story selection: men and women reporters would write equally often about women’s issues. If the prediction were that women were more likely than men to write about women’s issues, then gender should be systematically linked to observed story topic. Under the strongest possible relationship, all women write only about women’s issues, and no men do. In that case, knowing the gender of the reporter would enable the researcher to predict perfectly the topic of the reporter’s stories: 100% of bets would be won by simply predicting that every story written by a woman reporter dealt with a topic of interest to women, and that every story written by a man reporter would deal with some other kind of topic. Of course, seldom do such perfect relationships exist. For example, if women reporters write about 70% of their stories on women’s issues and men reporters write about 10% of their stories on women’s issues, one could better predict the likelihood of either gender producing stories about women’s issues than if these percentages were unknown. However, the prediction would not be correct 100% of the time. Past data can be useful in predicting future behaviors, but the degree to which the prediction would be correct can vary from never to 100% of the time. What is needed is a number or statistic that neatly summarizes the strength observed in relationships. In fact, several measures of association do exactly this and are employed depending on the level of measurement of the variables in the relationships being explored. Techniques for Finding Relationships The measures of association we describe in the following section do something similar to the preceding protracted hypothetical example. Based on data from a population or a sample, a mathematical pattern of the association, if any, is calculated. The measures of association we discuss in the following set a perfect relationship at 1 and a non-relationship at 0. A statistic closer to 1 thus describes a relationship with more substantive significance than a smaller one.

180 Data Analysis If the data used to generate the statistic measuring strength of association have been drawn from a probability sample, an additional problem exists. It is analogous to the problem in generalizing from a sample mean or proportion to a population mean or proportion. A statistical measure of association could merely be an artifact of sampling error, a sample that turns out by chance to be different in important ways from the population from which it was drawn. Procedures of statistical inference exist to permit researchers to judge when a relationship in randomly sampled data most likely reflects a real relationship in the population. Chi-Square and Cramer’s V Chi-square indicates the statistical significance of the relationship between two variables measured at the nominal level. Cramer’s V is one of a family of measures indexing the strength of that relationship. Cramer’s V alone suffices when all population data have been used to generate the statistic. Both measures are needed when data have been randomly sampled from some population of interest. Put another way, chi-square answers the key questions about the likelihood of the relationship being real in that population. Cramer’s V answers the question about the strength the relationship has in that population. The chi-square test of statistical significance is based on the assumption that the randomly sampled data have appropriately described, within sampling error, the population’s proportions of cases falling into the categorical values of the variables being tested. For example, a random sample of 400 television drama shows might be categorized into two values of a violence variable: “contains physical violence” and “no physical violence.” The same shows might also be categorized into two values of a sexuality variable: “contains sexual depictions” and “no sexual depictions.” Four possible combinations of the variables could be visualized in terms of a dummy 2×2 table: violence with sexual depictions, violence without sexual depictions, no violence but with sexual depictions, and no violence and no sexual depictions. A hypothesis linking the two variables might be that violent and sexual content are more likely to be present in shows together. If sample data seem to confirm this, how does chi-square put to rest the lingering anxiety that this may be a statistical artifact? Chi-square starts with the assumption that there is in the population only random association between the two variables and that any sample finding to the contrary is merely a sampling artifact. What, in the example just cited, might a purely random association between such variables as violence and sexuality look like? As in the hypothetical example using gender and story topic, chi-square constructs such a null pattern based on the proportions of the values of

Data Analysis 181 the two variables being tested. Assume, for example, that 70% of all programs lack violence and 30% have violent depictions. Furthermore, suppose that half of all programs have some form of sexual content. If knowing the violence content of a show was of no help in predicting its sexual content, then sexual content should be included in about half of both the violent and the nonviolent programs. However, if the two types of depictions are associated, one would expect a much greater concentration of sex in programs that also have violence. For each cell in the table linking the two variables (violence, sex; violence, no sex; no violence, sex; no violence, no sex), chi-square calculates the theoretical expected proportions based on this null relationship. The empirically obtained data are then compared cell by cell with the expected null relationship proportions. Specifically, the absolute value of the differences between the observed and expected values in each cell goes into the computation of the chi-square statistic. Therefore, the chi-square statistic is large when the differences between empirical and theoretical cell frequencies is large and small when the empirically obtained data more closely resemble the pattern of the null relationship. In fact, when the empirically obtained relationship is identical to the hypothetical null relationship, chi-square equals 0. This chi-square statistic has known values that permit a researcher to reject the null hypothesis at the standard 95% and 99% levels of probability. The computational work in computing a chi-square is still simple enough to do by hand (although tedious if the number of cells in a table is large). Again, statistical computer programs produce chi-square readily. The formula for hand computation is Chi-square = ∑

( fo − f e )

2

fe

in which fo = the observed frequency for a cell fe = the frequency expected for a cell under the null hypothesis Knowing that a relationship is statistically significant or real in the population from which the sampled relationship has been obtained is important. Cramer’s V statistic can indicate how important, with values ranging from 0 to a perfect 1.0. Based literally on the computed chi-square measure, V also takes into account the number of cases in the sample and the number of values of the categorical variable being interrelated. Cramer’s V and chi-square make it possible to distinguish between a small but nonetheless real association between two variables in a population and an association that is both significant and relatively more important. Statistical significance alone is not a discerning enough measure

182 Data Analysis because a large enough sample will by itself “sweep up” small but real relationships. Cramer’s V therefore permits an assessment of the actual importance of the relationship in the population of interest. A statistically significant relationship that is small in the population of interest will produce a small V. A significant relationship that is large in the population will produce a large V, with a 1.0 indicating a perfect relationship. However, V tends to take low values because a V close to 1.0 would require extreme distributions. Cramer’s V is produced by computer analysis programs, but is easily calculated by hand once chi-square has already been produced: V=

X2 n ( ( min ) ( r − 1) ( c − 1) )

in which X2 = the calculated chi-square for the table n = the sample size, min is the lesser of the rows or columns (r − 1) = the number of rows minus 1 (c − 1) = the number of columns minus 1 Higher-Level Correlation Correlation techniques are also available for levels of measurement higher than the nominal. Spearman’s rank order correlation, or rho, can be used with ordinal-level data, and, as its name implies, is frequently used to determine how similarly two variables share common rankings. The computing formula for the statistic is as follows: rs = 1 −

6 ∑ D2

(

)

n n2 − 1

in which D = difference in each rank n = sample size For example, a comparative study might rank the emphasis that two news sites give to an array of topics. Using raw frequency of stories might be misleading if one site has more articles, but converting frequencies to percentages makes the data comparable. Ranks can then be assigned to each paper’s percentages to reflect topic emphasis; rank order correlation would show the papers’ comparability. Another study (Fico, Atwater, & Wicks, 1985) looked at rankings of source use provided by newspaper and broadcast reporters. Spearman’s rank order correlation made it

Data Analysis 183 possible to summarize the extent to which these reporters made similar valuations of the relative worth of particular kinds of sources. Pearson’s product–moment correlation is employed with data measured at the interval and ratio levels. Unlike the example just cited wherein newspaper topic emphasis was reduced to ranks, it employs the original measurement scales of the variables of interest, and because more information is provided by interval and ratio scales, Pearson’s provides a more sensitive summary of any degree of association. In fact, because of this, Pearson’s correlation is considered more powerful, able to turn up a significant association when the same data, analyzed using Spearman’s correlation, could not. The formula for the Pearson product–moment correlation is r=

(

)(

∑ X − X Y −Y

(

)

(

) )

∑ X − X  ∑ Y − Y 2      2

in which X = each case of the X variable X = the mean of the X variable Y = each case of the Y variable Y = the mean of the Y variable It is worth mentioning, however, that the Pearson correlation makes an important assumption about what it measures, specifically that any covariation is linear. What this means is that the increase or decrease in one variable is uniform across the values of the other. A curvilinear relationship would exist, for example, if one variable increased across part of the range of the other variable, then decreased across some further part of the range of that variable, then increased again. A relation would certainly exist, but not a linear one, and not one that could be well summarized by the Pearson measure. An easy way to envision a curvilinear relationship is to think about the relationship of coding time and reliability in content analysis. As a coder becomes more practiced in using a content analysis system during a coding session, reliability should increase; the relationship is steady and linear. However, after a time, fatigue occurs, and reliability curves or “tails off.” It is frequently recommended that a scatter diagram, as shown in Figure 9.1, be inspected if it is suspected that a linear relationship between the two variables of interest does not exist. In such a scatter diagram, each case relating the values of the two variables is plotted on a graph. If the two variables are highly related linearly, the dots representing the joint values will be tightly clustered and uniformly increasing or decreasing.

184 Data Analysis

No correlation

Positive correlation

Negative correlation

Curvilinear correlation

Figure 9.1 Scatter diagrams of correlations

Both Spearman and Pearson correlation measures provide summary numbers for the strength of association between two variables. Both can range from a perfect −1 (negative) correlation to a perfect +1 (positive) correlation. In the case of a perfect negative correlation, for example, every instance in which one variable is high would find the variable in relation to it is correspondingly low. Because both variables are measured on scales using several or more points, the correlation measures are much more sensitive to small differences in the variables than would be the case for Cramer’s V. Spearman’s rank order correlation and Pearson’s product–moment correlation are thus more powerful tests than those available for nominal-level data. If a relationship actually exists in the population of interest, Spearman’s and Pearson’s correlations will find it when Cramer’s V might not. Perfect relationships are rare in the world, of course, and a data set will have a number of inconsistencies that depress the size of the correlations.

Data Analysis 185 Statistics textbooks usually consider correlations of .7 or above to be strong, correlations of between .4 to .7 to be moderate, and correlations between .2 and .4 to be weak to modest. However, recall that r is a measure of the strength of the linear relationship between two variables. Another use of r, however, the r-square proportion, also helps a researcher assess more precisely how important one variable’s influence on another is. R-square means the proportion of one variable’s variation accounted for by the other. Thus, an r of .7 produces an r-square of .49, meaning that just under half of one variable’s variance is linearly related to another variable’s variance. That is why r must be relatively large to be meaningfully related causally to another variable. Correlation and Significance Testing As just discussed in the context of chi-square and Cramer’s V, the correlations from randomly sampled data require statistical tests of significance for valid generalization to the populations of content of interest. The null hypothesis, in this case, is that the true correlation equals 0. As in the case of chi-square and Cramer’s V, the correlation coefficients also have mathematical properties that are well known. Therefore, the question about a correlation found in a random sample is whether it is large enough, given the size of the sample, that it cannot reasonably be due to chance. The answer is provided by an F-test of statistical significance. The larger the F, the greater the chance that the obtained correlation reflects a real correlation in the population rather than a statistical artifact generated from random sampling. The computational process producing the F is also accompanied by a probability value giving the probability that the relationship in the data was produced by chance. It is also possible to put a confidence interval around a Pearson’s correlation. With such an interval, the researcher can argue (at the 95% or higher confidence level) that the true population correlation is somewhere within the interval formed by the coefficient plus or minus the interval. Causal Modeling Finding relationships through the measures of association just described is important. However, life is usually more complicated than two things varying together in isolation. For example, it may be interesting and important to find a relationship between reporter gender and news story topic. Gender alone, however, is hardly likely to explain everything about story topic. In fact, gender may be a relatively small component of the total package of factors influencing the presence or absence of topics about, for example, women in the news.

186 Data Analysis Furthermore, these factors may not directly influence the variable of interest. Factor A, for example, may influence factor B, which then influences factor Y, which is what one really wants to know. More galling still, factor D, thought to be important in influencing factor Y, may not be influential at all. Factors A and B may really be influencing factor D, which then merely appears to influence factor Y (this situation is called spurious correlation). Researchers need a means of comprehending how all these factors influence each other, and ultimately some variable of interest. How much does each of these factors influence that variable of interest, correcting for any mutual relationships? The whole package of factors or variables directly or indirectly influencing the variation of some variable of interest can be assembled and tested in a causal model. Knowing what to include in the model and what to leave out is guided by theory, previous research, and logic. This is similarly the case when predicting which variables influence which other variables in the model, whether that influence is positive or negative, and the relative magnitude of those influences. What is interesting and important in such a model is that it permits researchers to grasp a bigger, more complex piece of reality all tied into a conceptually neat package that is relatively easy to comprehend. Furthermore, each assumed influence in the model provides guidance to the whole community of researchers working on similar or related problems. Such models can be tested in a variety of ways, including path analysis and structural equation modeling. However, first comes the model; seeing how well it actually fits data comes later. In fact, one of the easiest ways to think about a model of multiple causal influences is to draw a picture of it. Such models are easily drawn, as illustrated by Figure 9.2, used to predict fair and balanced reporting as an outcome of economic and newsroom factors (Lacy, Fico, & Simon, 1989). Note first that each variable is named. Variables causally prior to others are on the left, with the “ultimate” dependent variable on the extreme right. The arrows indicate the assumed causal flows from one variable to the next. The plus and minus signs indicate the expected positive or negative relationships. The arrows and signs are the symbolic representation of hypotheses presented explicitly in the study. Arrows that lack such signs would indicate research questions or simply lack of knowledge about what to expect. Note that in the example model, there are six arrows with signs that correspond to explicit study hypotheses. For this model, the causal relationship flows in one direction. However, models can involve variables influencing each other. Mutual influence between variables can occur in two ways. First, the influence between two variables occurs either simultaneously or so quickly that a time lag cannot be measured. Second, the influence between two variables is cyclical with a lag that can be measured. Models can be drawn that incorporate these reciprocal relationships.

Data Analysis 187 Economic Variables

Newsroom Variable

Content Variables

Circulation (Control Variable)

Group Ownership −

Intercity Competition

− −

Direct Competition

Average Reporter Work Load

+ −

Imbalance − Fairness

Figure 9.2 Hypothesized model showing relationships between economic, newsroom, and content variables

As noted, such models give guidance to future research, but it undergoes change. This change occurs both theoretically and empirically. First, the model grows as variables outside the current model are brought into it. For example, new variables causally prior to all the other variables in the model may be added. In addition, new variables may be added that intervene between two already included in the model. Such models also change as they undergo empirical tests. Specifically, each arrow linking two variables in the model can be tested against data to determine the validity of the relationship. Furthermore, the whole model can be tested all at once to determine the validity of all its separate parts and its overall usefulness as a model describing social reality. Those interested in testing causal models should consult an advanced statistics text, such as Tabachnick and Fidell (2013). Multiple Regression Techniques such as ordinary least squares regression and its variations are needed to assess how well variables in a causal model and the paths among them explain variation in some dependent variable of interest. Multiple regression permits assessment of the nature of the linear relationship between two or more variables and some dependent variable of interest. Correlation can only indicate when two things are strongly (or weakly) related to each other. Multiple regression can indicate how, for

188 Data Analysis every unit increase in each independent variable, the dependent variable will have some specified change in its unit of measure. Multiple regression requires that the dependent variable be interval or ratio level, although the independent variables can be dichotomous in nature (called dummy variables). (A form of regression called logistic regression is available to assess independent variable effects on a dependent variable measured at the nominal level, and readers should consult an advanced statistics book for details on using this.) When all independent variables are dummy variables, multiple regression is equivalent to ANOVA. The technique also assumes that each of these interval/ratio variables is normally distributed around its mean. Whether the data set meets this requirement can be assessed by examining each variable’s measures of skewness or departure from a normal distribution. Because small samples are likely to be more skewed, the technique is also sensitive to the overall number of cases providing data for the analysis. Tabachnick and Fidell (1996, p. 132) report that testing the multiple correlation (the effect of all the independent variables together on the dependent variable of interest) requires a minimum of at least 50 cases plus eight times the number of independent variables. In order to test the effect of individual independent variables on the dependent variable, the sample would need to be at least 104 plus the number of independent variables. For example, if a researcher had an equation with six independent variables, he or she would need a sample of at least 98 cases (50 + 48) to test the multiple correlation and 110 cases (104 + 6) to test the relationships between the dependent variable and individual independent variables. Multiple regression assesses the nature of the way variables vary together, but it does so controlling for all the ways other variables in the model are varying as well. Think of it this way: multiple regression correlates each independent variable with the dependent variable at each measurement level of all the other variables. Regression analysis creates an equation that allows the best prediction of the dependent variable based on the data set. The equation takes the following form: y = a + b1 X1 + b2 X2 +  bn Xn + e. In the equation, y is the value of the dependent variable when various values of the independent variables (X1 X2 . . . Xn) have been placed in the equation. The letter a represents an intercept point and would be the value of y when all the Xs equal zero. The e represents the error term, which is the variation in y not explained by all the Xs. The error term is sometimes dropped, but it is important to remember all statistical analysis has error.

Data Analysis 189 Each independent variable has a regression coefficient, which is represented by b1, b2 . . . bn. This coefficient equals the amount by which the X values are multiplied to figure out the y value. The coefficient specifies how much the dependent variable changes for a given change in each independent variable. However, regression coefficients are expressed in the original units of the variables, and because of this, they can be difficult to compare. To compare the contribution of independent variables, the regression coefficients can be standardized. Standardization of coefficients is similar to standardization of exam scores, or putting the scores on a curve. Standardization places the coefficients on a normal curve by subtracting each score from the variable mean and dividing by the standard deviation. The standardized, or beta, coefficients are most useful for within-model comparisons of the relative importance of each independent variable’s influence on the dependent variable. Beta weights are not comparable across data sets. Multiple regression computes a beta for each independent variable. The beta varies according to each variable’s standard deviation. The interpretation is that for each change of 1 SD in the independent variable, the dependent variable changes by some part of its standard deviation as indicated by the beta coefficient. For example, a beta of .42 means that for each increase of 1 SD in the independent variable, the dependent variable would increase by .42 of its standard deviation. If a second independent variable had a beta of .22, it is easy to see that it is less influential because its variation produces relatively less variation in the dependent variable. An additional statistic, used along with multiple regression, is the multiple r-squared statistic (coefficient of determination). The multiple r-squared statistic is the proportion of the dependent variable’s variance that is accounted for by all of the variation of the independent variables in the model. In other words, a large multiple r-squared produced by a model means that the set of variables included are indeed substantively important in illuminating the social processes being investigated. A smaller multiple r-squared means that independent variables outside the model are important and in need of investigation. The adjusted multiple r-squared modifies multiple r-squared by taking into consideration the number of independent variables and the number of cases. When probability samples are used with multiple regressions, adjusted multiple r-squared is a better measure of fit. Finally, if the data were drawn from a random sample, a test of statistical significance is necessary to determine whether the coefficients found in the regression analysis are really zero or reflect some actual relationship in the population. Regression analysis also generates significance tests to permit the assessment of each coefficient and of the entire set of variables in the regression analysis as a whole. Table 9.2 summarizes various measures of association used with nominal, ordinal, interval, and ratio data.

190 Data Analysis Feedback Relationships As mentioned previously in this chapter, establishing the time order of causal relationships can be complicated for models with feedback loops where concepts influence each other across time. These feedback loops can vary in time required for the influence to occur. For example, news content influences the public agenda, which in turn influences news content, which in turn influences the public agenda. Theory can be helpful in understanding these feedback loops, but the existence of the loops can have an impact on the empirical testing of the feedback loops. Survey data are collected simultaneously and can create difficulty in identifying time-order for examining causal relationships. Both the research design process and statistics can be useful in dealing with these loops. If the profit level for a streaming video service is hypothesized to affect the number and quality of streaming series produced by the service, then the profits must be measured before the series are counted and evaluated for quality (however measured). Of course, the number and quality of series may well influence the profit level, which would require that the quality and number of series be measured before the profit levels. Two statistical techniques are used often to examine the mutual influence of variables or to control for feedback loops. First, structural equation modeling (SEM) is a set of procedures that require an explicit model to be tested by evaluating the simultaneous relationships among a collection of latent variables, which are unobservable (e.g., attitudes), and measured variables, which are observable (e.g. content). It has many uses, but can create ambiguity in the interpretation because of the complexity of the models (Tabachnick & Fidell, 2013). Feedback loops also can be examined using two-stage multiple regression. Used often in econometrics, this approach controls for the simultaneous influence of the dependent and independent variables. It allows a researcher to evaluate the influence of variable x (independent variable) on y (dependent variable) when controlling for the influence of y on x (for more background, see Wooldridge, 2015). A detailed discussion of SEM and two-stage regression is beyond the scope of this volume, but researchers should be aware of any feedback relationships and simultaneity within the models they are testing and access information about useful statistics in evaluating the relationships.

Statistical Assumptions These procedures were presented in a manner designed to be intuitively easy to grasp; however, one runs the risk of oversimplifying. In particular, statistical procedures carry certain assumptions about the data being analyzed. If the data differ to a great degree from the assumed conditions (e.g., a few extreme values or outliers with regression analysis), the analysis will lack validity. Researchers should always test data for these assumptions. For example, Weber (1990) pointed out that content analysts

Data Analysis 191 Table 9.2 Common data association techniques in content analysis Level of Measure

Summary Measure

Significance Test (if Needed)

Nominal

Cramer’s V Phi Spearman’s rho Pearson’s r Regression Pearson’s r Regression

Chi-square

Ordinal Interval Ratio

Z-test F-test F-test F-test F-test

should be particularly careful in this regard when transforming frequencies, time, and space measures into percentages to control for length of a document. Percentages have a limited range, and the distribution is not linear; means and variances for percentages are not independent; and content analysis data are often not normally distributed. Linearity, independence of mean and variance, and normal distribution are assumptions for commonly used statistical procedures. When transforming content measures to percentages and using sophisticated statistical analysis, data should be checked to see if they fit assumptions. Statistical procedures vary in how sensitive they are to violations of assumptions. With some procedures, minor violations will not result in invalid conclusions. However, researchers will have more confidence in their conclusions if data are consistent with statistical assumptions. Readers should consult statistics books to help them evaluate assumptions about data (Blalock, 1972; Tabachnick & Fidell, 2013).

Summary Data analysis is exploration and interpretation, the process of finding meaning in what has been observed. Regardless of what the numbers turned up through statistical techniques, deriving meaning from them is the goal. Statistical analysis can help people understand data patterns only when the analysis is conducted in ways consistent with standard practices. This chapter is a very brief survey of some often-used statistics. Which statistics are appropriate to a particular study depends on the hypotheses or research questions, the level of measurement of variables, and the nature of the sample. Like any good tool, statistics must be appropriate to the project. One size does not fit all. Used properly, statistical techniques are valuable ways of expanding one’s understanding. Yet they can generate puzzles and questions not thought of before. It is the rare study that does not contain a sentence beginning, “Further research is needed to . . .” For most researchers, that sentence is less an acknowledgment of the study’s limitations and more of an invitation to join the exploration.

Appendix Reporting Standards for Content Analysis Articles

The following suggestions are aimed at standardizing the reporting process for content analysis articles. The suggestions are based on Lombard et al. (2004) and Lacy et al. (2015). These reporting standards represent the need for replication as a foundational element of social science. Replication requires sufficient and detailed information about a study.

Sampling •

The nature and selection process of the study sample should be clearly described in the article. This requires a specific and detailed description of the sampling method (census, simple random sampling, etc.) and a justification for the sampling method. • If a probability sample is used, the population and sampling frame should be explicitly described. Report the sample and, if possible, the population sizes. • If a probability sample is used, descriptive statistics (mean, median, range, and standard deviation) should be reported for each variable in a footnote, table, or text. • If a non-probability sample is used, it should be justified and its limitations specified.

Coders, Variables, and Protocol •

Articles should be transparent about study variables that ultimately failed to reach acceptable levels and were thus dropped. As noted above, science is cumulative, and reporting on efforts that are unsuccessful can nonetheless help advance communication research both theoretically and methodologically, as well as allow other scholars to learn from such experiences. • Articles should report the number of coders employed and who supervised the coding, the administration of reliability testing, and so on.

Appendix 193 • Articles should report how the coding work was distributed (what percentage of it was done by the PI or the coder[s], whether sets of coding units/assignments were assigned randomly to avoid systematic error, etc.). • The role, if any, that the coders played in developing the protocol should be reported.

Reliability • The sample used for the reliability check should either be a census of the study sample or a randomly selected subgroup of the study sample. As discussed in Chapter 6, make sure the sample is large enough to represent all variables and categories and report how you determined it was large enough. Report selection method and the number (not percentage) of units used in the reliability check. • Reliability of protocol should be established during a pilot test before the study coding begins. This step should be reported. • Articles should report how coding reliability problems were resolved (retraining and retesting, coder consensus, dropping the variables, etc.). • Establishing reliability of the protocol should use units from the study sample and be conducted during the coding process. Coders should not know which content units are being used in the reliability check. • If probability sampling was used to generate a reliability sample, the process for determining the number of reliability cases should be explained and justified by citing literature. Two processes are mentioned in this volume (Krippendorff, 2013; Lacy & Riffe, 1996). • Until the controversy about reliability coefficients is resolved, Krippendorff’s CAlpha should be reported for each variable, along with the percentage of agreement, which could be placed in a footnote. If one or more variables show a high level of agreement but low level for the reliability coefficient, the data should be examined to determine why (see Chapter 6). If the variable data are skewed, the author should report Gwet’s AC2 along with CAlpha and argue for why AC2 is appropriate. • In addition to reporting the reliability coefficients, a confidence interval should be reported for each. • The article should justify the decision that the reliability is sufficiently high for each variable to be included in the analysis.

References

Allen, C. J., & Hamilton, J. M. (2010). Normalcy and foreign news. Journalism Studies, 11, 634–649. Allen, M., D’Alessio, D., & Brezgel, K. (1995). A meta-analysis summarizing the effects of pornography II. Human Communication Research, 22(2), 258–283. Allport, G. W. (Ed.) (1965). Letters from Jenny. New York: Harcourt, Brace & World. Altschull, J. H. (1995). Agents of power (2nd ed.). New York: Longman. An, S., Jin, H. S., & Pfau, M. (2006). The effects of issue advocacy advertising on voters’ candidate issue knowledge and turnout. Journalism & Mass Communication Quarterly, 83(1), 7–24. Armstrong, C. L., & Boyle, M. P. (2011). Views from the margins: News coverage of women in abortion protests, 1960–2006. Mass Communication and Society, 14(2), 153–177. Austin, E. W., Pinkleton, B. E., Hust, S. J. T., & Coral-Reaume Miller, A. (2007). The locus of message meaning: Differences between trained coders and untrained message recipients in the analysis of alcoholic beverage advertising. Communication Methods and Measures, 1(2), 91–111. Babbie, E. (2013). The basics of social research. Boston, MA: Cengage Learning. Baden, C., & Tenenboim-Weinblatt, K. (2017). Convergent news? A longitudinal study of similarity and dissimilarity in the domestic and global coverage of the Israeli–Palestinian conflict. Journal of Communication, 67, 1–25. doi: 10.1111/jcom.12272 Baldwin, T., Bergan, D., Fico, F., Lacy, S., & Wildman, S. S. (2009, July). News media coverage of city governments in 2009. Research report. Quello Center for Telecommunication Management and Law, Michigan State University. Ball-Rokeach, S. J., Rokeach, M., & Grube, J. W. (1984). The great American values test: Influencing behavior and belief through television. New York: Free Press. Bantz, C. R., McCorkle, S., & Baade, R. C. (1997). The news factory. In D. Berkowitz (Ed.), Social meanings of news: A text reader (pp. 269–285). Thousand Oaks, CA: Sage. Barnhurst, K. G., & Mutz, D. (1997). American journalism and the decline in event-centered reporting. Journal of Communication, 47(4), 27–53. Bastien, F. (2018). Using parallel content analysis to measure mediatization of politics: The televised leaders’ debates in Canada, 1968–2008. Journalism, 1–19, published online January 22, 2018. doi: 10.1177/1464884917751962

References 195 Bauer, R. A. (1964). The obstinate audience: The influence process from the point of view of social communication. The American Psychologist, 19, 319–328. Baxter, R. L., DeRiemer, C., Landini, N., Leslie, L., & Singletary, M. W. (1985). A content analysis of music videos. Journal of Broadcasting & Electronic Media, 29, 333–340. Beam, R. A. (2003). Content differences between daily newspapers with strong and weak market orientations. Journalism & Mass Communication Quarterly, 80, 368–390. Beam, R. A., & Di Cicco, D. T. (2010). When women run the newsroom: Management change, gender, and the news. Journalism & Mass Communi cation Quarterly, 87, 393–411. Bennett, L. W. (1990). Toward a theory of press–state relations in the United States. Journal of Communication, 40(2), 103–127. Berelson, B. R. (1952). Content analysis in communication research. New York: Free Press. Berkowitz, D. (Ed.) (2011). Cultural meanings of news: A text-reader. Thousand Oaks, CA: Sage. Bialik, C. (2012, February 12). Tweets as poll data? Be careful. The Wall Street Journal. Retrieved July 12, 2012, from http://online.wsj.com/article/SB10001 424052970203646004577213242703490740.html. Blalock, H. M. L., Jr. (1972). Social statistics (2nd ed.). New York: McGraw-Hill. Bogart, L. (2004). Reflections on content quality in newspapers. Newspaper Research Journal, 25(1), 40–53. Bradac, J. (Ed.) (1989). Message effects in communication science. Newbury Park, CA: Sage. Brown, J. D., & Campbell, K. (1986). Race and gender in music videos: The same beat but a different drummer. Journal of Communication, 36(1), 94–106. Brummette, J., DiStaso, M., Vafeiadis, M., & Messner, M. (2018). Read all about it: The politicization of “fake news” on Twitter. Journalism & Mass Communication Quarterly, 95, 497–517. Bruns, A., & Liang, Y. E. (2012). Tools and methods for capturing Twitter data during natural disasters. First Monday, 17(4). Bryant, J. (1989). Message features and entertainment effects. In J. Bradac (Ed.), Message effects in communication science (pp. 231–262). Newbury Park, CA: Sage. Bryant, J., Roskos-Ewoldsen, D., & Cantor, J. (Eds.) (2003). Communication and emotion: Essays in honor of Dolf Zillmann. Mahwah, NJ: Lawrence Erlbaum Associates. Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research. Chicago, IL: Rand McNally. Cantril, H., Gaudet, H., & Hertzog, H. (1940). The invasion from Mars. Princeton, NJ: Princeton University Press. Carey, J. W. (1996). The Chicago school and mass communication research. In E. E. Dennis & E. Wartella (Eds.), American communication research: The remembered history (pp. 21–38). New York: Routledge. Carpenter, S., Boehmer, J., & Fico, F. (2016). The measurement of journalistic role enactment: A study of organizational constraints and support in for-profit and nonprofit journalism. Journalism & Mass Communication Quarterly, 93, 587–608.

196 References Ceron, A., Curini, L., & Iacus, S. M. (2016). Politics and big data: Nowcasting and forecasting elections. London: Routledge. Chaffee, S. H., & Hochheimer, J. L. (1985). The beginnings of political communication research in the United States: Origins of the “limited effects” model. In M. Gurevitch & M. R. Levy (Eds.), Mass communication yearbook 5 (pp. 75–104). Beverly Hills, CA: Sage. Chau, M., & Chen, H. (2008). A machine learning approach to web page filtering using content and structure analysis. Decision Support Systems, 44(2), 482–494. Coffey, A. J., & Cleary, J. (2008). Valuing new media space: Are cable network news crawls cross-promotional agents? Journalism & Mass Communication Quarterly, 85, 894–912. Cohen, J. A. (1960). Coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 31–46. Cohen, J. A. (1968). Weighted kappa: Nominal scale agreement with a provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220. Cohen, S., & Young, J. (Eds.) (1981). The manufacture of news. London: Constable. Connolly-Ahern, C., Ahern, L. A., & Bortree, D. S. (2009). The effectiveness of stratified sampling for content analysis of electronic news source archives: AP Newswire, Business Wire, and PR Wire. Journalism & Mass Communication Quarterly, 86, 862–883. Conway, B. A., Kenski, K., & Wang, D. (2015). The rise of Twitter in the political campaign: Searching for intermedia agenda-setting effects in the presidential primary. Journal of Computer-Mediated Communication, 20, 363–380. doi: 10.1111/jcc4.12124 Conway, M. (2006). The subjective precision of computers: A methodological comparison with human coding in content analysis. Journalism & Mass Communication Quarterly, 83(1), 186–200. Correa, T., & Harp, D. (2011). Women matter in newsrooms: How power and critical mass relate to the coverage of the HPV vaccine. Journalism & Mass Communication Quarterly, 88, 301–319. Craft, S. H., & Wanta, W. (2004). Women in the newsroom: Influences of female editors and reporters on the news agenda. Journalism & Mass Communication Quarterly, 81, 124–138. Culbertson, H. M. (1975, May 14). Veiled news sources—who and what are they? ANPA News Research Bulletin, No. 3. Culbertson, H. M. (1978). Veiled attribution—an element of style? Journalism Quarterly, 55, 456–465. Culbertson, H. M., & Somerick, N. (1976, May 19). Cloaked attribution—what does it mean to readers? ANPA News Research Bulletin, No. 1. Culbertson, H. M., & Somerick, N. (1977). Variables affect how persons view unnamed news sources. Journalism Quarterly, 54, 58–69. Danielson, W. A., & Adams, J. B. (1961). Completeness of press coverage of the 1960 campaign. Journalism Quarterly, 38, 441–452. Danielson, W. A., Lasorsa, D. L., & Im, D. S. (1992). Journalists and novelists: A study of diverging styles. Journalism Quarterly, 69, 436–446. Davis, J., & Turner, L. W. (1951). Sample efficiency in quantitative newspaper content analysis. Public Opinion Quarterly, 15, 762–763.

References 197 Davison, K. K., Gicevic, S., Aftosmes-Tobio, A., Ganter, C., Simon, C. L., Newlan, S., & Manganello, J. A. (2016). Fathers’ representation in observational studies on parenting and childhood obesity: A systematic review and content analysis. American Journal of Public Health, 106(11), e14–e21. Deese, J. (1969). Conceptual categories in the study of content. In G. Gerbner, O. R. Holsti, K. Krippendorff, W. J. Paisley, & P. J. Stone (Eds.), The analysis of communication content (pp. 39–56). New York: Wiley. De Swert, K. (2012). Calculating inter-coder reliability in media content analysis using Krippendorff’s alpha. Center for Politics and Communication, 1–15. Retrieved January 16, 2019, from www.polcomm.org/wp-content/uploads/ ICR01022012.pdf. de Vreese, C. H. (2004). The effects of frames in political television news on issue interpretation and frame salience. Journalism & Mass Communication Quarterly, 81, 36–52. de Vreese, C. H. (2010). Framing the economy: Effects of journalistic news frames. In P. D’Angelo & J. A. Kuypers (Eds.), Doing news framing analysis: Empirical and theoretical perspectives (pp. 187–214). New York: Routledge. de Vreese, C. H., & Boomgaarden, H. (2006). Valenced news frames and public support for the EU. Communications, 28(4), 361–381. Di Cicco, D. T. (2010). The public nuisance paradigm: Changes in mass media coverage of political protest since the 1960s. Journalism & Mass Communication Quarterly, 87, 135–153. Dick, S. J. (1993). Forum talk: An analysis of interaction in telecomputing systems. Unpublished doctoral dissertation, Michigan State University, East Lansing, MI. Dill, R. K., & Wu, H. D. (2009). Coverage of Katrina in local, regional, national newspapers. Newspaper Research Journal, 30(1), 6–20. Dominick, J. R. (1999). Who do you think you are? Personal home pages and self presentation on the World Wide Web. Journalism & Mass Communication Quarterly, 77, 646–658. Döring, N., Reif, A., & Poeschl, S. (2016). How gender-stereotypical are selfies? A content analysis and comparison with magazine adverts. Computers in Human Behavior, 55, 955–962. doi: 10.1016/j.chb.2015.10.001 Druckman, J. N., Kifer, M. J., & Parkin, M. (2010). Timeless strategy meets new medium: Going negative on congressional campaign web sites, 2002–2006. Political Communication, 27(1), 88–103. Druckman, J. N., Kifer, M. J., & Parkin, M. (2014). Congressional campaign communications in an Internet age. Journal of Elections, Public Opinion, and Parties, 24, 20–44. doi: 10.1080/17457289.2013.832255 Druckman, J. N., Kifer, M. J., & Parkin, M. (2017). Consistent and cautious: Online congressional campaigning in the context of the 2016 presidential election. In J. Baumgartner & T. Towner (Eds.), The Internet and the 2016 presidential campaign (pp. 3–24). Lanham, MD: Lexington. Druckman, J. N., Kifer, M. J., & Parkin, M. (2018). Resisting the opportunity for change: How congressional campaign insiders viewed and used the Web in 2016. The Social Science Computer Review, 36, 392–405. doi: 10.1177/0894439317711977 Druckman, J. N., Kifer, M. J., Parkin, M., & Montes, I. (2017). An inside view of congressional campaigning on the Web. Journal of Political Marketing. Advanced online publication. doi: 10.1080/15377857.2016.1274279

198 References Duffy, M. J., & Williams, A. E. (2011). Use of unnamed sources drops from peaks in 1960s and 1970s. Newspaper Research Journal, 32(4), 6–21. Emery, M. C., Emery, E., & Roberts, N. L. (2000). The press and America: An interpretive history of the mass media (9th ed.). Boston, MA: Allyn & Bacon. Entman, R. M. (2010). Framing media power. In P. D’Angelo & J. A. Kuypers (Eds.), Doing news framing analysis: Empirical and theoretical perspectives (pp. 331–355). New York: Routledge. Epps, A. C., & Dixon, T. L. (2017). A comparative content analysis of antiand prosocial rap lyrical themes found on traditional and new media outlets. Journal of Broadcasting & Electronic Media, 61, 467–498, doi: 10.1080/08838151.2017.1309411 Everbach, T. (2005). The “masculine” content of a female-managed newspaper. Media Report to Women, 33, 14–22. Feng, G. C., & Zhao, X. (2016). Do not force agreement: A response to Krippendorff (2016). Methodology, 12(4), 145–148. Fico, F. (1985). The search for the statehouse spokesman. Journalism Quarterly, 62, 74–80. Fico, F., Atwater, T., & Wicks, R. (1985). The similarity of broadcast and newspaper reporters covering two state capitals. Mass Communication Review, 12, 29–32. Fico, F., & Cote, W. (1997). Fairness and balance in election reporting. Newspaper Research Journal, 71(3–4), 124–137. Fico, F., & Cote, W. (1999). Fairness and balance in the structural characteristics of stories in newspaper coverage of the 1996 presidential election. Journalism & Mass Communication Quarterly, 76, 123–137. Fico, F., & Drager, M. (2001). Partisan and structural balance in news stories about conflict generally balanced. Newspaper Research Journal, 22(1), 2–11. Fico, F., Lacy, S., Wildman, S. S., Baldwin, T., Bergan, D., & Zube, P. (2013a). Citizen journalism sites as information substitutes and complements for newspaper coverage of local governments. Digital Journalism, 1(1), 152–168. Fico, F., Lacy, S., Baldwin, T., Wildman, S. S., Bergan, D., & Zube, P. (2013b). Newspapers devote far less coverage to county government coverage than city governance. Newspaper Research Journal, 34(1), 104–111. Fico, F., Richardson, J., & Edwards, S. (2004). Influence of story structure on perceived story bias and news organization credibility. Mass Communication and Society, 7, 301–318. Fico, F., & Soffin, S. (1995). Fairness and balance of selected newspaper coverage of controversial national, state and local issues. Journalism & Mass Communication Quarterly, 72, 621–633. Fontenot, M., Boyle, K., & Gallagher, A. H. (2009). Comparing type of sources in coverage of Katrina, Rita. Newspaper Research Journal, 30(1), 21–33. Fouts, G., & Burggraf, K. (1999). Television situation comedies: Female body images and verbal reinforcements. Sex Roles, 40(5–6), 473–481. Freelon, D. G. (2010). ReCal: Intercoder reliability calculation as a web service. International Journal of Internet Science, 5(1), 20–33. Freelon, D., McIlwain, C., & Clark, M. (2018). Quantifying the power and consequences of social media protest. New Media & Society, 20, 990–1011. doi: 10.1177/1461444816676646

References 199 Gerbner, G., Gross, L., Morgan, M., & Signorielli, N. (1994). Growing up with television: The cultivation perspective. In J. Bryant & D. Zillmann (Eds.), Media effects: Advances in theory and research (pp. 17–41). Hillsdale, NJ: Lawrence Erlbaum Associates. Gerbner, G., Signorielli, N., & Morgan, M. (1995). Violence on television: The Cultural Indicators Project. Journal of Broadcasting & Electronic Media, 39, 278–283. Ghosh, S., Zafar, M. B., Bhattacharya, P., Sharma, N., Ganguly, N., & Gummadi, K. (2013, October). On sampling the wisdom of crowds: Random vs. expert sampling of the twitter stream. In Proceedings of the 22nd ACM international conference on information and knowledge management (pp. 1739–1744). New York: ACM. Gilat, I., & Shahar, G. (2007). Emotional first aid for a suicide crisis: Comparison between telephonic hotline and Internet. Psychiatry, 70, 12–18. doi: 10.1521/ psyc.2007.70.1.12 Gjoka, M., Kurant, M., Butts, C. T., & Markopoulou, A. (2009). A walk in Facebook: Uniform sampling of users in online social networks. Retrieved January 11, 2019, from https://arxiv.org/abs/0906.0060. Golan, G., & Wanta, W. (2001). Second-level agenda setting in the New Hampshire primary: A comparison of coverage in three newspaper and public perceptions of candidates. Journalism & Mass Communication Quarterly, 78, 247–259. Grant, M. J., Button, C. M., & Snook, B. (2017). An evaluation of interrater reliability measures on binary tasks using d-prime. Applied Psychological Measurement, 41(4), 264–276. Green, M. C., Brock, T. C., & Kaufman, G. F. (2004). Understanding media enjoyment: The role of transportation into narrative worlds. Communication Theory, 14(4), 311–327. doi: 10.1111/j.1468-2885.2004.tb00317.x Grimmer, J., & Stewart, B. M. (2013). Text as data: The promises and pitfalls of automated content analysis methods for political texts. Political Analysis, 21, 267–297. doi: 10.1093/pan/mps028 Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29–48. Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gathersburg, MD: Advanced Analytics. Hak, T., & Bernts, T. (1996). Coder training: Theoretical training or practical socialization. Qualitative Sociology, 19, 235–257. doi: 10.1007/BF02393420 Hale, B. J., & Grabe, M. E. (2018). Visual war: A content analysis of Clinton and Trump subreddits during the 2016 campaign. Journalism & Mass Communication Quarterly, 95, 449–470. Hambrick, M. E., Simmons, J. M., Greenhalgh, G. P., & Greenwell, T. C. (2010). Understanding professional athletes’ use of Twitter: A content analysis of athlete tweets. International Journal of Sport Communication, 3(4), 454–471. Hamdy, N., & Gomaa, E. H. (2012). Framing the Egyptian uprising in Arabic language newspapers and social media. Journal of Communication, 62(2), 195–211. Hansen, K. A., & Paul, N. (2015). Newspaper archives reveal major gaps in digital age. Newspaper Research Journal, 36(3), 290–298.

200 References Hanusch, F., & Bruns, A. (2017). Journalistic branding on Twitter: A representative study of Australian journalists’ profile descriptions. Digital Journalism, 5(1), 26–43. Hayes, A. F. (2005). An SPSS procedure for computing Krippendorff’s alpha. Retrieved September 19, 2018, from www.afhayes.com/spss-sas-and-mplusmacros-and-code.html. Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77–89. Hermida, A., Lewis, S. A., & Zamith, R. (2013). Sourcing the Arab Spring: A case study of Andy Carvin’s sources on Twitter during the Tunisian and Egyptian revolutions. Journal of Computer-Mediated Communication, 19(3), 479–499. Hester, J. B., & Dougall, E. (2007). The efficiency of constructed weeks sampling for content analysis of online news. Journalism & Mass Communication Quarterly, 84, 811–824. Hindman, D. (2012). Knowledge gaps, belief gaps and public opinion about health care reform. Journalism & Mass Communication Quarterly, 89(4), 585–605. Holsti, O. R. (1969). Content analysis for the social sciences and humanities. Reading, MA: Addison-Wesley. Hovland, C. I. (1959). Reconciling conflicting results derived from experimental and survey studies of attitude change. The American Psychologist, 14, 8–17. Hox, J. J., Moerbeek, M., & Van de Schoot, R. (2017). Multilevel analysis: Techniques and applications (3rd ed.). New York: Routledge. Hua, M., & Tan, A. (2012). Media reports of Olympic success by Chinese and American gold medalists: Cultural differences in causal attribution. Mass Communication and Society, 15(4), 546–558. Hum, N. J., Chamberlin, P. E., Hambright, B. L., Portwood, A. C., Schat, A. C., & Bevan, J. L. (2011). A picture is worth a thousand words: A content analysis of Facebook profile photographs. Computers in Human Behavior, 27(5), 1828–1833. Hunter, J. E., & Gerbing, D. W. (1982). Unidimensional measurement, second order factor analysis and causal models. Research in Organizational Behavior, 4, 267–320. Hwang, Y., & Jeong, S. (2009). Revising the knowledge gap hypothesis: A metaanalysis of thirty-five years of research. Journalism & Mass Communication Quarterly, 86(3), 513–532. Internet Live Stats (2018). Twitter usage statistics. Retrieved June 28, 2018, from www.internetlivestats.com/twitter-statistics/. Jamal, A. A., Keohane, R. O., Romney, D., & Tingley, D. (2015). Anti-Americanism and anti-interventionism in Arabic Twitter discourse. Perspectives on Politics, 13, 55–73. doi: 10.1017/S1537592714003132 Johnson, M. A., & Pettiway, K. M. (2017). Visual expressions of black identity: African American and African museum websites. Journal of Communication, 67, 350–377. Johnson, R. H. (1999). The relation between formal logic and informal logic. Argumentation, 13, 265–274. Johnson, R. H., & Blair, J. A. (2000). Informal logic: An overview. Informal Logic, 20(2), 93–107.

References 201 Jones, R. L., & Carter, R. E., Jr. (1959). Some procedures for estimating “news hole” in content analysis. Public Opinion Quarterly, 23, 399–403. Joseph, K., Landwehr, P. M., & Carley, K. M. (2014). Two 1% s don’t make a whole: Comparing simultaneous samples from Twitter’s streaming API. In W. G. Kennedy, N. Agarwal, & S. J. Yang (Eds.), International conference on social computing, behavioral-cultural modeling, and prediction (pp. 75–83). Cham: Springer. Jung, J. (2002). How magazines covered media companies’ mergers: The case of the evolution of Time Inc. Journalism & Mass Communication Quarterly, 79, 681–696. Kaid, L. L., & Wadsworth, A. J. (1989). Content analysis. In P. Emmert & L. L. Barker (Eds.), Measurement of communication behavior (pp. 197–217). New York: Longman. Kamhawi, R., & Weaver, D. (2003). Mass communication research trends from 1980 to 1999. Journalism & Mass Communication Quarterly, 80(1), 7–27. Karlsson, M. (2012). Changing the liquidity of online news: Moving towards a method for content analysis. International Communication Gazette, 74, 385–402. Karpf, D. (2012). Social science research methods in Internet time. Infor mation, Communication, & Society, 15, 639–661. doi: 10.1080/1369118X. 2012.665468 Keith, T. (2016, November 18). Commander-in-Tweet: Trump’s social media use and presidential media avoidance. National Public Radio. Retrieved June 28, 2018, from www.npr.org/2016/11/18/502306687/commander-in-tweettrumps-social-media-use-and-presidential-media-avoidance. Kensicki, L. J. (2004). No cure for what ails us: The media-constructed disconnect between societal problems and possible solutions. Journalism & Mass Communication Quarterly, 81(1), 53–73. Kerlinger, F. N. (1973). Foundations of behavioral research (2nd ed.). New York: Holt, Rinehart & Winston. Ki, E., & Hon, L. C. (2006). Relationship maintenance strategies on Fortune 500 company web sites. Journal of Communication Management, 10(1), 27–43. Kim, S. H., Carvalho, J. P., & Davis, A. C. (2010). Talking about poverty: News framing of who is responsible for causing and fixing the problem. Journalism & Mass Communication Quarterly, 87, 563–581. Kim, S. H., Thrasher, J. F., Kang, M. H., Cho, Y. J., & Kim, J. K. (2017). News media presentations of electronic cigarettes: A content analysis of news coverage in South Korea. Journalism & Mass Communication Quarterly, 94, 443–464. Kiousis, S., Kim, S., McDevitt, M., & Ostrowski, A. (2009). Competing for attention: Information subsidy influence in agenda building during election campaigns. Journalism & Mass Communication Quarterly, 86(3), 545–562. Klapper, J. T. (1960). The effects of mass communication. New York: Free Press. Kornfield, R., Toma, C. L., Shah, D. V., Moon, T. J., & Gustafson, D. H. (2018). What do you say before you relapse? How language use in peer-to-peer online discussion forum predicts risky drinking among those in recovery. Health Communication, 33, 1184–1193. doi: 10.1080/10410236.2017.1350906 Kraemer, H. C. (1979). Ramifications of a population model for k as a coefficient of reliability. Psychometrika, 44, 461–472.

202 References Krippendorff, K. (1980). Content analysis: An introduction to its methodology. Beverly Hills, CA: Sage. Krippendorff, K. (2004a). Content analysis: An introduction to its methodology (2nd ed.). Thousand Oaks, CA: Sage. Krippendorff, K. (2004b). Reliability in content analysis: Some common misconceptions and recommendations. Human Communication Research, 30, 411–433. Krippendorff, K. (2011). Agreement and information in the reliability of coding. Communication Methods and Measures, 5(2), 93–112. Krippendorff, K. (2012). Commentary: A dissenting view on so-called paradoxes of reliability coefficients. In C. T. Salmon (Ed.), Communication yearbook 36 (pp. 481–499). New York: Routledge. Krippendorff, K. (2013). Content analysis: An introduction to its methodology (3rd ed.). Thousand Oaks, CA: Sage. Krippendorff, K. (2016). Misunderstanding reliability. Methodology, 12(4), 139–144. Krippendorff, K., & Bock, M. A. (Eds.) (2009). The content analysis reader. Thousand Oaks, CA: Sage. Krippendorff, K., & Craggs, R. (2016). The reliability of multi-valued coding of data. Communication Methods and Measures, 10(4), 181–198. Krippendorff, K., Mathet, Y., Bouvry, S., & Widlöcher, A. (2016). On the reliability of unitizing textual continua: Further developments. Quality & Quantity, 50(6), 2347–2364. Kurpius, D. D. (2002). Sources and civic journalism: Changing patterns of reporting? Journalism & Mass Communication Quarterly, 79, 853–866. Kutz, D. O., & Herring, S. C. (2005, January). Micro-longitudinal analysis of web news updates. In System Sciences, 2005. HICSS’05. Proceedings of the 38th annual Hawaii international conference (pp. 102a–102a). IEEE. Retrieved January 12, 2019, from www.computer.org/csdl/proceedings/ hicss/2005/2268/04/22680102a.pdf. Lacy, S. (1987). The effects of intracity competition on daily newspaper content. Journalism Quarterly, 64, 281–290. Lacy, S. (1988). The impact of intercity competition on daily newspaper content. Journalism Quarterly, 65, 399–406. Lacy, S. (1992). The financial commitment approach to news media competition. Journal of Media Economics, 59(2), 5–22. Lacy, S., Duffy, M., Riffe, D., Thorson, E., & Fleming, K. (2010). Citizen journalism web sites complement newspapers. Newspaper Research Journal, 31(2), 34–46. Lacy, S., & Fico, F. (1991). The link between newspaper content quality and circulation. Newspaper Research Journal, 12(2), 46–57. Lacy, S., Fico, F. G., Baldwin, T., Bergan, D., Wildman, S. S., & Zube, P. (2012). Dailies still do “heavy lifting” in government news, despite cuts. Newspaper Research Journal, 33(2), 23–39. Lacy, S., Fico, F., & Simon, T. F. (1989). The relationships among economic, newsroom and content variables: A path model. Journal of Media Economics, 2(2), 51–66. Lacy, S., & Riffe, D. (1993). Sins of omission and commission in mass communication quantitative research. Journalism Quarterly, 70, 126–132.

References 203 Lacy, S., & Riffe, D. (1996). Sampling error and selecting intercoder reliability samples for nominal content categories. Journalism & Mass Communication Quarterly, 73, 963–973. Lacy, S., Riffe, D., & Randle, Q. (1998). Sample size in multi-year content analyses of monthly consumer magazines. Journalism & Mass Communication Quarterly, 75, 408–417. Lacy, S., Riffe, D., Stoddard, S., Martin, H., & Chang, K. K. (2000). Sample size for newspaper content analysis in multi-year studies. Journalism & Mass Communication Quarterly, 78, 836–845. Lacy, S., Riffe, D., Thorson, E., & Duffy, M. (2009). Examining the features, policies and resources of citizen journalism: Citizen news sites and blogs. Web Journal of Mass Communication Research, 15(1), 1–20. Lacy, S., Robinson, K., & Riffe, D. (1995). Sample size in content analysis of weekly newspapers. Journalism & Mass Communication Quarterly, 72, 336–345. Lacy, S., & Rosenstiel, T. (2015). Defining and measuring quality journalism. New Brunswick, NJ: Rutgers School of Communication and Information. Lacy, S., Watson, B. R., & Riffe, D. (2011). Study examines relationship among mainstream, other media. Newspaper Research Journal, 32(4), 53–67. Lacy, S., Watson, B. R., Riffe, D., & Lovejoy, J. (2015). Issues and best practices in content analysis. Journalism & Mass Communication Quarterly, 92, 791–811. doi: 10.1177/1077699015607338 Lacy, S., Wildman, S. S., Fico, F., Bergan, D., Baldwin, T., & Zube, P. (2013). How radio news uses sources to cover local government news and factors affecting source use. Journalism & Mass Communication Quarterly, 90, 457–477. Lasswell, H. D. (1927). Propaganda technique in the world war. New York: Peter Smith. Law, C., & Labre, M. P. (2002). Cultural standards of attractiveness: A thirtyyear look at changes in male images in magazines. Journalism & Mass Communication Quarterly, 79(3), 697–711. Lawrence, R. G. (2010). Researching political news framing: Established ground and new horizons. In P. D’Angelo & J. A. Kuypers (Eds.), Doing news framing analysis: Empirical and theoretical perspectives (pp. 265–285). New York: Routledge. Lazarsfeld, P. F., Berelson, B., & Gaudet, H. (1944). The people’s choice. New York: Columbia University Press. Leccese, M. (2009). Online information sources of political blogs. Journalism & Mass Communication Quarterly, 86(3), 578–593. Lee, S., & Riffe, D. (2017). Who sets the corporate social responsibility agenda in the news media? Unveiling the agenda-building process of corporations and a monitoring group. Public Relations Review, 43, 293–305. doi: 10.1016/j. pubrev.2017.02.007 Lewis, S. C., Zamith, R., & Hermida, A. (2013). Content analysis in an era of big data: A hybrid approach to computational and manual methods. Journal of Broadcasting & Electronic Media, 57(1), 34–52. Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2004). A call for standardization in content analysis reliability. Human Communication Research, 30, 434–437.

204 References Lovejoy, J., Watson, B. R., Lacy, S., & Riffe, D. (2014). Assessing the reporting of reliability in published content analyses: 1985–2010. Communication Methods and Measures, 8(3), 207–221. Lovejoy, J., Watson, B. R., Lacy, S., & Riffe, D. (2016). Three decades of reliability in communication content analyses: Reporting of reliability statistics and coefficient levels in three top journals. Journalism & Mass Communication Quarterly, 93(4), 1135–1159. Lowery, S. A., & DeFleur, M. (1995). Milestones in mass communication research: Media effects (3rd ed.). White Plains, NY: Longman. Luke, D. A., Caburnay, C. A., & Cohen, E. L. (2011). How much is enough? New recommendations for using constructed week sampling in newspaper content analysis of health stories. Communication Methods and Measures, 5(1), 76–91. Lynch, T., Tompkins, J. E., van Driel, I. I., & Fritz, N. (2016). Sexy, strong, and secondary: A content analysis of female characters in video games across 31 years. Journal of Communication, 66, 564–584. doi: 10.1111/jcom.12237 Mahrt, M., & Scharkow, M. (2013). The value of big data in digital media research. Journal of Broadcasting & Electronic Media, 57(1), 20–33. Malamuth, N. M., Addison, T., & Koss, J. (2000). Pornography and sexual aggression: Are there reliable effects and can we understand them? Annual Review of Sex Research, 11(1), 26–94. Martins, N., Williams, D. C., Harrison, K., & Ratan, R. A. (2008). A content analysis of female body imagery in video games. Sex Roles, 61, 824–836. Mastro, D. (2009). Effects of racial and ethnic stereotyping. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed., pp. 325–341). New York: Routledge. Mastro, D. E., & Greenberg, B. S. (2000). The portrayal of racial minorities on prime time television. Journal of Broadcasting & Electronic Media, 44(4), 690–703. McCluskey, M., & Kim, Y. M. (2012). Moderation or polarization? Repre sentation of advocacy groups’ ideology in newspapers. Journalism & Mass Communication Quarterly, 89(4), 565–584. McCombs, M. E. (1972). Mass media in the marketplace. Journalism Mono graphs, 24. McCombs, M. E., & Shaw, D. L. (1972). The agenda-setting function of mass media. Public Opinion Quarterly, 36, 176–187. McCombs, M., & Reynolds, A. (2009). How the news shapes our civic agenda. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed., pp. 1–17). New York: Routledge. McEwan, B., Carpenter, C. J., & Westerman, D. (2018). On replication in communication science. Communication Studies, 69(3), 235–241. McLeod, D. M., Kosicki, G. M., & McLeod, J. M. (2009). Political communication effects. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed., pp. 228–251). New York: Routledge. McLeod, D. M., & Tichenor, P. J. (2003). The logic of social and behavioral sciences. In G. H. Stempel III, D. H. Weaver, & G. C. Wilhoit (Eds.), Mass communication research and theory (pp. 91–110). Boston, MA: Allyn & Bacon. McLoughlin, M., & Noe, F. P. (1988). Changing coverage of leisure in Harper’s, Atlantic Monthly, and Reader’s Digest: 1960–1985. Sociology and Social Research, 72, 224–228.

References 205 McMillan, S. J. (2000). The microscope and the moving target: The challenge of applying content analysis to the World Wide Web. Journalism & Mass Communication Quarterly, 77, 80–98. Mellado, C., Hellmueller, L., Márquez-Ramírez, M., Humanes, M. L., Sparks, C., Stepinska, A., Pasti, S., Schielicke, A., Tandoc, E., & Wang. H. (2017). The hybridization of journalistic cultures: A comparative study of journalistic role performance. Journal of Communication, 67, 944–967. doi: 10.1111/jcom.12339 Mellado, C., & van Dalen, A. (2017). Challenging the citizen–consumer journalistic dichotomy: A news content analysis of audience approaches in Chile. Journalism & Mass Communication Quarterly, 94, 213–237. doi: 10.1177/1077699016629373 Miller, D. C. (1977). Handbook of research design and social measurement (3rd ed.). New York: McKay. Moen, M. C. (1990). Ronald Reagan and the social issues: Rhetorical support for the Christian Right. The Social Science Journal, 27, 199–207. Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013, July). Is the sample good enough? Comparing data from Twitter’s streaming API with Twitter’s firehose. ICWSM. Retrieved January 19, 2019, from www.aaai.org/ocs/index. php/ICWSM/ICWSM13/paper/download/6071/6379. Moser, C. A., & Kalton, G. (1972). Survey methods in social investigation (2nd ed.). New York: Basic Books. Muñoz, C. L., & Towner, T. L. (2017). The image is the message: Instagram marketing and the 2016 presidential primary season. Journal of Political Marketing, 16, 290–318. doi: 10.1080/15377857.2017.1334254 Naaman, M., Boase, J., & Lai, C. H. (2010, February). Is it really about me? Message content in social awareness streams. In Proceedings of the 2010 ACM conference on computer supported cooperative work (pp. 189–192). New York: ACM. Neuendorf, K. (2017). The content analysis guidebook (2nd ed.). Thousand Oaks, CA: Sage. Neumann, R., & Fahmy, S. (2012). Analyzing the spell of war: A war/peace framing analysis of the 2009 visual coverage of the Sri Lankan Civil War in Western newswires. Mass Communication and Society, 15, 169–200. Nili, A., Tate, M., & Barros, A. (2017). A critical analysis of inter-coder reliability methods in information systems research. Australian Conference on Information Systems, Hobart, Australia. Oliver, M. B., & Krakowiak, M. (2009). Individual differences in media effects. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed., pp. 517–531). New York: Routledge. Olson, B. (1994). Sex and soap operas: A comparative content analysis of health issues. Journalism Quarterly, 71, 840–850. Opperhuizen, A. E., Schouten, K., & Klijn, E. H. (2018). Framing a conflict! How media report on earthquake risks caused by gas drilling: A longitudinal analysis using machine learning techniques of media reporting on gas drilling from 1990 to 2015. Journalism Studies, published online January 11, 2018. doi: 10.1080/1461670X.2017.1418672 Osgood, C. E., Suci, G. J., & Tannenbaum, P. H. (1957). The measurement of meaning. Urbana, IL: University of Illinois Press. Oxford University (1979). Newton, Isaac. In The Oxford Dictionary of Quotations (3rd ed., p. 362). New York: Oxford University Press.

206 References Palguna, D. S., Joshi, V., Chakaravarthy, V. T., Kothari, R., & Subramaniam, L. V. (2015). Analysis of sampling algorithms for Twitter. In Q. Y. Hong & M. Wooldridge (Eds.), Proceedings of the twenty-fourth international joint conference on artificial intelligence (pp. 967–973). Palo Alto, CA: AAAI Press. Papacharissi, Z. (2002). The presentation of self in virtual life: Characteristics of personal home pages. Journalism & Mass Communication Quarterly, 79, 643–660. Parde, N., & Nielsen, R. D. (2017). Detecting sarcasm is extremely easy ;-). Proceedings of the workshop on computational semantics beyond events and roles (SemBEaR-2018) (pp. 21–26). Stroudsburg, PA: Association for Computational Linguistics. Peter, J., & Lauf, E. (2002). Reliability in cross-national content analysis. Journalism & Mass Communication Quarterly, 79, 815–832. Potter, W. J., & Levine-Donnerstein, D. (1999). Rethinking validity and reliability in content analysis. Journal of Applied Communication Research, 27(3), 258–284. Prabowo, R., & Thelwall, M. (2009). Sentiment analysis: A combined approach. Journal of Informetrics, 3, 149–157. Pratt, C. A., & Pratt, C. B. (1995). Comparative content analysis of food and nutrition advertisements in Ebony, Essence, and Ladies’ Home Journal. Journal of Nutrition Education, 27, 11–18. Quarfoot, D., & Levine, R. A. (2016). How robust are multirater interrater reliability indices to changes in frequency distribution? The American Statistician, 70(4), 373–384. Ramaprasad, J. (1993). Content, geography, concentration and consonance in foreign news coverage of ABC, NBC and CBS. International Communication Bulletin, 28, 10–14. Rapoport, A. (1969). A system-theoretic view of content analysis. In G. Gerbner, O. Holsti, K. Krippendorff, W. J. Paisley, & P. J. Stone (Eds.), The analysis of communication content (pp. 17–38). New York: Wiley. Reese, S. D. (2011). Understanding the global journalist: A hierarchy-of-influences approach. In D. Berkowitz (Ed.), Cultural meanings of news: A text-reader (pp. 3–15). Thousand Oaks, CA: Sage. Reese, S. D., Gandy, O. H., Jr., & Grant, A. E. (Eds.) (2001). Framing public life: Perspectives on media and our understanding of the social world. Mahwah, NJ: Lawrence Erlbaum Associates. Reynolds, P. D. (1971). A primer in theory construction. Indianapolis, IN: Bobbs-Merrill. Rezvanian, A., & Meybodi, M. R. (2017). A new learning automata-based sampling algorithm for social networks. International Journal of Communication Systems, 30(5), e3091. Rice, R. E., Peterson, M., & Christie, R. (2001). A comparative features analysis of publicly accessible commercial and government health database websites. In R. E. Rice & J. E. Katz (Eds.), The Internet and health communication: Expectations and experiences (pp. 213–231). Thousand Oaks, CA: Sage. Riffe, D. (1984). International news borrowing: A trend analysis. Journalism Quarterly, 61, 142–148. Riffe, D. (1991). A case study of the effect of expulsion of U.S. correspondents on New York Times’ coverage of Iran during the hostage crisis. International Communication Bulletin, 26, 1–2, 11–15.

References 207 Riffe, D. (2003). Data analysis and SPSS programs for basic statistics. In G. H. Stempel III, D. H. Weaver, & G. C. Wilhoit (Eds.), Mass communication research and theory (pp. 182–208). Boston, MA: Allyn & Bacon. Riffe, D., Aust, C. F., & Lacy, S. R. (1993). The effectiveness of random, consecutive day and constructed week samples in newspaper content analysis. Journalism Quarterly, 70, 133–139. Riffe, D., Ellis, B., Rogers, M. K., Ommeren, R. L., & Woodman, K. A. (1986). Gatekeeping and the network news mix. Journalism Quarterly, 63, 315–321. Riffe, D., & Freitag, A. (1997). A content analysis of content analyses: 25 years of Journalism Quarterly. Journalism & Mass Communication Quarterly, 74, 873–882. Riffe, D., Goldson, H., Saxton, K., & Yu, Y. C. (1989). Females and minorities in TV ads in 1987 Saturday children’s programs. Journalism Quarterly, 66(1), 129–136. Riffe, D., Kim, S., & Sobel, M. R. (2018). News borrowing revisited: A 50-year perspective. Journalism & Mass Communication Quarterly, 98(4), 909–929. doi: 10.1177/1077699018754909 Riffe, D., Lacy, S., & Drager, M. (1996). Sample size in content analysis of weekly news magazines. Journalism & Mass Communication Quarterly, 73, 635–644. Riffe, D., Lacy, S., Nagovan, J., & Burkum, L. (1996). The effectiveness of simple random and stratified random sampling in broadcast news content analysis. Journalism & Mass Communication Quarterly, 73, 159–168. Rogers, E. M. (1994). A history of communication study: A biographical approach. New York: Free Press. Rogers, E. M. (2003). Diffusion of Innovation (5th ed.). New York: Free Press. Rowling, C. M., Jones, T. J., & Sheets, P. (2011). Some dared call it torture: Cultural resonance, Abu Ghraib, and a selectively echoing press. Journal of Communication, 61, 1043–1061. Rubin, A. M. (2009). Uses-and-gratifications perspective on media effects. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed., pp. 181–200). New York: Routledge. Rusmevichientong, P., Pennock, D. M., Lawrence, S., & Giles, C. L. (2001, November). Methods for sampling pages uniformly from the World Wide Web. Proceedings of the AAAI fall symposium on using uncertainty within computation (pp. 121–128). Menlo Park, CA: AAAI Press. St. Cyr, C., Carpenter, S., & Lacy, S. (2010). Internet competition and US newspaper city government coverage: Testing the Lowrey and Mackay model of occupational competition. Journalism Practice, 4(4), 507–522. St. Cyr, C., Lacy, S., & Guzman-Ortega, S. (2005). Circulation increases follow investments in newsrooms. Newspaper Research Journal, 26(4), 50–60. Sapolsky, B. S., Molitor, F., & Luque, S. (2003). Sex and violence in slasher films: Re-examining the assumptions. Journalism & Mass Communication Quarterly, 80(1), 28–38. Scheufele, B., Haas, A., & Brosius, H. (2011). Mirror or molder? A study of media coverage, stock prices, and trading volumes in Germany. Journal of Communication, 61, 48–70. Scheufele, B. T., & Scheufele, D. A. (2010). Of spreading activation, applicability, and schemas: Conceptual distinctions and their operational implications for measuring frames and framing effects. In P. D’Angelo & J. A. Kuypers

208 References (Eds.), Doing news framing analysis: Empirical and theoretical perspectives (pp. 110–134). New York: Routledge. Scott, D. K., & Gobetz, R. H. (1992). Hard news/soft news content of national broadcast networks. Journalism Quarterly, 69, 406–412. Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19, 321–325. Severin, W. J., & Tankard, J. W., Jr. (2000). Communication theories: Origins, methods, and uses in the mass media (5th ed.). New York: Addison Wesley Longman. Shils, E. A., & Janowitz, M. (1948). Cohesion and disintegration in the Wehrmacht in World War II. Public Opinion Quarterly, 12, 300–306, 308–315. Shin, J., & Thorson, K. (2017). Partisan selective sharing: The biased diffusion of factchecking messages on social media. Journal of Communication, 67(2), 233–255. Shoemaker, P. J., & Reese, S. D. (1990). Exposure to what? Integrating media content and effects studies. Journalism Quarterly, 67, 649–652. Shoemaker, P. J., & Reese, S. D. (1996). Mediating the message: Theories of influences on mass media content (2nd ed.). White Plains, NY: Longman. Shoemaker, P. J., Tankard, J. W., Jr., & Lasorsa, D. L. (2004). How to build social science theories. Thousand Oaks, CA: Sage. Shrum, L. J. (2009). Media consumption and perceptions of social reality: Effects and underlying processes. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed., pp. 50–73). New York: Routledge. Signorielli, N. (2001). Age on television: The picture in the nineties. Generations, 25(3), 34–38. Simon, T. F., Fico, F., & Lacy, S. (1989). Covering conflict and controversy: Measuring balance, fairness and defamation in local news stories. Journalism Quarterly, 66, 427–434. Simonton, D. K. (1994). Computer content analysis of melodic structure: Classical composers and their compositions. Psychology of Music, 22, 31–43. Skogerbo, E., & Krumsvik, A. H. (2015). Newspapers, Facebook and Twitter: Intermedial agenda setting in local election campaigns. Journalism Practice, 9, 350–366. doi: 10.1080/17512786.2014.950471 Slater, M. D. (2013). Content analysis as a foundation for programmatic research in communication. Communication Methods and Measures, 7(2), 85–93. Smith, S. L., & Granados, A. D. (2009). Content patterns and effects surrounding sex-role stereotyping on television and film. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed., pp. 342–361). New York: Routledge. Sobel, M. R., & Riffe, D. (2015). US linkages in New York Times coverage of Nigeria, Ethiopia and Botswana (2004–13): Economic and strategic bases for news. International Communication Research Journal, 50(1), 3–23. Sparks, G. G., Sparks, C. W., & Sparks, E. A. (2009). Media violence. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed., pp. 269–286). New York: Routledge. Stamm, K. R. (2003). Measurement decisions. In G. H. Stempel III, D. H. Weaver, & G. C. Wilhoit (Eds.), Mass communication research and theory (pp. 129–146). Boston, MA: Allyn & Bacon. Statista (2018a). Twitter penetration rate in the United States from 2014 to 2020. Retrieved June 10, 2018, from www.statista.com/statistics/183466/share-ofadult-us-population-on-twitter/.

References 209 Statista (2018b). Most famous social network sites worldwide as of April 2018, ranked by number of active users (in millions). Retrieved June 28, 2018, from www.statista.com/statistics/272014/global-social-networks-ranked-bynumber-of-users/. Stegner, W. (1949). The radio priest and his flock. In I. Leighton (Ed.), The aspirin age: 1919–1941 (pp. 232–257). New York: Simon & Schuster. Stempel, G. H., III (1952). Sample size for classifying subject matter in dailies. Journalism Quarterly, 29, 333–334. Stempel, G. H., III (1985). Gatekeeping: The mix of topics and the selection of stories. Journalism Quarterly, 62(4), 791–796, 815. Stempel, G. H., III (2003). Content analysis. In G. H. Stempel III, D. H. Weaver, & G. C. Wilhoit (Eds.), Mass communication research and theory (pp. 209–219). Boston, MA: Allyn & Bacon. Stempel, G. H., III, & Stewart, R. K. (2000). The Internet provides both opportunities and challenges for mass communication researchers. Journalism & Mass Communication Quarterly, 77, 541–548. Stokes, D. E. (1997). Pasteur’s quadrant: Basic science and technological innovation. Washington, DC: Brookings Institution Press. Stouffer, S. A. (1977). Some observations on study design. In D. C. Miller (Ed.), Handbook of research design and social measurement (3rd ed., pp. 27–31). New York: McKay. Strodthoff, G. G., Hawkins, R. P., & Schoenfeld, A. C. (1985). Media roles in a social movement. Journal of Communication, 35(2), 134–153. Stryker, J. E., Wray, R. J., Hornik, R. C., & Yanovitzky, I. (2006). Validation of database search terms for content analysis: The case of cancer news coverage. Journalism & Mass Communication Quarterly, 83(2), 413–430. Sundar. S. S., Rice, R. E., Kim, H., & Sciamanna, C. N. (2011). Online health infor mation: Conceptual challenges and theoretical opportunities. In T. L. Thompson, R. Parrott, & J. F. Nussbaum (Eds.), The Routledge handbook of health communication (2nd ed., pp. 181–202). New York: Routledge. Tabachnick, B. G., & Fidell, L. S. (1996). Using multivariate statistics (3rd ed.). New York: HarperCollins. Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). New York: HarperCollins. Táboas-Pais, M. I., & Rey-Cao, A. (2012). Gender differences in physical education textbooks in Spain: A content analysis of photographs. Sex Roles, 67(7–8), 389–402. Tajfel, H., & Turner, J. C. (1979). An integrative theory of intergroup conflict. In W. G. Austin & S. Worchel (Eds.), The social psychology of intergroup relations (pp. 33–47). Monterey, CA: Brooks/Cole. Tankard, J. W., Jr. (2001). The empirical approach to the study of framing. In S. D. Reese, O. H. Gandy, & A. E. Grant (Eds.), Framing public life: Perspectives on media and our understanding of the social world (pp. 95–106). Mahwah, NJ: Lawrence Erlbaum Associates. Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29, 24–54. doi: 10.1177/0261927X09351676 Tewksbury, D., & Scheufele, D. A. (2009). News framing theory and research. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed., pp. 17–33). New York: Routledge.

210 References Thelwall, M., Buckley, K., & Paltoglou, G. (2011). Sentiment in Twitter events. Journal of the American Society for Information Science and Technology, 62(2), 406–418. Theocharis, Y., Barberá, P., Fazekas, Z., Popa, S. A., & Parnet, O. (2016). A bad workman blames his tweets: The consequences of citizens’ uncivil Twitter use when interacting with party candidates. Journal of Communication, 66(6), 1007–1031. Thorson, E. (1989). Television commercials as mass media messages. In J. Bradac (Ed.), Message effects in communication science (pp. 195–230). Newbury Park, CA: Sage. Thorson, K., Driscoll, K., Ekdale, B., Edgerly, S., Thompson, L. G., Schrock, A., Swartz, L., Vraga, E. K., & Wells, C. (2013). Youtube, Twitter, and the Occupy movement. Information, Communication & Society, 16, 421–451. doi: 10.1080/1369118X.2012.756051 Tremayne, M. (2004). The Web of context: Applying network theory to use of hyperlinks in journalism on the Web. Journalism & Mass Communication Quarterly, 81, 237–253. Trilling, D., & Jonkman, J. G. F. (2018). Scaling up content analysis. Communication Methods and Measures, 12, 158–174. doi: 10.1080/19312458.2018.1447655 Trumbo, C. (2004). Research methods in mass communication research: A census of eight journals 1990–2000. Journalism & Mass Communication Quarterly, 81, 417–436. Turner, J. C., Hogg, M. A., Oakes, P. J., Reicher, S. D., & Wetherell, M. S. (1987). Rediscovering the social group: A self-categorizing theory. Oxford: Blackwell. Twitter (2018). Search tweets. Retrieved August 10, 2018, from https://developer. twitter.com/en/docs/tweets/search/overview.html. Vashi, A., & Rhodes, K. V. (2011). “Sign right here and you’re good to go”: A content analysis of audiotaped emergency department discharge instructions. Annals of Emergency Medicine, 57, 315–322. doi: 10.1016/j.annemergmed. 2010.08.024 Vincent, R. C., Davis, D. K., & Boruszkowski, L. A. (1987). Sexism on MTV: The portrayal of women in rock videos. Journalism Quarterly, 64, 750–755, 941. Vogt, W. P. (2005). Dictionary of statistics and methodology: A nontechnical guide for the social sciences (3rd ed.). Thousand Oaks, CA: Sage. Vorderer, P., & Hartmann, T. (2009). Entertainment and enjoyment as media effects. In J. Bryant & M. B. Oliver (Eds.), Media effects: Advances in theory and research (3rd ed., pp. 532–550). New York: Routledge. Wagner, K. (2017, May 9). Snapchat messages won’t always disappear as fast as they used to. Recode. Retrieved June 28, 2018, from www.recode. net/2017/5/9/15595040/snapchat-product-update-limitless-q1-earnings. Wall, M. (2015). Citizen journalism: A retrospective on what we know, an agenda for what we don’t. Digital Journalism, 3(6), 797–813. Wang, X., & Riffe, D. (2010, May). An exploration of sample sizes for content analysis of the New York Times web site. Web Journal of Mass Communication Research, 20. Retrieved from http://wjmcr.org/vol20. Wanta, W., Golan, G., & Lee, C. (2004). Agenda setting and international news: Media influence on public perceptions of foreign nations. Journalism & Mass Communication Quarterly, 81, 364–377.

References 211 Washburn, R. C. (1995). Top of the hour newscast and public interest. Journal of Broadcasting & Electronic Media, 39, 73–91. Waters, R. D., Burnett, E., Lamm, A., & Lucas, J. (2009). Engaging stakeholders through social networking: How nonprofit organizations are using Facebook. Public Relations Review, 35, 102–106. Watson, B. R. (2014). Assessing ideological, professional, and structural biases in journalists’ coverage of the 2010 BP oil spill. Journalism & Mass Communication Quarterly, 91, 792–810. doi: 10.1177/1077699014550091 Watson, B. R. (2017). Murder she searched: The effect of violent crime and news coverage on residents’ information needs. Mass Communication and Society, 20, 241–259. Weaver, D. A., & Bimber, B. (2008). Finding news stories: A comparison of searches using LexisNexis and Google News. Journalism & Mass Communi cation Quarterly, 85(3), 515–530. Weaver, D. H. (2003). Basic statistical tools. In G. H. Stempel III, D. H. Weaver, & G. C. Wilhoit (Eds.), Mass communication research and theory (pp. 147–181). Boston, MA: Allyn & Bacon. Weaver, J. B., Porter, C. J., & Evans, M. E. (1984). Patterns in foreign news coverage on U.S. network TV: A 10-year analysis. Journalism Quarterly, 61(2), 356–363. Weber, R. P. (1990). Basic content analysis (2nd ed.). Newbury Park, CA: Sage. West, D. M., & Miller, E. A. (2009). Digital medicine: Health care in the internet era. Washington, DC: Brookings Institute Press. Westley, B. H., & MacLean, M. S., Jr. (1957). A conceptual model for communication research. Journalism Quarterly, 34, 31–38. Whaples, R. (1991). A quantitative history of The Journal of Economic History and the cliometric revolution. The Journal of Economic History, 51, 289–301. Whitney, D. C. (1981). Information overload in the newsroom. Journalism Quarterly, 58, 69–76, 161. Wicks, R. H., & Souley, B. (2003). Going negative: Candidate usage of Internet Web sites during the 2000 presidential campaign. Journalism & Mass Communication Quarterly, 80(1), 128–144. Wilson, C., Robinson, T., & Callister, M. (2012). Surviving Survivor: A content analysis of antisocial behavior and its context in a popular reality television show. Mass Communication and Society, 15(2), 261–283. Wimmer, R. D., & Dominick, J. R. (1991). Mass media research: An introduction (3rd ed.). Belmont, CA: Wadsworth. Wimmer, R. D., & Dominick, J. R. (2003). Mass media research: An introduction (7th ed.). Belmont, CA: Wadsworth. Wimmer, R. D., & Dominick, J. R. (2011). Mass media research: An introduction (9th ed.). Belmont, CA: Wadsworth. Windhauser, J. W., & Stempel, G. H., III (1979). Reliability of six techniques for content analysis of local coverage. Journalism Quarterly, 56, 148–152. Wooldridge, J. M. (2015). Introductory econometrics: A modern approach. Toronto: Nelson Education. Wrightsman, L. S. (1981). Personal documents as data in conceptualizing adult personality development. Personality and Social Psychology Bulletin, 7, 367–385. Wu, L. (2015, August). How do national and regional newspapers cover posttraumatic stress disorder? A content analysis. Paper presented at Annual Convention, AEJMC, San Francisco, CA.

212 References Yi-Fan Su, L., Xenos, M. A., Rose, K. M., Wirz, C., Scheufele, D. A., & Brossard, D. (2018). Uncivil and personal? Comparing patterns of incivility in comments on the Facebook pages of news outlets. New Media & Society, 20, 3678–3699. doi: 10.1177/1461444818757205 Young, S. M., Pruett, J. A., & Colvin, M. L. (2018). Comparing help-seeking behavior of male and female survivors of sexual assault: A content analysis of a hotline. Sex Abuse, 30, 454–474. doi.org/10.1177/1079063216677785 Zamith, R. (2017). Capturing and analyzing liquid content: A computation process for freezing and analyzing mutable documents. Journalism Studies, 18, 1489–1504. doi: 10.1080/1461670X.2016.1146083 Zamith, R., & Lewis, S. C. (2015). Content analysis and the algorithmic coder: What computational social science means for traditional modes of media analysis. The Annals of the American Academy of Political and Social Science, 659, 307–318. Zeldes, G., & Fico, F. (2010). Broadcast and cable news network differences in the way reporters used women and minority group sources to cover the 2004 presidential race. Mass Communication and Society, 13(5), 512–514. Zephoria Digital Marketing (2018). The top 20 valuable Facebook statistics – updated June 2018. Retrieved June 28, 2018, from https://zephoria.com/top15-valuable-facebook-statistics/. Zhang, Y. (2009). Individualism or collectivism? Cultural orientations in Chinese TV commercials and analysis of some moderating factors. Journalism & Mass Communication Quarterly, 86(3), 630–653. Zhao, X. (2012, August). A reliability index (Ai) that assumes honest coders and variable randomness. Paper presented at the annual convention, Association for Education in Journalism and Mass Communication, Chicago, IL. Zhao, X., Feng, G. C., Liu, J. S., & Deng, K. (2018). We agreed to measure agreement: Redefining reliability de-justifies Krippendorff’s alpha. China Media Research, 14(2), 1–15. Zhao, X., Liu, J. S., & Deng, K. (2012). Assumptions behind intercoder reliability indices. In C. T. Salmon (Ed.), Communication yearbook 36 (pp. 419–480). New York: Routledge. Zhu, J., Luo, J., You, Q., & Smith, J. R. (2013, December). Towards understanding the effectiveness of election related images in social media. In W. Ding, T. Washio, H. Xiong, G. Karypis, B. Thuraisingham, D. Cook, & X. Wu (Eds.), Proceedings of the 2013 IEEE 13th international conference on data mining workshops (ICDMW) (pp. 421–425). Piscataway, NJ: IEEE. Zillmann, D. (2002). Exemplification theory of media influence. In J. Bryant & D. Zillmann (Eds.), Media effects: Advances in theory and research (2nd ed., pp. 19–41). Mahwah, NJ: Lawrence Erlbaum Associates. Zullow, H. M., Oettingen, G., Peterson, C., & Seligman, M. E. P. (1988). Pessimistic explanatory style in the historical record: CAVing LBJ, presidential candidates, and East versus West Berlin. The American Psychologist, 43, 673–682.

Author Index

Adams, J. B. 83 Addison, T. 151 Ahern, L. A. 90 Allen, C. J. 151 Allen, M. 10 Allport, G. W. 17–18 Altschull, J. H. 6 An, S. 151 Armstrong, C. L. 1 Atwater, T. 182–183 Aust, C. F. 85 Austin, E. W. 147 Baade, R. C. 10 Babbie, E. 47, 157 Baden, C. 17 Baldwin, T. 75, 86 Ball-Rokeach, S. J. 141 Bantz, C. R. 10 Barberá, P. 94 Barnhurst, K. G. 27 Barros, A. 120 Bastien, F. 2 Bauer, R. A. 6, 8 Baxter, R. L. 30 Beam, R. A. 10, 66 Bennett, L. W. 9 Berelson, B. 7 Berelson, B. R. 22, 83 Bergan, D. 75 Berkowitz, D. 10 Bernts, T. 38 Bialik, C. 32 Bimber, B. 92 Blair, J. A. 21 Blalock, H. M. L., Jr. 191 Boase, J. 94 Bock, M. A. 7 Boehmer, J. 89

Bogart, L. 133 Boomgaarden, H. 14 Bortree, D. S. 90 Boruszkowski, L. A. 30 Bouvry, S. 123 Boyle, K. 10 Boyle, M. P. 1 Bracken, C. C. 126 Bradac, J. 9 Brezgel, K. 151 Brock, T. C. 9 Brosius, H. 14 Brown, J. D. 60 Brummette, J. 28 Bruns, A. 94, 95 Bryant, J. 9 Buckley, K. 32 Burggraf, K. 51 Burkum, L. 87 Burnett, E. 94 Button, C. M. 125–126 Butts, C. T. 96 Caburnay, C. A. 86 Callister, M. 3 Campbell, D. T. 137–138 Campbell, K. 60 Cantor, J. 9 Cantril, H. 6 Carey, J. W. 7 Carley, K. M. 95 Carpenter, C. J. 21 Carpenter, S. 89, 135 Carter, R. E., Jr. 85 Carvalho, J. P. 72–73 Ceron, A. 41 Chaffee, S. H. 7 Chakaravarthy, V. T. 95–96 Chang, K. K. 85–86

214 Author Index Chau, M. 91 Chen, H. 91 Cho, Y. J. 89 Christie, R. 27 Clark, M. 39 Cleary, J. 27 Coffey, A. J. 27 Cohen, E. L. 86 Cohen, J. A. 122–123, 124 Cohen, S. 10 Colvin, M. L. 51 Connolly-Ahern, C. 90 Conway, B. A. 150 Conway, M. 36, 37 Coral-Reaume Miller, A. 147 Correa, T. 10 Cote, W. 111, 135 Craft, S. H. 10 Craggs, R. 64–65, 123, 125 Culbertson, H. M. 13–14 Curini, L. 41 D’Alessio, D. 151 Danielson, W. A. 72, 83 Davis, A. C. 72–73 Davis, D. K. 30 Davis, J. 85 Davison, K. K. 128 Deese, J. 58–59, 60 DeFleur, M. 6 Deng, K. 126 DeRiemer, C. 30 De Swert, K. 124 de Vreese, C. H. 14 Di Cicco, D. T. 10, 76 Dick, S. J. 97 Dill, R. K. 10 DiStaso, M. 28 Dixon, T. L. 1 Dominick, J. R. 12, 13, 24, 27, 81, 114, 136, 157 Döring, N. 51, 152, 153 Dougall, E. 89–90 Drager, M. 9, 86–87 Druckman, J. N. 2, 28, 30, 89 Duffy, M. 9, 135 Duffy, M. J. 9, 14 Edwards, S. 151 Ellis, B. 87 Emery, E. 6 Emery, M. C. 6 Entman, R. M. 9 Epps, A. C. 1

Evans, M. E. 87 Everbach, T. 10 Fahmy, S. 26 Fazekas, Z. 94 Feng, G. C. 125–126 Fico, F. 1, 9, 55, 63, 75, 89, 100, 101, 103–104, 111–112, 119, 135, 140, 151, 172, 174, 182–183, 186 Fidell, L. S. 187, 188, 190, 191 Fleming, K. 9 Fontenot, M. 10 Fouts, G. 51 Freelon, D. 39 Freelon, D. G. 124 Freitag, A. 12, 15–16, 30, 74 Fritz, N. 1 Gallagher, A. H. 10 Gandy, O. H., Jr. 8 Gaudet, H. 6, 7 Gerbing, D. W. 136 Gerbner, G. 151 Ghosh, S. 95 Gilat, I. 51 Giles, C. L. 95 Gjoka, M. 96 Gobetz, R. H. 87 Golan, G. 15, 151 Goldson, H. 13 Gomaa, E. H. 26 Grabe, M. E. 28 Granados, A. D. 8 Grant, A. E. 8 Grant, M. J. 125–126 Greenberg, B. S. 128 Greenhalgh, G. P. 58 Green, M. C. 9 Greenwell, T. C. 58 Grimmer, J. 38 Gross, L. 151 Grube, J. W. 141 Gustafson, D. H. 40 Guzman-Ortega, S. 140 Gwet, K. L. 125–126, 127–128 Haas, A. 14 Hak, T. 38 Hale, B. J. 28 Hambrick, M. E. 58 Hamdy, N. 26 Hamilton, J. M. 10 Hansen, K. A. 161 Hanusch, F. 94

Author Index 215 Harp, D. 10 Harrison, K. 13 Hartmann, T. 9 Hawkins, R. P. 58 Hayes, A. F. 124, 125–126 Hermida, A. 16 Herring, S. C. 91 Hertzog, H. 6 Hester, J. B. 89–90 Hindman, D. 67 Hochheimer, J. L. 7 Hogg, M. A. 31 Holsti, O. R. 32, 33, 34–35, 60, 134, 136, 157 Hon, L. C. 3, 10 Hornik, R. C. 92 Hovland, C. I. 6 Hox, J. J. 164 Hua, M. 59 Hum, N. J. 51 Hunter, J. E. 136 Hust, S. J. T. 147 Hwang, Y. 151 Iacus, S. M. 41 Im, D. S. 72 Jamal, A. A. 44 Janowitz, M. 6 Jeong, S. 151 Jin, H. S. 151 Johnson, M. A. 1, 26–27, 89 Johnson, R. H. 21 Jones, R. L. 85 Jones, T. J. 14–15 Jonkman, J. G. F. 45 Joseph, K. 95 Joshi, V. 95–96 Jung, J. 74 Kaid, L. L. 114 Kalton, G. 82, 83 Kamhawi, R. 30 Kang, M. H. 89 Karlsson, M. 89 Karpf, D. 43 Kaufman, G. F. 9 Keith, T. 94 Kensicki, L. J. 33 Kenski, K. 150 Keohane, R. O. 44 Kerlinger, F. N. 22, 26, 34, 158 Ki, E. 3, 10 Kifer, M. J. 2, 28

Kim, H. 27 Kim, J. K. 89 Kim, S. 10, 151 Kim, S. H. 72–73, 89 Kim, Y. M. 15, 50 Kiousis, S. 151 Klapper, J. T. 7 Klijn, E. H. 17 Kornfield, R. 40 Kosicki, G. M. 5 Koss, J. 151 Kothari, R. 95–96 Kraemer, H. C. 127 Krakowiak, M. 9 Krippendorff, K. 7, 22–23, 55–56, 64–65, 113, 117–118, 123–124, 125–126, 128, 129, 134, 145 Krumsvik, A. H. 2 Kurant, M. 96 Kurpius, D. D. 51 Kutz, D. O. 91 Labre, M. P. 13, 28–29 Lacy, S. 9, 10, 12, 21, 31, 38, 65, 75, 85, 86–87, 92, 101, 103–104, 112, 115, 119, 120, 129, 134, 135, 136, 137, 140, 163 Lacy, S. R. 85–86 Lai, C. H. 94 Lamm, A. 94 Landini, N. 30 Landwehr, P. M. 95 Lasorsa, D. L. 20, 72 Lasswell, H. D. 6 Lauf, E. 112 Law, C. 13, 28–29 Lawrence, R. G. 9 Lawrence, S. 95 Lazarsfeld, P. F. 7 Leccese, M. 2 Lee, C. 15 Lee, S. 3 Leslie, L. 30 Levine-Donnerstein, D. 126, 127 Levine, R. A. 126 Lewis, S. C. 16, 36, 38, 40, 42–43 Liang, Y. E. 95 Liu, H. 95 Liu, J. S. 126 Lombard, M. 126, 192 Lovejoy, J. 12, 16, 115, 120, 130, 140 Lowery, S. A. 6 Lucas, J. 94 Luke, D. A. 86

216 Author Index Luo, J. 40 Luque, S. 8 Lynch, T. 1 MacLean, M. S., Jr. 9 Mahrt, M. 32, 73 Malamuth, N. M. 151 Markopoulou, A. 96 Martin, H. 85–86 Martins, N. 13, 27 Mastro, D. 8 Mastro, D. E. 128 Mathet, Y. 123 McCluskey, M. 15, 50 McCombs, M. 8, 21 McCombs, M. E. 8, 15, 152 McCorkle, S. 10 McDevitt, M. 151 McEwan, B. 21 McIlwain, C. 39 McLeod, D. M. 5, 6 McLeod, J. M. 5, 6, 23 McLoughlin, M. 18 McMillan, S. J. 90 Mellado, C. 1–2, 10 Messner, M. 28 Meybodi, M. R. 95 Miller, D. C. 157 Miller, E. A. 27 Moen, M. C. 18 Moerbeek, M. 164 Molitor, F. 8 Montes, I. 2 Moon, T. J. 40 Morgan, M. 151 Morstatter, F. 95 Moser, C. A. 82, 83 Muñoz, C. L. 150–151 Mutz, D. 27 Naaman, M. 94 Nagovan, J. 87 Neuendorf, K. 38 Neumann, R. 26 Nielsen, R. D. 41 Nili, A. 120, 126 Noe, F. P. 18 Oakes, P. J. 31 Oettingen, G. 18 Oliver, M. B. 9 Olson, B. 52 Ommeren, R. L. 87 Opperhuizen, A. E. 17, 36

Osgood, C. E. 59 Ostrowski, A. 151 Palguna, D. S. 95–96 Paltoglou, G. 32 Papacharissi, Z. 27, 52 Parde, N. 41 Parkin, M. 2, 28 Parnet, O. 94 Paul, N. 161 Pennebaker, J. W. 40 Pennock, D. M. 95 Peter, J. 112 Peterson, C. 18 Peterson, M. 27 Pettiway, K. M. 1, 26–27, 89 Pfau, M. 151 Pfeffer, J. 95 Pinkleton, B. E. 147 Poeschl, S. 51 Popa, S. A. 94 Porter, C. J. 87 Potter, W. J. 126, 127 Prabowo, R. 32 Pratt, C. A. 18 Pratt, C. B. 18 Pruett, J. A. 51 Quarfoot, D. 126 Ramaprasad, J. 87 Randle, Q. 87 Rapoport, A. 47 Ratan, R. A. 13 Reese, S. D. 8, 10, 12, 14, 15, 165–166 Reicher, S. D. 31 Reif, A. 51 Rey-Cao, A. 51 Reynolds, A. 8, 21 Reynolds, P. D. 12, 20, 21 Rezvanian, A. 95 Rhodes, K. V. 50–51 Rice, R. E. 27 Richardson, J. 151 Riffe, D. 3, 9, 10, 12, 13, 15–16, 30, 31, 74, 85, 86–87, 90, 92, 115, 120, 135, 153, 163 Roberts, N. L. 6 Robinson, K. 86 Robinson, T. 3 Rogers, E. M. 5, 21 Rogers, M. K. 87 Rokeach, M. 141

Author Index 217 Romney, D. 44 Rosenstiel, T. 133, 134, 137 Roskos-Ewoldsen, D. 9 Rowling, C. M. 14 Rubin, A. M. 8 Rusmevichientong, P. 95 Sapolsky, B. S. 8, 150 Saxton, K. 13 Scharkow, M. 32, 73 Scheufele, B. 14 Scheufele, B. T. 9–10 Scheufele, D. A. 8, 9–10 Schoenfeld, A. C. 58 Schouten, K. 17 Sciamanna, C. N. 27 Scott, D. K. 121–123, 124 Scott, W. A. 87 Seligman, M. E. P. 18 Severin, W. J. 5, 7 Shahar, G. 51 Shah, D. V. 40 Shaw, D. L. 8, 15 Sheets, P. 14–15 Shils, E. A. 6 Shin, J. 30, 59, 60, 67 Shoemaker, P. J. 10, 12, 14, 15, 20, 165–166 Shrum, L. J. 9 Signorielli, N. 128, 151 Simmons, J. M. 58 Simon, T. F. 101, 186 Simonton, D. K. 18–19 Singletary, M. W. 30 Skogerbo, E. 2 Slater, M. D. 138 Smith, J. R. 40 Smith, S. L. 8 Snook, B. 125–126 Snyder-Duch, J. 126 Sobel, M. R. 10, 92 Soffin, S. 111 Somerick, N. 13–14 Souley, B. 2, 27–28, 30 Sparks, C. W. 8 Sparks, E. A. 8 Sparks, G. G. 8 Stamm, K. R. 31 Stanley, J. C. 137–138 St. Cyr, C. 135, 140 Stegner, W. 6 Stempel, G. H., III 9, 22–23, 54, 85, 89 Stewart, B. M. 39 Stewart, R. K. 89

Stoddard, S. 85–86 Stokes, D. E. 160 Stouffer, S. A. 157, 158 Strodthoff, G. G. 58 Stryker, J. E. 92, 93 Subramaniam, L. V. 95–96 Suci, G. J. 59 Sundar, S. S. 27 Tabachnick, B. G. 187, 188, 190, 191 Táboas-Pais, M. I. 51 Tajfel, H. 31 Tan, A. 59 Tankard, J. W., Jr. 5, 7, 8–9, 20, 33, 146 Tannenbaum, P. H. 59 Tate, M. 120 Tausczik, Y. R. 40 Tenenboim-Weinblatt, K. 17 Tewksbury, D. 8 Thelwall, M. 32 Theocharis, Y. 94 Thorson, E. 9, 135 Thorson, K. 31, 51–52, 59, 60, 67, 94 Thrasher, J. F. 89 Tichenor, P. J. 5, 23 Tingley, D. 44 Toma, C. L. 40 Tompkins, J. E. 1 Towner, T. L. 151 Tremayne, M. 27 Trilling, D. 45 Trumbo, C. 12 Turner, J. C. 31 Turner, L. W. 85 Vafeiadis, M. 28 van Dalen, A. 1–2 Van de Schoot, R. 164 van Driel, I. I. 1 Vashi, A. 50–51 Vincent, R. C. 30 Vogt, W. P. 4–5 Vorderer, P. 9 Wadsworth, A. J. 114 Wagner, K. 87 Wall, M. 135 Wang, D. 150 Wang, X. 90 Wanta, W. 10, 15, 151 Washburn, R. C. 59 Waters, R. D. 94 Watson, B. R. 9, 12, 16, 92, 120, 151 Weaver, D. A. 92

218 Author Index Weaver, D. H. 30, 31 Weaver, J. B. 87 Weber, R. P. 10, 22, 190–191 West, D. M. 27 Westerman, D. 21 Westley, B. H. 9 Wetherell, M. S. 31 Whaples, R. 18 Whitney, D. C. 10 Wicks, R. 182–183 Wicks, R. H. 2, 27–29, 30 Widlöcher, A. 123 Wildman, S. S. 75 Williams, A. E. 9, 14 Williams, D. C. 13 Wilson, C. 3, 59–60 Wimmer, R. D. 12, 13, 24, 81, 114, 136, 157 Windhauser, J. W. 54 Woodman, K. A. 87

Wooldridge, J. M. 190 Wray, R. J. 92 Wrightsman, L. S. 18 Wu, H. D. 10 Wu, L. 92 Yanovitzky, I. 92 Yi-Fan Su, L. 150 Young, J. 10 Young, S. M. 51 You, Q. 40 Yu, Y. C. 13 Zamith, R. 16, 38, 40, 43–44 Zeldes, G. 55 Zhang, Y. 151 Zhao, X. 125–126, 127, 128 Zhu, J. 40 Zillmann, D. 9 Zullow, H. M. 18

Subject Index

Note: Page numbers in italics indicate figures and those in bold indicate tables. abstractness, as trait of science 21 abstract relations in classification 59–60 access to data 44–45, 75, 88, 92, 93 accuracy, as reliability type 113 agenda-setting, media and 8, 15, 21, 141, 143, 151 algorithmic sampling 95–96 algorithmic text analysis (ATA) 16–17, 37–42; advantages and disadvantages of 37, 39–40; applications and best practice for 40–42, 45; as distinct from content analysis 38–39, 45–46 alpha coefficients 117, 120–121, 123–125, 127–130 ambiguity: non-text forms of communication and 52–53; variables, as source of 110–111 analysis of variance (ANOVA) 176–177, 178 antecedents of content 10, 11, 11, 12, 139, 141; framework for research program about 165–166 audiences, assumptions about content effects on 6, 7, 8, 9 big data, computer content analysis and 39, 44–45 binary attribute structure in classification 60 bullet, persuasive message 6, 7 category definitions 104, 145–147; as source of coder disagreements 110–111

causal relationship 154–157; control of rival explanations and 156–157; defined 154; models 185–187, 187; observable correlation condition for 154–155; time order condition for 154–155, 155 census sampling technique 74 centrality of content 11–12, 11 central limits theorem 77 chance agreements, coefficients and 121–124 chi-square: Cramer’s V and 180–182; formula for 181; relationships and 181–182 classification systems: Deese’s 58–60; defined 57–58; requirements for 60–62; units of analysis and 57–62 class structure in classification system 58 cluster sampling 82–83 coder disagreements, sources of 110–112 coder reliability assessments 113–119; content selection for tests 114–115; selection procedures and 115–118, 118; timing of tests and 118–119 coders, reporting standards for 192–193 coder training 108–112, 113–114; coding process and 109–110; disagreement sources and 110–112 coding, computer see algorithmic text analysis (ATA) coding sheets 69–70, 107–108, 108, 109 Cohen’s kappa 122–123, 124; high agreement/low reliability controversy with 125–128

220 Subject Index communication content: appropriate and meaningful 26; broadcast or film 27, 51; centrality model of 11; description content and 29–30; digital 27–28, 51–52; forms of 26–28, 49–53; inference drawing and 30–31; verbal 50–51; visual 26–27, 51–52 Communication Monographs 115, 120, 130 computers in content analysis: algorithmic text analysis (ATA) 16–17, 37–42; big data and 39, 44–45; concordances and 41; databases and 91–93; dynamic web content, “freezing” 16, 42, 43–44; human language and 39–40, 41; hybrid approaches 36–37, 38, 42–44, 45; keyword searches and 41, 92, 93; machine learning and 16–17, 36, 40, 41; overview of 36–38; sampling digital content 73, 87–97; social media and 16, 88–89, 93–96; software, corporate and open-source 44–45; uses of 36, 42 concept: complexity and number of variables 100–102; defined 100 conceptual and operational definitions 100, 145 conceptualization in content analysis design 149–150, 160–161 concordances, computer content analysis and 41 concurrent validity 135 consecutive unit sampling 76 construct validity 136–137 content analyses, designing: conceptualization in 149–150, 160–161; correlation, causation and 153–157, 155; described 148–149; good vs. bad design 157–159; model for 159, 159–164; purpose of 149, 149–150, 160–161; typology for studies 150–151 content analysis: defined 3, 23; evolving communication technologies and 30; nonreactive nature of 10, 18, 22, 33; production context of 9–11; reliability see reliability in content analysis; research applications of 14–19; as research technique, issues in 31–33; role of, in mass

communication research 4; as social science tool 20–25; use of, to examine influences on content 151; see also computers in content analysis; content analysis protocol; data analysis content analysis articles, reporting standards for 192–193 content analysis as social science tool: adapting definition for 22; communication content and 26–28; definition of 23; describing/inferring and 29–31; measurement and 28–29; replicability of science and 24–25; systematic research and 23–24 content analysis model: conceptualization and purpose 160–161; data collection and analysis 163–164; design 161–163; overview of 159, 159 content analysis of verbatim explanations (CAVE) 18 content as consequence 9–10 content-centered model 11 content centrality 11–12, 11 content forms, measurement and 49–53 content, measurement steps for 68–70 content units in content analysis 48–49 contingency effects of mass communication 7–9 convenience sampling 74–76 correlation: causal relationship and 154–156; content analyses and 153, 155; defined 154; scatter diagrams of 183, 184; significance testing and 185; spurious 186; strength of 179; techniques for 179–185, 191 counting, as data description 171 covariation, defined 178 Crimson Hexagon 44–45 daily newspapers, sampling research on 84–86, 85 data, access to sampling 44–45, 75, 88, 92, 93 data analysis: association techniques 191; describing data 171–172; fundamentals of 169–171; hypotheses and research questions for 170–171; introduction to 168–169; proportions and means, significance of 172–177; statistical

Subject Index 221 assumptions and 190–191; see also describing data for analysis; differences; relationships databases, sampling with 91–93 data mining 41 describing data for analysis 171–172, 178; counting and 171; mean and 171–172; proportion and 172 descriptive content analysis 13–14, 29–30 dictionary-based approaches, computer analysis and 17, 40–41 differences: in samples 176–177; of means test 175; of proportions test 175; significance of 172–174; twosample, null hypothesis and 174–176 digital content, sampling 73, 87–97; algorithmic sampling 95–96; databases and 91–93; overview of 87–89; social media and 93–96; suggestions for 96 dimensional ordering in classification 58–59 disproportionate/proportionate sampling 81–82 dummy tables 162, 162 dummy variables 63–64, 188 dynamic web content, “freezing” 16, 42, 43–44 effects research, mass communication: content analysis and 5–9; contingency effects and 7–9; limited effects and 7; overview of 5; powerful effects and 5–7 empirical relevance, as trait of science 21 empiricism 4 enumeration, rules of 67–68 external validity: category nature and 145–147; content nature and 144–145; overview of 137–139, 139; scientific knowledge and 142–143; social validity and 143–147; types of 139 Facebook, sampling with 88, 93–94 face validity 134–135 feedback relationships 190 finite population correction 79 frames/framing: content analyses about 33; defined 8–9; see also sampling frame F-ratio 177

grouping in classification system 58 holism 5 Holsti’s coefficient 120–121 hypotheses 151–153, 170; defined 152, 170; research design and 160–162, 163, 164; theory-driven, testing 23 idealism 4 independence in classification 61–62 individual communication sampling 96–97 inference drawing 22, 23, 29–31; causal relationship and 154–156; computer content analysis and 36, 41; concurrent validation and 135; convenience samples and 75; correlation and 154; descriptive data and 30–31; null hypothesis and 174; physical content only, cautions in 55; random sampling and 56, 71, 73, 81, 83; sampling error and 78–79, 173; standard error and 78–79; statistical, procedures of 180; validity and 132, 138–139 inferring from content 29–31 intercoder reliability 47; see also coder reliability assessments; coder training internal validity 137, 138, 139–141; content analysis control and 139–140; correlation and 141; explained 137; time order and 140–141; types of 139 Internet sampling 73, 87–96, 97 intersubjectivity 21 interval measures 66, 178, 191 Keyword-in-context (KWIC), computer content analysis and 41 keyword searches 41, 92–93 Krippendorff’s cAlpha 123–124; high agreement/low reliability controversy with 125–128 latent content 32–33, 64, 100–102, 112, 135, 145–146 layered units of observation 53–54 legacy media, sampling research on 84–87, 85 Letters from Jenny (Allport) 17–18

222 Subject Index LexisNexis 92 Linguistic Inquiry and Word Count (LIWIC), computer content analysis and 40 machine learning 16–17, 36, 40, 41 magazines, content analysis of 18 magazines, sampling research on 85, 86–87 manifest content 32, 33–34, 64, 100–101, 102, 145, 146–147; advantages of quantitative content analysis of 33–34; ambiguity in non-text forms of 52–53 mass communication research: content analysis role in 4; effects research, content analysis and 3–9; scientific method of 4, 20–21 mean: defined 171–172; differences, significance of 173–174; formula for difference of 175; null hypothesis and 174; number of samples and 176–177; sample measures, generalizing 173; standard error formula for 78, 175–176 meaning units of observation 55–56 measurement: enumeration, rules of 67–68; non-text forms and 52–53; overview of 47–48; units of analysis and 57–62; validity 134–137, 139; variables and 48–49; see also measurement, levels of measurement, levels of 63–67, 178, 191; importance of 67; interval 66; nominal 63–65; ordinal 65–66; ratio 66–67 measurement of content 28–29, 57, 68–70 measurement reliability, validity and 133–134 measurement validity, tests of 134–137; concurrent validity 135; construct validity 136–137; face validity 134–135; predictive validity 135–136 media, agenda-setting of 8, 15, 21, 141, 143, 151 median, defined 77 media organizations, general framework for research program about 165–166 Mediating the Message: Theories of Influences on Mass Media Content (Shoemaker and Reese) 15

mediatization 2 multi-case coding sheets 109, 109 multiform communication 51–52, 53 multiple regression 187–189, 190; equation 188–189; r-squared statistic 189 multistage sampling 83–84 network television news, sampling research on 85, 87 news content, influences on 9–10 news judgment 9, 15 newspapers, sampling research on 84–86, 85 nominal measures 63–65, 178, 191 non-probability sampling 74–76 non-reactive, content analysis as 10, 18, 22, 33 non-text forms, measurement and 52–53 null hypothesis, two-sample differences and 174–176 numeric values, assigning 28–29; rules for 67–68 observational process, validity in 137–141; design and internal validity 139–141; internal/external validity and 137–139, 139; overview 137 observation, units of 53–56; layered 53; meaning 55–56; physical 54–55; sampling concerns with 56; syntactical 55–56 online content: “freezing” dynamic websites 16, 43–44; sampling of 73, 87–97; social media 43–44, 51–52, 93–96 open-source software, computer content analysis and 45 operationalization 24 ordinal measures 65–66, 178, 191 “other” category use in classification 61 parametric procedures 67 Pasteur quadrant 160; see also social validity Payne Fund Studies 6 Pearson’s product-moment correlation 124–125, 183, 184; formula for 183; variable changes and 183–184

Subject Index 223 peer-review process, validity and 142–143 percentage of agreement 120–121, 128, 193 periodicity, systematic sampling and 81 persuasive messages 5, 6, 7 physical units of observation 54–55 population 27, 31, 56, 62, 67, 71–82, 116–117, 128, 145, 163–164, 171–182 predictive validity 135–136 probability sampling 71, 76–80; used in reliability testing 115 production context of content analysis 9–11 product-moment correlation, Pearson’s 124–125, 183, 184 proportion: defined 172; differences, significance of 172–174; formula for difference of 175; null hypothesis and 174; number of samples and 176; sample measure, generalizing 173; standard error formula for 78 proportionate/disproportionate sampling 81–82 protocol, content analysis 102–108, 104–107, 108, 109; coding sheets and 107–108, 108, 109; development of 103; organization of 103–104; problems with 110–111, 112; purpose of 102–103; reporting standards for 192–193 purposive sampling 76 quantitative content analysis: criticism of 31–32; definition of 3, 23; of manifest content, advantages of 33–34; measurement and 28–29; range of applications 1–3; see also content analysis rank order correlation (rho), Spearman’s 182, 184 ratio measures 66–67, 178, 191 ReCal (software) 124 reductionism 4–5 relationships: causal, defined 154; causal models of 185–187, 187; chi-square and 180–181; correlation, defined 154; Cramer’s V and 180, 181; described 179; feedback loops in 190; finding techniques for 179–185, 191;

linear/curvilinear 183, 184; multiple regression and 187–189; Pearson’s product-moment correlation 124–125, 183, 184; research design and 153–157, 155; scatter diagrams of 183, 184; significance testing and 185; Spearman’s rho 182, 184; strength of 179 reliability coefficients: Alpha, variations of 125; controversy about 125–128; overview of 119–120; Pearson’s product-moment correlation as 124–125; percentage of agreement 120–121; research literature containing 120; selection of 129–130; software for calculating 124 reliability in content analysis: achieving 99; coder bias and 99, 112; coder training and 108–112; complexity of concept and 100–102, 112; conceptual/operational definitions of 100; content analysis protocol and 102–108, 104–107, 108; defined 98; overview of 98–99; reporting standards for 193; summary of 130–131; variables, number of 100–102; see also coder reliability assessments; reliability coefficients reliability, reporting standards for 193 replicability trait of science 24–25 reporting standards for content analyses 192–193 reproducibility, as reliability type 113 research applications, content analysis 14–19 research design: correlation, causation and 153–157; described 148–149; elements of 158–159; good vs. bad 157–159; model for research program 164–166 research questions 151–153, 170; defined 170; research designs and 160–162, 163, 164 rival explanations 156–157 r-squared statistic, multiple regression and 189 r-square proportion 185 sample, defined 71 sample size, formula for 116–117 sampling: algorithmic 95–96; big data and 39, 44–45; census 74; cluster

224 Subject Index 82–83; convenience 74–76; databases 91–93; digital content 87–97; individual communication 96–97; multistage 83–84; non-probability 74–76; overview of 71–72; probability 76–80; proportionate/ disproportionate 81–82; purposive 76; reporting standards for 192; simple random 80; social media content 93–95; stratified 81–82; stratified for legacy media 84–87; systematic 80–81; techniques 73–84; time periods 72–73 sampling distribution 77, 175–176 sampling error, calculating 78–79, 173 sampling frame 72 sampling techniques: census 74; nonprobability 74–76; overview of 73–74; probability 71, 76–84 scaling in classification 58–59 scientific method of content analysis 4, 20–21 scientific process 4, 20–21; peer-review and 142–143 Scott’s pi 121–122, 123, 124; high agreement/low reliability controversy with 125–128 semantical validity 145 semantic differential 59 significance testing, correlation and 185 simple random sampling 80–81 single case coding sheets 107–108, 108 social media, sampling 93–96 social science, approach to knowledge of 20–21, 132–133; peer-review process and 142–143 social validity 143–147; types of 139 societal ideology, as influence on media content 166 spatial models in classification 59 spatial representation in classification 59 Spearman’s rank order correlation (rho) 182, 184 spurious correlation 186 stability, as reliability type 113 standard deviation (SD), in standard error formula 78 standard error (SE): adjusted for stratified samples 82; fpc formula and 79; of mean formula 78, 175–176; of proportion formula 78; sample size, formula for 116–117 stratified sampling 81–82; for legacy media 84–87, 85

structural equation modeling (SEM) 190 supervised machine learning (SML) 16–17, 36, 40, 41 symbolic complexity 112 symbols of communication content 26–28 syntactical units of observation 55–56 systematic approach of content analysis 22, 23–24 systematic sampling 80–81 theoretical definitions of variables 68–69 theoretically appropriate, defined 67 time order in content analysis 140–141, 154, 155 time units of measurement 54–55 Twitter: application programming interface (API) 43, 44, 95; big data sets and 16, 39, 44, 88; sampling with 43, 88, 93–95, 96; structured data format of 43 two-sample differences, null hypothesis and 174–176 units of analysis 57–62; classification systems for 57–62; defined 57 units of observation: meaning 55–56; physical 54–55; sampling concerns with 56 universe 71–72 validity: external 137–189, 139, 141–147; internal, research design and 139–141; measurement reliability and 133–134; measurement, tests for 134–137; overview of 132–133, 139; peer-review process and 142–143; semantical 145; social 143–147; social science meaning of 132–133; types of 139 validity, defined 132 variables 47–48; in content analysis 48–49; reporting standards for 192; theoretical definitions of 68–69 variance, defined 47 verbal communication 50–51 visual communication 26–27, 51–52 weekly newspapers, sampling research on 85, 86 written communication 50 YouTube 51–52, 94