Using Corpus Methods to Triangulate Linguistic Analysis 1138082546, 9781138082540

This book builds on Baker and Egbert's previous work on triangulating methodological approaches in corpus linguisti

1,294 171 4MB

English Pages xiv+286 [301] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Using Corpus Methods to Triangulate Linguistic Analysis
 1138082546, 9781138082540

Citation preview

Using Corpus Methods to Triangulate Linguistic Analysis

This book builds on Baker and Egbert’s previous work, Triangulating Methodological Approaches in Corpus Linguistic Research, on triangulating methodological approaches in corpus linguistics and takes triangulation one step further to highlight its broader applicability when implemented with other linguistic research methods. The volume showcases research methods from other linguistic disciplines and draws on nine empirical studies from a range of topics in psycholinguistics, applied linguistics, and discourse analysis to demonstrate how these methods might be most effectively triangulated with corpus-linguistic methods. A  concluding chapter synthesizes these findings as a means of pointing the way toward future directions for triangulation and its implications for future linguistic research. The combined effect reveals the potential for the triangulation of these methods to not only enhance rigor in empirical linguistic research but also our understanding of linguistic phenomena and variation by studying them from multiple perspectives, making this book essential reading for graduate students and researchers in corpus linguistics, applied linguistics, psycholinguistics, and discourse analysis. Jesse Egbert is Associate Professor of Applied Linguistics at Northern Arizona University. His research focuses on register variation and methodological issues in corpus linguistics. His recent books include Register Variation Online (2018) and Triangulating Methodological Approaches in Corpus Linguistic Research (2016). He is General Editor of the journal Register Studies. Paul Baker is Professor of English Language at Lancaster University. His research involves applications of corpus linguistics and his recent books include American and British English (2017), Using Corpora to Analyze Gender (2014) and Discourse Analysis and Media Attitudes (2013). He is the commissioning editor of the journal Corpora.

Routledge Advances in Corpus Linguistics Edited by Tony McEnery, Lancaster University, UK Michael Hoey, Liverpool University, UK

Metaphor, Cancer and the End of Life A Corpus-Based Study Elena Semino, Zsófia Demjén, Andrew Hardie, Sheila Payne and Paul Rayson Understanding Metaphor through Corpora A Case Study of Metaphors in Nineteenth Century Writing Katie J. Patterson TESOL Student Teacher Discourse A Corpus-Based Analysis of Online and Face-to-Face Interactions Elaine Riordan Corpus Approaches to Contemporary British Speech Sociolinguistic Studies of the Spoken BNC2014 Edited by Vaclav Brezina, Robbie Love and Karin Aijmer Multilayer Corpus Studies Amir Zeldes Self-Reflexive Journalism A Corpus Study of Journalistic Culture and Community in The Guardian Anna Marchi The Religious Rhetoric of U.S. Presidential Candidates A Corpus Linguistics Approach to the Rhetorical God Gap Arnaud Vincent Using Corpus Methods to Triangulate Linguistic Analysis Edited by Jesse Egbert and Paul Baker For more information about this series, please visit: www.routledge.com/ Routledge-Advances-in-Corpus-Linguistics/book-series/SE0593

Using Corpus Methods to Triangulate Linguistic Analysis Edited by Jesse Egbert and Paul Baker

First published 2020 by Routledge 52 Vanderbilt Avenue, New York, NY 10017 and by Routledge 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business © 2020 Taylor & Francis The right of Jesse Egbert and Paul Baker to be identified as the authors of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data A catalog record for this book has been requested ISBN: 978-1-138-08254-0 (hbk) ISBN: 978-1-315-11246-6 (ebk) Typeset in Sabon by Apex CoVantage, LLC

Contents

List of Figures List of Tables List of Contributors  1 Introduction

vii ix xii 1

JESSE EGBERT AND PAUL BAKER

  2 Triangulating Text Segmentation Methods with Diverse Analytical Approaches to Analyzing Text Structure

22

ERIN SCHNUR AND ENIKO CSOMAY

  3 Working at the Interface of Hydrology and Corpus Linguistics: Using Corpora to Identify Droughts in Nineteenth-Century Britain

52

TONY MCENERY, HELEN BAKER AND CARMEN DAYRELL

  4 Analysing Representations of Obesity in the Daily Mail via Corpus and Down-Sampling Methods

85

PAUL BAKER

  5 Connecting Corpus Linguistics and Assessment

109

GEOFFREY T. LAFLAIR, SHELLEY STAPLES, AND XUN YAN

  6 Examining Vocabulary Acquisition Through Word Associations: Triangulating the Psycholinguistic and Corpus-Based Approaches

141

DANA GABLASOVA

 7 If Olive Oil Is Made of Olives, then What’s Baby Oil Made of? The Shifting Semantics of Noun+Noun Sequences in American English JESSE EGBERT AND MARK DAVIES

163

vi Contents   8 Corpus Linguistics and Event-Related Potentials

185

JENNIFER HUGHES AND ANDREW HARDIE

  9 Priming of Syntactic Alternations by Learners of English: An Analysis of Sentence-Completion and Collostructional Results

219

STEFAN TH. GRIES

10 Usage-Based Theories of Construction Grammar: Triangulating Corpus Linguistics and Psycholinguistics

239

NICK C. ELLIS

11 Synthesis and Conclusion

268

JESSE EGBERT AND PAUL BAKER

Index

283

Figures

2.1 2.2 2.3 2.4

Manual Text Segmentation 28 Segmented Text 30 Algorithm for Similarity Measure 32 Vocabulary Profile Based on Similarity Measures at Each Word (500-word Segment) 33 3.1 Mentions of DROUGHT in UK Newspapers, 1800– 1899, Excluding Non-UK Specific References 61 3.2 DROUGHT in UK Countries, as Mentioned in Nineteenth-century British Newspapers, 1846–1852 68 3.3 DROUGHT in UK Counties, as Mentioned in Nineteenth-century British Newspapers, 1846–1852 69 3.4 DROUGHT in UK Towns and Cities, as Mentioned in Nineteenth-century British Newspapers, 1846–1852 70 3.5 DROUGHT in UK Countries, as Mentioned in Nineteenth-century British Newspapers, 1880 73 3.6 DROUGHT in UK Counties, as Mentioned in 74 Nineteenth-century British Newspapers, 1880 3.7 DROUGHT in UK Towns and Cities, as Mentioned in Nineteenth-century British Newspapers, 1880 75 4.1 Number of Articles About Obesity Over Time 90 5.1 Dimension 1: Literate vs. Oral Discourse Across ECPE Score Levels 122 5.2 Dimension 3: Prompt Dependence vs. Lexical Variety Across ECPE Score Levels 124 5.3 Dimension 2: Colloquial Fluency Across MELAB Score Levels 131 5.A1 ECPE Rating Scale 136 5.A2 MELAB Rating Scale 137 7.1 An Example of the Final Classification Instrument for NN Sequences 171 7.2 NN Frequency by Agreement Correlations Across Time Periods176

viii Figures 7.3 7.4 7.5 8.1 8.2 8.3 8.4 8.5 8.6 8.7 9.1 9.2 9.3 9.4

Diachronic Change in the Semantic Categories Within Pattern 1—Frequent → Frequent 178 Diachronic Change in the Semantic Categories Within Pattern 1—Infrequent → Infrequent 179 Diachronic Change in the Semantic Categories Within Pattern 1—Infrequent → Frequent 180 Positioning of Breaks Within and Between Trial Blocks in Experiment 1 198 Positioning of Breaks Within and Between Trial Blocks in Experiment 2 198 Electrode Sites Grouped Into Nine Zones for Statistical Analysis200 Grand Average ERPs Across a Representative Sample of Electrode Sites, From −200 ms Pre-stimulus to 800 ms Post-stimulus202 Topographic Scalp Maps Showing Mean Amplitude Between 300 and 500 ms 203 Grand Average ERPs Across a Representative Sample of Electrode Sites, From −200 ms Pre-stimulus to 800 ms Post-stimulus204 Topographic Scalp Maps Showing Mean Amplitude Between 300 and 500 ms 205 The Effect of PRIME_COMPL_CX on the Predicted Probability of Completing the Target Sentence Fragment With a Prepositional Dative 228 The Effect of TARGET_STIM_V on the Predicted Probability of Completing the Target Sentence 229 Fragment With a Prepositional Dative The Effect of ITEM_NO_WOUTFILLER : TARGET_ STIM_V on the Predicted Probability of Completing the Target Sentence Fragment With a Prepositional Dative 230 The Effect That Speaker-specific Intercepts Have on the Average Predicted Probability of Completing the Target 232 Sentence Fragment With a Prepositional Dative

Tables

2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 3.5 4.1 4.2 4.3 4.4 4.5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 6.1

Authentic Lectures Corpus Composition 27 Linguistic Features (Selection) on the Three Dimensions of Linguistic Variation in Academic Classroom Talk 38 Mean Percentage of Segments Containing Each Function 43 Co-occurrence of Functions 43 Newspapers Included in the British Library Nineteenth-century Newspaper Corpus 55 Droughts in the Nineteenth Century Based on Hydrometeorological Evidence 59 The Degree to Which Geoparsing Thinned Examples of DROUGHT from Each Newspaper 60 Top Ten Ranking Years When DROUGHT Appears Most Frequently (Normalised to per Million Words) in Nineteenth-century Newspapers 63 The Thinning of Examples From the Glasgow 79 Herald, by Decade Collocates of Obese96 Collocates of Obesity96 Summary of Findings From the Different Approaches 99 Proportions of Agreement Between Pairs of Analysis 101 Combinations of Sets and Number of Findings Elicited by Each Combination 102 Writing Prompts Represented in the ECPE Corpus 116 The ECPE Essay Corpus 117 Variables Included in the Multi-dimensional Analysis 118 Factor Loadings of Individual Variables for Dimension 1 and Dimension 3 120 Overview of the MELAB Corpus 127 Variables Included in Multi-dimensional Analysis 128 Factor Loadings of Individual Variables Included on Dimension 2 in the MELAB Study 129 Mean Number of WAs in L1 and L2 152

x Tables 6.2 6.3

First WA by Target Word (Responses in English Only) 153 Number of WAs in Superordinate Relationship to the Target Word 154 6.4 Proportion of WAs Included in the New-GSL 154 6.5 Ecocentrism: Most Frequent WAs 155 6.6 Terroir: Most Frequent WAs 155 6.7 Transhumance: Most Frequent WAs 155 7.1 Composition of COHA, by Register and Decade 168 7.2 Semantic Categories of NNs, With Rephrased Sentences and Examples 173 7.3 Classification Agreement Results 174 7.4 Fleiss’ Kappa Results Across the 13 Semantic Categories for the Reduced Set of 974 NNs 175 7.5 Normed Rates of Occurrence (per Million Words) of NN Semantic Categories Across Six Time Periods 177 8.1 Bigrams for Experiment 1 194 8.2 Bigrams for Experiment 2 195 8.3 Summary of Post Hoc ANOVA Results Carried out at Left, Right, and Midline Electrode Sites in the 350–500 ms Latency Range 203 8.4 Summary of Post Hoc ANOVA Results Carried Out at Anterior, Central, and Posterior Electrode Sites in the 350–500 ms Latency Range 205 8.5 Statistical Measures of Collocation Strength Ranked by Strength of Correlation 206 8.A1 Statistical Properties of Experiment 1 Bigrams Extracted From Written BNC1994 211 8.A2 Statistical Properties of Experiment 2 Bigrams Extracted From Written BNC1994 213 8.A3 Statistical Measures of Collocation Strength for Each Bigram Pair in Experiment 2 Part 2 214 9.1 Construction Frequencies Obtained in the Priming Experiment221 9.2 Observed Dative-alternation Construction Frequencies With Give in the ICE-GB 224 9.3 Fixed Effects in the Final Regression Model (LR-tests for deletion) 227 9.4 Collexeme Strengths of Verbs From Gries and Stefanowitsch’s (2004) ICE-GB Data Together With Predicted Probabilities for Target Fragment Verbs in Figure 3.2231 10.1 Contingency Table 242 10.2 The Intercorrelations of the Predictor Variables for the Stimulus Set 247

Tables  xi 10.3 A Linear Mixed Model Predicting Word 2 VOT in ms 10A.1 The 192 Stimuli 11.1 The Nine Studies in This Volume Organized Into the Two Types of Methodological Triangulation, With Research Questions and Methods

249 255 270

Contributors

Helen Baker is Newby Research Fellow in the Department of Linguistics and English Language at Lancaster University. Her work, combining social history with corpus analysis, has led to a number of publications on marginalized identities in early-modern England, including her book Corpus Linguistics and 17th Century Prostitution (with Tony McEnery, Bloomsbury, 2016). Paul Baker is Professor of English Language at the Department of Linguistics and English Language, Lancaster University. He specializes in corpus linguistics, particularly using and developing corpus methods to carry out discourse analysis, as well as being involved in research in gender, sexuality, health and media language. He has written 18 books, including Using Corpora in Discourse Analysis (2006). Eniko Csomay is a professor at San Diego State University. She completed her B.A. at Eotvos University in Budapest (HU), her M.A. at the University of Reading (UK) and her Ph.D. at Northern Arizona University in Flagstaff (USA) in applied linguistics. Her research focuses on the application of corpus-driven methods to text analysis, particularly as it relates to classroom discourse structure. She has published in multiple international journals, e.g. in Applied Linguistics and the Journal of English for Specific Purposes. Mark Davies is Professor of Linguistics at Brigham Young University in Provo, Utah, USA. His main areas of research include corpus linguistics and language variation and change. He is the creator of most of the corpora from https://www.english-corpora.org, which are probably the most widely used online corpora. Carmen Dayrell is a senior research associate in the Department of Linguistics and English Language at Lancaster University. As part of the ESRC Centre for Corpus Approaches to Social Sciences she has worked on a number of discourse-related projects, many related to climate change and the environment.

Contributors  xiii Jesse Egbert is Associate Professor in the Applied Linguistics program at Northern Arizona University. He specializes in corpus-based research on register variation, particularly academic writing and online language, and methodological issues in quantitative linguistic research, including methodological triangulation. He has more than sixty research publications. He is co-editor (with Paul Baker) of Triangulating Methodological Approaches in Corpus Linguistic Research (Routledge: 2016) and co-author (with Doug Biber) of Register Variation Online (Cambridge University Press: 2018). Nick C. Ellis is Professor of Psychology and Linguistics at the University of Michigan. His research interests include language acquisition, cognition, emergentism, corpus linguistics, cognitive linguistics, applied linguistics, and psycholinguistics. Recent books include: Usage-based Approaches to Language Acquisition and Processing: Cognitive and Corpus Investigations of Construction Grammar  (Wiley-Blackwell, 2016, with Römer and O’Donnell), Language as a Complex Adaptive System  (Wiley-Blackwell, 2009, with Larsen-Freeman), and  Handbook of Cognitive Linguistics and Second Language Acquisition (Routledge, 2008, with Robinson). He serves as General Editor of Language Learning. Dana Gablasova is Lecturer at the Department of Linguistics and English Language, Lancaster University. She is also a member of the ESRC Centre for Corpus Approaches to Social Science (CASS). Her research focuses on corpus-based approaches to studying L2 vocabulary and spoken L1 and L2 production. She is also interested in the use of corpora for language assessment and language education. Stefan Th. Gries is Professor of Linguistics in the Department of Linguistics at the University of California, Santa Barbara (UCSB) and Chair of English Linguistics (corpus linguistics with a focus on quantitative methods, 25%) at the Justus-Liebig-Universität Giessen. He was a Visiting Chair (2013–2017) of the Centre for Corpus Approaches to Social Science at Lancaster University, Visiting Professor at five LSA Linguistic Institutes between 2007 and 2019, and the Leibniz Professor (spring semester 2017) at the University of Leipzig. Andrew Hardie is Reader in Linguistics at Lancaster University. His main research interests are the methodology of corpus linguistics; the descriptive and theoretical study of grammar using corpus data; the languages of Asia; and applications of corpus methods in the humanities and social sciences. He is one of the lead developers of the Corpus Workbench analysis software, and the creator of its online interface, CQPweb. He is co-author, with Tony McEnery, of the book Corpus Linguistics: Method, Theory and Practice (2012).

xiv Contributors Jennifer Hughes is a postdoctoral researcher at the ESRC Centre for Corpus Approaches to Social Science, Lancaster University. Her main research interest is the application of neuroimaging techniques to collocation and other corpus phenomena. She completed her Ph.D. in 2018, entitled  The Psychological Validity of Collocation: Evidence from Event-Related Brain Potentials. Geoffrey T. LaFlair is an assessment scientist at Duolingo. He earned his Ph.D. in applied linguistics, specializing in language assessment and quantitative research methods, from Northern Arizona University. His research interests center on quantitative analyses of language assessments and the language in and elicited by language assessments. His research has been published in Language Testing, Applied Linguistics, and The Modern Language Journal. Tony McEnery is Distinguished Professor of English Language and Linguistics at Lancaster University. He has published widely on corpus linguistics and is the author of Corpus Linguistics: Method, Theory and Practice (CUP, 2011, with Andrew Hardie). Erin Schnur is a post-doctoral researcher at the Second Language Teaching and Research Center at the University of Utah. She is currently developing a multi-language corpus of spoken second language speech and a suite of associated materials for use in primary and secondary language classrooms. Her research interests include corpus research methodology, learner corpus research, and English for academic purposes. Shelley Staples is Associate Professor of English Applied Linguistics and Second Language Acquisition and Teaching at University of Arizona. Her research focuses on quantitative and qualitative corpus-based discourse analysis of L2 academic writing and speech as well as health care communication. Her work has recently been published in journals such as Journal of English for Academic Purposes, English for Specific Purposes Journal, and Applied Linguistics. She is the author of The Discourse of Nurse-Patient Interactions (John Benjamins, 2015). Xun Yan is an assistant professor of linguistics, second language acquisition and teacher education (SLATE), and educational psychology at the University of Illinois at Urbana-Champaign. His research interests include speaking and writing assessment, scale development and validation, psycholinguistic approaches to language testing, rater behavior and cognition, and language assessment literacy. His work has been published in Language Testing, Assessing Writing, Journal of Second Language Writing, and Asian-Pacific Education Researchers.

1 Introduction Jesse Egbert and Paul Baker

How Tall Are the Pyramids? In the sixth century B.C., the Greek philosopher Thales of Miletus wanted to know the height of the pyramids in Egypt. Even if he had possessed a measuring tape of sufficient length and had been able to climb to the top, the shape of the pyramids would have rendered it impossible to measure their exact height at the center. So Thales discovered a clever solution to his problem in which “he measured the height of the pyramids by the shadow they cast, taking the observation at the hour when our shadow is of the same length as ourselves” (Diogenes Laertius, 2018, 1.1, p.  27). Thales used his knowledge of geometry to deduce that his shadow and the shadow of the pyramids projected similar right-angled triangles. Thus, at the moment when the length of his shadow matched his height, the length of the pyramids’ shadows would also match their height. This was the first recorded use of triangulation, or the process of using “distances from and direction to two landmarks in order to elicit bearings on the location of a third point (hence completing one triangle)” (Baker & Egbert, 2016, p. 3). The efficacy of triangulation has ensured that it is still used today. For example, land surveyors use distance from and direction to two landmarks in order to elicit bearings on the location of a third point (hence completing a triangle). However, the concept has been extended to involve other forms of analysis. So a related term, methodological triangulation has been used for decades by social scientists as a means of explaining behavior by studying it from two or more perspectives (Webb, Campbell, Schwartz, & Sechrest, 1966; Glaser & Strauss, 1967; Newby, 1977; Layder, 1993; Cohen & Manion, 2000). Methodological triangulation in linguistic research that includes large bodies of naturally occurring text, encoded in electronic form, referred to as corpus data, is clearly on the rise. Recent decades have seen a surge in the use of corpora and their accompanying corpus linguistic methods which involve specialist computer software in empirical linguistic research (e.g. Sampson, 2013). While many studies in linguistics use

2 Introduction corpora as their sole source of data it is becoming increasingly common in empirical linguistics for researchers to triangulate corpus linguistic methods with methods and data from other areas in linguistics. This type of methodological triangulation has proven to be a highly effective means of explaining linguistic phenomena. McEnery and Hardie (2012) note the trend, explicitly encouraging researchers to engage with this type of research, stating that: the triangulation of corpus methods with other research methodologies will be an important further step in enhancing both the rigour of corpus linguistics and its incorporation into all kinds of research, both linguistic and non-linguistic. To put it another way, the way ahead is methodological pluralism. This kind of methodological triangulation is already happening, to some extent. . . . But we would argue that it needs to be taken further. (p. 227) Despite this apparent trend, there has been very little work that has described and evaluated state-of-the-art methods for triangulating corpus linguistics with other research methods in linguistics. One notable exception is a special issue of Corpus Linguistics and Linguistic Theory on ‘Corpora and Experimental Methods’ (see Gilquin  & Gries, 2009). However, the studies included in the special issue focussed solely on triangulating corpora with experimental psycholinguistic methods, which is only one of many potentially fruitful areas of research in linguistics that can benefit from triangulation with corpus data. In a 2016 volume published in Routledge’s Advanced in Corpus Linguistics series as the same book series, Triangulating Methodological Approaches in Corpus Linguistics, we (the two editors) carried out a large-scale experiment on methodological triangulation within corpus linguistics (Baker & Egbert, 2016). Our book presented a cohesive, demonstrative overview of ten methods within corpus linguistics using a single corpus, exploring the extent to which those methods complemented each other. We gave analysts the same corpus and research questions, asking them to work independently and produce a report of their approach and findings, which we then subjected to a comparative analysis in the final chapter of the book. We found a largely complementary picture, with most authors making unique discoveries with a smaller number of shared findings and a tiny number of contradictory ones. At the end of the book we argued that While there are certainly challenges inherent to triangulation research, we believe the benefits of triangulation make efforts to overcome these challenges worthwhile. Moreover, the challenges associated with triangulation research may help motivate corpus linguists

Introduction  3 to develop synergy in their research through effective collaborative relationships with scholars from different research orientations. (ibid, 207–8) The success of that book project inspired us to think about triangulation between corpus linguistic methods and other linguistic methodologies, which we see as a natural extension and an exciting next step. Thus, this edited volume focuses on triangulating corpus linguistic methods with a wide range of other research methods in linguistics (e.g. psycholinguistic experiments, critical discourse analysis, language assessments). Specifically, we have three primary goals for this volume: 1. Showcase a variety of state-of-the-art research methods in linguistics outside of the realm of corpus linguistics (CL). 2. Include a series of empirical studies on a range of topics in psycholinguistics, applied linguistics, and discourse analysis which triangulate non-CL methods with CL methods. 3. Investigate the extent to which non-CL research methods can complement CL methods to enhance our understanding of linguistic processes and variation. This book is structured around nine empirical studies, each of which triangulates CL data with non-CL data. We have selected expert researchers who have a strong track record of triangulating corpus methods with other linguistic methods. These nine methods fall into three major areas of linguistics: discourse analysis, applied linguistics, and psycholinguistics. The introduction and conclusion chapters are written by the two editors. In this introduction chapter we consider the relationship between corpus linguistics and triangulation with reference to relevant earlier research. We then describe the perspective that this collection takes on triangulation and outline the aims of the book and its more specific research questions. This follows with a summary of how we selected the topics and the instructions that were given to authors, and we conclude the chapter by introducing the nine approaches taken by the authors who contributed chapters, as well as giving a brief summary of the book’s concluding chapter. Let us begin then, with a brief introduction to corpus linguistics, which, along with triangulation, is one of the two thematic threads which run through this book.

Corpus Linguistics Linguistics is the scientific study of language. In essence, linguists share the goal of discovering and describing patterns in human language. Linguistics, broadly defined, encompasses many research areas related to the underlying structure of language, the use of languages by their speakers,

4 Introduction how languages vary across social groups and change over time, how language is processed in the brain and acquired by speakers, and how languages are typologically related, to name just a few. As with any scientific enterprise, linguistics relies heavily on empirical data to test hypotheses and explore patterns. There are many different types of linguistic data. For example, psycholinguists use response times, eye tracking data, and neuroimaging; sociolinguists use interviews, questionnaires and surveys; and critical discourse analysts consider both texts and the myriad set of contexts which account for the ways that such texts were produced and received. Although the sources of linguistic data differ widely across the different subfields of linguistics, one source of data is used in nearly all of them: the corpus. A corpus is a large collection of naturally occurring texts (e.g. those which were not originally created for the purposes of language analysis). Ideally, corpora are sampled from and thus meant to be representative of a larger linguistic population, allowing researchers to generalize findings from the corpus sample back to the population. Corpora are typically stored digitally and processed with the aid of computers, enabling researchers to process large amounts of corpus data automatically, providing increased speed and reliability. Even corpus methods that are qualitative in nature benefit from the use of computer software that allows researchers to search patterns, sort results, and annotate texts more easily and efficiently. Corpus-based research has proliferated during the past two or three decades with the existence of several dedicated corpus linguistics journals including International Journal of Corpus Linguistics, Corpora, Corpus Linguistics and Linguistic Theory, and International Journal of Learner Corpus Research. It is also increasingly common to find corpus-based studies published in applied linguistics journals (e.g. Applied Linguistics, TESOL Quarterly), sociolinguistics journals (e.g. Language Variation and Change, Journal of Sociolinguistics), discourse analysis journals (e.g. Discourse Studies, Discourse Processes), and general linguistics journals (e.g. Language, Journal of English Linguistics). The reason for the widespread appeal of corpus methods is quite simple: corpora reveal how real people use real language. Whether you’re a psycholinguist interested in the effect of input frequency, a historical linguist interested in diachronic language change, a sociolinguist interested in dialect variation, a language acquisition expert interested in developmental sequences or a critical discourse analyst interested in social relations and the power of language to influence people, corpora can often provide a good amount of naturally occurring language data to answer at least some of the research questions in a study. Paradoxically, despite being arguably the most ubiquitous type of data in linguistics, corpora are seldom used as the sole data source in most subfields of linguistics. In other words, it is becoming increasingly common for researchers to triangulate corpora and corpus linguistic methods with other data sources

Introduction  5 and methods in linguistic research. So, what is triangulation and how has it been used in corpus-based research?

Triangulation As mentioned earlier, triangulation is a research approach that takes two or more perspectives to investigate a research question. In the words of Cohen, Manion, and Morrison (2000), triangulation is an “attempt to map out, or explain more fully, the richness and complexity of human behavior by studying it from more than one standpoint” (p.  112). Denzin (1970) identified six major types of triangulation: time triangulation, space triangulation, combined levels of triangulation, theoretical triangulation, investigator triangulation, and methodological triangulation. The first three of these types can be grouped together within the category of data triangulation, leaving us with four main types of triangulation. In data triangulation, the researcher applies the same method to data sets from different times, locations or groups of participants. There are many examples of corpus studies that investigate language from different times (i.e. diachronic studies) and places (i.e. dialect studies), but these studies are typically focussed on describing variation rather than triangulating data. Data triangulation can also be performed with two or more groups of participants. In contrast to other studies with multiple participant groups, the primary goal in data triangulation studies is to validate data through cross-verification. In theoretical triangulation, a researcher draws on two or more theories by actively testing each of them or appealing to each of them to interpret research findings. In investigator triangulation, two or more researchers apply the same methods to the same data set. Whereas data and theoretical triangulation are not particularly common in corpus-based research, there have been some studies based on investigator triangulation. For example, Marchi and Taylor (2009) used investigator triangulation in a corpus-based study of journalist language, and Baker (2015) triangulated findings about representations of foreign doctors from five investigators. The final type, methodological triangulation—or “between methods triangulation” as Denzin (1970) calls it—relies on more than one method to answer the same research question. Although the triangulation of data, investigators, and theories are valuable approaches to data analysis, in this volume we focus on methodological triangulation, the most widely used and, arguably, the most useful and applicable type of triangulation for corpus research. Despite criticisms from some scholars (see, e.g. Silverman, 1985; Fielding & Fielding, 1986), researchers have noted a large number of benefits associated with methodological triangulation, arguing that it: 1. Presents the situation in a more detailed and balanced way (Altrichter, Posch, & Somekh, 1996).

6 Introduction 2. Leads to a deeper and more comprehensive understanding of the issue under investigation (Denzin & Lincoln, 1994, p. 5; Bekhet & Zauszniewski, 2012; Baker & Egbert, 2016). 3. Provides confirmation of findings (Bekhet & Zauszniewski, 2012). 4. Facilitates cross-checking of data to search for (ir)regulatities (O’Donoghue & Punch, 2003; Baker & Egbert, 2016). 5. Provides evidence for the validity of research findings (Thurmond, 2001; Cohen et al., 2000). 6. Demonstrates the reliability of the methods and findings (Marchi & Taylor, 2009). 7. Provides opportunities for more robust interpretations (Layder, 1993, p. 128). 8. Leads to increased collaboration among scholars with different theoretical orientations and/or methodological expertise (Baker  & Egbert, 2016). Defining Triangulation Some scholars in previous research apply the term ‘triangulation’ only to cases where the sole objective of using more than one method is to determine whether the validity of one method is confirmed by the other(s). As noted earlier, this is an oft-cited benefit of methodological triangulation (see Bekhet & Zauszniewski, 2012), but we argue that triangulation has benefits that reach beyond confirmation of findings. Thus, in this volume we adopt a much broader definition of methodological triangulation that extends to: • • • • •

“applying two methodologies separately to the same question” (Marchi & Taylor, 2009, p. 5) “the combination of two or more . . . methodologic approaches . . . within the same study” (Thurmond, 2001, p. 253) “the use of two or more different kinds of methods in a single line of inquiry” (Risjord, Moloney & Dunbar, 2001, p. 41) “the observation of the research issue from (at least) two different points” (Denzin, 1970, p. 178) “more than one kind of method to study a phenomenon” (Bekhet, 2012, p. 2)

Building on definitions provided by these and other scholars, for the purposes of this volume we adopt a broad and inclusive definition for methodological triangulation which includes any study that applies two or more independent methods to answering the same research question(s). In adopting this definition of triangulation, it is important to note the distinction made by Marchi and Taylor between methodological triangulation, in which the methods are carried out independently; and

Introduction  7 multi-method research, in which the methods are intertwined and often interdependent in the research process. The boundary between methodological triangulation, on the one hand, and multi-method/multi-­ disciplinary research, on the other, is a fuzzy one at best. For example, the following three chapters in this volume are situated on the fuzzy boundary between methodological triangulation and multi-method research: Chapter  2 (Triangulating Text Segmentation Methods With Diverse Analytical Approaches to Analyzing Text Structure), Chapter 5 (Triangulating Corpus Linguistics and Language Assessment: Using Corpus Linguistics to Enhance Validity Arguments) and Chapter 7 (If Olive Oil Is Made of Olives, Then What’s Baby Oil Made of? The Shifting Semantics of Noun+Noun Sequences in American English). Each of the studies in these three chapters relies on multiple methods during the course of the study, but they do not meet the criteria for the narrow definition of triangulation because they do not have the goal of using one method to validate or confirm the other(s). However, when we adopt our broader definition of triangulation, these three chapters fit squarely within the domain of methodological triangulation. In the following section we present an overview of previous research that has triangulated corpus linguistic methods with non-corpus linguistic methods. This is organized according to the subfield of linguistics within which the research was conducted in the following order: psycholinguistics, applied linguistics, and discourse analysis.

Triangulation Between Corpus Linguistics and Other Linguistic Methods Discourse Analysis In this book we acknowledge the range of different meanings held by discourse, which is keenly demonstrated by the three chapters in the final third of the book. Discourse can refer to “language above the sentence or above the clause” (Stubbs, 1983, p. 1), language style or register (Ringbom, 1998) or, taking a critical perspective, “practices which systematically form the objects of which they speak” (Foucault, 1972, p. 49). Additionally, we can refer to discourse markers which can have organizational functions, e.g. to show a change in topic or speaker (Aijmer, 2015); or discourse units which operate as discrete, coherent segments of texts, usually focussed around a topic or communicative purpose (Degand & Simon, 2009). Discourse analysis is one area that has seen an abundance of triangulation research with corpus linguistic methods. A  prime example of this type of triangulation is a study of the discourses of refugees and asylum seekers in the UK in which the authors propose a set of methods that draws on both corpus linguistics and CDS methods (Baker et  al.,

8 Introduction 2008). This spurred additional research in this area, including studies that have directly explored the benefits and drawbacks of triangulating critical discourse studies and corpus methods (see e.g. Baker & Levon, 2015). Many language studies that take a critical stance and also use corpora will often use corpus tools to identify frequent or salient linguistic features, which are then subjected to a more detailed qualitative analysis, such as Baker, Gabrielatos, and McEnery (2013), Brindle (2016) and Jeffries and Walker (2017). The same applies to scholars who use corpora to carry out various forms of discourse analysis but do not explicitly engage with critical perspectives such as Bednarek and Caple (2012), Harvey (2013) and Partington, Duguid and Taylor (2013). In addition to carrying out textual analysis using a combination of computer-assisted and close reading approaches, discourse analysts also undertake contextual analyses in order to interpret, explain and critique findings. For example, Baker et  al. (2013) draw upon readership demographics, government policy, outcomes of complaints to press regulators, attitude polls, crime statistics and political enquires in their analysis of representations of Muslims in the British press. Critical discourse studies is triangulatory in of itself, regardless of whether corpus approaches are combined with other techniques within an individual study. Chapters 2–4 in this volume are good examples of research at the intersection of corpus linguistics and discourse studies. Applied Linguistics There are many areas of linguistics within applied linguistics that are actively triangulating corpus linguistic methods with other methodological approaches. In this section we briefly discuss this type of research in language acquisition, language testing, text readability, and lexicography. The use of corpus data and methods in research on language acquisition is not a new idea. However, the growth of research in this area in recent years is indicated by the advent of conferences devoted to learner corpus research as well as the new journal International Journal of Learner Corpus Research. While not all learner corpus studies are examples of triangulation research, it has become quite common for researchers in this area to triangulate corpus linguistic methods with other methods, such as ratings of speech and writing (LaFlair, Staples, & Egbert, 2015; Staples, LaFlair & Egbert (2017)), proficiency levels (Gilquin, Bestgen & Granger, 2016; Paquot, 2018), and standardized test scores (Biber, Gray, & Staples, 2014; Biber, Reppen, & Staples, 2017). There has also been a surge of interest recently in triangulating corpus linguistic methods with language testing methods. Recently, Language Testing published a Special Issue on Corpus Linguistics and Language Testing that was entirely devoted to the topic. This special issue was based on papers presented in a colloquium devoted to the same topic

Introduction  9 at the American Association of Applied Linguistics. LaFlair and Staples (2017) demonstrated how corpus linguistic data can be used as evidence in the validity argument for a test. Römer (2017) uses corpus-based evidence to argue that the lexis/grammar division in rating rubrics is artificial and should be informed by recent corpus-based findings about the lexis-grammar interface. Lu (2017) and Kyle and Crossley (2017) both analyze the relationship between test scores and corpus linguistic measures in order to answer questions about language proficiency. Finally, Scott Jarvis introduces a new construct definition for lexical diversity that draws on both objective corpus methods and subjective reader perceptions. The authors of two commentary articles in this issue present an optimistic outlook for the future potential of methodological triangulation between language testing and corpus linguistics methods (see Cushing, 2017; Egbert, 2017; Xi, 2017). For many decades, researchers and educators have been interested in automated measures of text readability (see Bormuth, 1969; Dale & Tyler, 1934; Dolch, 1948). These measures are typically simple formulas that account for some measure of (i) syntactic complexity and ((ii) lexical complexity. Researchers have often attempted to validate readability measures by correlating the scores that texts receive from the automated measure with the results of comprehensibility tests, Cloze tests, or C-tests (see, e.g. Cunningham, Hiebert, & Mesmer, 2018). While these studies are not usually framed as corpus-based studies, they are nearly always based on a corpus of texts on which researchers attempt to validate the readability measures. Lexicography is another area that has a great deal of untapped potential for methodological triangulation with corpus linguistic methods. Lexicographers have a long history of working with corpora. From its beginnings, the Oxford English Dictionary has been based on and informed by language from natural texts. However, traditional lexicography has not normally involved empirical triangulation between corpus linguistic methods and other methods in lexicography. Rather, lexicographers have drawn on corpus data when it has been helpful to disambiguate word senses and update dictionaries for neologisms. There have, however, been recent calls for lexicography to incorporate more empirical methods, including psycholinguistic methods (e.g. forced choice tasks and acceptability judgments) and corpus linguistic methods (see Arppe & Jarvikivi, 2007). One example of this type of study comes from Gries (2006) who applied behavioral profile analysis in conjunction with several corpus linguistic methods to identify prototypes, determine the distinctiveness of different word senses, and explore word sense networks. This brief review of previous literature represents only a small sample of the many studies that have triangulated corpus methods with other methods in applied linguistics. Studies in this volume that fall within this area include those in Chapters 5–7.

10 Introduction Psycholinguistics The field of psycholinguistics has a long history of triangulating corpus methods with psycholinguistic methods. In fact, it could even be said that this type of triangulation is becoming the norm rather than the exception in current psycholinguistic research. In 2009 the journal Corpus Linguistics and Linguistic Theory devoted a special issue to this type of triangulation. In the introduction to that issue, Gilquin and Gries (2009, p. 9) state: Because the advantages and disadvantages of corpora and experiments are largely complementary, using the two methodologies in conjunction with each other often makes it possible to (i) solve problems that would be encountered if one employed one type of data only and (ii) approach phenomena from a multiplicity of perspectives. They further observe that when corpus linguistic and psycholinguistic methods are triangulated, it is almost always psycholinguists who are incorporating corpus methods into their research. They conclude by calling on corpus linguists to “look more into the possibilities of complementing their corpus studies with experimental data” (Gilquin & Gries, 2009, p. 16). The papers included in that special issue triangulate (a) lexical elicitation data and corpus frequencies (McGee, 2009; Nordquist, 2009), (b) corpus-based measures of idiomaticity with idiomaticity judgments by native speakers (Wulff, 2009), (c) psycholinguistic measures of formulaicity (e.g. formulaicity ratings, response times) and association measures (Ellis & Simpson-Vlach, 2009), and (d) acceptability judgments and corpus frequencies (Baroni, Guevara, & Zamparelli, 2009). For researchers within the usage-based paradigm, it has become extremely common to triangulate corpus methods with psycholinguistic methods. Research within this framework relies heavily on triangulating corpus-based frequency information for words and constructions with psycholinguistic measures, such as lexical decision tasks (e.g. Forster, 1976; Balota  & Chumbley, 1984), other measures of lexical response times (Kirsner, 1994), sentence completion tasks (Gries & Wulff, 2005), and eye-tracking (Demberg & Keller, 2008). Other more advanced neurolinguistic methods such as EEG and MRI have also been proposed (see, e.g. Molinaro & Carreiras, 2010). We only have space to mention a few examples of triangulation between psycholinguistic and corpus linguistic methods within the usage-based framework. The reader is referred to more comprehensive reviews of this literature in Ellis (2002), Ellis (2012), and Ellis (2017). Research focussed on phraseology and formulaic language is another fruitful area of triangulation between psycholinguistic and corpus

Introduction  11 linguistic methods. Generally speaking, research within this area has produced overwhelming evidence that frequent multi-word expressions are processed faster than control phrases (see, e.g. Bannard  & Matthews, 2008; Janssen & Barber, 2012; Janssen & Caramazza, 2011; SiyanovaChanturia, Conklin, & Schmitt, 2011; Siyanova-Chanturia, Conklin, & van Heuven, 2011; Sosa  & MacFarlane, 2002; Tremblay, Derwing, Libben,  & Westbury, 2011; Tremblay  & Tucker, 2011). Some of the research within this area has focussed on the relationship between event related potentials (ERPs) and the association strength of collocations (e.g. Molinaro  & Carreiras, 2010) and other multi-word expressions (e.g. Siyanova-Chanturia & Spina, 2015; Siyanova-Chanturia, Conklin, Caffarra, Kaan, & Van Heuven, 2017). Another novel and interesting application of methodological triangulation between corpus linguistic and psycholinguistic methods is recent research on the cognitive processes associated with reading literary texts (see, e.g. Stockwell  & Mahlberg, 2015). Mahlberg, Conklin, and Bisson (2014) triangulate eye-tracking data and corpus frequency data for repeated clusters to answer questions about the psycholinguistic processing of literary texts, showing that repeated clusters are processed significantly faster than surrounding text, especially clusters related to body language (Mahlberg et al., 2014). This section has scarcely scratched the surface of the interesting triangulation research that is taking place at the intersection of psycholinguistics and corpus linguistics. However, we hope we have demonstrated the utility of this type of triangulation and set the stage for related studies included in this volume (see Chapters 8–10).

Outline of the Book The first three analysis chapters in the book involve different conceptualisations of discourse analysis, with Chapter 2 drawing on the notion of discourse in terms of text structure, Chapter 3 considering discourse in terms of register, in this case nineteenth-century British news articles about drought and Chapter 4 taking a critical perspective of discourses being ways of looking at the world that are expressed through language. In Chapter  2 Erin Schnur and Eniko Csomay aim to find a way to identify text structure by dividing texts into different segments. Two approaches are taken to segmentation—first a manual approach, where human raters (via Mechanical Turk) were employed to identify cohesive discourse units from a corpus of transcriptions of academic lectures using an existing functional taxonomy. Second, an automatic approach used computational techniques to identify Vocabulary-Based Discourse Units (lexically similar segments which reflect a shared topic). Here, the idea is that the lexical differences that establish segment boundaries represent topic shifts. The manual approach resulted in the identification of

12 Introduction 1,056 text segments while the automatic approach gave just 769 segments and the resulting text segments obtained from the two approaches were considered separately, with the first approach being analysed in terms of the distribution of functional categories across texts. This found, for example, that the majority of discourse units within academic lectures are concerned with the transmission of content-related, factual information to students, although a second type of structural unit was centred around class management. The automatically segmented texts were then subjected to a multifactorial analysis to identify clusters of text segments that had similar linguistic characteristics. This led to the identification of three linguistic dimensions which could be labelled as (i) conceptual information focus vs. contextual directive orientation; (ii) personalized framing (or lack of); and (iii) teacher monologue or interactive dialogue. The ‘poles’ of each dimension contained corresponding sets of correlating linguistic features, for example, teacher monologue involved a relatively long teacher turn length while interactive dialogue involved higher use of one word turns and more use of discourse particles like yeah and well. The two analyses each offered a different but complementary understanding of text segmentation, helping to account for one another’s weaknesses. The manual approach helped to identify patterns of distribution of functional categories across texts while the automatic approach identified the collocating linguistic units that were associated with discrete segments. Tony McEnery, Helen Baker, and Carmen Dayrell use triangulation in Chapter 3 in order to carry out historical analysis involving the identification of droughts across the nineteenth century. The traditional way of approaching this endeavor would be to use official records on droughts that were made at the time. However, such records are neither comprehensive nor always accurate, especially before the 1850s, so they can only offer a partial picture. It is to this end that a complementary approach was utilised, involving a corpus of nineteenth-century newspaper articles containing news texts from eight British newspapers. By searching these geo-parsed texts for references to droughts and identifying the years where such references were most frequent, the authors are able to produce additional supporting evidence to warrant the accuracy of the drought records. In a number of cases though, the analysis of news reports resulted in mentions of droughts that had not been recorded, indicating at least seven additional periods of drought in various parts of Britain that were not on record. The study thus shows an original use for corpus data as part of triangulatory research for historians. Paul Baker considers triangulation of corpus linguistics within the realm of critical discourse studies (Chapter  4). In order to examine the representation of obesity in a corpus of British newspaper articles he employed an approach involving an analysis of expanded concordance lines analysing a selection of the strongest collocates of the terms obese

Introduction  13 and obesity. However, he also carried out a close reading of several sets of samples of articles. One sample was taken at random while three others were based on more principled replicable techniques—articles with the most references to obesity, those occurring in the month where obesity was discussed most often and those containing the highest proportion of keywords in the corpus as a whole when compared against a reference corpus. Findings were identified and then compared across the different methods of analysis with the aim of identifying combinations of techniques which were most likely to be complementary in terms of eliciting the greatest number of findings as well as findings that were different from one another. The results indicated that combining the corpus approach with one or two sets of text samples was an effective way of addressing different research questions. For example, the corpus analysis performed well in terms of identifying representations of obese people while analysing the sample of articles containing the most keywords showed how articles referred to the consequences of obesity. In order to account for the possibility of the order that the five sets of analyses occurred having an influence on the outcome of analyses, all approaches were carried out three times, with a consequence of this being that the initial research question was both refined and expanded into a set of smaller, more specific questions. Another potential consequence of combining approaches, then, is to generate questions that would not necessarily have been asked at the outset. Chapters 5, 6 and 7 turn to studies that can be broadly classed under the term applied linguistics. Chapter  5, authored by Geoffrey LaFlair, Shelley Staples, and Xun Yan, considers a different field of linguistics, that of language testing. The authors use a corpus of essays that were produced by people taking an English language exam, which had been categorised according to given levels of ability. A multidimensional analysis was carried out on the corpus, using 41 pre-identified features that were based on the results of previous research on writing development, and indicating five dimensions (or sets of features that correlated with one another) relating to linguistic ability. A second analysis used a corpus of transcriptions from speaking tests with a different set of linguistic features, this time finding three dimensions. Comparing the results of the analysis against the rubrics used to score these tests, the authors found some mismatches, for example the written test rubric only scored on three categories while the multidimensional analysis found five distinct sets of linguistic features were being used by writers. The study suggests that corpus approaches could be effectively employed to improve language test rubrics. However, the authors note that the rubric itself was used to group test takers by proficiency level, and that without this ‘external method’ the linguistic insights afforded by the corpus comparison of features across different levels would have not been possible. The multidimensional analysis therefore confirms that parts of the rubric

14 Introduction were working but also suggests more fine-grained distinctions, providing a good example of how a corpus linguistics approach is able to offer additional perspectives on criteria for test scoring. In Chapter 6, Dana Gablasova uses a word association task to ascertain vocabulary acquisition differences between native and non-native speakers. A group of Slovak students were asked to read two texts which contained 12 target words that were defined in the texts. Half of the texts were in their native language whereas the other half were in English. Students were then presented with the target words on a computer screen and asked to provide the associations that occurred to them. Gablasova considered the number of words that students produced as well as their semantic relationship to the target word (whether they were from a superordinate category or not), and she also compared responses to abstract words and concrete words. Perhaps as expected, students produced more word associations for the words in their native language, although this was not statistically significant. Concrete words were found to be more likely to produce superordinate elicitations whereas this was not the case for abstract words. By conducting searches in a reference corpus, Gablasova was able to obtain frequency information regarding the same words, using this information in order to correlate frequency with the number and type of words that the students elicited. She found that students provided more word associations for frequently occurring words, and that the words they gave tended to be frequent themselves. She also found that individuals who provided more words in the L1 condition were more likely to produce higher numbers of words in the L2 condition, suggesting evidence for individual differences in response behaviour. The use of frequency information derived from a corpus brings a new dimension to word association studies, helping to explain the findings, indicating that when people hear a word, there appears to be competition between superordinate words and high-frequency words in terms of the types of associations that are triggered. In Chapter 7 Jesse Egbert and Mark Davies are interested in identifying semantic change in noun+noun sequences over time. A  large number of such noun sequences were identified in the Corpus of Historical American English (COHA). A  classification rubric consisting of 12 semantic categories was produced in order to identify the different sorts of semantic relationships between the nouns in the sequence (e.g. for the sequence brass button, the second noun is implied to be made from the first noun, so this was classed as Composition). In order to quantify diachronic change in the types of noun-noun sequences that are being used, it was necessary to carry out the semantic classification of over 1,500 such sequences. To this end the authors used Mechanical Turk workers (N = 255) to carry out the categorisations, with each sequence categorised independently by eight coders. Inter-rater categorisation agreement was stronger for some categories than others, and the authors correlated

Introduction  15 the results with the frequencies of different category types over different time periods in the COHA, finding that the more frequently occurring sequences in the COHA were more likely to elicit inter-rater agreement and that this relationship was more pronounced for the more recent time periods in COHA. Their analyses of the frequency trajectories of different semantic categories over time concluded with the hypothesis that noun+noun sequences are moving from concrete to abstract meanings. The study indicates how corpus and non-corpus methods can be utilised at different points in a particular sequence across a research study in order to reach an end goal. The final three studies in this book all involve triangulations of corpus linguistics and psycholinguistics and while they each investigate different linguistic phenomena and use different techniques to measure human responses to them, they all involve a similar aim—to compare the extent to which corpus linguistic techniques are able to identify the same kind of features that are recognised by human brains as being particularly frequent or salient in some way. These studies draw on large reference corpora as a way of obtaining examples that are then presented to human subjects, with different forms of measurement employed to gauge the ease or trouble with which they are processed. An overarching theory involves priming—the idea that repeated exposure to particular patterns over time will result in cognitive short-cuts, or recognitions of patterns, and thus reduced processing times—where the brain expects to encounter certain words in certain contexts. As a result, all of these chapters deal with different kinds of word sequences (clusters) or combinations (collocates), and attempt to measure either the effect of repeated exposure to the entire sequence or the effect that the first part of a sequence has in terms of prompting a particular response. In Chapter 8 Jen Hughes and Andrew Hardie investigate collocation, a phenomenon central to many studies of corpus linguistics, denoting a systematically close co-occurrence relationship between two words. However, there has been little work aimed at identifying which of the many algorithms designed to identify collocates actually represents the way that the brain processes language. Hughes and Hardie’s study first uses corpus methods to identify sets of potential collocates and then draws on techniques from experimental psycholinguistics, namely electroencephalography, which involves placing electrodes on the scalp to measure brain activity (known as event-related potentials or ERPs). Their subjects are given pairs of collocates and non-collocates to read on a screen in sequence and electrical brain activity is measured in order to determine which pairs of words trigger the strongest neurological reactions. Once they have established that people’s brains do indeed react in a measurable way when they encounter collocates, the authors return to corpus techniques to ascertain how the various pairs of words relate to a range of different collocational algorithms. Finally, they correlate

16 Introduction these scores with the extent of brain activity that each pair of words elicited. Their results indicate that traditional ­collocational ­measures like log-likelihood and mutual information do not seem to match neurological recognition of word pairs as well as others like Z-score and MI3. The chapter thus indicates how experimental ­psycholinguistics research can help to validate and refine commonly used corpus linguistics techniques. Stefan Gries (Chapter  9) collates the findings of experimental data obtained from learners of English with that of a corpus of native speaker English in order to investigate the extent and manner to which learners exhibit a priming effect on their choice of verb constructions. Subjects were asked to complete fragments of sentences containing verbs like give, hand, lend, loan, post, send and show, although some of the fragments that they were given were designed to bias them into completing sentences in a particular way (either to use a ditransitive verb or a prepositional dative). A generalized linear mixed-effects model indicates that learners were indeed triggered to use certain constructions if they were exposed to them elsewhere in the task. In this study, Gries successfully triangulates verb co-occurrence distributions with data from sentence completion tasks to show that non-native speakers’ constructional choices are strongly influenced by structural priming effects. In Chapter 10, another way of measuring human response to linguistic phenomena involved calculating the VOT (voice onset time) or length that it took for subjects to read a word. Reading time offers a window onto language processing, with more difficult processing tasks resulting in longer reading times. In an experiment designed by Nick Ellis to examine processing of abstract Verb Argument Constructions, participants were presented with pairs of words and recorded reading them aloud. The first word was always a verb (get, add, vote etc), while the second was a preposition (against, between, for etc). These pairs had been elicited from the British National Corpus and they had also been categorised in terms of their frequencies, lengths (in terms of letters), contingency, and semantic prototypicality. Ellis found that all of these factors had an effect on the length of time it took for people to vocalise the second word. In summary, Ellis triangulated VOT with frequency, length, contingency, and semantic prototypicality to reveal a strong psycholinguistic effect of these lexical variables. In the concluding chapter we attempt to synthesize the findings of the previous nine chapters in order to provide a wider understanding of the extent to which non-CL research methods can complement CL methods to enhance our understanding of linguistic processes and variation. After comparing the research questions, methods, and results of the nine studies in this volume we then describe the strengths and limitations of the combinations of methods used and explore the degree to which the triangulation between the various CL and non-CL methods have been

Introduction  17 successful. This is explored both in terms of a synthesis of reflective comments made by the authors regarding methodological triangulation in their studies, as well as a qualitative investigation of the success of the triangulation by the editors. The chapter then enumerates the contributions of this volume to several communities of scholars, including corpus linguists, psycholinguists, discourse analysts, descriptive linguists, and language researchers outside of mainstream linguistics. The book ends with our reflections on the challenges and promise of methodological triangulation in the future of (corpus) linguistic research and research in the social sciences more generally.

References Aijmer, K. (2015). Analysing discourse markers in spoken corpora: Actually as a case study. In P. Baker, & T. McEnery (Eds.), Corpora and discourse studies (pp. 88–109). London: Palgrave Macmillan. Altrichter, H., Posch, P., & Somekh, B. (1996). Teachers investigate their work: An introduction to the methods of action research. London: Routledge. Arppe, A., & Järvikivi, J. (2007). Every method counts: Combining corpus-based and experimental evidence in the study of synonymy. Corpus Linguistics and Linguistic Theory, 3(2), 131–159. Baker, P. (2015). Does Britain need any more foreign doctors? Inter-analyst consistency and corpus-assisted (critical) discourse analysis. Corpora, grammar and discourse: In honour of Susan Hunston (pp. 283–300). Amsterdam: John Benjamins. Baker, P., & Egbert, J. (2016). Triangulating methodological approaches in corpus-linguistic research. New York: Routledge. Baker, P., Gabrielatos, C., Khosravinik, M., Krzyżanowski, M., McEnery, T., & Wodak, R. (2008). A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press. Discourse & Society, 19(3), 273–306. Baker, P., Gabrielatos, C., & McEnery, T. (2013). Discourse analysis and media attitudes: The representation of Islam in the British press. Cambridge: Cambridge University Press. Baker, P., & Levon, E. (2015). Picking the right cherries? A comparison of corpusbased and qualitative analyses of news articles about masculinity. Discourse & Communication, 9(2), 221–236. Balota, D. A., & Chumbley, J. L. (1984). Are lexical decisions a good measure of lexical access? The role of word frequency in the neglected decision stage. Journal of Experimental Psychology: Human Perception and Performance, 10, 340–357. Bannard, C., & Matthews, D. (2008). Stored word sequences in language learning: The effect of familiarity on children’s repetition of four-word combinations. Psychological Science, 19, 241–248. Baroni, M., Guevara, E., & Zamparelli, R. (2009). The dual nature of deverbal nominal constructions: Evidence from acceptability ratings and corpus analysis. Corpus Linguistics and Linguistic Theory, 5(1), 27–60. Bednarek, M., & Caple, H. (2012). News discourse. London: Bloomsbury.

18 Introduction Bekhet, A. K., & Zauszniewski, J. A. (2012). Methodological triangulation: An approach to understanding data. Nurse Researcher, 20(2), 40–43. Biber, D., Gray, B., & Staples, S. (2014). Predicting patterns of grammatical complexity across language exam task types and proficiency levels. Applied Linguistics, 37(5), 639–668. Biber, D., Reppen, R., & Staples, S. (2017). Exploring the relationship between TOEFL iBT scores and disciplinary writing performance. TESOL Quarterly, 51(4), 948–960. Bormuth, J. R. (1969). Development of readability analyses (Final Report, Project No. 7-0052, Contract No. OEC-3-7-070052-0326). Retrieved from ERIC database. Brindle, A. (2016). The language of hate: A corpus linguistic analysis of white supremacist language. London: Routledge. Cohen, L., Manion, L., & Morrison, K. (2000). Research methods in education. London: Routledge. Cunningham, J. W., Hiebert, E. H., & Mesmer, H. A. (2018). Investigating the validity of two widely used quantitative text tools. Reading and Writing, 31(4), 813–833. Cushing, S. T. (2017). Corpus linguistics in language testing research. Language Testing, 34(4), 441–449. Dale, E., & Tyler, R. W. (1934). A study of the factors influencing the difficulty of reading materials for adults of limited reading ability. The Library Quarterly: Information, Community, Policy, 4, 384–412. Degand, L.,  & Simon, C. A. (2009). On identifying basic discourse units in speech: Theoretical and empirical issues. Discours, 4. Open Access. Retrieved from http://discours.revues.org/5852 Demberg, V., & Keller, F. (2008). Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition, 109, 193–210. Denzin, N. (1970). The research act in sociology. Chicago: Aldine. Denzin, N. K., & Lincoln, Y. S. (1994). Handbook of qualitative research. Beverley Hills, CA: Sage Publications. Diogenes, L. (2018). Lives of eminent philosophers. Retrieved August 15, 2018, form www.iep.utm.edu/thales/ Dolch, E. W. (1948). Graded reading difficulty. Problems in reading (pp. 229– 255). Champaign, IL: The Garrard Press. Egbert, J. (2017). Corpus linguistics and language testing: Navigating uncharted waters. Language Testing, 34(4), 555–564. Ellis, N. C. (2002). Frequency effects in language processing. Studies in Second Language Acquisition, 24(2), 143–188. Ellis, N. C. (2012). What can we count in language, and what counts in language acquisition, cognition, and use. In S. Th. Gries & D. Divjak (Eds.), Frequency effects in language learning and processing 1 (pp.  7–34). Berlin: Walter de Gruyter. Ellis, N. C. (2017). Cognition, corpora, and computing: Triangulating research in usage-based language learning. Language Learning, 67(S1), 40–65. Ellis, N. C., & Simpson-Vlach, R. (2009). Formulaic language in native speakers: Triangulating psycholinguistics, corpus linguistics, and education. Corpus Linguistics and Linguistic Theory, 5(1), 61–78. Fielding, N. G., & Fielding, J. L. (1986). Linking data. Beverley Hills, CA: Sage Publications.

Introduction  19 Forster, K. I. (1976). Accessing the mental lexicon. In E. C. T. Walker & R. Wales (Eds.), New approaches to language mechanisms (pp. 257–287). Amsterdam: North-Holland. Foucault, M. (1972). The archaeology of knowledge. London: Tavistock. Gilquin, G., Bestgen, Y., & Granger, S. (2016). Assessing the CEFR assessment grid for spoken language use: A learner corpus-based approach. 37th International Computer Archive of Modern and Medieval English Conference (ICAME 37). Gilquin, G., & Gries, S. T. (2009). Corpora and experimental methods: A stateof-the-art review. Corpus Linguistics and Linguistic Theory, 5(1), 1–26. Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies for qualitative research. Chicago: Aldine. Gries, S. T. (2006). Corpus-based methods and cognitive semantics: The many senses of to run. In S. Th. Gries, & A. Stefanowitsch (Eds.), Corpora in cognitive linguistics: Corpus-based approaches to syntax and lexis (pp.  57–99). Berlin: Mouton de Gruyter. Gries, S. Th., & Wulff, S. (2005). Do foreign language learners also have constructions? Evidence from priming, sorting, and corpora. Annual Review of Cognitive Linguistics, 3, 182–200. Harvey, K. (2013). Investigating adolescent health communication: A corpus linguistics Approach. London: Bloomsbury. Janssen, N., & Barber, H. (2012). Phrase frequency effects in language production. PLoS One, 7(3), e33202. http://dx.doi.org/10.1371/journal.pone.0033202 Janssen, N., & Caramazza, A. (2011). Lexical selection in multi-word production. Frontiers in Language Sciences, 2, 81. Jeffries, L., & Walker, B. (2017). Keywords in the press: The New Labour years. London: Bloomsbury. Kirsner, K. (1994). Implicit processes in second language learning. In N. Ellis, (Ed.), Implicit and explicit learning of languages (pp.  283–312). San Diego, CA: Academic Press. Kyle, K., & Crossley, S. (2017). Assessing syntactic sophistication in L2 writing: A usage-based approach. Language Testing, 34(4), 513–535. LaFlair, G. T.,  & Staples, S. (2017). Using corpus linguistics to examine the extrapolation inference in the validity argument for a high-stakes speaking assessment. Language Testing, 34(4), 451–475. LaFlair, G. T., Staples, S., & Egbert, J. (2015). Variability in the MELAB Speaking Task: Investigating linguistic characteristics of test-taker performances in relation to rater severity and score. CaMLA Working Papers. Layder, D. (1993). New strategies in social research. Cambridge: Polity Press. Lu, X. (2017). Automated measurement of syntactic complexity in corpus-based L2 writing research and implications for writing assessment. Language Testing, 34(4), 493–511. Mahlberg, M., Conklin, K., & Bisson, M. J. (2014). Reading Dickens’s characters: Employing psycholinguistic methods to investigate the cognitive reality of patterns in texts. Language and Literature, 23(4), 369–388. Marchi, A., & Taylor, C. (2009). If on a winter’s night two researchers . . . A challenge to assumptions of soundness of interpretation. Critical Approaches to Discourse Analysis across Disciplines, 3(1), 1–20. McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.

20 Introduction McGee, I. (2009). Adjective-noun collocations in elicited and corpus data: Similarities, differences, and the whys and wherefores. Corpus Linguistics and Linguistic Theory, 5(1), 79–103. Molinaro, N., & Carreiras, M. (2010). Electrophysiological evidence of interaction between contextual expectation and semantic integration during the processing of collocations. Biological Psychology, 83(3), 176–190. Newby, H. (1977). In the field: Reflections on the study of Suffolk farm workers. In C. Bell & H. Newby (Eds.), Doing sociological research (pp. 109–29). London: Allen and Unwin. Nordquist, D. (2009). Investigating elicited data from a usage-based perspective. Corpus Linguistics and Linguistic Theory, 5(1), 105–130. O’Donoghue, T., & Punch, K. (2003). Qualitative educational research in action: Doing and reflecting. London: Routledge. Paquot, M. (2018). Phraseological competence: A  useful toolbox to delimitate CEFR levels in higher education? Insights from a study of EFL learners’ use of statistical collocations. Language Assessment Quarterly, 15, 29. Partington, A., Duguid, A., & Taylor, C. (2013). Patterns and meanings in discourse: Theory and practice in corpus-assisted discourse studies (CADS). Amsterdam: John Benjamins. Ringbom, H. (1998). Vocabulary frequencies in advanced learner English: A cross-linguistic approach. In S. Granger, (Ed.), Learner English on computer (pp. 41–52). London: Longman. Risjord, M., Moloney, M., & Dunbar, S. (2001). Methodological triangulation in nursing research. Philosophy of the social sciences, 31(1), 40–59. Römer, U. (2017). Language assessment and the inseparability of lexis and grammar: Focus on the construct of speaking. Language Testing, 34(4), 477–492. Sampson, G. (2013). The empirical trend: Ten years on. International Journal of Corpus Linguistics, 18(2), 281–289. Silverman, D. (1985). Qualitative methodology and sociology: Describing the social world. Brookfield, VT: Gower. Siyanova-Chanturia, A., Conklin, K., Caffarra, S., Kaan, E., & Van Heuven, W. J. (2017). Representation and processing of multi-word expressions in the brain. Brain and language, 175, 111–122. Siyanova-Chanturia, A., Conklin, K., & Schmitt, N. (2011). Adding more fuel to the fire: An eye-tracking study of idiom processing by native and nonnative speakers. Second Language Research, 27, 251–272. Siyanova-Chanturia, A., Conklin, K., & van Heuven, W. J. B. (2011). Seeing a phrase ‘time and again’ matters: The role of phrasal frequency in the processing of multi-word sequences. Journal of Experimental Psychology: Language, Memory, and Cognition, 37(3), 776–784. Siyanova-Chanturia, A., & Spina, S. (2015). Investigation of native speaker and second language learner intuition of collocation frequency. Language Learning, 65(3), 533–562. Sosa, A., & MacFarlane, J. (2002). Evidence for frequency-based constituents in the mental lexicon: Collocations involving the word of. Brain and Language, 83, 227–236. Staples, S., LaFlair, G. T., & Egbert, J. (2017). Comparing language use in oral proficiency interviews to target domains: Conversational, academic, and professional discourse. The Modern Language Journal, 101(1), 107–272.

Introduction  21 Stockwell, P., & Mahlberg, M. (2015). Mind-modelling with corpus stylistics in David Copperfield. Language and Literature, 24(2), 129–147. Stubbs, M. (1983). Discourse analysis: The sociolinguistic analysis of natural language. Chicago: University of Chicago Press. Thurmond, V. A. (2001). The point of triangulation. Journal of Nursing Scholarship, 33(3), 253–258. Tremblay, A., Derwing, B., Libben, G.,  & Westbury, C. (2011). Processing advantages of lexical bundles: Evidence from self-paced reading and sentence recall tasks. Language Learning, 61(2), 569–613. Tremblay, A., & Tucker, B. V. (2011). The effects of N-gram probabilistic measures on the recognition and production of four-word sequences. The Mental Lexicon, 6(2), 302–324. Webb, E. J., Campbell, D. T., Schwartz, R. D., & Sechrest, L. (1966). Unobtrusive measures. Chicago: Rand McNally. Wulff, S. (2009). Converging evidence from corpus and experimental data to capture idiomaticity. Corpus Linguistics and Linguistic Theory, 5(1), 131–159. Xi, X. (2017). What does corpus linguistics have to offer to language assessment? Language Testing, 34(4), 565–577.

2 Triangulating Text Segmentation Methods with Diverse Analytical Approaches to Analyzing Text Structure Erin Schnur and Eniko Csomay Introduction Researchers interested in examining issues of text structure have often had difficulty reconciling this goal with corpus-based techniques. Corpus linguistic studies often rely on the use of corpora, sub-corpora, or texts as the unit of analysis, in part because corpus tools are well-equipped for this type of analysis. Similarly, existing tools (such as concordancing software packages), enable easy examination of words or groups of words and their co-text, but are generally ill-equipped to examine other units of analysis within the text, or intra-textual features that span larger sections of text. As a result, researchers have relied on non-corpus methodologies, such as manual coding, in order to answer research questions related to text structure. This chapter presents the triangulation of two text segmentation methods (manual and automatic) and two analytical approaches to text analysis (qualitative and quantitative) with the goal of analyzing a corpus of monologic academic lectures for their discourse structure. The first approach to both segmentation and analysis (manual segmentation and qualitative functional coding) relies primarily on meaning (e.g. content, context, and coherence) in arriving at conclusions about text structure. The second approach (automatic segmentation and quantitative analysis) considers linguistic features (e.g. lexical items), their frequency, and their distribution throughout texts in order to investigate text structure. Our goal in this chapter is to investigate the extent to which results from these two methodologies complement each other in contributing to our understanding of the structure of monologic academic lectures. The manual text segmentation method employed human raters through a crowdsourcing platform, and the automatic text segmentation illustrates the design and implementation of a computer program that segments text into lexically coherent units. The units identified with the manual text segmentation method were then analyzed further with a qualitative analytical approach where a taxonomy of functional categories was used to identify the pedagogical purposes of each segment. The

Triangulating Text Segmentation Methods  23 units identified automatically were then analyzed further with a quantitative analytical approach where a multi-dimensional framework (Biber, 1988) was applied to provide a comprehensive linguistic analysis of each segment. Through triangulating these text segmentation methods and analytical frameworks, we were able to study discourse structure “from two [or more] perspectives” and “anchor findings in more robust interpretations and explanations” (Baker & Egbert, 2016, p. 4). We arrived at complementary findings through our results, making a strong case for a combined approach to an all-inclusive analysis of discourse structure. The next section discusses manual and computational approaches to text segmentation, including the advantages and challenges of each. The third section presents the steps for each of the two new methodologies (one manual and one automatic segmentation) and introduces two casestudies. The case studies illustrate how these methods can be applied to corpus-based analyses and can provide insights into different aspects and varying characteristics of text structure. The final section reviews the benefits of each of the methods.

Methods of Investigating Text Structure Studies of discourse organization often take one of two approaches which are, in key aspects, opposite. The first begins with linguistic features of interest and uses those features as a starting place for investigations of text structure (e.g. Allison & Tauroza, 1995; Basturkmen, 2007; Olsen  & Huckin, 1990). In other words, this approach uses linguistic characteristics to identify structure. Corpus linguistic methodologies lend themselves to this type of investigation of text structure, as existing corpus tools, such as various concordancing software packages (e.g. AntConc, Wordsmith Tools), are well-suited to finding and analyzing words (or multi-word units) of interest. This method is useful when conducting top-down analysis of text structure. That is, when lexico-grammatical features that reveal information about discourse organization or text structure have been previously established, those features can be analyzed in terms of their function, use, frequency, and co-text. However, this method is limited in its ability to identify or describe structural or organizational units within texts. The second method of investigating structure begins by identifying structural units, then investigating the characteristics of those units in order to describe text structure. This approach uses structural elements to identify textual features—in other words, it first seeks to identify the discourse units that serve as structural ‘building blocks’ of a text, and then seeks to characterize them in terms of function or lexico-grammatical features (Csomay, 2005). Passonneau and Litman (1997) note that few studies examining large numbers of texts have taken the second approach to investigating text structure, particularly when working with spoken

24  Triangulating Text Segmentation Methods language. However, this approach allows for the bottom-up characterization of text structure and examination of the relationships between structural elements across entire texts. The primary methodological challenge of this segmentation-based approach is the preliminary operationalization and identification of discourse units within texts. Some text types present pre-existing discourse units that can be used as a starting point for this type of analysis. For example, texts within many written registers contain paragraphs, and some spoken text types contain speaker turns. Depending upon the research questions being asked, these units may sufficiently serve as a basis for structural investigation. Other text types are more problematic in this regard. For example, the highly monologic academic lectures of interest in the case studies presented in this chapter do not contain paragraphs or a usefully abundant number of speaker turns to aid the identification of meaningful discourse units. Researchers wishing to combine a segmentation-based approach to investigating text structure with corpus methods or to carry out such research on a large body of texts face the challenge of initial text segmentation. In addition to manual text segmentation methods, in recent years interest has grown in automated (computational) methods of text segmentation. The following sections provide a brief overview of each, along with a discussion of benefits and challenges. Manual Segmentation Manual segmentation methods rely on human raters to read texts and to identify cohesive discourse units within them. Researchers have evaluated human performance in text segmentation tasks when working with a variety of text types. In a 1993 study, Hearst asked raters to identify paragraph boundaries in science articles where all section markers had been removed, and reported on over 80% agreement between raters. Flammia and Zue (1995) asked raters to segment telephone dialogues by identifying topic boundaries. This differed from previous studies both by evaluating the utility of topic shifts as a basis for segmentation and by working with spoken discourse rather than written. They report that when working with spoken dialogue, which is in many ways messier than written discourse, human raters were able to agree on the segmentation of discourse units approximately 60% of the time. Also working with spoken discourse, Passonneau and Litman (1997) used a common-sense definition of communicative intention as the criteria for identifying distinct text segments and had naïve raters segment texts. They found that the notion of communicative intention was understood well enough by untrained raters to result in highly significant agreement. In sum, results of this line of research indicate that segmentation of texts by human raters can be a reliable method of identifying discourse units in order to

Triangulating Text Segmentation Methods  25 investigate discourse structure. However, when working with a large body of texts, manual segmentation presents two primary challenges: the time and resources required, and the difficulty in controlling the level of granularity with which texts are segmented. Manual segmentation requires the close reading of texts and is a timeconsuming process. This can be prohibitive when working with large bodies of texts. Multiple raters can reduce the time needed to segment a large number of words/texts, although recruitment and training of raters presents its own time demands. In addition, the use of a greater number of raters raises concerns about standardization and reliability in the segmentation process. Although research on manual text segmentation generally reports acceptably high agreement among raters, researchers have noted difficulties in standardizing the level of granularity with which raters segment texts, particularly when segmentation is based on meaning (Fournier, 2013; Malioutov & Barzilay, 2006). Granularity is defined here as how fine-grained a rater’s segmentation of a text is; in other words, whether the text is segmented into fewer, longer segments, or more finely by identifying many, short segments. While some types of segmentation, such as segmentation into speaker turns, phrases, or orthographic paragraphs require less subjective decision making on the part of raters, segmentation based on meaning, considering coherence, context, topic, and function, requires raters to make subjective judgements about the relative shifts of these variables within a text. Raters are likely to reach different decisions about which shifts in meaning are significant enough to merit segmentation, and standardization of these subjective decisions across multiple raters is challenging. One method used by researchers to control for differing levels of granularity in rater coding has been to identify boundaries based on majority agreement. This approach anticipates that differences in granularity will impede rater agreement and asks multiple raters to segment the same texts. Possible segment boundaries are evaluated based on the percentage of raters who identify them, and only boundaries identified by the majority of raters are used in the final text segmentation (Hearst, 1993; Passonneau & Litman, 1997). Automatic Segmentation Computational text segmentation methods avoid the issues of recruiting raters, conduct segmentation in a timely manner, and control for reliability by using mathematical algorithms to determine the location of segment boundaries within a text (Pevzner  & Hearst, 2002). Hearst’s seminal TextTiling algorithm (1997) provides the basis for much current work on segmentation algorithms. TextTiling relies on lexical distribution and co-occurrence to identify changes in texts as they, for example, relate to topic shifts, and this use of lexical data has been refined and

26  Triangulating Text Segmentation Methods expanded upon by additional segmentation algorithms within the fields of applied linguistics (Csomay, 2002, 2005; Biber, Connor,  & Upton, 2007) and computational linguistics (e.g. Boguraev & Neff, 2000; Heinonen, 1998). Much of this research, especially as it relates to spoken texts in general (Pevzner & Hearst, 2002; Webber, 2012) and academic lectures in particular (Glass et al., 2007; Lin, Chau, Cao, & Jr, 2005), has been conducted within the fields of computational linguistics and computer science. In addition to answering other research questions, many of these studies look at how automatic segmentation aligns with human raters. In a 1997 study, Passonneau and Litman evaluated three algorithms (noun phrases, cue words, pauses) for automatic text segmentation. They report that no algorithm tested performed as well as humans, although an increase in the number of linguistic features used by the algorithm to identify text boundaries consistently improved performance. Automated segmentation methods are less frequently used in the field of applied linguistics; however, a few researchers have applied these techniques to linguistic studies. This includes Berber Sardinha (2001), who developed and tested an automatic segmentation procedure called Link Set Median (LSM). This method considers lexical cohesion between all sentences in a text in order to identify topic boundaries and was tested on three corpora comprising written texts (research articles, business reports, and encyclopedia articles). In this study, Berber Sardinha reports that the LSM method outperformed random segmentation significantly but did not perform as well as Hearst’s (1997) TextTiling algorithm. Finally, perhaps the most prolific line of corpus research in applied linguistics using automatic text segmentation has come from Biber and Csomay’s VBDU technique, which is based on Hearst’s TextTiling algorithm and is discussed in detail in the next section.

Methodology For this project, a corpus of authentic academic lectures was segmented manually, using a crowdsourcing platform, and automatically, to showcase lexical patterns in text. Resulting segments were then coded for function and analyzed linguistically. In this section, first, we describe the corpus we used. Second, we introduce the manual and the automatic segmentation techniques. Finally, we introduce two case studies illustrating the types of text structure analysis that could be done after segmentation. Corpus Description All texts included in this study are transcripts of authentic university lectures, gathered from two corpora: the TOEFL 2000 Spoken and Written Academic Language corpus (T2K-SWAL) (Biber, Reppen, Clark, &

Triangulating Text Segmentation Methods  27 Table 2.1  Authentic Lectures Corpus Composition Source

Number of Files

Number of Words

T2K-SWAL MICASE TOTALS:

18 6 24

104,940 55,474 160,414

Walter, 2004) and the Michigan Corpus of Academic Spoken English (MICASE) (Simpson, Briggs, Ovens,  & Swales, 2002). Both corpora were designed to include a representative sample of naturally occurring academic discourse, recorded in American universities and transcribed. Files used in this project represent a subset of files from the two parent corpora and met two criteria: (i) lectures were delivered in lower-level, undergraduate courses; and (ii) lectures contain a low level of interactivity. The corpus consists of 24 files and 160,414 words. A description of the corpus, including the source corpus, the number of files, and the total number of words, is presented in Table 2.1. Manual Segmentation Introducing Mechanical Turk Amazon’s Mechanical Turk (MTurk) is an internet crowdsourcing platform for hiring workers to carry out small tasks. The site was introduced in 2005 and was originally intended for small jobs that cannot yet be satisfactorily accomplished by a computer (e.g. audio transcription, market research). Requesters (employers) post Human Intelligence Tasks (HITs), or self-contained jobs. Workers (employees) browse a list of available HITs, select those they would like to complete, and are paid upon completion. Although MTurk was originally developed by Amazon as a means for businesses to outsource certain types of tasks, it has grown in popularity with researchers in a variety of disciplines. MTurk has been used by researchers in diverse fields such as psychology (Goodman, 2013) medicine (Ranard et al., 2014), and biology (Rand, 2012) in order to gather research participants and accomplish data processing tasks (e.g. data cleaning, verification, information collection). MTurk has many benefits for researchers. It provides access to a large and diverse worker population; in 2011 it was estimated that the site had more than 500,000 registered workers from 190 countries. MTurk provides a streamlined participant payment system, and HITs are generally completed quickly, and are relatively inexpensive. Within the social sciences, MTurk’s use by researchers has been increasing since 2010 (Buhrmester, Kwang, & Gosling, 2011). Linguists

28  Triangulating Text Segmentation Methods have used MTurk to investigate acceptability judgements (Sprouse, 2011) and linguistic characteristics of humor (Kao, Levy, & Goodman, 2013). However, despite the potential applicability of MTurk for general text processing, few researchers within corpus linguistics have utilized crowdsourcing. One notable exception is the work of Biber, Egbert, and Davies on web registers (2015), which engaged MTurk workers to classify internet texts into register categories. MTurk workers identified the register of 53,424 texts. The researchers note that “it would not have been feasible to code a corpus of this size (with four raters per document) without the aid of a resource like Mechanical Turk,” (p.  18). They found that, in addition to allowing for quick access to a large workforce, the quality of coding done by MTurk workers was superior to the work done by student participants, indicating that MTurk is an effective means of completing some types of text processing tasks. For the purposes of this project, MTurk offered a means to overcome the practical challenges of manually segmenting an entire corpus of texts. By providing access to a large number of raters, the corpus was segmented quickly, and multiple raters were hired to segment each text, allowing for the use of a majority agreement segmentation method and controlling for rater granularity in accomplishing the segmentation task. The Manual Text Segmentation Process In order to prevent rater fatigue and keep costs manageable, lecture transcripts were split into smaller sections (approximately 1,400 words) before they were given to MTurk workers. This process had the potential side-effect of introducing artificial discourse unit boundaries at the locations where each long text was split into a shorter text excerpt in order to create a HIT; since each MTurk worker was presented with only an excerpt of a text and would not necessarily see the segment before or after the excerpt, the start and end of the excerpt would become externally imposed segment boundaries. In order to avoid this, the texts were split into overlapping sections as illustrated in Figure 2.1.

1000 words 200 words

200 words

1000 words 200 words

Figure 2.1  Manual Text Segmentation

Segment 3 Overlapping text

200 words

Segment 2 Overlapping text

Segment 1

200 words

1000 words 200 words

Triangulating Text Segmentation Methods  29 Each file contained 1,000 unique words, with 200 overlapping words at the start and end of the text sections. This process of splitting texts into excerpts for the initial, pilot MTurk rating resulted in 151 text excerpts and a total of 302 HIT jobs. In order to allow for the identification of text segment boundaries that met a majority agreement criterion, three raters segmented each text: two MTurk workers and one of the researchers. All MTurk workers were required to pass a qualification test (involving a sample of the segmentation task) and reside in the United States. Although imperfect, this was an attempt to control for English language proficiency. In addition, all raters hired for this project had earned a minimum satisfaction rating of 95% and completed 500 HITs as an MTurk worker. Since MTurk workers are non-experts in text segmentation and/or discourse units, detailed instructions were developed and refined during a pilot study. In the pilot study we asked workers to rely on topic change or functional shifts to identify segment boundaries, but we noted that they found it challenging to separate these two constructs. To obtain better results, we provided the extended context to topic and function, and tested which version of instructions would yield the best results. The instructions that resulted in the most consistent results asked workers to consider topic, function, and context in making a holistic decision about where segment boundaries made sense. Consequently, the final version of the directions equated text segments to paragraphs in writing, and asked raters to rely on their intuition, as a reader, of where paragraph boundaries would make sense in each text and create cohesive segments. The directions also addressed practicalities of working with transcriptions of spoken language, such as accounting for transcription conventions, speaker turns, hesitations, or incomplete thoughts, in an attempt to standardize segmentation where possible. Once HITs were completed, a Perl program was used to identify the segment breaks created by each rater. Breaks were identified by their location in the text using word counts, so that within each text, each segment break identified was labelled with the word count of the word preceding the break (see Figure 2.2). Once the locations of the segment breaks were identified, the results were analyzed for near misses, or segment breaks that were very close to each other but not in precisely the same location. Near misses were defined as instances where raters segmented the text within ten words of each other. Each near miss was examined to determine whether it represented true disagreement between raters (in which case, no change was made) or whether the segment breaks were identifying the same general transition between discourse units—for example, in many cases two raters placed a segment break on opposite sides of hesitation markers, such as uh. In these cases, the ratings were adjusted to represent rater agreement.

30  Triangulating Text Segmentation Methods

BREAKS rater 1:  85,  258, 339, 346, 369, 425, 505, 561, 656, 725, 769 rater 2: 85, 258, 339, 346, 369, 425, 505, 561, 656, 725, 769, 878 rater 3: 85, 195, 258, 346, 369, 399, 561, 656, 79, 844, 878 and um, just imagine being twenty-two and s- having somebody say to you do you want this job? do you want to travel around, the world essentially, on a boat for five years doing what you’ve always wanted do collecting things writing about things, tasting and smelling and eating these things? and he said sure, i think, i would have gone too. [break 1: 85] um, he was already a really good naturalist, he wasn’t just this baby going out he had been working on his craft and his desire for several years at this point, and his knowledge of natural history and geology probably surpassed most of his contemporaries even though he was just twenty-two. and, they set off on a five year voyage. [break 2: 258] um, his life on the Beagle itself wasn’t that great. it was, this cramped little boat and he had about as much room as, i don’t know maybe, like from here to here. and he was always seasick he was just one of those people that never got over being seasick. so you know he was always throwing up and he never could really sleep well and, you can just imagine how hard it is if you’re seasick trying to read as you’re in your | Figure 2.2  Segmented Text

Next, the percentage of agreement for each segment break was calculated. If the agreement was 66% or 100%, then the break was used for final segmentation. Agreement was calculated by a third Perl program, which also opened a clean version of the text and inserted breaks which met the majority agreement criterion. Overlapping sections of text were rated six times (by three raters when the overlapping text occurred at the end of one section, and by three new raters when that same section of text occurred at the start of the next text section). Majority agreement for these sections was calculated based on either set of three raters, rather than on all six raters combined. This was to account for the possibility that a segment break (say, a major shift in topic) might be evident to raters in one context (i.e. having read the preceding 1,000 words) but not in the other context (i.e. reading the following 1,000 words but none of the preceding text). Only 30.68% of segment breaks identified by raters met the criterion of majority agreement. This number may seem low, misleading us into thinking that the segmentation was unsuccessful due to low agreement,

Triangulating Text Segmentation Methods  31 when, in fact, differences in granularity resulting in low agreement were expected. To illustrate, we can consider one 1,400-word section of a lecture transcript. In this section, rater one identified 11 segment breaks, rater two identified 13 segment breaks, and rater three segmented the text at a much more granular level and identified 22 segment breaks. For this section of text, there were 46 segment breaks identified, and only 12, or 26% met the criterion for final segmentation. However, rater three (the most granular rater) identified most of the breaks identified by raters one and two (87.50% of the segment breaks identified by raters one and two were also identified by rater three). Rater three simply also identified ten additional breaks that were not identified by either of the other raters. In this way, raters who segmented the text at a more granular level significantly lowered the agreement percentage. However, for the purposes of segmenting this corpus, by definition segments are identified by shifts in the text that are perceptually salient enough to be recognized by a majority of raters. The majority agreement method of segmentation allowed us to reliably identify 100% of such segments, regardless of how many other segment breaks were also identified and discarded from the final analysis. The final step of data processing involved reassembling the text excerpts (1,200–1,400 words) into complete transcripts. Once texts were reassembled, all texts were read to check for errors resulting from the computationally aided reconstruction of the texts. Applying this segmentation technique, we identified 1,056 text segments. The average number of segments per text was 44, and the average segment length was 151 words. However, since the texts vary greatly in length, it is also useful to consider the number of segments per 1,000 words, which was 6.62. The Automatic Segmentation Process Vocabulary-based Discourse Units (VBDUs) Most automatic segmentation techniques rely on orthographic words, syntactic patterns, or semantic features in a text. The identification of discourse units in the method presented here is based on vocabulary, principally arising from Prince’s (1981) notion that new vocabulary entering into the discourse results in the introduction of new topics. Following this principle, Youmans’ (1991) computer program tracked the vocabulary items newly entering into the discourse, and relying on this data, visually showed the vocabulary profile of a text. While this was an important step, his program was unable to automatically segment discourse into smaller units. Based on Youmans’ initial attempt, Hearst (1994, 1997) developed a computer program, called the TextTiler, that compared windows of texts for vocabulary similarity. More specifically, “the TextTiler

32  Triangulating Text Segmentation Methods compares vocabulary (i.e. orthographic words in transcribed texts) in adjacent segments of a text and calculates a similarity measure” (Csomay & Wu, in press) between the two windows at each word. A VBDU, or Vocabulary-based Discourse Unit, is “a block of discourse defined by its reliance on a particular set of words” which can “usually be interpreted as topically-coherent units” (Biber et  al., 2007, p.  156). In other words, a VBDU is a lexically similar segment of text, and these lexical similarities generally reflect a shared topic, while the lexical differences that establish segment boundaries represent topic shifts. The similarity measure indicates the extent to which the words in the first window are identical with those in the second window. Sliding through the text, a similarity value is calculated at each word, and the text between two low similarity values is recognized as a VBDU (Biber, Csomay, Jones, & Keck, 2004). More specifically, as Csomay (2005, p.  247) indicates, if the two windows use the same vocabulary to a large extent, they are considered to belong to a single VBDU, and are interpreted as topically coherent units. In contrast, when the two segments are maximally different in their vocabulary, they are considered to mark the boundaries between two VBDUs, and are interpreted as different units. The formula used to calculate this measure for each word is as follows (adopted from Biber et al., 2007, p. 161) in Figure 2.3. The text segmentation method presented here goes beyond simply using the similarity measure as the basic principle of the TextTiling program (Hearst, 1997). Based on the similarity measures along with other, pre-set parameters, this method segments discourse automatically. For this study, the segmentation was carried out on monologic lectures. A window size of 50 words per adjacent window (100 words total) was appropriate to address our research questions; however, this measure is flexible, and could be changed for other text-types. After plotting the similarity measures at each word (see Figure 2.4 for a visual representation), to establish the unit boundaries for monologic lectures, two additional criteria had to be met: (i) the similarity measure had to be at least 0.3; and (ii) there had to be at least a 30% difference in the similarities measures between the peaks and valleys to mark importance, where peaks

∑ i =1

(freq1 (wordi ) * freq2 (wordi ))

totaltypes

similarity =

∑ i =1

(freq1 (wordi ))2 * ∑ i =1

totaltypes

Figure 2.3  Algorithm for Similarity Measure

(freq2 (wordi ))2

totaltypes

Triangulating Text Segmentation Methods  33 0.9 0.8

Similarity measu re

0.7 0.6 0.5 0.4 0.3 0.2 0.1

VBDU 1

VBDU 2

VBDU 3

VBDU 4

0 1

22 43 64 85 106 127 148 169 190 211 232 253 274 295 316 337 358 379 400 421 442 463 484

Word number

Figure 2.4  Vocabulary Profile Based on Similarity Measures at Each Word (500-word Segment)

and valleys were identified automatically through calculating slope measures. All in all, these three parameters (window-size, similarity measure threshold, and peak-valley difference) mark the boundaries resulting in lexically coherent discourse units. While all three of these measures could be manipulated individually and independent of each other in order to gain the best segmentation results for different registers, for monologic lectures the previously presented values were used. Figure 2.4 shows the vocabulary profile of a 500-word segment of text. As described earlier, at each word, a similarity measure is calculated. The plot provides a visual representation of the change in that similarity measure as we go through the text word by word. The figure also shows the segments (marked with the shaded boxes) that have been identified in this stretch of discourse based on the criteria described previously. For example, the first segment, VBDU 1, is between word-number 1 and 95, and the second segment, VBDU 2, is between word-number 96 and 212, and so on. After the automatic segmentation, a multi-dimensional framework can be applied to provide a comprehensive linguistic analysis of the segments. VBDUs have been used to investigate biology research articles and university class sessions (Biber et al., 2007; Csomay & Wu, in press) and the

34  Triangulating Text Segmentation Methods use of lexical bundles in classroom discourse (Csomay, 2013; Csomay & Cortes, 2010). Applying this segmentation technique to the corpus discussed in this chapter resulted in 769 segments. The average number of segments per text was 32, ranging from 14 VBDUs to 81 VBDUs per full text depending on the overall length of each lecture. It is interesting to note that the automatic segmentation process resulted in fewer text segments than manual segmentation. Speaker turns are likely a contributing factor to this difference. Although the lectures are highly monologic, they do contain periods of interaction between the lecturer and other speakers. Human raters were instructed to separate non-lecturer speaker turns into individual segments, which the VBDU program did not do. To apply statistical analyses, VBDUs with at least 100 words were characterized further for their linguistic features. For the monologic lectures presented in this study, units were 230 words long on average, ranging from 101 words to 900 words per segment, resulting in 4.79 units per 1,000 words. Analyses of Text Segments—Two Approaches In this section, through a case study, we illustrate the kinds of analyses that can be carried out after segmenting the texts into sub-units whether applying manual or automatic segmentation techniques. While both case studies have used a corpus as their basis for the segmentation method, here we use two types of approaches researchers can take to analyze the segments. Following Approach #1, the segments are first analyzed through qualitative methods to identify their unit-internal functions. After they are classified into a set of categories, a series of quantitative studies can be carried out to see the most typical types of the functions identified manually, as one of our case-studies illustrates. In Approach #2, the segments are first characterized linguistically through quantitative measures on the basis of which they are then classified into four, broadly defined, communicative functional categories. After this they are categorized according to their pedagogical purposes and re-entered into the discourse flow, as shown in the second case study. Approach #1 Following manual text segmentation, all segments in each text were coded for the presence or absence of eight functions. The functional taxonomy was adapted from the work of Young (1994) and Deroey and Taverniers (2011). The taxonomy was piloted several times, and resulted in eight functions, with definitions, that comprehensively account for the

Triangulating Text Segmentation Methods  35 functions present in lectures. Each function, along with a definition and example, is described as follows: •

Discourse management: indications of the direction, aims, or scope of a lecture, topics to be covered, or the lecture’s place within the scope or sequence of the course. •



Factual instruction: language that contains factual or theoretical information about a topic, historical context for a fact/theory/topic, or reported speech, ideas, or research of others on a topic. •



Example: to find that total capacitance I need to get the areas ad2 is equal to six lambda W. two . . . um P.D. two is equal to twelve lambda . . . plus two W. two. I should break it into inside and outside bars but were gonna put it together here I guess I’m gonna assume that M. two is minimum sized . . . ad1 is equal to . . . six lambda W. one but W. one is . . . four W. two mu P. over mu N.

Examples: facts or case studies that are discussed in order to support or elaborate on the lecturer’s point. •



Example: I just want to make sure we all remember what it is, neap tide, the low tide, the lowest tidal phase, the tidal amplitude here is four meters [laughs] you know it’s sixteen and eighteen feet. that’s the low tide, the lowest tide of the year. the highest tide is eight meters, you know that’s longer than thirty feet.

Procedural instruction: language that discusses, explains, or demonstrates the process of how to do something. •



Example: Um, what I want to do with this, particular hour, is try and get you to the point where you understand how employers are using these kinds of competency or outcome portfolios to manage their interviewing and selection processes . . .

Example: it used to be quite common, in the eighteenth century for people to um middle class people to build a structure where the lower part of the structure was for their business, whether they were a doctor a grocer a pharmacist, even a brewer, you could imagine it was quite some smells coming out of a brewery. and then the upstairs would be your family’s residence.

Evaluation: groups of words that indicate how to weigh contentrelated information, or groups of words that indicate the speaker’s viewpoint on, opinion about, feelings about, or agreement/disagreement with content-related information. •

Example: so the real significance of this little pattern won’t really be clear to us until we have a selection of planets around other

36  Triangulating Text Segmentation Methods stars to study . . . you can’t judge the significance of the numerical pattern from one example. and we don’t have good statistics on planets around other stars yet. so the jury is out on what distance might tell us about the way planets form around stars. •

Class management: language that provides lecture or course information (such as timetables, assignment guidelines, etc.), or language that addresses the physical environment (e.g. the whiteboard) or lecturer actions (e.g. passing out handouts), or language that manages the audience (e.g. appeals for quiet, instructions to take notes). •



Example: well let me go ahead and uh hand out the syllabus. uh for the course (so) we can spend a few minutes uh going over that  .  .  .  (just) go ahead and pass them on down the row  .  .  . everybody have a copy of the syllabus?

Personal opinion: the presentation of opinion that is unrelated to content-related information, or facts or examples that support such an opinion. •

Example: If the football coach gets paid a bit more than the history teacher, it’s because Americans in general care a lot more about football than history. Both offer heroes and stirring tales of dare and do, but the one appeals to us and the other does not. Probably because history requires too much brain activity. We are intellectually lazy.

In order to perform the functional coding, one researcher read each text segment. The functional taxonomy and definitions were then used to determine the presence or absence of each function—the parts of the segment corresponding to each function were not identified. In addition, a segment was coded as containing a function regardless of the number of occurrences of that function within the segment. Results of this method tell us the proportion of text segments within a text that do or do not contain a particular function. They do not tell us the number of occurrences of a function. Within each text, segments were annotated to indicate which functions were present. Examples 1 and 2 illustrate this point.

Example 1 ( factual instruction, evaluation) let’s go to a little bit, closer, to, Darwin in terms of time . . . and talk about some of his almost contemporaries. um first of all Hutton and Lyell were both geologists, and they came up with not together but sequentially a couple of really important ideas. important eventually to Darwin.

Triangulating Text Segmentation Methods  37 Example 2 (class management, procedural instruction) [unclear] freewrite this, I just want you to get words down on paper, practice writing, that’s the whole idea of this and if you don’t feel like writing[unclear] so, I  want you to reflect on [unclear] direction [unclear] write about. Nothing else, [unclear]. The idea is to write as much as you can and continue writing just get your ideas down on paper. It’s not a [unclear], it’s a free write.

Once the entire corpus was segmented as described in the method section, and annotated to include the functional coding, AntConc concordancing software was used to identify the frequency with which segments containing each function occurred in the corpus, along with the co-occurrence of functions. The results of this analysis are presented in the case studies section, although before that we will detail the second approach. Approach #2 Once the VBDU units were identified with the TextTiling segmentation method, linguistic analyses were carried out in order to explore each unit’s linguistic make-up and based on its determined text type. Two statistical methods were used to identify text-types: multi-factorial, multi-dimensional analysis and K-means clustering. The goal of a factor analysis is to identify the co-occurrence patterns of a large number of linguistic features using statistical methods. These “co-occurence patterns are analyzed as underlying dimensions of variation” (Biber & Conrad, 2001, p.  6). While the co-occurring patterns are recognized quantitatively, the corresponding dimensions of variation are distinguished qualitatively. With the “assumption that co-occurence reflects shared function”, through interpreting “the functional bases underlying each set of co-occurring linguistic features” (ibid.), we are able to specify the dimensions of variation. Through the process, co-occurring linguistic features would gain positive or negative values. The group of features that co-occur with positive values are clustered on the positive side of that dimension of variation and those features that co-occur with negative values are clustered on the opposite, negative side of that dimension. The values being positive or negative mostly indicate group membership. When those linguistic features with positive values appear together in a given text segment, very few features with negative values are present. Similarly, when linguistic features with negative values are

38  Triangulating Text Segmentation Methods present, hardly any features of the positive set are present. (See text in Examples 3–7.) Based on previous work applying multi-factorial analyses on a larger set of university classroom discourse (Csomay, 2005), we used already existing dimensions of linguistic variation in this study. The three dimensions identified earlier were: Dimension 1—Conceptual informational focus (negative side) versus Contextual directive orientation (positive side); Dimension 2—Lack of Personalized framing (negative side) versus Personalized framing (positive side); and Dimension 3—Teacher monologue (negative side) versus Interactive dialogue (positive side). The full list of associated linguistic features with each of these three dimensions is in Table 2.2. Text samples illustrating the groupings of linguistic features on each dimension are in Examples 3–7. Table 2.2 Linguistic Features (Selection) on the Three Dimensions of Linguistic Variation in Academic Classroom Talk Dimensions

Negative Side

DIMENSION 1 Conceptual, informational focus Word length Attributive adj. Nouns Words used in one class only Past tense Prepositions Nominalization

DIMENSION 2 Lack of Personalized framing

DIMENSION 3 Teacher monologue Teacher turn length

Source: Csomay, 2005

Positive Side Contextual, directive orientation Non-past tense Adverbial clauses: 1st and 2nd person pron. conditional Verb-initial Modals lexical Non-passive bundles construct. Directive Contractions verbs 3rd person Common pronoun vocabulary (reduced Activity forms) verbs Personalized framing That deletion ‘You know’ Mental verbs Likelihood Factual verbs verbs with ‘that’ with ‘that’ ‘I mean’ 3rd person pronoun (he, she, they) Interactive dialogue Student One word turns turns Teacher Discourse turns particles

Triangulating Text Segmentation Methods  39 Example 3 ( Dimension 1 negative side: Conceptual, informational focus) the ultimate uh fate of the various provinces of the Roman Empire, and we also have an abundant corpus of linguistic documentation. so to a large extent the Romance languages present an ideal case, uh for studying, the development of language against the background of history. or conversely how history affects language. one of the issues that I want to particularly concentrate on today is the issue of the linguistic, uh impact of language contact. one of the main historical themes that I've been stressing throughout the last uh couple of classes has been that in the history of

The linguistic features associated with the negative side of Dimension 1 (word length, nouns, past tense, prepositions, attributive adjectives, words used in one class only, nominalization) are in bold, underlined. The associated texts with these features include academic reporting, reading written text out loud, or listings may occur, for example, during elicitation. Example 4 ( Dimension 1 positive side: Contextual, directive orientation) This is uh [one syll] in Java. Look in the Bean Box, you can uh play around with (speeds), slow em down or [unclear words]. and then what you can do is link this thing up to some events and go into an (auto button) class here it's in the toolbox. Create an instant kind of an auto Bean Box. You wanna link this button up to the activity that's going on here. You can uh . . . change the label, you want the start label. It knows that it's gonna start over here. . . . And with this button highlighted you can select events that you want that button to initiate or actions. Once you do that you got a line here.

The linguistic features associated with the positive side of Dimension 1 (non-past tense, first and second person pronouns, modals, non-passive constructions, contractions, adverbial subordinator of condition, verbs in directive forms, commonly used words in all class sessions in the corpus, activity verbs, and non-human third person singular pronoun it and other reduced forms such as pro-verb do, demonstrative pronouns) are in

40  Triangulating Text Segmentation Methods bold italics, underlined. The texts associated with this type of language use are where references are made to the immediate spatial or mental context or to a hypothetical situation. Example 5 (Dimension 2 positive side: Personalized framing) Now that’s really a question. Did the uh the Soviets pull their people out? Or did the Chinese push them out? I get, I get both, I get both sides. Well, I, if you take a look at this from you know, purely pragmatic viewpoint, I mean you know, of course you can probably find evidence to both. But why would, I mean, why would, uh, China want to throw out Soviet technocrats when that was basically the only area at this point in time where they were getting the technology that they needed in order to—to advance industry, to—to, at that time. And later on, yeah

The linguistic features associated with the positive side of Dimension 2 (that-deletion, factual verbs, likelihood verbs, mental verbs, ‘you know’, ‘I mean’, and third person pronouns such as he, she, they) are in bold italics. The texts associated with this type of language use are associated with reflecting, and frames to the speaker’s ideas and opinions and to the interaction (holding the floor).

Example 6 (Dimension 3 negative side: Teacher monologue) Teacher: it’s

gonna be tough for you cos that’s really when you

get into management you try, not only do you have to do your job but you have to do everyone else’s job too because they’re not gonna do it as soon as you’d do it so, my, my counseling there is that is to get some real help in the area of delegation if if your goal is management then you really are gonna have to do that and feel comfortable with it, make yourself available, particularly with the people who are working for you or your colleagues that are working with you on a project make yourself available, know that means try

The feature, long turns taken by the teacher indicated by capitalized letters, is associated with the negative side of Dimension 3. The texts associated with this side of Dimesnion 3 reflect the teacher holding the floor for a longer period of time producing monologic discourse.

Triangulating Text Segmentation Methods  41 Example 7 ( Dimension 3 positive side: Interactive dialogue)

Teacher: Student: Teacher: Student: Teacher: Student:

and help each other figure but this is going to something out do both and this this is on our um data right? right

we’re already working in pairs so (we’ll) actually be working like Student: in fours Student: yeah Teacher: in fours Student: yeah, yeah Student: mm Teacher: well Student: yeah Teacher: yeah Student: some of us are working in pairs Teacher:  Ok (great) and some are not, like A. is working with me so we’re kind of a pair but she‘s doing all the work. Student: boy you’re getting Student: yeah Teacher: get that . . . able to be very flexible. Are you working in a group, in a pair? Student: we were trying to get our titles together, yeah Teacher: Ok Student: we’re working on it Teacher:  so when you come to class you should have you should have a rough draft done Student: right so

The features on this side of Dimension 3 (turns taken by students and by the teacher, discourse particles, one-word turns, lack of turn-length) are capitalized. Texts where interaction is present would have high scores on this dimension. After identifying the broad dimensions of variation, the goal of applying clustering methods is to group text segments together that have similar linguistic characteristics of the three previously identified dimensions together. We applied K-means clustering, which is a statistical method that allows researchers to explore how many clusters should go into the equation. Relying on the constellation of these three dimensions for each VBDU discourse unit, we identified four clusters (Csomay, 2007, pp. 224–229), based on which the following text-types were determined: Cluster 1: Personalized

42  Triangulating Text Segmentation Methods framing; Cluster 2: Informational monolog; Cluster 3: Contextual interactive; and Cluster 4: Unmarked. Text segments classified as Cluster 1 texttype have salient linguistic features of the positive side of Dimension 2 and some features of positive Dimension 1. Text segments belonging to Cluster 2 text-type have linguistic features from the negative side of Dimension 1 and also features from the negative side of Dimension 3 (hence the name, Informational monolog). Text segments from Cluster 3 text-type exhibit features from the positive side of Dimension 1 and from the positive side of Dimension 3 (hence the name Contextual interactive). Finally, text segments from Cluster 4 are those that lack specific features in large numbers from any one of the six groups identified through the multidimensional analysis; thus, they are called “Unmarked”. Overall, of the total segments identified, 577 longer-than-a-hundredwords segments went through quantitative linguistic analysis. Of those, 101 units were classified as Cluster 1 type of text, 180 in Cluster 2 type, 141 of Cluster 3 type and 155 as Cluster 4 type. Once these text-types were established, we were able to feed the units with their cluster types back into their original position in the text and examine text structure based on the linguistic characteristics within the text.

Case Studies Now that texts in this corpus had been successfully segmented into discourse units and coded for function, corpus tools could be used to pursue a variety of additional research questions. For example, the vocabulary or grammatical structures that characterize segments containing each function could be identified using methods such as keyword analysis. Concordancing software could be used to investigate the use of linguistic features of interest (e.g. stance features, hedging, phrasal or clausal elements) within or across discourse units. On the other hand, we could carry out other, quantitative, analyses that could lead us to text-structure characterization. Case study #1: Distribution of Unit Functions The primary research question addressed for this case study concerns the distribution of functional categories identified in the text segments and across the corpus of low interactive university classes. As a first step in examining the functional composition of lectures, the percentage of segments containing each function was calculated for each text. Table 2.3 presents the mean percentages for the corpus by function. Because each segment can contain multiple functions, percentages do not add up to 100%. The most frequently occurring function within the corpus was factual instruction, which occurred in the majority of text segments—a mean of 81% of segments per text. Factual instruction was followed in frequency by

Triangulating Text Segmentation Methods  43 Table 2.3  Mean Percentage of Segments Containing Each Function Function

Mean

SD

Factual Instruction Examples Class Management Evaluation Discourse Management Procedural Instruction Personal Opinion

81.11 33.14 32.19 31.82 19.34 16.82 12.02

18.02 17.90 20.56 23.34 11.10 19.82 12.71

examples, class management, and evaluation, each of which occurred in close to one third of the segments, on average, per text. The remaining functions all occurred in a mean of less than 20% of segments per text. In addition to examining the relative frequency of each individual function in the segments, it is possible to investigate how functions co-occur. The mean percentage of segments per text containing each combination of two functions is presented in Table 2.4. Again, it is important to note that these percentages will not add up to 100% as each segment was coded for the presence or absence of all eight functions; many text segments contained only one function, or more than two. Table 2.4  Co-occurrence of Functions Functions

Percent of Segments

Factual instruction & Examples Factual instruction & Evaluation Factual instruction & Class management Discourse management & Factual instruction Factual instruction & Procedural instruction Procedural instruction & Class management Factual instruction & Personal opinion Factual instruction & Interaction Discourse management & Class management Examples & Evaluation Examples & Class management Discourse management & Evaluation Examples & Interaction Examples & Personal opinion Evaluation & Class management Procedural instruction & Examples Class management & Interaction Discourse management & Examples Discourse management & Procedural instruction Procedural instruction & Evaluation Evaluation & Interaction

30.75 27.20 14.90 14.68 11.74 12.48 10.23 10.13 5.91 9.75 5.87 4.92 4.92 4.92 4.64 4.36 4.07 3.60 2.84 2.65 2.56 (Continued)

44  Triangulating Text Segmentation Methods Table 2.4 (Continued) Functions Evaluation & Personal opinion Interaction & Personal opinion Procedural instruction & Interaction Discourse management & Interaction Class management & Personal opinion Discourse management & Personal opinion Procedural instruction & Personal opinion

Percent of Segments 2.46 2.46 2.08 1.23 1.23 0.85 0.28

Because factual instruction occurred, on average, in 81% of text segments within lectures, it is unsurprising that it occurs frequently in combination with all other functions. Examples, evaluation, and class management all occurred in approximately one third of text segments, and in combination with factual instruction, represent the three most frequent co-occurrences in the corpus. In fact, examples occurred in a mean of 33.14% of segments per text and co-occurred with factual instruction in a mean of 30.75% of segments per text, indicating that almost all occurrences of examples co-occur with factual instruction. Evaluation presents similar results, occurring in a mean of 31.82% of text segments and co-occurring with factual instruction in a mean of 27.20% of text segments (Example 1 given earlier in this chapter presents such a text segment). Both examples and evaluation are referential functions, in that they refer to or are based upon other ­information—examples exemplify something, and evaluation evaluates something. Within the context of academic lectures, it is unsurprising that they frequently refer to factual information. Although class management occurs approximately as frequently as examples and evaluation, in a mean of 32.19% of segments within a text, it does not follow the same pattern. Class management co-occurs with factual instruction in a mean of 14.90% of segments within texts, indicating that less than half of instances of class management occur in the same segment with factual instruction. In fact, class management is an outlier when compared to all other functions in this respect—it is the only function that occurs more frequently without factual instruction than with it. Instead, class management co-occurs relatively frequently with procedural instruction (in a mean of 12.48% of text segments). Occurrences of class management without factual instruction represent a type of discourse unit that is fundamentally different in purpose from the instructional units containing factual instruction. These discourse units are focussed on the business of running a class, rather than on course content. Class management co-occurring with procedural instruction often addresses the steps necessary to complete in-class activities (Example 2), or tasks related to homework, lab sections, etc. Taken together, these results indicate the majority of discourse units within academic lectures are concerned with the transmission

Triangulating Text Segmentation Methods  45 of content-related, factual information to students. This is hardly surprising for academic lectures. A total of 81% of segments in the average text contain factual instruction, which is most frequently accompanied by examples or evaluation (or both). Of the remaining five functions, four of them occur more frequently with factual instruction than without it. The exception is class management which occurs more than 50% of the time without factual instruction and occurs 38.77% of the time in combination with procedural instruction. Case Study #2: Distribution of Linguistic Patterns in Text Structure In this section, we illustrate the relationship and association between language change and instructional purpose in the structure of discourse. The primary research question in this case study is to examine how the units with varying linguistic features behave in marking different communicative purposes as we enter them back into the discourse thread. The following text extract shows three consecutive units exhibiting their text-type (dimension scores and text-types in brackets) and the associated communicative/pedagogical goals. Again, the ultimate question is to see how language change in the structure of discourse corresponds to change in communicative/instructional purpose. In Example 8, Dimension 1 positive features are in bold, Dimension 1 negative features are in bold and underlined, and Dimension 2 positive features are in italics.

Example 8 ( MICASE, Physics, low division undergraduate) Teacher: VBDU #8—Cluster 3: Contextual interactive (Dimension 1: 4.87; Dimension 2: -1.37; Dimension 3: -2.51) (Exposition with a hypothetical, contextualized example) . . . constant speed in a straight line so I’m not accelerating. I’ll be in an inertial frame. uh a reference frame which is not inertial would be uh, for example a driver in a car which would be going around a corner. Then what would happen is I’d see things looked like they’d be pushed against the outer door uh because I’m in a non-inertial frame. you know basically there’s no force pushing you now but then you wanna go straight and so, and if I continued in an inertial frame, the objects would just sort of continue with me but when I turned this way, I’d be in the accelerated frame, trying to go straight. So it looks to me like they’re uh they’ve suddenly started moving. but anyway if you’re in uh, any inertial frame, the laws of Newtonian

46  Triangulating Text Segmentation Methods physics work just fine VBDU #9—Cluster 1: Personalized framing (Dimension 1: 9.47; Dimension 2: 4.12; Dimension 3: -3.24) (Expansion of example with personalization) and what that means is that you can assume, you’re at rest in your own frame by definition, and you can assume you’re at rest and [that0] there’s no experiment in mechanics you can do, which will contradict that assumption. So let’s take a concrete example. You’ll see what I mean. Suppose I’m in oh you know a a something coasting along like this at constant velocity, and I throw a ball. okay? Now, uh what you see because I’m moving as I throw the ball in the air, and it comes down and I catch it and of course from your point of view the balls are passing you know like that. It obeys Newton’s second law, F equals M-A you know, downward force causes there to be an acceleration and has an initial upward velocity this way. And so, applying Newton’s law I should say well (what I’m gonna) do is this . . . fine but now from my point of view in my reference frame I’m at rest. You guys are moving that way but I just throw the ball straight up. Well once again you know, F equals M-A in my frame too but I- in my frame it has an initial velocity like that but no horizontal velocity, so it just it goes straight up and straight down. VBDU #10—Cluster 2 Informational monolog (Dimension 1: -7.73; Dimension 2: -0.76; Dimension 3: -3.52) (Exposition - connecting previously discussed examples to theory) but neither you nor I is especially entitled to say I’m the one who’s at rest and you’re moving. Basically either one of us can say I’m at rest and apply the laws of mechanics and they’ll work just fine. Um, Einstein was very impressed with this principle. Actually people before Einstein had noted it also and it was called the principle of relativity. Essentially that the laws of mechanics work equally well in any inertial frame of reference. Einstein felt that uh this must be some fundamental underlying simplicity of nature that it’s telling us something and so what he wondered was well should this be true not just of the laws of mechanics. But how about say the laws of electricity and magnetism um, and there seemed to be a problem when he did that uh because it appeared that the way the laws of electricity and magnetism were being interpreted in nineteen-oh-five. It appeared that not all inertial frames were on an equal footing that there was one special inertial frame and only that one could be considered to be at rest. And to illustrate where the problem comes let’s just think about light um, you remember that according to Maxwell’s theory . . .

Triangulating Text Segmentation Methods  47 Having gone through the quantitative analyses, the segments are re-entered into the flow of discourse, as shown for three segments, VBDUs 8 to 10, in Example 8. As seen, VBDU 8 was classified as a Cluster 3 text-type (Contextual interactive) exhibiting a high number of positive Dimension 1 scores including first and second person pronouns and conditional clauses. In this segment, the instructor exposes the students to several concepts of physics (inertial and non-inertial frame, referential frame, and accelerated frame) and their connection by putting him/herself in the context of an example (s/he as the driver of the car). In this example, an outside object (the car) is described as it relates to other, surrounding objects. In the next segment, VBDU 9, s/he summarizes the previous example and expands the explanation further with another example. In this case, however, the audience is brought into the example. The example is no longer about an object, but s/he describes the phenomenon from his/ her perspective as well as the students’ perspective. This segment was classified as Cluster 1 text-type, with positive features apparent from mainly Dimension 2 (Personalized framing), including that deletion, third person pronouns, and also some positive Dimension 1 features. Finally, in VBDU 10, the instructor exposes the students to the name of the actual theory that relates to these examples, connecting it to the main point of this segment that discusses Einstein’s theory. No longer is the topic contextualized with examples. In this segment, we see many features that are all associated with the negative side of Dimension 1 (nouns, attributive adjectives, and prepositional phrases) and with dense information packaging.

Conclusion With the goal of analyzing text structure in a corpus of 24 monologic lectures, this chapter triangulated two text segmentation methods and two text-analytical approaches. In analyzing discourse structure from different perspectives this way, we arrived at complementary findings through our results. As the first case study showed, the results indicate that there are at least two distinct types of discourse units that comprise the structure of academic lectures. The first type is instructional and contains factual instruction along with functions (such as evaluation and examples) that support factual instruction or contribute to content delivery. The second type of structural unit centers around class management, often in combination with procedural instruction. The difference between these two types of structural units may be interpreted as a difference between course content, on the one hand, and course administration, on the other. The second case showed how we can examine the flow of discourse from a lexical as well as a linguistic perspective. More specifically, we

48  Triangulating Text Segmentation Methods examined how change in language use corresponds to change in communicative purpose and, in the context of academia, change in pedagogical goals and actions. Indeed, the example illustrates that change in linguistic features mark change in communicative purposes within text. The example also shows that the same broad category of a communicative purpose of a segment (e.g. exposition) can be realized in linguistically varying ways and that variation in language use marks differences in pedagogical purposes. More specifically, and as the extract shows, exposition to information can take place in linguistically condensed ways through using nouns and prepositional phrases, or through contextualized language use through personal pronouns, for example. The latter type may manifest itself through examples that involve the audience. Upon reflection, on the one hand, the ultimate benefit of the first approach to analyze text structure is that we are able to have in-depth analyses of the texts arriving at functional categories and patterns of distribution of those functions across the texts. The drawback of this type of analysis, however, is that it is time-consuming, as only a certain amount of text can be analyzed this way within a reasonable research timeframe and so it is difficult to see how this could be applied to large datasets. On the other hand, the benefits of the second approach to the analysis is that we provided a comprehensive linguistic analysis of the units and that we can show language change and corresponding functions in the text. The drawback for this type of analysis, although it can be done in a timely manner on a large corpus, is that we do not get insights into the qualitative, functional analysis done with the other approach. Through this study, it became clear that each approach lacked what the other one could provide. Hence, the results of our triangulated study presented in this chapter not only provide complementary findings but makes a strong case for a combined approach to an all-inclusive analysis of discourse structure that apply both of these approaches in order to get a full picture. In terms of text segmentation techniques, similar conclusions can be drawn. More specifically, manual text segmentation by human raters, on the one hand, allows the researcher to operationalize the segmentation process and the resulting discourse units based on project-dependent criteria. However, this method poses practical challenges, which are particularly daunting for corpus researchers wishing to segment a large body of texts. Hand segmentation is time-consuming when performed on a large number of texts. Multiple raters can address this problem. However, the necessity of relying on multiple raters introduces challenges of participant recruitment and cost, along with the necessity of controlling for the granularity with which texts are segmented. The method presented here minimizes the challenges of manual text segmentation while creating a

Triangulating Text Segmentation Methods  49 segmented and annotated corpus that is well-suited for use with corpus tools, demonstrating that the continual evolution of resources, such as crowdsourcing platforms, that are available to researchers, in combination with creativity and the use of mixed methods, provides at the least a starting place for research that has long been considered methodologically daunting for corpus researchers. On the other hand, whereas automatic text segmentation methods have been observed to have the advantages of replicability and speed, they continue to introduce concerns regarding the accuracy and the appropriateness of the resultant discourse units for a given research project. While automatic methods segment texts more quickly and consistently than human raters, it is true that they do not always do a better job than human raters of segmenting texts into meaningful discourse units. However, as Biber et al. (2007, p. 156) note, the topical shift identified by a VBDU boundary often seems to correspond to a shift in the communicative purpose or function of the text, and this is promising. It is also noteworthy that, as exemplified by the VBDU patterns shown, the flow of new vocabulary items in discourse tracked by the TextTiling program, for example, would be impossible to do by humans in a reasonable timeframe and on a large number of texts as was the case here. Hence, by applying automatic segmentation with corpus-based methods, we are able to examine textual patterns and patterns of language use that we might not be able to explore otherwise. Triangulating both our segmentation methods and structural analyses allowed us to identify similarities and differences in the resulting insights into text structure. Ultimately, a combined approach would be ideal, providing supplementary types of information: the functional patterns that comprise the structure of a text, revealed by manual analysis, and the comprehensive linguistic description of text segments, revealed by an automated analysis. For that to happen, we will need to be able to automatically associate linguistic characteristics with functions.

References Allison, D., & Tauroza, S. (1995). The effect of discourse organisation on lecture comprehension. English for Specific Purposes, 14(2), 157–173. Baker, P., & Egbert, J. (2016). Triangulating methodological approaches in corpus-linguistic research. New York: Routledge. Basturkmen, H. (2007). Signalling the relationship between ideas in academic speaking: From language description to pedagogy. Prospect: An Australian Journal of TESOL, 22(2), 61–71. Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press. Biber, D., Connor, U., & Upton, T. (2007). Discourse on the move: Using corpus analysis to describe discourse structure. New York, NY: John Benjamins Publishing. Biber, D., & Conrad, S. (2014). Variation in English: Multi-dimensional studies. New York: Routledge.

50  Triangulating Text Segmentation Methods Biber, D., Csomay, E., Jones, J. K., & Keck, C. (2004). A corpus linguistic investigation of vocabulary-based discourse units in university registers. In U. Connor  & T. A. Upton (Eds.), Applied corpus linguistics: A  multidimensional perspective (pp. 53–72). Amsterdam: Rodopi. Biber, D., Egbert, J.,  & Davies, M. (2015). Exploring the composition of the searchable web: A corpus-based taxonomy of web registers. Corpora, 10(1), 11–45. Biber, D., Reppen, R., Clark, V., & Walter, J. (2004). Representing spoken language in university settings: The design and construction of the spoken component of the T2K-SWAL Corpus. ETS TOEFL monograph series. Princeton, NJ: Educational Testing Service. Boguraev, B. K., & Neff, M. S. (2000). Discourse segmentation in aid of document summarization. Proceedings of the 33rd Annual Hawaii International Conference on System Sciences, 3004–3014. Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s mechanical turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6(1), 3–5. Csomay, E. (2002). Episodes in university classrooms: A corpus linguistic investigation. (Ph.D. Dissertation). Northern Arizona University. Flagstaff, Arizona. Csomay, E. (2005). Linguistic variation within university classroom talk: A corpus-based perspective. Linguistics and Education, 15(3), 243–274. Csomay, E. (2007). A  corpus-based look at linguistic variation in classroom interaction: Teacher talk versus student talk in American University classes. Journal of English for Academic Purposes, 6(4), 336–355. Csomay, E. (2013). Lexical bundles in discourse structure: A corpus-based study of classroom discourse. Applied Linguistics, 34(3), 369–388. Csomay, E., & Cortes, V. (2010). Lexical bundle distribution in university classroom talk. In S. Gries, S. Wulff, & M. Davies (Eds.), Corpus linguistic applications: Current studies, new directions (pp. 153–168). New York, NY: Rodopi. Csomay, E., & Wu, S. M. (in press). Language variation in university classrooms: A corpus-based cross-cultural perspective. Register Studies. Deroey, K. L. B., & Taverniers, M. (2011). A corpus-based study of lecture functions. Moderna Sprak, 105(2), 1–22. Flammia, G., & Zue, V. (1995). Empirical evaluation of human performance and agreement in parsing discourse constituents in spoken dialogue. In Proceedings of the 4th European conference on speech communication and technology (pp. 1965–1968). Madrid, Spain. https://pdfs.semanticscholar.org/0a35/11dd0 25a91bc0ed7904ce4f17794bb8fbd44.pdf Fournier, C. (2013). Evaluating text segmentation (Unpublished doctoral dissertation). Ottawa-Carleton Institute for Electrical and Computer Engineering, University of Ottawa. Glass, J., Hazen, T., Cyphers, S., Malioutov, I., Huynh, D., & Barzilay, R. (2007). Recent progress in the MIT spoken lecture processing project. Interspeech, 2553–2556. Goodman, J. (2013). Data collection in a flat world: The strengths and weaknesses of mechanical turk samples. Journal of Behavioral Decision Making, 26(3), 213–224. Hearst, M. A. (1993). TextTiling: A quantitative approach to discourse segmentation. Technical Report 93/24, Sequoia 2000 Technical Report, University of California, Berkeley.

Triangulating Text Segmentation Methods  51 Hearst, M. A. (1994, June). Multi-paragraph segmentation of expository text. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics (pp. 9–16). Association for Computational Linguistics. Las Cruces, New Mexico. Hearst, M. A. (1997). Text tiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), 33–64. Heinonen, O. (1998). Optimal multi-paragraph text segmentation by dynamic programming. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (Vol. 2, pp. 1484–1486). Kao, J. T., Levy, R., & Goodman, N. D. (2013). The funny thing about incongruity: A  Computational model of humor in puns. Proceedings of the 35th Conference of the Cognitive Science Society, 1, 728–733. Lin, M., Chau, M., Cao, J., & Jr, J. N. (2005). Automated video segmentation for lecture videos: A linguistics-based approach. IJTHI, 1(2), 27–45. Malioutov, I.,  & Barzilay, R. (2006). Minimum cut model for spoken lecture segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (pp. 25–32). Olsen, L. A., & Huckin, T. H. (1990). Point-driven understanding in engineering lecture comprehension. English for Specific Purposes, 9(1), 33–47. Passonneau, R.,  & Litman, D. (1997). Discourse segmentation by human and automated means. Computational Linguistics, 23(1), 103–139. Pevzner, L., & Hearst, M. A. (2002). A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1), 19–36. Prince, E. (1981). Toward a taxonomy of given-new information. In P. Cole, (Ed.), Radical pragmatics (pp. 223–55). New York, NY: Academic Press. Ranard, B. L., Ha, Y. P., Meisel, Z. F., Asch, D. A., Hill, S. S., Becker, L. B., . . . Merchant, R. M. (2014). Crowdsourcing—harnessing the masses to advance health and medicine, a systematic review. Journal of General Internal Medicine, 29(1), 187–203. Rand, D. (2012). The promise of mechanical turk: How online labor markets can help theorists run behavioral experiments. Journal of Theoretical Biology, 229, 172–179. Sardinha, T. B. (2001). Lexical segments in text. In M. Scott & G. Thompson (Eds.), Patterns of text: In honor of Michael Hoey (pp. 213–237). Amsterdam: John Benjamins. Simpson, R. C., Briggs, S. L., Ovens, J., & Swales, J. M. (2002). The Michigan corpus of academic spoken English. Ann Arbor, MI: The Regents of the University of Michigan. Sprouse, J. (2011). A  validation of Amazon mechanical turk for the collection of acceptability judgments in linguistic theory. Behavior Research Methods, 43(1), 155–167. Webber, B. (2012). Discourse structures and language technology. Natural Language Engineering, 18(4), 437–490. Youmans, G. (1991). A new tool for discourse anlaysis. The voabulary management profile. Language, 67, 763–789. Young, L. (1994). University lectures—macro-structure and micro-features. In J. Flowerdew, (Ed.), Academic listening: Research perspectives (pp. 159–176). Cambridge: Cambridge University Press.

3 Working at the Interface of Hydrology and Corpus Linguistics Using Corpora to Identify Droughts in Nineteenth-Century Britain Tony McEnery, Helen Baker and Carmen Dayrell Introduction This chapter uses corpus linguistics in concert with a number of other methods to explore whether we can use a historical corpus to explore physical reality in the past. We have previously worked on projects looking at how corpus data may allow access to social attitudes in the past (McEnery  & Baker, 2017a; McEnery  & Baker, 2017b; McEnery  & Baker, 2018), but in this chapter we will look at how a corpus may allow us access to information about the natural environment by exploring the degree to which a corpus can tell us about extreme weather events in the past, in this case droughts and water shortages. Given the goal of our work, and the data available to us, triangulation is not a methodological option, it is a necessity. In order to use a variety of large corpora to explore past water shortages through corpus analysis, we have to accept that we need not only to overcome a series of issues with the data itself but to establish, in so far as we are able, some ground truth regarding climate conditions in the past. We did this by comparing our corpus findings with those provided by hydrologists who had undertaken close reading analyses of existing historic climate records. It is this qualitative work which provides the grounds on which our corpus-based analysis gains credence through triangulation. In the following sections we will review the data available to us and, based on the difficulty of using those data, move to explore the methods used in concert with corpus analysis in this chapter: i) close reading to establish data quality; ii) a review of the available records of droughts and water shortages undertaken by hydrologists; and iii) geoparsing. Following the introduction of these methods, we will explore how closely our corpus analysis matches with what is known about droughts in the nineteenth century before exploring some new droughts which may have been revealed through our analysis. First, however, we will consider the

Using Corpora to Identify Droughts  53 main hypothesis explored in this chapter—that a corpus may allow us to access the physical reality of the past.

Text and Reality—a Problem How good a reflection of reality is text? Corpus linguistics holds out the promise of being able to reflect not only on language in use or discourse, but also to an extent on the society that produced those texts. While important in contemporary research (Baker et  al., 2008; McEnery, McGlashan, & Love, 2015; Brookes & Baker, 2017), the possibility becomes central in the context of historical research. Methods that may allow us to test the plausibility of what texts appear to show us about discourse, such as triangulating the results with those derived from other methods such as social attitude surveys (as done by Baker, Gabrielatos, & McEnery, 2013, for example) and focus groups (see Holmgreen, 2009, for example) are either only possible to a degree or are impossible in the past. For example, while in the early modern period we may be able to access official records to discover legal sanctions to be taken against beggars, we cannot assess everyday attitudes to the practice of begging except by gathering views mediated by texts—there are no members of the relevant societies left for us to gather opinions from and there were no public surveys of social attitudes undertaken then. This problem becomes all the more acute when we have a narrow range of text types available. In this chapter we look at the possibility of using textual material to explore the physical and social reality of droughts and water shortages in the nineteenth century. Our task is made more difficult by virtue of the fact that although we have access to a large volume of data for our study from a number of authors and publishers (over five billion words of newspaper data from the nineteenth century made available to us by the British Library—see the next section for details), our data is composed of only one text type, newspapers. Newspapers add to the challenge for our work because we know that news values, as much as reality itself, distort and skew what is presented in a newspaper. As Ratia, Palander-Collin, and Taavitsainen (2017, p. 5) put it: “Media suggests frames in which discourse should be understood, but these frames may be incongruous with reality or represent a partial reality only”. So while the promise of exploring historical corpora to reveal forgotten pasts is alluring, it also has the property of a Siren on the rocks— luring unwary analysts to dangerous, decontextualized, conclusions. As if the context outlined was not difficult enough, two further, substantial, problems present themselves for our study. First the data we are using has been produced by running optical character recognition (OCR) software across very old newspapers. This gives rise to a range of issues— poor quality ink or paper leading to blurred images or damage to the originals, for example through wear and tear, insect activity or chemical

54  Using Corpora to Identify Droughts processes.1 This leads to an output that has a degree of error that is often profound—it varies over time within the same newspaper, between newspapers at any given moment in time and, fairly self-evidently, between words of different length (the longer a word the more likely it is that an error will appear in it). So the skew with which the errors manifest themselves in the data are many. This produces a major problem that, if it cannot be controlled, reduces the research value of the data substantially. The second problem requires us to return to the topic of our study before we can explain it. In this work we have been looking to see to what extent newspaper reporting can allow us to access physical reality in the UK in the nineteenth century, specifically with regard to the degree to which the UK as a whole, or parts of the UK, experienced droughts and water shortages. This choice of topic in part mitigates the issue of editorial skew in the data—it is less easy to see the reporting of droughts as being something which would be subject to direct and deliberate manipulation of reporting, unlike, for example, more overtly political issues. Yet it does usher in one more issue that was the starting point for this research—the records in the UK for droughts and water shortage in the UK are very scarce to non-existent before the twentieth century. However, when building long term climate forecasts and exploring issues such as weather patterns in relation to global warming, being able to establish periods when notably less rain fell on the UK, or parts of the UK, is clearly important to meteorologists and hydrologists. To that end, we have been working as a joint team of corpus linguists and hydrologists, in a project linked to the UK water industry, to explore the extent to which we can use the newspapers of the nineteenth century to build a fuller picture of the distribution of droughts and water shortages, as well as the reported effects of those events, from 1800 to 1899.2 This very complex picture calls for a number of methods to be deployed in concert. Accordingly, in this chapter we explore the extent to which (i) corpus linguistics, principally concordancing; (ii) close reading for the purposes of evaluating the quality of OCR texts; (iii) historic hydrology; and (iv) geoparsing can triangulate to both control the multiple sources of error present in such a study and provide a better record of the reality of droughts and water shortage in the nineteenth century. We triangulate between three forms of close reading (i, ii and iii) and one form of computational pre-processing (iv). In our study, we had to deploy methods ii–iv before being able to proceed to the first method, as the following sections will show.

The British Library Nineteenth-century Newspaper Corpus (BLNCNC) The data used in this study has been made available to us by the British Library. The Library holds extensive collections of nineteenth-century

Using Corpora to Identify Droughts  55 newspapers and, as noted, has used optical character recognition in order to render a number of them in machine readable form. This process generates substantial issues, as noted in the previous section. Yet it also opens up very promising avenues of research—the ability to explore language in one genre across a century. The newspapers contain a range of titles, some regionally focussed, some with a more national focus. While not all years are available for all newspapers, either because of gaps in the data or because the newspaper did not publish throughout the century, the scale of data, as Table 3.1 shows, is substantial. With over five billion words of data for our study we should be confident that, if it is possible to explore text as a proxy for physical reality, the availability of sufficient data should not prove to be a problem. The scale of the data does mean, however, that most popular concordancing programs cannot cope with the BLNCNC. Fortunately, we were able to mount the data on CQPweb which was able to both store and allow us to analyse this volume of data (see Hardie, 2012). Yet scale is not the only problem that Table 3.1 makes clear—it also starts to show that compromise and caution should be watchwords for the use of such data. As can be seen, only one source, the Hampshire/ Portsmouth Telegraph, is available for nearly the whole century. While there is data for every year of the century in the data set, the amount Table 3.1 Newspapers Included in the British Library Nineteenth-century Newspaper Corpus*

The Era Glasgow Herald Hampshire/Portsmouth Telegraph Ipswich Journal

Dates of Publication in Nineteenth Century

Extent

1838–1899 (1848, 1849, 1850 missing) 1820–1899 (1823, 1824, 1825, 1828–1843, 1872–1878 missing) 1800–1899 (1879 missing)

448,894,847 words in 3,096 texts 1,780,103,461 words in 13,034 texts

Pall Mall Gazette

1800–1898 (1829, 1831, 1832, 1897, 1899 missing) 1870–1899 (1871, 1872, 1898 missing) 1865–1899 (1872 missing)

Reynold’s Daily

1850–1899 (none missing)

Western Mail

1869–1899 (1872, 1896 missing)

Northern Echo

Total

531,379,257 words in 6,197 texts 424,191,798 words in 7,000 texts 386,590,121 words in 8,543 texts 468,324,154 words in 10,620 texts 289,617,880 words in 2,635 texts 756,039,060 words in 8,718 texts 5,085,140,578 words in 59,843 texts

* A year is only noted as being missing when all twelve months of that year are absent.

56  Using Corpora to Identify Droughts varies from year to year and from title to title. Hence the deployment of such data requires caution. But this only begins to outline the problems inherent in the data and this study.

Close Reading and Experimentation The first way in which we will deploy close reading in this chapter is by intersecting the work of Joulain (2017) with our own work. In our work we rely heavily on the findings of Joulain as her painstaking close reading and experimental analysis of the effect of OCR errors on corpus analysis provide us with the warrant to conduct our analysis on noisy data.3 Joulain (2017) explored the BLNCNC to establish the level of error in the corpus by comparing a subset of the BLNCNC to a gold standard collection of data which had been hand corrected. This is the first point at which close reading intersects with our study—the gold standard was the Corpus of Nineteenth-century Newspaper English produced at the University of Uppsala by Erik Smitterberg4 who had read and transcribed approximately 320,000 words of nineteenth-century newspaper material when he allowed Joulain access to his data to conduct her study. Based on the use of 160,616 words of that data, which overlapped with part of the BLNCNC used here, Joulain could test the degree to which errors were present in the un-corrected British Library material. She could also see to what extent the collocation statistics that were run on the uncorrected data were subject to error because of the OCR errors in the BLNCNC. Joulain’s findings are wide ranging, but her key findings of relevance to our study were that, as suspected, “the spread and average quality of the OCR may be expected to vary considerably depending on the year and publication of the sample considered” (Joulain, 2017, p. 94) and that “OCR errors do have an impact on the wordcounts” (Joulain, 2017, p. 97). This alone suggests that a corpus analysis of such data should be heavily supported by close reading and that wholly automated results should be viewed with extreme caution. Yet is close reading a sufficient response to the level of problem encountered—can corpus techniques guide the analyst to a small enough relevant subset of the data within which manual analysis may feasibly occur? If it is not, then a corpus analysis of such data remains a distant dream until the data are substantially improved. Fortunately, Joulain provides insights which help with this question. First, the issue of OCR errors impacts most directly on terms which are infrequent—if a word occurs once or twice in a corpus then any degree of error may prove catastrophic in retrieving that word form at all (Joulain, 2017, p. 121). With more frequent types, although some examples of them may be lost, the production of false positives is typically very low. Second, most of the errors are random; yet the reproduction of a specific word

Using Corpora to Identify Droughts  57 form is deliberate—hence most errors give rise to low frequency types, while high frequency types are much more likely to be real wordforms. Using a frequency threshold of ten reduces the number of errors in the types extracted from her corpus to 7% from 56% when no frequency floor was used (Joulain, 2017, p. 89). The experimental work of Joulain represents an important contribution to the study presented here—using experimental evidence she provides us with parameters which allow us to conduct a corpus-based analysis of drought and water shortage in our data. The proxy for intensity of drought in our analysis is frequency of mention. Our hypothesis was that as droughts intensified or became apparent then they would be mentioned more, i.e. droughts are newsworthy and increased intensity would enhance this newsworthiness and hence mention of the event. Thus we expressly do not wish to rely on low frequency mentions of drought, meaning that Joulain’s frequency threshold was not an impediment to our study; it is consonant with something that we need to do anyway, i.e. focus on high frequency mentions.

Close Reading and Hydrology Joulain’s study helps to clear away a series of issues with the data—to that extent mixing a close reading with a corpus analysis is clearly helpful; indeed for us it was essential. But a substantial problem remains; if the newspapers we are studying seem to reveal evidence of a drought or water shortage, how can we have confidence that this is objectively true? Normally within environmental science widespread measurement of rainfall and strict criteria for determining when a period of drought is being experienced allow scientists to be confident that when they declare that a drought is occurring it is in the sense that, barring some problem, with instrumentation for example, another scientist looking at that data using established criteria for what constitutes a drought, would agree. The established scientific definition of drought goes beyond everyday understandings of the word in that drought covers some deficit in one or more components of the hydrological cycle—a drought is more than just no rain for a long time. It covers other components of the hydrological system such as river flows and ground water levels. These deficits also need to be enduring and apparent over a given area for it to be declared as subject to drought.5 Ideally our study would be able to draw on hydrological evidence collected systematically in the nineteenth century to test the effectiveness of our corpus-based technique for discovering droughts. However, nineteenth-century climate records are few and fragmentary. For some elements of drought research, notably rainfall, the records are better, but for other major components of the hydrological system, notably river flow and groundwater levels, extensive records only run back to the mid-twentieth century. In a major study of the available historical

58  Using Corpora to Identify Droughts records from 1800 onwards, Marsh, Cole, and Wilby (2007, p. 88) sound a note of caution warning that: It should be noted that significant uncertainties—and potential bias in selection—are associated with the assessment of the magnitude of the earliest drought episodes, compared with more recent events. These arise from the scarcity of observational data (particularly prior to the 1850s), coupled with potential inaccuracies in precipitation measurement, especially during the winter when snowfall may be underestimated. . ., and the variable quality and limited spatial coverage of some of the documentary material. So as well as having an imperfect corpus, we have an imperfect ground truth that we can appeal to in order to check the usefulness of our approach to studying the correspondence of text to reality in this case. The work of Cole and Marsh (2006) and Marsh et al. (2007), however, provide another methodological point of triangulation for our study. Marsh et al., based on the earlier work of Cole and Marsh, bring together extant records in order to undertake what is currently probably the best available modelling of nineteenth-century droughts in Britain from available weather records that can be produced. This work was able to identify a number of droughts which had, in all likelihood, occurred in the nineteenth century. Yet because of the nature of the data, Marsh et al. could not be sure that they had identified all droughts, nor could they be sure of the severity or duration of the droughts that they identified. While based on some quantitative data gathered at the time as well as written descriptions of the droughts, the analysts still had to engage in a somewhat subjective analysis—they used the available evidence plus their expertise as hydrologists to interpret that evidence in order to identify probable droughts and estimate their severity (Marsh et al., 2007, p. 88): Assessments of drought severity were therefore not based on a single hydrometeorological index. Instead, hydrological judgement was used to bring together the various strands of evidence so that each drought event could be viewed as a whole. Such an approach does not readily allow quantitative thresholds to be established to distinguish categories of drought severity; correspondingly the distinction between ‘major’ and other severe events may not always be clear-cut. Some future re-designation of events may be required as further evidence is assembled. This provides both the motivation for the study presented here and the best available baseline against which to judge the success or failure of a corpus-based approach to this question. The scientific records which would permit a study of drought in the nineteenth century have been

Using Corpora to Identify Droughts  59 Table 3.2 Droughts in the Nineteenth Century Based on Hydro-meteorological Evidence Period

Focus

1802–1803 1806–1807 1814–1815 1826–1827 1844–1845 1854–1860 1863–1865 1867–1868 1869–1871 1873–1875 1884–1885 1887–1888 1890–19096

national ? northern England? national national western Britain all regions? all regions English lowlands northern and southern England west and north southern Britain?

explored and the best analysis that can be provided is in all likelihood partial, and by the authors’ own admission, provisional. However, the analysis is still good enough that it can be used to judge any corpus analysis. Table 3.2 shows the findings of Cole and Marsh (2006, p. 15), summarising the location and duration of drought episodes in the nineteenth century for England and Wales. Data from Table  3.2 provides evidence that can be used to assess a corpus-based study of droughts in the UK. While the regional view is limited and, as indicated by question marks in the table, location is sometimes uncertain, there are sufficient points in the table that are firm to the extent that a corpus analysis could be judged. This then is the second valuable contribution that another research method makes to this chapter. The subjective analyses of hydrologists, based on available evidence, provides crucial points in the century when we can check to see whether our analysis of the newspaper material corresponds with theirs. If it does, the credibility of the analysis is enhanced as is the existence of any other droughts we may identify that are not shown in Table  3.2. On the other hand, if we produce a set of results which have no point of connection with Table 3.2, then the plausibility of the analysis we undertake must be cast in doubt.

A Final Helping Hand—Focussing on the UK Our goal was to identify drought years in the UK by counting frequencies in British newspapers.7 Unfortunately for our study, the newspapers in question, while they mention droughts and water shortages frequently,

60  Using Corpora to Identify Droughts do not limit themselves to talking about droughts in the UK. However, we wished to focus upon droughts which occurred within Britain, so it was necessary to exclude those newspaper articles which referred to droughts outside Britain. Manually sorting the data to exclude foreign references to drought would have been a mammoth undertaking—there are 4,623 mentions of drought in the Glasgow Herald alone. Fortunately, we were able to substantially improve the selection process, both in terms of accuracy and speed, by using another method, concordance geoparsing, with our dataset (see Gregory & Hardie, 2011; Rupp et al., 2015). Concordance geoparsing is an innovative approach to textual analysis which combines corpus linguistics with GIS (Geographic Information Systems) methods. The technique works in a number of stages which can be summarised as follows: (i) using corpus linguistics software to extract each occurrence of drought and a span of twenty words to the left and right of it; (ii) geoparsing these concordances to identify instances of place-names, for which geo-spatial co-ordinates are found; (iii) manually checking results in order to remove errors and include UK place-names which were spelt incorrectly due to OCR errors; (iv) sorting the placenames in terms of longitude and latitude, and setting aside the placenames which belonged outside of the UK; and (v) applying GIS software to the UK co-ordinates for mapping.8 As a result of geoparsing, we were able to set aside between 3.52% and 21% of the examples we might otherwise have looked at because they were not UK focussed, as shown in Table 3.3. Given the scale of the data we were looking at, being able to set aside data in this way was very helpful. It is also a useful indication of (i) the relative newsworthiness of drought in the newspapers and/or (ii) the degree to which the paper had a national or international focus. It is notable from Table  3.3 that the

Table 3.3 The Degree to Which Geoparsing Thinned Examples of drought From Each Newspaper Newspaper

Percentage of Examples Discarded by the Geoparser

The Era Glasgow Herald Hampshire/Portsmouth Telegraph Ipswich Journal Northern Echo Pall Mall Gazette Reynold’s Daily Western Mail

10.47 16.19 8.38 6.65 6.89 21.0 9.75 3.52

Using Corpora to Identify Droughts  61 Pall Mall Gazette and to a lesser extent the Glasgow Herald are outliers in this respect. We then turned our attention back to the place-names which were UK specific to identify periods in which drought was most often talked about in the newspapers under consideration. We worked with our data in year-long chunks as we wished to emulate the duration results of Cole and Marsh (see Table 3.2).

Hunting Through Time for Droughts Identifying Peak Years With an understanding of how to use our data, appropriate thinning based on geoparsing and the best available record of nineteenth-century droughts available, we may now turn to explore a simple hypothesis: droughts were newsworthy to the extent that, where they occurred, they were, at least sometimes, mentioned in the newspapers. While simple, this hypothesis entails droughts being considered newsworthy in the century and, crucially, increasing in newsworthiness as they become more intense and/or endure for longer periods of time. So frequency of mention here, if it is in fact a proxy for the existence of a drought, is also a measure of newsworthiness and news values. Figure 3.1 shows what this simple approach to the problem reveals. Figure 3.1 shows how many times drought was mentioned in eight British newspapers over the course of the nineteenth century, using the

50 45 40

35 30 25 20 15 10 5 1800 1803 1806 1809 1812 1815 1818 1821 1824 1827 1830 1833 1836 1839 1842 1845 1848 1851 1854 1857 1860 1863 1866 1869 1872 1875 1878 1881 1884 1887 1890 1893 1896 1899

0

Figure 3.1 Mentions of drought in UK Newspapers, 1800–1899, Excluding Non-UK Specific References

62  Using Corpora to Identify Droughts categorised query function of CQPweb to separate and then exclude non-UK specific references. The data have been normalised in order to compare frequencies per million words. The graph provides an indication of peak periods where newspapers tended to be more interested in discussing drought. We were tempted at this point to apply statistical measures to the data in order to search for ‘real’ peaks in the data and to take a quasi-quantitative approach to identifying droughts. There was no technical problem in doing so—we had the frequencies we needed to run a process such as that used by Gabrielatos, McEnery, Diggle, and Baker (2012) to identify peaks and troughs of mention in the data, for example. However, once again Joulain guided our analysis. While we might use the data we had as positive evidence of mention, we could not necessarily rely on it to carry out reliable statistical testing because, as mentioned already, Joulain (2017, p. 97) had found a substantial variation in OCR accuracy within and between years. Importantly, Joulain has demonstrated significant variation in OCR quality in 1838 and 1839 in her dataset. Two of the peaks of mention of drought in our data for The Era were in exactly those years (see Table 3.4). If we were to undertake a statistical analysis of such data it might be hard to refute the claim that what we were exploring was a significant difference between the quality of OCR data over time, rather than mentions of droughts and water shortages at all. Our response to this was to reach for one of the longest established, and in some ways simplest, tools in the corpus linguist’s repertoire of methods—concordance analysis. By using the frequency data to downsample to specific years, we could use concordance analysis to swiftly read through the data and establish the ground truth that there was a good discussion of a drought at that point in the data. However, we needed to control the down-sampling. In the absence of safe grounds for using relatively sophisticated statistics for that purpose, we decided to simply focus on the rank of normalized frequency of mentions in each newspaper, limiting our analysis to the top ten years in which mentions peaked for that title. If our hypothesis that drought and newsworthiness aligned was right, we should see in that set a good number of the years identified by Cole and Marsh. If those years figure in the analysis, then we have some grounds to explore other years in the data not reported by Cole and Marsh. Table 3.4 shows the ten years (in descending order) in which drought most frequently appeared in each of the eight newspapers in the BLNCNC. The figure in brackets represents instances per million words. How effective is this simple method in identifying periods of British drought? The first point of note is the commonality of the results. The years which appear most frequently are 1870 (highlighted by 7 newspapers); 1893 (in 6 newspapers); 1868 (in 5 newspapers); and 1874, 1887, 1895, 1896, and 1898 (in 4 newspapers)—all of these years are

Using Corpora to Identify Droughts  63 Table 3.4 Top Ten Ranking Years When drought Appears Most Frequently (Normalised to per Million Words) in Nineteenth-century Newspapers. Years in Bold Are Those Not Previously Documented as Years of Drought The Era 1 1844 (4.1) 2 1839 (2.55) 3 1852 (2.27) 4 1868 (2.06) 5 1870 (1.93) 6 1860 (1.86) 7 1838 (1.5) 8 1845 (1.33) 9 1863 (1.19) 10 1864 (1.18)

Glasgow Hampshire/ Ipswich Northern Pall Reynold’s Western Herald Portsmouth Journal Echo Mall Daily Mail Telegraph Gazette 1826 (12.64) 1893 (5.06) 1844 (4.61) 1821 (4.4) 1846 (4.38) 1859 (4.18) 1850 (4.04) 1895 (3.63) 18899 (3.62) 1822 (3.53)

1868 (9.62) 1826 (6.28) 1852 (5.6) 1829 (5.52) 1825 (5.48) 1870 (4.89) 1893 (4.34) 1854 (3.92) 1864 (3.79) 1844 (3.69)

1893 (15.85) 1874 (13.32) 1870 (9.75) 1868 (9.6) 1884 (8.41) 1887 (7.38) 1896 (7.14) 1895 (7.11) 1898 (6.67) 1834 (6.49)

1893 (5.87) 1874 (4.19) 1875 (3.71) 1887 (3.45) 1870 (3.06) 1873 (3.0) 1899 (1.78) 1878 (1.51) 1896 (1.46) 1881 (1.44)

1898 (8.31) 1887 (5.42) 1896 (4.74) 1899 (4.29) 1895 (4.09) 1870 (4.06) 1868 (4.0) 1877 (3.71) 1878 (3.62) 1874 (3.38)

1893 (4.55) 1870 (2.98) 1896 (2.8) 1883 (2.19) 1898 (1.99) 1868 (1.97) 1851 (1.86) 1884 (1.85) 1880 (1.73) 1881 (1.65)

1887 (5.68) 1893 (5.24) 1870 (4.0) 1899 (3.42) 1874 (3.15) 1869 (2.44) 1895 (2.32) 1871 (2.06) 1898 (1.84) 1880 (1.47)

periods of reported dry spells as confirmed by Cole and Marsh (2006). For instance, they report the United Kingdom as experiencing drought of a long duration between 1869–70 and 1890–1909, with 1893 being the most intense year of drought within the latter period. Indeed, 26% of the years on this table belong to the 1890s, with 1893 taking the top-ranking place in the Northern Echo, the Ipswich Journal, and Reynold’s Daily. Other years which have multiple entries in Table 3.4 are 1826; 1844; 1852; 1864; 1878; 1880; 1881; and 1884. Almost all of these years have also been identified by Cole and Marsh (2006) as being periods of low rainfall. They note a drought between the spring of 1826 and the autumn of 1827 in southern Britain. Similarly, between the spring of 1844 and the spring of 1845, rainfall was very low in the east and west, and a long drought was recorded in the north-west. A drought from late 1863 to autumn 1865, particularly severe in 1864, affected western Britain. Lastly, from early 1884 to late 1885 a drought was recorded in northern and southern England (see Waddington (2016, p. 7).

64  Using Corpora to Identify Droughts Clearly, then, we have found evidence that newspapers talk of drought more frequently during periods of attested drought; this is a hypothesis which shows both the news values of the time (droughts were newsworthy) and the consequent utility of the newspaper data for exploring such issues as extreme weather events. This first exploration of the data, using one of the simplest methods in corpus linguistics, word frequency, is rewarding—frequency seems to be a good guide to drought. Drought clearly was aligned with the news values of the nineteenth-century press. Yet this simple method only worked so well because we used geoparsing to focus our data on the UK and we explored the BLNCNC using the frequency threshold setting recommended by Joulain (2017). Using Joulain’s insights helped provide an indication of when droughts occurred which intersect with those reported by Cole and Marsh (2006) and gave us critical insights into potential pitfalls with the analysis of the data. Similarly, our triangulation is successful in permitting a use of the data, derived from Cole and Marsh, that is rooted in what is known about drought in the period. However, we can go further than this—some of the droughts indicated by Table 3.4 were not included in the chart based on hydro-meteorological evidence compiled by Cole and Marsh. This is where we should exercise caution once more—a possibility to be explored is that OCR errors are causing peaks of mention that are entirely illusory. However, if these peaks are not illusory then we may pursue the apparent droughts and, by use of a further method, close reading via concordancing, identify where these droughts occurred and what their nature was. In what follows we found very few examples of OCR errors contributing to the peaks of frequency observed—on the contrary, we found much that suggests that by carefully orchestrating the methods we have used together, we have revealed hitherto undiscovered evidence of historic droughts. The Previously Unrecorded Droughts of 1846 and 1850–52 The year 1852 achieves two entries on Table  3.4 while the preceding years, 1846, 1850, and 1851, achieve one entry; none are highlighted by Cole and Marsh (2006) as a year of low rainfall. We explored concordances containing the search term in the four relevant newspapers, Reynold’s Daily, the Hampshire/Portsmouth Telegraph, the Glasgow Herald, and The Era, for the years in question, in order to discover in what contexts references to drought appeared. There is little evidence that 1846 was a year of drought. The year achieves the fifth peak year place in the table using data from the Glasgow Herald but there is only one relevant mention of substance in this newspaper, on 14 August, to “the drought of May and June”; the report proceeded to argue that alarm was unjustified.10 Other newspapers provide patchy supporting evidence for a drought in 1846. The Era

Using Corpora to Identify Droughts  65 carried an advertisement from a company manufacturing paletots, a semi-fitted, double-breasted jacket: “A  burning sun, and the drought occasioned by the dust inhaled on the road, united in causing the heat to be universally pronounced as unusually oppressive, both at the [heath] of Ascot and the downs of Epsom”. The advertisement assured gentlemen that if they purchased paletots, they would remain cool in the hot weather and hence be an object of envy.11 A weather report of the Ipswich Journal noted that the past month of July was not quite so dry and settled as June, but still we suffer from drought, save a few spots where deluging thunderstorms occurred; and the heat has even exceeded that of June at times, to this date.12 An agricultural report from Winchester in the Hampshire/Portsmouth Telegraph noted that barley and oats had not recovered from the “effects of the drought and extreme heat of May and June”.13 Reports of Scottish drought in the year 1850 are more substantial— parts of the country were affected in April and May while the whole country experienced drought in June. On 5 April, it was noted in a report of a fire in Colessie that a scarcity of water and the dryness of the land “from the long protracted drought” caused the flames to spread rapidly.14 In May, the Glasgow Herald mentioned the recent presence of unseasonable snow and “the long drouth” which had caused hill pasture to be unprecedently bare: “there being less grass now than at the commencement of the lambing season, even on ground that has not been fed upon”.15 By the summer, the weather was causing alarm in Scotland: The continued drought is telling severely on pastures and second crop clover grass, the former may be said to be burned up, and the latter the most worthless that has occurred for many years; and in many instances late sown turnips have entirely failed.16 Lanark and Renfrew were mentioned as being particularly affected by long droughts in April, May, and June.17 In September, a sombre report about the body of an infant which was discovered in a pond at Graham Square Mills in Glasgow noted that the child probably would never have been found but for the recent drought drying up the water.18 Toward the end of the same month, the paper updated readers on the progress of the harvest, noting that “green crop and pasture lands are suffering slightly from continued drought”. In Perthshire, the weather was described as being very dry: “dust is blowing about on every highway, the pastures are burnt up, and the brooks are nearly dry”; however, the report reasoned that the Tay was at a similarly low level during the previous year so the current dry spell is “nothing very remarkable”.19

66  Using Corpora to Identify Droughts The dry weather extended to other parts of the country later in the year—a report of 9 August, 1850 mentioned a protracted drought in June in Norfolk and an article of 16 September re-iterated that “an unusual drought prevailed over the country” in the month of July, which was attended by a depreciation in the water quality of the Ness.20 In September, the Glasgow Herald’s nature columnist observed that, “The dryness, for weeks, has almost equalled that which afflicted Italy in the 322d [sic] year after the building of Rome, and we have had dust more than enough to ransom a heptarchy of kings”.21 Concordances for the following year, 1851, initially give little indication that any part of the country was suffering from drought until the publication in December of an article in the Hampshire/Portsmouth Telegraph which evaluated the farming outcomes of an Essex landowner, John Joseph Mechi. It is Mechi who provides us with a first indication that southern Britain may have experienced drought in 1851 as he noted that his potatoes yielded only half a crop, “having been injured by drought”.22 By January 1852, the same newspaper carried more assertive reports of an unusual winter drought stretching from Scotland to southern Britain: Complaints are made of the dryness of the season. Even in the Highlands the general complaint is of drought; an exceedingly rare thing at this period of the year. The people of Cromarty are on short allowance from their water-works; while farmers cannot get water to work their steam-engines, and mill-streams are frequently almost dry. At the other extremity of the island—Hampshire—there are similar complaints. In the New Forest, it is necessary to drive the cattle a considerable distance for water.23 On 17 April, the Hampshire/Portsmouth Telegraph noted that the drought had caused little uneasiness but could result in problems for the spring sown crops should it continue. It added that “green food of every description had become scarce”.24 The Era referenced a long drought in two articles about horse racing in the same month.25 By May, the situation looked more serious, with fewer than two inches of rain having been reported in the previous three months: “nearly all the ponds, ditches, and small wells in this locality have been quite dry for some weeks past”.26 Two other articles of the same day also suggested that drought impacts were becoming widespread. The first, a short article citing the Bristol Gazette, reported that areas where sheep could graze (“sheep keep”) was becoming very short on the Down Farms meaning farmers were forced to pay high prices for hay for their stock.27 The second article, also fairly short in length, told readers that “a drought of extraordinary continuance” was causing great inconvenience to the inhabitants of the suburban

Using Corpora to Identify Droughts  67 Manchester township, with some families struggling to procure sufficient water to meet basic requirements. The article added that the waterworks committee had been forced to purchase 200 million gallons of water from the Manchester, Sheffield, and Lincolnshire Railway Company.28 Toward the end of May, the Hampshire/Portsmouth Telegraph was reporting rain and recovering vegetation and The Era described the relief of housekeepers in Manchester.29 Evidence which indicates the UK was experiencing drought in 1846 exists but is not robust. However, indications that parts of the country were affected by a drought in the period 1850–52 are much stronger and we have been able to trace the development of this drought from Scotland in 1850, to southern England in 1851, and on to northern England in 1852. Moreover, we have discovered that the drought of the period 1851–52 was a rare winter-spring event. Figures 3.2–3.4 show the distribution of mentions of place-names in three geographic areas of land—countries (Figure  3.2), counties (Figure 3.3), and towns and cities (Figure 3.4)30 that co-occur with drought, across the period 1846–1852, in the newspapers we analysed. The Previously Unrecorded Droughts of 1878, 1880, and 1881 Three other years, 1878, 1880 and 1881, achieve multiple entries on Table  3.4 but are not highlighted by Cole and Marsh (2006) as years of drought. We used the same method as before, exploring the concordances containing the search term in the newspapers with the aim of uncovering a narrative of drought. The UK does not appear to have experienced a drought in 1878. The relevant concordances in the Northern Echo for this year are littered with OCR errors where draught is replaced with drought.31 However, in both the Northern Echo and the Pall Mall Gazette, genuine references to drought refer to specific drought events in foreign countries that have slipped past the geoparsing procedure. The geoparser is not able to remove all references to drought outside of the UK, for instance, in matches where a discussion relates to a place but its name is not mentioned within 20 words of our search term or, less commonly, when a place-name is present but it is not recognised as such by the geoparser. Place-names are also sometimes affected by OCR errors, such as “Cape Too-n” and “QUEE N SLAND” which appear in concordances from the Northern Echo in 1878.32 Evidence for drought in 1880 is more substantial. The first indication that Britain had been experiencing droughty conditions came on 25 May 1880 in the Western Mail: “the weather of the past week has, though favourable for actual farm work, been decidedly bad for vegetation, which has been at a standstill, owing to the drought”.33 On the last

Figure 3.2 drought in UK Countries, as Mentioned in Nineteenth-century British Newspapers, 1846–1852. The shade gets progressively darker as the number of mentions increase.

Figure 3.3 drought in UK Counties, as Mentioned in Nineteenth-century British Newspapers, 1846–1852. The shade gets progressively darker as the number of mentions increase. Circles indicate historic counties.

Figure 3.4 drought in UK Towns and Cities, as Mentioned in Nineteenth-century British Newspapers, 1846–1852. Both size and colour represent number of mentions.

Using Corpora to Identify Droughts  71 day of the month, a letter entitled “Insufficiency of water at Cardiff” commented that although a drought had occurred, it did not justify the water shortages that the area was currently experiencing: The past drought has certainly not been of sufficient duration to warrant our unhappy condition, for to come home weary and hungry at night and be told that one can have no dinner as the kitchen fire has to be put out is enough to make most of us the reverse of happy.34 Discussions about the inadequacy of the water supply and a reference to “the late drought” also appeared in the paper on the 17 September.35 In late spring and early August, agricultural reports noted that oats had suffered considerably from the drought.36 The Pall Mall Gazette, in a report of 10 September, also noted that local inhabitants of Cardiff were being called upon to economise their use of water and that the “state of the reservoir affords much anxiety to members of the waterworks committee and the town generally”.37 Reynold’s Daily offers less information but does provide this corroborating evidence from a gardening article of 12 September 1800: “Having regard to the heat and drought which has prevailed, the water-pot should freely be used, as without that, flowers will be but short-lived”.38 The other newspapers we can access which cover 1880 and are able to offer information on the weather of that year are the Glasgow Herald, the Hampshire/Portsmouth Telegraph, and the Northern Echo. Unlike the Western Mail, they tend to discuss drought in relation to Scotland. The Northern Echo offers only one piece of corroborating evidence, in early August, in an article about grouse in Scotland. In the historic county of Ross-shire, the birds were described as being healthy but not numerous due to a drought of May and June whilst the Badenoch and Lochabar districts were reported as having experienced particularly dry weather.39 This report mirrors a very similar article which appeared a few days earlier in the Pall Mall Gazette: “The dry weather in the middle of June had a somewhat damaging effect on the condition of the young birds in Strathspey, but there being an entire freedom from disease, the ill effect of the drought has been got over”.40 In early September, a short report in the Pall Mall Gazette mentioned a dearth of water in the Clyde district, with Greenock having only 36 days of supply in store and the stock of water in Dumbarton estimated to last only two more weeks.41 A  similar report, on 1 September  1880 in the Glasgow Herald, discussed the water supply being threatened. Greenock was mentioned by name but, unfortunately, due to the deterioration of the original issue which was digitised, the extent of the dry weather is unclear.42 However, another article of 25 October described how, owing to the prolonged drought, Loch Thom had dried up and most of the public works in Greenock were predicted to cease operations the following

72  Using Corpora to Identify Droughts day: this was “the most complete water failure which has ever occurred in Greenock”.43 We can thus draw a conclusion that parts of Britain, Scotland, and perhaps Wales in particular, experienced drought in 1880. Figures 3.5–3.7 show the distribution of mentions of place-names in three geographic areas of land which co-occur with drought in newspapers printed in 1880. The following year, 1881, also appears twice in Table 3.4 as a result of data gathered from the Northern Echo and Reynold’s Daily. However, as with 1878, matches for the Northern Echo relate to draught horses or drought outside of the UK. Although most mentions in Reynold’s Daily are very general, a gardening column did note that strawberries were not abundant in the year due to the drought.44 Most of the other newspapers have little to offer in support of dry weather in 1881. The Pall Mall Gazette did carry a report that the crop of beans and root vegetables had “suffered from the extreme drought” in mid-August.45 However, there is evidence that south-east England experienced drought from April to July of 1881. The Ipswich Journal, in an article about fox hunting in Essex and Suffolk, reported in early April: Only a short time ago we were complaining of so much downfall and scarcely any sun. Now that a drought has set in, accompanied by searching winds, the country has become hard and dry as tinder, thus destroying all scenting properties.46 By the end of May, the drought was referred to as having negatively affected the growth of oats.47 This was followed, at the end of July, by references to “the recent droughty weather” impeding the progress of clover seed and to potatoes doing well despite “the recent drought”.48 It appears that this dry spell was broken in August as a report notes that the weather turned wet after 8 August.49 This dry spell does not appear to have made any real impact upon the rest of the country. The Previously Unrecorded Drought of 1821–1822 What about the years which only appear once on Table 3.4 but are also not documented as periods of drought? Although these years only achieve a top ten position using data from a single newspaper, they should not be set aside. After all, a regional newspaper might make much of a drought which only occurred in the area on which it focussed its reporting. From Table 3.4, we can see that four of these years were identified using data from the Glasgow Herald—1821, 1822, 1846, and 1850. It is possible that the relatively high presence of these ‘rogue’ years in the only Scottish newspaper in our dataset can be partly explained by differences in weather patterns between Scotland and England.

Figure 3.5 drought in UK Countries, as Mentioned in Nineteenth-century British Newspapers, 1880. The shade gets progressively darker as the number of mentions increase.

Figure 3.6 drought in UK Counties, as Mentioned in Nineteenth-century British Newspapers, 1880. The shade gets progressively darker as the number of mentions increase. Circles with crosses indicate present-day sub-regions.

Figure 3.7 drought in UK Towns and Cities, as Mentioned in Nineteenth-century British Newspapers, 1880. Both size and shade represent number of mentions.

76  Using Corpora to Identify Droughts Let us look in closer detail at the concordances of 1821–22 in the Glasgow Herald. In July  1821, the newspaper referenced “the long drought” which had lowered water levels and caused many springs to fail: “above a dozen of steam-vessels going up and down the Clyde, grounded at one time at the Point-house, about two miles below this city”.50 At the beginning of the following month, an agricultural article reported that “the present protracted drought [has] very materially injured the hay crop” and may have delayed the harvest.51 More specific information is provided in an edition of 27 July, in which readers were told that a drought “of nearly nine weeks continuance, and almost without a parallel” has been broken by showers.52 Crops were reported as being negatively affected by drought on 13 August and 10 September.53 We can thus safely conclude that Scotland does appear to have experienced a dry spell in 1821. At this point, it is wise to inspect the other regional newspapers we can access through CQPweb for the year 1821—the Ipswich Journal and Hampshire/Portsmouth Telegraph—to see if they also acknowledge this dry spell in Scotland. Neither contains information which suggests that any part of the country was experiencing drought. It is possible that the dry spell of 1821 made little impact in newspapers other than the Glasgow Herald because it was restricted to the far north of the country and had no consequential effects worth reporting for the rest of the country. The picture changes somewhat in 1822 when evidence suggests that this dry weather in Scotland persisted. On 7 June 1822, the weather is described as becoming problematic: “Yesterday the thermometer stood at 76 in the shade, and to-day it promises to be still higher. By and by we shall be in want of rain; but hitherto the drought has made no visible impression on the crops excepting in the lighter description of soils”.54 By 14 June, the “scorching heat and drought” were reported as having injured the corn crop and, to an even greater extent, the grass crops but the arrival of rain had relieved apprehensions.55 However, the dry weather persisted into the summer—on 5 July, the Glasgow Herald referenced continuing drought and in mid-August “the great drought of both last summer and the present”.56 On 6 September, readers were informed that “Swedish turnips are a failing crop, destroyed almost entirely by the drought and fly”.57 The other newspapers we can access through CQPweb for this period are slightly more helpful for 1822 than they were for the previous year. Although the Ipswich Journal offers no supporting evidence for dry weather, the Hampshire/Portsmouth Telegraph noted in July 1822 that a thunderstorm had broken two months of drought.58 The dry weather was reported as prevailing throughout September.59 It appears then, that a dry spell of 1821 which was isolated to Scotland was affecting other parts of the country by 1822.

Using Corpora to Identify Droughts  77 The Previously Unrecorded Drought of 1825 Let us now explore the drought concordances from the Hampshire/­ Portsmouth Telegraph in 1825—was this also a year of problematic weather? In this year, the initial suggestion of a dry spell arrived fairly early, in April, with reference to a “penetrating drought”.60 A report of early May provided more detail: “Within the last few days the country has been refreshed with mild and gentle rains, after the almost unprecedented drought of more than six weeks”.61 July was described as scorching, following a drought continued with little intermission through several months . . . the water meadow and other late growths of hay have been collected in unexampled fine order; the bulk, perhaps, less than usual, but the quality superior. But, on the other hand, the pastures and upland meadows are dried up, summer tares are withered and unproductive, and the second growth of clover is appropriated, where it exists, to the feeding of cattle.62 We examined the Ipswich Journal for corroborating reports of a British drought in 1825. It contains frequent references to drought but offers sparse detail on its nature. A report about the Woodbridge market in Suffolk referenced the “long drought” of the year.63 The following month, the lambs were described as being in fine condition, despite the “long drought”.64 An agricultural report of September 1825 noted that potatoes had been injured by the drought and that the price of store and cattle and sheep had been depressed by the drought.65 The turnip crop, meanwhile, survived the drought.66 We can conclude, therefore, that parts of England experienced droughty weather in 1825. The Previously Unrecorded Drought of 1834 Reports of drought in 1834, a year highlighted by the Ipswich Journal, were also mostly restricted to the south of England. On 31 May, 1834, the Journal reported that “Bury Market, on Wednesday, had more life in the corn trade than for the last three months, and all sorts of corn dearer than last week, owing to the long drought and 4 cold nights”.67 At the start of the summer, the paper reported the exhibitions at Saxmundham and Wickham Market Horticultural Society were disappointing “owing to the forwardness of the season, the continued drought, and the meeting being fixed late”.68 A week later, cattle dealers in Norwich were reported as being unable to sell their animals.69 Barley was described as having ripened prematurely and turnips were reported as being inferior due to drought.70 Unfortunately, we only have access to one other newspaper for 1834—the Hampshire/Portsmouth Telegraph. It does provide supporting evidence, albeit limited, for a dry spell in the period. At the end

78  Using Corpora to Identify Droughts of May, it reported: “Hay making commenced [in Cowes] yesterday but the crops are light, in consequence of the long drought. Every species of vegetation is withering away for want of rain”.71 On 14 July, the same newspaper noted that many parts of the country were affected by “excessive drought” and in some places the oat crop was being fed to sheep by farmers who believed it would never reach maturity.72 Carnivorous birds were reported as suffering extreme hunger due to the drought.73 The other years which appear on Table 3.4 but have not been documented as periods of drought or have already been discussed are 1829, 1838, 1839, 1877, and 1883. The year 1829 achieves fourth peak year place in data collected from the Hampshire/Portsmouth Telegraph. However, there is only one relevant article in this newspaper which refers to droughty, cold weather impeding the spring crops.74 Unfortunately, we are unable to look to other newspapers for substantiating evidence. Only the Ipswich Journal and Glasgow Herald reach as far back as the 1820s and issues for 1829 are missing in both titles. Concordances in the Pall Mall Gazette dating to 1877 reveal a relatively high frequency of drought references but very few are UK specific. This is also the case for 1883 in Reynold’s Daily; and The Era in 1838 and 1839. The articles concerned are not affected by OCR errors, however, to any great extent.

Conclusion Let us begin by summarizing what we have found out about droughts in the nineteenth century in this study. First, droughts align well with news values in the nineteenth century—it seems that when the United Kingdom was experiencing prolonged dry weather, it is often reflected in increased references to drought in British newspapers. By exploring the peak years in which drought is most frequently mentioned we can verify instances of dry weather which are already well-attested by environmental reports—the articles which mention drought in these years might add to what we already know about the nature and geography of drought during these periods. However, our analysis also throws up some peak years which are not listed in Cole and Marsh (2006) as being periods of drought; on having examined concordances for the years 1821–22; 1825; 1834; 1846; 1850–52, and 1880–81 we discovered that these did, in fact, appear to be years in which drought in the UK was being reported. We believe that droughts in these years may have been overlooked by hydrologists. Interestingly, much of the evidence for these dry spells comes in the form of casual asides by news writers: parts of the United Kingdom may have been experiencing drought, but that drought was usually referenced in relation to a more newsworthy story such as poor agricultural prospects. It is possible that these droughts did not achieve headline news because they were not considered to be particularly problematic or extreme.

Using Corpora to Identify Droughts  79 We must draw a second conclusion for this chapter, a conclusion focussed on method. We could easily have set aside all of the other methods we used in this chapter and have relied on frequency alone in order to identify periods of drought. If we had done so we may have drawn a range of conclusions not dissimilar to those drawn. However, those conclusions, while they may have looked the same, would have had many more questions hanging over them and, in all likelihood, would have looked very different indeed. Because of our understanding of the work of Joulain, we were able to explore issues in our data that a simple frequency analysis alone would not necessarily have highlighted. We were also cautious about some findings because of what we knew about the OCR quality of the corpus thanks to Joulain’s analysis. Similarly, because of the findings of Cole and Marsh we were able to root our observations in the reality of weather in the nineteenth century, at least insofar as the historic records of the water cycle in the UK allow. This was important for allowing the findings to make some claim for the ground truth of what was observed in the data. Bringing in those other methods and findings allowed us to provide results which, while they may not be perfect, allowed a more confident assertion of conclusions than an approach which ignored both of those sets of findings could have. Also, it goes without saying that without applying geoparsing to our data it would have been very difficult to focus our analysis on the UK at all without a good deal more manual sorting of the data—this is most clearly shown in the Glasgow Herald in which the second greatest volume of examples were removed by the geoparser. In the Glasgow Herald, the top five collocates which link geo-political entities to drought are, in descending order, Australia, Texas, California, Russia, and Queensland.75 Also, and importantly, the distribution of removed examples from the Herald are not smooth as shown in Table 3.5. As we can see, the mention of overseas droughts as a proportion of all droughts fluctuates across the century in the Herald while the amount of

Table 3.5  The Thinning of Examples From the Glasgow Herald, by Decade Decade

Years Present

Number of Words in the Corpus in This Decade

Percentage of Examples Removed

1820–29 1840–49 1850–59 1860–69 1870–79 1880–89 1890–99

1820–22; 1826–27 1844–49 All All 1870–71; 1879 All All

14,326,096 58,876,788 140,612,509 268,944,359 85,190,566 521,479,026 625,512,681

5.95 13.20 15.98 18.92 16.45 18.41 14.56

80  Using Corpora to Identify Droughts data we have for the Herald increases. By the time we reach the 1840s, the volume of mention of overseas droughts is clearly enough to have begun to distort the reporting of what might have been presumed to be droughts in the UK if we had not taken the foreign drought reporting into account. A penultimate comment should be added here about the pros and cons of triangulation, as we have experienced them. For us, the pros are obvious: (i) triangulation is a means of attempting to falsify a hypothesis from another methodological perspective; (ii) our failure to falsify in this study enhances the credibility both of work done by hydrologists for the century and for the proposition that corpora have a role to play in such investigations; (iii) working on triangulation in a cross-disciplinary context was helpful—the linguists, historians, and hydrologists on this project each made use of important disciplinary expertise in the interpretation of the results reported—the triangulation in this case formed a crucible within which disciplines met and forced fresh perspectives on the area studied. In terms of cons, these can emerge from (i) the individual approaches adopted—for example, the corpus data we used had issues, as reported, and any impact that these had on the findings have the potential to destabilize the triangulation, such as by generating a false falsification; (ii) an unfortunate interaction of the approaches used in the triangulation— for example, if droughts were not newsworthy in the nineteenth century, we may have appeared to falsify the work of the hydrologists on the basis of data that omitted to report droughts; and (iii) there is obvious complexity, as this chapter has shown, in aligning and critically employing different methods and data and focussing them on the same research question. On balance, however, this chapter has engaged more with the positives than the negatives—all of the benefits mentioned here were realised in this work, while of the negatives mentioned, only the third manifested itself in our study. A final point may be apparent to some readers, but is worth stating. This chapter has been about progressing through an analysis, using difficult data that others may despair of, and undertaking simple analyses to support close reading in a context in which different disciplines and methods are deployed to produce an approach to the data that seems to yield results. While the allure of the complex and the automated is understandably real, the power of the simple approach, in the right context, should not be overlooked.

Notes 1. For a good overview of issues facing archivists of newspaper materials see Walravens (2008). 2. UK Natural Environment Research Council grant number NE/L010100/1. 3. While not a productive dimension of study for this paper, Joulain (2017, p. 206) also provides some very useful notes on the production practices and political leanings of the Pall Mall Gazette, Reynold’s Weekly Newspaper,

Using Corpora to Identify Droughts  81 and The Era that could prove particularly helpful to researchers using the data to explore more politically loaded topics in those newspapers. 4. See Smitterberg (2014) for more details and an example of the use of the corpus. 5. For a fuller description of scientific definitions of drought see Tate and Gustard (2000). 6. An accompanying reference to this drought’s duration gives the starting period of late 1889. 7. Initial attempts to use a simple search term, such as drought*, generated too much irrelevant material so we searched for the more complex search term ‘drought OR droughts OR droughty_JJ OR droughted OR drought-* OR drouth OR drouths’. This produced results which included the singular and plural of the term drought, the term droughted, hyphenated terms beginning with drought, (such as drought-stricken and other instances where drought was mistakenly hyphenated), and also adjectival occurrences of droughty which avoided the inclusion of the proper noun Droughty. We also considered both the singular and plural of drouth, an archaic word which was occasionally used in the nineteenth century to mean drought. Most instances of drouth appeared in the Glasgow Herald. 8. We found no evidence that drought was frequently given a metaphorical meaning during the nineteenth century although we did note some rare instances of this. For instance, the Western Mail referred to a “commercial drought” in April 1880. See WMCF_1880_04_05. 9. See footnote 7. 10. GWHD_1846_08_14; Glasgow Herald, August 14, 1846; Issue 4543. 11. ERLN_1846_06_21; The Era, June 21, 1846; Issue 404. 12. IPJO_1846_08_08. 13. HPTE_1846_07_11; Hampshire Telegraph and Sussex Chronicle etc, July 11, 1846; Issue 2440. 14. GWHD_1850_04_05; Glasgow Herald, April 5, 1850; Issue 4923. 15. GWHD_1850_05_20; Glasgow Herald, May 20, 1850; Issue 4936. 16. GWHD_1850_08_19; Glasgow Herald, August 19, 1850; Issue 4962. 17. GWHD_1850_09_16; Glasgow Herald, September 16, 1850; Issue 4970. 18. GWHD_1850_09_16; Glasgow Herald, September 16, 1850; Issue 4970. 19. GWHD_1850_09_23; Glasgow Herald, September 23, 1850; Issue 4972. 20. GWHD_1850_08_09; Glasgow Herald, August  9, 1850; Issue 4959 and GWHD_1850_09_16; Glasgow Herald, September 16, 1850; Issue 4970. 21. GWHD_1850_09_23; Glasgow Herald, September 23, 1850; Issue 4972. 22. HPTE_1851_12_20; Hampshire Telegraph and Sussex Chronicle etc, December 20, 1851; Issue 2724. 23. HPTE_1852_01_10; Hampshire Telegraph and Sussex Chronicle etc, January 10, 1852; Issue 2727. 24. HPTE_1852_04_17; Hampshire Telegraph and Sussex Chronicle etc, April 17, 1852; Issue 2741. 25. ERLN_1852_04_11 and ERLN_1852_04_18. 26. HPTE_1852_05_08. 27. Hampshire Telegraph and Sussex Chronicle etc, May 8, 1852; Issue 2744. 28. ibid. This article is also printed by Reynold’s Daily, see RDNP_1852_05_02. 29. HPTE_1852_05_22 and ERLN_1852_05_02. 30. Figure  3.4 represents towns, cities, and other geographic markers such as rivers and lochs. 31. See, for example, NREC_1878_05_03. 32. See NREC_1878_02_16 and NREC_1878_03_05. 33. WMCF_1880_05_25; Western Mail , May 25, 1880; Issue 3445.

82  Using Corpora to Identify Droughts 34. WMCF_1880_05_31; Western Mail, May 31, 1880; Issue 3450. 35. WMCF_1880_09_17; Western Mail, September 17, 1880; Issue 3544. 36. WMCF_1880_06_15; Western Mail, June  15, 1880; Issue 3463 and WMCF_1880_08_14; Western Mail (Cardiff, Wales), Saturday, August 14, 1880; Issue 3515. 37. RDNP_1880_09_12. 38. NREC_1880_08_03; Northern Echo, August 3, 1880; Issue 3278. 39. PMGZ_1880_07_26. 40. PMGZ_1880_09_02. 41. GWHD_1880_09_01; Glasgow Herald, September 1, 1880; Issue 210. 42. GWHD_1880_10_25; Glasgow Herald, October 25, 1880; Issue 256. 43. PMGZ_1880_09_10. 44. RDNP_1881_07_31. 45. HPTE_1881_08_13. 46. IPJO_1881_04_05. 47. IPJO_1881_05_31. 48. See IPJO_1881_07_26 and IPJO_1881_07_30. 49. IPJO_1881_09_06. 50. GWHD_1821_06_18; Glasgow Herald, June 18, 1821; Issue 1929. 51. GWHD_1821_07_02; Glasgow Herald, July 2, 1821; Issue 1933. 52. GWHD_1821_07_27; Glasgow Herald, July 27, 1821; Issue 1940. 53. GWHD_1821_08_13; Glasgow Herald, August  13, 1821; Issue 1945 and GWHD_1821_09_10; Glasgow Herald, September  10, 1821; Issue 1953. One reference to drought on 10 June, 1822, referred to the weather of the previous year: “The unexampled drought of the last summer destroyed all crops on sandy soils,—which abound much in these islands; arid potatoes were entirely ruined.” 54. GWHD_1822_07_06; Glasgow Herald, June 7, 1822; Issue 2029. 55. GWHD_1822_06_14; Glasgow Herald, June 14, 1822; Issue 2031. 56. GWHD_1822_07_05; Glasgow Herald, July  5, 1822; Issue 2037 and GWHD_1822_08_12; Glasgow Herald, August 12, 1822; Issue 2048. 57. GWHD_1822_09_06; Glasgow Herald, September 6, 1822; Issue 2055. 58. HPTE_1822_07_08; Hampshire Telegraph and Sussex Chronicle etc, July 8, 1822; Issue 1187. 59. HPTE_1822_11_04; Hampshire Telegraph and Sussex Chronicle etc, November 4, 1822; Issue 1204. 60. HPTE_1825_04_11. 61. HPTE_1825_05_02. 62. HPTE_1825_08_08; Hampshire Telegraph and Sussex Chronicle etc, August 8, 1825; Issue 1348. 63. IPJO_1825_07_23; The Ipswich Journal, July 23, 1825; Issue 4554. 64. IPJO_1825_08_27. 65. IPJO_1825_09_03; The Ipswich Journal, September 3, 1825; Issue 4560 and IPJO_1825_09_03. 66. IPJO_1825_10_08. 67. IPJO_1834_05_31. 68. IPJO_1834_07_12; The Ipswich Journal, July 12, 1834; Issue 5025. 69. IPJO_1834_07_19. 70. IPJO_1834_08_30; The Ipswich Journal, August 30, 1834; Issue 5032. and IPJO_1834_11_15; Tuesday’s Post. The Ipswich Journal, November  15, 1834; Issue 5043. Also see IPJO_1834_11_15. 71. HPTE_1834_06_02; Hampshire Telegraph and Sussex Chronicle etc, June 2, 1834; Issue 1808. 72. HPTE_1834_07_14. 73. HPTE_1834_07_28.

Using Corpora to Identify Droughts  83 74. HPTE_1829_08_10. 75. While a full discussion of the pros and cons of undertaking collocation analysis on such data is the preserve of a future paper, this quick use of collocation shows that the problem that geoparsing addressed was very real.

References Baker, P., Gabrielatos, C., KhosraviNik, M., Krzyżanowski, M., McEnery, T., & Wodak, R. (2008). A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press. Discourse & Society, 19(3), 273–306. Baker, P., Gabrielatos, C., & McEnery, T. (2013). Discourse analysis and media attitudes: The representation of Islam in the British press. Cambridge: Cambridge University Press. Brookes, G., & Baker, P. (2017). What does patient feedback reveal about the NHS? A mixed methods study of comments posted to the NHS Choices online service. BMJ Open, 7, e013821. Cole, G. A.,  & Marsh, T. J. (2006). The impact of climate change on severe droughts: Major droughts in England and Wales from 1800 and evidence of impact. Bristol, UK: The Environment Agency. Gabrielatos, C., McEnery, T., Diggle, P. J., & Baker, P. (2012). The peaks and troughs of corpus-based contextual analysis. International Journal of Corpus Linguistics, 17(2), 151–175. Gregory, I. N., and Hardie, A. (2011). Visual Gisting: Bringing together corpus linguistics and Geographical Information Systems. Literary and Linguistic Computing, 26(3), 297–314. Hardie, A. (2012). CQPweb—combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380–409. Holmgreen, L. L (2009). Metaphorically speaking: Constructions of gender and career in the Danish financial sector. Gender and Language, 3(1), 1–32. Joulain, A. T. (2017). Corpus linguistics for history: The methodology of investigating place-name discourses in digitised nineteenth-century newspapers (Doctoral dissertation). Lancaster University. Marsh, T., Cole, G., & Wilby, R. (2007). Major droughts in England and Wales, 1800–2006. Weather, 62(4), 87–93. McEnery, T., McGlashan, M., & Love, R. (2015). Press and social media reaction to ideologically inspired murder: The case of Lee Rigby. Discourse and Communication, 9(2), 237–259. McEnery, T.,  & Baker, H. (2017a). A  corpus-based investigation into English representations of Turks and Ottomans in the early modern period. In M. Pace-Sigge, & K. Patterson (Eds.), Lexical priming: Applications and advances (pp. 42–66). Amsterdam: John Benjamins. McEnery, T., & Baker, H. (2017b). Corpus linguistics and 17th-century prostitution. London: Bloomsbury. McEnery, T.,  & Baker, H. (2018). The poor in seventeenth-century England: A corpus based analysis. Token, 6, 51–83. Ratia, M., Palander-Collin, M., & Taavitsainen, I. (2017). English news discourse from newsbooks to new media. In M. Palander-Collin, M. Ratia, & I. Taavitsainen (Eds.), Diachronic developments in English news discourse (pp. 3–12). Amsterdam: John Benjamins.

84  Using Corpora to Identify Droughts Rupp, C. J., Rayson, P., Gregory, I., Hardie, A., Joulain, A., & Hartmann, D. (2015). Dealing with heterogeneous big data when geoparsing historical corpora. IEEE conference on big data, Bethesda, October 27, 2014, pp. 80–83. Smitterberg, E. (2014). Syntactic stability and change in nineteenth-century newspaper language. In M. Hundt, (Ed.), Late modern English syntax (pp.  311– 330). Cambridge: Cambridge University Press. Tate, E. L., & Gustard, A. (2000). Drought definition: A hydrological perspective. In J. V. Vogt & F. Somma (Eds.), Drought and drought mitigation in Europe. Advances in natural and technological hazards research (Vol. 14, pp. 23–48). Dordrecht: Springer. Waddington, K. (2016). ‘I should have thought that Wales was a wet part of the world’: Drought, rural communities and public health, 1870–1914. Social History of Medicine, doi:10.1093/shm/hkw118 Walravens, H. (Ed.) (2008). Newspapers collection and management: Printed and digital challenges. The Hague: International Federation of Library Associations.

4 Analysing Representations of Obesity in the Daily Mail via Corpus and Down-Sampling Methods Paul Baker Introduction Perhaps more than is the case for other chapters in this collection, there already exists a small but significant amount of triangulation research combining corpus linguistics and critical discourse analysis. The RASIM (Refugees, Asylum Seekers, Immigrants and Migrants) project at Lancaster University used two research teams, one comprising corpus linguists, the other made of critical discourse analysts, to examine representations of RASIM in the UK press. The project resulted in publications which employed one of the two approaches (e.g. Gabrielatos & Baker, 2008, KhosraviNik, 2010) as well as a paper which outlined how corpus linguistics and CDA could be combined in a nine-stage iterative process which involved moving back and forth between different analytical techniques in order to develop and test new hypotheses (Baker et al., 2008). Other triangulation research has attempted to compare analysts in a variety of contexts. For example, Marchi and Taylor (2009) carried out independent corpus-assisted discourse analyses of the same newspaper corpus, considering the representation of journalism within it. Their comparison of results did not consider whether any interpretation was more valid than another but instead focussed on similarities and differences, finding a mixture of convergence (similar findings which indicated broad agreement), divergence (different findings which indicated disagreement) and complementary findings (which were different but indicated overall agreement). A similar study was carried out by Baker (2015), who compared five analysts who used a variety of corpus-assisted methods to examine representations of foreign doctors in a corpus of news articles. A meta-analysis of the reports produced by the analysts chiefly found a complementary picture, with around two thirds of findings made by one analyst only. When the findings of each analyst were compared against each of the other analysts in turn (e.g. comparing Analyst 1 against Analyst 2, then Analyst 1 against Analyst 3, etc), the amount of shared findings was between 13% and 36%. Only one finding out of a total of 26 was made by all five analysts.

86  Analysing Representations of Obesity A different approach was taken by Baker and Levon (2015), who analysed a corpus of news articles about masculinity, with Baker carrying out concordance analyses of collocates of six terms referencing racial and social class male identities like black man and working class men, while Levon qualitatively examined a down-sampled set of 54 articles which contained the most mentions of the same terms for each newspaper in the corpus. The comparison of their results found no divergent findings but did find agreement on a number of representations (e.g. black men were represented as criminals, physically impressive, and discriminated against), although there were also a range of complementary (different) findings. The corpus analysis uncovered representations through the identification of unexpected repeated patterns, for example the word I’m was only a collocate with working-class male identities, and concordance analyses indicated that this was used in constructions where workingclass men were likely to ‘claim’ that identity, whereas the collocate welloccurred with middle-class men in descriptors like well-educated and well-fed, indicating a type of privilege which is bestowed upon them by an unmarked other. On the other hand, the phrase working-class man/ men collocated with self- in terms like self-educated, self-made, and selfmotivated indicating that success for such men is seen as due to their own efforts. On the other hand, the qualitative analysis also uncovered implicatures that indicated the reasons why certain representations appeared in the articles (e.g. working-class men were viewed as beleaguered because they were implied to have been ‘forgotten by society’). The authors noted that A weakness of a singular corpus approach to discourse analysis (compared to the qualitative analysis) is that the focus on collocates or other patterns based around word frequencies may mean that in some cases a purely descriptive analysis emerges which does not attempt to provide interpretation, critique or explanation for the patterns found. Nor may such analysis engage with the wider social and historical context beyond the corpus. (ibid. 231) However, they noted that a qualitative analysis can also result in description of themes and that the findings of such an analysis “may not be generalizable beyond the smaller dataset” (ibid. 233), also running “the risk of being overly subjective, with pattern identification entirely a property of the perspective of the analyst and not accountable to any more ‘objective’ method” (ibid.). The authors thus argue that the combined analyses help to mitigate each other’s short-comings and advocate that analysts “complement qualitative analyses with corpus approaches” (ibid.). As an alternative to the nine-stage process developed by Baker et al. (2008), they indicate that there is value in the approaches working as separate

Analysing Representations of Obesity  87 components, with a final stage involving comparison and consolidation of findings. They also suggest that “further experiments of this type would be welcome in order to shed more light on the benefits and drawbacks associated with different approaches to critical discourse analysis” (ibid. 234). The studies described herein may have used different analytical methods but they all had one feature in common—the comparison of multiple analysts. This introduces a factor that is difficult to control for in terms of experimental conditions—the outcomes may be as much the result of people’s different analytical abilities and interests as the different techniques of analysis that were employed. This does not invalidate these studies, although an aim of the present study was to consider the effect of trying to remove or at least limit this factor, to determine whether a single analyst tasked with using different methods would result in the same conclusions about triangulation that are made on studies involving multiple analysts. Of course, using a single analyst brings its own problems. The analyst may simply be more experienced in (or prefer) one method compared to others. Additionally, it becomes difficult to carry out a fully independent set of analyses—the outcomes of the first analysis could play a part in how the second analysis proceeds (a point which is discussed in more detail in the following section). Despite these concerns, I  believe that there is value in having a single analyst compare multiple methods, so this chapter should be viewed as making a contribution toward the existing body of work, which has attempted to triangulate corpus linguistics with critical discourse analysis, rather than making any definitive or final statements on the topic. In this chapter I combine aspects of Baker and Levon’s (2015) experimental design with the one carried out by Baker (2015). This involves building a corpus of newspaper articles and conducting a concordance analysis of collocational pairs of words as well as a qualitative ‘close reading’ of a down-sampled set of articles. The findings are then subjected to a meta-analysis which involves both quantitative and qualitative comparisons of findings. In terms of choosing a topic to base the analysis around, I  decided to look at representations of obesity in the British press. This was based upon a subjective impression that some newspapers were representing obese people in negative ways, although that hypothesis is supported by the academic literature. For example, Holland et al. (2011) have criticized media reporting of obesity as alarmist and uncritical, whereas Couch et al. (2015) argue that such reporting is a form of social control. Caulfield et al.’s (2009) diachronic study has indicated a shift in reporting from a deterministic view of obesity (e.g. one caused by genetic factors) to one based on personal responsibility (linked to diet and exercise). I initially considered building a corpus of articles from a wide range of national newspapers but reasoned that this would introduce an additional

88  Analysing Representations of Obesity factor related to differences between newspapers, potentially giving a much wider range of stances about obesity which would be more difficult to do justice to in a chapter-length analysis. So I decided instead to examine reporting on obesity in just one newspaper, the Daily Mail (and its Sunday equivalent The Mail on Sunday). The Daily Mail is a conservative middle-market newspaper. A YouGov survey (2017) indicated that it is viewed as the most right-wing newspaper in the UK, while according to a National Readership Survey published in 2015, it has 23.5 million readers per month and is the most widely read newspaper (when its print and online versions are considered together) in the UK.1 Its popularity does not mean that it is representative of the British press as a whole (other widely read newspapers like The Guardian, The Telegraph, and The Mirror incorporate different writing styles and political perspectives), but we could view the Daily Mail as influencing large numbers of people in terms of their understandings of obesity and how obese people are viewed. For the corpus analysis, I decided to repeat the technique used in Baker and Levon (2015), which involved looking at concordance lines of collocational pairs. In that paper, the down-sampled set of articles had been determined by selecting articles which contained the most frequent mentions of the topics under examination. A potential limitation of this study then is that I have chosen a single set of corpus procedures to carry out analysis, which though shown to be reasonably effective, should not be seen as fully representative of all the possible ways that the corpus could have been analysed. For example, I could have considered key semantic tags, collocational networks or lexical bundles instead of or alongside the procedures I used. My findings thus apply to the specific procedures and texts used in this chapter. There are also numerous ways that down-sampling can be carried out, so I decided that the present analysis would not just be a comparison of a corpus approach with a single down-sampling approach but would incorporate different ways of down-sampling in order to make a further comparison between them. An approach that had been used in the RASIM project was to carry out down-sampling based on time periods when a particular topic had received the most attention in the media. Another way could be to use a recently developed tool, ProtAnt (Anthony & Baker, 2015) which identifies keywords in a corpus (when compared with an appropriate reference corpus) and then rank orders individual texts in the corpus based on either the total number or relative proportion of keywords. Additionally, texts which contain the most direct references to the topic of obesity could be argued as being more typical of the corpus and so down-sampling based on this method could be another way of selecting a few files that are most representative of the whole. The preceding three methods of down-sampling all rely on some sort of principled reasoning whereby a replicable procedure is put in place for

Analysing Representations of Obesity  89 selecting texts based on criteria determined using quantitative measures. However, another method could involve a different principle—choosing a set of texts at random. Such a procedure would produce different sets of texts each time and it would be difficult to know the extent to which the resulting set of texts would be representative of the whole. However, it would be interesting to compare the results of an analysis of the principled down-sampling techniques against a randomly obtained data set, in order to determine which analyses most closely resemble one another (and which ones, if any, resemble the corpus analysis most closely). The motivation behind this study is slightly different from the other experimental comparisons of methods described previously, in that the earlier body of work has demonstrated that different methods do not produce identical results. Instead we normally find a picture of complementarity rather than convergence or divergence, with each method giving a partial picture of the data. Therefore, in this study, rather than seeking to identify the extent to which methods produce similar findings, I am equally (or more) interested in identifying which combination of two or more methods will result in the greatest coverage of findings. To this end, the focus is on methods that produce distinct results, rather than simply replicating previously found research outcomes. In the following section I describe how I built the corpus and created the four sets of down-sampled texts. I  also discuss the order in which the five sets of analyses occurred. This is followed by a section listing the research questions which the analyses aimed to answer. There are then sections which give examples of the qualitative and corpus analyses. The final two sections outline the comparison of approaches, first using quantitative techniques to compare the extent of similarity or difference between approaches, and then reflecting on the ways that the approaches differed. The chapter ends with some reflections on the benefits and drawbacks of triangulation when carrying out critical research on large amounts of linguistic data.

Corpus Collection and Selection of Articles A corpus of Daily Mail articles containing the words obese or obesity2 was collected from the online archive Nexis, covering the five-year period from the start of 2012 until the end of 2016. In total, 1,201 articles were collected, consisting of 1,000,740 words. A sorted concordance of the words obese and obesity were used in order to identify 14 duplicate articles which were subsequently removed by hand. Four sets of ten files were identified for a qualitative analysis which involved reading the whole text. Set 1 contained articles which contained the highest number of references to the terms obese or obesity (the seed words used to collect the data). Set 2 contained articles which contained the highest proportion of keywords, using the tool ProtAnt with the

90  Analysing Representations of Obesity million word BE06 Corpus of written British English (Baker, 2009) as the reference corpus.3 Set 3 contained a spread of ten articles taken from May 2015, which was the month containing the most articles (39), suggesting a period where the subject peaked in terms of media interest (see Figure 4.1). Set 4 contained ten articles that were selected using an online random number generator facility provided by Google. The entire corpus constitutes a nominal fifth set, although as explained later, this was subjected to different analytical techniques to the four samples. As noted earlier, when multiple forms of analysis are carried out by the same person there is a potential issue around the order in which these analyses occur, particularly if we want to carry out some sort of comparison of the analytical techniques. For example, an analyst may be more thorough during the first analysis they carry out but may feel fatigued and be less rigorous in subsequent analytical stages. Or it could be the case that the later stages of analysis will benefit from findings that were made from earlier stages—we may find it easier to spot an example of something which we identified in an earlier data set, leaving us more time and energy to make further discoveries. Alternatively, our previous encounters with other datasets may prime us to notice the same phenomena again and again, resulting in replication of findings. It is not possible to fully address this issue within this chapter—to do so would require a large meta-study involving numerous analysts carrying out multiple analyses in different orders. Instead, I  can only reflect on the possible effects that the order in which the analyses were carried out could have. To take some of these potential issues into account I carried out the separate analyses over a period of several weeks, to factor in the possibility of

40 35

Number of arcles

30 25 20 15 10 5

Jan-12 Mar-12 May-12 Jul-12 Sep-12 Nov-12 Jan-13 Mar-13 May-13 Jul-13 Sep-13 Nov-13 Jan-14 Mar-14 May-14 Jul-14 Sep-14 Nov-14 Jan-15 Mar-15 May-15 Jul-15 Sep-15 Nov-15 Jan-16 Mar-16 May-16 Jul-16 Sep-16 Nov-16

0

Date

Figure 4.1  Number of Articles About Obesity Over Time

Analysing Representations of Obesity  91 analyst ‘fatigue’ and also to limit (to an extent) the effect of the findings of one analysis priming the next one. One question concerns the order in which the analyses took place, and whether the analyses should occur in a linear fashion (e.g. Set 1, Set 2, Set, 3, Set 4, Corpus), or whether they should be more iterative, involving several stages (e.g. Set 1, Set 2, Set 3, Set, 4, Corpus, Set 1, Set 2, Set 3, Set 4, Corpus). In Baker et al. (2008) we argued that a productive form of analysis would involve moving between different types of quantitative and qualitative analysis with each stage informing the direction that the next one could take. Therefore, the full analysis involved consideration of the datasets in the preceding sequence three times, although the first analysis took the most time and the second and third analyses involved more cross-checking back to determine whether I had missed anything as a result of noticing something previously somewhere else. As the research questions developed over the course of the analysis, it was also important to revisit the data sets in order to address questions that had not been considered the first time. This recursive sequence of analysis was not the one I  had initially intended to carry out, and it makes it more difficult to attribute findings to a single data set, as something I noticed first in one data set was subsequently sought for in the other sets. However, even if I had only examined each set once, this would have still been the case to a lesser extent, with the analysis of Set 4 being influenced by the findings from sets 1–3. Separate researchers working independently would not have had this issue of ‘pollination’, although in conducting the analyses on each set several times, I have at least tried to ensure that no set was unduly privileged. However, the extent to which the five analyses should be considered to be truly separate from one another is negligible. What we can compare, though, is the extent to which the holistic set of findings from the three cycles of analysis are found equally in each set or whether certain ways of sampling the corpus resulted in particular presences or absences.

Research Questions Prior to conducting any analysis, I had created a general research question: ‘How is obesity represented in the Daily Mail between 2012 and 2016?’ In order to give the analytical process more focus I also devised several smaller, more manageable, sub-questions. These sub-questions were refined or added in during the analysis, allowing analytical points to emerge that answered questions which were not initially thought of but were identified as relevant in terms of answering the general research question. The final set of sub-questions were as follows: 1. What types of people (or other social actors) are particularly represented as obese or affected by obesity?

92  Analysing Representations of Obesity 2. How are such people represented, particularly in terms of evaluations? 3. What is represented as the consequences of obesity? 4. What is represented as the cause of obesity, and what actions are suggested in order to reduce obesity? (These two questions appear together as they were often linked, e.g. if eating sugar was suggested as a cause of obesity, then not eating sugar, reducing advertising of sugary food or implementing a ‘sugar tax’ were suggested as actions). 5. What strategies are used in order to legitimate the positions taken in response to the preceding questions? There is not enough space to describe the full analysis so instead, in the following two sections, I  provide illustrative examples of the analysis based on the way I carried out a close reading of a single article, followed by the analysis of some of the collocates.

Example of Qualitative Analysis The qualitative analysis involved reading each article several times to identify its overall themes, the social actors referred to, the ways they were represented, the arguments that were cited and the reasons given in order to support them. I took each article as a separate text and considered how representations and arguments were developed over the course of the text and how different representations and arguments were prioritised, backgrounded or absent. This involved considering aspects such as the amount of space given over to different aspects of the text and whether particular representations occurred at the beginning, middle or end of the text. I also considered how journalists signalled stance toward different representations or arguments—for example, was a claim about obese people given as the journalist’s own opinion or did it appear within a quote from someone who had been interviewed, and if it was the latter, did the journalist signal or imply alignment toward the quote in some way? I considered whether arguments or representations were made by people who were represented as ‘experts’ or as having a vested interest in some way and what were the potential range of readings or responses an article could elicit. As well as looking at articles as single texts, I also noted cases where similar representations or arguments were repeated across the set I was examining. As noted previously, the analysis itself resulted in the development of the sub-questions beyond the main research question, and as it progressed, the wording of the sub-questions was refined.4 This was one of the reasons why I needed to go back to earlier data sets and look at them again—at the first ‘pass’ I  had not considered questions that emerged later on. The article I examine in detail for illustration purposes is taken from Set 1 and is entitled “Beware, if you’ve piled on the pounds you may

Analysing Representations of Obesity  93 never lose them again”, consisting of 735 words. The article refers to findings from a decade-long study of obese people and dieting by British researchers at Kings College London, reporting that fewer than 1 in 10 of the people in the study managed to lose at least 5% of their body weight. I note that news stories about obesity were often framed around findings from academic studies that quoted figures and cited reasons why people were obese. Obesity in this article is therefore framed as problematic (something which was found in all but one of the articles examined), but also in terms of a problem which can be potentially explained or solved by scientific or academic research. The claims about obesity in the article are thus legitimated by framing it as a report of an academic study. For example, in the second paragraph is the sentence “The findings are a stark warning of the lasting consequences of putting on weight, experts said last night”. This quotation is ambiguous—although it appears to be specific in noting the time that the quote was made “last night”, it does not name the experts, and in implying that the quote is attributed to more than one expert, it conjures up an image of a line of experts intoning the words in chorus. Clearly, this is not an exact representation of reality, and the lack of quote marks around the text attributed to the experts implies that this is perhaps a summary of the views of a number of different people, who were interviewed independently. The first 362 words of the article report the findings of the study and then the article shifts focus to a quote from the doctor who carried it out, as well as brings in a report from the World Health Organisation (WHO), another expert source. The WHO claim that three quarters of men and two thirds of women in the UK will be overweight or obese by 2030 is directly attributed to the WHO with the phrase “according to the WHO”. After this is the sentence “The problem is already resulting in soaring rates of type 2 diabetes and is predicted to have a similar impact on the numbers of people with heart disease, strokes and cancer”. These claims are not explicitly attributed to anyone, although it seems to be implied that they are also part of the WHO’s claims. Then the doctor who carried out the study is quoted as saying that public health strategies are not working. In the middle part of the article, four paragraphs are given to the views of Tam Fry, head spokesman of the National Obesity Forum (NOF). The Forum is an independent British organisation which campaigns for a more interventionist approach to obesity. Fry does not appear to have been part of the Kings College study but presumably, due to his role in the NOF, is given the space to comment on the study. He is quoted as saying that there are “messages aimed at people not to get fat in the first place—but an increasing number don’t listen”. Blame for obesity is placed on individuals who “don’t listen”, rather than at the public health strategies mentioned by the doctor who carried out the study. Fry is then

94  Analysing Representations of Obesity quoted as saying that “The result is that obesity may break the NHS”. Highlighting the cost of obesity to the National Health Service was something which was found across all the text samples, as well as being identified via the concordance analysis of collocates. The article then describes a second study by researchers at the University of Birmingham which claims that “well-meaning grandparents may be contributing to the rise in childhood obesity”. The article says that “grandparents overfeed their grandchildren, giving them sugary snacks and drinks, and don’t let them carry out physical chores”. The use of the verb let is notable here, as it implies that permission is being withheld from engaging in a chosen activity. This seems an unusual way to refer to children carrying out chores—consulting the 100-million-word British National Corpus, chore(s) carries a prosody for unpleasant and boring activities that people cannot avoid, modified by adjectives like tiresome, routine, tedious, sad, and dreadful. Considering this, it might be argued that a construction like “don’t make them carry out physical chores” would more accurately convey general attitudes about chores. However, a verb like make would perhaps acknowledge that chores are unpleasant and it can be difficult to persuade people to do them, which would imply that grandparents are less irresponsible, compared against the actual wording of the article. The final paragraph of the article notes that the study was carried out on parents of children in two Chinese cities. Coming at the end, this information is therefore somewhat backgrounded, with a potential consequence being that readers may view the study as applying to grandparents everywhere, rather than this being a study carried out in a specific cultural context distinct from the King’s College study. In terms of findings, I note the focus on children in the article, as well as the use of academic research and facts and figures to give the article credibility. Thinking about who is viewed as responsible for obesity, we need to take into account the doctor’s claim that “public health strategies are not working”, along with the ‘personal responsibility’ position relayed in Fry’s description of people as not listening. Additionally, the article implicitly places blame on caregivers (in this case grandparents) for giving children snacks and not ‘letting’ them do chores. I also note that the article positions obesity as causing other conditions, such as diabetes, and that it highlights obesity as a threat to the National Health Service, thus positioning obese people as a burden to society. However, the article does not explicitly appear to align itself with any of the positions that it references.

Example of Corpus Based Discourse Analysis Using AntConc 3.4.4 (Anthony, 2014), the corpus-based discourse analysis was based on collocational analyses of the two search terms used to

Analysing Representations of Obesity  95 build the corpus: obese and obesity. Two types of corpus-based discourse analysis were carried out, one based on categorising collocates in order to identify semantic preferences, the other involving close reading of expanded concordance lines which contained collocational pairs. In the interests of methodological transparency, an approach which involved eliciting keywords of the corpus was also considered alongside the collocates approach, with obese and obesity unsurprisingly appearing in the top ten keywords. Some of the other top keywords elicited related to food (food, diet, sugar, eat, eating, calories), others related to health (disease, healthy, health, diabetes, cancer, surgery) or social actors (patients, women, Dr, children, people, professor). However, it was felt that the corpus analysis would end up being much more extensive than the qualitative analyses if numerous keywords were investigated via concordancing (resulting in an unfair comparison), so it was decided to focus on just the two keywords which were closest to the topic under investigation: obese and obesity. The top 50 collocates5 for each word were examined via an initial concordance analysis and placed into categories based mainly on the grammatical and semantic functions that the collocates fulfilled in relation to the words obese and obesity. Tables 4.1 and 4.2 show how the collocates were categorised. This categorisation helped to identify semantic preferences around the terms obese and obesity—for example the representation of obesity as a problem (epidemic, timebomb, crisis etc.) requiring a solution (strategy, tackle, combat etc.), which is rising (spiralling, soaring, rising etc.) via metaphors of movement (usually upwards, although the verb fuel is different in that it represents obesity as a moving vehicle—so the representation is potentially related to speed rather than direction). We might hypothesise that these semantic preferences are also part of a wider negative discourse prosody which suggests that the increase in obesity is problematic.6 After this initial categorisation of collocates into prosodies, more detailed concordance analyses were carried out, focussing on the research questions outlined earlier. From the 100 collocates, 3288 concordance lines were produced. An examination of all of these lines would take a long time, resulting in a more detailed analysis than the qualitative analyses of a small number of articles. In order to ensure comparability between the different sets of analyses, I decided to spend roughly the same amount of time carrying out the corpus analysis as I had done with the qualitative analysis. As each qualitative set contained ten articles, I only carried out a more detailed analysis of concordance lines on ten collocates, five from Table  4.1 and five from Table  4.2. The collocates chosen were grossly, branded, proportion, Britons, primary, epidemic, forum, spiralling, blamed, and tackle. These collocates were selected in order to represent multiple categories across Tables  4.1 and 4.2, thus

96  Analysing Representations of Obesity Table 4.1  Collocates of Obese Category

Collocates of Obese

Modifying adverbs and adjectives to categorise types of obesity

morbidly, clinically, severely, technically, grossly, officially, super, borderline, dangerously, merely categorised, classified, classed, branded, category, labelled, deemed, defined, considered, meaning threatens, denied, refer, become, becoming overweight, underweight proportion, quarter, doubled, numbers, fifth adults, claimants, smokers, disabled, olds, expecting, Britons, reception, sedentary, diabetic, primary, population, individuals, mice likely, odds or, either

Verbs of categorisation Other verbs Other descriptions of weight Amounts Social actors—labelled as obese

Likelihood Grammatical words marking alternatives

Table 4.2  Collocates of Obesity Category

Collocates of Obesity

Problem Groups and representatives related to obesity

epidemic, timebomb, crisis, tide, grip national, forum, international, congress, alliance, chairman, leeds, metropolitan, European, Fry, Haslam, Tam fuelling, spiralling, rising, fuelled, soaring, fuel strategy, tackling, curb, tackle, combat, preventable, policies, anti rates, prevalence, halve related, links, link, linked, blamed, contributing autism, tooth, decay, illnesses maternal, childhood defined, awaited, publish amid

Rising Ending Amounts Causal verbs Other conditions Types of obesity Other verbs Grammatical words

providing potentially a wide coverage in terms of different constructions around the terms obese and obesity. When carrying out critical discourse analysis on corpus data, a typical concordance table that gives a few lines either side of the search word will often not provide enough context for the analyst to make an

Analysing Representations of Obesity  97 accurate interpretation. For example, in Wulff and Baker (forthcoming) a concordance analysis of the term refugees in a corpus of news articles initially indicated that refugees were described negatively such as bleeding Britain. However, when expanding concordance lines it became clear that these constructions occurred within quotations that were used to be critical of the claim. The initial concordance analysis which was used to categorise the collocates into Tables  4.1 and 4.2 was useful in terms of revealing grammatical patterns or semantic/discourse prosodies. For example, the set of verb collocates which occur with obesity: fuel, fuelled, fuelling, spiralling, rising, soaring are indicative of a semantic preference which implies that obesity is increasing. However, while concordance lines may reveal the discourse prosody around these movement words, not every line may contain enough information to reveal a prosody (assuming one exists), and such an approach may be even less likely to tell us about legitimation strategies (relating to sub-question 5). For this reason, many concordance lines needed to be expanded, to the extent of looking at a paragraph or even the entire article in order to facilitate interpretation. As an example, here are the expanded concordance lines for the collocation pair of obese with grossly. 1. Parents who let their children guzzle sugar-laden fizzy drinks and gorge themselves on fatty takeaways and convenience foods are also at fault. Often grossly obese themselves, they hypocritically whinge that schools and supermarkets must do more instead of looking at their own behaviour. 2. If you had stratospheric blood pressure, were grossly obese, lived a sedentary lifestyle and smoked constantly, what would you think of a doctor who told you to carry on as you were? That’s what I think of George Osborne, and of his admirers. 3. My GP said I was grossly obese and referred me to a dietitian but I’d tried numerous diets in the past and found them just a time-­ consuming hassle that denied me the foods I enjoyed. 4. To get weight loss surgery on the NHS you must be morbidly obese, the term for grossly overweight. 5. The authorities must be happy as they walk the streets to  see the enormous number of grossly obese people who are not endangering their health by eating calves’ liver. 6. But then his old buddy Garland turns up with the offer of a job, and Kim finds himself alongside his friend on a boat in the Red Sea as dive-guides to an increasingly fraught and chaotic party of lads’ mag journalists and the boat’s grossly obese and scarily violent owner. Reading these six cases, we can see that grossly tends to collocate with obese in the pattern grossly obese (apart from case 5 where morbidly

98  Analysing Representations of Obesity obese and grossly overweight are suggested as equivalents), and that this label is represented as problematic in a range of ways. In line 1 grossly obese people are referred to as hypocritical, in line 3 a doctor who calls someone “grossly obese” refers them to a dietician (indicating they have a problem) while in line 6 the grossly obese owner of a boat is also described as “scarily violent”. We can see negative representations of obese people here, particularly a ‘revulsion’ discourse which is achieved through the discourse prosody around word grossly itself,7 but also in terms of the negative representations around such people. In line 1, children are shown as being allowed (by grossly obese parents) to “guzzle sugar-laden fizzy drinks and gorge themselves on fatty takeaways”. The choice of verbs guzzle and gorge, instead of say drink and eat, imply greediness, over-indulgence, lack of restraint and ultimately disgust. Even here, some of the lines are not enough to make a full interpretation. Take line 5 for example, which refers to authorities being happy to see grossly obese people who are not endangering their health by eating calves’ liver. I hypothesised that this was part of a sarcastic opinion piece, where obese people were being attacked for some reason, although the main target appeared to be “the authorities”. However, it is unclear why the author is focussing on calves’ livers. Reading the entire article reveals that it is about a decision to take lightly grilled calves’ liver off the menu at the House of Lords, which is represented by the author as “madness” and “health and safety getting out of control”. The line referring to grossly obese people occurs in the last sentence of the article and is attributed to Lord Tebbit, who is quoted as being strongly against the decision. Tebbit implies that the officials who made the decision to ban lightly grilled calves’ liver are focussing on a trivial issue when there are an ‘enormous number of grossly obese people’ in existence in the UK. Such people are invoked as an authentic problem, contrasted with the decision to ban calves’ liver which is viewed as wrong-headed. Grossly obese people appear, then, to be employed as a justification for the main argument of the article (which is not related to obesity). In a related way, in example 2, the person under criticism is George Osbourne, the then Chancellor of the Exchequer who is blamed for the UK’s financial position in an opinion column. Obese people are not the main subject of this article either, but are employed as part of an argumentation strategy to be representative of something which is a problem, in the same way that Tebbit refers to them.

Comparison of Results Table 4.3 shows the results of the five types of analysis that were carried out (the analysis of collocates and expanded concordance lines and the qualitative analysis of the four sets of ten articles). The findings have been grouped according to the research questions that they addressed, and a

Table 4.3  Summary of Findings From the Different Approaches Research Question Finding

Set 1 Set 2 Set 3 Set 4 Set 5 Corpus

1 Types of people affected by obesity



Focus on children Focus on women Focus on pregnant women and babies 2 Representations The labelling of obesity of obese people can be unfair or incorrect Revulsion discourse (obese people are disgusting) Obese people as justification for an argument 3 Consequences of Obesity causes other obesity health problems Obesity is a costly burden on the NHS Being obese is good for you 4 Causes of obesity Scientific reasons for obesity (e.g. genes, bacteria) Social reasons for obesity (e.g. bullying, stress) Personal responsibility/ lifestyle reasons Caregivers Political correctness The European Union Food and drinks manufacturers and advertisers Public policy Obesity has become normalised 5 Legitimation Quotes of facts and strategies figures from scientific research Reliance on ‘expert’ opinion Use of self-identified obese author Overall RQ Focus on problem getting worse

  

  



 



 

 

 



 























   

   



 









 



 













 









100  Analysing Representations of Obesity tick means that a particular finding was elicited from one of the analytical methods. The last row of Table 4.3 does not answer any of the specific research questions although it was included because it did address the main research question relating to how obesity was represented. The table contains 22 separate ‘findings’, although no single form of analysis identified all of them, with the individual approaches identifying between 11 and 14 findings each. There is therefore not a large amount of difference in terms of the numbers of findings that the different approaches uncovered, although the corpus approach did slightly better (14 findings) and Set 3 elicited the least (11 findings). The articles in Set 3 were taken from the month containing the highest number of published articles, and while this period indicates one of intense interest in the topic of obesity, it would appear that as a result of this the messages and stories around obesity are slightly more uniform than the sets of articles where other types of sampling were employed. This could perhaps be related to the fact that the articles were only taken from a single newspaper so there may be a relatively high level of ideological coherence between different articles. Had the corpus consisted of articles from different newspapers we may have seen more debate and oppositional discourses during a period of intense discussion around a topic. With the slightly higher number of findings for the corpus analysis we should also recall that this analysis was carried out last, so I may have been primed to identify a wider range of phenomena due to my previous analytical excursions (although the subsequent second and third rounds of analysis should have reduced this issue to an extent). The corpus approach appeared to perform best in terms of identifying representations of obese people (RQ2), although it did not produce as many findings for RQ4 which related to causes of obesity. The opposite pattern was found for Set 1 (the sample containing the most frequent references) which suggests that a combination of these two approaches would be especially complementary. Set 2 (articles with the most keywords) identified more consequences of obesity (RQ3) compared to other sets, as well as types of people affected by it (RQ1), although Set 3 (month with the most articles) also performed well for RQ1. Finally, Set 4 performed best at identifying legitimation strategies in the articles (RQ5). Each approach, then, appeared to have particular strengths and weaknesses when we consider the different types of research questions. Only five findings were uncovered by all approaches—the focus on children as at particular risk from obesity, the characterisation of obesity as being costly to the National Health Service, representations of foods and drinks manufacturers as being at least partly responsible for obesity, the legitimation strategy involving use of expert opinions, and finally, representation of obesity as a problem which is getting worse. Considering that these findings were identified by all of the analytical techniques,

Analysing Representations of Obesity  101 it is likely that they represent popular or frequently cited positions within the newspaper. Three findings were only identified in one approach—the representation of obese people as justification in another argument (via the corpus analysis), the view that being obese is good for you (from Set 2 in an article which claimed that obese people were less likely to suffer from dementia), and legitimation of a stance through the use of a selfidentified obese author (from Set 4). These findings could perhaps be viewed as more marginally articulated within the newspaper, compared to those which were found across all the datasets, including the analysis of the whole corpus. It is notable that one finding was identified by all four of the qualitative analyses but was not uncovered by the corpus analysis—the ‘deterministic’ view that obesity was caused by scientific reasons such as genes or bacteria. Even the analysis of the collocates of causal verbs like related, links, link, linked, blamed, and contributing did not uncover such cases. With five sets of analysis, there are potentially ten possible ways of making comparisons between different combinations of two sets). Table 4.4 shows the percentage of agreement that different pairs of approaches had using the same approach from Baker (2015). This was calculated by calculating the number of findings that were shared between two sets and dividing this by the total number of findings elicited from them both, then multiplying by 100 to give a percentage. The last column of Table 4.4 gives the total proportion of findings for each set of analysis. This was calculated by counting the total number of findings for each set (for example, Set 1 resulted in 13 findings), and then dividing this by the total number of findings made by the different sets of analyses collectively (which was 22). Again, this figure was multiplied by 100 and rounded to the nearest whole number to be expressed as a percentage for example (13/21) × 100 = 59%. Table  4.4 indicates that all of the paired comparisons resulted in between 32% and 63% of shared findings, with no particular approach being hugely different or similar to another, with the analysis from Set 1 (articles with the most mentions of the search terms) and Set 4 (articles taken at random) being most similar (63%) and Sets 2 (articles with the most keywords) and 4 (random articles) being most different (32%). Table 4.4  Proportions of Agreement Between Pairs of Analysis

1 2 3 4 5

2

3

4

5 (Corpus)

Total Findings

41%

50% 53%

63% 32% 50%

42% 44% 56% 50%

59% 57% 50% 59% 64%

102  Analysing Representations of Obesity Table 4.5  Combinations of Sets and Number of Findings Elicited by Each Combination 1 Set

Findings

2 Sets

Findings

3 Sets

Findings

4 Sets

Findings

1 2 3 4 5

13 12 11 13 14

1, 2 1, 3 1, 4 1, 5 2, 3 2, 4 2, 5 3, 4 3, 5 4, 5

18 16 16 19 15 19 17 16 17 17

1, 2, 3 1, 2, 4 1, 2, 5 1, 3, 4 1, 3, 5 1, 4, 5 2, 3, 4 2, 3, 5 2, 4, 5 3, 4, 5

19 21 21 18 20 20 20 20 21 20

1, 2, 3, 4 1, 2, 3, 5 2, 3, 4, 5

21 22 22

Table 4.5 indicates which combinations of approaches would result in the greatest number of unique findings. When combining two approaches, Set 1 (articles which had highest number of mentions of the search term) combined with the collocational/ concordance analysis tied for the most findings (19) with the combination of Sets 2 (articles with most keywords) and 4 (random selection). For three approaches, it was some combination of Set 1, Set 2, Set 4 or Set 5 which elicited the most findings (21), whereas any combination of the corpus approach plus three sets gave all 22 findings. The least effective paired combination appears to have been Set 2 (the most keywords) and Set 3 (articles taken from the month with the most articles).

Benefits and Drawbacks of Triangulation No analytical approach is likely to be equally good at uncovering every potential finding as any other valid analytical approach and this study supports earlier triangulation studies involving corpora (e.g. Marchi & Taylor, 2009; Baker & Levon, 2016) that conclude that combinations of approaches are generally more productive than a single approach on its own. Reflecting on the different types of analysis used in this chapter, I note some of the advantages that one type potentially has over others. For example, one advantage of the corpus analysis was that it enabled a more accurate impression of how a word may have multiple functions. Take for example the collocate blamed which occurred with obesity ten times. My initial interpretation of this word was that certain social actors or social practices were being blamed for obesity. However, the concordance analysis found that this was only the case for three concordance lines, for example: “Poor diet and lifestyle have been blamed for the certain obesity epidemic”. In the other seven concordance lines, obesity was being

Analysing Representations of Obesity  103 blamed for other (negative) phenomena, for example: “Obesity blamed as snoring soars”. In performing better at identifying representations of obese people, we see the collocates approach come into its own—we do not always require a great deal of co-text in order to identify these kinds of representational patterns. However, for other types of research questions, a close reading of whole texts appears more helpful, with the collocates/concordance approach not doing so well at eliciting information about what was seen as responsible for obesity—perhaps because this is encoded in less repetitive patterns and/or in parts of the articles that occurred further away from the node words being examined (although a different corpus approach, e.g. keywords, may have been more fruitful here). The corpus analysis enabled more certainty that particular linguistic patterns (and discourses, representations or legitimation strategies associated with them) were reasonably frequent across the whole data set. The number of times that obesity collectively collocated with the words epidemic, timebomb, crisis, tide, and grip was 234, with epidemic and crisis being particularly high (119 and 91 cases respectively). Consulting frequency information allowed the identification of what was representative or typical in the corpus and what was a more unusual, extreme or marginalised position. So apart from grossly, none of the collocates of obese and obesity appeared to give directly offensive representations of obese people, and even grossly only occurred six times as a collocate with obese. While a sense of disgust at obese people was conveyed in the language used by some authors, this appeared to be limited to a small number of cases. On the other hand, for the four qualitative analyses, this sense of disgust was not found in Sets 1, 2 or 3, but it was noted in one of the articles in Set 4 (which consisted of articles taken at random), an opinion piece written by a woman who identified as obese. In this article the author referred to herself and other obese people as “fatties”, as well as using phrases like “flabby fright”, saying that her reflection “fill[ed] me with revulsion. . . . Now I look at my double chin and bingo wings and want to weep” and “my feet [are], the only part of my body I don’t feel ashamed of”. Another article in this set (also written by a woman who identified as obese) also made use of the term “fatties”. With two articles in this randomly selected set which both referred to obese people in less than polite language, and both being written by obese people, there is perhaps a danger in seeing this as a commonly occurring feature of the Daily Mail’s writing about obese people (“fatties” only occurred in 14 out of 1187 articles, 1.18% of the corpus). However, the other qualitative analyses did not find it, and the corpus analysis only uncovered one case where obese people were clearly described with a sense of disgust. A potential result of random selections then, is that some patterns may end up being overstated due to chance. And while the random set of articles did uncover some interesting answers to the research questions

104  Analysing Representations of Obesity and resulted in a respectable number of findings, we should bear in mind that it was the only way of identifying texts which was not systematic. A repeated version of the study would always obtain the same articles in terms of numbers of mentions of the search term or the month with the most articles, but a random set will be different each time. The random set I obtained was just one from millions so we cannot make definitive conclusions on the basis of a random set unless we carry out a follow-up study based on analysing numerous random sets. The analysis of Set 4 did allow a link to be made between a representation (that of obese people as disgusting) and the legitimation strategy used with it (the self-proclaimed obese author). This helps us to better understand how the newspaper is able to articulate the revulsion discourse without encountering too much censure.8 Another advantage of the qualitative analysis was that it showed how different legitimation strategies or combinations of causes of obesity occurred in a single article, also showing that some were more foregrounded than others. Prior to the analysis, I had expected the Daily Mail to largely take a ‘personal responsibility’ stance regarding obesity, putting blame at the individual level, which Lupton (1995) refers to as the neo-liberal stance of taking personal accountability for health. While every approach except Set 2 found evidence for the personal responsibility stance, the different qualitative analyses indicated that it rarely occurred as the only position in a single article but that most of the articles actually featured multiple voices giving different views about who or what was responsible for obesity, a finding which echoes Kwan and Graves (2013, p. 2), who write “the mass media regularly bombard us with news about the fat body, including the latest scientific research. Some of this information is consistent at face value but some of it . . . does not seem to add up”. Additionally, the qualitative analyses showed that the personal responsibility stance was often linked to personal opinion—either in opinion columns, or attributed through quotes from people like Tam Haslam of the National Obesity Forum. So in the example article from Set 1 that was analysed earlier, the scientist who carried out the study which was the topic of the article blamed public health strategy as not working, whereas Haslam was quoted as saying that people were not listening. The two positions could be seen as compatible (e.g. health strategies are not working because people do not listen to them), but this is something which is implied through the positioning of the two quotes in relation to one another, whereas the article could have been written differently to imply that public health strategies are not working for some other reason (such as being intrinsically flawed or having to compete with adverts for fatty food and sugary drinks in the media to get their message across). The qualitative analysis indicates then that the stance of an article can be swayed by editorial decisions to include particular voices (at the expense of others), and that clues can be identified regarding

Analysing Representations of Obesity  105 a newspaper’s ideological position if we consider which voices were selected to be included but were non-essential to the news story. And it is not the case that the newspaper hammers home one ideological position to the exclusion of all others. Instead it was found that many articles contained “competing constructions” (a term used in the title of the book by Kwan & Graves, 2013), which could be conceived as a form of ‘balanced reporting’ but might also mean that an impetus for particular changes (such as greater regulation of advertising or levies on manufacturers of sugary drinks) are diluted. So it is difficult to find a coherent stance on obesity, even within a single Mail article, other than that it is a growing problem and a burden on society. For the purposes of this comparative study, as much as possible, I attempted to carry out the different types of analysis separately from one another, while acknowledging that this was difficult to achieve, and each subsequent analysis built upon what had been done previously. However, this could be taken further. For example, having identified in Sets 2 and 3 that women seemed to be a particular focus of the articles, I could have used targeted searches on the whole corpus of relevant words relating to gender in order to see if this was a generalizable finding.9 In terms of possible drawbacks of triangulation, the amount of time that it takes to undertake an analysis could potentially be seen as an issue. It should be noted that the approaches carried out on their own elicited 11–14 findings (average 12.6), whereas two approaches resulted in 15–19 findings (average 17), three gave 18–21 (average 20) findings and four gave 21–22 (average 21.6). Adding further approaches, for this study at least, appears to have resulted in diminishing returns, in other words. This is perhaps unsurprising—some techniques will all elicit findings that are more representative of the general picture (such as the focus on children at risk) while others will uncover less frequent phenomena (such as use of a self-identified obese author). The more representative ones are likely to be encountered again and again whereas the marginal ones will be harder to discover. Within the diminishing returns we are perhaps then more likely to find the rarer cases, although it could be argued that we can at least identify them as being rare, which would be difficult to do if we had only considered them as part of a single small sample. For this reason, I would advocate that an ideal process would involve combining the corpus analysis with one or more of the qualitative analysis samples as this would enable a better understanding of the representativeness of a finding, but also allow us to uncover features of discourse that go beyond the concordance line. However, I would return to the point made in Baker et al. (2008), that a corpus-assisted critical discourse analysis ought to be iterative, cycling back and forth between qualitative and quantitative, allowing each type of analysis to produce new questions and hypotheses that can be explored through a range of methods.

106  Analysing Representations of Obesity

Notes 1. www.pressgazette.co.uk/nrs-daily-mail-most-popular-uk-newspaper-printand-online-23m-readers-month/ 2. Other search terms like overweight, fat or fatty were considered when building the corpus but it was felt that this would introduce a further level of complexity into the study so this study focusses on constructions around the terms obese and obesity. 3. ProtAnt (Anthony & Baker, 2015) is a tool for analysing the prototypicality of texts. This is achieved by identifying keywords across a set of texts (by comparing those texts against a reference corpus). The tool then orders the texts in terms of the (relative) frequency of keywords (either in terms of types or tokens) they contain. In experiments the tool performed well at identifying outlier texts from a corpus (as such texts contained fewer keywords), as well as identifying the texts that were most typical of a corpus. However, the tool tends to perform best on texts of a similar word length and because the corpus used here contained articles of different lengths (the shortest was 71 words while the longest was 3,787 words), using ProtAnt’s standard settings would have resulted in skewed results with text rankings simply correlating with their length. In order to adjust for this, a smoothing correction was applied when counting keywords, which involved taking the logarithm of the total number of key tokens in each file and dividing it by the logarithm of all of the tokens in the respective file. 4. For example, I initially formulated a sub-question ‘What is represented as the cause of obesity’ but later refined this to ‘What is represented as the cause of obesity, and what actions are suggested in order to reduce obesity?’ as causes of obesity were often linked to explicit suggestions for action. 5. Collocates were ordered according to their Mutual Information score, using a span of five words either side of the search term. All collocates had an MI score above 6 and a T score above 2 and occurred at least five times with the node word. These cut-off points were reached by carrying out experiments with different spans and minimum frequencies and were found to give a reasonably productive set of collocates in terms of allowing a range of themes to be identified. The decision to use the top 50 collocates was in order to produce a consistent number for each of the search terms which would be large enough for semantic preferences (if present) to emerge. 6. This sense of increase being represented via movement as being problematic is particularly denoted by the verb spiraling, and one direction the analysis could take would be to consult a larger reference corpus in order to determine what else tends to be represented as spiralling, priming readers to interpret the word positively or negatively when they encounter it in other contexts. For example, in the 100-million-word British National Corpus, spiralling collocates with costs, inflation, and prices and tends to be used to refer to items becoming more expensive which is viewed negatively. 7. In the British National Corpus grossly modifies a range of negative adjectives like ugly, offensive, indecent, deviant, inefficient, misleading, unfair, and polluted, as well as a further range of adjectives relating to size: bloated, swollen, distended, extended, huge, inflated, and fat. 8. This strategy appears to be a fairly commonly utilised one in newspaper discourse. For example, in Baker (2014) I discuss how gay people were quoted as giving negative opinions about homosexuality which helped to legitimate the generally negative stance towards gay people in the Daily Mail. 9. In fact a search of man/men/male/males resulted in 1341 occurrences while woman/women/female/females occurred 2510 times, almost twice as often. The qualitative analysis in Sets 2 and 3 therefore did identify a large-scale

Analysing Representations of Obesity  107 pattern when it noted the focus on women as opposed to men. To be fair, another type of corpus procedure, the keywords approach, would have identified this too, as women appeared in the top 30 keywords when the corpus was compared against the BE06 corpus (men did not appear anywhere in the keyword list). However, none of these female identity words were among the top 50 collocates of obese or obesity, although the collocates expecting and maternal did occur. These words only occurred as collocates five and seven times respectively, so it would be difficult to make a case for greater focus on women due to their appearance alone.

References Anthony, L. (2014). AntConc (Version 3.4.3) [Computer Software]. Tokyo, Japan: Waseda University. Retrieved from www.laurenceanthony.net/ Anthony, L., & Baker, P. (2015). ProtAnt: A tool for analysing the prototypicality of texts. International Journal of Corpus Linguistics, 20(3), 273–292. Baker, P. (2009). The BE06 corpus of British English and recent language change. International Journal of Corpus Linguistics, 14(3), 312–337. Baker, P. (2014). Using corpora to analyse gender. London: Bloomsbury. Baker, P. (2015). Does Britain need any more foreign doctors? Inter-analyst consistency and corpus-assisted (critical) discourse analysis. In N. Groom, M. Charles,  & S. John (Eds.), Corpora, grammar and discourse: In honour of Susan Hunston (pp. 283–300). Amsterdam: John Benjamins. Baker, P., Gabrielatos, C., Khosravinik, M., Krzyzanowski, M., McEnery, T., & Wodak, R. (2008). A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press. Discourse and Society, 19(3), 273–306. Baker, P., & Levon, E. (2015). Picking the right cherries? A comparison of corpus-based and qualitative analyses of news articles about masculinity. Discourse and Communication. 9(2), 221–336. Baker, P., & Levon, E. (2016). “That’s what I call a man”: Representations of racialized and classed masculinities in the UK print media. Gender  & Language, 10(1), 106–139. Caulfield, T., Alfonson, V., & Shelley, J. (2009). Deterministic? Newspaper representations of obesity and genetics. The Open Obesity Journal, 1, 38–40. Couch, D., Thomas, S. L., Lewis, S., Blood, R. W.,  & Komesaroff, P. (2015). Obese adult’s perception of news reporting on obesity. The panopticon and synopticon at work. Sage Open, 5(4), 2158244015612522. Gabrielatos, C., & Baker, P. (2008). Fleeing, sneaking, flooding: A corpus analysis of discursive constructions of refugees and asylum seekers in the UK Press 1996–2005. Journal of English Linguistics, 36(1), 5–38. Holland, K., Blood, R. W., Thomas, S. I., Lewis, S., Komesaroff, P. A., & Castle, D. J. (2011). ‘Our girth is plain to see’: An analysis of newspaper coverage of Australia’s future ‘Fat Bomb’. Health, Risk & Society, 13, 31–46. KhosraviNik, M. (2010). The representation of refugees, asylum seekers and immigrants in British newspapers: A critical discourse analysis. Journal of Language and Politics, 9(1), 1–28. Kwan, S.,  & Graves, J. (2013). Framing fat: Competing constructions in contemporary culture. New Brunswick, NJ and London: Rutgers University Press.

108  Analysing Representations of Obesity Lupton, D. (1995). The imperative of health: Public health and the regulated body. London: Sage. Marchi, A., & Taylor, C. (2009). If on a winter’s night two researchers . . . A challenge to assumptions of soundness of interpretation. Critical Approaches to Discourse Analysis across Disciplines, 3(1), 1–20. Wulff, S., & Baker. P. (forthcoming). Analyzing concordances. In S. Gries & M. Paquot (Eds.), The practical handbook of corpus linguistics. New York, NY: Springer. YouGov. (2017). How Left or Right Wing are the UK’s Newspapers? https://­ yougov.co.uk/news/2017/03/07/how-left-or-right-wing-are-uks-newspapers/

5 Connecting Corpus Linguistics and Assessment Geoffrey T. LaFlair, Shelley Staples, and Xun Yan

There is a growing interest both in the use of corpus linguistic methods in language assessment research and in the use of assessment in corpus linguistic research (c.f., Biber, Gray, & Staples, 2016; LaFlair, Staples, & Egbert, 2015; Staples, 2015; Weigle  & Friginal, 2015). To explore the triangulation of methods in these two research areas, this chapter has three main goals: (i) to survey the ways in which corpus linguistics and assessment research can complement each other; (ii) to demonstrate how corpus linguistics is used to evaluate evidence for the scoring inference of speaking and writing assessments; and (iii) to reflect on the benefits and limitations of triangulating corpus linguistics and assessment methods. We compiled two corpora and carried out two case studies. One corpus was composed of responses to the (independent) writing task on Michigan Language Assessment’s (MLA) Examination for the Certificate of Proficiency in English (ECPE). The second corpus was constructed from responses to the MLA’s Michigan Language Assessment Battery (MELAB) speaking test. We used a popular corpus linguistic method, multi-dimensional (MD) analysis, to compare representative linguistic features of the writing and speaking performances across proficiency levels and compared the dimensions of language use to the language constructs that were represented in the writing and speaking tasks’ rubrics. The two studies demonstrate that triangulating register-based corpus linguistics and assessment research has the potential for informing both areas. For language assessment, the corpus analysis can help identify cooccurrence patterns of linguistic features that characterize different levels of performance, which may or may not reflect the constructs represented in the rubric. This empirical data can be used to evaluate the extent to which test takers use language that was expected given the descriptors in the rubrics. A lack of overlapping linguistic features between the rubric and the MD analysis highlights the potential use of corpus-based register analysis as a bottom-up approach to rubric construction. For corpus linguistics, there is a potential to make positive changes in how reference corpora are specified and created. If we want reference

110  Connecting Corpus Linguistics and Assessment corpora to represent domains of language use that test takers will be participating in during their studies or professional lives, there is a greater need to ensure balance and representativeness in corpus design. Second, assessment methods allow corpus linguists to understand the language used by speakers and writers more fully, particularly with regards to language development but also in terms of the (perceived) real-world impact of language use.

Introduction The purpose of this chapter is to draw connections between corpus linguistics and assessment in order to illustrate how language researchers could employ methods and theories from both in order to answer their research questions. There are a wide range of situations in which a language researcher might pair the results of an assessment tool with corpus analysis. In this chapter, we explore relationships between use of language on the test and the test taker’s score level, namely, the extent to which the language used by test takers at different score levels in a writing task and a speaking task reflects the language that the rubric targets (e.g., grammar or vocabulary) at those same score levels. In order to investigate this relationship, we draw on methods and theories from corpus linguistics and assessment. From corpus linguistics, we examine the language at different score levels through the lens of corpus-based register analysis, specifically MD analysis (Biber & Conrad, 2009). From assessment, we draw on Bachman and Palmer’s (1996, 2010) target language use (TLU) domain analysis and Kane’s (2013) argument-based approach to evaluate the validity of the interpretations and uses of a test. We apply MD analysis in order to investigate evidence for the scoring inferences associated with the Examination for the Certificate of Proficiency in English (ECPE) writing test and the Michigan English Language Assessment Battery (MELAB) speaking test. A strength of MD analysis for this type of research is that it can reduce a large number of linguistic variables to a smaller set of dimensions of co-occurring linguistic features. The dimensions are interpreted with respect to the shared communicative functions of the linguistic features that make up each dimension. These dimensions can be compared with the constructs measured by the scoring rubrics (e.g., fluency and vocabulary). Commonalities between the dimensions and the constructs of the scoring rubric across score level are indicators of evidence that supports the scoring inference.

Connections: Corpus Linguistics and Assessment There are many situations in which a language researcher may be interested in drawing on methods in corpus linguistics and language assessment. One such situation is to compare the linguistic features of a corpus

Connecting Corpus Linguistics and Assessment  111 with some other outcome measure (e.g., proficiency level, perception of effectiveness of an interaction). For example, researchers who investigate interpersonal/inter-cultural communication between nurses and patients may be interested in the relationship between the linguistic features of nurse-patient interactions and patients’ perception of the quality of the interaction (see Staples, 2015). Or researchers may be interested in what genres/registers of reading are perceived as more difficult by lay readers and how the difficulty ratings are related to linguistic characteristics of the texts (see Egbert, 2014, 2015). Finally, second language researchers may want to measure how learner language varies across proficiency levels as determined by a rubric, which is the case we investigate here. A number of different corpus methodologies can be applied in language assessment research. Much like the choice of a statistical analysis, the corpus-based methodology employed should reflect the type of question being asked. Two major analytical approaches in corpus-based analyses of assessment include the use of holistic measures (e.g., T-units or C-units) and the use of individual linguistic features. T-units are usually operationalized as independent clauses and all associated dependent clauses (Hunt, 1965); C-units are generally the same as T-units but also include non-complete sentences (Crooks, 1990). One of the major differences between these two approaches is the influence of register. With a T-unit approach, the same (or similar, in the case of C-units) variables are used across registers—no matter whether the assessment is spoken or written, or intended for general purposes (e.g., conversational ability, such as an oral proficiency interview) or academic purposes (e.g., to gauge students’ readiness for academic writing at the college level, such as the TOEFL iBT). In contrast, individual linguistic features (and particularly groups of linguistic features, which are the result of MD analysis) can be chosen and interpreted functionally according to register. We argue for the use of linguistic features based on register characteristics to analyze score level differences, thus, a corpus-based register approach. This framework allows us to incorporate linguistic features and the situational characteristics of a register in validity studies, both of which are relevant in investigating the evidence for or against the interpretations and uses of an assessment. MD analysis is a method used in the corpus-based register approach that allows for the integration of co-occurring linguistic features and situational characteristics. The application of MD analysis in this context seems particularly appropriate because rubrics organize how ­linguistic features are expected to vary across proficiency levels and areas of ­language knowledge and use such as grammatical complexity, lexical range, and fluency in order to help raters make decisions about test taker proficiency. Multi-dimensional analysis offers an alternative, ­bottom-up, and empirical approach to understanding how observed linguistic ­features vary across score levels in speaking and writing assessments

112  Connecting Corpus Linguistics and Assessment along dimensions of language use. After the linguistic features in the corpus have been identified and counted, the MD analysis starts with a factor analysis (a statistical analysis) that identifies which features co-occur. The linguistic features that emerge from the factor analysis as groups of co-occurring features comprise a factor (as determined by factor loadings). The factors become dimensions after the researchers interpret the overarching communicative functions of the co-occurring features. In this chapter, we compare the dimensions from two MD analyses (from a writing assessment and a speaking assessment) with the categories found on the associated rubrics. We argue that this approach can be used to compare the empirical dimensions of language with dimensions found on the rubric as part of regular investigations of productive (speaking and writing) language assessments. Language assessments are regularly used as tools for understanding how well people comprehend and use language. The interpretations and uses of language assessments have consequences for stakeholders (e.g., test takers, programs, institutions, organizations). It should be regular practice for large-scale language assessments to be evaluated to ensure there is evidence that supports their interpretations and uses (Bachman & Palmer, 2010; Kane, 2013). In this chapter we argue that the principles, foundations, and methodologies that have been developed in the field of corpus linguistics can play an influential role in how the actual language in language assessments is evaluated. There is such a need for evaluating the interpretations and uses of language assessments because they are a snapshot of a portion of a person’s language ability at a given time. These snapshots are interpreted as, for example, in the case of the MELAB speaking task, test takers’ abilities to express themselves fluently in a wide range of spoken situations of language use (i.e., social, academic, and professional; Michigan Language Assessments [MLA], 2018). These types of claims require systematic evaluation of both the language of the test (i.e., listening/reading passages and speaking/writing performances) and the situations of language use (e.g., registers or target domains) that test users are making inferences to be based on test takers’ scores (i.e., social, academic, and professional domains). In making use of an argument-based approach to validity studies, researchers and assessment specialists evaluate pieces of information that support, or refute, key inferences that test users make when interpreting and using scores from a language test (Chapelle, Enright,  & Jamieson, 2008; Kane, 2013). Minimally, there are three inferences that test users make: a scoring inference, a generalization inference, and an extrapolation inference (Kane, 2013). These three inferences form a logical sequence in which the evaluation of scoring inferences should precede the evaluation of the latter two inferences (Chapelle, Enright, & Jamieson, 2010). The scoring inference, which we investigate in this chapter, starts with the actual performance and moves to an observed

Connecting Corpus Linguistics and Assessment  113 score, or number representing the test taker’s ability. This inference can be supported with evidence pertaining to the quality and appropriateness of the scoring method (i.e., a rubric based on communicative language abilities and trained scorers). A  corpus-based register approach provides effective methods for evaluating the linguistic evidence related to these inferences.

The Scoring Inference and Register Analysis The scoring inference is what allows test users to make the step from an observed performance (e.g., spoken or written task) to a numerical representation of the performance (i.e., a score awarded to the performance by one or more raters) (Kane, 2013). The strength of the inference that connects the observed performance to a numerical score depends on the evidence that the scoring rubric and task are appropriate for measuring the targeted language skills and that there is statistical evidence to support decisions about examinees (Chapelle et al., 2010). For researchers employing corpus-based methods this largely means investigating how language use differs across score levels and how well that variation mirrors what is expected by the rubric. Most studies of productive language assessments that use corpus methods have focussed on analyzing a large number of features and have considered them as individual variables in quantitative analyses. These studies tend to incorporate a statistical test (e.g., correlation, t test, or ANOVA) in order to identify what linguistic features occur at significantly different rates across score levels. These studies tend to find that characteristics related to length (e.g., longer words, t-units, longer texts) and features related to fluency (e.g., rate of speech) are positively related with score level (for an overview of these studies (see LaFlair  & Staples, 2017). This approach is perhaps best suited for research questions about which individual features play the largest role in predicting proficiency level. However, for researchers who are interested in the groups of linguistic features that test takers are using for various communicative functions, the corpus-based register approach paired with MD analysis allows researchers to consider how linguistic features co-occur. A reason for using this second approach is that linguistic features tend to co-occur with each other, and patterns of co-occurrence tend to vary by register and communicative function (c.f., Biber, 1988; Biber, 2006; Biber, Conrad,  & Cortes, 2004). Furthermore, these co-occurring sets of features tend to have functional interpretations (such as interpersonal functions in speaking) when considered as a group of features. In other words, speakers use sets of co-occurring features to accomplish communicative tasks. By conducting statistical tests with individual linguistic features as the predictor variables, the focus of the interpretation shifts from examining how test takers are using language

114  Connecting Corpus Linguistics and Assessment to achieve communicative goals to examining the discrete units of language that occur in test taker speech or writing. The resulting linguistic features certainly have individual communicative functions. However, an argument that test takers are using language for specific communicative purposes (e.g., rapport with an interlocutor) is better supported when there are a larger number of linguistic features that work together toward achieving the communicative purpose. Thus, if a researcher is interested in understanding how linguistic features function in concert in test taker production, they require techniques such as factor analysis. These analyses help the researcher reduce a large number of variables to a small set of interpretable variables. One such method in corpusbased register analysis is MD analysis (Biber, 1988; Biber & Conrad, 2009), in which researchers approach test performances as registers of language use. LaFlair and Staples (2017) highlighted the connections between the theoretical foundations (i.e. the communicative language movement) of corpus-based register analysis and a component of the test task creation process known as TLU domain analysis (Bachman, 1990; Bachman  & Palmer, 1996; Bachman  & Palmer, 2010). Both frameworks seek to understand and account for the characteristics of language use situations that can affect the types of linguistic features that speakers (or writers) use, but they differ in the reasons for investigating these characteristics. Corpus-based register analysis describes similarities and differences in uses of linguistic features across different situations of language use (i.e., registers), often to understand language variation (descriptive, sociolinguistic goals). In developing test tasks, especially productive test tasks (i.e., speaking and writing), assessment specialists seek to use the characteristics of targeted registers of language use (e.g., conversations and presentations) to develop tasks that will elicit language from test takers that matches the language in the targeted registers. LaFlair and Staples (2017) demonstrated how corpus-based register analysis could be used to evaluate the extrapolation inference by investigating how well the elicited language from a speaking task approximated target registers of conversation, nurse-patient interaction, university office hours, service encounters, and study groups. The processes used in corpus-based register analysis are also applicable to evaluating how language is used at different score levels in a test task and how those differences relate to the descriptions of the score levels on the rubric. In summary, corpus-based methods are directly applicable to language assessment research that seeks to evaluate linguistic evidence for key inferences. Furthermore, MD analyses have shown that when the patterns of cooccurrence, or correlated features, are reduced through the use of factor analysis, the groupings of features often have clear, functional interpretations (LaFlair & Staples, 2017). The value of this approach was underscored in Biber, Gray, and Staples’ (2016) investigation of productive

Connecting Corpus Linguistics and Assessment  115 tasks on the TOEFL iBT. They conducted an MD analysis on spoken and written responses to the TOEFL iBT to examine how well co-occurring linguistic patterns could differentiate between written and spoken tasks and how well co-occurring linguistic patterns could predict test takers’ scores. Their first dimension, Literate vs. Oral (nouns, prepositional phrases, adjectives vs. verbs, pronouns, verb + that-clauses) distinguished between mode (speech and writing) and task type (integrated and independent). This dimension showed positive correlations with score (higher scoring responses were associated with more literate uses of language). Additionally, they show support for the scoring inference as higher scoring test takers used more Literate language than low scoring test takers. However, this pattern is not necessarily tied to how the TOEFL writing and speaking rubrics describe differences in language use across proficiency levels. Additionally, Weigle and Friginal (2015) compared the co-occurring linguistic features of TOEFL iBT independent writing with disciplinary writing in the Michigan Undergraduate Corpus of Upper-level Student Papers (MICUSP) through the lens of MD analysis. They found that the performances on the test tasks used more features that are typical of narratives, personal opinions, situational discourse, and expressions of possibility than disciplinary writing. They also found that more proficient L2 English learners (the MICUSP papers) tended to use more linguistic features that are associated with academic writing. These studies highlight the application of MD analysis to productive language assessments. They make steps toward demonstrating the use of MD analysis for investigating the extrapolation inference (LaFlair  & Staples, 2017; Staples, Biber, & Reppen, 2018; Weigle & Friginal, 2015) and investigating the use of MD analysis as a method for examining the relationship between test score and dimensions of language use (Biber et al., 2016; LaFlair & Staples, 2017; Staples et al., 2018; Weigle  & Friginal, 2015). The latter application takes steps toward establishing a method that uses MD analysis to examine the scoring inference. These studies have shown that dimension scores do positively or negatively correlate with rubric score and that they can predict performance on the test. However, these types of relationships are more meaningful for the scoring inference if the linguistic features that comprise the dimensions reflect the features that are targeted in the rubric score bands. The descriptors in the scoring band of a rubric represent the features that are expected to mark each level of ability. Ideally, performances that share the same rubric scores will also share the same linguistic features, those that are described in scoring bands. The studies described here go one step further by relating the patterns of language use uncovered by MD analyses to the rubric descriptors at different proficiency levels. The following sections report on the methods and results of two studies that used MD analysis to investigate the linguistic features of a writing

116  Connecting Corpus Linguistics and Assessment task (Study 1: ECPE) and a speaking task (Study 2: MELAB), how the linguistic features varied across score level, and the extent to which the linguistic features and their dimensions reflected the linguistic features that are measured by the rubric. We treat the speaking and writing performances as two registers of language use, which allows for a demonstration of how this approach works in these two different domains. We also show how the co-occurring linguistic features convey particular functions within the context of the language assessment.

Study 1: ECPE The first study focuses on the ECPE writing test, which features an independent essay task that asks test takers to compose a written response to a statement on a controversial topic. We focus here on a subset of results from a larger analysis (see Yan & Staples, accepted). Here, this case study is used to illustrate the ways in which the scoring inference within a validity argument framework for assessment can be explored using principles of corpus-based register analysis, treating the ECPE as a written register. As discussed earlier, the scoring inference focuses on how language use differs across score levels and its connection to the rubric. Thus, in this study we first examine two dimensions from an MD analysis that illustrate differences across score levels and end by discussing their connections to the rubric. The corpus for this study was created by two of the authors (Staples and Yan) as part of a CaMLA (Cambridge Michigan Language Assessments) research grant (Yan & Staples, 2016). The corpus comprises 595 essays which were identified by CaMLA (now called Michigan Language Assessment) through a random stratified sampling process and represent essays from five different proficiency levels, A through E (A being the highest and E the lowest; see Appendix A for the ECPE writing scale). Each essay is rated independently by two certified raters using the ECPE analytic rating scale; an essay is assigned to a third rater if the original two raters have nonadjacent scores for any of the analytic scoring categories. The ECPE essays in this study were written on one of three prompts (described in Table 5.1) and were transcribed by trained transcribers. Table 5.1  Writing Prompts Represented in the ECPE Corpus Prompt

Topic

P1

Use of text messages and internet chat rooms to communicate among teenagers Use of educational technologies in the classroom Advantages and disadvantages of a country having a large tourism industry

P2 P3

Connecting Corpus Linguistics and Assessment  117 The resulting corpus includes 177,163 words, with an average essay length of around 300 words. Table  5.2 provides an overview of the ECPE corpus. Most of the test takers were from Southern Europe and South America, with Greek as an L1 comprising the largest group of test takers (88.7%), followed by Spanish (4.7%) and Portuguese (3.5%). Although age and reasons for taking the test were not available for the specific test takers in this study, according to a 2016 ECPE report most ECPE test takers are under 20 years old (67.34%), with the 13–16 years being the largest age group (58.13%). The majority of ECPE test takers take the test for educational (34.68%), employment (30.83%), and personal (29.24%) purposes (Cambridge Michigan Language Assessments [CaMLA], 2016a). An MD analysis was conducted to identify linguistic and functional characteristics of the test taker’s discourse at the five score levels. The linguistic features used for the MD analysis included grammatical and lexico-grammatical features, lexical bundles, and vocabulary measures. We chose variables that have been found to be important indicators of writing development, or important for differentiating more oral texts from literate texts, thus taking a register approach to our analysis (e.g., Biber, 1988; Biber & Gray, 2013; Biber, Gray, & Poonpon, 2011). Such features included both clausal features (e.g., finite adverbial clauses) and phrasal features (e.g., attributive adjectives). In addition to grammatical features, lexico-grammatical features were included (e.g., likelihood verbs + that complement clauses, such as thinks that). Interactions between lexis and grammar are known to express important functional aspects of language use (such as stance). Another way of capturing lexico-grammatical patterns is through the use of lexical bundles, which have been shown to be an important marker of writing development (e.g., Staples, Egbert, Biber, & McClair, 2013). Finally, lexical variety and frequency are important aspects of language development in general, but particularly in writing (e.g., Laufer & Nation, 1995; Schoonen, Gelderen, Stoel, Hulstijn, & Glopper, 2011; Yu, 2010). Table  5.3 lists the variables included in our study of the ECPE. For most of the variables, we calculated rates of occurrence for each text normed to 1000 words. For the lexical bundles, we calculated the proportion of bundles that were included in the prompt (prompt bundles; e.g., a large tourism industry) or were not included in the Table 5.2  The ECPE Essay Corpus Prompt

A

B

C

D

E

Total

P1 P2 P3 Total

30 30 30 90

47 41 46 135

60 60 60 180

47 41 47 134

15 25 16 56

199 197 199 595

118  Connecting Corpus Linguistics and Assessment prompt (generic bundles; e.g., in order to) and the proportion of generic bundles that were used for stance and referential bundles (Biber et  al., 2004; Simpson-Vlach & Ellis, 2010). Vocabulary frequency bands were also calculated proportionally (the percentage of vocabulary that fell in each of the three bands: 0–500, 501–3000, greater than 3000). Word count was calculated for each text, as well as the average length of words for each text, and the ratio of the number of types of words to the number of tokens. Details of our methods of coding and analyzing these features can be found in Yan and Staples (accepted). The MD analysis was conducted on the 41 features included in Table 5.3. We used exploratory factor analysis, performed in R (R Core Team, 2015) to reduce the number of variables to a smaller number of Table 5.3  Variables Included in the Multi-dimensional Analysis Variable Grammatical and syntactic features Finite adverbial clauses Likelihood verb + that clause Stance noun + that clauses Present tense verbs Have as a main verb Mental verbs Certainty verbs Communication verbs Activity verbs Attitude verbs Passive voice Necessity modals Pronoun it Third person pronouns Attributive adjectives Size adjectives Time adjectives Topical adjectives Premodifying nouns Common nouns Cognition nouns Place nouns

Description/Example . . . because they are “English” in their customs and practices. We think that Eulalie writes or speaks her monologue . . . The belief that people could be distinguished by . . . verb (uninflected present, imperative & third person) have know, think, believe conclude, prove, show, understand, find, know, realize argue, claim, propose, say, tell, suggest smile, open anticipate, expect, prefer Racism was rejected as a scientific concept. should, must, have to it they, she, he industrial scale, social reality large, big, high, long new, young, old political, international, national, economic metal ions, crime rate inequalities, ontology, formula, proteins fact, knowledge country, city, continent, border, mountain, ocean

Connecting Corpus Linguistics and Assessment  119 Variable

Description/Example

Human nouns

family, parent, teacher, official, president, people committee, congress, bank application, meeting, balance internet, web computer, machine, equipment, video collectivists, existence, sweetness a, an the in, at, on

Group nouns Process nouns Technical nouns Concrete nouns Nominalizations Indefinite articles Definite articles Prepositions Lexical bundles Proportion of prompt-match bundles Proportion of generic lexical bundles Proportion of stance bundles Proportion referential bundles Vocabulary features Word count Word length Vocabulary not in the first 3000 Vocabulary 501–3000 Vocabulary 1–500 Type token ratio

with their friends, a large tourism industry, to communicate with in order to, on the other hand, a lot of my opinion is, they should not, it is very important a lot of, more and more, the opportunity to Total number of words Average word length Vocabulary less frequent than the first 3000 words Vocabulary in the 501–3000 word frequency range Vocabulary in the 1–500 frequency range Ratio of types to tokens in the first 400 words of each text

factors that included co-occurring linguistic features. In this way, we were able to see how a constellation of variables worked together to communicate a particular linguistic function in each of the texts. We used the R function ‘fa’ (factor analysis) from the psych package (Revelle, 2016), choosing principal axis factoring and a Promax rotation (see Biber, 1988, pp.  93–97). The Kaiser-Meyer Olkin Measure of Sampling Adequacy (KMO) was .73, which is a ‘middling’ but still acceptable number for continuing with the factor analysis (Tabachnick & Fidell, 2013). Based on the scree plot of eigenvalues, which revealed a definitive break between the fifth and sixth factor, a five-factor solution was chosen. Variables had to meet a minimal factor loading threshold of +/−.30 to be included in the final factor analysis. The cumulative variance accounted for by the five factors was 35%. Each variable was assigned to one factor, where it

120  Connecting Corpus Linguistics and Assessment Table 5.4 Factor Loadings a of Individual Variables for Dimension 1 and ­Dimen­sion 3 Feature

Dimension 1

Word length Vocabulary 501–3000 Vocabulary not in the first 3000 Attributive adjectives Passive voice Vocabulary 1–500 Subordinate clauses Mental verbs Have as a main verb Common nouns Proportion of prompt bundles Premodifying nouns Proportion of generic bundles Type/token ratio Pronoun it

0.82 0.69 0.67 0.57 0.40 −0.86 −0.32 −0.31 −0.31

Dimension 3

0.67 0.61 0.54 −0.58 −0.48 −0.32

Note: a The cutoff for factor loading of individual features was set at ±.30; therefore, loadings lower than ±.30 are not reported in this table.

loaded most strongly, and the five factors were interpreted functionally. The five dimensions are as follows: • • • • •

Dimension 1: Literate vs. oral discourse Dimension 2: Topic-related content Dimension 3: Prompt dependence vs. lexical variety Dimension 4: Overt suggestions Dimension 5: Stance vs. referential discourse

We focus on two of the five dimensions in our discussion of results (see Table 5.4 for factor loadings): Literate vs. oral discourse (Dimension 1) and Prompt dependence vs. lexical variety (Dimension 3).

Data Analysis To investigate the differences in the use of the five dimensions across score level, we first computed a dimension score for each text. For each dimension, this consists of adding the positive loading features together and subtracting any negative loading features. After this, we were able to calculate mean dimension scores for each score level and also to compare each score level. We initially computed a two-way factorial MANOVA, using Type III sums of squares, which allowed us to account for both score level and prompt. A two-way MANOVA of the dimension scores among the ECPE essays revealed significant multivariate main effects for score level (Wilks’ λ = .62, F(20, 919.66) = 7.02, p < .001, η2 = .12) and

Connecting Corpus Linguistics and Assessment  121 prompt (Wilks’ λ = .09, F(10, 554) = 127.66, p < .001, η2 = .69). The interaction effect between score level and prompt was not statistically ­significant (Wilks’ λ = .85, F(40, 1210.21) = 1.19, p = .20). In the following results section, we focus on the findings from the univariate ANOVA for score level alone. We adjusted the alpha level to .01 to account for the five dimensions. We also conducted correlations between the dimension scores and ECPE score level. Again, we focus on Dimension 1 and 3. For each dimension, the mean dimension score and a 95% non-parametric bootstrap confidence interval was calculated and plotted for each score band on the rubric that was present in the dataset.

Results Here, we start with a discussion of each dimension, explaining the interpretation of each factor and our functional analysis. Then, we present the results according to score level to explore the ways in which MD analysis can elucidate differences across score levels. Dimension 1: Literate vs. Oral Discourse The first dimension (and the dimension with the greatest explanatory power) is one that has appeared, more or less, in almost every multidimensional analysis of speech and writing (e.g., Biber, 1988, 2006). What is interesting here is that it also emerges as an important dimension within the domain of writing alone. Dimension 1 has both positive and negative loading variables, with the positive features (word length, less frequent vocabulary, attributive adjectives, and passive voice) associated with more literate texts and the negative features (highly frequent vocabulary, finite adverbial clauses, mental verbs, and have as a main verb) associated with more spoken aspects of language. In Biber (1988) and Biber (2006), the features identified as “literate” features in this study (attributive adjectives, word length, passive voice) were associated with writing (e.g., academic articles, textbooks) while the features identified in this study as “oral” (mental verbs and finite adverbial clauses) were associated with spoken discourse (e.g., conversation, office hours). Excerpts 1 and 2 illustrate the use of more literate vs. more oral discourse in the ECPE.

Excerpt 1 File A_P3_33 Dimension 1 Score = +17.17 Note: Passive voice in bold, attributive adjectives in italics, vocabulary > 3000 underlined Tourism constitutes a financial sector that indisputably makes an enormous contribution to every nation’s economy. However, a great deal of controversy is generated with regards to its impact

122  Connecting Corpus Linguistics and Assessment on the mental state of people who are engaged in touristic activities. Excerpt 2 File E_P2_71 Dimension 1 Score = − 19.64 Note: Adverbial clauses in bold, mental verbs in italics, vocabulary 1–500 underlined. Children to find alone the answer is much because in that way the student they start and thinking. Another reason is many times the teacher don’t care if the student use it and this is a serious problem because then they stop think and start to use calculator and electronic dictionaries to solve all problem.

Literate v. Oral Discourse

When we examined the ANOVA results across dimension scores for Dimension 1, we found a significant difference in relation to score level. Figure 5.1 shows a linear increase in dimension score as score level increases. We also examined the correlation between dimension score and score level and found a significant, positive, moderate correlation (r = .43, p < .001). These findings indicate that as score level increases, test takers are using more of the features that are associated with writing based on previous

10

0

–10

–20 E

D

C

B

A

ECPE Score Figure 5.1  Dimension 1: Literate vs. Oral Discourse Across ECPE Score Levels

Connecting Corpus Linguistics and Assessment  123 studies. They are also using fewer features associated with speech. This is illustrated by the contrast between Excerpt 1, which received a score of A, the highest level, and Excerpt 2, which received a score of E, the lowest level. These results are similar to those observed for the TOEFL iBT. Biber et al. (2016) found that the more literate grammatical features (word length, attributive adjectives and passive voice) were common in higher level writing in the TOEFL iBT while the more oral features (finite adverbial clauses and mental verbs) were associated with speech and lower level writing on the TOEFL iBT (Biber et  al., 2016). This study adds to that analysis by showing that there are a number of lexical aspects that co-occur with the grammatical features (most notably vocabulary frequency). This allows us to capture both vocabulary and grammar, two separate descriptors on the ECPE scale, in one interpretation of the score level. Dimension 3: Prompt Dependence vs. Lexical Variety Dimension 3 also contains both lexical and grammatical features. It is notable that it aligns with some previous studies on writing and writing development but departs from others. Like Dimension 1, it also contains both positive and negative features. The positive features at first glance look like a mixture of more informational features (common nouns and premodifying nouns) and use of the prompt (proportion of prompt bundles). However, on closer examination, we found that the premodifying nouns, which in other studies were a sign of more advanced academic writing, were actually related to the prompt bundles. In other words, the prompt bundles contained a great number of premodifying nouns, such as text messaging and internet chat rooms. Excerpt 3 illustrates the use of both prompt bundles and premodifying nouns by a writer on the ECPE.

Excerpt 3 File E_P1_81 Dimension 3 Score = 18.73 Note: Prompt bundles in bold, common nouns italicized, premodifying nouns underlined. Also notable is the repetition throughout (text message/messaging, internet chat rooms, people). The use text messaging and internet chat rooms it’s a way for quickly communication and connection with people no only who are living in the same place or same country but with people who are living yet in abroad. Parents are thinking about this, that use text messages, and internet chat rooms give ability to speaking with other people who are able to see their after years without message and internet chat rooms.

124  Connecting Corpus Linguistics and Assessment The negative features of Dimension 3 contrast with this heavy reliance on the prompt (proportion of generic bundles, type/token ratio, and pronoun it). These features together display a characteristic of lexical variety in texts (as opposed to prompt dependence). Excerpt 4 displays the use of the negative features of Dimension 3.

Excerpt 4 File A_P2_62 Dimension 3 Score = − 7.05 Note: Generic bundles in bold and it underlined; also notable is the variety of language used (technological is the only content word repeated). It is an undeniable fact of modern, futuristic societies that students can (and are) profoundly influenced by the cornucopia of technological innovations. As a consequence, the majority of them make use of technological devices such as calculators, computers, etc., in class.

When we examined the ANOVA across dimension scores for Dimension 3, we found a significant difference in relation to score level. The mean dimension scores show a linear decrease in dimension score as score level increases (Figure  5.2). The white squares represent the observed mean dimension

Figure 5.2 Dimension 3: Prompt Dependence vs. Lexical Variety Across ECPE Score Levels

Connecting Corpus Linguistics and Assessment  125 score for each rubric score level. The distance between the upper and lower ends of the 95% bootstrap confidence interval represents the range of plausible values for the mean dimension score at each score level for each of the dimensions. Where there is a lack of overlap between rubric score levels, there is a potential significant difference in mean dimension scores between those score levels.1 We also examined the correlation between dimension score and score level and found a significant, negative, moderate correlation (r = −.27, p < .01). Taken together, these findings indicate that as score level increases, test takers are using more of the features that are associated with lexical variety and fewer of the features associated with prompt dependence. As indicated previously, this finding mediates the claims that have been made about premodifying nouns being a measure of greater proficiency/higher levels of development. Biber et al. (2016) found that higher level scorers on the TOEFL iBT used more premodifying nouns along with other features associated with informational writing. Staples et  al. (2016) also found that L1 writers used more premodifying nouns as academic level increased. However, Staples and Reppen (2016) found that Chinese L1 writers used many more premodifying nouns than their English L1 peers in first year composition texts. This study provides more evidence that the use of phrasal features may be more complex than previously thought. As well, this study also emphasizes the importance of including lexical aspects (type/token ratio), grammatical features (nouns), and phraseological features (prompt vs. generic bundles) when investigating L2 writing across score levels. ECPE: Support for the Scoring Inference Results of the MD analysis provide supportive evidence for score inferences on the ECPE at the evaluation step. According to Chapelle et  al. (2010), score inferences at the evaluation step warrant that performances on a test are evaluated to provide scores reflective of the targeted language abilities. To support the intended score inferences, evidence needs to be examined in regard to several specific assumptions, including: i) the appropriateness of the scoring rubric to provide evidence for targeted language abilities; ii) the appropriateness of task conditions to provide evidence for targeted language abilities; and iii) the statistical characteristics of items, measures, and test forms to support norm-referenced decisions on performances and examinees (Chapelle et al., 2010, p. 7). In the case of ECPE, the MD analysis of lexico-grammatical features of writing performances provides supportive evidence for the first and the third assumptions. With respect to the first assumption, the emergence of Dimensions 1 and 3 supports that the ECPE rating rubric is appropriate for capturing the differences in test takers’ writing abilities across proficiency levels. The ECPE Writing rating rubric (see Appendix A) includes three scoring criteria: Rhetoric, Grammar/Syntax, and Vocabulary. The descriptors for Grammar/Syntax and Vocabulary include syntactic complexity, lexical diversity, and sophistication as salient performance features to

126  Connecting Corpus Linguistics and Assessment reflect test takers’ lexico-grammatical knowledge in writing. The descriptors for Rhetoric include topic development, organization, and prompt dependence (reliance on prefabricated language and/or language from the prompt [emphasis added]). In the MD analysis, Dimension 1 is positively associated with the use of low-frequency words, but negatively associated with highly frequent vocabulary, which aligns well with the Vocabulary descriptors. In addition, this dimension has negative loadings on finite adverbial clauses, mental verbs, and have as a main verb. Not only are these features more associated with spoken discourse, they also align with syntax used at lower proficiency levels in academic writing contexts and writing assessments (see e.g., Biber, Gray, & Staples, 2016; Staples et al., 2016). Therefore, this dimension reflects both lexical and syntactic complexity in written discourse, and can be used to support the appropriateness of both Grammar/Syntax and Vocabulary criteria in assessing writing ability. In contrast, Dimension 3 is positively associated with the use of prompt bundles and negatively associated with type-token ratio, which may reflect a lack of ability to use a wide range of vocabulary and develop content beyond the prompt. Thus, this dimension supports the appropriateness of both Rhetoric and Vocabulary criteria. In terms of the third assumption, the MD analysis supports the argument that the ECPE can distinguish high- and low-proficiency writers on the target writing abilities. The moderate correlations between dimension scores and holistic test scores and the significant results on the ANOVA tests suggest that test takers across proficiency levels can be statistically distinguished on the ECPE (with respect to Dimension 1: Literate vs. Oral Discourse and Dimension 3: Prompt Dependence vs. Lexical Variety). Given that features associated with Dimensions 1 and 3 can distinguish test takers across proficiency levels (reflection of scores awarded by human raters) and that those features are aligned with descriptors of key criteria on the ECPE rating scale (what raters are trained to score on), it is reasonable to conclude that, when rating the ECPE essay performances, human raters are generally able to focus on the key criteria in the rubric. Taken together, these findings can be used to support scoring inferences for the ECPE writing test at the evaluation stage.

Study 2: MELAB The second study focuses on the MELAB speaking test, which is an oral proficiency interview. Here the MELAB is treated as a spoken register, and like the ECPE we use this case study to show how corpus-based register analysis can be used to investigate the scoring inference. The corpus for this study was created as part of a CaMLA research grant (LaFlair, Staples, & Egbert, 2016). The corpus comprises 98 transcribed interactions from the MELAB speaking test. The interactions were identified through a process of random sampling and represent speaking tasks from 22 different raters. Seven score bands are represented (2, 2+, 3−, 3, 3+, 4−, 4),

Connecting Corpus Linguistics and Assessment  127 Table 5.5  Overview of the MELAB Corpus Sub-corpus

Texts

Mean Words/Text

Total Words

MELAB Test taker, Score 2 MELAB Test taker, Score 3 MELAB Test taker, Score 4 Subtotal all MELAB Test takers

 8 59 31 98

408.25 430.51 548.00 465.86

 3,266 25,400 16,988 45,654

with 4 being the highest score. For our analysis we treated +/− as a whole score to align with the rubric, which gives only whole score descriptions (see Appendix B). Each speaking performance is scored by one rater (the rater who administers the task). Due to budgetary constraints, the first five minutes of the MELAB OPIs were transcribed by trained transcribers. This procedure has been used in other studies of spoken assessments (see e.g., Kang, 2012). The transcribed texts were split into two texts, one for the MELAB rater and one for the MELAB test taker. Splitting the texts allows for the comparison of rater and test taker language. The resulting corpus includes 75,568 words with an average test taker file of around 465 words. Table 5.5 provides an overview of the MELAB corpus. The first language of most of the test takers was Arabic (37.7%) or Chinese (Mandarin or Cantonese; 14.3%). The rest of the test takers were L1 speakers of a wide range of languages (e.g., Farsi, Tagalog, Gujarati, Urdu, French, Japanese). The average age of the test takers in this study was 30, and the test takers ranged in age from 17 to 59. About 50% of the test takers were between 17 and 28 years old. Most of the MELAB test takers in this data set took the speaking task for educational purposes (62.30%). Other people took the test as part of a language requirement for professional certification (22.40%) or other, unspecified reasons (15.30%).

Multidimensional Analysis The 32 features in Table 5.6 were selected to be included in an exploratory factor analysis because of their demonstrated importance to the construct of spoken English from results of previous research (Biber, 1988, 2006; Biber, Johansson, Leech, Conrad, & Finegan, 1999; Friginal, 2009; Staples, 2015). Notably, these variables differ from those chosen for the ECPE due to the differences in register. The factor analysis was carried out in R (R Core Team, 2015) in order to reduce the large number of variables down to a smaller set of interpretable dimensions of language use. As with the ECPE data, we used the ‘fa’ (factor analysis) function from the psych package (Revelle, 2016), choosing principal axis factoring and a Promax rotation. The Kaiser-Meyer Olkin Measure of Sampling Adequacy (KMO) was 0.67, which is ‘middling’ (Tabachnick & Fidell, 2013). The scree plot of the eigenvalues indicated a three factor solution. The variables included in the final factor analysis met

Table 5.6  Variables Included in Multi-dimensional Analysis Variable Speech Rate of speech Hesitation markers Interaction Discourse markers Backchannels Number of questions Number of turns Words per turn Language Contractions 1st person pronouns and possessive determiners 2nd person pronouns and possessive determiners 3rd person pronouns and possessive determiners Word count Vocabulary Vocabulary Certainty adverbials Likelihood adverbials THAT- deletion w/verb complement clauses Attributive adjectives Premodifying nouns Nominalizations Present tense verbs Past tense verbs Mental verbs Certainty verbs Existence verbs Desire verbs + to-clauses Stance verbs + to-clauses Emphatics Hedges Amplifiers WH-words Likelihood verbs + that-clause

Description/Example syllables per second um, uh now, well yeah, uh huh

I, me, my You, your They, them, their Vocabulary in the 501–3000 word frequency range Vocabulary in the 1–500 frequency range certainly, of course maybe, possibly We thought (THAT) you were going to meet us at the restaurant. good job, new friends sales job, language class admission, education My son lives by himself I studied mechanical engineering know, think, believe conclude, prove, show, understand, find, know, realize seem, appear, live I want to study hotel management That’s the one I hope to get into So, you like it a lot? it sort of depends on what you’re studying. It’s really nice. All wh-words It appears that we don’t have a sound recording

Connecting Corpus Linguistics and Assessment  129 Table 5.7 Factor Loadingsb of Individual Variables Included on Dimension 2 in the MELAB Study Feature

Dimension 2

Emphatics Syllables per second Contractions Certainty adverbials Number of turns Hesitation markers

0.53 0.52 0.51 0.44 0.30 −0.55

Note: b The cutoff for factor loading of individual features was set at ±.30; therefore, loadings lower than ±.30 are not reported in this table.

the minimum threshold of +/−0.30. The three factors accounted for a cumulative variance of 35%. The variables were assigned to the factors on which they loaded most strongly. The three dimensions and their functional interpretations are: • • •

Dimension 1: Listener-centered Involvement v. Speaker-centered Information Dimension 2: Colloquial Fluency Dimension 3: Explicit Personal Stance

For this study, we present the results of one of the three dimensions: Dimension 2, Colloquial Fluency (see Table 5.7 for factor loadings).

Data Analysis As in the ECPE study, a dimension score was computed for each text on each of the three factors. For each dimension, the mean dimension score and a 95% non-parametric bootstrap confidence interval was calculated and plotted for each whole score band on the rubric that was present in the dataset (i.e. 2, 3, 4). We conducted a MANOVA, using Type III sums of squares, on the data to control for the multivariate nature of the data and the different number of observations in each of the three groups (score levels 2, 3, and 4). The result of the MANOVA (Wilks’ λ = .08, F(9, 226.49) = 46.48, p < .001, η2 = .72) revealed that there was significant mean difference between at least one of the groups on one of the dimensions. Univariate ANOVAs were conducted to test whether or not any of the mean dimensions scores were significantly different across score level with an a priori alpha set at 0.016 to adjust for the three comparisons made between the three dimensions. The presentation of the results focuses on Dimension 2.

Results As with the results of the ECPE analysis, we start with a discussion of the dimension by explaining the functional analysis. The discussion of

130  Connecting Corpus Linguistics and Assessment the dimension is accompanied by illustrative examples and figures that highlight dimension score patterns within and across the score levels. Dimension 2: Colloquial Fluency Dimension 2 is characterized by mostly positive features. These are features that are often present in fluent speech (more syllables per second, more contractions, and more turns in an interaction) and more colloquial speech (emphatics and certainty adverbials). Dimension 2 has one negatively loading feature which tends to occur at higher rates in less fluent speech—­ hesitation markers. Excerpts 7 and 8 represent the use of these features by, respectively, a test taker who used positively loading features and a test taker who used more hesitation markers (the negatively loading feature). Excerpt 7 File 8_D_21F.txt: MELAB score = 4; Test taker Dimension 2 score = + 3.46 Note: Emphatics and contractions in bold, hesitation markers underlined (number of turns = 41; speech rate = 4.06 syllables per second) T: I’m the only one in Canada. So friends, but uh, it’s like, every person has a different jobs, timings and everything, come home at six o’clock or four, and . . . T: Then go for studying, and just and by eleven or twelve it’s time to sleep because we have to wake up in the early morning at seven o’clock. . . . T: So it’s hard but we, we are to manage, right? We can contrast Excerpt 7 (41 turns and 4.06 syllables per ­second) with the greater occurrence of hesitation markers, fewer turns and slower rate of speech in Excerpt 8. Excerpt 8 File 4_E_19G.txt: MELAB score = 2; Test taker Dimension 2 score = − 0.07 Note: Note: Hesitation markers are underlined (number of turns = 33; speech rate = 2.75 syllables per second). T: Uh when I finish my Master, I want to get uh, I wish to get a Docto- Doctorate degree and uh because I want to study uh uh what's its name? Uh. I forget its name. . . . T: Because uh when we studied uh when I studied in uh physics, we studied about uh the smallest thing uh what the nano will talk about.

Connecting Corpus Linguistics and Assessment  131 10

Colloquial Fluency

5

0

–5

–10 2

3 MELAB Score

4

Figure 5.3  Dimension 2: Colloquial Fluency Across MELAB Score Levels

Figure  5.3 shows the distribution of dimension scores across score levels for Dimension 2. There is a clear pattern of test takers using more linguistic features that are associated with colloquial fluency at the higher end of the score scale than the lower end. The univariate ANOVA revealed a significant mean difference F(3, 95)  =  20.68, p < 0.001 across score levels. The correlation analysis of MELAB score with Dimension 2 score revealed a moderate, positive, significant relationship (r = 0.60, p < 0.001), which indicates that higher scoring test takers tend to use features associated with colloquial fluency at higher rates than lower scoring test takers. The contrast between Excerpt 7 and Excerpt 8 highlight this relationship. The former is from a more proficient test taker (MELAB speaking score of 4), who has a higher rate of speech and uses more contractions and emphatics. The latter excerpt is from a less proficient test taker (MELAB speaking score of 2), who uses fewer features of colloquial fluency and more hesitation markers. MELAB: Support for the Scoring Inference The results of the analyses show support for the first assumption of the scoring inference, that the rubric is appropriate for providing evidence of targeted language abilities. They also show support for the third assumption of the scoring inference, that the statistical characteristics support norm-referenced interpretations, which means that the assessment is able

132  Connecting Corpus Linguistics and Assessment to rank order students across a continuum of ability from low to high (e.g., a normal distribution of scores). The MELAB rubric (see Table 5.6) describes the salient features of the interaction in both the descriptors at each score level and in a “salient features” table which contains three categories of features: speech, interaction, and language. Fluency is targeted in both the table (under the speech category) and the descriptors for the four levels as rate of speech, pausing/hesitation, and prosody. Linguistic connections between Dimension 2, Colloquial Fluency, and the fluency features described in the rubric include hesitation markers (pausing/hesitation) and syllables per second (rate of speech). Additionally, other features co-occurred with fluency features on this dimension: contractions and number of turns. This provides some empirical support for the assumption that the rubric offers evidence of the targeted language abilities. However, the other features that comprise Dimension 2 reflect salient features from other categories on the rubric. For example, number of turns is more related to the role the test taker plays in developing the conversation (in the interaction category) whereas emphatics and certainty adverbials are more related to language use. This highlights a mismatch between the resulting dimension and the salient features categories. In regard to the third assumption, there is clear support for the scoring inference in regard to the fluency features that test takers use in the MELAB speaking task. The top band of the rubric specifies that test takers who received a score of four are highly fluent and use few hesitations in speech. As scores decrease, the rubric describes language that is less fluent and that contains more hesitations. The moderate, positive correlation coefficient and the test of mean differences indicate that the MELAB speaking task can rank order test takers from low proficiency to high proficiency in respect to their fluency.

Conclusions The results of the two studies have illustrated one way in which corpusbased register analysis can be combined with assessment to investigate linguistic evidence for the scoring inferences of a writing test and a speaking test. In both studies, the MD analysis revealed dimensions of language use that reflected the linguistic features of interest in the rubrics, which lends support to the assumption that the scoring rubric provides evidence of targeted language abilities. Although the dimensions in both studies were correlated with proficiency level, the linguistic features on each dimension were not always patterned along the dimensions or categories in the scoring rubric. The ECPE has three categories and the MD analysis resulted in five dimensions. The MELAB has three categories of salient features on the holistic rubric, but the resulting three dimensions did not fully capture the three salient feature categories.

Connecting Corpus Linguistics and Assessment  133 For assessment researchers and developers, this mismatch between dimension features and the rubric might signal a need for more empirical evaluation regarding the groups of language features that are considered to be markers of different levels of language proficiency. In other words, it could be argued that when the descriptors are finalized for each level of a rubric, they represent an expected pattern of co-occurring features (linguistic, content, and organizational) of the performance at different levels of proficiency. A mismatch between actual language use in the test taker performances and expected language use in the rubric could be indicative of misspecifications in the rubric or a misalignment between the rubric, the task, and the target domain (i.e., the register of language use that the task is supposed to represent). Furthermore, the mismatch between rubric descriptors and linguistic features in a dimension may signal a need to revisit how constructs of language are represented on the rubric. For example, rate of speech, number and placement of pauses, and lack of hesitation markers are common linguistic features that have been considered to be a part of fluency. However, the results of the MELAB speaking task analysis showed that the use of emphatics, certainty adverbials, contractions, and more turns tended to co-occur with higher rates of speech and fewer hesitation markers. Given this result, perhaps we need to consider other features, such the ability to take more turns, to emphasize ideas, and to express certainty about ideas as part of an expanded construct of fluency. In the ECPE rubric, grammar and vocabulary in writing are assessed as separate constructs. However, the MD analysis underscored the idea that grammar and vocabulary are indeed very much related, and are perhaps tied to patterns of grammar and vocabulary use in different registers (i.e., oral vs. literate) (Römer, 2017). It is possible that this bottom-up approach to investigating the linguistic features can help assessment researchers re-evaluate and revise the constructs of language knowledge they want to measure. For corpus linguists, the studies discussed here highlight the added value of incorporating assessment tools in their linguistic analyses and interpretations of their results. In the context of language assessment, carrying out a MD analysis and connecting the results back to the descriptors in the rubric, or the assessment tool, provides another layer of insight into the language of the speaking and writing tasks. With the test takers organized by proficiency level by the examiners and the rubric, the results of the analysis of the MELAB highlighted how speakers at different ability levels employed linguistic features related to fluency differently. Likewise, the results of the analysis of the ECPE illustrated how test takers at higher proficiency levels tend to use more literate discourse and a wider variety of lexical items than test takers at lower proficiency levels. Without some external method (in this case the use of detailed rubrics) to group the test takers by proficiency level, these insights would be unavailable to the researcher. The relationship between the scores on these measurement

134  Connecting Corpus Linguistics and Assessment tools and the language features of the texts is not accidental. It is the result of a process of rubric development and rater training, which provides a strong foundation for inferences about language use at different proficiency levels regardless of the mismatch between the categories on the rubric and the dimensions. We believe that the two fields can benefit from each other beyond this immediate context to other contexts in which the primary focus of the research is not language proficiency. For example, in Staples (2015), a corpus analysis revealed differences in the use of linguistic features by nurses from two different backgrounds: those whose first language was English and were trained in the U.S., and those whose first language was not English and were trained outside the U.S. However, without further information, it was unclear whether these differences mattered in terms of patient perceptions or alignment of the nurses’ discourse with expected behaviors of medical providers in the U.S. (e.g., displays of empathy and rapport). By correlating the linguistic findings with two assessments, one a patient satisfaction measure and one an evaluation of the nurses’ interpersonal communication skills based on an assessment created for overseas-trained doctors, it could be determined that particular aspects of the overseas-trained nurses’ speech (e.g., fewer first person pronouns and prediction modals when conducting a physical exam) were related to lower patient satisfaction scores and lower scores on the interpersonal communication measure. When the broader discourse was analyzed, it became clear that the overseas-trained nurses were using fewer indications, a term that describes a providers’ discussion of what is happening during an exam (e.g., I’m just going to touch your shoulder). In turn, the linguistic features associated with particular levels of evaluation can provide raters of medical providers with more concrete indicators of more and less effective communication within medical interactions. The studies and approaches presented in this chapter have demonstrated how the disciplines of corpus linguistics and assessment can mutually benefit from interdisciplinary collaboration. The fusion of methods from these disciplines can yield new insights into how language varies across contexts as well as how language plays a role in evaluating discourse. We think that future work combining these disciplines would be more fruitful if key considerations in both disciplines are taken into account at the design stages. We believe the key to this process includes good, clear communication and a willingness to understand and appreciate perspectives from the other discipline. Understanding the broader implications required conversations on this topic between the assessment experts and the corpus linguistics expert. From an assessment perspective, it was difficult to grasp, visualize, and describe how the assessment related aspects of the study and its implications are meaningful to the broader field of corpus linguistics beyond the application of corpus linguistic methods in assessment settings. Additionally, collaboration with corpus linguists

Connecting Corpus Linguistics and Assessment  135 keeps the practice of assessment from overemphasizing statistics and measurement and shifts the focus of the research to language. The reverse of this was true as well. The assessment experts needed to be able to show the corpus linguistics expert how corpus methods are relevant to validityrelated research. In addition, perspectives from language assessment help stress the importance of differentiating and categorizing linguistic features in relation to language proficiency and development. In summary, it was necessary to develop some expertise in the other discipline as it is difficult to do this type of research well without being involved in the other discipline’s part of the study in order for the researchers to understand the interpretation and implications from each other’s perspective.

Appendix A

Rhetoric

Exceeds Standard

5

• Organization well controlled; appropriate to the material • Connection is smooth

4

• Topic clearly and completely developed, with acknowledgment of its complexity

Standard

• Topic clearly developed, but not always completely or with acknowledgment of its complexity • Organization generally controlled; connection sometimes absent or unsuccessful

2

• Topic development usually clear but simple and may be repetitive • Attempts to address different perspectives on the topic are often unsuccessful

Vocabulary

• Flexible use of a wide range of syntactic (sentence level) structures; morphological (word forms) control nearly always accurate

• Broad range; appropriately used

• Both simple and complex syntax adequately used; good morphological control

• Vocabulary use shows flexibility; is usually appropriate • Any inappropriate vocabulary does not confuse meaning

• Both simple and complex syntax present • For some, syntax is cautious but accurate, while others are more fluent but less accurate • Inconsistent morphological control • Morphological errors are frequent • Simple sentences tend to be accurate; more complex ones tend to be inaccurate

• Adequate vocabulary, but may sometimes be inappropriately used

• Vocabulary may be limited in range, and is sometimes inappropriately used to the point that it causes confusion

• Overreliance on prefabricated language and/or language from the prompt • Organization partially controlled

1

• Topic development may be unclear and/or limited by incompleteness or lack of focus • Might not be relevant to topic • Connection of ideas often absent or unsuccessful

Not Scored

Grammar/Syntax

• Organization is controlled and shows appropriateness to the material • Few problems with connection

3

Below Standard

• Topic richly, fully, complexly developed

• Pervasive and basic errors in sentence structure and word order cause confusion • Problems with subject-verb agreement, tense formation or word formation • Even basic sentences are filled with errors

• Incorrect use of vocabulary causes confusion • Even basic words may be misused • May show interference from other languages

A Not on Topic rating is awarded to any essay that:

Not • is written on a topic different from those assigned; or On Topic • is connected to the prompt so loosely that the essay could very well have been prepared in advance; or • requires considerable effort to see any connection between the composition and the prompt.

Figure 5.A1  ECPE Rating Scale

Appendix B

Overall Spoken English Descriptors Excellent Speaker

4

The test taker is a highly fluent user of the language, is a very involved participant in the interaction, and employs native-like prosody, with few hesitations in speech.

4–

The test taker takes a very interactive role in the construction of the interaction, and sustains topic development at length. Prosody is native-like though may be accented. Idiomatic, general, and specific vocabulary range is extensive. There is rarely a search for a word or an inappropriate use of a lexical item. The test taker employs complex grammatical structures, rarely making a mistake.

3+

Good Speaker

3 3–

2+

The test taker is quite fluent and interactive but has gaps in linguistic range and control. Overall, the test taker communicates well and is quite fluent. Accent does not usually cause intelligibility problems, though there may be several occurrences of deviations from conventional pronunciation. The test taker is usually quite active in the construction of the interaction and is able to elaborate on topics. Vocabulary range is good, but lexical fillers are often employed. There are some lexical mistakes and/or lack of grammatical accuracy, usually occurring during topic elaboration. Marginal/Fair Speaker Talk is quite slow and vocabulary is limited.

2 2–

Overall, the pace of talk is slow with numerous hesitations, pauses, and false starts, but fluency may exist on limited topics. Although talk may be highly accented, affecting intelligibility, the test taker can usually convey communicative intent. However, the discourse flow is impeded by incomplete utterances. Also, the test taker does not always understand the examiner. Vocabulary knowledge is limited; there are usually many occurrences of misused lexical items. Basic grammatical mistakes occur. Poor/Weak Speaker

1+

1

Talk consists mainly of isolated phrases and formulaic expressions, and there are many communication breakdowns between the examiner and test taker. The test taker’s abilities are insufficient for the interaction. Some basic knowledge of English exists and some limited responses to questions are supplied. Utterances may not consist of syntactic units, and it is often difficult to understand the communicative intent of the test taker. The test taker also frequently does not understand the examiner. Accent may be strong, making some of the test taker’s responses unintelligible. Vocabulary is extremely limited and sparse.

Salient Features Speech

Fluency Intelligibility

Interaction

Language

• rate of speech, pausing/hesitation, prosody (stress, rhythm, intonation) • accent, articulation, delivery

Conversational Development

• interactional facility (responsiveness), topic development (elaboration)

Conversational Comprehension

• mutual comprehension (test taker comprehensibility and examiner speech adjustment)

Vocabulary

• lexical range (general, specific, idiomatic), use of lexical fillers, utterance length, utterance complexity, syntactic control, morphology

Grammar

Figure 5.A2  MELAB Rating Scale

138  Connecting Corpus Linguistics and Assessment

Note 1. It should be noted that even when C.I.s overlap there is potential for statistical difference (depending on n-size and amount of overlap).

References Bachman, L. F. (1990). Fundamental considerations in language assessment. Cambridge: Cambridge University Press. Bachman, L. F., & Palmer, A. (1996). Language testing in practice. Oxford, UK: Oxford University Press. Bachman, L. F., & Palmer, A. (2010). Language assessment in practice. Oxford, UK: Oxford University Press. Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press. Biber, D. (2006). University language: A corpus-based study of spoken and written registers. Amsterdam: John Benjamins. Biber, D., & Conrad, S. (2009). Register, genre, and style. Cambridge: Cambridge University Press. Biber, D., Conrad, S., & Cortes, V. (2004). If you look at . . . Lexical bundles in university teaching and textbooks. Applied Linguistics, 25, 371–405. Biber, D., Conrad, S., Reppen, R., Byrd, P., Helt, M., Clark, V., Cortes, V., Csomay, E.,  & Urzua, A. (2004). Representing language use in the university: Analysis of the TOEFL 2000 spoken and written academic language corpus. Princeton, NJ: Educational Testing Service. Biber, D., & Gray, B. (2013). Discourse characteristics of writing and speaking task types on the TOEFL iBT: A  lexico-grammatical analysis. ETS Research Report Series, TOEFL iBT, 19(1), i–132. Biber, D., Gray, B.,  & Poonpon, K. (2011). Should we use characteristics of conversation to measure grammatical complexity in l2 writing development? TESOL Quarterly, 45(1), 5–35. Biber, D., Gray, B., & Staples, S. (2016). Predicting patterns of grammatical complexity across language exam task types and proficiency levels. Applied Linguistics, 37(5), 639–668. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). The longman grammar of spoken and written English. London: Pearson. Cambridge Michigan Language Assessments [CaMLA]. (2016a). ECPE report. Retrieved from https://michiganassessment.org/wp-content/uploads/2017/12/ ECPE-2016-Report.pdf Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (2008). Building a validity argument for the test of English as a Foreign language. New York, NY: Routledge. Chapelle, C. A., Enright, M. K., & Jamieson, J. (2010). Does an argument-based approach to Validity Make a difference? Educational Measurement: Issues and Practice, 29, 3–13. doi:10.1111/j.1745-3992.2009.00165.x Crooks, G. (1990). The utterance, and other basic units for second language discourse analysis. Applied Linguistics, 11(2), 183–199. Egbert, J. (2015). Sub-register and discipline variation in published academic writing: Investigating statistical interaction in corpus data. International Journal of Corpus Linguistics, 20(1), 1–29.

Connecting Corpus Linguistics and Assessment  139 Egbert, J. (2014). Student perceptions of stylistic variation in introductory university textbooks. Linguistics and Education, 25, 64–77. Friginal, E. (2009). The language of outsourced call centers: A  corpus-based study of cross-cultural interaction. Amsterdam: John Benjamins. Hunt, K. W. (1965). Grammatical structures written at three grade levels (No. 3). Champaign, IL: National Council of Teachers of English. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. Kang, O. (2012). Impact of rater characteristics and prosodic features of speaker accentedness on ratings of international teaching assistants’ oral performance. Language Assessment Quarterly, 9, 249–269. LaFlair, G. T.,  & Staples, S. (2017). Using corpus linguistics to examine the extrapolation inference in the validity argument for a high-stakes speaking assessment. Language Testing, 34(4), 451–475. LaFlair, G. T., Staples, S., & Egbert, J. (2015). Variability in the MELAB speaking task: Investigating linguistic characteristics of test-taker performance in relation to rater severity and score. Cambridge Michigan language assessment research reports. Retrieved from https://michiganassessment.org/wp-content/ uploads/2015/04/CWP-2015-04.pdf Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics, 16(3), 307–322. Michigan Language Assessments [MLA]. (2018). MELAB. Retrieved from http:// michiganassessment.org/test-takers/tests/melab/ R Core Team. (2015). R: A  language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from www.R-project.org/ Revelle, W. (2016). Psych: Procedures for psychological, psychometric, and personality research (Version 1.6.9) [Computer software]. Evanston, IL: Northwestern University. Retrieved from http://CRAN.R-project.org/package= psych Römer, U. (2017). Language assessment and the inseparability of lexis and grammar: Focus on the construct of speaking. Language Testing, 34(4), 477–492. Schoonen, R., Gelderen, A. van, Stoel, R. D., Hulstijn, J.,  & Glopper, K. de. (2011). Modeling the development of L1 and EFL writing proficiency of secondary school students. Language Learning, 61(1), 31–79. doi:10.1111/j.1467–9922.2010.00590.x Simpson-Vlach, R., & Ellis, N. C. (2010). An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512. doi:10.1093/ applin/amp058 Staples, S. (2015). Examining the linguistic needs of internationally educated nurses: A corpus-based study of lexico-grammatical features in nurse—patient interactions. English for Specific Purposes, 37, 122–136. Staples, S., Biber, D., & Reppen, R. (2018). Using corpus-based register analysis to explore authenticity of high-stakes language exams: A register comparison of TOEFL iBT and disciplinary writing tasks. Modern Language Journal, 102(2), 310–332. doi:10.1111/modl.12465 Staples, S., Egbert, J., Biber, D., &, Gray, B. (2016). Academic writing development at the university level: Phrasal and clausal complexity across level of study, discipline, and genre. Written Communication, 33, 149–183.

140  Connecting Corpus Linguistics and Assessment Staples, S., Egbert, J., Biber, D., & McClair, A. (2013). Formulaic sequences and EAP writing development: Lexical bundles in the TOEFL iBT writing section. Journal of English for Academic Purposes, 12, 214–225. Staples, S., & Reppen, R. (2016). Understanding first-year L2 writing: A lexicogrammatical analysis across L1s, genres, and language ratings. Journal of Second Language Writing, 32, 17–35. Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics. Boston: Pearson. Weigle, S. C.,  & Friginal, E. (2015). Linguistic dimensions of impromptu test essays compared with successful student disciplinary writing: Effects of language background, topic, and L2 proficiency. Journal of English for Academic Purposes, 18, 25–39. White, M. (1994). Language in job interviews: Differences relating to success and socioeconomic variables (Unpublished dissertation). Northern Arizona University, Flagstaff, AZ. Yan, X.,  & Staples, S. (2016). Investigating lexico-grammatical complexity as construct validity evidence for the ECPE writing tasks: A  multidimensional analysis. Cambridge Michigan language assessment research reports. Retrieved from https://michiganassessment.org/wp-content/uploads/2017/05/CWP2016-01-Yan-Staples.pdf Yan, X., & Staples, S. (accepted). Fitting MD analysis in an argument-based validity framework for writing assessment: Explanation and generalization inferences for the ECPE. Language Testing. Yu, G. (2010). Lexical diversity in writing and speaking task performances. Applied Linguistics, 31(2), 236–259. doi:10.1093/applin/amp024

6 Examining Vocabulary Acquisition Through Word Associations Triangulating the Psycholinguistic and Corpus-Based Approaches Dana Gablasova Introduction: Vocabulary Knowledge and Lexical Development Vocabulary is one of the basic building blocks of successful language use, relying on both productive as well as receptive knowledge of words and phrases. The lexical dimension of language knowledge is also strongly related to the overall mastery of the language underlying such fundamental communicative skills as reading and speaking. Vocabulary knowledge is multidimensional and dynamic; its complexity has been reflected in the large number of approaches taken to measure the different facets of lexical knowledge and development (e.g. Schmitt, 2014; Nation, 2001). Two major approaches have addressed the breadth and depth of lexical knowledge. While the former investigates the range of lexical items known to the speakers and gains in the size of their lexicon, the latter is related to the development in understanding of individual items in terms of their meaning and use (Read, 2004; Henriksen, 1999). This chapter addresses a third approach to vocabulary knowledge, network knowledge, which is concerned with the links between words in the mental lexicon of language users. This approach focuses on the network of lexical and conceptual knowledge created around a word upon its learning, and any subsequent reshaping of that network (Haastrup & Henriksen, 2000). This can be observed through associations that this word activates (Fitzpatrick, 2006; Meara, 2009). As Verhallen and Schoonen (1993, p. 348) stress, “lexical development comprises more than extending vocabulary size and (at the level of individual words) extending knowledge of meaning components. It is also a gradual structuring process, in which higher-level interrelationships between words and their meanings are established”. This process is considered by some researchers to be also directly connected to the depth approach to vocabulary knowledge. As Read (2004, p. 219) puts it, “depth can be understood in terms of learner’s developing ability to distinguish semantically related words and, more generally, their knowledge of the various ways in which individual words are linked to each

142  Examining Vocabulary Acquisition other” (see also Haastrup & Henriksen, 2000, p. 222). The quality and depth of a lexical network is important, as relationships between words affect which words and concepts become activated during language use (Kruse, Pankhurst, & Smith, 1987). The number, diversity, and type of lexical and conceptual connections might then influence the conceptual and linguistic quality of the message. The network of associations built around words in the mental lexicon can offer insights into the development of lexical knowledge that may not be observed through more traditional tests (e.g. those measuring vocabulary size). One of the key methods used to gain insight into the organization of the mental lexicon relies on eliciting and evaluating associations between the target (stimulus) word and other items in the lexicon. This technique, the word association task, has a long tradition in second language acquisition studies (see Kruse, Parkhurst, & Smith, 1987; Meara, 2009 inter alia) and corpus-based approaches to the analysis of word associations (WAs) through collocation are also well established (see McEnery  & Hardie, 2011, pp.  122–133). These corpus-based approaches, as used in this chapter, have the advantage of allowing frequency information, not normally used in WA studies, to be exploited. The more traditional approach used in vocabulary acquisition research usually takes into consideration the number of WAs produced, speakers’ response times and the lexical, semantic and phonetic connections between the stimulus word and its associations. The corpus-based approach relies on information about word frequency and collocations in language use to explain the WAs and evaluate the integration of the target words in the lexicon and to offer explanations for trends in WA activation. The established utility of word association tasks for the study of vocabulary acquisition in second language acquisition studies, coupled with the relevance of introducing corpus frequency data into the study of WAs, explains the choice of methods which are triangulated in this chapter—these approaches are, in principle, complementary. In order to demonstrate the use of these two analytical approaches to WAs as a window into vocabulary acquisition, the chapter investigates the acquisition of new, specialized lexis from academic reading in first (L1) and second (L2) language. WAs can help us describe both how new lexical items enter the lexicon and what links are formed between them and the words already in the lexicon. The findings of this study enable us to understand how words acquired by bilinguals through their L1 and L2 are similar to or different from each other and how the lexicons of bilingual speakers develop. With respect to the practical implications of this research, the study was designed to reflect the practice at English-medium high schools in which new subject-specific (technical) words are acquired over the course of learning new content knowledge in the L2. Hence the findings presented in this chapter are of relevance to that context.

Examining Vocabulary Acquisition  143 The chapter concludes by reflecting upon how the two approaches offer a fruitful source of triangulation for linguists interested in exploring word associations. The study demonstrates the value, and at times the challenges, of applying different methodological approaches (psycholinguistic and corpus-based) to one dataset both in order to gain complementary perspectives on our research questions and to bring different perspectives to bear in explaining the results of our investigation of these questions.

Vocabulary Acquisition: Investigating Lexical Network Knowledge Using Word Associations to Study L1 and L2 Vocabulary Development A word association test (WAT) is a method in which participants are given a cue (stimulus) word and are asked to provide one or more words that come to their mind immediately upon hearing or seeing the word. Word associations have been used to investigate a very wide range of mental processes—for example, in customer research WAs have been used to gain an insight into how a key concept related to a product was perceived (Roininen, Arvola,  & Lähteenmäki, 2006) and in academic assessment WAs may help to determine students’ content knowledge in a particular discipline (Nakiboglu, 2008). One of the major research areas involving WAs, which is also the focus of this study, investigates vocabulary acquisition of native and nonnative speakers and the similarities/differences between the two groups. Traditionally, WAs in these studies have been analysed in terms of the number of WAs provided, response times and the type of lexico-semantic connection between the words (Zareva & Wolter, 2012; Zareva, 2011; Fitzpatrick  & Izura, 2011). In terms of lexico-semantic connections, researchers have most often looked at the proportion of syntagmatic and paradigmatic relationships between the cue word and its associations. The responses of L2 speakers in research so far have been compared to WAs elicited from L1 speakers as part of the same study or, more often, obtained from standardized databases containing L1 responses to a set of words (e.g. Zareva & Wolter, 2012). One of the major advantages of using WAs to examine the development and organization of an L2 mental lexicon is that the responses are not constrained by speakers’ knowledge of grammatical rules or stylistic considerations (Fitzpatrick, 2013). With respect to the trends that have emerged from the studies employing WAs, the comparison of the organization of the mental lexicon in L1 and L2 speakers has revealed fundamental differences between these groups (e.g. Singleton, 1999, Pavlenko, 2009). Wolter (2001, p. 42) discusses three types of differences that have been identified. These concern

144  Examining Vocabulary Acquisition first, the stability of the connections between words in the mental lexicon, with L2 speakers showing lower consistency in their responses. Second, phonology appears to play a more prominent organizing role in the L2 mental lexicon than it does for native speakers. Finally, the semantic links between words used by L2 speakers tend to differ in a systematic way from those of native speakers. Whereas native speakers’ associations are more paradigmatically oriented, non-native speakers were found to have a tendency to produce more syntagmatic responses, more typical of younger L1 speakers. However, rather than affirming the structural difference between the mental lexicons of L1 and L2 speakers of a particular language put forward in these studies, Wolter (2001) claims that the different findings could be related to the depth of the knowledge of individual items in the lexicon. Wolter argues that “how well a particular word is known may condition the connections made between that particular word and other words in the mental lexicon” (Wolter, 2001, p. 47). It seems that with a higher degree of word knowledge, there is a tendency for L2 speakers’ answers in WATs to resemble those of native speakers more (Namei, 2004; Wolter, 2001). Thus, the degree of integration of a word in the mental lexicon may be also (indirectly) affected by both the user’s proficiency and the relative frequency of the word (Namei, 2004; Wolter, 2001). For example, Zareva (2007) presents evidence that an increasing level of proficiency in L2 speakers results in response types similar to those of L1 users, with advanced L2 speakers giving native-like associations. Further, Zareva claims that if a difference was found, it was one of quantity, not quality: all L2 speakers in her research tended to provide more paradigmatic than syntagmatic answers. The difference between their answers and those of native speakers thus lay in the diversity and number of associations provided. To capture potential differences between L1 and L2 speakers, Fitzpatrick (2006, 2007) proposed a different methodology of WA scoring (which included form-based, positionbased, meaning-based, and erratic relationships between words) allowing for better discrimination between the nature of associations. Following this new categorisation, differences in the quality of association patterns were identified between L1 and L2 speakers. Native speakers were found to produce more collocational and synonymous responses, while nonnative speakers produced more form-based responses and responses with a weaker relationship between concepts represented by the WAs (Fitzpatrick, 2006). However, it has been argued that there were several limitations in the research that brought evidence of differences between L1 and L2 WAs. First, evidence showed that L1 users are not as homogeneous in their answers as is commonly assumed and strong individual and sociolinguistic preferences exist between language users for a particular style of word association response (Dronjic  & Helms-Park, 2014; Fitzpatrick, 2007;

Examining Vocabulary Acquisition  145 Higginbotham, 2010). These preferences moreover seem stable across the bilingual’s two languages (Fitzpatrick  & Izura, 2011). Second, different word classes were found to result in different types of responses, with nouns yielding more paradigmatic responses than verbs and adjectives (e.g. Zareva, 2011; Nissen  & Henriksen, 2006). Likewise, words of lower frequency also produced a considerable variety of responses from native speakers (Fitzpatrick, 2007). This is consistent with Wolter’s (2001) argument about the effect of depth of knowledge on the integration of a word in the mental lexicon. It could be assumed that less frequent words are also less well known by speakers generally. The variation in the responses may then reflect the varying degrees of familiarity with these words among native speakers. Corpus-based Approaches to Network Knowledge Corpora have been used in vocabulary studies for a long time. While they played a role in the construction of vocabulary lists (e.g. West, 1953) as well as in the development of vocabulary tests based on corpus-based frequency information (Read, 2000), they have not yet been systematically used in WA research despite the long tradition in corpus linguistics of studying associative relationships between words (e.g. collocations). Corpus-based approaches to word networks involve identifying words that systematically co-occur in language use and explaining these patterns. The corpus-based evidence of connections between words is usually based on the frequency with which the words occur together and the exclusivity (strength) of this relationship (Gablasova, Brezina & McEnery, 2017a; Evert, 2008). In addition to describing immediate (firstorder) collocations of the target words, corpus methods can also be used to identify more complex associative word relationships in which shared collocates of several words are identified, thus uncovering larger interconnected networks of concepts (Brezina, 2016; Brezina, McEnery  & Wattam, 2015; Mollet, Wray & Fitzpatrick, 2012). The studies that have combined corpus information with WA research have followed two main avenues of inquiry. The first group of these studies used corpus evidence (corpus frequency information and collocations) to explain WAs provided by native speakers of the language and to investigate whether the WAs in the L1 could be related to the patterns recurring in language use. These studies (e.g. Spence & Owens, 1990; Wettler et  al., 2005; Schulte im Walde et  al., 2008; Mollin, 2009) found that there was only a limited overlap between the associations given by L1 speakers and collocations found in corpora. In particular, Mollin (2009) concluded that given the considerable differences between associations found in the corpus and those in the mental lexicon of speakers, the nature of linguistic evidence captured by these two data sources tapped into very different sources and processes of language production. Mollin

146  Examining Vocabulary Acquisition (2009) likewise noted that the different approaches to the selection and application of corpus methods (e.g. some of the studies used span of +/−20 words, others applied much narrowed collocation windows) made it difficult to evaluate and compare the findings. In addition, these studies often did not discriminate between types of relationships between words and their associations (e.g. in terms of symmetric and asymmetric relationships between words) which can to a large extent affect the type of WAs elicited (Michelbacher, Evert, & Schütze, 2011). The second group of studies used corpus evidence to compare association response patterns from L1 and L2 speakers. Zareva (2011) investigated word frequency information (though not directly corpus-based) as a possible factor in WA response types finding that it did not seem to contribute much to explaining the variation. Building on this study, Zareva and Wolter (2012) used collocations from the Bank of English to test the hypothesis that the native speakers would produce more collocational associations than the non-native speakers. The results showed that this was, however, not the case. Overall, there has been only a very limited use of corpus data in WA research that is concerned with vocabulary acquisition, whether of L1 or L2 speakers. The current study uses corpus frequency information to examine the WAs provided by L1 and L2 speakers.

The Study and Research Questions The current study investigates the quality of lexical knowledge gained by bilingual speakers who read academic texts in their second language in comparison to another set of bilingual speakers who gained the same knowledge through their first language (see Gablasova, 2014, 2015 for more details about the research). As little is known about the development of network knowledge of words previously unfamiliar to speakers (Zareva & Wolter, 2012), the study was designed to provide insights into what lexical knowledge is acquired and activated when a new word enters the mental lexicon. In practical terms, the aim of the study was to understand the challenges of lexical and subject learning faced by bilingual students (L1 Slovak speakers) in an L2-medium (English) educational context, in which both the new content knowledge and the new subject vocabulary is gained through their second language. In order to study this type of learning in a controlled experiment, the bilingual Slovak-English speakers, who were students at an English-medium high school, were asked to read two academic texts that contained 12 subjectspecific words that were new to the students both in their first and second language. One group of students read these texts in their L1 and the other group read them in their L2. This setup allowed a comparison of the similarities/differences in the acquisition of new words through L1 and L2 (see Section Texts and the Target Words for the description of

Examining Vocabulary Acquisition  147 the words) and an investigation of different aspects of the newly-formed lexical knowledge. In particular, this chapter uses word associations produced by the two groups of students to examine and compare the network knowledge they developed when learning new words. The study seeks to answer the following three research questions: RQ1: Is there a difference in the number of WAs provided by the two groups when learning new words? RQ2: Is there a difference between the type of WAs provided by the two groups? RQ3: What role does frequency of occurrence play in the activation of WAs?

Methodology Participants Participants in the study were 76 bilingual Slovak-English speakers, students at two high schools in Slovakia with a Slovak-English bilingual programme. All participants were 17 to 20 years old and were recruited from among students in the last two years of the programme. The participants used English daily in their education and in after-school activities. At the end of their studies, the students are expected to reach B2/C1 level of the Common European Framework of Reference for languages (CEFR). Participants were randomly divided into two groups; the groups were balanced for proficiency as established by three tests (two measured their vocabulary size and one their productive language skills). The first group of participants (N = 39) received the reading texts in their L2 (English) (referred to as the ‘L2-input group’ or ‘L2 group’) and the second group (N = 37) received them in their L1 (Slovak) (referred to as ‘the L1 group’). Texts and the Target Words The texts used in the study were two expository texts (each approximately 850 words long) on topics new to the participating students: the history and geography of New Zealand.1,2 The first text described the arrival of the Maori in New Zealand and the consequent development of the Maori lifestyle; the second text described the region of the High Country in the South Island in New Zealand. The texts included 12 target words along with their lexical familiarisation (e.g. a definition). For example, the lexical familiarisation included in the text for the target word transhumance was as follows: Farmers follow several important practices on the land: Transhumance— the seasonal movement (before winter) of stock from exposed, high

148  Examining Vocabulary Acquisition mountain slopes to the more sheltered foothills and river flats. This avoids large stock-losses due to the bitter cold of winter. The target words were all nouns (six abstract and six concrete nouns) which were not known to the participants prior to the study either in their L1 and L2 (as established by a pre-test). One of the major selection criteria stipulated that the words have similar form and pronunciation in Slovak and in English (e.g. English ampelography vs. Slovak ampelografia) so that they would be easily recognisable by participants in either of their two languages. This design allowed directly comparing L1 and L2 groups as they acquired the same set of words. It is important that the same words could be used in both English and in Slovak texts as vocabulary acquisition and retention are very sensitive to word properties such as length of the word. The 12 words were: ampelography, diastrophism, ecocentrism, kumara, moa, moko, pa, perendale, rcd, terroir, transhumance, and whanau. The Slovak equivalents were: ampelografia, diastrofizmus, ekocentrizmus, kumara, moa, moko, pa, perendale, rcd, terroir, transhumancia, and whanau. While the words appeared with somewhat varying frequency in the texts read by the participants (up to four times), their occurrence in these texts was in each case restricted to one paragraph. Procedure This study is part of a larger inquiry into content and vocabulary learning of bilingual students evaluating the effect of the language of input (L1 or L2). The procedure consisted of a pre-test, the learning session and a post-test (which contained the WAT). The participants were tested individually and did not know that lexical learning was of interest to the researcher; they were told the study focussed on subject learning by bilinguals and that they would be asked questions about the content of the texts. During the experimental phase, the participants first completed a pre-test that established that they did not have previous knowledge of the words or of subject knowledge in the texts. The prior knowledge of the target words was established using a vocabulary list which contained the 12 target words as well as 56 other words of low frequency that appeared in the reading materials and acted as distractors in the pre-test. Participants were asked to indicate their knowledge of these words using a lexical knowledge scale based on Paribakht and Wesche (1999). The test confirmed that the participants were not familiar with any of the target words. In the learning session that followed the participants were given 10 minutes to read the first text and then a recording of the text was played to them as well. This process was repeated with the second text. Overall, the learning session took about 30 minutes. The L1 group of students received both of their texts

Examining Vocabulary Acquisition  149 in Slovak; the L2 group received both texts in English. The listeningwhile-reading modality was employed in addition to reading in order to ensure that the participants paid attention to the whole of the text (cf. Horst, Cobb, & Meara, 1998). The learning session was followed by a two-minute non-verbal distractor task (a puzzle). After that, the participants received a post-test which first asked about their knowledge of the text content and the knowledge of the target words and, after a short break, they completed the computer-administered WAT, providing their answers orally. WA Elicitation There are different approaches to administering a WAT. Most WA measures require participants to provide one word (the discrete elicitation approach) (e.g. Fitzpatrick, 2006, 2007; Fitzpatrick & Izura, 2011). Others ask participants to provide more words (the continuous approach)— either all that the participants can think of or what they can produce within a time limit (Kruse, et al., 1987). While eliciting one word may be enough to identify the strongest link between two words, it has been argued that this approach may limit the data unnecessarily (De Deyne & Storms, 2008; Kruse, et  al., 1987). The present study adopted a continuous approach to association elicitation in order to collect multiple responses and gain further insight into the words connected to the target word (De Deyne & Storms, 2008; Kruse et al., 1987). The association task was computer-administered as the trial revealed that this was less distracting than face-to-face responses in the researcher’s presence. In the WAT, participants were shown the cue words and given 30 seconds to provide associations that occurred to them when they saw the words; they were instructed to provide as many associations as they could within the time limit. The time limit was set in order to reduce retrieval inhibition, which occurs when participants are asked to provide an exhaustive list of associations. The WAT was divided into two language blocks, with the same six words presented in the English language block and the same six words in the Slovak language block to allow a direct comparison of participants’ performance on the same set of words in each language. The participants were instructed to respond in the language of the block; this approach was taken to counterbalance the effect of having to produce words in a language different from the language of input. The presentation of the target words within a language block was randomized to prevent priming effects of a particular sequence of words. Before the session, participants received both an oral and written explanation of the task and were given two trial words to practice the task on. Before each language block, participants were given a trial word in order to help them to get accustomed to the language of production. The language blocks were counterbalanced across the participants,

150  Examining Vocabulary Acquisition so that half of the participants responded first in English, while the other half received the Slovak block first.

Data Analysis WA Analysis Before the WAs could be analysed, the data were first processed so that all instances where the participants did not demonstrate knowledge of the target word were removed. These were the following cases: a. No WA given: The participant responded with, for example, ‘I don’t know the word’; b. WA for a clearly different word given: e.g. ‘war dance’ given as a WA for moko was clearly related to haka instead; haka was another new word that appeared in the text; c. No relationship between the WA and the target word could be identified: e.g. the cue word ecocentrism was followed by associations such as ‘window’, ‘lunch’, ‘experiment’. Only the remaining WAs were further analysed in order to answer the research questions. There was no statistically significant difference in the number or type of excluded responses across the two groups of participants (for a more detailed account see Gablasova, 2012). Corpus-based Analysis Drawing on previous corpus-based analysis of WAs as well as broader use of corpora in second language acquisition research, this study uses corpus frequency information to examine the role of usage-based frequency of occurrence in the selection of WAs. Corpora are a unique source of frequency information (Leech, 2011). Given that the information about frequency of word occurrence can be a useful indicator of how likely the word is to be known, encountered or produced by users of English (Ellis, 2014), corpora are ideally placed to help analysts approach issues relating to frequency in the study of language. The study in this chapter is, in many ways, an exploration of this assertion; results, such as those presented later in the chapter (in the section which addresses RQ3 in the Results section inter alia) rely on corpus frequency data to generate insights into word associations. When using corpus methods, there are many techniques and corpora to choose from; we should therefore carefully consider the aims of our analysis and select techniques that are appropriate to our data and research aims. There are different approaches to obtaining information about the frequency of occurrence of target items that could be employed in

Examining Vocabulary Acquisition  151 vocabulary research (Gablasova, Brezina, & McEnery, 2017b). We have to critically reflect on both the choice of the corpus to base the frequency information on as well as the type of frequency that we will measure. In order to assess the effect of word frequency this study relies on frequency information obtained by means of the average-reduced-frequency (ARF) measure to assess the relative frequency of words. Unlike absolute frequency, which was found to be only indirectly related to the acquisition of new structures, ARF combines information about frequency of items with data about their dispersion in language use (i.e. how evenly or unevenly they are distributed in use). ARF thus favors items that are very frequent but also occur across the whole corpus. Note that ARF is equivalent to a normalized frequency only where the normalized frequency represents something that is evenly spread through the documents in a corpus (see Hlaváčová & Rychlý, 1999; Savický & Hlaváčová, 2002 for a fuller discussion of the conceptualization and use of ARF). The ARF of items in general English was used to produce the new General Service List (new-GSL) (Brezina  & Gablasova, 2015) used in this study. The new-GSL lists 2,494 lemmas that were found to be both most frequent and evenly distributed in the English language. This list is based on four corpora of English (the British National Corpus, Lancaster-Oslo-Bergen (LOB) corpus, EnTenTen12 corpus and British English 2006 (BE06) corpus) containing over 12 billion words.3 As the new-GSL is based on four different corpora, it is this list of the most frequent words in English that is used to assess whether the WAs produced in the experiment relate or not to the most frequent words in the English language in RQ3. Exploring Each Research Question The three research questions in the study were explored in the following ways. The first research question addressed the number of WAs provided by speakers who acquired the target words through their L1 and L2. For this research question, each acceptable WA was scored with one point. A  mean number of WAs in each response language was calculated for each group and compared using an independent-samples t test. The second research question focussed on the semantic analysis of the relationship between the cue word and the first WA produced by participants. In particular, it examined whether the first WA reflected a syntagmatic or a paradigmatic relationship with the target word. The hierarchical semantic relationship between the target word and the WAs was established using WordNet (Princeton University, 2010). While WordNet is, of course, an idealized representation of these relationships, it has the virtue, for experimental purposes, of representing an independent, expert assessment of hierarchical semantic relationships. This factors out the possibility of bias which may result from the experimenter both examining WAs and setting benchmark norms for these words.

152  Examining Vocabulary Acquisition When exploring the second research question, only responses from the English part of the test in which participants provided the WAs for the following six words were used for the analysis: ecocentrism, kumara, moko, perendale, terroir, and transhumance. Using WordNet, the first WA given for each target word in English by participants in the L1 and L2 groups was considered to assess whether it is superordinate to the target word or not. The third research question investigated the relationship between the word frequency and the WAs and focussed on the association of the three abstract words in the responses provided in English (ecocentrism, terroir, and transhumance). The analysis first ascertained what proportion of all WAs provided for each of these target words came from among the most frequent and general words in English, i.e. from the words listed in the new-GSL. As a second step, all WAs provided for these three words were analysed with respect to their corpus-based frequency as well.

Results Research Question 1: The Number of WAs In terms of the number of WAs per target word, as shown in Table 6.1, there is a tendency of participants in both groups to provide more associations in their first rather than in their second language. However, the t test did not reveal any statistically significant differences between the two groups in terms of WA numbers. A  correlational analysis confirmed a strong relationship between the number of WAs provided by participants in their L1 and L2, observed for both groups (L1 group: r. = 731, p < .01; L2 group: r = .798, p < .01). Research Question 2: Type of WAs The second research question focussed on semantic analysis of the relationship of the WAs to the target words. Table  6.2 gives an overview of the first WAs given for each target word in English by participants in the two groups. Bold type indicates which words were considered to be superordinate to the target words. Numbers in parentheses indicate Table 6.1  Mean Number of WAs in L1 and L2 L1 group

Response in L1 Response in L2

L2 group

Mean

SD

Mean

SD

6.92 5.88

1.97 1.71

7.30 6.84

2.25 2.08

t

df

Sig.

 −.643 −1.790

49 49

.523 .080

Examining Vocabulary Acquisition  153 Table 6.2  First WA by Target Word (Responses in English Only) Ecocentrism

L1 group L2 group

Kumara

L1 group L2 group

Moko Perendale Terroir

L1 group L2 group L1 group L2 group L1 group L2 group

Transhumance

L1 group L2 group



nature (16), belief (3), religion (3), science, flowers nature (6), belief (4), equality (3), plants (2), ideology, human, religion, view, organisms, faith, worldview potato (13), food (5), sweet potato† (4), crop, plant, sweet, traditional potato (7), sweet potato† (7), plant (4), food (3), crop, eat, meal, sweet tattoo (18), painting, woman, people tattoo (20), traditions, black, inhabitants sheep (20), merino sheep (9) wine (20), French (3), system (2), vineyard (2), grapes (2), method, soil, farming, grape wine (15), territory (3), vineyard (3), friends (2), French, winers, group sheep (5), moving (4), movement (4), farmers, livestock, animals, transferring, winter, agriculture, move, transporting movement (3), sheep (3), moving (2), animals (2), change, transporting, transport, migration, mountain, far, move, agriculture, farm

pronounced as one unit

how many respondents produced that association, where the number is greater than 1. Table  6.3 shows the proportion of the first word associations that contained a superordinate to the cue word. A strong preference for the superordinate was found for the three concrete words (kumara, moko, and perendale) whereas a strong preference for a choice other than the superordinate word was found for the remaining three abstract words (ecocentrism, terroir, and transhumance). This response pattern was found in both groups. It seems that concrete target words elicited more superordinate words as associations than the abstract target words. (Note that the same tendency was found for the WAs given in participants’ L1, Slovak). Research Question 3: The Effect of Word Frequency on the Selection and Activation of WAs With respect to corpus-based frequency of the elicited WAs, as can be seen from Table  6.4, there seems to be a stable pattern found in both groups, with the majority (62 to 77%) of the WAs coming from among the more frequent words in the English language.

154  Examining Vocabulary Acquisition Table 6.3  Number of WAs in Superordinate Relationship to the Target Word Target word

L1 group

 

superordinate

other

superordinate

other

No. Perc.

No. Perc.

No. Perc.

No. Perc.

Ecocentrism Kumara Moko Perendale Terroir Transhumance Concrete words Abstract words

3 19 18 20 2 4 57 9

L2 group

12.5 73.1 85.7 95.2 6.1 19 83.8 11.5

21 7 3 1 31 17 11.0 69.0

87.5 26.9 14.3 4.8 93.9 81 16.2 88.5

6 19 20 9 3 3 48.0 12.0

27.3 76 87 100 11.5 15.8 84.2 17.9

16 6 3 0 23 16 9.0 55.0

72.7 24 13 0 88.5 84.2 15.8 82.1

Table 6.4  Proportion of WAs Included in the New-GSL New-GSL

Ecocentrism Terroir Transhumance

L1 L2 L1 L2 L1 L2

Not on New-GSL

N

Perc.

N

Perc

89 96 124 127 104 46

77.4 73.3 62.6 68.6 69.3 71.8

26 35 74 58 94 37

22.6 26.7 37.4 31.4 30.7 28.2

Total

115 131 198 185 150 131

Next, we considered the WAs for the three words individually. Tables 6.5–6.7 list all WAs with frequency higher than three that were provided by participants in each group. These WAs are divided according to their presence or absence from the new-GSL. WAs highlighted in bold type indicate which words appeared both in the lexical familiarisation of the target words in the texts as well as among the WA responses. When all WAs were taken into consideration, a strong preference for particular WAs in both groups became apparent. For example, ‘nature’, ‘equality’, ‘belief’ and ‘plant’ are among the most frequently produced WAs for ecocentrism, while both groups gave ‘wine’, ‘vineyard’, ‘grape’ and ‘French’ as the most common associations for terroir. When looking at the WAs in terms of their frequency in general language use, we see that the majority of WAs most commonly given by the participants are very frequent words (they are included in the new-GSL list), giving further evidence for the overall trend revealed in Table 6.4. In other words, the associations that became activated in response to the target words

Examining Vocabulary Acquisition  155 Table 6.5  Ecocentrism: Most Frequent WAs L1 group L2 group

New-GSL: nature (19), belief (7), plant (7), religion (7), animal (6), centre (4), equal (3) Off-list: equality (9) New-GSL: nature (15), belief (9), plant (7), animal (6), equal (6), people (4), view (4), centre (3) Off-list: equality (8), ecology (6), organism (5)

Table 6.6  Terroir: Most Frequent WAs L1 group

L2 group

New-GSL: wine (31), French (6), quality (5), type (5), soil (4), bottle (4), condition (4), farm (3), field (3), mark (3), region (3), system (3), weather (3) Off-list: vineyard (13), grape (11), France (5) New-GSL: wine (25), French (8), soil (8), friends (6), condition (5), group (5), taste (5), farm (4), people (4), bottle (3), weather (3), technique (3) Off-list: vineyard (16), grape (11), agriculture (4), territory (4), France (3)

Table 6.7  Transhumance: Most Frequent WAs L1 group

L2 group

New-GSL: winter (11), move (8), animal (7), movement (6), people (6), hill (5), mountain (5), transport (5), weather (5), cold (3), slope (3) Off-list: sheep (17), livestock (3) New-GSL: winter (7), hill (5), movement (5), move (5), cold (4), condition (4), farmer (4), stock (4), weather (4), animal (3), season (3), mountain (3) Off-list: sheep (9)

come predominately from highly frequent words that are most likely encountered in a variety of contexts. A further analysis of the associations that appeared frequently in participants’ responses but were not in the original texts (i.e. their activation could not be attributed to seeing these words in the input texts together with the target words) suggested the following explanation for their activation: The most frequently produced WAs appear to be related to major semantic (meaning) components of the target words. For example, the meaning of transhumance includes four major concepts: (1) movement of (2) farm animals (3) from one location to another (4) due to unsuitable conditions. While the WAs are undoubtedly related to these concepts, it is interesting to observe that a word (concept) in the original text has been replaced by a word of higher frequency. Thus, for example,

156  Examining Vocabulary Acquisition ‘stock’ which appeared in the definition of transhumance seemed to activate related words of higher frequency, e.g. ‘sheep’ and ‘animals’. Similarly, ecocentrism includes the core components of (1) a worldview/belief that stresses (2) the central role of nature and (3) equality of all living creatures. It seems that WAs such as ‘plant’ and ‘animal’, very frequent words in general use (as indicated by the new-GSL), were activated by the low-frequency word ‘organisms’ that appeared in the original text. This trend suggests that a concept expressed by a (specialized) word of low frequency in the input tends to activate more general words in the mental lexicon that were related semantically to the concept.

Discussion of the Findings In this section the findings of the study are summarized and discussed. In eliciting multiple responses to each target word, this study opted for a more complex approach to WAs. These data thus represent a rich resource for analysing the integration of new words into the mental lexicon of bilingual speakers. The findings are first discussed in terms of the comparison of lexical learning through L1 and L2 and of the insights into novel vocabulary learning. The methodological issues related to the use of corpus methods in psycholinguistic research are discussed next. Comparison of L1 and L2 Learning of Novel Words The data of the L1 and L2-input groups were compared in two main areas and the following trends emerged from the data analysis. First, with respect to the WA productivity (the number of WAs), there was no difference between the two groups, indicating that when advanced bilinguals encounter new words in a meaningful context these create links with words and concepts already present in the mental lexicon regardless of the language of input. At first glance this finding may appear different from that reported by Zareva (2007) who described differences in the quantity of WAs provided by L1 and L2 speakers; however, Zareva also stressed that higher proficiency in L2 was linked to more native-like performance. Moreover, a correlational analysis revealed that there were strong individual preferences in terms of the number of WAs produced by participants—those who provided more WAs in their L1 also provided more WAs in their L2. This brings further evidence of the individual differences in WAT response behavior noted in earlier studies (e.g. Dronjic & Helms-Park, 2014) as well as further evidence that these preferences are also transferred from one of the bilingual’s languages to the other (Fitzpatrick & Izura, 2011). Next, the individual WAs were analysed to better understand how the newly acquired word related to the lexical items already known to the speakers. The first WA, believed to reflect the strongest association of the stimulus word, was examined in terms of its semantic relationship

Examining Vocabulary Acquisition  157 to the stimulus word. The responses of both groups showed a similar trend—for concrete words, the superordinate word was the immediate response; this was, however, not the case for the abstract words. Participants, regardless of the language of input, also showed some common trends in terms of the preferred response to the majority of the target words (e.g. ‘wine’ being the most frequently given first association for terroir or ‘sheep’ as a response to perendale). When all WAs were taken into consideration and ordered according to their frequency as part of a closer investigation into WAs for the abstract words (Tables 6.5–6.7), the overall association preferences of the two groups were remarkably similar (especially given the fact that the L1-input group received their texts in Slovak and were producing the WAs in English). The similarities found between these two groups with respect to different aspects of mental lexicon organization suggest that when exposed to the same input, advanced bilinguals tend to learn new vocabulary in a similar way through their L1 and L2. It is thus possible that some of the differences between L1 and L2 speakers found in previous studies result from difference in exposure and consequent familiarity with different aspects of the word’s meaning (Namei, 2004; Wolter, 2001). However, these findings are also to some extent surprising as they do not correspond with results based on the same participants which measured the newly acquired knowledge of the participants by means of other techniques (Gablasova, 2014, 2015). In particular, when looking at definitions and explanations of the newly learned words, some participants with L2 input showed more limited understanding of the meaning of the target words and, it could be argued, in terms of the depth of knowledge, they acquired the words to a lesser degree than those learning them through their L1. The current study thus suggests that even if the meaning of the target words derived from the context may not be fully understood in the participants’ L2, it is possible for similar lexical networks to be activated by participants when encountering the words in their L1 and L2. Trends in Learning of New Words by Bilinguals The commonalities in vocabulary acquisition found in the L1 and L2 input groups point to the existence of more general trends in the acquisition of new words. The evidence from the WAT suggests that there is a dynamic interaction between semantic and usage-based factors when new words are integrated into existing lexical networks. As discussed in the previous sub-section, syntagmatic semantic relationships appeared to be a strong predictor of the first WAs (especially in the case of concrete words). Likewise, superordinate words were among the most frequent WAs overall even for abstract words (Tables 6.5–6.7) with ‘belief’, ‘vineyard’ and ‘movement’ among the most frequent associations for ecocentrism, terroir and transhumance respectively. However, the semantic

158  Examining Vocabulary Acquisition connection was not the only predictor; it appears that in several cases it competed with usage-based properties of the words activated in the mental lexicon. Corpus-based information about the frequency of words that appeared among WAs thus seemed to be another predictor of which lexical items got activated. Results indicated a preference for more frequent and general words with the majority of WAs found in the new-GSL. Frequency of words also appeared related to the patterns of activation with a rarer word in the input activating what could be considered as semantically-related but more general and frequent items in the mental lexicon (e.g. ‘organisms’ activated ‘plants’ and ‘animals’). The interaction between semantic relationships and corpus frequency information could also further explain some of the WA response patterns. In particular, there appears to be a competition between strong associations with superordinate words and preference for higher-frequency words to be activated first. For example, when we examine WAs for terroir, the two most frequent associations were ‘wine’ and ‘vineyard’ (occurring in 56 and 29 responses across the two groups respectively). These two words could be considered equally salient as they are related to the major meaning components of the target word (terroir is a group of vineyards whose wine shares the same characteristics). However, ‘wine’, the more frequent and general word, rather than the superordinate word ‘vineyard’ appears to be more strongly associated as reflected by being the first choice in many participants’ responses (Table 6.2). A similar pattern can be observed with ecocentrism and the competition between ‘nature’ on the one hand and ‘belief’ and ‘equality’ on the other hand—all three words connected to the major meaning components. Again, ‘nature’, ranked within the first one thousand words in the new-GSL (position 530 on the list), is activated more often than the superordinate ‘belief’, the second most preferred WA. It is possible that the more general words may be activated more quickly and appear more salient to the speakers, with these words having already forged connections with other words/concepts across a variety of contexts.

Triangulation: Combining Psycholinguistic and Corpus Methods Over the last ten years, there has been an increasing enthusiasm about combining corpus and psycholinguistic methods (e.g. Gilquin & Gries, 2009) and the interdisciplinary approaches have proved very fruitful. However, despite the encouraging results, there are potential challenges when different research traditions are brought together (e.g. Gablasova et al., 2017b; Durrant & Siyanova-Chanturia, 2015; Mollin, 2009) that should be addressed if the multiple methods are to be combined and triangulated in a meaningful and appropriate manner. There are some clear benefits of using corpora and corpus methods in psycholinguistic research. The frequency of occurrence of target structures and

Examining Vocabulary Acquisition  159 items is a significant driving force in the process of learning language and it interacts with every stage of this process (e.g. noticing, acquisition, and retention) (Ellis, 2014). Frequency of occurrence of target features is therefore a very powerful explanatory factor when we investigate and explain mechanisms involved in language learning. The current study illustrated how corpus-based frequency information can aid in filling gaps in our understanding of how new, specialized vocabulary items enter the mental lexicon, a process that has so far been mostly explained via word characteristics (such as word class or concreteness), meaning-related connections between words (e.g. paradigmatic and syntagmatic semantic relationships), and speaker characteristics (e.g. proficiency of the speakers and their familiarity with words) (Fitzpatrick, 2007; Wolter, 2001). Moreover, including corpus-based analysis in the current study facilitated both the treatment of more complex WA data elicited through the continuous WAT approach and the searching of all of the WAs for shared trends. Finally, including corpus information also enabled us to relate the data produced in a specific study to more general patterns in language use recorded in corpora. However, while the combination of corpus and psycholinguistic traditions of quantitative research appears promising, we have to be careful when selecting the data and applying the methods from each discipline. As discussed earlier, previous studies using corpus methods in WA research suggested that language processes recorded in corpora and by WAs are not always readily comparable or directly relatable to each other (e.g. Mollin, 2009). Moreover, automatic and quantitative analyses of language can potentially not only uncover meaningful patterns but also obscure them (e.g. the current study noted different preferences with respect to concrete and abstract words). This study set out to achieve two aims: to bring evidence about how bilingual speakers acquire new specialized words through their first and second language and to demonstrate how a combination of psycholinguistic and corpus methods can contribute to a better description, explanation, and understanding of processes involved in vocabulary learning. On a theoretical level, the study contributed to our knowledge of how conceptual networks in bilingual speakers are developed and suggested possible factors that play a role in this process. In terms of methodological exploration, the study demonstrated the value of triangulating techniques from different methodological traditions as each of the methods offered a different perspective on the data, creating a richer and more accurate understanding of the vocabulary acquisition investigated in this study.

Notes 1. It should be noted, however, that this chapter separates out layers and features of vocabulary knowledge which, in practice, may become less discrete and more inter-dependent. However, while noting this, this chapter makes the same assumption as psycholinguistic studies, for experimental purposes, that

160  Examining Vocabulary Acquisition the distinctions drawn are discrete, while tacitly accepting the more blurred and interacting nature of these features. 2. The texts used in the study were based on existing texts rewritten for this study. See Gablasova (2014) for details. 3. For a discussion of the corpora and the rationale for their use in generating the new GSL, see Brezina and Gablasova (2015, pp. 5–7).

References Brezina, V. (2016). Collocation networks: Exploring associations in discourse. In P. Baker  & J. Egbert (Eds.), Triangulating methodological approaches in corpus linguistic research (pp. 90–107). London: Routledge. Brezina, V., & Gablasova, D. (2015). Is there a core general vocabulary? Introducing the new general service list. Applied Linguistics, 36(1), 1–22. Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139–173. De Deyne, S., & Storms, G. (2008). Word associations: Norms for 1,424 Dutch words in a continuous task. Behavior Research Methods, 40(1), 198–205. Dronjic, V.,  & Helms-Park, R. (2014). Fixed-choice word-association tasks as second-language lexical tests: What native-speaker performance reveals about their potential weaknesses. Applied Psycholinguistics, 35(01), 193–221. Durrant, P.,  & Siyanova-Chanturia, A. (2015). Learner corpora and psycholinguistics. In S. Granger, G. Gilquin,  & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 57–77). Cambridge: Cambridge University Press. Ellis, N. C. (2014). Frequency-based accounts of second language acquisition. In S. Gass, & A. Mackey (Eds.), The Routledge handbook of second language acquisition (pp. 193–210). London: Routledge. Evert, S. (2008). Corpora and collocations. In A. Lüdeling  & M. Kytö (Eds.), Corpus linguistics. An international handbook (pp. 1212–1248). Berlin: Mouton de Gruyter. Fitzpatrick, T. (2006). Habits and rabbits: Word associations and the L2 lexicon. EuroSLA Yearbook, 6(1), 121–145. Fitzpatrick, T. (2007). Word association patterns: Unpacking the assumptions. International Journal of Applied Linguistics, 17(3), 319–331. Fitzpatrick, T., & Izura, C. (2011). Word associations in L1 and L2: An exploratory study of response types, response times, and interlingual mediation. Studies in Second Language Acquisition, 33(3), 373–398. Fitzpatrick, T. (2013). Word associations. In C. Chapelle, (Ed.), The encyclopedia of applied linguistics (pp. 6193–6199). Oxford: Blackwell Publishing. Gablasova, D. (2012). Learning and expressing technical vocabulary through the medium of L1 and L2 by Slovak-English bilingual high-school students (Doctoral dissertation). The University of Auckland. Gablasova, D. (2014). Learning and retaining specialized vocabulary from textbook reading: Comparison of learning outcomes through L1 and L2. The Modern Language Journal, 98(4), 976–991. Gablasova, D. (2015). Learning technical words through L1 and L2: Completeness and accuracy of word meanings. English for Specific Purposes, 39, 62–74.

Examining Vocabulary Acquisition  161 Gablasova, D., Brezina, V., & McEnery, A. M. (2017a). Collocations in corpusbased language learning research: Identifying, comparing and interpreting the evidence. Language Learning, 67(S1), 130–154. Gablasova, D., Brezina, V., & McEnery, T. (2017b). Exploring learner language through corpora: Comparing and interpreting corpus frequency information. Language Learning, 67(S1), 155–179. Gilquin, G., & Gries, S. T. (2009). Corpora and experimental methods: A stateof-the-art review. Corpus Linguistics and Linguistic Theory, 5(1), 1–26. Haastrup, K., & Henriksen, B. (2000). Vocabulary acquisition: Acquiring depth of knowledge through network building. International Journal of Applied Linguistics, 10(2), 221–240. Henriksen, B. (1999). Three dimensions of vocabulary development. Studies in Second Language Acquisition, 21(02), 303–317. Higginbotham, G. (2010). Individual learner profiles from word association tests: The effect of word frequency. System, 38(3), 379–390. Horst, M., Cobb, T., & Meara, P. (1998). Beyond a clockwork orange: Acquiring second language vocabulary through reading. Reading in a foreign language, 11(2), 207–223. Hlaváčová, J., & Rychlý, P. (1999). Dispersion of words in a language corpus. In V. Natousekm, P. Mautner, J. Ocelíková, & P. Sojka (Eds.), Text, speech and dialogue (pp. 321–324). Berlin and Heidelberg: Springer. Kruse, H., Pankhurst, J.,  & Smith, M. S. (1987). A  multiple word association probe in second language acquisition research. Studies in Second Language Acquisition, 9(2), 141–154. Leech, G. (2011) Frequency, corpora and language learning. In F. Meunier, S. De Cock & G. Gilquin (Eds.), A taste for corpora: In honour of Sylviane Granger (pp. 7–31). Amsterdam: John Benjamins. McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press. Meara, P. (2009). Connected words: Word associations and second language vocabulary acquisition. Amsterdam: John Benjamins Publishing. Michelbacher, L., Evert, S., & Schütze, H. (2011). Asymmetry in corpus-derived and human word associations. Corpus Linguistics and Linguistic Theory, 7(2), 245–276. Mollin, S. (2009). Combining corpus linguistic and psychological data on word co-occurrences: Corpus collocates versus word associations. Corpus Linguistics and Linguistic Theory, 5(2), 175–200. Mollet, E., Wray, A., & Fitzpatrick, T. (2012). Accessing second-order collocation through lexical co-occurrence networks. In S. Faulhaber & T. Herbst (Eds.), The phraseological view of language: A tribute to John Sinclair (pp. 87–122). Berlin: Mouton de Gruyter. Nakiboglu, C. (2008). Using word associations for assessing non major science students’ knowledge structure before and after general chemistry instruction: The case of atomic structure. Chemistry Education Research and Practice, 9(4), 309–322. Namei, S. (2004). Bilingual lexical development: A Persian–Swedish word association study. International Journal of Applied Linguistics, 14(3), 363–388. Nation, I. S. (2001). Learning vocabulary in another language. Cambridge: Cambridge University Press.

162  Examining Vocabulary Acquisition Nissen, H. B., & Henriksen, B. (2006). Word class influence on word association test results. International Journal of Applied Linguistics, 16(3), 389–408. Pavlenko, A. (2009). Conceptual representation in the bilingual lexicon and second language vocabulary learning. In A. Pavlenko, (Ed.), The bilingual mental lexicon: Interdisciplinary approaches (pp. 125–160). Bristol: Multilingual Matters. Paribakht, T. S., & Wesche, M. (1999). Reading and “incidental” L2 vocabulary acquisition. Studies in Second Language Acquisition, 21, 195–224. Princeton University. (2010). About wordnet. Retrieved June 2011, from http:// wordnet.princeton.edu Read, J. (2000). Assessing vocabulary. Cambridge: Cambridge University Press. Read, J. (2004). Plumbing the depths: How should the construct of vocabulary knowledge be defined? In P. Bogaards & B. Laufer (Eds.), Vocabulary in a second language: Selection, acquisition, and testing (pp. 209–227). Amsterdam: John Benjamins. Roininen, K., Arvola, A., & Lähteenmäki, L. (2006). Exploring consumers’ perceptions of local food with two different qualitative techniques: Laddering and word association. Food Quality and Preference, 17(1), 20–30. Savický, P., & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9, 215–231. Schmitt, N. (2014). Size and depth of vocabulary knowledge: What the research shows. Language Learning, 64(4), 913–951. Schulte im Walde, S., Melinger, A., Roth, M., & Weber, A. (2008). An empirical characterisation of response types in German association norms. Research on Language & Computation, 6(2), 205–238. Singleton, D. (1999). Exploring the second language mental lexicon. Cambridge: Cambridge University Press. Spence, D. P.,  & Owens, K. C. (1990). Lexical co-occurrence and association strength. Journal of Psycholinguistic Research, 19(5), 317–330. Verhallen, M., & Schoonen, R. (1993). Lexical knowledge of monolingual and bilingual children. Applied Linguistics, 14(4), 344–363. West, M. (1953). A general service list of English words: With semantic frequencies and a supplementary word-list for the writing of popular science and technology. London: Longman. Wettler, M., Rapp, R., & Sedlmeier, P. (2005). Free word associations correspond to contiguities between words in texts. Journal of Quantitative Linguistics, 12(2–3), 111–122. Wolter, B. (2001). Comparing the L1 and L2 mental lexicon: A  depth of individual word knowledge model. Studies in Second Language Acquisition, 23, 41–69. Zareva, A. (2007). Structure of the second language mental lexicon: How does it compare to native speakers’ lexical organization? Second Language Research, 23(2), 123–153. Zareva, A. (2011). Effects of lexical class and word frequency on the L1 and L2 English-based lexical connections. The Journal of Language Teaching and Learning, 1(2), 1–17. Zareva, A., & Wolter, B. (2012). The ‘promise’ of three methods of word association analysis to L2 lexical research. Second Language Research, 28(1), 41–67.

7 If Olive Oil Is Made of Olives, then What’s Baby Oil Made of? The Shifting Semantics of Noun+Noun Sequences in American English Jesse Egbert and Mark Davies Introduction Noun+noun constructions (NNs)—also known as pre-modifying nouns, nominal pre-modifiers, noun-noun sequences, and noun+noun compounds—are a topic of particular interest in the study of English for a number of reasons. NNs occur when a head noun is pre-modified by one or more nouns (e.g. corpus linguistics, research design, book chapter). However, the nature of the semantic relationship between a pre-modifying noun and a head noun is highly variable (Biber et al., 1999). Even for a given noun, there can be a wide range of meanings. Consider, for example the variety of semantic relationships that are possible between the noun oil and a pre-modifying noun: Semantic relationship

Examples

oil is made from ___________ oil is used for ___________ oil is extracted from ___________ oil found at the location of ___________ oil belongs to ___________

olive oil, vegetable oil, coconut oil baby oil, motor oil, cooking oil shale oil, coal oil, tar sands oil gulf oil, sea oil, ocean oil state oil, government oil

Each of the preceding examples come from the list of the 200 most frequent N + oil constructions in the Corpus of Contemporary American English (COCA). This demonstrates the tremendous amount of variation in the semantics of NNs. To further complicate matters, all of this variation exists without any grammatical cues to indicate the semantic relationship between nouns in NN pairs. This intriguing phenomenon has been almost entirely ignored in the literature. Very little empirical research has been focussed on the semantics of NNs and how they have evolved over time. The objective of this study is to take a first step toward filling that gap by triangulating usebased corpus data and user-based classification data to investigate the

164  Shifting Semantics of Noun+Noun Sequences semantic relationships that are possible in NNs and how those categories are changing over time. In the next section, we discuss research into diachronic changes in the use of NNs. After that we review previous literature on the semantics of NN sequences and their classification. We then introduce two approaches to linguistic data: use-based and user-based, and their relevance to the present study. Finally, we introduce the objectives and research questions for this study. Diachronic Change in NNs The use of NNs in English is increasing at an accelerated pace (Biber & Gray, 2013). It has recently been shown that this pattern is part of a larger trend in written English toward the use of more compression in the noun phrase and less elaboration in clauses (Biber & Gray, 2016). While NNs appear to be on the rise in English generally, the rate of this increase varies quite dramatically across registers (Biber & Gray, 2011; Biber, Egbert, Gray, Oppliger, & Szmrecsanyi, 2016). NNs can simply represent a genitive relationship (e.g. FBI’s director = director of the FBI = FBI director). However, there are many other semantic relations that are possible between two nouns in a NN structure, as we observed in the oil examples in the previous section. One of the reasons for the rapid rise in the use of NNs is the fact that this structure is extremely productive, not only in terms of the nouns that are used but also in the semantic relationships that are possible between the two nouns. Biber and Gray (2016) suggest that the list of possible semantic relationship between nouns in NNs is expanding over time, but an empirical investigation of these changes was beyond the scope of their study. While no research has looked at diachronic change in the semantics of NNs, the next section describes previous attempts at describing NN semantics in contemporary use. Semantic Relationships in NNs The multiplicity of possible semantic relationships between nouns in NNs is a phenomenon that has perplexed linguists for decades. In the earliest research on this topic, scholars working within a generative syntax framework were unable to agree on the best way to classify NNs based on their semantic properties (see Lees, 1970; Levi, 1974; Zimmer, 1971). Much of this research relied on paraphrasing NNs into clauses or prepositional phrases in order to determine their meaning (see also Lauer, 1995; Nakov & Hearst, 2006). Based in part on this lack of agreement, one scholar concluded that it is not possible to classify NNs into discrete semantic categories (Downing, 1977). One line of more recent research has focussed on NNs as a third genitive variant (in addition to of and ’s genitives) (see Rosenbach, 2006,

Shifting Semantics of Noun+Noun Sequences  165 2007; Szmrecsanyi et al., 2016). However, it is also clear that many NNs cannot be paraphrased using an ’s or of genitive. Moreover, many of the NNs that can be paraphrased using one of those two variants do not meet the traditional criteria for genitives (see Biber et al., 2016). The Longman Grammar of Spoken and Written English is one of the only reference grammars of English to describe patterns of use for NNs (Biber et al., 1999). The authors of this book include a list of 12 semantic categories for NNs (three of which are subdivided into two types). However, they also acknowledge that the list they provide is by no means exhaustive or all-inclusive (Biber et  al., 1999, p.  591). While this list is useful, it was developed on the basis of the authors’ intuitions after surveying corpus data rather than on an empirical quantitative analysis. More recently, Girju, Moldovan, Tatu, and Antohe (2005) developed a list of 35 semantic classification categories for NNs in a topdown fashion, based on their own experience and intuitions. It appears that the authors of this study developed this list independently of the list produced in Biber et al. (1999), as the lists look quite different and the Girju paper does not make any reference to Biber et al. (1999). Assuming that this is the case, there is a surprising amount of overlap between the two lists. For example, both lists contain categories for location, purpose, source, time/temporal, agent/subjective, partitive/part-whole, and hypernymy/identity. The longer list produced by Girju et  al. is largely a result of an attempt to classify NNs at a more fine-grained level. For example, the authors of that study distinguish between cause (malaria mosquito = mosquito that causes malaria) and make/produce (shoe factory = factory that makes shoes). The list developed in Girju et al. was used by coders in their study to classify a set of NNs into categories based on the semantic relationship between the two nouns. The primary coders were two Ph.D. students who were assigned to select one of the 35 categories for a set of NNs sampled from newspapers. The ultimate goal of Girju et al.’s study was to train computational algorithms to automatically classify NNs. As a result, the methods used by the authors, as well as the results they report, are limited in their usefulness for descriptive linguistic purposes. However, to our knowledge, this is the only previous study that has taken an empirical approach to the semantic relationships between nouns in NNs. The moderate agreement results they achieved are promising in that they suggest that (a) NNs can be classified based on semantics and (b) this can be done with at least some level of reliability. The methods used in the Girju et al. study leave a number of important questions unanswered, such as: What is the best method for establishing a list of semantic categories for NNs? What is the best method for collecting a sample of NNs that represents the full range of semantic categories? Who should classify the NNs? How should this classification be performed? While some of these questions may have been beyond the scope

166  Shifting Semantics of Noun+Noun Sequences of Girju et al.’s study, they are central to the present study. Additionally, our study also attempts to learn about i) an empirically supported list of semantic categories for NNs; ii) the psycholinguistic reality of the semantic relations between nouns in NNs; and iii) diachronic change in the use of NN semantic categories. The present study is the first to attempt to answer those questions. Use-based vs. User-based Approaches to Linguistics In order to achieve the objectives of our study, it is necessary to analyze NNs using data from two different sources: use-based and user-based. In this section, we describe these two types of data and how they can be triangulated in linguistic research. Use-based data includes any language actually produced by ­language users. Corpus data is a prime example of use-based data. However, ­corpora are not the only type of use-based data. Use-based data includes elicited language, sociolinguistic interviews, test responses, and any other type of recorded language data. The analysis of use-based data can p ­ rovide insights into a wide variety of phenomena about how a language is actually used by its speakers, such as frequency of use, language choices, linguistic possibilities and probabilities, diachronic change, and sociolinguistic variation. This study will rely on use-based data to (i) ­generate a list of highfrequency NNs and (ii) measure the frequency of those NNs over time. Unlike use-based data, which is produced by language users, user-based data is data about language that comes from language users. Examples of user-based data include reader/listener perceptions, perceptions of dialect features and accentedness, grammaticality judgments, text classification, and data from psycholinguistic experiments (e.g. sentence completion tasks, lexical decision tasks, eye-tracking, etc.). Technically speaking, linguistic researchers are often users of the language they are analyzing. However, these experts are excluded from our definition for user-based data, which includes only language-related data from non-expert language users. One source of non-expert user-based language data that is becoming increasingly common in linguistic research is crowdsourcing. Crowdsourced data is collected from a large number of people, usually via an internet-based crowdsourcing service (e.g. Mechanical Turk). Within linguistics, crowdsourcing has been used to collect data on reader perceptions (Egbert, 2014; Egbert, 2016), register classification (Egbert, Biber,  & Davies, 2015; Biber  & Egbert, 2016; Asheghi, Sharoff,  & Markert, 2016), word sense disambiguation (Rumshisky, 2011; Jurgens, 2013), among others. Traditional methods in historical linguistics have relied heavily on researcher intuition, judgment, identification, and classification. This approach has always raised questions about the reliability of data coding. However, until recently it was possible for a single researcher to code

Shifting Semantics of Noun+Noun Sequences  167 all of the data for a study due to the relatively small data sets and corpora that have been used (typically just 1–2 million words for the most widely-used historical corpora of English). This is no longer possible in many cases. The ‘lone researcher’ approach is entirely impractical with the much larger historical corpora that are becoming widely used. In this study, we propose a new approach that relies on user-based data from multiple coders. This makes it possible to (i) develop a list of semantic categories for NNs and a NN classification instrument; and to (ii) classify NNs into semantic categories. We believe this approach addresses the issue of reliability because we are able to calculate intercoder reliability and agreement. Additionally, this approach is scalable to very large data sets, especially with the use of crowdsourcing technology. Study Aims and Outline In this study, we attempt to answer the following three research questions about the semantics of NNs: 1. What semantic relationships are possible between nouns in NNs? 2. Can non-expert language users reliably classify NNs into semantic categories? 3. How do the semantic categories for NNs develop historically? This will be accomplished through triangulation of the previously listed use-based and user-based methods. In the next section, we describe these methods in detail. The Results and Discussion section contains the full results of our study and our discussion of them, organized according to the three research questions. We conclude this chapter with a summary of our findings and some reflective comments on the use of triangulation in this study.

Methods This section contains a detailed description of the methods used in this study. We begin by describing the design of the corpus and the search queries we used. We then explain the methods used to establish a list of semantic relationship categories for NNs, develop an instrument for classifying NNs into these categories, and use that instrument to classify 1,535 NNs. Finally, we describe the methods we used to analyze this data in order to answer our three research questions. Corpus The use-based aspects of this study relied on the Corpus of Historical American English (COHA) (Davies, 2010). COHA contains just over

168  Shifting Semantics of Noun+Noun Sequences Table 7.1  Composition of COHA, by Register and Decade Decade

Fiction

Popular Magazines

Newspapers Non-fiction Total Books

1810s 641,164 88,316 0 1820s 3,751,204 1,714,789 0 1830s 7,590,350 3,145,575 0 1840s 8,850,886 3,554,534 0 1850s 9,094,346 4,220,558 0 1860s 9,450,562 4,437,941 262,198 1870s 10,291,968 4,452,192 1,030,560 1880s 11,215,065 4,481,568 1,355,456 1890s 11,212,219 4,679,486 1,383,948 1900s 12,029,439 5,062,650 1,433,576 1910s 11,935,701 5,694,710 1,489,942 1920s 12,539,681 5,841,678 3,552,699 1930s 11,876,996 5,910,095 3,545,527 1940s 11,946,743 5,644,216 3,497,509 1950s 11,986,437 5,796,823 3,522,545 1960s 11,578,880 5,803,276 3,404,244 1970s 11,626,911 5,755,537 3,383,924 1980s 12,152,603 5,804,320 4,113,254 1990s 13,272,162 7,440,305 4,060,570 2000s 14,590,078 7,678,830 4,088,704 TOTAL 207,633,395 97,207,399 40,124,656

451,542 1,181,022 1,461,012 6,927,005 3,038,062 13,773,987 3,641,434 16,046,854 3,178,922 16,493,826 2,974,401 17,125,102 2,835,440 18,610,160 3,820,766 20,872,855 3,907,730 21,183,383 4,015,567 22,541,232 3,534,899 22,655,252 3,698,353 25,632,411 3,080,629 24,413,247 3,056,010 24,144,478 3,092,375 24,398,180 3,141,582 23,927,982 3,002,933 23,769,305 3,108,775 25,178,952 3,104,303 27,877,340 3,121,839 29,479,451 61,266,574 406,232,024

400 million words of published writing across four registers of American English between 1810 and 2009 (see Table  7.1). Roughly half of the words in COHA come from fiction (prose, poetry, and drama). The other half is composed of popular magazines (24%), non-fiction books, balanced across the U.S. Library of Congress classification system (15%), and newspapers (10%). A more complete description of COHA, along with a complete description of all 115,000 texts can be found at http:// corpus.byu.edu/coha/. Corpus Analysis The first step in our study was to extract the most frequent NNs from COHA. COHA was tagged using the CLAWS part-of-speech tagger. Using these tags, we performed a database query that identified all occurrences of two adjacent nouns. In order to ensure that we represented NNs from across the time periods in COHA, we divided the corpus into six time periods (1810–1840; 1850–1880; 1890–1920; 1930–1950; 1960–1980; and 1990–2000) and required that at least the 400 most frequent NNs from each of the six time periods were included in our data set. We limited our analysis to the 1,535 most frequent NNs. Before

Shifting Semantics of Noun+Noun Sequences  169 establishing our final list of NNs, we manually eliminated non-NNs that were in the list as a result of tagging errors. Frequency data (per million words) for each decade was recorded for each of these 1,535 NNS. These normed frequency counts were used for the analysis of diachronic change (Research Question 3). Developing a User-based Instrument for NN Classification The next two steps of our method, developing a list of semantic categories for NNs and an instrument for NN classification, are described together in this section. This is because these two steps were developed simultaneously in a bottom-up fashion through a series of pilot studies. We used the list of semantic categories for NNs in the Longman Grammar of Spoken and Written English (LGSWE) as a starting point for our research (Biber et al., 1999, pp. 590–591). The LGSWE categories are:   1. Composition   2. Time   3. Location   4. Partitive   5. Specialization   6. Institution   7. Identity   8. Source   9. Purpose 10. Content 11. Objective 12. Subjective Pilot Study 1 The first pilot study was performed by the two authors (N = 2). We began by randomly sampling 100 NNs from our list of 1,535 NNs. We then attempted to independently classify each NN into one of the 12 LGSWE categories. A comparison of the results revealed extremely low inter-rater agreement. After discussing the reasons for the disagreements we realized that we did not even agree on the distinctions between the semantic categories. Based on this discussion, we made modifications to the list of categories. We also learned from this pilot study that selecting from a long list of semantic categories is a difficult task. This led us to develop a classification instrument in which coders select from a list of sentences, where each sentence presents the NN in question in the form of an explanation of the relationship between the two nouns (e.g. afternoon tea: ‘tea is found or takes place at the time of afternoon’). Based on our experience

170  Shifting Semantics of Noun+Noun Sequences with the first pilot study, we believed this approach would be superior because it would make the task faster and more natural by eliminating the requirement for coders to memorize and interpret the definitions of the semantic categories. Pilot Study 2 For the second pilot study we used the instrument described earlier. The coding was performed by university students (N = 42) enrolled in classes taught by the two authors. Each coder classified the same set of 30 NNs, which were randomly selected from the original list of 1,535 NNs. This was done using an online survey tool that presented each of the 30 NNs followed by 12 sentences with the instruction to select the sentence that best represented the meaning of the NN. The results of this study revealed that while perfect agreement among the 42 participants was rare, most of the NNs were coded into one semantic category by a large majority of the coders. This led us to the conclusion that more than two raters were needed to get reliable results for this task. Based on our results, we made small modifications to the wording of some of the sentences. We also noticed that three of the more general semantic categories were being overused by some of the participants. This led us to modify the instrument so that it began with the question, “Do any of the following describe the meaning of _______?”, followed by nine options. Then, after a section break, the survey presented the question, “IF NOT, do any of the following describe the meaning of ________?”. This was done with the hope that coders would take the opportunity to select a more specific category, if it was the best choice, rather than repeatedly selecting more general categories that might also make sense within the rephrased sentence. For example, the categories of purpose, topic, and process were overused in the early pilot studies. The modification described here seemed to motivate participants to select the most appropriate option. The list of semantic categories used in this study, along with the rephrased sentences used in the classification instrument, is displayed in Table 7.2. Pilot Study 3 In the final pilot study, we used the modified version of the survey used in Pilot Study 2. We recruited coders (N = 59) through Mechanical Turk. Together, these workers coded 150 NNs, randomly sampled from the original list of 1,535 NNs, and each NN was coded by eight independent workers. The results of this pilot study were encouraging, showing that most of the NNs were coded into a single semantic category by a majority of the eight coders. Based on these results, we decided that the list of semantic categories and the instrument were ready to be used on a large

Shifting Semantics of Noun+Noun Sequences  171

Do any of the following describe the meaning of health care? cc care is made from health, (example: glass window) cc care is found or takes place at the time of health, (example: Christmas party) cc care is found or takes place at the location of health, (example: comer cupboard) cc care is one of the parts that make up a(n) health, (example: cat legs) cc care is a person, health is what he/she specializes in. (example: finance director) cc care is an institution, health is the type of institution, (example: insurance company) cc care is owned by health, (example: pirate ship) cc A(n) health care is a(n) health and it is also a(n) care, (example: exam paper) cc health is the source of care, (example: plant residue) If NOT, do any of the following describe the meaning of health care? cc health is the topic of care, (example: algebra textbook) cc care is a process related to health, (example: eye movement) cc health is the purpose or use for care, (example: pencil case) Figure 7.1  An Example of the Final Classification Instrument for NN Sequences

scale for our final analysis. A screenshot of our final instrument can be seen in Figure 7.1. Classifying NN Sequences After making several rounds of revision to the list of semantic categories and to the classification instrument, based on three pilot studies, we were prepared to collect data on a larger scale. We recruited a large number of coders (N = 255) through Mechanical Turk. These workers coded the full set of 1,535 NNs described in the Methods section. As with Pilot Study 3, each NN was coded by eight independent workers. Data Analysis Agreement and Classification Agreement was measured using Fleiss’ kappa, a statistic for measuring interrater agreement among two or more raters on categorical data. Like

172  Shifting Semantics of Noun+Noun Sequences Cohen’s kappa, Fleiss’s kappa accounts for chance agreement among raters, making it more robust than simple percent agreement. However, unlike Cohen’s kappa, Fleiss’ kappa does not require coders to be the same for each item, making it ideal for the design of our study. Fleiss’ kappa calculations were performed in R, using the ‘kappam.fleiss’ function in the irr package (Gamer, Lemon, Fellows, & Singh, 2012). After measuring inter-rater agreement, we set out to classify as many NNs as possible into a single semantic category. We began by calculating the number of coders that assigned each of the 1,535 NNs into each of the 13 categories (12 semantic categories plus an ‘other’ option). We determined that a NN sequence would be assigned to particular category if that category was selected by a plurality of the coders. In our study, we used the following definition for plurality. A particular NN sequence was classified as Category X by a plurality if: a. It was classified as Category X by 5+ raters; or b. It was classified as Category X by 4 raters, and no other category was selected by more than 2 coders; or c. It was classified as Category X by 3 raters, and no other category was selected by more than 1 coder. Quantitative Analysis The subset of NNs that met the previously listed agreement criteria was included in this study. These NNs, along with normed rates of occurrence (per million words) for each of the six major COHA time periods, were stored in a spreadsheet. These data were used to compute frequency means for each of the 12 semantic categories in each time period. These means were used to measure diachronic change in the use of the 12 semantic categories. It was also used to perform a factorial ANOVA to measure the effect of time, semantic category, and the interaction between those two variables, on the frequency of use of the NNs in the data set. All statistical procedures were performed in R.

Results and Discussion Semantic Relationships Before analyzing the quantitative results of the study, we will first take a closer look at the semantic categories included in the final list. The complete list, along with the rephrased sentence used in the instrument and three examples for each, is displayed in Table 7.2. These semantic relationships can be organized on a continuum that ranges from more concrete to more abstract. The categories of Composition, Partitive, and Location are quite concrete, whereas the categories of

Shifting Semantics of Noun+Noun Sequences  173 Table 7.2  Semantic Categories of NNs, With Rephrased Sentences and Examples Category

Rephrased Sentence

Examples

Composition

N2 is made from N1

Time

N2 is found or takes place at the time of N1 N2 is found or takes place at the location of N1 N2 is one of the parts that makes up a N1 N2 is a person. N1 is what he/she specializes in N2 is an institution. N1 is the type of institution A/an N1 N2 is a/an N1 and it is also a/ an N2 N1 is the source of the N2

brass button grape juice paper towel Christmas gift autumn leaf summer air library door street light mountain stream shirt collar chicken breast television screen college professor sales manager construction worker police department oil industry law school patron saint bow tie minority student farm income man power drug problem assault weapon light bulb operating room tax law world news science fiction data analysis air conditioning population growth enemy plane family mansion merchant vessel

Location Partitive Specialization Institution Identity Source Purpose

N1 is the purpose or use for N2

Topic

N1 is the topic of the N2

Process

N2 is a process related to N1

Ownership

N2 is owned by N1

Process, Purpose, and Topic are more abstract. The categories of Ownership and Source fall somewhere in the middle. Scholars in semantics have hypothesized that linguistic forms with concrete meanings tend to develop before forms with more abstract meanings (see Traugott, 1989; Heine, Claudi, & Hünnemeyer, 1991). Based on this, we could hypothesize that concrete NNs were adopted into English first, and more abstract NNs were adopted later. The final list of semantic categories used in this study is by no means exhaustive. As we will see in the next section, there were many NNs that

174  Shifting Semantics of Noun+Noun Sequences coders could not agree on. One reason for these disagreements may be that the actual semantic category for the NN was not presented as an option in the instrument, and coders disagreed on what the next best option was. However, based on the results of our pilot studies and data collection, we believe that this list includes many of the important meaning relationships that can exist between two nouns in a NN. We hope that this list serves as a useful starting place for future research. Classification Agreement The overall Fleiss’s kappa was .34 for the full set of 1,535 NNs with eight raters and 12 categories (12 semantic categories plus an ‘other’ option). This can be interpreted as an indication of “fair agreement” according to Landis and Koch (1977). As discussed in in the Methods section, NNs were assigned to a category if they met our criteria for classification by a plurality of coders. Overall, 974 NNs met these criteria, allowing us to classify approximately 64% of the NNs. Table 7.3 shows the number of NNs that were classified into a semantic category at each level. The data in Table  7.3 can be analyzed in many different ways. We could focus on the fact that coders were not able to agree on a single semantic category for 36% of the NNs. We were actually quite encouraged by these results despite being unable to achieve agreement on all of the NNs in our data set. A large and varied group of untrained coders managed to classify nearly two thirds of the NNs in our data set, without any situational or linguistic context to aid them. We should also keep in mind that many of the NNs in our data set were most frequent in earlier time periods (e.g. chain stitch, salt pork, boon companion, tenement house), increasing the likelihood that they would be unfamiliar to contemporary coders. We do, however, believe that the cases where agreement was not achieved raise important questions that must be addressed in future research. These questions include: What additional semantic categories for NNs exist? Are there NNs that represent hybrids (i.e. they can fit within multiple semantic categories)? Would modifications to the Table 7.3  Classification Agreement Results Raters

NNs

Cumulative NNs

8 7 6 5 4 3

133 171 174 238 230 28

133 304 478 716 946 974

Shifting Semantics of Noun+Noun Sequences  175 coding instrument improve inter-rater agreement? Would some amount of coder training help improve inter-rater agreement? While the answers to these questions are beyond the scope of the current study, we believe they are important questions for future research that will add to our understanding of the semantics of NNs. The remaining results in this study are based on the reduced data set of 974 NNs that were classified into a single semantic category. We computed Fleiss’ kappa for this data set overall, as well as for each of the 13 semantic categories. The overall Fleiss’ kappa was .47, revealing a marked, if unsurprising, improvement over the kappa results for the full data set. For the reduced data set, we also computed Fleiss’ kappa for each of the 13 categories separately. These results are given in Table 7.4. There is a wide range of variation in the inter-rater agreement across these categories, including three categories that could be interpreted as ‘substantial agreement’ (.61–.80) and three that would be interpreted as ‘slight agreement’ (.01–.20) (Landis & Koch, 1977). After reviewing the agreement results, we were curious to know whether coders are more likely to agree on the semantic category of NNs that are more frequent in recent decades than in the distant past. As discussed earlier, in order to learn about diachronic change in the use of NNs, we included NNs that were much more frequent in historical time periods than in contemporary use. However, we believe this may have presented challenges to coders who are unfamiliar with the language used in those earlier time periods. While a comprehensive analysis of this relationship is beyond the scope of this study, we ran a series of simple correlations between the frequency of the NNs across semantic categories and the Fleiss’ kappa agreement across those categories. This same correlation was Table 7.4  Fleiss’ Kappa Results Across the 13 Semantic Categories for the Reduced Set of 974 NNs Category

Fleiss’ kappa

Interpretation

specialization composition time location institution purpose process partitive source ownership identity other topic

0.731 0.727 0.653 0.569 0.542 0.385 0.369 0.262 0.259 0.231 0.206 0.197 0.162

Substantial Moderate Fair

Slight

176  Shifting Semantics of Noun+Noun Sequences

Frequency × Agreement (Fleiss’ Kappa)

0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 –0.10

1

2

3

4

5

6

–0.20 –0.30 –0.40

Time period

Figure 7.2  NN Frequency by Agreement Correlations Across Time Periods

run for each of the six time periods included in this study. The results are displayed in bar plot form in Figure 7.2. The six time periods are laid out on the x-axis and the correlations are plotted on the y-axis. It can be seen from this plot that there is a strong relationship between inter-rater agreement among the coders and the frequency of the NNs across time periods. Whereas there is a moderate negative correlation between these two variables in the earliest time period (1810–1840), there is a strong positive correlation in the two most recent time periods (1960–1980; 1990–2000). From these results we can draw two conclusions: (i) there is a relationship between NN frequency and inter-rater reliability; and (ii) this relationship is much more pronounced for recent data sets than for older ones. Diachronic Change To answer the third research question, we investigated diachronic changes in the use of NNs across the 12 semantic categories. Our results confirmed findings from previous research by showing that NNs are increasing in frequency over time. We found that this pattern holds true across all of the semantic categories (see Table 7.5). However, our analysis of the data revealed that this general pattern only tells part of the story. There is a large amount of variability across semantic categories in their initial frequency in the earliest time period of the corpus, as well as in their frequencies in the most recent time period. Moreover, the rate of increase over time varies widely across semantic

Shifting Semantics of Noun+Noun Sequences  177 Table 7.5 Normed Rates of Occurrence (per Million Words) of NN Semantic Categories Across Six Time Periods Category

1810– 1849

1850– 1889

1890– 1929

1930– 1959

1960– 1989

1990– 2009

Institution Location Specialization Purpose Composition Process Source Time Topic Identity Partitive Ownership Other

45.06 83.81 19.25 23.60 85.52 10.07 33.98 36.94 16.56 14.26 3.53 4.69 16.29

45.18 101.43 39.06 20.74 98.13 11.60 29.02 46.71 19.52 11.48 3.12 4.20 12.62

110.01 152.57 100.47 51.00 95.78 28.43 35.86 40.45 26.68 22.21 7.42 7.27 15.11

225.24 196.94 158.68 157.75 115.87 79.94 56.66 49.34 38.75 50.63 20.21 11.89 26.07

251.22 188.77 172.52 214.94 103.39 125.64 45.12 37.51 51.28 48.69 34.67 9.96 28.22

278.27 207.03 268.88 256.35 127.06 204.90 66.04 45.77 94.81 56.53 70.27 12.98 35.19

category. In the next section we describe the results of a factorial ANOVA aimed at accounting for the effect of time and semantic category, as well as a possible interaction between them. Statistical Results We performed a 6 × 13 factorial ANOVA to determine the statistical effects of the variables of time and semantic category on variation in the frequency of NNs in the data set. The results of this analysis revealed a significant interaction effect between the variables of time and semantic category, F(60, 5766) = 4.18, p < .0001, R2 = .03. This shows that diachronic change in the use of a particular NN is likely to depend on the semantic relationship between the two nouns. NN frequency was also predicted by semantic category, F(12, 5766) = 3.50, p < .0001, R2 = .006, and time, F(6, 5766) = 119.67, p < .0001, R2 = .08. However, in the presence of an interaction, it is appropriate to investigate simple effects (i.e. differences between the levels of variable 1 within each level of variable 2, separately) rather than the overall main effects produced by the factorial ANOVA. Thus, our next step was to describe the diachronic trends in the use of NNs across each of the 13 semantic categories. Although each of the categories follows a slightly different trend, we managed to identify three major patterns of diachronic change: 1. Frequent  frequent 2. Infrequent  infrequent 3. Infrequent  frequent

178  Shifting Semantics of Noun+Noun Sequences In the next three sections we present descriptions of these patterns, including quantitative and qualitative findings from our data. Pattern 1: Frequent  Frequent The first pattern we discovered in our data set was that two semantic categories, location and composition, were more frequent than the other categories in the earliest time period (Figure 7.3). Whereas most of the categories occurred less than 50 times per million words (pmw), these two categories occurred about 85 times per million words. These categories share the characteristic of being highly concrete. Composition NNs (e.g. stone wall, orange juice, gold watch, wood door) include tangible objects, where N2 is a concrete head noun and N1 is another concrete noun that specifies what N2 is made from. In location NNs (e.g. kitchen table, heart disease, forest fire, mountain resort) N1 describes the place where N2 exists or takes place. Of all the semantic categories in our framework, these two categories are the most concrete. This suggests that the NN grammatical structure may have begun as a means of describing concrete objects and processes and was later adopted for the description of more abstract concepts and processes. This supports the previously mentioned hypothesis that semantic change often shifts from concrete meanings to abstract meanings (see Traugott, 1989; Heine et al., 1991). The categories of location and composition have both increased in use over time. Based on the data in COHA, the rate of increase for the category of composition has been relatively modest over time, with a net increase of just over 40 occurrences pmw from time period 1 to time period 6. In contrast, the location category has increased in use much 250.00

200.00

150.00

100.00

50.00

0.00 1810–1849

1850–1889

1890–1929

1930–1959

locaon

composion

1960–1989

1990–2009

Figure 7.3 Diachronic Change in the Semantic Categories Within Pattern 1— Frequent  Frequent

Shifting Semantics of Noun+Noun Sequences  179 more rapidly, with a net increase of more than 120 occurrences pmw. Although these two categories are not the most frequently used semantic categories for NNs, they are both more frequent than more than half of the semantic categories. It is interesting to note that location and composition NNs, combined, made up more than 40% of all NNs in our data set during the period of 1810–1849. This proportion shifted quite dramatically to a mere 19% in the most recent time period, showing the relatively rapid diachronic spread of NN constructions into other semantic domains. Pattern 2: Infrequent  Infrequent The second pattern comprises semantic categories that have occurred in every time period, but which have remained consistently infrequent relative to the other semantic domains (Figure 7.4). This list includes the following semantic categories: time (e.g. summer day), identity (e.g. student teacher), partitive (e.g. family member), topic (e.g. algebra text), source (e.g. government policy), ownership (e.g. police car), and other. The semantic categories in Pattern 2 are less frequent than the categories in the other two patterns. All of these categories are increasing, just at different rates. Some of these categories, such as source, ownership, identity and time, are only about 2–3 times more frequent in the most recent time period when compared with the earliest time period. Identity NNs are four times more frequent, and topic NNs are six times more frequent. The semantic category in this pattern that is increasing most rapidly is the partitive category, which is nearly 20 times more frequent 250.00

200.00

150.00

100.00

50.00

0.00 1810–1849

me

1850–1889

identy

1890–1929

parve

1930–1959

topic

source

1960–1989

ownership

1990–2009

other

Figure 7.4 Diachronic Change in the Semantic Categories Within Pattern 1— Infrequent  Infrequent

180  Shifting Semantics of Noun+Noun Sequences in time period 6 when compared with time period 1. The semantic categories within this pattern represent the broad range of semantic relationships that can occur in NN sequences even if they are not nearly as frequent in contemporary American English writing as the categories in the other two time periods. Pattern 3: Infrequent  Frequent The semantic categories within Pattern 3 are particularly interesting since these NNs have undergone the most rapid changes in frequency during the past 200 years (Figure 7.5). The four categories in this pattern include: institution (e.g. insurance company, stock market), specialization (e.g. police officer, government official), purpose (e.g. credit card, golf course), and process (e.g. tax cut, birth control). On average, these categories are 13 times more frequent in period 6 than in period 1. Moreover, these categories comprise the four most frequent meaning relationship categories for NNs, occurring more than 250 times pmw, on average. A closer look at the particular NN sequences within these four categories suggests that the rapid increase in the use of these categories reflects societal changes in the United States. The most salient of these changes seem to be related to specialization—of knowledge, labor, industry, and day-to-day processes. This shift in the semantics of NNs seems to correspond to the development of increasing specialization in scientific disciplines, government, commerce, job descriptions, and technology, among many others. Obviously, there is much more to the story since this shift

250.00

200.00

150.00

100.00

50.00

0.00 1810–1849

1850–1889 instuon

1890–1929 specializaon

1930–1959 purpose

1960–1989

1990–2009

process

Figure 7.5 Diachronic Change in the Semantic Categories Within Pattern 1— Infrequent  Frequent

Shifting Semantics of Noun+Noun Sequences  181 toward specialization could have been expressed in other ways in American English. The other piece to the puzzle seems to be a shift toward increased economy in the language that manifests itself in the use of compressed phrases rather than elaborated clauses, especially in writing (see, e.g. Biber & Gray, 2016).

Conclusion Summary of Findings Our end goal in this study was the description of diachronic change in the use of NNs across categories representing distinct semantic relationships between N1 and N2 (RQ 3). Before it was possible to answer that question, however, it was necessary to establish a list of the semantic relationships that are possible between two nouns in an NN (RQ 1) and determine whether non-experts could reliably classify NNs based on the meaning relationship between the two nouns (RQ 2). To address RQ 1, we developed a list of 12 semantic categories for NNs. This list was initially based on Biber et al. (1999), but we made several revisions to it based on a series of pilot studies. During those same pilot studies, we refined an instrument that could be used by nonexpert English speakers for classifying NNs into semantic categories. In answer to RQ 2—can non-expert language users reliably classify NNs into semantic categories?—we would answer ‘yes’. Using this method, we managed to classify nearly two thirds of the NNs in our data set. However, we also found considerable variation in reliability across the semantic categories, suggesting that some of these categories are much better defined than others in the minds of language users. We also found an effect of time on inter-rater reliability. Raters were much more likely to agree on NNs that are more frequent in recent time periods than those that are more frequent in earlier time periods. Our analysis of historical change in the use of NNs confirmed that the NN construction is increasing over time in written English. The results of this study have also shown that NNs are becoming increasingly productive, with the construction rapidly spreading across a wide range of semantic categories over the course of 200 years, a relatively short period of time in historical linguistics. We have also shown that different semantic categories have developed over time in very different ways. While all of the semantic categories are increasing over time, we showed a statistical interaction between time and semantic category. We then explored three underlying patterns of historical development. The first pattern includes two semantic categories that have been relatively frequent in all time periods. The second pattern includes seven categories that began with low frequencies and have experienced relatively small increases in frequency over time. The final category includes four categories that have

182  Shifting Semantics of Noun+Noun Sequences undergone rapid increases in frequency over time, beginning with relatively low frequencies and ending with the highest frequencies in the data set. We believe that one explanation for these changes is the hypothesis that semantic change moves from more concrete to more abstract meanings. On Methodological Triangulation The use-based corpus methods in this study were quite straightforward. The user-based methods, on the other hand, presented many challenges. Our experience was that non-expert raters perform best when the instrument relies on terminology and tasks they are already familiar with. In our case, that meant eliminating the names we had developed for the semantic categories and structuring the instrument in the form of a series of simple sentences. We had a similar experience in another project focussed on register classification in which coders performed better when register labels were replaced with familiar situational characteristics (e.g. spoken/ written, interactive/non-interactive) (Egbert et al., 2015). We also found that it was not realistic to expect high inter-rater reliability between two raters in the task of classifying NNs. However, we were able to classify most of the NNs when we used eight raters and focussed on plurality agreement rather than inter-rater reliability. We had similar experiences in previous research that triangulated user-based and use-based corpus research methods (see Egbert et al., 2015; Egbert, 2014). This study would not have been possible without the combination of use-based corpus data and the user-based semantic classification of human raters. High frequency NNs extracted from the corpus (use-based) were classified by raters (user-based) in a series of pilot studies that informed the development of a list of semantic categories and a NN classification instrument. This instrument was used by hundreds of raters (user-based) to classify 1,535 NNs extracted from the corpus (use-based) to represent patterns from six time periods. Finally, diachronic change in the frequencies of each semantic category of NNs (use-based) was measured for each of the categories that were established by raters (user-based). The iterative use-based and user-based stages of this research process combined to create a robust research methodology. This methodology provided data to address the question of historical change in English NNs across semantic categories. The results of this study demonstrate the value of triangulating corpus linguistic methods with other methods in linguistics. Although it would have been possible for us as linguistic researchers to attempt to classify each of the NNs in our data set, we believe there were major advantages to using non-expert language users for the task. The most obvious advantage was that we were able to classify all of the NNs in our data set in a fraction of the time that it would have taken us. More importantly, we

Shifting Semantics of Noun+Noun Sequences  183 learned a great deal in this study about the semantics of NNs that we could not have learned if we had attempted to classify them ourselves. We learned that: 1. users can think about the meaning relationship between two nouns in a NN by rephrasing it in the form of a sentence. 2. some semantic categories are much easier for users to agree on than others. 3. users are much more likely to agree on the semantic category of a NN if it is frequent in contemporary English. 4. users can usually identify the semantic relationship between two nouns in a NN without (i) any linguistic context or (ii) grammatical cues. We believe that discoveries such as these are extremely valuable because they answer questions that are unanswerable using corpus data alone.

References Asheghi, N. R., Sharoff, S., & Markert, K. (2016). Crowdsourcing for web genre annotation. Language Resources and Evaluation, 50(3), 603–641. Biber, D., & Egbert, J. (2016). Register variation online. Cambridge: Cambridge University Press. Biber, D., Egbert, J., Gray, B., Oppliger, R.,  & Szmrecsanyi, B. (2016). Variationist versus text-linguistic approaches to grammatical change in English: Nominal modifiers of head nouns. In M. Kyto & P. Paivi, (Eds.), Handbook of English historical linguistics. Cambridge: Cambridge University Press. Biber, D., & Gray, B. (2011). The historical shift of scientific academic prose in English towards less explicit styles of expression: Writing without verbs. In V. Bathia, P. Sánchez, & P. Perez-Paredes (Eds.), Researching specialized languages (pp. 11–24). Amsterdam: John Benjamins. Biber, D., & Gray, B. (2013). Being specific about historical change: The influence of sub-register. Journal of English Linguistics. Biber, D.,  & Gray, B. (2016). Grammatical complexity in academic English: ­Linguistic change in writing. Cambridge, Cambridge University Press. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). The ­Longman grammar of spoken and written English. London: Pearson. Davies, M. (2010). The Corpus of Historical American English (COHA): 400 million words, 1810–2009. Available online at http://corpus.byu.edu/coha/. Downing, P. (1977). On the creation and use of English compound nouns. Language, 53(4), 810–842. Egbert, J. (2014). Reader perceptions of linguistic variation in published academic writing (Unpublished Ph.D. dissertation). Northern Arizona University, Arizona. Egbert, J. (2016). Stylistic perception. In P. Baker & J. Egbert (Eds.), Triangulating methodological approaches in corpus linguistic research. New York, NY: Routledge.

184  Shifting Semantics of Noun+Noun Sequences Egbert, J., Biber, D., & Davies, M. (2015). Developing a bottom-up, user-based method of web register classification. Journal of the Association for Information Science and Technology, 66(9), 1817–1831. Gamer, M., Lemon, J., Fellows, I., & Singh, P. (2012). Irr: Various coefficients of interrater reliability and agreement. R package version 0.84. Retrieved from https://CRAN.R-project.org/package=irr Girju, R., Moldovan, D., Tatu, M., & Antohe, D. (2005). On the semantics of noun compounds. Computer speech & language, 19(4), 479–496. Heine, B., Claudi, U., & Hünnemeyer, F. (1991). Grammaticalization. Chicago and London: The University of Chicago Press. Jurgens, D. (2013, June). Embracing ambiguity: A  comparison of annotation methodologies for crowdsourcing word sense labels. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 556–562). Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174. Lauer, M. (1995). Designing statistical language learners: Experiments on noun compounds (Unpublished Ph.D. Thesis). Macquarie University, Australia. Lees, R. (1970). Problems in the grammatical analysis of English nominal compounds. In M. Bierwisch  & K. E. Heidolph (Eds.), Progress in linguistics (pp. 174–186). The Hague: Mouton. Levi, J. N. (1974). On the alleged idiosyncrasy of non-predicate NPs. Papers from the 10th regional meeting, Chicago, Chicago Linguistic Society (pp. 402–415). Nakov, P.,  & Hearst, M. (2006). Using verbs to characterize noun-noun relations. In International conference on artificial intelligence: Methodology, systems, and applications (pp. 233–244). Berlin, Heidelberg: Springer. Rosenbach, A. (2006). On the track of noun+noun constructions in Modern English. In C. Houswitschka, G. Knappe, & A. Müller (Eds.), Anglistentag 2005 Bamberg: Proceedings of the conference of the German Association of University Teachers of English (pp. 543–557). Trier: Wissenschaftlicher Verlag Trier. Rosenbach, A. (2007). Emerging variation: Determiner genitives and noun modifiers in English. English Language and Linguistics, 11(1), 143–189. Rumshisky, A. (2011). Crowdsourcing word sense definition. In Proceedings of the 5th linguistic annotation workshop (pp. 74–81). Prague: Association for Computational Linguistics. Szmrecsanyi, B., Biber, D., Egbert, J.,  & Franco, K. (2016). Towards more accountability: Modeling ternary genitive variation in late modern English. Language Variation and Change, 28(1), 1–29. Traugott, E. C. (1989). On the rise of epistemic meanings in English: An example of subjectification in semantic change. Language, 65(1), 31–55. Zimmer, K. E. (1971). Some general observations about nominal compounds. Working papers on language universals, Stanford University, 5, 1–21.

8 Corpus Linguistics and EventRelated Potentials Jennifer Hughes and Andrew Hardie

Introduction Collocation can be defined as a “co-occurrence relation between two words” (McEnery & Hardie, 2012, p. 240), with collocation extraction being one of the key techniques used in corpus linguistics. Indeed, Gilquin and Gries (2009) find that 32% of a sample of 81 corpus research articles use collocation analysis. However, despite the prevalence of collocation analysis, it is not clear whether corpus-derived collocations are actually processed differently by the brain from non-collocational sequences, and therefore whether collocation can be seen as being a plausible psychological phenomenon. There is growing recognition of the importance of combining corpus data with experimental work. For instance, Arppe, Gilquin, Glynn, Hilpert, and Zeschel (2010, p. 6) point out that “linguists have made relatively few efforts up until now to test the cognitive reality of corpora”. Likewise, Gries (2014, p. 12) argues that “there will be, and should be, an increase of corpus-based studies that involve at least some validation against experimental data”. Some psycholinguistic studies have attempted to ascertain the psychological validity of corpus-derived collocations using techniques such as eye-tracking and self-paced reading (e.g. Conklin & Schmitt, 2008; McDonald & Shillcock, 2003a, 2003b; Underwood et al., 2004; Huang, Wible, & Ko, 2012). The results of these studies reveal that sequences of words which form collocations are read more quickly and receive fewer eye fixations than sequences of words which do not form collocations. These studies therefore provide strong evidence in support of the validity of collocation as a psychological phenomenon. However, to ascertain whether or not there exists a neural correlate of corpus-derived collocations, it is necessarily to combine corpus data with neuroimaging (also known as brain imaging) techniques—and thus study processing activity directly rather than via proxy variables such as eye movements. One such imaging technique is electroencephalography (henceforth EEG), where electrodes placed across the scalp detect some of the electrical

186  Event-Related Potentials signals that are produced by neural activity (Harley, 2008, pp. 17, 494). Language-related EEG studies almost always use the event-related potential (henceforth ERP) technique, which involves analysing “the momentary changes in electrical activity of the brain when a particular stimulus is presented to a person” (Ashcraft & Radvansky, 2010, p. 61). In line with previous language-related EEG studies, the two experiments presented in this chapter use the ERP technique of analysing brainwave data. In Experiment 1, we aim to find out whether or not there is a neurophysiological difference in the way that the brain processes pairs of adjacent words which form collocations (henceforth collocational bigrams) compared to pairs of adjacent words which do not form collocations (henceforth non-collocational bigrams). Collocational bigrams are defined operationally as those which have a high transition probability in the written section of the British National Corpus 1994 (henceforth, written BNC19941). An example is clinical trials. Non-collocational bigrams, on the other hand, are operationally defined as bigrams with a transition probability of zero (or, more accurately, a transition probability that is lower than can be measured using the written BNC1994). An example of a non-collocational bigram that corresponds to clinical trials is clinical devices: this bigram is grammatical, and meaningful in terms of both semantics and pragmatics, but simply does not occur in the written BNC1994, so its transition probability is zero. The transition probability is calculated by dividing the number of times the bigram X-then-Y occurs in the written BNC1994 by the number of times word X occurs in the written BNC1994 altogether (McEnery & Hardie, 2012, p. 195). The use within stimuli of collocational versus non-collocational bigrams, paired in the same way as for clinical trials/devices, is the basis of the two distinct conditions utilized within each of the experiments reported here. Any differences in the ERP data arising from the stimuli across these two conditions will indicate that there is, in fact, a neurophysiological difference between the processing of collocations as opposed to non-collocations. Moreover, evidence of such differences will imply that the ERP technique is sensitive to the transition probabilities between words (as measured in corpus data). However, the reverse does not apply: the lack of a difference between conditions will not imply that collocation lacks neurophysiological reality. This is because EEG provides only a partial measure of brain activity (Kiloh et al., 1981, p. 46). That being the case, a null finding in this study will imply only that the ERP technique is not a suitable one for detecting any neural correlates of collocation (in itself a valuable result in terms of providing guidance for subsequent research). Experiment 2 consists of two parts. In Part 1, we aim to replicate the results of Experiment 1 using a new set of stimuli in order to increase confidence in the conclusions. In Part 2, we then use the same stimuli set to investigate the strength of the correlation between the transition probability of a bigram and the amplitude of the ERP response to that

Event-Related Potentials  187 bigram being replaced by an equivalent non-collocational bigram. Thus, while Experiment 1 and Experiment 2 Part 1 treat collocational and noncollocational bigrams as disjunct categories, creating a binary variable, Experiment 2 Part 2 treats collocationality as a continuous variable, and aims to determine the extent to which the brain is sensitive to the varying strength of collocations as measured by textual transition probabilities. According to Millar (2010, p. 275), “there is a need for more studies to validate the numerous statistical measures available to corpus linguists”. To address this issue, the second aim of Experiment 2 Part 2 is to compare measures of collocation strength (i.e. association measures) in terms of how closely they correlate with the amplitude of the ERP response, and thus determine which measure(s) may be seen as having the most psychological validity. Eight association measures are tested, namely mutual information, transition probability, MI3, z-score, t-score, log-likelihood, Dice coefficient, and raw frequency. Log-likelihood is a significance statistic, meaning that it measures whether or not the “observed frequency” of a bigram is higher than its “expected frequency” (Hoffman, Evert, Smith, Lee, & Berglund, 2008, p. 150), whereas mutual information and transition probability are effect size statistics, reflecting in different ways the “ratio between observed and expected frequency” (Hoffman et al., 2008, p. 150). The remaining association measures (other than raw frequency) are hybrid measures which combine both significance and effect size, or significance/effect size and frequency (Hoffman et al., 2008, p. 151). In terms of triangulation, then, this chapter uses a corpus and corpusrelated methods to identify pairs of words that regularly co-occur. Then ERP research is carried out in order to better understand how these word pairs are processed in the brain, in contrast to equivalent ‘non-­co-occurring’ pairs, and the extent to which the various measures of collocation match brain activity. The two key questions this study asks then are: ‘What relationship, if any, is there between the phenomena of collocation and language processing as observed directly in brain activity?’; and ‘Which measure of collocation strength best matches the effect on the brain’s processing of collocational, as opposed to non-collocational, word pairs’?

Background Collocation and the Network Model of Language Processing McEnery and Hardie (2012, p. 240) state that “words are said to collocate with one another if one is more likely to occur in the presence of the other than elsewhere”. However, there are many different approaches to the study of collocation, each of which conceptualizes and operationalizes the phenomenon in different ways. In this section we introduce four different approaches, namely Sinclair’s (1991) Idiom Principle, Hoey’s (2005) theory of Lexical Priming, Wray’s (2002) model of formulaic

188  Event-Related Potentials language, and the collection of usage-based theories known as construction grammar (e.g. Goldberg, 1995). Across the different approaches to collocation, there is either an explicit or an implicit assumption that collocations are represented in the mind as transitions across a network. The network consists of nodes which, depending on the theory, constitute individual words, collocations, or constructions. These nodes are connected to other nodes via weighted connections. When an individual produces or hears word X immediately followed by word Y, a connection is formed between those two words (or nodes). Then, on subsequent occasions when that individual produces or hears word X, word Y will be mentally activated—along with any other words that have previously followed word X in that individual’s language experience. Mental activation is defined as a “state of memory traces that determines both the speed and the probability of access to a memory trace” (Anderson, 2005, p.  455). Through repeated exposure to the same sequence X-then-Y, the connection between word X and word Y is strengthened for that individual. This means that there is an increased probability that word Y will occur after word X in that individual’s language production, as well an increased expectation that word Y will follow word X when that individual is comprehending language. Thus the weightings between nodes are essentially a type of transition probability, whereby the probability of a particular word occurring after the preceding word depends upon the strength of the connection between those words. This connection strength is determined by the individual’s level of prior exposure to the collocation. This network model of language processing is intuitively plausible, as it is in line with what is known about how neurons work in the brain; through the transmission of electrochemical signals, neurons interact and form networks that are capable of representing knowledge (Rosenweig, Leiman, & Breedlove, 1996, p. 35; Sternberg, 2009, p. 35). It is these networks of neurons which must constitute the ultimate basis of linguistic knowledge. That said, if we assume that each node in the language network consists of a word (or a construction), then a node contains much more information than can be represented in a single neuron. Therefore, neurologically, the nodes themselves would be composed of some sort of network representation of the word or construction that they represent. The network model of language processing is most explicit in Hoey’s (2005) theory of Lexical Priming. Hoey (2005, pp. 7–8) explicitly links the notion of collocation to psychology, arguing that “collocation is fundamentally a psychological concept” and that any word that frequently follows another word is mentally activated whenever the first word is spoken or heard. Furthermore, Hoey (2005, p. 8) states that “[a]s a word is acquired through encounters with it in speech and writing, it becomes cumulatively loaded with the contexts and co-texts in which it is encountered”. In other words, through repeated exposure to the same words

Event-Related Potentials  189 in collocational patterns, the connection between the nodes representing those words is strengthened. The network model is also explicit in construction grammar. Constructions are defined as syntactic frames that “specify, not only syntactic, but also lexical, semantic, and pragmatic information” (Fillmore, Kay, & O’Connor, 2003, p. 243). Constructions exist in a network in which, each time a construction is heard or produced, a pattern of nodes is activated. If the nodes are activated frequently enough, the construction becomes entrenched in the mind as a holistically stored processing unit (Langacker, 1987, pp. 59–60). A network relationship also exists between an abstract construction and an instantiation of that construction, i.e. a collocation (Torres Cacoullous & Walker, 2011, p. 236). Furthermore, the network is described as a “structured inventory of constructions” because it contains core constructions which connect to many other constructions as well as peripheral constructions with fewer connections (Tomasello, 2003, p.  6). In this way, “[t]he collection of constructions  .  .  . constitute a highly structured lattice of inter-related information” (Goldberg, 1995)—similar to the pattern of interconnected neurons in the brain. The network model of language processing is not explicit in Wray’s (2002) model of formulaic language. Wray (2002) proposes the term formulaic sequence, which she defines as: a sequence, continuous or discontinuous, of words or other elements, which is, or appears to be, prefabricated: that is, stored and retrieved whole from memory at the time of use, rather than being subject to generation or analysis by the language grammar. (Wray, 2002, p. 9) Wray (2002, p.  270) states that she is not concerned with the mental representation of formulaic sequences because “it is rarely addressed in other models”. However, to some extent, the network model is present implicitly in Wray’s understanding of formulaicity, because she argues that frequently encountered sequences are accessed more quickly than sequences that are encountered less frequently (Wray, 2002, p.  268). Mental activation is defined as a “state of memory traces that determines both the speed and the probability of access to a memory trace [emphasis added]” (Anderson, 2005, p.  455). Thus, Wray’s reference to speed of access is equivalent to the notion of node activation in the network model of language processing. The network model is also not explicit in Sinclairian theory. Sinclair (1991, pp.  53, 71) proposes the Idiom Principle, i.e. the idea that “a language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments”. This Idiom Principle is supported by the Open-Choice Principle, whereby any word can be chosen

190  Event-Related Potentials to occupy a slot, providing that the result is grammatical (Sinclair, 1991, pp. 109–110)—the latter being the traditional approach in linguistics to discussing the interaction of lexicon and grammar. Stubbs (1993, p. 21) observes that “Sinclair’s work is strangely detached from any psychological or social theory”. However, although Sinclair does not explicitly discuss any psychological basis for his theory, his work seems to carry the underlying assumption that semi-preconstructed phrases form a network with weighted connections among nodes. For instance, Sinclair (2004, p. 161) states that a “word becomes associated with a meaning through its repeated occurrence in similar contexts”. This suggests that the mind is sensitive to frequency information in the input and that the connections between nodes are strengthened through repeated exposure. Thus, although Sinclair does not explicitly discuss collocations as forming a psychological network, the assumption is there implicitly. Indeed, Stubbs (1993, p.  21) argues that Sinclair’s “detailed observations about the co-selection of lexical items are ripe for input into psycholinguistic theory”. Overall, then, the network assumption is prevalent across different approaches to collocation: explicitly or implicitly, collocations are considered to be represented in the mind or brain as transitions across a network, where that network constitutes part or all of an individual’s knowledge of language. The prevalence of this assumption—as well as the intuitive plausibility of the network model as compatible with the known behaviour of the brain at the lowest level, that of interconnections among neurons—constitutes strong justification for conducting psycholinguistic experiments which use transition probabilities as a measure of collocation strength. Collocation and Event-related Potential Components One way of analysing ERP data is to look for specific ERP components. Kappenman and Luck (2011, p. 4) define an ERP component as “a scalprecorded voltage change that reflects a specific neural or psychological process”. Various ERP components have been identified, but the most widely studied language-related ERP component is the N400 (Swaab, Ledoux, Camblin,  & Boudewyn, 2012, p.  399; Luck, 2014a, p.  102). The N400 is labelled as such because it is a negative shift in the ERP waveform occurring roughly 400 ms after stimulus onset; it is typically described as having a central-posterior scalp distribution (Swaab et al., 2012, p.  399); distribution in this sense is a characterisation of which electrodes detect the voltage shift). Although some N400 activity is elicited in response to encountering any content word (Kumar & Debruille, 2004, p. 89; Luck, 2014a, p. 102), a much larger N400 is elicited in response to reading a semantic error (Kutas & Hillyard, 1980, p. 204). Moreover, in their seminal article on

Event-Related Potentials  191 the N400, Kutas and Hillyard (1980, p. 203) demonstrate that the amplitude of the N400 is directly proportional to the magnitude of the semantic error. For instance, the amplitude of the N400 is larger in response to reading the last word of the sentence He took a sip from the transmitter, where the semantic incongruity is considered “strong”, than in response to reading the last word of the sentence He took a sip from the waterfall, where the semantic incongruity is considered “moderate” (as a waterfall is still a source of water, albeit an unusual one to drink from). To date, there exists just one N400 study which focuses explicitly on collocation. In this study, by Siyanova-Chanturia et al. (2017), collocations are operationalized as binomial expressions (that is, fixed phrases of two content words with an intervening conjunction) such as knife and fork; and matched non-collocations are created by replacing the first word of the binomial with a semantically congruous (spoon and fork) or incongruous (theme and fork) word. The results show that the N400 amplitude is largest in response to reading semantically incongruous noncollocations (theme and fork), smallest in response to reading collocations (knife and fork), and of intermediate size in response to reading semantically congruous non-collocations (spoon and fork). Furthermore, the difference in N400 amplitude between the collocational and noncollocational conditions is no longer apparent when the binomials are presented without the conjunction. Thus, this suggests that it is the (non-) collocational status of the binomials, rather than any properties of the individual words, which causes the difference in N400 amplitude. Other ERP studies which explicitly investigate collocational processing focus on ERP components other than the N400. For instance, Molinaro and Carreiras (2010, p. 176) and Vespignani, Canal, Molinaro, Fonda, and Cacciari (2010, pp.  1695–1696) find that reading an unexpected ending to an otherwise collocational expression elicits a P300, i.e. a positive shift in the ERP waveform that peaks roughly 300 ms after stimulus onset and is largest over temporal-parietal scalp regions (Polich, 2007, p.  2128). In addition, Vespignani et  al. (2010, pp.  1695–1695), Tremblay and Baayen (2010, p. 168), and Molinaro, Barraza, and Carreiras (2013, pp. 127, 129) find that the predictability of a sequence modulates a very early component known as the N1. However, the N1 results are not consistent across these studies, perhaps because voltage shifts which occur very soon after stimulus onset (sensory or exogenous components) are often associated with the physical properties of the stimulus rather than a true experimental effect (Luck, 2014a, p. 33). We therefore focus on the N400, which, as a later component (a cognitive or endogenous component), can more reliably be taken to reflect cognition (Sur & Sinha, 2009, p. 70). Although the aforementioned studies explicitly focus on collocation, they conceptualize or operationalize collocation somewhat differently from the present study. Siyanova-Chanturia et  al. (2017) use binomial

192  Event-Related Potentials expressions; Molinaro and Carreiras (2010, pp. 179–180) state that their stimuli consist of “idioms or clichés”; Molinaro et al. (2013, p. 124) use the same stimuli as Molinaro and Carreiras (2010), but this time refer to them as “multi-word expressions”. Idioms, clichés, and multi-word expressions are particular types of collocations that tend to be formally fixed, to be semantically non-compositional, and to possess clear beginning and end points. By contrast, in this study we adopt a more fluid conceptualization of collocation, whereby combinations of words can be considered collocational even when lacking (some of) the aforementioned properties. Specifically, the collocational bigrams that will be analysed are typically transparently compositional in meaning and consist of word pairs where each word also appears with other words, so that the bigram link is probabilistic rather than total. The previous N400 studies that explicitly address collocation have not adopted this probabilistic approach. But this approach is taken by a handful of N400 studies which seem to deal with collocation even though they do not use that terminology. For example, Rhodes and Donaldson (2008, p. 52) use the term associative to describe the relationship between isolated words in pairs such as traffic–jam and fountain–pen. The notion of association has been variably defined in the psychology literature, often in terms of stimulus-response relationships between words/concepts (Postman  & Keppel, 1970). However, some psychologists use the term associative priming for the link between pairs of words that frequently co-occur. For instance, Molinaro et  al. (2013, p.  122) define “associative relations” as “links between items [which] are developed because their lexical items frequently co-occur in language”. Since the associative word pairs in Rhodes and Donaldson’s (2008, p. 52) study exist in a syntagmatic, or linear, relationship, they are essentially strong collocations. Thus, the present study stands within an existing tradition of work in psychology, despite conceptual and terminological differences.

Method Design of Experimental Stimuli Selection of Bigrams We extracted collocational bigrams from the written section of the BNC1994 using BNCweb (Hoffman  & Evert, 2006; Hoffman et  al. 2008). We chose to use the BNC1994 because its 90  million words of written English are drawn from a variety of text types and it therefore can be treated as fairly representative of written British English (Aston & Burnard, 1998, pp.  5, 31–32). As the experimental technique involves reading as opposed to listening, only the written part of the BNC1994

Event-Related Potentials  193 was used; the collocations found in writing are “fundamentally different” from those found in speech (Biber, 2009, p. 299). The first step in extracting bigrams from BNCweb was to use part-ofspeech tags to search for bigrams of particular grammatical categories. The grammatical form of a bigram is, of course, something which may well affect its processing as apparent in ERPs; as such it must be controlled within experiments. The simplest way to do this is to look at bigrams of one single grammatical form. Thus, the experiments here are based solely on adjective-noun bigrams. These were extracted using the following CEQL2 search term: _{ADJ} _{N}. This locates any sequence of two word tokens whose simple part-of-speech tags are ADJ (any kind of adjective) and N (any kind of noun). The next step was to thin (i.e. reduce) the query results from 6,270,939 hits to 250,000 hits (a limitation of the BNCweb software is that certain secondary analysis functions cannot run on more than 250,000 concordance lines at once). We then used the frequency breakdown function to count the different bigram types within these 250,000 random instances. The result was a downloadable frequency list of a very large sample of adjective-noun bigrams from the written BNC1994. The collocational bigrams in the experiments described ahead were selected from this list. A corresponding non-collocational bigram was devised for each, incorporating the same adjective but a different noun, chosen so as to make the bigram semantically plausible. The non-collocational bigrams were manufactured in such a way as to ensure that they did not occur in the BNC1994 at all; that is, the absence of a particular bigram from the BNC1994 was treated as an indication that that combination of adjective and noun do not form a collocation, or to put it another way that the probability of a transition from the adjective in question to the noun in the manufactured bigram was zero (or as close to zero as is possible to estimate on the basis of the BNC1994) As far as possible, the nouns in the collocational and non-collocational bigrams were matched for frequency and word length (see Appendix A); but when matching for both length and frequency proved impossible, we prioritized frequency over length. While there is psycholinguistic evidence for a quantitative difference in the processing of words with different frequencies, “a length effect independent of word frequency has been hard to find” (Conklin & Schmitt, 2008, p. 79); thus, it was more crucial to exclude any effect of frequency from interfering with the results. We also tried to ensure that we did not use any bigram from within a longer n-gram that forms a stronger collocational unit than the bigram itself. For example, the bigram regular basis is part of the longer collocational 4-gram on a regular basis. Embedding regular basis into a sentence without preceding it with on a would create a stimulus likely not to elicit collocational processing—since the bigram would be out of place. That is, the brain response to the noun stimulus could well reflect the

194  Event-Related Potentials unexpected presence of the whole bigram instead of the expected transition from the adjective to the noun. While regular basis was excluded on this criterion (as all but one of the 436 examples of regular basis in the BNC are within the pattern on a [optional adjective] regular basis), the same exclusion was not applied to similar bigrams, such as foreseeable future, which appear as part of more than one different longer n-gram (for instance, foreseeable future is, like regular basis, nearly always part of a preposition phrase, but with six different prepositions, two—in and for—very frequent). This methodological decision is of course debatable; establishing what does and does not belong within the boundaries of one particular collocation is itself a theoretically fraught question, albeit not one within the scope of this research. The 30 bigrams used in Experiment 1 are shown in Table 8.1, along with the transition probability (TP) of the collocational bigram within each pair. As noted in the Introduction, transition probability is calculated by dividing the number of times the bigram X-then-Y occurs in the written BNC1994 by the number of times X occurs in the written BNC1994 altogether (McEnery & Hardie, 2012, p. 195). Clearly, what might be called the “real-world” probability of any given word following any other given word in the language production of a given speaker is an individualised psychological fact regarding that particular speaker, arising from the totality of their individual linguistic experience. A  particular bigram’s transition probability must thus vary, not only across different speakers but also across different contexts of production (cf. Hoey, 2005, p. 178). Table 8.1  Bigrams for Experiment 1 Condition 1: Collocational Bigrams (TP)

Condition 2: Non-collocational Bigrams

nineteenth century (0.855) endangered species (0.682) foreseeable future (0.678) wide range (0.28) minimum wage (0.246) vast majority (0.183) head teacher (0.162) classic example (0.138) profound effect (0.09) random sample (0.057) clinical trials (0.048) chief executives (0.043) key issues (0.028) massive increase (0.022) crucial point (0.017)

nineteenth position endangered fish foreseeable weeks wide city minimum prize vast opportunity head character classic company profound person random content clinical devices chief definitions key costs massive statement crucial night

Event-Related Potentials  195 Table 8.2  Bigrams for Experiment 2 TP Band

Collocational Bigrams (exact TP)

Non-collocational Bigrams

0.8 ≤ b