A Practical Handbook of Corpus Linguistics 9783030462154, 9783030462161

This handbook is a comprehensive practical resource on corpus linguistics. It features a range of basic and advanced app

1,522 177 67MB

English Pages 686 [671] Year 2020

Report DMCA / Copyright


Polecaj historie

A Practical Handbook of Corpus Linguistics
 9783030462154, 9783030462161

Table of contents :
Part I Corpus Design
1 Corpus Compilation
1.1 Introduction
1.2 Fundamentals
1.2.1 Representativeness
1.2.2 Issues in Collecting Data for the Corpus
1.2.3 Ethical Considerations
1.2.4 Documenting What Is in the Corpus
1.2.5 Formatting and Enriching the Corpus
1.2.6 Sharing the Corpus
1.2.7 Corpus Comparison
1.3 Critical Assessment and Future Directions
Further Reading
2 Corpus Annotation
2.1 Introduction
2.2 Fundamentals
2.2.1 Part-of-Speech Tagging
2.2.2 Lemmatization
2.2.3 Syntactic Parsing
2.2.4 Semantic Annotation
2.2.5 Annotation Accuracy
2.2.6 Practicalities of Annotation
2.3 Critical Assessment and Future Directions
2.4 Tools and Resources
Further Reading
3 Corpus Architecture
3.1 Introduction
3.2 Fundamentals
3.2.1 Corpus Macro-structure
3.2.2 Primary Data and Text Representation
3.2.3 Data Models for Document Annotations
3.3 Critical Assessment and Future Directions
3.4 Tools and Resources
Further Reading
Part II Corpus methods
4 Analysing Frequency Lists
4.1 Introduction
4.2 Fundamentals
4.2.1 Zipf's Law
4.2.2 Unit of Analysis
4.2.3 Beyond Raw Frequency Normalising Frequency Counts Range and Dispersion
4.3 Critical Assessment and Future Directions
4.3.1 Dealing with Homoforms and Multi-word Units
4.3.2 Application of Dispersion (and other) Statistics
4.3.3 Addressing Reliability in the Validation of Frequency Lists
4.4 Tools and Resources
Further Reading
5 Analyzing Dispersion
5.1 Introduction
5.2 Fundamentals
5.2.1 An Overview of Measures of Dispersion
5.2.2 Areas of Application and Validation
5.3 Critical Assessment and Future Directions
5.4 Tools and Resources
Further Reading
6 Analysing Keyword Lists
6.1 Introduction
6.2 *-18pt
6.3 Critical Assessment and Future Directions
6.3.1 Corpus Preparation
6.3.2 Focus on Differences
6.3.3 Applications of Statistics
6.3.4 Clusters and N-Grams
6.3.5 Future Directions
6.4 Tools and Resources
6.4.1 Tools
6.4.2 Resources (Word Lists)
Further Reading
7 Analyzing Co-occurrence Data
7.1 Introduction
7.1.1 General Introduction
7.2 Fundamentals
7.3 Critical Assessment and Future Directions
7.3.1 Unifying the Most Widely-Used AMs
7.3.2 Additional (Different) Ways to Quantify Basic Co-occurrence
7.3.3 Additional Information to Include
7.4 Tools and Resources
Further Reading
8 Analyzing Concordances
8.1 Introduction
8.2 Fundamentals
8.2.1 Sorting and Pruning Concordances
8.2.2 Qualitative Analysis of Concordance Lines
8.2.3 Quantitative Analysis of Concordance Lines
8.2.4 Pedagogical Applications of Concordance Lines
8.3 Critical Assessment and Future Directions
8.4 Tools and Resources
Further Reading
9 Programming for Corpus Linguistics
9.1 Introduction
9.2 Fundamentals
9.2.1 The Basic Building Blocks of Software Programs
9.2.2 Choosing a Suitable Language for Programming in Corpus Linguistics
9.3 First Steps in Programming
9.3.1 Case Study 1: Simple Scripts to Load, Clean, and Process Large Batches of Text Data Loading a Corpus File and Showing its Contents Loading a Corpus File, Cleaning it, and Showing Its Contents Loading a Web Page, Cleaning it, and Showing Its Contents Loading an Entire Corpus and Showing its Contents
9.3.2 Case Study 2: Scripting the Core Functions of Corpus Analysis Toolkits Creating a Word-type Frequency List for an Entire Corpus Creating a Key-Word-In-Context (KWIC) Concordancer Creating a “MyConc” Object-Oriented Corpus Analysis Toolkit
9.4 Critical Assessment and Future Directions
9.5 Tools and Resources
Further Reading
Part III Corpus types
10 Diachronic Corpora
10.1 Introduction
10.2 Fundamentals
10.2.1 Issues and Challenges of Diachronic Corpus Compilation Identifying the Lectal and Diatypic Properties of Texts Redressing Historical Bias Diachronic Comparability
10.2.2 Issues and Challenges of Text-Internal Annotation
10.2.3 Issues and Challenges Specific to the Analysis of Diachronic Corpora
10.3 Critical Assessment and Future Directions
10.4 Tools and Resources
Further Reading
11 Spoken Corpora
11.1 Introduction
11.2 Fundamentals
11.2.1 Raw Data and Different Types of Spoken Corpora
11.2.2 Corpus Annotation Orthographic Transcription POS-Tagging and Lemmatisation Parsing Phonemic and Phonetic Transcription Prosodic Transcription Multi-layered and Time-Aligned Annotation
11.2.3 Data Format and Metadata
11.2.4 Corpus Search
11.3 Critical Assessment and Future Directions
11.4 Tools and Resources
Further Reading
12 Parallel Corpora
12.1 Introduction
12.2 Fundamentals
12.2.1 Types of Parallel Corpora
12.2.2 Main Characteristics of Parallel Corpora
12.2.3 Methods of Analysis in Cross-Linguistic Research
12.2.4 Issues and Methodological Challenges Issues and Challenges Specific to the Design of Parallel Corpora Issues and Challenges Specific to the Analysis of Parallel Corpora
12.3 Critical Assessment and Future Directions
12.4 Tools and Resources
12.4.1 Query Tools
12.4.2 Resources
12.4.3 Surveys of Available Parallel Corpora
Further Reading
13 Learner Corpora
13.1 Introduction
13.2 Fundamentals
13.2.1 Types of Learner Corpora
13.2.2 Metadata
13.2.3 Annotation
13.2.4 Methods of Analysis
13.3 Critical Assessment and Future Directions
13.4 Tools and Resources
Further Reading
14 Child-Language Corpora
14.1 Introduction
14.2 Fundamentals
14.2.1 Recording and Contextual Setting
14.2.2 Subject Sampling
14.2.3 Size of Corpora and Recording Intervals
14.2.4 Transcription
14.2.5 Metadata
14.2.6 Further Annotations
14.2.7 Ethical Considerations
14.3 Critical Assessment and Future Directions
14.4 Tools and Resources
Further Reading
15 Web Corpora
15.1 Introduction
15.2 Fundamentals
15.2.1 Web as Corpus
15.2.2 Web for Corpus
15.3 Critical Assessment and Future Directions
15.4 Tools and Resources
15.4.1 Web Corpora
15.4.2 Crawling and Text Processing
Further Reading
16 Multimodal Corpora
16.1 Introduction
16.2 Fundamentals
16.2.1 Defining Multimodality and Multimodal Corpora
16.2.2 Multimodality Research in Linguistics
16.2.3 Issues and Methodological Challenges
16.3 Critical Assessment and Future Directions
16.4 Tools and Resources
Further Reading
Part IV Exploring Your Data
17 Descriptive Statistics and Visualization with R
17.1 Introduction
17.2 An Introduction to R and RStudio
17.2.1 Installing R and RStudio
17.2.2 Getting Started with R Writing and Running Code Installing and Loading Packages
17.3 Data Handling in R
17.3.1 Preparing the Data
17.3.2 *-24pt
17.3.3 Managing and Saving Data
17.4 Descriptive Statistics
17.4.1 Measures of Central Tendency
17.4.2 Measures of Dispersion
17.4.3 Coefficients of Correlation
17.5 Data Visualization
17.5.1 Barplots
17.5.2 Mosaic Plots
17.5.3 Histograms
17.5.4 Ecdf Plots
17.5.5 Boxplots
17.6 Conclusion
Further Reading
18 Cluster Analysis
18.1 Introduction
18.2 Fundamentals
18.2.1 Motivation
18.2.2 Data
18.2.3 Clustering Cluster Definition Proximity in Vector Space Clustering Methods Advanced Topics
18.3 Practical Guide with R
18.3.1 K-means
18.3.2 Hierarchical Clustering
18.3.3 Reporting Results
Further Reading
19 Multivariate Exploratory Approaches
19.1 Introduction
19.2 Fundamentals
19.2.1 Commonalities
19.2.2 Differences
19.2.3 Exploring is not Predicting
19.2.4 Correspondence Analysis
19.2.5 Multiple Correspondence Analysis
19.2.6 Principal Component Analysis
19.2.7 Exploratory Factor Analysis
19.3 Practical Guide with R
19.3.1 Correspondence Analysis
19.3.2 Multiple Correspondence Analysis
19.3.3 Principal Component Analysis
19.3.4 Exploratory Factor Analysis
19.3.5 Reporting Results
Further Reading
Part V Hypothesis-Testing
20 Classical Monofactorial (Parametric and Non-parametric) Tests
20.1 Introduction
20.2 Fundamentals
20.2.1 Null-Hypothesis Significance Testing (NHST) Paradigm
20.2.2 Statistical Tests and their Assumptions Chi-Squared Test T-test ANOVA Mann-Whitney U Test Kruskal-Wallis Test Pearson's Correlation Non-parametric Correlation Tests
20.2.3 Effect Sizes and Confidence Intervals
20.3 Practical Guide with R
20.3.1 Chi-Squared Test*-12pt
20.3.2 T-test*-12pt
20.3.3 Cohen's d with 95% Confidence Intervals – To Be Computed with T-test*-12pt
20.3.4 ANOVA*-12pt
20.3.5 Post-hoc T-Test with Correction for Multiple Testing*-12pt
20.3.6 Mann-Whitney U Test*-12pt
20.3.7 Kruskal-Wallis Test*-12pt
20.3.8 Pearson's and Spearman's Correlations
Further Reading
21 Fixed-Effects Regression Modeling
21.1 Introduction
21.2 Fundamentals
21.2.1 (Multiple) Linear Regression An Example of (Multiple) Linear Regression Assumptions of Linear Regression
21.2.2 Binary Logistic Regression An Example of Binary Logistic Regression Assumptions of Binary Logistic Regression
21.2.3 *5pc
21.3 Practical Guide with R
21.3.1 Multiple Linear Regression Creating an Artificial Dataset for a Multiple Linear Regression Running a Multiple Linear Regression What Happens If We Make the Effect Sizes Smaller? What Happens If We Make the Effects “Noisier”? Manufacturing an Interaction Effect
21.3.2 Binary Logistic Regression Creating an Artificial Dataset for a Binary Logistic Regression Running a Binary Logistic Regression Visualizing the Effects of a Binomial Logistic Regression Manufacturing and Visualizing an Interaction Effect
21.3.3 Reporting the Results of Regression Analyses
Further Reading
22 Mixed-Effects Regression Modeling
22.1 Introduction
22.2 Fundamentals
22.2.1 When Are Random Effects Useful? Crossed and Nested Effects Hierarchical/Multilevel Modeling Random Slopes as Interactions
22.2.2 Model Specification and Modeling Assumptions Simple Random Intercepts Choosing Between Random and Fixed Effects Model Quality More Complex Models
Representative Study 1
Representative Study 2
22.3 Practical Guide with R
22.3.1 Specifying Models Using lme4 in R Overview of the Data Set A Simple Varying Intercept Instead of a Fixed Effect More Complex Models
Further Reading
23 Generalized Additive Mixed Models
23.1 Introduction
23.2 Fundamentals
23.2.1 The Generalized Linear Model
23.2.2 The Generalized Additive Model
23.3 Practical Guide with R
23.3.1 A Main-Effects Model
23.3.2 A Model with Interactions
23.3.3 Random Effects in GAMs
23.3.4 Extensions of GAMs
Further Reading
24 Bootstrapping Techniques
24.1 Introduction
24.2 Fundamentals of Bootstrapping
24.2.1 Objectives and Methods
24.2.2 Applications of Bootstrapping in Corpus Linguistics Estimating Sampling Distributions Measuring Corpus Homogeneity Validating Statistical Models Random Forest Analysis Additional Applications of Bootstrapping
24.3 Practical Guide with R
Further Reading
25 Conditional Inference Trees and Random Forests
25.1 Introduction
25.2 Fundamentals
25.2.1 Types of Data
25.2.2 The Assumptions
25.2.3 Research Questions
25.2.4 The Algorithms The CIT Algorithm The CRF Algorithm
25.2.5 CITs and CRFs Compared with Other Recursive Partitioning Methods
25.2.6 Situations When the Use of CITs and CRFs May Be Problematic
25.3 A Practical Guide with R
25.3.1 T/V Forms in Russian: Theoretical Background and Research Question
25.3.2 Data: Film Subtitles
25.3.3 Variables
25.3.4 Software
25.3.5 Conditional Inference Tree
25.3.6 Conditional Random Forest
25.3.7 Interpretation of the Predictor Effects: Partial Dependence Plots
25.3.8 Conclusions and Recommendations for Reporting the Results
Further Reading
Part VI Pulling Everything Together
26 Writing up a Corpus-Linguistic Paper
26.1 The Structure of an Empirical Paper
26.2 The `Methods' Section
26.3 The `Results' Section
26.4 Concluding Remarks
27 Meta-analyzing Corpus Linguistic Research
27.1 Introduction
27.2 Fundamentals
27.3 A Practical Guide to Meta-analysis with R
27.3.1 Defining the Domain and Searching for Primary Literature
27.3.2 Developing and Implementing a Coding Scheme
27.3.3 Aggregating Effect Sizes
27.3.4 Aggregating Effects and Interpreting Results
27.4 Critical Assessment and Future Directions
27.5 Conclusion
27.6 Tools and Resources
Further Reading

Citation preview

Magali Paquot Stefan Th. Gries  Editors

A Practical Handbook of Corpus Linguistics

A Practical Handbook of Corpus Linguistics

Magali Paquot • Stefan Th. Gries Editors

A Practical Handbook of Corpus Linguistics


Editors Magali Paquot FNRS Centre for English Corpus Linguistics, Language and Communication Institute UCLouvain Louvain-la-Neuve, Belgium

Stefan Th. Gries Department of Linguistics University of California Santa Barbara, CA, USA Justus Liebig University Giessen Giessen, Germany

ISBN 978-3-030-46215-4 ISBN 978-3-030-46216-1 (eBook) https://doi.org/10.1007/978-3-030-46216-1 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland


Corpus linguistics is “a whole system of methods and principles” (McEnery et al. 2006: 7f) that can be applied to answer research questions related to language use and variation in a wide variety of domains of linguistic enquiry and beyond. Over the last decades, it has been “among the fastest-growing methodological disciplines in linguistics” (Gries 2015: 93) and is now also developing as a key methodology in the humanities and social sciences. The tasks of corpus linguists are manifold and complex. They can be grouped into minimally three different, though of course interrelated, areas: • Corpus design, which requires knowledge about corpus compilation (e.g., the notions of sampling and/or representativeness), data processing for corpus annotation (e.g., tagging, lemmatizing, parsing), and corpus architecture (e.g., representing corpus data in a maximally useful way); then, once there is a corpus, • Corpus searching/processing, which requires knowledge of, ideally, general data processing (e.g., file management, dealing with different annotation formats, using regular expressions to define character strings that lead to good search results) as well as corpus query tools and methods (from off-the-shelf tools to programming) to address the specificities of various data types (e.g., time alignment of spoken and multimodal corpora or bitext alignment of parallel corpora); then, once there are results from a corpus, • Statistical analysis to get the most out of the corpus data, which requires knowledge of statistical data wrangling/processing (e.g., establishing subgroups in data or determining whether transformations are necessary), statistical analysis techniques (e.g., significance testing or alternative approaches, regression modeling, or exploratory data analysis), and visualization (e.g., representing the results of complex statistical analysis in ways non-quantitatively minded readers can understand). This handbook aims to address all these areas with contributions by many of their leading experts, to be a comprehensive practical resource for junior and more




senior corpus linguists, and to represent the whole research cycle from corpus creation, method, and analyses to reporting results for publication. It is divided into six parts. In Part I, the first three chapters focus on corpus design and address issues related to corpus compilation, corpus annotation, and corpus architecture. Part II deals with corpus methods: Chapters 4–9 provide an overview of the most commonly used methods to extract linguistic and frequency information from corpora (frequency lists, keywords lists, dispersion measures, co-occurrence frequencies and concordances) as well as an introduction on the added value of programming skills in corpus linguistics. Chapters 10–16 in Part III review different corpus types (diachronic corpora, spoken corpora, parallel corpora, learner corpora, child language corpora, web corpora, and multimodal corpora), with each chapter focusing on the specific methodological challenges associated with the analysis of each type of corpora. Parts IV–VI aim to offer a user-friendly introduction to the variety of statistical techniques that have been used, or have started to be used, more extensively in corpus linguistics. As each chapter under Part IV–VI uses R for explaining and exemplifying the statistics, Part IV starts with an introductory chapter on how to use R for descriptive statistics and visualization. Chapters 18 and 19 focus on exploratory techniques, i.e., cluster analysis and the multidimensional exploratory approaches of correspondence analysis, multiple correspondence analysis, principal component analysis, and exploratory factor analysis. Part V focuses on hypothesistesting (classical monofactorial tests, fixed-effects regression modeling, mixedeffects regression modeling, generalized additive mixed models, bootstrapping techniques and conditional inference trees and random forests). It is important to note that the chapters on mixed effects regression modeling and generalized additive mixed models in particular are primarily meant for readers to get a grasp of what these techniques are (more and more corpus linguistic papers rely on such methods and it is important for corpus linguists to understand the current literature). However, a single chapter can of course not provide all that is required to get started with statistics of such a level of complexity. If the reader is interested to know more and wants to use these statistics for their own purposes, they will necessarily need to read more on the topic. Part VI aims to pull everything together by providing guidelines for how to write a corpus linguistic paper and how to meta-analyze corpus linguistic research. Chapters in Parts IV and V as well as Chaps. 7, 9 and 27 come with online additional material (R code with datasets). It is our hope that this handbook will serve to help students and colleagues expand their methodological toolbox. We certainly learned a lot while editing this volume! Louvain-la-Neuve, Belgium Santa Barbara, CA, USA

Magali Paquot Stefan Th. Gries



References Gries, S. T. (2015). Some current quantitative problems in corpus linguistics and a sketch of some solutions. Language and Linguistics, 16(1), 93–117. McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. London/New York: Routledge.


Part I Corpus Design 1

Corpus Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Annelie Ädel



Corpus Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John Newman and Christopher Cox



Corpus Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amir Zeldes


Part II Corpus methods 4

Analysing Frequency Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Don Miller



Analyzing Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Th. Gries



Analysing Keyword Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Paul Rayson and Amanda Potts


Analyzing Co-occurrence Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Stefan Th. Gries and Philip Durrant


Analyzing Concordances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Stefanie Wulff and Paul Baker


Programming for Corpus Linguistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Laurence Anthony

Part III Corpus types 10

Diachronic Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Kristin Davidse and Hendrik De Smet





Spoken Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Ulrike Gut


Parallel Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Marie-Aude Lefer


Learner Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Gaëtanelle Gilquin


Child-Language Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Sabine Stoll and Robert Schikowski


Web Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Andrew Kehoe


Multimodal Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 Dawn Knight and Svenja Adolphs

Part IV Exploring Your Data 17

Descriptive Statistics and Visualization with R . . . . . . . . . . . . . . . . . . . . . . . . . 375 Magali Paquot and Tove Larsson


Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 Hermann Moisl


Multivariate Exploratory Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 Guillaume Desagulier

Part V Hypothesis-Testing 20

Classical Monofactorial (Parametric and Non-parametric) Tests . . . . 473 Vaclav Brezina


Fixed-Effects Regression Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 Martin Hilpert and Damián E. Blasi


Mixed-Effects Regression Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 Roland Schäfer


Generalized Additive Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 R. Harald Baayen and Maja Linke


Bootstrapping Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 Jesse Egbert and Luke Plonsky


Conditional Inference Trees and Random Forests . . . . . . . . . . . . . . . . . . . . . . 611 Natalia Levshina


Part VI


Pulling Everything Together


Writing up a Corpus-Linguistic Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647 Stefan Th. Gries and Magali Paquot


Meta-analyzing Corpus Linguistic Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 Atsushi Mizumoto, Luke Plonsky, and Jesse Egbert

Part I

Corpus Design

Chapter 1

Corpus Compilation Annelie Ädel

Abstract This chapter deals with the fundamentals of corpus compilation, approached from a practical perspective. The topics covered follow the key phases of corpus compilation, starting with the initial considerations of representativeness and balance. Next, issues in collecting corpus data are covered, including ethics and metadata. Technical aspects involving formatting and annotation are then presented, followed by suggestions for sharing the corpus with others. Corpus comparison is also discussed, as it merits some reflection when a corpus is created. To further illustrate key concepts and exemplify the varying roles of the corpus in specific research projects, two sample studies are presented. The chapter closes with a brief consideration of future directions in corpus compilation, focusing on the importance of compensating for the inevitable loss of complex information and taking the increasingly multimodal nature of discourse as a case in point.

1.1 Introduction Given that linguistics is descriptive at its core, many linguists study how language is used based on some linguistic sample. Finding the right material to use as the basis for a study is a key aspect of the research process: we are expected to use material that is appropriate for answering our research questions, and not make claims that go beyond what is supported by the material. This chapter covers the basics of compiling linguistic material in the form of a corpus. Corpus compilation involves “designing a corpus, collecting texts, encoding the corpus, assembling and storing the relevant metadata, marking up the texts where necessary and possibly adding linguistic annotation” (McEnery and Hardie 2012:241). In the process of putting together linguistic data in a corpus, researchers need to make a series of decisions at different steps. The process is described in a general way in this chapter, while more

A. Ädel () Dalarna University, Falun, Sweden e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_1



A. Ädel

in-depth discussion relating to the compilation of specific types of corpora follows in Chaps. 10–16. Specifics on corpus annotation and corpus architecture follow in Chaps. 2 and 3, respectively.

1.2 Fundamentals 1.2.1 Representativeness The most basic question to consider when compiling a corpus involves representativeness: what type of speakers/variety/discourse is the corpus meant to represent? In many of the well-known corpora of English, the ambition has been to cover a general and very common type of discourse (such as ‘conversation in a variety of English’) or a very large population (such as ‘second-language learners of English’). However, such a comprehensive aim is beyond the scope for most researchers and should be reserved for large groups of researchers with plenty of resources at their disposal (see e.g. Aston and Burnard (1998) for discussions on how the British National Corpus was designed, or Johansson et al. (1978) on the Lancaster-Oslo/Bergen Corpus). In small-scale projects, the aims regarding representativeness need to be more modest by comparison, for example with a focus on a specialized type of discourse used by a relatively restricted group of speakers. The general sense of the word ‘sample’ is simply a text or a text extract, but in its more specific and statistical sense it refers to “a group of cases taken from a population that will, hopefully, represent that population such that findings from the sample can be generalised to the population” (McEnery and Hardie 2012:250).1 The aim in compiling a corpus is that it should be a maximally representative—in practice, this translates into acceptably representative—sample of a population of language users, a language variety, or a type of discourse. In most linguistic studies, we have to make do with studying merely a sample of the language use, or variety, as a whole. It is only in rare cases, and when the research question is quite delimited, that it is possible to collect all of the linguistic production of the population or type of discourse we are interested in. As an example, it may be possible for a researcher in Languages for Specific Purposes to retrieve all of the emails sent and received in a large company to use as a basis for studying the typical features of this specific type of communication in that company. The corpus builder needs to consider very carefully how to collect samples that maximally represent the target discourse or population. One of the ways of selecting material for a corpus is by stratified sampling, where the hierarchical structure (or ‘strata’) of the population is determined in advance. For example, a researcher

1 Samples

in the sense ‘text extracts’ are occasionally used in corpora to avoid having one type of text dominate, just because it happens to be long. There are many arguments for using complete texts, however. See e.g. Douglas (2003) and Sinclair (2005).

1 Corpus Compilation


who is interested in spoken workplace discourse could document demographic information about speakers’ job titles and ages and whether interactions involve peers or managers/subordinates, and then include in the corpus a predetermined proportion of texts from each category. In the detailed sampling process, it is decided exactly what texts or text chunks to include. There is a range of possible considerations to take in deciding about sampling procedures for a corpus, one of which concerns to what extent to organize the overall design around text production or text reception. For illustration, this is what the compilers of the British National Corpus (Aston and Burnard 1998:28) concluded with respect to the written part of the corpus: In selecting texts for inclusion in the corpus, account was taken of both production, by sampling a wide variety of distinct types of material, and reception, by selecting instances of those types which have a wide distribution. Thus, having chosen to sample such things as popular novels, or technical writing, best-seller lists and library circulation statistics were consulted to select particular examples of them.

A concept that is intrinsically related to representativeness is balance, which has to do with the proportions of the different samples included in the corpus. In a balanced corpus, “the relative sizes of each of [the subsections] have been chosen with the aim of adequately representing the range of language that exists in the population of texts being sampled” (McEnery and Hardie 2012:239). In the case of ‘conversation in a variety of English’, the researcher would need a principled way of deciding what proportions to include, for example, of conversations among friends versus among strangers, or unplanned versus preplanned conversations (an interview is an example of the latter), or conversations from institutional/public/private-personal settings, and so on. Such decisions could be based on some assessment of how commonly these different configurations occur or of their relative importance (however this may be defined). Balancing decisions could even be based on comparability with some other corpus: for example, in a diachronic corpus of English (cf. Chap. 10) fiction writing may be deliberately overrepresented and religious writing underrepresented in earlier periods to allow for easier comparison to present-day English. The notions of representativeness and balance are scalar and vague (see e.g. Leech 2007), so there are no hard and fast rules for achieving representativeness and balance in a corpus. The first step is to map out the available types of discourse, in order to find useful categorizations of the different ways of communicating used in the target community. The point that the most important consideration in corpus compilation is “a thorough definition of the target population” which is able to describe the “different situations, purposes, and functions of text in a speech community” was made by Biber (1993:244–245) in a classic piece on representativeness in corpus design. Added to this are “decisions concerning the method of sampling” (Biber 1993:244), as the next step is to find some principled way of representing these different ways of communicating. For some of the early standard corpora, this was done by drawing on classifications from library science, where there is a long tradition of cataloguing written publications. For example,


A. Ädel

a list of the collection of books and periodicals in the Brown University Library and the Providence Athenaeum was used as a sampling frame for the pioneering Brown corpus, aiming to represent written American English in general (published in 1961); see Francis and Kucera (1979).2 Using stratified random sampling, a onemillion word corpus was produced, consisting of 500 texts including 2,000 words each. However, if the available types of discourse are not already classified in some reliable way, as in the case of spoken language, it means that the corpus builder will have to dedicate a great deal of time to researching the characteristics of the target discourse in order to develop valid and acceptable selection criteria. Douglas (2003) describes this type of situation and includes a useful discussion about the collection of The Scottish Corpus of Texts and Speech. With a definition of representativeness as the extent to which a corpus reflects “the full range of variability in a population” (Biber 1993:243), it has been suggested that representativeness can be assessed by the degree to which it captures not only the range of text types in a language (external criteria), but also the range of linguistic distributions in a language (internal criteria). Since different linguistic features—vocabulary, grammar, lexicogrammar—vary in frequency and are distributed differently “within texts, across texts, across text types” (ibid.), the corpus should make possible analysis of such distributions. In fact, Biber (1993) suggests a cyclical method for corpus compilation, including as key components theoretical analysis of relevant text types (which is always primary) and empirical investigation of the distributions of linguistic features. However, few corpus projects have attempted this. The literature on corpus design sometimes contrasts ‘principled’ ways of building a corpus to ‘opportunistic’ ones. An opportunistic corpus is said to “represent nothing more nor less than the data that it was possible to gather for a specific task”, with no attempts made “to adhere to a rigorous sampling frame” (McEnery and Hardie 2012:11). It is, however, very difficult not to include some element of opportunism in corpus design, as we do not have boundless resources. This is especially true of single-person MA or PhD projects, where time constraints may present a major issue. What is absolutely not negotiable, however, is that the criteria for selecting material for the corpus be clear, consistent and transparent. Indeed, transparency is key in selecting material for the corpus. The criteria used when selecting material also need to be explicitly stated when reporting to others about a study—it is a basic principle in research and a matter of making it possible for others to replicate the study. The selection criteria are typically biased with respect to specific research interests behind a given corpus project, which should also be spelled out in the documentation about the corpus.

2 Biber

(1993:244) defines a sampling frame as “an operational definition of the population, an itemized listing of population members from which a representative sample can be chosen”.

1 Corpus Compilation


1.2.2 Issues in Collecting Data for the Corpus Corpus compilation involves a series of practical considerations having to do with the question ‘Given the relative ease of access, how much data is it feasible to collect for the corpus?’. Indeed, this needs addressing before it is possible to determine fully the design of a corpus. Relevant spoken or written material may of course be found in many different places, and the effort required to collect it may vary considerably. Some types of discourse are meant to be widely distributed, and are even in the public domain, while others are relatively hidden, and are even confidential or secret. In an academic setting, for example, written course descriptions and spoken lectures target a large audience, while teacher feedback and committee discussions about the ranking of applicants for a post target a restricted audience. Once the data have been collected, varying degrees of data management will be required depending on the nature and form of the data. If spoken material is to be included in the corpus, it needs to be transcribed, that is, rendered in written form to be searchable by computer. The transcription needs to be appropriately detailed for the research question (see Chap. 11 for key issues involved in compiling spoken corpora). If written material is to be included in the corpus, there are practical considerations regarding how it is encoded. For example, if it can be accessed as plain text files at the time of collection, it will save time. If it is only available on paper, it will need to be scanned using OCR (Optical Character Recognition) in order for the text to be retrieved. If it is only available on parchment, it will need very careful handling indeed by the historical corpus compiler, including manual typing and annotation to represent it. Even modern text files which are available in pdf format may not be retrievable as plain text at all, or it may be possible to convert the pdf to text, but only with a varying degree of added symbols and garbled text, requiring additional ‘cleaning’.3 Section 1.2.5 on Formatting the corpus discusses some of these issues more fully. Nowadays there are massive amounts of material on the web, which are already in an electronic format. As a consequence, it has become popular among corpus builders to include material from online sources (see Chap. 15), which represent a great variety of genres, ranging from research articles to blogs. It is important, however, to make the relevance of the material to the research question a priority over ease of access, and carefully consider questions such as “How do we allow for the unavoidable influence of practicalities such as the relative ease of acquiring public printed language, e-mails and web pages as compared to the labour and expense of recording and transcribing private conversations or acquiring and keying personal handwritten correspondence?” (Sinclair 2005). Even if material is available on the web, it does not necessarily mean that it is easy to access—at least not in the way texts need to be accessed for corpus work. Online newspapers are a case in point. While they often make it possible to search 3 There are tools that automatically convert pdf files to simple text, such as AntFileConverter http://

www.laurenceanthony.net/software/antfileconverter/. Accessed 24 May 2019.


A. Ädel

the archive, they may not make the text files downloadable other than one by one by clicking a hyperlink. The work of clicking the link, copying and saving each individual article manually is then left to the user. This is no small task, but it tends to be underestimated by beginner corpus compilers. Fortunately, there are ways of speeding up and automatizing the process in order to avoid too much manual work; Chap. 15 offers suggestions. Corpus compilers who are able to collect relevant material in the public domain still need to check the accuracy and adequacy of the material. Consider the case of a research group seeking the answer to the question ‘To what extent is (a) the spoken dialogue in the fictional television series X (dis)similar to (b) authentic non-scripted conversation?’. They may go to the series’ website to search for material, following the logic that an official website is likely to be a more credible source for transcripts than a site created by anonymous fans. Before any material can be included in the corpus, however, each transcript needs to be checked against the recorded episode to ensure that the transcription is not only correct, but also sufficiently detailed for the specific research purposes. When collecting material from the web, there may also be copyright restrictions to take into account; see e.g. the section on Ethical considerations below and Section 3.2 in McEnery and Hardie (2012) on legal issues in collecting such data. Beginner corpus researchers often find themselves confounded by the question ‘How much data do I need in order for my study to be valid?’. There is no rule of thumb for corpus size, except for the general principle ‘the more, the better’. That said, it requires more data to be able to make valid observations about a large group of people and a general type of discourse than a small group of people and a specific type of discourse. It also requires more data to investigate rare rather than common linguistic features. Thus, the appropriate amount of data depends on the aim of the research. Each study, however, needs to be considered in its context. There are always going to be practical restrictions on how much time a given researcher is able to put into a project. Researchers who find themselves in a situation of not being able to collect as much data as planned will need to adjust their research questions accordingly. With less data—a smaller sample—the claims one is able to make based on one’s corpus findings will be more modest. But most importantly, as discussed above, the issue of representativeness needs to be addressed before a corpus, regardless of size, can be considered appropriate for a given study.

1.2.3 Ethical Considerations Corpus compilation involves different types of ethical considerations depending on the type of data. For data in the public domain, such as published fiction or online newspaper text, it is not necessary to secure consent. However, such data may be protected by copyright. For data that is collected from scratch by the researcher, it is necessary to obtain the informants’ informed consent and it may be necessary to ask for institutional approval.

1 Corpus Compilation


In the case of already published material, permission may be needed from a publisher or some other copyright holder. There are grey areas in copyright law and copyright infringement is looked at in different ways in different parts of the world, so it is difficult to find universally valid advice on the topic, but generally speaking copyright may prove quite a hindrance for corpus compilation. To a certain extent, restrictions on copyright may be alleviated through concepts such as ‘fair use’, as texts in a corpus are typically used for research or teaching purposes only, with no bearing on the market.4 However, copyright holders and judges are likely to distinguish between material that is used by a single researcher only and material that is distributed to other researchers, so it may matter whether or not the corpus is made available to the wider research community. In addition to the potential difference between data gathering for a single use versus data distribution for repeated use by many different people, copyright holders may be more likely to grant permission to use an extract rather than a complete text. In the case of collecting data from informants, approval may be needed from an institutional ethics review board before the project can begin. Even if institutional approval is not needed, consent needs to be sought from the informants in order to collect and use the data for research purposes. Asking for permission to use material for a corpus is often done by means of a consent form, which is signed by each informant, or by the legal guardians in the case of children (see Chap. 14). A consent form should clearly state what the data will be used for so that an informed decision can be made. It needs to be clear that the decision to give consent is completely voluntary. It is important how the consent form is worded, so it is useful to consider forms used in similar corpus projects for comparison.5 If a participant does not give his or her consent, the data will have to be removed from the corpus. In the case of multi-party interactions, it may still be worth including the data if most participants have given their consent, while blanking out contributions from the nonconsent-giving participant. See Crasborn (2010) for a problematized view of consent in connection with online publication of data. Once permission has been obtained to use data for a corpus, the informants’ integrity needs to be protected in different ways, such as by anonymizing the material. An initial step may be to not reveal the identity of the informants by not showing their real names, for example through ‘pseudonymisation’, whereby personal data is transformed in such a way that it cannot be attributed to a specific informant without the use of additional information, which is kept separately. A second step may be to manipulate the actual linguistic data (that is, what the

4 Fair

use is measured through the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion taken, and the effect of the use on the potential market. For more information, see https://fairuse.stanford.edu/overview/fair-use/fourfactors/. Accessed 24 May 2019. 5 For sample templates, see the forms from the Bavarian Archive for Speech Signals at http://www.phonetik.uni-muenchen.de/Bas/BasTemplateInformedConsent_en.pdf, or Newcastle University at https://www.ncl.ac.uk/media/wwwnclacuk/research/files/Example%20Consent %20Form.pdf. Accessed 29 May 2019.


A. Ädel

people represented in the corpus said or wrote) by changing also names and places mentioned which could in some way give away the source. In the case of image data, this would involve masking participants’ identity in various ways. Confidential data needs to be stored in a safe way. Sensitive information may have to be destroyed if there is a risk that others may access information which informants have been promised will not be revealed. For further reading on ethical perspectives on data collection, see e.g. BAAL’s Recommendations on Good Practice in Applied Linguistics.6 It complicates matters that regulations may differ by region. While ethics review boards have been in place for quite some time at universities in the United States, linguists in Europe have been relatively free to collect data. It is not clear, however, what long-term effects the General Data Protection Regulation (https://www.eugdpr.org/; effective as of 2018) will have for data collected in the European Union.

1.2.4 Documenting What Is in the Corpus As language use is characterized by variability, factors which may have an impact on the way in which language is used should be recorded in some way—these may include demographic information about the speakers/writers, or situational information such as the purpose of the communication or the type of relationship between the discourse participants. Even if the corpus compilers are deeply familiar with the material, it is still the case that memory is both short and fallible, so if they want to use the corpus in a few years’ time, important details of the specific context of the data may well have been forgotten. In addition, if the corpus is made available to others, they need to know what is in it in order to make an informed decision about whether the design of the corpus is appropriate for answering their specific research questions. Anybody who wants the claims made based on a corpus to be accepted by the research community needs to show in some way that the corpus material is appropriate for the type of research done. With incomplete description of the corpus, people will be left wondering whether the material was in fact valid for the study. There are several different ways in which information about the corpus design can be disseminated. It can be done through a research publication, such as a research article or an MA thesis, which includes a section or chapter describing the material (for more on this, see Chap. 26). Corpus descriptions are sometimes published in peer-reviewed journals, especially if the corpus is breaking new ground (as is the case in Representative Study 2 below), so that the research community can benefit from discussions on corpus design. It can also be done by writing a report solely dedicated to describing the corpus (and possibly how to use it), which is made

6 See

https://baalweb.files.wordpress.com/2016/10/goodpractice_full_2016.pdf. Accessed 24 May 2019.

1 Corpus Compilation


available either as a separate file stored together with the corpus itself, or online. Corpora often come with “read me” files where the corpus design is accounted for. Some large corpus projects intended to attract large numbers of users, such as the British National Corpus (BNC) and the Michigan Corpus of Academic Spoken English (MICASE), provide relatively detailed reports online.7 There are also published books which offer even more detailed documentation of corpora and recommendations for how to use them (e.g. Aston and Burnard’s (1998) The BNC Handbook and Simpson-Vlach and Leicher’s (2006) The MICASE Handbook). Another reason for documenting what is in the corpus is to enable researchers to draw on various variables in a systematic way when analyzing data from the corpus. As an example, see Chap. 8 and the subsection on quantitative analysis of concordance lines. In a study of that-complementation in English, for each hit in the corpus, the researchers considered external variables such as the L1 of the speaker who had produced the hit and whether the hit came from a written or spoken mode. Through the inclusion of ‘metadata’—data about the data—about the type of discourse represented in the corpus, the corpus user can keep track of or investigate different factors that may influence language use, which may explain differences observed in different types of data. Metadata can consist of different types of information. For example, the corpus compiler may include information based on interviews with participants or participant observation. A common way of collecting metadata is by asking corpus participants to fill out a questionnaire which has been carefully designed by the corpus compiler so as to include information likely to be relevant with respect to the specific context of the discourse included and the people represented. An example of metadata based on a questionnaire from the International Corpus of Learner English (ICLE) is summarized in Fig. 1.1. The ICLE is a large-scale project with collaborators from several different countries. (For more information on learner corpora, see Chap. 13.) The corpus includes metadata about the type of discourse included (written essays) and about the language users represented (university students), collected through a questionnaire called a ‘learner profile’, as the contributors are all learners of English. In a language-learning context, some of the variables likely to be relevant include what the learner’s first language is (2e), what the medium of instruction was in school (2i; 2j), how much exposure the learner has had to the second language—whether through instruction in a school context (2l) or through spending time in a context where the second language is spoken (2q). Based on metadata from the questionnaire, it is possible to select a subset of the ICLE corpus, for example to study systematically potential differences in language use between learners who have and who have not spent any time abroad in a country where the target language is spoken natively—and thus test a hypothesis from Second Language Acquisition research.

7 See

http://www.natcorp.ox.ac.uk/docs/URG/ (Accessed 24 May 2019) and https://web.archive. org/web/20130302203713/http://micase.elicorpora.info/files/0000/0015/MICASE_MANUAL.pdf (Accessed 24 May 2019).


A. Ädel

1a 1b 1c 1d 1e 1f

2a 2b 2c 2d 2e 2f 2g 2h 2i 2j 2k 2l 2m 2n 2o 2p 2q 2r

Metadata about the discourse (essay) Title: Approximate length required: -500 words/+500 words Conditions: timed/untimed Examination: yes/no Reference tools: yes/no -> What reference tools? Bilingual dictionary / English monolingual dictionary / Grammar / Other(s) Metadata about the informant (university student) Surname, First names: Age: Gender: M/F Nationality: Native language: Father’s mother tongue: Mother’s mother tongue: Language(s) spoken at home: (if more than one, give average % use of each) Primary school - medium of instruction: Secondary school - medium of instruction: Current studies: Current year of study: Institution: Medium of instruction: English only / Other language(s) (specify) / Both Years of English at school: Years of English at university: Stay in an English-speaking country: -> Where? When? How long? Other foreign languages in decreasing order of proficiency:

Fig. 1.1 An example of metadata collected for a corpus: The learner profile for the ICLE. (Adapted from https://uclouvain.be/en/research-institutes/ilc/cecl/corpus-collection-guidelines. html. Accessed 24 May 2019)

Three different documents that are commonly used in corpus compilation have been brought up above: (i) the consent form from the participants, (ii) the questionnaire asking for various types of metadata about the participants and the discourse and (iii) a text, possibly in a “read me” file, which documents what is in the corpus. Corpus compilers who are collecting publicly available data in such a way that they do not need (i) or (ii), may still choose to compile metadata to help track for instance various types of sociolinguistic information about the corpus participants. However, if both (i) and (ii) are needed, it is a good idea to investigate the possibility of setting them up electronically, such as on a website, to avoid having to type in all the responses manually.

1 Corpus Compilation


1.2.5 Formatting and Enriching the Corpus There is a great deal to be said about how best to format corpus material, but this section will merely offer a few hints on technicalities. (More detailed information is found in Chaps. 2 and 3.) Researchers’ computational needs and programming skills vary. Those who are reasonably computer literate and whose corpus needs are relatively simple are likely to be able to do all the formatting themselves. However, those who wish to compile a corpus involving complex types of information or do advanced types of (semi-)automatic corpus searches would be helped by collaborating with a computational linguist or computer programmer (see Chap. 9). A plain text format (such as .txt) is often used for corpus files. MS Word formats are avoided, as these add various types of information to the file and do not work with corpus tools such as concordance programs. When naming files for the corpus, it is useful to have the file name in some way reflect what is in the file. For example, the file name ‘BIO.G0.02.3’ in a corpus of university student writing across disciplines and student levels (Michigan Corpus of Upper-level Student Papers, MICUSP; see Römer and O’Donnell 2011), consists of an initial discipline code (‘Biology’), a student level code (‘G0’ stands for final year undergraduate; while ‘G1’ stands for first year graduate, etc.), followed by a student and paper number (‘02.3’ refers to the third paper submitted by the second student at that level). Codes not only make it easier for the analyst to select the relevant files, but are also useful when analyzing concordance results, as the codes may help reveal patterns in the data. For example, in studying adverbial usage in student writing, the analyst may find that all of the hits for the relatively informal adverbial maybe come from texts coded with the lowest student level (‘G0’). It may be necessary, or just a good investment of time, to add markup, that is, “codes inserted into a corpus file to indicate features of the original text rather than the actual words of the text. In a spoken text, markup might include utterance breaks, speaker identification codes, and so on; in a written text, it might include paragraph breaks, indications of omitted pictures and other aspects of layout” (McEnery and Hardie 2012:246). If we take an example from the corpus of university student writing mentioned above, one of the marked-up features is quoted material. This makes it possible to exclude quotations when searching the running text, based on the logic that most corpus users would be primarily interested in text produced by novice academics themselves, and not material brought in from primary or secondary sources. Markup allows the corpus builder to include important information about each file in the corpus. Various types of metadata can be placed in a separate file or in a ‘header’, so that a computer script or web-based tool for example will be able to use the information in systematic ways when counting frequencies, searching for or displaying relevant data. If we consider the metadata from the ICLE (Fig. 1.1 above) again, it makes it possible to distinguish for instance between those essays which were timed versus untimed, or between essays written by learners who have never


A. Ädel

3a 3b 3c 3d 3e 3f 3g 3h 3i 3j 3k 3l 3m 3n 3o 3p 3q 3r 3s 3t 3u 3v 3x 3y 3z

But like , I was


this is gon na be so embarrassing like in P E !

Fig. 1.2 An illustration of XML annotation: Sentence (a) from the corpus in Representative Study 2. (Based on Rühlemann and O’Donnell 2012:337)

stayed in a country where the target language is spoken versus learners who have reported on relatively extensive stays in such a context. Another way of adding useful information to a corpus is through annotation, or “codes within a corpus that embody one or more linguistic analyses of the language in the corpus” (McEnery and Hardie 2012:238). Annotation can be done manually or (semi-)automatically (see Chap. 2 for information about automatic annotation). Annotation helps to make the data more interesting and useful. It can be done at any linguistic level, including for example classification of word class for each word in the corpus (POS-tagging; see Fig. 1.2), indication of prosodic features of spoken data, or pragmatic marking of politeness phenomena. Representative study 2 presents annotations of narratives in conversation, which for example involved adding a code for the degree to which an utterance is represented as verbatim or indirect. Example (a) from the corpus includes a sentence from an utterance where the underlined unit is coded ‘MDD’ (3k in Fig. 1.2) for a verbatim presentation mode. (a) But like, I was thinking this is gonna be so embarrassing like in P E! The contemporary standard for corpus markup and annotation is XML (eXtensible Markup Language), where added information is indicated by angle brackets ,

1 Corpus Compilation


as illustrated in Fig. 1.2, which represents the above sentence. The sentence opens with an tag including a number which uniquely identifies it (3a), and closes with an end tag (3z). Each word also has an opening tag, giving information about lemma forms and part of speech (‘pos’), and a closing tag. The quotative verb (3h), for example, is labelled “VERB” and, more specifically, “VVG” to mark the –ing form of a lexical verb. We can also see, for example, that 3f (was), 3m (is) and 3p (be) instantiate different forms of the lemma BE. XML is ideal “because of its standard nature” and “because so much corpus software is (at least partially) XML-aware” (Hardie 2014:77–78). This does not mean, however, that it is necessary to use in corpus building. While Representative Study 2 is at the advanced end regarding corpus formatting, Representative Study 1 uses raw corpus texts and does not even mention XML or annotation. The degree to which a corpus is enriched will depend partly on the research objectives. MICUSP was mentioned above as an example of a corpus created with the aim of mapping advanced student writing across different levels and disciplines. As mentioned, quoted material was marked up to enable automatic separation between the students’ own writing and writing from other sources. It is also an example of a corpus that is distributed to others, which means that the compilers put a greater effort into marking up the data for a range of potential future research projects. For those wishing to learn more about XML for corpus construction, Hardie (2014:73) is a good place to start. Even more fundamental than markup or annotation is encoding, which refers to “the process of representing a text as a sequence of characters in computer memory” (McEnery and Hardie 2012:243). We want corpus texts to be rendered and recognized the same way regardless of computer platform or software, but for example accented characters in Western European languages (such as ç and ä) may cause problems if standard encoding formats are not used. How characters are encoded may be an issue especially for non-alphabetical languages. A useful source on the fundamentals of character encoding in corpus creation is McEnery and Xiao (2005), who recommend the format UTF-8 for corpus construction, as it represents “a universal format for data exchange” in the Unicode standard. Unicode is “a large character set covering most of the world’s writing systems, offering a way of standardizing the hundreds of different encoding systems for rendering electronic text in different languages, which were often conflicting” (Baker et al. 2006:163) in the past. Unicode and XML together currently form a standard in corpus building. There are many considerations for formatting corpus material in ways that follow current standards and best practice. An authoritative source is the Text Encoding Initiative (TEI),8 which represents a collective enterprise for developing and maintaining international guidelines. The TEI provides recommendations for different aspects of corpus building, ranging from how to transcribe spoken data to what to put in the ‘header’. As mentioned above, some corpus projects make use of ‘headers’ placed at the top of each corpus file. A TEI-conformant header

8 See

https://tei-c.org/. Accessed 24 May 2019.


A. Ädel

should at least document the corpus file with respect to the text itself, its source, its encoding, and its (possible) revisions. This type of information can be used directly by linguists searching the corpus texts, but most often it is processed automatically by corpus tools to help the linguist pre-select files, visualize the distribution of variables, display characters correctly, and so on.

1.2.6 Sharing the Corpus One of many ways in which corpora vary is in how extensively and long-term they are intended to be used. A corpus can be designed to be the key material for many different research projects for a long time to come, or it can be created with a single project in mind, with no concrete plan to make it available to others. In the former category, we find ‘standard’ corpora, which are widely distributed and which form the basis for a large body of research. This type of corpus is designed to be representative of a large group of speakers, typically adopting “the ambitious goal of representing a complete language”, as Biber (1993:244) puts it. In the latter category, we find a large and ever-growing number of corpora created on a much more modest scope, focusing on a small subset of language. These are oftentimes used by a single researcher to answer one specific set of research questions, as in the case of Representative Study 1. Even in the context of a small-scale corpus project, it is considered good practice in research to make one’s data available to others. It supports the principle of replicability in research and it fosters generosity in the research community. Our time will be much better invested if more than one person actually uses the material we have put together so meticulously. Certain types of data will be of great interest to not only researchers or teachers and students, but also the producer community itself, as in the case of sign language corpora (e.g. Crasborn 2010). Sharing one’s corpus is in fact to an increasing extent a requirement; some bodies of research funding make ‘open access’ a precondition for receiving any funding. When sharing a corpus, it is common to apply licensing. Making a corpus subject to a user licence agreement provides a way of keeping a record of the users and of enforcing specific terms of use. Corpora published online may for example be made available to others through a Creative Commons licence in order to prohibit profit-making from the material.9 However, even with such a licence in place, it may be difficult for corpus compilers to enforce compliance, which is another reason for taking very seriously the protection of informants’ integrity. Even if open access is not a requirement, in a case where a researcher is applying for funding to compile a corpus for a research project, it may be a good idea to include an entry in the budget for eventually making the corpus available. If, say for various reasons related to copyright, it is not possible to make the complete set of

9 See

https://creativecommons.org/licenses/. Accessed 24 May 2019.

1 Corpus Compilation


corpus files available to others, the corpus could still be made searchable online and concordance lines from the corpus be shown. Another consideration in sharing corpus resources involves how to make these accessible to others and how to preserve digital data. The easiest option is to find an archive for the corpus, such as The Oxford Text Archive or CLARIN.10

1.2.7 Corpus Comparison Corpus data are typically studied quantitatively in some capacity. This means that the researcher will have various numbers to which to relate, which typically give rise to questions such as ‘Is a frequency of X a lot or a little?’. Such questions are difficult to answer in a vacuum, but are more usefully explored by means of comparison—for example by studying the target linguistic phenomenon not just in one context, but contrasting it across different contexts. Statistics can then be used to support the interpretation of results across two or more corpora, or to assess the similarity between two or more corpora (see e.g. Kilgarriff (2001) for a classic paper taking a statistical approach to measuring corpus similarity). The researcher may go on to ask qualitative questions such as ‘How is phenomenon X used?’ and systematically study similarities and differences in (sub-)corpus A and (sub-)corpus B. Even if frequencies are similar in cross-corpus comparison, it may be the case that, once you scratch the surface and do a qualitative analysis of how the individual examples are actually used, considerable differences emerge. In order for the comparison to be valid, however, the two sets ((sub-)corpus A and (sub-)corpus B) need to be maximally comparable with regard to all or most factors, except for the one being contrasted. Some corpora are intentionally constructed for comparative studies (this includes parallel corpora, covered in Chap. 12). In contrastive studies of different languages or varieties, for example, it is useful to have a so-called comparable corpus, which “contains two or more sections sampled from different languages or varieties of the same language in such a way as to ensure comparability” (McEnery and Hardie 2012:240). The way in which the texts included in the corpora have been chosen should be identical or similar—that is, covering the same type of discourse, taken from the same period of time, etc.—to avoid comparing apples to oranges. Having considered some of the fundamentals of corpus compilation, we will next turn to the two sample studies, which will illustrate further many of the concepts mentioned in this section.

10 See

https://ota.ox.ac.uk/ (Accessed 24 May 2019) and https://www.clarin-d.net/en/corpora (Accessed 24 May 2019).


A. Ädel

Representative Study 1 Jaworska, S. 2016. A comparison of Western and local descriptions of hosts in promotional tourism discourse. Corpora 11(1): 83–111. Jaworska (2016:84) makes the point that “corpus tools and methods [are] increasingly used to study discursive constructions of social groups, especially the social Other—that is, groups that have been marginalised and discriminated against”.11 In this study, corpus methods are used to investigate promotional tourism discourse and ways in which local people (hosts) are represented. Previous research in the area is based on small samples of texts and looks at representations in one destination or region, so there is typically no comparison across contexts. The research questions for the study are: 1. How are hosts represented in tourism promotional materials produced by Western versus local tourist industries? 2. To what extent do these representations differ? 3. What is the nature of the relationship between the representations found in the data and existing stereotypical, colonial, and often gendered ideologies? To answer these questions, two corpora were created, consisting of written texts promoting tourist destinations that have a history of being colonised. The two corpora represent, on the one hand, a Western, ‘external’ perspective and, on the other, a local, ‘internal’ perspective, which are contrasted in the study. They are labelled the External Corpus (EC) and the Internal Corpus (IC). To create the EC, texts were manually taken from the websites of “some of the largest tourism corporations operating in Western Europe”. A selection of 16 destinations was made, based on the most popular destinations as identified by the companies themselves during the period of data collection—however excluding Southern European destinations, as the focus of the research was on post-colonial discursive practices. To create the IC, official tourism websites were sourced from the 16 countries selected in the process of creating the EC. All of the websites are listed in an appendix to the article. A restriction imposed on the data selection for both corpora was to include only “texts that describe the countries and its main destinations (regions and towns)” rather than specific resorts or hotels or information on how to get there. This was to make the two corpora as comparable as possible. However, one way in which they differ is with respect to size, with the IC being three (continued)

11 A similar example is reported in Chap. 8 and involves a study of how foreign doctors are represented in a corpus of British press articles.

1 Corpus Compilation


times as big as the EC, as “local tourism boards [offer] longer descriptions and more details” (92). The solution to comparing corpora of different sizes was to normalise the numbers, rather than reduce the size of the IC. The author’s rationale was that reducing the IC would have “compromise[d] the context and the discourse of local tourism boards in that some valuable textual data could have been lost” (92). The corpora were compared by extracting lists of the most frequent nouns (cf. Chap. 4). From these lists were identified the most frequent items used to refer to local people (e.g. people, locals, man/men, woman/women, fishermen). Careful manual analysis was required in order to check that each instance was relevant, that is, actually referring to hosts/local people. The word people, for example, was also sometimes used to refer to tourists. It was found that the IC had not only more tokens of such references, but also more types (F = 68) compared to the EC (F = 20). The tokens were further classified into socio-semantic groups of social actors based on an adapted taxonomy from the literature, for example based on ‘occupation’ (fisherman, butler), ‘provenance’ (locals, inhabitants), ‘relationship’ (tribe, citizens), ‘religion’ (devotees, pilgrims), ‘kinship’ (son/s, child/ren) and ‘gender’ (man/men, woman/women). The corpora were compared qualitatively as well, by identifying patterns in the concordance lines and analysing the context (“collocational profiles”) of the references to hosts, specifically of people and locals, which occurred in both corpora. The pattern found for locals was that local people were represented “on an equal footing with tourists” in the IC, while in the EC they were portrayed as “docile, friendly and smiley servants [,] reproduc[ing] and maintain[ing] the ideological colonial asymmetry” (104).

Representative Study 2 Rühlemann, C. and O’Donnell, M.B. 2012. The creation and annotation of a corpus of conversational narratives. Corpus Linguistics and Linguistic Theory 8(2): 313–350. Rühlemann and O’Donnell’s (2012) article Introducing a corpus of conversational stories: Construction and annotation of the Narrative Corpus describes the main features of a corpus of conversational narratives. Research has shown that it is extremely common for people to tell stories in everyday conversation. The authors hope that the use of the corpus “will advance the linguistic theory of narrative as a primary mode of everyday spoken interaction” (315). (continued)


A. Ädel

Previous work on this type of discourse has been based not on corpus data, but on elicited interviews or narratives told by professional narrators. The corpus comprises selected extracts of narratives, 153 in all, for a total of around 150,000 words, taken from the demographically sampled ‘casual conversations’ section of the BNC, which is balanced by sex, age group, region and social class, and which totals approximately 4.5 million words. This example is somewhat unusual in that the authors do not collect the data themselves, but instead use a selection of data from an existing corpus. However, given that the intended audience of this handbook is expected to have limited resources for corpus compilation, it seems useful to provide an example of a study where it was possible to use part of an already existing corpus. The NC is only about 3% of the original collection from BNC, so the authors have put a great deal of effort into selecting the data, which is done in a transparent and principled way. In the article, they describe (i) the extraction techniques, (ii) selection criteria and (iii) sampling methods used in constructing the corpus. In order to (i) retrieve narratives, they (a) read the files manually and (b) used a small set of lexical forms (e.g. it was so funny/weird; did I tell you; reminds me) that tend to occur in narratives based on the literature or based on analysis of their own data. In (ii) deciding what counts as a conversational narrative, they used three selection criteria: First, some kind of ‘exosituational orientation’ needed to be present in the discourse, that is, “linguistic evidence of the fact that stories relate sequences of events that happened in a situation remote from the present, story-telling, situation” (317)—this includes for example the use of past tense verbs; items with past time reference as in yesterday; reference to locations not identical to the location of speaking. A second criterion was that at least two narrative clauses be present, which are temporally related so that first one event takes place and then another. A third criterion involved consensus, so that at least two researchers agreed that a given example was in fact a narrative. With respect to (iii) sampling, the authors retained the sociological balance from the demographically sampled BNC by choosing two texts from each file insofar as this was possible. The NC is not only a carefully selected subset of the demographically sampled BNC, but it is also annotated. The corpus builders have thus augmented the existing data by adding various types of information—about the speakers (sex, age, social class, region of origin, educational background), about the text (type of narrative; whether a stand-alone story or part of a ‘narrative chain’) and about the utterance (the roles of the participants visà-vis the narration; type of quotative verb used to signal who said what in a narrative; to what degree the discourse is represented as being verbatim or more or less indirect). The authors stress that all of the annotation is justified in some way by the literature on conversational narrative, so the rationale for (continued)

1 Corpus Compilation


including a layer of analysis to the corpus text is to enable researchers to answer central research questions in a systematic fashion. The corpus design makes it possible to use the demographic information about the speakers—such as sex—and consider how it is distributed in relation to the number of words uttered by the speakers who are involved in the narratives, as exemplified in Table 1.1. Note the presence of a category of “unknown”, which is useful when relevant metadata is missing. Each narrative in the corpus is classified also based on a taxonomy of narrative types. This type of information is highly useful, as it not only makes it possible to study and compare different types of narrative, but it also shows how the corpus is balanced (or not) with respect to type of narrative. The classification is justified by an observation from the literature that “we are probably better off [] considering narrative genre as a continuous cline, consisting of many subgenres, each of which may need differential research treatment” (Ervin-Tripp and Küntay 1997:139, cited in Rühlemann and O’Donnell 2012:321). The annotation includes two features: experiencer person (whether first person or third person, that is, direct involvement by narrator versus hearsay) and type of experience (personal experiences; recurrent generalized experiences; dreams; fantasies; jokes; mediated experiences). The last subcategory refers to the common practice of retelling a film or a novel. At the time of creation, the NC was the first corpus of conversational narratives to be annotated, so there was no established practice to follow regarding what analytical categories to annotate. However, the authors were able to follow some general guidelines, for example Leech’s (1997) ‘standards’ for corpus annotation concerning how to design the labels in the tagsets (e.g. they should be (a) easy to interpret and (b) concise, consisting of no more than three characters). Table 1.1 Distributions of male and female narrative participants involved in narratives [based on a subset of the total corpus] Sex Female Male Unknown Total

Number of participants 212 173 115 500

Rühlemann and O’Donnell (2012:320)

% 42 35 23 100

Number of words 44,476 24,268 10,079 78,823

% 56 31 13 100


A. Ädel

1.3 Critical Assessment and Future Directions The representation of a group of language users/variety of language/type of discourse in a corpus inevitably involves simplification and loss of complex contextual information. If we consider the future of corpus building from the perspective of the loss of complex information, it is interesting to note that few existing corpora reflect a feature which many present-day types of discourses exhibit: that of multimodality. It represents information of a kind that many corpus creators have expressed an interest in, but which few corpus projects have included (see Chap. 16 for more information). If we take the two sample studies as an example, they would both have benefitted from multimodal data. In Jaworska (2016:105), this is explicitly commented on by the author, who says that “given that images are an integral part of tourism promotional discourse, further studies would need to complement a quantitative textual analysis with a multi-modal approach based on a systematic examination of the visual material in order to reveal other semiotic resources”. In the case of the corpus of narratives described in Rühlemann and O’Donnell (2012), it was constructed based on the BNC, for which there is no information other than the speech signal (sound from recordings) and the transcriptions of this in the case of the spoken data. This is critiqued in a review of a monograph by Rühlemann, where the reviewer makes the point that “[t]he one glaring limitation to using preexisting transcribed texts such as these from the BNC is the paucity of information on the paralinguistics going on during storytelling, including glance, gesture, tone of voice and, since the central topic of the volume is narrative co-construction and recipient feedback, this is a significant absence” (Partington 2015:169). Regarding the inevitable loss of contextual information in the making of a corpus, it is important to attempt to compensate for this by means of rich metadata that describe the material. With better metadata about individual texts and speakers, we will be in a better position to understand the data, not only to correlate metadata to variation, but also to see more precisely how corpora differ in the case of comparison. Corpus enrichement is an important way forward, and this applies not only to metadata but also to linguistic annotation. Some of the possibilities of corpus annotation are presented in the next chapter. In order to promote and make better use of corpus enrichment, there is a need for collaborative work between linguists with a deep knowledge of the needs to different areas such as Second Language Acquisition or Historical Linguistics and experts in Computational Linguistics or Natural Language Processing.

1 Corpus Compilation


Further Reading Biber, D. 1993. Representativeness in Corpus Design. Literary and Linguistic Computing 8(4): 243–257. Biber’s work is significant not only in having had quite an impact on the field, but also in its attempt to develop empirical methods for evaluating corpus representativeness. Wynne, M. (Editor). 2005. Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow Books. http://ota.ox.ac.uk/documents/creating/dlc/. There is a surprising dearth of reference works on corpus compilation. Even if this collection of chapters is not recent, it is still worth reading.

References Aston, G., & Burnard, L. (1998). The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press. Baker, P., Hardie, A., & McEnery, T. (2006). A glossary of corpus linguistics. Edinburgh: Edinburgh University Press. Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257. Crasborn, O. (2010). What does ‘informed consent’ mean in the internet age? Publishing sign language corpora as open content. Sign Language Studies, 10(2), 276–290. Douglas, F. M. (2003). The Scottish corpus of texts and speech: Problems of corpus design. Literary and Linguistic Computing, 18(1), 23–37. Francis, W. N., & Kucera, H. (1964/1979). Manual of information to accompany a standard corpus of present-day edited American English, for use with digital computers. Department of Linguistics, Brown University. http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM. Accessed 24 May 2019. Hardie, A. (2014). Modest XML for Corpora: Not a standard, but a suggestion. ICAME Journal, 38, 72–103. Jaworska, S. (2016). A comparative corpus-assisted discourse study of the representations of hosts in promotional tourism discourse. Corpora, 11(1), 83–111. https://doi.org/10.3366/ cor.2016.0086. Johansson, S., Leech, G., & Goodluck, H. (1978). Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computers. Department of English, University of Oslo. http://clu.uni.no/icame/manuals/LOB/INDEX.HTM. Accessed 24 May 2019. Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97– 133. Leech, G. (1997). Introducing corpus annotation. In R. Garside, G. Leech, & T. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 1–18). London: Longman. Leech, G. (2007). New resources, or just better old ones? In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (pp. 134–149). Amsterdam: Rodopi. McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press. McEnery, T., & Xiao, R. (2005). Character encoding in corpus construction. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 47–58). Oxford: Oxbow Books.


A. Ädel

Partington, A. (2015). Review of Rühlemann (2014) Narrative in English conversation: A corpus analysis of storytelling. ICAME Journal, 39. https://doi.org/10.1515/icame-2015-0011. Römer, U., & O’Donnell, M. B. (2011). From student hard drive to web corpus (part 1): The design, compilation and genre classification of the Michigan Corpus of Upper-level Student Papers (MICUSP). Corpora, 6(2), 159–177. Rühlemann, C., & O’Donnell, M. B. (2012). Introducing a corpus of conversational stories. Construction and annotation of the Narrative Corpus. Corpus Linguistics and Linguistic Theory, 8(2), 313–350. https://doi.org/10.1515/cllt-2012-0015. Simpson-Vlach, R., & Leicher, S. (2006). The MICASE handbook: A resource for users of the Michigan Corpus of Academic Spoken English. Ann Arbor: University of Michigan Press. Sinclair, J. (2005). Corpus and text – basic principles. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 1–16). Oxford: Oxbow Books.

Chapter 2

Corpus Annotation John Newman and Christopher Cox

Abstract In this chapter, we provide an overview of the main concepts relating to corpus annotation, along with some discussion of the practical aspects of creating annotated texts and working with them. Our overview is restricted to automatic annotation of electronic text, which is the most common kind of annotation in the context of contemporary corpus linguistics. We focus on the annotation of texts which typically follow established orthographic principles and consider the following four main types of annotation, using English for the purposes of illustration: (1) part-of-speech (POS) tagging, (2) lemmatization, (3) syntactic parsing, and (4) semantic annotation. The accuracy of annotation is a key factor in any evaluation of annotation schemes and we discuss methods to verify annotation accuracy, including precision and recall measures. Finally, we briefly consider newer developments in two broad areas: the annotation of multimodal corpora and the annotation of Indigenous and endangered language materials. Both of these developments reflect changing priorities on the part of linguistic researchers, and both present significant challenges when it comes to automated annotation.

2.1 Introduction Annotation provides ways to enhance the value of a corpus by adding to the corpus information about parts of the corpus. While there may be a variety of types of annotation, including, for example, adding information about persons or places referenced in historical texts, our focus here is linguistic annotation. Such annotation

J. Newman () University of Alberta, Edmonton, Canada Monash University, Melbourne, Australia e-mail: [email protected] C. Cox Carleton University, Ottawa, Canada e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_2



J. Newman and C. Cox

most typically takes the form of adding linguistically relevant information about words, phrases, and clausal/sentential units, though other linguistic units can also be annotated, e.g., morphemes, intonation units, conversational turns, and paragraphs. The reality of contemporary corpus linguistics is that the corpora we rely on, in most cases, are simply too large for manually adding annotation, and the automated annotation of electronic texts has become the primary focus in the development of annotation methods. Consequently, it is automated linguistic annotation that we will be concerned with in this chapter (see Part III of this volume for discussion of manual annotation in certain kinds of corpora, e.g., annotation of errors in a learner corpus). In order to simplify the discussion that follows, we illustrate our points about the fundamentals of annotation using primarily English data. While the raw text of an unannotated corpus has its own unique value, (wisely) annotated corpora offer great advantages over the raw text when it comes to the investigation of linguistic phenomena. Most linguistic phenomena of interest to linguists are couched in terms of linguistic constructs (the plural morpheme, the passive construction, time adverbials, the subject of a verb, etc.), rather than orthographic words. A corpus that has been annotated with the needs of linguists in mind can greatly facilitate the exploration of such phenomena by reducing the time and effort involved. Even if an automatically annotated corpus is unlikely to meet all the expectations of a researcher in terms of its categories of annotation, it can still be an invaluable resource.

2.2 Fundamentals In the following sub-sections, we deal with a few main types of annotation (partof-speech tagging, lemmatization, syntactic parsing, and semantic annotation), the accuracy of annotation, and the practicalities of carrying out annotation of texts.

2.2.1 Part-of-Speech Tagging Part-of-speech (POS) tagging is a common form of linguistic annotation that labels or “tags” each word of a corpus with information about that word’s grammatical category (e.g., noun, verb, adjective, etc.). Any such tagging assumes prior tokenization of the text, i.e., division of the text into units appropriate for analysis and annotation. Tokenization in fact is a prerequisite for most kinds of annotation and presents challenges in its own right. Although it is convenient to simply refer here to the tagging of “words”, tokenizing a text into word units subsumes a number of critical decisions that we put aside here (see Chap. 3 for further discussion of tokenization and related issues). There are many POS tagsets currently used in English corpus analysis, varying in degree of differentiation of POS categories and in the nature of the categories

2 Corpus Annotation


themselves (see Atwell 2008 for an overview of English POS tagsets). One commonly used tagset is CLAWS (Constituent Likelihood Automatic Word-tagging System), available in different versions (e.g., CLAWS 5 contains just over 60 tags, while CLAWS 7 contains over 160). (1a) is an example of a sentence that has been tagged using the CLAWS 7 tagset and (1b) shows the descriptions of the tags used in (1a), as given in the CLAWS documentation.1 Most of these tags correspond to familiar parts of speech from traditional grammatical analysis of English (article, infinitive, singular common noun etc.), though some other tags are less familiar (e.g., after-determiner). A key consideration in preparing tagged corpora is that the tag appears in some predictable format along with the word it is associated with, as is the case in (1a) where the POS tag is appended to a word using the underscore as a separator (see Chap. 3 for further discussion of the integration of POS information into a corpus, including “stand-off” POS annotation). In this version of CLAWS, punctuation marks like a comma are trivially tagged as that punctuation mark. (1)

a. If_CS the_AT government_NN1 continues_VVZ to_TO behave_VVI in_II this_DD1 way_NN1 ,_, it_PPH1 will_VM find_VVI itself_PPX1 facing_VVG opposition_NN1 from_II those_DD2 who_PNQS have_VH0 been_VBN supporting_VVG it_PPH1 many_DA2 years_NNT2 ._. b. AT = article CS = subordinating conjunction DA2 = plural after-determiner DD1 = singular determiner DD2 = plural determiner II = general preposition NN1 = singular common noun NNT2 = temporal noun, plural PNQS = subjective wh-pronoun (who) PPH1 = 3rd person singular neuter pronoun (it) PPX1 = singular reflexive personal pronoun TO = infinitive marker (to) VBN = been VH0 = have, base form VM = modal auxiliary VVI = infinitive VVG = -ing participle of lexical verb VVZ = -s form of lexical verb

1 The full CLAWS 7 tagset can be found at http://ucrel.lancs.ac.uk/claws7tags.html. Accessed 25 May 2019.


J. Newman and C. Cox

Contracted forms of English show some peculiarities when tagged with CLAWS. For example, in the British National Corpus, tagged with CLAWS 5, gonna is segmented into gon (VVG) and na (TO), tagged just like the unreduced equivalent going to would be. Similarly, the reduced form of isn’t, inn’t, appears as the sequence in (VBZ, a tag otherwise reserved for the is form of the verb be), n (XX0, a negative particle), and it PNP (personal pronoun) (see Chap. 3 for more information about how to map word forms and annotations in multiple layers or annotation graphs in corpus architecture). Sometimes, it is useful to allow multiple tags to be associated with the same word. The CLAWS tagger, for example, assigns a hyphenated POS tag, referred to as an “ambiguity tag”, when its tagging algorithm is unable to unambiguously assign a single POS to a word. For example, singing in the sentence She says she couldn’t stop singing, even if she wanted to try is tagged in the British National Corpus by CLAWS 5 as VVG-NN1. The hyphenated tag in this case indicates that the algorithm was unable to decide between VVG (the -ing form of a verb) and NN1 (the singular of a common noun), but the preference is for the VVG tag which appears as the first element of the hyphenated tag. In some cases, genuine multiple readings of a sentence may be possible, e.g., the sentence the duchess was entertaining from the Penn Treebank, where entertaining could justifiably be tagged either as an adjective or as a present participle. Hyphenated tags also have a useful role to play in the tagging of diachronic corpora where a word may come to be associated with different parts of speech or different functions through time (cf. Meurman-Solin 2007 and Chap. 10). POS tags may also be attached to sequences of words. So, for example, in the British National Corpus, sequences such as of course, all at once, and from now on are tagged as adverbs, while instead of, in pursuit of, and in accordance with are tagged as prepositions. These tags, of which there are many, are assigned as part of an “idiom tagging” step after the initial assignment of POS tags and involve matching the sequences of POS-tagged words in the corpus against templates of multiword sequences.2 Indeed, there has been a growing interest in multiword expressions as part of the annotation of a corpus, beyond just multiword parts of speech. The term multiword expression, it should be noted, is used in a great variety of ways in the literature, including, but not limited to, relatively fixed idiomatic phrases (easy as pie), whole proverbs (beggars can’t be choosers), verb-particle phrases (pick up), and light verb constructions (make decisions) (see Schneider et al. 2014 for an overview of multiword expressions as understood in the literature). Each POS tagset, either explicitly or implicitly, embodies some theory of grammar, even if the theory is simply traditional grammar. It would be unrealistic, therefore, to expect that different POS-tagging algorithms for the same language will produce identical results. Consider the tags assigned to rid in the three sentences

2 For

more on the treatment of multiword expressions in the British National Corpus, see the section on “Automatic POS-Tagging of the Corpus” at http://ucrel.lancs.ac.uk/bnc2/bnc2autotag. htm. Accessed 25 May 2019.

2 Corpus Annotation


Table 2.1 Four tagging solutions for English rid

CLAWS7 taggera Infogisticsb FreeLingc (Brill-based) GoTaggerd

I am now completely rid of such things Past participle Verb base Adjective Adjective

You are well rid of him Past participle Verb base Verb base Adjective

I got rid of the rubbish Past participle Past participle Past participle Adjective

a Available

at http://ucrel.lancs.ac.uk/claws/trial.html. Accessed 25 May 2019 at http://www.infogistics.com/posdemo.htm. Accessed 25 May 2019 c Available at https://github.com/TALP-UPC/FreeLing. Accessed 25 May 2019 d Available at https://github.com/adsva/go-tagger. Accessed 25 May 2019 b Available

in Table 2.1, based on four automatic tagging programs, where it can be seen that there is no uniform assignment of the part of speech of rid for any of the three sentences given. Here we see indications of an earlier historical shift in grammatical status, from a past participle to an adjective, with different (unambiguous) solutions provided by the different POS taggers.

2.2.2 Lemmatization Another common kind of annotation found in modern-day corpora is lemmatization. In lemmatization, each orthographic word encountered in a corpus is assigned a lemma, or ‘base form’, which provides a level of abstraction from any inflection that might appear in the original orthographic word. If we were to lemmatize this paragraph up to this point, for instance, we would see that the resulting nouns would appear without plural marking, and many verbs without agreement with their subjects, as in (2). Here, the lemmas have replaced the original words, but lemmas could also be added as additional information to the inflected forms (see Chap. 3). Note, too, that in example (2), prior tokenization of the text plays a role in what base forms are recognized. The word modern-day in this example has been tokenized into separate units (modern, -, day), allowing each part to be annotated individually. (2) another common kind of annotation find in modern – day corpus be lemmatization . in lemmatization , each orthographic word encounter in a corpus be assign a lemma , or ‘ base ’ form , which provide a level of abstraction from any inflection that may appear in the original orthographic word . if we be to lemmatize this paragraph up to this point , for instance , we will see that the resulting noun will appear without plural marking , and many verb without agreement with their subject , as in ( 2 ). These kinds of lemmas often resemble the headwords found in dictionaries. Like the lemmas found in corpora, dictionary headwords often aim to represent a base word form (e.g., toaster, shine, be), rather than provide separate entries for each distinct


J. Newman and C. Cox

inflected word form (e.g., toasters, shines / shone / shining, is / am / are / were / was / been / being). Both the headwords used in dictionaries and the lemmas found in corpora serve a similar purpose: they allow researchers to locate information more readily, particularly when searching by individual inflected word forms might otherwise make it difficult to find features of interest. In corpus linguistic studies, lemmas can be particularly useful in making complex searches more tractable, especially when a single word appears in many forms. Searches based on lemmas can be invaluable when working with corpora of languages with rich inflection, such as Romance languages like French or Spanish, where a single regular verb may have dozens of distinct inflected forms. While searches such as these can also often be conducted using regular expressions, lemmas generally make these searches more straightforward, and help to ensure that no relevant surface forms are inadvertently overlooked. English is not as richly inflected as some languages, but the variation found in inflected forms of, say, lexical verbs is still considerable (cf. the inflected forms of bring, sing, drive, send, stand, etc.) and being able to search for all instances of such verbs on the basis of the lemmas will save time and effort.3 For some researchers, it is linguistic patterning at the higher lemma level that is of most interest, but for other researchers the individual inflected forms can be the target of interest. Rice and Newman (2005), for example, take a closer look at the inflectional ‘profiles’ of individual verbs in English (e.g., think, rumour, go, etc.), with an eye to how inflected forms are distributed across particular tense, aspect, and subject person categories across different genres. They find that individual verbs (and classes of verbs) often have distinctively skewed distributions of inflection across these categories. A verb like think, for instance, shows a marked tendency across all tense and aspects to appear with first-person subjects (e.g., I think, I was thinking, I thought, etc.). These ‘inflectional islands’, as Rice and Newman call them, would be more challenging to study for morphologically complex verbs like think if it were necessary to run individual searches for each distinct inflected form (e.g., think, thinks, thought, etc.). In a lemmatized corpus, however, all instances of think can be retrieved with a single search on the lemma THINK—essentially using lemmatization as a way of getting back to the full range of inflected forms found in the corpus. The choice of lemmatization software often depends on the kinds of language found in the corpus materials. Lemmatization can be an extremely useful tool in the corpus builder’s toolkit, especially when searches of the annotated corpus may need to be run over many inflected forms. Although current English-language lemmatization tools make this process much easier to carry out on large bodies of text, it is often worth bearing in mind that even the most sophisticated lemmatization software will inevitably run into cases that are not entirely clear-cut (e.g., should the

3 Obviously,

it will also benefit researchers if any spelling variation in a corpus (older vs. newer spellings, American vs. British spellings etc.) can be “normalized”, though we take this to be distinct from lemmatization (cf. Chap. 10).

2 Corpus Annotation


lemma of the noun axes be AXE or AXIS?) and where the resulting lemmas are not necessarily what one might expect. Different analyses can thus lead to different lemma assignments; accordingly, many lemmatization tools are careful to document the language-specific lemmatization guidelines that they follow. As when using other corpus annotations that have been produced by automatic or semi-automatic procedures, understanding the limitations of automatic lemmatization and treating its outputs accordingly with a degree of circumspection is often a necessary part of the corpus annotation and analysis process.

2.2.3 Syntactic Parsing The preceding sections have focused on adding annotations to corpus sources that focused on the properties of individual words—in the case of part-of-speech tagging, on their membership in particular grammatical classes, or in the case of lemmatization, on their association with particular headwords. When annotating a corpus, we may also be interested in adding information about the relationships that exist between elements in our texts above the level of the individual word that may help shed light on larger patterns in our corpus. Identifying particular multiword expressions, as mentioned in the preceding section, is an example of this. Another kind of higher-level annotation is syntactic parsing, which aims to provide information about the grammatical structure of sentences in a corpus. While syntactic structures can be assigned to sentences manually (and may be necessary when developing a corpus for a language for which few syntacticallyannotated corpus resources have already been developed; see Sect. 2.3), it is more common for syntactic annotations to be added automatically by a syntactic parser, a program that provides information about different kinds of syntactic relationships that exist between words in a given text (parses). In the past, syntactic parsers were typically developed following deterministic approaches, often applying sets of carefully crafted syntactic rules or constraints to determine syntactic structure in a fully predictable way. In the 1990s, a new wave of syntactic parsers emerged that adopted a range of novel, probabilistic approaches (see Collins 1999 for an overview). These parsers began by analyzing large numbers of syntactically annotated sentences (a treebank), attempting to learn how syntactic structures are typically assigned to input sentences. Once trained on a particular sample of sentences, a probabilistic parser could then use that information to determine what the most likely parses would be for any new sentences it encountered, weighing the likelihood of different possible analyses in consideration. These probabilistic parsers have become increasingly common in corpus and computational linguistics, and generally work quite well for annotating arbitrary corpus texts in many


J. Newman and C. Cox

languages. One such parser is the Stanford Parser (Klein and Manning 2003),4 and a sample output from this parser is shown in (3), representing the phrase structure tree of She ran over the hill. The phrase structure tree provided by the Stanford Parser also includes automatic part-of-speech tagging, in this case using the tagset from the Penn Treebank (Marcus et al. 1993). The nested parentheses in (3) capture the parent-child relationships between higher and lower-level constituents (e.g., the determiner the and the noun hill both appearing within a larger noun phrase). (3) Parse of She ran over the hill. (ROOT (S (NP (PRP She)) (VP (VBD ran) (PP (IN over) (NP (DT the) (NN hill)))) (. .))) Another common form of syntactic information in corpora are dependency annotations, which indicate the grammatical relationships between sets of words. The online Stanford Parser is able to provide dependency parses for input sentences, as well. Example (4) shows the dependency parse for our previous example sentence, She ran over the hill. In this representation, each word in the input is numbered according to its position in the original sentence: the first word, She, is marked with -1, while the fifth word, hill, is marked as -5. Each word appears in a three-part structure that gives the name of the grammatical relationship, followed by the governing and dependent elements (e.g., the nominal subject (nsubj) of the second word in the sentence, ran, is the first word in the sentence, She; the determiner of the fifth word, hill, is the fourth word, the, etc.). By tracing these relationships, it is possible to extract features of corpus sentences—subjects of active and passive verbs, objects of prepositions, etc.—that would be difficult to retrieve from searches of unannotated text alone. (4) nsubj(ran-2, She-1) root(ROOT-0, ran-2) case(hill-5, over-3) det(hill-5, the-4) nmod(ran-2, hill-5) As informative as these compact, textual representations of constituency and dependency structures can be, it can also be useful to be able to visualize syntactic annotations, especially whenreviewing annotated corpus materials for accuracy.

4 Available

at https://nlp.stanford.edu/software/lex-parser.shtml. Accessed 25 May 2019.

2 Corpus Annotation


Fig. 2.1 Visualization of the dependency structure for She ran over the hill

Several freely available tools are able to produce graphical representations of the kinds of dependency and constituency structures seen here. In Fig. 2.1, the dependency structure for (4) has been visualized using the Stanford CoreNLP online tool.5

2.2.4 Semantic Annotation Semantic annotation refers to the addition of semantic information about words or multiword units to a corpus. An example of an annotation model like this is the UCREL Semantic Analysis System (USAS).6 A full explanation of USAS can be found in Archer et al. (2004) and Piao et al. (2005).7 USAS relies on a classification of the lexicon into twenty-one broad categories, called discourse fields, represented by the letters of the alphabet, as displayed in Fig. 2.2. These fields are intended to be as general as possible, suitable for working with as many kinds of text as possible and providing an intuitive, immediately understandable breakdown of our conceptual world. Archer et al. (2004) point to similarities in the taxonomies utilized by USAS and the Collins English Dictionary (2001), an indication of how the taxonomy in Fig. 2.2 reflects more a common ‘folk understanding’ of our conceptual world rather than a top-down classification one arrived at by strict psychological or philosophical criteria. The set of fields includes a category Z assigned to names and function words, with narrower sub-categories indicated in tags by additional delimiting numbers. In (5) we illustrate narrower categories of the Time category in USAS, along with examples. Notice in (5) that words belonging to different parts of speech can be assigned the same semantic tag. The category of T.1.1.1, for example, includes nouns (history), verbs (harked, as in harked back to), adjectives (nostalgic) and adverbs (already).

5 Available

at http://corenlp.run/. Accessed 25 May 2019. stands for “University Centre for Computer Corpus Research on Language” at Lancaster University. 7 Further information on USAS can also be found online at http://ucrel.lancs.ac.uk/usas/. Accessed 25 May 2019. 6 UCREL


J. Newman and C. Cox A





















Fig. 2.2 The 21 major discourse fields underlying the USAS semantic tagset

(5) T1 Time T1.1 General T1.1.1 Past: history, medieval, nostalgic T1.1.2 Present: yet, now, present T1.1.3 Future: shall, will, next T1.2 Momentary: midnight, sunrise, sunset T1.3 Period: years, century, 1940s T2 Beginning/ending: still, began, continued T3 Old/new/young; age: new, old, children T4 Early/late: early, later, premature An example of text annotated in accordance with USAS is shown in (6a). The annotated text comes from the Canadian component of the International Corpus of English. Notice that in some cases there are multiple semantic categories associated with one word. Children, for example, is associated with a portmanteau tag, consisting of both S (Social actions, states and processes) and T (Time). The labels “m” and “f” for male and female can also form part of the semantic tag, e.g., girls includes the “f” label, while children includes both “m” and “f”. (6b) illustrates how some multi-word units, e.g., at a time (as in build houses two at a time), have a unique identifier that is associated with each word in the multi-word unit, here i165.3, with each member of the three-word unit assigned an additional suffix 1, 2, or 3.

2 Corpus Annotation



a. The_Z5 ending_T2- of_Z5 the_Z5 poem_Q3 may_A7+ seem_A8 to_Z5 be_A3+ contradictory_A6.1- because_Z5/A2.2 both_N5 girls_S2.1f marry_S4 and_Z5 have_A9+ children_S2mf/T3- ;_PUNC thereby_Z5 filling_N5.1+ the_Z5 traditional_S1.1.1 female_S2.1 role_I3.1 ._PUNC b. at_T1.1.2[i165.3.1 a_T1.1.2[i165.3.2 time_T1.1.2[i165.3.3

The process of annotating a corpus with USAS tags is carried out automatically, relying on a complex combination of lexicon and rules, including prior tokenization and POS tagging (Rayson 2007), and a number of corpora in the International Corpus of English have been semantically annotated using USAS.

2.2.5 Annotation Accuracy Automated annotation is subject to errors and consequently the accuracy of annotation models is a key consideration for researchers either choosing a model to apply to a corpus or working with a corpus already annotated. A common measure of (token) accuracy is to report the number of correct tags in a corpus as a percentage of the total number of tags in the corpus, or more likely in some sample of the whole corpus extracted for the explicit purpose of checking accuracy (cf. Marcus et al. 1993). An accuracy of more than 90% is typical for automatic lemmatization, POS tagging, parsing, and semantic annotation of general English. Another kind of measure of accuracy of POS taggers would be the percentage of sentences in a corpus or corpus sample which have been correctly tagged, i.e., sentence-accuracy, and then the accuracy drops to around 55–57% (cf. Manning 2011). This might seem a marginal and severe measure, but as Manning points out, even one incorrect tag in a sentence can seriously impact subsequent automatic parsing of a sentence. The concepts of precision and recall (cf. van Halteren 1999:82) are also relevant when evaluating an annotation scheme. Precision is the extent to which the retrieved objects in a query are correctly tagged, e.g., the extent to which the results from searching on the preposition tag consists, in fact, of prepositions. Recall describes the extent to which the objects matching the query retrieve all the target objects in the corpus, e.g., the extent to which searching on a preposition tag successfully retrieves all the objects that the researcher would identify as prepositions. Recall and precision ratios for parts of speech in the British National Corpus have been calculated and show an overall precision rate of 96.25% and an overall recall rate of 98.85% (based on a 50,000 word sample of the whole corpus).8 Hempelmann et al. (2006) investigated the accuracy of the Stanford Parser (Klein and Manning 2003), along with other parsers, where the accuracy measurements took into account both 8 See

the section “POS-tagging Error Rates” in the BNC at http://ucrel.lancs.ac.uk/bnc2/bnc2error. htm (last accessed 25 May 2019) for information on recall and precision rates for individual parts of speech.

36 Table 2.2 Precision and recall of labeled constituents, based on the Stanford Parser

J. Newman and C. Cox

WSJ text Expository text Narrative text

Precision 84.41 75.38 62.65

Recall 87.00 85.12 87.56

tags and constituents. The authors reported not just on the accuracy of a commonly used “Gold Standard” test corpus drawn from a parsed excerpt of the Wall Street Journal (WSJ) corpus, but also on a selection of narrative and expository scientific texts, including textbooks. The results are shown in Table 2.2 where one can see that there is a considerable drop in precision as one moves from the WSJ text to the other types of text. The accuracy rates reported above give some sense of what is possible with stateof-the-art automatic annotation. For one thing, the more fine-grained an annotation system, the more difficult it will be to achieve high accuracy. Clearly, the accuracy of a model will also vary depending on the type of text it is applied to, and even within formal written genres of English, accuracy of annotation can vary quite a bit, as seen in Table 2.2 above. The more the text differs, lexically and structurally, from the kind of text that the annotation has been originally trained on (most often, written text consciously planned, such as newspaper text), the more one can expect the accuracy to drop. Panunzi et al. (2004) highlight problems faced in the automatic annotation of Italian spontaneous speech which included regional and dialectal words, onomatopoeia, pause fillers etc. and report on the token accuracy of a selection of 3,069 tokens. Their adaptation of an existing annotation tool intended for written texts resulted in a token accuracy of 90.36%, compared with 97% when applied to written texts of official European Union documents. Learner corpora, where the users of language are still acquiring the lexicon and structures of a language, can be expected to present special difficulties when it comes to automatic annotation (see Thouësny 2011 for a combination of methods that achieve significant improvements in accuracy of annotation of a corpus of learners of French; see also Chap. 13). Transcriptions of highly informal and fragmentary texts such as Twitter present specific challenges to annotation models (Derczynski et al. 2013; Eisenstein 2013; Farzindar and Inkpen 2015; cf. also the issue of ‘noisy’ data in web corpora generally in Chap. 15).

2.2.6 Practicalities of Annotation While the preceding discussion has focused on introducing different kinds of automatic annotation that are commonly assigned to linguistic corpora, it is also reasonable to ask how these methods can be applied in practice when creating a new corpus. At the outset, this involves making decisions as to what kinds of annotations should be added, what conventions should be followed for representing that information consistently, and what tools will be used to apply those conventions to corpus source materials. All these decisions and the availability of existing conventions and annotation tools can make a significant difference to the overall

2 Corpus Annotation


process of annotation that follows. In the case of relatively well-resourced languages like English, for which many corpus annotation standards and tools exist, many common annotation tasks can be accomplished with ‘off-the-shelf’ software tools and minimal customization of annotation standards or procedures. In contrast, for many lesser-studied languages and varieties, these same tasks may require the development of annotation conventions that ‘fit’ the linguistic features of the source materials, as well as the implementation of these conventions in existing annotation tools, adding additional complexity to the overall corpus annotation workflow (see Sect. 2.2.3). In this section, we consider the example of applying POS tagging to a collection of unannotated corpus source materials. As mentioned in Sect. 2.2.1, this is a common task in corpus development, and one on which other forms of linguistic annotation (e.g., lemmatization, syntactic annotation) often rely. In the most straightforward case, it may be possible to use existing annotation software to automatically apply a given set of POS tagging conventions to corpus source materials. If one were aiming to create an English-language corpus of the novels of Charles Dickens (1812–1870), one might begin by retrieving plain-text copies of these works from Project Gutenberg,9 then load these unannotated sources into an annotation tool such as TagAnt (Anthony 2015), which provides a graphical user interface for the TreeTagger annotation package (Schmid 1994). Example (7) shows the first paragraph of Dickens’s 1867 novel Great Expectations before and after being loaded into TagAnt, which applies TreeTagger’s existing English POS annotation model and saves the annotated text for further use. (7) My father’s family name being Pirrip, and my Christian name Philip, my infant tongue could make of both names nothing longer or more explicit than Pip. So, I called myself Pip, and came to be called Pip. My_PP$ father_NN ’s_POS family_NN name_NN being_VBG Pirrip_NN ,_, and_CC my_PP$ Christian_JJ name_NN Philip_NP ,_, my_PP$ infant_JJ tongue_NN could_MD make_VV of_IN both_DT names_NNS nothing_NN longer_RBR or_CC more_RBR explicit_JJ than_IN Pip_NP ._SENT So_RB ,_, I_PP called_VVD myself_PP Pip_NP ,_, and_CC came_VVD to_TO be_VB called_VVN Pip_NP ._SENT Graphical tools such as TagAnt can be useful for exploring corpus annotation, particularly when creating smaller or specialized corpora for languages with existing annotation conventions. However, not all annotation procedures offer graphical user interfaces, and, for larger corpora, it may not be practical to manually load thousands (or millions) of corpus source documents into such an interface for annotation. In these cases, it is often necessary to integrate annotation as part of a larger, automatic workflow. This is sometimes accomplished by invoking individual annotation tools such as TreeTagger in scripts that contain a series of commands that carry out tasks 9 See

http://www.gutenberg.org/ebooks/author/37. Accessed 25 May 2019.


J. Newman and C. Cox

for which general-purpose tools have not yet been designed to accomplish (cf. Chap. 9). Example (8) presents an excerpt of one such script (written in the standard sh(1) command language available on Unix-like operating systems such as macOS and Linux), which calls on TreeTagger to automatically tag all of the given Englishlanguage source documents and save the resulting POS-tagged versions under the same name preceded by the label ‘TAGGED-’: (8)

for i in $∗ ; do tree-tagger-english “$i” > “TAGGED-$i”; done

For many larger-scale or more complicated corpus construction projects, it is common for annotation to be implemented in custom software, which may in turn be integrated into larger ‘pipelines’ of natural language processing (NLP) tools that feed corpus source documents through successive stages of annotation and analysis (cf. Chap. 3).10 Libraries of common corpus annotation functions exist for many programming languages (e.g., NLTK for Python; Bird et al. 2009; Perkins 2014), which can greatly facilitate the process of developing custom corpus annotation procedures. While this discussion has assumed that appropriate tagsets and tagging models exist for the language or variety represented in your corpus, in many cases, existing sets of POS categories may not be suitable for the data, or may not exist at all. In situations like these, it is possible to train some kinds of annotation software to apply new tagging conventions to your corpus sources. This involves manually annotating samples of corpus text as examples of how the new annotation categories should be assigned. These examples are then provided to training software that attempts to learn how these categories should be assigned to other texts, often by assessing the probability of particular sequences of words and annotations occurring together (e.g., is dog more likely to be a noun or a verb when preceded by the words the big?). Creating a new POS tagging model is often an iterative process: small samples of text are manually annotated and used as training data to create a provisional POS tagger, which then automatically annotates more texts. These texts are then reviewed and their POS tags corrected, creating more data for the next round of training and annotation (cf. Cox 2010). While “bootstrapping” a new POS tag assignment model in this way requires some effort, it allows these techniques to be extended to some languages and varieties for which POS-tagged corpora do not already exist. Once POS annotations have been added to a collection of texts, there are several ways in which they can be used in queries. Some corpus search tools, such as

10 Frameworks

such as GATE (Cunningham et al. 2013), UIMA (Ferrucci and Lally 2004), and Stanford CoreNLP (Manning et al. 2014) provide support for creating complex sequences of text annotation and analysis which are particularly valuable for larger-scale corpus construction projects involving multiple forms of annotation. References to several of these frameworks are provided in the Resources section below.

2 Corpus Annotation


AntConc (Anthony 2016), allow users to search for particular POS tags and tag sequences through a graphical user interface. In other cases, the regular expression search facilities integrated into many common programming languages and text editing tools can often be used to target particular tags without relying on specialized corpus linguistic software. For example, if we wanted to retrieve all of the nouns that appear immediately after the adjective brilliant in an English-language corpus with Penn Treebank-style POS tags, we could search the corpus using regular expression “brilliant/JJ .∗ ?/NNS?”. This would identify all of the instances in the corpus where the word brilliant has been tagged as an adjective (JJ) and is followed by some number of characters (.∗ ?) that have been tagged as a noun (NNS?). More information about applying regular expression searches in corpus linguistic studies can be found in Chap. 9 as well as Weisser (2015, Chap. 6) and Gries (2017).

Representative Study 1 Newman, J. 2011. Corpora and cognitive linguistics. Brazilian Journal of Applied Linguistics 11(2): 521–559. Special Issue on Corpus Studies: Future Directions, ed. Gries, S.T. To illustrate some advantages of a POS-tagged corpus, we draw upon Newman’s (2011) investigation into the lexical preferences for nouns in two English constructions: EXPERIENCE N and EXPERIENCE OF N, where N stands for a noun and the small caps for lemmas. What does a corpus teach us about the noun preferences in each of these two constructions – constructions which at first glance seem rather similar in the semantic relations implied? The analysis relies on the collostructional analysis approach pioneered by Stefanowitsch and Gries (2003), although we will not report all the details of the statistical computations here. For this study, Mark Davies’ POS-tagged online Corpus of Contemporary American English (COCA; Davies 2008–) was used. The corpus contained well over 400 million words at the time of the study, allowing for solid statistical calculations. All frequencies reported here are from the 2011 version of COCA. As an initial attempt to establish noun preferences in the two constructions, one could simply retrieve all instances of EXPERIENCE N and EXPERIENCE OF N from the corpus and identify the most frequently occurring nouns in each construction. We say “simply”, but this task is only simply carried out when the corpus is POS-tagged and one is able to search for the verb EXPERIENCE immediately followed by a noun. In the COCA interface the relevant search expression is: [experience].[vv∗ ] [nn∗ ].11 If the corpus were not POS-tagged, one would have to find some alternative way of retrieving the desired forms, (continued)

11 Lemma

forms are retrieved by square brackets around the search term in the COCA interface.


J. Newman and C. Cox

e.g., by retrieving all tokens of any form of the lemma EXPERIENCE in their sentential context and then working manually through these results to isolate the verbal uses of EXPERIENCE followed by a noun. Given that there are more than 14,000 forms of EXPERIENCE in the corpus, this is not a very appealing task. Instead of relying solely on raw frequency of occurrence of nouns in each construction, however, it has become commonplace to invoke more sophisticated measures of attraction of words to a construction (see Chap. 7 for more information on collostructional analysis). Typically, these measures take into account frequencies in the whole corpus. The relevant statistic that we rely on for this more nuanced investigation into the preferences for the nouns (collexemes) in each of these constructions is called collostructional strength. Put simply, collostructional strength is a measure of the overuse or underuse of a word in a construction in light of expected frequencies. The higher the values, the greater the attraction of the collexeme to the construction.12 Table 2.3 shows the 20 nouns with the strongest collostructional strength in each of the two constructions. Looking at the top 20 collexemes of EXPERIENCE N in Table 2.3, it can be easily seen that the great majority of them are nouns which in fact share a kind of negative nuance: difficulty, depression, pain, stress, anxiety etc. Results for the collostructional analysis of the EXPERIENCE OF N construction, on the other hand, are quite different. The top 20 collexemes in this construction include a mixture of negatively nuanced concepts and more abstract, philosophical concepts, e.g., reality, oneness, modernity, transcendence, otherness. The collostructional profiles in Table 2.3 provide a sophisticated way of demonstrating the quite different types of nouns attracted to the two constructions and lend support to treating the two constructions as objects of study in their own right. In addition, this example shows how much more straightforward and replicable this analysis becomes once the analyst can rely on existing POS-tagging. (continued)

12 To be more precise, the collostructional strength in Table 2.3 is the negative log10-values of the p-values of the Fisher-Yates exact test (cf. Chap. 7). A collostructional strength >3 is significant at the pBut I cannot accept it._ _ Early corpus architectures were aimed at capturing and separating word form tokens, using spaces between token units, often followed by a separator and

3 Note that although A/V signals in multimodal corpora logically precede their transcription, corpus

architectures usually implement aligned A/V signals as annotations anchored to the transcription using timestamps. In other words, in much the same way as the POS tag ‘noun’ might apply to the position in the text of a word like ‘bread’, a recording of this word is also a type of datum that can be thought of as happening at the point in which ‘bread’ is uttered. In continuous primary data representations (see below), A/V timestamp alignment therefore ‘tiles’ the text (no span of audio is left without alignment to some token).

3 Corpus Architecture


annotations, as in (2), where a separator ‘/’ marks the beginning of a POS tag (see also Chap. 2). (2) Mark/NNP agreed/VBD. /SENT This/DT was/VBD,/, then/RB,/, the/DT end/NN ./SENT But/RB I/PRP can/MD not/RB accept/VB it/PRP./SENT For many linguistic research questions, the representation in (2) is adequate, for example for vocabulary studies: one can extract type/token ratios to study vocabulary size in different texts, find vocabulary preferences of certain authors, etc. However for many other purposes, the loss of information about the original text from (1) is critical. To name but a few examples: • Tokens with ambiguous spacing: both ‘can not’ and ‘cannot’ are usually tokenized as two units, but to study variation between these forms, one needs to represent whitespace somehow. • Training automatic sentence/document/subsection splitters: Position and number of spaces, as well as presence of tab characters are very strong cues for such programs. For example TextTiling, a classic approach to automatic document segmentation, makes use of tabs as predictors (Hearst 1997). • Stylometry and authorship attribution: even subtle cues found in whitespace can distinguish authors and styles. For example, US authors are much more likely to use double spaces after a sentence final period than UK authors, and specific combinations of whitespace practices can sometimes uniquely identify authors (see Kredens and Coulthard 2012:506–507). Proportion of white space has also been used in authorship and plagiarism detection (Canales et al. 2011). Whitespace and other features of the original primary data can therefore be important, and some corpus architectures employ formats which preserve and separate the underlying data from processes of tokenization and annotation, often using ‘stand-off’ XML formats. In stand-off formats, different layers of information are stored in separate files using a referencing mechanism which allows us, for example, to leave an original text file unchanged. One can then add e.g. POS annotations in a separate file specifying the character offsets in the text file at which the annotations apply (e.g. marking that a NOUN occurred between characters 4– 10; see ‘Tools and Resources’ for more details). A second important issue in representing language data is the tokenization itself, which requires detailed guidelines, and is usually executed automatically, possibly with manual correction (see Schmid 2008 for an overview). Although a working definition of ‘tokens’ often equates them with “words, numbers, punctuation marks, parentheses, quotation marks, and similar entities” (Schmid 2008:527), a more


A. Zeldes (3) wf



















(4) wf





then then























(5) wf






a to
















Fig. 3.2 Tokenization, normalization and POS tags for word forms in (3)–(5)

precise definition of tokens is simply “the smallest unit of a corpus” (Krause et al. 2012:2), where units can also be smaller than a word, e.g. in a corpus treating each syllable as a token. In other words, tokens are minimal, indivisible or ‘atomic’ units, and any unit to which we want to apply annotations cannot be smaller than a token (see Representative Corpus 2). In English, word forms and tokens usually coincide, and tokenization is closely related to prevalent part of speech tagging guidelines (the Penn tag set, Santorini 1990 and CLAWS, Garside et al. 1987, both ultimately going back to the Brown tag set, Kuˇcera and Francis 1967; cf. also Chap. 2). However, modals, negations and other items which sometimes appear as clitics are normally tokenized apart, as in the clitics ‘ll and n’t in (3) and (4). These are represented as separate in the ‘tok’ (token) rows of Fig. 3.2, but are fused on the ‘wf’ (word form) level. In (3), separating the clitic ‘ll allows us to tag it as a modal on the ‘pos’ layer (MD), just like a normal will. The other half of the orthographic sequence I’ll is retained unproblematically as I. In (4), by contrast, separating the negation n’t produces a segment wo, which is not a ‘normal’ word in English, but is nevertheless tagged as a modal. (3) I’ll do it (4) I won’t do it then In order to make all instances of the lexical item will findable, some corpora rely on lemmatization (the lemma of all of these is will), while other corpora use explicit normalization. This distinction becomes more crucial in corpora with non-standard orthography, as in example (5), featuring the contraction I’m a (frequent in, but not limited to African American Vernacular English, Green 2002:196).

3 Corpus Architecture


(5) I’m a do it (i.e. I’m going to do it) This last example clearly shows that space-delimited orthographic borders, tokenization, and annotations at the word form level may not coincide. To do justice to examples such as (5), a corpus architecture must be capable of mapping word forms and annotations to any number of tokens, in the sense of minimal units. In some cases these tokens may even be empty, as in the position following a in the ‘tok’ layer for (5) – what matters is not necessarily that ‘tok’ contains some segmentation of the text in ‘wf’, but rather that the positions and borders that are required for the annotation table are delimited correctly in order to allow the interpretation of a as corresponding to the ‘norm’ sequence going (tagged VBG) and to (tagged TO), assuming this is the desired analysis.4 For multimodal data in which speakers may overlap, the situation is even more complex and an architecture completely separating the concepts of tokens as minimal units and word forms becomes necessary. An example is shown in Fig. 3.3. The example shows several issues: towards the end, two speakers overlap with word forms that only partially occur at the same time, meaning that borders are needed corresponding to these offsets; in the middle of the excerpt, there is a moment of silence, which has a certain duration; and finally, there is an extra linguistic event (a phone ringing) which takes place in part during speaker A’s dialogue, and in part during the silence between speech acts. An architecture using the necessary minimal units can still represent even this degree of complexity, provided that one draws the correct borders at the minimal transitions between events, and add higher level spans for each layer of information. In cases like these, the concept of minimal token is essentially tantamount to timeline indices, and if these have explicit references to time (as in the seconds and milliseconds in the ‘time’ layer of Fig. 3.3), then they can be used for A/V









actually RB








[phone rings]

events time







00:08 00:08.1

Fig. 3.3 Multiple layers for dialog data with a minimally granular timeline

4 Some

architectures go even further and use an ‘empty’ token layer, using tokens solely as ordered positions or time-line markers, not containing text (e.g. the RIDGES corpus, Odebrecht et al. 2017, or REM, Klein and Dipper 2016). In such cases, tools manipulating the data can recover the covered text for each position from an aligned primary text.


A. Zeldes

signal alignment as well. More generally, an architecture of this kind, which is used by concrete speech corpus transcription tools such as ELAN (Brugman and Russel 2004) or EXMARaLDA (Schmidt and Wörner 2009), can also be thought of as an annotation graph (see Sect. 3.2.3). A final consideration in cases such as these is the anchoring or coupling of specific layers of information in the data model: in the example above, the two ‘pos’ layers belong to the different speakers. A user searching for all word forms coinciding with a verbal tag in the corpus would be very surprised to find the word I, which might be found if all VBP tags coinciding with a word form are retrieved (since the second I overlaps with the other speaker’s word know). What is meant in such situations is to only look at combinations of POS and word form information from either speaker A or speaker B. In other situations, however, one might want to look at any layers containing some speaker (e.g. search for anyone saying um), in which case some means of capturing the notion of ‘any transcription layer’ is required. These concepts of connecting annotation layers (posA belongs to spkA) and applying multiple segmentations to the data will be discussed below in the context of graph models for corpus annotations.

3.2.3 Data Models for Document Annotations The central concern of annotations is ‘adding interpretative, linguistic information to an electronic corpus’ (Leech 1997:2), such as adding POS tags to word forms (see Chap. 2). However, as we have seen, one may also want to express relationships between annotations, grouping together multiple units into larger spans, building structures on top of these, and annotating them in turn. For example, multiple minimal tokens annotated as morphemes may be grouped together to delineate a complex word form, several such word forms may be joined into phrases or sentences, and each of these may carry annotations as well. Additionally, some annotations ‘belong together’ in some sense, for example by relating to the same speaker in a dialogue. If a document contains these kinds of data, the resulting structure is then no longer a flat table such as Fig. 3.2, but rather a graph with explicit hierarchical connections. For planning and choosing a fitting corpus architecture, it is important to understand the components of annotation graphs at an abstract level, since even if individual XML formats under consideration for a corpus vary substantially (see Sect. 3.4), at an underlying level, the most important factor is which elements of an annotation graph they can or cannot represent. At its most general formulation, a graph is just a collection of nodes connected by edges: for example an ordered sequence of words, each word connected to the next, with some added nodes connected to multiple words (e.g. a sentence node grouping some words, or smaller phrase nodes). Often these nodes and edges will be annotated with labels, which usually have a category name and a value (e.g. POS=NOUN); in some complex architectures, annotations can potentially include more complex data types, such as hierarchical feature structures, in which annotations can contain not only simple values, but also further nested annotations (see ISO 24612 for a standardized representation for such structures in corpus

3 Corpus Architecture


markup). Additionally, some data models add grouping mechanisms to annotation graphs, often referred to as ‘annotation layers’, which can be used to lump together annotations that are somehow related.5 Given the basic building blocks ‘nodes’, ‘edges’, ‘annotations’ and ‘layers’, there are many different constraints that can be imposed on the combinations of these elements. Some data models allow us to attach annotations only to nodes, or also to edges; some data models even allow annotations of annotations (e.g. Dipper 2005), which opens up the possibility of annotation sub-graphs expressing, for example, provenance (i.e. who or what created an annotation and when, see Eckart de Castilho et al. 2017) or certainty of annotations (e.g. an ‘uncertain’ label, or numerical likelihood estimate of annotation accuracy). Another annotation model constraint is whether multiple instances of the same annotation in the same position are allowed (e.g. conflicting versions of the same annotation, such as multiple POS tags or even syntax trees, see Kountz et al. 2008). This can be relevant not only for finegrained manual annotations, but also for the application and comparison of multiple automatic tools (several POS taggers, parsers, etc.). Layers too can have different constraints, including whether layers can be applied only to nodes, or also to edges and annotations, and whether layer-element mapping is 1:1 or whether an element can belong to multiple layers. Search engines sometimes organize visualizations by layers, i.e. using a dedicated syntax tree visualization for a ‘syntax’ layer, and other modules for annotations in other layers. Basic annotation graphs, such as syntactically annotated treebanks, can be described in simple inline formats. However, as the corpus architecture grows more complex or ‘multilayered’, the pressure to separate annotations into different files and/or more complex formats grows. To see why, one can consider the Penn Treebank’s (Marcus et al. 1993) bracketing format, which was developed to encode constituent syntax trees. The format uses special symbols to record not only the primary text, but also empty categories, such as pro (for dropped subject pronouns), PRO (for infinitive subjects), traces (for postulated movement), and more. In the following tree excerpt from the Wall Street Journal portion of the Penn Treebank, there are two ‘empty categories’, at the two next to last tokens: a zero ‘0’ tagged as -NONE- standing in for an omitted that (i.e. researchers said *that*), and a trace ‘*T*-2’, indicating that a clause has been fronted (i.e. the text is “crocidolite is . . . resilient . . . , researchers said”, which can be considered to be fronted from a form such as “researchers said the crocidolite . . . ”): 5 In

some formats, XML namespaces form layers to distinguish annotations from different inventories, such as tags from the TEI vocabulary (Text Encoding Initiative, http://www.tei-c.org/) (Accessed 28 May 2019) versus corpus specific tags (see Höder 2012 for an example). A formal concept of layers to group annotations is provided in the Salt data model (Zipser and Romary 2010), and UIMA Feature Structure Types in the NLP tool-chain DKPro (Eckart de Castilho and Gurevych 2014). NLP tool chain components are often thought of as creating implicit layers (e.g. a parser component adds a syntactic annotation layer), see e.g. GATE Processing Resources or CREOLE Modules in GATE (Cunningham et al. 1997), Annotators components in CoreNLP (Manning et al. 2014) or WebLicht Components (Hinrichs et al. 2010).


A. Zeldes

( (S (S-TPC-2 (NP-SBJ … (NP (NN crocidolite) ) (, , ) ) ( VP (VBZ is) (ADJP- PRD (RB unusually) (JJ resilient) ) … (, ,) (NP-SBJ ( NNS researchers) ) (VP (VBD said) (SBAR (-NONE- 0) (S (-NONE- *T*- 2) ))) (. .) ))

This syntax tree defines a hierarchically nested annotation graph, with nodes corresponding to the tokens and bracketing nodes, and annotations corresponding to parts of speech and syntactic category labels (NP, VP etc.). However much of the information is rather implicit; the edges of the tree are marked by nested brackets: the NP dominates the noun ‘crocidolite’, etc. Annotations are pure value labels (VBD, VP etc.), and one must infer readings for their keys (POS, phrase category). Another ‘edge’ represented by co-indexing the trace with its location at S-TPC-2 depends on our understanding that *T*-2 is not just a normal token (marked only by a special POS tag -NONE-). This is especially crucial for the dropped ‘that’, since the number 0 can also appear as a literal token, for example in the following case, also from the Wall Street Journal section of the Penn Treebank: (NP (NP (DT a) (NN vote) ) (PP (IN of) (NP (NP (CD 89) ) (PP (TO to) (NP (CD 0) )))))

3 Corpus Architecture


At the very latest, once information unrelated to the syntax tree is to be added to the corpus, such as temporal annotation, coreference resolution or named entity tags, multiple annotation files will be needed. In fact, the OntoNotes corpus (Hovy et al. 2006), which contains parts of the Penn Treebank extended with multiple additional layers, is indeed serialized in multiple files for each document, expressing unrelated or only loosely connected layers of annotation. A corpus containing unrelated layers in this fashion is often referred to as a ‘multilayer’ corpus, and data models and technology for such corpora are an active area of research (see Lüdeling et al. 2005; Burchardt et al. 2008; Zeldes 2017, 2018). Because of the complexity inherent in annotation graphs, complex tools are often needed to annotate and represent multilayer data, and the choice of search and visualization tools with corresponding support becomes more limited (see Tools and Resources). In the case of formats for data representation, the situation is somewhat less critical, since, as already noted, different types of information can be saved in separate files. This also extends into the choice of annotation tools, as one can use separate tools, for example to annotate syntax trees, typographical properties of source documents, or discourse annotations. The greater challenge begins once these representations need to be merged. This is often only possible if tools ensuring consistency across layers are developed (e.g. the underlying text, and perhaps also tokenization must be kept consistent across tools and formats). As a merged representation for complex architectures, stand-off XML formats are often used (see Sect. 3.3), and Application Programming Interfaces (APIs) are often developed in tandem with such corpora to implement validation, merging and conversion of separate representations of the same data (for example, the ANC Tool, used to convert data in the American National Corpus and its richly annotated subcorpus, MASC, Ide et al. 2010). For search and visualization of multilayer architectures, either a complex tool can be used, such as ANNIS (Krause and Zeldes 2016; see also Sect. 3.4), or a combination of tools is used for each layer. For example in the TXM text mining platform, Heiden (2010) proposes to use a web interface to query the Corpus Workbench (Christ 1994) for ‘flat’ annotations, TigerSearch (Lezius 2002) for syntax trees, and XQuery for hierarchical XML. The advantage of this approach is that it can use off-the-shelf tools for a variety of annotation types, and that it can potentially scale better for large corpora, since each tool has only a limited share of the workload. The disadvantage is that a data model merging results from all components can only be generated after query retrieval has occurred in each component. This prevents complex searches across all annotation layers: for example, it is impossible to find sentences with certain syntactic properties, such as clefts, which also contain certain XML tags, such as entity annotations denoting persons, and also have relational edges with components of other sentences, such as coreference with a preceding or following entity annotation. These kinds of combinations can be important for example for studying the interplay between syntax and semantics, especially at the levels of discourse and pragmatics: for example, to predict variables such as word order


A. Zeldes

in practice, we must often be aware of the grammatical functions, semantic roles, information status of entity mentions, degree of formality and politeness, and more.

Representative Corpus 1 The GUM corpus The Georgetown University Multilayer corpus (GUM, Zeldes 2017), is a freely available corpus of English Web genres, created using ‘classsourcing’ as part of the Linguistics curriculum at Georgetown University. The corpus, which is expanded every year and currently contains over 129,000 tokens, is collected from eight open access sources: Wikinews news reports, biographies, fiction, reddit forum discussions, Wikimedia interviews, wikiHow how-to guides and Wikivoyage travel guides. Its architecture can therefore be considered to follow the common tree-style macro-structure with eight subcorpora, each containing simple, unaligned documents. The complexity of the corpus architecture results from its annotations: as the data is collected, student annotators iteratively apply a large number of annotation schemes to their data using different formats and tools, including document structure in TEI XML, POS tagging, syntactic parsing, entity and coreference annotations and discourse parses in Rhetorical Structure Theory. The complete corpus covers over 50 annotation types (see http://corpling. uis.georgetown.edu/gum/). A single tokenized word in GUM therefore often carries an annotation graph of dozens of nodes and annotations, illustrated using only two tokens from the corpus in Fig. 3.4, which shows the two tokens I know.

Fig. 3.4 Annotation graph for the tokens I know in an interview from GUM


3 Corpus Architecture

At an abstract level, the interconnected units in Fig. 3.4 all represent nodes in an annotation graph. The two solid rectangles at the bottom of the image are special nodes representing actual tokens from a text: they are unique in that they carry not only a number of annotations, but also references to primary text data (I and know in bold at the top of each box). Their annotations include two distinct POS tags using different tag sets (Penn Treebank and CLAWS), as well as lemmas. These token nodes also function as anchors for the remaining nodes in the graph: every other node in the figure is attached directly or indirectly to the tokens via edges. Layers are represented by dashed rectangles: in this case, each annotation belongs to exactly one layer (rectangles do not overlap). Node annotations are represented as key-value pairs with an ‘=’ sign (e.g. entity = person), while edge annotations look the same but are given in italics next to the edge they annotate. For example, there is a dependency edge connecting the two tokens and carrying a label (func = nsubj, since I is the nominal subject of know), belonging to a layer ‘dependencies’. The single node in the layer ‘sentence types’ above the tokens is annotated as s_type = decl (declarative sentence), and is attached to both tokens, but the edges attaching it are unannotated (no labels). Finally, some layers, such as ‘constituent syntax’, contain a complex subgraph: an NP node is attached to the token I, and a VP node is attached to know, and together they attach to the S node denoting the clause. Similarly, the ‘discourse’ layer, of which we only see one incoming edge, is the entry point into the discourse annotation part of the graph, which places multiple tokens in segments, and then constructs a sub-graph made of sentences and clauses based on Rhetorical Structure Theory (RST, Mann and Thompson 1988). The edge is annotated as ‘relname = background’, indicating this clause gives background information for some other clause (not pictured). Note that it is the corpus designer’s decision which elements are grouped in a layer. For example, the constituent S representing the clause has a similar meaning to the sentence annotation in the ‘sentence types’ layer, but these have been modeled as separate. As a result, it is at least technically possible for this corpus to have conflicting constituent trees and sentence span borders for sentence type annotation. If these layers are generated by separate automatic or manual annotation tools, then such conflicts are in fact likely to occur over the course of the corpus. Similarly, a speaker annotation (‘sp_who’) is attached to both tokens, as is the sentence annotation, but it is conceivable that these may conflict hierarchically: a single sentence annotation may theoretically cover tokens belonging to different speakers, which may or may not be desirable (e.g. for annotating one speaker completing another’s sentence). The graph-based data model allows for completely independent annotation layers, united only by joint reference to the same primary text.



A. Zeldes

Representative Corpus 2 MERLIN The MERLIN project (Multilingual Platform for European Reference Levels: Interlanguage Exploration in Context, Boyd et al. 2014) makes three learner corpora available in the target languages Czech, German and Italian, which are richly annotated and follow a comparable architecture to allow for cross-target language and native language comparisons. The project was conceived to collect, study and make available learner texts across the levels of the Common European Framework of Reference for Languages (CEFR), which places language learners at levels ranging from A1 (also called ‘Breakthrough’, the most basic level) to C2 (‘Mastery’). Although these levels are commonly used in language education and studies of second language acquisition, it is not easily possible to find texts coming from these levels, neither for researchers nor for learners looking for reference examples of writing at specific levels. The MERLIN corpora fill this gap by making texts at the A1-C1 levels publicly available in the three target languages above. To see how the MERLIN corpora take advantage of their architecture in order to expose learner data across levels we must first consider how users may want to access the data, and what the nature of the underlying primary textual data is. On one level, researchers, language instructors and other users would like to be able to search through learner data directly: the base text is, trivially, whatever a learner may have written. However, at the same time the problems discussed in Sect. 3.2.2 make searching through non-native data, which potentially contains many errors,6 non-trivial. For example, the excerpt from one Italian text in (6) contains multiple errors where articles should be combined with prepositions: once, da ‘from’ is used without an article in da mattina ‘from (the) morning’ for dalla ‘from the (feminine)’, and once, the form da is used instead of dal ‘from the (masculine)’. The data comes from a Hungarian native speaker, rated at an overall CEFR ranking of B2, as indicated by document metadata in the corpus. (6) Da mattina al pomerrigio? Da prossima mese posso lavorare? from morning to.the afternoon ? from next month can.1.SG/ work? ‘From morning to the afternoon? From next month I can work?’ (continued)

6 This is not to say that native data does not contain errors from a normative perspective, and indeed some corpora, such as GUM in Representative Corpus 1, do in fact annotate native data for errors.

3 Corpus Architecture


This data is invaluable to learners and educators interested in article errors. However users interested in finding all usages of da in the L2 data will not be able to distinguish correct cases of da from cases that should have dal or dalla. At the same time, less obvious errors may render some word forms virtually unfindable. For example, the word pomeriggio ‘afternoon’ is misspelled in this example, and should read pomerrigio (the ‘r’ should be double, the ‘g’ should not be). As a result, users who cannot guess the actual spelling of words they are interested in will not be able to find such cases. In order to address this, MERLIN includes layers of target hypotheses (TH, see Reznicek et al. 2013). These provide corrected versions of the learner texts: At a minimum, all subcorpora include a span annotation called TH1, which gives a minimally grammatical version of the learner utterance, correcting only as much as necessary to make the sentence error-free, but without improving style or correcting for meaning.7 Figure 3.5 shows the learner utterances on the ‘learner’ layer, while the TH1 layer shows the minimal correction: preposition+article forms have been altered, and a word-order error in the second utterance has been corrected (the sentence should begin Posso lavorare ‘can I work’). The layer TH1Diff further notes where word form changes have occurred (the value ‘CHA’), or where material has been moved, using ‘MOVS’ (moved, source) and ‘MOVT’ (moved, target).8 These ‘difference tags’ allow users to find all cases of discrepancies between the learner text and TH1 without specifying the exact forms being searched for.

Fig. 3.5 Annotation grid for a learner utterance, with target hypothesis (TH) and error annotations, visualized using ANNIS


7 See Reznicek et al. (2012) for the closely related Falko corpus of L2 German, which developed minimal TH annotation guidelines. Like Falko, a subset of MERLIN also includes an ‘extended’ TH layer, called TH2, on which semantics and style are also corrected. A closely related concept to TH construction which is relevant to historical corpora is that of normalization: non-standard historical spellings can also be normalized to different degrees, and similar questions about the desired level of normalization often arise. 8 Further tags include ‘DEL’ for deleted material, and ‘INS’ for insertions.


A. Zeldes

One consequence of using a TH layer for the architecture of the corpus is that the data may now in effect have two conflicting tokenizations: on the ‘learner’ layer, the first ‘?’ and the second ‘Da’ stand at adjacent token positions; on the TH1 layer, they do not. To make it possible to find ‘?’ followed by ‘Da’ in this instance, while ignoring or including TH layer gaps, MERLIN’s architecture explicitly flags these annotation layers as ‘segmentations’, allowing a search engine to use either one for the purpose of determining adjacency as well as context display size (i.e. what to show when users request a windows of +/− 5 units). One shortcoming of TH annotations is that they cannot generalize over common error types which are of interest to users: for example, they do not directly encode a concept of ‘article errors’. To remedy this, MERLIN includes a wide range of error annotations, with a major error-annotation category layer (EA_category, e.g. G_Morphol_Wrong for morphological errors), and more fine grained layers, such as G_Morphol_Wrong_type. The latter indicates a ‘gender’ error on prossima ‘next (feminine)’ in Fig. 3.5, which should read prossimo ‘next (masculine)’ to agree with mese ‘month’. Note however that the architecture allows multiple conflicting annotations at the same position: two ‘EA_category’ annotations overlap under prossima, indicating the presence of two concurrent errors, and there is no real indication, except for the length of the span, that the ‘gender’ error is somehow paired with the shorter EA_category annotation. Additionally, the EA layers cannot encode all foreseeable errors of interest: for example, there is no specific category for cases where da should be dal (but not dalla). This type of query can only be addressed using the running TH layer.9 Finally it should be noted that both tokens and annotations, including TH layers, can be used as entry points for more complex annotation graphs. In the case of MERLIN, an automatically generated dependency syntax parse layer was added on top of the learner layer, as shown in Fig. 3.6. If the corpus architecture has successfully expressed all annotations including the parse in a single graph, then it is possible to query syntax trees in conjunction with other layers. For example we can obtain syntactic information, such as the most common grammatical functions and distance between words associated with movements (MOVS/MOVT) across gaps on the TH layer. This would not be possible if TH analysis had been implemented (continued)

9 A more minimal type of TH analysis is also possible, in which only erroneous tokens are given a correction annotation (see e.g. Tenfjord et al. 2006 for a solution using TEI XML). A limitation of this approach is that the TH layer itself cannot be annotated as a complete independent text (e.g. to compare POS tag distributions in the original and TH text), and that gaps of the type seen in Fig. 3.5 cannot be represented.

3 Corpus Architecture


p adpmod p adpobj

Da mattina





adpobj adpmod adpobj ? -

Da prossima



P lavorare


Fig. 3.6 Dependency parse attached to the annotations of example (6) in MERLIN

in separate files, without consideration for the alignment of each annotation’s structures or the handling of gaps and segmentation conflicts. Similar additional graphs would also be conceivable, for example to link specific MOVS and MOVT locations, but these have not yet been implemented – TH1Diffs are currently expressed as flat annotations whose interconnections are left unexpressed in the data model.

3.3 Critical Assessment and Future Directions At the time of writing, corpus practitioners are in the happy position of having a wide range of choices for concrete corpus representation formats and tools. However, few tools or formats can do ‘everything’, and more often than not, the closer they get to this ideal, the less convenient or optimized they are for any one task. To recap some important considerations in choosing a corpus architecture and a corresponding concrete representation format: • Is preservation of the exact underlying text (e.g. whitespace preservation) important? • Are annotations very numerous or involve conflicting spans to the extent that a stand-off format is needed? • Are annotations arranged in mutually exclusive spans? Are they hierarchically nested? Are discontinuous annotations required? • Are complex metadata management and subcorpus structure needed, or can this information be saved separately in a simple table? • Does the data contain A/V signals? If so, are there overlapping speakers in dialogue? • Is parallel alignment needed, i.e. a parallel corpus?


A. Zeldes

These questions are important to address, but the answers are not always straightforward. For example, one can represent ‘discontinuous’ annotations slightly less faithfully by making two annotations with some co-indexed naming mechanism (cf. MOVS and MOVT in Representative Corpus 2). This may be unfaithful to our envisioned data model, but will greatly broaden the range of tools that can be used. In practice, a large part of the choice of corpus architecture is often dictated by the annotation tools that researchers wish to use, and the properties of their representation formats. Using a more convenient tool and compromising the data model can be the right decision if this compromise does not hurt our ability to approach our research questions or applications. For example, many spoken corpora containing dialogue do not model speaker overlap, instead opting to place overlapping utterances in the order in which they begin. This can be fine for some research questions, for example for a study on word formation in spoken language; but not for others, e.g. for pragmatic studies of speech act interactions in dialogue. Table 3.1 gives a (non-exhaustive) overview of some popular corpus formats and their coverage in terms of the properties discussed above. A good starting point when looking to choose a format is to use this table or construct a similar one, note supported and unsupported features, and rule out formats that are not capable of representing the desired architectural properties. CoNLLU is a popular format in the family of tab-delimited CoNLL formats, which is used for dependency treebanks in the Universal Dependencies project (http://universaldependencies.org/). It is enriched with ‘super-token’-like word forms (i.e. multi-token orthographic word forms such as ‘I’m’), open-ended keyvalue pairs on tokens, and sentence level annotations as key-value pairs. The CWB vertical format (sometimes also called ‘TreeTagger format’, due to its compatibility with the tagger by Schmid 1994), is an SGML format with one token per line, accompanied by tab-delimited token annotations, and potentially conflicting, but not hierarchically nested element spans. Elan and EXMARaLDA are two popular grid-based annotation tools, which do not necessarily model a token concept, instead opting for unrestricted layers of spans, some of which can be used to transcribe texts, while others express annotations. They offer excellent support for aligned A/V data and model a concept of potentially multiple speakers, complete with speaker-related metadata, which makes them ideal for dialogue annotation. FoLiA, GrAF and PAULA XML are all forms of graph-based stand-off XML formats, though FoLiA’s implementation is actually contained in a single XML file, with document internal references. GrAF has the status of an ISO standard (ISO 24612), and has been used to represent the American National Corpus (https://www.anc.org/). FoLiA has the advantage of offering a complete annotation environment (FLAT, http://flat.science. ru.nl/), though PAULA and GrAF can be edited using multi-format annotation tools such as Atomic (http://corpus-tools.org/atomic/). PAULA is the only format of the three which implements support for parallel corpora and overlapping speakers.

Whitespace Yes No Yes Yes Yes Yes Yes No Yes Yes No Yes Yes

Standoff No No Inline Inline Inline Yes Yes No Inline Yesc No Yes Inline

Hierarchy Depa No No No Yes Yes Yes Yes Yes Yes Yes Yes depa

Confl. Spans No Yes Yes Yes Yes Yes Yes No Yes Noc No Yes Yes

Discontinuous No No No No Yes Yes Yes No Yes Noc Yes Yes No

Parallel No Yes No No No Nob Yes No No Yes Nod No No

Dialogue overlap No No Yes Yes No Nob Yes No No Yes No No No

Metadata No Yes Yes Yes Yes Yes Yes No Yes Yes Yesd Yes No

Subcorpora No No Yes Yes Yes Yes Yes No No Yes Yes Yes No

Multimodal No No Yes Yes No No Yes No No Yes No No No

value ‘dep’ indicates formats with some capacity to express dependency edges between flat units (including, e.g. syntactic dependency or coreference annotation), but without complex node hierarchies. b While GrAF does not explicitly support multiple overlapping speakers or parallel corpora, there are some conceivable ways of representing these using the available graph structure. However I am not aware of any corpus or tool implementing these with GrAF. c Stand-off annotation has been implemented in TEI XML (see Chapter 20.5 of the TEI p5 guidelines, http://www.tei-c.org/) (Accessed 28 May 2019) and can cover a wide range of use cases for discontinuous annotations and hierarchy conflicts. However it is not frequently used in the TEI community, and there are some limitations (see Ba´nski 2010 for analysis). d TigerXML itself does not implement parallel alignment, but an extension format known as STAX has been developed for parallel treebanking in the Stockhold TreeAligner (Lundborg et al. 2007). Metadata in TigerXML is limited to a predetermined set of fields, such as ‘author’, ‘date’ and ‘description’.

a The


Table 3.1 Data model properties for a range of open corpus formats

3 Corpus Architecture 67


A. Zeldes

Penn Treebank bracketing (PTB), TigerXML and tiger2 are formats specializing in syntax annotation (treebanks). The PTB format is the most popular way of representing projective constituent trees (no crossing edges) with single node annotations (part of speech or syntactic category). It is highly efficient and readable, but has some limitations (see the ‘crocidolite’ example above). TigerXML is a more expressive XML format, capable of representing multiple node annotations, crossing edges, edge labels and two distinct types of edges. The tiger2 format (Romary et al. 2015) is an extension of TigerXML, outwardly very similar in syntax, but with unlimited edge typing, metadata, multiple/conflicting graphs per sentence and other more ‘graph-like’ features. It enjoys an ISO standard status (ISO 24615). TCF (Hinrichs et al. 2010) is an exchange format used by CLARIN infrastructure, and in particular the WebLicht NLP toolchain. It is highly expressive for a closed set of multilayer annotations, and has built in concepts for tokenization, sentence segmentation, syntax and entity annotation. It is also one of the supported formats of the popular WebAnno online annotation tool (Yimam et al. 2013), which also supports a variety of formats of its own, including its highly expressive UIMA based format (serializable as an ‘inline stand-off’ XMI format), and a whitespace preserving tab-delimited export, called WebAnno TSV. An important trend in corpus building tools looking forward is a move away from saving and exchanging data in local files on annotators’ computers or private servers. Corpora are increasingly built using public, version-controlled repositories on platforms such as GitHub. For example, the Universal Dependencies project is managed entirely on GitHub, including publicly available data in multiple languages and the use of GitHub pages and issue trackers for annotation guidelines and discussion. Some tools (e.g. the online XML and spreadsheet editor GitDox, Zhang and Zeldes 2017) are opting for online storage on GitHub and similar platforms as their exclusive file repository. In the future we will hopefully see increasing openness and interoperability between tools which adopt open data models and best practices that allow users to benefit from and re-use existing data and software.

3.4 Tools and Resources An important set of tools influencing the choice of corpus architecture is NLP pipelines and APIs, which allow users to construct automatically tagged and parsed representations with complex data models (and these can be manually corrected if needed). Some examples include Stanford CoreNLP (Manning et al. 2014), Apache OpenNLP (https://opennlp.apache.org/) (accessed 28 May 2019), Spacy (https:// spacy.io/) (accessed 28 May 2019), the Natural Language Toolkit (NLTK, http:// www.nltk.org/) (accessed 28 May 2019), GATE (Cunningham et al. 1997), DKPro (Eckart de Castilho and Gurevych 2014), NLP4J (https://emorynlp.github.io/nlp4j/) (accessed 28 May 2019) and FreeLing (http://nlp.cs.upc.edu/freeling/) (accessed 28 May 2019).

3 Corpus Architecture


The output formats of NLP tools are often not compatible with corpus search architectures, and may not be readily human-readable (for example, .json files offer very efficient storage, but are only meant to be machine readable). For this reason, NLP tool output must often be converted into corpus formats such as those in Table 3.1. Versatile conversion tools, such as Pepper (http://corpus-tools.org/ pepper/) (accessed 28 May 2019), can be used to convert between a variety of formats and make data accessible to a wider range of tools. Another important feature supported by tools such as Pepper is merging data from several formats into a format capable of expressing the multiple streams of input data. Using a merging paradigm makes it possible to build corpora that require some advanced features (e.g. conflicting spans, or multimodal time alignment), which are not available simultaneously in the tools we wish to use, but can be represented separately in a range of tools, only to be merged later on. For example, the GUM corpus described above is annotated using five different tools which are optimized to specific tasks, and the merged representation is created automatically (this is sometimes called a ‘build bot’ strategy; for an example see https://corpling.uis.georgetown.edu/gum/ build.html) (accessed 28 May 2019). Finally, corpus architecture considerations also interact with the choice of search and visualization facilities that one intends to use. Having an annotation tool which supports a complex data model may be of little use if the annotated data cannot be accessed and used in sensible ways later on. Some corpus practitioners use scripts, often in Python or R, to evaluate their data, without using a dedicated search engine (see Chap. 9). While this approach is very versatile, it is also labor intensive: for each new type of information, a new script must be written which traverses the corpus in search of some information. It is therefore often desirable to have a search engine that is capable of extracting data based on a simple query. For corpora that are meant to be publically available to non-expert users, this is a necessity. In public projects, a proprietary search engine tailored explicitly for a specific corpus is often programmed, which cannot easily be used for other corpora. Here I therefore focus on generic, freely available tools which can be used for a variety of datasets. The Corpus Workbench (Christ 1994) and its web interface CQPWeb (Hardie 2012) are amongst the most popular tools for corpus search and visualization, but are not capable of representing hierarchical data, and therefore they cannot be used for treebanks. Grid-like data, e.g. from EXMARaLDA or Elan files, can be indexed for search using EXMARaLDA’s search engine, EXAKT (http://exmaralda.org/en/ exakt-en/) (accessed 28 May 2019). For treebanks, there are some local user tools (e.g. TigerSearch, Lezius 2002, or command line tools such as TGrep2, http:// tedlab.mit.edu/˜dr/Tgrep2/ (accessed 28 May 2019), the successor of the original Penn Treebank tool, or Stanford’s Tregex, (https://nlp.stanford.edu/software/tregex. shtml) (accessed 28 May 2019). There are only a few dedicated web interfaces for treebanks, notably Ghodke and Bird’s (2010) highly efficient Fangorn (for projective, unlabeled constituent trees), and TüNDRA, the Tübingen aNnotated Data Retrieval Application, for TigerXML style trees and dependency trees (Martens 2013). For small-medium sized multilayer corpora, with syntax trees, entity and coreference annotation, discourse parses and more, ANNIS (http://corpus-tools.org/


A. Zeldes

annis/) (accessed 28 May 2019) offers a comprehensive solution supporting highly complex graph queries over hierarchies, conflicting spans, aligned A/V data and parallel corpora. For larger datasets, KorAP (Diewald et al. 2016) presents a search engine supporting a substantial subset of graph relations, accelerated for text search using Apache Lucene (https://lucene.apache.org/) (accessed 18 June 2019).

Further Reading McEnery, T., Xiao, R., and Tono, Y. 2006. Corpus-Based Language Studies: An Advanced Resource Book. (Routledge Applied Linguistics.) London: Routledge. This resource book contains both important readings in corpus linguistics and practical guides to corpus compilation and research. It forms a cohesive introduction and is a useful starting point for newcomers to the discipline. Chapters 3 and 4 in part one of the book offer a good brief overview of issues in corpus mark-up and annotation that relate to corpus architecture, and the third part of the book explores practical case studies with real data that can help familiarize readers with some fundamental examples of corpus architectures. Lüdeling, A., and Kytö, M. (eds.) 2008–2009. Corpus Linguistics. An International Handbook. (Handbooks of Linguistics and Communication Science 29.) Berlin: Mouton de Gruyter. A comprehensive, two volume overview of topics in Corpus Linguistics written by a range of experts. The chapters on corpus types, including Speech, Multimodal, Historical, Parallel and Learner Corpora offer good overviews of issues in specific corpus type architectures, and the chapters on Annotation Standards and Searching and Concordancing should be of interest with respect to corpus architectures as well. Kübler, S., and Zinsmeister, H. 2015. Corpus Linguistics and Linguistically Annotated Corpora. London: Bloomsbury. This book gives a comprehensive overview of many aspects of complex annotated corpora, including data models and corpus query languages for treebanks and multilayer corpora. A particular focus on concrete corpora and tools makes it a useful practical introduction. Zeldes, A. 2018. Multilayer Corpus Studies. (Routledge Advances in Corpus Linguistics 22.) London: Routledge. As an intermediate level survey of multilayer corpora and their applications, this two part volume begins by laying out foundations for dealing with large numbers of concurrent annotations, and goes on to explore some of the applications of such corpora across a range of different research questions, primarily in the area of discourse level phenomena.

3 Corpus Architecture


References Ba´nski, P. (2010). Why TEI stand-off annotation doesn’t quite work and why you might want to use it nevertheless. Proceedings of Balisage: The markup conference 2010. Montréal. Biber, D. (1993). Representativeness in Corpus design. Literary and Linguistic Computing, 8(4), 243–257. Boyd, A., Hana, J., Nicolas, L., Meurers, D., Wisniewski, K., Abel, A., Schöne, K., Štindlová, B., & Vettori, C. (2014). The MERLIN Corpus: Learner language and the CEFR. Proceedings of LREC 2014 (pp. 1281–1288). Reykjavik, Iceland. Brugman, H., & Russel, A. (2004). Annotating multimedia/multi-modal resources with ELAN. Proceedings of LREC 2004 (pp. 2065–2068). Paris: ELRA. Burchardt, A., Padó, S., Spohr, D., Frank, A., & Heid, U. (2008). Formalising Multi-layer Corpora in OWL DL – Lexicon Modelling, Querying and Consistency Control. Proceedings of IJCNLP 2008 (pp. 389–396). Hyderabad, India. Calzolari, N., & McNaught, J. (1994). EAGLES interim report EAG–EB–IR–2. Canales, O., Monaco, V., Murphy, T., Zych, E., Stewart, J., Castro, C.T.A., Sotoye, O., Torres, L., & Truley, G. (2011). A Stylometry system for authenticating students taking online tests. Proceedings of student-faculty research day, CSIS, Pace University, May 6th, 2011 (pp. B4.1– B4.6). White Plains, NY. Christ, O. (1994). A modular and flexible architecture for an integrated Corpus query system. Proceedings of complex 94. 3rd conference on computational lexicography and text research (pp. 23–32). Budapest. Crasborn, O. & Sloetjes, H. (2008). Enhanced ELAN functionality for sign language corpora. Proceedings of the 3rd workshop on the representation and processing of sign languages at LREC 2008 (pp. 39–42). Marrakesh, Morocco. Cunningham, H., Humphreys, K., Gaizauskas, R., & Wilks, Y. (1997). Software infrastructure for natural language processing. Proceedings of the fifth conference on applied natural language processing (pp. 237–244). Washington, DC. Diewald, N., Hanl, M., Margaretha, E., Bingel, J., Kupietz, M., Ba´nski, P., & Witt, A. (2016). KorAP architecture – Diving in the Deep Sea of Corpus data. Proceedings of LREC 2016. Portorož: ELRA. Dipper, S. (2005). XML-based stand-off representation and exploitation of multi-level linguistic annotation. Proceedings of Berliner XML Tage 2005 (pp. 39–50). Berlin, Germany. Eckart de Castilho, R., & Gurevych, I. (2014). A broad-coverage collection of portable NLP components for building shareable analysis pipelines. Proceedings of the workshop on open infrastructures and analysis frameworks for HLT (pp. 1–11). Dublin. Eckart de Castilho, R., Ide, N., Lapponi, E., Oepen, S., Suderman, K., Velldal, E., & Verhagen, M. (2017). Representation and interchange of linguistic annotation: An in-depth, side-by-side comparison of three designs. Proceedings of the 11th linguistic annotation workshop (LAW XI) (pp. 67–75). Valencia, Spain. Garside, R., Leech, G., & Sampson, G. (Eds.). (1987). The computational analysis of English: A Corpus-based approach. London: Longman. Ghodke, S., & Bird, S. (2010). Fast query for large Treebanks, 267–275. Proceedings of NAACL 2010. Los Angeles, CA. Green, L. J. (2002). African American English: A linguistic introduction. Cambridge: Cambridge University Press. Greenbaum, S. (Ed.). (1996). Comparing English worldwide: The international Corpus of English. Oxford: Clarendon Press. Hardie, A. (2012). CQPweb – Combining power, flexibility and usability in a Corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380–409. Hearst, M. A. (1997). TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), 33–64.


A. Zeldes

Heiden, S. (2010). The TXM platform: Building open-source textual analysis software compatible with the TEI encoding scheme, 24th Pacific Asia Conference on Language, Information and Computation (pp. 389–398). Sendai, Japan. Hinrichs, E.W., Hinrichs, M., & Zastrow, T. 2010. WebLicht: Web- based LRT services for German. Proceedings of the ACL 2010 system demonstrations (pp. 25–29). Uppsala. Höder, S. (2012). Annotating ambiguity: Insights from a Corpus-based study on syntactic change in old Swedish. In T. Schmidt & K. Wörner (Eds.), Multilingual Corpora and multilingual Corpus analysis (Hamburg studies on multilingualism 14) (pp. 245–271). Amsterdam/Philadelphia: Benjamins. Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2006). OntoNotes: The 90% solution. Proceedings of the human language technology conference of the NAACL, companion volume: Short papers (pp.57–60). New York. Ide, N., Baker, C., Fellbaum, C., & Passonneau, R. (2010). The manually annotated sub-Corpus: A community resource for and by the people. Proceedings of ACL 2010, (pp. 68–73). Uppsala, Sweden. ISO 24612. (2012). Language resource management – Linguistic annotation framework (LAF). London: BSI British Standards (pp. 10). ISO 24615. (2010). Language resource management – Syntactic annotation framework (SynAF). London: BSI British Standards (pp. 20). Klein, T., & Dipper, S. (2016). Handbuch zum Referenzkorpus Mittelhochdeutsch (Bochumer Linguistische Arbeitsberichte 19). Bochum: Universität Bochum Sprachwissenschaftliches Institut. Kountz, M., Heid, U., & Eckart, K. (2008). A LAF/GrAF-based encoding scheme for underspecified representations of dependency structures. Proceedings of LREC 2008. Marrakesh, Morocco. Krause, T., & Zeldes, A. (2016). ANNIS3: A new architecture for generic Corpus query and visualization. Digital Scholarship in the Humanities, 31(1), 118–139. Krause, T., Lüdeling, A., Odebrecht, C., & Zeldes, A. (2012). Multiple Tokenizations in a diachronic Corpus. In Exploring Ancient Languages through. Oslo: Corpora. Kredens, K., & Coulthard, M. (2012). Corpus linguistics in authorship identification. In P. M. Tiersma & L. M. Solan (Eds.), The Oxford handbook of language and law (pp. 504–516). Oxford: Oxford University Press. Kuˇcera, H., & Francis, W. N. (1967). Computational analysis of present-day English. Providence: Brown University Press. Kupietz, M., Belica, C., Keibel H., & Witt, A. (2010). The German reference Corpus DEREKO: A primordial sample for linguistic research. Proceedings of LREC 2010 (pp. 1848–1854). Valletta, Malta. Lee, J., Yeung, C. Y., Zeldes, A., Reznicek, M., Lüdeling, A., & Webster, J. (2015). CityU Corpus of essay drafts of English language learners: A Corpus of textual revision in second language writing. Language Resources and Evaluation, 49(3), 659–683. Leech, G. N. (1997). Introducing Corpus annotation. In R. Garside, G. N. Leech, & T. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 1–18). London/New York: Routledge. Lezius, W. (2002). Ein Suchwerkzeug für syntaktisch annotierte Textkorpora. PhD thesis, Institut für maschinelle Sprachverarbeitung Stuttgart. Lüdeling, A., Walter, M., Kroymann, E., & Adolphs, P. (2005). Multi-level error annotation in learner corpora. Proceedings of Corpus linguistics 2005. Birmingham, UK. Lundborg, J., Marek, T., Mettler, M., & Volk, M. (2007). Using the Stockholm TreeAligner. Proceedings of the sixth workshop on Treebanks and Linguistic theories. Bergen. Mann, W. C., & Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text, 8(3), 243–281. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. Proceedings of ACL 2014: System demonstrations (pp. 55–60). Baltimore, MD.

3 Corpus Architecture


Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated Corpus of English: The Penn Treebank. Special Issue on Using Large Corpora, Computational Linguistics, 19(2), 313–330. Martens, S. (2013). Tundra: A web application for Treebank search and visualization. Proceedings of the twelfth workshop on Treebanks and Linguistic theories (TLT12) (pp. 133–144). Sofia. McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book (Routledge Applied Linguistics). London/New York: Routledge. Odebrecht, C., Belz, M., Zeldes, A., & Lüdeling, A. (2017). RIDGES Herbology – Designing a diachronic multi-layer Corpus. Language Resources and Evaluation., 51(3), 695–725. Reznicek, M., Lüdeling, A., Krummes, C., Schwantuschke, F., Walter, M., Schmidt, K., Hirschmann, H., & Andreas, T. (2012). Das Falko- Handbuch. Korpusaufbau und Annotationen. Humboldt-Universität zu Berlin, Technical Report, Version 2.01, Berlin. Reznicek, M., Lüdeling, A., & Hirschmann, H. (2013). Competing target hypotheses in the Falko Corpus: A flexible multi-layer Corpus architecture. In A. Díaz-Negrillo, N. Ballier, & P. Thompson (Eds.), Automatic treatment and analysis of learner Corpus data (pp. 101–124). Amsterdam: John Benjamins. Romary, L., & Bonhomme, P. (2000). Parallel alignment of structured documents. In J. Véronis (Ed.), Parallel text processing: Alignment and use of translation corpora (pp. 201–217). Dordrecht: Kluwer. Romary, L., Zeldes, A., & Zipser, F. (2015). − Serialising the ISO SynAF syntactic object model. Language Resources and Evaluation, 49(1), 1–18. Santorini, B. (1990). Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd rev.). Technical report, University of Pennsylvania. Sauer, S., & Lüdeling, A. (2016). Flexible multi-layer spoken dialogue corpora. International Journal of Corpus Linguistics, Special Issue on Spoken Corpora, 21(3), 419–438. Schembri, A., Fenlon, J., Rentelis, R., Reynolds, S., & Cormier, K. (2013). Building the British sign language Corpus. Language Documentation and Conservation, 7, 136–154. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of the conference on new methods in language processing (pp. 44–49). Manchester, UK. Schmid, H. (2008). Tokenizing and part-of-speech tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook (Vol. 1, pp. 527–551). Berlin: Mouton de Gruyter. Schmidt, T., & Wörner, K. (2009). EXMARaLDA – Creating, analysing and sharing spoken language corpora for pragmatic research. Pragmatics, 19(4), 565–582. Smith, J. R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. Proceedings of NAACL 2010 (pp. 403–411). Los Angeles. Tenfjord, K., Meurer, P., & Hofland, K. (2006). The ASK Corpus – A language learner Corpus of Norwegian as a second language. Proceedings of LREC 2006 (pp. 1821–1824). Genoa, Italy. Wichmann, A. (2008). Speech corpora and spoken corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook (Vol. 1, pp. 187–207). Berlin: Mouton de Gruyter. Yimam, S. M., Gurevych, I., Eckart de Castilho, R., & Biemann, C. (2013). WebAnno: A flexible, web-based and visually supported system for distributed annotations. Proceedings of ACL 2013 (pp. 1–6). Sofia, Bulgaria. Zeldes, A. (2017). The GUM Corpus: Creating multilayer resources in the classroom. Language Resources and Evaluation, 51(3), 581–612. Zeldes, A. (2018). Multilayer Corpus studies (Routledge advances in Corpus linguistics 22). London: Routledge. Zhang, S., & Zeldes, A. (2017). GitDOX: A linked version controlled online XML editor for manuscript transcription. Proceedings of FLAIRS-30 (pp. 619–623). Marco Island, FL. Zipser, F., & Romary, L. (2010). A model oriented approach to the mapping of annotation formats using standards. Proceedings of the workshop on language resource and language technology Standards, LREC-2010 (pp. 7–18). Valletta, Malta.

Part II

Corpus methods

Chapter 4

Analysing Frequency Lists Don Miller

Abstract As perhaps the most fundamental statistic in corpus linguistics, frequency of occurrence plays a significant role in uncovering variation in language. Frequency lists have been designed for a variety of linguistic features and employed in order to address a variety of research questions in diverse fields of inquiry, from language learning and teaching to literary analysis to workplace communication. This chapter provides an overview of fundamental concepts and methods in the development and application of frequency lists, including issues related to the target unit of analysis, and key considerations beyond raw frequency, such as the importance of normalising frequency counts when making comparisons and employing additional measures to gain a more comprehensive picture of distributional characteristics.

4.1 Introduction The use of corpora and corpus-based tools allows for increased efficiency in the identification of linguistic features that are characteristic of a language variety. A fundamental statistic in assessing the saliency of any linguistic feature is frequency of occurrence, or, simply, the number of times a feature of interest occurs in a data set. Probably the most widely known frequency lists are lists of frequently occurring lexical items, such as West’s (1953) General Service List (GSL), Nation’s (2004) lists from the British National Corpus (BNC), or more specialized lists such as Coxhead’s (2000) Academic Word List (AWL). These and other word lists have found wide use in support of language learning and teaching, helping focus efforts on lexical items that will ideally provide the biggest payoff for learners. Indeed, a considerable amount of research has been devoted toward developing and improving such lists (cf. Nation 2016). In addition to helping identify target vocabulary for

D. Miller () Department of Languages and Applied Linguistics, University of California, Santa Cruz, CA, USA e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_4



D. Miller

study, lexical frequency lists have been used to help educators to better understand the lexical demands of target language uses (e.g. Adolphs and Schmitt 2003; Laufer and Ravenhorst-Kalovski 2010; Nation 2006;) or to assess the developing vocabulary sizes of learners (e.g. Batista and Horst 2016; Laufer and Nation 1995). Language teachers and learners have also benefited from frequency analysis of linguistic units beyond lexis. For example, the frequency of different grammatical constructions has informed the selection and ordering of features included in course textbooks (Biber and Reppen 2002; Conrad and Biber 2009), and frequency of error types have been used to better understand the challenges faced by different groups of second language writers (Doolan and Miller 2012). But the use of frequency lists extends well beyond pedagogically oriented applications, helping to address a wide variety of language-related research questions across numerous fields of inquiry. Baker (2011), for example, identified diachronic variation in British English by analysing word frequencies in corpora reflecting four time periods between 1931–2006. Weisser (2016a) examined the frequency of different speech acts in workplace communication as a possible means for differentiating individuals (or groups) and assessing speaker efficiency in service encounters. Ikeo (2016) used frequency of multi-word units to assist in literary analysis, to illustrate how language is used in DH Lawrence’s Lady Chatterley’s Lover to establish different characters’ internal states, perceptions, and viewpoints. Frequency of multi-word units has also been used in studies of authorial attribution (e.g. Houvardas and Stamatatos 2006). In speech and language pathology, Liu and Sloane (2006) used frequency profiles to inform the selection of Chinese characters meriting encoding for an augmentative and alternative communication (AAC) system. These few examples illustrate the wide diversity of applications of frequency analysis. The following section highlights some fundamental issues related to the construction, analysis, and application of frequency lists.

4.2 Fundamentals 4.2.1 Zipf’s Law A key to understanding both the efficacy and limitations of frequency lists is Zipf’s law (Zipf 1936, 1949). In researching a variety of languages (English, German, Latin, and Chinese), Zipf observed that texts were made up of relatively few frequently occurring words and a large number of words occurring quite rarely. This observation led to Zipf’s law, which very broadly states that, in human language, frequency of occurrence patterns in inverse proportion to frequency rank. If the most frequently occurring word (rank 1) occurs x times, the next most frequent word (rank 2) occurs approximately half as often. The 10th ranked word occurs approximately twice as often as the 20th ranked word, about 10 times as often as the 100th ranked

4 Analysing Frequency Lists


Table 4.1 Sample rank and frequency distribution from the BNC

Frequency 6,187,267 2,941,444 917,579 478,162 91,583 9738




Word the of to at as playing


Rank of word form frequency in ICE-GB (binary log)

Fig. 4.1 Rank vs. frequency plot of word forms in ICE-GB. (I thank Stefan Th. Gries for providing me with this figure)

Rank 1 2 10 20 100 1000





World form frequency in ICE-GB (binary log)

word, and about 100 times as often as the 1000th ranked word. Table 4.1 illustrates this distribution in the BNC (Leech et al. 2001). This phenomenon is perhaps best visualized via a log-log plotting of rank vs. frequency such as in Fig. 4.1, which illustrates form/rank distributions in the British component of the International Corpus of English (ICE-GB). In practical terms, the distribution identified in Zipf’s law has two important implications: 1. A relatively small set of words accounts for a large proportion of tokens in a text (or corpus). For example, the 100 most frequently occurring words in the BNC account for approximately 45% coverage of the corpus (Atkins and Rundell 2008). This distributional phenomenon allows researchers, despite immense lexical diversity in language, to achieve high text coverage with a relatively low number of high-frequency words. This is very fortunate for list designers and list users seeking to maximize efforts in language learning. However, . . . 2. . . . a very large proportion of words in a text have a very low frequency of occurrence. Approximately 40% of words in the BNC, for example, are hapax legomena, or words occurring only once in a corpus (Scott and Tribble 2006). Since the vast majority of words in a language are highly infrequent, very large corpora are required to capture and understand the distribution and behaviour of rare lexical features.


D. Miller

4.2.2 Unit of Analysis As noted above, perhaps the most widely known frequency lists are lists of individual words. Even at this level, some basic decisions have to be made in the operationalization of the construct of a “word.” In languages using the Roman alphabet, for example, it may seem obvious that a word can be identified as one or more letters surrounded by spaces. However, this is not so simple. In English, for instance, decisions must be made regarding whether to treat compounds, either (a) open (e.g., school bus), closed (e.g., weekend), or hyphenated (e.g., well-being), (b) transparent (e.g. school bus) or opaque (e.g., deadline), and (c) as one word or two words. This can be especially tricky, as there is often some variation in how these items appear across a corpus (or even within a text); for example, although the hyphenated form, well-being, is clearly the most common option found in the Corpus of Contemporary American English (COCA) (>8000 occurrences), the forms well being and wellbeing each occur several hundred times. Researchers also need to account for different spellings across different varieties of a language (e.g., behaviour vs. behavior) or even commonly misspelled words (e.g., mispelled). Whatever decisions are made, they should be principled, consistent, and clearly explained (cf. Nation 2016, particularly Sect. 4.4.2, for a discussion and suggestions for dealing with such issues). In profiling lexical frequencies, researchers also have to decide which related forms should be counted as a single lexical item. Operationalizations of the ‘word’ construct can range from treating all orthographic word forms as distinct words (e.g., book and books are two separate words), to treating all word forms sharing a classical root as members of a word family (e.g., help, helping, helpful, unhelpful, helpfulness, etc. are all members of the word family related to the headword HELP). Bauer and Nation (1993) outline a useful seven-level taxonomy for classifying morphological affixation and, ultimately, the relationships between word forms. Until fairly recently, most of the influential pedagogically oriented word lists have been based on the more inclusive levels of this taxonomy in their operationalization of “word family.” For example, the GSL (West 1953), the BNC 3000 (Nation 2004), and the AWL (Coxhead 2000) are all lists of word families. More recently, however, there has been a trend toward using the lemma, a unit consisting of “lexical forms having the same stem and belonging to the same major word class, differing only in inflection and/or spelling” (Francis and Kucera’s 1982, 1), rather than the word family, as the unit of analysis. For example, the lemma FALL (v.) has the members fall (v.), falls (v.), fell (v.), and fallen (v.); the separate lemma FALL (n.) has the members fall (n.) and falls (n.). An example of a lemma-based list is the 5000-word list detailed in Davies and Gardner’s Frequency Dictionary of American English (2010). Among the drivers of this trend, one is technological advances, especially the development of increasingly accurate part-of-speech taggers that allow automatic distinctions to be made (cf. Chap. 2). Another driver is the recognition that using the lemma as the unit of analysis does some of the work in differentiating meaning senses of polysemous words. For example, man (n.)

4 Analysing Frequency Lists


meaning “the human species” or “an adult male” would be differentiated from man (v.) “to take charge of something”. Whatever unit is chosen, it should be noted that constructing frequency lists based on sets of related word forms (e.g., lemmas or word families) requires the additional step of first compiling those sets. For example, in his construction of the BNC lists, Nation (2004) benefitted from the lemmatized list that Leech et al. (2001) had previously compiled, but he still had to build word families from that list. Fortunately, lists of lemmas and word families for many languages are available online, and many software packages used for word list construction include options for unit of analysis (e.g., word form, lemma, or word family) (see Sect. 4.4.4 for examples). Yet another robust research paradigm has widened the concept of “lexical item” beyond single words to include multi-word units (MWUs), including collocations, “phrasal verbs (call on, chew out), idioms (rock the boat), fixed phrases (excuse me), and prefabs (the point is)” (Gardner 2007,260). This paradigm is founded on the important role of phraseology in discourse. Research has demonstrated the wide prevalence of multi-word sequences in language (e.g. Biber et al. 1999; Biber et al. 2004; Conklin and Schmitt 2008). Erman and Warren (2000) even propose that multi-word phrases account for over half of both spoken and written English! The ubiquity of MWUs, coupled with their demonstrated psycholinguistic reality (e.g. Conklin and Schmitt 2008; Durrant and Doherty 2010) has led to investigations into MWU frequency distributions in a variety of languages (e.g., English: Shin and Nation 2008; Korean: Kim 2009; Polish: Grabowski 2015), registers (e.g., academic writing: Biber et al. 2004; Simpson-Vlach and Ellis 2010), and disciplines (e.g., engineering: Ward 2007; business: Hsu 2011). A great deal of research in this paradigm has focused on identifying and analyzing prefabricated ‘chunks’ of formulaic language (e.g., if you look at) (Biber et al. 2004). These contiguous sequences of words of various length are known by many names, including formulaic sequences, lexical bundles, or n-grams (where n = the number of words in the sequence, e.g., bigrams, trigrams, etc.). Conceptually, what separates these units from other MWUs is that, rather than being recognized and interpreted as semantically ‘complete’ units (as with, for example, phrasal verbs or idioms), formulaic sequences may appear semantically or structurally incomplete and are thus typically categorized functionally (e.g., framing: the existence of a; quantifying: in a number of ) or structurally (e.g., PP-based: as a result of; NP-based: the nature of the) (Biber and Barbieri 2007; Biber et al. 1999). From a practical perspective, these contiguous strings of words can be retrieved automatically very easily compared to many other types of MWUs, such as open compound nouns (post office), phrasal verbs (call on, chew out), or restricted collocations (do + homework), as identifying these latter types of MWUs requires distinguishing them from free combinations (e.g., I called on a student who was raising her hand. vs. I called on Friday, but she wasn’t in the office.) and/or identifying them across gaps of variable length (e.g., I have to do homework. vs. I have to do my boring algebra homework.) (Gardner 2007). In the past decade, increasing attention has been given to non-contiguous chunks of frequently co-occurring language. These “discontinuous sequences in which


D. Miller

words form a ‘frame’ surrounding a variable slot (e.g. I don’t * to, it is * to)” (Gray and Biber 2013, 109) have been referred to by different names, including lexical frames (ibid.), gapped n-grams (Cheng et al. 2006), phrase frames, or p-frames (Fletcher 2011; Römer 2010) or skipgrams (Cheng et al. 2006). This paradigm allows for the identification of frequently occurring phraseological patterns that would be missed were investigations restricted solely to contiguous strings of words. For example, while the latter would identify the trigram it would be, the former allows for the identification of a larger pattern, it would be * to, and to identify common fillers of the slot (e.g., interesting, useful, better) (Römer 2010).

4.2.3 Beyond Raw Frequency As noted above, frequency of occurrence can be used in understanding the relative salience of linguistic features in a text or discourse domain. It can also be used for comparison across texts or discourse domains. For example, Table 4.2 shows the 30 most frequently occurring nouns in two corpora, each comprising 30 speeches delivered by one of the candidates, Hillary Clinton or Donald Trump, during the 2016 United States Presidential campaign (corpora adapted from Brown 2017). These ranked lists might be used to highlight and compare some of the typical issues raised by the candidates. Juxtaposing these lists reveals some expected overlapping words (e.g., people, country, America, United States, job), but also some words reflecting each candidate’s key talking points. For example, Hillary Clinton frequently mentioned the economy (business, work, tax), education (school, college), and her opponent (Trump). Donald Trump often spoke about national security and immigration (border, wall, Mexico, ISIS), his negative assessment of the state of the country and its policies (problem, disaster, Obama, Obamacare) and, in another interesting overlap, himself in the 3rd person (Trump). (See Chap. 6 for discussion of more sophisticated methods for comparing salience of lexical items across corpora.) The frequency counts in Table 4.2 are what are referred to as raw frequency, or the actual number of times these items are attested in the corpora. As can be seen from Table 4.2, raw frequency can be used to rank items, and ranked lists can provide some useful insights. However, if the goal is to determine whether a feature occurs more or less often in texts or corpora of unequal size, raw frequency has limitations. In this case, is necessary to normalise frequency counts.

Normalising Frequency Counts

As can be seen from the ranked lists in Table 4.2, the word job made it into both candidates’ lists of frequently used nouns. In terms of raw frequency, there are more than twice the number of occurrences in the corpus of Donald Trump’s speeches than in the corpus of Hillary Clinton’s speeches (525 compared with 251). While it

4 Analysing Frequency Lists


Table 4.2 Presidential candidates’ most frequently used nouns in US campaign speeches, 2016 Hillary Clinton Rank Lemma


Donald Trump Rank Lemma


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

558 326 320 300 285 251 238 168 163 158 152 144 143 131 129 128 124 118 113 110 104 103 101 97 89 88 85 80 75 74

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1254 800 575 525 429 365 360 333 296 254 240 236 233 229 213 208 196 194 182 173 167 157 151 147 146 140 136 130 127 119

People Country America Trump President Job Family Year Day Election Time Business American Way Economy Campaign Thing Work World Right Life Woman Plan, college United States, friend Man Kid, future Tax Place Community School

People Country Hillary Job Time Thing Year Way American Trump Percent Day Deal World Obama United States Border Right Folks Number Problem ISIS Trade Mexico Disaster Wall Company Guy Obamacare Plan

may seem such a disparity is evidence that Donald Trump was more concerned about the issue of employment, these raw frequencies are not meaningfully comparable without factoring in the size of the corpora in which items occur, or normalising the frequencies. Normalising frequencies is simply a matter of dividing a raw frequency by the total number of words in a text (or corpus) and, optionally, multiplying the result by a meaningful common denominator that is somewhat comparable to the length of a corpus (or texts within a corpus). Thus, as can be seen in Table 4.3, 100,000 has been set as the denominator, as one of the corpora is nearly this size, while the other is not too much larger. A note of caution: Though it is common for research


D. Miller

Table 4.3 Computing normalized frequency

Candidate Hillary Clinton Donald Trump

Raw frequency Number of Step 1: Raw of the word words in frequency/total jobs corpus number of words

Step 2: multiply by common base (here 100,000)

Normalised frequency of occurrence (frequency per 100,000 words)



252/95,131 = 0.00264

0.00264 × 100,000




525/167,446 = 0.00313 0.00313 × 100,000


using corpora comprising tens or hundreds of millions of words to use one million or more as a common denominator, care should be taken in choosing this figure to avoid misrepresenting the frequency (Weisser 2016b). Using the current example, normalising to occurrences per one million words—which is in this case about 10 times the size of the corpora and 100 times the size of the individual texts—might artificially inflate counts for rare features. As can be seen in Table 4.3, the corpora are quite different in size. Thus, despite the large difference in raw frequency of occurrence, normalized frequency suggests that the word job actually played a more comparable role – at least in terms of frequency – in both candidate’s speeches than the raw frequencies would suggest.

Range and Dispersion

In practice, frequency alone—even normalized frequency—is of limited use for understanding the salience of a feature throughout a discourse domain. This point can be illustrated by considering two different content words, each occurring 42 times (per 100,000 words) in the corpus of Donald Trump’s campaign speeches: military (n.), and Virginia (n.). While frequency appears to suggest some sort of parity in salience, these occurrences had very different distributions throughout the corpus. Closer analysis reveals that the word military occurred in 80% of his speeches (24 of 30), suggesting it was one of Donald Trump’s frequent talking points throughout the campaign, whereas the word Virginia occurred in just 7 speeches, illustrating its use was limited to a much narrower set of contexts— perhaps to speeches given in this state. If the goal is to identify Donald Trump’s pet talking points throughout his campaign, it should be clear that frequency alone would be insufficient. For this reason, researchers typically include measures of dispersion in order to better understand the distributional characteristics of a target feature. Measures of dispersion provide a more comprehensive picture of frequency distributions, allowing researchers to determine whether features are generally representative of a target discourse domain or, alternatively, limited to certain contexts or idiosyncrasies of certain language users.

4 Analysing Frequency Lists


The simplest measure of dispersion is range, and it is typically operationalized in terms of the number of texts and/or sections in which a feature occurs. Many frequency lists have been designed based on a combination of frequency and range criteria. For example, the words included in Coxhead’s (2000) AWL had to occur 10 times in each of the four main macro-disciplines and in approximately one half of the subdisciplines represented in her Academic Corpus. The formulaic sequences in Hsu’s (2011) Business Formulas List had to occur 10 times per million words, in at least one half of the business disciplines represented in the corpus, and in at least 10% of texts representing each of these disciplines. Though range may be the mostly commonly employed measure of dispersion in frequency list design, contemporary researchers are increasingly employing measures of dispersion that take into account how evenly a feature occurs throughout a target discourse domain – not simply the number of texts or sections of a corpus that a feature occurs in, but whether the occurrences are evenly distributed or particularly frequent or infrequent in any of these sections (see Chap. 5 for further discussion). For example, the words in Nation’s (2004) three 1000-word BNCbased frequency lists had to achieve a minimum frequency (≥10,000 occurrences in the whole corpus), range (occurrence in 95–98 of 100 one million-word sections), and evenness of distribution across 100 one million-word sections (a dispersion coefficient—Juilland’s D—of ≥.80; cf. Chap. 5 for more about this measure and below for a discussion of its potential shortcomings). Davies and Gardner (2010) determined ranks for words in their dictionary by employing yet another method: multiplying each word’s frequency by its dispersion (also Juilland’s D) and ranking words by resulting score. The variety of measures and criteria evidenced in these examples illustrate an important point: there are no hard and fast rules regarding the best frequency or dispersion criteria for constructing frequency lists. Rather, the criteria used have often been somewhat arbitrary, reflecting researcher’s intuitions about what seems reasonable. Often times, this has meant that researchers simply employ criteria used in previous research. Alternatively, they might experiment with different criteria, observing resulting lists to gauge whether the criteria used lead to lists with desired characteristics (see Representative Study 2).

Representative Study 1 Gardner, D., Davies, M. 2014. A new academic vocabulary list. Applied Linguistics 35(3):305–327. Gardner and Davies’ study exemplifies a contemporary approach to constructing and validating a frequency list of academic vocabulary—one based on lemmas rather than word families—by taking advantage of technological advances (e.g., part of speech tagging; the ability to compile substantially larger corpora) and the application of statistics to building frequency lists. (continued)


D. Miller

The list that Davies and Gardner designed, the 3000-word Academic Vocabulary List (AVL), was culled from a 120+ million word subcorpus of written academic English from the Corpus of Contemporary English (COCA). According to Gardner and Davies, the AVL improves upon previous efforts in a number of key ways. First, the AVL is based on a contemporary, balanced corpus that is considerably larger than the 3.5 million word corpus from which its most popular predecessor, the AWL (Coxhead 2000), was culled. Their academic subcorpus of COCA represents nine academic disciplines. Approximately 75% of the corpus comprises academic journals, with the remainder from “academically oriented magazines” and financial newspaper articles (p. 313). According to the researchers, the size and breadth of this corpus improve its representativeness of academic writing. Secondly, the AVL is a list of lemmas rather than word families. Gardner and Davies argue that using the lemma as the unit of analysis will allow list users to more accurately target the most frequently occurring forms and meaning senses of academic vocabulary. Third, the authors included a more comprehensive set of statistical measures related to ratio and dispersion than were employed in previous efforts for identifying academic vocabulary (esp. the AWL). Specifically, words in the AVL had to meet the following criteria: 1. Ratio: Selected words had to occur at a rate 50% higher (i.e., at 1.5 the ‘expected’ rate of occurrence) in their academic corpus than in a nonacademic corpus (the rest of COCA). 2. Range: Selected words had to occur with more than 20% of the expected frequency (cf. Chap. 20) in at least 7 of the 9 disciplines in their academic corpus 3. Dispersion: In order to demonstrate evenness of distribution, selected words had to achieve a Juilland’s D of at least .80. 4. Discipline measure: Selected words could not occur at more than 3 times the expected frequency in any one discipline, in order to avoid including discipline-specific vocabulary Gardner and Davies arrived at each cut-off rate via experimentation, as no guidance was available in previous frequency list research. For example, the 1.5 ratio rate was chosen because the researchers determined that lower ratios allowed too many general words (e.g. different, large), while higher ratios disallowed many words that intuitively belong in a high-frequency academic word list (e.g., require, create). (continued)

4 Analysing Frequency Lists

In addition to the methodology used for corpus design and list extraction, two coverage-based analyses were used to provide evidence of the AVL’s validity. First, to demonstrate the specialized nature of the list, Gardner and Davies compared the coverage of the AVL in two academic corpora to that in two non-academic corpora. Results of this analysis demonstrate that the list performs considerably better coverage-wise in academic writing than in other written genres, in both in the academic portion of the corpus from which the list was constructed (13.8%), and a different academic sub-corpus from the BNC (13.7%) (compared to 8.0% for newspaper articles and 3.4% for fiction). In the second analysis, Gardner and Davies compared the performance of the AVL to that of the AWL in order to demonstrate that it is indeed a more robust list. Because the AWL is a list of 570 word families, they built word families based on the most frequent 570 lemmas in the AVL for this comparison. The resulting 570 word families based on the AVL indeed provided considerably better coverage than the AWL in the two academic corpora—in fact, almost twice the coverage: nearly 14% by the AVL compared to approximately 7% by the AWL. The researchers acknowledge a key concern that may be raised about the fairness of the AVL vs. AWL coverage-based comparison in this second analysis. Many specialized lists that have been designed over the past several decades have relied on West’s (1953) GSL in helping to determine the specialized nature of frequency vocabulary (e.g., the AWL). They have essentially used the GSL as a stoplist, focusing only on words that do not appear in this list. This approach has a few drawbacks. First, it inherits any potential shortcomings of the GSL (or any stoplist employed). What is perhaps more important, however, is that, as Davies and Gardner demonstrate (and others have noted previously, see esp., Paquot 2007, 2010), many general service list words do have notable importance in academic writing as well. While they are highly frequent and widely dispersed in general English, they may be especially so in academic writing. Nevertheless, because Gardner and Davies used a different methodology— specifically, not using the GSL as a stoplist—their AVL includes many high frequency items that are also found on the GSL and so were not included in the AWL. While this does not detract from the efficacy of the AVL, it does limit the strength of conclusions that can be drawn from this coverage-based comparison.



D. Miller

Representative Study 2 Brezina, V., &, Gablasova, D. 2015. Is there a core general vocabulary? Introducing the New General Service List. Applied linguistics 36(1):1–22. In this study, Brezina and Gablasova detail their development of an updated general service list, the new-GSL. Of particular note in their study is an additional step that they employ for addressing reliability: comparing frequency list items across different corpora. That is, they identify an overlapping core frequency list based on four corpora that differ from each other in terms of size, genre distributions, and age: the one million word LancasterOslo-Bergen Corpus (LOB) corpus from 1961, the 100 million-word British National Corpus (BNC) from the 1990s, the one million-word British English 2006 corpus (BE06) from 2005–6, and the 12 billion-word internet-based EnTenTen12 from 2012. Using Sketch Engine, a sophisticated web-based corpus tool, Brezina and Gablasova culled four lists of 3000 words, one from each of the corpora. An important difference between their methodology and West’s (1953) is the unit of analysis used. They chose the lemma (i.e., base form + inflectional variants) rather than the word family for two primary reasons: (1) it is beginners, i.e., learners without wide derivational morphological knowledge, who are the most likely users of a general service list; (2) using lemmas—rather than a more inclusive category like word family—can lessen the number of different word senses contained in a single lexical unit. Another key difference is that they selected words for their list by using distributional criteria only: words were ordered according to a composite score based on frequency and range, i.e. Average Reduced Frequency (Savický and Hlaváˇcová 2002),1 and the top 3000 lemmas (excluding proper nouns) from each corpus were included in each of the four lists. To address and better understand the issue of diachronic change in frequency profiles, they extracted and compared lists from the four corpora. Differences among the lists were found primarily in content words, often reflecting changes in technology. For example, the BE06 includes a number of words that its earlier counterpart, the LOB, designed 40 years earlier, did not, e.g., CD, computer, email, Internet, mobile, online, video, web, website. (continued)

1 West (1953) purposely left out words that he considered highly emotional, potentially offensive, colloquial or slang, regardless of their frequency in the corpus used for constructing the GSL. He felt that users of this list (i.e., language learners), needed to be able to express ideas, not emotions, and he felt that these ideas could be expressed without colloquialisms. Further, to cover as wide a range of notions as possible, West left off words expressing notions already covered by more frequent words and replaced them with less frequent words with different semantic values.

4 Analysing Frequency Lists


Notable differences were also found between the BE06 and EnTenTen12 corpora, despite their being so close in age. The list from the EnTenTen12 included a higher proportion of words related to the internet and technology (e.g., download, file, install, menu, scan, software, upgrade) and business and advertising (e.g., advertising, CEO, competitor, dealer, discount, logo). The researchers note that these differences likely reflect the online registers of the internet-based EnTenTen12. The authors then compared the frequency lists culled from each corpus to identify an overlapping, “stable” core of general service words—a list not demonstrating bias toward any of the four corpora, and thus exhibiting stability across time. The four lists had 78 to 84 percent overlap, with correlations between ranks at rs = .762 to rs = .870, all p < .001. 2116 lemmas were shared by each of the four lists, thus, according to the authors, constituting a stable core. In order to further modernize their new-GSL, they also added 378 lemmas which were identified in the two recent corpora, the BE06 and the EnTenTen12, completing the final, 2494-word list. Finally, Brezina and Gablasova compared the coverage of their new-GSL with that of West’s (1953) GSL in each of the four corpora from which the former was derived. They found coverage to be generally comparable, both lists providing on average a bit over 80%. What is most important in this comparison is that the new-GSL is based on lemmas (2494 lemmas in total), whereas the original GSL’s 2000 word families comprise 4114 lemmas. Thus, the researchers argue, an important benefit of the new-GSL is that it provides comparable coverage with considerably less ‘investment’ for learners (i.e., approximately 40% fewer lemmas).

4.3 Critical Assessment and Future Directions Three critical issues require attention as frequency list research goes forward: (1) dealing with homoforms and multi-word units, (2) the application of dispersion (and other) statistics to frequency list creation, and (3) addressing reliability/stability in validation of frequency lists.

4.3.1 Dealing with Homoforms and Multi-word Units Two critical obstacles that need to be addressed in frequency list research is the handling of homoforms (e.g., river bank vs. investment bank vs. bank as a verb) and multi-word units (e.g., I didn’t care for the movie at all vs. The product is


D. Miller

carried at all locations.) (Cobb 2013). These phenomena pose challenges at both the theoretical level (e.g., agreeing on a consistent operational definition for multi-word units) and the practical level (identification of these phenomena and incorporating them into frequency lists) (Nation 2016). Identification of these units is the first complex hurdle. Take for example a word from Nation’s first BNC 1000 list, course. Among its many meanings are (1) a certain path or direction (e.g., a circuitous course); (2) an academic class or program (e.g., a biology course); (3) a part of a meal (e.g., first course, second course . . . ); (4) to move (e.g., blood coursing through the veins). As Gardner (2007) and Gardner and Davies (2014) note, using the lemma can do some of the work in differentiating meaning senses. For example, meaning 4 of course would be differentiated with a POS tagger. However, this still leaves at least three meanings (1–3) grouped together as course (n.). Yet another challenge is differentiating multi-word units from the same sequences of words not acting as a lexical unit, as discussed above. Differentiation in both instances requires either manual sorting or, to accomplish this task automatically, complex algorithms able to analyse the lexical and grammatical context. Such differentiation would no doubt have some effect on frequency lists. Martinez and Schmitt (2012), for example, found 505 multi-word units in the most frequent 5000 BNC words. In their analysis, they determined that, for example, while the word course currently exists in the first BNC 1000 list, it appears in the BNC more frequently as part of the MWU of course rather than as the single word course. Incorporating this information would put the MWU of course on the BNC 1000 list and shift the single word course to the BNC 3000. Several options have been proposed for incorporating MWUs along with single-word items into frequency lists. O’Donnell (2011), for example, proposes possible methods for adjusting frequencies to account for words’ occurrence in frequently occurring n-grams of various sizes so that this information can be used in frequency list construction. The task of identifying and sorting homoforms and multi-word units in corpora and incorporating them into frequency lists is clearly a significant one, and this is only part of the challenge. If we then want to use these lists to profile lexis in other texts or corpora, the same procedure is necessary for each new text. Such disambiguation remains both a theoretical and practical challenge for corpus-based frequency list research.

4.3.2 Application of Dispersion (and other) Statistics As discussed in Sect., contemporary operationalizations of lexical item salience typically include some measure of dispersion. The most frequently used measure in contemporary frequency list research is Juilland’s D (e.g. Davies and Gardner 2010; Gardner and Davies 2014; Leech et al. 2001; Lei and Liu 2016). Gries (2008), however, details limitations of Juilland’s D and other “parts-based” measures of dispersion (p. 410). A key issue is that these measures tend to require

4 Analysing Frequency Lists


that parts be of equal size, which Gries notes may lead to less “meaningful” divisions (rather than at, for example, the text level) for corpus parts.2 Gries also notes that these measures may lack the desired sensitivity to be able to distinguish between items with even very different distributional profiles (see Chap. 5 for more information). Further investigating the limitations raised by Gries (2008), Biber et al. (2016) conducted a number of experiments to determine the sensitivity of Juilland’s D for words with a range of distributional profiles in the BNC when the number of corpus parts used in calculation of this statistic were manipulated. What they found was that dispersion scores increased (indicating more even distribution) as the number of parts increased, which would in effect allow even items with highly skewed dispersion to meet selection criteria. They conclude that Juilland’s D is “not a reliable measure of lexical dispersion in a large corpus that has been divided into numerous corpus-parts” (p. 442). However, they were able to demonstrate that an alternative measure of dispersion, Gries’ (2008) Deviation of Proportions (DP) is more functional and reliable. Their experiments confirmed that, apart from allowing calculation using unequal corpus parts, DP does not suffer from the tendency of Juilland’s D to identify highly skewed items as being evenly dispersed. This allows for both the use of natural divisions of parts (e.g., at the text level) and finer grained sampling (see Representative Study 1 in Chap. 5 for more info). Biber et al. (2016) conclude their study with an extremely important point: While the increasing application of more robust statistics to corpus-based research should be encouraged, a great deal more research is needed to better understand the benefits and limitations of different statistics for different research questions (cf. Gries 2010 and Chap. 5).

4.3.3 Addressing Reliability in the Validation of Frequency Lists Over the past two decades—and as exemplified in the two representative studies in this chapter—it would appear that the greatest efforts in designing and validating frequency lists have gone into three areas: corpus design, item selection criteria, and—in the case of word lists—coverage-based demonstrations of list robustness. Corpora are now often much larger and better balanced and, as a result, perhaps more representative than ever before. The application of additional distributional statistics allows for better targeting of items with desired distributions (e.g., Gardner and Davies 2014). Lexical frequency lists are providing ever higher coverage of target texts or achieving such coverage with fewer words (e.g., Brezina and Gablasova 2015).

2 Gries

(2008) and Biber et al. (2016) note that there are adjustments that can be made to account for different size parts, but they are rarely (if ever) employed in frequency list research.


D. Miller

In the midst of these important developments, one issue that deserves more attention is frequency list generalizability. That is, to what degree are corpus-based frequency lists generalizable to target discourse domains? At present, evidence of frequency list generalizability tends to come in one or two forms, both of which are indirect. The first comprises primarily corpus-external evidence, focused on corpus design (i.e., Do samples represent the diversity of text types, topics, etc. in a proportion reflecting the target language use domain?). Biber (1993) refers to these critical considerations as situational evidence of corpus representativeness (see also Chap. 1). When it appears that corpus design represents the situational parameters of the target discourse domain, this is often taken to suggest that frequency lists generated from them should be generalizable. In word list research, a second form of evidence of generalizability is also typically employed: post-hoc assessment of a list’s coverage in other, similarly purposed corpora. As discussed above, Gardner and Davies (2014) assessed the coverage provided by their Academic Vocabulary List (AVL) in the corpus from which it was extracted as well as in an academic subcorpus from the BNC, and found it to be almost identical in both corpora. While coverage is an unquestionably important factor in assessing the value of a list, it can only serve as an indirect measure of reliability. Very rarely do studies include a direct assessment of list generalizability – of the extent to which items on a frequency list produced from one corpus overlap with items on a list extracted from a similarly purposed corpus made up of different texts. Manuals on lexical frequency list research, however, do recommend such comparisons. Nation and Webb (2010), for example, suggest that list designers “cross-check the resulting list on another corpus or against another list to see if there are any notable omissions or unusual inclusions or placements” (p. 135). Nation (2006), for example, showed that the set of words in each frequency band provided greater coverage than the set of words in each subsequent frequency bands in both the BNC and a comparison corpus. While this analysis provides important evidence regarding the proper ordering of each frequency band, Nation acknowledges that “this approach does not show that each word family member is in the right list” (2006, p. 64). In other words, while BNC 1 K did in fact provide more coverage than the BNC 2 K in both the BNC and the comparison corpus, there may be words in the BNC 2 K—or even in the BNC 3 K, 4 K, etc. —which provide higher coverage in the comparison corpus than do certain words in the BNC 1 K. For this particular reason, Nation (2016) recommends checking lists “against competing lists not just for coverage but also for overlapping and non-overlapping words” (p. 132), as was done in Brezina and Gablasova’s (2015) study (cf. Representative Study 2). In this analysis, they found 78–84% overlap in pairwise comparisons between their four lists, and 71% overlap among all four. While this level of overlap may or may not be surprising, it is important to consider that lists of high frequency general vocabulary, as was the focus here, would likely have the greatest stability across different corpora. The generalizability of lists of lower frequency features, e.g., specialized vocabulary, MWUs, etc., may be more

4 Analysing Frequency Lists


limited without certain methodological adjustments (e.g., to corpus design, selection criteria, etc.). More research is needed to better understand the effects of different methodologies employed on the ability to capture reliable, generalizable frequency lists. Current practice has been to design new lists by adopting previously used methodologies or to update lists by manipulating several variables (e.g., compiling larger, more principally balanced corpora; changing the unit of measurement; employing different measures of dispersion). But is it unclear whether newly employed methodologies are leading to (more) reliable lists, and, if so, which (combination of) methodological adjustments account for seeming improvements. Future research should continue to investigate how different variables such as frequency profiles of target features, selection criteria used, or corpus design may affect the reliability of lists. What combination of selection criteria ensure the greatest list reliability (Park 2015)? What size and composition of corpus is required to capture a reliable, generalizable frequency list (Brysbaert and New 2009; Miller and Biber 2015)? (How) does this requirement change for different features (e.g., lemmas, word families, MWUs) or discourse domains (Biber 1990, 1993)? In what ways might statistical techniques such as bootstrapping (cf. Chap. 24, this volume, esp. Sect. be applied to help researchers better understand the extent to which their corpora represent the distributions of target features in target discourse domains? Answers to these questions would go a long way towards helping researchers maximize efforts in designing reliable, generalizable frequency lists.

4.4 Tools and Resources This section highlights three tools that are highly useful for lexical frequency profiling and the construction of frequency lists, including single words and ngrams. AntConc Anthony, L. 2014. AntConc (Version 3.4.3) [Computer Software]. Tokyo, Japan: Waseda University. http://www.laurenceanthony.net/. Accessed 7 June 2019. AntConc is very easy-to-use yet powerful freeware which allows users to quickly construct frequency lists of word forms, lemmas, or n-grams based on their own uploaded texts. A nice feature is that AntConc allows users to build frequency lists based on lemmas, by using lists of lemmas provided in the software for English, French, or Spanish or by uploading their own. AntWordProfiler Anthony, L. 2014. AntWordProfiler (Version 1.4.1) [Computer Software]. Tokyo, Japan: Waseda University. http://www.laurenceanthony.net/. Accessed 7 June 2019. AntWordProfiler is an update of Heatley and Nation’s (1994) Range program which, as the original name implies, profiles not only the frequency but also the range of


D. Miller

each word across user-uploaded files. Results can easily be exported into an Excel file, allowing for further calculations of dispersion statistics. #LancsBox: Lancaster University Corpus Toolbox Brezina, V., McEnery, T., & Wattam, S. 2015. Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139–173. http://corpora.lancs.ac.uk/lancsbox/. Accessed 7 June 2019. Another example of useful freeware for corpus analysis, LancsBox includes the functionalities of the Ant-ware tools described and several more. Users can upload their own texts or use any of the half dozen corpora already embedded into the software. Also embedded into the program is the TreeTagger, allowing for part-ofspeech tagging and lemmatization of more than a dozen languages. LancsBox also includes the ability to calculate several measures of dispersion. Yet another useful feature is the option to graphically visualize dispersion across corpora via interactive illustrations.

Further Reading Nation, I.S.P. 2016. Making and using word lists for language learning and testing. John Benjamins, Amsterdam. This text provides a comprehensive, practical, and very accessible discussion of important issues in frequency list design and evaluation. Nation provides useful recommendations for researchers through all steps in frequency list development, from designing (or choosing) a corpus to choosing an appropriate unit of analysis (including dealing with homoforms and multi-word units), to determining criteria for word selection and ordering. He also provides a helpful list of questions that can guide the analysis of pedagogically oriented frequency lists and walks readers through specific examples of evaluations via reflections on the merits and shortcomings of the BNC lists he designed. Baayen, H. 2001. Word frequency distributions. Kluwer Academic, New York. Whereas Nation’s (2016) text is geared more toward language teaching and learning specialists, Baayen’s book provides a much more challenging, theoretical discussion of properties and analysis of word frequency distributions. It provides a sophisticated discussion of statistical analyses of these distributions, particularly with regard to rare words, “Large Numbers of Rare Events” (LNRE), which comprise a considerable proportion of natural language. While this book aims to make word frequency-related statistical techniques “more accessible for non-specialists” (p. xxi), it does require that readers have a sound background in probability theory.

4 Analysing Frequency Lists


References Adolphs, S., & Schmitt, N. (2003). Lexical coverage of spoken discourse. Applied Linguistics, 24(4), 425–438. Atkins, B. T. S., & Rundell, M. (2008). The Oxford guide to practical lexicography. New York: Oxford University Press. Baker, P. (2011). Times may change but we’ll always have money: A corpus driven examination of vocabulary change in four diachronic corpora. Journal of English Linguistics, 39, 65–88. Batista, R., & Horst, M. (2016). A new receptive vocabulary size test for French. The Canadian Modern Language Review, 72(2), 211–233. Bauer, L., & Nation, I. S. P. (1993). Word families. International Journal of Lexicography, 6, 253– 279. Biber, D. (1990). Methodological issues regarding corpus-based analyses of linguistic variation. Literary and Linguistic Computing, 5, 257–269. Biber, D. (1993). Representativeness in corpus design. Literary & Linguistic Computing, 8, 243– 257. Biber, D., & Barbieri, F. (2007). Lexical bundles in university spoken and written registers. English for Specific Purposes, 26, 263–286. Biber, D., Conrad, S., & Cortes, V. (2004). If you look at . . . : lexical bundles in university teaching and textbooks. Applied Linguistics, 25, 371–405. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). The Longman grammar of spoken and written English. London: Longman. Biber, D., & Reppen, R. (2002). What does frequency have to do with teaching grammar? Studies in Second Language Acquisition, 24(2), 199–208. Biber, D., Reppen, R., Schnur, E., & Ghanem, R. (2016). On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics, 21(4), 439–464. Brezina, V., & Gablasova, D. (2015). Is there a core general vocabulary? Introducing the new general service list. Applied Linguistics, 36(1), 1–22. Brown, D. W. (2017). Clinton-Trump corpus. http://www.thegrammarlab.com. Accessed 7 June 2019. Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. Cheng, W., Greaves, C., & Warren, M. (2006). From n-gram to skipgram to concgram. International Journal of Corpus Linguistics, 11(4), 411–433. Cobb, T. (2013). Frequency 2.0: Incorporating homoforms and multiword units in pedagogical frequency lists. In L2 vocabulary acquisition, knowledge and use: New perspectives on assessment and corpus analysis. Eurosla Monographs Series 2. Conklin, K., & Schmitt, N. (2008). Formulaic sequences: Are they processed more quickly than nonformulaic language by native and nonnative speakers? Applied Linguistics, 29(1), 72–89. Conrad, D., & Biber, D. (2009). Real grammar: A corpus-based approach to English. New York: Pearson. Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238. Davies, M., & Gardner, D. (2010). A frequency dictionary of contemporary American English. New York: Routledge. Doolan, S., & Miller, D. (2012). Generation 1.5 written error patterns: A comparative study. Journal of Second Language Writing, 21, 1–22. Durrant, P., & Doherty, A. (2010). Are high-frequency collocations psychologically real? Investigating the thesis of collocational priming. Corpus Linguistics and Linguistic Theory, 6(2), 125–155. Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text, 20(1), 29–62.


D. Miller

Francis, W., & Kucera, H. (1982). Frequency analysis of English usage: Lexicon and grammar. Boston: Houghton Mifflin. Fletcher, W. H. (2011). Phrases in English. http://phrasesinenglish.org. Accessed 13 June 2019. Gardner, D. (2007). Validating the construct of word in applied corpus-based vocabulary research: A critical survey. Applied Linguistics, 28(2), 241–265. Gardner, D., & Davies, M. (2014). A new academic vocabulary list. Applied Linguistics, 35(3), 305–327. Grabowski, L. (2015). Keywords and lexical bundles within English pharmaceutical discourse: A corpus-driven description. English for Specific Purposes, 38(2), 23–33. Gray, B., & Biber, D. (2013). Lexical frames in academic prose and conversation. International Journal of Corpus Linguistics, 18(1), 109–136. Gries, S. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. Gries, S. (2010). Dispersions and adjusted frequencies in corpora: Further explorations. In S. Gries, S. Wulff, & M. Davies (Eds.), Corpus linguistic applications: Current studies, new directions (pp. 197–212). Amsterdam: Rodopi. Heatley, A., & Nation, I. S. P. (1994). Range. [Computer Software]. Victoria University of Wellington, NZ. http://www.vuw.ac.nz/lals/. Accessed 7 June 2019. Houvardas, J., & Stamatatos, E. (2006). N-gram feature selection for authorship identification. In J. Euzenat & J. Domingue (Eds.), Artificial intelligence: Methodology, systems, and applications (AIMSA 2006: Lecture notes in computer science) (Vol. 4183, pp. 77–86). Berlin: Springer. Hsu, W. (2011). A business word list for prospective EFL business postgraduates. Asian ESP Journal, 7(4), 63–99. Ikeo, R. (2016). An analysis of viewpoints by the use of frequent multi-word sequences in DH Lawrence’s Lady Chatterley’s Lover. Language and Literature, 25(2), 159–184. Kim, Y. (2009). Korean lexical bundles in conversation and academic texts. Corpora, 4, 135–165. Laufer, B., & Nation, I. S. P. (1995). Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics, 16(3), 307–322. Laufer, B., & Ravenhorst-Kalovski, G. (2010). Lexical threshold revisited: Lexical text coverage, learners’ vocabulary size and reading comprehension. Reading in a Foreign Language, 22, 15– 30. Leech, G., Rayson, P., & Wilson, A. (2001). Word frequencies in written and spoken English: Based on the British National Corpus. London: Longman. Lei, L., & Liu, D. (2016). A new medical academic word list: A corpus-based study with enhanced methodology. Journal of English for Academic Purposes, 22, 42–53. Liu, C., & Sloane, Z. (2006). Developing a core vocabulary for a Mandarin Chinese AAC system using word frequency data. International Journal of Computer Processing of Oriental Languages, 19(4), 285–300. Martinez, R., & Schmitt, N. (2012). A phrasal expressions list. Applied Linguistics, 33(3), 299– 320. Miller, D., & Biber, D. (2015). Evaluating reliability in quantitative vocabulary studies: The influence of corpus design and composition. International Journal of Corpus Linguistics, 20(1), 30–54. Nation, I. S. P. (2004). A study of the most frequent word families in the British National Corpus. In P. Bogaards & B. Laufer (Eds.), Vocabulary in a second language (pp. 3–14). Amsterdam: John Benjamins. Nation, I. S. P. (2006). How large a vocabulary is needed for reading and listening? The Canadian Modern Language Review, 63(1), 59–82. Nation, I. S. P. (2016). Making and using word lists for language learning and testing. Amsterdam: John Benjamins. Nation, I. S. P., & Webb, S. (2010). Researching and analyzing vocabulary. Boston: Heinle. O’Donnell, M. B. (2011). The adjusted frequency list: A method to produce cluster-sensitive frequency lists. ICAME Journal, 35, 135–169.

4 Analysing Frequency Lists


Park, S. (2015). Methodology for a reliable academic vocabulary list. Unpublished doctoral dissertation, Northern Arizona University, Flagstaff, Arizona. Paquot, M. (2007). Towards a productively oriented academic word list. In J. Walinski, K. Kredens, & S. Gozdz-Roszkowski (Eds.), Practical applications in language and computers 2005 (pp. 127–140). Frankfurt: Peter Lang. Paquot, M. (2010). Academic vocabulary in learner writing: From extraction to analysis. New York: Continuum. Römer, U. (2010). Establishing the phraseological profile of a text type: The construction of meaning in academic book reviews. English Text Construction, 3(1), 95–119. Savický, P., & Hlaváˇcová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215–231. Scott, M., & Tribble, C. (2006). Textual patterns: Keyword and corpus analysis in language education. Amsterdam: John Benjamins. Shin, D., & Nation, I. S. P. (2008). Beyond single words: The most frequent collocations in spoken English. ELT Journal, 62(4), 339–348. Simpson-Vlach, R., & Ellis, N. (2010). An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512. Ward, J. (2007). Collocation and technicality in EAP engineering. Journal of English for Academic Purposes, 6, 18–35. West, M. (1953). A general service list of English words. London: Longman. Weisser, M. (2016a). Profiling agents and callers: A dual comparison across speaker roles and British versus American English. In L. Pickering, E. Friginal, & S. Staples (Eds.), Talking at work: Corpus-based explorations of workplace discourse (pp. 99–126). London: Palgrave Macmillan. Weisser, M. (2016b). Practical corpus linguistics: An introduction to corpus-based language analysis. Oxford: Wiley-Blackwell. Zipf, G. (1936). The psychobiology of language. London: Routledge. Zipf, G. (1949). Human behavior and the principle of least effort. New York: Addison-Wesley.

Chapter 5

Analyzing Dispersion Stefan Th. Gries

Abstract This chapter provides an overview of one of the most crucial but at the same time most underused basic statistical measures in corpus linguistics, dispersion, i.e. the degree to which occurrences of a word are distributed throughout a corpus evenly or unevenly/clumpily. I first survey a range of dispersion measures, their characteristics, and how they are computed manually; also, I discuss how different kinds of measures are related to each other in terms of their statistical behavior. Then, I address and exemplify the kinds of purposes to which dispersion measures are put in (i) lexicographic work and in (ii) some psycholinguistic explorations. The chapter then discusses a variety of reasons why, and ways in which, dispersion measures should be used more in corpus-linguistic work, in particular to augment simple frequency information that might be misleading; I conclude by discussing future directions in which dispersion research can go both in terms of how the logic of dispersion measures extends from frequencies of occurrence to co-occurrence and, potentially, even key words and in terms of how dispersion measures can be validated in future research on cognitive and psycholinguistic as well as applied-linguistics applications.

5.1 Introduction Imagine a corpus linguist looking at a frequency list of the Brown corpus, a corpus aiming to be representative of written American English of the 1960s that consists of 500 samples, or parts, of approximately 2000 words each. Imagine further that corpus linguist is looking at that list to identify verbs and adjectives within a certain frequency range – maybe because he needs to (i) create stimuli for a psycholinguistic experiment that control for word frequency, (ii) identify words from a certain

S. Th. Gries () University of California Santa Barbara, Santa Barbara, CA, USA Justus Liebig University Giessen, Giessen, Germany e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_5



S. Th. Gries

frequency range to test learners’ vocabulary, or (iii) compile a vocabulary list for learners, or some other application. Imagine, finally, the frequency range he is currently interested in is between 35 and 40 words per million words and, as he browses the frequency list for good words to use, he comes across an adjective and a verb – enormous and staining – that he thinks he can use because they both occur 37 times in the Brown corpus (and are even equally long) so he notes them down for later use and goes on. This is not an uncommon scenario and yet it is extremely problematic because, while that corpus linguist has indeed found words with the same frequency, he has probably not even come close to do what he actually wanted to do. The frequency range of the words he was interested in – 35-40 – or the actual frequency of the two words discussed – 37 – may have been an operationalization for things that might have to do with how fast people can identify the word in a psycholinguistic experiment (as in a lexical decision task) or with how likely a learner would be to have encountered, and thus hopefully know, a word of that kind of rarity. However, chances are that this choice of words is highly problematic: While both words are equally long and equally frequent in one and the same corpus, they could hardly be more different with regard to the topic of this chapter, their dispersion, which probably makes them useless for the above-mentioned hypothetical purposes, controlled experimentation, vocabulary testing, or vocabulary lists. This is because • the word enormous occurs 37 times in the corpus, namely once in 35 corpus parts and twice in 1 corpus part; • the word staining occurs 37 times in the corpus, namely 37 times in 1 corpus part. In other words, given its (relatively low) frequency, enormous is pretty much as evenly dispersed as a word with that frequency can possibly be while, given its identical frequency, staining is as unevenly dispersed as a word with that frequency can possibly be: enormous is characterized by even dispersion, staining is characterized by a most uneven dispersion, clumpiness, or, to use Church and Gale’s (1995) terms, high burstiness or bunchiness. In the following section, I will discuss fundamental aspects of the notion of dispersion, including some of the very few previous applications as well as a variety of dispersion measures that have been proposed in the past.

5.2 Fundamentals 5.2.1 An Overview of Measures of Dispersion Corpus linguistics is an inherently distributional discipline: Virtually all corpuslinguistic studies with at least the slightest bit of a quantitative angle involve the frequency or frequencies with which

5 Analyzing Dispersion


• an element x occurs in a corpus or in a part of a corpus representing a register or variety or something else, . . . or • an element x occurs in close proximity (however defined) to an element y in a corpus (or in a part of a corpus). Also, any kind of more advanced corpus statistic – for instance, association measures (see Chap. 7) or key words statistics (see Chap. 6) is ultimately based on the observation of, and computations based upon, such frequencies. However, just like trying to summarize the distribution of any numeric variable using only a mean can be treacherous (especially when the numeric variable is not normally distributed), so is trying to summarize the overall ‘behavior’ (or the co-occurrence preferences or the keyness) of a word x on the basis of just its frequency/frequencies because, as exemplified above, words with identical frequencies can exhibit very different distributional behaviors. On some level, this fact has been known for a long time. Baron et al. (2009) mention Fries & Traver’s assessment that Thorndike was the first scholar to augment frequency statistics with range values, i.e. the numbers of corpus parts or documents in which words were attested at least one. However, this measure of range is rather crude: it does not take into consideration how large the corpus parts are in which occurrences of a word are attested, nor does its computation include how many occurrences of a word are in one corpus part – to have an effect on the range statistic, all that counts is a single instance. Therefore, during the 1970s, a variety of measures were developed to provide a better way to quantify the distribution of words across corpus parts; the best-known measures include Juilland’s D (Juilland and ChangRodriguez 1964, Juilland et al. 1970), Carroll’s D2 (Carroll 1970), and Rosengren’s S (Rosengren 1971). To discuss how these statistics and some other competing ones are computed, I am following the expository strategy of Gries (2008), who surveyed all known dispersion measures on the basis of a small fictitious corpus; ours here consists of the following five parts: b b b b b

a a c a a

m s a g h

n a g h a

i t a a a

b b b b b

e e e e e

u w s a a

p q t a x

n a t a


This ‘corpus’ has several characteristics that make it useful for the discussion of dispersion: (i) it is small so all computations can easily be checked manually, (ii) the sizes of the corpus parts are not identical, which is more realistic than if they were, and (iii) multiple corpus-linguistically relevant situations are built into the data: • the words b and e are equally frequent in each corpus part (two times and one time per corpus part respectively), which means that their dispersion measures should reflect those even distributions;


S. Th. Gries

• the words i, q, and x are attested in one corpus part each: i in the first corpus part (which has 9 elements), q in the second corpus part (which has 10 elements), and x in the third corpus part (which has 11 elements), which means these words are extremely clumpily distributed, but slightly differently so (because the corpus parts they are in differ in size); • the word a, whose dispersion we will explore below and which is highlighted in bold, is attested in each corpus part, but with different frequencies. To compute the measures of dispersion to be discussed here, a few definitions are in order; we will focus on the word a: (1) (2) (3) (4) (5) (6)

l = 50 n=5 s = (0.18, 0.2, 0.2, 0.2, 0.22) f = 15 v = (1, 2, 3, 4, 5) p = (1 /9 , 2 /10 , 3 /10 , 4 /10 , 5 /11 )

(the length of the corpus in words) (the length of the corpus in parts) (the percentages of the n corpus part sizes) (the overall frequency of a in the corpus) (the frequencies of a in each corpus part 1-n) (the percentages a makes up of each corpus part 1-n)

The most important dispersion measures – because of their historical value and evaluation studies discussed below – are computed as discussed in what follows; see Gries (2008) for a more comprehensive overview. The simplest measure is the range, i.e. the number of corpus parts in which the element in question, here a, is attested, which is computed as in (7): (7)


number of parts containing a = 5

Then, there are two traditional descriptive statistics, the standard deviation of the frequencies of the element in question in all corpus parts (sd, see (8)). This measure requires to take every value in v, subtract from it the mean of v (f /n , i.e. 3), square those differences, and sum them up; then one divides that sum by the number of corpus parts n and takes the square root of that quotient:  (8)

sdpopulation :

n i=1

 2 vi − fn n

≈ 1.414 (sdsample has n-1 in the denominator)

A maybe more useful variant of this measure is its ‘normalized version, the variation coefficient (vc, see (9)); the normalization consists of dividing sdpopulation by the mean frequency of the element in the corpus parts f /n : (9)

vcpopulation :

sdpopulation (v) mean(v)

≈ 0.471

(vcsample would use sdsample )

The version of Juilland’s D that can handle differently large corpus parts is then computed as shown in (10). In order to accommodate the different sizes of the corpus parts, however, the variation coefficient is not computed using the observed frequencies v1-n (i.e. 1, 2, 3, 4, 5 in files 1 to 5 respectively, see (5) above) but using

5 Analyzing Dispersion


the percentages in p1-n (i.e. how much of each corpus part is made up by the element in question, i.e. 1 /9 , 2 /10 , 3 /10 , 4 /10 , 5 /11 , see (6) above), which is what corrects for differently large corpus parts: (10)

Juilland’s D: 1 −

sdpopulation (p) mean(p)


√ 1 (n−1)

≈ 0.785

Carroll’s D2 is essentially a normalized version of entropy of the proportions of the element in each corpus part, as shown in (11) (see also Gries 2013: Sect. for general applications of this measure). The numerator computes the entropy of the percentages in p1-n while dividing it by log2 n amounts to normalizing it against the maximally possible entropy given the number of corpus parts n. (11)

Carroll’s D2 :

   p p − ni=1 ip ×log2 ip log2 n

≈ 0.938

The version of Rosengren’s S that can handle differently large corpus parts is shown in (12). Each corpus part size’s in percent (in s) is multiplied with the frequencies of the element in question in each corpus part (in v1-n ); of each product, one takes the square root, and those are summed up, that sum is squared, and divided by the overall frequency of the element in question in the corpus (f ): (12)

Rosengren’s (1971) Sadj :

n √ i=1

si · vi



1 f

≈ 0.95 (with min S=1n)

Finally, Gries (2008, 2010) and the follow-up by Lijffijt and Gries (2012) proposed a measure called DP (for deviation of proportions), which falls between 1-min s (for an extremely even distribution) and 1 (for an extremely clumpy distribution) as well as a normalized version of DP, DPnorm , which falls between 0 and 1, which are computed as shown in (13). For DP, one computes the differences between how much of the element in question is in each corpus file in percent on the one hand and the sizes of the corpus parts in percent on the other – i.e. the differences between observed and expected percentages. Then, one adds up the absolute values of those and multiplies by 0.5; the normalization then consists of dividing this values by the theoretically maximum value of DP given the number of corpus parts (in a way reminiscent of (11)1 :     DP ≈ 0.22 (13) DP: 0.5 × ni=1  vfi − si  = 0.18 and DPnorm : 1−mins The final measure to be discussed here is one that, as far as I can tell, has never been proposed as a measure of dispersion, but seems to me to be ideally suited to be one, namely the Kullback-Leibler (or KL-) divergence, a non-symmetric measure that quantifies how different one probability distribution (e.g., the distribution of all the occurrences of a across all corpus parts, i.e. v /f ) is from another (e.g., the

1 As

pointed out by Burch et al. (2017), DPnorm is equivalent to a measure called ADA (for average deviation analog) proposed by Wilcox (1973).


S. Th. Gries

Table 5.1 Dispersion measures for several ‘words’ in the above ‘corpus’ Range Sd/vc Juilland’s D Carroll’s D2 Rosengren’s S DP/DPnorm KL-divergence

b 5 0/0 0.968 0.999 0.999 0.02/0.024 0.003

i 1 0.4/2 0 0 0.18 0.82/1 2.474

q 1 0.4/2 0 0 0.2 0.8/0.976 2.322

x 1 0.4/2 0 0 0.22 0.78/0.951 2.184

corpus part sizes s); the KL-divergence is computed as shown in (14) (with log2 s of 0 defined as 0):    (14) KL-divergence: ni=1 vfi × log2 vfi × s1i ≈ 0.137 with log2 0 : = 0 Table 5.1 shows the corresponding results for several elements in the above ‘corpus’. The results show that, for instance, b is really distributed extremely evenly (since it occurs twice in each file and all files are nearly equally large). Note in particular how the values of Rosengren’s S, DP, and the KL-divergence for i, q, and x differ: all three occur only once in the corpus, only in one corpus part, but what differs is the size of the corpus part, and the larger the corpus part in which the single instance of i, q, or x is attested, the more even/expected that distribution is. In sum, corpus linguists have proposed quite a few different measures of dispersion, most of which are generally correlated with each other, but that also react differently to the kinds of distributions one finds in corpus data, specifically, • the (potentially large) number of corpus parts in which an element is not attested; • the (potentially large) number of corpus parts in which an element is attested much less often than the mean; • the range of distributions a corpus linguist would consider to be different but that would yield the same dispersion measure(s); • the number of different corpus parts a corpus linguist would assume and their (even or uneven sizes).2

2 Some

dispersion measures do not require a division of the corpus into parts and/or also involve the differences between successive mentions in a corpus parts. These are theoretically interesting alternatives, but there seems to be virtually no research on them; see Gries (2008, 2010) for some review and discussion as well as the Further reading section for a brief presentation of one such study, Savický & Hlaváˇcová (2002).

5 Analyzing Dispersion


5.2.2 Areas of Application and Validation There are at least a few areas where dispersion information is now considered at least occasionally, though much too infrequently. The area of research/application where dispersion has gained most ground is that of corpus-based dictionaries and vocabulary lists. Leech et al. (2001) discuss dispersion information of words in the British National Corpus (BNC) and remark that, as in the enormous/staining example above, for instance, the words HIV, lively, and keeper are approximately equally frequent in the corpus, but are very differently dispersed in the corpus and proceed to use Juilland’s D as their measure of choice. Similarly, Davies and Gardner (2010) and Gardner and Davies (2014) also use Juilland’s D in their frequency dictionary and academic vocabulary list, as does Paquot (2010) for her academic keyword list. It is worth pointing out in this connection that, especially in this domain of dictionaries/vocabulary lists, researchers have often also computed what is called an adjusted frequency, i.e. a frequency that is adjusted downwards depending on the clumpiness/unevenness of the distribution. In the mathematically simplest case, the adjusted frequency is the observed frequency of the word in the corpus times the dispersion value; for instance, Juilland’s usage coefficient U is just that: the frequency of the word in the corpus f times Juilland’s D, a measure that, for instance, Davies and Gardner (2010) use. In the above case for the word a, U = 15 × 0.785 = 11.777 whereas for q, U = 1 × 0 = 0; similar adjusted frequencies exist for Carroll’s D2 (the so-called Carroll’s Um ) and Rosengren’s S (the so-called Rosengren’s AF). Another area where dispersion information has at least occasionally been recognized as important is psycholinguistics, in particular the domain of lexical decision tasks. Consider, for instance, Schmid’s (2010:115) concise summary: “frequency is one major determinant of the ease and speed of lexical access and retrieval, alongside recency of mention in discourse.” And yes, for many decades now, (logged) frequency of occurrence has been known to correlate with reaction times to word/non-word stimuli. However, compared to frequency, the other major determinant, recency, has been considered much less in cognitive and psycholinguistic work. This is somewhat unexpected because there are general arguments that support the importance of dispersion as a cognitively relevant notion, as the following quote demonstrates: Given a certain number of exposures to a stimulus, or a certain amount of training, learning is always better when exposures or training trials are distributed over several sessions than when they are massed into one session. This finding is extremely robust in many domains of human cognition. (Ambridge et al. 2006:175)

Ambridge et al. do not mention dispersion directly, but what would be its direct corpus-linguistic operationalization. Similarly, Adelman et al. (2006:814) make the valid point that “the extent to which the number of repeated exposures to a particular item affects that item’s later retrieval depends on the separation of the exposures in time and context,” and of course the corpus-linguistic equivalent to this “separation of the exposures in time and context” is dispersion.


S. Th. Gries

More empirically, there are some studies providing supporting evidence for the role of dispersion when it comes to lexical decision tasks. One such study is in fact Adelman et al. (2006), who study dispersion. Their study has a variety of general problems: • they only use the crudest measure of dispersion possible (range) and do not relate to previous more psychological/psycholinguistic work that also studied the role of range (such as Ellis 2002a, b); • they do not establish any relation to the notion of dispersion in corpus linguistics and, somewhat worse even, refer to range with the misleading label contextual diversity, when in fact the use of a word in different corpus parts by no means implies that the actual contexts of the word are different: No matter in how many different corpus parts hermetically is used, it will probably nearly always be followed by sealed. Nonetheless, they do show that dispersion is a better and more unique predictor of word naming and lexical decision times than token frequency and they, like Ellis (2011), draw an explicit connection to Anderson’s rational analysis of memory. More evidence for the importance of dispersion is offered by Baayen (2010), who includes range in the BNC as a predictor in a multifactorial model that ultimately suggests that the effect of frequency when considered a mere repetition-counter as opposed to some other cognitive mechanism is in fact epiphenomenal and can partly be explained by dispersion, and Gries (2010), who shows that lexical decision times from Baayen (2008) are most highly correlated with vc and DP/DPnorm (see Box 2 for details). In spite of all the effort that has apparently gone into developing measures of dispersion and in spite of uneven dispersion posing a serious threat to the validity of virtually all corpus-based statistics, it is probably fair to say that dispersion is still far from being routinely included in both (more) theoretical research and (more) practical applications. One early attempt to study the behavior of these different measures is Lyne (1985), who compared D, D2 , and S to each other using 30 words from the French Business Correspondence Corpus, which for that application was divided into 5 equally large parts; on the basis of testing all possible ways in which 10 words can be distributed over 5 corpus parts, Lyne concludes that Juilland’s D performs best; see also Lyne (1986), but there is little research that includes dispersion on a par with frequency or other corpus statistics and even less work that attempts to elucidate which measures are best (for what purpose); two studies that begin to work on this important issue are summarily discussed below.

5 Analyzing Dispersion


Representative Study 1 Biber D., Reppen, R., Schnur, E., and Ghanem, R. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4): 439–464. Starting out from observations in Gries (2008), Biber et al. (2016) is one of the most comprehensive tests, if not the most comprehensive one, of how the perceived default of Juilland’s D behaves in particular with contemporary corpora that are large and have many different corpus parts, i.e. high values of n. They begin by discussing the mathematical characteristics of Juilland’s D, in particular the fact that the formula shown above in (10) increases “degrees of uniformity” (i.e. evenness of distribution/dispersion across corpus parts) “as the number of corpus parts is increased” (Biber et al. 2016:443); thus, the larger the corpora one considers, the more likely one uses a relatively large number of corpus parts (for reasons of statistical sampling), and the more Juilland’s D is reduced, which “inflat[es] the estimate of uniformity, and overall, greatly reduc[es] the effective range of values for D” (p. 444). Biber et al. then proceed with two case studies. The first one explores D-values of a set of words in the British National Corpus, which, for the purpose of testing what effect the numbers of corpus parts n one assumes, was divided into n = 10, 1000, and 1000 equal-sized parts; crucially, the words explored were words for which theoretical considerations would lead an analysis to expect fairly different D-values, contrasting words such as at, all, or time (which should be distributed fairly evenly) with words such as erm, ah, and urgh (which, given their preponderance in spoken data, should be distributed fairly unevenly). Specifically, they analyzed 153 words in 10 categories emerging from crossing (i) several different word frequency bands and (ii) expected distribution (uniform, writing-skewed, and speech-skewed). In this first case study, they find the expected high D-values for higherfrequency words that would be uniformly-distributed or skewed towards writing (i.e. the 90% majority of the BNC) regardless of n. However, they also discover that the D-values for lower-frequency writing-skewed words are quite sensitive to variations of n. Their concern that these results are not due to the larger sampling sizes reflecting the dispersions more accurately is supported by what they find for the speech-skewed words, namely “extremely large discrepancies even for the most frequent speech-skewed words” (p. 450). More precisely, D-values for high-frequency speech-skewed words can vary between very high (e.g. 0.885 for yeah with n = 1000) and very low (e.g. (continued)


S. Th. Gries

0.286 for yeah with n = 10). Even more worryingly, “[t]hese discrepancies become even more dramatic as [they] consider moderate and lower-frequency words” (p. 452), with differences in D-values frequently exceeding 0.5 just because of varying n, which on a scale from 0 to 1 of course corresponds to what seems to be an unduly large effect. Their main conclusion of the first case study is that “D values based on 1,000 corpus parts completely fail to discriminate among words with uniform versus skewed distributions in naturalistic data” (p. 454). In their second case study, Biber et al. created different data sets with, therefore, known distributions of target words across different numbers of corpus parts, but the bottom line of this more controlled case study is in fact the same as that of the first. Their maybe most extreme, and thus worrying, result is that the exact same distribution of a target word – a uniform distribution across 10% of a corpus – can result in a D value of 0.0 when the computation is based on a corpus split into 10 parts, versus a D value of 0.905 when the computation is based on a corpus split into 1000 parts. (p. 457)

As a more useful alternative, they propose to use Gries’s (2008) DP. They recommend DP because it is conceptually simple, can easily handle unequally large corpus parts, and “it seems to be a much more reliable estimate of dispersion (and uniformity) in large corpora divided into many corpus parts” (p. 459). In a direct comparison with Juilland’s D, they show that DP not only returns values from a more useful wider range of values when given a diverse set of differently dispersed words, but it also reacts differently to larger numbers of n: (1-DP) values are consistently lower for corpus divisions into many parts, which Biber et al. interpret as being desirably compatible with the expected benefits of the finer-grained sampling that comes with increasing n: Theoretically, we would expect more conservative estimates of dispersion based on a large number of corpus parts. For example, it is more likely that a word will occur in 6 out of 10 corpus parts than for that same word to occur in 600 out of 1000 corpus parts. The values for 1-DP seem to reflect this fact, resulting in consistently lower values when computations are based on a large number of corpus parts. In summary, DP is clearly more effective than D at discriminating between uniform versus skewed distributions in a corpus, especially when it is computed based on a large number of corpus-parts. (Biber et al. 2016:460)

Biber et al. conclude with a plea for more validation and triangulation when it comes to developing corpus-linguistic statistics and/or more general methods.

5 Analyzing Dispersion


Representative Study 2 Gries, S.T. 2010. Dispersions and adjusted frequencies in corpora: further explorations. In Corpus linguistic applications: current studies, new directions, eds. Gries S.T., Wulff S., and Davies, M., 197–212. Rodopi, Amsterdam. The second representative study to be discussed here is concerned with dispersion and its role in psycholinguistic contexts. Gries (2010) is an attempt to provide at least a first glimpse at how different dispersion measures are behaving statistically and predictively when studied in conjunction with psycholinguistic (reaction time) data. To that end, he conducted two kinds of case studies: First, he explored the degree to which the many existing measures capture similar kinds of dispersion information by exploring their intercorrelations; second, he computed the correlations between raw frequency, all dispersion measures, and all adjusted frequencies on the one hand and experimentally-obtained reaction time data from lexical decision tasks in psycholinguistics; in what follows, I briefly discuss these two case studies. As for the first case study, he extracted all word types from the spoken component of the BNC that occur 10 or more times – there are approx. 17,500 such types – and computed all 29 dispersion measures and adjusted frequencies cataloged in the most recent overview article of Gries (2008). All measures were z-standardized (to make their different scales more comparable) and then used as input to both hierarchical agglomerative cluster analyses (see Chap. 18) and principal component analyses (see Chap. 19) separately for dispersion measures and adjusted frequencies. For the former, he used 1-Pearson’s r (see Chap. 17) as a similarity measures and Ward’s method as an amalgamation rule. The results from both analyses revealed several relatively clear groupings of measures. For instance, the following clusters/components were well established in both the cluster and the principal components analysis: • Rosengren’s S, range, and a measure called Distributional Consistency (Zhang et al. 2004); • Juilland’s D, Carroll’s D2 , and a measure called D3 based on chi-squared (Lyne 1985); and, more heterogeneously, • DP, DPnorm , vc, and idf (inverse document frequency, see Spärck Jones 1972 and Robertson 2004); • frequency, the maxmin measure (the difference between max(v1-n ) and min(v1-n )), and sd. In fact, the principal components analysis revealed that just two principal components capture more than 75% of the variance in the 16 dispersion (continued)


S. Th. Gries

measures explored: many measures behave quite similarly and fall into several smaller groups. Nevertheless, the results also show that the groups of measures also sometimes behave quite dissimilarly: “different measures of dispersion will yield very different (ranges of) values when applied to actual data” (Gries 2010:204, his emphasis). With regard to the adjusted frequencies, the results are less diverse and, thus, more reassuring. All measures but one behave relatively similarly, which is mostly interesting because it suggests that (i) the differences between the adjusted frequencies are less likely to yield very different results, but also that (ii) the computationally very intensive distance-based measures that have been proposed (see in particular Savický and Hlaváˇcová 2002 as well as Washtell 2007) do not appear to lead to fundamentally different results; given that these measures computing time can be 10 times as long or much much longer for large corpora, this suggests that the simpler-to-compute ‘classics’ might do the job well enough. The second case study in this paper involves correlating dispersion measures and adjusted frequencies with response time latencies from several psycholinguistic studies, specifically with (i) data from young and old speakers from Spieler and Balota (1997) and Balota and Spieler (1998), and (ii) data from Baayen (2008). All dispersion measures and adjusted frequencies were centered and then correlated with these reactions times (using Kendall’s τ , see Chap. 17). For the Balota/Spieler data, the results indicate that some measures score best (including, for instance, AF, U, and DP), but that most measures’ correlations with the reaction times are very similar. However, for the reaction times of Baayen (2008), a very different picture emerges: While DP scores very well, only surpassed by vc, there is a distinct cline such that some measures really exhibit only very low and/or insignificant correlations with the psycholinguistic comparison data. Gries concludes with some recommendations: Many dispersion measures are relatively similar, but if one is uncertain what measure to trust, it would be useful to compute measures that his cluster/principal component analyses considered relatively different to get a better picture of the diversity in the data; at present and until more data have been studied, it seems as if the computationally more demanding measures may not be worth the effort. Trivially, more analyses (than Lyne’s really small study) are needed, in particular of larger data sets and, along the lines of what Biber et al. (2016) did, of data sets with known distributional characteristics.

5 Analyzing Dispersion


5.3 Critical Assessment and Future Directions The previous sections already touched upon some recommendations for future work. It has hopefully become clear that dispersion is as important an issue as it is still neglected or even completely ignored. While every corpus linguist with only the slightest bit of statistical knowledge knows to never present a mean or median without a measure of dispersion, the exact same advice is hardly ever heeded when it comes to frequencies and dispersions in corpus data: There are really only very few studies that report frequency data and dispersion or, just as importantly, report frequencies and association measures and dispersion, although Gries (2008) has shown that the computation of association measures is just as much at risk as frequencies when dispersion information is not also considered. Thus, the first desideratum is that more research takes the threat of underdispersion/clumpiness much more seriously; strictly speaking, reviewers should always request dispersion information so that readers can more reliably infer what reported frequencies or association measures really represent or whether they represent what they purport to represent. Second, we need more studies of the type discussed in the representative studies boxes so that we better understand the different measures’ behavior in actual but also controlled/designed data. One issue, for instance, has to do with how corpora are divided into how many parts and how this affects dispersion measures (see for example Biber et al.’s 2016 discussion of the role of the denominator in Juilland’s D, which features the number of corpus parts). Another is how dispersion measures relate to issues outside of corpus linguistics such as, again, psycholinguisticallyor cognitively-informed approaches. This is particularly relevant for measures that are advertised as having certain characteristics. To discuss just one example, Kromer (2003:179) promotes his adjusted frequency measure by pointing to its interdisciplinary/psycholinguistic utility/validity: From our point of view, all usage measures considered above have one common disadvantage: their introduction and application are not based psycholinguistically. A usage measure, free from the disadvantage mentioned, is offered below.

However, the advantage is just asserted, not demonstrated, and in Gries (2010) at least, the only study I am aware of testing Kromer’s measure, his measure scored worse than most others when explicitly compared to psycholinguistic reference data. While that does of course not mean Kromer’s measures has been debunked, it shows what is needed: more and explicit validation. That being said, a certain frequent trend in corpus linguistic research should be resisted and this is best explained with a very short excursus on association measures (see Chap. 7), where the issue at hand has been recognized earlier than it has in the little existing dispersion research. For several decades now, corpus linguists have discussed dozens of association measures that are used to rank-order, for instance, collocations by the attraction of their constituent words. Some of these measures are effect sizes in the sense that they do not change if the co-occurrence tables from which they are computed are increased by some factor (e.g., the odds ratio), others are based on significance tests, which means they conflate both sample size/actual


S. Th. Gries

Fig. 5.1 The correlation of frequency and DP of words in the spoken BNC

observed frequencies and effect size (e.g., the probably most widely-used measure, the log-likelihood ratio). This is relevant in the present context of dispersion measures because we are now facing a similar issue in dispersion research, namely when researchers and lexicographers also take two dimensions of information – frequency and the effect size of dispersion – and conflate them into one value such as an adjusted frequency (e.g., by multiplication, see above Juilland’s U). To say it quite bluntly, this is a mistake because, frequency and dispersion are two different pieces of information, which means conflating them into a single measure loses a lot of information. This is true even though frequency and dispersion are correlated, as is shown in Fig. 5.1 and Fig. 5.2. Both have word frequency on the x-axis (logged to the base of 10) and a dispersion measure (DP in Fig. 5.1, range in Fig. 5.2) on the y-axis, and have words represented by grey points. Also, in both plots, the words have been divided into 10 frequency bins, for each of which a blue whisker and the numbers above and below it represent the range of the dispersion values in that frequency bin. For example, in Fig. 5.1, the 6th frequency bin from the left includes words with frequencies

5 Analyzing Dispersion


Fig. 5.2 The correlation of frequency and range of words in the spoken BNC

between 2036 and 5838 and DP values between 0.23 and 0.86, i.e. a DP-range of 0.63 also noted in blue at the bottom of the scatterplot. Obviously, there are the expected correlations between frequency and dispersion (R2 = 0.832 for logged frequency and DP), but just as obviously, especially in the middle range of frequencies – ‘normal content words’ with frequencies between 1000 and 10,000 – words can have extremely similar frequencies but still extremely different dispersions. This means several things: First, even though there is the above-mentioned overall correlation between frequency and dispersion, this correlation can be very much weakened in certain frequency bins. For example, in the 6th frequency bin, R2 for the correlation between frequency and dispersion is merely 0.086. Second, a relatively ‘specialized’ word like council is in the same (6th) frequency bin (freq = 4386, DP = 0.72, range = 292 out of 905) as intuitively more ‘common/widespread’ words like nothing, try, and whether (freqs = 4159, 4199, 4490; DPs = 0.28, 0.28, 0.32; ranges = 652, 664, 671 out of 905); in both plots, the positions of council and nothing are indicated with the c and the n respectively plotted into the graph.


S. Th. Gries

Also, even just in the sixth frequency band, the extreme range values that are observed are 85 /905 = 9.4% vs. 733 /905 = 81% of the corpus files, i.e. huge differences between words that in a less careful study that ignores dispersion would simply be considered ‘similar in frequency’. Finally, these graphs also show that forcing frequency and dispersion into one value, e.g. an adjusted frequency, would lose a huge amount of information. This is obvious from the visual scatter in both plots, but also just from simple math: If a researcher reports an adjusted frequency of 35 for a word, one does not know whether that word occurs 35 perfectly evenly distributed times in the corpus (i.e., frequency = 35 and, say, Juilland’s D = 1) or whether it occurs 350 very unevenly distributed times in the corpus (i.e., frequency = 350 and, say, Juilland’s D = 0.1). And while this example is of course hypothetical, it is not as unrealistic as one might think. For instance, the products of observed frequency and 1-DP for the two words pull and chairman in the spoken BNC are very similar – 375 and 368.41 respectively – but they result from very different frequencies and DP-values: 750 and 0.5 for pull but 1939 and 0.81 for chairman. Not only is it the dispersion value, not the frequency one, that reflects our intuition (that pull is more basic/widely-used than chairman) much better, but this also shows that we would probably not want to treat those two cases as ‘the same’ as we would if we simply computed and reported some conflated adjusted frequency. Thus, keeping frequency and dispersion separate allows researchers to preserve important information and it is therefore important that we do not give in to the temptation of ‘a single rank-ordering scale’ and simplify beyond necessity/merit – what is needed is more awareness and sophistication of how words are distributed in corpora, not blunting our research tools. In all fairness, even if one decides to keep the two dimensions separate, as one definitely should, there still is an additional unresolved question, namely what kind of threshold value(s) to choose for (frequency and) dispersion. It is unfortunately not clear, for instance, what dispersion threshold to adopt to classify a word as ‘evenly dispersed enough for it to be included in a dictionary’: DP = 0.4/D = 0.8? DP = 0.45/D = 0.85? In the absence of more rigorous comparisons of dispersion measures to other kinds of reference data, at this point any cut-off point is arbitrary (see Oakes and Farrow 2007:92 for an explicit admission of this fact). Future research will hopefully both explore which dispersion measures are best suited for which purpose and how their relation to frequency is best captured. In order to facilitate this necessary line of research, an R function computing dispersion measures and adjusted frequencies is provided at the companion website of this chapter, see Sect. 5.4; hopefully, this will inspire more research on this fundamental distributional feature of linguistic elements and its impact on other corpus statistics such as association measures, key (key) words, and others.

5 Analyzing Dispersion


5.4 Tools and Resources Dispersion is a corpus statistic that has not been implemented widely into existing corpus tools and arguably it is in fact a statistic that, unlike others, is less obvious to implement, which is why all implementations of dispersion in such generalpurpose tools probably leave something to be desired. This is for two main reasons. First, most tools offer only a very small number of measures, if any, and no ways to implement new ones or tweak existing ones. Second, most existing dispersion measures require a division of the corpus into parts and the decision of how to do this is not trivial. While ready-made corpus tools such as WordSmith Tools or AntConc might assume for the user that the corpus parts to be used are the n (a user-defined number) equally-sized parts a corpus can be divided into or the separate files of the corpus, this may actually not be what is required for a certain study if, for instance, sub-divisions in files are to be considered as well (as might be useful for some files in the BNC) or when groupings of files into (sub-)registers are what is of interest. To mention a few concrete examples, WordSmith Tools offers a dispersion plot as well as range and Juilland’s D-values (without explicitly stating that that is in fact the statistic that is provided) while AntConc offers a version of a dispersion plot separately for each file of a corpus, which is often not what one needs. The COCA-associated website https://www.wordfrequency.info/ (accessed 22 May 2019) provides data that went into Davies and Gardner (2010), which means they provide Juilland’s D for the corpus when split up into 100 equally-sized parts. As is obvious, the range of features is extremely limited and virtually non-customizable. By far the best – in the sense of most versatile and powerful – approach to exploring issues of dispersion is with programming languages such as R or Python (see Chap. 9), because then the user is not dependent on measures and settings enshrined in ready-made software but can customize an analysis in exactly the way that is needed, develop their own methods, and/or run such analysis on data/annotation formats that none of the above tools can handle. This chapter comes with some companion code for readers to explore as well as an R function to compute a large number of dispersion measures for data provided by a user. This function is an update of the function provided in Gries (2008), which adds the KLdivergence as a dispersion measure, updates the computation of some measures, cleans up the code, and drastically speeds up all computations; see the companion website for how to use it.

Further Reading Burch, B., Egbert, J., and Biber, D. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2):189–216. Burch et al. (2017) is a study that introduces another dispersion measure DA (or MDA in Wilcox’s 1973 terminology) and compares it to the historically most


S. Th. Gries

widely-used dispersion measure of Juilland’s D and to the recently-proposed measure of Gries’s DP. They define DA and test its performance by, for instance, a simulation study of three different scenarios by creating randomly sampled corpora and comparing the three different dispersion statistics. Also, they correlate the dispersion statistics for 150 words taken from the British National Corpus using scatterplots and pairwise differences of dispersion statistics. It is worth pointing out, as the authors also do, that (i) this study is based on the overall probably less realistic scenario that all corpus parts are equally large, which is not that likely when corpus parts are considered to be files (e.g., in the BNC) or (sub-)registers (e.g. in the ICE-GB) and that (ii) computing DA can take literally thousands more time than D or DP even though its non-linear correlation R2 with DP exceeds 0.99. That being said, their study is nonetheless a good example of exactly the kind of study we need more of to further our understanding of (i) how different dispersion measures react to corpus-linguistic data and (ii) how they react to certain kinds of potentially extreme input data. Savický, P., and Hlaváˇcová, J. 2002. Measures of word commonness. Journal of Quantitative Linguistics 9(3):15–31. Savický and Hlaváˇcová (2002) is another interesting reading. Their study starts out from the question of how to identify “common” words to be included in a universal dictionary. However, they propose to approach dispersion in ways that do not require a division of a corpus in parts – rather, the corpus is treated as a single sequence or vector of words and then dispersion is used to compute corrected frequencies that are close to the actual observed frequencies when a word is very evenly distributed and (much) small when it is not. They propose three different corrected frequencies – one based on Average Reduced Frequency (fARF ), one based on Average Waiting Time (fAWT ), and one based on Average Logarithmic Distance (fALD ) – and proceed to apply them to data from the Czech National Corpus to test the measures’ stability (how much do they vary when applied to different parts of the overall corpus?) and to exemplify the kinds of words that the measures return as highly unevenly distributed. While these dispersion measures can take much longer to compute than the parts-based measures reported on above and adjusted frequencies are problematic for the reasons discussed above, this paper is nonetheless noteworthy and interesting for the novel, non-parts-based approach to dispersion.

References Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 19(9), 814–823. Ambridge, B., Theakston, A. L., Lieven, E. V. M., & Tomasello, M. (2006). The distributed learning effect for children’s acquisition of an abstract syntactic construction. Cognitive Development, 21(2), 174–193. Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to R. Cambridge: Cambridge University Press.

5 Analyzing Dispersion


Baayen, R. H. (2010). Demythologizing the word frequency effect: A discriminative learning perspective. The Mental Lexicon, 5(3), 436–461. Balota, D. A., & Spieler, D. H. (1998). The utility of item level analyses in model evaluation: A response to Seidenberg and Plaut. Psychological Science, 9(3), 238–240. Baron, A., Rayson, P., & Archer, D. (2009). Word frequency and keyword statistics in historical corpus linguistics. Anglistik: International Journal of English Studies, 20(1), 41–67. Biber, D., Reppen, R., Schnur, E., & Ghanem, R. (2016). On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics, 21(4), 439–464. Burch, B., Egbert, J., & Biber, D. (2017). Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science, 3(2), 189–216. Carroll, J. B. (1970). An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour, 3(2), 61–65. Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Journal of Natural Language Engineering, 1(2), 163–190. Davies, M., & Gardner, D. (2010). A frequency dictionary of contemporary American English: Word sketches, collocates and thematic lists. London/New York: Routledge, Taylor and Francis. Ellis, N. C. (2002a). Frequency effects in language processing and acquisition: A review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Acquisition, 24(2), 143–188. Ellis, N. C. (2002b). Reflections on frequency effects in language acquisition: A response to commentaries. Studies in Second Language Acquisition, 24(2), 297–339. Ellis, N. C. (2011). Language acquisition as rational contingency learning. Applied Linguistics, 27(1), 1–24. Gardner, D., & Davies, M. (2014). A new academic vocabulary list. Applied Linguistics, 35(3), 305–327. Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. Gries, S. T. (2010). Dispersions and adjusted frequencies in corpora: Further explorations. In S. T. Gries, S. Wulff, & M. Davies (Eds.), Corpus linguistic applications: Current studies, new directions (pp. 197–212). Amsterdam: Rodopi. Gries, S. T. (2013). Statistics for linguistics with R (2nd rev. and ext. ed, 359). Berlin/Boston: De Gruyter Mouton. Juilland, A. G., & Chang-Rodriguez, E. (1964). Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. Juilland, A. G., Brodin, D. R., & Davidovitch, C. (1970). Frequency dictionary of French words. The Hague: Mouton de Gruyter. Kromer, V. (2003). An usage measure based on psychophysical relations. Journal of Quantitative Linguistics, 10(2), 177–186. Leech, G. N., Rayson, P., & Wilson, A. (2001). Word frequencies in written and spoken English: Based on the British National Corpus. London: Longman. Lijffijt, J., & Gries, S. T. (2012). Correction to “Dispersions and adjusted frequencies in corpora”. International Journal of Corpus Linguistics, 17(1), 147–149. Lyne, A. A. (1985). Dispersion. In The vocabulary of French business correspondence (pp. 101– 124). Geneva/Paris: Slatkine-Champion. Lyne, A. A. (1986). In praise of Juilland’s D. In Méthodes quantitatives et informatiques dans l’Études des textes, vol. 2 (pp. 589–595). Geneva/Paris: Slatkine-Champion. Oakes, M., & Farrow, M. (2007). Use of the chi-squared test to examine vocabulary differences in English language corpora representing seven different countries. Literary and Linguistic Computing, 22(1), 85–99. Paquot, M. (2010). Academic vocabulary in learner writing: From extraction to analysis. London/New York: Continuum.


S. Th. Gries

Robertson, S. (2004). Understanding inverse document frequency: On theoretical arguments of IDF. Journal of Documentation, 60(5), 503–520. Rosengren, I. (1971). The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série), 1, 103–127. Savický, P., & Hlaváˇcová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 15–31. Schmid, H. J. (2010). Entrenchment, salience, and basic levels. In D. Geeraerts & H. Cuyckens (Eds.), The Oxford handbook of cognitive linguistics (pp. 117–138). Oxford: Oxford University Press. Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in information retrieval. Journal of Documentation, 28(1), 11–21. Spieler, D. H., & Balota, D. A. (1997). Bringing computational models of word naming down to the item level. Psychological Science, 8(6), 411–416. Washtell, J. (2007). Co-dispersion by nearest-neighbour: Adapting a spatial statistic for the development of domain-independent language tools and metrics. Unpublished, M.Sc. thesis, School of Computing, Leeds University. Wilcox, A. R. (1973). Indices of qualitative variation and political measurement. The Western Political Quarterly, 26(2), 325–343. Zhang, H., Huang, C., & Yu, S. (2004). Distributional consistency: As a general method for defining a core lexicon. Paper presented at language resources and evaluation 2004, Lisbon, Portugal.

Chapter 6

Analysing Keyword Lists Paul Rayson and Amanda Potts

Abstract Frequency lists are useful in their own right for assisting a linguist, lexicographer, language teacher, or learner analyse or exploit a corpus. When employed comparatively through the keywords approach, significant changes in the relative ordering of words can flag points of interest. This conceptually simple approach of comparing one frequency list against another has been very widely exploited in corpus linguistics to help answer a vast number of research questions. In this chapter, we describe the method step-by-step to produce a keywords list, and then highlight two representative studies to illustrate the usefulness of the method. In our critical assessment of the keywords method, we highlight issues related to corpus design and comparability, the application of statistics, and clusters and ngrams to improve the method. We also describe important software tools and other resources, as well as providing further reading.

6.1 Introduction As we have seen from Chap. 4, frequency lists are an essential part of the corpus linguistics methodology. They allow us to see what words appear (and do not appear) in a text, and give an indication of their prominence if we sort the list in frequency order. Beyond the world of the corpus linguistic researcher, frequency lists can be used directly or indirectly to support language learners by providing a way to focus on the more frequent words in a text, or suggesting priorities for language teachers when preparing lesson materials. Lexicographers use frequency lists indirectly when constructing traditional printed dictionaries. Frequency lists have also been allied with other kinds of grammatical information (such as major P. Rayson () Lancaster University, Lancaster, UK e-mail: [email protected] A. Potts Cardiff University, Cardiff, UK e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_6



P. Rayson and A. Potts

word class) and translational glosses (both for the words themselves and example sentences) and turned directly into frequency dictionaries for learners and teachers in a number of languages (e.g. Juilland et al. 1970; Tono et al. 2013). When used in simple form, frequency lists can sometimes be misleading, and care needs to be taken when making generalisations from them. For example, if frequency counts are derived from a large representative corpus such as the British National Corpus (BNC), we may reasonably claim that high frequency words in the corpus may also have a similarly high usage in the language. However, a word may have a high frequency count in the BNC not because it is widely used in all sections of the corpus, but because it has a very high frequency in only certain parts of the corpus and not in others, e.g. conversational speech rather than newspaper articles. Hence, as described in Chap. 4, and more particularly in Chap. 5, we need to pay careful attention to dispersion or range measures, which can help us estimate, for instance, how widely represented a word is across the various texts, domains, or genres within a written corpus, or across speakers within a spoken corpus. Using computer software to automate the creation of frequency lists from texts saves the researcher significant amounts of time, but the results can be overwhelming in terms of the amount of information to analyse. One option to reduce the wealth of information is to compare a frequency list from one corpus with another in order to highlight differences in word rank or frequency, since significant changes to the relative ordering of words can flag points of interest (Sinclair 1991: 31). Corpus linguistics is inherently comparative, so a method has evolved to support the comparison of corpora, which helps us study, for example, the differences between tabloid and broadsheet newspapers (Baker et al. 2013), vocabulary variation on the basis of age and gender (Murphy 2010), or grammatical, lexical, and semantic change over 100 years of British and American English (Baker 2017). The resulting widely-used method of keyword analysis is our focus in this chapter. Hofland and Johansson (1982) were early pioneers of this approach when they carried out a large (for the time) comparison of one million words of American English (represented by the Brown corpus) with one million words of British English (in the LOB corpus). Their study employed a difference coefficient defined by Yule, which varies between +1 and −1 to calculate the difference between the relative frequencies of a word in the two corpora. In addition, Pearson’s statistical goodness-of-fit test, called the chi-squared test, was applied, enabling Hofland and Johansson to mark statistically significant differences at the 5%, 1% and 0.1% confidence levels (cf. Chap. 20). Another major milestone in the development of the keywords approach was the inclusion of the method by Mike Scott in his WordSmith Tools software. A number of authors had used significance tests to determine the importance of differences of specific words or linguistic features between corpora, but Scott’s approach (1997)

6 Analysing Keyword Lists


allowed for a systematic comparison of full word frequency lists. Scott demonstrated that the keyword results enable a researcher to understand the ‘aboutness’ of a text or corpus.

6.2 Fundamentals The keywords method is conceptually simple, relying on the comparison of (normally) two word frequency lists. The complexity of the method lies in the choice of statistics and frequency cut-offs to appropriately filter the results (for further discussion of this, see Sect. 6.4). As part of the corpus linguist’s toolbox, the keywords method is most appropriate as a starting point to assist in the filtering of items for further investigations, rather than an end in and of itself. As a first step, two frequency sorted word lists are prepared, one from the corpus being studied (the ‘target’) and one from a reference corpus. Each word list contains a list of tokens and associated frequencies. The reference dataset in corpus linguistics studies is usually a general corpus, representative of some language or variety of language. However, depending on the research question and aims, a suitable comparison set may also be used, e.g. from a corpus representing a different variety, time, or genre. Rather than using two different corpora entirely, some researchers use various subcorpora from the same (reference) corpus for the ‘target’ and ‘reference’ sets, and this approach is further exemplified in the two representative studies summarised below. Next, the frequency of each word in the target corpus is compared to its frequency in the reference dataset in order to calculate a keyness value. Finally, the word list for the target corpus is reordered in terms of the keyness values of the words. The resulting sorted list contains two kinds of keyword: positive (those which are unusually frequent in the target corpus relative to the reference corpus) and negative (words which are unusually infrequent in the target corpus). It is also common to describe these two groups of words as overused (for positive) and underused (for negative), particularly in the learner corpus literature (cf. Chap. 13). In order to perform the keyness calculation for each word in the list, corpus tools set up a 2 by 2 contingency table, as shown in Table 6.1 (see also Chap. 7). The value ‘c’ is the total number of words in the target corpus and ‘d’ is the total number

Table 6.1 Contingency table for each word in the list Frequency of word Frequency of other words Total

Target Corpus a c−a c

Reference Corpus b d−b d

Total a+b c+d−a−b c+d


P. Rayson and A. Potts

of words in the reference corpus. Numbers ‘a’ and ‘b’ are termed the ‘observed’ (O) values (i.e. the actual frequencies of a given word in each corpus). Two main significance test statistics have been used in the corpus linguistics literature: chi-squared (as employed by Hofland and Johansson 1982) and log-likelihood (LL) (described by Rayson and Garside 2000 for use in keywords calculations and by Dunning 1993 for calculating collocations). Rayson et al. (2004a) compared the reliability of the two statistics when used for keyness calculations under varying conditions (corpus size, frequency of words) and showed that the log-likelihood test is preferred over the chi-squared test since it is a more accurate statistic for heavily skewed comparisons (e.g. a high ratio of target corpus to reference corpus sizes or low frequency words), so here we present only the log-likelihood formula. First, we need to calculate the expected values (E) corresponding to each observed value (O) in Table 6.1 and then insert these values into the second equation below.  Ni Oi Ei =  i i Ni Here, N1 = c, and N2 = d. Hence, for E1 = c*(a + b)/(c + d) and E2 = d*(a + b)/(c + d). Note that the calculation of expected values takes account of the sizes of the corpora, so the raw frequency figures should be used in the table. Oi Oi ln −2lnλ = 2 i Ei The log-likelihood score in this case is LL = 2*((a*ln(a/E1)) + (b*ln(b/E2))).1,2 Once this calculation has been performed for each word and the resulting word list has been sorted on the LL value, we can see the words that are most indicative (or characteristic) of the target corpus relative to the reference corpus at the top of the list. Words which occur in the two corpora with roughly similar relative frequencies appear lower down the sorted list. At this point in the method, different researchers have applied different cut-offs relating to position in the list—significance value or p-value—and there is little agreement about a preferred approach. We will return to this discussion in our critical assessment in Sect. 6.3.

1 Online

calculators and downloadable spreadsheets are available at http://corpora.lancs.ac.uk/ sigtest/ (accessed 25 June 2019) and http://ucrel.lancs.ac.uk/llwizard.html (accessed 25 June 2019). 2 It should be noted that this formula represents the 2-cell calculation (Rayson and Garside 2000) which can be used since the contribution from the other two cells is fairly constant and does not affect the ranking order. Other tools, e.g. AntConc, and statistical calculators also support the 4-cell calculation incorporating contributions from frequencies of the other words into the LogLikelihood value.

6 Analysing Keyword Lists


Representative Study 1 Seale, C., Ziebland, S., and Charteris-Black, J. 2006. Gender, cancer experience and internet use: A comparative keyword analysis of interviews and online cancer support groups. Social Science & Medicine 62:2577– 2590. This work set out to achieve two aims: (1) to discuss comparative keyword analysis as a possible substitute for the sort of qualitative thematic analysis that a large number of previous illness studies found basis in; and (2) to apply this method in analysing gender differences in the online discourse of people with breast and prostate cancer. Previous studies of people with cancer found gender differences which align broadly with findings in more wide-reaching sociolinguistic analyses, namely that women’s style is more expressive compared to men’s style, which is more instrumental (Boneva and Kraut 2002). Seale et al. (2006) expanded considerably on previous studies by incorporating tools from corpus linguistics, notably keyness analysis. The corpora utilised contained two types of data: research interviews and Internet-based support groups. Qualitative interviews were adopted for secondary analysis from the Database of Individual Patient Experiences project; these were conducted in the UK, 2000–2001, with 97 people with cancer (45 women with breast cancer; 52 men with prostate cancer), totalling 727,100 words. Posts were inspected to extract only those written by people with cancer (as opposed to family members, carers, or those experiencing symptoms that may be associated with cancer), which resulted in a final corpus comprising 12,757 posts and 1,629,370 words. Often, stylistic, grammatical, or syntactical features of the target corpus are highlighted through keyness comparison with a general reference corpus. However, the comparative keyword analysis reported here did not involve a general reference corpus (which may conflate or obscure gender differences in the specific data sets. Instead, the breast cancer texts were compared to the prostate cancer texts to facilitate analysis of meanings made by their female and male authors, respectively. Keywords were calculated using WordSmith Tools (Scott 2004). Measures of ‘keyness’ (expressed in positive and negative log-likelihood values with the prostate cancer corpus serving as the target corpus) were provided. WordSmith Tools also provides their corresponding p-values, p < 0.00000001 for all items, rendering this an ineffective method of describing or differentiating results. The ‘top 300’ results from both the breast and prostate cancer corpora were analysed, with some exclusions. Concordances and clusters around keywords were analysed by hand, and keywords were manually placed into (continued)


P. Rayson and A. Potts

semantic categories. In the opinion of the researchers, “[t]his enabled important and meaningful comparative aspects of these large bodies of text to be identified” in a “more economical and potentially replicable manner than conventional qualitative thematic analysis based on coding and retrieval” (ibid.). This answered the first broad purpose of the paper, as introduced above. Analysis of the interviews and web fora led to findings broadly aligned with previous research on gender differences in the experience of serious illness. Qualitative analysis of the interview corpus also supported findings from previous studies: women were more likely to claim to seek social support on the Internet, whereas men said that they use the Internet to look for information. Quantitative analysis using keyness helped to identify further areas of interesting difference. Compared to women with breast cancer, men with prostate cancer had a much greater number of keywords pertaining to TREATMENT (i.e. catheter, brachytherapy, hormone, Zoladex, treatment), TESTS AND DIAGNOSIS (biopsy, MRI, screening), SYMPTOMS AND SIDE EFFECTS (incontinence, impotence), and DISEASE AND ITS PROGRESSION (PSA [prostate specific antigen], staging, cancer, aggressive). By contrast, women with breast cancer had a greater number of keywords under categories such as SUPPORT (i.e. help, supportive), FEELINGS (scared, hope, depressed), PEOPLE (I, you, husband, ladies), CLOTHING AND APPEARANCE (wear, clothes), and SUPERLATIVES (lovely, definitely, wonderful). Grouping keywords into a range of semantic categories helps to generalise and distil meanings. Viewed as a whole, findings in Seale et al. suggested “that men’s experience of their disease appears to be more localised on particular areas of the body, while women’s experience is more holistic” (2006: 2588). Seale et al. (2006) conceded that the study had a number of limitations. The first limitation was to do with sampling: as individual posts had not been linked to poster identity, there was the possibility that overuse of certain keywords by specific individuals has resulted in patterns being interpreted as common across the corpus/sample as a whole. As with other keyness studies, the researchers acknowledged that this work focussed on difference rather than similarity, which has the effect of reifying gender differences. Finally, more complex syntactical or semantic patterns (such as tag questions or cognitive metaphors) were not highlighted and discussed with this methodology or in this study. Other limitations not acknowledged by the researchers were also present in this study. The use of WordSmith’s ‘keyness measure’ was used to rank results, with the ‘top 300’ skimmed from each corpus for further analysis. However, with all p-values nearing zero and no thresholds of log-likelihood included, it is unclear whether the most salient semantic categories were, in fact, included and populated. Finally, while we acknowledge that the creation of ad hoc semantic categories is useful for thematic content analysis, (continued)

6 Analysing Keyword Lists


particularly when the researchers are very well-versed in the content of their corpora, we wonder about the accompanying detriment that this brings, particularly in relation to replicability. A number of other studies make use of further computational methods to undertake semantic annotation and categorisation. Below, we summarise one such work.

Representative Study 2 Culpeper, J. 2009. Words, parts-of-speech and semantic categories in the character-talk of Shakespeare’s Romeo and Juliet. International Journal of Corpus Linguistics 14(1):29–59. Culpeper (2009) moved beyond analysis of keywords in isolation to consider both key parts-of-speech and semantic categories in a corpus stylistic analysis of character-talk in Romeo and Juliet. In this study, the speech of the six characters with the highest frequency of speech in Romeo and Juliet was isolated to create subcorpora varying in length from 1293 to 5031 words. Culpeper posited that studying these subcorpora would expose the differing speech styles of characters (which, in turn, contribute to reader perception of characterisation). In this case, the very small sizes of data under analysis also allowed for full consideration of all results. Earlier literary studies have described (contextualised) frequency of features as indicators of an author’s or character’s style. Culpeper’s aim was to “produce key items that reflect the distinctive styles of each character compared with the other characters in the same play, rather than ... stylistic features relating to differences of genre . . . or aspects of the fictional world” (2009: 35). Therefore, in this study, the subcorpus of any individual character’s speech was compared to a reference corpus of all other character’s speech (exclusive). This study also made use of WordSmith Tools (Scott 2004) to calculate keywords. It found that keyword analysis did provide evidence for results that might be predictable (for instance, that “Romeo is all about love” (Culpeper 2009: 53)), but also exposed features that were less easily observable (Juliet’s subjunctive keywords, which may be linked to an anxious style), without relying on intuitions about which parts of the text or which features to focus on (ibid., p. 53). The study then went on to experiment with analysing key parts-of-speech and key semantic domains using a combination of software. Key parts-of-speech and key semantic categories are calculated by applying the keywords procedure to part-of-speech and semantic tag frequency (continued)


P. Rayson and A. Potts

lists in the same way as described in Sect. 6.2 for word frequency lists. Keyword analysis can be illuminating, but can also be misleading, because semantic similarity is not explicitly taken into account during analysis. While many (or even most) researchers who use keyword analysis end up grouping or discussing words in semantic categories, this can be a somewhat flawed method, as these categories are subjective, and words which are too infrequent to appear as key on their own will be discounted. In small corpora (such as the Romeo and Juliet corpus), many content words of interest would necessarily be low-frequency. The incorporation of a computational annotation system allows for systematic, rigorous annotation followed by statistical analysis. In this study, semantic domains were annotated and analysed using UCREL’s (University Centre for Computer Corpus Research on Language, based at Lancaster University) Semantic Analysis System (USAS) in Wmatrix. The input is part-of-speech tagged text produced by CLAWS, which is then run through SEMTAG, which uses lexicons to assigns semantic tag(s) to each lexical item or multiword unit (for details, see Wilson and Rayson 1993, Rayson and Wilson 1996). The accuracy rate for SEMTAG is said to be approximately 91% for modern, general language (Rayson et al. 2004b), though the author acknowledged that semantic shift led to some ‘incorrect’ classifications. Some of these were corrected using a historical lexicon, but Culpeper cautiously checked and interpreted the results to be more certain of their meaning and importance. In the analysis section of the paper, a number of key semantic categories and constituent items (words and set phrases) were presented for Romeo, Nurse, and Mercutio. An indicative selection of Romeo’s key semantic categories appears in Table 6.2. The first two semantic categories (RELATIONSHIP: INTIMATE/SEXUAL and LIKING) are clearly linked semantically and metonymically. The keyness of these themes was predictable, but did provide empirical and statistical evidence for topicality of Romeo’s speech and a sense of his role as a lover. The third category in Table 6.2 is illustrative of key semantic domains which were much less predictable. Romeo describes literal light, but also makes use of conventional metaphors, such as light/dark for happiness/unhappiness. Exposure of metaphorical usage is an interesting and useful feature of the semantic tagger. Interestingly, very few items within key semantic domains were also identified as keywords in this study. In Table 6.2 above, only three of the 26 listed items were also keywords; in the full table in Culpeper (2009: 48), only five out of 77 total items in key semantic categories were independently identified keywords. This result highlighted a major advantage of key semantic domain analysis: lexical items which may not turn up in keyword analysis due to low frequency combine together to highlight larger fields of key meaning. (continued)

6 Analysing Keyword Lists


Table 6.2 Romeo’s ‘top three’ semantic categories, as rank-ordered for positive keyness (i.e. relatively unusual overuse). Keywords—identified independently earlier in the paper— are emboldened Semantic category, including the tag code and frequency RELATIONSHIP: INTIMATE/SEXUAL (S3.2) (48) LIKING (E2+) (38) COLOUR AND COLOUR PATTERNS (O4.3) (33)

Items within the category (and their raw frequencies) up to a maximum of ten types if they are available love (34), kiss (5), lovers (3), kisses (2), paramour (1), wantons (1), chastity (1), in love (1) love (15), dear (13), loving (3), precious (2), like (1), doting (1), amorous (1), loves (1) light (6), bright (4), pale (3), dark (3), green (2), stained (2), black (2), golden (1), white (1), crimson (1)

Adapted from Culpeper (2009: 48)

As far as indicators of aboutness and style, all methods have merit. Key part-of-speech categories, keywords, and key semantic categories are usually dominated by a small number of very frequent items, allowing for overlap. Culpeper found that items identified as keywords dominate 66.6% of the semantic categories, meaning that a keyword analysis would reveal most conclusions, but also leave out a not-insignificant amount of findings. The areas not overlapping (particularly the 33.4% between key semantic domains and keywords) are very salient but also very difficult to predict. So, “[w]hen keywords are dominated by ideational keywords, capturing the ‘aboutness’ of the text, the part-of-speech and particularly the semantic keyness analyses have much more of a contribution to make, moving the analysis beyond what is revealed in the keywords” [ibid.].

6.3 Critical Assessment and Future Directions In this chapter so far, we have described the origins and motivations for the development of the keywords method in corpus linguistics, and shown two representative studies using the technique. In the next few sub-sections, we will undertake a critical assessment describing some of the pitfalls and misconceptions about the use of the technique, along with a summary of criticisms, concluding with future directions for the approach.

6.3.1 Corpus Preparation Given that the input to the keywords method is word frequency lists, then the specific details of the preparation of the lists are important—but often overlooked—in the corpus linguistics literature. What counts as a word in such lists (on the basis of tokenisation, punctuation, capitalisation, standardisation, etc.) can potentially make a large difference to the results. Usually, corpus software tools tokenise words by


P. Rayson and A. Potts

identifying boundaries with white space characters and removing any punctuation characters from the start and end of words. However, different texts and authors might use hyphenation differently; for instance, “todo”, “to-do” and “to do”, must be carefully cross-matched or standardised, or else the frequencies in contingency tables will not compare word types consistently. Most corpus tools avoid the capitalisation issue completely by forcing all characters to upper- or lowercase, but this can cause issues with different meanings, e.g. “Polish” versus “polish”. Wmatrix makes an attempt to preserve capitalisation if words are tagged as proper nouns but this, in turn, relies on the accuracy of the POS tagger software. Corpus methods are increasingly being applied to historical corpora, as we have seen in Culpeper (2009), described in Representative Study 2. Many authors rely on modernised or standardised editions to avoid spelling variation issues. Baron et al. (2009) carried out a detailed study to assess the degree to which keyword results are affected by spelling variants in original editions. First, they estimated the extent of spelling variation in various large Early Modern English corpora and found that, on average, in texts from 1500, the percentage of variant types is over 70% and variant tokens is around 40%. In terms of preparing frequency lists, this means, for example, that rather than counting all occurrences of the word “would”, corpus software needs to take account of frequencies for other potential variants: “wolde”, “woolde”, “wuld”, “wulde”, “wud”, “wald”, “vvould”, “vvold”, and so on. The amount of spelling variation drops down to less than 10% of types and tokens in corpus texts from around 1700 onwards. In terms of impact on the keyness method, spelling variation is a significant problem, since the rank ordering of words will be affected by the distribution of variant frequencies. Baron et al. (2009) estimated the difference with rank correlation coefficients on keyword lists calculated before and after standardisation and found that Kendall’s Tau scores can drop as low as 0.6 (where a score of 1 indicates that the two lists are the same, 0 indicates that the rankings are independent and −1 indicates that one ranking is the reverse of the other; cf. Chap. 17). A similar effect will be observed when applying the keywords approach to computer-mediated communication (CMC) varieties e.g. online social media, emails, and SMS, so care must be taken with data preparation.

6.3.2 Focus on Differences One of the central drawbacks to keyness analysis is the innate focus on difference (and obfuscation of similarity). Baker (2004) undertook a comparative study on online erotica, and explained that while large is a keyword in gay male erotic texts compared to lesbian erotic narratives from the same erotica website, other semantically related words (i.e. huge) may have occurred with comparable frequency in both corpora. This may lead analysts to (erroneously) over-generalise the keyness of ‘size’ in the gay male corpus, overlooking the central tenet of keyword analysis, which is allowance for findings and discussion at the lexical level. Baker (2004) proposed one way to circumvent this focus on differences: to carry out comparisons on more than two sets of data. This is helpful when undertaking keyword analysis on

6 Analysing Keyword Lists


two target corpora (rather than one target corpus compared to a reference corpus). By calculating keywords in two target corpora against one another and then, for instance, against a larger reference corpus, differences and similarities may be highlighted in the emerging results. Another issue in keyword analysis highlighted by Baker (2004) is that the ‘strongest’ words tend to reveal obvious patterns. While this does provide confirmatory evidence in new studies that the technique is working as expected, this bias can contribute to an unmanageable number of unsurprising keywords being thrown up for analysis. Possible proposed work-arounds have already been demonstrated in some of the studies discussed above: researchers may apply cut-off points related to relative dispersion across texts (see also Chap. 5), frequency in the entire corpus, or maximum p-values, or even switch the focus to use dispersion instead of frequency for keyness calculations (Egbert and Biber 2019). No robust guidelines as to ‘appropriate’ cut-offs for any of these measures have been recommended in the literature, which can be seen as a need for further development of the method. Similar issues with method settings and parameters can be observed in the area of collocation research (see Chap. 7).

6.3.3 Applications of Statistics There have been a number of criticisms of the keywords approach in relation to the application and interpretation of the significance test statistics used in the procedure. The method described in Sect. 6.2 can be seen as a goodness-of-fit test, where the null hypothesis is that there is no difference between the observed frequencies of a word in the two corpora. If the resulting metric (log-likelihood in our case) exceeds a certain critical value, then the null hypothesis can be rejected. After choosing a degree of confidence, we can use chi-squared statistical tables to find the critical value, e.g. for the 5% level (p < 0.05) the critical value is 3.84, and for 1% (p < 0.01), it is 6.63 (cf. Chap. 20). For a comparison of two corpora, we use values with 1 degree of freedom, i.e. one less than the number of corpora. However, if the value calculated from the contingency table does not exceed the critical value, this only indicates that there is not enough evidence to reject the null hypothesis and we cannot conclude that the null hypothesis is true (i.e. which would indicate that there is no significant difference). It was Dunning (1993) who first brought the attention of the community to the log-likelihood test, proposing it for collocation analysis rather than keywords. Dunning cautioned that we should not rely on the assumption of a normal distribution when carrying out statistical text analysis and recommended log-likelihood as parametric analysis based on the binomial or multinomial distributions instead. There is some disagreement in the literature here, with some authors stating that chi-squared assumes a multinomial distribution, making no special distributional assumptions of normality. Cressie and Read (1984) showed that Pearson’s X2 (chisquared) and the likelihood ratio G2 (Dunning’s log-likelihood) are two statistics


P. Rayson and A. Potts

in a continuum defined by the power-divergence family of statistics and refer to the long running discussion (since 1900) of the statistics and their appropriateness for contingency table analysis. Kilgarriff (1996) considered the Brown versus LOB corpus comparison by Hofland and Johansson (1982) and highlighted that too many common words were marked as significant using the chi-squared test. In order to better discriminate interesting from non-interesting results, he suggested making use of the Mann-Whitney test instead, as this makes use of frequency ranks rather than frequency directly. However, with a joint LOB/Brown frequency above 30 where the test could be applied, 60% of the word types were still marked as significant. Results using Mann-Whitney also suffer towards the low end of the frequency spectrum, especially when words have a frequency of zero in one of the two corpora. This is because a large number of words occur with the same frequency (indeed, usually half of the types in a corpus occur with a frequency of one), so they cannot be satisfactorily ranked. For tables with small expected frequencies, many researchers have used Yates’ corrected chi-squared statistic (Y2 ), and some prefer Fisher’s exact test; for more discussion see Baron et al. (2009). More recent papers have also investigated similar issues of statistical validity and appropriateness of the keywords procedure as currently envisaged for comparing corpora with specific designs. Brezina and Meyerhoff (2014: 1) showed that using a keywords approach to compare whole corpora “emphasises inter-group differences and ignores within group variation” in sociolinguistic studies. The problem is not the significance test itself, but rather the aggregation of frequency counts for a target linguistic variable, e.g. a word across speaker groupings. They recommend the Mann-Whitney U test instead, to take account of separate speaker frequency counts and variation within datasets. As Kilgarriff (2005) reminded us, language is not random, and the assumption of independence of words inherent in the chisquared and log-likelihood tests “may lead to spurious conclusions when assessing the significance of differences in frequency counts between corpora” (Lijffijt et al. 2016: 395), particularly for poorly dispersed words. Paquot and Bestgen (2009) and Lijffijt et al. (2016) recommended representing the data differently in order to make the assumption about independence at the level of texts rather than the level of words. Lijffijt et al. (2016) recommended other tests that are appropriate for large corpora, such as Welch’s t-test, the Wilcoxon rank-sum test and their own bootstrap test (see also Chap. 24). In response to Kilgarriff (2005), Gries (2005) pointed out the importance of multiple corrections for post-hoc testing (e.g. Bonferroni, or the more recent Šidák correction), since, after applying those, the expected proportion of significant results are observed. Gries (2005) also directed readers to other methods such as effect sizes, Bayesian statistics (later picked up by Wilson 2013) and confidence intervals, and highlighted that null hypothesis significance testing has been criticised in other scientific disciplines for many decades. Concerns over the reproducibility and replicability of scientific results have led the editors of the Basic and Applied Social Psychology journal to ban p-values (null hypothesis significance testing) and the American Statistical Association produced a policy statement to discuss the issues (Wasserstein and Lazar 2016).

6 Analysing Keyword Lists


Many misconceptions about statistical hypothesis testing methods are observable in the corpus linguistics literature and beyond; for further details, see Vasishth and Nicenboim (2016). One specific example that we can demonstrate here illustrates the usefulness of including effect size measures alongside significance statistics to allow for comparability across different sample sizes. As with significance metrics, there are a number of different effect size formulae that could be used. Effect size measures show the relative difference in sizes between word frequencies in two corpora, rather than factoring in how much evidence we have in the corpus samples. This means that, unlike log-likelihood measures, they are not affected by sample size. Consider three hypothetical experiments for the frequencies of the words ‘blah’, ‘ping’ and ‘hoot’ in four corpora, as show in Table 6.3. Here, we are using log-likelihood (LL) as our significance measure and Log Ratio (LR) as the effect size measure (Hardie 2014). In experiment 1, LL tells us that there is enough evidence to reject the null hypothesis at p < 0.0001 (critical value 15.13) and the effect size shows the doubling of the frequency of the word in corpus 1 relative to corpus 2. Compare this with experiment 2, where the word frequencies and corpus sizes are all ten times larger than in experiment 1. As a result, the LL value is ten times larger, indicating more evidence for the difference, but the LR is still the same, given that the ratio of 1000 to 500 is the same as the ratio of 100 to 50. In experiment 3, we retain the same sized corpora as in experiment 2, but the frequencies of the word are closer together, and they illustrate that a smaller relative frequency difference is still shown to be significant at the same p-value as in experiment 1. Importantly, we should note the lack of comparability of the LL score between experiments 1 and 2 (as well as between 1 and 3) because they employ differently sized corpora. In contrast, effect size scores can be compared across all three experiments without the same concerns.

Table 6.3 Three hypothetical keywords experiments Experiment Word frequencies and corpus sizes 1 Corpus 1 and 2 contain 10,000 words each. Frequency of ‘blah’ in corpus 1 = 100 Frequency of ‘blah’ in corpus 2 = 50 2 Corpus 3 and 4 contain 100,000 words each. Frequency of ‘ping’ in corpus 3 = 1000 Frequency of ‘ping’ in corpus 4 = 500 3 Corpus 3 and 4 contain 100,000 words each. Frequency of ‘hoot’ in corpus 3 = 1000 Frequency of ‘hoot’ in corpus 4 = 824

Significance and effect size results Significance (LL) = 16.99 Effect size (LR) = 1.00 Significance (LL) = 169.90 Effect size (LR) = 1.00 Significance (LL) = 17.01 Effect size (LR) = 0.28


P. Rayson and A. Potts

6.3.4 Clusters and N-Grams Both Baker (2004) and Rayson (2008) have pointed out a serious limitation of the keywords procedure, which is that it can really only be used to highlight lexical differences and not semantic differences. This means that a word which has one significant meaning might not be correctly signalled as key when its various senses are counted together, thus masking something of interest. To some extent, the procedure implemented in Wmatrix and employed by Culpeper (2009) as described in Representative Study 2 will address this issue, because words are semantically tagged and disambiguated before the keyness procedure is applied. Researchers may also be interested in (semantic) meaning beyond the single word. The USAS semantic tagger can be used to identify semantically meaningful multiword expressions (MWEs) since these chunks need to be analysed as belonging to one semantic category or are syntactic units e.g. phrasal verbs, compounds, non-compositional idiomatic expressions. Wmatrix then treats these MWEs as single elements in word lists, allowing key MWEs to emerge alongside keywords. Consider an example MWE ‘send up’. If this were not identified in advance as a semantically meaningful chunk meaning ‘to ridicule or parody’, then separate word counts for ‘send’ and ‘up’ would be observed and merged with the other occurrences of those words in the corpus, potentially incorrectly inflating their frequencies. Without the benefit of a semantic tagger, Mahlberg (2008) combines, for the first time, the keywords procedure with clusters or n-grams, i.e. repeated sequences of words counted in corpora. Once the clusters have been counted, then key clusters can be calculated using the same procedure as for keywords. Mahlberg then groups key clusters by function to draw conclusions about local textual functions in a corpus of Charles Dickens’ writing, which formed the basis of a corpus stylistic investigation. This key clusters (or key n-grams) approach can be seen as an extension of the keywords approach. The simple keywords approach is, in fact, a comparison of n-grams of length 1. It has proved to be a very fruitful line of investigation with a number of other studies employing this method. Paquot (2013, 2014, 2017) used key clusters to identify French learners’ lexical preferences and potential transfer effects from their native language. Additionally, others have used the key n-gram approach to support native language identification (Kyle et al. 2013) and automatically assessing essay quality (Crossley et al. 2013).

6.3.5 Future Directions Many current studies using the keywords method are on English corpora. As this method is readily available in software such as WordSmith and AntConc— which work well in most languages—more thought should be given to how well keywords work in languages other than English, especially where much more complex inflectional and derivational morphology occurs, e.g. Finnish. For these

6 Analysing Keyword Lists


languages, it might be the case that comparing surface forms of words from the corpus works less well than comparing lemmas, because the frequencies are too widely dispersed across different word forms (in a similar way to historical spelling variants) to be comparable. In terms of future directions for keyness analysis, we recommend that more care is taken in the application of the technique. Rather than blindly applying a simple method to compare two relative frequencies, more thought is required to consider the criticisms and shortcomings that have been expressed in the preceding sections. Any metadata subdivisions present within a target corpus or reference corpus should be better explored via comparison so that they are not hidden; the corpora should be carefully designed and constructed with the aim of answering specific research questions and facilitating comparability; issues such as tokenisation, lemmatisation, capitalisation, identification of n-grams and multi-word expressions, and spelling variation should be considered; and differences as well as similarities should be taken into account when undertaking the analysis of the keyword results. As a corpus community, we need to agree on better guidelines and expectations for filtering results in terms of minimum frequencies and significance and effect size values rather than relying on ad hoc solutions without proper justifications. In the future, we recommend investigating the use of statistical power calculations in corpus linguistics. Power calculations can be used alongside significance testing and effect size calculations and are increasingly employed in other disciplines, e.g. psychology. Statistical power allows us to calculate the likelihood that an experiment will detect an effect (or difference in frequency in our case of comparing corpora) when there is an effect to be detected. We can use higher statistical power to reduce the probability of a Type-2 error, i.e. concluding that there is no difference in frequency of a word between two corpora, when there is in fact a difference. This might mean setting the effect size in advance and then calculating (a-priori) how big our corpora need to be, or at least being able to (post-hoc) calculate and compare the power of our corpus comparison experiments. This might help us answer the perennial question, ‘How big should my corpus be?’ and help researchers determine comparability and the relative sizes of sub-corpora defined by metadata such as socio-linguistic variables. Finally, related to the experimental design and interpretation of results, issues of corpus comparability, homogeneity and representativeness are highly important to consider alongside reliability of the statistical procedure (Rayson and Garside 2000). It should not be forgotten that interpretation of the results of any automatic procedure is the responsibility of the linguist, and the results of the keywords method are a starting point to help guide us rather than an end point of the research.


P. Rayson and A. Potts

6.4 Tools and Resources 6.4.1 Tools Keyness was arguably one of the later additions to the quiver of corpus linguistic tools; many papers published between 2001–2008 (including the two representative studies summarised in this chapter) discussed the infancy of its adoption. Now, however, this is considered one of the standard five methods of the field, alongside frequency, concordance, n-gram and collocation. As a result, nearly all concordancers and corpus linguistic tools will offer some assistance in the calculation of keyness. Distinguishing features, then, are: 1. the incorporation of more sophisticated taggers, allowing for calculation of key lemmas, parts-of-speech (POS), or semantic domains; 2. the inclusion of built-in reference corpora, often general corpora or subsections thereof, allowing for immediate calculation against a known ‘benchmark’ without the necessity of sourcing or collecting a comparable corpus; 3. the selection of measures of keyness available (see Sect. 6.3.3 for discussion). We have provided an overview of popular tools and these features in Table 6.4. If a user has both a reference and target corpus and is simply interested in straightforward calculation of keywords, we can recommend both AntConc and WordSmith as good beginner-level tools for this method. CQPweb has the greatest variety of measures available; with robust tagging systems and a range of reference corpora, it also allows for calculation of keyness across features and genres. However, key semantic domains are difficult to access, and inability to upload target corpora may inhibit use for many users interested in exploring their own data. SketchEngine is extraordinarily powerful, with part-of-speech tagging and lemmatisation on a huge number of languages. However, semantic tagging is still under development, and some may disagree with the application of Simple Maths. The most powerful tool for semantic processing is inarguably Wmatrix. The main interface also offers easy access to key POS and keywords, and a small number of reference corpora are accessible. We recommend Wmatrix for keyness analysis, although with a caveat about size restrictions since it is currently suitable for corpora up to around five million words. The keywords method can also be implemented directly in programming languages such as Python, R and Perl (see Chap. 9).

6.4.2 Resources (Word Lists) Generation of keywords in a target corpus necessitates some point of comparison, usually either a second target corpus, a reference corpus, or a word list from a large, general corpus. Selection of a reference corpus will impact the results, and some care should be taken to select an appropriate ‘benchmark’ to highlight differences aligned with a given research question. Many research questions necessitate the


SketchEngine * Wmatrix v3  WordSmith  Tools v5

a For





Key semantic domains


Upload own corpora 

details, see https://www.sketchengine.eu/documentation/simple-maths/. Accessed 25 June 2019

Key clusters 

Tool AntConc v3.4.4 CQPweb v3.2.25

Keywords (*lemmatised) 


Built-in reference corpora

LL, log ratio (unfiltered, LL filter, confidence interval filter) Simple mathsa LL, log ratio Chi-square (yates correction), LL

Measures of keyness available Chi-square, LL

Table 6.4 Overview of available tools. A tick mark indicates full usability of a given feature; a tilde indicates partial capacity (i.e. beta development or use restricted to special access)

6 Analysing Keyword Lists 135


P. Rayson and A. Potts

Table 6.5 Selection of available word lists, with descriptions and weblinks Source BNC

Brown family





Description A number of word lists from the British National Corpus, including subcorpora divisions (e.g., written or spoken) Word lists from the 1961, 1991, and 2006 American and British Brown family corpora: Brown, LOB, Frown, FLOB, AmE06, and BE06 A range of word and phrase lists from the Corpus of Contemporary American English Multilingual word lists of the most frequent 9000 words in nine languages A range of word and phrase lists, including: Five languages; root words, synonyms, and related words; the complete works of Shakespeare; with some lists part-of-speech or IPA coded A huge range of word lists from a range of sources and domains (including, i.e., Project Gutenberg), from a large number of languages

Link http://ucrel.lancs.ac.uk/bncfreq/flists. html. Accessed 25 June 2019.

http://ucrel.lancs.ac.uk/wmatrix/ freqlists/. Accessed 25 June 2019.

http://corpus.byu.edu/resources.asp. Accessed 25 June 2019. https://www.hf.uio.no/iln/english/ about/organization/text-laboratory/ services/kelly.html. Accessed 25 June 2019. https://en.wikipedia.org/wiki/Moby_ Project. Accessed 5 July 2019.

https://en.wiktionary.org/wiki/ Wiktionary:Frequency_lists. Accessed 25 June 2019.

collection of specialised reference corpora, or entail comparison of subcorpora. Those wishing to answer more general questions (e.g. ‘aboutness’) may choose to make use of a general reference corpus. Word lists of many of the largest general corpora are readily available online; we have provided a sample of some Englishlanguage resources in Table 6.5, although care should be taken when selecting these to ensure that tokenisation decisions are well documented and comparable.

Further Reading Bondi, M. and Scott, M. (Eds.). 2010. Keyness in texts. Amsterdam: John Benjamins. This is quite a comprehensive guide for scholars with particular interest in keywords and phrases. The collection is divided into three sections: (1) Exploring keyness; (2) Keyness in specialised discourse; and (3) Critical and educational perspectives. Section one deals with a number of the issues that we have touched upon here in

6 Analysing Keyword Lists


greater detail, with leading scholars such as Stubbs and Scott outlining the main concepts and problems in keyword analysis. Sections two and three function as interesting collections of case studies on corpora drawn from engineering, politics, media, and textbooks, from a range of time periods and places. Archer, D. (Ed.). 2009. What’s in a word-list? Investigating word frequency and keyword extraction. London: Routledge. This edited collection has a number of chapters of particular relevance for scholars interested in keyness. Mike Scott explores reference corpus selection and discusses the eventual impact on findings. Tony McEnery and Paul Baker have chapters using keyness to critically examine the discourses in media and politics, respectively. Those interested in Culpeper’s (2009) paper above may like to read a wider study on Shakespeare’s comedies and tragedies, by Archer, Culpeper, and Rayson. Finally, Archer makes an argument for wider incorporation of frequency and keyword extraction techniques in the closing chapter.

References Baker, P. (2004). Querying keywords: Questions of difference, frequency, and sense in keywords analysis. Journal of English Linguistics, 32(4). Baker, P. (2017). British and American English: Divided by a common language? Cambridge: Cambridge University Press. Baker, P., Gabrielatos, C., & McEnery, T. (2013). Discourse analysis and media attitudes: The representation of Islam in the British Press. Cambridge: Cambridge University Press. Baron, A., Rayson, P., & Archer, D. (2009). Word frequency and key word statistics in corpus linguistics. Anglistik, 20(1), 41–67. Boneva, B., & Kraut, R. (2002). Email, gender, and personal relations. In B. Wellman & C. Haythornthwaite (Eds.), The internet in everyday life (pp. 372–403). Oxford: Blackwell. Brezina, V., & Meyerhoff, M. (2014). Significant or random? A critical review of sociolinguistic generalisations based on large corpora. International Journal of Corpus Linguistics, 19(1), 1– 28. Cressie, N., & Read, T. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society: Series B: Methodological, 46(3), 440–464. Crossley, S. A., Defore, C., Kyle, K., Dai, J., & McNamara, D. S. (2013). Paragraph specific ngram approaches to automatically assessing essay quality. In S. K. D’Mello, R. A. Calvo, & A. Olney (Eds.), Proceedings of the 6th international conference on educational data mining (pp. 216–219). Heidelberg/Berlin: Springer. Culpeper, J. (2009). Words, parts-of-speech and semantic categories in the character-talk of Shakespeare’s Romeo and Juliet. International Journal of Corpus Linguistics, 14(1), 29–59. Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74. Egbert, J., & Biber, D. (2019). Incorporating text dispersion into keyword analysis. Corpora, 14(1), 77–104. Gries, S. T. (2005). Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory, 1(2), 277–294. Hardie, A. (2014). Log Ratio – an informal introduction. CASS blog: http://cass.lancs.ac.uk/ ?p=1133. Accessed 25 June 2019.


P. Rayson and A. Potts

Hofland, K., & Johansson, S. (1982). Word frequencies in British and American English. Bergen, Norway: The Norwegian Computing Centre for the Humanities. Juilland, A., Brodin, D., & Davidovitch, C. (1970). Frequency dictionary of French words. Paris: Mouton &. Kilgarriff, A. (1996). Why chi-square doesn’t work, and an improved LOB-Brown comparison. In Proceedings of the ALLC-ACH conference (pp. 169–172). Bergen: Norway. Kilgarriff, A. (2005). Language is never ever ever random. Corpus Linguistics and Linguistic Theory, 1(2), 263–276. Kyle, K., Crossley, S., Daim J., & McNamara, D. (2013, June 13). Native language identification: A key N-gram category approach. In Proceedings of the eighth workshop on innovative use of NLP for building educational applications (pp. 242–250). Atlanta, Georgia. Lijffijt, J., Nevalainen, T., Säily, T., Papapetrou, P., Puolamäki, K., & Mannila, H. (2016). Significance testing of word frequencies in corpora. Literary and Linguistic Computing, 31(2), 374–397. Mahlberg, M. (2008). Clusters, key clusters and local textual functions in Dickens. Corpora, 2(1), 1–31. Murphy, B. (2010). Corpus and sociolinguistics: Investigating age and gender in female talk. Amsterdam: John Benjamins. Paquot, M. (2013). Lexical bundles and transfer effects. International Journal of Corpus Linguistics, 18(3), 391–417. Paquot, M. (2014). Cross-linguistic influence and formulaic language: Recurrent word sequences in French learner writing. In L. Roberts, I. Vedder, & J. Hulstijn (Eds.), EUROSLA yearbook (pp. 216–237). Amsterdam: Benjamins. Paquot, M. (2017). L1 frequency in foreign language acquisition: Recurrent word combinations in French and Spanish EFL learner writing. Second Language Research, 33(1), 13–32. Paquot, M., & Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. In A. Jucker, D. Schreier, & M. Hundt (Eds.), Corpora: Pragmatics and discourse (pp. 247–269). Amsterdam: Rodopi. Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus Linguistics, 13(4), 519–549. Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000), 1–8 October 2000, Hong Kong (pp. 1–6). Rayson, P., & Wilson, A. (1996). The ACAMRIT semantic tagging system: Progress report. In L. J. Evett & T. G. Rose (Eds.), Language engineering for document analysis and recognition, LEDAR, AISB96 workshop proceedings (pp. 13–20). Brighton: Faculty of Engineering and Computing, Nottingham Trent University, UK. Rayson, P., Berridge, D., & Francis, B. (2004a, March 10–12). Extending the Cochran rule for the comparison of word frequencies between corpora. In Purnelle, G., Fairon, C., & Dister, A. (Eds.) Le poids des mots: Proceedings of the 7th International Conference on Statistical analysis of textual data (JADT 2004) (Vol. II, pp. 926–936), Louvain-la-Neuve: Presses Universitaires de Louvain. Rayson, P., Archer, D., Piao, S. L., & McEnery, T. (2004b). The UCREL semantic analysis system. In Proceedings of the workshop on beyond named entity recognition semantic labelling for NLP tasks in association with 4th international conference on language resources and evaluation (LREC 2004), 7–12. 25th may 2004, Lisbon, Portugal. Paris: European Language Resources Association. Scott, M. (1997). PC analysis of key words – And key key words. System, 25(2), 233–245. Scott, M. (2004). WordSmith tools. Version 4.0. Oxford: Oxford University Press. ISBN: 0-19459400-9. Seale, C., Ziebland, S., & Charteris-Black, J. (2006). Gender, cancer experience and internet use: A comparative keyword analysis of interviews and online cancer support groups. Social Science & Medicine, 62, 2577–2590.

6 Analysing Keyword Lists


Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Tono, Y., Yamazaki, M., & Maekawa, K. (2013). A frequency dictionary of Japanese. Routledge. Vasishth, S., & Nicenboim, B. (2016). Statistical methods for linguistic research: Foundational ideas – Part I. Lang & Ling Compass, 10, 349–369. https://doi.org/10.1111/lnc3.12201. Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133. Wilson, A. (2013). Embracing Bayes factors for key item analysis in corpus linguistics. In New approaches to the study of linguistic variability. Language competence and language awareness in Europe (pp. 3–11). Frankfurt: Peter Lang. Wilson, A., & Rayson, P. (1993). Automatic content analysis of spoken discourse. In C. Souter & E. Atwell (Eds.), Corpus based computational linguistics (pp. 215–226). Amsterdam: Rodopi.

Chapter 7

Analyzing Co-occurrence Data Stefan Th. Gries

and Philip Durrant

Abstract In this chapter, we provide an overview of quantitative approaches to cooccurrence data. We begin with a brief terminological overview of different types of co-occurrence that are prominent in corpus-linguistic studies and then discuss the computation of some widely-used measures of association used to quantify co-occurrence. We present two representative case studies, one exploring lexical collocation and learner proficiency, the other creative uses of verbs with argument structure constructions. In addition, we highlight how most widely-used measures actually all fall out from viewing corpus-linguistic association as an instance of regression modeling and discuss newer developments and potential improvements of association measure research such as utilizing directional measures of association, not uncritically conflating frequency and association-strength information in association measures, type frequencies, and entropies.

7.1 Introduction 7.1.1 General Introduction One of the, if not the, most central assumptions underlying corpus-linguistic work is captured in the so-called distributional hypothesis, which holds that linguistic elements that are similar in terms of their distributional patterning in corpora also

Electronic Supplementary Material The online version of this chapter (https://doi.org/10.1007/ 978-3-030-46216-1_7) contains supplementary material, which is available to authorized users. S. Th. Gries () University of California Santa Barbara, Santa Barbara, CA, USA Justus Liebig University Giessen, Giessen, Germany e-mail: [email protected] P. Durrant University of Exeter, Exeter, UK © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_7



S. Th. Gries and P. Durrant

exhibit some semantic or functional similarity. Typically, corpus linguists like to cite Firth’s (1957:11) famous dictum “[y]ou shall know a word by the company it keeps” but Harris’s (1970:785f.) following statement actually makes the same case much more explicitly, or much more operationalizably: [i]f we consider words or morphemes A and B to be more different in meaning than A and C, then we will often find that the distributions of A and B are more different than the distributions of A and C. In other words, difference of meaning correlates with difference of distribution.

That is, a linguistic expression E – a morpheme, word, construction/pattern, . . . – can be studied by exploring what is co-occurring with E and how often. Depending on what the elements of interest are whose co-occurrence is studied, different terms have been used for such co-occurrence phenomena: • lexical co-occurrence, i.e. the co-occurrence of words with other words such as the strong preference of hermetically to co-occur with, or more specifically, be followed by, sealed, is referred to as collocation; for collocations, it is important to point out the locus of the co-occurrence and Evert (2009:1215) distinguishes between (i) surface co-occurrence (words that are not more than a span/window size of s words apart from each other; often s is 4 or 5), (ii) textual co-occurrence (words in the same clause, sentence, paragraph, . . . ), and (iii) syntactic cooccurrence (words in a syntactic relation); • lexico-grammatical co-occurrence, i.e. the co-occurrence of words with grammatical patterns or constructions such as the strong preference of the verb regard to be used in the passive of the as-predicative (e.g., The Borg were regarded as the greatest threat to the Federation) is referred to as colligation or collostruction (see McEnery et al. 2006:11 or Stefanowitsch and Gries 2003).1 Different studies have adopted different views on how collocation in particular, but also co-occurrence more broadly, should be approached – how many elements are considered (two or more)? Do we need a minimum observed frequency of occurrence in some corpus? Is a certain degree of unpredictability/idiosyncrasy that the co-occurrence exhibits a necessary condition for collocation status? etc. Also, cooccurrence applications differ in their retrieval procedures: Studies that target a word or a construction may retrieve all instances of the word/construction in question and explore its co-occurring elements; other studies might approach a corpus with an eye to identify all (strong) collocations for lexicographic, didactic, contrastive, or other purposes. For the sake of generality, we will discuss here a somewhat atheoretical notion of co-occurrence that eschews commitments regarding all of the above questions and is based only on some mathematical relation between the observed co-occurrence and non-co-occurrence frequencies of l elements in a corpus; it goes without saying that different research questions or practical applications may require one or more commitments regarding the above questions (see Bartsch (2004), Gries

1 We

are ignoring the lexico-textual co-occurrence sense of colligation here.

7 Analyzing Co-occurrence Data


(2008b), and Evert (2009) for more discussion of the parameters underlying cooccurrence and their historical development). The simplest possible way to explore a linguistic element (such as hermetically or regard) would be by raw co-occurrence frequency – how often do I find the collocation hermetically sealed in my corpus? – or, more likely, conditional probabilities such as p(contextual element(s)|E) – how likely is a verbal construction to be an as-predicative when the verb in the construction is regard? While obtaining sorted frequency lists that reveal which collocates or constructions occur most often or are most likely around an element E is straightforward, much corpus-linguistic research has gone a different route and used more complex measures to separate the wheat (linguistically revealing co-occurrence data) from the chaff (the fact that certain function words such as the, of, or in occur with everything a lot, qua their overall high frequency. Such measures are often referred to as association measures (AMs) simply because they, typically, quantify the strength of mutual association between two elements such as two words or a word and a construction. In the following section, we discuss fundamental aspects of the computation of some of the most widely-used AMs.

7.2 Fundamentals For decades now, AMs have typically been explained on the basis of co-occurrence tables of the kind exemplified in Table 7.1, which contain observed frequencies of (co-)occurrence of a linguistic expression E (for instance a particular word) and one of the l types of contextual elements X (e.g. other words or constructions X1-l can occur with/in); for instance, if the ditransitive construction is attested with l = 80 verb types in a corpus, one would generate 80 such co-occurrence tables. In each such table, cell a is the frequency with which E is observed with/in element X, cell b is the frequency with which E is observed without X, this means the overall frequency of E is a + b, etc. Often, such a table would also contain or at least refer to the corresponding expected frequencies in the same cells a to d, i.e. the frequencies with which X and E would be observed together and in isolation if their occurrences were completely randomized; these frequencies are computed from the row and column totals as indicated in Table 7.1 as they would be for, say, a chi-squared test.

Table 7.1 Schematic co-occurrence frequency table Co-occurring element X Obs.: a Exp.: (a + b) × (a + c) /n Other elements (not E) Obs.: c Exp.: (c + d) × (a + c) /n Column totals a+c Element E

Other elements (not X) Obs.: b Exp.: (a + b) × (b + d) /n Obs.: d Exp.: (c + d) × (b + d) /n b+d

Row totals a+b c+d a+b+c+d=n


S. Th. Gries and P. Durrant

Table 7.2 Co-occurrence frequencies of regard and the as-predicative in Gries et al. (2005) Regard Other verbs Column totals

As-predicative 80 exp.: 99 × 687 /138,664 607 exp.: 138,565 × 687 /138,664 687

Other constructions 19 exp.: 99 × 137,977 /138,664 137,958 exp.: 138,565 × 137,977 /138,664 137,977

Row totals 99 138,565 138,664

As mentioned above, such a co-occurrence table is generated for every element type X1-l ever occurring with E at least once or, if the element analyzed is X, then such a co-occurrence table is generated for every element type E1-l ever occurring with X at least once. For instance, if one studied the as-predicative construction, then X might be that construction and elements E1-l could be all verbs occurring in that construction at least once and one could use the values in each of the l tables to compute an AM for every one of the l verb types of E co-occurring in X to. These results could then be used to, for instance, rank-order them by strength of attraction and then study them, which is often interesting because of how expressions that cooccur with X reveal structural and/or functional characteristics of E (recall the Firth and Harris quotes from above). A large number of AMs has been proposed over the last few decades, including (i) measures that are based on asymptotic or exact significance tests, (ii) measures from, or related to, information theory, (iii) statistical effect sizes, various other measures or heuristics; Evert (2009) and Pecina (2010) discuss more than altogether 80 measures and since then even more measures have been proposed. However, the by far most widely-used measures are (i) the loglikelihood measure G2 (which is somewhat similar to the chi-squared test and, thus, the z-score, and which is highly correlated with the p-value of the Fisher-Yates exact test as well as the t-score), (ii) the pointwise Mutual Information (MI), (iii) the odds ratio (and/or its logged version), which are all exemplified here on the basis of the frequencies in Table 7.2 of the co-occurrence of regard and the as-predicative in the British Component of the International Corpus of English reported in Gries et al. (2005).  (1) G2 = 2 4i=1 obs × log obs exp ≈ 762.196 a−aexp √ a




a 80 = log2 0.49 ≈ 7.349 pointwise Mutual I nf ormation = log2 aexp


607 a b odds ratio= ab / dc = 80 19 / 137958 = c / d ≈ 956.962 (log odds ratio ≈ 6.864)


80−0.49 √ 80

≈ 8.889

All four measures indicate that there is a strong mutual association between X (the as-predicative) and E (regard); if one computed the actual p-value following

7 Analyzing Co-occurrence Data


from this G2 , one would obtain a result of p < 10−167 .2 However, this sentence also points to what has been argued to be a shortcoming of these measures: The fact that they quantify mutual attraction means that they do not distinguish between different kinds of attracted elements: • instances of collocations/collostructions where X attracts E but E does not attract X (or at least much less so); • instances where E attracts X but X does not attract E (or at least much less so); • instances where both elements attract each other (strongly). Based on initial discussion by Ellis (2007), Gries (2013a) has shown that each of these three kinds of collocations is common among the elements annotated as multi-word units in the British National Corpus: • according to or upside down are examples of the first kind: If one picks any bigram that has to or down as its second word, it is nearly impossible to predict which words will precede it, but if one picks any bigram with according or upside as the first word, one is quite likely to guess the second one correctly; • of course or for instance are examples of the second kind: If one picks any bigram with of or for as the first word, it is nearly impossible to predict which word will follow, but if one picks any bigram with course or instance as the first word, one is quite likely to guess that of or for are the first word correctly; • Sinn Fein and bona fide are examples of the third kind: each word is very highly predictive of the other. Crucially, all of the above examples are highly significant – in the spoken part of the BNC, all have G2 -values of >178 and p-values of > 0.699), which is more/different evidence that the distribution in Table 7.2 is better quantified with uni-directional measures.4 The two versions of this measure are fairly highly correlated with P (r > 0.86 in as-predicative data, for instance, and > 0.8 in Baayen’s 2011 comparison of multiple AMs), but an attractive feature of DKL is that (i) it is a measure that has interdisciplinary appeal given the wide variety of uses that information-theoretical concepts have and (ii) it can also be used for other corpus-linguistically relevant phenomena such as dispersion (see Chap. 5), thus allowing the researcher to use one and the same metric for different facets of co-occurrence data. Baayen’s second proposal is to use the varying intercepts of the simplest kind of mixed-effects model (see Chap. 22). Essentially, for the as-predicative data from Table 7.2 used as an example above, this approach would require as input a data frame in the case-by-variable format, i.e. with 138,664 rows (one for each construction) and two columns (one with the constructional choices (as-predicative vs. other), one with all verb types (regard, see, know, consider, . . . , other) in the data. Then, one can compute a generalized linear mixed-effects model in which one determines the basic log odds of the as-predicative (−3.4214) but, more crucially, also how each verb affects the log odds of the as-predicative differently, which reflects its association to the as-predicative. These values are again positively correlated with, say, Ps, but the advantage they offer is that, because they too derive from the unified perspective of the more powerful/general 3 The

Kullback-Leibler divergence is also already mentioned in Pecina (2010). Michelbacher et al. (2007, 2011) and Gries (2013a) for further explorations of unidirectional/asymmetric measures.

4 See

7 Analyzing Co-occurrence Data


approach of regression modeling, they allow researchers to effortlessly include other predictors in the exploration of co-occurrence. For instance, the as-predicative is not only strongly attracted to verbs (such as regard, hail, categorize, . . . ) but also to the passive voice. However, traditional AM analysis does usually not consider additional attractors of a word or a construction, but within a regression framework those are more straightforward to add to a regression model than just about any other method. In sum, AM research requires more exploration of measures that allow for elegant ways to include more information in the analysis of co-occurrence phenomena.

7.3.3 Additional Information to Include Another kind of desiderata for future research involves the kind of input to analyses of co-occurrence data. So far, all of the above involved only token frequencies of (co-)occurrence, but co-occurrence is a more multi-faceted phenomenon and it seems as if the following three dimensions of information are worthy of much more attention than they have received so far (see Gries 2012, 2015, 2019 for some discussion): • type frequencies of co-occurrence: current analyses of co-occurrence based on tables such as Table 7.2 do not consider the number of different types that make up the frequencies in the cells b (19) and c (607) even though it is well-known that type frequency is correlated with many linguistic questions involving productivity, learnability, and language change. So far, the only AM that has ever been suggested to involve type frequencies is Daudaraviˇcius and Marcinkeviˇcien˙e’s (2004) lexical gravity, but there are hardly any studies that explore this important issue in more detail (one case in point is Gries and Mukherjee 2010); • entropies of co-occurrence: similarly to the previous point, not only do studies not consider the frequencies of types with which elements co-occur, they therefore also do not consider the entropies of these types, i.e. the informativity of these frequencies/distributions. Arguably, distributions with a low(er) entropy would reflect strong(er) associations whereas distributions with a high(er) entropy would reflect weak(er) associations. Since entropies of type frequencies are relevant to many aspects of linguistic learning and processing (see Goldberg et al. 2004; Linzen and Jaeger 2015; or Lester and Moscoso del Prado 2016), this is a dimension of information that should ultimately be added to the corpus linguist’s toolbox. • dispersion of co-occurrence (see Gries 2008a, Chap. 5): given how any kind of AM is based on co-occurrence frequencies of elements in a corpus, it is obvious that the AMs are sensitive to underdispersion. Co-occurrence frequencies as entered into tables such as Table 7.2 may yield very unrepresentative results if


S. Th. Gries and P. Durrant

they are based on only very small parts of the corpus under investigation. For instance, Stefanowitsch and Gries (2003) find that the verbs fold and process are highly attracted to the imperative construction in the ICE-GB, but also note that fold and process really only occur with the imperative in just a single of the 500 files of the ICE-GB – the high AM scores should therefore be taken with a grain of salt and dispersion should be considered whenever association is. To conclude, from our above general discussion and desiderata, one main takehome message should be that, while AMs have been playing a vital role for the corpus-linguistic analysis of co-occurrence, much remains to be done lest we continue to underestimate the complexity and multidimensionality of the notion of co-occurrence. Our advice to readers would be • to familiarize themselves with a small number of ‘standard’ measures such as G2 , MI, and t; but • to also immediately begin to learn the very basics of logistic regression modeling to (i) be able to realize the connections between seemingly disparate measures as well as (ii) become able to easily implement directional measures when the task requires it; • to develop even the most basic knowledge of a programming language like R to avoid being boxed in into what currently available tools provide, which we will briefly discuss in the next section.

7.4 Tools and Resources While co-occurrence is one of the most fundamental notions used in corpus linguistics, it is not nearly as widely implemented in corpus tools as it should be. This is for two main reasons. First, existing tools offer only a very small number of measures, if any, and no ways to implement new ones or tweak existing ones. For instance, WordSmith Tools offers MI and its derivative MI3, t, z, G2 , and a few less widely-used ones (from WordSmith’s website) and AntConc offers MI, G2 , and t (from AntConc’s website). While this is probably a representative section of the most frequent AMs, all of these are bidirectional, for instance, which limits their applicability for many questions. Second, these tools only provide AMs for what they ‘think’ are words, which means that colligations/collostructions and many other co-occurrence applications cannot readily be handled by them. As so often and as already mentioned in Chap. 5, the most versatile and powerful approach to exploring co-occurrence is with programming languages such as R or Python, because then the user is not restricted to lexical co-occurrence and dependent on measures/settings enshrined in ready-made software black boxes, but can customize an analysis in exactly the way that is needed; some very rudimentary exemplification can be found in the companion code file to this chapter; also, see http://collocations. de for a comprehensive overview of many measures.

7 Analyzing Co-occurrence Data


Further Reading Pecina, P. 2010. Lexical association measures and collocation extraction. Language Resources and Evaluation 44(1):137–158. Pecina (2010) appears to be the most comprehensive overview of corpus- and computational-linguistic AMs focusing on automatic collocation extraction. In this highly technical paper, 82 different AMs are compared with regard to how well they identify true collocations in three scenarios (kinds of corpus data) and evaluated on the basis of precision-recall curves, i.e. curves that determine precision (true positives (correctly identified collocations) /all positives (all identified collocations) ) and recall (true positives /all trues (collocations to be found) ) values for every possible threshold value an AM would allow for. For two of the three kinds of corpus data, measures that can be assumed to be unknown to most corpus linguists score the highest mean average precision (cosine context similarity and the unigram subtuple measure); for the largest data set, the better-known pointwise MI scores second highest, and some other well-known measures (including z and the odds ratio) score well in at least one scenario. Wiechmann, D. 2008. On the computation of collostruction strength: testing measures of association as expression of lexical bias. Corpus Linguistics and Linguistic Theory 4(2):253–290. Wiechmann (2008) also provides a wide-ranging empirical comparison of association measures, specifically those pertaining to collostruction. He focuses on how well various measures of collostruction strength predict the processing of sentences in which a noun phrase is temporarily ambiguous between being a direct object (The athlete revealed his problem because his parents worried) and the subject of a subordinate clause (The athlete revealed his problem worried his parents) using cluster and regression analyses.

References Ackermann, K., & Chen, Y. H. (2013). Developing the academic collocation list (ACL) – A corpusdriven and expert-judged approach. Journal of English for Academic Purposes, 12(4), 235–247. Baayen, R. H. (2011). Corpus linguistics and naive discriminative learning. Brazilian Journal of Applied Linguistics, 11(2), 295–328. Bartsch, S. (2004). Structural and functional properties of collocations in English. Tübingen: NARR. Bestgen, Y., & Granger, S. (2014). Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing, 26(4), 28–41. Daudaraviˇcius, V., & Marcinkeviˇcien˙e, R. (2004). Gravity counts for the boundaries of collocations. International Journal of Corpus Linguistics, 9(2), 321–348. Durrant, P. (2014). Corpus frequency and second language learners’ knowledge of collocations. International Journal of Corpus Linguistics, 19(4), 443–477.


S. Th. Gries and P. Durrant

Durrant, P., & Schmitt, N. (2009). To what extent do native and non-native writers make use of collocations? International Review of Applied Linguistics, 47(2), 157–177. Ellis, N. C. (2007). Language acquisition as rational contingency learning. Applied Linguistics, 27(1), 1–24. Ellis, N. C., Simpson-Vlach, R., & Maynard, C. (2008). Formulaic language in native and secondlanguage speakers: Psycholinguistics, corpus linguistics, and TESOL. TESOL Quarterly, 1(3), 375–396. Evert, S. (2009). Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 1212–1248). Berlin/New York: Mouton De Gruyter. Firth, J. R. (1957). A synopsis of linguistic theory 1930–55. Reprinted in Palmer FR (Ed.), (1968) Selected papers of J.R. Firth, 1952–1959. Longman, London. Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure. Chicago: University of Chicago Press. Goldberg, A. E., Casenhiser, D. M., & Sethuraman, N. (2004). Learning argument structure generalizations. Cognitive Linguistics, 15(3), 289–316. Gries, S. Th. (2008a). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. Gries, S. Th. (2008b). Phraseology and linguistic theory: A brief survey. In S. Granger & F. Meunier (Eds.), Phraseology: An interdisciplinary perspective (pp. 3–25). Amsterdam/Philadelphia: John Benjamins. Gries, S. Th. (2012). Frequencies, probabilities, association measures in usage−/exemplar-based linguistics: Some necessary clarifications. Studies in Language, 36(3), 477–510. Gries, S. Th. (2013a). 50-something years of work on collocations: What is or should be next . . . . International Journal of Corpus Linguistics, 18(1), 137–165. Gries, S. Th. (2013b). Statistics for linguistics with R (2nd rev. & ext. ed) De Gruyter Mouton: Boston/New York. Gries, S. Th. (2015). More (old and new) misunderstandings of collostructional analysis: On Schmid & Küchenhoff (2013). Cognitive Linguistics, 26(3), 505–536. Gries, S. Th. (2015). 15 years of collostructions: some long overdue additions/corrections (to/of actually all sorts of corpus-linguistics measures). International Journal of Corpus Linguistics, 24(3), 385–412. Gries, S. Th. (2018). On over- and underuse in learner corpus research and multifactoriality in corpus linguistics more generally. Journal of Second Language Studies, 1(2), 276–308. Gries, S. T. (2019). 15 years of collostructions: Some long overdue additions/corrections (to/of actually all sorts of corpus-linguistics measures). International Journal of Corpus Linguistics, 24, 385. Gries, S. Th., & Mukherjee, J. (2010). Lexical gravity across varieties of English: An ICE-based study of n-grams in Asian Englishes. International Journal of Corpus Linguistics, 15(4), 520– 548. Gries, S. Th., Hampe, B., & Schönefeld, D. (2005). Converging evidence: Bringing together experimental and corpus data on the association of verbs and constructions. Cognitive Linguistics, 16(4), 635–676. Hampe, B., & Schönefeld, D. (2006). Syntactic leaps or lexical variation? – More on “Creative Syntax”. In S. T. Gries & A. Stefanowitsch (Eds.), Corpora in cognitive linguistics: Corpusbased approaches to syntax and lexis (pp. 127–157). Berlin/New York: Mouton de Gruyter. Harris, Z. S. (1970). Papers in structural and transformational linguistics. Dordrecht: Reidel. Lester, N. A., & Moscoso del Prado, M. F. (2016). Syntactic flexibility in the noun: Evidence from picture naming. In A. Papafragou, D. Grodner, D. Mirman, & J. C. Trueswell (Eds.), Proceedings of the 38th annual conference of the cognitive science society (pp. 2585–2590). Austin: Cognitive Science Society. Linzen, T., & Jaeger, T. F. (2015). Uncertainty and expectation in sentence processing: Evidence from subcategorization distributions. Cognitive Science, 40(6), 1382–1411. McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. Oxon/New York: Routledge.

7 Analyzing Co-occurrence Data


Michelbacher, L., Evert, S., & Schütze, H. (2007). Asymmetric association measures. International Conference on Recent Advances in Natural Language Processing. Michelbacher, L., Evert, S., & Schütze, H. (2011). Asymmetry in corpus-derived and human word associations. Corpus Linguistics and Linguistic Theory, 7(2), 245–276. Mollin, S. (2009). Combining corpus linguistic and psychological data on word co-occurrences: Corpus collocates versus word associations. Corpus Linguistics and Linguistic Theory, 5(2), 175–200. Pecina, P. (2010). Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1), 137–158. Schneider, U. (to appear). Delta P as a measure of collocation strength. Corpus Linguistics and Linguistic Theory. Siyanova-Chanturia, A. (2015). Collocation in beginner learner writing: A longitudinal study. System, 53(4), 148–160. Stefanowitsch, A., & Gries, S. T. (2003). Collostructions: Investigating the interaction between words and constructions. International Journal of Corpus Linguistics, 8(2), 209–243. Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.

Chapter 8

Analyzing Concordances Stefanie Wulff and Paul Baker

Abstract In its simplest form, a concordance is a list of all attestations (or hits) of a particular search word or phrase, presented with a user-defined amount of context to the left and right of the search word or phrase. In this chapter, we describe how to generate and manipulate concordances, and we discuss how they can be employed in research and teaching. We describe how to generate, sort, and prune concordances prior to further analysis or use. In a section devoted to qualitative analysis, we detail how a discourse-analytical approach, either on the basis of unannotated concordance lines or on the basis of output generated by a prior quantitative examination of the data, can help describe and, crucially, explain the observable patterns, for instance by recourse to concepts such as semantic prosody. In a section devoted to quantitative analysis, we discuss how concordance lines can be scrutinized for various properties of the search term and annotated accordingly. Annotated concordance data enable the researcher to perform statistical analyses over hundreds or thousands of data points, identifying distributional patterns that might otherwise escape the researcher’s attention. In a third section, we turn to pedagogical applications of concordances. We close with a critical assessment of contemporary use of concordances as well as some suggestions for the adequate use of concordances in both research and teaching contexts, and give pointers to tools and resources.

S. Wulff () University of Florida, Gainesville, Florida, USA e-mail: [email protected] P. Baker Lancaster University, Lancaster, UK e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_8



S. Wulff and P. Baker

8.1 Introduction In this chapter, we describe how to generate, manipulate, and analyze concordances. In its simplest form, a concordance is “a list of all the occurrences of a particular search term in a corpus, presented within the context that they occur in; usually a few words to the left and the right of the search term” (Baker 2006: 71). The specified amount of context is often also referred to as the “(context) window/span”. Sometimes, concordances are called “key words in context”, or KWIC, displays, with “key word” referring to the search word or phrase; this is not to be confused with another use of the term “key word” in corpus linguistics as the words or phrases that are statistically distinctive for a certain corpus sample (see Chap. 6). The term concordance usually refers to the entire list of hits, although sometimes researchers refer to a single line from the list as a concordance. In this chapter, to avoid confusion, we refer to a concordance as the list of citations, distinguishing it from a concordance line, which is a single citation. We have used a range of concordancing tools to create this chapter, but for consistency, we have formatted the concordances in a standard way. Figure 8.1 is an example of a concordance for the search terms refugee and refugees in the British National Corpus (BNC), with a context window of 100 characters around the search terms. The BNC contains data from the late 20th century. Overall, refugee and refugees occur 2,733 times in the BNC, so displaying the entire concordance is not an option here–instead, Fig. 8.1 displays only a snippet of 15 concordance lines. The typical layout of concordances–the search term in the middle; some context around it; the left-hand context aligned flush right and the right-hand context aligned flush left–is meant to make it easier to inspect many examples at one glance in order




The paper also said CBS identified


walking on the Pakistani border as

crossing into Czechoslovakia, from where most


are fleeing to the West. The

last of eight special trains bringing 7,600


from the West German embassy in

adjoining the embassy from which the


have clambered into the grounds. Some

have clambered into the grounds. Some but had not boarded the sealed


in the embassy gardens said they


trains travelling through East Germany to

numerous Czechoslovaks who befriended the


He said the debacle had been

is cracking.’ The determination of the


despite the freezing and insanitary conditions

business in cafes and restaurants because


were crammed like sardines on mattresses

to Prague to convince the embassy


it was not a trap. Their

Righting the Record By Brian Dooley 29 urging them to ensure that Iraqi


Jumping the Hurdles By Jan Shaw


then under the control of United

hat it strongly feared that the


population would be at risk of

and her sick child at Isikveren


camp on the Iraq/Turkish border


to find work.’ Adherence to this

Right of Asylum and help political

Fig. 8.1 Concordance snippet of 15 attestations of refugee|refugees in the BNC

8 Analyzing Concordances


to identify patterns around the search term that might escape one’s notice if one were reading the examples as one running text. In Fig. 8.1, it is in fact difficult to see recurring patterns, mainly because we are looking only at 15 examples. There are ways, however, to make patterns more easily visible by sorting the concordance, which we turn to next. For now, we can examine the snippet in Fig. 8.1 and already glean some information about how refugee(s) are talked about. For example, we learn about their location: the word embassy is mentioned four times; trains and camp also appear in the context window describing the location of refugee(s). Similarly, looking at the verb predicates following refugee(s), we get a glimpse of the actions associated with them: walking, fleeing, clambered, and crammed are four examples.

8.2 Fundamentals 8.2.1 Sorting and Pruning Concordances While concordances are typically formatted slightly differently from regular running text, it can be difficult to see patterns because the attestations in a simple concordance as shown in Fig. 8.1 are in chronological order of their occurrence in the corpus. For that reason, most concordance tools have an option to widen the context window, and/or to sort the concordance display according to the words in the left and/or right-hand context. For example, we could choose to sort the concordance according to the first word to the right of the search word so that we see the immediate right-hand collocates (also called “R1”) of refugee(s) in alphabetical order, which makes it easy to see if there are potential collocates prominently following. Alternatively, we could sort the concordance by the word immediately preceding refugee(s) (the L1 collocates), which would give us an idea of what words, if any, likely modify refugee(s). Below are two examples. Figure 8.2 shows another 15-line snippet from the same concordance for refugee(s) in the BNC that was sorted in a more useful way: according to the first word to the right (R1) of refugee(s), then nested in a sort according to the second word to the right (R2) of refugee(s), and nested in a sort according to the third word to the right (R3) of refugee(s). This particular snippet shows that one of the words immediately to the right of refugee is camps, followed by the preposition in as the second word to the right, followed in the slot in third position to the right by a place name. Generalizing across these examples, we can say that one frequent pattern containing refugee(s) is [refugee camps in PLACENAME]. Other R1-collocates occurring at least 5 times or more often in the entire concordance include more specific descriptions of refugees (children, communities, families, population), their movement between countries (exodus, migration), and societal evaluations of admitting refugees to one’s country (crisis, problem, situation).


S. Wulff and P. Baker

L 1987, of whom nearly two-thirds were in principally from Mozambiquan civil war, arriving at




camps in Algeria. The two sides disagreed

smiles to the faces of children from

refugee refugee

camps in Bark and absolutely nothing camps in Bosnia. (WES NEXT) They set camps in Diyarbakir in May 1989, but the

the wife of the French President, to


have fled from their homes, some to


camps in Ethiopia, others to revert to

and Breakfast at Tiffany's, recently toured


camps in famine -ravaged Somalia as a camps in 1990 -91 handled 3,239 refugees

fears of a growing refugee problem. Five


Zagreb. But they were diverted instead to


camps in Hungary, where there were

There were similar jubilant scenes in


camps in Lebanon. Mr Khaled Al -Hassan

before, Israeli troops pulled out of Palestinian


camps — in Leba non during the Israeli

two young girls from one of the


camps in San Salvador were capt ured, we

they can return home, either leaving the


camps in San Salvador or ending their

Having seen the Shias in the


camps in south Iran and in south

victims of attacks who are streaming into


camps in Thailand. According to H. B.

were reported to have fled to temporary


camps in Thailand. As part of the


Fig. 8.2 Concordance snippet of 15 attestations of refugee|refugees in the BNC sorted by R1/R2/R3 L



priority to reintroducinglegislation to curb bogus


The new Parliamentary session, to be

The Asylum Bill, designed to curb bogus


was abandoned with the dissolution of

; reintroducing the Asylum Bill curbing bogus


and asylum seekers; legislation to pave the



will grab state handouts’, complete with

Kinnock ‘won't curb flood of bogus


’.’ In other parts of Britain, the far

than the so-called rights of bogus


Mr. Baker The purpose of the Bill

Bill, designed to crack down on bogus


A new national lottery to aid sport,

page lead article: ‘SCANDAL OF THE BOGUS a headline in The Times today, bogus

REFUGEES refugees

-- 80% cheat their way into Britain and the bleed Britain of £100 million through

refugees along the Iraqi-Turkish border; 10,000


were reported already to have crossed into

Refugee ban A CHARITY behind a Bosnian


rescue mission accused the Home Office of

the newrecruit to Class 1 — a Bosnian


whose name no one could pronounce.

Harrison said that we should accept Bosnian


because they have lost their jobs and

bewildered mother. Azra Duric, another Bosnian


says Shnezhana and her husband Haris had

SMITH/Appeal Organiser AZRA DURIC/Bosnian


KEVIN BIRD/Aid Volunteer Voice over

Fig. 8.3 Concordance snippet of 15 attestations of refugee(s) in the BNC sorted by L1/L2/L3

Figure 8.3, in contrast, is a snippet from the same concordance sorted in a nested fashion first by L1, then L2, and then L3 collocates. One pattern that emerges from this sorting is the phrase bogus refugee, frequently combined with the verb curb. Another pattern instantly visible is Bosnian refugee. If we had the space to display the entire concordance sorted to the left, many more such patterns would become visible, including phrases like Armenian refugee(s), Bosnian refugee(s), and Palestinian refugee(s), to give but three examples of L1-collocates telling us something about the home countries of refugees; phrases like civilian refugee(s), political refugee(s), and religious refugee(s), reflecting how refugees are categorized

8 Analyzing Concordances


by their (assumed) motivation to leave their home countries; and phrases like genuine refugee(s) and would-be refugee(s), which, alongside bogus refugee(s), are used frequently in the intense societal and political debate around the legitimacy of refugees. Depending on what your end goal is, you may want to not only sort (and sort in different ways to obtain different perspectives on your data), but also prune a concordance. By pruning, we here mean one or more of the following: deleting certain concordance lines and keeping others; narrowing down the context window; or blanking out the search term and/or collocates. Most typically, we delete concordance lines and/or clip the context window in the interest of saving space. Spatial restrictions apply to a handbook article like this one (hence the 15line snippets as opposed to displaying the concordance in its entirety) as much as to the use of concordances for classroom use (not many students would want to inspect thousands of concordance lines). In a research context, in contrast, especially when researchers want to make a claim for exhaustive data retrieval and/or when the sequencing of attestations in the corpus matters for the research question at hand, one would only delete concordance lines that contain false hits. Another typical reason for deleting certain concordance lines is that depending on your search term (as well as the functionality of the software you are using and the make-up of the corpus data), the resulting concordance may contain a sizeable share of false hits. Imagine, for example, that you have a (untagged) corpus from which you want to retrieve all mentions of former and current Presidents of the United States. If you (can only) enter the simple search terms Bush, Clinton, Obama, and Trump, chances are that you will retrieve a number of hits that do not refer to the Presidents, but other people by the same name, or that do not refer to people at all, but instead, say, plants (Steffi is hiding behind the bush) or card games (Paul played his trump card). Similarly, you may want to blank out either the search term or the collocates surrounding the search term to create a fill-in-the-blank exercise for teaching purposes. There are countless applications of this; to give but one example, imagine you wanted to teach non-native speakers of English the difference between nearsynonymous words such as big, large, and great. You could create a worksheet in a matter of minutes by creating a concordance of these three adjectives, deleting any false hits or hits you deem too confusing or complex for your students, blanking out the search terms themselves, and printing out the formatted concordance (Stevens 1991). Concordance lines form the basis for both qualitative and quantitative analysis. We define the term quantitative analysis here to refer to analyses that focus either exclusively, primarily, or in the initial stages of analysis, on the distributional patterns and global statistical trends underlying a given phenomenon, while we define the term qualitative analysis to refer to analyses that focus either exclusively, primarily, or in the initial stages of analysis, on in-depth scrutiny of individual attestations of a phenomenon. As we see it, the distinction between the two kinds of analysis is more a matter of degree than a categorical choice, and which form of analysis dominates in a research project depends on the phenomenon


S. Wulff and P. Baker

under investigation and the researcher’s goals. Ideally, both forms of analysis are employed to provide complementary evidence: a qualitative analysis may be very thorough, yet leave open questions of generalizability and robustness that can be addressed in a more quantitative study. Conversely, even the most sophisticated quantitative analysis typically entails item-by-item annotation of each data point, which equates to a qualitative analysis of sorts; likewise, the results of a quantitative analysis demand interpretation, which typically requires a qualitative approach. In any case, concordance lines serve to provide the lexical and/or grammatical context of a search term and thus can be needed at all stages of an analysis, from data coding to interpretation.

8.2.2 Qualitative Analysis of Concordance Lines In order to show the benefit of a qualitative analysis of concordance lines, we stick with the subject of constructions of refugees in the BNC. One way that concordance lines can be used fruitfully is to identify semantic prosody (Louw 1993; Sinclair 2004; Stubbs 1995, 2001; Partington 1998, 2004). The term has a range of overlapping but slightly different conceptualizations, but Louw (1993: 159) refers to it as “a consistent aura of meaning with which a form is imbued by its collocates”. One of the most famous examples of a semantic prosody is Sinclair’s (1991: 112) observation that “the verb happen is associated with unpleasant things– accidents and the like”. A semantic prosody can be identified by simply classifying the collocates of a word, either by hand, as Sinclair did, or by using statistical measures to identify them (see Chap. 7), then noting whether their meanings are largely positive or negative. While an approach which considers collocates is useful in obtaining a general sense of a word’s semantic prosody, we would argue that such an approach works best when complemented by concordance analyses which may identify more nuanced uses. For example, Sinclair (2004: 33–34) argues that the term naked eye has a semantic prosody for ‘difficulty’, based upon the identification of concordance lines containing phrases like too faint to be seen with the naked eye. With many corpus tools, a concordance analysis can be used as a starting point for further forms of analysis, such as the identification of n-grams (see Chap. 4) which contain the search term or the collocates of the search term. For example, let us consider some of the collocates of the term refugee(s) in the British National Corpus. Figure 8.4 shows some concordance lines of refugee(s)–selected from the corpus to illustrate the word’s semantic prosody. First, using the Mutual Information statistic and a span of 5 words (see Chap. 7), we find strong noun collocates like influx, flood and flow, as well as the verb collocate swollen (see the top half of Fig. 8.4). While the collocational analysis only identified a small number of words relating to water, an examination of all the concordance lines of refugee(s) reveals less frequent

8 Analyzing Concordances





dispatched to reinforce dykes and assist flood


State grain reserves were opened in

of 50,000 to 60,000 people. The new flood of


consequent upon the Russian withdrawal . Watched by a demonstration of 5000

in 1960 alone) mirroring this year's flood of


laws … Kinnock ‘won't curb flood of bogus


'.’ In other parts of Britain, the far right

Pein the street are asked to prepare for an influx of


who will be looked after in the local church


influx from Ethiopia — Sudanese

into Kenya at Moyale. AFRICA SUDAN Cutting off the flow of


did not solve the financial problem,

The flow of East German


swelled during the 24 hours to Sunday

reconstruction had only begun, large numbers of


had swollen the numbers of the

be confident this tidal flow of disenchanted


the cities will not destroy the very

sets in movement streams of developmental


, waves of environmentally displaced

of attacks who are streaming into


camps in Thailand.

contents of its many inns. Halfling


poured down the river Aver in a convoy

hardships led to a continued outflow of


, particularly from the minorities

explosions rolled closer — impelling a flux of


to eddy towards the troopers

Fig. 8.4 Water metaphors used to represent refugees

cases which are shown in the bottom half of Fig. 8.4 (waves, streaming, streams, poured, outflow, eddy). The examples of concordance lines in Fig. 8.4 indicate a negative semantic prosody of refugees as out-of-control natural disasters, using water metaphors. The qualitative concordance analysis is also helpful in indicating that not all of the co-occurrences of water-related words are used in metaphorical ways. For example, the first concordance line in Fig. 8.4 uses flood in a literal sense to refer to people who have become refugees due to a flood. Such concordance lines are important in showing the danger of interpretative positivism (Simpson 1993: 113) where “a particular linguistic feature, irrespective of the context of its use, will always generate a particular meaning”. The flood refugees case in line 1 was the only such example where a water-related collocate was not used in a metaphorical sense with refugees, so it does not negate the original finding, but it does mean we should adjust our frequencies to take it into account. Let us consider another, even more important example of why a qualitative concordance analysis is important. Looking again at the collocates of refugee(s), the word bogus is the 10th strongest adjectival collocate of the term, occurring 14 times. Another collocate, genuine, appears with the search term 41 times. This entails a contrast between genuine and refugee(s) that calls into question the veracity of people who are identified as refugees. Figure 8.3 gives some of the examples of bogus collocating with the search term. A qualitative analysis of these lines (and others like it) allows us to identify how bogus is used to construct refugees. In the BNC, it always occurs as a modifier of refugees, rather than say, refugees referring to someone or something else as bogus. Additionally, we can look at verbs and verb phrases to identify what bogus refugees are doing and what is being done to


S. Wulff and P. Baker

them. For example, five concordance lines refer to curbing or cracking down on bogus refugees, particularly with reference to a new Act or Asylum Bill, which is designed to do this. Another example refers to a former Labour party leader (Neil) Kinnock who is claimed to say that he will not curb [the] flood of bogus refugees. These concordances could be viewed as contributing towards a negative semantic prosody because they imply that someone or something wishes to curb/crack down on refugees. If we look at what bogus refugees are described as doing, we see phrases like 80% cheat their way into Britain and the good life, will grab state handouts and bleed Britain of £100 million. Again, this implies a negative semantic prosody as the examples describe bogus refugees as benefiting financially from the British state. The last example contains a vivid metaphor that brings to mind bogus refugees as leeches or vampires, bleeding Britain. At this point it would be reasonable to conclude that one way that refugees are constructed in the BNC is negatively, as ‘bogus refugees’ who require regulation to stop them from illegally obtaining money from the British government. The fact that the terms bogus and genuine appear as collocates of refugee(s) in the corpus suggests that this occurs quite frequently. However, there is a danger of presenting an over-simplified analysis if we stop here. It is often wise to look at expanded concordance lines before making a strong claim, in order to consider more context. Take for example the line “SCANDAL OF THE BOGUS REFUGEES–80% cheat their way into Britain and the good life”. The use of quotes at the start and end of this line perhaps indicates an intertextual use of language (where a text references another text in some way), and it is worth expanding the line to see whether this is occurring, and if so why. A fuller example of this is below: (1) Intending to increase sensitivity to the supposed threat, the right-wing tabloids have been regaling the public with anti-refugee stories during 1991 and 1992. For example, the Sunday Express of 20 October 1991 headlined its front-page lead article: ‘SCANDAL OF THE BOGUS REFUGEES — 80% cheat their way into Britain and the good life’. Consulting the header information from this particular file, we see that it is from an article in the New Statesman. Importantly, this article references constructions of refugees in ‘right-wing tabloids’ like the Sunday Express by quoting from them. Reading the whole article, the New Statesman’s tone is critical of such constructions - the article is titled ‘Racist revival’. A similar case is found in the ninth line in Fig. 8.3: a headline in The Times today, bogus refugees bleed Britain of £100 million through. This text is from a transcript of a political debate in parliament. More context is shown below: (2) Does my right hon. Friend agree that the opportunity for this country to help support genuine refugees abroad through various aid programmes is not helped by the fact that, according to a headline in The Times today, bogus refugees bleed Britain of £100 million through benefit fraud? Has he seen the comments of a DSS officer in the same article that benefit fraud is now a national sport and that bogus asylum seekers think that the way in which this country hands out so much money is hilarious?

8 Analyzing Concordances


Again, the example of bogus refugees is cited in this text in order to be critical of it, arguing that such representations do not help to support genuine refugees, although the speaker still makes a distinction between genuine and bogus refugees. However, the speaker does appear to be critical of the Times’s reference to bogus refugees, and so these two examples indicate that not every mention of a bogus refugee should be seen as uncritically contributing towards the negative semantic prosody. Had our analysis involved a close examination of a small number of full texts, this point would have quickly been obvious. However, due to the nature of a concordance analysis – the number of lines the analyst is presented with, along with the fact that they only contain a few words of context either side of a search term, it is possible that these more nuanced cases may be overlooked. Before making claims, it is important to consider a sample of expanded concordance lines and to maintain vigilance in terms of spotting lines that potentially may be functioning differently to a first glance. In Fig. 8.3, it is use of quote marks along with mentions of other texts which indicates that something is being referenced e.g. a headline in the Times or front-page lead article. It is common for someone to voice an oppositional position by first quoting the opinion they disagree with, and a good concordance analysis will take this into account.

8.2.3 Quantitative Analysis of Concordance Lines Many quantitative corpus analyses are based on concordance data (though not necessarily all: one could think of, for example, a study that is based on frequency or collocation lists instead, see Chaps. 4–7). This is particularly true for multifactorial studies, that is, studies that try to explain a linguistic phenomenon with recourse to not one, but several variables (see Chaps. 21–26). These variables have to be identified and coded for, and depending on the phenomenon in question, that may require the researcher to carefully examine the context around a given search term (one may consider that step a qualitative analysis). Let us consider one such example here, namely that-complementation in English. Speakers can choose to realize or omit the complementizer that in object, subject, and adjectival complement constructions as in (3): (3) a. I thought (that) Steffi likes candy. b. The problem is (that) Steffi doesn’t like candy. c. I’m glad (that) Paul likes candy. The variables that govern speakers’ choices to realize or omit the complementizer have been studied intensively (e.g. Jaeger 2010; Wulff et al. 2014). In a recent study, Wulff et al. (2018) explored if and to what extent non-native speakers of English make similar choices to native speakers of English. To that end, they compiled corpus data from native and non-native speaker corpora and ran concordances on


S. Wulff and P. Baker

each to retrieve all instances of complementation attested in these corpora. Since the goal was to present a unified analysis that draws together the data from these different corpora and thus allows statistical evaluation of possible contrasts between native and non-native speakers, the first step was to append all concordance lines into one big master spreadsheet (also called a “raw data sheet”). Most concordancers let you save concordances in a format that makes it easy to copy them into a spreadsheet with the left-hand context, the search term, and the right-hand context in separate columns. Not only is that more convenient for visual inspection, as we saw in the examples above; it also makes subsequent coding of the data much easier. We will discuss an example of that below. In a second step, after false hits had been pruned from the raw data sheet, the remaining true hits were coded for different variables known to impact native speakers’ choices. The result was a spreadsheet with one row for each concordance line and one column for each variable considered for analysis. A snippet of that spreadsheet is shown in Fig. 8.4 (we pruned the left- and right-hand context to fit the page). The variables coded for included, among others, the absence or presence of the complementizer (that: yes or no); the type of complementation construction (obj, subj, or adj); the first language of the speaker who had produced the hit (English (GB), German (GER), or Spanish (SP)); the mode the hit came from (s(poken) or w(ritten)); the length of the main clause subject (number of letters); and many others. The conversion of the concordance lines into a spreadsheet like in Fig. 8.5 is helpful in various ways. For one, it facilitates the coding process because you can use the filter and sorting functions that all spreadsheet software includes. For example, you could sort the entire table by one construction type and look only at, say, object complementation, which will speed up the coding. Alternatively, you could apply a filter to the column containing the right-hand context and opt to see only instances that begin with that, which in turn allows you to reliably identify, and code accordingly, all hits that contain a complementizer in the that-column. A second advantage is that most statistics software requires you to input your data in a tabular format anyway, so you are working with the data in a format that makes loading them into, say, SPSS or R much more convenient (see Chap. 17). A third advantage is that once you are presented with the results of a statistical analysis, you can easily find relevant examples from your data by again applying filters. We do not have enough space to discuss the statistical evaluation and detailed results of this study here; suffice it to say that applying a multiple regression approach to the entire data sample of over 9,000 attestations, the authors found that intermediate-advanced German and Spanish learners rely on the same factors as native speakers do when they choose to realize or omit that. The main difference between the two speaker groups is that learners are overall more reluctant to omit the complementizer, especially if complexity-related variables increase the cognitive cost associated with processing the sentence (for instance, if the complement clause is quite long). For more information regarding regression-type approaches, see Chaps. 21–23.

8 Analyzing Concordances L And do you Although I had to We can which Lucius the distance is that I The ambassador They Anyway the outcome a rising number of people ....…


match accept admit argue means recognizes remember

R that there is a crisis that she was very tiring that television have that the the pressures on us that his life before I was four weeks down there

that yes yes yes yes yes no

Type obj obj obj obj obj obj


Mode s w w s w s

Subj 3 1 2 5 6 1

… … … … … … …

said think was

the allied bombardment that every person had the that I told him

no yes yes

obj obj sub


s w w

14 4 11

… … …

wishes …

that cars ought to be banned …

yes …

obj …


w …

25 …

… ...

Fig. 8.5 Snippet of a raw data sheet in the case-by-variable format

8.2.4 Pedagogical Applications of Concordance Lines There are two ways in which concordances (or any other corpus-based output) can be used in the classroom: either the students generate concordances themselves in class, or the instructor provides materials that include concordance lines. There is a growing strand of research that explores the efficacy of so-called datadriven learning (DDL) approaches, which give students access to large amounts of authentic language data (Frankenberg-Garcia et al. 2011), and corpus-based materials naturally lend themselves to use in such an approach. Space does not permit a comprehensive review of that literature here; for a good point of departure, see Römer (2008), who provides a general overview of the use of corpora in language teaching. To give but a few examples of prominently referenced studies that specifically tested the usefulness of concordances, we refer the reader to Stevens (1991), who examined the efficacy of learners consulting corpus printouts. One group of students was asked to fill in a blank in single gapped sentences, while another group was instructed to fill in the missing words in a set of gapped concordances. Word retrieval seemed better in the latter condition. Another example is Cobb (1997), who tested the efficacy of concordance-based exercises for vocabulary acquisition compared to traditional vocabulary learning materials; the results of weekly quizzes of more than 100 students throughout one semester likewise suggested that concordance-based exercises were better for vocabulary retention than traditional materials. Boulton (2008, 2010) investigated the efficacy of paper-based concordance materials for vocabulary learning with 62 low-proficiency English learners. After Boulton identified a set of fifteen common language issues from students’ written productions, one group of students was taught the target language features either using concordance materials while another group received traditional materials retrieved from dictionary entries. The results of a post-test suggested that learners who had received the concordance-based materials outperformed those who had not on these target language features.


S. Wulff and P. Baker

Finally, for a recent study that explored the effect of concordance-based teaching not only in terms of learners’ performance on a vocabulary test, but its potential effect on learner production, see Huang (2014). More recently, meta-analyses of DDL in language learning testify to the efficacy of this teaching approach (Boulton and Cobb 2017; Lee et al. 2018).

Representative Study 1 Baker, P., and McEnery, A. 2014. Find the doctors of death: the UK press and the issue of foreign doctors working in the NHS, a corpus-based approach. In The Discourse Reader, eds. Jaworski, A., Coupland, N., 465–80. Routledge, London. This study examined the ways that foreign doctors were represented in a 500,000 word corpus of British national newspaper articles (The Times, The Mail, The Guardian, The Express, The Sun, The Star, The Mirror, The Telegraph and The Independent) containing the term foreign, followed by doctor, doc, medic, GP, locum, or physician or the plural of those words. The data was taken from 2000-2010 and collected using the online news database Nexis UK. Additionally, a 1.3 million word corpus of articles containing the word doctor, taken from the same newspapers and time period, was examined in order to determine if representations of foreign doctors were similar to representations of doctors more generally. Finally, a 1.2 million word corpus of articles containing the word foreign, foreigner or foreigners, taken from the same newspapers and time period, was examined in order to determine if representations of foreign doctors were similar to representations of foreigners more generally. The 1,180 concordance lines containing the search term foreign used to create the 500,000 million word corpus were examined qualitatively in order to identify representations of foreign doctors based on the qualities and actions that were attributed to them. Concordance lines with similar representations were grouped into categories. Additional concordance searches were carried out in order to find related examples, e.g. the word killer was examined through concordances as it had been noted in the headlines of texts from a number of other concordance lines. Concordance analyses of the two other newspaper corpora were carried out in similar ways in order to compare representations of doctors, foreigners and foreign doctors. The analysis found that 41% of the references to foreign doctors directly represented them as incompetent (particularly in terms of not being able to speak English), with a further 16% implying incompetence by calling for tighter regulation of them. There was frequent reference to foreign doctors who had accidentally killed patients, labelling them as killers. The concordance analysis also noted several contradictory (continued)

8 Analyzing Concordances


representations, including the view that foreign doctors were desperate to work in the UK and were ‘flooding’ into the country (similar to the water metaphor used to describe refugees), appearing alongside other articles which claimed that foreign doctors ‘ignore’ vacancies in the UK. As well as being regularly described as incompetent and bungling, foreign doctors were also characterized as ‘sorely needed in their own countries’ and the UK was seen as amoral for ‘stripping poorer countries of professionals’. Foreign doctors were thus negatively represented, no matter what position they were seen to take. Representations of doctors were different to those of foreign doctors, with few mentions of the need for tighter regulation and only a small number of references to incompetent doctors. A common phrase in this corpus was ‘see your doctor’, which implied that journalists placed trust in doctors (as long as they were not foreign). Representations of foreigners were largely concerned with political institutions like the foreign office, although a sample of 21 out of 100 concordance lines taken at random (using an online random number generator) showed negative constructions of foreigners involving stereotyping, implying they were taking up British resources or jobs, or controlling British interests. Overall, the analysis indicates that foreign doctors tend to be viewed negatively, as foreigners first and doctors second, with individual stories of negligence being generalized and used as evidence of a wider problem.

Representative Study 2 Gries, S.T., and Wulff, S. 2013. The genitive alternation in Chinese and German ESL learners: towards a multifactorial notion of context in learner corpus research. International Journal of Corpus Linguistics 18(3):327–356. Gries and Wulff (2013) examined data obtained from the British sub-section of the International Corpus of English and the Chinese and German subsections of the International Corpus of Learner English in order to determine what factors govern learners’ choice of either the s-genitive (as in the squirrel’s nest) or the of -genitive (the nest of the squirrel), and how learners’ choices align with those of native speakers. They annotated 2,986 attestations captured as concordance lines for 14 variables that were previously shown to impact native speakers’ choices, including the semantic relation encoded by the noun phrases, the morphological number marking on the noun phrases, their animacy, specificity, complexity, and, crucially, the L1 background of the learners, among others. The data sample was analyzed with a binary (continued)


S. Wulff and P. Baker

logistic regression (Chap. 21) in which the dependent variable was the choice of genitive (-s vs. of ) and the 14 variables were the predictor variables. The final model suggested that learners generally heed to the same variables that native speakers consider in their choice of genitives. The most important variable across speakers’ groups was segment alternation: native and nonnative speakers alike strongly preferred to opt for the genitive variant that represented the more rigid alternation of consonants and vowels. Overall, the Chinese learners were much better aligned with the native speaker’s choices than the German learners were, yet showed a stronger tendency to overuse the s-genitive across different contexts.

8.3 Critical Assessment and Future Directions What are the limitations and drawbacks of concordance analysis? For one, it can be time consuming, particularly if we are using a large corpus or searching on a frequent item. This is a valid concern especially in the contexts of using concordances in the classroom, or for self-study. Secondly, human error and cognitive bias can creep in, meaning that we may over-focus on patterns that are easy to spot, such as a single word appearing frequently at the L1 position, while more variable patterns may go unnoticed. It can be mentally tiring to examine hundreds or thousands of lines, so there is a danger that what we notice first may receive the most attention (which stresses the importance of trying different sorting patterns to yield different perspectives on the data). One option would be to use multiple analysts to carry out coding of concordance lines, with an attendant measure of inter-coder agreement (Hallgren 2012), a practice which is likely to help identify and resolve inconsistencies and coding errors. This chapter has also discussed the practice of pruning concordance lines, and some tools do allow for a concordance to be randomly ‘thinned’, giving a smaller sample to work with. However, there is no agreement on what an ideal sample size should be, and it is probably the case that different sample sizes are appropriate for different sized corpora, different types of corpora, and different search terms. For example, if we want to examine a word that has two main meanings, say a literal and metaphorical one (such as lamely), a sample of 50 lines would probably be adequate in helping us to see that the metaphorical use is much more typical. However, if we are interested in a word such as like, which has multiple functions, 50 lines may not be enough to ascertain the range of meanings and which ones are more typical. A good rule of thumb is to start with a reasonably low number (say between 20 and 100), carry out a concordance analysis of a sample of that size, noting the patterns and frequencies of different uses of an item, then take a second random set of concordance lines, of the same size, and redo the analysis. If the

8 Analyzing Concordances


findings from both sets are similar, then your sample size is most likely adequate. If not, double the sample size and repeat the exercise. Where is concordance analysis headed? To date, most concordancing research has been carried out on corpora of plain text. However, with moves towards multimodal corpora, it is possible to combine concordancing with analysis of sound or visuals (see Chaps. 11 and 16). For example, a sizeable proportion of the 10 million word spoken section of the British National Corpus has been aligned to the original speech files, so when concordance searches are carried out using the online tool BNCweb, it is possible to select a line and hear the speech associated with it. WordSmith 7 also allows the analysis of corpora which contain tags which link to multimedia (audio or video) files. Work which combines concordance analysis with image data is still in its early stages, and in the absence of adequate tools, can require painstaking hand-analysis. McGlashan (2016) carried out a concordance analysis of text from a corpus of children’s books, finding that a common lexical item was the 3 word cluster love each other. By comparing instances of this item with the images that occurred alongside it, he found that it was often accompanied by pictures of family members hugging each other. Therefore, the meaning of love each other involved the representation of a tactile component which was only found in the images. McGlashan coined the term collustration to refer to the saliently frequent co-occurrence of features in multiple semiotic modes, and his concordance tables included numbered lines that referred to a grid of corresponding images that were shown underneath each table. In summary, concordance analysis is one aspect of corpus linguistics that sets it apart from other computational and statistical forms of linguistic analysis. It ensures that interpretations are grounded in a systematic appraisal of a linguistic item’s typical and atypical uses, and it guards against interpretative positivism. The inspection of dozens of alphabetically sorted concordance lines enables patterns to emerge from a corpus that an analyst would be less likely to find from simply reading whole texts or scanning word lists. By bridging quantitative and qualitative perspectives on language data, concordance analysis is and will remain a centerpiece of corpus-linguistic methodology.

8.4 Tools and Resources Table 8.1 provides an overview of the most widely used concordance software, their platform restrictions (if any), pricing, and associated web links. Each concordance tool is slightly different: • some are tailored more towards research, others were designed primarily with classroom use in mind; • some can only query corpus files that contain data in Latin alphabet format, while others are Unicode-compatible, i.e. can accommodate any writing system;


S. Wulff and P. Baker

Table 8.1 (Software including) concordance tools (information accurate at the time of writing; pricing for single user licenses) Concordance tool aConCorde

Platform macOS

Pricing Free
















MonoConc Pro



Open Corpus Workbench



Simple Concordance Program






WordSmith Tools



Web link http://www.andy-roberts.net/coding/ aconcorde. Accessed 8 July 2019. http://www.laurenceanthony.net/ software/antconc/. Accessed 8 July 2019. https://sites.google.com/site/ casualconc/Home. Accessed 8 July 2019. http://software.sil.org/conc/. Accessed 8 July 2019. http://www.concorder.ca/index_en. html. Accessed 8 July 2019. http://corpora.lancs.ac.uk/lancsbox/. Accessed 8 July 2019. http://www.athel.com/mono.html. Accessed 8 July 2019. http://cwb.sourceforge.net/index.php. Accessed 8 July 2019. http://www.textworld.com/scp/. Accessed 8 July 2019. http://neon.niederlandistik.fu-berlin.de/ en/textstat/. Accessed 8 July 2019. http://www.lexically.net/wordsmith/. Accessed 8 July 2019.

• some can handle regular expressions (cf. Chap. 9), while others only allow simple searches; • some tools are simple concordancers, others include many other functions such as generating frequency lists, collocate and n-gram lists, and visualization tools, to name but a few. The list below is not comprehensive in at least three ways: firstly, the tools listed below are all for offline use–there are web-based concordancers such as the Sketch Engine that either allow access to specific corpora, or let the user upload data for online examination.1 Secondly, we only list monolingual concordancers, i.e. tools that let the user examine text from one corpus representing one language. There are also multilingual concordancers specifically designed to query parallel corpora, i.e. corpora that contain data from multiple languages in their direct translations (see Chap. 12). Furthermore, it is worth pointing out that in research, there is a growing trend away from ready-made concordance tools and towards writing and adapting

1 For

a list of web-based concordancers (and many other corpus-linguistic resources), see http:// martinweisser.org/corpora_site/CBLLinks.html. Accessed 31 May 2019.

8 Analyzing Concordances


scripts written in programming environments like Python or R (e.g. Gries 2016). The rationale for many scholars is that this allows them to retrieve, annotate, statistically evaluate, and visually display their language data in one environment; it gives the user maximum control over each analytical step; and it facilitates the free sharing of data and scripts among the scientific community (see Chap. 9). Ultimately, the choice for one concordancer is a matter of personal preference, and we encourage the reader to find their own favorite.

Further Reading Hoffmann, S., Evert, S., Smith, N., Lee, D.Y.W., and Berglund, Y. 2008. Corpus linguistics with “BNCweb” – a practical guide. Peter Lang, Bern. In our chapter, we only provide examples of simple searches, that is, searches that involve a specific whole word or phrase. Sometimes, however, we need not know in advance what specific words or phrases we are looking for. For example, what if we want to create a concordance of all the adjectives in a corpus that is annotated for parts of speech? It would be tedious to try and enter each adjective individually (and we would likely miss out on a number of adjectives that do occur, yet we failed to think of them). If we can instead write a query that finds all adjectives by their tags, we can find all of them with just one search. A second example of a more complex search could be: what if we are interested in finding all words that end in the morpheme –licious without knowing what they are? In this case, we need a query that specifies only the final part of the word, but leaves open what the beginning of the word looks like. In all of these and many other cases, it can be useful to resort to more complex queries that involve what are called regular expressions. For more information on regular expressions, see Chap. 9 in this volume. A good introduction can also be found in Hoffmann et al. (2008), who provide many examples of how to combine regular expressions with corpus annotation such as part-of-speech tags, lemmatization, etc. While Hoffmann et al. (2008) focus specifically on the query syntax associated with the Corpus Query Processor (CQP) of the IMS Open Corpus Workbench as it can be used for complex searches of the BNCweb corpus, it is a great way to get started with complex searches for several reasons: access to the BNCweb is free (for more information on how to access the BNCweb, go here: ), and the Corpus Workbench can be used on many other corpora provided they meet certain requirements. That aside, once you understand the basic reasoning behind a corpus query syntax such as the one implemented in CQP, it is relatively easy to work with different implementations of it. Partington, A. 1998. Patterns and meanings: using corpora for English language research and teaching. John Benjamins, Amsterdam and New York. Partington presents a series of case studies that illustrate how corpus methods can shed light on diverse areas like synonymy, cohesion, and idioms; analysis of concordances plays a major role throughout.


S. Wulff and P. Baker

Sinclair, J. 1991. Corpus, concordance, collocation. Oxford University Press, Oxford. An early introduction to corpus linguistics written for students in language education. Stubbs, M. 2001. Words and phrases: corpus studies of lexical semantics. Blackwell, Malden MA. Stubbs outlines how the meanings of words depend on their contexts, and how the connotations of words arise from their recurring embedding in larger phrases.

References Baker, P. (2006). Using corpora in discourse analysis. London/New York: Continuum. Baker, P., & McEnery, A. (2014). Find the doctors of death: the UK press and the issue of foreign doctors working in the NHS, a corpus-based approach. In A. Jaworski & N. Coupland (Eds.), The Discourse Reader (pp. 465–480). London: Routledge. Boulton, A. (2008). DDL: reaching the parts other teaching can’t reach? In A. Frankenberg-Garcia (Ed.), Proceedings of the 8th Teaching and Language Corpora conference (pp. 38–44). Lisbon: Associação de Estudos e de Investigação Cientifíca do ISLA-Lisboa. Boulton, A. (2010). Data-driven learning: taking the computer out of the equation. Language Learning, 60(3), 534–572. Boulton, A., & Cobb, T. (2017). Corpus use in language learning: a meta-analysis. Language Learning, 67(2), 348–393. Cobb, T. (1997). Is there any measurable learning from hands on concordancing? System, 25(3), 301–315. Frankenberg-Garcia, A., Aston, G., & Flowerdew, L. (Eds.). (2011). New trends in corpora and language learning. New York: Bloomsbury. Gries, S. T. (2016). Quantitative corpus linguistics with R: a practical introduction (2nd ed.). London/New York: Routledge. Gries, S. T., & Wulff, S. (2013). The genitive alternation in Chinese and German ESL learners: towards a multifactorial notion of context in learner corpus research. International Journal of Corpus Linguistics, 18(3), 327–356. Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23–34. Huang, Z. (2014). The effects of paper-based DDL on the acquisition of lexico-grammatical patterns in L2 writing. ReCALL, 26(2), 163–183. Jaeger, T. F. (2010). Redundancy and reduction: speakers manage syntactic information density. Cognitive Psychology, 61, 23–62. Lee, H., Warschauer, M., & Lee, J. H. (2018). The effects of corpus use on second language vocabulary learning: a multilevel meta-analysis. Applied Linguistics. published online first. Louw, W. E. (1993). Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology: in honour of John Sinclair (pp. 157–176). Amsterdam: John Benjamins. McGlashan, M. (2016). The representation of same-sex parents in children’s picturebooks: A corpus-assisted multimodal critical discourse analysis. Dissertation, Lancaster University. Partington, A. (1998). Patterns and meaning: using corpora for English language research and teaching. Amsterdam/Philadelphia: John Benjamins.

8 Analyzing Concordances


Partington, A. (2004). “Utterly content in each other’s company”: semantic prosody and semantic preference. International Journal of Corpus Linguistics, 9(1), 131–136. Römer, U. (2008). Corpora and language teaching. In A. Lüdeling & M. Kyto (Eds.), Corpus linguistics: an international handbook (Vol. 1, pp. 112–130). Berlin: Mouton de Gruyter. Simpson, P. (1993). Language, ideology and point of view. London: Routledge. Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Sinclair, J. (2004). Trust the text: language, corpus and discourse. London: Routledge. Stevens, V. (1991). Concordance-based vocabulary exercises: a viable alternative to gap-fillers. English Language Research Journal, 4, 47–63. Stubbs, M. (1995). Collocations and semantic profiles: on the cause of the trouble with quantitative studies. Functions of Language, 2(1), 23–55. Stubbs, M. (2001). Words and phrases: corpus studies of lexical semantics. Oxford: Blackwell. Wulff, S., Gries, S. T., & Lester, N. A. (2018). Optional that in complementation by German and Spanish learners. In A. Tyler & C. Moder (Eds.), What is applied cognitive linguistics? Answers from current SLA research (pp. 99–120). New York: De Gruyter Mouton. Wulff, S., Lester, N. A., & Martinez-Garcia, M. (2014). That-variation in German and Spanish L2 English. Language and Cognition, 6, 271–299.

Chapter 9

Programming for Corpus Linguistics Laurence Anthony

Abstract This chapter discusses the important role of programming in corpus linguistics. The chapter opens with a history of programming in the field of corpus linguistics and presents various reasons why corpus linguists have tended to avoid programming. It then offers some strong counter arguments for why an understanding of the basic concepts of programming is essential to any corpus researcher hoping to do cutting-edge work. Next, the chapter explains the basic building blocks of all software programs, and then provides a number of criteria that can be used to assess the suitability of a programming language for a particular corpus linguistics project. To illustrate how the building blocks of programing are used in practice, two case studies are presented. In the first case study, basic programming concepts are applied in the development of simple programs that can load, clean, and process large batches of corpus files. In the second case study, these same concepts are applied in the development of a more advanced program that can replicate and, in some ways, improve on the tools and functions found in ready-built corpus tools. The chapter finishes with a critical assessment and discussion of future developments in corpus linguistics programming.

9.1 Introduction Computer programming has played a key role in corpus linguistics since its growth in the 1960s. Early researchers did not have access to any existing corpusanalysis software. As a result, they had to build simple tools from scratch in programming languages such as Fortran and COBOL. This work led to the creation

Electronic Supplementary Material The online version of this chapter (https://doi.org/10.1007/ 978-3-030-46216-1_9) contains supplementary material, which is available to authorized users. L. Anthony () Faculty of Science and Engineering, Waseda University, Tokyo, Japan e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_9



L. Anthony

of concordancers, such as those of Clark (1966), Dearing (1966), Price (1966), and Smith (1966). Some researchers even created new programming languages for this work, such as ATOL, which was used to develop the CLOC collocation analysis tool for the COBUILD project (Reed 1978; Sinclair et al. 2004; Moon 2007). From the 1980s onwards, however, the importance of programming in corpus linguistics work diminished somewhat with the development of ready-built, user-friendly tools that could run on corpus linguists’ own personal computers. Examples of software from this time include Micro-concord (Johns 1986), the Oxford Concordance Program (Hockey and Martin 1987), the Longman Mini-Concordancer (Chandler 1989), and the Kaye concordancer (Kaye 1990). As these tools evolved and became even more powerful and easy-to-use, the need for programming within the corpus linguistics community became even less clear. Today, it is still the case that many corpus linguists see little need to develop their programming skills. Perhaps the main reason for this is the ready availability of freeware toolkits such as AntConc (Anthony 2019) and commercial tools, such as WordSmith Tools (Scott 2020). Ready-built tools allow corpus linguists to carry out much, if not all, they need to do with their own custom-designed corpora or directly accessible ready-built corpora, such as the British National Corpus (BNC). (Burnard 2000), the British Academic Spoken English (BASE) Corpus (Thompson and Nesi 2001), the British Academic Written English (BAWE) Corpus (Nesi et al. 2004), and the Michigan Corpus of Academic Spoken English (MICASE) (Simpson et al. 2002). Indeed, a review of the literature published in three mainstream corpus linguistics journals, i.e., Corpora, Corpus Linguistics and Linguistic Theory, and the International Journal of Corpus Linguistics, suggests that the majority of corpus linguists use corpus tools in relatively simple ways to complete a minimal number of analytical tasks. On the other hand, when corpus linguists are asked what they would want to do with corpora (ignoring what is possible with existing ready-built tools), it is clear that a much wider range of functions are desired (see Anthony 2014). Some of these include: • the automatic creation of clean, annotated corpora • the comparison of two or more texts across multiple lexical/grammatical/rhetorical dimensions • the batch counting of words, phrases, and n-grams (lexical bundles) to give pertext frequency and dispersion information • the calculation of lexical diversity scores for each file in a learner corpus • the measurement of distance between priming words and target words • the identification and counting of collocates of certain target word types in target and reference corpora • the extraction of complex meaning units such as definitions • the counting of interesting lexical phenomena, such as disfluencies, in a tagged corpus • the automated analysis of the rhetorical structure of texts (see Swales 1990)

9 Programming for Corpus Linguistics


• the creation of novel and useful visualizations of text data • the look-up of pronunciations of words found in a corpus in a pronunciation database such as CELEX2 (Baayen et al. 1995) Another reason why many corpus linguists may see little need to learn programming is the growing preference to use web-based corpora. These corpora are often painstakingly created in teams that involve corpus linguists and software developers, and thus, circumvent the need for the user to program their own data collection and cleaning procedures. The corpora are also usually positioned behind a ‘wall’ and can only be accessed indirectly through either purpose-built web-browser interfaces (e.g. the Michigan Corpus of Upper-Level Student Papers (MICUSP) (http:// micusp.elicorpora.info/)) or online corpus management and query systems (e.g. corpus.byu.edu (corpus.byu.edu corpus site. corpus.byu.edu), CQPweb (https:// cqpweb.lancs.ac.uk/), Sketch Engine (https://www.sketchengine.co.uk/), and WMatrix (http://ucrel.lancs.ac.uk/wmatrix/)). These systems not only remove the need for the user to develop their own query systems and/or analytical tools, but also greatly reduce (or completely prevent) a user from directly using custom programs with the data. Corpus linguists may also choose not to learn to program for other, more mundane, reasons. One is that corpus linguists, especially those working in academic institutions, are likely to be pressured for time (see Tribble 2015). MA and PhD students, for example, have a deadline to meet when it comes to analyzing their data and writing up their findings in the form of a thesis. As a result, the advantages they gain by giving up some of this time to learn to program may not be immediately apparent. Another is that even if a corpus linguist has the time to learn to program, they may not know the best way to go about it or even know which programming language to attempt to learn. Working directly with computer programmers in the creation and analysis of data is not so straightforward, either. In an academic setting, computer programmers are likely to be working in a different department and/or faculty, introducing problems of physical distance, faculty-culture differences, and meeting-scheduling issues. Corpus linguists may also struggle to communicate to computer programmers exactly what they want to do with the new software, as they will lack the vocabulary commonly used by software engineers and programmers when discussing development issues. Although there are several reasons why a corpus linguist may not want to learn to program, there are also important reasons why knowledge of programming can benefit a corpus linguist greatly. Biber et al. (1998:254–256) explain that programming allows the corpus linguist to do analyses not possible with existing tools, do those analyses more quickly using a corpus of any size, and also tailor the output to meet their own needs. Similarly, Gries (2009:12) explains that the use of a programming language puts researchers “in the driving seat”, enabling them to circumvent the limitations of existing tools in terms of availability, functionality, and user-control. Davies (2011) and Anthony (2009) also acknowledge the importance of programming in corpus linguistics. However, they are more cautious in their recommendations for who should learn to program. Davies suggests that


L. Anthony

“corpus users” (researchers who are not involved in creating a corpus) might be able to “get by” with standalone-tools and web-based corpora, whereas “corpus creators” (those who build new corpus resources) need at least some experience with programming. Anthony suggests that programming cannot be avoided in any cutting-edge corpus research, although this work may be carried out by a dedicated computer programmer who is part of the team. On the other hand, he recommends that corpus linguists in the team have some experience of programming as it will help when communicating design ideas and explaining potential problems to the programmer. Continuing from the ideas expressed in Anthony (2009), this chapter is positioned on the side of the argument that corpus linguists do need some basic understanding of programming. Minimally, they need to understand the underlying concepts of programming, which will not only help them to recognize the limitations of the tools they use, but also allow them to work closer with expert programmers in the development of new and improved tools if the chance arises. Going further, if they hope to become “corpus creators”, as is the case with many MA and PhD students embarking on a corpus research project, they are likely to have to collect and process vast amounts of noisy, inconsistent, and poorly formatted data. In this case, a basic ability to program is clearly an advantage. To follow this path, however, they first need to understand the different types of programming languages available and the inherent strengths and weaknesses of each. With this knowledge they can then make an informed decision about which language is most suitable for their needs and begin their journey to learn it. The same applies to “corpus users” who hope to do more complex, multi-leveled investigations than those possible with the features and functions of ready-built tools. Clearly, many of the leading corpus linguists in the field can be included in this category. Fortunately, as this chapter explains, learning basic programming is becoming an ever-easier task as new learning resources become available and new features and functions are added to programming languages that simplify their use.

9.2 Fundamentals 9.2.1 The Basic Building Blocks of Software Programs Software programs are built using computer programming languages, i.e., simple, artificial languages that are designed to enable humans to issue instructions to computers in a non-ambiguous way. The instructions written in a computer language consist of statements formed from a vocabulary of concepts recognized by the computer (e.g. variable names, variable values, mathematical operators, string operators, data types, and so on) arranged in a particular order as defined by the syntax of the language. If the statements are correctly formed, they can be parsed by a compiler or interpreter built into the computer language and converted into

9 Programming for Corpus Linguistics


a machine language used internally by the computer. Complex instructions can be created by combining individual statements into larger blocks of code following structures that are also predefined by the computer language (e.g. flow structures, functions, methods, objects, and classes). Computer languages can be considered as members of particular language families if they share a similar vocabulary and syntax rules. The most popular languages in the world today are all part of a family of computer languages that evolved from the C language, which was developed towards the end of the 1960s. ‘C-like’ languages include C, C++, C#, Objective-C, Java, JavaScript, Perl, PHP, Python, and many more. Other well-known languages that have given rise to families of related languages include BASIC, Fortran, Pascal, Lisp, Prolog, Smalltalk, and S, the modern equivalent of which is R. As with human languages, learning one computer language in a family can be hugely beneficial when learning other languages in that same family. This is one reason why many degree courses in computer science first focus on the C language (or a related ‘C-like’ language). However, all computer languages share many common vocabulary and syntax features. So, learning any computer language will be hugely beneficial to a corpus linguist if and when it becomes necessary to learn another computer language later. Some common features of computer languages are listed in Table 9.1, where it can be seen that computer languages usually work with a limited set of data types (e.g. Boolean, number, string, . . . ) and that manipulating these data types requires the use of special operator symbols. For example, in Python, the “+” operator is used to add numbers together, but it is also used to concatenate strings of letters together (“abc” + “def” => “abcdef”). Although all computer languages are similar in many ways, they are each designed with certain features and idiosyncrasies that offer advantages (as well as disadvantages) over other languages in particular settings. For example, C is a lowlevel language, i.e., one that closely resembles the computer’s core instruction set. This feature allows it to run very quickly and efficiently, but also makes it quite difficult to read and understand. In contrast, JavaScript, Perl, Python and R are highlevel languages (ones that more closely resemble normal human languages), making them easier to read and understand, but also making them slower and less efficient than C. Table 9.1 Common features of computer languages Data types Operators Control and loop flow Modularization Input/output

Boolean, number, string, list, key/value pairs (often termed “dictionary” or “hash”) arithmetic (“+”, “−”, . . . ), comparison (“”, . . . ), condition (“if”, “when”, . . . ), logical (e.g. “OR”, “NOT”, . . . ), string (“.”, “+”, “eq”, . . . ) if, while, for/foreach functions or classes/objects or both print, read, write


L. Anthony

Perhaps the most important feature that distinguishes one programming language from another is the way that it handles the grouping of statements into larger units. Here, there are two main approaches. One is a Functional Programming (FP) approach, where statements that are designed to instruct the computer to perform an action (e.g. sorting) are grouped into a self-contained “function” that does not rely on any external state (e.g. variable values). The other is an Object-Oriented Programming (OOP) approach, where the instructions to perform an action as well as the variables on which that operation are performed are all grouped into a single “object class”. FP programming generally allows for the easy addition of new functions to a program but keeping this growing set of functions organized and working well with existing and new variables can become confusing and error prone. OOP programming, on the other hand, requires more thought when deciding which variables and actions (methods) should be included in objects. However, creating and organizing new objects is relatively simple and because each object class is independent of the others, error checking is easier. Computer languages are designed from the ground-up to facilitate programming using one or both of these design approaches. C, JavaScript, Perl, and R are examples of languages designed for functional programming, whereas Java is an example of a language designed for object-oriented programming. Python is an example of a language that was designed to accommodate both programming approaches. In practice, current software engineering practices tend to favor an OOP approach as it allows programs to scale well and lends itself more easily to the modularization of code that is developed in a team. As a result, many functional programming languages, such as C, Perl, and R, have been adapted or extended to allow for some kind of object-oriented coding, although the designs that have been employed are not always elegant or efficient.

9.2.2 Choosing a Suitable Language for Programming in Corpus Linguistics All modern programming languages can be used to develop programs that will fulfill the needs of corpus linguists, whether they are MA students, PhD students or seasoned experts in the field. However, some languages are more suited to the tasks commonly carried out by corpus linguists, i.e., reading large text files into memory, processing text strings, counting tokens, calculating statistical relationships, formatting data outputs, and storing results for use with other tools or for later use. In order to decide which language is most suitable for a particular corpus linguistics programming task, there are at least four important factors to consider at the outset. The first consideration is the design of the language. High-level, functional languages are well suited for writing short, simple, “one-off” programs for quick cleaning and processing of a corpus, e.g. renaming files, tokenizing a corpus,

9 Programming for Corpus Linguistics


cleaning noise from a particular corpus, and calculating a particular statistic for corpus data. For this reason, languages such as Perl, Python (which can run as a functional language), and R have been commonly used for corpus linguistics applications (e.g. Desagulier 2017; Gries 2016; Johnson 2008; Winter 2019). Python and R, in particular, have grown to be especially popular among corpus linguists due to their strong statistical and data visualization features. High-level objectoriented languages, on the other hand, are more suited for writing longer, more complex, “reusable” programs, as they can be more easily maintained and extended. As a result, corpus toolkits that combine and integrate common corpus functions into a complete package are more likely to be built in a language such as Python or Java. Popular examples are the majority of AntLab tools (Anthony 2019) and the Natural Language Toolkit (NLTK) (https://www.nltk.org/), which are written in Python, and the suite of natural language processing tools offered by the Stanford Natural Language Processing Group (https://nlp.stanford.edu/), which are written in Java. If a corpus linguist hopes to use or contribute to the development of these tools, knowledge of Python or Java would certainly be useful. A second consideration is whether to use a language that is converted into machine language ‘on the fly’ during runtime (usually referred to as an ‘interpreted’ language) or one that is converted (or compiled) into machine language prior to runtime (usually referred to as a ‘compiled’ language). Interpreted languages, such Perl, Python, and R allow for programs to be “prototyped”, meaning that developers can create working programs with placeholders marking yet-to-bewritten or unfinished code. They can also be used in a ‘live’, interactive way, with new lines of code written in response to the output generated by the interpreter. This makes them particularly suited to many MA and PhD student projects, where development speed and flexibility are important factors. They are also useful for carrying out some of the ‘quick and dirty’ programming tasks that a corpus linguist might need to do in order to get their data into a form that can be analyzed. Compiled languages, such as C and Java, on the other hand, require a slower “write-compilerun” design. This process can reduce the number of bugs in the final code (as they can be detected at compile time) and allow the code to run faster than equivalent code written in an interpreted language. For these reasons, compiled languages are often chosen for very large projects, where speed or accuracy are required, such as the engine of the IMS Open Corpus Workbench (CWB) (http://cwb.sourceforge. net/) toolkit, which is written in C, and the tools developed by the Stanford Natural Language Processing Group mentioned earlier, which are developed in Java. A third consideration is whether the programming language is designed primarily for creating web-based tools, traditional desktop tools, or mobile platform tools. Some languages, such as PHP and JavaScript, have many features designed specifically for the web, with CQPweb using PHP extensively on the server side, and SketchEngine using JavaScript heavily for its interface. In contrast, languages such as Java, Perl, Python, Ruby, and R have features tailored for both web and desktop environments, making them a common choice for many corpus tools. As an example, AntConc (Anthony 2019) is written in Perl and TagAnt (Anthony 2019) is written in Python. For mobile applications, programs are usually written either


L. Anthony

natively for the operating system in Java (for Android systems) or Swift (for iOS system) or developed to run smoothly in a mobile web browser using JavaScript on the client side and PHP, Python, or Java or the server side. The final and perhaps most important consideration is the size and vibrancy of the community that supports a computer language. ‘Popular’ languages such as C, C++, PHP, Python, and R have large, active groups of core developers that continually develop and improve the code base. These languages also have a large group of active users who are willing to answer questions and offer advice on coding through forums such as StackOverflow (https://stackoverflow.com/). In fact, such is the popularity of the Python and R languages among the corpus linguistics community that there are dedicated groups serving this community and multiple sites that provide resources specifically for corpus researchers (see Sect. 9.5 for more details). In contrast, very new and/or niche languages (e.g. LiveCode (https:// livecode.com/)), and some of the older languages with shrinking user groups (e.g. Perl) might have useful features for a particular project, but they may also lack the community support to maintain and update core packages and extension modules, or answer questions about how to use the language in practical contexts. One common problem with very new languages, niche languages, and older languages is that they do not integrate well with other languages and might fail to adapt to the evolving nature of different operating systems. It follows that choosing a ‘popular’ language in the corpus linguistics community, such as Python or R (or a popular language designed specifically for web-development, such as JavaScript), is a safe route to programming unless there is a specific reason to use one of the newer, older, or niche languages.

Representative Study 1 Edberg, J., and Biber, D. 2019. Incorporating text dispersion into keyword analyses. Corpora. In this paper, the authors aim to determine which keyness measure best identifies words that are distinctive to the target domain(s) present in a corpus (cf. Chap. 6). To achieve this goal, they compare various keyness measures that are primarily based on the frequency of words in a corpus together with their own “text dispersion keyness” measure, which is based on the number of texts in which a word appears. In order to carry out the analysis, the authors use Python scripts to calculate a traditional keyness measure based on log-likelihood, as well as two variations, each with different minimum dispersion cut-off values. They also use a Python script to calculate their own “text dispersion keyness” measure and compare all four of these measures against the “Key keyword analysis” measure available in WordSmith Tools (Scott 2020). (continued)

9 Programming for Corpus Linguistics


In this project, the authors are working in a small team and are probably using one-off scripts that are unlikely to be extended further. Therefore, any scripting language would probably be sufficient for the purpose with Python being an excellent choice. Using Python scripts as part of their analysis, the authors are able to quickly and accurately compare a range of possibilities for calculating keyness, including their own “text dispersion keyness” measure, which is not yet available in ready-built tools. The results from the study not only show the relatively merits of using dispersion as part of a keyness measure, but also suggest an important extension that can be added to readybuilt tools.

9.3 First Steps in Programming The following case studies are designed to contextualize the previous discussion and show how programming languages can be used to carry out some of the most common and important tasks carried out by corpus linguists. Although the majority of examples presented in these case studies mimic the functionality of existing ready-built corpus tools, they will generally perform much faster because they are designed to carry out a specific task and they remove the overhead of creating and updating an interface. It should also be noted that they can be easily extended or adapted to carry out tasks that are quite difficult or impossible to achieve with the most popular tools used in corpus linguistics today. Script 4 in Case Study 1 illustrates this point. The programming language used in the case studies is Python. As described earlier, Python is a high-level, object-oriented, interpreted programming language that can be used to build web applications and also desktop applications for Windows, Macintosh, and Linux. It has a very large, vibrant community of core developers and users, and was ranked the second most popular programming language in the world in 2019 (StackOverflow 2019). In the survey, only JavaScript ranked higher, perhaps due to its common use for building web-based applications. Python also has a huge number of user- and company-developed extensions that add important features on top of its rich core functionality. For example, the Pandas extension allows Python to carry out advanced statistical measures that can be visualized using one of many visualization packages, such as Matplotlib or Seaborn. These features have resulted in Python becoming one of the most popular languages used in corpus linguistics work today. Interestingly, the importance of Python in corpus linguistics work might possibly grow as it is one of the most popular languages used to develop artificial intelligence (AI) and deep learning applications due to its rich number of natural language processing and machine learning libraries.


L. Anthony

For reference, equivalent scripts written in the R language are also provided by the second editor, St. Th. Gries, at http://www.stgries.info/research/ 2020_STG_Scripts4Anthony_PHCL.zip. R is a high-level, functional, interpreted language that is mainly used for building desktop applications. However, a recent add-on package to the language called Shiny makes it much easier to build simple web-apps using the language. Although R is far less popular than Python among general programmers, it has a number of strong supporters in the corpus linguistics community. Also, it has perhaps the most vibrant community of developers and users focused on corpus linguistics work. One reason for this is that, as already mentioned above, R is particularly well suited to carrying out statistical analyses and visualizing the results. The recent trend to use more advanced quantitative methods in corpus linguistics work suggests that R will grow even more in popularity. Importantly, there are extensions in both Python and R to allow programs written in one language to interact smoothly with programs written in the other (see Sect. 9.5 for more information). In order to follow the case studies and implement the code samples presented, the following steps should first be completed: 1. Set up the target computer to run Python scripts This is a trivial task that simply involves downloading the latest version of the software and running the installer with default settings (https://www.python. org/downloads/). Numerous tutorials are available on the Internet, but the most detailed and comprehensive explanation is provided by the Python Software Foundation (see Sect. 9.5). During the installation process, it is useful to select the “Add Python to PATH” option so that you can run scripts directly from the command prompt later (see Step 4). 2. Create a folder in a suitable place in the operating system from where Python scripts can be run (e.g. the Desktop) and give it the name “project” (or equivalent). 3. Save all scripts described in the two case studies in the project folder as plaintext files with a “.py” extension to signal that they are Python scripts (e.g. “script_1.py”, “script_2.py”, etc.) 4. Run the scripts by launching the command prompt (on Windows) or Terminal (Macintosh/Linux) and then typing “python” followed by the name of the script (separated by a space), and then hitting the Enter/Return key.

9.3.1 Case Study 1: Simple Scripts to Load, Clean, and Process Large Batches of Text Data We will start by assuming that a very simple, UTF-8-encoded, plain-text text file called “text_1.txt” is needed to be processed. It contains a single line of text: “The cat sat on the mat.”

9 Programming for Corpus Linguistics


Loading a Corpus File and Showing its Contents

Script 1 is a short script that loads “text_1.txt” and shows its content in the console window (on Windows) or terminal (on Macintosh/Linux). Script 1: Load a File and Show Its Content

1 2 3 4

from pathlib import Path corpus_file = Path('text_1.txt') file_contents = corpus_file.read_text() print(file_contents)

Script 1 is just four lines long. However, it illustrates some interesting and useful features of a modern, object-oriented language. First, the script shows that the most important components of the language are stored in classes that are loaded when needed. Here, on line 1, the program imports the Python “Path” object class from the pathlib library. This is used to create “Path” objects that automatically adjust the string parameter to match the formatting rules of the operating system. The actual “Path” object is created on line 2 and given the name “corpus_file” (for convenience). Second, the script shows that objects, such as “Path”, have associated methods that can be accessed using a dot notation. As an example, on line 3, the “read_text” method of the “Path” object is called using “corpus_file.read_text()” in order to read the content of the file into memory. Then, the content of the file is printed to the screen in line 4 using a Python core “print” function. A third interesting feature is that the user-defined variables in the script (i.e., “corpus_file” and “file_content”) have long, meaningful names making the script easy to read and understand. The variables could just as easily be named “cf” and “fc” as long as they did not clash with reserved names used by the Python core. However, such names would be confusing and easily forgotten, especially if the script was not used for several weeks. Importantly, the Python core object classes and functions also have long, meaningful names, which is one reason why it is a commonly recommended first language to learn. While Script 1 can be said to “work”, it includes some deliberate weaknesses that should be avoided when creating programs any longer than this. First, the code contains no comments (lines of code often prepended with a hash character that are ignored by the interpreter but can be read by humans). These are useful for “self-documenting” the code (i.e. adding documentation directly within the code) so that anyone reading the code later (including the original coder!) will understand its design choices. The code also contains no whitespace to divide up the different parts of the code, which again improves readability. Third, the code is written as a series of commands rather than grouped together in a well-formed function (or


L. Anthony

object class). As a result, the code cannot easily be recycled and used in other scripts. This writing style also leads to confusing code that is prone to include bugs. Script 2 shows a more useful version of Script 1 written as a function with comments and whitespace to improve readability and its likelihood of future use: Script 2: Load a File and Show Its Content (Written as a Function) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

# import useful features from the Python Core from pathlib import Path # for dealing with files on different operating systems # define a function to display corpus file content def get_file_contents(file_path): # create a file path object corpus_file = Path(file_path) # read the file file_contents = corpus_file.read_text() # return the content return(file_contents) # launch the function file_contents = get_file_contents('text_1.txt') print(file_contents)

Script 2 is comprised of three parts: First, the necessary object class is loaded (line 2). Second, the main function is defined (lines 5–11); Third, the function is launched with the path to the file as a parameter (line 14), and the returned result is shown (line 15). Importantly, the function “get_file_contents” is completely encapsulated (e.g. no outside parameter values are hard-coded into the function), so it can be used with any corpus file in any program. As an example, to show the contents of a file called “text_2.txt” only the one line of code that launches the function (line 14) needs to be changed: 14

file_contents = get_file_contents('text_2.txt')

Loading a Corpus File, Cleaning it, and Showing Its Contents

For a more challenging example, imagine that a file containing HTML code needs to be processed and the embedded text shown on screen:

9 Programming for Corpus Linguistics


File Name: html_example.txt File Contents:

The Cat

Chapter 1

Once upon a time, a cat sat on a mat.

This task would be quite troublesome to do manually as it would involve multiple copy/paste steps to create the new ‘cleaned’ file that could then be shown on the screen. Script 2, however, can be very easily extended to load the HTML file, remove all the HTML markup, and then show the text contents. The extended program is shown as Script 3 below: Script 3: Load an HTML File, Remove the Tags, and Show the Text Contents 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

# import useful features from the Python Core from pathlib import Path # for dealing with files on different operating systems from bs4 import BeautifulSoup # for parsing HTML # define a function to display corpus file content def get_file_contents(file_path, file_type = 'text'): # create a file path object corpus_file = Path(file_path) # read the file file_contents = corpus_file.read_text() # clean the content (if necessary) if file_type == 'html': # create a BeautifulSoup object with an HTML parser soup = BeautifulSoup(file_contents, "html.parser") # parse the file and extract text file_contents = soup.get_text() # return the contents return(file_contents) # launch the function file_contents = get_file_contents('html_example.txt', file_type = 'html') print(file_contents)


L. Anthony

The following output produced by Script 3 is shown in the command window or terminal:

The Cat Chapter 1 Once upon a time, a cat sat on a mat.

Script 3 differs from Script 2 in only four places: • The “BeautifulSoup” object class from the bs4 Python extension library is loaded in line 3. This library is used to parse HTML files and extract plain text. • An additional parameter, “file_type”, is created for the function with a default value of “text“. This is introduced so that the script can handle both plain-text and HTML files (line 6) • A conditional expression is added to the function to parse files using a “BeautifulSoup” object if they are signaled to be HTML files (lines 12–18) • The extended function is now called with two parameters: (1) the path to the file, and (2) the file type (line 21)

Loading a Web Page, Cleaning it, and Showing Its Contents

Script 3 can be easily adapted further to download an HTML webpage, clean it, and show its text contents. To do this, a “request” object from the urllib Python core library needs to be utilized. This object essentially serves as a mini Web browser, which can access servers and webpages, and download the content for later processing. The adapted script is shown as Script 4, with the statements in lines 9 and 10 serving to download and read the contents of a webpage into memory for processing. Line 19 is then used to call the new function with the specific URL address of the desired webpage.

9 Programming for Corpus Linguistics


Script 4: Download an HTML Page, Remove the Tags, and Show Its Text Content 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

# import useful features from the Python Core from pathlib import Path # for dealing with files on different operating systems from bs4 import BeautifulSoup # for parsing HTML from urllib import request # define a function to display corpus file content def get_url_contents(url): # download the HTML and read it with request.urlopen(url) as fh: file_contents = fh.read() # create a BeautifulSoup object with an HTML parser soup = BeautifulSoup(file_contents, "html.parser") # parse the file and extract text file_contents = soup.get_text() # return the contents return(file_contents) # launch the function file_contents = get_url_contents(url = 'http://www.python.org/') print(file_contents)

Loading an Entire Corpus and Showing its Contents

We can extend Script 3 in a different way to process an entire corpus. To do this, we create a new function (“show corpus”) that finds the paths of all the corpus files in a folder (e.g. “target_corpus”), calls the “get_file_content” function on each file path, and then prints out the output of each file (lines 21–28). Of course, this script is not very useful on its own, but it can be extended to form the foundation of a complete corpus toolkit, as described in Sect. 9.3.2.


L. Anthony

Script 5: Load an Entire Corpus and Show its Contents 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

# import useful features from the Python Core from pathlib import Path # for dealing with files on different operating systems from bs4 import BeautifulSoup # for parsing HTML # define a function to display corpus file content def get_file_contents(file_path, file_type = 'text'): # create a file path object corpus_file = Path(file_path) # read the file file_contents = corpus_file.read_text() # clean the content (if necessary) if file_type == 'html': # create a BeautifulSoup object with an HTML parser soup = BeautifulSoup(file_contents, "html.parser") # parse the file and extract text file_contents = soup.get_text() # return the contents return(file_contents) # define a function to process a folder of files def show_corpus(folder_name, file_type = 'text'): # create a file path object corpus_folder = Path(folder_name) # iterate through all the files in the folder and process each one for corpus_file in corpus_folder.iterdir(): file_content = get_file_contents(file_path = corpus_file, file_type = file_type) print(file_content) # launch the function show_corpus(folder_name = 'target_corpus')

9.3.2 Case Study 2: Scripting the Core Functions of Corpus Analysis Toolkits For this case study, imagine that a toy corpus that comprises just three UTF-8encoded, plain-text files needs to be processed. Each corpus file contains a single,

9 Programming for Corpus Linguistics Table 9.2 Description of target corpus for case study 2

Folder name Folder contents Individual file contents text_1.txt The cat sat on the mat. text_2.txt The cat chased the mouse. text_3.txt A dog barked at the cat.

197 Target_corpus text_1.txt, text_2.txt, text_3.txt

short sentence and the whole corpus is stored in a folder called “target_corpus” in the project folder. The details of the “target_corpus” are given in Table 9.2. Analyzing such a small corpus can perhaps be done by hand or with a calculator. However, when developing computer programs that can analyze corpora of many thousands, millions, or even billions of words, it is often useful to test the scripts being developed with these simple examples that can be calculated exactly. This makes it possible to check if the code is running correctly and ensure that no bugs have been inadvertently introduced.

Creating a Word-type Frequency List for an Entire Corpus

To create a script that produces a word-type frequency list for a set of plain-text corpus files, we can first utilize the “get_file_contents” function of Script 2 (or Script 3 if we want to process HTML files). Then, we only need to adapt the “show_corpus” function from Script 5 to process each file and count all the words in the corpus. Script 6 shows the complete program.


L. Anthony

Script 6: Create a Word-Type Frequency List for an Entire Corpus 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

# import useful features from the Python Core from pathlib import Path # for dealing with files on different operating systems from collections import Counter # for creating a token counting object from regex import findall # for finding tokens using regular expressions # define a function to display corpus file content def get_file_contents(file_path): # create a file path object corpus_file = Path(file_path) # read the file file_contents = corpus_file.read_text() # return the content return(file_contents) # define a function to create a frequency list def create_word_type_frequency_list( corpus_folder_path, token_definition, ignore_case, results_file_path): # create a file path object corpus_folder = Path(corpus_folder_path) # create a token counting object word_type_counter = Counter() # iterate through all the files in the folder and process each one for file_path in corpus_folder.iterdir(): file_contents = get_file_contents(file_path = file_path) # check the case option if ignore_case is True: file_contents = file_contents.lower() # tokenize the file file_tokens = findall(pattern = token_definition, string = file_contents) # loop through the tokens and add one to the type count for token in file_tokens: word_type_counter[token] += 1 # create an output list in order of most to least frequent items token_frequency_list = sorted(word_type_counter.items(), key = lambda k: (-k[1], k[0])) # output the list with open(file = results_file_path, mode = 'w') as file_handle: print("{}\t{}".format('word', 'frequency'), file = file_handle) for word_type, frequency in token_frequency_list: print("{}\t{}".format(word_type, frequency), file = file_handle) # run the function create_word_type_frequency_list( corpus_folder_path = 'target_corpus', token_definition = r'[\p{L}]+', ignore_case = True, results_file_path = "word_type_frequency_list.txt")

9 Programming for Corpus Linguistics


The following output produced by Script 6 is automatically saved in a file named “word_type_frequency_list.txt”. Word the cat a at barked chased

Frequency 5 3 1 1 1 1

Word dog mat mouse on sat

Frequency 1 1 1 1 1

Script 6 introduces two new object class imports: “Counter” from the Python Core collections library, and “findall” from the regex Python extension library. The “Counter” object is a very fast and memory efficient “key-value” data structure that is used to store the word types as keys and their growing frequencies as values as each corpus file is processed. The “findall” function is used to find all the tokens in each file. This function uses a widely used and powerful search language called “regular expressions”, which is used to define very precise search conditions based on four core concepts listed in Table 9.3. From Table 9.3, it can be seen that the r’[\p{L}]+’ regular expression used in line 47 is used to find one or more continuous strings of “letters” (A-Za-z for English), which is a simple but commonly used definition of a “word” in corpus linguistics

Table 9.3 Core concepts used in regular expressions Concept Matching with consumption (for searching and storing or replacing) Matching with non-consumption (for positioning the start and end of searches) Quantifying the number of permissible results

Defining conditions

Examples \p{L} \p{N} \p{P} · \b

? or {0,1} + or {1,} ∗ or {0,} {m} {m,n} [] () |

any letter character any number character any punctuation character any character a boundary between ‘word’ characters (e.g. A-Z, a-z, and the underscrore_ for English)

zero or one occurrence one of more occurrences zero of more occurrences m occurrences between m and n occurrences single character alternatives (e.g. [abc] for ‘a’ OR ‘b’ or ‘c’ sequences of characters (e.g. (cat) for a string containing ‘cat’ OR separator for specifying alternatives (e.g. (cat)\(dog) for ‘cat’ OR ‘dog’


L. Anthony

work. The “r” prefix to the string definition marks it as a regular expression (see: https://www.regular-expressions.info/ for more information). The main, “create_word_type_frequency_list” function is designed to accept four parameters: (1) the corpus location, (2) a definition of word tokens to be counted, (3) an option to ignore case in the list (by converting all words to lowercase), and (4) a file path for the results file (lines 16–20). The function then performs five main actions. First, it defines a “corpus_folder” Path object as we saw in the scripts from Case Study 1 (line 22). Second, it defines a “word_type_counter” Counter object which is used to store the word types and their frequencies (line 24). Third, it iterates through all the corpus files in the directory processing each one (lines 26–35). Fourth, it ‘flattens’ the “word_type_counter” Counter object into a simple list in which the items are ordered first by frequency (in reverse order from high to low) and then alphabetically (line 37). Fifth, it prints a header and the newly created list of word types and frequencies to the results file (lines 39–42). At the end of the script, the “create_word_type_frequency_list” function is run with suitable parameter values (line 45–49). Here, the regular expression [\p{L}]+ is used for the token definition, which translates as “a series of one or more characters in the Unicode “Letter” character category” (https://en.wikipedia.org/ wiki/Unicode_character_property). The “Letter” category is useful here as it leads to a definition of tokens that includes “A-Z-a-z” for English, all the characters with accents for European languages, all the characters used as ‘letters’ in Asian languages such as Japanese, Korean, and Chinese, and also the ‘letters’ of all other languages defined in the Unicode standard (but not numbers or punctuation etc.). Script 6 is a very fast, efficient, and fully-featured program that can process and produce a word-type frequency list for the 1-million-word Brown corpus (Francis & Kuˇcera 1964) in just over 1 sec on a modest computer. This is much faster that most desktop corpus analysis tools.

Creating a Key-Word-In-Context (KWIC) Concordancer

To create a script that produces a classic Key-Word-In-Context (KWIC) concordancer, again, we can utilize the “get_file_contents” function of Script 2 (or Script 3 if we want to process HTML files). We then adapt the “show_corpus” function from Script 5 to process each file and output a KWIC result for every search hit that is found. Script 7 shows the complete program.

9 Programming for Corpus Linguistics


Script 7: Create a Key-Word-In-Context (KWIC) Concordancer 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

# import useful features from the Python Core from pathlib import Path # for dealing with files on different operating systems from regex import finditer, sub, IGNORECASE # for finding tokens using regular expressions # define a function to display corpus file content def get_file_contents(file_path): # create a file path object corpus_file = Path(file_path) # read the file file_contents = corpus_file.read_text() # return the content return(file_contents) # define a function to create a frequency list def create_kwic_concordance( corpus_folder_path, ignore_case, search_term, context_size, results_file_path): # create a file path object corpus_folder = Path(corpus_folder_path) # open file to save results output_file_handle = open(file = results_file_path, mode = 'w') # iterate through all the files in the folder and process each one for file_path in corpus_folder.iterdir(): # read the file file_contents = get_file_contents(file_path = file_path) # check the case option if ignore_case is True: flags = IGNORECASE else: flags = 0 # search in the file for hits for match in finditer(search_term, file_contents, flags): kwic_string = '' # set file positions from where to extract the kwic result kwic_start_index = match.start(0) - context_size kwic_end_index = match.start(0) + context_size + 1 # adjust the file positions if they point beyond the file # and add padding if the string is too short if kwic_start_index < 0: start_padding = abs(kwic_start_index) kwic_string = ' ' * start_padding kwic_start_index = 0 # extract the kwic result


47 48 49 50 51 52 53 54 55 56 57 58 59

L. Anthony kwic_string += file_contents[kwic_start_index:kwic_end_index] # replace line breaks in the kwic result with spaces kwic_string = sub('[\r\n]+', ' ', kwic_string) # save the kwic result print(kwic_string, file = output_file_handle) # run the function create_kwic_concordance( corpus_folder_path = 'target_corpus', ignore_case = True, search_term = r'\b[\p{L}]+at\b', context_size = 10, results_file_path = "kwic_results.txt")

The following output produced by Script 7 is automatically saved in a file named “kwic_results.txt”.



The on



The cat the The the

cat sat mat. cat cat.

sat on

on the


Script 7 does not require the “Counter” object class from the Python Core collections library or the “findall” from the regex Python extension library. However, it does use the “finditer” and “sub” functions of the regex library (line 3). The “finditer” function performs similarly to the “findall” function, but produces results iteratively, one-by-one, allowing them to be saved to a file immediately. The “sub” function, on the other hand, is used to find and replace (substitute) strings using a regular expression. This function is used to clean the KWIC results by removing unwanted line-break characters from the results. The call to the regex library also includes an “IGNORECASE” flag which tells the regex library to ignore case in searches. The main, “create_kwic_concordance” function is designed to accept five parameters: (1) the corpus location, (2) an option to ignore case when searching, (3) the search term, (4) a context size that defines how many tokens to the left and right of the search term will be shown in the KWIC results, and (5) a file path for the results file (lines 15–20). The function then performs just three main actions. As in the previous scripts, it first defines a “target_corpus_reader” Path object (line 22). Next, it creates a file handle that is used to output the results as they are generated and a variable to process the case option (lines 28–33). Finally, it performs the main action of the function, i.e., locating search term hits in the corpus files (via “finditer”) and generating KWIC results for each of them (lines 35–51). At the end of the script, the “create_kwic_concordance” function is run with suitable parameter values (lines 54–59). Here, the “search” parameter is given as

9 Programming for Corpus Linguistics


r"\b[\p{L}]+at\b", which is a regular expression that translates to “any string of Unicode letter characters that is immediately followed by “at” and starts and ends with a word boundary”. This leads to results that include “cat”, “mat”, and “sat”. The “context_size” parameter is set to 10, which leads to KWIC results with 10 characters of context to the left and right of the search term. Script 7 is again a very fast, efficient, and fully-featured program. It can create the nearly 70,000 KWIC results for the word “the” in the 1-million-word Brown corpus (a notoriously slow search) in under 1 sec on a modest computer. This is very much faster than most desktop corpus analysis tools, which have to deal with color highlighting and other display issues. It can also search for words, phrases, or full regular expressions with case-sensitivity, and produces hits with any amount of surrounding context desired. Perhaps surprisingly, the script comprises just 59 lines of code, and again, most of these are in the form of whitespace and comments. One limitation of the program, however, is that it does not sort the results. Adding this functionality would require an additional sorting function. Or the sorting could be carried out later in a spreadsheet software tool, such as Excel.

Creating a “MyConc” Object-Oriented Corpus Analysis Toolkit

Script 8 recreates the functionality of Scripts 6 and 7 in a single object class called “MyConc”. As discussed earlier in this chapter, object-oriented programming offers numerous advantages over functional programming, especially for largerscale projects. In this case, we could use the MyConc class as a foundation for a more complete corpus toolkit that could be released as an open source project allowing it to be used and extended by others.

9.4 Critical Assessment and Future Directions Learning to program with a computer language is certainly not an easy task. As with learning to use a human language, it requires study, practice, and perhaps most importantly, a genuine need. There is also an aspect of creativity and beauty in computer language use that mimics that of human languages. Some programs may “work” but they are short, abrupt, and difficult to understand. Others may be long and overly complex. This raises an aspect of programming that is often forgotten: Computer programs must be understood by humans. The first human that needs to understand the code is the developer, especially when they return to the code months after the original project in order to fix a bug or add a new feature. Other humans are likely to see the code, too. If the program is written as part of a funded project, at some point, the developer might leave requiring others to take over the work. If the program is part of an open-source project, many people might want to contribute to the code. There is also a growing requirement by journals and funding agencies to make programs open access in order to facilitate the replicability and reproducibility


L. Anthony

of research results (Branco et al. 2017). Therefore, scripts should always be clear, clean, and readable. The scripts presented in this chapter are designed to illustrate good programming habits. However, they are limited in terms of scope (e.g., only two core functions of corpus analysis toolkits are presented) and functionality (e.g., the KWIC tool does not include a sorting function). Fortunately, corpus linguists who are interested in further developing their programming skills have an abundance of learning resources available to them. There are numerous MOOCs (Massive Open Online Courses) offered online, as well as specially prepared web-based courses and tutorial guides. One notable course for Python, for example, is the tutorial offered by the w3resource team (see Sect. 9.5). The main site for asking specific questions about programming, as well as seeing code samples that have been posted in response to questions, is StackOverflow (again, see Sect. 9.5). This is a truly vital resource for anyone seriously considering to entering the world of programming. It is highly likely that at some point in a corpus linguist’s career, they will need to develop custom scripts to investigate their unique research questions. As discussed here, one strategy is to write these scripts directly. However, another possibility is to work with an expert programmer. In the latter case, it is important that the language of programming does not get in the way of communicating what the task should be. Corpus linguists should avoid trying to explain to the programmer how the task should be completed, e.g., saying that they want the programmer to create a program that opens each file, tokenizes the content, and then counts the frequencies of each word. Rather, they should explain what they want, e.g. an ordered list of important words in the corpus. Through discussions, the precise meaning of “important” can be clarified, as well as the best way to order the list. One danger when working with computer programmers is becoming overwhelmed by the amount of programming terminology that tends to appear in their conversations. To some extent, the discussions on different programming languages and the functional and object-oriented programming paradigms given in this chapter should help to demystify some of the terminology that may be used. Of course, most programmers are very willing to explain what they mean, so the corpus linguist should always ask for clarification where necessary.

9.5 Tools and Resources Numerous tools and resources exist to help novice programmers download, install, setup, and use a programming language. The list that follows targets the Python and R programming languages, but a simple Internet search will produce resources that can fill the gaps for other languages Downloading, Installing, and Setting Up the Programming Language • Getting started with Python: https://docs.python.org/. Accessed 31 January 2020. • Getting started with R: https://www.r-project.org/. Accessed 31 January 2020.

9 Programming for Corpus Linguistics


Online Tutorials and Resources for Python • Getting started with Python: https://docs.python.org/. Accessed 31 January 2020. • Interactive Python tutorial: https://www.learnpython.org/. Accessed 31 January 2020. • Python Tutorial: https://www.w3schools.com/python/. Accessed 31 January 2020. • Learn Python the hard way (a top-rated tutorial for beginners despite the name): https://learnpythonthehardway.org/. Accessed 31 January 2020. • Python Exercises, Practice, Solution: https://www.w3resource.com/pythonexercises/. Accessed 31 January 2020. • Natural Language Toolkit (NLTK documentation): https://www.nltk.org/. Accessed 31 January 2020. Online Tutorials and Resources for R • An Introduction to R: https://cran.r-project.org/doc/manuals/R-intro.pdf. Accessed 31 January 2020. • R manuals: https://cran.r-project.org/manuals.html. Accessed 31 January 2020. • Collostructional analysis with R: http://www.stgries.info/teaching/groningen/ index.html. Accessed 31 January 2020. • R Resources: https://www.ucl.ac.uk/ctqiax/PUBLG100/2015/resources.html. Accessed 31 January 2020. • R studio (with Shiny examples): https://www.rstudio.com/resources/. Accessed 31 January 2020. Online Communities for Programming (Including Corpus Linguistics) • StackOverflow: https://stackoverflow.com/. Accessed 31 January 2020. • Python Software Foundation: https://docs.python.org/. Accessed 31 January 2020. • Planet Python: https://planetpython.org/. Accessed 31 January 2020. • StatForLing with R: https://groups.google.com/forum/#!forum/statforling-withr. Accessed 31 January 2020. • CorpLing with R: https://groups.google.com/forum/#!forum/corpling-with-r. Accessed 31 January 2020. Packages to Allow Python and R to Interact with Each Other • The “rpy2” Python package to access R scripts from Python: https://rpy2. readthedocs.io/en/latest/. Accessed 31 January 2020. • The “reticulate” R package to access Python scripts from R: https://github.com/ rstudio/reticulate. Accessed 31 January 2020.


L. Anthony

Further Reading Bird, S., Klein, E., and Loper, E. 2009. Natural language processing with Python: Analyzing text with the natural language toolkit. Sebastopol: O’Reilly Media, Inc. https://www.nltk.org/book/. Accessed 31 January 2020. This book provides a comprehensive description of the Python NLTK framework, which allows even beginner programmers to easily download corpora, and analyze them through KWIC concordance views, word frequency lists, and a host of other commonly used corpus tools. Gries, S.T. 2016. Quantitative corpus linguistics with R: A practical introduction. Routledge (2nd edition). Abingdon and New York: Routledge. This book is a revised and updated edition of Gries’ 2009 introduction to R programming in corpus linguistics, which pioneered the use of R and advanced quantitative methods in corpus linguistics research. Desagulier, G. 2017. Corpus Linguistics and Statistics with R. Springer International Publishing. This book provides another very useful introduction to the R programming language aimed at broad audience of applied linguists, including sociolinguists, historical linguists, computational linguists, and psycholinguists.

References Anthony, L. (2009). Issues in the design and development of software tools for corpus studies: The case for collaboration. In P. Baker (Ed.), Contemporary corpus linguistics (pp. 87–104). London: Continuum Press. Anthony, L. (2014). Brainstorming the future of corpus tools. http://cass.lancs.ac.uk/?p=1432. Accessed 31 January 2020. Anthony, L. (2019). AntConc (Version 3.5.8) [Computer software]. Tokyo: Waseda University. https://www.antlab.sci.waseda.ac.jp/. Accessed 31 Jan 2020. Anthony, L. (2020). AntLab tools. Tokyo: Waseda University. https://www.antlab.sci.waseda.ac.jp/ software. Accessed 31 Jan 2020. Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). CELEX2 LDC96L14. Philadelphia: Linguistic Data Consortium. https://catalog.ldc.upenn.edu/LDC96L14. Accessed 31 Jan 2020. Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics. Cambridge: Cambridge University Press. Branco, A., Cohen, K. B., Vossen, P., Ide, N., & Calzolari, N. (2017). Replicability and reproducibility of research results for human language technology: Introducing an LRE special section. Language Resources and Evaluation, 51, 1–5. Burnard, L. (2000). The British National corpus users reference guide. http:// www.natcorp.ox.ac.uk/archive/worldURG/index.xml. Accessed 31 Jan 2020. Chandler, B. (1989). Longman mini-concordancer [Computer Software]. Harlow: Longman. Clark, R. (1966). Computers and the humanities, 1(3), 39.

9 Programming for Corpus Linguistics


Davies, M. (2011). Synchronic and diachronic uses of corpora. In V. Viana, S. Zyngier, & G. Barnbrook (Eds.), Perspectives on corpus linguistics: Connections & Controversies (pp. 63– 80). Philadelphia: John Benjamins. Dearing, V. A. (1966). Computers and the humanities, 1(3), 39–40. Desagulier, G. (2017). Corpus linguistics and statistics with R. Springer. Edberg, J., & Biber, D. (2019). Incorporating text dispersion into keyword analyses. Corpora, 14(1), 77–104. Francis, W. N., & Kuˇcera, H. (1964). Manual of information to accompany a standard corpus of present-day edited American English, for use with digital computers. Providence. Rhode Island: Department of Linguistics, Brown University. Gries, S. T. (2009). What is corpus linguistics? Language and Linguistics Compass, 3, 1–17. Gries, S. T. (2016). Quantitative corpus linguistics with R. (2nd rev & extend edition). London/New York: Routledge/Taylor & Francis. Hockey, S., & Martin, J. (1987). The Oxford concordance program version 2. Literary & Linguistic Computing, 2(2), 125–131. https://doi.org/10.1093/llc/2.2.125. Johns, T. (1986). Micro-concord: A language learner’s research tool. System, 14(2), 151–162. Johnson, K. (2008). Quantitative methods in linguistics. Hoboken: Wiley. Kaye, G. (1990). A corpus builder and real-time concordance browser for an IBM PC. In J. Aarts & W. Meijs (Eds.), Theory and practice in corpus linguistics (pp. 137–162). Amsterdam: Rodopi. Moon, R. (2007). Sinclair, lexicography, and the Cobuild Project: The application of theory. International Journal of Corpus Linguistics, 12(2), 159–181. Nesi, H., Sharpling, G., & Ganobcsik-Williams, L. (2004). Student papers across the curriculum: Designing and developing a corpus of British student writing. Computers and Composition, 21(4), 401–503. Price, K. (1966). Computers and the Humanities, 1(3), 39. Reed, A. (1978). CLOC [Computer Software]. Birmingham: University of Birmingham. Scott, M. (2020). WordSmith tools (Version 8.0) [Computer Software]. https://lexically.net/ wordsmith/. Accessed 31 Jan 2020. Simpson, R. C., Briggs, S. L., Ovens, J., & Swales, J. M. (2002). The Michigan corpus of academic spoken English. Ann Arbor: The Regents of the University of Michigan. Sinclair, J., Jones, S., & Daley, R. (2004). English collocation studies: The OSTI report. London: Continuum. Smith, P. H. (1966). Computers and the Humanities, 1(3): 39. StackOverflow. (2019). Developer survey results 2019. https://insights.stackoverflow.com/survey/ 2019. Accessed 31 Jan 2020. Swales, J. (1990). Genre analysis: English in academic and research settings. Cambridge: Cambridge University Press. Thompson, P., & Nesi, H. (2001). The British Academic Spoken English (BASE) corpus project. Language Teaching Research, 5(3), 263–264. Tribble, C. (2015). Teaching and language corpora: Perspectives from a personal journey. In A. Le´nko-Szyma´nska & A. Boulton (Eds.), Multiple affordances of language corpora for datadriven learning (pp. 37–62). Amsterdam: John Benjamins Publishing. Winter, B. (2019). Statistics for linguists: An introduction using R. Abingdon/New York: Routledge.

Part III

Corpus types

Chapter 10

Diachronic Corpora Kristin Davidse and Hendrik De Smet

Abstract In this chapter, we first consider the challenges specific to diachronic corpus compilation. These result from the uneven (or non-)availability of historical records in respect of the varieties associated with users (temporal, regional, social and individual) and contexts of use. Various ways are discussed in which these biases can be redressed. Next, we discuss issues of diachronic corpus annotation and heuristic techniques that can be used to interrogate the syntagmatic-paradigmatic organization across the whole lexicogrammatical continuum, illustrating their relevance to diachronic corpus linguistics. Finally, we consider representative studies and corpora, tools and resources and key readings. These are informed by the future directions we advocate, viz. inductive, data-driven approaches to text classification, the identification of historical lexicogrammatical patterns and meaning change, and the sharing of rich data annotation and data analysis.

10.1 Introduction Ever since its philological beginnings, diachronic linguistics has relied on corpus data. Past lexicogrammatical patterns are not accessible through speaker intuition or experimentation, but have to be reconstructed on the basis of the written historical record. Historical linguists therefore have always had to take recourse to collections of texts or collections of quotations. However, with the advent of electronic corpora, the speed and systematicity with which diachronic – like synchronic – records can be queried has increased tremendously, opening up new research possibilities by dramatically facilitating at least some aspects of data collection. This is, in fact, a continuing trend, as historical corpora continue to grow in number and size, and as the techniques for interrogating them become both more efficient and more sophisticated. At the same time, barring corpora of very recent history,

K. Davidse · H. De Smet () KU Leuven (University of Leuven), Leuven, Belgium e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_10



K. Davidse and H. De Smet

diachronic data will always present us with important problems, failing to represent all the variation associated with the users and contexts of use, with especially spoken language as the perennial gap. Despite justified enthusiasm about the recent advances in diachronic corpus linguistics, such concerns must be taken into account in corpus design as well as in the development of heuristic techniques to crack the code of past language systems and to explain variation and change. In the most general terms, our plea here is one for informed use of diachronic resources. For end users to make appropriate use of a corpus it is instrumental that they understand how it has been compiled. The corollary of this is that compilers should set out their compilation procedures accessibly and explicitly. At the same time, it would be beneficial for the research community as a whole to consider ways in which historical resources can be built and enriched in a more dynamic and bottom-up way, to ensure they are maximally adaptable to specific research needs as well as being more responsive to newly gained insights (Diller et al. 2011; Nevalainen et al. 2016). The reason is that many of the problems of corpus compilation are not just theoretically uninteresting preliminaries to research, but actually reflect issues that can seriously affect data interpretation, to the point of being research-worthy in their own right.

10.2 Fundamentals 10.2.1 Issues and Challenges of Diachronic Corpus Compilation There is one thing that all diachronic corpora have in common: the usage data they contain is at least organized along the temporal dimension, such that comparison across earlier and later manifestations of a language becomes possible. Apart from that, diachronic corpora differ widely in size, composition, scope, annotation, and the nature of their textual material. Because of the limitations of the historical data available, and because research goals are very diverse, there is no such thing as an ideal historical corpus. Specific requirements of diachronic research simply need to be met in different ways. Nevertheless, there are recurrent challenges that both compilers and end users of diachronic corpora have to confront. As diachronic corpora are typically used to study language change, and language change is generally understood to arise from and give rise to language variation, it is something of a bitter irony that one of the greatest difficulties diachronic corpora face lies precisely in capturing historical variation. This holds for both major dimensions of variation, which – following Gregory (1967) and Romaine (2000) – we will refer to here as lectal and diatypic. Lectal variation reflects “reasonably permanent characteristics of the user [italics ours]” (Gregory 1967:181), including variation that is structured temporally, regionally, socially and individually. Diatypic variation reflects “recurrent characteristics of the user’s use of language in situations

10 Diachronic Corpora


[italics ours]” (Gregory 1967:185) and depends on communicative goals, the mode of communication, and the speaker-addressee relationship. The typical challenges of compiling diachronic corpora, then, include (1) identifying the lectal and diatypic properties of texts, (2) handling the lectal and diatypic biases of the historical records and (3) circumventing impaired comparability across lectally and diatypically diverse datasets. These issues are the topics of the following sections.

Identifying the Lectal and Diatypic Properties of Texts

In addition to being organized along the temporal dimension, diachronic corpora often include information on other lectal and diatypic properties of the texts they contain. Identifying these properties, however, may be difficult for historical texts. The challenges start with the identity of historical authors, which more often than not is something of a mystery. Added to that there are the complexities of textual transmission. Think, for instance, of scribal interference in mediaeval texts or editorial interference in more recently published materials. Tellingly, when creating the Helsinki Corpus of English texts, which was the first diachronic electronic corpus of any language, its compilers found themselves forced to provide many of the older texts with multiple period labels, as the best way to indicate both a text’s (approximate) manuscript date and its creation date (Kytö 1996). In general, the older the text material, the more reasonable it is to suspend overly precise attempts at dating and locating texts. But these problems are certainly not restricted to premodern texts. In fact, due to both the ease of digital reproduction and the anonymity of the world wide web, many recent web-based corpora, such as COW or GloWbE, run into the problems that only a decade ago were mainly associated with premodern texts, in that almost nothing is known about the language users represented, including some of the most basic sociolinguistic variables, such as age, gender and linguistic background (see also Chap. 15). Of course, not all historical texts pose these difficulties, and some offer special opportunities. Letters, for instance, have several advantages as a source of historical text material. They are often precisely located in time and space, can often be unambiguously assigned to a single author, have in their addressee a clearly identifiable target audience, and may represent speakers who have otherwise left no written records. A corpus that has used this type of data to the best advantage is the Corpus of Early English Correspondence, compiled by Terttu Nevalainen, Helena Raumolin-Brunberg and a team of collaborators (Raumolin-Brunberg and Nevalainen 2007). Not only does the corpus contain letters from the sixteenth to eighteenth century, it also covers four major English regions, as well as giving information on the social rank of the letter writers and how they relate to their addressees. Providing this type of social background information of course poses new challenges, requiring a thorough knowledge of the social structure of past societies. The compilers eventually arrived at a very elaborate coding scheme to describe the letter writers in their corpus, using 27 different parameters, some with very open information. Parameters include, for instance, the letter writer’s gender, year of birth, occupation, rank, their father’s rank, their education, religion, place of


K. Davidse and H. De Smet

residence, any history of migration, their careers and social mobility, the general type of contents of their letters, and how well their letters can be authentified. Additional information on a letter writer, if potentially relevant, is included in open comment boxes. In providing all this information, the compilers clearly chose to collect as much metadata as possible. As a result, it is to an important extent also up to the end user to interpret the complexities of historical reality. On the whole, however, detailed background information on historical texts is often impossible to come by directly. While in such cases healthy agnosticism remains a sensible default option, it is good to be aware that avenues towards possible solutions are currently being explored. Particularly in the domain of authorship, automated stylometric techniques now allow probabilistic identification. An example is the authentication of the writings of Julius Caesar by Kestemont et al. (2016). The Latin texts in question report on the military campaigns conducted by Julius Caesar. Some of the texts can be confidently attributed to Caesar himself, but there has long been controversy about others. The work by Kestemont et al. (2016) confirms that one of Caesar’s generals, Aulus Hirtius, has a good claim to the authorship of some of the writings. This type of work is of course potentially relevant to corpus compilers, who could rely on it to annotate the texts in their corpora. Moreover, while the probabilistic nature of identification may at first sight appear to be a disadvantage, in the long term it may actually liberate compilers from overly rigid and reified systems of classification. Another recent development promising new insights into lectal variation is the increasing access offered by very large corpora to individual variation. One example is the Hansard Corpus, compiled by Jean Anderson and Marc Alexander, which contains the proceedings of the British Houses of Parliament from 1803 to 2005 and represents nearly 40,000 individual speakers. Corpora built from Parliamentary proceedings may become immensely valuable to historical linguists and sociolinguists for a number of reasons (Marx 2009). First, comparable datasets are available for other languages than English, often free of copyright. Second, especially for the more recent decades, the lives of the language users who produced the texts are mostly well-documented, and even the social relations between them are to some extent known. And third, these datasets are already intrinsically structured along linguistically relevant dimensions (by date, speaker, speaker role, party affiliation, house etc.), relieving the corpus compiler at least to some extent from the burden of imposing structure on the data. In the future, these very rich data sets may allow researchers to further disentangle the complexities of lectal variation. Because the usage of individuals can be compared directly, less recourse needs to be taken to what are potentially aprioristic and artificial classifications of language material based on authors’ or speakers’ regional provenance or social status.

Redressing Historical Bias

While classifying and contextualizing available text material will always pose difficulties, many of the problems diachronic corpora face do not come from the

10 Diachronic Corpora


texts they have, but from the texts they do not have. Depending on the historical period at issue, the historical record is patchy to a greater or lesser degree, but in practically all cases it is severely biased. The voices of the less powerful and/or less literate strata of the population are as a rule unrecorded, and the texts that do reach us are – prior to the large-scale employment of modern recording equipment – biased to the written mode. Considering that direct spoken interaction and patterns of social stratification are believed to be crucial to the emergence and transmission of linguistic variants, this is of course a frustrating situation. Moreover, this is one area where, despite their merits, current big data projects may (for now) be exacerbating the problem by their tendency to go where the data is and harvest text materials indiscriminately. Typically, it is corpora making the most of unique but limited and less accessible resources that are best placed to redress the biases in the historical record. A good example is the Corpus of Early English Correspondence, already discussed above. Another example is the data set used by Blaxter (2015), who carefully extracted the direct reported speech passages from Old Norse sagas to assess the role of speaker gender in ongoing change in a mediaeval Scandinavian setting. It is instructive here to consider in some more detail another substantial effort to create a diachronic corpus that contains – at least by proxy – socially stratified spoken interaction. The Old Bailey Corpus (OBC2.0), compiled by Magnus Huber and his team (Huber 2007; Huber et al. 2012), consists of trial proceedings from London’s Central Criminal Court, the Old Bailey, published between 1720 and 1913. Containing the published transcripts of the spoken interactions in court, it is about as close to a corpus of Late Modern spoken English as one can get, with the added bonus of a socially very diverse set of speakers. But, while the OBC2.0 is decidedly an exciting resource, the question remains to what extent it really represents spoken usage. In general, trial proceedings have been argued to give us some of the most reliable data on spoken usage before the advent of audio-recordings (Culpeper and Kytö 2000). Regarding the material in OBC2.0 in particular, a balanced discussion is provided by Huber (2007) (but see also Archer 2014). The speech recorded in OBC2.0 obviously does not come down to us directly but has typically been transcribed in shorthand during the court sessions, reworked afterwards into standard text, typeset, perhaps proofread and printed. The most crucial step here is probably the transition from speech to shorthand transcript. Scribes themselves appear to have prided themselves on their ability to make verbatim records of the proceedings. At the same time, some of the tell-tale characteristics of spoken interaction (pauses, false starts, repetitions, etc.) are obviously missing from the printed records. What is most disturbing is that there may be very substantial lexical, grammatical and textual differences between records in the proceedings of the Old Bailey and records of the same trials published elsewhere. Then again, it is reassuring to find that the direct speech passages in OBC2.0 differ from the remainder of the corpus, for instance containing far more instances of contraction – much as one would expect if the direct speech passages reflect actual speech.


K. Davidse and H. De Smet

Though limited to recent periods, another way to counter the biases in the historical record is of course the compilation of diachronic corpora containing actual audio-recorded speech (see Chap. 11 for more information about spoken corpora). There are several ways to pursue this goal. First, corpora of spoken usage began to be compiled in the second half of the twentieth century. Initially intended as representative of contemporary usage, these corpora are now gradually becoming historical corpora and it is to be hoped that current and future researchers will be willing to repeat the efforts of their predecessors to create new contemporary and comparable spoken corpora. One such effort is the recent creation of the Spoken BNC 2014, whose structure echoes the spoken component of the British National Corpus, originally released in 1994 (Love et al. 2017). Second, corpus compilers have also begun to dig into existing archives of old recordings, such as oral history projects. In some cases, the time depth that can be achieved is impressive. For example, Hay and Sudbury (2005) have been able to trace the emergence of English intrusive r (as in law-r-and-order) in New Zealand English thanks to the Origins of New Zealand English Project, which contains recordings from English-speaking New Zealanders born as early as 1850. Third, the exponentially increasing amounts of recorded speech posted online in combination with the advances in speech-totext technology mean that a diachronic corpus compiled by automated harvesting and transcribing of large amounts of freely available spoken data is literally only a matter of time.

Diachronic Comparability

The problem of gaps and biases in the historical record is further complicated by the temporal dimension of diachronic corpora. Not only is it difficult to approximate the full range of synchronic variability for a language at a given point in time, there is a further difficulty in doing so without compromising diachronic comparability. For example, given a rich historical record and a long tradition of high-quality text editions, it is perfectly possible to create a sizeable corpus of Old French, as shown by the FRANTEXT database. Yet it is impossible to create a corpus of Old French that is comparable in any straightforward way to a corpus of Present-day French. The reason is, put simply, that there is no Old French equivalent to a Presentday French newspaper, just as there is no Present-day French equivalent to an Old French epic poem. Similarly, it is possible to create a sizeable corpus of Latin, with a time-depth of over 2000 years, as shown by the 13-million-word LatinISE corpus, built by Barbara McGillivray. However, even despite efforts to preserve genre balance, any diachronic comparison across this timespan has to take into account that the language used in such a corpus changes from a relatively standardized written language with more or less direct roots in a contemporary vernacular to a functionally impoverished and much more writing-dependent language used mainly in religious contexts and as a lingua franca among European scholars and scientists. More generally, because the conditions under which language is produced are themselves subject to change, it is fundamentally impossible to compare linguistic

10 Diachronic Corpora


material only along its temporal dimension. In this respect, historical linguists are always comparing apples and oranges. Arguably, all major diachronic reference corpora, though often striving to produce stratified samples of language use across time, suffer from this problem. One possible response is to ignore the issue and simply include material as exhaustively as possible. This approach prioritizes coverage of synchronic variability, to the best level achievable and with no prior assumptions made. Especially where the body of historical data is finite, disparate and severely biased, or where a corpus is to be used to study change over very long time periods, this is a perfectly defendable strategy. An example is the Dictionary of Old English Corpus (compiled by Antonette di Paolo Healey and colleagues), which exhausts all Old English texts available down even to the odd Runic inscription. Another example is the Oxford Corpus of Old Japanese (compiled by Bjarke Frellesvig and colleagues). A completely different response to the issue of comparability is to create singlegenre diachronic corpora that cover relatively short time spans and draw their data from a single historical source. The OBC2.0 or the Hansard Corpus, both already discussed above, are good examples. The single-genre approach has the advantage of improving diachronic comparability, but comes at the expense of coverage, both with respect to synchronic variability and time-depth. Moreover, although diachronic comparability improves, it may still not be optimal, because genres themselves tend to be moving targets. Consider again trial proceedings, as represented in the OBC2.0. The recording of trial proceedings in the Old Bailey started outside the actual control of the court, as publishers were commercially interested in the more sensational cases and sent out their scribes to record them. But in the course of the eighteenth century the proceedings gradually developed into official true-to-fact records. Another example is found in De Smet and Vancayzeele (2014), who show that eighteenth-century English narrative fiction contains far fewer action sequences and has longer descriptive passages than later narrative fiction. In a diachronic corpus of narrative fiction this inevitably affects the frequencies of specific grammatical patterns associated with either descriptive or more action-driven narrative passages. Especially over longer time spans, it is virtually impossible to keep genre – or, for that matter, any other diatypic parameters – constant. Besides awareness of these difficulties, a more bottom-up and data-driven approach to describing diatypic text properties may, in the long term, provide the more satisfactory solutions. Dimensions such as ‘spokenness’ or ‘formality’ can be operationalized and measured on a text-by-text basis from linguistic properties (Biber 1988). For example, Hinrichs et al. (2015) apply a number of relatively simple measures to their English corpus data to assess individual texts’ adherence to prescriptivist dogma, which they then use as a predictor in a variationist study. One such measure is the relative rate of occurrence of shall and will as future auxiliaries. To apply such methods more systematically and across long time-spans will require further research, but it at least allows researchers to position texts relative to one another and to the contemporary norm on one or more dimensions of interest. The implication is again that responsibility for interpreting the structure of a corpus


K. Davidse and H. De Smet

moves from the corpus compiler to the researcher. It also means that it may be necessary for corpora themselves to become the object (rather than just a means) of study.

10.2.2 Issues and Challenges of Text-Internal Annotation Turning from the level of the texts that make up a corpus to the internal properties of those texts, perhaps the most fundamental question compilers and users of diachronic corpora must ask is to what extent they can rely on methods devised for the annotation and analysis of contemporary data in handling data from older periods. Older texts are in principle somewhat alien. Their writing conventions differ from present-day practices and, obviously, the very language they represent is different from any present-day variety. Added to this is a layer of inadvertent ‘noise’ created along the way as a corpus text travels from historical manuscript or print to digital edition. All of this complicates even the most basic analyses, including the identification of lexical items, grammatical classes and grammatical structures. Nevertheless, corpora whose texts have been annotated with lexical and grammatical information can of course be extremely valuable tools for research. The most straightforward problems are the strictly technical issues. The development of new techniques of corpus analysis is often spearheaded by research on contemporary performance data. To extend such techniques to older data requires both circumspection and additional technical know-how. For example, Schneider et al. (2016) describe the application of part-of-speech taggers created for contemporary English to older text material, comparing their performance against results for present-day data. They found that, although (surprisingly) tagging accuracy improved for nineteenth-century texts, it became progressively worse for older material. They further describe ways to improve results, including spelling normalization using VARD (Baron and Rayson 2008), combining different taggers, and modest manual intervention. An excellent discussion of the technical challenges of applying Natural Language Processing to historical texts (including Optical Character Recognition and spelling normalization) is offered by Piotrowski (2012), who draws examples from a variety of languages and historical corpora. Solving technical problems of course pertains to only one side of the issue. More fundamental are matters of linguistic analysis proper, which present all the problems associated with lemmatization, tagging and annotation of synchronic data to a much higher degree (see Chap. 12 for a general introduction to corpus annotation; see again Piotrowski 2012 for discussion of the various techniques, e.g. for part-ofspeech tagging and syntactic parsing). For this reason, rich annotation is best seen only as a means to facilitate querying a corpus. Consider, for example, the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), developed by Ann Taylor and collaborators (Taylor et al. 2003; Taylor 2003), who used a combination of automated and manual parsing to create a 1.5-million-word syntactically annotated corpus representing Old English prose.

10 Diachronic Corpora


While exciting and impressive, it is nevertheless good to bear in mind that the syntactic annotation in YCOE is indeed a tool. That is, it is not meant as a final analysis of all the sentences in the corpus but as a means to retrieve with greater ease data that is of potential research interest and that is otherwise nearly impossible to collect. The syntactic annotation scheme is a simplified and in some ways deliberately agnostic version of generative X-bar theory. Optimal precision and recall are certainly not guaranteed (see Chap. 2 on precision and recall). As regards precision, results collected automatically from the corpus should be manually checked to make sure they actually include what the researcher is looking for. As Taylor herself points out, there may be “a strong temptation to skip this relatively time-consuming step”, but it “must be manfully resisted” (2003:200). As regards recall, solutions might be either to go through part of the corpus manually (if the search target is reasonably frequent), or to compare corpus findings to findings reported in earlier work where data has been collected manually (if the target is infrequent) (see D’hoedt 2017, who applies both methods). Finally, any effort at creating richly annotated corpora runs the risk of obscuring existing patterns in the data. It is not obvious, for instance, that Old English authors had a concept of sentences that is exactly comparable to the notion of sentence assumed by the formal theory underlying the parsing in YCOE – Old English punctuation, in any case, suggests otherwise (Fischer et al. 2017:163). In other words, in the end it is again the researcher who should always be wary of any prior assumptions and who should try to find ways to let historical data speak for themselves. At the same time, the ideal practice for corpus compilers is to strive to give end users access to original spelling, punctuation and even text layout. As is the case now for many online text archives, users can only benefit from being able to consult high-quality images of the original manuscripts or prints in the corpus.

10.2.3 Issues and Challenges Specific to the Analysis of Diachronic Corpora From the design properties of corpora and their texts, we move to the actual use of diachronic corpora for research. Methods of corpus interrogation will be affected by how linguistic organization is conceived. This holds a fortiori for the complex interrogation of diachronic corpora. The tenet that lexicon and grammar form an integrated continuum of form-meaning pairing has gained ground to the point of probably forming the majority position in current corpus linguistics. In this section, we will discuss a number of heuristic techniques that can be used to interrogate the syntagmatic-paradigmatic organization across the whole lexicogrammatical continuum, and we will illustrate their potential relevance to diachronic corpus linguistics. It is generally recognized that, at the lexical end of this continuum, major methodological progress has come from Firth’s (1957) insight that the meaning


K. Davidse and H. De Smet

of a word and the words it frequently co-occurs with mutually influence each other. This co-occurrence manifests itself syntagmatically in the relation between a node and its collocates irrespective of their grammatical classes or relations. The larger paradigmatic organization is formed by the relations between the node and all its collocates (see also Chap. 7). Collocation-based methodology measures the degrees of attraction or repulsion between a lexical node and other individual lexical items, which are revealing of the lexicosemantics of the node, and its semantic prosody (Sinclair 1991). The drawing up of “behavioural profiles” of lexical items (Gries and Divjak 2009) can objectively identify different polysemous senses of one word, and relations of synonymy and antonymy between different words. A good diachronic illustration is Kossmann’s (2007) study of the lexicosemantic variation and change of poor and rich and related adjectives in Old and Middle English on the basis of detailed contextual and collocational analysis of large historical databases. Kossmann (2007:70ff) nuances earlier, less data-driven claims that rich meant ‘powerful’ in Old English and acquired the sense ‘wealthy’ in Middle English. She shows that rich had a polysemous structure from Old English on, which persisted beyond Middle English, all the while reflecting sociocultural changes. She shows that similar culturally evolving polysemies typify the main antonyms of rich, such as Old English þearf (‘needy’) and the Middle English loanword poor. Kung’s (2005) collocational study of the noun melancholy in British historical novels shows that its semantic prosody changed from negative in the pre-romantic period to positive in the romantic period, as reflected in a shift from mainly negative to predominantly positive collocates. Study of changing collocational patterns can also reveal essential dimensions of delexicalization and grammaticalization processes. In his study of the development of intensifiers, Lorenz (2002) noted that advanced grammaticalization, as manifested by very in English, correlates with the lifting of preferences for specific sets of collocates and semantic prosodies. In her study of the grammaticalization of expressions like heaps/piles of into quantifiers, Brems (2003:291) operationalizes delexicalization in corpus data in terms of “a gradual broadening of collocational scatter or a loosening of the collocational requirements of the MN [measure noun] via such semantico-pragmatic processes as metaphorization, metonymization, analogy, etc.”, which typically precede the actual grammatical re-analysis into a quantifier marked by changing agreement behaviour. With collostructional analysis (Stefanowitsch and Gries 2003), the focus of description moves somewhat more towards the grammatical end of the lexicogrammar: this method studies which lexemes are strongly attracted or repelled by a particular slot in a construction, i.e. occur more frequently or less frequently than expected. The earliest case study of diachronic distinctive collexeme analysis was, according to (Hilpert 2008:41), Kemmer and Hilpert (2005), which reconstructs the shifting preferences for types of lexical collocates attracted to the verbal complement of the English causative construction with make. In the earliest stages, the complement slot attracted mainly verbs referring to mechanical action such as grow. In later stages, emotional and cognitive verbs such as cry were preferred and

10 Diachronic Corpora


finally verbs depicting epistemic states such as seem. This semantic development is interpreted as an instance of progressive subjectification. Further towards the grammatical end is colligational analysis, which is now mostly implemented in Sinclair’s (1991) definition as the relation between a lexical node and grammatical categories. Changes in the co-occurrence of grammatical categories with lexical nodes may reflect grammaticalization. For instance, the development of progressive aspect meanings of be in the middle/midst of is reflected by colligational extension from nouns designating spatial extensions or temporal periods to, first, nominalizations and deverbal nouns, e.g. (1), and, secondly, verbal gerunds, e.g. (2) (Van Rompaey and Davidse 2014). (1) While you were in the middest of your sport [ . . . ] (OED, a1548) (2) [ . . . ] when you are in the middle of loving me. (CLMETEV, 1873) Firth’s (1957) original notion of colligation was more purely grammatical. It is concerned with interrelations between “elements of structure [that] [ . . . ] share a mutual expectancy in an order which is not merely a sequence” (Firth 1957:17). Colligations form the basis of “the statement of meaning at the grammatical level” (Firth 1957:13). Davidse and Van linden (2020) discuss changing colligations (in this sense) in extraposition constructions with matrix ‘it is a/no little/etc. wonder’ and related constructions with ‘there is no/little/etc. doubt’. They redefine these constructions as one macro-construction subsuming two distinct subtypes, the generally recognized instances with predicative matrices as well as ones with existential matrices, on the basis of two colligational relations. Firstly, predicative and existential matrices are distinguished from each other by the different diachronic realization of the position syntactically enclitic with the finite matrix verb. In predicative matrices, this paradigmatic distribution is zero (3), that (4), it (5). In existential matrices it is zero, it, there, but never ∗ that, which according to Larsson (2014) is the characteristic distribution of existential clauses in Germanic languages. In view of this distribution and the existential meaning of be (nan) tweo in the whole historical dataset, Davidse and Van linden (2020) reject the predicative analysis of examples with it like (7) in YCOE, where nan tweo is annotated as NP-NOM-PRD.1 (3) Micele mare wundor is þæt he wolde beon mann on þisum life ‘Much greater wonder it is (lit: is) that he wanted to be a human in this life’ (YCOE, 950–1050) (4) þæt is wundor, þæt ðu swa ræðe forhæfdnisse & swa hearde habban wilt. ‘that is wonder, that you want to have fierce and harsh abstinence’ (YCOE, 850–950) (5) Full mycel wundor hit wæs þæt þæt mæden gebær cild. ‘Full great wonder it was that that maiden bore a child’ (YCOE, 1050–1150)

1 Of the 12 matrices in YCOE containing be + NP with tweo only, 10 are tagged as NP-NOM-PRD and 2 as NP-NOM.


K. Davidse and H. De Smet

(6) Nis ðæs ðonne nan tweo, gif suelc eaðmodnes bið mid oðrum godum ðeawum begyrded, ðæt ðæt bið beforan Godes eagum soð eaðmodness, [...] ‘There is about that then no doubt (lit: not-is no doubt), if such humility is encompassed with other good virtues, that that is true humility before God’s eyes’ (YCOE, 850–950) (7) Forðæm hit is nan tweo þæt ða goodan beoð symle waldende [ . . . ] ‘Therefore there (lit: it) is no doubt that the good ones are always powerful’ (YCOE, 850–950) (8) [ . . . ] þa næs þær nænig tweo, þæt hit nealæhte þara forðfore, þe þær gecigde wæron. ‘there was no doubt then that it drew near to the death of them who were named there’ (YCOE, 1050–1150) Secondly, predicative and existential matrices historically used the same pronouns to refer to the complement clause. In Old English, the most commonly used pronoun was demonstrative that, which occurred as subject (nominative) in predicative matrices like (4) and as adjunct (genitive) in existential matrices like (6). In a further stage, which has persisted into Present-day English, non-salient pronoun it became predominant, functioning as subject (5) in predicative matrices and as complement of a preposition like about (9) in existential matrices. (9) There is no doubt about it that he is in discomfort all the time (WB) The fact that this colligation with its changing realization (that, it) occurs in both predicative and existential matrices suggests that the so-called it-extraposition construction is part of a larger class of evolving complementation constructions, even though, because of the different matrix syntax, reference to the complement is obligatory in predicative and optional in existential matrices. Finally, at what is arguably the most distinctively grammatical end of the lexicogrammar, we find syntactic paradigms based on relations between constructions. Relations between constructions have been studied mainly from a variationist perspective, in which examples of variants are annotated in terms of various predictors (language internal, language-external, information theoretic), and processed statistically (Gries 2017.). Typical language-internal parameters, often referred to as “discourse functional” (Gries 2017:9), are animacy, humanness, definiteness, givenness, etc. A representative diachronic case study is Szmrecsanyi et al.’s (2016) study of genitive variation in Late Modern English, which considers the ‘s and of genitive, as well as noun-noun realization of the possessor-possessum relation. The study (2016:1) establishes “an overall drift towards the N-N genitive, which is preferred over other variants, when constituent noun phrases are short, possessor constituents are inanimate, and possessum constituents are thematic”. As pointed out by McGregor (1994:305), syntactic paradigms have been studied less in terms of what they can reveal about meaning and semantic change. Alternations associated with lexical verb senses allow the analyst to interrogate usage data for all the aspects associated with verb-argument semantics. Firstly, verbspecific alternations “can be used effectively to probe for linguistically pertinent aspects of verb meaning” (Levin 1993:1), i.e. to draw up linguistically-based,

10 Diachronic Corpora


rather than intuition-based, classifications of verb senses. Secondly, as argued by Laffut and Davidse (2002), the study of verb-specific alternations in corpus data can also reveal semantic selection restrictions on the arguments. They investigated the lexical sets realizing the arguments of the two types of locative verbs, sprayverbs (e.g. spray, smear, spread) and load-verbs (e.g. load, pack). They found that alternating spray-verbs have locatums designating dispersive entities (e.g. water, butter, herbs, sheet), which because of this semantic feature can be construed either as patient or oblique argument with preposition with. Alternating load-verbs have locations that are containers (e.g. box, suitcase, car booth), which semantic feature likewise motivates their being codable as either patient or oblique with in(to). In a sorting experiment involving the in/on(to)- with-alternation, Perek (2012:628) found evidence that users mentally store “constructional meaning abstracted from the meanings of the variants of the alternation”. We propose that the generalizations stored by users involve precisely such features as the ‘dispersiveness’ shared by spray-verbs and their locatums across alternations, as well as more general semantic features such as (non-)intentionality of the relation between agent and action, etc. Lemmens (1998) exploits the heuristic potential of verb-specific alternations in his diachronic study of abort. On the basis of OED-data, he reconstructs changes in its alternation paradigms, which he correlates with meaning changes of the verb and changing selection restrictions on the arguments. The verb abort came into English from Latin in the sixteenth century as an intransitive verb with the meaning “intr. Of a pregnant woman or animal: to expel an embryo or fetus from the uterus, esp. before it is viable; to suffer a spontaneous abortion or miscarriage” (OED, abort v intr 1a), as in (10). Abort then developed metaphorical meanings, which could be construed both intransitively (11) and transitively (12). That is, these new meanings enabled the causative-inchoative alternation (Levin 1993:27f), in which the intransitive construes the ‘coming to a premature end’, while the semantic scope of the transitive also includes the cause of the premature end (which is implied by the passive in 12). (10) The pregnant woman which hath tenasmum, for the moste parte aborteth [L. abortit] (OED, 1540) (11) Hee wrote a large Discourse..which he intended to send to her Maiestie..but that death preuented him; and (he dying) that worke aborted with him. (OED, 1620) (12) It [sc. the Parliament] is aborted before it was born. (OED, 1614) In Modern English, abort acquired the meaning of ‘deliberately terminating a pregnancy’, as in (13). In this meaning, abort is a purely transitive verb that does not participate in the causative-inchoative alternation because it does not designate (the causation of) a quasi-spontaneous event. Rather, this meaning of abort expresses the intentional targeting of the action of ‘aborting’ onto the unborn child. (13) I don’t think I would abort a baby. (WB) Alternations that are not dependent on the lexical verb differ from the verbspecific ones in fundamental ways. They are not selectively but generally available to all clauses with internal constituent structure. Examples are subject-finite inver-


K. Davidse and H. De Smet

sion, anteposition of non-subjects in the clause, etc. With these alternations, each syntagmatic variant is meaningful in its own right. From the perspective of functional frameworks such as Halliday (1994), these variants appear as members of mood paradigms and information structure systems. Formation of moods (e.g. declarative, interrogative) and information variants (realized by linear order and prosody) are by and large not dependent on the verbs used in clauses. A classic corpus study of this type of variation is Breivik’s (1989) reconstruction of the transition from the ‘expletiveless’ existential clauses of early Old English to existentials with it and there in Middle English. This study encapsulates the challenge of identifying and interpreting changing paradigms, in relation to changes both in the coding and positioning of subjects in declaratives and interrogatives, and in the marking of information structure. To sum up, in this section, we have discussed and illustrated a number of heuristic techniques that can be used to interrogate lexicogrammatical patterning in diachronic data from various perspectives: collocational, collostructional, colligational and variational.

Representative Study 1 Perek, F., and Hilpert, M. 2017. A distributional semantic approach to the periodization of change in the productivity of constructions. International Journal of Corpus Linguistics 22:490–520. In their study, Perek and Hilpert (2017) seek to identify patterns inherent in the data, rather than applying pre-existing classifications. They do so in two areas of diachronic study: qualitative semantic change and periodization of change. Their case studies focus on changes in the semantic range of verbs found in the ‘V the hell out of ’ construction and the ‘V one’s way’ construction through the various decades represented in the Corpus of Historical American English (COHA), for which they develop an alternative to the collostructional approach (see Sect. 10.2.3). They argue that the meaning of lexical items can best be revealed by their association with mid-to high-frequency content words, which are semantically specific and co-occur with a wide range of target words in non-random ways, and therefore “yield robust measurements of meaningful lexical associations” (Perek and Hilpert 2017:496). As the two case studies focus on verbs, they first constructed a distributional matrix for all verbs from COHA with a corpus frequency of at least 1000 to guarantee sufficient distributional data to make meaningful comparisons with other verbs. The 2532 verbs extracted were then related in a matrix to the 10,000 most frequent nouns, verbs, adjectives and adverbs in COHA, first recording their raw frequencies of co-occurrence and then converting these into positive (continued)

10 Diachronic Corpora


measures of strength of association. This matrix was then transformed and reduced into a matrix where each verb receives 300 numerical values, which constitute the verb’s high-dimensional vector, the values being understood as co-ordinates in a multidimensional space. Such a model is referred to as a vector space model, which allows the authors to precisely quantify semantic similarity between words “by similarity in their semantic vectors, whose correlation (as opposed to sheer closeness) can be quantified by standard measures such as cosine distance” (Perek and Hilpert 2017:500). For the specific case studies at hand, they then built representations of the semantic range of the verbal slot-fillers in each successive period by summing and averaging their vector values, i.e. calculating period vectors. These represent the semantic average of the lexical types, making abstraction of both token and type frequency in that period. Comparison of the period vectors shows whether the semantic range of the verbs in the constructions expanded or contracted in the various periods – to what degree and how slowly or quickly. Perek and Hilpert find that only the ‘V one’s way’ construction has undergone qualitative semantic changes. Initially it was associated mainly with verbs expressing the creation of a material path, e.g. carve, break, rip, fight, but from the 1880s it also accommodated verbs of perception, cognition and communication, e.g. smell, guess, joke, talk, expressing the creation of metaphorical paths. Perek and Hilpert then compare these results with the findings obtained by collostructional analysis, which does not filter out highly frequent and semantically neutral collocates that do not contribute much to the meaning of the construction. The collexemes that in the early stages of the ‘V one’s way’ construction score highest for being more frequent than expected are take one’s way and find one’s way, which arguably are barely instances of the ‘V one’s way’ construction. Perek and Hilpert then take on the intrinsic periodization of changes as opposed to relating them to language-external historical landmarks. For this, they use variability-based neighbour clustering (VNC), a variant of an agglomerative hierarchical clustering algorithm which allows only periods that are temporally adjacent to be merged. VNC was proposed as a method for inductive periodization by Gries and Hilpert (2008) in combination with collostructional analysis. Perek and Hilpert (2017) combine VNC with a distributional semantic approach to periodize on the basis of qualitative semantic change. For the ‘V the hell out of ’ case study, this yields a radically different periodization than that on the basis of quantitative change. If VNC is combined with frequency-based data, i.e. type frequency, token-frequency and hapax legomena, then a sharp divide emerges for this construction between the period from the 1930s to 1970s and the period from the 1980s to 2000s, in which productivity sharply increased according to all the frequency indicators. By contrast, if VNC is combined with distributional (continued)


K. Davidse and H. De Smet

semantic representations, then each decade witnesses consistently gradual and relatively minor qualitative change, suggesting that no discrete periods of qualitative change should be distinguished for this construction.

Representative Study 2 Buyle, A., and De Smet, H. 2018. Meaning in a changing paradigm: The semantics of you and the pragmatics of thou. Language Sciences 68:42–55. Buyle and De Smet’s (2018) study is based on a small but richly annotated corpus of seventeenth and eighteenth century comedies. The advantage of comedies is that they contain dialogic interactions within complex social settings. While the settings themselves are of course fictitious and often unrealistic, they can nevertheless reveal how the symbolic resources of the language at the time would have been used to respond to and shape social relations. To exploit this property of drama texts, Buyle and De Smet annotated all speaker-hearer dyads in their corpus for three interactional variables, describing whether the speaker has any authority over the hearer or vice versa, whether speaker and hearer are socially intimate or distant, and whether (at any one point) their relationship is conflicted or not. Using this information, Buyle and De Smet reassess the use of the Modern English address pronouns thou and you. While earlier literature suggests that you in particular had become a semantically neutral pronoun by the seventeenth century, the analysis shows that it in fact continued to associate with deferential and formal usage as long as thou was still a systemic option. As for thou, which earlier literature analyzes as a marker of negative speaker emotion, the analysis shows that the association with expressions of anger and contempt is in fact partly an artefact of the data – in that negative emotions are simply more often expressed in intimate relations – and that it is partly a result of thou’s increasing pragmatic salience, following from its dwindling frequency. This brings the semantics of Modern English thou and you closer in line with the classical analysis of a pronominal T/V system by Brown and Gilman (1960) and departs from alternative analyses that tended to assign an exceptional status to the English thou/you contrast. For present purposes, Buyle and De Smet’s study shows that there is insight to be gained from small and closely annotated purpose-built corpora, particularly when it comes to some of the more elusive domains of grammatical analysis such as paradigmatic meaning and interactional pragmatics in earlier stages of a language.

10 Diachronic Corpora


Representative Corpus 1 Base textuelle FRANTEXT is in between a text archive and a typical reference corpus that strives to represent the history of a language – in this case, French. It contains some of the oldest French texts from 950 up to the present-day and, with currently about 300 million words of text, it decidedly qualifies as a large corpus. In its present version, it intends to cater to a great variety of researchers, including literary scholars and historians, as well as linguists. It started life, however, as a corpus for lexicographic research which explains its great time-depth and wide coverage in terms of genres. Part of the corpus has been part-of-speech tagged. One striking feature is that the corpus comes with various predefined subcorpora, varying in size or in the period that is represented, so as to meet different research needs. Indeed, the corpus has such a flexible online interface that it allows the user to dynamically select a working corpus precisely tailored to their specific objectives. Another distinguishing feature is that it is open to contributions by third parties, who can submit new material for inclusion in the corpus.

Representative Corpus 2 The York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) is a 1.5-million-word syntactically annotated corpus representing Old English prose, with texts dating from before 850 up to 1150. YCOE is a member of a bigger family of corpora. With the Helsinki Corpus of English Texts (Kytö 1996), YCOE shares its text categorization scheme, including periodization and text genres, as well as file naming conventions. The linguistic annotation of YCOE, including its part-of-speech tagging and syntactic annotation, is shared with other members of the family, such as the Penn-parsed Corpus of Middle English (Kroch and Taylor 2000) – albeit with some adjustments to deal with the specificities of Old English. Finally, YCOE has drawn on electronic editions of Old English texts that had originally been created by the Toronto Dictionary of Old English Project. As such, the corpus illustrates how corpus compilation benefits from broad collaboration networks, which has the further advantage of having led to a high degree of standardization across a substantial set of historical corpora of English. Remarkably, YCOE’s selection of texts has been primarily guided by syntactic interest, favouring longer texts of running prose. The reason is that such texts provide the richest, most varied and best contextualized evidence of the syntactic patterns of Old English. For discussion of the syntactic annotation of the corpus, see Sect. 10.2.2 above.


K. Davidse and H. De Smet

Representative Corpus 3 The Old Bailey Corpus (OBC2.0) consists of trial proceedings from London’s Central Criminal Court, the Old Bailey, published between 1720 and 1913. With its 24.4 million words it is a sizeable corpus, but what really sets it apart is the kind of text material it contains, with published transcripts of the spoken interactions in court. It was social historians Tim Hitchcock and Robert Shoemaker who started the process of digitizing the proceedings of the Old Bailey, annotating the texts and making them available online with a dedicated search engine (Hitchcock et al. 2012). These digitized data formed the input to the work done by Magnus Huber and his team. They took the necessary steps to prepare the data for linguistic research. A balanced subset was selected from the material, divided over roughly equally-sized 10-year subperiods. Direct speech passages were automatically identified and annotated. From the data information was retrieved and systematically annotated about the speakers (age, gender, social status), their role in the trial (defendant, witness, etc.), and information about the text (scribe, printer, publisher). Finally, the corpus has been part-of-speech tagged. For discussion of how well the corpus captures actual speech, see Sect. above.

10.3 Critical Assessment and Future Directions In this chapter, we have seen how the increase in the variety and overall size of corpus data allows researchers to broach new horizons in diachronic linguistics. Looking ahead, at least two ongoing trends can be expected to continue. The first trend pertains to corpora themselves. The quantitative turn is strong, and with it the idea that bigger data are better data. Sound generalizations are increasingly expected to be based on large datasets, while at the same time, large data sets are bringing within reach the possibility of studying change in the language system of individual users over their own lifetime. Even so, there are some risks involved that critics of big data will not hesitate to point out. First, using large data sets may make it impossible for analysts to familiarize themselves in any detail with the texts that make up the empirical basis of their research. Second, for research to remain feasible it must be increasingly automated, again increasing the risk that researchers lose touch with their data. Third, a bigger data set is not necessarily more balanced or representative of historical usage. As we have seen, however, strategies are emerging to avoid some of the potential pitfalls. More bottom-up text-based approaches to text classification, for instance, can reduce the need to rely on more aprioristic classifications. Computational techniques may even begin to supply information – such as author identity – that traditional philological work could not definitively determine. Another strategy lies

10 Diachronic Corpora


not in automation but in team work. Complex questions, involving large data sets and requiring specialized knowledge of diverse domains such as language history, text traditions, and computational techniques, can be handled if information passes between specialists efficiently. Therefore, it is to be hoped that the future will see corpora that can support and incorporate end user input, as well as researchers making richly annotated data sets available to colleagues. The second trend pertains to the kind of research corpora are used for. As we have seen, particularly strong advances have been made with bottom-up and datadriven approaches to historical patterns and changes in these patterns, that “mak[e] the study of grammar more similar to the study of the lexicon” (Stefanowitsch and Gries 2003:210). At the same time, we have argued that other areas are in need of development such as the corpus implementation of paradigmatic phenomena such as distributions and alternations. Change cannot be studied as affecting an isolated syntagm, which moves through time unconnected to the grammatical systems surrounding it (Fischer 1994). As paradigmatic patterns operate in absentia in relation to the specific syntagms that concordances naturally extract, they have to be reconstructed by querying corpora for the different environments an element occurs in (its distribution), or for its different related patterns (alternations). Qualitative and quantitative description are as complementary here as in other domains of corpus linguistics. There is, in conclusion, still ample room for developing further creative methods to identify and interpret lexicogrammatical change in the empirical detail that only diachronic corpora can provide.

10.4 Tools and Resources What is useful to diachronic corpus linguists depends obviously on the languages they intend to study. Indeed, there are few resources that are truly languageindependent. There are several types of resources, however, that it is good for any diachronic corpus linguist to be on the lookout for. First, there are several websites dedicated to cataloguing or collecting corpora, such as the Linguistic Data Consortium (https://www.ldc.upenn.edu/) (accessed 29 May 2019). A remarkable tool, though specific to diachronic corpora for English, is the Corpus Resource Database, which has a specialized interface to allow visitors to search for the corpora that best meet their needs (http://www.helsinki.fi/varieng/ CoRD/) (accessed 29 May 2019). For corpora that have no online search interface of their own, a great variety of concordancing tools exist: see Chap. 8 for more info. Second, anyone wanting (or needing) to compile their own corpora can benefit from digitized texts in online repositories, such as Project Gutenberg (http://www. gutenberg.org/) (accessed 29 May 2019) – even though it is good to be aware that freely available online editions may not always meet scholarly editorial standards. To draw texts from repositories a web crawler may be useful, or programming one using Python or Perl (cf. Chap. 9). The material in online repositories may not yet


K. Davidse and H. De Smet

have been converted into machine-readable text. To this end, it may be necessary to acquire OCR software. ABBYY FineReader is one such package, whose OCR editor can be equipped with tailor-made dictionaries and can be trained on special character sets (though the work tends to be time-consuming) (https://www.abbyy. com/en-au/) (accessed 29 May 2019). Third, corpus compilers may choose to optimally adapt their corpus to various further research needs. It is best to be aware of the Text Encoding Initiative, which seeks to standardize XML annotation for digital text editions (https://tei-c.org/) (accessed 3 June 2019). A further interesting possibility is to adapt a corpus to use in Sketch Engine, which supports querying and offers facilities for text analysis or text mining applications (https://www.sketchengine.eu/) (accessed 29 May 2019). For many purposes, spelling normalization and lemmatization may be required. For this, language-specific tools are needed, such as (again for English) VARD2 (http:// ucrel.lancs.ac.uk/vard/about/) (accessed 29 May 2019). For corpus annotation tools, see Chap. 2.

Further Reading Jenset, G., and McGillivray, B. 2017. Quantitative historical linguistics. A corpus framework. Oxford: Oxford University Press. This book reconceptualizes the newest quantitative methods of corpus linguistics for diachronic linguistics. It argues for richer annotation of historical corpora and open, reproducible research. Meurman-Solin, A., and Tyrkkö, J. 2013. Principles and Practices for the Digital Editing and Annotation of Diachronic Data. Helsinki: VARIENG. http://www. helsinki.fi/varieng/series/volumes/index.html. Accessed 29 May 2019. This edited volume offers an overview of current principles and practices of digital editing, from a corpus linguistic and philological perspective. Piotrowski, M. 2012. Natural Language Processing for Historical Texts. San Rafael, CA: Morgan & Claypool. doi:10.2200/S00436ED1V01Y201207 HLT017. This book addresses the challenges posed by historical texts to the application of Natural Language Processing techniques, ranging from digitization to syntactic annotation.

10 Diachronic Corpora


References Archer, D. (2014). Historical pragmatics: Evidence from the old Bailey. Transactions of the Philological Society, 112, 259–277. Baron, A., & Rayson, P. (2008). VARD 2: A tool for dealing with spelling variation in historical corpora. In: Proceedings of the postgraduate conference in corpus linguistics, Birmingham: Aston University. Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press. Blaxter, T. (2015). Gender and language change in Old Norse sentential negatives. Language Variation and Change, 27, 349–375. Brems, L. (2003). Measure noun constructions: An instance of semantically-driven grammaticalization. International Journal of Corpus Linguistics, 8, 283–312. Breivik, L. (1989). On the causes of syntactic change in English. In L. Breivik (Ed.), Language change: Contributions to the study of its causes (pp. 29–70). Berlin: Walter de Gruyter. Brown, R., & Gilman, A. (1960). The pronouns of power and solidarity. In T. A. Sebeok (Ed.), Style in language (pp. 253–276). Cambridge: MIT Press. Buyle, A., & De Smet, H. (2018). Meaning in a changing paradigm: The semantics of you and the pragmatics of thou. Language Sciences, 68, 42–55. Culpeper, J., & Kytö, M. (2000). Data in historical pragmatics. Spoken interaction (re)cast as writing. Journal of Historical Pragmatics, 1, 175–199. D’hoedt, F. (2017). Language change in constructional networks: The development of the English secondary predicate construction. Doctoral dissertation. KU Leuven: Department of Linguistics. Davidse, K., & Van Linden, A. (2020). Revisiting ‘it-extraposition’: The historical development of constructions with matrices (it)/(there) be + NP followed by a complement clause. In P. NúñezPertejo et al. (Eds.), Crossing linguistic boundaries (pp. 81–103). London: Bloomsbury. De Smet, H., & Vancayzeele, E. (2014). Like a rolling stone: The changing use of English premodifying present participles. English Language and Linguistics, 19, 131–156. Diller, H.-J., De Smet, H., & Tyrkkö, J. (2011). A European database of descriptors of English electronic texts. The European English Messenger, 19, 21–35. Firth, J. R. (1957). A synopsis of linguistic theory, 1930-1955. In J. R. Firth (Ed.), Studies in linguistic analysis (pp. 1–32). Oxford: Blackwell. Fischer, O. (1994). The development of quasi-auxiliaries in English and changes in word order. Neophilologus, 78, 137–162. Fischer, O., De Smet, H., & van der Wurff, W. (2017). A brief history of English syntax. Cambridge: Cambridge University Press. Gregory, M. (1967). Aspects of varieties differentiation. Journal of Linguistics, 3, 177–274. Gries, S. (2017). Syntactic alternation research: Taking stock and some suggestions for the future. Belgian Journal of Linguistics, 31, 8–29. Gries, S., & Divjak, D. (2009). Behavioral profiles: A corpus-based approach to cognitive semantic analysis. In V. Evans & S. Pourcel (Eds.), New directions in cognitive linguistics (pp. 57–75). Amsterdam: Benjamins. Gries, S., & Hilpert, M. (2008). The identification of stages in diachronic data: Variability-based neighbor clustering. Corpora, 3, 59–81. Halliday, M. A. K. (1994). An introduction to functional grammar (2nd ed.). London: Arnold. Hay, J., & Sudbury, A. (2005). How Rhoticity Became /r/-Sandhi. Language, 81, 799–823. Hilpert, M. (2008). Germanic future constructions: A usage-based approach to language change. Amsterdam: Benjamins. Hinrichs, L., Szmrecsanyi, B., & Bohmann, A. (2015). Which-hunting and the Standard English relative clause. Language, 91, 806–836. Hitchcock, T., Shoemaker, R., Emsley, C., Howard, S., & McLaughlin, J., et al., (2012). The Old Bailey proceedings online, 1674-1913. www.oldbaileyonline.org, version 7.0, 24 March 2012. Accessed 3 June 2019.


K. Davidse and H. De Smet

Huber, M. (2007). The old bailey proceedings, 1674-1834: Evaluating and annotating a corpus of 18th- and 19th-century spoken English. In A. Meurman-Solin & A. Nurmi (Eds.), Studies in variation, contacts and change in English, Vol. 1: Annotating variation and change. Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki. http:/ /www.helsinki.fi/varieng/journal/volumes/01/huber/. Accessed 29 May 2019. Huber, M., Nissel, M., Maiwald, P., & Widlitzki, B. (2012). The Old Bailey Corpus. Spoken English in the 18th and 19th centuries. www.uni-giessen.de/oldbaileycorpus. Accessed 29 May 2019. Kemmer, S., & Hilpert, M. (2005). Constructional grammaticaliation in the make-causative. Paper presented at ICHL 17, Madison, WI. Kestemont, M., Stover, J., Koppel, M., Karsdorp, F., & Daelemans, W. (2016). Authenticating the writings of Julius Caesar. Expert Systems with Applications, 63, 86–96. Kossmann, B. (2007). Rich and poor in the history of English: corpus-based analyses of lexicosemantic variation and change in Old and Middle English. PhD dissertation. Albert-LudwigsUniversität Freiburg i. Br. Kroch, A., & Taylor, A. (2000). The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). Department of Linguistics, University of Pennsylvania. CD-ROM, second edition, release 4 https://www.ling.upenn.edu/hist-corpora/PPCME2-RELEASE-4/index.html. Accessed 3 June 2019. Kung, S. (2005). A diachronic study of melancholy in a British novel corpus. Manuscript. University of Birmingham. https://www.birmingham.ac.uk/Documents/college-artslaw/corpus/ Intro/Unit52Melancholy.pdf. Accessed 29 May 2019). Kytö, M. (1996). Manual to the diachronic part of the Helsinki Corpus of English texts. Coding conventions and lists of source texts. Department of English, University of Helsinki. Laffut, A., & Davidse, K. (2002). English locative constructions: An exercise in Neo-Firthian description and dialogue with other schools. Functions of Language, 9, 169–207. Larsson, I. (2014). Choice of non-referential subject in existential constructions and with weatherverbs. Nordic Atlas of Language Structures, 1, 55–71. Lemmens, M. (1998). Lexical perspectives on transitivity and ergativity. Causative constructions in English. Amsterdam: Benjamins. Levin, B. (1993). English verb classes and alternations: A preliminary investigation. Chicago: The University of Chicago Press. Lorenz, G. (2002). Really worthwhile or not really significant? A corpus-based approach to the delexicalization and grammaticalization of intensifiers in modern English. In I. Wischer & G. Diewald (Eds.), New reflections on grammaticalization (pp. 143–161). Amsterdam: Benjamins. Love, R., Dembry, A., Hardie, C., Brezina, V., & McEnery, T. (2017). The spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22, 319–344. Marx, M. (2009). Advanced information access to parliamentary debates. Journal of Digital Information, 10(6). https://journals.tdl.org/jodi/index.php/jodi/article/view/668. Accessed 30 May 2019. McGregor, W. (1994). Review of P. Hopper and E.C. Traugott (1993), Grammaticalization. Functions of Language, 1, 304–307. Nevalainen, T., Vartiainen, T., Säily, T., Kesäniemi, J., Dominowska, A., & Öhman, E. (2016). Language change database: A new online resource. ICAME Journal, 40, 77–94. Perek, F. (2012). Alternation-based generalizations are stored in mental grammar: Evidence from a sorting task experiment. Cognitive Linguistics, 23, 601–635. Perek, F., & Hilpert, M. (2017). A distributional semantic approach to the periodization of change in the productivity of constructions. International Journal of Corpus Linguistics, 22, 490–520. Piotrowski, M. (2012). Natural language processing for historical texts. San Rafael: Morgan & Claypool. https://doi.org/10.2200/S00436ED1V01Y201207HLT017. Raumolin-Brunberg, H., & Nevalainen, T. (2007). Historical sociolinguistics: The corpus of early English correspondence. In J. Beal, K. Corrigan, & H. Moisl (Eds.), Creating and digitizing language Corpora. Volume 2: Diachronic databases (pp. 148–171). Houndsmills: PalgraveMacmillan.

10 Diachronic Corpora


Romaine, S. (2000). Language in society: An introduction to sociolinguistics. Oxford: Oxford University Press. Schneider, G., Hundt, M., & Oppliger, R. (2016). Part-of-speech in historical Corpora: Tagger evaluation and ensemble systems on ARCHER. In Proceedings of the Conference on Natural Language Processing (KONVENS) (Vol. 13, pp 256–264). Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Stefanowitsch, A., & Gries, S. (2003). Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics, 8, 209–243. Szmrecsanyi, B., Biber, D., Egbert, J., & Franco, K. (2016). Toward more accountability: Modeling ternary genitive variation in Late Modern English. Language Variation and Change, 28, 1–29. Taylor, A. 2003. The York-Toronto-Helsinki Parsed corpus of Old English prose. YCOE Lite: A beginner’s guide. http://www-users.york.ac.uk/˜lang22/YCOE/doc/annotation/YcoeLite.htm. Accessed 29 May 2019. Taylor, A., Warner, A., Pintzuk, S., & Beths, F. (2003).The York-Toronto-Helsinki parsed Corpus of Old English prose. Electronic texts and manuals available from the Oxford Text Archive. Van Rompaey, T., & Davidse, K. (2014). The different developments of progressive aspect markers be in the middle/midst of and be in the process of V-ing. In S. Hancil & E. König (Eds.), Grammaticalization: Theory and data (pp. 181–202). Amsterdam: Benjamins.

Chapter 11

Spoken Corpora Ulrike Gut

Abstract This chapter provides a detailed introduction to the central aspects of the compilation and use of spoken corpora, which have become increasingly popular in linguistic research. It discusses the challenges associated with collecting raw data and creating annotations for spoken corpora and shows how these are determined by the specific research aims and traditions in the various fields of linguistics. A wide range of tools are presented and evaluated that can be used to annotate and search spoken corpora and examples of different spoken corpora and their use are given, representing the myriad ways in which the analysis of spoken corpora has contributed to the description of aspects of human language use. With a view to the increasing technological advances that will meet many of the current challenges of constructing and analysing spoken corpora, the chapter discusses future challenges and desired developments with respect to the linguistic use of spoken corpora.

11.1 Introduction Compared to written corpora, spoken corpora are still few in number and are typically much smaller. This is chiefly due to the greater costs and challenges in terms of technology and time that are connected with the compilation and annotation of spoken corpora. However, interest in spoken corpora has been on the increase in the past two decades (e.g. Kirk and Andersen 2016; Durand et al. 2014; Raso and Mello 2014; Ruhi et al. 2014), based on the growing conviction that with a corpus-based method a wide range of the properties of spoken human language and communication can be analysed in exciting new ways. Some of the earliest spoken corpora were developed for the study of the vocabulary of Australian workers (Schonell et al. 1956) and the acquisition of vocabulary in a first language (Beier et al. 1967). Nowadays, a large variety of

U. Gut () Department of English, University of Münster, Münster, Germany e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_11



U. Gut

corpora containing spoken language exist that have been compiled for a myriad of uses: they often form part of national reference corpora such as the BNC (British National Corpus)1 and the Czech National Corpus (Benešová et al. 2014) and they serve as a data basis for linguistic research, for example in dialectology (see Szmrecsanyi and Wolk 2011 for an overview), conversation analysis (e.g. O’Keeffe and Walsh 2012), the study of the grammar of speech (e.g. Leech 2000), pragmatics (e.g. Schauer and Adolphs 2006), phonetics and phonology (see DelaisRoussarie and Yoo 2014 for an overview), language acquisition (e.g. Gut 2009) and forensic linguistics (e.g. Cotterill 2004). Moreover, spoken corpora have been applied in language pedagogy (e.g. Carter 1998; Gut 2005) and are used for the development of dictionaries and grammars (e.g. The Longman grammar of spoken and written English; Biber et al. 1999), for the development of new technologies and commercial tools in speech engineering (see Gibbon et al. 1997) and translation (cf. Chap. 12) and for the documentation of endangered languages.2 The different research aims and applications that the corpus compilers pursued with the construction of the various spoken corpora has resulted in numerous different types of corpora that differ drastically in the type of raw data they contain, the annotations that were carried out, the data format the corpora have as well as the possibilities corpus users have to search them. There is no doubt that the advent of spoken corpora has opened up new avenues for studying spoken language properties and use that have resulted in some fundamental reshapings of linguistic theories. For example, decisive advances have been made in the corpus-based description of the grammatical features of spoken language (e.g. Biber 1988; Leech 2000), phonological processes such as apicalization in Norwegian (e.g. Kristoffersen and Simonsen 2014), the intonation of spontaneous speech (e.g. Martin 2014), the use of discourse particles in conversations (e.g. Aijmer 2002) as well as grammatical and phonological aspects of first and second language acquisition (e.g. Dimroth 2008; Rose 2014). Yet, a number of challenges remain, especially concerning the (re-)usability of spoken corpora, the further development of standard practices in spoken corpus annotation, technological requirements for corpus archiving as well as ethical considerations in corpus compilation and dissemination.

11.2 Fundamentals There are two largely distinct but not exclusive types of corpora that contain spoken language, usually referred to as speech corpora (or speech databases) and spoken corpora respectively. Speech corpora or databases such as the MultiLanguage Conversational Telephone Speech 2011-Slavic Group database (Jones

1 http://www.natcorp.ox.ac.uk/. 2 see

Accessed 22 May 2019. DOBES project http://dobes.mpi.nl/. Accessed 22 May 2019.

11 Spoken Corpora


et al. 2016) typically contain large amounts of spoken language, often recorded under experimental conditions, and are used for industrial and technological applications such as assessing automatic speech recognition systems and developing human-machine communication or text-to-speech systems (see e.g. Gibbon et al. 1997). Spoken corpora, by contrast, are compiled for a linguistic purpose such as hypothesis testing in the study of language use and human communication, as well as for linguistic applications such as language teaching and the development of grammars and dictionaries. They typically include the corresponding sound files, but some ‘mute’ corpora that only provide access to the transcriptions also exist (e.g. ICE Philippines, a corpus of written and spoken Philippine English http://icecorpora.net/ice/icephi.htm). This chapter is concerned with typical spoken corpora rather than databases or mute corpora.

11.2.1 Raw Data and Different Types of Spoken Corpora The types of raw data that can be found in spoken corpora differ in terms of the corpus compilers’ influence and control over the communicative context, ranging from ‘no control at all’ to ‘highly controlled data elicitation methods’. Spoken language that was produced without any involvement of the corpus compiler is often referred to as ‘authentic’ or ‘natural’ language, as it avoids the observer’s paradox, i.e. the fact that the presence of a researcher and the speakers’ awareness that they are being recorded influence the nature of the language produced (Labov 1970). Ideally thus, the corpus raw data was produced in real communicative situations and was recorded on video or audio for other purposes than including it in a linguistic corpus. The corpus compiler’s role is simply to select this already recorded and archived data to be included in the corpus. This type of raw data is typically found in so-called reference corpora that were designed to be representative of a language or language variety. For example, the various components of the International Corpus of English (ICE; Greenbaum and Nelson 1996) that constitute reference corpora for the different varieties of English spoken around the world, include broadcast discussions, broadcast interviews, broadcast talks, cross-examinations and news readings, which were not recorded specifically for the compilation of these corpora. Attempting to avoid the observer’s paradox, the compilers of the London-Lund Corpus of English (Svartvik 1990) surreptitiously recorded conversations, a practice that violates current ethical research standards but is still occasionally in use (e.g. Guirao et al. 2006). Typically, nowadays when recording raw data for a corpus, speakers are aware of it. In addition, all speakers should have given their formal consent for the recordings to be included in the corpus (cf. Chap. 1). Most spoken corpora contain language productions that were purposefully elicited by the corpus compilers. By controlling the situation in which language is produced corpus compilers increase the probability that the phenomena they are interested in are actually included in the corpus. Especially rare linguistic phenomena might not occur in sufficient numbers in a corpus that contains data


U. Gut

exclusively produced in uncontrolled situations. A wide range of speaking styles with varying degrees of corpus compilers’ influence can be elicited, including unplanned and pre-planned as well as scripted and unscripted speech: In interviews carried out with speakers of different British dialects such as those recorded for the Freiburg English Dialect (FRED) corpus (Anderwald and Wagner 2007), the researchers only control the topic of the conversation. In games and story retellings, specific vocabulary can be elicited: For the HCRC Map Task corpus3 (Anderson et al. 1991), for instance, speakers played a game in which one speaker had to explain the route on a map to another speaker who could not see it on his or her map. The names of the locations on the map thus give the corpus compilers the chance to elicit certain sounds and prosodic patterns. For some corpora such as the LeaP corpus (see Sect. 11.4), speakers are asked to retell a story they have previously read or seen as a film, which again allows corpus compilers to elicit targeted lexical items or possibly even syntactic structures. The degree of control that is exerted over the communicative situation in which the raw data is produced determines the characteristics of the language that is being produced. For example, under highly controlled conditions speakers cannot choose their own words but have to produce language that is visually or aurally presented to them by the researcher. Thus, it is typically monologues rather than dialogues that are recorded in very controlled situations. In contrast, data produced in uncontrolled conditions comprises many types of spontaneously produced, unplanned, preplanned, scripted and unscripted language in monologues and dialogues. The researcher’s control over the raw data production moreover influences the degree of variation that is represented in the corpus. While in uncontrolled data social, situational and genre-related variation in language use is typically present, it is increasingly restricted in the different types of elicited language data. As the selection of the corpus data is determined by the intended uses of the corpus, the spoken corpora that are collected in the various linguistic subdisciplines differ sharply in terms of the raw data they contain. Spoken corpora that comprise ‘authentic’ raw data are typically compiled for the study of spoken language morphosyntax, pragmatics, discourse, conversations and sociolinguistics as they contain a breadth of different types of language (registers) and a sufficient amount of language variation. Spoken corpora that were assembled for the study of phonological and phonetic phenomena, on the other hand, tend to contain highly controlled raw data in the form of scripted monologues that ensure the occurrence of sufficient tokens of the features under investigation (they have thus been classified as peripheral corpora, e.g. by Nesselhauf 2004:128). This in turn means that the potential use of spoken corpora very much depends on the type of raw data they contain. Anderwald and Wagner (2007:47) for example note that in the interviews the FRED corpus contains speakers mainly talk about the past, which causes an overrepresentation of past tense verb forms and constitutes a drawback for the

3 http://www.hcrc.ed.ac.uk/maptask/.

Accessed 22 May 2019.

11 Spoken Corpora


investigation of the present tense. By the same token, a corpus that consists of recordings of text passages and word lists is unsuitable for the study of grammatical phenomena and a corpus that consists of audio recordings only cannot be used for the study of the interplay of prosody and gestures in human communication (cf. Chap. 16). Many spoken corpora contain several types of raw data, thus combining more and less controlled recording scenarios. This is true for all reference corpora, which aim to constitute representative samples of a language or language variety. Apart from written language, they contain a wide range of types of spoken language to guarantee the representation of variation across registers in this language (see e.g. the BNC). Other spoken corpora that combine raw data types include the IvIE corpus (Nolan and Post 2014), which was compiled for the study of intonational variation on the British Isles and which contains read speech, retellings, Map Tasks and free conversations, and the LeaP corpus of non-native German and English, which contains word lists, read speech, retellings and interviews (see Sect. 11.4).

11.2.2 Corpus Annotation A collection of spoken language recordings does not constitute a linguistic corpus unless linguistic annotations are added to it. The one type of annotation that all spoken corpora share is an orthographic transcription (see also Chap. 14). Whether any further annotations are added to the corpus and what type is again largely determined by the intended use of the corpus (see also Sect. 11.3). Some types of annotation such as orthographic, phonemic and prosodic annotations are unique to spoken corpora, while others such as part-of-speech (POS)-tagging and parsing have been adapted from written corpora, which in some cases has led to specific challenges as discussed below. The process of corpus annotation is interpretative and often theory-dependent. With each annotation, corpus compilers have to take various decisions that can constrain the future use of their corpus. This is true even for a seemingly simple annotation as an orthographic transcription.

Orthographic Transcription

All spoken corpora contain orthographic transcriptions. However, they can vary considerably in terms of the orthographic conventions chosen (some languages like English and German have different standard spellings in the different countries in which they are spoken, e.g. British English colour vs. American English color), spelling conventions concerning capitalisation rules (e.g. Ladies and Gentlemen vs. ladies and gentlemen) and hyphenisation (ice-cream vs. icecream). Likewise, they differ in the transcription of communicative units such as sounds of hesitation (erm) or affirmative sounds (mhm), for which no agreed spelling standard exists. Moreover, some corpora contain transcriptions for word forms typical of spoken


U. Gut

but not written language, for instance contractions like wanna, while others do not and corpora can differ in whether and how mispronunciations of words by individual speakers and non-speech events such as laughter and background noises are transcribed. It is important for corpus users to be aware of such potential differences in the orthographic transcriptions as they may lead to the missing of tokens in a word-based corpus search: in a search for ‘because’, for instance, transcribed forms such as ‘coz’ and ‘cos’ will not be found. Options to avoid the time-consuming manual orthographic transcription are beginning to materialise: for example, it is possible to train speech-to-text software such as the one included in GoogleDocs on one’s voice and then record oneself repeating the content of the raw data file. Moreover, youtube offers a free automatic transcription service for videos. While these systems do not work very reliably yet, especially on recordings with background noise or overlapping speech, they will no doubt improve dramatically over the next few years.

POS-Tagging and Lemmatisation

Some spoken corpora have POS-tagging, which indicates the grammatical category of each transcribed word and which is essential for the quantitative analysis of the grammatical properties of spoken language (see Sect. 11.2.4; Chap. 2). A number of different taggers with different tag sets are in use that were originally developed for automatically annotating written corpora: these include the freely available CLAWS tagger,4 which was used for tagging the BNC, and the PiTagger system (Panunzi et al. 2004) that was used for the tagging of the Italian component of the C-ORAL corpus. POS taggers tend to be less reliable for spoken data with error rates of up to 5% (e.g. Moreno-Sandoval and Guirao 2006) because, in comparison to written data, repetitions, repairs, false starts and hesitations can occur (see Oostdijk 2003). Due to this and to systematic errors in the automatic tagging process itself, manual corrections have been carried out for some spoken corpora such as part of the BNC (Garside 1995). During lemmatisation, the various word forms of an inflectional paradigm are assigned to a common lemma by automatic lemmatisers such as the one used for the VOICE corpus5 or the CGN tagger6 for Dutch. This type of annotation allows corpus users to compile lists and frequencies of different lexical types rather than of individual word forms only.

4 http://ucrel.lancs.ac.uk/claws/trial.html.

Accessed 22 May 2019. Accessed 22 May 2019. 6 https://ilk.uvt.nl/cgntagger/. Accessed 22 May 2019. 5 http://www.univie.ac.at/voice/.

11 Spoken Corpora



Parsing, i.e. the automatic annotation of the syntactic structure of a language (cf. Chap. 2), is still tremendously challenging for spoken corpora. Due to the characteristics of spoken language such as constructions with word order patterns not found in written language and incomplete utterances, parsers that were developed for written language usually yield poorer output when applied to spoken corpora. Yet, first advances have been made with the Constraint Grammar parser PALAVRAS that was successfully adapted to handle the structures of spoken language and which was used for parsing the C-ORAL Brazil corpus with 95% accuracy for syntactic function assignment (Bick 2014).

Phonemic and Phonetic Transcription

Some spoken corpora that were compiled for the study of phonological and phonetic phenomena (so-called ‘phonological corpora’; see Gut and Voormann 2014) further contain phonemic or phonetic transcriptions. In a phonemic transcription, the phonological form of a word is transcribed while a phonetic transcription represents the actual pronunciation by a speaker. For a phonemic transcription, each phoneme is transcribed using either the symbols of the International Phonetic Alphabet7 or the corresponding UTF-8 encoding or, more commonly, its machine-readable equivalents SAMPA (Wells et al. 1992) or CELEX.8 Phonemic transcriptions of spoken corpora can be carried out manually or automatically: manual transcriptions often suffer from inconsistencies across and within transcribers (e.g. Gut and Bayerl 2004) and are very time-consuming, while automatic transcriptions, which can be generated by tools such as WebMAUS (Schiel 2004; see Strik and Cucchiarini 2014 for an overview) typically contain systematic errors that have to be corrected manually. Phonetic annotations can be carried out with different degrees of detail ranging from coarticulatory processes such as labialisation to articulatory details such as tongue position (see e.g. Delais-Roussarie and Post 2014 for an overview). In addition, pronunciation errors can be transcribed by indicating both the phonetic target form and its actual realisation (see Neri et al. 2006 for a methodology of pronunciation error annotation and measuring annotator agreement). Pronunciation errors are often transcribed in spoken corpora compiled for the analysis of first language acquisition (Chap. 14).

7 https://www.internationalphoneticassociation.org/content/ipa-chart. 8 https://catalog.ldc.upenn.edu/docs/LDC96L14/celex.readme.html.

Accessed 22 May 2019. Accessed 22 May 2019.


U. Gut

Prosodic Transcription

Spoken corpora that were compiled in order to study prosodic phenomena in speech can contain various types of prosodic transcriptions. Unlike for phonemic transcriptions, several transcription conventions exist side by side for prosody: the most commonly used are the transcription system ToBI (Silverman et al. 1992; for example in the LeaP corpus) and the transcription conventions of the British School of intonation analysis (e.g. O’Connor and Arnold 1961; Halliday 1967) that were for example used for the Lancaster/IBM corpus (Williams 1996). Attempts to automatically transcribe intonation exist, e.g. the INTSINT system (e.g. Hirst 2005) or Prosograms (Mertens 2004), which remove the micro-prosodic perturbations from the fundamental frequency curve and give as output a smoothed intonation curve that is assumed to be perceptually equivalent. For fine-grained analyses of intonational patterns or for the transcription of ‘challenging’ speech such as nonnative or emotional speech manual transcriptions of prosody are still more reliable.

Multi-layered and Time-Aligned Annotation

Spoken corpora do not only differ in terms of the number and type of annotations they contain but also in the fundamental question of how these annotations are represented. Annotations of existing spoken corpora differ in two ways: annotation layering and time-alignment. The term ‘annotation layer’ refers to the issue of whether different types of annotation are integrated together in one linear transcription or whether they are represented individually on separate layers. Many spoken corpora compiled by researchers working in the field of conversation analysis and interactional linguistics use transcription systems that integrate various linguistic aspects in one linear transcription. The GAT-2 system (Selting et al. 2009), for example, combines an orthographic, a literal and a prosodic transcription with the transcription of other events such as speaker overlap, pauses and non-verbal events. Figure 11.1 shows an example of the beginning of a dialogic conversation that has been transcribed using GAT-2. Symbols such as ‘.’ and ‘;’ represent the intonation, stress is transcribed by using capital letters, speaker overlap is indicated by ‘[’ and phonetically reduced forms such as ‘ham’ (line 02, for ‘haben’ have) are used. Most modern spoken corpora, however, use multi-layered annotations, where only one type of linguistic annotation is contained per layer (or tier) (see also Chaps. 3 and 14). This has many advantages such as the possibility to represent

01 02




ja:; (.) die `VIERziger genera`tiOn so;= =das_s: `!WA:HN!sinnig viele die sich da ham [ `SCHEI]den lasse[n.= [ ja; ]

Fig. 11.1 Example of a GAT-2 transcription (from Selting et al. 2009:396). [Rough translation: S1: Yes the forties generation so that incredibly many who then filed for divorce S2: yes]

11 Spoken Corpora


Fig. 11.2 Multi-layered annotation of the LeaP corpus

different overlapping annotations for one speech event (see Fig. 11.2, tiers 3 and 4, for example). Figure 11.2 shows the multi-layered annotation of the LeaP corpus (Gut 2012) that was carried out with Praat. The different types of annotation are each assigned to one tier: tier 1 contains the transcription of pitch levels, on tier 2, the intonation is transcribed, tier 3 contains the transcriptions of all vocalic and consonantal intervals, tier 4 the phonemic transcriptions, tier 5 the orthographic transcription and tier 6 the intonation phrases. In addition, on tiers 7 and 8, the lemmas and POS are marked. The multi-layered annotation of the LeaP corpus in Fig. 11.2 is also time-aligned. The term time-alignment refers to the technical linking of an annotation with the corresponding part of the audio or video file. On tier 5, for example, the beginning and end of each transcribed word is marked by vertical lines, as are the beginning and end of each syllable on tier 4 and the beginning and end of each intonation phrase on tier 6. The TextGrid file that is produced by Praat carries this information as time stamps that give the exact point in time in the sound file for each such annotation. Thus, it is possible to listen directly to each individual annotated word or syllable in the recording. Time-aligned spoken corpora are essential for any research on phonology or phonetics where the exact duration of linguistic units (e.g. vowels, pauses) or their further characteristics (e.g. formants for vowels, pitch height) is important. Most modern tools for spoken corpus compilation support multi-layered time-aligned annotations (see Sect. 11.4 below).

11.2.3 Data Format and Metadata The large number of corpus annotation tools that are in use determines that spoken corpora also differ widely in their data format, ranging from txt files to XML formats. While nearly every tool produces its proprietary output format, many of them have import/export functions that allow data exchange across tools. However,


U. Gut

the interoperability between tools is still one of the major challenges for spoken corpus use and re-use across linguistic subdisciplines as discussed in Sect. 11.3. One major problem, for example, is that linear annotations cannot be converted to multi-layered ones. The data format of the corpus also restricts the range of tools that can be used for corpus searches (see Sect. 11.2.4). Further heterogeneity across spoken corpora exists in terms of metadata. The term metadata refers to any additional information about the corpus compilers, the data collection (e.g. the procedure and date and place of recordings), the speakers that were recorded (e.g. their age, gender, regional background, further languages) and the data annotation process (e.g. the transcription conventions, tagset used, see Chap. 1). While several international initiatives have attempted to standardise metadata formats (see Broeder and van Uytvanck 2014 for an overview), its divergent use across existing spoken corpora is pronounced. As is true for any corpora, when the metadata is provided in a separate file without using standoff techniques (e.g. IDs pointing from annotations to corresponding speaker metadata), it requires great manual effort to make use of it for automatic corpus searches (see also Chap. 3).

11.2.4 Corpus Search As existing spoken corpora vary greatly in the type and number of annotations as well as their data format, possibilities for corpus search diverge significantly. In general, corpora with rich multi-layered annotations lend themselves to largescale automated quantitative analyses and statistical exploitation (e.g. Moisl 2014), while others that contain only orthographic transcriptions can be used mainly for manual corpus inspection (see also Sect. 11.3). For KWIC (keyword-in-context) searches that are used to find specific words, phrases or tags in a corpus and that are usually followed by a qualitative analysis of the displayed hits (cf. Chap. 8), some corpus compilers offer specially designed query interfaces on their website: On the Scottish Corpus of Texts and Speech (Anderson and Corbett 2008) website,9 for example, it is possible to search for the occurrence of individual words in all or selected files of the corpus. The hits are displayed in their context and a link to the corresponding audio file is provided. Thus, it is possible to analyse how often and where speakers use Scots grammatical forms such as didnae (didn’t) as shown in Fig. 11.3. Similarly, the Nordic Dialect Corpus (Johannessen et al. 2014) can also be searched on the corpus website.10 Words searched for are displayed as concordances with links to the audio files, and it provides facilities for frequency counts, collocation analyses and statistical measures as well as visualisations of the search results as pie charts or maps.

9 http://www.scottishcorpus.ac.uk/search/. 10 http://www.tekstlab.uio.no/scandiasyn/.

Accessed 22 May 2019. Accessed 22 May 2019.

11 Spoken Corpora


Fig. 11.3 First results of search for didnae in the Scottish Corpus of Texts and Speech

Large-scale quantitative analyses of spoken corpora often require programming skills: for example, in order to run automatic calculations of annotations in a corpus that has an XML data format a script needs to be written (see also Chap. 9). For example, Gut and Fuchs (2017) used Praat scripts in order to calculate the fluency (mean number of words per utterance and mean number of phonemes in total articulation time) in ICE Nigeria and ICE Scotland.

Representative Study 1 Aijmer, K. 2002. English discourse particles – evidence from a corpus. Amsterdam: John Benjamins. This is an example of an innovative study using the corpus-based method to gain new insights into the linguistic subdiscipline of pragmatics. She analysed the use of the English discourse particles now, oh/ah, just, sort of, actually, and some phrases such as and that sort of thing in the London-Lund corpus and compared this to the Lancaster-Oslo/Bergen Corpus of written English and the COLT Corpus. Since these corpora are not pragmatically annotated, she used a classic KWIC search method to find these words in the corpora and carried out subsequent qualitative analyses of the use of these discourse particles and phrases. Her findings show that these discourse particles have multiple functions and differ in their use with respect to the level of formality.


U. Gut

Representative Study 2 Biber, D., and Staples, S. 2014. Variation in the realization of stance adverbials. In Spoken Corpora and Linguistic Studies, eds. Raso, T., and Mello, H, 271–294. Amsterdam/Philadelphia: John Benjamins. This is an example of a study based on a corpus with integrated linear annotations. For investigating the interplay between grammatical stance expressions and prosody they carried out a KWIC search (using AntConc)11 of a subset of the Hong Kong Corpus of Spoken English for selected stance adverbials and subsequently manually determined their prosodic prominence and their syntactic distribution. Their findings show that the use of stance adverbials varies with both the speaker’s language background and register; less grammaticalised stance adverbials and those in utterance-initial position receive greater prosodic prominence than more grammaticalised ones and those in medial or final position.

Representative Study 3 Kohler, K. 2001. Articulatory dynamics of vowels and consonants in speech communication. Journal of the International Phonetic Association 31:1–16. This study exemplifies the use of a large time-aligned phonetically annotated corpus (Kiel Corpus of Read and Spontaneous Speech) that was automatically searched and statistically evaluated. For this, the sound files were annotated with phonological and phonetic transcriptions using a modified SAMPA and markers for secondary articulation. Subsequently, the annotations were fed into a databank (Kiel data bank; Pätzold 1997), which was searched. Kohler analysed articulatory movements of German speakers in real speech communication compared to read speech and found great variability in their schwa elision as well as nasalization and deletion of plosives, thus showing how corpora can be used to challenge existing theories and models and how they can open up new avenues for research.

11 http://www.laurenceanthony.net/software/antconc/.

Accessed 22 May 2019.

11 Spoken Corpora


Representative Corpus 1 The Spoken Dutch Corpus (Oostdijk 2000, 2002; van Eynde et al. 2000) is a multi-purpose reference corpus for Dutch and with nearly 9 million words (800 h of recordings) represents the largest spoken corpora that are currently available. It was compiled between 2000 and 2003 as a resource for linguistic research on the syntactic, lexical and phonological properties of contemporary spoken Dutch. It also functions as a database for technological applications such as the development of speech recognizers and as a tool for language teaching. The raw data comprises private informal speech such as spontaneous face-to-face conversations and telephone dialogues, dialogic and monologic broadcast speech such as broadcast interviews, news readings and discussions, unscripted monologues as in classroom lessons, lectures, speeches and sermons as well as scripted read speech. The corpus contains time-aligned orthographic transcriptions, automatic lemmatization and partof-speech tagging. For one million words automatically created and manually verified broad phonemic transcriptions, orthographic transcripts aligned at the word level and semi-automatic syntactic annotation exist. For approximately 250,000 words of the corpus, manual prosodic annotations of prominent syllables, pauses and segmental lengthenings were carried out. The orthographic and phonemic/prosodic transcriptions are available as Praat TextGrid and XML files. POS tags, lemmatization and the syntactically annotated portion of the corpus exist as ASCII and XML files. The corpus can be searched with COREX (CORpus Exploitation) that was developed for it. The corpus is available on 33 DVDs, distributed by the Dutch Language Institute (https://ivdnt. org/downloads/taalmaterialen/tstc-cgn-annotaties) and the documentation is available at http://lands.let.ru.nl/cgn/doc_English/topics/project/pro_info.htm

Representative Corpus 2 The LeaP corpus (Gut 2005, 2012; Milde and Gut 2004), which was compiled between 2001 and 2003, is a multilingual learner corpus of non-native German and English. It was designed for the study of the phonological, lexical and syntactic properties of non-native speech and as a tool for language teaching. With extensive multi-layered and time-aligned manual annotations of intonation phrases and non-speech events, orthographic transcriptions at the word level, phonemic transcriptions using SAMPA at syllable level, transcriptions of vocalic and consonantal intervals, of intonation and pitch range as well as automatic part-of-speech annotation and lemmatization (see Fig. 11.2) but a size of only 73,841 words (12 h of recordings, including (continued)


U. Gut

interviews, reading passages, story retellings and word lists), it is a typical example of a small richly annotated spoken corpus The corpus is available as both Praat TextGrid and XML files with metadata in the IMDI format https:// tla.mpi.nl/imdi-metadata/ and can be searched, for example, with XSLTscripts. The corpus and corpus manual are available for free at https:// sourceforge.net/projects/leapcorpus/

Representative Corpus 3 The Michigan Corpus of Academic Spoken English (MICASE; Simpson et al. 2002; Simpson-Vlach and Leicher 2006) represents a specialised corpus, designed as a resource for linguistic research on lexical and syntactic properties of academic English and as a tool for language teaching. It was compiled between 1997 and 2002 and comprises 1.8 million words (>200 h of recordings) of lectures, classroom discussions, laboratory sections, student presentations in seminars, defences, meetings and advising sessions. It has time-aligned orthographic transcriptions that are available in an XML format. A website with a searchable interface http://quod.lib.umich.edu/m/ micase/ allows KWIC searches. There is a handbook for sale at http://micase. elicorpora.info/

11.3 Critical Assessment and Future Directions Section 11.2 has shown that the term ‘spoken corpus’ covers a wide range of fairly heterogeneous corpora of spoken language, whose types of raw data, annotations, data formats and search procedures can differ enormously depending on the intended linguistic use of the corpus as well as on the research traditions of the respective disciplines. With more linguistic subdisciplines adopting a corpus-based approach into their methodological repertoire (e.g. the newly established branch of corpus phonology [Durand et al. 2014] and the recently founded Journal of Corpus Pragmatics) and the increasing interest in the corpus-based exploration of multimodal aspects of human communication (see Chap. 16), there is a strong need for more spoken corpora to be compiled. In particular, the construction of corpora with video raw data seems especially desirable as only they allow researchers to study all aspects of human communication. However, the compilation of spoken corpora is still very time-consuming and cost-intensive. Despite many efforts to automatize more types of annotation such as phonemic transcription (see Strik and Cucchiarini 2014 for an overview) and prosodic transcription (e.g. Hirst 2005; Mertens 2004), there is another avenue

11 Spoken Corpora


to creating new opportunities for corpus-based linguistic studies that should be explored to a much greater extent in the future: the re-use of existing corpora. The re-use of spoken corpora seems to be still much rarer than that of written corpora, which is probably due to four factors: insufficient documentation, lack of standardisation in terms of annotations and data format, lack of standardised corpus search tools and lack of access. The first major obstacle for the reusability of spoken corpora is the often insufficient documentation of the corpus creation process, the type of raw data and metadata in the corpus and the annotation schemes applied. If a potential spoken corpus user interested in the grammatical variation between older and younger speakers cannot find information on the age of the speakers represented in the corpus, and if a researcher interested in the interplay between prosody and syntax in a language cannot interpret the transcription symbols used for prosody, re-use of corpora is impossible. It is therefore essential for corpus compilers to make available to future corpus users ample metadata and a corpus manual detailing the corpus compilation and annotation process. Equally, it would be of great benefit to the research community if the metadata for already existing spoken corpora could still be made available, for example for some of the older ICE corpora. Yet, even well documented spoken corpora are often not immediately (re-)usable to the wider research community, especially if the intended linguistic use is not the original one of the corpus compilers. Thus, the syntactic or prosodic annotation of a corpus might be based on a different theoretical tradition than the one preferred by the researcher or one type of annotation that is necessary for the current study might be missing altogether. Adding new annotations to a spoken corpus, however, can still constitute a major challenge for the simple lack of suitable tools although great advances have been made in the last decade in terms of the interoperability of the major tools in use for spoken corpus construction: ELAN, Praat, EXMARaLDA and ANVIL (see Sect. 11.4 below) now all have import and export functions for their respective file formats so that it is possible to add new annotations with one of these tools to a spoken corpus that was compiled with another tool (for this Transformer by Oliver Ehmer can also be used). Yet, the conversion of linear corpus annotations into multi-layered ones still constitutes an unsolved challenge. Moreover, many older spoken corpora could be opened up for entirely new directions of research if their annotations were time-aligned. Auran et al. (2004) were pioneers in making the Spoken English Corpus reusable for new research purposes: they time-aligned the original orthographic transcriptions with the sound files and in addition supplied an automatically generated phonemic and prosodic transcription. The now renamed Aix-Marsec corpus thus constitutes a new resource for phonological and phonetic research, like the AudioBNC.12 It is hoped that similar efforts will be undertaken in the future for the many as yet not time-aligned spoken corpora. Apart from re-using and enriching existing spoken corpora, future corpus-based research might also increasingly make use of combining existing corpora for

12 http://www.phon.ox.ac.uk/AudioBNC.

Accessed 22 May 2019.


U. Gut

linguistic hypothesis testing in order to overcome size limits of individual corpora or in order to allow diachronic studies of phenomena (see also the SPADE project https://spade.glasgow.ac.uk/ and Wittenburg et al. 2014). For example, for a study of the variation and change in the use of intensifiers such as awfully and massive in spoken language, it would be worthwhile to search together the BNC and BNC2014, which represent British English of the 1990s and current use respectively. The third impediment for the (re-)use of some spoken corpora sometimes is the lack of suitable automatized corpus search tools. Corpora that can only be searched with specialised tools might prove inaccessible to some linguists as discussed in Sect. 11.2.4. Some international initiatives such as CLARIN in Europe have the aim of overcoming this challenge by providing an infrastructure for corpus users that includes easy-to-use standardised tools. Many more similar efforts are necessary to make more spoken corpora accessible to researchers from all linguistic subdisciplines. The last challenge for the future of spoken corpora is their continued availability and accessibility. While an increasing number of corpus compilers are eager to make their spoken corpora available to the research community, technological and ethical difficulties have to be met as discussed below. For corpus data stored in nondigital form such as analogue tapes (there is still a lot of historical data that has not been digitised yet) every access means loss of quality. Moreover, many older data formats will not be accessible anymore in the near future. The archiving and dissemination of spoken corpora, even in digital form, thus implies the constant pressure of keeping up with technological advances. For instance, raw video data encoded in one format such as MPEG1 will have to be regularly updated to new encoding schemes. Furthermore, new strategies for corpus dissemination are being proposed to ensure long-term access to spoken corpora. Some corpora that were made available and searchable via websites are ‘lost’ due to lack of maintenance. As an alternative, corpora can be stored in external data services such as clouds with corpus access for example by compressed media streaming. Yet another option is the storage of spoken corpora at large data centres such as the MPI Archive (Wittenburg et al. 2014), which cover the costs for data archiving and dissemination. In any case, for the long-term preservation of spoken corpora mirroring them at different sites seems to be a good option. Further challenges for research based on spoken corpora are ethical issues such as privacy rights and copyrights. Typically, privacy laws require corpus compilers to ask for the formal written authorisation by each speaker to be recorded for the corpus and to allow the transcription, sharing and re-use of his or her data. Only when corpus compilers have received the written consent of the speakers recorded for the corpus that their data can be used for research and be disseminated can corpora be used and shared. Moreover, privacy rights require corpus compilers to anonymise the disseminated data (cf. Chap. 1): while this is easily achieved in the transcriptions, where references to people and places can be removed, complete anonymization in audio files, i.e. the changing of the voice quality, would run counter and make impossible many research purposes of the corpus. Legislation on copyright and privacy issues often changes and can differ widely across nations. In some national laws the speaker can withdraw his or her consent at any later point

11 Spoken Corpora


in time, which poses serious challenges for corpus dissemination. The European General Data Protection Regulation, which became enforceable in May 2018, for example, states that personal data may only be processed when the data subjects have given their consent for specific purposes. Changing legislation in these areas might pose further difficulties for corpus-based research in the future. In conclusion, both the compilation of new spoken corpora and the reuse of older ones remain exciting and challenging tasks for the future. I have no doubts, however, that they will help to provide many more important insights into human language use.

11.4 Tools and Resources Tools for multi-layered time-aligned annotation and search of spoken corpora CLAN http://dali.talkbank.org/clan/ (accessed 22 May 2019): allows complex queries such as built-in metrics of syntactic complexity (e.g. mean length of utterance); very popular in language acquisition research FOLKER http://agd.ids-mannheim.de/folker.shtml (accessed 22 May 2019): Tool for time-aligned transcription of spoken corpora using the GAT-2 transcription system; popular in conversation analysis Pacx http://pacx.sourceforge.net/ (accessed 22 May 2019) (Gut 2011): Platform for the multi-layered time-aligned annotation and search of spoken corpora in XML, based on Eclipse and using ELAN for multi-layered annotation of videos ANVIL http://www.anvil-software.org/ (accessed 22 May 2019) (Kipp 2014): popular in gesture research ELAN https://tla.mpi.nl/tools/tla-tools/elan/ (accessed 22 May 2019) (Sloetjes 2014): popular in language documentation EXMARaLDA http://exmaralda.org/de/ (accessed 22 May 2019) (Schmidt & Wörner 2014): popular in conversation and discourse analysis especially suited for phonetic research EMU http://emu.sourceforge.net/ (accessed 22 May 2019) (John and Bombien 2014): tools for the creation, manipulation and analysis of speech databases, includes an R interface Praat http://praat.org (accessed 22 May 2019) (Boersma 2014): provides an environment for running perception experiments and speech synthesis, facility for running scripts (see Brinckmann 2014 for a detailed introduction) for automatic acoustic analyses Phon https://www.phon.ca/phon-manual/misc/Welcome.html (accessed 22 May 2019) (Rose and MacWhinney 2014): developed for child language corpora; allows specification of target pronunciation and actual realisation; in-built query system; popular in first language acquisition research


U. Gut

Further Reading Durand, J., Gut, U., and Kristoffersen, G. (eds.) 2014. Oxford Handbook of Corpus Phonology. Oxford: Oxford University Press. The most comprehensive volume on spoken corpora to date that covers all aspects of the construction, use and archiving of spoken corpora with a focus on phonological corpora. It contains chapters on innovative approaches to phonological corpus compilation, corpus annotation, corpus searching and archiving and exemplifies the use of phonological corpora in various linguistic fields ranging from phonology to dialectology and language acquisition. Furthermore, it contains descriptions of existing phonological corpora and presents a wide range of popular tools for spoken corpus compilation, annotation, searches and archiving. Raso, T., and Mello, H. (eds.) 2014. Spoken Corpora and Linguistic Studies. Amsterdam/Philadelphia: John Benjamins. This volume contains a comprehensive collection of chapters discussing cuttingedge issues in spoken corpus compilation, spoken corpus annotation and presents examples of research based on spoken corpora. The individual articles show how the exploitation of richly annotated and text-to-tone aligned spoken corpora can render new insights into the syntax of speech and the use of prosody in human interactions. Ruhi, S., ¸ Haugh, M., Schmidt, T., and Wörner, K. (eds.) 2014. Best Practices for Spoken Corpora in Linguistic Research. Newcastle upon Tyne: Cambridge Scholars Publishing. This collection of papers focuses on questions of standards for the construction, annotation, searching, archiving and sharing of spoken corpora used in conversation analysis, sociolinguistics, discourse analysis and pragmatics. The individual contributions discuss these issues and illustrate current practices in corpus design, data collection and annotation, as well as strategies for corpus dissemination and for increasing the interoperability between tools.

References Aijmer, K. (2002). English discourse particles – Evidence from a corpus. Amsterdam: John Benjamins. Anderson, A., Bader, M., Bard, E., Boyle, E., Doherty, G. M., Garrod, S., Isard, S., Kowtko, J., McAllister, J., Miller, J., Sotillo, C., Thompson, H. S., & Weinert, R. (1991). The HCRC map task corpus. Language & Speech, 34, 351–366. Anderson, W., & Corbett, J. (2008). The Scottish corpus of texts and speech – A user’s guide. Scottish Language, 27, 19–41. Anderwald, L., & Wagner, S. (2007). FRED – The Freiburg English dialect corpus. In J. Beal, K. Corrigan, & H. Moisl (Eds.), Creating and digitizing language corpora. Volume 1: Synchronic corpora (pp. 35–53). London: Palgrave Macmillan.

11 Spoken Corpora


Auran, C., Bouzon, C., & Hirst, D. (2004). The Aix-Marsec project: An evaluative database of spoken British English. In Proceedings of speech prosody 2004. Nara. Beier, E., Starkweather, J., & Miller, D. (1967). Analysis of word frequencies in spoken language of children. Language and Speech, 10, 217–227. Benešová, L., Waclawiˇcová, M., & Kˇren, M. (2014). Building a data repository of spontaneous spoken Czech. In S. ¸ Ruhi, M. Haugh, T. Schmidt, & K. Wörner (Eds.), Best practices for spoken corpora in linguistic research (pp. 128–141). Cambridge Scholars Publishing: Newcastle upon Tyne. Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). The Longman grammar of spoken and written English. London: Longman. Biber, D., & Staples, S. (2014). Variation in the realization of stance adverbials. In T. Raso & H. Mello (Eds.), Spoken corpora and linguistic studies (pp. 271–294). Amsterdam/Philadelphia: John Benjamins. Bick, E. (2014). The grammatical annotation of speech corpora. In T. Raso & H. Mello (Eds.), Spoken corpora and linguistic studies (pp. 105–128). Amsterdam/Philadelphia: John Benjamins. Boersma, P. (2014). The use of Praat in corpus research. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 342–360). Oxford: Oxford University Press. Brinckmann, C. (2014). Praat scripting. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 361–379). Oxford: Oxford University Press. Broeder, D., & van Uytvanck, D. (2014). Metadata formats. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 150–165). Oxford: Oxford University Press. Carter, R. (1998). Orders of reality: CANCODE, communication and culture. English Language Teaching Journal, 52, 43–56. Cotterill, J. (2004). Collocation, connotation and courtroom semantics: Lawyer’s control of witness testimony through lexical negotiation. Applied Linguistics, 25(4), 513–537. Delais-Roussarie, E., & Yoo, H. (2014). Corpus research in phonetics and phonology: Methodological and formal considerations. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 193–213). Oxford: Oxford University Press. Delais-Roussarie, E., & Post, B. (2014). Corpus annotation: Methodology and transcription systems. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 46–88). Oxford: Oxford University Press. Dimroth, C. (2008). Age effects on the process of L2 acquisition? Evidence from the acquisition of negation and finiteness in L2 German. Language Learning, 58, 117–150. Durand, J., Gut, U., & Kristoffersen, G. (Eds.). (2014). Oxford handbook of corpus phonology. Oxford: Oxford University Press. Garside, R. (1995). Grammatical tagging of the spoken part of the British National Corpus: A progress report. In G. Leech, G. Myers, & J. Thomas (Eds.), Spoken English on computer: Transcription, mark-up and application (pp. 161–167). London: Longman. Gibbon, D., Moore, R., & Winski, R. (1997). Handbook of standards and resources for spoken language systems. Berlin: Mouton de Gruyter. Greenbaum, S., & Nelson, G. (1996). The International Corpus of English (ICE) project. World Englishes, 15(1), 3–15. Guirao, J., Moreno-Sandoval, A., González Ledesma, A., de la Madrid, G., & Alcántara, M. (2006). Relating linguistic units to socio-contextual information in a spontaneous speech corpus of Spanish. In A. Wilson, D. Archer, & P. Rayson (Eds.), Corpus linguistic around the world (pp. 101–113). Amsterdam/New York: Rodopi. Gut, U. (2005). Corpus-based pronunciation training. In: Proceedings of phonetics teaching and learning conference, London. Gut, U. (2009). Non-native Speech. A corpus-based analysis of the phonetic and phonological properties of L2 English and L2 German. Frankfurt: Peter Lang.


U. Gut

Gut, U. (2011). Language documentation and archiving with Pacx, an XML-based tool for corpus creation and management. In N. David (Ed.), Workshop on language documentation and archiving (pp. 21–25). London. Gut, U. (2012). The LeaP corpus. A multilingual corpus of spoken learner German and learner English. In T. Schmidt & K. Wörner (Eds.), Multilingual corpora and multilingual corpus analysis (pp. 3–23). Amsterdam: John Benjamins. Gut, U., & Bayerl, P. (2004). Measuring the reliability of manual annotations of speech corpora. Proceedings of Speech Prosody, 2004, 565–568. Gut, U., & Voormann, H. (2014). Corpus design. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 13–26). Oxford: Oxford University Press. Gut, U., & Fuchs, R. (2017). Exploring speaker fluency with phonologically annotated ICE corpora. World Englishes. https://doi.org/10.1111/weng.12278. Halliday, M. (1967). Intonation and grammar in British English. The Hague: Mouton. Hirst, D. J. (2005). Form and function in the representation of speech prosody. Speech Communication, 46, 334–347. Johannessen, J., Vangsnes, O., Priestley, J., & Hagen, K. (2014). A multilingual speech corpus of North-Germanic languages. In T. Raso & H. Mello (Eds.), Spoken corpora and linguistic studies (pp. 69–83). Amsterdam/Philadelphia: John Benjamins. John, T., & Bombien, L. (2014). EMU. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 321–341). Oxford: Oxford University Press. Jones, K., Graff, D., Walker, K., & Strassel, S. (2016). Multi-language conversational telephone speech 2011-Slavic Group LDC2016S11. Web Download. Philadelphia: Linguistic Data Consortium. https://catalog.ldc.upenn.edu/LDC2016S11. Accessed 22 May 2019. Kipp, M. (2014). ANVIL: The video annotation research tool. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford Handbook of corpus phonology (pp. 420–436). Oxford: Oxford University Press. Kirk, J., & Andersen, G. (2016). Compilation, transcription, markup and annotation of spoken corpora. Special issue of International Journal of Corpus Linguistics, 21, 3. Kohler, K. (2001). Articulatory dynamics of vowels and consonants in speech communication. Journal of the International Phonetic Association, 31, 1–16. Kristoffersen, G., & Simonsen, H. (2014). A corpus-based study of apicalization of /s/ before /l/ in Oslo Norwegian. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 214–239). Oxford: Oxford University Press. Labov, W. (1970). The study of language in its social context. Studium Generale, 23, 66–84. Leech, G. (2000). Grammars of spoken English: New outcomes of corpus-oriented research. Language Learning, 50(4), 675–724. Martin, P. (2014). Speech and corpora: How spontaneous speech analysis changed our point of view on some linguistic facts. The case of sentence intonation in French. In T. Raso & H. Mello (Eds.), Spoken corpora and linguistic studies (pp. 191–209). Amsterdam/Philadelphia: John Benjamins. Mertens, P. (2004). The prosogram: Semi-automatic transcription of prosody based on a tonal perception model. In Proceedings of speech prosody 2004 (pp. 549–552). Nara, Japan. Milde, J.-T., & Gut, U. (2004). TASX – eine XML-basierte Umgebung für die Erstellung und Auswertung sprachlicher Korpora. In A. Mehler & H. Lobin (Eds.), Automatische Textanalyse: Systeme und Methoden zur Annotation und Analyse natürlichsprachlicher Texte (pp. 249–264). Wiesbaden: Verlag für Sozialwissenschaften. Moisl, H. (2014). Statistical corpus exploitation. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 110–132). Oxford: Oxford University Press. Moreno-Sandoval, A., & Guirao, M. (2006). Morpho-syntactic tagging of the Spanish C-ORALROM corpus: Methodology, tools and evaluation. In Y. Kawaguchi, S. Zaima, & T. Takagaki (Eds.), Spoken language corpus and linguistic informatics (pp. 199–218). Amsterdam: John Benjamins.

11 Spoken Corpora


Neri, A., Cucchiarini, C., & Strik, H. (2006). Selecting segmental errors in non-native Dutch for optimal pronunciation training. International Review of Applied Linguistics in Language Teaching, 44, 354–404. Nesselhauf, N. (2004). Learner corpora and their potential in language teaching. In J. Sinclair (Ed.), How to use corpora in language teaching (pp. 125–152). Amsterdam: Benjamins. Nolan, F., & Post, B. (2014). The IViE corpus. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 475–485). Oxford: Oxford University Press. O’Connor, J. D., & Arnold, G. (1961). Intonation of colloquial English. London: Longman. O’Keeffe, A., & Walsh, S. (2012). Applying corpus linguistics and conversation analysis in the investigation of small group teaching in higher education. Corpus Linguistics and Linguistic Theory, 8–1, 159–181. Oostdijk, N. (2000). The spoken Dutch corpus. Overview and first evaluation. In M. Gravilidou, G. Carayannis, S. Markantonatou, S. Piperidis, & G. Stainhaouer (Eds.), Proceedings of the second international conference on Language Resources and Evaluation (LREC) (Vol. 2, pp. 887–894). Oostdijk, N. (2002). The design of the spoken Dutch corpus. In P. Peters, P. Collins, & A. Smith (Eds.), New frontiers of corpus research (pp. 105–112). Amsterdam: Rodopi. Oostdijk, N. (2003). Normalization and disfluencies in spoken language data. In S. Granger & S. Petch-Tyson (Eds.), Extending the scope of corpus-based research. New applications, new challenges (pp. 59–70). Amsterdam: Rodopi. Pätzold, M. (1997). KielDat ± Data bank utilities for the Kiel Corpus. Arbeitsberichte des Instituts für Phonetik der Universität Kiel, 32, 117–126. Panunzi, A., Picchi, E., & Moneglia, M. (2004). Using PiTagger for lemmatization and PoS tagging of a spontaneous speech corpus: C-Oral-Rom Italian. In M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa, & R. Silva (Eds.), Proceedings of the 4th LREC conference (pp. 563–566). Raso, T., & Mello, H. (Eds.). (2014). Spoken corpora and linguistic studies. Amsterdam/Philadelphia: John Benjamins. Rose, Y. (2014). Corpus-based investigations of child phonological development. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 265–285). Oxford: Oxford University Press. Rose, Y., & MacWhinney, B. (2014). The Phonbank project. Data and software-assisted methods for the study of phonology and phonological development. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The oxford handbook of corpus phonology (pp. 380–401). Oxford: Oxford University Press. Ruhi, S., ¸ Haugh, M., Schmidt, T., & Wörner, K. (Eds.). (2014). Best practices for spoken corpora in linguistic research. Newcastle upon Tyne: Cambridge Scholars Publishing. Schauer, G., & Adolphs, S. (2006). Expressions of gratitude in corpus and DCT data: Vocabulary, formulaic sequences, and pedagogy. System, 34, 119–134. Schiel, F. (2004). MAUS goes iterative. In Proceedings of the IV. International conference on language resources and evaluation (pp. 1015–1018). University of Lisbon. Schmidt, T., & Wörner, K. (2014). EXMARaLDA. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 402–419). Oxford: Oxford University Press. Schonell, F., Meddleton, I., Shaw, B., Routh, M., Popham, D., Gill, G., Mackrell, G., & Stephens, C. (1956). A study of the oral vocabulary of adults. Brisbane/London: University of Queensland Press/University of London Press. Selting, M., Auer, P., Barth-Weingarten, D., Bergmann, J., Bergmann, P., Birkner, K., CouperKuhlen, E., Deppermann, A., Gilles, P., Günthner, S., Hartung, M., Kern, F., Mertzlufft, C., Meyer, C., Morek, M., Oberzaucher, F., Peters, J., Quasthoff, U., Schütte, W., Stukenbrock, A., & Uhmann, S. (2009). Gesprächsanalytisches Transkriptionssystem 2 (GAT 2). Gesprächsforschung, 10, 353–402. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Pierrehumbert, J., & Hirschberg, J. (1992). ToBI: A standard for labeling English prosody. In Proceedings of second International conference on spoken language processing (Vol. 2, pp. 867–870). Banff, Canada.


U. Gut

Simpson, R. C., Briggs, S. L., Ovens, J., & Swales, J. M. (2002). The Michigan corpus of academic spoken English. Ann Arbor: The Regents of the University of Michigan. Simpson-Vlach, R., & Leicher, S. (2006). The MICASE handbook: A resource for users of the Michigan corpus of academic spoken English. University of Michigan Press/ELT. Sloetjes, H. (2014). ELAN: Multimedia annotation application. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 305–320). Oxford: Oxford University Press. Strik, H., & Cucchiarini, C. (2014). On automatic phonological transcription of speech corpora. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 89–109). Oxford: Oxford University Press. Svartvik, J. (Ed.). (1990). The London corpus of spoken English: Description and research (Lund studies in English) (Vol. 82). Lund: Lund University Press. Szmrecsanyi, B., & Wolk, C. (2011). Holistic corpus-based dialectology. Revista Brasileira de Linguística Aplicada, 11(2). https://doi.org/10.1590/S1984-63982011000200011. Accessed 22 May 2019. Van Eynde, F., Zavrel, J., & Daelemans, W. (2000). Lemmatisation and morphosyntactic annotation for the spoken Dutch corpus. In P. Monachesi (Ed.), Computational linguistics in the Netherlands 1999. Selected papers from the tenth CLIN meeting (pp. 53–62). Utrecht Institute of Linguistics OTS. Wells, J., Barry, W., Grice, M., Fourcin, A., & Gibbon, D. (1992). Standard computer-compatible transcription. Technical report. No. SAM Stage Report Sen.3 SAM UCL-037. Williams, B. (1996). The formulation of an intonation transcription system for British English. In A. Wichmann, P. Alderson, & G. Knowles (Eds.), Working with speech (pp. 38–57). London: Longmans. Wittenburg, P., Trilsbeek, P., & Wittenburg, F. (2014). Corpus archiving and dissemination. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 133–149). Oxford: Oxford University Press.

Chapter 12

Parallel Corpora Marie-Aude Lefer

Abstract This chapter gives an overview of parallel corpora, i.e. corpora containing source texts in a given language, aligned with their translations in another language. More specifically, it focuses on directional corpora, i.e. parallel corpora where the source and target languages are clearly identified. These types of corpora are widely used in contrastive linguistics and translation studies. The chapter first outlines the key features of parallel corpora (they typically contain written texts translated by expert translators working into their native language) and describes the main methods of parallel corpus analysis, including the combined use of parallel and comparable corpora. It then examines the major challenges that are linked with the design and analysis of parallel corpora, such as text availability, metadata collection, bitext alignment, and multilingual linguistic annotation, on the one hand, and data scarcity, interpretation of the results and infelicitous translations, on the other. Finally, the chapter shows how these challenges can be overcome, most notably by compiling balanced, richly-documented parallel corpora and by cross-fertilizing insights from cross-linguistic research and natural language processing.

12.1 Introduction This chapter gives an overview of parallel corpora, which are widely used in corpusbased cross-linguistic research (here understood as an umbrella term for contrastive linguistics and translation studies) and natural language processing. Parallel corpora (also called translation corpora) contain source texts in a given language (the source language, henceforth SL), aligned with their translations in another language (the target language, henceforth TL). It is important to point out from the outset that the term parallel corpus is to some extent ambiguous, because it is sometimes used to refer to comparable original texts in two or more languages, especially texts that

M.-A. Lefer () Université catholique de Louvain, Centre for English Corpus Linguistics, Louvain-la-Neuve, Belgium e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_12



M.-A. Lefer

belong to comparable genres or text types and deal with similar topics (e.g. Italian and German newspaper articles about migration or English and Portuguese medical research articles). Here, the term will only be used to refer to collections of source texts and their translations. The compilation of parallel corpora started in the 1990s. Progress has been rather slow, compared with monolingual corpus collection initiatives, but in recent years we have witnessed a boom in the collection of parallel corpora, which are increasingly larger and multilingual. Parallel corpora are highly valuable resources to investigate cross-linguistic contrasts (differences between linguistic systems) and translation-related phenomena, such as translation properties (features of translated language). They can also be used for a wide range of applications, such as bilingual lexicography, foreign language teaching, translator training, terminology extraction, computer-aided translation, machine translation and other natural language processing tasks (e.g. word sense disambiguation and cross-lingual information retrieval). This chapter is mainly concerned with the design and analysis of parallel corpora in the two fields of corpus-based contrastive linguistics and corpus-based translation studies. Contrastive linguistics (or contrastive analysis) is a linguistic discipline that is concerned with the systematic comparison of two or more languages, so as to describe their similarities and differences. Corpus-based contrastive linguistics was first pioneered by Stig Johansson in the 1990s and has been thriving ever since. Corpus-based translation studies is one of the leading paradigms in Descriptive Translation Studies (Toury 2012). This field also emerged in the 1990s, under the impetus of Mona Baker and relies on corpus linguistic tools and methods to elucidate translated text (in particular, the linguistic features that set translated language apart from other forms of language production) (cf. Kruger et al. 2011; De Sutter et al. 2017). Contrastive linguistics and translation studies, which both make intensive use of parallel corpora, are quite close, as demonstrated by edited volumes such as Granger et al. (2003) and the biennial Using Corpora in Contrastive and Translation Studies conference series (e.g. Xiao 2010).

12.2 Fundamentals 12.2.1 Types of Parallel Corpora Parallel corpora can be of many different types. They can be bilingual (one SL and one TL), such as the English-Norwegian Parallel Corpus (ENPC; Johansson 2007), or multilingual (more than one SL and/or TL), such as the Oslo Multilingual Corpus, which is fully trilingual (English, Norwegian and German), with some texts available in Dutch, French and Portuguese as well (ibid., 18–19). Other multilingual parallel corpora include the Slavic parallel corpus ParaSol (Waldenfels ˇ 2011) and InterCorp (Cermák and Rosen 2012). A further distinction is made between monodirectional corpora, when only one translation direction is repre-

12 Parallel Corpora


sented (SLX > TLY , e.g. English > Chinese), and bidirectional (or reciprocal) corpora, when both translation directions are included (SLX > TLY and SLY > TLX , e.g. English > Chinese and Chinese > English). The ENPC, for example, is bidirectional (from English to Norwegian, and vice versa). Most parallel corpora contain published translations (with some exceptions, e.g. when translations are specifically commissioned for a particular corpus compilation project). In most cases, only one translation of each source text is included. However, there are also multiple translation corpora, which include several translations of the same source text in a given TL. Such corpora make it possible to compare the translation solutions used by various translators rendering the same source text.

12.2.2 Main Characteristics of Parallel Corpora The majority of parallel corpora used in contrastive linguistics and translation studies are characterized by two key features. First, the source and target languages are clearly identified. In other words, the translation direction is known (from LanguageX to LanguageY or from LanguageY to LanguageX ). In cross-linguistic research, it is of paramount importance to know, for instance, whether a given text was translated from Spanish into German or vice versa, because corpus studies have shown that translation direction influences translation choices, and hence the linguistic make-up of translated text (e.g. Dupont and Zufferey 2017). Second, only direct translation is included, i.e. no pivot (intermediary, mediating) language is used between the source and target languages. In texts produced by the European Union (EU), for example, English has been systematically used as a pivot language since the early 2000s. In practical terms, this means that a text originally written in, say, Slovenian or Dutch is first translated into English. The English version is then translated into the other official languages of the EU. In other words, English acts as a pivot language and most target texts originating from EU institutions are in fact translations of translations (see Assis Rosa et al. 2017 on the issue of indirect translation). Parallel corpora that display these two features (known as translation direction and translation directness) will be referred to as directional parallel corpora in this chapter (a term borrowed from Cartoni and Meyer 2012). Parallel corpora whose translation direction is unknown and/or where a pivot language has been used will be called non-directional. Examples of the latter type include the Europarl corpus (Koehn 2005), the Eur-Lex corpus (Baisa et al. 2016) and the United Nations Parallel Corpus (Ziemski et al. 2016). It is important to bear in mind, however, that the distinction between directional and non-directional parallel corpora is not always clear-cut. In some parallel corpora, both types of parallel texts are included. For example, the Dutch Parallel Corpus (DPC; Macken et al. 2011), which is largely directional, contains some indirect, EU translations. Directional parallel corpora typically (i) contain written texts (ii) translated by expert translators (iii) working into their native language (L1), and (iv) cover a


M.-A. Lefer

rather limited number of text types or genres. Each of these typical features will be discussed in turn: (i) Directional parallel corpora mainly cover written translation (e.g. the ENPC), to the detriment of other translation modalities, such as interpreting and audiovisual translation. In recent years, however, efforts have been made to include other forms of translation. A case in point is the compilation of several parallel corpora of simultaneous interpreting (see Russo et al. 2018 for an overview of corpus-based interpreting studies). In these corpora, the main source of data (speeches and their interpreted versions) is the European Parliament (Bernardini et al. 2018). An example of one such European Parliament interpreting corpus is the fully trilingual English-Italian-Spanish European Parliament Interpreting Corpus (Russo et al. 2006). Recent developments also include the compilation of intermodal parallel corpora, i.e. corpora representing several translation modalities (e.g. written translation and simultaneous interpreting), such as the European Parliament Translation and Interpreting Corpus (EPTIC; Ferraresi and Bernardini 2019). EPTIC features two main components: (i) simultaneous interpreting: transcripts of speeches delivered at the European Parliament plenary sittings and transcripts of the simultaneous interpretations of these speeches, and (ii) written translation: the verbatim reports of the plenary sittings, as officially published on the European Parliament website, alongside the official translations of these verbatim reports (the Europarl corpus is also based on this written material, see Representative Corpus 2 below). Parallel corpora of sign interpreting (e.g. Meurant et al. 2016) and audiovisual translation modalities (such as subtitling, dubbing, and film audio description; cf. Baños et al. 2013) have also been collected recently. Some of these parallel corpora are multimodal, in the sense that they contain different modes, such as language, image, sound and music (e.g. Jimenez Hurtado and Soler Gallego 2013; Chap. 16). (ii) In general, parallel corpora include target text translated (or assumed to have been translated) by professional and/or expert translators (it must be stressed, however, that limited metadata on translators’ status have been collected to date; see Sect. In some cases, the translators’ status is rather unclear (e.g. in translated news items, found in several parallel corpora, from Le Monde Diplomatique, a French monthly newspaper with more than 30 international editions, in 20+ languages1 ). Other translators’ profiles are also represented, albeit less frequently, such as non-professional, volunteer translators, as in the TED Talks WIT3 corpus (Web Inventory of Transcribed and Translated Talks; Cettolo et al. 2012). Aside from professional and volunteer translators, some parallel corpora, called learner translation corpora (LTC), contain translations produced by foreign language learners or trainee translators, i.e. novices (see also Chap. 13). The first LTC emerged in the early 2000s (Uzar 2002; Bowker

1 https://www.monde-diplomatique.fr/diplo/int/.

Accessed 22 May 2019.

12 Parallel Corpora


and Bennison 2003) and have been followed by several similar initiatives, such as the MeLLANGE corpus (Castagnoli et al. 2011), the English-Catalan UPF LTC (Espunya 2014), the Russian Learner Translator Corpus (Kutuzov and Kunilovskaya 2014), and the Multilingual Student Translation corpus (Granger and Lefer 2020). The vast majority of directional parallel corpora contain translations produced by human translators (in some cases, with the help of computer-aided translation tools). Recently, however, translation scholars have started to include machine-translated texts alongside humantranslated texts, with a view to uncovering the linguistic traits that differentiate machine translation from human translation (computer-aided or otherwise) (e.g. Lapshinova-Koltunski 2017). (iii) Directional parallel corpora tend to be restricted to L1 translation (i.e. when the translation is carried out into the translator’s native language), except in the case of some LTC, which contain L2 (inverse, reverse) translation as well, or corpora representing language pairs for which L2 translation is common practice (e.g. Finnish to English) (see Beeby Lonsdale 2009 on directionality practices). (iv) Most directional, balanced parallel corpora used in contrastive linguistics and translation studies are restricted to a couple of genres or text types, mainly fictional prose (e.g. the English-Portuguese COMPARA; Frankenberg-Garcia and Santos 2003; the “core” part of InterCorp), news (news items and opinion articles published in newspapers and magazines) and/or non-fiction, such as popular science texts (e.g. the ENPC; the English-French Poitiers-Louvain Échange de Corpus Informatisés PLECI2 ; the French-Slovenian FraSloK parallel corpus, Mezeg 2010; and the English-Spanish ACTRES parallel corpus, Izquierdo et al. 2008). A handful of directional parallel corpora cover a wider range of text types. Examples include the DPC for the language pairs DutchEnglish and Dutch-French (Macken et al. 2011) and the CroCo corpus for German-English (Hansen-Schirra et al. 2012), with five and ten text types represented, respectively. The directional parallel corpora featuring the four characteristics outlined above are relatively modest in size compared with monolingual reference corpora commonly used in corpus linguistics (they usually contain a few million words). This is even more striking for parallel corpora of interpreted language, in view of the many hurdles inherent in transcribing spoken data (Bernardini et al. 2018; Chap. 11). If more parallel data are needed, and provided translation direction and directness are not considered to be of particular relevance, researchers can turn to several nondirectional parallel corpora (mainly of legislative and administrative texts) that are much larger than the parallel corpora discussed so far. These mega corpora are used widely in natural language processing, for example for data-driven machine translation. However, it is important to bear in mind that (i) in these corpora,

2 https://uclouvain.be/en/research-institutes/ilc/cecl/pleci.html.

Accessed 22 May 2019.


M.-A. Lefer

translation direction is often unknown (i.e. the source and target languages are not clearly identified), and (ii) in many instances, the translation relationship between the parallel texts for a given language pair is indirect (either the translation is done through an intermediary, pivot language, or the parallel texts in a given pair are both translations from another, third language). Generally speaking, non-directional parallel corpus data should be treated with caution. While their use makes sense in natural language processing research, it remains to be seen whether they can yield reliable insights into cross-linguistic differences.

12.2.3 Methods of Analysis in Cross-Linguistic Research Parallel corpora are widely used in corpus-based contrastive linguistics and translation studies and they are starting to emerge as a useful source of data in typology as well (Levshina 2016). As pointed out by Johansson (2007:3), most contrastive scholars “have either explicitly or implicitly made use of translation as a means of establishing cross-linguistic relationships. [ . . . ] As translation shows what elements may be associated across languages, it is fruitful to base a contrastive study on a comparison of original texts and their translations”. In other words, parallel corpora can be used to study cross-linguistic correspondences (e.g. between lexical items, lexico-syntactic patterns or grammatical structures). The corpus methods used to achieve that goal are similar to those applied in monolingual corpus linguistics, such as concordances (Chap. 8) and co-occurrence data (Chap. 7). Figure 12.1 provides a sample of bilingual concordances for the English phrase no kidding and its Italian equivalents in a corpus of subtitled films and series (OpenSubtitles2011, available in Sketch Engine; see Sect. 12.4). A cursory glance at Fig. 12.1 shows that Italian equivalents include non scherzo (‘I am not kidding’),

Fig. 12.1 English no kidding and its Italian translation equivalents in the OpenSubtitles2011 corpus (OPUS2, Sketch Engine, Lexical Computing Ltd)

12 Parallel Corpora


sul serio (‘seriously’) and davvero (‘really’). The detailed analysis of English-Italian equivalences found in the corpus can act as a springboard for an in-depth contrastive analysis (e.g. what are the discursive and pragmatic functions of no kidding in scripted spoken English and which equivalent expressions are used in Italian to fulfill these functions?). Bilingual concordances are also widely used in translation studies to investigate the translation procedures used to render specific items (e.g. lexical innovations, proper names, culture-specific elements). For instance, on the basis of an Italian-to-German parallel corpus of tourist brochures, it is possible to determine whether translators adapt SL culture-bound items (e.g. macchiato, caffè latte) or whether they keep them in their translation (perhaps with an explanatory note), which reflects more general translation strategies towards domestication and foreignization. Figure 12.2 shows a sample of a bilingual Word Sketch, i.e. a summary of the grammatical and collocational behaviors of equivalent words, for English sustainability and its French equivalent durabilité in parliamentary proceedings (Europarl). The bilingual Word Sketch makes it possible, among other things, to detect equivalent verbal collocates of the English and French nouns under scrutiny, such as jeopardize/menacer and ensure/assurer. This kind of co-occurrence analysis is particularly helpful for contrastive phraseology, applied translation studies (e.g. to raise trainee translators’ awareness of phraseological equivalence) and bilingual lexicography. In the two examples mentioned above, we started with a given SL item (no kidding, sustainability) and examined its translation equivalents in the TL (Italian and French, respectively), i.e. going from source to target. Interestingly, this source-

Fig. 12.2 Sample of a bilingual Word Sketch for English sustainability and its French equivalent durabilité (Europarl7, Sketch Engine, Lexical Computing Ltd)


M.-A. Lefer

to-target approach is also used in monolingual corpus linguistics to examine the semantic, discursive and pragmatic features of source-language items (Noël 2003). For example, Aijmer and Simon-Vandenbergen (2003) examine the meanings and functions of the English discourse particle well on the basis of its Swedish and Dutch translation equivalents in a parallel corpus of fictional texts. An alternative method is to start off from a given item or structure in translated texts and examine its corresponding source-text items or structures, i.e. from target to source. Taking the same example as above, this would entail analyzing all occurrences of sul serio in Italian subtitles and identifying the English source items that have triggered their use. This target-to-source approach is quite common in translation studies. Delaere and De Sutter (2017), for example, rely on an English-to-Dutch parallel corpus to find out whether the English loanwords found in translated Dutch stem from their corresponding trigger words in the English source texts. Naturally, these two approaches (source to target and target to source) can be combined if a more comprehensive picture of cross-linguistic correspondences is required. Indeed, many new insights can be gained by investigating a given item or structure in both source and target texts, so as to find out how it is commonly translated and which items in the other language have triggered its use in translation (e.g. Zufferey and Cartoni 2012). It is also possible, on the basis of parallel corpora, to work out what Altenberg has termed mutual correspondence (or mutual translatability), i.e. “the frequency with which different (grammatical, semantic and lexical) expressions are translated into each other” (Altenberg 1999:254). Mutual correspondence is calculated as follows, with At and Bt corresponding to the frequencies of the compared items A and B in the target texts (t), and As and Bs to their frequencies in the source texts (s): mutual correspondence =

(At + Bt) × 100 As + Bs

If, say, a lexical item A is always translated with an item B, and vice versa, then items A and B have a mutual correspondence of 100%. If, on the contrary, A and B are never translated with each other, they display a mutual correspondence of 0%. In other words, this index makes it possible to assess the extent to which items are equivalent across languages: “the higher the mutual correspondence value is, the greater the equivalence between the compared items is likely to be” (Altenberg and Granger 2002:18). For example, Dupont and Zufferey (2017) find that in samples of 200 occurrences extracted from Europarl, the adverb pair however/ cependant displays a mutual correspondence of 57% (however > cependant: 87/200, cependant > however: 140/200), while the however/toutefois pair has a lower

12 Parallel Corpora


correspondence score of 49% (however > toutefois: 80/200, toutefois > however: 114/200): however/cependant =

(87 + 140) × 100 200 + 200

however/toutef ois =

(80 + 114) × 100 200 + 200

The scores tend to indicate that in parliamentary proceedings, the however/ cependant cross-linguistic equivalence is somewhat stronger than for however and toutefois. So far, we have outlined different methods of parallel corpus analysis (from source to target, from target to source, mutual correspondence). However, it should be stressed that several types of corpora can be combined to reveal and disentangle cross-linguistic contrasts and translation-related phenomena (Bernardini 2011; Johansson 2007; Halverson 2015). Two types of corpora are commonly used in cross-linguistic research in combination with parallel corpora: (i) bilingual/multilingual comparable corpora and (ii) monolingual comparable corpora. Their combined use with parallel corpora will be discussed in turn. Bilingual (or multilingual) comparable corpora are “collections of original [i.e. non-translated] texts in the languages compared” (Johansson 2007:5). The texts are strictly matched by criteria such as register, genre, text type, domain, subject matter, intended audience, time of publication, and size. Examples include KIAP, a comparable corpus of research articles in Norwegian, English, and French (Fløttum et al. 2013) and the Multilingual Editorial Corpus, a comparable corpus of newspaper editorials in English, Dutch, French, and Swedish.3 Bilingual and multilingual comparable corpora usefully complement parallel corpora in that a given phenomenon can be studied cross-linguistically on the basis of comparable original texts, i.e. texts displaying no trace of source-language or source-text influence, unlike translations in parallel corpora. Corpus studies combining both types of corpora can start either with the bilingual/multilingual comparable analysis, before turning to the parallel corpus analysis, or the other way around, depending on the research questions to be tackled (see Johansson 2007 for more details). Interestingly, bilingual comparable and parallel corpora can be combined in the same corpus framework, namely bidirectional parallel corpora whose two translation directions are truly comparable in terms of size, text types, etc. As shown in Fig. 12.3, for example, the ENPC can function both as a bidirectional parallel corpus (English originals > Norwegian translations and Norwegian originals > English translations; see black arrows) and as a bilingual comparable corpus (English originals and Norwegian originals; see white double arrow). Numerous parallel corpora are

3 https://uclouvain.be/en/research-institutes/ilc/cecl/mult-ed.html.

Accessed 22 May 2019.


M.-A. Lefer

Fig. 12.3 The model for the ENPC (based on Johansson 2007: 11)

Fig. 12.4 The model for the ENPC, with additional reference monolingual corpora

based on the ENPC model, such as the English-Swedish Parallel Corpus (ESPC), COMPARA and PLECI. However, the main problem of the bidirectional ENPC model is that the selection of texts to be included in the corpus is limited to genres that are commonly translated in both directions (see Johansson, 2007:12 on this issue). In other words, the number of genres and texts that can be included in the corpus is often limited (e.g. only fiction and non-fiction texts in the ENPC). As a result, to improve representativeness, the comparable, original components of bidirectional parallel corpora need to be supplemented with larger, multi-genre (reference) monolingual corpora of the languages investigated (see Fig. 12.4).

12 Parallel Corpora


Fig. 12.5 The model for a monolingual-comparable-cum-parallel corpus

Parallel corpora can also be combined with monolingual comparable corpora, which include comparable translated and non-translated (i.e. original) texts in a given language (e.g. novels originally written in English alongside novels translated into English from a variety of source languages; see, for example, the ten-million-word Translational English Corpus4 ). Monolingual comparable corpora of translated and original texts are widely used in translation studies, with a view to identifying the major distinguishing features of translated language, when compared with original language production (the so-called translation universals, or translation features/properties, such as simplification, normalization and increased explicitness; cf. Baker 1993, 1995). Parallel corpora, when combined with monolingual comparable corpora, are used to check for source-text and/or source-language influence. Cappelle and Loock (2013), for example, use parallel corpus data to find out whether the under-representation of existential there in English translated from French (as compared with non-translated English) stems from SL (French) interference. Parallel and monolingual comparable corpora can be integrated within the same overall corpus framework, as shown in Fig. 12.5.

12.2.4 Issues and Methodological Challenges

Issues and Challenges Specific to the Design of Parallel Corpora

This section presents an overview of some of the main challenges specific to the design of parallel corpora (for a detailed discussion of more general issues, such

4 https://www.alc.manchester.ac.uk/translation-and-intercultural-studies/research/projects/

translational-english-corpus-tec/. Accessed 22 May 2019.


M.-A. Lefer

as representativeness and balance, copyright clearance,5 and text encoding, see Chap. 1). The first issue is text availability. As mentioned above, parallel corpora, especially bidirectional ones, tend to be modest in size and are often restricted to a small number of text types. One of the reasons for this is that for any given language pair (LX and LY ), there is often some kind of asymmetry or imbalance between the two translation directions (LX > LY and LY > LX ). This imbalance can take several forms: either there are simply fewer texts translated in one direction than in the other (especially when the language pair involves a less “central”, or more “peripheral”, language), or certain text types are only (or more frequently) translated in one of the two directions. For example, as noted by Frankenberg-Garcia & Santos (2003:75) in relation to the translation of tourist brochures for the English-Portuguese pair: [t]ourist brochures in Portuguese translation are practically non-existent: Portuguesespeaking tourists abroad are expected to get by in other, more widely known languages. In contrast, almost all material destined to be read by tourists in Portuguese-speaking countries comes with an English translation.

To sum up, “translations are heavily biased towards certain genres, but these biases are rarely symmetrical for any language pair” (Mauranen 2005:74). In addition, some widely translated text types may be hard to obtain, for obvious confidentiality reasons specific to translation projects carried out by translation agencies and freelance translators (e.g. legal texts or texts translated for internal use only). Finally, there are language pairs for which there are very few parallel texts available (cf. for example, Singh et al. 2000 on building an English-Punjabi parallel corpus). To compensate for data scarcity, there have been a number of initiatives since the early 2000s (Resnik and Smith 2003) aiming to create mainly non-directional parallel corpora by crawling sites across the web (Chap. 15). Obtaining detailed metadata is another challenge facing anyone wishing to compile a parallel corpus. In this respect, parallel corpora are clearly lagging behind compared with other corpus types, such as learner corpora, which are more richly documented (Chap. 13). Ideally, the following metadata should be collected (this list is non-exhaustive): • Source text and target text: author(s)/translator(s), publisher, register, genre, text type, domain, format, mode, intended audience, communicative purpose, publication status, publication date, etc. • Translation direction, including SL and TL (and their varieties) • Translation directness: use of a pivot language or not

5 Unsurprisingly,

it is far from easy to obtain copyright clearance for texts to be included in parallel corpora. For this reason, many parallel corpora are not publicly available (e.g. ENPC, PLECI, Raf Salkie’s INTERSECT, P-ACTRES, CroCo).

12 Parallel Corpora


• Translation directionality: L2 > L1 translation, L1 > L2 translation, L2 > L2 translation, etc. • Translator: translator’s status (professional, volunteer/amateur, student, etc.), translator’s occupation, gender, nationality, country of residence, translation expertise (expert vs. novice), translation experience (which can be measured in many different ways, e.g. number of years’ experience), language background (native and foreign languages), etc. • Translation task: use of computer-aided translation tools (translation memories, terminological databases) and other tools and resources (dictionaries, forums, corpora, etc.), use of a translation brief (set of translation instructions, including, for instance, use of a specific style guide or in-house terminology), fee per word/line/hour, deadline/time constraints, etc. • Revision/editorial intervention: self- and other-revision, types of revision (e.g. copyediting, monolingual vs. bilingual revision), etc. It is also important to stress here that the concepts of source language and source text are becoming increasingly blurred. In today’s world, some “source” documents are simultaneously drafted in several languages. In multilingual translation projects, there are also cases where there is no single “source” text, as translators translate a given text while accessing some of its already available translations (e.g. when confronted with an ambiguous passage). Third, there is the issue of alignment, i.e. the process of matching corresponding segments in source and target texts (see Tiedemann 2011). Software tools can be used to align parallel texts automatically at paragraph, sentence and word level (see, for instance, Hunalign, Varga et al. 2007; GIZA++, Och and Ney 2003; fast_align, Dyer et al. 2013). Most directional corpora are aligned at sentence level. Different sources of information can be used to match sentences across parallel texts, such as sentence length (in words or characters, normalized by text length), word length, punctuation (e.g. quotation marks), and lexical anchors (e.g. cognates). Some aligners also rely on bilingual dictionaries. Sentence alignment is not a straightforward task, as translators often merge or split sentences when producing the target text. This is referred to as 2:1 and 1:2 alignment links, respectively (see examples in Table 12.1). As pointed out by Macken et al. (2011:380), “[t]he performance of the individual alignment tools varies for different types of texts and language pairs and in order to guarantee high quality alignments, a manual verification step is needed”. A good option is to use a tool that combines automatic sentence alignment and manual post-alignment correction options, such as the open-source desktop application InterText editor (Vondˇriˇcka 2014) or the Hypal interface (Obrusnik 2014). One way of reducing this manual editing step is to combine the output of several aligners, as done for the DPC, where the corpus compilers combined the output of three


M.-A. Lefer

Table 12.1 Splitting and merging source-text sentences in translation 1:1 alignment Splitting (1:2 alignment) Merging (2:1 alignment)

Hachez les feuilles de coriandre et mélangez au gingembre. Hachez les feuilles de coriandre et mélangez au gingembre.

Chop the coriander leaves and mix with the ginger. Chop the coriander leaves. Mix with the ginger.

Râpez le gingembre. Coupez les feuilles de coriandre et mélangez au gingembre.

Grate the ginger, then chop the coriander leaves and mix with the ginger.

aligners. The alignment links that were present in the output of at least two aligners were considered as reliable alignment links. All the other links were then checked manually (this shows that manual editing of automatically aligned texts is essential, even when the output of several aligners is combined). Aligners typically generate the following types of XML output: (i) one source-text file, one target-text file and one link file (linking up the source- and target-text segments), (ii) one source-text file and one target-text file, containing the same number of segments, or (iii) a TMX (Translation Memory eXchange) file. Finally, yet another major challenge relating to the compilation of parallel corpora (or any other type of multilingual corpus) is multilingual linguistic annotation (e.g. lemmatization, morphosyntactic annotation, syntactic parsing, semantic tagging; Chap. 2). Johansson (2007:306) rightly argues that “[t]o go beyond surface forms, we need linguistically annotated corpora that allow more sophisticated studies”. However, multilingual annotation raises the following key questions, which echo the more general “universality vs. diversity” debate in linguistics (see, for example, Evans and Levinson 2009): If corpora are annotated independently for each language, to what extent is the analysis comparable? If they are provided with some kind of language-neutral annotation (for parts of speech, syntax, etc.), to what extent do we miss language-specific characteristics? (Johansson 2007:306).

At present, no definite answers have been found to these questions. As a matter of fact, issues related to multilingual annotation (e.g. whether it should be languagespecific or language-neutral, or, more generally, how cross-linguistic comparability can be achieved) have received relatively little attention in contrastive linguistics and translation studies (one notable exception is Neumann 2013). The languagespecific and language-neutral approaches are both used in parallel corpora, the former being more common. In the language-specific approach, researchers rely either on separate annotation tools (one per language involved) or on one single tool that is available for several languages, such as the TreeTagger (Schmid 1994) or FreeLing (Padró and Stanilovsky 2012) POS taggers. However, it is important to bear in mind that in these multilingual annotation tools, (i) the annotation systems are not designed to be cross-linguistically comparable: some tags are languagespecific (e.g. the RP tag used for English adverbial particles) while, unsurprisingly, “shared” tags display language-specific features (e.g. the TreeTagger JJ tag used for

12 Parallel Corpora


English adjectives does not correspond fully to what the French ADJ tag covers), and (ii) precision and recall ratios (Chap. 2) differ across languages (e.g. for the TreeTagger, they tend to be higher for English than for French). These two factors can potentially jeopardize the contrastive comparability of annotated multilingual data. Great care should therefore be taken when analyzing annotated data in crosslinguistic research (see, for example, Neumann 2013 and Evert and Neumann 2017 on the English-German language pair). An interesting language-neutral approach, suggested in Rosen (2010), consists in using an abstract, interlingual hierarchy of linguistic categories mapped to language-specific tags. In the same vein, some researchers have proposed “universal” tagsets, which include tags that accommodate language-specific parts-of-speech (see, for example, Benko 2016; the MULTEXT-East project,6 with its harmonized morphosyntactic annotation system for 16 languages; Erjavec’s SPOOK specifications,7 with harmonized tagsets for English, French, German, Italian, and Slovenian). The multilingual annotation of existing parallel corpora is still very basic, being mostly limited to lemmatization and POS tagging. Syntactic annotation will probably become more standard in years to come, given recent advances in multilingual parsing (e.g. Bojar et al. 2012; Volk et al. 2015; Augustinus et al. 2016 on parallel treebanks; see also the Universal Dependencies project8 ).

Issues and Challenges Specific to the Analysis of Parallel Corpora

Clearly, compared with monolingual corpora, parallel corpora are lagging behind in terms of size (representativeness is also an issue, as small corpora tend to represent relatively few authors and translators/interpreters). Low-frequency linguistic phenomena may be hard to analyze on the basis of parallel corpora, for sheer lack of sufficient data that would allow reliable generalizations. Researchers in contrastive linguistics and translation studies are therefore often forced to combine several parallel corpora to extract a reasonable amount of data, but this approach raises a number of problems. One is that several confounding variables may be intertwined in the various corpora used, which in turn hinders the interpretability of the results. In Lefer and Grabar (2015), for instance, we relied on two parallel corpora, i.e. verbatim reports of parliamentary debates (Europarl) and interlingual subtitles of oral presentations (TED Talks), so as to investigate the translation of rather infrequent lexical items, namely evaluative prefixes (e.g. overand super-). We found marked and seemingly insightful differences between the translation procedures used in Europarl and TED Talks but were forced to recognize that it was impossible to assess to what extent the observed differences were due to source-text genre (parliamentary debates vs. oral presentations), translation

6 http://nl.ijs.si/ME/V4/msd/html/index.html.

Accessed 22 May 2019. Accessed 22 May 2019. 8 http://universaldependencies.org/. Accessed 22 May 2019. 7 http://nl.ijs.si/spook/msd/html-en/.


M.-A. Lefer

modality (written translation vs. subtitling) or translator expertise (professional translators vs. non-professional volunteers) or, for that matter, a combination of some or all of these factors. Another issue, also directly related to the interpretability of the results, is the cross-linguistic comparability (or lack thereof) of genres and text types in bidirectional parallel corpora (such as the ENPC, the DPC and CroCo) (see Neumann 2013). Matching genres or text types cross-linguistically is “by no means straightforward” (Johansson 2007:12). We may indeed wonder whether the observed differences reflect genuine cross-linguistic contrasts and/or translationspecific features or whether they are due to fundamental cross-linguistic differences between supposedly similar genres or text types (e.g. research articles or newspaper opinion articles) (cf. Fløttum et al. 2013 on medical research articles in Norwegian). This question cannot be overlooked. It is also worth pointing out that most parallel corpora are poorly metadocumented (source and target texts and languages, translator, translation task, editorial intervention, etc.), which, unfortunately, can lead researchers to jump to hasty conclusions as regards both cross-linguistic contrasts (“this pattern is due to differences between the two language systems under scrutiny”) and features of translated language (“this is inherent in the translation process”). One final point to be made in this section is that parallel corpora (even those whose texts have all been translated by highly-skilled professionals) contain infelicities and even translation errors (to err is human, after all). Researchers may therefore feel uncomfortable with some of the data extracted from parallel corpora. Rather than sweeping erroneous items under the carpet, when in doubt it is probably safer to acknowledge these seemingly infelicitous or erroneous data explicitly. Moreover, looking on the bright side, these infelicities and errors can prove to be highly valuable in applied fields such as bilingual lexicography, foreign language teaching or translator training. In Granger and Lefer (2016), we suggest using them to devise corpus-based exercises, such as the detection and correction of erroneous translations or the translation of sentences containing error-prone items.

Representative Study 1 Dupont, M., and Zufferey, S. 2017. Methodological issues in the use of directional parallel corpora. A case study of English and French concessive connectives. International Journal of Corpus Linguistics 22(2):270–297. In their study, Dupont & Zufferey make an important methodological contribution to the field of corpus-based contrastive linguistics by examining three factors that can potentially affect the nature of the cross-linguistic correspondences found in parallel corpora: register, translation direction (continued)

12 Parallel Corpora


and translator expertise. More specifically, they compare three registers (news, parliamentary proceedings, and TED Talks) in two translation directions (from English into French, and vice versa), examining three types of translator expertise (they compare professional, semi-professional and amateur translators). Their study is particularly innovative in that relatively few contrastive corpus studies to date have taken into consideration these influencing factors (especially translation direction and translator expertise), focusing almost exclusively on the source and target linguistic systems under scrutiny. By assuming that the correspondences extracted from parallel corpora are mainly (or solely) due to similarities and differences between the source and target languages, researchers fail to acknowledge the inherently multidimensional nature of translation. In this study, Dupont & Zufferey investigate the translation equivalences between English and French adverbial connectives expressing concession (e.g. yet, however, nonetheless) across three parallel corpora (PLECI news, Europarl Direct and TED Talk Corpus). Their results indicate that translation choices (and hence, observed crosslinguistic correspondences) depend on the three factors investigated.

Representative Study 2 Delaere, I., and De Sutter, G. 2017. Variability of English Loanword Use in Belgian Dutch Translations: Measuring the Effect of Source Language, Register, and Editorial Intervention. In Empirical Translation Studies: New Methodological and Theoretical Traditions, eds. De Sutter, G., Lefer, M.-A., and Delaere, I., 81–112. Berlin/Boston: De Gruyter Mouton. Delaere & De Sutter’s study is situated in the field of corpus-based translation studies. The authors explore three factors that can impact on the linguistic traits of translated language, namely source-language influence, register, and editorial intervention (i.e. revision). They do so through an analysis of English loanwords (vs. their endogenous variants) in translated and original Belgian Dutch (e.g. research & development vs. onderzoek en ontwikkeling). Loanword use is related to a widely investigated topic in translation studies, viz. the normalization hypothesis, which states that translated text is more standard than non-translated text. The starting-point hypothesis of Delaere & De Sutter’s study is that overall, translators make more use of endogenous lexemes (a conservative option compared with the use of loanwords), than (continued)


M.-A. Lefer

do non-translators (writers). Relying on the Dutch Parallel Corpus, the authors combine two approaches in their study: monolingual comparable (Dutch translated from English and French, alongside original Dutch) and parallel (English to Dutch). As is often the case in corpus-based translation studies, parallel data are used with a view to identifying the source-text items/structures that have triggered the use of a given item/structure in the translations (in this case, the presence of a trigger term in the English source texts, such as unit, job, or team). The authors apply multivariate statistics (profile-based correspondence analysis and logistic regression analysis) to measure the effect of the three factors investigated on the variability of English loanword use. The logistic regression analysis reveals that the effect of register is so strong that it cancels out the effect of source language. Their study convincingly illustrates the need to adopt multifactorial research designs in corpus-based translation studies, as these make it possible to go beyond the monofactorial designs where, typically, only the “translation status” variable is considered (translated vs. non-translated).

Representative Corpus 1 The Dutch Parallel Corpus (Macken et al. 2011) is a ten-million-word bidirectional Dutch-French and Dutch-English parallel corpus (Dutch being the central language). The DPC includes five text types: administrative texts (e.g. proceedings of parliamentary debates, minutes of meetings, and annual reports), instructive texts (e.g. manuals), literature (e.g. novels, essays, and biographies), journalistic texts (news reporting articles and comment articles) and texts for external communication purposes (e.g. press releases and scientific texts). The DPC also features rich metadata, such as publisher, translation direction, author or translator of the text, domain, keywords and intended audience. The corpus is fully aligned at sentence level and is lemmatized and part-of-speech tagged. Unlike many similar corpora, the DPC is available to the research community, thanks to its full copyright clearance.

Representative Corpus 2 To date, Europarl (Koehn 2005) is one of the few parallel corpora to have been used widely in both corpus-based contrastive/translation studies and natural language processing. It contains the proceedings (verbatim reports) of the European Parliament sessions in 21 languages. Its seventh version, (continued)

12 Parallel Corpora


released in 2012 by Koehn, includes data from 1996 to 20119 and amounts to 600+ million words. Europarl contains two types of European Parliament official reports, viz. written-up versions of spontaneous, impromptu speeches and edited versions of prepared (written-to-be-spoken) speeches. Europarl files contain some metadata tags, such as the speaker’s name and the language in which the speech was originally delivered.10 The main problem, however, is that in part of the corpus, LANGUAGE tags are either missing or inconsistent across corpus files. To solve this problem, Cartoni and Meyer (2012) have homogenized LANGUAGE tags across all corpus files. Thanks to this approach, they have been able to extract directional Europarl subcorpora, i.e. subcorpora where the source and target languages are clearly identified (see https://www.idiap.ch/dataset/europarl-direct).

12.3 Critical Assessment and Future Directions As shown above, anyone wishing to design and compile a directional parallel corpus faces a number of key issues, such as parallel text availability (especially in terms of text-type variety), access to source text-, translator- and translation taskrelated metadata, automatic sentence alignment, and linguistic annotation. Relying on existing parallel corpus resources poses its own challenges as well, as present-day parallel corpora tend to be quite small and/or poorly meta-documented and typically cover relatively few text types. Notwithstanding these issues and challenges, parallel corpus research to date has yielded invaluable empirical insights into cross-linguistic contrasts and translation. There are many hopes and expectations for tomorrow’s parallel corpora. There are three ways in which headway can be made in the not too distant future. The first

9 The practice of translating the European Parliament proceedings into all EU languages was ceased in the second half of 2011. The verbatim reports of the plenary sittings are still made available on the European Parliament website, but the written-up versions of the speeches are only published in the languages in which the speeches were delivered. 10 In this respect, it is important to stress that English is increasingly used as a lingua franca at the European Parliament. In other words, some of the speeches originally delivered in English are in fact given by non-native speakers of English (the same holds, albeit to a lesser extent, for other languages, such as French). This is not a trivial issue, as recent research indicates that the use of English as a Lingua Franca can have a considerable impact on translators’ (and interpreters’) outputs (see Albl-Mikasa (2017) for an overview of English as a Lingua Franca in translation and interpreting).


M.-A. Lefer

two are related to the design of new parallel corpora, while the third is concerned with a rapprochement between natural language processing and cross-linguistic studies. First, it is high time we started collecting richer metadata, notably in terms of SL/TL, source and target texts, translator, translation task, and editorial intervention. This will make it possible to adopt multifactorial research designs and use advanced quantitative methods in contrastive linguistics and translation studies much more systematically, thereby furthering our understanding of cross-linguistic contrasts and of the translation product in general. Second, whenever possible, we should go beyond the inclusion of translated novels, news, and international organizations’ legal and administrative texts, and strive for the inclusion of more genres and text types, especially those that are dominant in today’s translation market, to which corpus compilers have had limited access to date, for obvious reasons of confidentiality and/or copyright clearance. This also entails compiling corpora representing different translation modalities (e.g. audiovisual translation, interpreting) and translation methods, such as computer-aided translation and post-editing of machine-translated output, as translation from scratch is increasingly rarer today (one notable exception is literary translation). Including different versions of the same translation would also prove to be rewarding (e.g. draft, unedited, and edited versions of the translated text). Finally, we need to cross-fertilize insights from natural language processing and corpus-based cross-linguistic studies. This “bridging the gap” can go both ways. On the one hand, cross-linguistic research should take full stock of recent advances in natural language processing, for tasks such as automatic alignment and multilingual annotation. Significant progress has been made in recent years in these areas, but parallel corpora, especially those compiled by research teams of corpus linguists, have not yet fully benefited from these new developments. At present, for instance, very few parallel corpora are syntactically parsed or semantically annotated. On the other hand, natural language processing researchers involved in parallel corpus compilation projects could try to document, whenever possible, meta-information that is of paramount importance to contrastive linguists and translation scholars, such as translation direction (from LX to LY , or vice versa) and directness (use of a pivot language or not). In turn, taking this meta-information into account may very well help significantly improve the overall performance of data-driven machine translation systems and other tools relying on data extracted from parallel corpora. Even though it is quite difficult to predict future developments with any certainty, especially in view of the fact that translation practices are changing dramatically (e.g. human post-editing of machine-translated texts is increasingly common in the translation industry), it is safe to say that compiling and analyzing parallel corpora will prove to be an exciting and rewarding enterprise for many years to come.

12 Parallel Corpora


12.4 Tools and Resources 12.4.1 Query Tools Sketch Engine by Lexical Computing Ltd. is undoubtedly the most powerful tool available to linguists, translation scholars, and lexicographers to analyze bilingual and multilingual parallel corpora. The Sketch Engine interface offers powerful functionality, such as bilingual Word Sketches and automatic bilingual terminology extraction. It contains several ready-to-use sentence-aligned, lemmatized, and POStagged parallel corpora, such as DGT-Translation Memory, Eur-Lex, Europarl7 and OPUS2. It is also possible to upload your own parallel corpora in various formats (including XML-based formats used in the translation industry, such as TMX Translation Memory eXchange and XLIFF XML Localization Interchange File Format), and exploit them in Sketch Engine. A free, simpler version of the tool, NoSketchEngine, is freely available to the research community (https://nlp.fi. muni.cz/trac/noske) (accessed 22 May 2019). There are also a number of multilingual parallel concordancers specifically designed for the extraction of data from parallel corpora, such as: • Anthony’s AntPConc (available from: http://www.laurenceanthony.net/software/ antpconc/) (accessed 22 May 2019), a freely available parallel corpus analysis toolkit for concordancing and text analysis using line-break aligned, UTF-8 encoded text files. • Barlow’s ParaConc (http://www.athel.com/para.html) (accessed 22 May 2019), a multilingual concordancer with the following functionality: semi-automatic alignment of parallel texts, parallel searches, automatic identification of translation candidates (called Hot Words) and collocate extraction.

12.4.2 Resources • OPUS project (Tiedemann 2012; 2016), a large collection of freely available parallel corpora: its current version covers 200 languages and language variants and contains over 28 billion tokens, and the collection is constantly growing, in terms of both coverage and size. Compared with other non-directional parallel corpora, OPUS has two major advantages: (i) rather than being restricted to administrative and legal texts (mainly EU and UN), it covers a relatively wide range of other genres and text types, such as user-contributed movie and TV show subtitles, software localization, and multilingual wikis; (ii) a number of poorly-resourced and non-EU language pairs are well represented (albeit often through an indirect translation relationship; e.g. in the LX -LY language pair, the two languages LX and LY are both translations from the source language LZ ). http://opus.nlpl.eu/ (accessed 22 May 2019).


M.-A. Lefer

• ParaCrawl (Web-Scale Parallel Corpora for Official European Languages): parallel corpora for various languages paired with English, created by crawling websites https://paracrawl.eu/index.html (accessed 22 May 2019). • CLARIN’s Key Resource Families – parallel corpora (Fišer et al. 2018): many parallel corpora can be downloaded from the CLARIN webpage. https://www. clarin.eu/resource-families/parallel-corpora (accessed 22 May 2019).

12.4.3 Surveys of Available Parallel Corpora A large number of parallel corpora have been mentioned or discussed in this chapter, but it was outside the scope of the present overview to list all available parallel corpora. As a matter of fact, there is as yet no up-to-date digital database documenting all existing parallel corpora (be they bilingual or multilingual, directional or non-directional, developed for cross-linguistic research and/or natural language processing). However, there are some promising initiatives in this direction, such as Mikhailov and Cooper’s (2016) survey, the “Universal Catalogue” of the European Language Resources Association (ELRA) (http://www.elra.info/en/catalogues/universal-catalogue/) (accessed 22 May 2019), CLARIN’s overview of parallel corpora (https://www.clarin.eu/resource-families/ parallel-corpora) (accessed 22 May 2019), and the TransBank project (https:// transbank.info/) (accessed 22 May 2019).

Further Reading Johansson, S. 2007. Seeing through Multilingual Corpora. On the use of corpora in contrastive studies. Amsterdam/Philadelphia: John Benjamins. Johansson’s monograph is a must-read for anyone interested in corpus-based contrastive linguistics. The book provides a highly readable introduction to corpus design and use in contrastive linguistics. It also offers a range of exemplary case studies contrasting lexis, syntax, and discourse on the basis of parallel corpus data. Mikhailov, M., and Cooper, R. 2016. Corpus Linguistics for Translation and Contrastive Studies. A guide for research. London/New York: Routledge. In this accessible guide for research, Mikhailov & Cooper provide detailed information on parallel corpus compilation and describe a wide range of search procedures that are commonly used in corpus-based contrastive and translation studies. The book also offers a useful survey of some of the available parallel corpora.

12 Parallel Corpora


Zanettin, F. 2012. Translation-Driven Corpora. Corpus Resources for Descriptive and Applied Translation Studies. London/New York: Routledge. Zanettin’s coursebook is a practical introduction to descriptive and applied corpusbased translation studies. In addition to providing clear background information on the study of translation features in the field, it offers a wealth of useful information on translation-driven (including parallel) corpus design, encoding, annotation, and analysis. Each chapter is enriched with insightful case studies and hands-on tasks.

References Aijmer, K., & Simon-Vandenbergen, A.-M. (2003). The discourse particle well and its equivalents in Swedish and Dutch. Linguistics, 41(6), 1123–1161. Albl-Mikasa, M. (2017). ELF and translation/interpreting. In J. Jenkins, W. Baker, & M. Dewey (Eds.), The Routledge handbook of English as a Lingua Franca (pp. 369–384). London/New York: Routledge. Altenberg, B. (1999). Adverbial connectors in English and Swedish: Semantic and lexical correspondences. In H. Hasselgård & S. Oksefjell (Eds.), Out of corpora. Studies in honour of Stig Johansson (pp. 249–268). Amsterdam: Rodopi. Altenberg, B., & Granger, S. (2002). Recent trends in cross-linguistic lexical studies. In B. Altenberg & S. Granger (Eds.), Lexis in contrast. Corpus-based approaches (pp. 3–48). Amsterdam/Philadelphia: John Benjamins. Assis Rosa, A., Pi˛eta, H., & Bueno Maia, R. (2017). Theoretical, methodological and terminological issues regarding indirect translation: An overview. Translation Studies, 10(2), 113–132. Augustinus, L., Vandeghinste, V., & Vanallemeersch, T. (2016). Poly-GrETEL: Cross-lingual example-based querying of syntactic constructions. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016) (pp. 3549–3554). European Language Resources Association (ELRA). Baisa, V., Michelfeit, J., Medved, M., & Jakubíˇcek, M. (2016). European Union language resources in sketch engine. In Proceedings of tenth international conference on language resources and evaluation (LREC’16). European Language Resources Association (ELRA). Baker, M. (1993). Corpus linguistics and translation studies. Implications and applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology. In honour of John Sinclair (pp. 233–250). Amsterdam: John Benjamins. Baker, M. (1995). Corpora in translation studies: An overview and some suggestions for future research. Targets, 7(2), 223–243. Baños, R., Bruti, S., & Zanotti, S., (Eds.). (2013). Corpus linguistics and audiovisual translation: In search of an integrated approach. Special issue of Perspectives, 21(4). Beeby Lonsdale, A. (2009). Directionality. In M. Baker & G. Saldanha (Eds.), Routledge encyclopedia of translation studies (pp. 84–88). Abingdon: Routledge. Benko, V. (2016). Two years of Aranea: Increasing counts and tuning the pipeline. In Proceedings of 10th international conference on language resources and evaluation (LREC’16) (pp. 4245– 4248). European Language Resources Association (ELRA). Bernardini, S. (2011). Monolingual comparable corpora and parallel corpora in the search for features of translated language. SYNAPS, 26, 2–13. Bernardini, S., Ferraresi, A., Russo, M., Collard, C., & Defrancq, B. (2018). Building interpreting and intermodal corpora: A How-to for a formidable task. In M. Russo, C. Bendazzoli, & B. Defrancq (Eds.), Making way in corpus-based interpreting studies (pp. 21–42). Springer. Bojar, O., Žabokrtský, Z., Dušek, O., Galušcáková, P., Majliš, M., Marecek, D., Maršík, J., Novák, M., Popel, M., & Tamchyna, A. (2012). The joy of parallelism with CzEng 1.0. In Proceedings


M.-A. Lefer

of the 8th international conference on language resources and evaluation (LREC-2012) (pp. 3921–3928). European Language Resources Association (ELRA). Bowker, L., & Bennison, P. (2003). Student translation archive and student translation tracking system. Design, development and application. In F. Zanettin, S. Bernardini, & D. Stewart (Eds.), Corpora in translator education (pp. 103–117). Manchester: St. Jerome Publishing. Cappelle, B., & Loock, R. (2013). Is there interference of usage constraints? A frequency study of existential there is and its French equivalent il y a in translated vs. non-translated texts. Target, 25(2), 252–275. Cartoni, B., & Meyer, T. (2012). Extracting directional and comparable corpora from a multilingual corpus for translation studies. In Proceedings of the 8th international conference on language resources and evaluation (LREC-2012) (pp. 2132–2137). European Language Resources Association (ELRA). Castagnoli, S., Ciobanu, D., Kübler, N., Kunz, K., & Volanschi, A. (2011). Designing a learner translator Corpus for training purposes. In N. Kübler (Ed.), Corpora, language, teaching, and resources: From theory to practice (pp. 221–248). Bern: Peter Lang. ˇ Cermák, F., & Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 13(3), 411–427. Cettolo, M., Girardi, C., & Federico, M. (2012). WIT3 : Web inventory of transcribed and translated talks. Proceedings of EAMT, 261–268. De Sutter, G., Lefer, M.-A., & Delaere, I. (Eds.). (2017). Empirical translation studies: New methodological and theoretical traditions. Berlin/Boston: De Gruyter Mouton. Delaere, I., & De Sutter, G. (2017). Variability of English loanword use in Belgian Dutch translations: Measuring the effect of source language, register, and editorial intervention. In G. De Sutter, M.-A. Lefer, & I. Delaere (Eds.), Empirical translation studies: New methodological and theoretical traditions (pp. 81–112). Berlin/Boston: De Gruyter Mouton. Dupont, M., & Zufferey, S. (2017). Methodological issues in the use of directional parallel corpora. A case study of English and French concessive connectives. International Journal of Corpus Linguistics, 22(2), 270–297. Dyer, C., Chahuneau, V., & Smith, N. A. (2013). A simple, fast, and effective reparameterization of IBM model 2. Proceedings of NAACL-HLT, 2013, 644–648. Espunya, A. (2014). The UPF learner translation corpus as a resource for translator training. Language Resources and Evaluation, 48(1), 33–43. Evans, N., & Levinson, S. C. (2009). The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and Brain Sciences, 32, 429–492. Evert, S., & Neumann, S. (2017). The impact of translation direction on characteristics of translated texts. A multivariate analysis for English and German. In G. De Sutter, M.-A. Lefer, & I. Delaere (Eds.), Empirical translation studies: New methodological and theoretical traditions (pp. 47–80). Berlin/Boston: De Gruyter Mouton. Ferraresi, A., & Bernardini, S. (2019). Building EPTIC: A many-sided, multi-purpose corpus of EU parliament proceedings. In M. T. S. Nieto & I. Doval (Eds.), Parallel corpora: Creation and application. Amsterdam/Philadelphia: John Benjamins. Fišer, D., Lenardiˇc, J., & Erjavec, T. (2018). CLARIN’s key resource families. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) (pp. 1320–1325). Fløttum, K., Dahl, T., Didriksen, A. A., & Gjesdal, A. M. (2013). KIAP – Reflections on a complex corpus. Bergen Language and Linguistics Studies, 3(1), 137–150. Frankenberg-Garcia, A., & Santos, D. (2003). Introducing COMPARA: The Portuguese-English parallel Corpus. In F. Zanettin, S. Bernardini, & D. Stewart (Eds.), Corpora in translator education (pp. 71–87). Manchester: St. Jerome Publishing. Granger, S., & Lefer, M.-A. (2016). From general to learners’ bilingual dictionaries: Towards a more effective fulfillment of advanced learners’ phraseological needs. International Journal of Lexicography, 29(3), 279–295. Granger, S., & Lefer, M.-A. (2020). The multilingual student translation corpus: A resource for translation teaching and research. Language Resources and Evaluation: Online First.

12 Parallel Corpora


Granger, S., Lerot, J., & Petch-Tyson, S. (Eds.). (2003). Corpus-based approaches to contrastive linguistics and translation studies. Amsterdam/New York: Rodopi. Halverson, S. L. (2015). The status of contrastive data in translation studies. Across Languages and Cultures, 16(2), 163–185. Hansen-Schirra, S., Neumann, S., & Steiner, E. (2012). Cross-linguistic corpora for the study of translations. Insights from the language pair English-German. Berlin: De Gruyter. Izquierdo, M., Hofland, K., & Reigem, Ø. (2008). The ACTRES parallel corpus: An English– Spanish translation corpus. Corpora, 3(1), 31–41. Jimenez Hurtado, C., & Soler Gallego, S. (2013). Multimodality, translation and accessibility: A corpus-based study of audio description. Perspectives, 21(4), 577–594. Johansson, S. (2007). Seeing through multilingual corpora. On the use of corpora in contrastive studies. Amsterdam/Philadelphia: John Benjamins. Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. Proceedings of MT Summit X, 79–86. Kruger, A., Wallmach, K., & Munday, J. (Eds.). (2011). Corpus-based translation studies. Research and applications. London/New York: Bloomsbury. Kutuzov, A., & Kunilovskaya, M. (2014). Russian learner translator corpus. Design, research potential and applications. In P. Sojka, A. Horák, I. Kopeˇcek, & K. Pala (Eds.), Text, speech and dialogue. TSD 2014 (pp. 315–323). Springer. Lapshinova-Koltunski, E. (2017). Exploratory analysis of dimensions influencing variation in translation. The case of text register and translation method. In G. De Sutter, M.-A. Lefer, & I. Delaere (Eds.), Empirical translation studies: New methodological and theoretical traditions (pp. 207–234). Berlin/Boston: De Gruyter Mouton. Lefer, M.-A., & Grabar, N. (2015). Super-creative and over-bureaucratic: A cross-genre corpusbased study on the use and translation of evaluative prefixation in TED talks and EU parliamentary debates. Across Languages and Cultures, 16(2), 187–208. Levshina, N. (2016). Verbs of letting in Germanic and Romance languages: A quantitative investigation based on a parallel corpus of film subtitles. Languages in Contrast, 16(1), 84– 117. Macken, L., De Clercq, O., & Paulussen, H. (2011). Dutch parallel corpus: A balanced copyrightcleared parallel corpus. Meta, 56(2), 374–390. Mauranen, A. (2005). Contrasting languages and varieties with translational corpora. Languages in Contrast, 5(1), 73–92. Meurant, L., Gobert, M., & Cleve, A. (2016). Modelling a parallel corpus of French and French Belgian sign language. In Proceedings of the 10th edition of the language resources and evaluation conference (LREC 2016) (pp. 4236–4240). Mezeg, A. (2010). Compiling and using a French-Slovenian parallel corpus. In R. Xiao (Ed.), Proceedings of the international symposium on using corpora in contrastive and translation studies (UCCTS 2010) (pp. 1–27). Ormskirk: Edge Hill University. Mikhailov, M., & Cooper, R. (2016). Corpus linguistics for translation and contrastive studies. A guide for research. London/New York: Routledge. Neumann, S. (2013). Contrastive register variation. A quantitative approach to the comparison of English and German. Berlin/Boston: De Gruyter Mouton. Noël, D. (2003). Translations as evidence for semantics: An illustration. Linguistics, 41(4), 757– 785. Obrusnik, A. (2014). Hypal: A user-friendly tool for automatic parallel Text alignment and error tagging. In Eleventh international conference teaching and language corpora, Lancaster, 20– 23 July 2014 (pp. 67–69). Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51. Padró, L., & Stanilovsky, E. (2012). FreeLing 3.0: Towards wider multilinguality. In Proceedings of the language resources and evaluation conference (LREC-2012). European Language Resources Association (ELRA).


M.-A. Lefer

Resnik, P., & Smith, N. A. (2003). The web as a parallel corpus. Computational Linguistics, 29(3), 349–380. Rosen, A. (2010). Mediating between incompatible Tagsets. NEALT Proceedings Series, 10, 53– 62. Russo, M., Bendazzoli, C., & Sandrelli, A. (2006). Looking for lexical patterns in a trilingual corpus of source and interpreted speeches: Extended analysis of EPIC. Forum, 4(1), 221–254. Russo, M., Bendazzoli, C., & Defrancq, B. (Eds.). (2018). Making way in corpus-based interpreting studies. Springer. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of international conference on new methods in language processing. Singh, S., McEnery, T., & Baker, P. (2000). Building a parallel corpus of English/Panjabi. In J. Véronis (Ed.), Parallel Text processing. Alignment and use of translation corpora (pp. 335– 346). Kluwer Academic Publishers. Tiedemann, J. (2011). Bitext alignment. Morgan & Claypool Publishers. Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th international conference on language resources and evaluation (LREC’2012) (pp. 2214–2218). Tiedemann, J. (2016). OPUS – Parallel corpora for everyone. Baltic Journal of Modern Computing, 4(2), 384. Toury, G. (2012). Descriptive translation studies – And beyond. Amsterdam/Philadelphia: John Benjamins. Uzar, R. S. (2002). A corpus methodology for analysing translation. Cadernos de Tradução, 9(1), 235–263. Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., & Trón, V. (2007). Parallel corpora for medium density languages. In N. Nicolov, K. Bontcheva, G. Angelova, & R. Mitkov (Eds.), Recent advances in natural language processing IV: Selected papers from RANLP 2005 (pp. 247–258). Amsterdam & Philadelphia: John Benjamins. Volk, M., Ghring, A., Rios, A., Marek, T., & Samuelsson, Y. (2015). SMULTRON (version 4.0) – The Stockholm MULtilingual parallel TReebank. An English-French-German-QuechuaSpanish-Swedish parallel treebank with sub-sentential alignments. Institute of Computational Linguistics, University of Zurich. Vondˇriˇcka, P. (2014). Aligning parallel texts with InterText. In Proceedings of the ninth international conference on language resources and evaluation (LREC 2014) (pp. 1875–1879). Waldenfels, R. V. (2011). Recent developments in ParaSol: Breadth for depth and XSLT based web concordancing with CWB. In D. Majchráková & R. Garabík (Eds.), Natural Language Processing, Multilinguality. Proceedings of Slovko 2011, Modra, Slovakia, 20–21 October 2011 (pp. 156–162). Bratislava: Tribun EU. Xiao, R. (Ed.). (2010). Using corpora in contrastive and translation studies. Newcastle upon Tyne: Cambridge Scholars Publishing. Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (2016). The United Nations parallel corpus v1.0. Language Resources and Evaluation (LREC’16). Zufferey, S., & Cartoni, B. (2012). English and French causal connectives in contrast. Languages in Contrast, 12(2), 232–250.

Chapter 13

Learner Corpora Gaëtanelle Gilquin

Abstract This chapter deals with learner corpora, that is, collections of (spoken and/or written) texts produced by learners of a language. It describes their main characteristics, with particular emphasis on those that are distinctive of learner corpora. Special types of corpora are introduced, such as longitudinal learner corpora or local learner corpora. The issues of the metadata accompanying learner corpora and the annotation of learner corpora are also discussed, and the challenges they involve are highlighted. Several methods of analysis designed to deal with learner corpora are presented, including Contrastive Interlanguage Analysis, Computeraided Error Analysis and the Integrated Contrastive Model. The development of the field of learner corpus research is sketched, and possible future directions are examined, in terms of the size of learner corpora, their diversity, or the techniques of compilation and analysis. The chapter also features representative corpus-based studies of learner language, representative learner corpora, tools and resources related to learner corpora, and annotated references for further reading.

13.1 Introduction Learner corpora are corpora representing written and/or spoken ‘interlanguage’, that is, language produced by learners of that language. Typically, the term covers both foreign language and second language situations, that is, respectively, situations in which the target language has no official function in the country and is essentially confined to the classroom (and, possibly, international communication), and situations in which the target language is learned by immigrants in a country where it is the dominant native language. It is normally not used to refer to corpora of child language, which are made up of data produced by children acquiring their first language (see Chap. 14), nor corpora of institutionalized second-language varieties,

G. Gilquin () Université catholique de Louvain, Centre for English Corpus Linguistics, Louvain-la-Neuve, Belgium e-mail: gaetanelle.g[email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_13



G. Gilquin

which are collected in countries that have the target language as an official, though not native, language (cf. ‘New Englishes’ like those represented in the International Corpus of English), although their data may also reflect a process of learning or acquisition. While the first corpora were compiled in the 1960s, it took some 30 years before the first learner corpora started to be collected, both in the academic world (International Corpus of Learner English (ICLE)) and in the publishing world (Longman Learners’ Corpus). Initially, they were corpora of written learner English, keyboarded from handwritten texts. Gradually, however, learner corpora representing other languages as well as spoken learner corpora made their appearance, while written learner corpora were increasingly compiled directly from electronic sources, which facilitated the compilation process. The nature of learner language made it necessary to rethink and adapt some of the general principles of corpus data collection and analysis. This led, among other things, to the creation of new types of corpora, like longitudinal corpora representing different stages in the language learning process, to the collection of new types of metadata, such as information about the learner’s mother tongue and exposure to the target language, and to the use of new methods to annotate or query the corpus, for example to deal with the errors found in learner corpora. These specificities, and others, will be considered in Sect. 13.2.

13.2 Fundamentals 13.2.1 Types of Learner Corpora Like other corpora, learner corpora can include written, spoken and/or multimodal data; they can be small or large; and they can represent any (combination of) languages. The ‘Learner Corpora around the World’ resource (see Sect. 13.4) reveals that the majority of learner corpora are made up of written data, and that these data often correspond to learner English. Other types of corpora, however, including spoken learner corpora and corpora representing other target languages, are becoming more widely available. As for size, many of the learner corpora listed in the ‘Learner Corpora around the World’ resource are under one million words, with some of them not even reaching 100,000 words and a couple just containing some 10,000 words. It is likely that among those learner corpora that are not listed but exist ‘out there’, most can be counted in tens of thousands rather than in millions of words. Yet, there are also learner corpora that are much larger, especially those that have continued to grow over the years (like the Longman Learners’ Corpus, which now comprises ten million words) and those that come out of the testing/assessment world, such as EFCAMDAT (Geertzen et al. 2014) or TOEFL 11 (Blanchard et al. 2013).

13 Learner Corpora


One of the defining features of corpora is that they should be made up of authentic texts. This concept of authenticity, however, tends to be problematic in the case of learner corpora. Learner language, most of the time, is not produced purely for communicative purposes, but as part of some pedagogical activity, to practise one’s language skills. Writing an argumentative essay or role-playing with a classmate, for example, may be natural tasks in the classroom, but they are not authentic in the sense of being “gathered from the genuine communications of people going about their normal business” (Sinclair 1996). Our understanding of the concept of authenticity must therefore be adapted to the context of learner corpora and encompass tasks that would not be described as natural in other contexts. It must also be acknowledged that some learner corpora will be more “peripheral” (Nesselhauf 2004:128), as is the case of spoken learner corpora like the Giessen-Long Beach Chaplin Corpus (Jucker et al. 2003) which are elicited on the basis of a picture or a movie and thus include data of a more constrained nature. Another, related feature of learner language is that it usually does not cover the whole spectrum of genres that is characteristic of native varieties. Because its use tends to be associated with educational settings, there are certain genres that are difficult to capture or simply do not exist in the target language. Having a spontaneous conversation with a friend, for example, is more likely to occur in the mother tongue (L1) than in the target language (L2). As a result, most learner corpora represent one of a limited number of genres, including argumentative essays, academic writing, narratives and interviews. One type of learner corpus that is worth singling out, because it is specific to varieties that are in the process of being learned or acquired, is the longitudinal learner corpus. In such a corpus, data are collected from the same subjects at different time intervals, so as to reflect the development of their language skills over time. Belz and Vyatkina (2005), for example, use longitudinal data from the Telecollaborative Learner Corpus of English and German (Telekorp) to study the development of German modal particles over a period of 9 weeks, with one data collection point every week. Most longitudinal learner corpora, however, are less ‘dense’, in that they include data collected at longer intervals, sometimes only once or twice a year (cf. LONGDALE, the Longitudinal Database of Learner English; Meunier 2016). Note that non-longitudinal learner corpora can sometimes also be used to investigate the development of learner language. Thus, if a learner corpus contains data produced by distinct learners from different proficiency levels, like the National Institute of Information and Communications Technology Japanese Learner English (NICT JLE) Corpus (Izumi et al. 2004), it is possible to identify developmental patterns by comparing subcorpora representing different levels, even if all the data were collected at a single point in time. Such learner corpora are called ‘quasi-longitudinal’ corpora and, because they are easier to collect than longitudinal corpora, they have often been used to study interlanguage development.


G. Gilquin

13.2.2 Metadata Given the “inherent heterogeneity of learner output” (Granger 1998:177), it is crucial that information about the data included in a learner corpus should be available. Learner corpora tend to be characterized by a large amount of such metadata. These metadata can have to do with the text itself (genre, length, conditions in which the task took place, etc.), but they can also concern the learners: what is their mother tongue? how old are they? how long have they been learning the target language? what kind of exposure to the target language have they received? do they know any other languages? etc. Usually, some of these variables are controlled for in the very design of the corpus, in the sense that the corpus only includes data corresponding to a certain value, e.g. only written essays (in ICLE) or only native speakers of English learning Spanish (in the Spanish Learner Language Oral Corpora (SPLLOC)). For the variables that are not controlled for during the compilation of the corpus, it is often possible for users to find information that enables them either to use a subset of the data meeting specific criteria (e.g. only texts written in exam conditions) or to examine the distribution of the results according to these variables (e.g. percentage of a given linguistic phenomenon among male vs female learners). Using the Multilingual Platform for European Reference Levels: Interlanguage Exploration in Context (MERLIN),1 for example, one can select a number of criteria, like the task (essay, email, picture description, etc.), the learner’s mother tongue, his/her age, gender or proficiency level according to the Common European Framework of Reference for Languages (CEFR), in order to define a subcorpus and then restrict the search to this subcorpus. Figure 13.1 is a screenshot from the MERLIN website that shows the selection of a subcorpus made up of data produced by French-speaking learners of Italian with an A2 CEFR level (test and overall rating) and aged between 30 and 59. The ICLE interface (Granger et al. 2009) also allows users to define a subcorpus according to certain criteria. In addition, it makes it possible to visualize, in the form of tables and graphs, the distribution of the results according to all the other variables encoded in the metadata. Figure 13.2 is a screenshot from the ICLE interface that represents the output of a search for the word informations in the ICLE data produced by learners with Chinese (or Chinese-Cantonese/Chinese-Mandarin) as their mother tongue (ICLE-CH). More particularly, the graph shows the distribution of the results according to the time available to write the essay and indicates that the incorrect pluralization of information is more frequent in timed than in untimed essays. Despite the wealth of metadata that accompany most learner corpora and despite the facilities that some of these corpora provide to access them, it must be recognized that metadata are not used to their full potential in learner corpus research. One variable that is regularly taken into account is that of the learner’s L1 background (e.g. Golden et al. 2017, based on ASK, the Norsk andrespråkskorpus), which makes

1 http://merlin-platform.eu/.

Accessed 22 May 2019.

13 Learner Corpora


Fig. 13.1 Selection of a subcorpus on the MERLIN platform (criteria: target language = Italian; mother tongue = French; CEFR level of test = A2; overall CEFR rating = A2; age = 30–59). (Source: http://merlin-platform.eu/) Conditions : Occurrences distribution for selected profiles (Relative frequencies per 100,000 words)

3 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

2.8 0.1 No Timing 2.8 Timed 0.6 Unknown



No Timing



Fig. 13.2 Relative frequency of informations in ICLE-CH according to time available. (Source: Granger et al. 2009)

it possible to identify probable cases of transfer from the L1. Sometimes it is another variable that is investigated, for example exposure to the target language through a stay abroad (Gilquin 2016) or presence of a native or non-native interlocutor (Crossley and McNamara 2012). Studies that examine the possible impact of several variables, on the other hand, are relatively rare, although such studies can offer important insights into the factors that are likely to affect learner language. The problem with this type of approach is that, because of the relatively limited size of most learner corpora, selecting many variables may result in a very small subset of


G. Gilquin

data (see Callies 2015:52), which, in effect, may make any kind of generalization impossible.

13.2.3 Annotation Learner corpora can be enriched by means of the same types of annotation as all other corpora, including part-of-speech (POS) tagging, parsing, semantic annotation, pragmatic annotation and, for spoken learner corpora, phonetic and prosodic annotation (see Chaps. 2 and 11). One issue to bear in mind, however, is that, with very few exceptions, the tools that one has to rely on to annotate learner corpora automatically are tools that have been designed to deal with native data. Applying them to non-native data may therefore cause certain difficulties. For POS tagging, for example, the many spelling errors found in written learner corpora have been shown to lower the accuracy of POS taggers (de Haan 2000; Van Rooy and Schäfer 2002). As for parsing, punctuation and spelling errors in written learner corpora have the highest impact according to Huang et al. (2018), and in spoken learner corpora Caines and Buttery (2014) have demonstrated that disfluencies and (formal and idiomatic) errors can lead to a 25% decrease in the success rate of the parser. However, while tools and formats of annotation specifically designed for learner data would of course be desirable (as suggested by Díaz-Negrillo et al. (2010) for POS tagging), it must be underlined that some attempts to automatically annotate learner corpora with off-the-shelf tools have been quite successful. Granger et al. (2009:16), for example, report accuracy rates between 95% and 99.1% for the POS tagging of ICLE. A first attempt at POS tagging the Louvain International Database of Spoken English Interlanguage (LINDSEI; Gilquin et al. 2010) revealed an accuracy rate of 92% (Gilquin 2017). As for parsing, it seems to be more affected by the nature of learner language than POS tagging (see Huang et al. 2018). However, Geertzen et al. (2014:247) note that the parser they used actually scored slightly better on EFCAMDAT, a written learner corpus, than on the Wall Street Journal corpus (89–92% for EFCAMDAT, to be compared with 84–87% for the Wall Street Journal). These reasonably good accuracy rates – given the non-native nature of the corpora – may be explained by the fact that the errors and disfluencies found in learner language are compensated by the relatively simple structure of the sentences which learners tend to produce (see Meunier (1998) on POS tagging and Huang et al. (2018) on parsing). Another possible explanation is that most learner corpora represent university-level interlanguage (like ICLE and LINDSEI) and that such data are arguably easier to deal with for a POS tagger or parser than data produced at a lower proficiency level. Geertzen et al. (2014:248) point out that the accuracy rate of the parser was higher on the more advanced EFCAMDAT data, although “the effect seem[ed] small”. Next to these automated methods of annotation, learner corpora can also be annotated manually. While a full annotation of the corpus may not be feasible (nor, in fact, desirable), one type of annotation that may be particularly useful is problem-oriented tagging (de Haan 1984). This tagging

13 Learner Corpora


Fig. 13.3 Example of an error-tagged sentence in Falko (FalkoEssayL1v2.0: dhw015_2007_06) as displayed on the ANNIS platform

is geared towards a specific research question and consists in annotating only those items that are of direct relevance to the research question. Spoelman’s (2013) study of partitive case-marked noun phrases in learner Finnish, for instance, involved tagging instances of this phenomenon, depending on the category they represented. Such tagging then opens the way to automatic treatment of the annotated corpus. Besides these types of annotation that are common to all corpora, there is one that is typical of learner corpora (and also child-language corpora, see Chap. 14), namely error tagging, which consists in the annotation of the errors found in a corpus (syntactic errors, unusual collocations, mispronunciations, etc.). The Fehlerannotiertes Lernerkorpus (‘error annotated learner corpus’, Falko), for instance, is an error-tagged corpus of learner German. The annotation of errors is usually accompanied by a correction (the ‘target hypothesis’) as well as a tag indicating the category of the error (e.g. spelling error, error in verb morphology, complementation error). Figure 13.3 shows an error-tagged sentence from Falko, as retrieved from the ANNIS platform.2 Falko uses a multi-layer standoff architecture, in which each layer represents an independent level of annotation (see also Chap. 3). The ‘tok’ (= token) layer shows the original sentence as produced by the learner. ‘ZH1’ provides a corrected version of the sentence (ZH = Zielhypothese ‘target hypothesis’), with ‘ZH1Diff’ highlighting the differences between the original and corrected versions, and ‘ZH1lemma’ and ‘ZH1pos’ corresponding, respectively, to a lemmatized and POS-tagged version of the sentence. In this case, the learner has mistakenly used the article (‘ART’) der instead of the correct form die, an error which involves a changed token (‘CHA’) in the target hypothesis. Note that the multi-layer architecture of the corpus allows for enough flexibility to encode competing target hypotheses (Reznicek et al. 2013). In Falko, the step of attributing an ‘edit tag’ to the error (change, insertion, deletion, etc.) can

2 https://korpling.german.hu-berlin.de/falko-suche/.

Accessed 22 May 2019.


G. Gilquin

be automated by comparing the learner text and the (manually encoded) target hypothesis/hypotheses. In learner corpus research, attempts have also been made to automate the process of error detection itself, although this is usually restricted to specific phenomena, for example preposition errors (De Felice and Pulman 2009), article errors (Rozovskaya and Roth 2010) or spelling errors (Rayson and Baron 2011). Most of the time, however, the whole error tagging procedure is done manually, a time-consuming task that can be facilitated by the use of an error editor like the Université Catholique de Louvain Error Editor (UCLEE; see Dagneaux et al. 1998 and Sect. 13.4). Once a learner corpus has been error tagged, it becomes possible to automatically extract instances of erroneous usage, which, as will be described in the next section, lies at the basis of one of the methods of analysis that have been developed to deal with learner corpora.

13.2.4 Methods of Analysis In addition to the application of well-established corpus linguistic methods, like the use of concordances (Chap. 8), frequency lists (Chap. 4) or collocations (Chap. 7), a number of techniques have been developed to deal specifically with learner corpora. Among these, we can mention Computer-aided Error Analysis (Dagneaux et al. 1998), Contrastive Interlanguage Analysis (Granger 1996) and the Integrated Contrastive Model (Granger 1996; Gilquin 2000/2001). Computer-aided Error Analysis (or CEA) relies on the use of an error-tagged learner corpus (cf. Sect. 13.2.3). Through error tagging, errors are identified and categorized according to a taxonomy, such as that developed by Dagneaux et al. (2008) to error tag ICLE. These error tagging systems are usually hierarchical, distinguishing for example between grammar, lexis, lexico-grammar and style at a high level of annotation, and then making further distinctions within each of these categories, for example grammatical errors related to nouns, pronouns or verbs, and within grammatical verb errors, those having to do with number, tense, voice, etc. This hierarchy is reflected in Dagneaux et al.’s (2008) tagset: grammatical errors are indicated by the letter ‘G’, grammatical verb errors by ‘GV’, and grammatical errors in verb tense by ‘GVT’. Such tags make it very easy to automatically retrieve all the annotated errors in a certain category (e.g. all the complementation errors) or all the occurrences of a word representing a certain type of error (e.g. all the cases where the verb enjoy is used with an erroneous complement). These errors are the focus of analysis of CEA, as was the case in traditional error analysis (see James 1998). Unlike traditional error analysis, however, CEA allows the linguist to examine the errors in context, to consider correct uses along with incorrect uses, and to easily quantify the results (percentage of incorrect uses out of all uses or relative frequency of the error per 10,000 words, for instance). Contrastive Interlanguage Analysis (CIA) consists of two types of comparison: a comparison of learner language with native language and a comparison between different learner varieties (Granger 2009:18). These two types of comparison should

13 Learner Corpora


preferably be combined with each other, but they can also be drawn separately. The comparison between native and learner language lies at the basis of a majority of the studies in learner corpus research (Flowerdew 2015:469). Such a comparison helps identify non-standard forms (cf. CEA), but also, importantly, instances of ‘overuse’ and ‘underuse’ (or ‘overrepresentation’ and ‘underrepresentation’, see Granger 2015:19). These terms, which are not meant as being evaluative but purely descriptive, refer to cases in which a given linguistic phenomenon (word, construction, function, etc.) is used significantly more or significantly less in the learner corpus than in a comparable native corpus, as indicated by a measure of statistical significance. The study of over- and underuse has been a real eye-opener in learner corpus research, because it has shown that the foreign-soundingness of learner language, especially at advanced levels of proficiency, is to be attributed as much (or perhaps even more) to differences in the frequency of use as to downright errors (Granger 2004:132). The second type of comparison in CIA involves comparing different learner varieties, most notably varieties produced by learners from different L1 backgrounds. Such a comparison helps detect possible traces of transfer from the mother tongue: if a feature is only found among learners from a specific L1 population, say Italian learners of French, it might be a sign that it is the result of crosslinguistic influence, that is, interference from the L1 (Italian) on the L2 (French) (see Jarvis and Pavlenko 2008, on crosslinguistic influence, and Osborne 2015, on its link with learner corpus research). The learner varieties that are compared with each other could however differ along another dimension, which could be any of the variables encoded in the corpus metadata (comparison of foreign and second language learners, of male and female learners, of learners who have spent some or no time in a target language country, etc.). Recently, a revised version of CIA, called CIA2 , has been proposed by Granger (2015). Among its major developments, we can mention the fact that this revised model no longer advocates the exclusive use of native language as a reference point against which to compare learner varieties. Instead, it promotes the comparison of “interlanguage varieties” against one or several “reference language varieties” which, in the case of English, could include, in addition to native English, New Englishes (like Hong Kong English or Singapore English) and English as a Lingua Franca (i.e. English as used by competent L2 users). CIA2 also includes an explicit reference to a number of variables (diatypic, dialectal, task and learner variables), thus encouraging researchers to take these into account in the application of the model. The Integrated Contrastive Model (ICM) is partly based on CIA, but it also integrates a contrastive analysis (CA), comparing the target language and the mother tongue thanks to comparable or parallel corpora (cf. Chap. 12). The model aims to predict possible cases of negative transfer (when the CA shows the target language and the mother tongue to differ in a certain respect) and seeks to explain problematic uses – misuse, overuse, underuse – in the learner corpus (by checking whether they could be due to discrepancies between the target language and the mother tongue). It thus has both predictive and diagnostic power. By combining careful analyses of learner, native and bilingual corpora, the model avoids the trap of misattributing


G. Gilquin

certain phenomena to transfer simply because intuition seems to suggest that this is a plausible interpretation. Liu and Shaw (2001:179), for example, claim that the frequent use of the causative constructions make sb/sth feel and make sb/sth become by Chinese learners of English “may be attributable to L1 interference” because such sequences “have word for word translational equivalents in Chinese”. However, such a claim would require a thorough contrastive analysis of English and Chinese to confirm the equivalence between the English and the Chinese constructions. Moreover, a study of causative constructions in different varieties of learner English has demonstrated that the overuse of make sb/sth feel and make sb/sth become is in fact characteristic of several other L1 populations of learners (Gilquin 2012), which suggests that Liu and Shaw’s (2001) results do not point to a case of transfer (or at least not only), but a more general tendency. The last few years have witnessed a general refinement of the methods of analysis in learner corpus research. One major change is the increasingly prominent role of statistics in the field. While statistical significance testing has almost always been part of learner corpus studies, through the notions of over- and underuse, criticism has recently been voiced against this type of monofactorial statistics. Gries and Deshors (2014), for example, argue that, instead of comparing overall frequencies in learner and native corpora, researchers should look at the linguistic contexts in which an item is used – or not – by learners and native speakers, as determined by a multifactorial analysis involving a variety of morpho-syntactic and semantic features. Statistics also help researchers go beyond the typical global approach of corpus linguistics (studying corpora as wholes), by taking corpus/learner variation into account through statistical techniques such as Wilcoxon tests (e.g. Paquot 2014) or linear modelling (e.g. Meunier and Littré 2013). By adopting this more individual type of approach, learner corpus research is following the general quantitative trend in corpus linguistics as well as theories in second language acquisition (SLA) research like the Dynamic Systems Theory, which focuses on “individual developmental paths” (de Bot et al. 2007:14). The link with theoretical frameworks, incidentally, is another way in which learner corpus research has evolved over the last few years. More and more learner corpus studies nowadays are grounded in SLA theories (see Myles 2015) or usage-based theories like cognitive linguistics (see De Knop and Meunier 2015), which gives such studies a more solid background and helps improve their explanatory power. Finally, methodological refinement in learner corpus research also comes from its rapprochement with the field of natural language processing, which has provided powerful tools and techniques for the automated analysis of large datasets (see Meurers 2015).

13 Learner Corpora


Representative Study 1 Altenberg, B., and Granger, S. 2001. The grammatical and lexical patterning of MAKE in native and non-native student writing. Applied linguistics 22(2):173–194. This study of the grammatical and lexical patterning of the high-frequency verb make among French- and Swedish-speaking learners of English seeks to test a number of hypotheses from the literature, e.g. the idea that a core verb like make is safe to use (hence not error-prone) or the contradictory claims that high-frequency verbs tend to be underused/overused by learners. It uses the French and Swedish components of ICLE, as well as a comparable native English corpus, the Louvain Corpus of Native English Essays (LOCNESS). The article provides a good overview of some of the techniques that can be applied to learner corpus data, including a comparison of the overall frequency of make in the three (sub)corpora, an examination of the distribution of its main semantic uses, a phraseological analysis of the collocates of the verb, and a syntactic and error analysis of its causative uses. In addition, the potential role of the mother tongue is examined, and some possible cases of transfer are highlighted, as well as strategies that appear to be common to the two groups of learners (e.g. a “decompositional” strategy which results in constructions like make the family live instead of support the family). Interestingly, the article also discusses the methodological issue of how accurate and useful an automatic extraction of collocates is. More generally, it demonstrates the benefits of combining an automatic and manual analysis, as well as a quantitative and qualitative approach.

Representative Study 2 Lüdeling, A., Hirschmann, H., and Shadrova, A. 2017. Linguistic models, acquisition theories, and learner corpora: Morphological productivity in SLA research exemplified by complex verbs in German. Language Learning 67(S1):96–129. This study focuses on German as a foreign language, and how advanced learners acquire morphological productivity for German complex verbs, that is, prefix verbs (like verstehen ‘to understand’) and particle verbs (like aufstehen ‘to get up’). Looking at the treatment of morphological productivity in different acquisition models, including generative and usage-based models, the authors put forward a number of hypotheses, which are then tested against a learner corpus. The corpus is Falko (see Sect. 13.2.3) and its (continued)


G. Gilquin

L1 equivalent. The study combines Contrastive Interlanguage Analysis and Computer-aided Error Analysis. First, it compares the frequency and uses of complex verbs in learner and native German. Second, it relies on the error tagging of Falko to identify grammatical and ungrammatical uses of complex verbs and to determine error types. The results show that learners tend to underuse prefix verbs and, especially, particle verbs, and that the variance between individual learners is greater than that between individual native speakers. Learners also appear to use complex verbs productively, although the new forms they produce sometimes result in errors. The paper illustrates some of the latest developments in learner corpus research, such as a solid grounding in theories and a combined aggregate and individual approach. It also makes the interesting methodological point that, through corpus annotation, categorization of the data can be made explicit and available to other researchers.

Representative Study 3 Alexopoulou, T., Geertzen, J., Korhonen, A., and Meurers, D. 2015. Exploring big educational learner corpora for SLA research: Perspectives on relative clauses. International Journal of Learner Corpus Research 1(1):96–129. This study is based on one of the large learner corpora coming out of the testing/assessment world (see Sect. 13.2.1), namely EFCAMDAT, the EF Cambridge Open Language Database. EFCAMDAT is made up of 33 million words, representing 85,000 learners and spanning 16 proficiency levels. Although the corpus includes longitudinal data for certain individual learners, this study adopts an aggregate approach, considering each proficiency level as a ‘section’, but with the acknowledgment that “combining the cross-sectional perspective with an analysis of individual learner variation is a necessary next step” (p. 126). The paper investigates the development of learners’ use of relative clauses. Like Lüdeling et al. (2017), it is grounded in theories of (second language) acquisition. In addition, it illustrates the rapprochement between learner corpus research and natural language processing (NLP), since it makes use of NLP tools and techniques to automatically extract relative clauses from a “big data” resource and to analyze their uses. The study reveals that very few relative clauses are found before Level 4, that their frequency increases until Level 6 and that it then remains more or less stable, with a peak at Level 11. The results show some limited effect of learners’ nationalities (in terms of the types of relative clauses that are used) and a strong task effect. (continued)

13 Learner Corpora


This focus on tasks echoes Granger’s (2015) recommendation to take this kind of variable into account (see Sect. 13.2.4). However, other variables that are equally important in learner corpus research cannot be investigated in EFCAMDAT because of the relative lack of metadata about learners (information about their L1, for example, is so far not available but has to be approximated through nationality and country of residence). This shows that, at the moment, learner corpus size may still come at the expense of rich metadata.

Representative Corpus 1 International Corpus of Learner English (ICLE; Granger et al. 2009) One of the first learner corpora to have been compiled, ICLE is a mono-L2 and multi-L1 corpus, in that it contains data from a single target language, English, produced by (high-intermediate to advanced) learners from different L1 backgrounds. It is a written learner corpus made up of argumentative (and some literary) essays written by university students under different conditions (exam or not, timed or untimed, access to reference tools or not). It is accompanied by rich metadata which can be queried through the interface that comes with the released version of the corpus. In its current version, it contains 3.7 million words, representing 16 L1 backgrounds. The whole corpus has been POS tagged.

Representative Corpus 2 Corpus Escrito del Español L2 (CEDEL2; http://cedel2.learnercorpora. com/) CEDEL2, directed by Cristobal Lozano, is a mono-L2 and multi-L1 learner corpus, made up of Spanish learner data produced by speakers of various L1s. It includes texts written by learners of all proficiency levels, from beginners to advanced learners. The texts were collected via a web application, together with detailed metadata. Unlike many learner corpora which fail to include precise information about learners’ proficiency levels (see Sect. 13.3), CEDEL2 provides, for each learner, the result of an independent and standardized placement test which the participants also took online. The corpus currently includes over one million words. It comes with native Spanish corpora built according to the same design criteria, which can be used for L1-L2 comparisons.


G. Gilquin

Representative Corpus 3 Parallèle Oral en Langue Étrangère (PAROLE; Hilton et al. 2008) PAROLE is a multi-L2 and multi-L1 spoken learner corpus, which represents L2 Italian, French and English speech produced by learners from various L1 backgrounds and proficiency levels. It also contains some data produced by L1 speakers. The data were collected through five oral production tasks, which correspond to varying degrees of naturalness. Next to the usual type of information (learner’s L1, knowledge of other languages, etc.), the metadata include, for each learner, measures of L2 proficiency, phonological memory, grammatical inferencing and motivation. PAROLE is a speech (or speaking) learner corpus, which means that, unlike so-called mute spoken learner corpora, it comes with sound files. The data have been transcribed according to the CHILDES system (see Sect. 13.3) and the transcriptions have been time-aligned with the sound files (see Chap. 11 on time-alignment).

13.3 Critical Assessment and Future Directions Over the last few years, learner corpora have grown in number, size and diversity. Written learner corpora are already quite numerous and large. In the near future, we should see the release of more and bigger spoken learner corpora, like the (still growing) Trinity Lancaster Corpus (Gablasova et al. 2017). In this respect, it is to be hoped that the developments in speech recognition will one day make it possible to automatically create reliable transcriptions based on recordings of learner language. In Zechner et al. (2009), the authors tested the reliability of a speech recognizer that they had trained on non-native spoken English produced by learners from a wide range of L1 backgrounds and proficiency levels. The result was that about one word in two was (wholly or partly) incorrectly transcribed. Although progress has been made in the meantime, Higgins et al. (2015:593) still acknowledge that the performance of speech recognizers “can degrade substantially when they are presented with non-native speech”. Another possible development is that the learner corpora of the future will be mega databases (rather than corpora in the strict sense), bringing together data produced by the same learners in different contexts, with different degrees of monitoring (thus including some constrained data, even perhaps of an experimental nature, in addition to the more naturalistic data), at different stages in their learning process and in different languages, including their mother tongue. The lastmentioned type of data, L1 data to be compared with L2 data from the same subjects, can help distinguish between linguistic behaviours that are typical of a person, regardless of whether s/he is using his/her mother tongue or a non-native language

13 Learner Corpora


(e.g. a slow speech rate), and those that the person only displays when using the L2. García Lecumberri et al. (2017), for instance, have compiled a bi-directional corpus made up of speech produced by English and Spanish native speakers in both their L1 and their L2, and they show how such a corpus can open up new possibilities for the study of learner language. More and more learner corpora nowadays come with an equivalent L1 corpus representing the target language (cf. CEDEL2 and PAROLE). This is a welcome development, as it makes it possible to carry out contrastive interlanguage analyses on the basis of fully comparable data. Such target language data are likely to be included in the mega databases of the future. What would also be desirable is input data, which should strive to represent the language that learners get exposed to, so that correlations between input and output can be measured. While in the past learners’ input has been approximated by means of textbook corpora (cf. Römer 2004), it is clear that, especially in the case of an international language like English, learners’ input is no longer limited to textbooks, even in foreign language situations, and that additional sources of exposure to the target language should therefore be taken into account. At the same time as we should witness an exponential growth in the size of learner corpora/databases, we should also observe the creation of new types of learner corpora, some of which have already started to be collected. The PROCEED corpus (Process Corpus of English in Education),3 for example, is a ‘process learner corpus’ which aims to reflect the whole of the writing process among language learners. It does so by combining screencast and keystroke logging and by examining at a micro-level the different steps leading to the final product (see Gilquin Forthcoming). Multimodal learner corpora (see Chap. 16) like the Multimedia Adult ESL Learner Corpus (MAELC; Reder et al. 2003) are likely to become more common, as well as translation learner corpora (corpora of texts translated by non-native students/translator trainees) like the MeLLANGE Learner Translator Corpus (Castagnoli et al. 2011) or the Multilingual Student Translation (MUST) corpus (see Chap. 12). More generally, it seems as if the new generation of learner corpora will be characterized by a higher degree of diversification than is currently the case: more (target and first) languages will be represented, more proficiency levels (including young learners, as in the International Corpus of Crosslinguistic Interlanguage; Tono 2012), more tasks, etc. The use of web applications to collect learner corpus data (cf. CEDEL2) will also make it possible to include the production of a wider range of non-native populations, and in particular learners outside universities, where, for reasons of convenience, many participants so far have been recruited. In addition to an expansion and diversification of learner corpora, we can also expect these corpora to come with more additional information than ever before, in the form of metadata and annotation. Starting with metadata, although learner corpora have included a large variety of them from the very beginning, there is

3 https://uclouvain.be/en/research-institutes/ilc/cecl/proceed.html.

Accessed 22 May 2019.


G. Gilquin

also a growing recognition that these may not be enough to reflect the complexity of the second language acquisition process. Limiting target language exposure to the ‘time abroad’ factor, for example, means neglecting other possible sources of exposure like the Internet, TV series or songs, all of which have become omnipresent in the lives of many young people. Proficiency is another case in point. While typically it has been evaluated on the basis of external criteria such as age or number of years of English instruction, scholars like Pendar and Chapelle (2008) have demonstrated that these may only give a very rough approximation of a learner’s actual proficiency, which speaks in favour of having the participants take a placement test as part of the data collection procedure (cf. CEDEL2) and/or having the corpus data rated according to a scale like the CEFR. More cognitive measures are also likely to be added in the future, as is the case in PAROLE or in the Secondary-level COrpus Of Learner English (SCooLE), which relies on a whole battery of psychometric tests measuring verbal comprehension, reasoning, perseverance, anxiety and many others (see Möller 2017). In terms of annotation, we can expect learner corpora to more systematically be POS tagged, parsed and/or error tagged (to cite only the main types of annotation mentioned in Sect. 13.2.3), which should be easier once adequate tools have been designed or adapted to deal with learner language more accurately. As with other types of corpora (see, e.g., spoken corpora in Chap. 11), standardization will become even more important as metadata and annotation keep being added. A project like the Child Language Data Exchange System (CHILDES)4 has contributed to the standardization of childlanguage corpora by proposing a common format for transcription, POS tagging, etc. Although some learner corpora have adopted this system too, like PAROLE or the French Learner Language Oral Corpora (FLLOC),5 they are relatively rare, and there is currently no corresponding system for learner corpora which could ensure the same degree of standardization. The availability of more, more diverse, bigger and more richly annotated learner corpora will have an impact on the way we conduct learner corpus research. Ellis et al. (2015), for example, call for “more longitudinal studies based on dense data”. This will involve, first, the compilation of bigger and denser longitudinal learner corpora. Once these corpora have been collected, appropriate techniques will have to be developed to automate the analysis of individual developmental trajectories in large datasets (see Hokamura 2018 for a step in this direction, based on a set of 20 data collection points but limited to two learners). With such techniques, it will become possible to investigate much larger populations of learners than is currently the case and thus achieve a higher degree of reliability. It can also be hoped that new and better resources will attract more users. In particular, teachers should be encouraged not only to use learner corpora, but also to collect data produced by their own students, in the form of ‘local learner corpora’ (Seidlhofer 2002). With more and more teachers receiving some training in corpus linguistics, we can expect

4 https://childes.talkbank.org/. 5 www.flloc.soton.ac.uk/.

Accessed 22 May 2019. Accessed 22 May 2019.

13 Learner Corpora


that an increasingly large number of them will want to apply the methods of learner corpus research in their classrooms, thus bringing learner corpora closer to those who, ultimately, should benefit from their exploitation, namely learners.

13.4 Tools and Resources Learner Corpus Bibliography: this bibliography is made up of references in the field of learner corpus research. The bibliography can be found on the CECL website (https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpus-bibliography. html) (accessed 22 May 2019). A searchable version is accessible to members of the Learner Corpus Association in the form of a Zotero collection. Learner Corpora around the World (https://uclouvain.be/en/research-institutes/ ilc/cecl/learner-corpora-around-the-world.html) (accessed 22 May 2019): this website, which is regularly updated, contains a list of learner corpora, together with their main characteristics (target language, first language, medium, text/task type, proficiency level, size) as well as information about whether (and how) they can be accessed. Université Catholique de Louvain Error Editor (UCLEE; Hutchinson 1996): this program facilitates error tagging thanks to a drop-down menu that makes it possible to select an error tag. It also facilitates the insertion of a corrected form. A new version of the software is currently in preparation. Compleat Lexical Tutor (Lextutor; http://www.lextutor.ca) (accessed 22 May 2019): this website, created by Tom Cobb, is mainly aimed at teachers and learners (of English, but also some other languages like French). However, among the many tools it offers, some will be useful to researchers working with learner corpora. VocabProfile, in particular, can analyze (small) learner corpora according to vocabulary frequency bands, making it possible to check whether, say, learners of English tend to rely heavily on the 1000 most frequent words of the English language.

Further Reading Granger, S. 2012. How to use foreign and second language learner corpora. In Research methods in second language acquisition: A practical guide, eds. Mackey, A., and Gass, S.M., 7–29. Chichester: Blackwell Publishing. After briefly introducing learner corpora, this paper clearly presents the different stages that can be involved in a learner corpus study: choice of a methodological approach, selection and/or compilation of a learner corpus, data annotation, data extraction, data analysis, data interpretation and pedagogical implementation.


G. Gilquin

Díaz-Negrillo, A., Ballier, N., and Thompson, P., eds. 2013. Automatic treatment and analysis of learner corpus data. Amsterdam: John Benjamins. This edited volume covers many important methodological issues related to learner corpora, such as the question of interoperability, multi-layer error annotation, automatic error detection and correction, or the use of statistics in learner corpus research. Granger, S, Gilquin, G., and Meunier, F., eds. 2015. The Cambridge handbook of learner corpus research. Cambridge: Cambridge University Press. This handbook provides a comprehensive overview of the different facets of learner corpus research, including the design of learner corpora, the methods that can be applied to study them, their use to investigate various aspects of language, and the link between learner corpus research and second language acquisition, language teaching and natural language processing.

References Alexopoulou, T., Geertzen, J., Korhonen, A., & Meurers, D. (2015). Exploring big educational learner corpora for SLA research: Perspectives on relative clauses. International Journal of Learner Corpus Research, 1(1), 96–129. Altenberg, B., & Granger, S. (2001). The grammatical and lexical patterning of MAKE in native and non-native student writing. Applied Linguistics, 22(2), 173–194. Belz, J., & Vyatkina, N. (2005). Learner corpus analysis and the development of L2 pragmatic competence in networked inter-cultural language study: The case of German modal particles. The Canadian Modern Language Review/La revue canadienne des langues vivantes, 62(1):17– 48. Blanchard, D., Tetreault, J., Higgins, D., Cahill, A., & Chodorow, M. (2013). TOEFL11: A corpus of non-native English. Princeton: Educational Testing Service. Caines, A., & Buttery, P. (2014). The effect of disfluencies and learner errors on the parsing of spoken learner language. In First joint workshop on statistical parsing of morphologically rich languages and syntactic analysis of non-canonical languages (pp. 74–81). Dublin, Ireland, August 23–29, 2014. Callies, M. (2015). Learner corpus methodology. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 35–55). Cambridge: Cambridge University Press. Castagnoli, S., Ciobanu, D., Kunz, K., Kübler, N., & Volanschi, A. (2011). Designing a learner translator corpus for training purposes. In N. Kübler (Ed.), Corpora, language, teaching, and resources: From theory to practice (pp. 221–248). Bern: Peter Lang. Crossley, S. A., & McNamara, D. S. (2012). Interlanguage talk: A computational analysis of nonnative speakers’ lexical production and exposure. In P. M. McCarthy & C. Boonthum-Denecke (Eds.), Applied natural language processing: Identification, investigation and resolution (pp. 425–437). Hershey: IGI Global. Dagneaux, E., Denness, S., & Granger, S. (1998). Computer-aided error analysis. System, 26(2), 163–174. Dagneaux, E., Denness, S., Granger, S., Meunier, F., Neff, J. A., & Thewissen, J. (2008). Error tagging manual version 1.3. Louvain-la-Neuve: Centre for English Corpus Linguistics.

13 Learner Corpora


de Bot, K., Lowie, W., & Verspoor, M. (2007). A Dynamic Systems Theory approach to second language acquisition. Bilingualism: Language and Cognition, 10(1), 7–21. De Felice, R., & Pulman, S. (2009). Automatic detection of preposition errors in learner writing. CALICO Journal, 26(3), 512–528. de Haan, P. (1984). Problem-oriented tagging of English corpus data. In J. Aarts & W. Meijs (Eds.), Corpus linguistics: Recent developments in the use of computer corpora (pp. 123–139). Amsterdam: Rodopi. de Haan, P. (2000). Tagging non-native English with the TOSCA–ICLE tagger. In C. Mair & M. Hundt (Eds.), Corpus linguistics and linguistic theory (pp. 69–79). Amsterdam: Rodopi. De Knop, S., & Meunier, F. (2015). The ‘learner corpus research, cognitive linguistics and second language acquisition’ nexus: A SWOT analysis. Corpus Linguistics and Linguistic Theory, 11(1), 1–18. Díaz-Negrillo, A., Meurers, D., Valera, S., & Wunsch, H. (2010). Towards interlanguage POS annotation for effective learner corpora in SLA and FLT. Language Forum, 36(1–2), 139–154. Ellis, N. C., Simpson-Vlach, R., Römer, U., O’Donnell, M. B., & Wulff, S. (2015). Learner corpora and formulaic language in second language acquisition research. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 357–378). Cambridge: Cambridge University Press. Flowerdew, L. (2015). Learner corpora and language for academic and specific purposes. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 465–484). Cambridge: Cambridge University Press. Gablasova, D., Brezina, V., McEnery, T., & Boyd, E. (2017). Epistemic stance in spoken L2 English: The effect of task and speaker style. Applied Linguistics, 38(5), 613–637. García Lecumberri, M. L., Cooke, M., & Wester, M. (2017). A bi-directional task-based corpus of learners’ conversational speech. International Journal of Learner Corpus Research, 3(2), 175–195. Geertzen, J., Alexopoulou, T., & Korhonen, A. (2014). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCamDat). In R. T. I. Miller, K. Martin, C. M. Eddington, A. Henery, N. Marcos Miguel, A. M. Tseng, A. Tuninetti, & D. Walter (Eds.), Selected proceedings of the 2012 second language research forum: Building bridges between disciplines (pp. 240–254). Somerville: Cascadilla Proceedings Project. Gilquin, G. (2000/2001). The integrated contrastive model: Spicing up your data. Languages in Contrast, 3(1), 95–123. Gilquin, G. (2012). Lexical infelicity in English causative constructions. Comparing native and learner collostructions. In J. Leino & R. V. Waldenfels (Eds.), Analytical causatives. From ‘give’ and ‘come’ to ‘let’ and ‘make’ (pp. 41–63). München: Lincom Europa. Gilquin, G. (2016). Discourse markers in L2 English: From classroom to naturalistic input. In O. Timofeeva, A.-C. Gardner, A. Honkapohja, & S. Chevalier (Eds.), New approaches to English linguistics: Building bridges (pp. 213–249). Amsterdam: John Benjamins. Gilquin, G. (2017). POS tagging a spoken learner corpus: Testing accuracy testing. Paper presented at the 4th Learner Corpus Research Conference, Bolzano/Bozen, Italy, 5–7 October 2017. Gilquin, G. (Forthcoming). Hic sunt dracones: Exploring some terra incognita in learner corpus ˇ research. In A. Cermáková & M. Malá (Eds.), Variation in time and space: Observing the world through corpora. Berlin: De Gruyter. Gilquin, G., De Cock, S., & Granger, S. (2010). Louvain International Database of Spoken English Interlanguage. Louvain-la-Neuve: Presses universitaires de Louvain. Golden, A., Jarvis, S., & Tenfjord, K. (2017). Crosslinguistic influence and distinctive patterns of language learning: Findings and insights from a learner corpus. Bristol: Multilingual Matters. Granger, S. (1996). From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora. In K. Aijmer, B. Altenberg, & M. Johansson (Eds.), Languages in contrast. Text-based cross-linguistic studies (pp. 37–51). Lund: Lund University Press. Granger, S. (1998). The computer learner corpus: A testbed for electronic EFL tools. In J. Nerbonne (Ed.), Linguistic databases (pp. 175–188). Stanford: CSLI Publications.


G. Gilquin

Granger, S. (2004). Computer learner corpus research: Current status and future prospects. In U. Connor & T. Upton (Eds.), Applied corpus linguistics: A multidimensional perspective (pp. 123–145). Amsterdam: Rodopi. Granger, S. (2009). The contribution of learner corpora to second language acquisition and foreign language teaching: A critical evaluation. In K. Aijmer (Ed.), Corpora and language teaching (pp. 13–32). Amsterdam: John Benjamins. Granger, S. (2015). Contrastive interlanguage analysis: A reappraisal. International Journal of Learner Corpus Research, 1(1), 7–24. Granger, S., Dagneaux, E., Meunier, F., & Paquot, M. (2009). The International Corpus of Learner English (Handbook and CD-ROM. Version 2). Louvain-la-Neuve: Presses universitaires de Louvain. Gries, S. T., & Deshors, S. C. (2014). Using regressions to explore deviations between corpus data and a standard/target: Two suggestions. Corpora, 9(1), 109–136. Higgins, D., Ramineni, C., & Zechner, K. (2015). Learner corpora and automated scoring. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 587–604). Cambridge: Cambridge University Press. Hilton, H., Osborne, J., Derive, M. -J., Succo, N., O’Donnell, J., Billard, S., & Rutigliano-Daspet, S. (2008). Corpus PAROLE (Parallèle Oral en Langue Étrangère). Architecture du corpus & conventions de transcription. Chambéry: Laboratoire LLS – Équipe Langages, Université de Savoie. http://archive.sfl.cnrs.fr/sites/sfl/IMG/pdf/PAROLE_manual.pdf. Accessed 22 May 2019. Hokamura, M. (2018). The dynamics of complexity, accuracy, and fluency: A longitudinal case study of Japanese learners’ English writing. JALT Journal, 40(1), 23–46. Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28–54. Hutchinson, J. (1996). Université Catholique de Louvain Error Editor. Louvain-la-Neuve: Centre for English Corpus Linguistics, Université catholique de Louvain. Izumi, E., Uchimoto, K., & Isahara, H. (2004). The NICT JLE Corpus: Exploiting the language learners’ speech database for research and education. International Journal of the Computer, the Internet and Management, 12(2), 119–125. James, C. (1998). Errors in language learning and use: Exploring error analysis. London/New York: Longman. Jarvis, S., & Pavlenko, A. (2008). Crosslinguistic influence in language and cognition. New York/London: Routledge. Jucker, A. H., Smith, S. W., & Lüdge, T. (2003). Interactive aspects of vagueness in conversation. Journal of Pragmatics, 35(12), 1737–1769. Liu, E. T. K., & Shaw, P. M. (2001). Investigating learner vocabulary: A possible approach to looking at EFL/ESL learners’ qualitative knowledge of the word. International Review of Applied Linguistics in Language Teaching, 39(3), 171–194. Lüdeling, A., Hirschmann, H., & Shadrova, A. (2017). Linguistic models, acquisition theories, and learner corpora: Morphological productivity in SLA research exemplified by complex verbs in German. Language Learning, 67(S1), 96–129. Meunier, F. (1998). Computer tools for interlanguage analysis: A critical approach. In S. Granger (Ed.), Learner English on computer (pp. 19–37). London/New York: Addison Wesley Longman. Meunier, F. (2016). Introduction to the LONGDALE Project. In E. Castello, K. Ackerley, & F. Coccetta (Eds.), Studies in learner corpus linguistics. Research and applications for foreign language teaching and assessment (pp. 123–126). Berlin: Peter Lang. Meunier, F., & Littré, D. (2013). Tracking learners’ progress: Adopting a dual ‘corpus cum experimental data’ approach. The Modern Language Journal, 97(S1), 61–76. Meurers, D. (2015). Learner corpora and natural language processing. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 537–566). Cambridge: Cambridge University Press.

13 Learner Corpora


Möller, V. (2017). Language acquisition in CLIL and non-CLIL settings: Learner corpus and experimental evidence on passive constructions. Amsterdam: John Benjamins. Myles, F. (2015). Second language acquisition theory and learner corpus research. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 309–331). Cambridge: Cambridge University Press. Nesselhauf, N. (2004). Learner corpora and their potential in language teaching. In J. Sinclair (Ed.), How to use corpora in language teaching (pp. 125–152). Amsterdam: John Benjamins. Osborne, J. (2015). Transfer and learner corpus research. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 333–356). Cambridge: Cambridge University Press. Paquot, M. (2014). Cross-linguistic influence and formulaic language: Recurrent word sequences in French learner writing. In L. Roberts, I. Vedder, & J. H. Hulstijn (Eds.), EUROSLA Yearbook 14 (pp. 240–261). Amsterdam: John Benjamins. Pendar, N., & Chapelle, C. A. (2008). Investigating the promise of learner corpora: Methodological issues. CALICO Journal, 25(2), 189–206. Rayson, P., & Baron, A. (2011). Automatic error tagging of spelling mistakes in learner corpora. In F. Meunier, S. De Cock, G. Gilquin, & M. Paquot (Eds.), A taste for corpora: In honour of Sylviane Granger (pp. 109–126). Amsterdam: John Benjamins. Reder, S., Harris, K., & Setzler, K. (2003). The Multimedia Adult ESL Learner Corpus. TESOL Quarterly, 37(3), 546–557. Reznicek, M., Lüdeling, A., & Hirschmann, H. (2013). Competing target hypotheses in the Falko corpus: A flexible multi-layer corpus architecture. In A. Díaz-Negrillo, N. Ballier, & P. Thompson (Eds.), Automatic treatment and analysis of learner corpus data (pp. 101–124). Amsterdam: John Benjamins. Römer, U. (2004). Comparing real and ideal language learner input: The use of an EFL textbook corpus in corpus linguistics and language teaching. In G. Aston, S. Bernardini, & D. Stewart (Eds.), Corpora and language learners (pp. 152–168). Amsterdam: John Benjamins. Rozovskaya, A., & Roth, D. (2010). Training paradigms for correcting errors in grammar and usage. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 154–162). Los Angeles: Association for Computational Linguistics. Seidlhofer, B. (2002). Pedagogy and local learner corpora: Working with learning-driven data. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition and foreign language teaching (pp. 213–234). Amsterdam: John Benjamins. Sinclair, J. (1996). Preliminary recommendations on corpus typology (Technical report, EAGLES (Expert Advisory Group on Language Engineering Standards). www.ilc.cnr.it/EAGLES96/ corpustyp/corpustyp.html. Accessed 22 May 2019. Spoelman, M. (2013). The (under)use of partitive objects in Estonian, German and Dutch learners of Finnish. In S. Granger, G. Gilquin, & F. Meunier (Eds.), Twenty years of learner corpus research: Looking back, moving ahead (pp. 423–433). Louvain-la-Neuve: Presses universitaires de Louvain. Tono, Y. (2012). International Corpus of Crosslinguistic Interlanguage: Project overview and a case study on the acquisition of new verb co-occurrence patterns. In Y. Tono, Y. Kawaguchi, & M. Minegishi (Eds.), Developmental and crosslinguistic perspectives in learner corpus research (pp. 27–46). Amsterdam: John Benjamins. Van Rooy, B., & Schäfer, L. (2002). The effect of learner errors on POS tag errors during automatic POS tagging. Southern African Linguistics and Applied Language Studies, 20, 325–335. Zechner, K., Higgins, D., Xi, X., & Williamson, D. M. (2009). Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51(10), 883–895.

Chapter 14

Child-Language Corpora Sabine Stoll and Robert Schikowski

Abstract Together with experiments, the main method in the study of child language development is the analysis of behavioral changes of individual children over time. For this purpose recordings of infants’ and children’s naturalistic interactions in a variety of languages spoken in different cultural contexts are key. Language development corpora either are cross-sectional or longitudinal collections of conversations recorded at predetermined intervals and annotated on various linguistic and multi-modal levels. Here, we discuss the advantages and disadvantages of crosssectional and longitudinal studies but the focus of this chapter will be on longitudinal studies which are the main tool in corpus analyses of child language. We focus on challenges of corpus design including sampling issues such as the number and choice of participants, amount of recordings, recording situations, transcription, linguistic and multi-modal annotations and ethical considerations.

14.1 Introduction Corpora in language development research are collections of naturalistic interactions of children and their surrounding environment. They usually comprise several subcorpora corresponding to different target children. The underlying purpose of developmental corpora is to learn about children’s proficiency and to understand how they use language in their natural environment. Thus, corpora allow the researcher to find out what children do in natural interaction, in contrast to experiments, which test what children can do. Ideally, a language development

S. Stoll () Department of Comparative Language Science & Center for the Interdisciplinary Study of Language Evolution (ISLE), University of Zurich, Zurich, Switzerland & NCCR Evolving Language, Swiss National Science Foundation Agreement #51NF40_180888 e-mail: [email protected] R. Schikowski ETH Zurich, Zurich, Switzerland e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_14



S. Stoll and R. Schikowski

corpus presents an ecologically valid and representative picture of the linguistic development of language learners. To capture how children learn language, the change in their development of vocabulary, grammar (e.g. morphology, syntactic constructions) and pragmatic understanding are monitored over a predefined time window. Thus, developmental corpora are necessarily time series data, whose internal structure is important. To model the child’s proficiency at a specific point in time or over a period of time, a session by session comparison of the child’s and surrounding adults’ constructions is key. The raw data in developmental corpora are audio-visual recordings, which are transformed into transcriptions, either phonetic or, more usually, orthographic (see also Chap. 11). In a second step these transcriptions are then enriched with further annotations. Two main types of developmental corpora are used in the field: cross-sectional and longitudinal corpora. In cross-sectional corpora, specific age points of interest are identified and a number of children with the respective ages are recorded. Each child is recorded individually in either naturalistic or semi-structured contexts, depending on the purpose of the study. In this type of corpus, development is inferred via between-group comparisons, i.e. by averaging over a large number of participants each of which is recorded only at one point in their development. In longitudinal corpora, on the other hand, the development of individual children is estimated based on temporally ordered samples of the same child. Each child is recorded at regular intervals at a number of consecutive time points, usually stretching over several months or even years. Thus, longitudinal corpora portray the individual development of a few children and thereby capture individual differences in developmental curves rather than inferring a general but averaged profile of productivity as it is the case in cross-sectional studies. Since recording over several years is time-consuming, a staggered design of several longitudinal studies of children including different age spans is often applied (cf. e.g. the Chintang corpus in Box 1). In this design, several children are observed over a predetermined time span but recordings start and end at different ages for different groups of children. Instead of recording, for instance, two children over two full years, four children of two different ages (e.g. two 2-year-olds and two 3-year-olds) are recorded over one year so that the same age spans are covered. Cross-sectional and longitudinal studies both have advantages and disadvantages. Cross-sectional studies require large numbers of children at each interval because the individual variation across children is huge (e.g. Bates et al. 1995). The fewer the children, the more influence potential outliers or extreme cases with significantly different developmental curves can have. Longitudinal data, by contrast, portray real developmental curves of individual children but little information about the effective variation across children at different points of age can be extracted. Thus, longitudinal studies usually remain case studies. Ideally, the two approaches complement each other. In the following we focus on longitudinal designs, which are more prevalent in research on language development, but many issues such as annotation layers or metadata are relevant for both corpus types.

14 Child-Language Corpora


14.2 Fundamentals Longitudinal corpora imply regular recordings over at least several months, transcriptions of each utterance (ideally not only of the target child, but of all interlocutors), and a multitude of different annotation levels, depending on the questions of the respective project. This makes the development of longitudinal corpora a logistically difficult, time consuming, and ultimately very expensive endeavor. The ultimate design goal therefore is to create corpora that are sustainable and suitable for quantitative analyses of a wide range of topics rather than focusing on a single research question. In the following we present the main steps and layers in corpus design and their respective challenges.

14.2.1 Recording and Contextual Setting The raw data of longitudinal corpora are audio-visual recordings of natural interactions between a target child and her surrounding communicative environment. Video is an important feature not only because the speech of young children is often difficult to interpret without context but also because language development is multi-modal and involves speech that interacts tightly with gesture and gaze (see Chap. 16). Most corpora nowadays include at least some video recordings. A major difference to other spoken corpora is the strict recording scheme. Regular and evenly spaced recordings document how the child’s language develops by adapting more and more to the adult target. The naturalness of the recording contexts of existing corpora varies widely. Some researchers prefer to include only the main caretaker and the target child in a play situation. This setting facilitates transcription but sacrifices ecological validity. It is known that the recording context has a strong influence on the type of constructions used, both in child-directed speech and by the children themselves. Studies that strive for a recording situation that is as natural as possible and thus do not filter for participants or situations are more representative because they allow to better generalize from the recordings to the rest of the child’s linguistic encounters. It is worthwhile to invest into more complex transcriptions that include the typical participants of the natural environment of the child in order to avoid biases in the characterizations of child-surrounding speech. To supplement information about speech, researchers have started to use childmounted cameras to capture the focus of the child’s immediate environment. The main goal of this type of recording is to link the characteristics of the speech of the child to her immediate linguistic and extra-linguistic environment. The visual context is important because the joint attentional frame in conversations as well as the types of responses and world-to word mappings are highly relevant for word learning (e.g. Tomasello 2003; Kuhl 2004). Another approach are recordings with cameras on a tripod, which allows to capture a wider range of the environment


S. Stoll and R. Schikowski

including the child herself. This approach has the big advantage that the extralinguistic context and the actions and reactions of the interactional partners can be included in the analysis.

14.2.2 Subject Sampling The number of children is one of the most important decisions in the design of a study, having the potential both to make data statistically (more) robust and to increase the amount of work put into a corpus ad infinitum. Demuth (1996) recommends three to four children as the minimum and this seems to be the number of children most projects opt for, both small-scale and large-scale. While recent corpora tend to include more children, there is (to the best of our knowledge) not a single publication on how to determine the ideal number of children in longitudinal studies. Two children are obviously better than one because one child might happen to be either precocious or a late starter, but two children will also not remedy this problem. However, with more children the question is not if their number improves the reliability of data but rather to what degree it does. This point is crucial considering how additional target children multiply the amount of time and money required to compile a corpus with minimum annotation standards. Numbers of 80 children per community and corpus, as studied in the 0–5 Program (see Representative Corpus 2), are a role model of corpus development but go far beyond the possibilities of most research groups, especially if non-WEIRD (Western, Educated, Industrialized, Rich, Democratic; Henrich et al. 2010) societies are involved. The extensive variation in development asks for statistical methods that are suitable for corpora consisting of several case studies. A major factor for the analysis of these corpora is individual variation, i.e. children of the same ages usually vary extremely in their linguistic competence (Bates et al. 1988; Fenson et al. 1994; Lieven 2006). As a consequence, age is a very unreliable predictor for development. Stoll and Gries (2009) propose an incremental method of data analysis that addresses variation up front instead of averaging it out by pooling data of different children. In this approach the individual trajectories of each child are first compared to their surrounding adults, who are the benchmark for linguistic proficiency. Only in a second step are the curves of the children compared and similarities in their development are analyzed. Variation is conditioned by a multitude of factors for which the selection of participants aims to control as far as possible. As a consequence, most developmental studies feature target children of different sexes. However, the differences between sexes reported in the literature are small (Hyde and Linn 1988; Fenson et al. 1994), which means they will frequently be outweighed by individual variation given the small numbers of target children that are common in developmental studies. By contrast, a variable that is definitely worth to control for is socio-economic background since there is large variation in the input to children of different SES

14 Child-Language Corpora


groups (e.g. Hart and Risley 1995; Rowe 2008, 2012). Fernald et al. (2013) have shown that already at 18 months there is an enormous gap in processing abilities between infants from high and low-SES families. The conditioning variable for these developmental differences is the amount and quality of input children receive in the first years of their life (Rowe 2012).

14.2.3 Size of Corpora and Recording Intervals Probably the strongest limiting factor for the results and the conclusions that can be drawn from a corpus is the amount of recordings. Ideally one would record all the speech of the child and her environment. This approach has been taken in the Speechome project (Roy et al. 2006, 2009), in which nearly all linguistic encounters of one child were recorded from birth to age 3. Recordings were made at the home of the child, all rooms of which were wired. Comprehensive as this approach is, it is not feasible for most projects, not only because the research logistics are challenging but also because of the restricted usability of the data due to ethical and privacy issues. In all but the Speechome context we deal with snapshots of the child’s daily encounters and thus with samples of varying sizes, which serve as the basis for our extrapolation to the overall linguistic ecology of a child. The type of constructions and vocabulary used by children varies by extralinguistic contexts and activities. To ensure a faithful assessment of the child’s abilities, recordings ideally include a variety of daily situations. In a recent effort this corpus scheme has been extended to daylong recordings (cf. VanDam et al. 2016, http://www.darcle.org) to obtain deeper insights into daily activities and linguistic interactions (see http://homebank.talkbank.org). Daylong recordings of the target child and their surrounding environment are often conducted with LENA devices (LENA Research Foundation, Boulder, Colorado, United States). LENA devices audio-record and simultaneously analyze the utterances produced by the child and her surroundings. They provide automatic estimates about the number of adult words surrounding the child, the number of turns, and the number of child vocalizations. This is very valuable information to estimate the quantity and type of input a child is exposed to. However, the underlying algorithm has only been validated for English, French and Mandarin so far. Further, in audio-only recordings it can be challenging to match the situational context to the respective utterances. If complemented by video, LENA recordings can provide great insights into a large range of situations, provided that all the data is transcribed. This, however, can quickly become a challenge with large numbers of participants and long recording sessions. Sample size and sampling regimes have an enormous influence on the estimation of the child’s development (Tomasello and Stahl’s 2004; Malvern and Richards 1997; Rowland and Fletcher 2006). They are therefore already relevant in the design phase as well as later in the analysis in order to avoid severe biases in the results. Tomasello and Stahl’s (2004:102) estimate that one to two hours of recordings of a


S. Stoll and R. Schikowski

single child per month capture no more than 1–2% of her speech. As Tomasello and Stahl point out, the main problem with this is a delay in the detection of rare items. Further, errors, which are a window into developmental strategies, often occur in rare constructions. If infrequent constructions are underrepresented in the sample, errors in frequent constructions become oversampled. Thus, small samples result in both an over- and undersampling of errors (Rowland and Fletcher 2006). In a recent study Bergelson and colleagues (Bergelson et al. 2018) further found that hourlong video recordings may lead to very different estimates about the vocabulary that children hear than daylong audio recordings. They found that the input in the hourlong video recordings was comparatively much greater and much more varied. This is an extremely important result, which confirms conjectures about the relevance of the situational context of recordings with respect to input distributions (Stoll 2016). Bergelson and colleagues show that hour-long video recordings record a special situation, which is presumably not representative for the language a child hears and uses on average over the day. In these hour-long recordings caretakers play with the children and hence provide a much denser input than during other activities typical for the rest of the day. This shows that recording intervals and length of recordings can have dramatic consequences for claims about development and productivity if delays are not projected based on underlying frequency distributions. Productivity is one of the most relevant measures in language development research as it captures the acquired competence of the child to use language like a mature native speaker. However, the underlying frequency distributions may likewise not be easily retrievable from corpora that are not dense enough. To avoid this, the following issues are key: • The recall1 and the hit rate2 for a phenomenon X is a function of the frequency of X and sampling density. • The same is true for estimations of the frequency and the onset age of X. • Small samples combined with low-frequency X result in unreliable data. Soothing as this may sound, “small” and “rare” are relative terms. For instance, even a standard sampling density of one hour per week will barely succeed in capturing a not-so-rare X with seven expected instances per week. • The more frequent X, the steeper the hit rate gain induced by increased sampling density. Note this also entails that the more frequent X, the earlier increases in sampling density do not add significantly to hit rates. An additional problem stressed by Rowland et al. (2008) is that small samples may not adequately reflect the distribution of X. Not only may rare X not be observed at all; short-lived X, such as an error that a child produces only for a short time (e.g. the past tense form goed instead of went), may likewise be missed and distributions over time may appear randomly distorted. This in fact holds more generally for the distribution of lexical items.

1 The 2 The

recall is the proportion of all instances of a phenomenon that is covered by a sample. hit rate is the probability of finding at least one instance of a phenomenon in a sample.

14 Child-Language Corpora


It is worth noting that pooling data in such cases does not necessarily help because it creates new problems. High-frequency items thrown together with lowfrequency items will dominate the pool, which may have serious consequences for linguistic interpretation. For instance, English-speaking children produce more errors for rare irregular verbs than for frequent ones, but this fact might be obscured by pooled data because specific errors by individual children might be misidentified as rare in the larger sample (Maratsos 2000; Rowland et al. 2008). Pooled data give more weight to verbs with high token frequency with the result of overrepresenting high-frequency tokens (Rowland and Fletcher 2006:9). The consequences of these observations for researchers planning to compile or analyze a developmental corpus are serious. Researchers should no longer rely on their intuitions when estimating how robust their data are but prop them up by more reliable quantitative considerations. In the following some more concrete suggestions are given. A central notion is sampling coverage, which can be defined as the proportion of the data in question that is captured by the recording scheme. For instance, if we take one week as our reference interval and go with Tomasello and Stahl’s (2004) and Rowland et al. (2008) in assuming that a child is awake for roughly 10 hours per day, this gives us 70 hours of potential interactions per week. If we record 2 hours 2 per week, the sampling coverage will be 70 = 0.03. To calculate the probability of capturing linguistic phenomena, we may moreover assume that they follow a Poisson distribution.3 Below some formulas are listed that make it easy to estimate, for instance, sampling coverage and the probability of capturing a phenomenon. They are derived from the basic terms and calculations presented in Tomasello and Stahl’s (2004). • Given a sampling coverage c and an estimated absolute frequency Λ of X per interval, what will be the average catch λ of x per interval? λ=c· For instance, if we record 2 hours per week and continue to assume that a child 2 talks about 70 hours per week, the sampling coverage will be 70 = 0.03. A phenomenon that occurs 10 times per week is then expected to be observed 0.03 · 10 = 0.3 times per week (in other words, not a single time: on average, a full month will pass until the first instance is observed).

3 This

is in fact a coarse oversimplification since the Poisson distribution assumes that events occur at a constant rate and that earlier events cannot influence the probability of later ones. Both assumptions are obviously very problematic for linguistic interactions so that Tomasello and Stahl’s (2004) approach should be considered as an approximation and a first step in raising consciousness about sampling issues. These issues need to be tackled by future research on linguistic sampling techniques.


S. Stoll and R. Schikowski

• Given an average catch λ (calculated from c and Λ as above), how high is the probability of capturing at least one x (i.e. the hit rate r)? r = 1 − P (n = 0) = 1 −

λ0 · e−λ = 1 − e−λ 0!

For instance, let us assume we want to observe an X that we estimate to occur 80 times per day or 560 times per week and we record half an hour per week, so the 0.5 sampling coverage is 0.5 70 ≈ 0.01 and the average catch is 70 · 560 = 4. Then the −4 probability of capturing at least one x is 1 − e ≈ 0.98, i.e. very good in spite of the low density. • Given an estimated frequency Λ of X and a number λ of X to be captured per interval on average, what should be the sampling coverage c, and what is the number of hours h to be recorded per interval (given the total number of hours spoken by the child per interval, H)? c=


h=c·H =

λ ·H 

For instance, if we estimate that X occurs 500 times per week and would like to 10 capture 10 instances on average, the required sampling coverage is 500 = 0.02. 10 In other words, the number of hours we should record per week is 500 · 70 = 1.4 (i.e. roughly one and a half hours). • Given an estimated frequency Λ of X and a desired hit rate r, what should be the sampling coverage c and the hours h to be recorded per interval? −loge (1 − r)  −loge (1 − r) ·H h=c·H =  c=

For instance, if X is as rare as only occurring 5 times per week and we want to have a 99%4 probability of catching at least one X in an average recording week, the sampling density should be −loge (1−0.99) ≈ 0.92. The number of hours to be 5 recorded per week is then 0.92 · 70 ≈ 65. In other words, our goal is unrealistic. The last two formulas are especially important because they also allow researchers to derive a recording scheme directly from their research interests. Even if the precise frequency of the phenomenon of interest is not known, estimating 4 While 1.0 may seem to be the most desirable value here, a perfect hit rate requires infinitely dense

sampling. In other words, we can never be perfectly sure to capture at least one instance even if we sample everything because the population itself does not guarantee the occurrence of events.

14 Child-Language Corpora


it and calculating the required sampling density on that base is still much more objective than following a gut feeling or simply choosing the density that is currently most common. It is worthwhile mentioning that the concepts of sampling density and coverage used above conflate two aspects of sampling which are of great relevance for theory and practice, viz. sampling intervals (e.g. one month between samples) and durations of recordings (e.g. 2 hours per sample). If we were interested in a phenomenon that occurs about 20 times per hour (1400 times per week, 5600 times per month) and wanted to capture 25 instances per week (100 instances per 25 month), the required sampling densities are 1400 · 70 = 1.25 hours per week or 100 · 280 = 5 hours per month. This makes a practical difference: recording 5600 e.g. a 5-hour sample within a predefined week of the month (the sample can be subdivided in several recordings of different length within this week) will on average be easier to accomplish than recording a 1-hour sample every week, which implies a high demand of discipline both of the recording assistant and the families. In addition, increasing both sampling intervals and durations also has the advantage of increasing the hit rate and improving the sampling density per point in time. In other words, while we lose some granularity in this approach (developments can only be observed in larger steps), we gain confidence in our observations. To sum up, this implies that a sampling regime of 4–5 hours within a predefined week per month is preferable to a sampling regime of 1 hour per week even though the same amount of hours is recorded.

14.2.4 Transcription Although transcription may seem to be the precondition for all kinds of data annotation, it can in fact be viewed itself as a kind of annotation or “data reduction” (Brown 1993:154). As has been shown by Ochs (1979, with a focus on language development research), transcriptions have a deep impact on how we interpret data and what kind of analyses are possible based on them. Thus, this task should ideally be treated on a par with other annotation tasks. Transcription can be split up into several subtasks, viz. segmentation (detection of time stretches that contain speech), speaker assignment (connecting segments to speaker metadata), and transcription proper (putting text on segments; see also Chap. 11). An important linguistic parameter of transcriptions is phonetic granularity. Most developmental studies (apart from studies on phonological development) do not require a high level of phonetic precision. Instead, a simple phonological or orthographic transcription is sufficient. Even when research questions make it necessary to transcribe data phonetically, it is helpful to have an additional tier for coarser transcriptions, which represent less variation and are therefore easier


S. Stoll and R. Schikowski

to search. Coarse phonological transcriptions may also serve as a surrogate to full orthographic normalization when the latter is not feasible. Transcription is often the bottleneck of corpus development. In the case of underresearched languages, large amounts of data paired with resource pressures often make it seem hard to impossible to transcribe all data at once. Kelly et al. (2015:299) therefore suggest to transcribe only “potentially relevant stretches”, and this is indeed common practice in many corpora. The problem with this practice is that it runs the risk of creating highly biased data where we only find what we want to find. Calculating frequencies or estimating the age of onset based on such data is impossible. Another option to reduce transcription efforts is to transcribe only children’s utterances. While this does not create any statistical problems, it makes it impossible to take child-surrounding speech into consideration. As a consequence there is no way to compare the development of the child to the behavior of mature native speakers and this is all what naturalistic corpora are about. Thus, there is no way around transcribing all data in each recording session. In order to be able to spot developments even when the corpus is not (yet) fully transcribed, it is useful to transcribe data according to a systematic “zoom-in” pattern, where one starts with the first and the last recording of a child, then transcribes the recording in the middle between the two, then the two recordings in the middle of the stretches thus defined, and so on. Moving beyond these basic questions, there are a number of transcription problems which are specific to developmental corpora. An excellent overview of these can be found in MacWhinney (2000), so that below only a short summary and some recommendations are given. • The phonology of child language is different from adult language. For researchers interested in phonological development, a simple orthographic transcription therefore will not suffice – they will need an additional tier for phonetic transcriptions. But even researchers interested in other aspects of language face the problem that children frequently produce deviant forms (as compared to adult language) such as [ˈsuː] for shoe or [ˈpʌmpɪn] for pumpkin. Transcribing only the actual form often makes it difficult to understand what was said. Transcribing only the target form, on the other hand, obscures the fact that the child produced an alternate form. There are basically two ways to solve this dilemma, depending on research interests and available resources. The maximal solution is to transcribe both levels on independent tiers.5 The economical alternative is to transcribe only actual forms and to recycle another tier (e.g. morphological analysis) for implicitly specifying the associated target forms. The disadvantage of this latter approach is that it is no longer

5 It

is not advisable to transcribe actual and target forms on a single tier as is done in CHAT, since this will often have negative effects on processability and convertability.

14 Child-Language Corpora


possible to distinguish between children’s mistakes or deviant forms and other cases where the additional tier serves normalization. • The distinction between actual and target forms is closely connected to error coding, which can, however, also span multiple words and give more precise information regarding the kind of error made. Like all semantic layers that are logically independent of transcription proper, error codes, too, should be specified as an independent layer (e.g. an additional tier) if a researcher is interested in them. • It is often desirable to link utterances to stretches on the time line. This makes it easy to locate utterances in a video to review its context and to correct or add annotations. Time links also have theoretical applications in language development, where they can e.g. serve as the base for calculating input frequencies based on real time. Researchers wishing to include time links in their corpus are advised to use a software that creates them as part of the transcription process such as ELAN (https://tla.mpi.nl/tools/tla-tools/elan/) or CHILDES (https://childes.talkbank.org). • The addressee of utterances is of special interest for language development research because there are important differences between child-directed and adult-directed speech. Thus, it can be useful to code for addressees on a separate tier.

14.2.5 Metadata Any corpus relies on metadata to correlate the speech of the child and her surroundings with social variables. The most important metadata are the recording date, the identity, birth date, and sex of participants, and the role a participant takes in a given recording (a cover term for kinship relations such as “mother” and functional roles such as “recorder”). Many more fields are provided for in the multitude of XML metadata profiles contained in the CMDI6 Component Registry at https://catalog. clarin.eu/ds/ComponentRegistry, which is maintained by the European Union’s CLARIN network (https://www.clarin.eu/). Two CMDI profiles that are widely used in documentary linguistics (field of linguistics providing a comprehensive description of the linguistic practices of individual speech communities) and that are also suitable for developmental research are the IMDI and ELDP profiles. All widespread metadata standards bundle metadata belonging to different levels (such as sessions or participants) in a single physical file for each session. While this has the advantage of creating neat file systems, it also has at least one severe disadvantage. Participant metadata are logically independent of session metadata – for instance, the name and sex of a participant do not change with every recording. Nevertheless, the design of the mentioned standards provides that if a child appears 6 Component

Metadata Infrastructure.


S. Stoll and R. Schikowski

in 50 sessions, all her metadata are repeated 50 times (in the metadata file for each session). Our experience shows that this inevitably leads to data corruption in the form of inconsistencies. For instance, one and the same child may appear with slightly altered versions of his name (“Thomas”, “Tommy”), which will make him look like two different participants for automatic processing. Participant metadata are commonly linked to the participants’ utterances via short participant IDs that may be numeric or alphabetical. This both makes it easier to type in the speaker of an utterance during transcription and largely anonymizes the data (given that the metadata are stored in a different file). This is especially relevant if the data is shared more widely.

14.2.6 Further Annotations Most questions require further annotations for making information contained in the data more explicit. Such annotations are the base for automated quantitative analysis. These annotation steps are often referred to as “tagging” or “coding” in the literature. One of the most common options for further annotations are utterance-level translations, which are a prerequisite for cross-linguistic research. Without translations, project-external researchers who are not familiar with the language will not be able to use the data. While such researchers might rely on glosses instead (for which see below), glossing is in turn greatly facilitated by utterance-level translations and in many cases only becomes possible through them (e.g. when glosses are done by student assistants, who might not be as familiar with the language as the researchers). This also concerns corpora of languages with strong institutional support: in a globalized world it no longer seems fair to assume that everybody speaks a handful of European languages for which translations would therefore be futile. Studies with a focus on the development of grammar rely on grammatical annotations, especially lemmatization, parts of speech, and interlinear glossing (cf. below). There are two main glossing standards used in language development corpora. The standard implemented in the CHILDES database (the CHAT tier %mor:) ignores the function of lexical morphemes and the shape of grammatical morphemes and conflates what remains on a single tier. Thus, the CHAT version of the Chintang (Sino-Tibetan, Nepal) utterance Pakpak cokkota would look as follows: (1)

*MOT: Pakpak cokkota. %mor: pakpak ca-3P-IND.NPST-IPFV %eng: ‘He’s munching away at it.’

While this saves space, this glossing style also has several disadvantages. First, it does not follow the principle of separating distinct semantic layers (such as

14 Child-Language Corpora


segments vs. morpheme functions) in the annotation syntax, thus making this format harder to read, process, and convert than others. Second, a researcher who does not speak Chintang will not know what pakpak and ca mean. Even an utterance translation will not help with longer utterances, where it becomes increasingly hard to identify words and morphemes of the object language with the elements of the translation metalanguage. Such types of glosses are thus less reusable and sustainable. Further, this format does not allow to search for morphemes in the sense of formmeaning pairs. For instance, a search for the segment ca paired with the gloss ‘eat’ or a search for the form -u paired with the function ‘3P’ will unambiguously select two morphemes with a very low possibility of confusion with other morphemes. However, such searches are only possible in formats with complete form and function tiers. In the CHAT format one might search for ca, but this will include homophones and exclude synonyms. Similarly, a search for the gloss 3P may easily yield other morphemes with the same function even when we are only looking for the one with the underlying shape -u. Another common option are true interlinear glosses, which can be seen as a combination of segmentation into morphemes and morpheme-level translations. This type of glosses is also common in general linguistic publications and specifies both the shape and function of all morphemes. Below an example from the Chintang corpus is given (in its native format, Toolbox). The first tier is an orthographic transcription tier similar to the CHAT example. The second tier segments words into morphemes, given in their underlying form. The third tier assigns a gloss to every morpheme, which is a metalanguage translation in the case of lexical items and a standardized label in the case of grammatical morphemes. The fourth tier contains the English translation of tier 1. (2)

\tx \mph \mgl \eng

Pakpak cokkota. pakpak ca-u-kV-ta without.stopping eat-3[s]P-IND.NPST[3sA]-IPFV ‘He’s munching away at it.’

Morpheme-based glosses are a precondition for any searches of basic semantic units without knowing the language. When glosses are used that correspond to a standard such as the Leipzig Glossing Rules (https://www.eva.mpg.de/ lingua/resources/glossing-rules.php) or the Universal Dependencies framework (http://universaldependencies.org/u/feat/index.html), it becomes easy to compare different corpora and languages. The relation between morphemes and glosses is in general much more straightforward than the one between utterances and translations, making it ideal for more objective analyses and for searching for specific elements (cf. Demuth 1996:21). A corpus with good interlinear glosses may even spare utterance translations. Another common semantic layer are parts of speech (POS). POS annotations are crucial for analyses of grammatical development (see Chap. 2). While glosses are highly useful even without POS tags, the reverse is hardly true because POS tags


S. Stoll and R. Schikowski

alone do not make it possible to search for specific functions or morphemes. Thus, POS tagging should be done simultaneously with or after glossing. The combination of the two layers has many applications in developmental research, thus making POS tags another recommendable tier. As always, it is advisable to keep POS and glosses on distinct tiers. Beyond the basic layers just discussed, there is a plethora of further possibilities that cannot all be listed here. Some common examples include the annotation of semantic roles or other properties linked to reference, syntactic dependencies, speech acts, errors on various levels, or actions accompanying speech (gaze and pointing). (3) shows an example for syntactic dependency annotations in a Hebrew corpus created by an automatic parser (Gretz et al. 2015:114; dummy speaker inserted). The relevant tier is %gra: with dependency structures represented as triplets. The first element in the annotation is an index of the element, the second element is the index of the head of the respective token and the last element indicates the relation of the token. (3)

loP roc¯e t.ip¯ot *CHI: Pan¯ı %mor: pro:person|num:sg neg part|num:sg n|num:pl %gra: 1|3|Aagr 2|3|Mneg 3|0|Root 4|3|Anonagr %eng: ‘I don’t want drops.’

Figure 14.1 shows an annotation tier for pointing behavior (index finger points, hand points etc.) accompanying the speech of the interlocutors taken from the longitudinal corpus of Russian language development (Stoll et al. 2019). The format is ELAN.

14.2.7 Ethical Considerations Language acquisition data are highly sensitive due to several reasons. Children are an especially vulnerable population and are not considered capable of giving informed consent by many legislations. Recordings are typically taken in intimate settings and if more participants than just mother-child dyads are present they will start talking about anything that may be considered taboo by a given society once they have gotten used to the camera. Moreover, longitudinal designs and developmental questions require tracking participants and collecting metadata such as full names, addresses, and birth dates, which makes the raw data quite the opposite of anonymous. Ethical considerations should therefore be an integral part of language acquisition research. Besides obtaining ethics clearance from institutional reviewing boards and/or funding agencies, researchers need to have sufficient knowledge of the socio-

14 Child-Language Corpora


Fig. 14.1 Pointing annotations in the Stoll Corpus

cultural context and take the time to explain to the participating families in detail what the research implies. In communities speaking underdocumented languages, involving the community itself may be an additional concern and data protection is of special importance since these communities are often tightly knit. The described high demands are often directly opposed to long-standing claims for more exchange and interoperability of language acquisition data, which are recently being refueled by the open access and open data movements. Publishing language acquisition data without taking any measures for protecting subjects is ethically highly problematic. Researchers must therefore be aware of data protection techniques such as pseudonymization (also known as “coding”), anonymization, aggregation, and possibly encryption. The organizational design of databases is just as important and should cover aspects such as user roles, access levels, and longterm use. Direct collaboration with data owners helps to build trust and provides the additional advantage of getting access to rich knowledge of the cultural context.


S. Stoll and R. Schikowski

Representative Study 1 Huttenlocher, J., Haight, W., Bryk, A., Seltzer, M., and Lyons, T. 1991. Early vocabulary growth: Relation to language input and gender. Developmental Psychology, 27(2), 236–248. One frequent topic of developmental studies is the development of basic vocabulary. Studies from this area most often rely on transcriptions without considering grammar. In order to understand the enormous variation that children show in this regard, innate predispositions and environmental variables need to be tested in a large number of children. In a large-scale study Huttenlocher et al. (1991) analyzed the vocabulary growth of 11 children from 14 months to 26 months in their natural environment. The corpus consists of two subgroups of children: five children were recorded from 14 months onwards for 5 hours per month, and six children, from 16 months onward for 3 hours per month. This type of data and sampling allowed Huttenlocher and colleagues to test which variables are mainly responsible for vocabulary growth (including sampling regime) while at the same time controlling for how much the target child and the caregivers talk overall. This is important for calculating the chance of over or underestimating the vocabulary size of a child. The authors could show that the relative talkativeness of the mothers is stable over time but an increase in the amount of speech can be explained as an adaption to the age of the child. Between-child models focusing on the relation of group, exposure and gender found that the vocabulary acceleration in Group 1 with a higher recording density is 45% more than in Group 2. This shows impressively how strongly our results depend on our sampling decisions. If group effects are controlled for, the role of exposure is positive and significant with girls accelerating faster than boys, i.e. gender is a significant variable.

Representative Study 2 Abbot-Smith, K. and Behrens, H. 2006. How known constructions influence the acquisition of other constructions: the German passive and future constructions. Cognitive Science 30, 995–1026. Abbot-Smith and Behrens (2006) illustrate in an impressive high-density study of one German child, age 2;0–5;0 (see Representative Corpus 1) how constructions interact in learning. They show that two German passive constructions and the German plural, which have similar auxiliary constructions, are learned in a piecemeal fashion at a different pace. A main insight of this study is that construction learning is not an isolated process but previously (continued)

14 Child-Language Corpora


learned constructions have an influence on later similar constructions, i.e. they either support or impede their development. The authors show that the learning of the German passive construction, which is build with the auxiliary sein, is supported by previously acquired copula constructions which are build with the same auxiliary. The other passive construction built with the auxiliary werden was learned much later and was not supported by the productivity of prior related constructions. In the analysis of the German future tense, which is also build with the auxiliary werden, they found that a semantically similar construction delayed the acquisition of the future tense. Such results can only be obtained by dense corpus studies allowing us to detect relatively rare phenomena such as passive constructions.

Representative Corpus 1 The Thomas Corpus (Lieven et al. (2009), https://childes.talkbank.org/ access/Eng-UK/Thomas.html) compiled by the Max Planck Child Study Center at the University of Manchester and the Leo Corpus compiled at the Max Planck Institute for Evolutionary Anthropology, Leipzig (Behrens 2006, http://childes.talkbank.org/access/German/Leo.html) are both dense longitudinal corpora including mainly the interactions of the target child with one caregiver. The corpora contain data of one English-learning child (aged 2;0 to 4;11) and one German-learning child (aged 1;11 to 4;11). The corpora are especially notable for their high sampling density: for the third year of the children’s life, five 60-minutes recordings were taken every week (four audio, one audio and video) and the later phases consisted of 1 h recording per month. This allows for views into language development with an unprecedented resolution. The corpora are fully lemmatized and contain annotations for morphology and parts of speech.

Representative Corpus 2 The Language 0–5 Corpus (https://www.lucid.ac.uk/what-we-do/research/ language-0-5-project/) is currently being built up by Caroline Rowland and colleagues at the University of Liverpool. What makes this corpus remarkable is its broad scope: it follows 80 English-learning children longitudinally from age 0;6 to 5;0, including naturalistic recordings, questionnaires and experiments, thus building the most comprehensive picture of inter-individual variation in one language that language development research has seen so far. This corpus has the potential to change the field of developmental research radically in allowing sound generalizations from the sample to the population.


S. Stoll and R. Schikowski

Representative Corpus 3 The Chintang Language Corpus (Stoll et al. 2012, 2019, http://www.clrp. uzh.ch) built by Sabine Stoll, Elena Lieven and colleagues is an example for a corpus that strikes a balance between the requirements of sample size and sampling density and that has dealt with some of the typical problems of studies of underresourced languages. It contains data of six children (two each in three rough age ranges: 0;6 to 1;10, 2;0 to 3;5, and 3;0 to 4;3) learning Chintang, a highly polysynthetic, endangered Sino-Tibetan language spoken in a small village in Eastern Nepal. The recordings mainly took place outside on the veranda where the children played and the adults were sitting around. It respects the natural context in which children grow up and includes transcriptions of all the speech of people surrounding the child. For each child, 4 hours were recorded per month within a single week. The corpus is richly annotated, e.g. with translations, Leipzig Glossing Rule-style interlinear glosses and POS tags for all participants.

14.3 Critical Assessment and Future Directions The main goal of language development research is to understand how language can be learned by children. For this we need to know how children and the people surrounding them use language in their natural environment. Observational corpora are the best tool for estimating how distributions in the input relate to learning. However, they come with big costs; the logistics of compiling such corpora are extremely demanding and therefore most corpora include only a small number of children with limited recordings. As a consequence, one of the major impediments to progress in our field has been the quantity of data we have at our disposal. Quantity here refers both to (i) the size of individual corpora (e.g., number of participants, length and interval of recordings) and (ii) the number (and type) of languages we have data for. Why is this relevant? Two issues are key here: Corpus size is relevant because we need reliable frequency distributions of a large range of structures and constructions to estimate developmental growth curves. This requires dense corpora such as the Max Planck dense databases. So far, this type of data is only available for a small number of participants in two languages. A main challenge is to extend this approach to a wider number of participants as pioneered in the 0–5 project presented above. Since high density recordings are very demanding both for the people recorded and the researchers, this might not be a valuable avenue at least for a wider range of languages. A potential alternative might be to combine individual high-density studies with large-scale cross-sectional studies.

14 Child-Language Corpora


The number of languages is relevant because understanding language development requires understanding how children can learn any language. However, the corpora that are currently available are still heavily biased towards Indo-European languages spoken in Europe. For only about 2% of the languages of the world at least one study of language development is available, and this study often concerns a single feature of the language under investigation (Stoll 2009; Lieven and Stoll 2010). As a result, we only know about the development of a tiny number of features in a tiny number of languages. A complicating fact is that our language sample of “Standard Average European Languages” is in fact rather exotic and unusual from a typological point of view (Dahl 1990). Thus, what language development research so far has been doing is looking at how languages which could be termed as “the odd ball out” are learned rather than making claims about how language is learned independently of individual structural properties. To remedy this problem we need more data from a large variety of typologically diverse languages. This issue has been extensively discussed in typological research, where it is widely acknowledged that statements about language are only possible when a representative sample of languages is considered. There is another good reason besides sampling issues why language development research is in dire need of more diverse corpora. As pointed out by Bowerman (1993, 2011), it is frequently assumed that typology and language development are linked in a principled way: the most frequent structures in typology could also be the ones that are easiest to learn, and vice versa (Moran et al. 2017). This, too, can only be proven on the basis of more and more diverse data. Recently a new sampling approach has been proposed focusing on diversity in grammatical structures (Stoll and Bickel 2013) resulting in a database (http://www.acqdiv.uzh. ch/en.html) of longitudinal data of five groups of grammatically maximally diverse languages (Stoll and Bickel 2013; Moran et al. 2016). This database thus allows us to simulate maximum structural diversity in the languages of the world and therefore is an ideal testing ground for general learning mechanism and input structures. To appreciate the role of different structural features in development, we need to be able to compare structures of different languages. Thus, while a large amount of data is already available, it is by no means trivial to explore them in a uniform way. Corpora usually have their own idiosyncratic study designs, including the number and age of children involved, the sampling regime, and the type of contexts in which recordings are taken, let alone the different conventions used for annotations. Comparability is sometimes complicated by rather technical reasons such as different choices of file formats, corpus formats and syntax, and coding schemes. CHILDES and the CHAT format as well as widespread markup languages such as XML have made a great contribution to unification but still leave room for enormous variation, even in areas that are of great interest to many researchers such as word meaning, morphology, or syntax. Thus, more standardization, also linking to standards existing in other corners of corpus linguistics, seems an indispensable step for exploiting the full range of developmental data that we have. A recent effort to unify glossing and initiate collaborative comparative research was undertaken in the above-mentioned cross-linguistic database ACQDIV composed of longitudinal


S. Stoll and R. Schikowski

corpora from typologically diverse languages. The database features unified glosses, POS tags, and other variables that are relevant for comparative studies (for a description of the design see Moran et al. 2016; Jancso et al. 2020). To sum up, the two issues of quantity are tightly intertwined. In short, we need more data for more diverse languages. To achieve this goal, we need massive improvement of automatic transcription devices and automatic interlinear glossing. To include a wider set of languages in our samples we will need to focus more on fieldwork on less well-known languages. For this, specialists of language development ideally team up with linguists specialized in these languages. For remote and undescribed languages, however, it is usually impossible to conduct high density studies, let alone to record more than a handful of children and subsequently transcribe and annotate the data. In addition, fieldwork in culturally diverse settings necessitates a plethora of ethical considerations related to language attitudes, language policies, and privacy issues, which need to be resolved in collaboration between communities and fieldworkers (Stoll 2016). As of now the field is also still waiting for innovative ways of reconciling high ethical standards and the demand for open access. One possibility is to implement a sophisticated, multi-layered system of access, to make sensitive data publicly available in an aggregated format that does not allow inferences to associations between individual utterances and speakers, or to fully encrypt the data so that structure is preserved but meaning (including names) is lost. Thus, the major challenge for observational language development research is to overcome these impediments and to introduce big data. Only then will we be able to conduct large scale meta-analyses as pioneered by Bergmann et al. (2018). These are developments that hopefully will take place in the not too far future. For now, as we see it, the most pressing step to advance the field of corpus-based language development research is to strengthen our statistical methods to allow for sound generalizations from the small corpora that are at our disposal.

14.4 Tools and Resources One of the most prominent and important platforms for data sharing and corpus tools in language development research is CHILDES (MacWhinney 2000). CHILDES is part of the TalkBank system, which is a database for studying and sharing conversational data (http://childes.talkbank.org). The database currently contains corpora of over 40 languages. The size, depth, design, and format of the corpora vary widely. Data range from first language development and bilingual corpora to clinical corpora. CHILDES also provides a number of tools for corpus development including transcription and annotation programs. Another recent approach to hosting corpora and initiating collaborative research is the ACQDIV project (http://www.acqdiv.uzh.ch). The project features a database of corpora from maximally diverse and often endangered languages.

14 Child-Language Corpora


Further Reading Behrens, H. (ed.) 2008. Corpora in Language Acquisition Research: Finding Structure in Data (= Trends in Language Acquisition Research 6 (TiLAR)). Amsterdam: Benjamins. This collective volume edited by Heike Behrens is a highly valuable collection of various topics on language development corpora, among them also methods for corpus building and the history of this genre. Meakins, F., Green, J., Turpin, M. 2018. Understanding linguistic fieldwork. Routledge. This comprehensive volume on fieldwork techniques provides an in-depth practical introduction to fieldwork on small and/or endangered languages. It also includes a profound chapter with ample and very useful practical instructions for building up corpora on language development in such environments. Acknowledgments The research leading to these results received funding from the European Union’s Seventh Framework Program (FP7/2007–2013), ERC Consolidator Grant: ACQDIV (grant agreement no. 615988, PI Sabine Stoll).

References Abbot-Smith, K., & Behrens, H. (2006). How known constructions influence the acquisition of other constructions: The German passive and future constructions. Cognitive Science, 30, 995– 1026. Bates, E., Bretherton, I., & Snyder, L. S. (1988). From first words to grammar: Individual differences and dissociable mechanisms. Cambridge: Cambridge University Press. Bates, E., Dale, P. S., & Thal D. (1995). Individual differences and their implications for theories of language development. In: Fletcher P. & MacWhinney B. (Eds.), Handbook of child language (pp. 96–151). Oxford: Basil Blackwell. Behrens, H. (2006). The input–output relationship in first language acquisition. Language and Cognitive Processes, 21, 2–24. Bergelson, E., Amatuni, A., Dailey, S., Koorathota, S., & Tor, S. (2018). Day by day, hour by hour: Naturalistic language input to infants. Developmental Science 22(3), 1–10. Bergmann, C., Tsuji, S., Piccinini, P. E., Lewis, M. L., Braginsky, M., Frank, M. C., & Cristia, A. (2018). Promoting replicability in developmental research through meta-analyses: Insights from language acquisition research. Child Development, 89(6), 1996–2009. Bowerman, M. (1993). Typological perspectives on language acquisition. In E. V. Clark (Ed.), The Proceedings of the Twenty-Fifth Annual Child Language Research Forum, Stanford (pp. 7–15). Center for the Study of Language and Information. Bowerman, M. (2011). Linguistic typology and first language acquisition. In J. J. Song (Ed.), The Oxford handbook of linguistic typology (pp. 591–617). Oxford: Oxford University Press. Brown, P. (1993). The role of shape in the acquisition of Tzeltal (Mayan) locatives. In E. Clark (Ed.), The Proceedings of the 25th Annual Child Language Research Forum (pp. 211–220). New York: Cambridge University Press. Dahl, Ö. (1990). Standard Average European as an exotic language. In J. Bechert, G. Bernini, C. Buridant (Eds.), Towards a typology of European languages. (pp. 3–9). Berlin: Mouton de Gruyter.


S. Stoll and R. Schikowski

Demuth, K. (1996). Collecting spontaneous production data. In J. D. Villiers, C. McKee, & H. S. Cairns (Eds.), Methods for assessing children’s syntax (pp. 3–22). Cambridge, MA: MIT Press. Fenson, L., Dale, P. S., Reznick, J. S., Bates, E., Thal, D. J., Pethick, S. J., Tomasello, M., Mervis, C. B., & Stiles, J. (1994). Variability in Early Communicative Development. Monographs of the Society for Research in Child Development, 59(5), pp. i–185. Wiley: New Jersey. Fernald, A., Marchman, V. A., & Weisleder, A. (2013). SES differences in language processing skill and vocabulary are evident at 18 months. Developmental Science, 16(2), 234–248. Gretz, S., Itai, A., MacWhinney, B., Nir, B., & Wintner, S. (2015). Parsing Hebrew CHILDES transcripts. Language Resources and Evaluation, 49(1), 107–145. Hart, B., & Risley, T. R. (1995). Meaningful differences in the everyday experience of young American children. Baltimore: Paul Brookes Publishing. Henrich, J., Heine, S. J., & Norenzayan, A. (2010). Most people are not WEIRD. Nature, 466(7302), 29. Huttenlocher, J., Haight, W., Bryk, A., Seltzer, M., & Lyons, T. (1991). Early vocabulary growth: Relation to language input and gender. Developmental Psychology, 27(2), 236–248. Hyde, J. S., & Linn, M. C. (1988). Gender differences in verbal ability: A meta-analysis. Psychological Bulletin, 104(1), 53–69. Jancso, A., Moran, S., & Stoll, S. (2020). The ACQDIV corpus database and aggregation pipeline. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 156–165. Kelly, B. F., Forshaw, W., Nordlinger, R., & Wigglesworth, G. (2015). Linguistic diversity in first language acquisition research: Moving beyond the challenges. First Language, 35(4–5), 286–304. Kuhl, P. K. (2004). Early language acquisition: Cracking the speech code. Nature Reviews Neuroscience, 5(11), 831–843. Lieven, E. (2006). Variation in first language acquisition. In K. Brown (Ed.), Encyclopedia of language & linguistics (2. ed., pp. 350–354). Amsterdam: Elsevier. Lieven, E., Salomo, D., & Tomasello, M. (2009). Two-year-old children’s production of multiword utterances: A usage-based analysis. Cognitive Linguistics, 20(3), 481–507. Lieven, E. V. M., & Stoll, S. (2010). Language. In M. H. Bornstein (Ed.), Handbook of cultural developmental science (pp. 143–160). New York: Psychology Press. MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk. Mahwah: Lawrence Erlbaum Associates. Malvern, D. D., & Richards, B. J. (1997). A new measure of lexical diversity. In A. Ryan & A. Wray (Eds.), Evolving models of language (pp. 58–71). Clevedon: Multilingual Matters. Maratsos, M. P. (2000). More overregularizations after all: New data and discussion on Marcus, Pinker, Ullman, Hollander, Rosen and Xu. Journal of Child Language, 27, 183–212. Moran, S., Sauppe, S., Lester, N., & Stoll, S. (2017). Worldwide frequency distribution of phoneme types predicts their order of acquisition. In Paper Presented at the 51st Annual Meeting of the Societas Linguistica Europaea (SLE) 29 Aug–1 Sept 2017. Moran, S., Schikowski, R. Pajovi´c, D., Hysi, C. & Stoll, S. (2016). The ACQDIV database: Min(d)ing the ambient language. In Calzolari, N., Choukr, K., Declerck, T., Grobelnik, M., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evalutation (LREC 2016) (pp. 23–28). ELRA: Paris. Ochs, E. (1979). Transcription as theory. In E. Ochs & B. Schieffelin (Eds.), Developmental pragmatics (pp. 43–71). New York: Academic. Rowe, M. L. (2008). Child-directed speech: Relation to socioeconomic status, knowledge of child development and child vocabulary skill. Journal of Child Language, 35(1), 185–205. Rowe, M. L. (2012). A longitudinal investigation of the role of quantity and quality of childdirected speech in vocabulary development. Child Development, 83(5), 1762–1774. Rowland, C. F., & Fletcher, S. L. (2006). The effect of sampling on estimates of lexical specificity and error rates. Journal of Child Language, 33(04), 859–877.

14 Child-Language Corpora


Rowland, C. F., Fletcher, S. L., & Freudenthal, D. (2008). How big is big enough? Assessing the reliability of data from naturalistic samples. In H. Behrens (Ed.), Corpora in language acquisition research: History, methods, perspectives (pp. 1–24). Amsterdam: John Benjamins. Roy, B. C., Frank, M.C. , & Roy, D. (2009). Exploring word learning in a high-density longitudinal corpus. In Taatgen, N. & van Rijn H. (Eds.) Proceedings of the Thirty-First Annual Conference of the Cognitive Science Society, July 29-August 1, 2009, Vrije Universiteit, Amsterdam, Netherlands. Roy, D., Patel, R., DeCamp, P., Kubat, R., Fleischman, M., Roy, B., Mavridis, N., Tellex, S., Salata, A., Guinness, J., et al. (2006). The human speechome project. Lecture Notes in Computer Science, 4211, 192–196. Stoll, S. (2009). Crosslinguistic approaches to language acquisition. In E. Bavin (Ed.), The Cambridge handbook of child language (pp. 89–104). Cambridge: Cambridge University Press. Stoll, S. (2016). Studying language acquisition in different linguistic and cultural settings. In N. Bonvillain (Ed.), The Routledge handbook of linguistic anthropology (pp. 140–158). New York: Routledge. Stoll, S., & Bickel, B. (2013). Capturing diversity in language acquisition research. In B. Bickel, L. A. Grenoble, D. A. Peterson, & A. Timberlake (Eds.), Language typology and historical contingency: Studies in honor of Johanna Nichols (pp. 195–260). Amsterdam: Benjamins Publishing Company. Stoll, S., & Gries, S. (2009). How to measure development in corpora? An association-strength approach to characterizing development in corpora. Journal of Child Language, 36, 1075– 1090. Stoll, S., Bickel, B., Lieven, E., Banjade, G., Bhatta, T. N., Gaenszle, M., Paudyal, N. P., Pettigrew, J., Rai, I. P., Rai, M., & Rai, N. K. (2012). Nouns and verbs in Chintang: Children’s usage and surrounding adult speech. Journal of Child Language, 39, 284–321. Stoll, S., Lieven, E., Banjade, G., Bhatta, T. N., Gaenszle, M., Paudyal, N. P., Rai, M., Rai, N. K., Rai, I. P., Zakharko, T., Schikowski, R., & Bickel, B. (2019). Audiovisual corpus on the acquisition of Chintang by six children. Electronic resource. Tomasello, M. (2003). Constructing a language: A usage-based theory of language acquisition. Harvard: Harvard University Press. Tomasello, M., & Stahl, D. (2004). Sampling children’s spontaneous speech: How much is enough? Journal of Child Language, 31, 101–121. VanDam, M., Warlaumont, A. S., Bergelson, E., Cristia, A., Soderstrom, M., De Palma, P., & MacWhinney, B. (2016). Homebank: An online repository of daylong child-centered audio recordings. Seminars in Speech and Language 37(2), (pp. 128–142). New York: Thieme Medical Publishers.

Chapter 15

Web Corpora Andrew Kehoe

Abstract This chapter explores the increasingly important role of the web in corpus linguistic research. It describes the two main approaches adopted in the field, which have been termed ‘web as corpus’ and ‘web for corpus’. The former approach attempts to extract linguistic examples directly from the web using standard search engines like Google or other more specialist tools, while the latter uses the web as a source of texts for the building of off-line corpora. The chapter examines the pitfalls of the entry-level ‘web as corpus’ approach before going on to describe in detail the steps involved in using the ‘web for corpus’ approach to build bespoke corpora by downloading data from the web. Through a series of examples from leading research in the field, the chapter examines the significant new methodological challenges the web presents for linguistic study. The overall aim is to outline ways in which these challenges can be overcome through careful selection of data and use of appropriate software tools.

15.1 Introduction Over the past two decades the internet has become an increasingly pervasive part of our lives, leading to significant changes in well-established ways of carrying out everyday tasks, from catching up with the news and keeping in touch with friends to grocery shopping and job hunting. The latest statistics suggest that there are over 4 billion internet users worldwide, with a growth rate of over 1000% since 2000.1 It should come as no surprise then that, during the same period, the web has had a major impact on well-established ways of doing things in the field of corpus linguistic research too. Part of this impact has involved the increased availability of

1 http://www.internetworldstats.com/stats.htm.

Accessed 22 May 2019.

A. Kehoe () Birmingham City University, Birmingham, UK e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_15



A. Kehoe

existing resources, with several of the corpora discussed in previous chapters now available online. For example, the British National Corpus (BNC) is searchable on the Brigham Young University website,2 meaning it is no longer essential for researchers to install the corpus and associated software on their own local computers. The BNC is not a ‘web corpus’ as such but it is a corpus that is now available to search via the web. What this chapter is more interested in, however, is the growing use of the web itself as a linguistic resource. As the chapter will go on to explain, this research area has diversified in recent years to include a wide range of different activities but what they have in common is the use of linguistic data from the web, either in place of data from standard corpora like the BNC or to supplement it. The key benefits of the web over such corpora are its size and the fact that it is constantly updated with new texts and, thus, examples of the latest language use. Even a 100 million word corpus like the BNC is too small for some purposes, such as lexicographic and collocational research. Most words in the BNC occur fewer than 50 times, which makes it difficult to draw firm conclusions about their meaning (Kilgarriff and Grefenstette 2003). Pomikàlek et al. (2009) give specific examples: the verb hector with only 37 occurrences and the noun heebie-jeebies with none. At the time of writing, a Google search for heebie-jeebies returns over 970,000 hits, which would seem to offer exciting new possibilities for corpus linguistic research. However, as this chapter will go on to illustrate, the web also presents us with a substantial set of new challenges. Some of these are methodological and can be overcome through the use of appropriate techniques and software tools, whereas others require us to rethink the fundamental principles of corpus linguistics as a discipline.

15.2 Fundamentals There are two main approaches to the use of web data in corpus linguistic research, which have been termed ‘web as corpus’ and ‘web for corpus’ (de Schryver 2002). Each has several variants which are described in turn below. It is worth noting here that, although ‘web as corpus’ and ‘web for corpus’ are distinct approaches, the former is sometimes used as an umbrella term for the whole research area, for example in the book title Web As Corpus: Theory and Practice (Gatto 2014) or in the name of the ACL SIGWAC: Special Interest Group on Web As Corpus.3 This reflects the fact that the ‘web as corpus’ approach was the first of the two to emerge in the late 1990s.

2 http://corpus.byu.edu/bnc/. 3 http://www.sigwac.org.uk/.

Accessed 22 May 2019. Accessed 22 May 2019.

15 Web Corpora


15.2.1 Web as Corpus This ‘entry level’ approach uses commercial search engines such as Google to access the textual content of the web. When considering the term ‘web as corpus’, the first question we must ask is whether the web can actually be classed as a corpus according to the criteria set out in Chap. 1. On the surface there are similarities between conventional corpora and the web, which have led some researchers to refer to the latter as a ‘cybercorpus’ (Brekke 2000:227) or ‘supercorpus’ (Bergh 2005:26). Like the corpora discussed in previous chapters, the web contains large quantities of textual data that can be explored through dedicated search interfaces. However, if we consider the issue in more depth it becomes clear that the web does not meet several of the key defining criteria of a corpus. Sinclair (2005) offers a succinct summary: The World Wide Web is not a corpus, because its dimensions are unknown and constantly changing, and because it has not been designed from a linguistic perspective. At present it is quite mysterious, because the search engines [ . . . ] are all different, none of them are comprehensive, and it is not at all clear what population is being sampled.

Sinclair’s first objection relates to corpus size. From the 1 million word Brown corpus of the 1960s to the 100 million word BNC of the 1990s, the resources used by researchers in the field have conventionally been of known (usually finite) size. The web, in contrast, is indeed ‘quite mysterious’ (Sinclair 2005). Over a decade after Sinclair made this statement, although we know that the web has grown in size by many orders of magnitude, we are still no closer to knowing exactly how large it is. Search engine companies such as Google release relevant information publicly from time to time (e.g. Brin and Page 1998) but, for commercial reasons, it is usually in their interests to remain as mysterious as possible. We do know that the web is much larger than any conventional corpus. Early estimates by linguists took the ‘hit’ counts returned by search engines for specific words then used the frequencies of those words in conventional corpora to extrapolate the size of the web (Bergh et al. 1998). Researchers in other fields tend to measure the size of web collections in terms of number of pages rather than number of words, but their methods are no more advanced. For example, recent research in the field of ‘Webometrics’ used an almost identical technique to Bergh et al. (1998), estimating the size of the web to be just under 50 billion pages (van den Bosch et al. 2016).4 However, estimates vary wildly, with researchers at Google reporting in March 2013 that their software

4 See

the website by the same authors for latest estimates: http://www.worldwidewebsize.com/. Accessed 22 May 2019.


A. Kehoe

was aware of 30 trillion pages5 ; a figure that had risen to 130 trillion by November 2016.6 Sinclair’s second objection relates to the composition of the web. When we conduct a search using Google we are given no indication of the status of the matching texts it returns. We do not know whether a text has been through a careful editing process or whether it contains spontaneous thoughts. Often we cannot determine when a text was published and whether it has been edited since (Kehoe 2006). It can be difficult to discover the intended purpose or audience of a text, and whether it was written by a native-speaker. Sometimes we are given no indication of the author’s identity at all, and it may even be the case that the text was generated or translated automatically by a computer. It is only in the last few years that linguists have begun to answer these questions by developing techniques to analyse web content on a large scale (Biber et al. 2015 – see Representative Study 1) but there is still work to be done. An example will illustrate the limitations we face if we attempt to treat the web as a corpus using conventional search engines. Figure 15.1 shows the results of a Google search for the phrase genius idea, designed to investigate a change in the meaning of genius that has taken place too recently to be found in standard reference corpora like the BNC (most recent text from 1994). According to Rundell (2009), this is a noun ‘hovering on the brink of adjective-hood’, used in contexts were ingenious would previously have been found. The immediate problem we face is that web search engines do not allow grammatical searches so we cannot specify that we want to see only instances of genius used as an adjective. We therefore search for a phrase made up of genius followed by the word idea, which we might expect to collocate with genius in its adjectival sense (we could also try words such as plan, response, touch or move). Figure 15.1 was generated in late 2016. If we were to re-run the Google search the results would probably look quite different. A second major problem with accessing the web as a corpus through commercial search engines is that results change frequently as the search engines update their indexes of web content. This has a significant effect on the reproducibility of findings, which is an important consideration in all scientific research. There is a delicate balance here: as linguists we want to keep pace with the latest developments in language use but we also want to retain control of exactly what appears in our corpus at any given point in time. There are other, more specific problems in Fig. 15.1 too. The first is that Google ignores case, meaning that some of the ‘matches’ for genius idea are capitalised proper nouns (e.g. ‘Genius Idea Studios, LLC’). Case-sensitivity is vital in linguistic

5 https://search.googleblog.com/2013/03/billions-of-times-day-in-blink-of-eye.html.

Accessed 22 May 2019. This refers to the number of pages the Google software was aware of at that time, not the number of pages actually held in its index. 6 https://searchengineland.com/googles-search-indexes-hits-130-trillion-pages-documents-263378. Accessed 22 May 2019. This information has since been removed from the Google website and updated figures are no longer provided.

15 Web Corpora


Fig. 15.1 Google search results for the phrase genius idea

search but there is no way to specify it in Google. Secondly, Google shows only one match from each website within a limited context (the concordance span in corpus linguistic terms). The only way to extract a full set of examples within a usable context is to click on each link in turn, locate each match manually and copy it to a file.


A. Kehoe

Finally, although Google claims to have available ‘About 818,000 results’ in Fig. 15.1, there is no way to view all of these through its web interface. We also need to bear in mind that this ‘hit’ count represents the number of pages in the Google index containing the search term, not the actual frequency of the term in the index. Some researchers (e.g. Keller and Lapata 2003; Schmied 2006) have attempted to use Google hit counts in linguistic research but the results must be treated with caution as we cannot be sure exactly what the figures reported by Google represent. Indeed, a study by Rayson et al. (2012:29) demonstrated that Google hit counts ‘are often highly misleading, inflating results drastically’ and are thus a poor indication of the number of matching pages, let alone of actual search term frequency. Since the late 1990s when linguists began to turn to the web as a corpus, search engines have actually become less advanced in terms of the options they offer. Full regular expression search has never been possible but useful features for linguistic study such as wildcards and the ‘NEAR’ operator (for finding one word in close proximity to another) have gradually been removed over the years. Commercial search engines are geared towards information retrieval rather than the extraction of linguistic data. What they do, they do increasingly well but they are less than ideal for linguistic research. This is why several tools have been developed to ‘piggyback’ on commercial search engines and add layers of refinement specifically for linguistic study, including KWiCFinder (Fletcher 2004), WebCONC (Hüning 2001) and the only one still operational at the time of writing WebCorp (Kehoe and Renouf 2002). WebCorp Live, as it is now known, operates by sending the search term to a commercial search engine like Google, extracting the ‘hit’ URLs from the results page, and then accessing each URL directly to gather a full set of matches. These are then presented in the familiar Key Word in Context (KWIC) format, which can be sorted alphabetically or by date. WebCorp also post-processes search engine results to offer case-sensitivity and pattern matching options. An extract of WebCorp Live output for genius idea (case-sensitive and restricted to UK broadsheet newspaper websites) is shown in Fig. 15.2. Although WebCorp Live offers advantages over direct use of commercial search engines (and is still widely used for this reason), it does not solve the underlying problems of the web as corpus approach. Quantitative analysis – one of the core activities of corpus linguistic research – is not possible as we do not know the total size of the web ‘corpus’ held on the search engines’ servers. It is also unclear exactly how the search engines decide upon ‘relevant’ matches for a query. Google results are selected and sorted by a proprietary measure of relevance (PageRank: Brin and Page 1998), which is altered regularly in unpredictable and undocumented ways. This problem has become more acute in recent years following the introduction of personalised search results based on factors such as geographic location and previous web activity. A final, more practical problem is that search engines are becoming increasingly difficult to use in linguistic study at all. WebCorp Live originally used a process known as ‘web scraping’: the extraction of useful information from the HTML code of a web page, in this case the ‘hit’ URLs from the Google results page and examples

15 Web Corpora


Fig. 15.2 WebCorp Live output for genius idea

of the search term from each of the ‘hit’ pages. Other linguists have used similar techniques, writing scripts in programming languages such as Perl, R, and Python, but this is no longer possible as search engines now block access from software other than recognised web browsers such as Internet Explorer and Chrome. WebCorp Live now uses the Application Programming Interfaces (APIs) provided by search engine companies to give controlled access to their indexes. Unfortunately, in the last few years many popular search engines have either begun to charge a fee for API access (e.g. Bing) or have shut down their APIs entirely (Google). This has further reduced the usefulness of search engines for linguistic research. Few researchers would now claim that the web is a corpus in any meaningful sense, but the web as corpus approach can still be fruitful for certain kinds of research and it is particularly useful for introducing newcomers to the field. Perhaps a more suitable term for the activities described in this section is ‘web as corpus surrogate’ (Bernardini et al. 2006). This reflects the fact that, although we are aware the web is not a corpus in a conventional sense, it may be the ‘next best thing’ when no other suitable corpora are available and researchers lack the necessary expertise to build them. One final point to be made in this section is that there has been a growth in recent years in research outside the field of corpus linguistics which uses web data and standard web search tools to answer what are essentially linguistic questions. So-called ‘culturomics’ (Michel et al. 2011) is the process of mining large digital collections to find words and phrases likely to represent important cultural phenomena, for example the fact that the word slavery “peaked during the civil war


A. Kehoe

(early 1860s) and then again during the civil rights movement (1955-1968)” (Michel et al. 2011:177). The corpus surrogate of choice in this field is often the Google Books archive, which presents a number of problems both in terms of accuracy of digitisation and flexibility of the search interface. A related concept is the rather illdefined ‘big data’, which has also become a buzz phrase in recent years but which does not appear to offer anything new to corpus linguists.

15.2.2 Web for Corpus The second major strand of web-based linguistic study has seen researchers attempt to overcome the limitations described above. This web for corpus approach has also been referred to as ‘web as corpus shop’ (Bernardini et al. 2006), meaning that the web becomes a store from which particular texts can be downloaded and used to build large off-line corpora. The key advantage over the ‘web as corpus surrogate’ approach is that we have control of what goes into our corpora, making it possible to design a corpus that represents something meaningful. Importantly, we also know exactly how large our corpus is – making quantitative analyses possible – and we can carry out more advanced linguistic processing, such as part-of-speech tagging. Mair (2012) refers to this as a shift from ‘opportunistic’ to ‘systematic’ use of the web as a corpus. However, although the web for corpus approach is considerably more systematic than the web as corpus surrogate approach described above, it is not fully comparable with the conventional corpus compilation process (see Chap. 1). The key issue here is one of representativeness, which ‘means that the study of a corpus (or combination of corpora) can stand proxy for the study of some entire language or variety of a language’ (Leech 2007:135). As Leech points out, this ‘holy grail’ was achievable in conventional corpora because their compilers had a clear idea of the total population, or ‘textual universe’ (Leech 2007:135), from which they were sampling. For instance, the compilers of the Brown corpus (Kucera and Nelson Francis 1967) used library catalogues to sample texts published in the US in 1961. When the textual universe is the web, things are not quite so straightforward. In fact, early web navigational aids such as the Yahoo! Directory attempted to impose library-style hierarchical classification systems on web texts, with human editors employed to curate content by subject. However, this approach was not scalable to the increasingly vast web and such directories have now largely been replaced by keyword-based search engines like Google. This means we are still largely dependent on such search engines as gatekeepers to the textual content of the web. However, the ‘web for corpus’ approach utilises search engines in a different way, using them at the initial stage of corpus building only. This approach was popularised by Baroni and Bernardini (2004) with their BootCaT tool but it has been used by several other researchers since. BootCaT is available as a user-friendly but rather limited front-end or as a series of command line scripts for more advanced users. We will describe the front-end

15 Web Corpora


here to introduce the general principles. The first step in the process is to supply a list of ‘seed’ words from which the corpus will be grown by the software. The type of corpus required will determine what seeds should be chosen. The BootCaT manual gives a simple example for the building of a domain-specific corpus on dogs, with the seeds dog, Fido, food hygiene, leash, breeds, and pet. The software takes these seeds and combines them into tuples with a length of three, e.g. Fido leash dog. All possible unique tuples are generated from the seeds supplied and each in turn is then sent to Bing using its API (previously Google). The assumption is that it is possible to build a corpus covering a particular domain (in this case dogs) by using a commercial search engine to find web pages containing words likely to occur in that domain. As an initial step, BootCaT fetches 10 hits from Bing for each tuple then downloads and processes the corresponding web pages to build a corpus in the form of a text file. Although this example is rather basic, the same underlying principle has been used to build much larger reference corpora, by the BootCaT team and by other researchers. Sharoff (2006a, b) built ‘BNC-like’ 100 million words corpora of English, Chinese, German, Romanian, Ukrainian and Russian from the web. Later, Baroni et al. (2009) built corpora of 1 billion words each for English, German and Italian, while Kilgarriff et al. (2010) built corpora of up to 100 million words each for Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai and Vietnamese. More recently, the COW project (Schäfer and Bildhauer 2012) has built ‘gigaword’ corpora using a similar approach. The web has been used to construct corpora of World Englishes too, the best known example being the 1.9 billion word Corpus of Global Web-based English (GloWbE) which includes sub-corpora from 20 countries (Davies and Fuchs 2015; see Representative Corpus 1). When building general reference corpora from the web we need to choose seed words that are likely to appear across a wide range of topics, but there has been some debate about exactly what constitute good seeds for a corpus of this type. Sharoff (2006a) used a word frequency list from the BNC, relying on random combinations of four non-grammatical words (e.g. work room hand possible) to ensure that search engine matches were not overly biased towards particular topics. Ciaramita & Baroni (2006:153) experimented with sets of words from the Brown corpus, concluding that the best seeds are ‘words that are neither too frequent nor too rare’. Kilgarriff et al. (2010) are a little more specific, ignoring the top 1000 high frequency words in the language and using the next 5000 as seeds. Whatever seeds are chosen, this is only the first step in the process of building a corpus from the web. In all but the most basic examples, it is likely that the researcher will want to expand the corpus beyond the initial set of seeds. There are two main ways of achieving this: (i) through the addition of further seeds, or (ii) by using a web crawler. For the first option, the more advanced BootCaT command line scripts add another stage where further seeds are extracted by comparing the initial corpus with an existing more general corpus, e.g. the basic dogs corpus may include key words such as canine, labrador, and barks which can be combined and sent to the search engine to extract further hits. Hence, the full BootCaT approach is one of ‘bootstrapping’ or iterative refinement.


A. Kehoe

This approach is suitable for the building of domain-specific corpora but the second approach, the use of a web crawler, is more appropriate for building of general reference corpora. In simple terms, crawlers (sometimes known as spiders) start with an initial list of URLs, download the corresponding documents, extract hyperlinks from these documents, then add these to the list of URLs to be crawled. In theory this process could run indefinitely but crawls run for corpus-building purposes tend to be restricted to a fixed time period. The most popular crawler in web for corpus research is the open-source Heritrix system (used by Kehoe and Gee 2007 amongst others). However, there are alternatives available, including the command line tool Wget (used by Kilgarriff et al. 2010) and HTTrack (used by Biber et al. 2015). Another option is SpiderLing (Suchomel and Pomikálek 2012), a crawler designed specifically for the building of linguistic corpora. Whichever tool is chosen, it is important to crawl the web in a responsible manner, observing the robots exclusion standard. This allows website owners to specify (usually in a file called ‘robots.txt’) which parts of a site should not be crawled. It is also important that crawling takes place as slowly as the name suggests. A crawler should not be configured to download multiple pages from the same website simultaneously or in quick succession as this may have an impact on access speeds for standard users of the site. When the crawl is eventually complete, several other steps are usually carried out to ‘clean up’ the downloaded web documents before they are added to a corpus. The user-friendly BootCaT front-end carries out some of these tasks automatically but most researchers opt to use dedicated tools for these individual tasks to retain more control over the process. There are a number of options available for each task in the corpus building ‘pipeline’, as outlined below: Boilerplate Removal The term ‘boilerplate’ is commonly used to refer to features such as navigation menus, copyright notices and advertising banners which do not form part of the main textual content of a web page. As such features are often repeated verbatim across multiple pages on a single website, it is desirable to remove them to prevent the words they contain from becoming unnaturally frequent in the resulting corpus. The boilerplate removal process usually goes hand in hand with the stripping of HTML markup code from web documents as HTML can offer clues about what is boilerplate and what is the main textual content. Several web for corpus projects (e.g. Ciaramita and Baroni 2006; Kehoe and Gee 2007) have used or adapted the Body Text Extraction (BTE) algorithm for boilerplate removal. An increasingly popular alternative is the linguistically-aware jusText tool, and a further option is texrex, which performs boilerplate removal as well as several other text ‘clean up’ tasks. It is also possible to remove only HTML code without considering boilerplate by using open-source tools such as html2text. Similar tools are available for the extraction of text from PDF and Word documents, formats which are far less likely to contain boilerplate. It is also worth mentioning Apache Tika here: a library designed to extract text from a wide variety of document formats. Document Filtering After HTML code and boilerplate have been removed, it is often useful to apply additional filters to the downloaded documents. The exact

15 Web Corpora


choice of filters is dependent on the intended nature of the corpus and research aims, but size and language filters are amongst the most common. Size filters are designed to remove very short and very long documents from the corpus. The assumption is that very short documents are unlikely to contain much connected prose, while very long documents are likely to skew the corpus unnaturally. Cavaglia and Kilgarriff (2001) and Ide et al. (2002) specify the minimum acceptable size of a web document as 2000 tokens. This may be overly strict, however, as Kehoe and Gee (2007) found that less than 4% of HTML documents in a large web crawl met the 2000 token threshold. Ciaramita and Baroni (2006) apply an additional filter, which states that for a text to be included at least 25% of its tokens must come from the top 200 most frequent tokens in the Brown corpus. This is intended partly as a spam filter and partly as a way of ensuring that all texts in their corpus are in English. Separate language detection libraries are also available in many programming languages. Duplicate Removal The nature of the web means that not all documents are unique. In some cases, multiple URLs will point to the same document, while in others the same document will appear on multiple websites (mirror sites and archives). The latter also happens with content released by news agencies, which is reused on many sites worldwide. As a result, any web-derived corpus is likely to include documents that are either exact duplicates of one another or are very similar. Like boilerplate, such documents can skew word frequency counts so it is desirable to remove them from the corpus. It is possible to do this by writing software that produces a ‘fingerprint’ (or hash) for each document and then compares these to find similarities. Software libraries are available in several programming languages to help with this, including the Text::DeDuper module in Perl. An easier alternative is to use an ‘off the shelf’ tool for the automatic detection of duplicate and nearduplicate texts, such as Onion (ONe Instance ONly). After these steps have been performed, the corpus can then be tokenised, lemmatised and part-of-speech tagged using the standard approaches described in Chap. 2. One thing to bear in mind when dealing with web data is that it can be rather ‘noisy’. For instance, it may lack punctuation or whitespace and there may be a higher proportion of non-standard spellings than in conventional texts. These factors may have an impact on the accuracy of corpus annotation since the linguistic models underpinning off-the-shelf annotation tools are usually derived from standard written language. For example, most popular part-of-speech taggers are trained on newspaper text, a register which tends to follow fairly strict style guides and contain few errors. Giesbrecht and Evert (2009) offer a useful case study on this topic, examining the effectiveness of five part-of-speech taggers on a web-derived corpus of German. They find that tagging accuracy drops from the advertised 97% to below 93% when applied to web data. The accuracy of such taggers can be improved through additional text ‘clean-up’ or the modification of tagging rules, but problems encountered in applying other annotation methods (e.g. dependency parsing) to web data are more difficult to resolve and few researchers have attempted to do so.


A. Kehoe

It is important to stress that the decisions made at each stage of the web corpus building process will have a significant impact on the resulting corpus, in terms of size but also in terms of composition. Activities such as boilerplate stripping, deduplication and additional filtering can remove a considerable proportion of the documents retrieved through web crawling. In an extreme case, Schäfer and Bildhauer (2013) found that post-processing eliminated over 94% of the web documents they had downloaded (with over 70% of these being very short documents). This is worth bearing in mind when aiming for a specific corpus size as it may be necessary to download significantly more documents than initially expected. It is also important to monitor all stages of the pipeline carefully to ensure that each annotation and filtering tool is having the intended effect. Relevant questions to consider are likely to include what exactly counts as boilerplate, how similar two documents must be to be considered (near) duplicates, and whether testing for duplicates should occur at sentence, paragraph or whole document level. The approach to crawling described so far is one designed to maximise the number of documents downloaded and, thus, maximise the size of the final corpus. In recent years most compilers of web corpora have worked on the assumption that maximising corpus size is likely to result in a more representative corpus, even if we have no reliable way of measuring this (see Representative Study 2 for a discussion of the related topic of balance in web corpora). Some researchers have gone further still, dismissing the notion that representativeness is achievable or important in web corpora. Schäfer & Bildhauer (2012:486) adopt ‘a more or less agnostic position towards the question of representativeness’, while Kilgarriff & Grefenstette (2003:11) argue that the web ‘is not representative of anything else. But nor are other corpora, in any well understood sense’. An alternative approach to large indiscriminate crawls is to focus on specific websites, thus limiting the size of the textual universe and increasing the chances of building a representative corpus. For example, some newspapers make archives of articles from previous years available online. These can be downloaded by pointing the crawler to the homepage and instructing it to follow links within the site only, or through the use of an API if the newspaper makes one available. An advantage of this focussed (or ‘scoped’) approach to crawling is that there is unlikely to be much spam or duplication of content and all documents are likely to be in a similar format, making boilerplate removal more straightforward. With news texts it is also much easier to determine publication date: something which can be very difficult to do on the web in general (see Representative Corpus 1 for information on the NOW corpus). Blogs offer similar advantages when it comes to boilerplate removal and text dating (see Representative Corpus 2 for more information on the Birmingham Blog Corpus) and are one variety of so-called ‘Web 2.0’ content, which is being used increasingly in linguistic research. Blogs are similar to conventional web texts in terms of layout and in the fact that they can be located using commercial search engines. This is not true of all Web 2.0 texts though, and corpus linguists have had to devise new methods to access the language of social media sites. There has always been a so-called ‘deep web’ (Bergman 2001) of information beyond the reach of standard search engines in password-protected archives and databases. However,

15 Web Corpora


this hidden world below the surface of the web has increased in size substantially in recent years with the growth of social media sites such as Facebook and Twitter which are not fully indexed by Google. In general, it is very difficult to download textual content from Facebook as it is necessary to be logged in and ‘friends’ with the person who published the content in order to access it. Facebook Pages dedicated to specific topics and interests are the exception as these can be accessed without logging in (though Facebook’s terms and conditions should always be observed). Twitter is easier to access for crawling purposes and it is now being used more widely in linguistic studies as a result. Early studies tended to download tweets manually but there are now a number of automated tools available. Options include the FireAnt package designed specifically for corpus linguists, as well as general purpose software such as IFTTT, Import.io, and TAGS. The last of these, standing for Twitter Archiving Google Sheet, automatically downloads tweets matching particular search parameters into a spreadsheet which can subsequently be used to build a corpus. Whether existing corpus methods are entirely suitable for the analysis of texts of 280 characters or fewer is a separate question requiring further research, but it is certainly possible to build Twitter corpora and conduct interesting linguistic analyses of them (e.g. Page 2014; Huang et al. 2016). Building a large-scale, publicly-accessible Twitter corpus is a bigger challenge, and previous attempts such as the Edinburgh Twitter Corpus (Petrovi´c et al. 2010) have met with opposition from Twitter itself. As can be seen from the above discussion, the ‘web for corpus’ approach is rather more complex than the previously described ‘web as corpus surrogate’ approach. The building of bespoke corpora from the web usually requires considerable technical skill and, thus, may not be an option for all corpus linguists, especially beginners. Fortunately, there are other options. Several teams of researchers have already built large web-derived corpora and made them available to download or search online (see Sect. 15.4). There is also the commercial Sketch Engine tool, which is now available free of charge to academic users within the EU (2018–2022). Sketch Engine includes a range of pre-loaded corpora, including the TenTen web corpora, and makes these available through a novel search interface. In addition, Sketch Engine provides user-friendly tools that allow linguists to build their own web-derived corpora (based on the BootCaT technology discussed in Sect. 15.2.2).

Representative Study 1 Resnik, P., and Smith, N.A. 2003. The Web as a Parallel Corpus. Computational Linguistics 29(3):349–380. Until relatively recently, corpus-based studies have tended to focus almost exclusively on the English language, and on British and American English in particular. The web offers new possibilities for the study of World Englishes and other languages for which there are no BNC-style reference corpora (continued)


A. Kehoe

available. One specific area where the web has had a transformational impact is in the building of parallel corpora for use in multilingual natural language processing. Work by Resnik & Smith on the STRAND system was pioneering in this field. Parallel corpora are pairs of corpora in two different languages where the texts in one are translations of those in the other (also called bitexts; cf. Chap. 12). STRAND was designed to mine the web to find candidate texts automatically, ‘based on the insight that translated Web pages tend quite strongly to exhibit parallel structure, permitting them to be identified even without looking at content’ (Resnik and Smith 2003:350). The original version of STRAND relied on the AltaVista search engine but, mirroring the wider shift from ‘web as corpus’ to ‘web for corpus’ in the field, later versions added a bespoke web crawler. One technique used successfully to find bitexts was to search for words such as ‘english’, ‘french’, ‘anglais’ and ‘français’ in anchor text: the clickable text within hyperlinks. Another was to compare URLs in the Internet Archive,7 e.g. matching a URL containing ‘/english/’ with an otherwise identical one containing /arabic/. Resnik & Smith used bilingual speakers to assess the quality of the parallel corpora extracted by STRAND, reporting 100% precision and 68.6% recall for English-French web pages. More recent research has built on the foundations established by the STRAND project. For example, San Vicente and Manterola (2012) used anchor text, URL matching and HTML structure to build Basque-English, Spanish-English and Portuguese-English parallel corpora. Interestingly, they chose not to remove boilerplate as they found that navigation menus contain useful parallel information.

Representative Study 2 Biber, D., Egbert, J., and Davies, M. 2015. Exploring the composition of the searchable web: a corpus-based taxonomy of web registers. Corpora 10(1):11–45. The exact composition of the searchable web is something we know surprisingly little about as a research community, making it difficult to assess the representativeness of our web-derived corpora. Related to this is the notion of a balanced corpus: one where ‘the size of its subcorpora (representing (continued)

7 http://archive.org/.

Accessed 22 May 2019.

15 Web Corpora


particular genres or registers) is proportional to the relative frequency of occurrence of those genres in the language’s textual universe as a whole’ (Leech 2007:136; cf. also Chap. 1). The major challenge here is that the categorisation of documents by register in a web-scale collection must be done automatically, yet this is not possible unless we first have a list of the possible categories. Biber et al. set out to develop such a list by extracting a random sample of 48,571 documents from GloWbE (see Representative Corpus 1) and using volunteers to classify them. Instead of taking a pre-defined set of registers, the project asked volunteers to identify ‘situational characteristics’, e.g. whether a written text was interactive or non-interactive, and whether non-interactive texts were designed to narrate events, explain information, express opinion, etc. The key finding was that the majority (31.2%) of web documents belong to the Narrative register, with a further 29% classified as ‘Hybrid’ (Narration+Information, Narration+Opinion, etc.). Within the Narrative register, over half the documents were classified as ‘News report/blog’. Some researchers have criticised the approach adopted by Biber et al., pointing out that they remain heavily reliant on Google as gatekeeper to the web because the documents analysed come from a corpus seeded by Google queries (Schäfer 2016). Despite this limitation, however, Biber et al. are still able to offer important new insights on web content, concluding that ‘the most common registers found on the web are not those typically analysed in corpora of published written texts’ and ‘although the most widely analysed registers from published writing can be found on the web, they are typically rare in comparison to other web registers’ (Biber et al. 2015:26). The output from their analysis is available in the form of the CORE corpus (see Representative Corpus 1).

Representative Corpus 1 BYU corpora: COCA, GLoWbE, CORE and NOW The Corpus of Contemporary American English (COCA), Corpus of Global Web-based English (GloWbE), Corpus of Online Registers of English (CORE), and News On the Web (NOW) corpus are four in a series of corpora released by Mark Davies. COCA contains 20 million words of texts each year since 1990, split evenly between five genres: spoken, fiction, popular magazines, newspapers, and academic journals. It will be noted that these genres are very similar to (continued)


A. Kehoe

those found in pre-web corpora such as the BNC, and therein lies one of the limitations of COCA. Although all texts were downloaded from the web using the ‘web for corpus’ approach, COCA was designed in such a way that it excludes web-native genres like blogs and social media. COCA is a web corpus in a very loose sense only. It is intended as a monitor corpus yet it does not contain examples of the latest trends in language use, which tend to be found in blogs and other less formal text types. This issue is addressed in the 1.9 billion word GloWbE and 50 million word CORE corpora. GloWbE was constructed using ‘web for corpus’ techniques, seeded through search engines queries. As explained in Davies and Fuchs (2015), the most common 3-grams were extracted from COCA (‘is not the’, etc.) and submitted to Google, with 80–100 hits downloaded for each 3-gram. Unlike in COCA, the texts in GloWbE are not restricted to American English and come from 20 different English-speaking countries. Around 60% of them are from blogs which, according to Davies and Fuchs (2015), makes GloWbE comparable with corpora in the International Corpus of English (ICE) family (which have a 60/40% split between speech and writing). Of course, GloWbE is much less carefully designed than the ICE corpora, and it remains to be determined whether blogs are truly comparable with speech, but GloWbE certainly offers a size advantage over the ICE corpora. The CORE corpus was derived from GloWbE as part of the project undertaken by Biber et al. (2015) discussed in Representative Study 2. Meanwhile, the NOW corpus has recency as its main focus, containing over 6 billion words of newspaper text from 2010 to present, with around 10,000 new articles added every day. This is thus a good example of the kind of focused crawling discussed in Sect. 15.2.2.

Representative Corpus 2 Birmingham Blog Corpus The Birmingham Blog Corpus (BBC) is a freely-searchable 630 million word collection downloaded from various blog hosting sites (Kehoe and Gee 2012). Of particular interest is the sub-corpus from WordPress and Blogger, which includes 95 million words of blog posts and a further 86 million words of reader comments on those posts. It was possible to separate posts and comments in this way as, although blogs on WordPress and Blogger are written by a wide range of people and cover many different topics, they make use of a small number of pre-defined templates. It was therefore relatively easy to identify the post and each individual comment during the crawling (continued)

15 Web Corpora


process, without the need for complex boilerplate removal techniques. The crawling for the WordPress and Blogger sub-corpus of the BBC took place without relying on commercial search engines. Instead, lists of popular blogs were taken from the hosting sites themselves and used as the initial set of blogs to crawl. When each post on each blog was processed, links to other WordPress and Blogger blogs were extracted and added to the crawling list, thus widening the scope of the corpus. Linguistic research based on the BBC has demonstrated the value of blog data in pragmatic analyses of online interaction (e.g. Lutzky and Kehoe 2017 on apologies). Such research has also identified a shift in use of the blog format from its original ‘online diary’ focus, meaning that it is now possible to capture a wide variety of topics and communicative intentions in a corpus made up exclusively of blogs.

15.3 Critical Assessment and Future Directions In this chapter we have examined a wide range of approaches which use the web as a linguistic resource, ranging from basic Google searches to large-scale crawling of web content. The web may not be a corpus in a conventional sense but, as we have seen, it can be a valuable corpus surrogate or, increasingly, a source of texts for the building of corpora. However, in their attempts to build large web-derived corpora, researchers continue to be hampered by a reliance on commercial search engines and a lack of detailed knowledge of the web’s structure and content. With it becoming increasingly difficult to use search engines for linguistic research, both directly and as a way of seeding crawls, we must devise new ways of accessing web content. The scoped crawling approach is one solution, allowing us to focus our attention on specific websites without relying on search engines as gatekeepers to the web. Lists of popular sites such as the Alexa web rankings may be a good starting point. Social media, particularly Twitter, is another increasingly useful source of textual data. As we have moved towards the ‘web for corpus’ approach, a new challenge has emerged concerning the legality of distributing corpora crawled from the web. The solutions adopted by corpus compilers vary, from limiting the context shown through the search interface (Mark Davies’ corpora) to releasing corpora for download with the sentences shuffled into a random order (COW corpora). This is still very much a grey area, with different laws applying across the world, and anyone planning to distribute a web-derived corpus should seek local advice. Within


A. Kehoe

the European Union, the General Data Protection Regulation (GDPR)8 introduced in 2018 was designed to standardise laws and give citizens control of their personal data. At the time of writing, the implications of GDPR for web crawling in general and for web corpus building in particular are not completely clear. It remains good practice to configure web crawlers so that they identify themselves clearly and crawl responsibly, obeying website terms of service. It is also good practice to carry out an assessment of all data being stored – in this case the corpus – to ensure that the chances of an individual person being identified are minimised. In the longer term, the ideal solution for web corpus research would be a search engine designed specifically for linguistic study: a Google-scale resource providing the search options and post-processing tools we require to extract examples of language in use from the web. There have been attempts in the past but most have proved to be unsustainable as academic funding models do not allow the continuous expansion of disk storage space required for regular web crawling or the bandwidth required to allow multiple users to carry out complex searches on a large corpus simultaneously. With increasing interest in the use of web data in linguistic research and deepening knowledge of web content (e.g. Biber et al. 2015; see also Biber and Egbert 2016, 2018 for multidimensional analyses of web registers), now may be an appropriate time for researchers to pool resources and harness the full potential of the web as a linguistic resource.

15.4 Tools and Resources 15.4.1 Web Corpora Aranea – searchable and downloadable web-derived corpora in English, French, German, and a range of other languages: http://ucts.uniba.sk/aranea_about/. Accessed 21 May 2019. Birmingham Blog Corpus – searchable 630 million word corpus of blog posts and reader comments downloaded from popular hosting sites: http://wse1.webcorp. org.uk/home/blogs.html. Accessed 22 May 2019. COCA – over 500 million words of American English texts published since 1990, downloaded from and searchable via the web: http://corpus.byu.edu/coca/. Accessed 21 May 2019. CORE – searchable 50 million word sub-set of GloWbE, categorised by register: http://corpus.byu.edu/core/. Accessed 21 May 2019. COW – searchable and downloadable web-derived corpora in English, French, German, Spanish, Dutch and Swedish (shuffled sentences only, not full texts): http://corporafromtheweb.org/. Accessed 21 May 2019. 8 https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32016R0679.

May 2019.

Accessed 22

15 Web Corpora


GloWbE – searchable 1.9 billion word corpus crawled from 20 English-speaking countries: http://corpus.byu.edu/glowbe/. Accessed 21 May 2019. Leipzig Collection – searchable web-derived corpora in English, French, German, Arabic and Russian: http://corpora.informatik.uni-leipzig.de/. Accessed 21 May 2019. NOW – searchable corpus of news texts newspaper text from 2010 to present, with around 10,000 articles added daily: http://corpus.byu.edu/now/. Accessed 21 May 2019. TenTen – a collection of web-derived corpora available through Sketch Engine covering over 30 languages, each containing at least 10 billion words: http:// www.sketchengine.eu/documentation/tenten-corpora/. Accessed 21 May 2019. WaCKy – downloadable web-derived corpora in English, French, German and Italian: http://wacky.sslmit.unibo.it/doku.php?id=corpora. Accessed 21 May 2019. WebCorp Live – searches the web in real-time for words or phrases, adding refinement options to standard search engines; can also produce a word list and n-grams for any web text: http://www.webcorp.org.uk/. Accessed 21 May 2019.

15.4.2 Crawling and Text Processing Alexa Web Rankings – helpful in locating popular websites as initial seeds for a crawl: http://www.alexa.com/topsites. Accessed 21 May 2019. Apache Tika – extracts text and detects metadata from files: http://tika.apache.org/. Accessed 21 May 2019. Body Text Extraction (BTE) – boilerplate removal tool taking into account document structure: http://www.aidanf.net/posts/bte-body-text-extraction. Accessed 21 May 2019. BootCaT – pipeline of tools for seeding and building web corpora: http://bootcat. dipintra.it/. Accessed 21 May 2019. FireAnt – freeware designed to build corpora from Twitter, with built-in visualisation tools (time-series, geo-location): http://www.laurenceanthony.net/software/ fireant/. Accessed 21 May 2019. Heritrix – large-scale web crawler: https://github.com/internetarchive/heritrix3/ wiki. Accessed 21 May 2019. HTTrack – downloads a single website to a local computer: https://www.httrack. com/. Accessed 21 May 2019. IFTTT – allows use of simple conditional statements to download data from popular websites: https://ifttt.com/. Accessed 21 May 2019. Import.io – extracts, processes and visualises data from the web: https://www. import.io/. Accessed 21 May 2019. jusText – boilerplate removal tool: http://corpus.tools/wiki/Justext. Accessed 21 May 2019.


A. Kehoe

ONe Instance ONly (Onion) – duplicate text remover: http://corpus.tools/wiki/ Onion. Accessed 21 May 2019. Sketch Engine – a corpus search infrastructure containing a range of preloaded corpora (including TenTen). Includes tools for building web corpora (based on BootCaT). The open-source version NoSketch Engine has a more restricted search interface and no pre-loaded corpora: http://www.sketchengine. eu/nosketch-engine/. Accessed 21 May 2019. TAGS – Google Sheets template for automated collection of search results from Twitter: http://tags.hawksey.info/. Accessed 21 May 2019. Texrex – full web document cleaning tool; removes duplicates, detects boilerplate, extracts metadata (part of COW project): http://github.com/rsling/texrex. Accessed 21 May 2019. Text::DeDuper – duplicate text remover: https://metacpan.org/pod/Text::DeDuper. Accessed 30 September 2020. Wget – simple HTTP download tool: https://www.gnu.org/software/wget/. Accessed 21 May 2019.

Further Reading Schäfer, R., and Bildhauer, F. 2013 Web Corpus Construction. San Rafael: Morgan & Claypool. This book offers a fuller discussion of the technical issues involved in web crawling for linguistic purposes. It includes chapters on web structure, seed word selection and crawling, post-processing, annotation, and corpus evaluation. Biber, D., and Egbert, J. 2018. Register Variation Online. Cambridge: CUP. Building on the work by Biber et al. (2015) discussed in Representative Study 2, this volume examines the full range of registers found on the searchable web. It explores overall patterns of register variation with a multidimensional analysis and discusses the main lexical, grammatical and situational features of each register, offering important new insights on the language of the web. Hundt, M., Nesselhauf, N., and Biewer, C. 2007. Corpus Linguistics and the Web. Amsterdam: Rodopi. This was the first book-length publication to bring together key perspectives in web corpus research. It includes the chapter on representativeness and balance cited in Sect. 15.2.2 as well as chapters on both the new possibilities and new challenges presented by the web as/for corpus approaches.

15 Web Corpora


References Baroni, M., & Bernardini, S. (2004). BootCaT: Bootstrapping corpora and terms from the web. Proceedings of LREC, 2004, 1313–1316. Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources & Evaluation, 43, 209–226. Bergh, G. (2005). Min(d)ing English language data on the web: What can Google tell us? ICAME Journal, 29, 25–46. Bergh, G., Seppänen, A., & Trotta, J. (1998). Language corpora and the internet: A joint linguistic resource. In A. Renouf (Ed.), Explorations in corpus linguistics (pp. 41–54). Amsterdam: Rodopi. Bergman, M. K. (2001). The deep web: Surfacing hidden value. Journal of Electronic Publishing, 7(1). Bernardini, S., Baroni, M., & Evert, S. (2006). A WaCky introduction. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the web as Corpus (pp. 9–40). Bologna: GEDIT. http://wackybook.sslmit.unibo.it/pdfs/bernardini.pdf. Accessed 21 May 2019. Biber, D., & Egbert, J. (2016). Register variation on the searchable web : A multi-dimensional analysis. Journal of English Linguistics, 44(2), 95–137. Biber, D., & Egbert, J. (2018). Register variation online. Cambridge: Cambridge University Press. Biber, D., Egbert, J., & Davies, M. (2015). Exploring the composition of the searchable web: A corpus-based taxonomy of web registers. Corpora, 10(1), 11–45. Brekke, M. (2000). From BNC to the cybercorpus: A quantum leap into chaos? In J. Kirk (Ed.), Corpora Galore (pp. 227–247). Amsterdam: Rodopi. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7), 107–117. Cavaglia, G., & Kilgarriff, A. (2001). Corpora from the web (Information Technology Research Institute Technical Report Series (ITRI-01-06)). University of Brighton. https:// www.kilgarriff.co.uk/Publications/2001-CavagliaKilg-CLUK.pdf. Accessed 21 May 2019. Ciaramita, M., & Baroni, M. (2006). Measuring web-corpus randomness. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the web as corpus (pp. 127–158). Bologna: GEDIT. http://wackybook.sslmit.unibo.it/pdfs/ciaramita.pdf. Accessed 21 May 2019. Davies, M., & Fuchs, R. (2015). Expanding horizons in the study of world Englishes with the 1.9 billion word global web-based English Corpus. English World-Wide, 36(1), 1–28. de Schryver, G. (2002). Web for/as corpus: A perspective for the African languages. Nordic Journal of African Studies, 11(2), 266–282. Fletcher, W. H. (2004). Making the web more useful as a source for linguistic corpora. In U. Connor & T. Upton (Eds.), Applied Corpus linguistics: A multidimensional perspective (pp. 191–205). Amsterdam: Rodopi. Gatto, M. (2014). Web as corpus: Theory and practice. London: Bloomsbury. Giesbrecht, E., & Evert, S. (2009). Is part-of-speech tagging a solved task? An evaluation of POS taggers for the German web as corpus. In Proceedings of the 5th Web as Corpus Workshop (WAC5). San Sebastian: Spain. Huang, Y., Guo, D., Kasakoff, A., & Grieve, J. (2016). Understanding U.S. regional linguistic variation with Twitter data analysis. Computers Environment and Urban Systems, 59, 244–255. Hüning, M. (2001). WebCONC. http://www.niederlandistik.fu-berlin.de/cgi-bin/web-conc.cgi (no longer accessible). Accessed 21 May 2019. Ide, N., Reppen, R., & Suderman, K. (2002). The American National Corpus: More than the web can provide. In Proceedings of the 3rd language resources and evaluation conference (LREC) (pp. 839–844). Paris: ELRA. Kehoe, A. (2006). Diachronic linguistic analysis on the web using WebCorp. In A. Renouf & A. Kehoe (Eds.), The changing face of corpus linguistics (pp. 297–307). Amsterdam: Rodopi. Kehoe, A., & Gee, M. (2007). New corpora from the web: Making web text more “text-like”. Towards Multimedia in Corpus Studies. Helsinki: VARIENG. http://www.helsinki.fi/varieng/ series/volumes/02/kehoe_gee/. Accessed 21 May 2019.


A. Kehoe

Kehoe, A., & Gee, M. (2012). Reader comments as an aboutness indicator in online texts: Introducing the Birmingham blog corpus. Aspects of corpus linguistics: Compilation, annotation, analysis. Helsinki: VARIENG. http://www.helsinki.fi/varieng/journal/volumes/12/kehoe_gee/. Accessed 21 May 2019. Kehoe, A., & Renouf, A. (2002). WebCorp: Applying the web to linguistics and linguistics to the web. In Proceedings of the 11th international World Wide Web conference. http:// web.archive.org/web/20141206025600/http://www2002.org/CDROM/poster/67/. Accessed 21 May 2019. Keller, F., & Lapata, M. (2003). Using the web to obtain frequencies for unseen bigrams. Computational Linguistics, 29(3), 459–484. Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–347. Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010). A corpus factory for many languages. http://www.sketchengine.co.uk/wp-content/uploads/2015/05/ Corpus_Factory_2010.pdf. Accessed 21 May 2019. Kucera, H., & Nelson Francis, W. (1967). Computational analysis of present-day American English. Providence: Brown University Press. Leech, G. (2007). New resources, or just better old ones? The holy grail of representativeness. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (pp. 133–149). Amsterdam: Rodopi. Lutzky, U., & Kehoe, A. (2017). “I apologise for my poor blogging”: Searching for apologies in the Birmingham blog Corpus. Corpus Pragmatics, 1(1), 37–56. Mair, C. (2012). From opportunistic to systematic use of the web as corpus: Do-support with got (to) in contemporary American English. In T. Nevalainen & E. C. Traugott (Eds.), The Oxford handbook of the history of English (pp. 245–255). Oxford: Oxford University Press. Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, et al. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176–182. Page, R. (2014). Saying “sorry”: Corporate apologies posted on Twitter. Journal of Pragmatics, 62, 30–45. Petrovi´c, S., Osborne, M., & Lavrenko, V. (2010). The Edinburgh Twitter corpus. In Proceedings of the NAACL HLT 2010 workshop on computational linguistics in a world of social media (pp. 25–26). Pomikàlek, J., Rychly, P., & Kilgarriff, A. (2009). Scaling to billion-plus word corpora. Advances in Computational Linguistics: Special Issue of Research in Computing Science, 41, 3–14. Rayson, P., Charles, O., & Auty, I. (2012). Can Google count? Estimating search engine result consistency. In Proceedings of the seventh Web as Corpus workshop (WAC7) (pp. 23–30). http:/ /sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf. Accessed 21 May 2019. Resnik, P., & Smith, N. A. (2003). The web as a parallel corpus. Computational Linguistics, 29(3), 349–380. Rundell, M. (2009). Genius and rubbish and other noun-like adjectives. MacMillan Dictionary Blog. http://www.macmillandictionaryblog.com/noun-like-adjectives. Accessed 21 May 2019. San Vicente, I., & Manterola, I. (2012). PaCo2: A fully automated tool for gathering parallel corpora from the web. In Proceedings of the eight international conference on Language Resources and Evaluation (LREC12). http://aclanthology.info/papers/L12-1085/l12-1085. Accessed 21 May 2019. Schäfer, R. (2016). On bias-free crawling and representative web corpora. In Proceedings of the 10th Web as Corpus workshop (WAC-X) and the EmpiriST shared task (pp. 99–105). Schäfer, R., & Bildhauer, F. (2012). Building large corpora from the web using a new efficient tool chain. In Proceedings of the eighth international conference on Language Resources and Evaluation (LREC) (pp. 486–493). Istanbul: ELRA. Schäfer, R., & Bildhauer, F. (2013). Web corpus construction (Vol. 6, pp. 1–145). San Rafael: Morgan & Claypool.

15 Web Corpora


Schmied, J. (2006). New ways of analysing ESL on the WWW with WebCorp and WebPhraseCount. In A. Renouf & A. Kehoe (Eds.), The changing face of corpus linguistics (pp. 309–324). Amsterdam: Rodopi. Sharoff, S. (2006a). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the web as corpus (pp. 63–98). Bologna: GEDIT. http://wackybook.sslmit.unibo.it/pdfs/sharoff.pdf. Accessed 21 May 2019. Sharoff, S. (2006b). Open-source corpora. Using the net to fish for linguistic data. International Journal of Corpus Linguistics, 11(4), 435–462. Sinclair, J. (2005). Corpus and text – Basic principles, and appendix: How to build a corpus. In M. Wynne (Ed.), Developing linguistic corpora: a guide to good practice. Oxford: Oxbow Books. http://ota.ox.ac.uk/documents/creating/dlc/. Accessed 21 May 2019. Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus workshop (WAC7) (pp. 39–43). http://sigwac.org.uk/rawattachment/wiki/WAC7/wac7-proc.pdf. Accessed 21 May 2019. van den Bosch, A., Bogers, T., & de Kunder, M. (2016). Estimating search engine index size variability: A 9-year longitudinal study. Scientometrics, 107, 839–856.

Chapter 16

Multimodal Corpora Dawn Knight and Svenja Adolphs

Abstract This chapter provides an overview of current advances in multimodal corpus linguistics. It defines what multimodal corpora are, what they can be used for, how and why they are used, and outlines some of the practical and methodological challenges and concerns faced in the construction and analysis of multimodal corpora. Examples of notable corpora are presented, alongside examples of software tools for multimodal corpus development and enquiry, and the chapter ends with reflections on possible future directions for developments within this field.

16.1 Introduction Developments in technology, particularly the ever-increasing availability of advanced capturing devices for video and audio alongside digital analysis software, have provided linguists with invaluable tools for the construction of multimodal records of human communication. These developments have proven to be particularly beneficial to the emergent field of multimodal corpus linguistic enquiry. A corpus is a principled collection of language data taken from real-life contexts. Modern corpora vary in size and scope and are used by a range of different researchers and professionals: from academics to lexicographers, textbook writers and syllabus designers; and more broadly, they could be used by potentially anyone with an interest in language. As Conrad (2002: 77) argues, “Corpus analysis can be a good tool for filling us in on the bigger picture of language”, providing users both with sufficient data for exploring specific linguistic patterns, and with the

D. Knight () School of Education, Communication and Language Sciences (ENCAP), Cardiff University, Cardiff, Wales, UK e-mail: [email protected] S. Adolphs School of English, The University of Nottingham, Nottingham, UK e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_16



D. Knight and S. Adolphs

methods of doing so; the corpus linguistic approach (see Stubbs 1996: 41). For more information on corpus design and construction, see Chap. 1. While the majority of current corpora are mono-modal in nature, comprising text-based records taken from transcribed spoken language, and written data (e.g. BNC, British National Corpus, see Burnard and Aston 1998 and COCA, Corpus of Contemporary American, see Davies 2008), a surge in the development of multimodal corpora has been witnessed over the past decade or so (including corpora of sign language (e.g. the British Sign Language (BSL) corpus, see Johnston and Schembri 2006) and associated systematic analysis of gesture patterning). Multimodal corpora present a significant step-change in the potential for linguistic study, moving away from textual representations of language-in-use, and enabling the analysis of different modes of communication, represented in, for example, video and audio records of communication. Such records help to support the examination of the gestural, prosodic and proxemic features of talk.

16.2 Fundamentals 16.2.1 Defining Multimodality and Multimodal Corpora Rooted in Halliday’s work on social semiotics (1978), multimodality is often conceptualised as the relationship between different ‘modes’ of communication and how they interact to develop meaning in discourse. A mode, in this context, is defined as “a socially shaped and culturally given semiotic resource for making meaning” (Kress 2010: 79). Examples of semiotic resources include texts, objects, space, colour, image, gesture, prosody, facial expressions, eye gaze and body posture. Multimodal analysis, then, concerns an examination not only of the ‘abstract’ processes utilised in discourse (i.e. bodily movement and speech, see Kress and van Leeuwen 2001), but the ‘media’, the physical mode(s), in which these are conveyed. A multimodal corpus is defined as “an annotated collection of coordinated content on communication channels, including speech, gaze, hand gesture and body language, and is generally based on recorded human behaviour” (Foster and Oberlander 2007: 307–308). Multimodal corpora are physical repositories within which records of these behaviours are represented, through the use of multiple forms of media; again, typically the culmination of aligned video, audio and textual representations (i.e. transcriptions) of data within a single digital interface. The catalyst behind the development of multimodal corpora is grounded in the notion that “natural language is an embodied phenomenon and that a deeper understanding of the relationship between talk and bodily actions— in particular gestures— is required if we are to develop a more coherent understanding of the collaborative organization of communication” (Adolphs 2013: 1, also see Saferstein 2004). The construction and analysis of emergent multimodal corpora aim to enable just this,

16 Multimodal Corpora


by providing the analyst with the resources which support the investigation of the interplay of different modes in the construction of meaning in discourse. This type of analysis is something that traditional mono-modal corpora, that is, corpora that reduce communication purely to a textual form, are unable to support. The majority of current multimodal corpora are forms of ‘specialised’ corpora insofar as they tend to be constructed to help answer a specific question, examine a particular discursive context and/or to meet the requirements of a particular research area or project. The VACE corpus (Chen et al. 2005), Multimodal Meeting (MM4) Corpus (Mana et al. 2007), Mission Survival Corpus (MMC1, McCowan et al. 2003), NIST Meeting Room Phase II Corpus (Garofolo et al. 2004) and AMI (Ashby et al. 2005) corpora are, for example, taken from meeting room contexts. The CUBE-G (Rehm et al. 2008), Fruit Carts Corpus (Aist et al. 2006), SaGA (Lücking et al. 2010) and the UTEP ICT (Herrera et al. 2010) corpora are all (semi)scripted, controlled or task-based multimodal corpora, while the Nottingham Multi-Modal Corpus (NMMC–Knight et al. 2009) is comprised of recordings from an academic context. There currently exist no ‘general’ large-scale multimodal corpora, which contain data sampled from a wide range of discursive contexts and/or socio-demographic groups, that are designed in the same vein to existing large-scale monomodal corpora. The specialist nature of multimodal corpora impacts on the potential scalability and reusability of the corpora across different projects. Reusability is an issue that is also often tied to ethical constraints that are commonly attached to multimodal corpora. While text-based corpora can be anonymised so that participants are not easily identifiable, this becomes more of an issue with corpora which contain, for example, video and audio data. Audio files can be muted, and video images can be pixelated, and avatars can be used to conceal the identities of particular individuals. However, such techniques often blur and distort the data, making it difficult to examine phonetic or prosodic patterns or certain forms of gestures, respectively. Explicit permission needs to be sought from participants not only to video and audio record data, but also to publish and/or distribute it: this is not always possible and is one of the reasons why a large-scale, widely used multimodal corpus is currently not in existence. Some recent efforts have been made, however, to attempt to capture more naturalistic records of communication in the construction of a multimodal corpus, moving away from (semi)scripted, controlled or task-based contexts. These are generally smaller scale corpora, particularly in comparison to their contemporary monomodal counterparts. Examples of these include the Multimodal Corpus Analysis in the Nordic Countries project (NOMCO, Paggio et al. 2010 – see Sect. 16.3) and the EVA corpus (Mlakar et al. 2017), both of which include more ‘spontaneous’ interactions, with participants asked to discuss topics of their own choosing (i.e. without specific prompts or requirements being provided). Similarly, the D64 Corpus (Oertel et al. 2010) includes 4 h of recordings from a two-day period and is a non-scripted, more spontaneous multimodal corpus, with interactions taking place in domestic settings. The developers of D64 aimed to provide “as close to ethnological observation of conversational behaviour as technological demands permit” (Oertel et al. 2010:


D. Knight and S. Adolphs

27), although as the corpus was constructed to support the automatic tracking and classification of gestures, participants were required to wear reflective sticky markers during the recording phase. Again, these markers are somewhat invasive and detrimental to the perceived naturalness of the recorded data, as they are “not only time-consuming and often uncomfortable” to wear but “can also significantly change the pattern of motion” enacted by participants (Fanelli et al. 2010: 70; also see Fisher et al. 2003). Thus, the extent to which this corpus can be deemed to be ‘naturalistic’ or truly spontaneous is limited.

16.2.2 Multimodality Research in Linguistics While there is a rich tradition of multimodal research in fields such as psychology, computer science, cultural studies and anthropology, the examination of multimodal communication by linguists is comparatively underdeveloped. Some of the most significant developments in research into multimodality and multimodal text analysis in linguistics exist within the discourse analysis tradition and also in sign language research (for examples of relevant works see Tannen 1982; Baldry and Thibault 2001; Kress and van Leeuwen 2001; Lemke 2002; Scollon and Levine 2004; Ventola et al. 2004; Baldry 2004, 2005; van Leeuwen 2005; O’Halloran 2006, 2011; Royce and Bowcher 2006). Within the MDA (Multimodal Discourse Analysis) framework, analysts typically draw on systematic functional theory to examine the relationship between discourse and social practice. Examples of some of the areas where a multimodal approach is used include the examination of text layouts, colour use in texts, identity and voice and embodiment and gesture. Raw data can include, for example, clips of video, still visual images, transcribed text, media cuttings, physical shapes and so on, in small episodes of talk or collections of data, with analysis, again, being qualitatively-led and conducted at the micro level. Work in multimodal corpus linguistics is, in contrast, a relatively new field of research, having only been established in the mid-late 2000s. The key difference between multimodal corpus linguistics and the multimodal analysis of corpora using MDA is that, as with traditional text-based corpus enquiry, multimodal corpus analysis includes not only detailed qualitative analyses, but also quantitative analyses of emerging patterns of language-in-use. Gu notes that multimodal corpus linguistic research tends to be rooted either in the social sciences, focusing on providing multimodal and multimedia studies of discourse, or more generally within computational sciences and focusing on speech engineering and corpus annotation (2006: 132). The former of these is concerned with answering specific questions about language use, while the latter deals with the development of software and tools for the construction of multimodal corpora, and developing approaches to analyse the data systematically. As Knight (2011a: 55) notes, “few studies concentrate on both of these key concerns in great detail. This is mainly due to the fact that different types of expertise are needed to meet the requirements posed by each of these strands of research”. To provide a holistic overview of the multimodal corpus landscape, the current chapter attempts to draw

16 Multimodal Corpora


on, and summarise, key ideas and considerations relevant to each of these strands of research. The Representative Study boxes provide examples of three empirical studies that focus on analysing specific linguistic phenomena using multimodal corpora: discourse markers, turn management in interaction and spoken and non-verbal listenership behaviour. These areas of focus only represent the tip of the iceberg for the potential of multimodal corpus research, as these resources can be used in different areas of research, from examining shoulder shrugging and hesitations (Jokinen and Allwood 2010) to studying emotions in interaction (Cu et al. 2010). As discussed above, the design and specific contents of multimodal corpora are often tied to the aims and objectives of a given research project, which means that the type of research, and the question that can be addressed by analysing these corpora, are also tied to this.

16.2.3 Issues and Methodological Challenges Few open-source and freely available multimodal corpora exist (with the AMI and NOMCO corpora perhaps being exceptions to this – see Sect. 16.3 for further details) and, as already outlined above, existing multimodal corpora are commonly specialist and bespoke. Many of the most pertinent issues and methodological challenges faced in multimodal corpus research are tied to the construction and availability of resources for further research. This dependency is due to the fact that our ability to analyse and examine complex and extensive multimodal corpora in a systematic and accurate way (particularly when integrating traditional corpus-based methods into the approach), is reliant on the availability of appropriate technology. The construction of multimodal corpora is a time consuming and technically complex process. Decisions that are being made regarding the design and composition of these corpora must ensure the aims of the specific research questions being asked can be examined using the resource that is compiled. Researchers need to consider what forms of data are to be included; what modalities are to be captured and represented; where the data is to be sourced from; what format these are stored in (formats are not always universal and/or transferable) and how they will be synchronised, transcribed and annotated, as well as what kinds of tools/conventions will be used to support these processes. This is because “like transcription, any camera position [or hardware/software used to capture multimodal ‘data’] constitutes a theory about what is relevant within a scene—one that will have enormous consequences for what can be seen in it later— and what forms of subsequent analysis are possible” (Goodwin 1994: 607). Achieving balance and representativeness has long been regarded as fundamental to designing reliable and verifiable corpora (see Sinclair 2005 – also refer to Chap. 1 for further discussion). A balanced corpus is one that “usually covers a wide range of text categories which are supposed to be representative of the language or language variety under consideration” (McEnery et al. 2006: 16), while


D. Knight and S. Adolphs

representativeness in corpus design is best defined as “the extent to which a sample includes the full range of variability in a population”, that is “the range of texts in a language” and the “range of linguistic distributions in a language” (Biber 1993: 243). Balance and representativeness depend on the number of words per sample, number of samples per ‘text’ and number of texts per text type included in a corpus. Again, multimodal corpora are typically specialist insofar as no general, or reference, multimodal corpus exists. While each multimodal corpus may be representative of the specific context or discourse type that it has been designed to study, it is difficult to provide a list of current resources that are most representative of the field. The Representative Corpora below provide some examples of more well known, available, multimodal corpora. Once collected, “managing the detail and complexity involved in annotating, analysing, searching and retrieving multimodal semantic patterns within and across complex multimodal phenomena” (O’Halloran 2011: 136) is the next challenge to be faced. As already noted, multimodal corpus analysis is essentially a mixed methods approach, one which seeks to combine quantitative techniques with qualitative textual analyses, as utilised in conventional corpus enquiry. To support detailed quantitative analysis of phenomena in traditional corpus research, data is often marked-up and annotated first, to make specific features searchable using concordancing tools. Monomodal text-only corpora in English are sometimes annotated automatically with the use of ‘taggers’ and ‘parsers’. These can tackle morphological, grammatical, lexical, semantic and discourse-level features in talk (see Chap. 2 for further information on corpus annotation). These taggers and parsers and their associated annotation systems tend to be unable to support the analysis of language in use beyond spoken discourse. This is because the annotation of non-verbal elements of talk, including gestures, is particularly complex as gestures are not readily made ‘textual units’ (see Gu 2006: 130; for additional works on language and gesture, see Ladewig, 2014a, b; Steen and Turner 2013; Müller et al. 2013 and Rossini 2012). So, while the availability of digital tools and methods supporting the identification, representation and analysis of spoken discourse proliferate, there is a lack of these for marking-up non-verbal elements or for integrating this with verbal elements for investigation (particularly in the field of linguistics). There are also on-going methodological and technical challenges regarding how best to align annotations from the different modalities in a meaningful way to map the temporal and semiotic relationship between these. Abuczki and Ghazaleh note that there have been recent moves towards designing international standards “for annotating various features of spoken utterances, gaze movement, facial expressions, gestures, body posture and combinations of any of these features” within multimodal corpora (2013: 94). The most notable work on developing re-usable and international standards for investigating language and gesture-in-use was carried out by researchers involved in the ‘Natural Interaction and Multimodality’ (NIMM) part of the ‘International Standards for Language Engineering’ (ISLE) project (Dybkjær and Ole Bernsen 2004, also refer to Wittenburg et al. 2000). The uptake of this has not, as yet, been universal so there remains no formally agreed, conventional prescription for mark-up across the research

16 Multimodal Corpora


community. Instead, individual projects often devise their own bespoke schemes, which are designed to meet the specific requirements of their research. A further challenge faced by researchers working on multimodal corpora is the choice of software used to analyse multimodal resources/corpora. Other chapters in this volume have cited a variety of different digital concordancing tools which support the analysis of corpora (see Chap. 8 for more information). These include AntConc (Anthony 2017), #LancBox (Brezina et al. 2015), SketchEngine (Kilgarriff et al. 2014), WMatrix (Rayson 2009) and Wordsmith Tools (Scott 2017). Although these may have some key differences regarding the basic interfaces, query tools and functionalities they offer, they also share many commonalities. We can expect, for example, to be able to search for words, lemmas, clusters/n-grams, POS; generate frequency lists (and, in some cases, word clouds); calculate keywords; calculate and examine patterns of collocation; generate Key Word in Context (KWIC) outputs; chart and visualise results, and/or even plot the incidence of individual items across individual texts and/or sub-corpora using current software. These tools enable quantitative (including statistical) analyses as a potential way-in to the corpus-based examination of text(s) and provide the means for the user to access specific texts or items of interest for further, more qualitative analyses. However, each of these tools lacks the provisions for querying multimodal corpora, i.e. corpora which do not simply comprise text-based representations of interaction. Note, however, that tools that support the annotation of complex multimodal datasets are widely available. Examples such as ELAN (Wittenberg et al. 2006) and ANVIL (Kipp 2001) are discussed in more detail in Sect. 16.4 below, with a screenshot of ELAN shown in Fig. 16.1. In this screenshot we see ELAN’s capacity to synchronise and organise various forms of aligned time series data for future analysis. This main annotation window includes a video (top left), and a searchable annotation grid and viewer on the top right (here, transcripts can be viewed and searched). In this example, the first two ‘tiers’ of information (beneath the video) are movement outputs from left and right-hand sensors, followed by an audio output. Beneath this is the transcribed speech from the episode followed by the coded semiotic gestures performed by the right and left hand, then both hands together. Each of these different tiers is synchronised by time and searchable via the facility in the top right corner of the figure. There are calls for the construction of digital tools that enable corpus-based analysis of more complex, multimodal corpora; however, there are various reasons why the development of these digital tools is complicated. The analysis of multimodal corpora presents a whole host of technological challenges especially with regards to the synchronization and representation of multiple streams of information. Early developments that responded to this challenge were made with the construction of the Digital Replay System (DRS, see Brundell et al. 2008). DRS was the first (and only) software which concentrated on allowing for facilitating corpus-based analyses of synchronised multimodal dataset, using lexical units and/or gestural annotations as the ‘way-in’ to analysis. It enabled users to store, transcribe, code and align different forms of qualitative data and to enable the quantitative and corpus-based analysis of data, including the generation of frequency lists, key-word-


D. Knight and S. Adolphs

Fig. 16.1 A screenshot from ELAN

in-context (KWIC) concordance outputs, amongst other utilities. Unfortunately, however, this tool is no longer supported or sustained, so is no longer available to the research community.

Representative Study 1 Baiat, G.E., Coler, M., Pullen, M., Tienkouw, S. and Hunyadi, L. 2013. Multimodal analysis of ‘well’ as a discourse marker in conversation: A pilot study. Proceedings of the 4th IEEE International Conference on Cognitive Infocommunications (CogInfoCom 2013), 283–287. Budapest, Hungary. This paper adopts a multimodal corpus pragmatic approach to the analysis of the relationship between the spoken discourse marker well and accompanying non-verbal behaviours. It questions whether well is more likely to be used with or without accompanying behaviours, and examines what pragmatic functions can be associated with any patterns of verbal and non-verbal co-occurrence. (continued)

16 Multimodal Corpora


As a ‘pilot’ study, the paper with a corpus only comprising six recordings of circa 7 min each, problematizes the collection, annotation and analysis of multimodal corpora for research. Amongst other key findings, this initial exploration revealed that “having an averted eye gaze was correlated with the use of this marker regardless of its pragmatic function” (Baiat et al. 2013: 287).

Representative Study 2 Navarretta, C. and Paggio, P. 2013. Classifying multimodal turn management in Danish dyadic first encounters. Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), 133–146. Oslo, Norway. This paper examines multimodal turn management in interactions between Danish speakers who have met for the first time, so uncovering how individuals give, take or keep the turn through speech and non-verbal behaviours. The data used in this study is taken from the NOMCO corpus (see Representative Corpus 3), drawing on the Danish unscripted first encounter dialogues which comprise 12 conversational dyads of circa 5 min, in which participants are instructed to “to talk in order to get acquainted, as if they met at a party” (Navarretta and Paggio 2013: 135). The study revealed that “all kinds of head movement, facial expression and body posture are involved in turn management” and that “the turn management types frequently occurring in this corpus depend on the type of social activity in which the participants are involved” (Navarretta and Paggio 2013: 142). As with Baiat et al. (2013), this work contributes to our understanding of human-human interaction but can also be used for dialogue modelling systems.

Representative Study 3 Malisz, Z., Wlodarczak, M., Buschmeier, H., Skubisz, J., Kopp, S., Wagner, P. 2016. The ALICO corpus: analyzing the active listener. Language Resources and Evaluation 50(2): 411–442. This paper contributes to research on spontaneous human-human interaction, examining the functional relationship between different modalities in interaction, focusing specifically on the role of the listener and non(continued)


D. Knight and S. Adolphs

verbal listenership behaviours, including head nods and patterns of gaze. The research presented is carried out using data from the ALICO corpus, which comprises 50 spontaneous, storytelling dialogues carried out by dyads of German speakers, 34 of whom were female and 16 were male. Head movements, gaze and other forms of non-verbal listening behaviours were annotated in conjunction with the spoken discourse to allow for an analysis of the relationship between them. The paper indicates that there is a link “between feedback form and its pragmatic function in both visual and verbal domains” (2016: 413), and the correlation and frequency of specific listening behaviours underlined the level of attentiveness and involvement of the listener in conversation. For examples of other studies which have examined listenership behaviours, see Dittman and Llewellyn 1968; Duncan 1972; Maynard 1987; Paggio and Navarretta 2010 and Knight 2011a.

Representative Corpus 1 The AMI Meeting Corpus is comprised of 100h of recordings taken from meeting room contexts. AMI includes synchronised data from audio recorders (microphones), video cameras, projectors, whiteboards and pens, and associated transcriptions and annotations of this data. The corpus was compiled in 2005 and remains one of the largest freely accessible multimodal corpora in existence. Specific excerpts from the corpus can be downloaded from http:// groups.inf.ed.ac.uk/ami/download/ and can be searched and analysed using the NITE XML Toolkit, an open-source set of tools which enables the sharing, markup and annotation, representation and analysis of complex multimodal data. The motivation behind the construction of AMI was to help develop the technology that enables group interaction in meeting room contexts, although it is acknowledged that the resource “could be used for many different purposes in linguistics, organizational and social psychology, speech and language engineering, video processing, and multi-modal systems” (Carletta 2006: 3).

Representative Corpus 2 TalkBank (see: https://talkbank.org/), which includes the Child Language Data Exchange System (CHILDES – MacWhinney 2000; cf. Chap. 14) is an online repository for transcribed audio and video data that “seeks to provide a common framework for data sharing and analysis for each of the many (continued)

16 Multimodal Corpora


disciplines that studies conversational interactions” (MacWhinney 2007: 8). This includes data for the examination of, for example, first and second language acquisition, child phonology, gesture use, bilingualism, aphasia, and classroom interaction. There is a standardised format in which the data is stored and accessed on the site, as well as standardised guidelines for transcription, based on Conversational Analysis (CA) principles (see Sacks et al. 1974), and coding, based on CHAT (Codes for the Human Analysis of Transcripts). The software CLAN (Computerised Language Analysis) was created specifically to support the creation and analysis of data included in TalkBank. While TalkBank is extensive and is sampled from different contexts and individual projects, the resource is perhaps not a corpus in the traditional sense because it is not a principled collection of data (more a collection of sub-sets of data). The resource is specifically designed to enable micro-level analyses of data. It is not equipped with corpus query tools which enable a more quantitative analysis of the contents of the data.

Representative Corpus 3 The Multimodal Corpus Analysis in the Nordic Countries (NOMCO) project includes conversations in Swedish, Finnish, Danish, Estonian, and more recently, Maltese, which have been video recorded, transcribed and annotated (see Paggio et al. 2010). Each corpus has been constructed in a similar way, with data having been orthographically transcribed using the speech analysis software PRAAT (Boersma and Weenink 2009) and coded for specific forms of gesture, including head movements, body posture and facial expressions using the video annotation tool ANVIL (Kipp 2001). The corpora were constructed to help support the analysis of a range of phenomena including, for example, “turn management, feedback exchange, information packaging and the expression of emotional attitudes” (Paggio and Navarretta 2017: 463). The corpora are currently available for other researchers to use. More information can be accessed at: http://sskkii.gu.se/nomco/index.php

16.3 Critical Assessment and Future Directions This chapter has provided an overview of the current state-of-the-art in multimodal corpus linguistics. It has highlighted that while multimodal corpus research is gaining some momentum, there are still some areas where further development is required. Perceived shortcomings relating to the size, scope (and representativeness and reusability) and the level of availability of current multimodal corpora have


D. Knight and S. Adolphs

been mentioned, along with some of the challenges regarding the representation and analysis of multimodal corpora. A final challenge for current multimodal corpus research relates to more recent discussions and developments in this field. Multimodal interaction includes a range of different semiotic resources, and multimodal corpora, as already noted, have the potential for enabling the researcher to study the use of language along a continuum of dynamically changing contexts. The focus on dynamic contexts has resulted in a call for the construction of more ‘heterogeneous’ corpora for linguistic research (see Adolphs et al. 2011). Heterogeneous corpora are those which include different forms of media ‘types’ from interaction in virtual environments (instant messaging, entries on personal notice boards etc.), GPS data, to face-to-face situated discourse, phone calls, video calls, sensory data and so on. Heterogeneous corpora aim to better capture and represent aspects of the complexity and fluidity of the discursive context for the future analysis of language use (see Tennent et al. 2008; Crabtree and Rodden 2009; Knight et al. 2010; Knight 2011b and Adolphs and Carter 2013). Heterogeneous corpora also aim to provide a better insight into how language is received, as well as produced, by an individual. Accounting for the effect of contextual factors on language use in a systematic way is crucial in this context. As Bazzanella acknowledges, “in real life, context is exploited both in production and in comprehension” (2002: 239; also see Sweetser and Fauconnier 1996). As a vision for the future, Knight et al. (2010: 2) suggest that software that will support the construction and analysis of multimodal and heterogenous corpora will include the following functionalities: • The ability to search data and metadata in a principled and specific way (encoded and/or transcribed text-based data). • Tools that allow for the frequency profiling of events/elements within and across domains (providing raw counts, basic statistical analysis tools, and methods of graphing such). • Variability in the provisions for transcription, and the ability for representing simultaneous speech and speaker overlaps. • New methods for drilling into the data, through mining specific relationships within and between domain(s). This may be comparable to current social networking software, mind maps or more topologically based methods. • Graphing tools for mapping the incidence of words or events, for example, over time and for comparing sub-corpora and domain specific characteristics. As outlined in this chapter, multimodal corpus research is somewhat still in its infancy and as such we can expect a step-change in our description and understanding of language based on this research. It will be interesting to see whether some of the major insights generated within monomodal, or text-based, corpus linguistics will be replicated once we are able to truly analyse multimodality at scale. This may include new insights into collocation of multimodal units of meaning across

16 Multimodal Corpora


interactions; acquisition of speech-gesture units; and insights into frequencies of specific multimodal units in different contexts. Advances in this area are likely to be dependent, at least in part, on the development, functionality and availability of technological resources, but also, as Knight notes, on institutional, national and international collaborative interdisciplinary and multidisciplinary research strategies and funding (2011b: 409).

16.4 Tools and Resources While there is a dearth in the existence of freely-available and widely used multimodal corpora, there are, conversely, a wide range of digital tools and resources that exist to support the construction and analysis of bespoke multimodal corpora. The most widely used tools, and the specific areas of multimodal corpus research that is supported by these tools, focusing on corpus compilation, annotation and analysis, are presented below. All of the tools listed require manual transcription, annotation and analysis of data: these processes are not automated. Anvil ANVIL is a video analysis software that was built by Michael Kipp in 2000 and recently updated in 2017 (see Kipp 2001 for more details about the key functionalities of this tool). ANVIL provides users with the tools for the management, synchronization, transcription, annotation and analysis of multimodal data (in various different formats, including 3D motion capture videos) across multiple layers (known as tiers), that is, for the construction of synchronization of bespoke multimodal corpora. ANVIL is also equipped with a PRAAT plug-in (for details on PRAAT, see: www.fon.hum.uva.nl/praat (accessed 23 May 2019); cf. also Chap. 11), amongst other tools, supporting phonetic and other forms of linguistic analysis of multimodal corpora. Unfortunately, the current version of ANVIL is not integrated with any specific corpus enquiry tools such as concordancers or wordlist functionalities, although it does have systematic search capabilities, which operate at a tag/annotation level, as well as at the lexical level. ELAN (https://tla.mpi.nl/tools/tla-tools/elan) (accessed 23 May 2019) The Max Planck Institute’s ELAN (Wittenberg et al. 2006) is an open-source tool which enables the annotation and analysis of data across multiple ‘tiers’ of information. As with ANVIL, ELAN enables users to upload and synchronise their own multimodal corpora and supports the manual frame-based annotation and analyses of multiple modes of time series data, from audio and video data, to sensor outputs. A particular strength of ELAN is that it is integrated with a particularly user-friendly transcription viewer and some basic corpus analytic tools which allow users to search for specific lexical items and/or codes within a specific data ‘project’. As with ANVIL, ELAN is also equipped with a PRAAT plug-in (amongst other plug-ins).


D. Knight and S. Adolphs

Transana (www.transana.com) (accessed 23 May 2019) Transana supports the qualitative analysis of videos, audio and images. It enables users to integrate, transcribe, categorize and code their own data (i.e. construct their own multimodal corpus) and then search and explore it in more detail (i.e. analyse it). Transana also provides the means for converting files into a standard format which increases the flexibility of the tool. The transcription tools within Transana are particularly user friendly, and a particular strength of this tool is that it enables real time collaboration via the online version. As Transana is primarily a tool for qualitative analysis, its provisions for, for example, frequency-based or numerical analysis (as utilized by corpus linguists) is somewhat limited.

Further Reading Allwood, J. 2008. Multimodal Corpora. In Corpus Linguistics. An International Handbook, eds. Lüdeling, A. and Kytö, M., 207–225. Berlin: Mouton de Gruyter. Allwood’s (2008) chapter includes a particularly detailed focus on the definition and annotation of gestures and the steps towards developing a standardised approach to such within the field of multimodal corpus linguistics. Extensive examples of what may be analysed using multimodal corpora are also offered, from the examination of communication and power, emotion, consciousness and awareness to the analysis of multimodality within and across specific types of media. Allwood also provides some more practical reflections on the possible applications of multimodal corpus research, from the development of tools to support humanhuman and human-machine communication to the construction of multimodal translation and interpretation systems. Knight, D. 2011a. Multimodality and Active Listenership: A Corpus Approach. London: Continuum. Building on some of the discussions outlined by Allwood (2008), Knight’s monograph takes readers through every step of multimodal corpus design and construction and provides a worked example of multimodal corpus analysis. This analysis focuses on the examination of the use of, and the relationship between, spoken and non-verbal forms of backchanneling behaviour, and a particular sub-set of gestural forms: head nods. This investigation is undertaken by means of analysing the patterned use of specific forms and functions of backchannels within and across sentence boundaries, in 5 h of dyadic (supervision) data taken from the NMMC (Nottingham Multi-Modal Corpus). Knight discusses the validity and applicability of different existing categorisations of backchannels to multi-modal corpus data, and examines the requirements for a redefinition of these considering the findings resulting from the analyses.

16 Multimodal Corpora


Gu, Y. 2006. Multimodal text analysis: A corpus linguistic approach to situated discourse. Text and Talk 26(2): 127–167. In a similar way to Knight, Gu concentrates on providing some guidelines for an approach to multimodal corpus analysis. This study is situated mainly within the MDA paradigm, so it focuses on small-scale corpora of situated discourse, utilizing discourse analytic methods as the initial point of departure, but discusses how such an approach can be extended with the integration of corpus methods. Gu focuses on different modalities within the analysis, beyond what the majority of multimodal corpus studies typically afford, making this work of particular relevance to the final section of this chapter: projections for the future directions of this field. Kipp, M., Martin, J-C., Paggio, P. and Heylen, D. (Eds). 2009. Multimodal Corpora: from models of natural interaction to systems and applications. Lecture Notes in Artificial Intelligence 5509. Berlin: Springer-Verlag. This edited collection provides a state-of-the-art survey of research into multimodal corpus construction and analysis, taken from the ‘Multimodal Corpora: From Models of Natural Interaction to Systems and Applications’ workshop held in conjunction with the 6th LREC (International Conference for Language Resources and Evaluation) conference, held in Marrakech in 2008. The volume presents research from different fields including computational linguistics, human-human interaction, signal processing, artificial intelligence, psychology and robotics, comprising six contributions from the workshop and additional invited research articles from key multimodal corpora projects, including AMI. The volume discusses a range of topics from corpus construction to analysis, so is invaluable for researchers who are perhaps new to the field and need some advice on carrying out their own work in this area.

References Abuczki, Á., & Ghazaleh, E. B. (2013). An overview of multimodal corpora, annotation tools and schemes. Argumentum, 9, 86–98. Adolphs, S. (2013). Corpora: Multimodal. In C. A. Chapelle (Ed.), The encyclopaedia of applied linguistics. London: Blackwell. Adolphs, S., & Carter, R. (2013). Spoken corpus linguistics: From monomodal to multimodal. London: Routledge. Adolphs, S., Knight, D., & Carter, R. (2011). Capturing context for heterogeneous corpus analysis: Some first steps. International Journal of Corpus Linguistics, 16(3), 305–324. Aist, G., Allen, J., Campana, E., Galescu, L., Gomez Gallo, C.A., Stoness, S., Swift, M., & Tanenhaus, M. (2006). Software architectures for incremental understanding of human speech, 1922–1925. In: Proceedings of Interspeech 2006, Pittsburgh. Allwood, J. (2008). Multimodal Corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (pp. 207–225). Berlin: Mouton de Gruyter. Anthony, L. (2017). AntConc (Version 344) [Computer software]. Waseda University, Tokyo. Available from http://www.laurenceanthony.net. Accessed 23 May 2019.


D. Knight and S. Adolphs

Ashby, S., Bourban, S., Carlettea, J., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D., and Wellner, P. (2005). The AMI Meeting Corpus, In: Proceedings of measuring behavior conference, Wageningen. Baiat, G.E., Coler, M., Pullen, M., Tienkouw, S., & Hunyadi, L. (2013). Multimodal analysis of ‘well’ as a discourse marker in conversation: A pilot study. In: Proceedings of the 4th IEEE International Conference on Cognitive Infocommunications (CogInfoCom 2013), (pp. 283– 287). Budapest, Hungary. Baldry, A. (2004). Phase and transition, type and instance: Patterns in media texts as seen through a multimodal concordance. In K. O’Halloran (Ed.), Multimodal discourse analysis: Systemic functional perspectives (pp. 83–108). London/New York: Continuum. Baldry, A. (2005). Multimodal corpus linguistics. In G. Thompson & S. Hunston (Eds.), System and Corpus: Exploring connections (pp. 164–183). London/New York: Equinox. Baldry, A., & Thibault, P. J. (2001). Towards multimodal corpora. In G. Aston & L. Burnard (Eds.), Corpora in the description and teaching of English, papers from the 5th ESSE (pp. 87–102). Conference Bologna: Cooperativa Libraria Universitaria Editrice Bologna. Bazzanella, C. (2002). The significance of context in comprehension: The ‘we case’. Foundations of Science, 7, 39–254. Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257. Boersma, P., & Weenink, D. (2009). Praat: doing phonetics by computer [Computer Program]. Available from http://www.praat.org. Accessed 23 May 2019. Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139–173. Brundell, P., Tennent, P., Greenhalgh, C., Knight, D., Crabtree, A., O’Malley, C., Ainsworth, S., Clarke, D., Carter, R., and Adolphs, S. (2008). Digital Replay system (DRS): A tool for interaction analysis. In: Paper delivered at the International Conference for the Learning Sciences 2008 (ICLS), Utrecht, The Netherlands June–July 2008. Burnard, L., & Aston, G. (1998). The BNC handbook: Exploring the British National Corpus. Edinburgh: Edinburgh University Press. Carletta, J. (2006). Announcing the AMI meeting corpus. The ELRA Newsletter, 11(1), 3–5. Chen, R. L., Rose, T., Qiao, Y., Kimbara, I., Parrill, F., Welji, H., Han, X., Tu, J., Huang, Z., Harper, M., Quek, F., Xiong, Y., McNeill, D., Tuttle, R., & Huang, T. (2005). VACE multimodal meeting corpus. In S. Renals & S. Bengio (Eds.), Proceedings of Machine Learning for Multimodal Interaction (MLMI 2005), Lecture Notes in Computer Science (Vol. 3869, pp. 40– 45). Berlin/Heidelberg: Springer. Conrad, S. (2002). Corpus linguistic approaches for discourse analysis. Annual Review of Applied Linguistics, 22, 75–95. Crabtree, A., & Rodden, T. (2009). Understanding interaction in hybrid ubiquitous computing environments. Proceedings of the 8th International Conference on Mobile and Ubiquitous Media, Article No1, 1–10. Cambridge: ACM. Cu, J., Suarez, M., & Sta, M.A. (2010). Filipino Multimodal Emotion Database. In: Proceedings of 7th Conference on Language Resources and Evaluation (LREC-2010) (pp. 37–42). Malta, May 2010. Davies, M. (2008). The Corpus of Contemporary American English (COCA): 560 million words, 1990-present. Available online at https://corpusbyuedu/coca/. Accessed 23 May 2019. Dittman, A., & Llewellyn, L. (1968). Relationships between vocalizations and head nods as listener responses. Journal of Personality and Social Psychology, 9, 79–84. Duncan, S. (1972). Some signals and rules for taking speaking turns in conversation. Journal of Personality and Social Psychology, 23(2), 283–292. Dybkjær, L., & Ole Bernsen, N. (2004). Recommendations for natural interactivity and multimodal annotation schemes. In: Proceedings of the 4th Language Resources and Evaluation Conference (LREC) 2004, Lisbon, Portugal.

16 Multimodal Corpora


Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., & Van Gool, L. (2010). 3D vision technology for capturing multimodal corpora: chances and challenges. In: Proceedings of the LREC Workshop on Multimodal Corpora, Mediterranean Conference Centre (pp. 70–73). Malta, 18 May 2010. Fisher, D., Williams, M., & Andriacchi, T. (2003). The therapeutic potential for changing patterns of locomotion: an application to the acl-deficient knee. In: Proceedings of the ASME Bioengineering Conference (pp. 849–50). Miami, Florida, June 2003. Foster, M. E., & Oberlander, J. (2007). Corpus-based generation of head and eyebrow motion for an embodied conversational agent. Language Resources and Evaluation (LREC), 41(3/4), 305–323. Garofolo, J.S., Laprun, C.D., Michel, M., Stanford, V.M., & Tabassi, E. (2004). The NIST Meeting Room Pilot Corpus. In: Proceedings of the 4th Language Resources and Evaluation Conference (LREC) 2004, Lisbon, Portugal. Goodwin, C. (1994). Professional vision. American Anthropologist, 96(3), 606–633. Gu, Y. (2006). Multimodal text analysis: A corpus linguistic approach to situated discourse. Text and Talk, 26(2), 127–167. Halliday, M. A. K. (1978). Language as social semiotic: The social interpretation of language and meaning. London: Edward Arnold. Herrera, D., Novick, D., Jan, D., & Traum, D. (2010). The UTEP-ICT Cross-Cultural Multiparty Multimodal Dialog Corpus. In: Proceedings of LREC Workshop on Multimodal Corpora. May 2010, Malta. Johnston, T., & Schembri, A. (2006). Issues in the creation of a digital archive or a signed language. In L. Barwick & N. Thieberger (Eds.), Sustainable data from digital fieldwork (pp. 7–16). Sydney: Sydney University Press. Jokinen, K., & Allwood, J. (2010). Hesitation in Intercultural Communication: Some observations on Interpreting Shoulder Shrugging. In: Proceedings of the international workshop on agents in cultural context, (pp. 25–37). The First International Conference on Culture and Computing 2010 Kyoto, Japan. Kilgarriff, A., Baisa, V., Bušta, J., Jakubíˇcek, M., Kovváˇr, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The sketch engine: Ten years on. Lexicography, 1, 7–36. Kipp, M. (2001). ANVIL – A generic annotation tool for multimodal dialogue, 1367–1370. In: Proceedings of 7th European Conference on Speech Communication and Technology 2nd INTERSPEECH Event, Aalborg, Denmark. Knight, D. (2011a). Multimodality and active listenership: A corpus approach. London: Bloomsbury. Knight, D. (2011b). The future of multimodal corpora. Brazilian Journal of Applied Linguistics (BJAS), 11(2), 391–415. Knight, D., Evans, D., Carter, R. A., & Adolphs, S. (2009). Redrafting corpus development methodologies: Blueprints for 3rd generation multimodal, multimedia corpora. Corpora, 4(1), 1–32. Knight, D., Tennent, P., Adolphs, S., & Carter, R. (2010). Developing heterogeneous corpora using the Digital Replay System (DRS). In: Proceedings of the LREC 2010 (Language Resources Evaluation Conference) Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality. Malta, May 2010. Kress, G. (2010). Multimodality–A social semiotic approach to contemporary communication. London: Routledge. Kress, G. R., & Van Leeuwen, T. (2001). Multimodal discourse: The modes and Media of Contemporary Communication. London: Arnold. Ladewig, S. H. (2014a). The cyclic gesture. In C. Müller, A. Cienki, E. Fricke, S. Ladewig, D. McNeil, & J. Bressem (Eds.), Body language communication volume 2: An international handbook on multimodality in human interaction (pp. 1605–1618). Berlin: De Gruyter. Ladewig, S. H. (2014b). Recurrent gestures. In C. Müller, A. Cienki, E. Fricke, S. Ladewig, D. McNeil, & J. Bressem (Eds.), Body language communication volume 2: An international handbook on multimodality in human interaction (pp. 1558–1574). Berlin: De Gruyter. Lemke, J. L. (2002). Travels in hypermodality. Visual Communication, 1(3), 299–325.


D. Knight and S. Adolphs

Lücking A., Bergman, K., Hahn, F., Kopp, S., & Rieser, H. (2010). The Bielefeld Speech and Gesture Alignment Corpus (SaGA). In: Proceedings of LREC Workshop on Multimodal Corpora, 2010. Malta, May 2010. MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk (3rd ed.). Mahwah: Lawrence Erlbaum Associates. MacWhinney, B. (2007). The TalkBank project. In J. Beal & K. Moisl (Eds.), Creating and digitizing language corpora: Synchronic databases (Vol. 1). Basingstoke/Hampshire: Palgrave Macmillan. Malisz, Z., Wlodarczak, M., Buschmeier, H., Skubisz, J., Kopp, S., & Wagner, P. (2016). The ALICO corpus: Analyzing the active listener. Language Resources and Evaluation, 50(2), 411– 442. Mana, N., Lepri, B., Chippendale, P., Cappelletti, A., Pianesi, F., Svaizer, P., & Zancanaro, M. (2007). Multimodal Corpus of Multi-Party Meetings for Automatic Social Behavior Analysis and Personality Traits Detection. In: Proceeding of Workshop on Tagging, Mining and Retrieval of Human-Related Activity Information at ICMI’07 (pp. 9–14). Nagoya, Japan. Maynard, S. K. (1987). International functions of a nonverbal sign head movement in Japanese dyadic casual conversation. Journal of Pragmatics, 11, 589–606. McCowan, I., Bengio, S., Gatica-Perez, D., Lathoud, G., Monay, F., Moore, D., Wellner, P., & Bourlard, H. (2003). Modelling Human Interaction in Meetings. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 748–751). Hong Kong, April 2003. McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. London: Routledge. Mlakar, I., Kaˇciˇc, Z., & Rojc, M. (2017). A Corpus for investigating the multimodal nature of multi-speaker spontaneous conversations – EVA Corpus. WSEAS Transactions on Information Science and Applications, 14, 213–226. Müller, C., Bressem, J., & Ladewig, S. H. (2013). Towards a grammar of gestures: A formbased view. In C. Müller, A. Cienki, E. Fricke, S. H. Ladewig, D. McNeill, & S. Teßendorf (Eds.), Body-language-communication: An international handbook on multimodality in human interaction (Vol. 1, pp. 707–733). De Gruyter Mouton: Berlin. Navarretta, C., & Paggio, P. (2013). Classifying multimodal turn management in Danish dyadic first encounters. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013) (pp. 133–146). Oslo, Norway. O’Halloran, K. L. (2006). Multimodal discourse analysis – Systemic functional perspectives. London: Continuum. O’Halloran, K. L. (2011). Multimodal discourse analysis. In K. Hyland & B. Paltridge (Eds.), Companion to discourse (pp. 120–137). London/New York: Bloomsbury. Oertel, C., Cummins, F., Campbell, N., Edlund, J., & Wagner, P. (2010). D64: a corpus of richly recorded conversational interaction. In: Proceedings of the LREC Workshop on Multimodal Corpora, Mediterranean Conference Centre (pp. 27–30). Malta, 18 May 2010. Paggio, P., & Navarretta, C. (2010). Feedback in Head Gesture and Speech. In: Proceedings of 7th Conference on Language Resources and Evaluation (LREC-2010) (pp. 1–4). Malta, May. Paggio, P., & Navarretta, C. (2017). The Danish NOMCO corpus: Multimodal interaction in first acquaintance conversations. Language Resources and Evaluation, 51(2), 463–494. Paggio, P., Allwood, J., Ahlsén, E., Jokinen, K., & Navarretta, C. (2010). The NOMCO Multimodal Nordic Resource - Goals and Characteristics. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.), In: Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10). Valletta, Malta, May 19–21, European Language Resources Association (ELRA). Rayson, P. 2009. Wmatrix: A web-based corpus processing environment 7 [computer software] Computing Department, Lancaster University, Lancaster. Available from: http:// ucrellancsacuk/wmatrix/. Accessed 23 May 2019.

16 Multimodal Corpora


Rehm, M., André, E., Bee, N., Endrass, B., Wissner, M., Nakano, Y., Lipi, A. A., Nishida, T., & Huang, H.-H. (2008). Creating standardized video recordings of multimodal interactions across cultures. In M. Kipp, J.-C. Martin, P. Paggio, & D. Heylen (Eds.), Multimodal corpora from models of natural interaction to systems and applications, LNAI 5509 (pp. 138–159). New York: Springer. Rossini, N. (2012). Reinterpreting Gesture as Language. Amsterdam: IOS Press. Royce, T. D., & Bowcher, W. (2006). New directions in the analysis of multimodal discourse. London: Routledge. Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest systematics for the organisation of turn taking for conversation. Language, 50(4–1), 696–735. Saferstein, B. (2004). Digital technology and methodological adaptation: Text on video as a resource for analytical reflexivity. Journal of Applied Linguistics, 1(2), 197–223. Scollon, R., & Levine, P. (2004). Multimodal discourse analysis as the confluence of discourse and technology. Hillsdale: Lawrence Erlbaum Associates. Scott, M. 2017. WordSmith Tools version 7 [Computer software]. Lexical Analysis Software, Stroud. Available from: http://www.lexically.net/wordsmith/. Accessed 23 May 2019. Sinclair, J. (2005). Corpus and text – Basic principles. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 1–16). Oxford: Oxbow Books. Steen, F., & Turner, M. B. (2013). Multimodal construction grammar. In M. Borknet, B. Dancygier, & J. Hinnell (Eds.), Language and the creative mind. Stanford: CSLI Publications. Stubbs, M. (1996). Text and corpus analysis: Computer-assisted studies of language and culture. Oxford: Blackwell. Sweetser, E., & Fauconnier, G. (1996). Cognitive links and domains: Basic aspects of mental space theory. In G. Fauconnier & E. Sweetser (Eds.), Spaces, worlds and grammars (pp. 1– 28). Chicago: University of Chicago Press. Tannen, D. (1982). Ethnic style in male-female conversation. In J. J. Gumperz (Ed.), Language and social identity (pp. 217–231). Cambridge: Cambridge University Press. Tennent, P., Crabtree, A., & Greenhalgh, C. (2008). Ethno-goggles: supporting field capture of qualitative material. In: Proceedings of the 4th Annual International e-Social Science Conference [online], University of Manchester, Manchester. van Leeuwen, T. (2005). Introducing social semiotics. London: Routledge. Ventola, E., Cassily, C., & Kaltenbacher, M. (Eds.). (2004). Perspectives on multimodality. John Benjamins: Amsterdam and Philadelphia. Wittenburg, P., Broeder, D., & Sloman, B. (2000). Meta-description for language resources EAGLES/ISLE White paper [online report], available at: http://www.mpi.nl/isle/documents/ papers/white_paper_11.pdf. Accessed 23 May 2019. Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., & Sloetjes, H. (2006). ELAN: a professional framework for multimodality research. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), 1556–1559. ELRA, Genoa, Italy.

Part IV

Exploring Your Data

Chapter 17

Descriptive Statistics and Visualization with R Magali Paquot

and Tove Larsson

Abstract This chapter introduces the main functionalities of R, a free software environment for statistical computing and graphics, and its integrated development environment RStudio. It starts with a quick introduction about the basics of R and RStudio (how to install R and RStudio, and get started). Next, it focuses on data handling (how to prepare, load, check, manage and save data with R). The second part of the chapter deals with descriptive statistics (measures of central tendency, measures of dispersion and coefficients of correlation) and visualization techniques used to explore linguistic data (in the form of bar plots, mosaic plots, histograms, ecdf plots and boxplots).

17.1 Introduction This chapter serves as an introduction to the other chapters in Parts IV and V. Its main objectives are to provide the necessary basic know-how about R (R Core Team 2019) and to introduce descriptive statistics and visualization techniques with R, with a special focus on terminology. Section 17.2 provides a short introduction to R and RStudio (RStudio Team 2017) software installationand basic commands. R and RStudio are the two software

Electronic Supplementary Material The online version of this chapter (https://doi.org/10.1007/ 978-3-030-46216-1_17) contains supplementary material, which is available to authorized users. M. Paquot () FNRS – Université catholique de Louvain, Centre for English Corpus Linguistics, Louvain-la-Neuve, Belgium e-mail: [email protected] T. Larsson Université catholique de Louvain, Centre for English Corpus Linguistics, Louvain-la-Neuve, Belgium Uppsala University, Uppsala, Sweden e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Paquot, S. Th. Gries (eds.), A Practical Handbook of Corpus Linguistics, https://doi.org/10.1007/978-3-030-46216-1_17



M. Paquot and T. Larsson

tools that will be used in the rest of the handbook to describe a set of statistical techniques available to corpus linguists. Section 17.3 reviews essential steps in data handling with R, from data preparation to data loading, managing and saving. Section 17.4 summarizes the main descriptive statistics and Sect. 17.5 introduces some of the most common data visualization techniques used to explore linguistic data. The main focus is here placed on visualization techniques that may be used to explore a dataset before any statistical test is applied; visualization of statistical results is described in the relevant following chapters. The chapter comes with a supplementary R code file that exemplifies all functions used and provides some more information about additional useful functions.

17.2 An Introduction to R and RStudio 17.2.1 Installing R and RStudio R is a freely available software environment. While R has a wide variety of applications (see also Chap. 9), we will here focus on its use as a tool for statistical computing and data visualization. In order to install R, follow the steps below: 1. Go to the Comprehensive R Archive Network – CRAN website at https://cran.rproject.org/. 2. In the “Download and Install R” box, click on the link for your operating system (Windows, Mac OS or Linux) and download the Installer package or follow the instructions for your Linux distribution. 3. Follow the steps to run the installer. Although R can be used as standalone software, we will here be using an opensource integrated development environment for R called RStudio, which provides a user-friendly way of accessing the R console and managing R codes, datasets and plots. To download RStudio, follow the steps below: 1. 2. 3. 4. 5.

Go to https://www.rstudio.com/. Click on “download” under the RStudio column. Click on the free RStudio Desktop download. Click on the appropriate link for the operating system used on your computer. Follow the steps to run the installer. In the default setting, there are four panels in RStudio (see Fig. 17.1):

1. The top left panel is a code editor (Panel 1) that can be used for writing code (i.e. a sequence of instructions that a computer can process and understand) and for sending it to the console to be executed. Previously saved code files (or R scripts) can also be opened in this panel. 2. The top right panel (Panel 2) is a workspace that describes the current working environment (i.e. the datasets and variables used) and keeps track of the history; datasets can also be imported from this panel.

17 Descriptive Statistics and Visualization with R


Fig. 17.1 RStudio integrated development environment

3. The bottom left panel is the console, i.e. a text entry and display device where code is run and numerical results are displayed (Panel 3). Code can be run directly in the console or sent to the console from the code editor (see Sect. 4. The bottom right panel (Panel 4) houses the help pages and displays any plots made. This panel also provides an overview of the packages (or libraries) installed (see Sect.

17.2.2 Getting Started with R An elementary understanding of how to write R code is necessary for working in R. In this subsection, an introduction to writing and running code will be presented in Sect., followed by instructions for how to install and load packages (Sect. As will become clear, there is typically more than one way of completing a task in RStudio; for most tasks in this subsection, two different options will therefore be presented.

Writing and Running Code

To exemplify how to write code and execute commands (i.e. run code), we will here show how R can be used as a calculator. To ask R to display the answer for 31+82, the first step is to write 31+82 in the code editor (Panel 1) and press [enter]. Comments, for example specifying what the code does for future reference,


M. Paquot and T. Larsson

Fig. 17.2 The console

Fig. 17.3 Assigning variables

can also be entered here. This is typically done by starting each new line with one or several hash characters #; this way, R will not interpret what follows the hash symbol as code to be run. The next step is to run the code, which can be done in (at least) two different ways. The first option is to mark the relevant code and press [run] in the upper-right corner of Panel 1. The second option is to either mark the relevant code or just have the cursor somewhere in that line and then press [control] (Windows/Linux/Mac OS) or [command] (Mac OS) + [enter] on the keyboard. The code will now appear along with the answer (i.e. 113) in the console (Panel 3), as shown in Fig. 17.2. As mentioned above, the code can also be run directly in the console (the way to do this is to type the code after the prompt (the last angled bracket) and press [enter]); however, it is easier to save and edit code that is written in Panel 1. It is often useful or even necessary to create variables or store results using the assignment arrow, cl.order cl.order mean(cl.order$LEN_SC) # the $ separates the data frame and the variable

One way not to have to type the name of the dataset every time is to attach the data frame to the R search path using the function attach(), in our case: attach(cl.order). When a dataset is attached, you can apply functions on variables (without the name of the data frame they are from) as follows: > mean(LEN_SC)

17.3.3 Managing and Saving Data The cl.order data structure is a data frame, i.e. a list of variables represented as vectors and factors of various types. The numeric variables CASE, LEN_MC, LEN _ SC and LENGTH _ DIFF are numeric vectors (sequences of numbers) and the categorical variables ORDER, SUBORDTYPE, CONJ and MORETHAN2CL are factors. This distinction is important because vectors and factors are not manipulated in the same way in R. Table 17.1 provides a selected list of the most useful functions that take numeric vectors, factors and data frames respectively as arguments. Table 17.2 outlines some of the most useful data manipulation techniques for exploring data frames illustrated with examples from the cl.order dataset. See the accompanying R code for the generated output and Levshina (2015:397–408) for more details.

Tip 6 To know what type of object a variable is, the function is() can be used: is(LEN_MC) [1] "integer" ....



17 Descriptive Statistics and Visualization with R


Table 17.1 Data structures in R and how to manipulate them – a selected list of functions Numeric vectors length() sort() table() Factors as.numeric(as.character()) length() levels() table() Data frames str() write.table() All data structure head() tail() rm(a) rm(list=ls(all=TRUE))

Check how many elements are included in a vector Sort a vector in increasing order Tabulate a vector Turn a factor that contains numbers into a numeric vector Check how many elements are included in a factor Return the levels of a factor Tabulate a factor Return the structure of the data frame Export a data frame to a file Return the first lines of a vector, matrix, table, data frame or function Return the last lines of a vector, matrix, table, data frame or function Remove the data structure a Clear memory of all data

Table 17.2 The most useful data manipulation techniques to handle data frames cl.order[c(“CONJ”,”ORDER”)] cl.order[c(7,20,48),] cl.order[SUBORDTYPE == “caus”,] cl.order[order(LENGTH_DIFF),]

Select columns CONJ and ORDER Select observations or cases (i.e. rows) 7, 20 and 48 Select observations which meet specific criteria, here the subordinate clause is causal. Sort the data frame by LENGTH_DIFF in ascending order

17.4 Descriptive Statistics The focus of this section will be on the description of numeric variables. See the supplementary R file for how to explore categorical data with the functions summary(), table() and prop.table().

17.4.1 Measures of Central Tendency The most straightforward measures of central tendency are the mean and the median. Each provides a different type of information about a distribution of values. The mean is simply the sum of all the values for a given variable, divided by the number of values for that variable. The median is defined as the midpoint in a set of ordered values; it is also known as the 50th percentile (or second quartile) because it is the point below which 50% of the cases in the distribution fall. Other percentiles are useful too, such as the 25% (1/4 of all ranked values are below this value; also


M. Paquot and T. Larsson

known as first quartile) and the 75th percentile (3/4 of all ranked values are below this value; third quartile). A useful R command to get the above-mentioned statistics at once is summary(). It was already used above to summarize the whole data frame but it can also be used to summarize individual variables:

> summary(LEN_SC) Min. 1st Qu. 2.000





3rd Qu.



Max. 36.000

One is, however, often not interested in the mean or median for a given independent variable, but instead in the central tendencies at each level of the independent variable. For example, we may want to determine whether the two types of subordinate clause differ with regard to their mean word length. To do so, the function tapply() can be used as follows:

> tapply(LEN_SC, SUBORDTYPE, mean) caus




The function tapply() has three arguments. The first is a vector or factor (here, LEN _ SC ) to which we want to apply a function as found in the third argument (here, mean). The second argument is a vector or factor that specifies the grouping of values from the first vector/factor to which the function is applied.

Tip 7 The function table() can also be used to crosstabulate and characterize two or more variables and their relation: > table(LEN_SC,SUBORDTYPE) SUBORDTYPE LEN_SC caus temp 2










17 Descriptive Statistics and Visualization with R















17.4.2 Measures of Dispersion Measures of central tendency should never be reported without some corresponding measure of dispersion. Without a measure of dispersion, it is not possible to know how good the measure of central tendency is at summarizing the data. An example from Gries (2013:119–120) will serve to illustrate this. Figure 17.6 shows the monthly temperatures of Town 1 and Town 2 with mean annual temperatures of 10.25 and 9.83 respectively. A quick glance at Fig. 17.6 shows that the mean of the temperatures for Town 2 (9.83) summarizes the central tendency of Town 2 much better than the mean of the temperatures for Town 1 does for Town 1: the values of Town 1 vary much more widely around the mean with a minimum value of −12 and a maximum value of 23.

Fig. 17.6 Why measures of central tendency are not enough. (Gries 2013:120)


M. Paquot and T. Larsson

Thus, Gries (2013:119) recommends that the following measures of dispersion always be provided with measures of central tendency: • the interquartile range or quantiles for the median for interval/ratio-scaled data that (a) do not approximate the normal distribution (see Tip 10), or (b) exhibit outliers, i.e. values that are numerically distant from most of the other data points in a dataset (see Sect. 17.5.5 for a means to visualize outliers); • the standard deviation or the variance for normally distributed interval/ratioscaled data. The interquartile range is the difference between the third (i.e. 75%) quartile and the first (i.e. 25%) quartile. The more the first quartile and the third quartile differ from one another, the more heterogeneous or dispersed the data are. To compute the interquartile range with R, the function IQR() can be used: > IQR(LEN_SC) [1] 7

The standard deviation, sd, of a distribution with n elements is obtained by computing the difference of each data point to the mean, squaring these differences, summing them up, and after dividing the sum by n-1, taking its square root. In R, all that is required is the sd() function: > sd(LENGTH_DIFF) [1] 6.851885

See Gries (2013:120–125) for other measures of dispersion such as the average deviation and the variation coefficient. Measures of central tendency and dispersion can be reported in a table (especially if there are many of them) or in text. Using LENGTH_DIFF as an example, measures of central tendency and dispersion are typically reported as follows: “The mean score was -0.1 with a standard deviation of 6.85”. They can also be summarized as (M = −0.1, SD = 6.85), in which M is the mean and SD is the standard deviation. The Publication Manual of the American Psychological Association (American Psychological Association 2010) also illustrates the use of the abbreviated format −0.1(6.85). Means (with standard deviations in parentheses) for Trials 1 through 4 were 2.43 (0.50), 2.59 (1.21), 2.68 (0.39), and 2.86 (0.12), respectively. (p. 117) To report the median and interquartile range, the following format can be used: Mdn = 0, IQR = −4–3, where IQR is more usefully reported as a range (reporting Q1 and Q3) rather than a value.

17 Descriptive Statistics and Visualization with R


17.4.3 Coefficients of Correlation Coefficients of correlation are typically used to investigate the relationship between two numeric variables (but there are also coefficients of correlation used to investigate the relationship between a numeric variable and a categorical variable). A correlation between the numeric variables X and Y is positive if the values of both variables increase and decrease together; the correlation is negative if the values of variable X increase when the values of Y decrease, or if the values of variable X decrease when the values of Y increase. Correlation coefficients range from −1 (perfect negative correlation) to 1 (perfect positive correlation), with the plus or minus sign indicating the direction of the correlation, and the value reflecting the strength of the correlation. To interpret its value, as a very coarse rule of thumb, see which of the following values in Fig. 17.7 the correlation is closest to (e.g. Rumsey 2016). Pearson’s product-moment coefficient r is probably the most frequently used correlation coefficient but its use is probably best restricted to interval-scaled (or ratio-scaled) variables that are both approximately normally distributed. It requires the relationship between variables to be monotonic (i.e. an increase in the values of variable X is followed by an increase in the values of variable Y; a decrease in the values of variable X is followed by a decrease in the values of variable Y) and linear (i.e. the values of variable Y increase or decrease at the same rate as the values of variable X do); it is also very sensitive to the presence of outliers. When these conditions are not met, another coefficient such as Kendall’s τ (‘tau’) is preferred. This correlation coefficient is based only on the ranks of the variable values and is therefore more suited to ordinal variables; it is also less sensitive to outliers. To compute a correlation coefficient with R, the cor() function is used as follows:

> cor(LEN_MC,LEN_SC, method="kendall") [1] 0.1012148

Kendall’s τ is here selected because the assumptions for using Pearson’s r are not met: the relationship between variables is not monotonic and linear and the two variables are not normally distributed (see Fig. 17.11 for a histogram of LEN_MC). Thus, using the guidelines provided in Fig. 17.7, it can be seen that there is almost no relationship between the length of the main clause and the length of the subordinate clause in the cl.order dataset. The relationship between variables could be investigated more closely and we could try to predict values of the dependent variable on the basis of values of one or more independent variables and their interactions. This method is called linear regression and is discussed in Chap. 21.


M. Paquot and T. Larsson -1


















no linear relationship

Fig. 17.7 Correlation coefficients and their interpretation

17.5 Data Visualization Visualization is an essential step in data exploration: it is a helpful way of getting an overview of the data (e.g. the type of distribution represented in the dataset), identifying areas that need attention (e.g. the presence of outliers or many zero values) and checking whether the assumptions of the selected statistical test are met (see more particularly in Chaps. 20–23). As will become clear from the subsequent chapters, data visualization can also provide a means of presenting the results in a reader-friendly manner. R offers many different ways of visualizing data and this section will present a selection of commonly used graphs. Section 17.5.1 covers barplots and how these can be customized; Section 17.5.2 introduces a version of barplots for two variables, namely mosaic plots. In these two sections, barplots and mosaic plots will be created using the function plot(). This function is a good function to start with as it chooses what type of graph to use based on the kind(s) of variables to be plotted. The subsequent sections cover graphs that can be used for numeric variables to display dispersion and central tendencies, namely histograms (Sect. 17.5.3), ecdf plots (Sect. 17.5.4), and boxplots (Sect. 17.5.5). The code file that accompanies this chapter also includes code for making other common graphs, such as scatterplots.

17.5.1 Barplots The plot() function will here be used to create a barplot displaying the frequencies of two of the nominal/categorical variables from the cl.order data frame, namely ORDER and CONJ. To graphically present the frequency differences between the two levels of ORDER, namely main clause followed by subordinate clause (mc-sc) and subordinate clause followed by main clause (sc-mc), type plot(ORDER) in the code editor and run the code. The frequencies of ORDER will be plotted as a barplot (Fig. 17.8). The graph will appear in Panel 4, in