Applications of pattern-driven methods in corpus linguistics 9789027200136, 9027200130

737 76 19MB

English Pages 0 [323] Year 2018

Polecaj historie

Corpus Applications in Applied Linguistics 9781441107800, 9781472541611, 9781441184382

Corpus linguistics is one of the most exciting approaches to studies in applied linguistics today. From its quantitative

368 114 8MB Read more

Corpus Linguistics and Translation Tools for Digital Humanities: Research Methods and Applications 9781350275225, 9781350275256, 9781350275232

Presenting the digital humanities as both a domain of practice and as a set of methodological approaches to be applied t

269 127 3MB Read more

Corpus Linguistics 9781474470865

GBS_insertPreviewButtonPopup('ISBN:9780748611652); Corpus Linguistics has quickly established itself as the leadi

237 28 4MB Read more

Arabic Corpus Linguistics 9780748677382

An overview of current corpus-based research on the Arabic language Takes a perspective-based approach to the practice o

187 106 4MB Read more

Statistics for Corpus Linguistics 9781474471381

This book in the Edinburgh Textbooks in Empirical Linguistics series is a comprehensive introduction to the statistics c

185 31 32MB Read more

Doing Linguistics with a Corpus: Methodological Considerations for the Everyday User (Elements in Corpus Linguistics) 1108744850, 9781108744850

Paradoxically, doing corpus linguistics is both easier and harder than it has ever been before. On the one hand, it is e

569 140 2MB Read more

Cluster Analysis for Corpus Linguistics 9783110363814, 9783110350258

The standard scientific methodology in linguistics is empirical testing of falsifiable hypotheses. As such the process o

191 43 4MB Read more

Corpus Linguistics: Volume 2 9783110213881, 9783110207330

This handbook provides an up-to-date survey of the field of corpus linguistics, a field whose methodology has revolution

406 6 15MB Read more

Corpus Linguistics and the Description of English 9781474421713

A lively hands-on introduction to the use of electronic corpora in the description and analysis of English The second

167 43 2MB Read more

Methods in Latin Computational Linguistics 9004260110, 9789004260115

InMethods in Latin Computational Linguistics, Barbara McGillivray presents some of the most significant methodological f

643 60 3MB Read more

Applications of pattern-driven methods in corpus linguistics
9789027200136, 9027200130

Author / Uploaded
Kopaczyk
Joanna; Tyrkkö
Jukka

Citation preview

Applications of Pattern-driven Methods in Corpus Linguistics edited by Joanna Kopaczyk Jukka Tyrkkö

Studies in Corpus Linguistics

82 JOHN BENJAMINS PUBLISHING COMPANY

Applications of Pattern-driven Methods in Corpus Linguistics

Studies in Corpus Linguistics (SCL) issn 1388-0373

SCL focuses on the use of corpora throughout language study, the development of a quantitative approach to linguistics, the design and use of new tools for processing language texts, and the theoretical implications of a data-rich discipline. For an overview of all books published in this series, please see http://benjamins.com/catalog/books/scl

General Editor Ute Römer

Georgia State University

Advisory Board Laurence Anthony

Susan Hunston

Antti Arppe

Michaela Mahlberg

Michael Barlow

Anna Mauranen

Monika Bednarek

Andrea Sand

Tony Berber Sardinha

Benedikt Szmrecsanyi

Douglas Biber

Elena Tognini-Bonelli

Marina Bondi

Yukio Tono

Jonathan Culpeper

Martin Warren

Sylviane Granger

Stefanie Wulff

Waseda University

University of Alberta University of Auckland University of Sydney Catholic University of São Paulo Northern Arizona University University of Modena and Reggio Emilia Lancaster University University of Louvain

University of Birmingham University of Birmingham University of Helsinki University of Trier Catholic University of Leuven The Tuscan Word Centre/The University of Siena Tokyo University of Foreign Studies The Hong Kong Polytechnic University University of Florida

Stefan Th. Gries

University of California, Santa Barbara

Volume 82 Applications of Pattern-driven Methods in Corpus Linguistics Edited by Joanna Kopaczyk and Jukka Tyrkkö

Applications of Pattern-driven Methods in Corpus Linguistics Edited by

Joanna Kopaczyk University of Glasgow

Jukka Tyrkkö Linnaeus University Växjö

John Benjamins Publishing Company Amsterdam / Philadelphia

8

TM

The paper used in this publication meets the minimum requirements of the American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984.

Cover design: Françoise Berserik Cover illustration from original painting Random Order by Lorenzo Pezzatini, Florence, 1996.

doi 10.1075/scl.82 Cataloging-in-Publication Data available from Library of Congress: lccn 2017045531 (print) / 2017050052 (e-book) isbn 978 90 272 0013 6 (Hb) isbn 978 90 272 6456 5 (e-book)

© 2018 – John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Company · https://benjamins.com

Table of contents Acknowledgements chapter 1 Present applications and future directions in pattern-driven approaches to corpus linguistics Jukka Tyrkkö & Joanna Kopaczyk

vii

1

Part I. Methodological explorations chapter 2 From lexical bundles to surprisal and language models: Measuring the idiom principle in native and learner language Gerold Schneider & Gintarė Grigonytė chapter 3 Fine-tuning lexical bundles: A methodological reflection in the context of describing drug-drug interactions Łukasz Grabowski chapter 4 Lexical obsolescence and loss in English: 1700–2000 Ondřej Tichý

15

57

81

Part II. Patterns in utilitarian texts chapter 5 Constance and variability: Using PoS-grams to find phraseologies in the language of newspapers Antonio Pinna & David Brett chapter 6 Between corpus-based and corpus-driven approaches to textual recurrence: Exploring semantic sequences in judicial discourse Stanisław Goźdź-Roszkowski

107

131

 Applications of Pattern-driven Methods in Corpus Linguistics

chapter 7 Lexical bundles in Early Modern and Present-day English Acts of Parliament Anu Lehto

159

Part III. Patterns in online texts chapter 8 Lexical bundles in Wikipedia articles and related texts: Exploring disciplinary variation Turo Hiltunen chapter 9 Join us for this: Lexical bundles and repetition in email marketing texts Joe McVeigh chapter 10 I don’t want to and don’t get me wrong: Lexical bundles as a window to subjectivity and intersubjectivity in American blogs Federica Barbieri chapter 11 Blogging around the world: Universal and localised patterns in Online Englishes Joanna Kopaczyk & Jukka Tyrkkö Index

189

213

251

277 311

Acknowledgements The idea for a volume introducing the concept of a pattern-driven approach in corpus linguistics grew out of our fascination with frequency-based, non a-priori methods of querying large corpora, which in the last two decades have enabled linguistic assessments and discoveries that would not have been possible otherwise. Such methods are typically described as corpus-driven, and it was under that banner that we organised a special session on lexical bundles at the European Society for the Study of English conference in Košice in 2014. The response from the audience and the participants was very encouraging and there was a clear need for a volume dedicated to showcasing new developments in corpus-driven methods and applications. The conference was a springboard for what are now much more refined studies in this book, and additional chapters were contributed by scholars who have also been inspired by the field of corpus-driven linguistics. All chapters have benefitted from cross-reading by other contributors to the volume as well as from expert comments and suggestions by our select panel of external reviewers: Marc Alexander, Viviana Cortes, Philip Durkin, Bethany Gray, Andreas H. Jucker, Maria Kuteeva, Magdalena Leitner, Caroline Tagg, Richard J. Whitt and Christopher Williams. We would like to thank everybody for raising the bar in a constructive, friendly and timely fashion and especially our authors for patiently going through revisions. During the course of the editorial process we realized that the contributions have something in common besides using corpus-driven methods in novel ways and with new materials. They all engage with the methodology by departing from a strict definition of corpus-driven, theory-free research; instead, while relying on frequency-driven data-mining, they introduce categorisations, models and thresholds that help the researcher identify emerging patterns of language use that shed new light on the linguistic question of interest to them. This is, essentially, a patterndriven approach to the study of language. We would like to thank Ute Römer, the Series Editor, who pointed us towards this realisation and accepted the title for publication, as well as Kees Vaes and his team for seeing the project through production. The authors would also like to acknowledge the support of the institutions where they were based during the course of this book’s editorial process: Joanna Kopaczyk is indebted to Adam Mickiewicz University in Poznań and the University of Edinburgh, and Jukka Tyrkkö would like to thank the School of Language, Translation and Literary Studies and the Institute for Advanced Social Research (IASR), both at the University of Tampere.

chapter 1

Present applications and future directions in pattern-driven approaches to corpus linguistics Jukka Tyrkkö & Joanna Kopaczyk

Linnaeus University and University of Glasgow

1. Corpus linguistics today Following several decades as a pioneering new methodology that required access to expensive equipment and specialised skills rarely taught to linguists, corpus linguistics emerged into the mainstream in the 1990s and soon became one of the predominant approaches to the study of language. Over the past two decades, corpus methods have become commonplace in most fields of linguistics and practically invaluable in some, such as variationist studies. By affording researchers accurate and exhaustive access to large sets of language data, corpus methods have, among other things, made it possible to identify statistically significant and meaningfully large synchronic and diachronic differences, sharpened our understanding of the importance of metadata and provided new evidence of the systematic and often predictable nature of linguistic processes. In the beginning, corpus linguistic inquiries focussed primarily on predefined linguistic entities, using corpora for evidence about their frequencies and distribution, and for finding illustrative examples suitable for closer examination. Given the singular importance of the primary data, corpus linguists have traditionally emphasized that corpora should only include evidence drawn from authentic texts or speech situations, that the methods of sampling should ensure that the corpus includes only texts that are truly representative of the target population, and that any findings should be verifiable and the methods of analyses replicable. However, as more and more primary data has become available through digitized and borndigital textual resources, the sizes of corpora have grown almost exponentially. Today, numerous textual repositories, websites and social media networks offer

doi 10.1075/scl.82.01tyr © 2018 John Benjamins Publishing Company



Jukka Tyrkkö & Joanna Kopaczyk

linguistic evidence in volumes that would have been unthinkable only a few years ago. While this offers exciting new opportunities, the expansion of data also comes at a price and linguists increasingly find themselves face-to-face with the fact that data needs to be turned into information before it leads to new knowledge. Unlike the small corpora compiled carefully by philologically oriented teams of researchers, the so-called mega-corpora and corpus-like repositories are – for all their undeniable worth – often only minimally curated, which can lead to systematic errors in the analyses, while the metadata is too scarce to allow in-depth inquries into reasons behind the phenomena. Indeed, perhaps ironically, very large corpora tend to yield too much raw data for manual verification or processing. Although linguistic datasets only rarely amount to what is now often called Big Data, the analysis of high frequency phenomena in large corpora requires, or is at least made much more manageable, by the use of a wide variety of computational and statistical methods; these also increasingly serve as common ground for corpus linguistics, computational linguistics and information science. Importantly, these large samples of language also lend themselves particularly well to pattern-based explorations, which have the potential of revealing previously unobserved trends and tendencies. 2. Pattern-driven research into language This collection of articles focuses on methodological developments and their applications in what we will call pattern-driven linguistic research. We situate this methodological approach conceptually between corpus-based and corpusdriven approaches, arguing that the fundamentally data-focused nature of pattern retrieval and analysis goes beyond the traditional corpus-based research while, at the same time, the term corpus-driven ought to be reserved for approaches that are truly theory neutral – which pattern-based analysis often is not. The crucial, and occasionally hotly contested, difference between corpusbased and corpus-driven methods is well-known within the research community. In the former the starting point is usually a small number of pre-identified linguistic features based on theoretical assumptions and earlier findings, which are examined using evidence drawn from corpora, while the latter make few, if any, such assumptions and instead employ a bottom-up approach that allows us to capture patterns in language from a neutral and unbiased perspective. While the corpus-based approach is knowledge-based in the sense that it builds on and seeks to expand and develop our pre-existing understanding of language use, the defining motivation behind the corpus-driven approach is that such reliance on pre-existing theories and pre-conceived classifications, for instance word classes

Chapter 1. Pattern-driven approaches to corpus linguistics

or syntactical units, may miss important features or even reinforce fundamentally false dichotomies. The two schools of thought differ even more when it comes to the meaning and interpretations of the observed phenomena: where one makes claims primarily about language as observed in the texts of the corpora (performance), and by extension about the communities that produced them, the other suggests that corpus evidence can be used as a window into the inner workings of how the human mind processes language (competence). A seminal work in the field of corpus-driven analysis is Tognini-Bonelli (2001), which layed the foundations for much of the current understanding of the theoretical landscape of corpusbased and corpus-driven approaches in linguistics. Emphasizing “the integrity of the data as a whole” (2001: 84) and the need to approach corpus data in a comprehensive manner, Tognini-Bonelli explains that linguistic categories emerge from recurrent patterns of language use embedded in context. Although in its purest form a corpus-driven study would take as a premise that language data should be approached from an entirely theory-free perspective in the Firthian tradition, most self-described corpus-driven studies do, in fact, accept at least some a priori assumptions, such as the concepts of a word and word boundaries, while many go further and subscribe to more theoretical concepts such as lemmas and even word classes (for discussion, see Teubert 2005; McEnery et al. 2006; Biber 2009; Gries 2010; Meyer 2014). Furthermore, many, if not most, studies self-described as corpus-driven fall short of the ideal of an exhaustive analysis of the primary data by adopting frequency and range thresholds to keep the amount of variation manageable (McEnery et al. 2006: 8). This is particularly pertinent to the present-day situation when computational power is constantly increasing and we find ourselves working with corpora that are hundreds, sometimes thousands of times larger in volume than those used only a decade or two ago. Although corpus-based methods have been and continue to be the backbone of corpus linguistics, corpus-driven approaches have an appreciably long history of their own. One of the earliest examples of the corpus-driven research paradigm is pattern grammar, developed by Francis, Hunston and Manning in Collins COBUILD’s Grammar Patterns 1: Verbs (1996) and Grammar Patterns 2: Nouns and Adjectives (1998); see also Hunston and Francis (2000). By querying large corpora for the phraseological and grammatical patterns in which lexical items occur, pattern grammar introduced the idea that corpora could be used to gain entirely new perspectives on the relationship between words and grammatical patterns. Around the same time, Tognini-Bonelli’s monograph was a programmatic call for a new subdiscipline in corpus linguistics which could validate datadriven approaches to language. Arguing that linguistic categories should arise from the data, she promoted “the central concept of a functionally complete unit of meaning” (2001: 179, emphasis original). Yet another noteworthy example of





Jukka Tyrkkö & Joanna Kopaczyk

c orpus-driven analysis is Linear Unit Grammar, introduced in Sinclair and Mauranen (2006), which approaches spoken language from a discourse-functional perspective and dismisses conventional a priori assumptions about parts of speech or syntax-level units. However, even in the most influential approaches born out of the corpusdriven framework, there are often formal assumptions about the existence of emergent patterns or types of patterns that corpus-driven data analysis will help to reveal. We therefore suggest that in order to alleviate the caveats that so often qualify a corpus-driven approach, it would be more useful and more accurate to talk about pattern-driven approaches, defined either as subset within the larger theoretical framework proposed by Tognini-Bonelli (2001) or as an intermediate step between corpus-based and corpus-driven methods. After all, on closer inspection the common ground of most corpus-driven methodologies relies on the identification of patterns, sequences, and lexico-grammatical structural units or, in short, on repetition. As Stubbs (2002: 221) noted, “the first task of corpus linguistics is to describe what is usual and typical”, and when it comes to identifying typical phenomena beyond the frequencies of predefined items, some variation of pattern extraction will almost always be necessary. Indeed, scholars interested in registerspecific, formulaic, fixed and otherwise strongly associated features of language use often adopt essentially pattern-driven approaches and ask research questions that are best answered with data generated using corpus-driven methods: from lexical bundles, alternatively defined in literature as word clusters (Scott 1997), n-grams (Fletcher 2003), recurrent word chains (Stubbs & Barth 2003), contiguous formulaic strings (Conklin & Schmitt 2008) or lexical clusters (Taavitsainen & Pahta (eds) 2010), through to concgrams (Cheng 2007; Greaves & Warren 2007), skipgrams (Wilks 2005), phrase frames (Fletcher 2006) and POS-grams. The most prominent corpus-driven method is the n-gram methodology, which reveals patterns of recurrent lexical items or other linguistic units in text. Widely used in a variety of fields ranging from computer science to computational linguistics, n-gram analysis has been around in corpus linguistics for about 15 years; some of the most influential first studies include Altenberg (1998) and the monumental Longman Grammar of Spoken and Written English by Biber et al. (1999). Various ways of sequence retrieval have been proposed (based on frequency, strength of association measures such as MI-scores, psycholinguistic salience, entropy measures, or a mixture of such criteria). However, despite the growing wealth of literature on the topic, there are still many vexing questions to do with the comparability of research results across various corpus-driven studies. As Gries and Mukherjee put it, the procedure is “recent enough for the field not to have yet accepted standards on how to generate, explore, quantify, and study n-grams” (2010: 521).

Chapter 1. Pattern-driven approaches to corpus linguistics

Applications of corpus-driven methods to answer pattern-driven questions in present-day data have been most successful so far in the following areas: –– corpus grammars of contemporary English (Biber et al. 1999; Hunston & Francis 2000; Carter & McCarthy 2006) –– identifying typical features of academic discourse, comparing students’ writing to professional writing (Cortes 2004), identifying multi-word sequences for pedagogic purposes, machine translation, language processing inquiries, etc. (Mahlberg et al. 2009), native speakers vs learners (Ädel & Erman 2012) –– comparative genre analysis (Stubbs & Barth 2003), comparative analysis of disciplines in English for Academic Purposes (Hyland 2008) –– specialised discourse features (structural and functional): political discourse (Partington & Morley 2004), legal discourse (Goźdź-Roszkowski 2011), scientific discourse (Salazar 2014), pharmaceutical discourse (Grabowski 2015), tourism (Fuster-Márquez 2014) –– psycholinguistic research: holistic storing and processing of lexical bundles (Tremblay et al. 2011) Historical texts have also been subjected to pattern-driven queries focussing on the following research questions: –– orality features in Early Modern drama and Early Modern trials (Culpeper & Kytö 2002, 2010) –– linguistic stability in 19th–20th c. letters, science, history and trials (Kytö, Rydén & Smitterberg 2006) –– textual standardization in early legal discourse (Kopaczyk 2013) –– lexical variation in Shakespeare’s plays (Culpeper 2011) –– genre characteristics in a variety of 20th-c. texts in the Swiss German Corpus (Bürki 2010); readability in Luther’s 1545 translation of the Bible as opposed to Hoffnung für Alle (2002) (Shrefler 2011) This broad – and growing – range of applications shows that pattern-driven methodologies provide a timely addition to the linguistic toolbox, especially in the context of pattern recognition, retrieval and analysis. Since they are quickly becoming a part of the so-called ordinary working linguist’s toolkit, more focus is needed on the strengths and possible shortcomings of these methods. We believe that it is particularly important to make these methods and their application to a variety of linguistic problems accessible to linguists who are more familiar with corpus-based methods. Consequently, the chapters in this volume showcase p attern-driven approaches and their applicability to syntactic, phraseological, pragmatic, and genre-related studies within the broad framework of corpus





Jukka Tyrkkö & Joanna Kopaczyk

linguistics, using both large and small corpora and addressing a wide variety of research questions. 3. Book overview Taking pattern recognition and interpretation as its unifying theme, the volume starts with cutting-edge methodological explorations in corpus-driven linguistics, and then focusses on applications of corpus-driven methods into the patterndriven analyses of utilitarian texts and online texts. The first of these two broadly conceived textual domains has been frequently subjected to corpus-driven methods in search of formulaic patterns in, for instance, legal or medical discourse. The second domain is emerging as a prolific area for the study of formulaicity, situated at the crossroads between spontaneous, perhaps orality-related written texts, and more carefully planned written genres rendered online. Our hope is that the readers can draw inspiration from these new methodological proposals, expand their understanding of patterns in utilitarian texts, as revealed by pattern-driven approaches, and examine the application of relevant corpus-driven methods to the study of online language use and the newest communicative media. The first chapter in Part I, by Gerold Schneider and Gintare Grigonytė, develops a novel way of analyzing the formulaicity of recurrent sequences in texts. It has been argued in the past that native speakers perform better when it comes to the balance between formulaicity (Sinclair’s idiom principle) and expressiveness (his syntax principle) than language learners. Schneider and Grigonytė take lexical bundles, the well-known type of sequences identified by corpus-driven methods, as a departure point in their quest to establish a reliable method of measuring formulaicity. The authors try out various measures such as numeric frequency, Observed/Expected measure and T-score collocation, to establish the ‘bundleness’ in essays written by American students and contrast it with essays written by Japanese students with an advanced level of English. Recognizing the need to introduce a psycholonguistic aspect to measuring formulaicity, the authors introduce surprisal, an information-theoretic measure of formulaicity based on reader expectations and text entropy. It turns out that advanced learners’ writing shows lower surprisal than the essays of native speakers, but, interestingly, it contains more collocations. To investigate further syntactic implications, Schneider and Grigonytė apply surprisal to POS-tagging and compare how the parser performs on the corrected and uncorrected versions of non-native students’ essays. The innovative and exploratory character of the chapter sets forth new directions in investigating and comparing formulaicity across corpora. The next chapter in the methodological section, authored by Łukasz Grabowski, refines the lexical bundle methodology by addressing the question

Chapter 1. Pattern-driven approaches to corpus linguistics

of overlapping bundles, which often cause problems for linguists working with n-grams. Grabowski looks at pharmaceutical texts written in English where recurrent phraseology and repetitive chunks become important in the process of comprehension, essential to both native and non-native medical professionals. After rehearsing various approaches to identifying and interpreting lexical bundles in texts, the author focusses on transitional probability and ‘coverage’, new methods of collapsing overlapping bundles into single sequence types. This approach should allow researchers to limit the number of bundle types for a more efficient functional analysis, similarly to various types of bundle sampling, which Grabowski also discusses in his chapter. In Chapter 4, Ondřej Tichý proposes a new, corpus-driven method of tracing lexical obsolescence. The procedure consists in identifying the best candidates for obsolescence through trawling n-gram data, in this case uni- and trigrams, drawn from the largest available set of lexical data for English between 1700 and 2000, the Google Books corpus. Tichý reviews approaches and definitions related to the loss of vocabulary as well as problems inherent in using the Google material, before presenting descriptive n-gram statistics per decade and narrowing the discussion down to lower frequency bands, where good candidates for obsolescence can be found. After careful consideration of the material left in these residual frequency bands and discussing technical issues stemming from corpus characteristics, the author is able to pinpoint words and multi-word units which were subject to lexical obsolescence over time. The chapter identifies new directions for the development of lexicographic resources and diachronic study of vocabulary in general. Part II concentrates on patterns in utilitarian texts, as revealed by corpusdriven methods. In Chapter 5, Antonio Pinna and David Brett show the extent of repetition on the level of grammatical structure in newspaper discourse. Using a 1-million-word representative corpus of newspaper genres, based on the Guardian, the authors look for repetitive grammatical frames, part-of-speech grams (POS-grams) and look at their semantic correlates. Pinna and Brett draw POS-grams from the newspaper corpus and establish comparisons with a reference corpus (the BNC). They arrive at striking genre inventories of recurrent grammatical frames for the subgenres of Travel, Crime and Obituaries and reveal shared preferences for structural patterns among them. The authors discuss the variable content of most prominent POS-grams and draw attention to semantic groupings characteristic of each subgenre. The next chapter, by Stanisław Goźdź-Roszkowski, concentrates on patterns emerging in legal discourse, with a focus on judicial opinions. This legal genre relies on evaluation and proposition, whose specific linguistic ingredients may not be comprehensively assessed through corpus-based queries alone. Rather, Goźdź-Roszkowski argues, pattern-driven methodology helps to identify recurrent semantic sequences which then illuminate the most crucial i ngredients





Jukka Tyrkkö & Joanna Kopaczyk

of discourse. The author zooms in on semantic sequences to do with stance. Various patterns related to the status of a judicial proposition can be triggered by specific lexico-grammatical sequences, e.g. assumption that [N+that] … is incorect. The study gives a detailed functional account of patterns deriving from status- indicating nouns, with broader implications for analysing discourse-specific formulaicity. In Chapter 7, Anu Lehto continues the theme of legal discourse and offers a diachronic comparison of lexical bundles in English acts of parliament. Starting with a general numeric overview of n-grams of various lengths in the Corpus of Early Modern English Statutes and in UK parliamentary acts from 2015, the study focusses on 3-gram patterns. Lehto compares and contrasts the most frequent repetitive chunks in both collections and draws attention to similar levels of recurrent 3-gram types, regardless of the period. Interesting grammatical and functional preferences are revealed, for instance in the format of dependent clauses or in the choice of passive voice. These pattern-driven findings enhance and complement previous grammaticalisation and syntactic change assessments. There are also striking diachronic differences in the functions of recurring 3-grams, especially in terms of referential and textual bundles, which reveals new insights into the history of legislative writing. Part III comprises chapters drawing on a range of online genres. Chapter 8, by Turo Hiltunen, makes informative and important comparisons between Wikipedia articles, student essays and research articles across three disciplines representing different styles of academic writing: medicine, economics and literary criticism. Wikipedia often serves as a source for academic work, at least preliminary, and features prominently in student learning; on the other hand, Wikipedia entries are increasingly written by specialist on the respective topics. Setting the discussion against the theoretical concept of expert performances, Hiltunen analyses overlaps between inventories of lexical bundles (3- and 4-grams) first within Wikipedia, and then across the three genres. It is illuminating to see the extent of pattern stability across the subject areas in the online encyclopedia but also to pinpoint the ingredients that are absent from Wikipedia but recurrent, and therefore expected, in research writing at both levels of proficiency. Another pervasive online genre, email, is explored by Joe McVeigh. In order to reveal specific strategies used in email marketing directed at lawyers, McVeigh compares patterns emerging in marketing emails aimed at the legal profession against those extracted from legal blogs and legal case decisions, two genres also encountered by lawyers in their daily practice. The author offers a quantitative assessment of repetitiveness across the genres on the basis of lexical bundle methodology, paying attention to such problematic issues as cut-off points and bundle overlap. In qualitative terms, it turns out that the language of marketing is more

Chapter 1. Pattern-driven approaches to corpus linguistics

formulaic and comes across as relying on a set of templates, regardless of the fact that marketing language is often believed to be associated with linguistic innovation and creativity. In Chapter 10, Federica Barbieri looks at subjectivity and intersubjectivity in American blogs. She starts by discussing blogs as an online genre associated with specific communities of practice, discourse communities, or themes, and identifies self-presentation as its pivotal feature. This brings Barbieri to the analysis of recurrent patterns, established by means of the lexical bundle methodology, with special focus on stance-related repetitive strings. Another group of functionally important patterns are those related to narration. The author also analyses person reference as a window onto (inter)subjectivity as well as the general structural characteristics of lexical bundles in blogs. The discussion reveals the complexity of blogs as a genre which is a product of its communicative environment and purpose. Continuing the discussion of patterns in blogs, Joanna Kopaczyk and Jukka Tyrkkö close the volume with a study of World Englishes based on the corpus of Global Web-based English (Davies 2014). Extracting and analysing n-grams from blog data representing nineteen different regions of the English-speaking world, they provide an exploratory pattern-driven perspective on World Englishes in a web-based genre. An important bottom-up discovery is the areal grouping of blogs on the basis of high-frequency 3-gram patterns. These groupings, which suggest that web-based Englishes exhibit regional variation instead of showing a levelling effect that could be attributed to ‘Internet English’, can be taken as new evidence in the debate surrounding various conceptualizations of World Englishes. The chapter also showcases various tools for exploring shared and unshared patterns in texts and degrees of similarity between different bodies of data: from coefficients, through hierarchical clustering to network graphs. 4. Pattern-driven linguistics: Future directions The chapters in this volume showcase the flexibility and wide applicability of pattern-driven methods to the study of an exciting range of linguistic questions. As demonstrated above, the contributions go beyond the core concerns raised by the repetitiveness of lexical patterns in texts. The main theoretical objective of the volume is to suggest that the bottom-up approach is not only a fancy device made possible by increasing computational power, but that it has the potential to change our perception and understanding of how language is constructed and how it functions in specific contexts. However, while pattern-driven methods are a crucially important tool that many more linguists would benefit from experimenting with, we wish to argue that in many cases the best insights may be gained by combining



 Jukka Tyrkkö & Joanna Kopaczyk

such methods with previously established theoretical understanding. To that end, all the chapters explain in detail how the theoretical model is constructed and how it fits in with the proposed pattern-oriented research questions, and then apply the methods in practice to authentic and previously unexplored corpus data. Each chapter thus provides a demonstration of a selected pattern-driven method and also provides new and exciting findings in its own field. The emphasis on pattern-driven approaches allows us to find a counter- balance to the more popular corpus-based linguistic inquiries without spuriously implying a theory-neutral starting point of corpus-driven analyses, or qualifying them with a range of caveats. While only a few of the chapters could be described as rigidly corpus-driven in the purest sense, they all apply methods that rely on querying corpora for patterns that were essentially not predetermined. Moreover, the retrieval of these patterns is not the endpoint of the study, but rather a new starting point for identifying broader underlying tendencies, which in turn can lead to new perspectives on the respective research questions. Essentially, based on the underlying corpus-driven framework, we arrive at a pattern-driven approach to language use in context. The benefits of the methods discussed in this volume are extensive, ranging from exploratory to exhaustive analysis, and from applications that address system-wide questions to those focused on register-specific phenomena. It is our hope that the volume inspires our fellow corpus linguists to examine linguistic phenomena from the pattern-driven perspective.

References Ädel, Annelie & Erman, Britt. 2012. Recurrent word combinations in academic writing by native and non-native speakers of English: A lexical bundles approach. English for Specific Purposes 31: 81–92. doi: 10.1016/j.esp.2011.08.004 Altenberg, Bengt. 1998. On the phraseology of spoken English: The evidence of recurrent wordcombinations. In Phraseology. Theory, Analysis and Applications, Anthony P. Cowie (ed.), 101–122. Oxford: Clarendon Press. Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward. 1999. Longman Grammar of Spoken and Written English. London: Longman. Biber, Douglas. 2009. Corpus-based and corpus-driven analyses of language variation and use. In The Oxford Handbook of Linguistic Analysis, Bernd Heine & Heiko Narrog (eds). Oxford: OUP. doi: 10.1093/oxfordhb/9780199544004.013.0008 Buerki, A. 2010. All sorts of change: A preliminary typology of change in multi-word sequences in the Swiss Text Corpus. Presented at FLaRN 2010, Paderborn, Germany. Carter, Ronald & McCarthy, Michael. 2006. Cambridge Grammar of English. A Comprehensive Guide. Spoken and Written English Grammar and Usage. Cambridge: CUP. Cheng, Winnie. 2007. Concgramming: A corpus-driven approach to learning the phraseology of discipline-specific texts. CORELL: Computer Resources for Language Learning 1: 22–35.

Chapter 1. Pattern-driven approaches to corpus linguistics

Conklin, Kathy & Schmitt, Norbert. 2008. Formulaic sequences: Are they processed more quickly than nonformulaic language by native and nonnative speakers? Applied Linguistics 29(1): 72–89. doi: 10.1093/applin/amm022 Cortes, Viviana. 2004. Lexical bundles in published and student writing in history and biology. English for Specific Purposes 23(4): 397–423. doi: 10.1016/j.esp.2003.12.001 Culpeper, Jonathan & Kytö, Merja. 2002. Lexical bundles in Early Modern English: A window into the speech-related language of the past. In Sounds, Words, Texts, Change. Selected Papers from the Eleventh International Conference on English Historical Linguistics (11 ICEHL) [Current Issues in Linguistic Theory 224], Teresa Fanego, Belén Méndez-Naya & Elena Seoane (eds), 45–63. Amsterdam: John Benjamins. Culpeper, Jonathan & Kytö, Merja. 2010. Early Modern English Dialogues: Spoken Interaction as Writing. Cambridge: CUP. Culpeper, Jonathan. 2011. A new kind of dictionary for Shakespeare’s plays: An immodest proposal. In Stylistics and Shakespeare’s Language: Transdisciplinary Approaches, Mireille Ravassat & Jonathan Culpeper (eds), 58–83. London: Continuum. Davies, Mark. 2014. Corpus of Global Web-Based English: 1.9 billion words from speakers in 20 countries. Fletcher, William H. 2003/2004. Phrases in English. (1 May 2017). Fletcher, William H. 2006. Concordancing the web: Promise and problems, tools and techniques. In Corpus Linguistics and the Web [Language and Computers 59], Marianne Hundt, Nadja Nesselhauf & Carolin Biewer (eds), 25–45. Leiden: Brill. Francis, Gill, Hunston, Susan & Manning, Elizabeth. 1996. Collins COBUILD Grammar Patterns, 1: Verbs. London: HarperCollins. Francis, Gill, Hunston, Susan & Manning, Elizabeth. 1998. Collins COBUILD Grammar Patterns, 1: Nouns and Adjectives. London HarperCollins. Fuster-Márquez, Miguel. 2014. Lexical bundles and phrase frames in the language of hotel websites. English Text Construction 7(1): 84–121. doi: 10.1075/etc.7.1.04fus Goźdź-Roszkowski, Stanisław. 2011. Patterns of Linguistic Variation in American Legal English. Frankfurt: Peter Lang. Grabowski, Łukasz. 2015. Keywords and lexical bundles within English pharmaceutical discourse: A corpus-driven description. English for Specific Purposes 38: 23–33.

doi: 10.1016/j.esp.2014.10.004

Greaves, Christopher & Warren, Martin. 2007. Concgramming: A computer-driven approach to learning the phraseology of English. ReCALL Journal 17(3): 287–306. doi: 10.1017/S0958344007000432 Gries, Stefan T. & Mukherjee, Joybrato. 2010. Lexical gravity across varieties of English: An ICEbased study of N-Grams in Asian Englishes. International Journal of Corpus Linguistics 15(4): 520–548. doi: 10.1075/ijcl.15.4.04gri Hunston, Susan & Francis, Gill. 2000. Pattern Grammar: A Corpus-driven Approach to the L exical Grammar of English [Studies in Corpus Linguistics 4]. Amsterdam: John Benjamins.

doi: 10.1075/scl.4

Hyland, Ken. 2008. As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes 27: 4–21. doi: 10.1016/j.esp.2007.06.001 Kopaczyk, Joanna. 2013. The Legal Language of Scottish Burghs. Standardisation and Lexical Bundles, 1380–1560. Oxford: OUP. Kytö, Merja, Rydén, Mats & Smitterberg, Erik. 2006. Nineteenth-century English: Stability and Change. Cambridge: CUP. doi: 10.1017/CBO9780511486944



 Jukka Tyrkkö & Joanna Kopaczyk Mahlberg, Michaela, González-Díaz, Victorina & Smith, Catherine (eds). 2009. Proceedings of the Corpus Linguistics Conference, Liverpool 20–23 July 2009. (1 May 2017). McEnery, Tony, Xiao, Richard & Tono, Yukio. 2006. Corpus-based Language Studies: An Advanced Resource Book. London: Taylor and Francis. Meyer, Charles F. 2014. Corpus-based and corpus-driven approaches to linguistic analysis: one and the same? In Developments in English: Expanding Electronic Evidence, Irma Taavitsainen, Merja Kytö, Claudia Claridge & Jeremy Smith (eds), 14–28. Cambridge: CUP.

doi: 10.1017/CBO9781139833882.004

Partington, Alan & Morley, John. 2004. From frequency to ideology: Investigating word and cluster/bundle frequency in political debate. In Practical Applications in Language and Computers. PALC 2003, Barbara Lewandowska-Tomaszczyk (ed.), 179–192. Frankfurt: Peter Lang. Salazar, Danica. 2014. Lexical Bundles in Native and Non-native Scientific Writing [Studies in Corpus Linguistics 65]. Amsterdam: John Benjamins. doi: 10.1075/scl.65 Scott, Michael. 1997. PC analysis of key words – and key key words. System 25(1): 1–13. Shrefler, Nathan. 2011. Lexical bundles and German bibles. Literary and Linguistic Computing 26(1): 89–106. doi: 10.1093/llc/fqq014 Sinclair, John McH. & Mauranen, Anna. 2006. Linear Unit Grammar: Integrating Speech and Writing [Studies in Corpus Linguistics 25]. Amsterdam: John Benjamins. doi: 10.1075/scl.25 Stubbs, Michael. 2002. Two quantitative methods of studying phraseology in English. International Journal of Corpus Linguistics 7(2): 215–44. doi: 10.1075/ijcl.7.2.04stu Stubbs, Michael & Barth, Isabel. 2003. Using recurrent phrases and text-type discriminators: A quantitative method and some findings. Functions of Language 10(1): 61–104.

doi: 10.1075/fol.10.1.04stu

Taavitsainen, Irma & Pahta, Päivi (eds). 2010. Early Modern English Medical Texts: Corpus Description and Studies. Amsterdam: John Benjamins. doi: 10.1075/z.160 Teubert, Wolfgang. 2005. My version of corpus linguistics. International Journal of Corpus Linguistics 10(1): 1–13. doi: 10.1075/ijcl.10.1.01teu Tognini-Bonelli, Elena. 2001. Corpus Linguistics at Work [Studies in Corpus Linguistics 6]. Amsterdam: John Benjamins. doi: 10.1075/scl.6 Tremblay, Antoine, Derwing, Bruce, Libben, Gary & Westbury, Chris. 2011. Processing advantages of lexical bundles: Evidence from self-paced reading and sentence recall tasks. Language Learning 61(2): 569–613. doi: 10.1111/j.1467-9922.2010.00622.x Wilks, Yorick. 2005. REVEAL: The notion of anomalous texts in a very large corpus. Tuscan Word Centre International Workshop. Certosa di Pontignano, Tuscany, Italy, 31 June–3 July.

part i

Methodological explorations

chapter 2

From lexical bundles to surprisal and language models Measuring the idiom principle in native and learner language Gerold Schneider & Gintarė Grigonytė

University of Zurich and University of Konstanz / University of Stockholm We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We first show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an informationtheoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse differences between genres of native language use, and learner language at different levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s openchoice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension difficulty. Our goal to abstract away from word sequences also leads us to language models as models of processing, first in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely confirmed, we also observe that advanced learners bundle most, and that scientific language may show lower surprisal than spoken language. Keywords: formulaicity; learner’s language; language processing; collocation; part-of-speech tagging; syntactic parsing

doi 10.1075/scl.82.02sch © 2018 John Benjamins Publishing Company

 Gerold Schneider & Gintarė Grigonytė

1. Introduction Lexical bundles have been used to describe and detect lexical, phraseological and syntactic patterns (e.g. Biber & Barbieri 2007; Biber 2009; Kopaczyk 2012). They signify fixed practices of language use, testify to Sinclair (1991)’s idiom principle, and have been used to measure formulaicity, complexity and (non-)creativity. The idiom principle postulates that texts are largely composed of multi-word entities (MWE), formulaic expressions which “constitute single choices” (Sinclair 1991: 110) in the mental lexicon, and that free combinations of lexical items are rather the exception than the rule. Many scholars (e.g. Pawley & Syder 1983; Sinclair 1991) have observed that if individual words were the building blocks of language and syntactic rules could freely combine them, the amount of creative language and novel combinations would be far greater than what we see. Altenberg and Tapper (1998) estimate that up to 80% of the words in a corpus are part of a recurring sequence. Erman and Warren (2000) estimate that the number of multiword composites, so called prefabs, is around 55%, supporting the idea that a fluent native text is constructed according to the idiom principle. Fluency is even more critical to processing time of the speech, which translates to even higher prevalence of collocations in spoken corpora. Biber et al. (1999) and Leech (2000) compare spoken and written corpora and report the proportions of collocations being higher in spoken language. Lexical bundles are formed as a direct result of the idiom principle. The stronger the idiom principle is, the more words tend to occur in fixed sequences only, language would be completely formulaic if the idiom principle were the only force. The phenomenon that words form bundles has been explained on psycholinguistic grounds. Pawley and Syder (1983: 192) suggest that using frequent and thus familiar sequences and collocations can minimize the “clause internal encoding work to be done” and therefore provides more time for “planning larger units of discourse”. Planning prevails for larger units such as utterances: “language users tend to generate the most probable utterance for a given meaning on the basis of the frequencies of utterance representations” (Ellis & Frey 2009: 476). Pawley and Syder (1983) also hypothesize that native speakers are more experienced in finding a balance between the idiom principle and the open-choice principle. We are thus looking for methods to measure this difference, starting with lexical bundles in Section 4. We then argue that we need on the one hand an additional generic measure of the amount of bundling (Section 5), and on the other hand investigating bundles and type-token distributions inside a selected syntactic frame can provide answers (Section 6). In order to step from selected frames to all syntactic frames, we experiment with a pre-terminal model of syntax – part-of-speech tags – in Section 7, and a full syntactic parser in Section 8.

Chapter 2. From lexical bundles to surprisal and language models 

It is increasingly accepted that sequences of words form the basic building blocks of discourse. Psycholinguistically, using MWEs benefits language users in a twofold way: they allow speakers to attain the needed level of fluency and listeners and readers the ease and speed of understanding required under processing pressures. This plays a crucial role in noisy environments, where our expectations about the continuation of the conversation, also called priming (Hoey 2005), help us to interpret and fill gaps. From the point of message comprehension, predicting patterns in situations is crucial for understanding (Nattinger 1980) as predicted by Shannon’s (1951) noisy channel. Frequency provides a strong pattern of analysis and according to Ellis (2002), both perception and production is governed by frequency of previously perceived utterance analysis: “Comprehenders tend to perceive the most probable syntactic and semantic analyses of a new utterance on the basis of frequencies of previously perceived utterance analyses. Language users tend to produce the most probable utterance for a given meaning on the basis of frequencies of utterance representations” (Ellis 2002: 145). Ellis, Frey and Jalkanen (2009) reported on a psycholinguistic study on the lexical decision task whether two given strings were words or not. The processing was shown to be sensitive to the pattern of collocation usage. Native speakers were faster to decide in cases when words occurred in collocations in contrast to grammatically correct two-word sequences. The authors conclude that experience of high frequency collocations in usage and the speed of perception were related. We outline the build-up of our argument sequence and sections in the following. The simplest measure of repetition is absolute frequency. One can count how often sequences of n = 2, 3, 4, 5 etc. words, i.e. n-grams, occur in a large text collection, in our case the British National Corpus and learner corpora introduced in Section 3, and report the counts sorted by descending frequency. We illustrate this method, which is the classical lexical bundles approach, in Section 4.1. These frequency-based methods are, for instance, described in Biber et al. (1999). Frequency as a measure of lexical bundles has been criticized, e.g. McEnery et al. (2006: 208–220) points out that they often fail to report the strongest lexical associations, and instead collocation measures (like Mutual Information (MI), log-likelihood, T-score, etc.) came to be used to measure formulaicity, see e.g. Cheng et al. (2009). Bartsch and Evert (2014) report relatively low performance of pure frequency for collocation detection. Biber (2009: 286–290) raises three criticisms on the use of the MI collocation score. His first argument is that the MI score brings rare collocations to the top, which often includes idioms. Idioms are typically also subsumed under formulaic language, and similar processing advantages apply across all MWEs, as SiyanovaChanturia and Martinez (2014) summarize:

 Gerold Schneider & Gintarė Grigonytė

Critically, the above studies differ greatly with respect to the specific type of MWEs investigated (idioms, collocations, binomials, lexical bundles – MWEs varying in their figurativeness, literality, compositionality, length, and frequency). Despite this heterogeneity, all of the above studies strongly suggest that the human brain is highly sensitive to frequency and predictability information encoded in phrasal units. Siyanova-Chanturia & Martinez (2014: 10)

An example of a rare collocation is the prototypical idiom “kick the bucket”. It occurs only 8 times in the 100-million-word British National Corpus (BNC). The MI metric reports it as a top hit (e.g. rank 8 with an observation window of 1–3 to the right), while the T-score misses it (rank 43 with the same observation window). In fact, there are various collocation measures, typically the family of significance test measures such as T-test or chi-square, which have a general bias towards frequent collocations, or those which aim to have neither bias, such as log likelihood, and those which have a bias towards rare collocations, such as MI.1 In this paper we illustrate the measure O/E (which delivers the same rankings as MI) and T-score in Section 4.2 (and in Section 6). In terms of the quote of Siyanova-Chanturia and Martinez (2014: 10), there are methods which prioritize frequency, such as lexical bundles or T-score collocations, and others that prioritize predictability, such as MI collocations, and it is difficult to find a balance. Lexical bundles explicitly and deliberately focus on frequency (e.g. Biber 2009: 281), and while they intend to express collocation they do not aim to express idiomaticity (Biber et al. 1999: 990). A focus on frequency entails boosting sequences that contain function words and are multi-word function units, as the most frequent words are generally function words. For the aim of investigating stylistics this is appropriate, but if also content-centred sequences and rare collocations should be included – such as for our research question of whether language learners use more or less idiomatic language – it may be desirable to aim for a more balanced measure which pays equal tribute to the psycholinguistic factor of predictability. We thus suggest (in Section 5) to use a measure which has two components: one based on frequency, and one based on predictability. Biber’s second argument is that the MI statistics does not take the order of words into consideration. To address this criticism, Section 5 introduces directed word transition probabilities, in particular surprisal (Levy & Jaeger 2007) at the word surface.2 Given a word sequence [w1 w2], the probability of w2 given w1,

. There is a large variety of collocation measures with different characteristics, see Evert (2009) or Pecina (2009) for detailed overviews. . Directional probabilities are also included in associative collocation measure Δp proposed by Gries (2013).

Chapter 2. From lexical bundles to surprisal and language models 

p(w1|w2), is not equal to p(w2|w1). These conditional, directed probabilities are a part of the definition of surprisal. In order to consider word order, we also use fixed positions inside syntactic frames (for example, verb-PP slots have a sequence of ). We use this approach for the description of overused and underused collocations in Section 6 (see also Lehmann & Schneider 2011). Biber’s third argument is that multi-word formulaic sequences are often discontinuous. Generally speaking, syntactic approaches address this issue. Our use of fixed positions inside syntactic structures, as in Section 6, is thus one answer to address this issue. A second answer is that we also use 3-grams as building blocks for the calculation of surprisal, which catches mild forms of discontinuity. Finally, we also give an outline of syntactic surprisal. We further think that lexical bundles (LBs) are an extreme form of expected continuation (Hoey 2005), only showing the most extreme sequences, the famous tip of the iceberg, while losing more gradient instances, occurrences further down in ranked lists. We argue that for this reason a general, gradient measure of how much a text, document, register or speaker group tends to exhibit lexical bundles and use formulaic language would be desirable, abstracting from few individual bundles to a general measure which permits the quantification of the overall bundle use in the texts, for which we will use the metaphorical term “bundleness”. We suggest to use the information-theoretic measure of surprisal as a general and versatile measure of lexical bundleness, formulaicity and non-creativity. We argue that it is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy.3 Information Theory and entropy was introduced by Shannon (1951): the more probable and thus expected a word is in its context, the less information it carries, the more easily interpretable and redundant it is. As such redundant, strongly or gradiently less formulaic, idiomatic expressions prevail on all levels, we argue for moving from individual bundles to more abstract measures of bundleness, such as surprisal. Surprisal uses an information-theoretic model of language, at the lexical surface level, and can be seen as an informationtheoretic model of processing the language (Levy & Jaeger 2007). Shannon’s information theory is also well-known for its noisy channel model: whenever noise distorts a small amount of the signal, it can be re-interpreted correctly due to the redundancy of the signal, i.e. a listener’s strong expectations of the continuation of the conversation. If the communication is so dense that redundancy (in the form of LB on the morphosyntactic and world knowledge on the

. Entropy is informally the amount of unpredictability. For a formal defintion, see e.g. Gries (2010)

 Gerold Schneider & Gintarė Grigonytė

s emantic level) is absent, misunderstanding a single word may lead to a breakdown in understanding. Very dense communication corresponds to unmet or unclear reader expectations and thus high surprisal, as we will see. While LBs allow us to detect contexts of high redundancy, surprisal allows us to measure medium redundancy and also low redundancy, i.e. the other side of the coin. The tug-of-war between the idiom principle (which LB partly catches) and the openchoice principle can partly be recast as a tug-of-war between low surprisal and high surprisal and the struggle for a balance between the two. Sinclair (1991)’s open choice or syntax principle does of course not lead to surprising word sequences per se, but if syntactic constraints are strong enough then they can give rise to very rare word sequences, although their processing by human parsers does not necessarily pose problems. In other words: a model of processing also needs to give the open-choice principle due reference. The sequence of arguments from Section 5 to 8 brings an additional layer of discussion; from the idiom principle towards including the open-choice principle. In order to approximate the open-choice principle better, we also experiment with language processing models at higher levels, such as the word-class level and the syntax level. After using plain word sequences (Section 5) we progress to more complex morphosyntactic sequences in the form of a part-of-speech tagger (Section 7) and then hierarchical syntactic representations in the form of a syntactic parser (Section 8). In order to compare to conditions in which speakers have less expert knowledge of language use, we employ features of language learner production such as surprisal (Section 5), fixed positions inside syntactic frames (Section 6), and language models (Section 7 and 8) to uncover differences between: a. spoken and written learner language (L2), across selected written genres; b. L2 across different proficiency levels; c. learner language as compared with native language (L1).

2. Related research Related research falls into two major categories, namely formulaicity in written and spoken production of language learners, and the assessment of L2 language learning. As for the first field of related research, several studies investigating language usage of L2 learners on the basis of substantial corpora of production data have been published since the 1990s (e.g. Altenberg & Tapper 1998 (for S wedish); Lorenz 1999; Granger 2009). In Granger and Tyson (1996) and Altenberg and Tapper (1998) quantitative corpus studies show that learners tend to overuse and underuse adverbial connectors in terms of frequencies. In Granger (2009) the

Chapter 2. From lexical bundles to surprisal and language models 

same pattern of misuse is observed for lexical phrases, collocations and active/ passive verb constructions. Nesselhauf (2003) analyses the written production of advanced language learners’ usage within three types of collocations: free-combinations, collocations (various degree of restriction) and idioms. The study reports that most mistakes in production of collocations occur between combinations of medium degree of restriction, and the lowest rate of mistakes is typical for combinations with a high degree of restriction like fail an exam/test. Nesselhauf argues that the latter “are more often acquired and produced as wholes”, whereas the combinations of medium degree of restriction “are more creatively combined by learners” (Nesselhauf 2003: 233). The study of Ellis et al. (2008) on spoken language processing and production tasks was formulated on the basis of corpus-derived features like length, frequency and mutual information (MI). The authors conclude that for native speakers it is the formulaicity (i.e. idiomaticity and fluency of language that was measured by MI) that determines the processability, whereas for L2 learners it is the n-gram frequency that governs the processing. Erman (2009) investigates differences between L1 and L2 written production, according to their findings the proportion between collocations (of verb+noun) compared to free combinations (of verb+noun and adj+noun) amount to 60.2% and 54.9% for the group of native speakers and 39.8% and 45% for the group of language learners. Erman concludes that the idiom principle is the default principle in language production for learners and native speakers, although learners produce fewer collocations compared to the native group (Erman 2009: 342). Concerning the second field of related research, assessment of L2 production, Bonk (2000) shows that collocation use is a testable phenomenon in discriminating among L2 proficiency. The proposed collocation test was found to be reliably correlating to TOEFL scores and ESL teachers’ proficiency rankings.4 Read and Nation (2006) analyse whether the use of formulaic language varies according to the candidate’s band score level. IELTS exam oral production of candidates rated at bands 4, 6 and 8 is analysed. The authors find that “the sophistication in vocabulary of high-proficiency candidates was characterised by the fluent use of various formulaic expressions, often composed of high-frequency words, perhaps more so than any noticeable amount of low-frequency words in their speech. Conversely, there was little obvious use of formulaic language among Band 4 candidates” (2006: 207). Similarly, Kennedy and Thorp (2007) analyse a corpus of IELTS exam written essays and investigated collocation use across proficiency levels (corresponding to

. Significance of 0.83, Spearman’s rank correlation at 0.05 level.

 Gerold Schneider & Gintarė Grigonytė

bands 4, 6 and 8). They found that collocation use is more prevalent with writing scores that fall into band 8 than in 4 and 6. Ohlrogge’s (2009) research focuses on various types of formulaic language used by intermediate level learners in EFL written proficiency test and investigates to what extent formulaic sequences occur across high- and low-scoring essays. Ohlrogge (2009) reports a correlation of 0.90 found between a grade level and the use of idioms/collocations. These results indicate that high-scoring essays have significantly and consistently more formulaic language sequences than low-scoring ones. The above approaches are mainly based on a descriptive linguistic approach. Sections 7 and 8, which apply a language model to L1 and L2 data, measure if L2 fits the model less well. This approach follows the tradition of statistical anomaly and outlier detection (Aggarval 2013). If a model is trained on L1 speakers, it can be expected that L2 speakers fit less well, that they are more often outliers, because they produce more ungrammatical sentences, and because, according to Pawley and Syder (1983), they master the subtleties of formulaic language less well. Approaches that are related to ours are Keller (2003) who shows that parsers deliver low scores on certain types of ungrammatical material, and Keller (2010) who argues that parsers can be used as a psycholinguistic model of a native speaker. 3. Materials We use the written part of the British National Corpus (BNC, Aston & Burnard 1998) to describe differences between genres. We rely on the BNC genre categorisation provided by Lee (2001), which is similar to the one used in the ICE corpora, and more fine-grained than the official BNC categorization. For example, it makes a distinction between academic and non-academic texts in pure science and in applied science. In order to investigate if Pawley and Syder’s (1983) hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open-choice principle) we measure surprisal on native language and on learner language material. We use two Japanese learner corpora, the Japanese Learner English Corpus (JLE, Version 4.1, 2012), and the Corpus of English Essays Written by Asian University Students (CEEAUS). The error-corrected JLE corpus (Izumi et al., 2005, NICT 2012) consists of spoken data from exam interviews. Errors are annotated, and suggested corrections by professional teachers provided, which gives a parallel corpus of learner and (almost) native data, abstracting away from other sources of linguistics difference, such as semantics, topics, or genre. The corpus contains 1,281 exams

Chapter 2. From lexical bundles to surprisal and language models 

totaling 1.2 million tokens, and 9 levels of speaker proficiency. We will contrast the original utterances with the error-corrected utterances in the following. In order to investigate written data, we also use CEEAUS5 (Ishikawa 2009); 2012 Version. It is not error-corrected, but the subject matter of this corpus is tightly controlled, fine-grained learner levels are provided in the metadata, and native speaker essays on the same topic are included. We have used one of the two essay topics of the corpus (“should students have a part-time job”). Size and buildup of the corpus is illustrated in Table 1. It consists of learner language components, among others the Corpus of English of Japanese University Students (CEEJUS), and of a contrastive native language component, the Corpus of English Native American Students (CEENAS). We will contrast different learner levels among the Japanese L2 writers, and the Japanese L2 writers with the Native (L1) writers. Table 1. Features of CEEJUS (Japanese learners) and CEENAS (English native speakers) subsets of the CEEAUS corpus (December 2012 Version) # Essays (2 topics)

CEEJUS (L2)

CEENAS (L1)

770 (4 proficiency levels)

  146

Tokens

169,654

37,173

Types

4,800

3,797

Lemmas

3,602

2,884

4. From frequencies to collocations We first describe raw frequency as a method to locate bundles, then we use collocation statistics to detect differences in lexical bundles between genres. In Section 6 we use collocation statistics to measure which bundles are found particularly often in L2 production. 4.1 Frequency as measure of lexical bundleness Frequency is the simplest measure of routinisation and repetition. One can count how often sequences of 2, 3, 4, 5 etc. words occur in a large text collection, and report the counts in reverse order. In Table 2, we list the most frequent 4-word sequences, so-called 4-grams, from the written part of the BNC.

. Nowadays part of the ICNALE project.

 Gerold Schneider & Gintarė Grigonytė

Table 2. Top most frequent 4-grams, BNC written part Rank

Frequency

4-gram

1

4812

at the same time

2

4712

for the first time

3

4467

as a result of

4

4177

on the other hand

5

4124

. There is a

6

3905

. It is a

7

3221

the Secretary of State

8

2758

in the form of

9

2743

on the basis of

10

1579

I would like to

This method has the advantage that it is easy to calculate and the characteristic that it prioritizes very frequent bundles. It is also difficult to compare across d ifferent corpora, as the frequencies depend on the corpus size, and there is no obvious method to compare ranks. A measure of collocation strength can solve both of these problems. 4.2 Collocation measures: O/E and T-score There is a large variety of collocation measures with different characteristics. Pecina (2009) describes 82 different collocation measures. We use two popular collocation measures in the following: O/E and T-score. O/E is a simple information-theoretic measure (Shannon 1951) and delivers the same rankings as the equally popular mutual information (MI) measure. It tends to be susceptible to coincidences in the corpus and over-represent rare events due to its information-theoretic base. The T-score measure is based on significance testing and has the opposite characteristic: it over-represents frequent collocations, even if it does so less radically than pure frequency counts. There are two main factors for the decision if observed differences are significant: either they are very large (effect size) or they are based on so many observations that random fluctuations have become small. The latter reason gives a boost to frequent collocations. 4.2.1 Method O/E literally stands for Observed over Expected: it calculates the probability of words in combination as observed in the actual corpus, divided by the ratio of expected independent probability if all the words in the corpus were randomly distributed. When applied to combinations of two words, such as bi-grams, if x is

Chapter 2. From lexical bundles to surprisal and language models 

the first word, y the second word in the combination, and N the size, O/E can be calculated as follows. The independent probability of generating x is its frequency in the corpus divided by corpus size; and for y analogously. The probability of x and y in combination, in other words the observed value (O), is the frequency of x and y in combination (e.g. the first word in the bigram is x, the second y) divided by the corpus size. p(x) =

f(x) N

;

p(y) =

f(y) N

;

p(x, y) = 0 =

f(x, y) N

If co-occurrence of x and y is due to chance, i.e. if there is no collocational force, then the independent probability of seeing both Expected (E) and Observed (O, the joint probability of seeing the combination) are roughly equal: O = p(x, y) ≅ p(x) ∙ p(y) = E O/E, Observed divided by Expected, is then: f(x, y) p(x, y) f(x, y) · N · N f(x, y) · N 0 N = = = = E p(x) · p(y) f(x) f(y) f(x) · f(y) · N f(x) · f(y) · N N

In our application in this section, we do not use random distribution of independent words as Expected value (we will do so in Section 6), but the random distribution of the lexical bundle in a large reference corpus. For example, we compare a specific genre of the BNC to the entire BNC, as follows. f(x, y, ...)genre · NBNC 0 = E f(x, y, ...)BNC · Ngenre

In addition to the fact that collocation measures aim at representing attracting forces between words, this formulation also allows us to apply O/E to word sequences f(x,y, …) of arbitrary length. Researchers interested in lexical bundles have repeatedly pointed out that “studies of collocations give primacy to frequency and two-word relationships” (Conrad & Biber 2004: 57). In such research, typically, “sequences of two words are not included since many of them are word associations that do not have a distinct discourse-level function” (Conrad & Biber 2004: 58). Our suggested measure does not report the absolute collocation strength, but the relative collocation strength within a genre compared to the collocation strength in the entire BNC, or in other words, an information-theoretic measure of overrepresentation in the genre or speaker group under investigation.

 Gerold Schneider & Gintarė Grigonytė

4.2.2 Results We have applied the O/E method to N-grams of length 2, 3, 4, and 5, using the subdivision of the BNC into genres provided by Lee (2001) into the following 8 genres: –– –– –– –– –– –– –– ––

Arts Applied Science Commerce and Finance Belief and Thought Leisure Natural and pure sciences Social Science World affairs

We give examples of the 20 top-ranked 4-grams,6 according to O/E, from Belief and Thought in Table 3, and Leisure in Table 4. We have used a heuristically chosen frequency filter of f > = 50 to filter out rare collocations, which is a standard procedure in O/E to eliminate corpus coincidences. Table 3. Top-ranked 4-grams by O/E, BNC topic Belief and God Rank

O/E

Frequency

4-gram

1

21.19

67

of the Created God

2

21.19

60

a “god”

3

18.97

77

of the Holy Spirit

4

18.84

56

in the New Testament

5

16.95

60

the ordination of women

6

15.14

55

of the New Testament

7

5.17

60

in the life of

8

5.16

105

the Church of England

9

4.63

52

of the Church of

10

3.96

51

the authority of the

11

3.91

87

the life of the

12

3.63

66

there can be no

13

3.47

51

would seem to be

14

3.45

67

in the sense of

. For the interpretation of the results in Tables 3–7 we use functional labels used in Biber, Conrad & Cortes (2004).

Chapter 2. From lexical bundles to surprisal and language models 

Table 3. (Continued) Rank

O/E

Frequency

4-gram

15

3.02

71

does not mean that

16

3.02

53

is to be found

17

2.97

60

of the nature of

18

2.80

54

that is to say

19

2.60

80

in the sense that

20

2.57

55

in a state of

Table 3 contains many genre-specific multi-word key concepts (Created God, Holy Spirit, New Testament, Church of England, ordination of women), stance expressions (e.g. there can be no, would seem to be, does not mean that), and some discourse organisers (e.g. in the sense of, that is to say, in the sense that). There are also a few referring expressions (e.g. of the nature of, in a state of). Table 4. Top-ranked 4-grams by O/E, BNC topic Leisure Rank

O/E

Frequency

4-gram

1

7.08

76

+0000 (GMT)

2

7.08

62

and silver anniversary couples

3

7.08

292

Supplements per person per

4

7.08

70

Preheat the oven to

5

7.08

70

(see p. 225)

6

7.08

69

The price includes dinner

7

7.08

60

includes dinner (or

8

7.08

60

dinner (or lunch

9

7.08

60

(or lunch)

10

7.08

56

room (on request

11

7.08

56

Single room (on

12

7.08

52

receive a bottle of

13

6.98

74

(on request)

14

6.98

71

bottle of sparkling wine

15

6.95

330

per person per night

16

6.88

71

a bottle of sparkling

17

6.83

54

the oil in a

18

6.81

51

ml (1 tsp

19

6.81

51

5 ml (1

20

6.53

60

(1 tsp)

 Gerold Schneider & Gintarė Grigonytė

The genre of Leisure is dominated by cooking recipes and holiday brochures. Many of the phrases are repeated so often that even a relatively high frequency filter (f > = 50) lets them pass. The top 12 4-grams are found exclusively in texts from the genre Leisure, and thus all have the same O/E measure. The list in Table 4 indeed mainly lists rare collocations, and the majority does not have a discourse function nor do they seem to be single items in the mental lexicon. One of the collocation measures which gives precedence to frequent collocations is the T-score. T-score allows us to address the first of the three criticisms on using collocation measures by Biber (2009: 286–290), that MI score (or the functionally identical measure of O/E) bring rare collocations to the top. We have used a formulation of the T-score in terms of O and E, given in Evert (2009). The 20 top-ranked 4-grams, according to T-score, from Belief and Thought are given in Table 5, Leisure in Table 6, and Natural and pure Sciences in Table 7. Table 5. Top-ranked 4-grams by T-score, BNC topic Belief and Thought Rank

T-score

Frequency

4-gram

1

243.2

264

the end of the

2

205.8

217

at the same time

3

165.5

184

at the end of

4

165.5

173

the way in which

5

134.8

146

in the case of

6

127.4

138

is one of the

7

123.8

132

on the other hand

8

122.9

135

the rest of the

9

119.5

125

that there is a

10

110.2

119

to be able to

11

105.9

113

in terms of the

12

104.9

112

a great deal of

13

103.0

105

the Church of England

14

102.3

116

for the first time

15

101.6

111

as well as the

16

101.4

112

on the basis of

17

100.5

105

to be found in

18

99.1

110

On the other hand

19

92.1

98

in the light of

20

84.6

87

the life of the

Chapter 2. From lexical bundles to surprisal and language models 

Table 5 lists some multi-word key concepts (Church of England), some discourse organisers (as well as, on the other hand, at the same time) and many referential expressions (e.g. the end of, the case of, a great deal of). The high proportion of referential expressions indicates a formal style (e.g. Biber, Conrad & Cortes 2004). Table 6. Top-ranked 4-grams by T-score, BNC topic Leisure Rank

T-score

Frequency

4-gram

1

969.0

1001

the end of the

2

766.4

793

at the end of

3

614.9

630

one of the most

4

580.9

599

for the first time

5

580.3

589

the top of the

6

564.6

580

is one of the

7

518.8

537

the rest of the

8

415.5

439

at the same time

9

350.7

359

I do n’t think

10

331.9

339

the back of the

11

327.3

330

per person per night

12

311.1

317

the edge of the

13

302.1

311

I do n’t know

14

289.5

292

Supplements per person per

15

288.2

297

the centre of the

16

272.7

283

in the middle of

17

257.6

275

to be able to

18

251.7

265

was one of the

19

250.3

257

the bottom of the

20

243.4

254

the middle of the

The top-ranked 4-grams from the topic of Leisure contain several stance expressions (e.g. I do n’t think, I do n’t know), which, as Biber, Conrad & Cortes (2004: 384) show in their comparison of University registers, is typical for less formal styles (e.g. conversation) and evaluative style (e.g. classroom teaching). Leisure also contains many topological reference expressions (e.g. top of, end of, edge of, back of, middle of). Table 7 shows that Natural and pure Sciences is dominated by referential expressions as claimed by, e.g., Conrad & Biber (2004: 68): “The majority of the

 Gerold Schneider & Gintarė Grigonytė

Table 7. Top-ranked 4-grams by T-score, BNC topic Natural and pure Sciences Rank

T-score

Frequency

4-gram

1

266.4

291

the end of the

2

216.3

231

as a result of

3

209.5

230

at the end of

4

207.7

219

in the case of

5

191.2

201

on the basis of

6

188.4

191

in the presence of

7

186.5

197

in the form of

8

175.7

187

is one of the

9

173.0

174

data not shown)

10

169.1

178

a wide range of

11

166.2

184

per cent of the

12

163.0

164

( data not shown

13

161.4

166

in the absence of

14

151.2

167

at the same time

15

150.6

153

the surface of the

16

147.7

159

on the other hand

17

141.3

149

in terms of the

18

140.8

148

it is possible to

19

137.3

144

In the case of

20

137.2

146

the nature of the

common four-word bundles in academic prose are referential expressions.” We also find the subgroups which they describe, such as intangible framing attributes (rows = ranks 2, 4, 5, 6, 13), multi-functional reference (rows 1, 3, 15), identification (rows 8, 10), quantity specification (row 11) or time reference (row 14). Other groups are rare, for example topic elaboration as discourse organisation (row 16) or impersonal stance expressions (row 18). Conrad and Biber (2004) address the question whether different registers tend to use different sets and different classes of lexical bundles. They investigate the extreme opposites of conversation and academic prose, our investigation here indicates that subtle differences can be found across all registers. Conrad and Biber (2004) also set out to investigate how frequent lexical bundles are in the compared two registers. They observe that “conversation has a few bundles with very high frequencies” (Conrad & Biber 2004: 61), but academic

Chapter 2. From lexical bundles to surprisal and language models 

prose also has a high proportion of words which participate in lexical bundles. According to their measure, 28% of words in conversation occur within 3- and 4-word lexical bundles, while in academic prose it is still 20%. Obtaining such measures of “bundleness” require one to sum over long lists of lexical bundles. We thus wondered if there could be a more direct abstraction from individual bundles to a measure of “bundleness”, which would have the characteristic of obviating such summations. We present a suitable measure in Section 5. We also applied the above methods of detecting n-grams and collocations in learner language. Table 8, for example, shows the top-ranked 4-grams by T-score, comparing the NICT JLE corpus to the BNC. Table 8. Top-ranked 4-grams by T-score, NICT JLE corpus Rank

T-score

Frequency

4-gram

1

68.9

69

I do n’t know

2

38.9

39

One day last week

3

38.9

39

I ‘d like to

4

37.9

38

I do n’t have

5

22.9

23

how can I say

6

19.9

20

I went to the

7

19.9

20

I do n’t like

8

19.5

20

And there is a

9

18.9

19

I ‘m going to

10

14.9

15

O K One day

While the results such as Table 8 showed us that learner language is relatively simple, and the learners feel challenged (“I don’t know”, “how can I say”), and that the interview situation influences the bundles, our research question whether learners use more or fewer bundles is very difficult to answer and would require very much interpretation. This finding further supports that a generic measure of the amount of bundling is necessary. We are going to introduce one in the following. In Section 6, we will see that using collocations inside selected syntactic frames also allows us to partly answer our questions. 5. Surprisal as a measure of bundleness The psycholinguistic underlying force of lexical bundles is that speakers and listeners expect the continuation of the conversation to such a degree that they can

 Gerold Schneider & Gintarė Grigonytė

retrieve a multi-word sequence as a single item from the mental lexicon. Entropy is so low that re-analysis, due to the occurrence of an unexpected word in the continuation, is hardly ever necessary. Lexical bundles are an extreme case, the famous tip of the iceberg, of expected continuation and priming, as described by Hoey (2005). It would be desirable to have a measure which, in addition to showing the tip of the iceberg, measures the entire gradience from bundle to free creativity, or from the perspective of syntax, offers a gradient operationalization of chunk boundaries used in theories such as Linear Unit Grammar (Sinclair & Mauranen 2006). 5.1 Method An information-theoretic measure of expected continuation could thus give us a measure of bundleness. We have suggested in Section 1 that surprisal (Levy & Jaeger 2007) is a good candidate for such a measure. It calculates the probability of the following word given the n previous words. It is an information-theoretic measure, and it has the following desired characteristics: –– It directly delivers a “bundleness” value without needing to detect and sum over individual bundles. –– It measures the entire gradience from idiom to rarity, i.e. much more than the tip of the iceberg, which the discovery of lexical bundles shows us. Surprisal can theoretically take the entire previous context of a discourse into consideration; for practical purposes however, going back more than a few words is hardly useful and, as corpora are limited, inevitably leads to an unsurmountable sparse data problem. We will use bi-gram and tri-gram surprisal in the following discussion. 2– gram surprisal = log

3– gram surprisal = log

1

1 + log p(W1) p(W2 | W1) 1

1 1 + log + log p(W1) p(W2 | W1) p(W3 | W1W2)

In other words, surprisal is the logarithmic version of the probability of seeing word w1 linearly combined with the probability of the transition to the next word, w2. The probability p(w1), the so-called prior, is based on frequency, while the transitions, e.g. p(w2| w1) express predictability. Surprisal is an information theoretic measure; it measures how many bits of information the continuation of the conversation contains. Information Theory goes back to Shannon (1951):

Chapter 2. From lexical bundles to surprisal and language models 

the more probable and thus expected a word is in its context, the less information is carries, the more redundant it is, the more easily it can be repaired if noise distorts the signal. For human communications, some of the insights of information theory are equally valid. In contexts where surprisal is very low, speakers just use the trodden paths of lexical bundles without conveying any information to listeners, the conversation stays largely predictable and redundant (and probably boring). In contexts where surprisal is very high, listeners are given much new information, but the continuation of the conversation is very hard to predict and listeners will be challenged, and missing a single word (for example due to noise) can lead to ambiguity and misunderstandings. Particularly in spoken language, one sometimes misses or mishears words and can only interpret the meaning based on the context. In such environments, a certain amount of redundancy is a prerequisite for successful communication. One consequence of Information Theory is Zipf ’s observation that frequent words are shorter (analogous to the fact that compression algorithms give the shortest sequences to highly frequent patterns), which led to his famous Principle of Least Effort, which says that human behavior strives to minimize “the person’s average rate of work-expenditure over time” (Zipf 1949: 6). The need for expressivity, i.e. to transmit as much information as possible in as few words as necessary, is thus in a constant tug-of-war with efficiency, the need to produce an utterance which fulfills the expectations of listeners and proceeds without major hesitations and overly long pauses. Levy and Jaeger (2007) hypothesize that for successful communication, areas of very high and very low surprisal are avoided as far as possible in successful communication. “[S]peakers may be managing the amount of information per amount of linguistic signal (henceforth information density), so as to avoid peaks and troughs in information density” Jaeger (2010: 24). They postulate the principle of Uniform Information Density (UID) and state that “it can be seen as minimizing comprehension difficulty” (Levy & Jaeger 2007: 850). They show that it holds on the level of syntactic reduction, where that complementizers can are rendered as zero-forms preferably in non-ambiguous contexts. We hypothesize that this holds in more syntactic environments, as a trend even in all syntactic environments. We will test this hypothesis by using a tagger in Section 7 and a parser in Section 8. 5.2 Results In the following, we test UID using bi-gram and tri-gram surprisal. First, we test how much it holds in spoken and written genres, then we test if native speakers follow it better than language learners.

 Gerold Schneider & Gintarė Grigonytė

Levy and Jaeger’s (2007) UID predicts that in order to ensure comprehension we avoid zones of high surprisal, while due to the fact that we want to convey information we equally avoid low surprisal. From a psycholinguistic perspective, we form sentences in the tug-of-war between formulaic but expected expressions, and semantically dense but unexpected language. In information-theoretic terms, we need to find an appropriate balance: do not (over-)load the conversation with too much information, and neither (under-)load with too little. We thus expect to see a Gaussian distribution of surprisal. 14000 12000

Frequency

10000 8000 6000 4000 2000 0

6

8

10

12 Bigram surprisal

14

16

18

Figure 1. Distribution of Bigram surprisal in BNC spoken demographic (diagonally striped bars) and pure science (vertically striped bars)

Figure 1 shows that UID holds surprisingly well in spoken language, where an approximately Gaussian distribution with a mode of around 10 emerges, but much less in dense written registers, such as pure science, where the needs of information compression are particularly high (Biber 2003), which leads to a distribution in which surpisal is most frequent. After a plateau around 14, bigrams with higher surprisal are even more frequent, partly due to rare technical noun terms. These findings may also indicate that UID is partly more of a planning help under time constraints than an aid to help readers and listeners in the comprehension task. 5.3 Bundleness of spoken L2 compared to corrected L2 In this section we analyze spoken learner production and compare it to corrected learner production. The corrections of utterances have been made by language teachers and even though not exactly being L1, they are in most cases close to

Chapter 2. From lexical bundles to surprisal and language models 

L1-like. We analyse bundleness effect on corrected L2 vs. L2 by applying the same method we use for L2 vs. L1. In terms UID, we expect corrected bundles to have lower surprisal. According to Pawley and Syder (1983), native speakers know best how to play the game of fixedness vs. expressiveness. “[N]ative speakers do not exercise the creative potential of syntactic rules to anything like their full extent, and that, indeed, if they did do so they would not be accepted as exhibiting native like control of the language” (Pawley and Syder 1983: 193). So we expect Learner English to show violations of the UID and thus see evidence of higher surprisal, as we have in the case of pure science in the BNC (although there the reasons are different: information compression in BNC scientific, but non-native expressions and ungrammatical utterances in Learner English). We use the error-corrected Japanese Learner English Corpus NICT corpus to compare trigram surprisal between original (diagonal stripes) and corrected (horizontal stripes) utterances. As we saw in Figure 1, surprisal largely depends on genre and topic, which we completely control by using the NICT corpus, which has original and corrected utterance in parallel. We use only those utterances from the corpus in which a correction has been made. Results are given in Figure 2. 14000 12000

Frequency

10000 8000 6000 4000 2000 0

25

30

35 40 45 Trigram surprisal in JLE

50

55

Figure 2. Distribution of trigram surprisal in corrected (horizontal stripes) and uncorrected (diagonal stripes) JLE corpus

Figure 2 reveals higher frequency of low surprisal n-grams and lower frequency of high surprisal n-grams for the corrected utterances when compared to the original n-grams. As our surface n-gram surprisal model is based on L1 data (BNC),

 Gerold Schneider & Gintarė Grigonytė

low surprisal values of n-grams are interpreted to have more native like language features, whereas very unusual or even unseen n-grams, which have high surprisal values, contain unusual word combinations, morphological and syntactic errors. 5.4 Bundleness of written L2 compared to L1 An equivalent procedure is used to investigate written Learner English in terms of violations of the UID. We use the CEEAUS corpus to compare trigram surprisal between L2 (diagonal stripes) and L1 (horizontal stripes) utterances (see Figure 3). 3500 3000

Frequency

2500 3000 1500 1000 500 0

20

30

Trigram surprisal

40

50

Figure 3. Distribution of trigram surprisal of L1 (horizontal stripes) and L2 (diagonal stripes) writers in the CEEAUS corpus

L2 Written production clearly shows the underuse of low surprisal n-grams if compared to native production and a slight overuse of high surprisal n-grams. In the written genre of student essays, to which CEEAUS belongs, information compression is not higher than in the spoken JLE NICT corpus – both form a Gaussian distribution and thus largely abide to UID. We have also measured surprisal across learner levels but we obtained less clear results. In particular, advanced learners seem to show lower surprisal than native speakers. We explore the reasons in the next two sections. One reason could be that learners have been claimed to overuse the most frequent prefabricated structures (Granger 2009). We will follow this trace in Section 6. Another reason could be that the vocabulary of learners is smaller, and that we need a form of surprisal which uses a morphosyntactically more appropriate language processing model than lexical sequences. We take up this trace in Section 7.

Chapter 2. From lexical bundles to surprisal and language models 

6. Collocations as non-adjacent relations in a syntactic frame Granger (2009) states that learners use fewer prefabricated structures than nativespeakers, but at the same time the most frequent prefabricated structures are overused. The latter may lead to lower surprisal in the utterance of language learners, opposing the general trend seen in the previous section. Erman (2009) claims the same for collocations in general, and points out that investigating fixed bundles runs the risk of missing those collocations which are flexible and nonadjacent. “What sets collocations apart from idioms is that many […] allow members to be varied, frequently depending on pragmatic factors and the situation at hand. Furthermore, they suffer few syntactic constraints” (Erman 2009: 328). Biber’s (2009: 286–290) third criticism of using collocations for measuring bundles is that multi-word formulaic sequences are often discontinuous. To address this issue, we measure collocations in the frame of a syntactic relation, which may by definition be between non-contiguous words. We have introduced collocation measures in Section 4.2.1. Using collocation measures inside a syntactic frame, we have extracted collocations from large corpora (e.g. Lehmann & Schneider (2011) for verb-PP constructions and Ronan & Schneider (2015) for light verbs). Seretan (2011) shows that syntax-based collocation extraction performs better than observation windows, Bartsch and Evert (2014) give detailed evaluations, confirming that both precision and recall are consistently higher in approaches using syntactic dependency relations compared to surface approaches such as observation windows. For testing Erman’s (2009) claims, we investigate how L2 speakers use verb-PP relations. Verb-PP relations are particularly interesting as they are often non-adjacent which means that we profit from syntactic parsing (see (1) below), as they include the important subclass of phrasal verbs (see (2) below), and as they are morphologically unfixed and can be freely modified.

(1) But, it is not pleasant to concentrate all their energies on a part time job (CEEAUS, level L).

(2) If they don’t have a part-time job they depend on their parents to have the money. (CEEAUS, level L)

For this investigation, we compared frequencies of verb-PP constructions (including adjective-PP structures and verbal particles) in CEEAUS. As the following counts are based on type/token ratios and are thus affected by corpus size (see e.g. Malvern et al. 2004), we had to use portions of the same size from each learner level. The native speaker part contains 1063 verb-PP structures, so we used the first 1063 verb-PP occurrences from each learner level. As the low-level data contained slightly fewer than 1063 verb-PP occurrences, we combined it together with a small amount of middle-level data, and refer to it as L+M.

 Gerold Schneider & Gintarė Grigonytė

The 11 most frequently used verb-PP construction types by low-level learners (L+M) are listed in Table 9. They can be compared to the 11 most frequent ones in the native speaker corpus in Table 10, and to semi-upper learners (S) in Table 11. In Tables 9 to 11, F signifies token frequency, V the verb (or adjective) to which the PP is attached, P the preposition or particle, and N the noun in the PP. The third last column (fraction) gives the percentage for the frequency of the type, the second last column gives the cumulative percentage, showing how many of the verbPP occurrences are covered by the list from the top to this row. The last column is Zipf ’s constant, i.e. rank*frequency, which according to Zipf ’s law tends to stay quite constant across the entire list. We come back to this point later. Table 9. Most frequent L+M learner verb-PP constructions F

V

P

N

99

important

for

student

21

agree

with

statement

10

agree

with

opinion

8

play

with

friend

7

important

for

us

7

depend

on

parent

6

work

in

6

study

in

6

go

to

college

6

be

For

example

5

work

as

job

collo?

fraction

L+M cumulative

Zipf ’s C

0.0931

0.0931

99

yes

0.0197

0.1129

42

yes

0.0094

0.1223

30

0.0075

0.1298

32

0.0065

0.1364

35

0.0065

0.1430

42

society

0.0056

0.1486

42

college

0.0056

0.1543

48

0.0056

0.1599

54

0.0056

0.1656

60

0.0047

0.1703

55

fraction

Native cumulative

Zipf ’s C

yes

yes

Table 10. Most frequent native speaker verb-PP constructions F

V

P

N

collo?

31

important

for

student

0.0291

0.0292

31

11

agree

with

statement

yes

0.0103

0.0395

22

6

focus

on

study

yes

0.0056

0.0452

18

5

go

to

college

0.0047

0.0499

20

5

concentrate

on

study

0.0047

0.0546

25

5

be

For

student

0.0047

0.0593

30

5

be

in

school

0.0047

0.0640

35

4

go

to

school

0.0037

0.0677

32

4

disagree

with

statement

0.0037

0.0715

36

3

work

in

restaurant

0.0028

0.0743

30

3

work

with

other

0.0028

0.0771

33

yes

yes

Chapter 2. From lexical bundles to surprisal and language models 

Table 11. Most frequent S learner verb-PP constructions F

V

P

N

76

important

for

student

15

agree

with

opinion

14

agree

with

statement

12

work

in

society

12

be

Of

course

10

play

with

9

important

for

9

be

For

example

8

good

for

8

go

to

7

important

for

collo?

fraction

S cumulative

Zipf ’s C

0.0714

0.0714

76

yes

0.0141

0.0856

30

yes

0.0131

0.0987

42

0.0112

0.1100

48

0.0112

0.1213

60

friend

0.0094

0.1307

60

them

0.0084

0.1392

63

0.0084

0.1476

72

student

0.0075

0.1552

72

college

0.0075

0.1627

80

us

0.0065

0.1693

77

yes

yes

Such lists can be expected to have a Zipfian distribution (Zipf 1965): rank*frequency is quite constant, which also entails the few top ranked items are very frequent, but most are very rare; the lower half of the full list of types is typically made up of singletons (frequency = 1). The most frequent type, important for student, has 99 tokens in the low-lever learner essay part of the corpus, it thus covers 9.3% of all verb-PP constructions (second last column), Zipf ’s constant is much too high here, too much of the mass clusters at the very top. In the native speaker corpus, the most frequent type (which is also important for student only occurs 31 times and thus makes up 2.9% of all verb-PP tokens. The top 10 types cover 17.0% of all tokens in the low-level learner corpus L+M), but only 7.7% in the native data. The 1063 tokens have only 347 types in the low-level learner corpus, compared to 929 types in the native data, and 752 in the semi-upper learner data. This striking difference is visualized in Figure 4 for the top 100 types (the horizontal axis indicates the rank, the vertical axis the cumulative coverage). We have also included the semi-upper learners (S). The results indicate that indeed low-level learners use fewer verb-PP constructions, and particularly overuse the few most frequent ones. The semi-upper learners (S) seem to pattern almost like the low-level leaners. But we need to bear in mind that these usage patterns do not distinguish between collocations and fully compositional verb-PP constructions. Therefore, Figure 4 should be interpreted as a picture of the ‘vocabulary’ richness, of the richness of the inventory of verb-PP constructions, and possibly also mirrors the variety of opinions on the topic. In order to assess the situation concerning collocations, we have annotated the 100 most frequent types on whether the verb-PP construction is compositional

 Gerold Schneider & Gintarė Grigonytė 0.45

L+M cumulative S cumulative Native cumulative

0.4

Cumulative coverage

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Rank

Figure 4. Cumulative coverage of the top 100 verb-PP types. The horizontal axis is the rank (Tables 8 to 11 give the 11 top ranks), the vertical axis is the cumulative fraction, i.e. the coverage until this rank

or a collocation, i.e. partly non-compositional (column collo? in Tables 9–11). In the L+M learner list, 26 types are collocations, 38 in the S list, and 42 in the native list. It seems to be the case that learners use fewer prefabs, as Erman (2009) and Granger (2009) stated. If we only count the non-compositional verb-PP structures, the pattern given in Figure 5 emerges. Looking at the few top collocations, both L+M and S overuse them, but then L+M has fewer prefabs. Towards the end of the 100 verb-PP construction types, Native has caught up and partly overtaken L+M. The group using most collocation tokens, however, is S, semi-upper learners. This fact can be interpreted in two ways: either they use those collocations which are mastered reasonably well by speakers of this level more than native speakers, or the fact that their general vocabulary is more limited than the one of native speakers restricts compositional uses. The Zipfian behavior of S is considerably different from both Native and L+M, as we can see if we compare Figures 4 to 6. L+M starts off with very high counts, and then drops to low frequencies very fast. Native starts off with low counts, and drops off even faster. L+M starts off with relatively high counts, and – this is the surprising part – stays high for quite a long time. Zipf ’s law states that in such lists, frequency * rank (where rank is the row number in such a sorted list) should stay approximately constant. We plot the fraction (percentage) in Figure 5, and the Zipf constant in Figure 6. While it indeed stays relatively constant, S shows a clear ‘belly’ from rows 3–15 in both data representations. By rank 50, all lists have reached f = 2 and are thus not very interesting, so we have plotted only until rank 50.

Chapter 2. From lexical bundles to surprisal and language models  0.16

L+M cumulative S cumulative Native cumulative

0.14

Cumulative coverage

0.12 0.1 0.8 0.06 0.04 0.02 0

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Rank

Figure 5. Cumulative coverage of the top 100 verb-PP types, when only counting collocations. The horizontal axis is the rank (Tables 9 to 11 give the 11 top ranks), the vertical axis is the cumulative fraction, i.e. the coverage until this rank

Fraction = Non-cumulative Coverage

0.05

L+M S Native

0.04

0.03

0.02

0.01

0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Rank

Figure 6. Non-cumulative coverage of verb-PP constructions. The horizontal axis is the rank (Tables 9 to 11 give the 11 top ranks), the vertical axis is the cumulative fraction, i.e. the coverage until this rank

 Gerold Schneider & Gintarė Grigonytė

The indications that high-level L2 speakers use considerably more idiomatic expressions than lower level speakers is in line with Ohlrogge (2009), who shows that “the use of collocations and idioms has been shown to be strongly associated with higher proficiency students. The higher the writing proficiency grade obtained, the more likely candidates were to use these linguistic features” (Ohlrogge 2009: 383). Nevertheless, the fact they use even more collocations than native speakers was unexpected. The effect is likely to be a professional style, but with a slightly lower information load, e.g. without using compression techniques and metaphorical allusions as native speakers might. In this section, we have investigated one particular construction at the syntactic level. The fact that the most frequent collocations are so frequent is on the one hand due to the fact that language learner tend to overuse the most central collocations, on the other hand it may also indicate that their vocabulary, including their repository of non-compositional constructions, is smaller. We have tried to counteract this danger by splitting the data into idioms and fully compositional constructions. Still, the danger remains that the influence of lexis is too strong. We thus would like to use a model at a higher level. The next higher level in a syntactic representation are the pre-terminals, the POS tags, a level at the morphologysyntax interface which is still dominated be the idiom principle but abstracts away from lexis to word-class sequences. 7. Part-of-Speech tagging model We have introduced and applied surprisal (Levy & Jaeger 2007) at the word sequence level in Section 5. Similarly, we could also measure surprisal at the level of POS tag sequences. We suspect, however, that this would not be a reliable measure, for the following reason: if a really surprising sequence occurs, the automatic tagger is likely to assign wrong tags to some of the words. It is probably more revealing to measure how surprised and confused the language processing model of the tagger is. 7.1 Method We are using a model of word classes (pre-terminals) and words (terminals) in interaction. Instead of using surprisal at the part-of-speech tag level, we used the model fit of the tagger as a measure of surprise and confusion of the tagger. To achieve this, we employ the confidence which taggers emit in addition to the most likely tag. Such an approach uses the model fit as a signal, i.e. it reports how confidently a model can make predictions. Areas of low confidence for word class assignment typically indicate low model fit, high entropy, lack of formulaicity, in other words unexpected sequences and therefore conflicts

Chapter 2. From lexical bundles to surprisal and language models 

with the idiom principle as far as it is represented in the language processing model of the tagger. Several authors report such a correlation between part-ofspeech sequences and reading times (Frank & Bod 2011; Fossum & Levy 2012), so the assumption that proficient speakers and hearers of a language use abstract knowledge at this level is reasonable. 7.2 Results We apply the Tree-Tagger (Schmid 1994), which is trained on the Penn Treebank (Marcus et al. 1993), to the original and the corrected texts of NICT JLE. The mean probability of the top reading is 96.8% for the original, and 97.1% for the corrected text. We show two original sentences and their corrected counterparts, as illustrative examples, in (3) and (4). The words are given in the first line, the POS tag in the second, and the tagger confidence for the POS tag in the third line, in bold where the original and the corrected versions differ. (3) ORIGINAL But not much of complication . CC RB JJ IN NN SENT 0.99 1 0.885 1 1 1 CORRECTED But not many complications . CC RB JJ NN SENT 0.99 1 0.999 1 1 (4) ORIGINAL But only thing we were wondering was where were the CC JJ NN PP VBD VVG VBD WRB VBD DT 0.994 0.907 1 1 1 1 1 1 1 1 Japanese people . JJ NNS SENT 0.994 1 1 CORRECTED But the only thing we were wondering was where CC DT JJ NN PP VBD VVG VBD WRB 0.999 1 0.988 1 1 1 1 1 1 Japanese people were . JJ NNS VBD SENT 0.995 1 1 1

If we look at the distribution of the confidence probabilities, p = 1, i.e. full confidence of the tagger, is far the most frequent value, but it is more frequent in the

 Gerold Schneider & Gintarė Grigonytė

corrected text: 54.7% of the words in the original material get a POS tag with p = 1, compared to 55.6% in the corrected material. The original material has higher occurrences of all levels of probability scores 500 relations from original and the corresponding corrected sentences in the NICT JLE corpus

Figure 8 shows that the error rate has significantly decreased on corrected text, it has almost halved. As expected, the parser performs less well on the original learner utterances, which contain many errors, but better on the same learner utterances after many of the mistakes have been corrected. Illustrative examples are given in Figures 9 and 10. The mistagging of play as a noun in Figure 9 by the automatic parser does not mean that a human parser, who has far more resources to avoid misunderstandings, would also make such an error. But the fact that the tagger, which is trained on large amounts of realworld context data, suggests a noun here, indicates that the interpretation as a

Chapter 2. From lexical bundles to surprisal and language models 

verb is surprising in this context based on previous experience, and human readers or listeners may show slightly increased processing load or minimal delays in comprehension. Table 12 shows more examples of original learner utterances and their corrected counterparts. We can see that the parser’s analyses of the corrected sentences are considerably better, which confirms our hypothesis that the model, which was trained on native speaker language, makes more accurate predictions on the corrected utterances than on the original utterances.

subj

sentobj

subj

I_PRP I I_PRP PRP 1

think_VBP think think_VBP VBP 2

subj

they_PRP they they_PRP PRP 3

obj

play_VB play play_VB VB 4

sentobj

subj

I_PRP I I_PRP PRP 1

baseball_NN baseball baseball_NN NN 5

think_VBP think think_VBP VBP 2

they_PRP they they_PRP PRP 3

obj

play_VB play play_VB VB 4

baseball_NN baseball baseball_NN NN 5

Figure 9. Example of original (above) and corrected (below) learner utterance from the NICT JLE corpus

 Gerold Schneider & Gintarė Grigonytė

subj

bridge

conj

modrel

modpp

obj prep

the_DT, man NN man man_NN NN 2

And_CC And And_CC CC 1

is_VBZ, asking VBG ask asking VBG VBG 3

whaLWP what what_WP WP 4

subj

subj

name_NN of_IN name of name_NN of_IN NN IN 5 6

the DT, wine NN wine wine NN NN 7

it_PRP is_VBZ it be it_PRP is_VBZ VBZ PRP 8 9

sentobj

conj

obj

subj

modpp prep

And_CC And And_CC CC 1

the_DT, man NN man man_NN NN 2

is_VBZ, asking VBG ask asking VBG VBG 3

what_WP what what_WP WP 4

the_DT, name_NN name name_NN NN 5

of_IN of of_IN IN 6

the DT, wine NN wine wine NN NN 7

is_VBZ be is_VBZ VBZ 8

Figure 10. Parser output for an original and its corresponding corrected L2 sentences from the NICT JLE corpus

8.3 Parser model fit Our second hypothesis is that L2 utterances, particularly those produced by lowlevel speakers, do not fit the processing model very well. This applies equally to the human listener and to the computational L1 based parser model, which we use as language processing model. In the case of the computational parser model, the less fitting L2 utterances lead to lower automatic parser scores, indicating ambiguity, potential ungrammaticality and less native-like language command. Millar (2011) showed that non-native like idioms by L2 speakers lead to higher processing load for human parsers. Keller (2003) showed that ungrammatical structures lead to consistently lower parser scores of automatic parsers and thus suggested (Keller 2010) that parsers can be used as a psycholinguistic model of a native speaker. Probabilitybased scores of automatic parsers, which are originally intended for disambiguation and ranking of parsing candidates can be used as measures of surprise and L2 model fit to a model trained on L1. A higher parser score indicates that: –– the utterance matches the expectation of an L1-based language processing model, and a particular syntactic parse,

Chapter 2. From lexical bundles to surprisal and language models 

–– the lexical items, as they are used in combination in the corrected utterances more strongly point to a certain analysis than the uncorrected utterances; A low parser score indicates that: –– the utterance is unexpected by the model, –– the parser cannot map it well to any known syntactic analysis; The examples (1)–(4) are given again in Table 12, this time comparing parser scores of original and corrected version from the NICT JLE corpus. In the last sentence, the corrected version obtains a lower score, but this is partly due to the fact that scores depend on sentence length. Table 12. Parser scores obtained from original and corrected NICT JLE sentences Version (1) (2) (3)

(4)

Sentence

Score

ORIG

Usually , I go to the library , and I rent these books .

5054.31

CORR

Usually , I go to the library , and I borrow these books .

8956.83

ORIG

For example , at summer , I can enjoy the sea and breeze .

7186.86

CORR

For example , in summer , I can enjoy the sea and breeze .

8965.99

ORIG

so I will go to the Shibuya three o ‘ clock , nannda , before Hachikomae .

176.172

CORR

so I will go to Shibuya at three o ‘ clock , nannda , in front of Hachikomae .

12787.4

ORIG

The computer game is very violence in today , but I do n’t like it .

6570.44

CORR

Computer games are very violent today , but I do n’t like them .

161.753

We have analyzed and compared parser scores between original and corrected sentences by sentence length (sentence length is measured in chunks) in Figure 11. 9. Conclusions and outlook In this paper, we have set out to measure the idiom principle in Learner English, in particular if Pawley and Syder’s (1983)’s hypothesis holds, if surprisal can be used as a measure of bundleness, and if we can also approximate a measure of the open-choice principle. We have first extracted lexical bundles from BNC genres, using frequency and collocation as measures. We defended the use of collocation measures by indicating that many collocation are rare. Then we suggested the use of surprisal (Levy & Jaeger 2007) as a general and gradient measure, abstracting away from individual bundles

 Gerold Schneider & Gintarė Grigonytė 4.5E+10

Corrected Original

4E+10

Parser score

3.5E+10 3E+10 2.5E+10 2E+10 1.5E+10 1E+10 5E+09 0

1–5

5–10

10–15

15–20

20–25

25–30

Sentence length

Figure 11. Parser score by sentence length, measured in chunks, comparing original and corrected utterances

to generic bundleness. We observed differences between BNC spoken and written which indicate that Uniform Information Density (UID) may partly be a planning help in language production under time constraints and a noisy oral channel as medium, rather than a generic help to understand. While it holds in spoken language, in the compressed genre (Conrad & Biber 2004) of scientific writing it does not hold when measured at the word level. We then compared surprisal between original utterances of L2 speakers and the corrected utterances, and between L2 and L1 speakers. The results confirm Pawley and Syder’s (1983) claim that L1 speakers know best how to play the game of fixedness and expressiveness. When comparing L2 learner levels we obtained less clear results. We could show in Section 6, for the verb-preposition constructions, that L2 speakers on the one hand overuse bundles in the form of the most frequent prefabricated forms while underusing rarer ones, as Granger (2009) had suggested. Our third goal was to move on from lexical sequences to higher but equally psycholinguistically adequate levels of abstraction. This procedure first led to the use of a morphosyntactic sequence-based model, a POS tagger, which offers a model for word classes (pre-terminals) and words (terminals) in interaction. Instead of using surprisal, we used the model fit of the tagger as a measure

Chapter 2. From lexical bundles to surprisal and language models 

of surprise and confusion of the tagger. The results again confirm Pawley and Syder (1983) and can be seen as an avoidance of ambiguity strategy in line with UID, Sinclair’s idiom principle places its heavy performance constraints on all the creative but rare options which language competence offers. The scientific genre has higher tagger confidence than spoken language, partly because spoken language contains false starts etc., partly because the tagger was trained on written language. Higher levels of abstraction also need to include a model which steps up from sequences to syntactic hierarchy, to include Sinclair’s open choice principle as far as it applies. We thus outline the use of a syntactic parser as a language processing model in the final section. The hypotheses that parsers have lower performance and confidence scores on original versus corrected utterances are confirmed. Our current paper is in many ways a pilot study, showing new approaches to cognitive linguistics. We have suggested surprisal as a measure of the idiom principle, a syntactic parser as a model of the open-choice principle, and a POS tagger in between: a model of pre-terminal sequences. Each of these three describes a level at which listener expectations need to be met up to a point to ensure that Shannon’s channel does not become too noisy. Applications for word-sequences exist already, for example in the research of grammatical error correction (Ng et al. 2014). We envisage cognitive linguistics applications, for example by using parsers as cognitive processing models (Keller 2010): for successful communication, we avoid formulations leading to high entropy. Further, we would like to elaborate the question whether increased entropy generally correlates with increased ambiguity for the human reader, as Millar (2011) has shown for learner idioms. We also intend to continue our research by combining our approaches, and by correlating our metrics to psycholinguistic metrics from self-paced reading and eye-tracking experiments, for example using the publicly accessible data in Frank et al. (2013).

References Aggarval, Charu C. 2013. Outlier Analysis. Dordrecht: Kluwer. doi: 10.1007/978-1-4614-6396-2 Altenberg, Bengt & Tapper, Marie. 1998. The use of adverbial connectors in advanced Swedish learner’s written English. In Learner English on Computer, Sylviane Granger (ed.), 80–93. London: Addison Wesley Longman. Aston, Guy & Burnard, Lou. 1998. The BNC Handbook. Exploring the British National Corpus with SARA. Edinburgh: EUP. Bartsch, Sabine & Evert, Stefan. 2014. Towards a Firthian notion of collocation. In Vernetzungsstrategien, Zugriffsstrukturen und automatisch ermittelte Angaben in Internetwörterbüchern [OPAL – Online publizierte Arbeiten zur Linguistik 2/2014], Andrea Abel & Lothar Lemnitzer (eds), 48–61. Mannheim: Institut für Deutsche Sprache.

 Gerold Schneider & Gintarė Grigonytė Biber, Douglas. 2003. Compressed noun-phrase structures in newspaper discourse: The competing demands of popularization vs. economy. In New Media Language, Jean Aitchison & Diana Lewis (eds), 169–181. London: Routledge. Biber, Douglas. 2009. A corpus-driven approach to formulaic language in English: Multiword patterns in speech and writing. International Journal of Corpus Linguistics 14(3): 275–311. doi: 10.1075/ijcl.14.3.08bib Biber, Douglas & Barbieri, Federica. 2007. Lexical bundles in university spoken and written registers. English for Specific Purposes 26: 263–286. doi: 10.1016/j.esp.2006.08.003 Biber, Douglas, Conrad, Susan & Cortes, Viviana. 2004. If you look at…: Lexical bundles in university teaching and textbooks. Applied Linguistics 25: 371–405. doi: 10.1093/applin/25.3.371 Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward. 1999. Longman Grammar of Spoken and Written English. London: Longman. Bonk, William J. 2000. Testing ESL learners’ knowledge of collocations. Urbana IL: Clearinghouse. Conrad, Susan & Biber, Douglas. 2004. The frequency and use of lexical bundles in conversation and academic prose. Lexicographica 20: 56–71. Cheng, Winnie, Greaves, Chris, Sinclair, John McH. & Warren, Martin. 2009. Uncovering the extent of the phraseological tendency: Towards a systematic analysis of concgrams. Applied Linguistics 30(2): 236–252. doi: 10.1093/applin/amn039 Ellis, Nick C. 2002. Frequency effects in language processing. Studies in Second Language Acquisition 24(2): 143–188. Ellis, Nick C., Frey, Eric & Jalkanen, Isaac. 2009. The psycholinguistic reality of collocation and semantic prosody (1): Lexical access. In Exploring the Lexis-Grammar Interface [Studies in Corpus Linguistics 35], Ute Römer & Rainer Schulze (eds), 89–114. Amsterdam: John Benjamins. doi: 10.1075/scl.35.07ell Ellis, Nick C. & Frey, Eric. 2009. The psycholinguistic reality of collocation and semantic prosody (2): Affective priming. Formulaic Language 2: 473–497. doi: 10.1075/tsl.83.13ell Ellis, Nick C., Simpson Vlach, Rita & Maynard, Carson. 2008. Formulaic language in native and second language speakers: Psycholinguistics, corpus linguistics, and TESOL. Tesol Quarterly 42(3): 375–396. doi: 10.1002/j.1545-7249.2008.tb00137.x Erman, Britt & Warren, Beatrice. 2000. The idiom principle and the open choice principle. TEXT 20(1): 29–62. Erman, Britt. 2009. Formulaic language from a learner perspective: What the learner needs to know. In Formulaic Language, Vol. II: Acquisition, Loss, Psychological Reality, and Functional Explanations [Typological Studies in Language 83], Roberta Corrigan, Edith A. Moravcsik, Hamid Ouali & Kathleen M. Wheatley (eds), 323–346. Amsterdam: John Benjamins.

doi: 10.1075/tsl.83.05erm

Evert, Stefan. 2009. Corpora and collocations. In Corpus Linguistics. An International Handbook, Anke Lüdeling & Merja Kytö (eds), 1212–1248. Berlin: Mouton de Gruyter.

doi: 10.1515/9783110213881.2.1212

Frank, Stefan L. & Bod, Rens. 2011. Insensitivity of the human sentence-processing system to the hierarchical structure. Psychological Science 22(6): 829–834. doi: 10.1177/0956797611409589 Frank, Stefan L., Fernandez Monsalve, Irene, Thompson, Robin L. & Vigliocco, Gabriella. 2013. Reading-time data for evaluating broad-coverage models of English sentence processing. Behavior Research Methods 45: 1182–1190. doi: 10.3758/s13428-012-0313-y

Chapter 2. From lexical bundles to surprisal and language models 

Fossum, Victoria & Levy, Roger. 2012. Sequential vs. hierachical models of human incremental sentence processing. In Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2012), Montreal, Canada, Roger Levy & David Reitter (eds), 61–69. Montreal: Association for Computational Linguistics. Gildea, Daniel. 2001. Corpus variation and parser performance. In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing (EMNLP), 167–202, Pittsburgh, PA. Gries, Stefan T. 2013. 50-something years of work on collocations: What is or should be next…. International Journal of Corpus Linguistics 18(1): 137–166. Special issue Current Issues in Phraseology, Sebastian Hoffmann, Bettina Fischer-Starcke & Andrea Sand (eds).

doi: 10.1075/ijcl.18.1.09gri

Gries, Stefan T. 2010. Useful statistics for corpus linguistics. In A Mosaic of Corpus Linguistics: Selected Approaches, Aquilino Sánchez & Moisés Almela (eds), 269–291. Frankfurt: Peter Lang. Granger, Sylviane. 2009. Prefabricated patterns in advanced EFL writing: Collocations and formulae. In Phraseology: Theory, Analysis, and Applications, Anthony P. Cowie (ed.), 185–204. Tokyo: Kurosio. Granger, Sylviane, & Tyson, Stephanie. 1996. Connector usage in the English essay writing of native and non-native EFL speakers of English. World Englishes 15(1): 17–27.

doi: 10.1111/j.1467-971X.1996.tb00089.x

Hoey, Michael. 2005. Lexical priming: A New Theory of Words and Language. Routledge.

doi: 10.4324/9780203327630

Izumi, Emi, Uchimoto, Kiyotaka & Isahara, Hitoshi. 2005. Error annotation for corpus of Japanese learner English. Proceedings of the Sixth International Workshop on Linguistically Interpreted Corpora (LINC 2005). Ishikawa, Shin. 2009. Vocabulary in interlanguage: A study on corpus of English essays written by Asian university students (CEEAUS). In Phraseology, Corpus Linguistics and Lexicography: Papers from Phraseology 2009 in Japan, Katsumasa Yagi & Takaaki Kanzaki (eds), 87–100. Nishinomiya: Kwansei Gakuin University Press. Kennedy, Chris & Thorp, Dilys. 2007. A corpus investigation of linguistic responses to an IELTS Academic Writing task. In IELTS Collected Papers: Research in Speaking and Writing Assessment, Linda Taylor & Peter Falvey (eds), 316–378. Cambridge: CUP. Kopaczyk, Joanna. 2012. Applications of the lexical bundles method in historical corpus research. In Corpus Data across Languages and Disciplines, Piotr Pezik (ed.), 83–95. Frankfurt: Peter Lang. Keller, Frank. 2003. A probabilistic parser as a model of global processing difficulty. In Proceedings of the 25th Annual Conference of the Cognitive Science Society, Richard Alterman & David Kirsh (eds), 646–651. Boston MA: Cognitive Science Society. Keller, Frank. 2010. Cognitively plausible models of human language processing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics: Short Papers, 11–16 July, 60–67. Uppsala: Uppsala University. Lee, David Y. W. 2001. Genres, registers, text types, domains and styles: Clarifying the concepts and navigating a path through the bnc jungle. Language Learning and Technology 5(3): 37–72. Leech, Geoffrey. 2000. Grammars of spoken English: New outcomes of corpus-oriented research. Language Learning 50(4): 675–724. doi: 10.1111/0023-8333.00143

 Gerold Schneider & Gintarė Grigonytė Lehmann, Hans Martin & Schneider, Gerold. 2011. A large-scale investigation of verb-attached prepositional phrases. In Studies in Variation, Contacts and Change in English, Vol. 6: Methodological and Historical Dimensions of Corpus Linguistics, Sebastian Hoffmann, Paul Rayson & Geoffrey Leech (eds). Helsinki: Varieng. Levy, Roger & Jaeger, T. Florian. 2007. Speakers optimize information density through syntactic reduction. In Advances in Neural Information Processing Systems (NIPS) 19, Bernhard Schlökopf, John Platt & Thomas Hoffman (eds), 849–856. Cambridge MA: The MIT Press. Jaeger, T. Florian. 2010. Redundancy and reduction: Speakers manage syntactic information density. Cognitive Psychology 61(1): 23–62. doi: 10.1016/j.cogpsych.2010.02.002 Lorenz, Gunter R. 1999. Adjective Intensification – Learners Versus Native Speakers. A Corpus Study of Argumentative Writing. Amsterdam: Rodopi. Malvern, David D., Richards, Brian J., Chipere, Ngoni & Durán, Pilar. 2004. Lexical Diversity and Language Development. Houndmills: Palgrave MacMillan. doi: 10.1057/9780230511804 Marcus, Mitch, Santorini, Beatrice & Marcinkiewicz, Mary Ann. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19: 313–330. McEnery, Tony, Xiao, Richard & Tono, Yukio. 2006. Corpus-based Language Studies: An Advanced Resource Book [Routledge Applied Linguistics Series]. London: Routledge. Millar, Neil. 2011. The processing of malformed learner collocations. Applied Linguistics 32(2):129–148. doi: 10.1093/applin/amq035 Nattinger, James R. 1980. A lexical phrase-grammar for ESL. TESOL Quarterly 14(3): 337–344. doi: 10.2307/3586598 Nesselhauf, Nadja. 2003. The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics 24(2): 223–242. doi: 10.1093/applin/24.2.223 Ng, Hwee Tou, Wu, Siew Mei, Briscoe, Ted, Hadiwinoto, Christian, Hendy Susanto, Raymond & Bryant, Christoper (eds). 2014. Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task. Baltimore MD: Association for Computational Linguistics. doi: 10.3115/v1/W14-17 NICT, 2012. Japanese Learner English Corpus (JLE, Version 4.1, 2012). Ohlrogge, Aaron. 2009. Formulaic expressions in intermediate EFL writing assessment. In Formulaic Language, Vol. II: Acquisition, Loss, Psychological Reality, and Functional Explanations [Typological Studies in Language 83], Roberta Corrigan, Edith A. Moravcsik, Hamid Ouali & Kathleen M. Wheatley (eds), 375–386. Amsterdam: John Benjamins.

doi: 10.1075/tsl.83.07ohl

Pawley, Andrew & Hodgetts Syder, Frances. 1983. Two puzzles for linguistic theory: Native-like selection and native-like fluency. In Language and Communication, Jack C. Richards & Richard W. Schmidt (eds), 191–226. London: Longman. Pecina, Pavel. 2009. Lexical Association Measures: Collocation Extraction [Studies in Computational and Theoretical Linguistics 4]. Prague: Institute of Formal and Applied Linguistics, Charles University in Prague. Read, John & Nation, Paul. 2006. An investigation of the lexical dimension of the IELTS speaking test. In IELTS Research Reports, Vol. 6, Petronella McGovern & Steve Walsh (eds). IELTS Australia and British Council. Ronan, Patricia & Schneider, Gerold. 2015. Determining light verb constructions in contemporary British and Irish English. International Journal of Corpus Linguistics 20(3): 326–354. doi: 10.1075/ijcl.20.3.03ron

Chapter 2. From lexical bundles to surprisal and language models 

Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing. Manchester. Schneider, Gerold. 2008. Hybrid Long-Distance Functional Dependency Parsing. PhD dissertation, University of Zurich. Seretan, Violeta. 2011. Syntax-Based Collocation Extraction. Dordrecht: Springer.

doi: 10.1007/978-94-007-0134-2

Shannon, Claude E. 1951. Prediction and entropy of printed English. The Bell System Technical Journal 30: 50–64. doi: 10.1002/j.1538-7305.1951.tb01366.x Sinclair, John. 1991. Corpus, Concordance, Collocation. Oxford: OUP. Sinclair, John McH. & Mauranen, Anna. 2006. Linear Unit Grammar: Integrating Speech and Writing [Studies in Corpus Linguistics 25]. Amsterdam: John Benjamins. doi: 10.1075/scl.25 Siyanova-Chanturia, Anna & Martinez, Ron. 2014. The Idiom Principle revisited. Applied Linguistics 36(5): 549–569. Zipf, George Kingsley. 1965. The Psycho-Biology of Language: An Introduction to Dynamic Philology. Cambridge MA: The MIT Press. Zipf, George Kingsley. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. London: Addison-Wesley.

chapter 3

Fine-tuning lexical bundles A methodological reflection in the context of describing drug-drug interactions Łukasz Grabowski University of Opole

This chapter has two major aims. First, it attempts to extend earlier research on recurrent phraseologies used in the pharmaceutical field (Grabowski 2015) by exploring the use, distribution and functions of lexical bundles found in English texts describing drug-drug interactions. Conducted from an applied perspective, the study uses 300 text samples extracted from DrugDDI Corpus originally collected in the Drugbank database (Segura-Bedmar et al. 2010). Apart from presenting new descriptive data, the second aim of the chapter is to reflect on the ways lexical bundles have been typically explored across different text types and genres. The problems discussed in the chapter concern the methods used to deal with structurally incomplete bundles, filter out overlapping bundles, and select, for the purposes of qualitative analyses, a representative sample of bundles other than the most frequent ones. This chapter is therefore meant to help researchers fine tune the methodologies used to explore lexical bundles depending on the specificity of the research material, research questions and scope of the analysis. Keywords: corpus-driven approach; lexical bundles; pharmaceutical texts; drug-drug interactions

1. Introduction It is common knowledge that when searching for recurrent sequences of words, corpus linguists study authentic texts rather than systems of natural languages. In fact, corpus linguists, notably phraseologists, lexicographers, specialists in SLA or FLA, capitalize on the data found in texts and attempt generalizations as to which data items constitute the so-called language units, that is components of the lexicon of the natural language, to be further entered in a dictionary, used

doi 10.1075/scl.82.03gra © 2018 John Benjamins Publishing Company

 Łukasz Grabowski

in teaching or otherwise. Hence, the search for multi-word units in texts reveals two types of data, that is, language units (jednostki języka), and language products (produkty językowe), a distinction originally proposed by Bogusławski (1976), a Polish linguist, lexicographer and philosopher of language.1 The former constitute semantically indivisible arrangements of diacritical elements reproduced by language users as ready-made single items in the process of text production (Grochowski 1981: 31); also, they constitute building blocks of language p roducts (Grochowski 1981: 34), which are free combinations or syntagmatic associations of language units. Using corpus-driven methodology to study texts, corpus linguists usually explore recurrent sequences of words, be it contiguous or noncontiguous ones, which are typically language products or, at best, only potential language units. In recent years, corpus linguistic research has revealed many types of such language products (e.g. n-grams, clusters, lexical bundles, phrase frames),2 which are typically non-phrasal in structure and which are not readily-available form and meaning mappings.3 That said, corpus linguists studying phraseology are keenly interested in frequent and statistically significant multi-word patterns in which particular words occur (Moon 2007: 1046).4 Lexical bundles are one of the types of recurrent multi-word units found in texts. Defined as sequences of three or more word forms that recur frequently in natural discourse in a variety of spoken and written contexts (e.g. I don’t think, as a result, the nature of the), they constitute a unit of linguistic analysis first proposed by Biber, Johansson, Leech, Conrad, and Finegan (1999: 989–991). In practice, the studies centered on lexical bundles position at the forefront inconspicuous and not perceptually salient multi-word sequences with high frequency in texts.

. Bogusławski (1976: 357) poses a central question, namely which of the utterances found in texts are actually language units and which are not, being parts of other language units or combinations of language units [the original reads: co w masie wypowiedzeń, z którymi mamy do czynienia w tekstach, jest jednostką języka, a co nią nie jest (będąc bądź częścią jakiejś jednostki, bądź połączeniem jednostek)]. In the same paper, Bogusławski (1976: 359–362) describes a substitutive test (test substytucyjny) that, among other tests proposed in his later papers, may be used to distinguish between language products and language units. . Bednarek (2014: 58) argues that n-grams (including clusters, chains, lexical bundles etc.) constitute recurring syntagmatic combinations of words automatically generated by computer software, which means that “they do not necessarily have grammatical, semantic or pragmatic status” as it is the case with idioms, proverbs, sayings, clichés, catchphrases etc., which all occur with relatively low frequencies in texts. . According to Kopaczyk (2013: 54), “the reasons for that should be sought in pragmatics and discourse structure, as well as in language processing”. . As early as in 1989, Bogusławski argued that phraseologisms are in fact “word combinations with significant frequency” (frekwencyjnie istotne ciągi wyrazów) (1989: 13–14).

Chapter 3. Fine-tuning lexical bundles 

Since lexical bundles are “combinations of words that in fact recur most commonly in a given register” (Biber, Johansson, Leech, Conrad & Finegan 1999: 992), it is no surprise that they constitute important building blocks of specialist discourses (e.g. academic or legal), as illustrated by a number of studies (e.g. Biber 2006; Biber & Barbieri 2007; Biber, Conrad & Cortes 2004; Breeze 2013; GoźdźRoszkowski 2011; Hyland 2008; Kopaczyk 2012, 2013). However, the studies focusing on a description of recurrent linguistic patterns in pharmaceutical texts are either scarce (Grabowski 2015) or dispersed as fragment of larger studies on medical, biomedical or scientific discourse (e.g. Gledhill 2000; Salazar 2011, 2014). In view of the above, this chapter has two main aims. First, it attempts to provide a preliminary description of the use, distribution and discourse functions of lexical bundles found in English pharmaceutical texts describing drug-drug interactions, that is, situations “whereby either the pharmacokinetics or the pharmacodynamics of one drug is altered by another” (Rowland 2008: 1) so that one drug affects the action of another one.5 In fact, drug-drug interactions account for 6 to 30% of all adverse reactions and that is why they constitute a particularly significant problem in clinical practice (Ionescu & Caira 2005: 296). Consequently, it is essential that professionals, that is, researchers or practitioners in the pharmaceutical field (e.g. pharmacists, hospital pharmacists, laboratory technicians), notably non-native speakers of English, be familiar with recurrent multi-word units used to describe drug-drug interactions. With this rationale in mind, the study presented in this chapter extends earlier research on lexical bundles across other English pharmaceutical text types (Grabowski 2015). However, apart from only presenting new descriptive data on recurrent phraseologies used in the pharmaceutical domain, the second aim of this chapter is to reflect on the ways lexical bundles have been typically explored so far across various text types and genres. The issues addressed in this study pertain to the methods used to deal with structurally incomplete or overlapping bundles or to select, from multitudinous bundles identified in a data-driven way, a representative sample for further qualitative analyses. These problems are discussed later in the chapter using specific examples and a case-study. That is why this contribution is primarily intended to help researchers fine tune the methods used to explore lexical bundles depending on the specificity of the research material, research questions or scope of the analysis.

. According to Gallicano and Drusano (2005: 3), the most commonly encountered or perceived interactions occur between two drugs. However, one may note a growing interest in the study of drug-drug interactions because of the rise in polypharmacy, that is, taking multiple drugs together in a course of a day (Huang, Lesko & Temple 2008: 665).

 Łukasz Grabowski

2. Methodology: What we know about and usually do with lexical bundles As proposed by Biber et al. (1999), the criteria used to extract lexical bundles from texts are orthographic and distributional. More specifically, these criteria refer to length (in running words) of a lexical bundle, a frequency cut-off point6 (usually a normalized frequency of occurrence per 1 million words), and a number of texts in which a contiguous, and uninterrupted by punctuation marks, sequence of words must occur (typically 3–5 texts representing a given register). Kopaczyk (2013: 155) refers to this criterion as “token-to-file ratio”. In fact, the parameters of the three criteria have been further modified by researchers. For example, Hyland (2008: 8) considers to be lexical bundles only those uninterrupted sequences of words which appear in at least 10% of texts representing a given register (rather than in 3–5 texts from a given register); Chen and Baker (2010) exclude from the analyses those bundles which are highly context-dependent or contain proper names; the reason being that such bundles typically inflate quantitative results (Chen & Baker 2010: 33); Granger (2014) and Grabowski (2014) show that the criteria used to identify bundles from texts written in languages other than English should be further modified in view of typological and systemic differences. In practice, however, the conventional criteria used to extract lexical bundles from texts are often not sufficient to identify relevant items for a particular applied purpose (e.g. teaching a foreign language, translation practice or dictionary compilation). In the context of English language teaching, Simpson-Vlach and Ellis (2010: 490–491) propose a method for deriving pedagogically useful formulas using a combination of quantitative and qualitative criteria, such as corpus statistics (frequency information and measures of the strength of association between words, such as MI-score,7 and LL-statistic),8 psycholinguistic processing metrics and instructor

. See Kopaczyk (2013: 153) for an overview of frequency thresholds employed in selected studies of lexical bundles. Also, Cortes (2015: 205) explains certain problems related to normalization of frequency data across large and small corpora. . Mutual information score (MI-score) is a measure of collocational strength “computed by dividing the observed frequency of the co-occurring word in the defined span for the search string by the expected frequency of the co-occurring word in that span, and then taking the logarithm to the base of 2 of the result” (McEnery, Xiao & Tono 2006: 56). As a rule, the higher the score, the stronger the link between the two words; in practice, collocations with high MIscores often include combinations of low frequency words (McEnery et al. 2006: 56). . Simpson-Vlach and Ellis (2010: 492) used LL-statistic (that is, log-likelihood statistic) to compare the frequencies of recurrent multi-word units across the corpora under study. In fact, LL-statistic is a measure of statistical significance that assumes that the data are not normally distributed and it uses asymptotic distribution of the generalized likelihood ratio (Dunning

Chapter 3. Fine-tuning lexical bundles 

insights (i.e. rating lexical bundles in terms of their perceived level of formulaicity, cohesive meaning or function, as well as pedagogical relevance).9 To give another example, Salazar (2011) employs ten additional syntactic and semantic criteria in order to obtain a more refined and pedagogically useful list of 3–6 word lexical bundles for teaching scientific writing in English (Salazar 2011: 48–50). In fact, it is difficult to expect that language learners would easily learn and use formulaic sequences of words with no clear functional roles (Appel & Trofimovich 2015: 4). More precisely, Appel and Trofimovich (2015: 4) argue that since lexical bundles often have no clear meanings or functions, one should not take their pedagogical utility for granted. This shows that the usual criteria employed to identify lexical bundles may be sufficient for descriptive purposes, yet it may be required that additional criteria be used to identify the bundles relevant for specific applied purposes. Once extracted from texts, lexical bundles display a number of features that make them distinct phraseologies. Although Biber et al. (1999: 991) note that lexical bundles are commonly parts of noun phrases and prepositional phrases, they are typically incomplete structural units, falling into several structural types or bordering on two or three structural types (e.g. I don’t know why, the nature of the). Also, Kopaczyk (2012: 5; 2013: 54 & 63) notes that lexical bundles are often either smaller than a phrase (notably, short bundles consisting of three or four words) or larger than a phrase (indicating complementation patterns of phrases). In a similar vein, Stubbs and Barth (2003: 81) argue that some lexical bundles, referred to in their study as “chains”, are not complete syntactic units, yet they may contain one; some strongly predict a complete syntactic unit; some are not necessarily pre-constructed. In fact, the proportion of structurally complete bundles is highly variable across registers and genres. For example, only 15% of lexical bundles in conversations are complete structural units, while in academic prose this number is even lower and accounts for a mere 5%, most of them being parts of longer noun phrases or prepositional phrases (Biber et al. 1999: 995). Depending on their composition, lexical bundles are either multi-word collocations or multi-word formulaic sequences (Biber 2009: 286–290). Typically represented by technical terms, the former are composed of content words only, they are strongly associated statistically (i.e. they have high MI-scores) and they occur with relatively low frequencies in texts, for example selective serotonin

1993: 6). This non-parametric test enables one to conduct comparisons between corpora of different sizes, particularly those consisting of smaller volumes of text than it is necessary for conventional tests based on assumed normal distribution; thanks to this, LL is reliable even with very low frequencies, that is, lower than 5 (Dunning 1993: 6; Rayson & Garside 2000: 2). . This results in a metric called “formula teaching worth (FTW)” (Simpson-Vlach & Ellis 2010: 495–496).

 Łukasz Grabowski

r euptake inhibitors, drug laboratory test interactions. Conversely, the multi-word formulaic sequences consist of both function and content words, have low MIscores and relatively high frequencies (Biber 2009: 289), for example the concomitant use of, can be minimized by. One may put forward a hypothesis that the more specialized the text type, the more multi-word collocations are found in it. Lexical bundles “serve basic discourse functions [in texts] related to the expression of stance, discourse organization, and referential framing” (Biber & Barbieri 2007: 265). The bundles’ specific functions and meanings typically differ across registers, text types or genres, depending on their communicative functions, target audience and other situational factors (Biber 2006: 174). That is why the functions of lexical bundles in texts are often register-specific or domain-specific; hence, it is difficult to develop any compact and, at the same time, specific enough functional typology for formulaic sequences to be applicable across corpora representing various text types, genres or domains of language use (Wray & Perkins 2000: 8). As demonstrated by many studies (e.g. Biber 2006; Goźdź-Roszkowski 2011; Hyland 2008), it has now become customary to tailor typologies of discourse or textual functions of multi-word units to specific research materials in order to capture the bundles’ more fine-grained meanings and functions in specialist texts. From a psycholinguistic perspective, lexical bundles, like n-grams, constitute only an intermediate form of representation as regards their status in the mental lexicon of language users (Rieger 2001: 171, cited in Stubbs & Barth 2003: 81). This means that lexical bundles represent “surface evidence of psycholinguistic units which are exploited in producing and interpreting fluent language use” (Stubbs & Barth 2003: 81). In a similar vein, Simpson-Vlach and Ellis (2010: 490) claim that “the fact that a formula is above a certain frequency threshold and distributional range does not necessarily imply either psycholinguistic salience or pedagogical relevance”. For example, Schmitt, Grandage and Adolphs (2004) conducted a study aimed to test the psycholinguistic validity of clusters, another label used with reference to recurrent n-grams, by studying the degree to which these recurrent contiguous sequences of words are stored in memory as single wholes. The results revealed that “frequency of occurrence is not closely related to whether a cluster is stored in the mind as a whole or not” and that “semantic and functional transparency does have a role to play in determining whether a recurrent cluster becomes stored in the mind” (Schmitt et al. 2004: 139).10 Also, Adolphs (2006: 58)

. In an attempt to offer a conceptual clarification, Myles and Cordier (2017: 10) propose a distinction between speaker-external formulaic sequences, that is, what language users consider to be formulaic in texts “outside the speaker” (because of formal, pragmatic or distributional properties), and speaker-internal formulaic sequences, that is, psycholinguistic units stored as single wholes by language users.

Chapter 3. Fine-tuning lexical bundles 

notes that some recurrent sequences may be more meaningful than others, which may reflect the nature of individual data sets from which they are extracted on the basis of their high frequency. All this means that recurrent lexical bundles, identified using a data-driven approach, are primarily textual units, rather than units of the lexicon (or language system), and, consequently, they cast more light on text or discourse-organization rather than on the composition of one’s mental lexicon. In short, lexical bundles are primarily usage-based rather than system-based; more often than not, they are language products rather than language units, applying the division proposed by Bogusławski (1976). 3. Lexical bundles approach: Is there any area for improvement? Looking at various studies aimed at exploration of lexical bundles, one may arrive at two contrasting observations. On the one hand, some researchers tend to replicate the methodologies used in earlier studies on lexical bundles to ensure that the results are comparable and compatible with each other. On the other hand, since there is no ideal methodology, it is often necessary to re-engineer and fine tune the research methods in order to provide answers to specific research questions. In the case of lexical bundles, some of the methodological challenges, notably when undertaking applied research, refer to the choice of methods to filter out overlapping or structurally incomplete bundles, or to select a representative sample of bundles for further qualitative analyses. These and other issues, pertaining to the subtle nature of lexical bundles, are discussed below. 3.1 H ow to deal with structurally incomplete and/or overlapping lexical bundles? As mentioned earlier, the majority of lexical bundles constitute incomplete structural (syntactic) units, a situation that is not conducive to aligning the bundles’ form with specific meanings or discourse functions. In other words, lexical bundles are usually not “self-contained” in a sense that they do not constitute readily- available form-and-meaning mappings. To overcome this obstacle, one may attempt to determine whether a structurally incomplete bundle is a fragment of a longer structurally complete bundle, and only then attempt to align its form with a discourse function.11 Liu (2012: 27) proposes yet another solution, namely describing structurally-incomplete lexical bundles as multi-word constructions; . Then it is necessary to present this information explicitly in the course of qualitative analyses (e.g. a bundle you have any problems is a fragment of a longer bundle if you have any problems, which may be readily aligned with a discursive function of introducing conditions).

 Łukasz Grabowski

for example, a structurally-incomplete bundle this is the is formally described as a longer abstracted construction “this is + det + noun phrase” (Liu 2012: 27). Consequently, since multi-word constructions are more “self-contained” in terms of their meaning, it should be easier to align them with specific discourse functions. This brings us to another issue, yet closely related to the bundles’ structural incompletes, namely that longer lexical bundles frequently include shorter ones. This implies, as noted by Biber et al. (1999: 993), that the former are commonly formed through an extension or combination with the latter, for example if you have → if you have any → if you have any problems → if you have any problems with.12 Consequently, lexical bundles, notably the ones used with similar frequencies, overlap with each other and it is often difficult to specify the bundles’ boundaries (e.g. if you have any, you have any problems, have any problems with), a situation that is particularly problematic for functional analyses.13 Then, researchers usually modify the frequency threshold depending on the orthographic length of potential bundles. As a rule, the shorter the n-gram, the higher its frequency in texts, a phenomenon that may be attributed to the economy of language use. In a similar vein, the shorter the orthographic length of an n-gram, the more n-gram types are found in a text or corpus (Cortes 2015: 204; Kopaczyk 2013: 154). That is why the frequency threshold for longer bundles should be lower than the one for shorter ones. It is then possible to filter out manually those shorter bundles that are fragments of longer ones. Such a solution is used, for example, by Chen and Baker (2010), who exclude shorter bundles overlapping with longer ones from the numerical counts and further qualitative analyses. Wood and Appel (2014: 5) propose yet another solution, namely “condensing overlapping structures”; for example, two overlapping 4-word lexical bundles at the end of and the end of the could be presented as a condensed sequence, such as (at) the end of (the). In the same study, Wood and Appel (2014: 5) argue that formally similar bundles of the same length could also be presented as shorter bundles with variable slots in the initial or final position, for example as a result [the/of] (ibid.). The latter proposal resembles the concept of phrase frames defined by Fletcher (2002–2007) as a “set of variants of an n-gram identical except for one word”. Another approach to specify the bundles’ boundaries is to introduce additional frequency thresholds or other metrics developed to measure associations between

. Such bundles are often neighbours or near-neighbours on the frequency list. . The same point is raised by Kopaczyk (2013: 157), who proposed a conceptual clarification with respect to the issue of overlapping lexical bundles. More specifically, Kopaczyk (2013: 156–157) introduces two labels, namely “syntagmatic overlap” (a situation when a given bundle includes a fragment of a preceding bundle) and “paradigmatic overlap” (a situation when a longer bundle includes a shorter one).

Chapter 3. Fine-tuning lexical bundles 

words. Although Biber (2009) showed that MI-score constitutes an unreliable measure of formulaicity of word sequences (as it fails to measure the likelihood of co-occurrence of words in a particular word order),14 one may try employing a directional measure of word association called “transitional probability” (Appel & Trofimovich 2015: 10–11). Designed specifically to locate utterance boundaries and tested on a sample of 100 four-word items extracted from the BNC, transitional probability is intended to help predict accurate sequence completion and, consequently, “reduce the incidence of overlapping, incomplete, and overly extended structures identified as FSs [formulaic sequences]” (Appel & Trofimovich 2015: 6). In practice, the calculation boils down to dividing the frequency of a longer n-gram by the frequency of each of its two shorter components; the lower the score, called either a backward or forward transitional probability (BTP and FTP respectively), the more probable it is that the shorter sequence of words is a “complete” one and, hence, more functionally salient (Appel & Trofimovich 2015: 11). For example, in the sample of DrugDDI Corpus (Segura-Bedmar, Martinez & de Pablo-Sanchez 2010) used in this study, I found the following overlapping sequences of words: had no effect on the (23 occurrences in 17 texts), had no effect on __ (38 occurrences in 25 texts), __ no effect on the (34 occurrences in 26 texts). This results in BTP score of 0.676 (23/34) and FTP of 0.605 (23/38), which means that the sequence had no effect on is a more “complete” one in the corpus.15 Although promising, the metric has not been tested in a comprehensive manner so far, that is, using smaller corpora with texts restricted with respect to genre, register or specialist domain (Appel & Trofimovich 2015: 15–16). Finally, the approach to identify properly fragmented n-grams based on the concept of “coverage” is proposed by Forsyth (2015a, 2015b). In that approach, coverage is a binary category, which means that it is irrelevant how many n-grams, previously generated for each text in a given corpus, cover a given text sequence; what counts is whether the text sequence is covered or not and “based on that, the proportion of covered vs. uncovered characters for each text file is calculated and then the character coverage for each text category is aggregated” (2015b: 13–14).16 . In short, Biber (2009: 289–290) revealed that MI-score sidesteps very frequent lexical bundles consisting of high-frequency function words; in such a case, a low MI-score translates into a higher probability that these lexical bundles co-occur by chance while in reality they are strongly formulaic (e.g. in the case of) (Biber 2009: 290). . Lower transitional probability score means that a word, either in an initial or final position, is only loosely associated with a given n-gram (Appel & Trofimovich 2015: 11). . Although similar to “Serial Cascading Algorithm” proposed earlier by O’Donnell (2011: 149–153) to generate adjusted frequency lists of n-grams, Forsyth (2015b: 25) notes that his method “is simpler and has no fixed upper limit on the length of the sequences produced”.

 Łukasz Grabowski

Using the “formulex” method implemented in Formulib software (Forsyth 2015a) written in Python 3.4, I generated the list of n-grams with the highest coverage in the sample of 300 texts extracted from DrugDDI corpus used in this study. The top-20 n-grams arranged by coverage are presented in Table 1. Table 1. Coverage by frequent n-grams in the sample of DrugDDI Corpus (Segura-Bedmar et al. 2010) No.

Coverage (in %)

Raw frequency

No. of char.

No. of tokens

1.

0.4272

136

29

3

concomitant administration of

2.

0.2045

93

20

3

co administration of

3.

0.1492

57

24

3

plasma concentrations of

4.

0.1428

62

21

3

in patients receiving

5.

0.1353

38

33

4

drug laboratory test interactions

6.

0.1231

56

20

3

should be considered

7.

0.1214

61

18

3

concomitant use of

8.

0.1175

66

16

3

in patients with

9.

0.1173

80

13

3

the effect of

10.

0.1126

43

24

3

drug interaction studies

11.

0.1036

43

22

4

it is recommended that

12.

0.1018

54

17

3

the metabolism of

13.

0.1002

33

28

3

monoamine oxidase inhibitors

14.

0.0975

49

18

3

is not recommended

15.

0.0900

43

19

3

in combination with

16.

0.0895

57

14

3

the effects of

17.

0.0872

49

16

3

plasma levels of

18.

0.0836

42

18

4

in the presence of

19.

0.0819

23

33

4

the concomitant administration of

20.

0.0817

78

9

3

mg kg day [mg/kg/day]

N-gram

The data in Table 1 show that the n-gram with the highest coverage in the study corpus is concomitant administration of. In other words, this means that 0.4272 per cent of the entire number of characters in the corpus are repetitions of that three-word sequence. Looking at Table 1, one may arrive at an incorrect conclusion that the frequencies of the 3-word gram concomitant administration of (136 occurrences) overlap with the 4-word gram the concomitant administration of (23 occurrences), both printed in bold in the table. In fact, however, the former shorter sequence was not embedded in the latter longer sequence on

Chapter 3. Fine-tuning lexical bundles 

136 occasions. More specifically, the sequence concomitant administration of occurs 218 times in the study corpus. Such a method whereby “the sequences are mutually exclusive” and that “longer prefabricated phrases [are prevented] from being swamped by the elements of which they are composed of ” (Forsyth 2015b: 17) enables one to specify more precise boundaries of recurrent strings of words, some of them being potential lexical bundles. In order to ascertain which n-grams constitute proper lexical bundles, one may apply specific range and frequency thresholds to the output of the Formulib package, which is a list of nonoverlapping n-grams ranked by coverage. This would enable one to filter out the original lists of lexical bundles, identified using three traditional criteria, against the lists of formulas generated using the “formulex” method (Forsyth 2015a), a procedure that may ultimately result in a refined list of lexical bundles of various lengths (Grabowski & Jukneviciene 2016). 3.2 How to select a representative sample of bundles from a corpus? Another challenge in the study of bundles refers to the choice of a representative sample for qualitative functional analyses, notably if the application of orthographic and distributional criteria resulted in the multitude of bundles. In some studies (e.g. Biber 2006; Biber et al. 2004; Hyland 2008), researchers explore all the bundles identified in the course of the study; in other studies (e.g. Goźdź-Roszkowski 2011; Grabowski 2015), a sample of the most frequent bundles (e.g. the top-50 by frequency) is analyzed qualitatively. Both solutions are not devoid of problems, however. In the former scenario, notably if one explores the corpus with hundreds of bundles, manual qualitative analyses become extremely labor-intensive and timeconsuming. In fact, the research procedure then boils down to a close reading of hundreds of concordance lines, usually conducted by two or more researchers to ensure a high degree of inter-rater reliability. The latter scenario is also questionable, notably if one explores highly repetitive and clichéd text types or genres. For example, it may happen that overlapping bundles (e.g. if you have any, you have any problems, have any problems with or had no effect on, no effect on the) occur in texts with similar frequencies and hence are neighbours or near-neighbours on the frequency list. Also, the functions of the most frequent lexical bundles may not be representative of the total population of bundles in a corpus, which means that any extrapolation of the results could be construed as speculative. To overcome these problems, that is, to more objectively select a representative sample of bundles, it is possible to apply either stratified sampling or systematic sampling, the methods which are described in greater detail in, among others, Oakes (1998: 10), Rowntree (2000: 26–27), Babbie (2013: 226) or Canning (2013: 33–36). In the former scenario, it is possible to divide lexical bundles into a number of frequency

 Łukasz Grabowski

bands (e.g. with normalized frequencies of 100 or more, 99–70 and 69–40) and then to select – either at random or systematically – the same number of bundles from each frequency band.17 Employing systematic sampling, one might select bundles which occur at regular intervals on the frequency list, that is, every nth bundle. For example, if the total number of bundles is 350, and a sample to be explored qualitatively is 50 or 25, then one should select every 6th or 12th bundle respectively. Importantly, a starting point can be chosen at random (e.g. somewhere in the middle of the frequency list), yet after reaching the end of the frequency list, one should start from its beginning to ensure that the intended sample consists of 50 or 25 bundles. All in all, the aim of testing various sampling methods is twofold. First, it may allow to explore the impact of using different sampling methods for providing an overview of discourse functions performed by the entire set of lexical bundles, notably if one is confronted with a high number of lexical bundles so that their functional concordance-based manual analysis is bound to be time-consuming and labour-intensive. Second, the use of either stratified sampling or systematic sampling may help one ensure that the selection of bundles be more representative of the entire range of bundles found in the corpus rather than limited to the most frequent items only. This assumption will be verified in a small-scale case study described below, and the implications for functional analyses of lexical bundles will be discussed afterwards in greater detail. 4. Corpus and context: Lexical bundles describing drug-drug interactions 4.1 Corpus description As mentioned earlier, in this case study an attempt is made to explore the use, distribution and discourse functions of lexical bundles in the sample of DrugDDI Corpus, a collection of 988 texts describing drug-drug interactions and originally collected in the Drugbank database (Segura-Bedmar et al. 2010). Compiled at the Computer Science Department of University Carlos III of Madrid, DrugDDI Corpus was employed as a benchmark for testing and evaluation of various information extraction techniques used to automatically acquire information on drug-drug interactions from texts in the biomedical domain (Segura-Bedmar et al. 2010: 2). The corpus sample used in the study consists of 300 texts with 138,988 word tokens in total. More precisely, the texts were selected on the basis . The specific number of bundles to be selected from a given frequency band could be proportional to the total number of bundles in the band; another option is to select the same number of the most frequent bundles in the band.

Chapter 3. Fine-tuning lexical bundles 

of their size, that is, the 300 longest texts (out of 988) describing drug-drug interactions were subjected to the analysis. Although the total size of the study corpus is well below a conventional threshold of 1 million words used in many studies on lexical bundles (Cortes 2015: 205), it is considered to be sufficient in view of a highly-patterned and specialized text type under scrutiny. This also accords with the claim made by Koester (2010: 67), who argues that smaller corpora are more suitable to identify the connections between linguistic patterning and specialized contexts of use. 4.2 Procedure and analysis An inductive corpus-driven approach is used in this study so that neither grammatical categories nor syntactic structures have “a priori status in the analysis” (Biber 2009: 278; Tognini-Bonelli 2001: 87). The study focuses on 4-word lexical bundles since they have a more readily recognizable range of structures and functions than 3-word bundles and 5-word bundles (Chen & Baker 2010: 32; Hyland 2008: 8). Using WordSmith Tools 5.0 (Scott 2008), I identified 203 lexical bundles with 4 words that occur in the corpus more than 40 times per million words18 (short ‘pmw’) in at least 9 texts, that is, in 3% of all texts in the corpus. The distribution of the bundles across three frequency bands is presented in Table 2. Table 2. Distribution of lexical bundles across frequency bands Frequency band (per million words, pmw)

Number of bundle types

Top-frequency (more than 200 pmw)

29

Medium-frequency (199–100 pmw)

67

Bottom-frequency (fewer than than 100 pmw)

107

. The same threshold (that is, 40 occurrences pmw) was used by, among others, Jukneviciene (2009), Bernardini, Ferraresi and Gaspari (2010), Goźdź-Roszkowski (2011) or Gaspari (2013). In this study, 40 occurrences pmw equal 5,52 occurrences (raw frequency) in the sample of DrugDDI corpus under scrutiny. However, taking into consideration the distribution criterion (3% of all texts), the lexical bundle ranked last (203rd) on the list has a raw frequency of 9 occurrences. This value is relatively high, given the small size of the study corpus. By definition, lexical bundles should be “the most frequently occurring sequences of words” in a register (Biber 2006: 134). Hence, if one analyzes a small corpus consisting of clichéd specialized texts, it is justified to use even a more conservative frequency threshold. This accords with the claim made by Cortes who argues that “the frequency of individual lexical bundles becomes higher as the corpus becomes more focused or restricted” (2013: 42).

 Łukasz Grabowski

In the next stage, the bundles were explored qualitatively in terms of their discourse functions. To that end, the decision had been made to capitalize on the insights from the functional typology originally developed by Hyland (2008), who explored the functions of lexical bundles across academic text types (research articles, PhD theses and MA/MSc theses) representing four distinct disciplines, namely electrical engineering, biology, business studies and applied linguistics. In short, Hyland (2008: 13–14) divided bundles into three major functional groups, that is, research-oriented (in this study called “referential” bundles), text- oriented and participant-oriented bundles (in this study called ”stance/evaluation” bundles).19 Hence, in this study referential bundles (R) refer to various properties (pharmacological, pharmacokinetic etc.) of medicines that may cause drug-drug interactions, most of them being topic-related bundles (e.g. the clinical significance of, drug laboratory test interactions, Cmax and AUC of, the plasma concentrations of, metabolized by the cytochrome); text-oriented bundles (T) help organize and convey research results or specialist knowledge on drug-drug interactions, and they include research-related bundles20 (e.g. did not affect the, did not influence the, no significant effect on, a significant increase in, increase the risk of, have/has been reported to, have been reports of, has not been established, has been shown to; studies have shown that), framing signals (e.g. in combination with other, in the presence of, in the absence of, with a history of, with any of the) or condition bundles (when such drugs are, when these drugs are); finally, stance/evaluation bundles (S) help express attitudes, value judgments or assessments of information on drug-drug interactions, for example it is recommended that, caution should be exercised/used, should be observed closely, (should) be closely monitored (for), would be expected to, (should) be used with caution, should not be taken, (it) is not known (whether), may need to be. 4.3 Results In order to provide a more comprehensive description of the bundles describing drug-drug interactions, from among 203 lexical bundles identified in this study three different samples, with 25 bundles in each, were selected for qualitative analyses. More specifically, sample 1 includes the 25 most frequent bundles (see Table 3); Sample 2 includes 25 bundles selected by means of stratified random sampling proportional to the number of bundles in each frequency band (see . In this study, the labels such as “referential” and “stance/evaluation” bundles are modeled on the typology used by Biber, Conrad and Cortes (2004). . In this study, research-related bundles (used to structure information and present research results) have been treated as a sub-category of textual bundles.

Chapter 3. Fine-tuning lexical bundles 

Table 4);21 Sample 3 includes 25 bundles selected by means of systematic sampling, that is, every 8th bundle in the list (see Table 5). Also, the entire set of 203 lexical bundles were analyzed qualitatively. Table 3. Sample 1 with the 25 most frequent bundles

No.

Lexical bundle

Normalized frequency (pmw)

1

did not affect the

352

29

T / Research-related

2

a single dose of

345

35

R / Topic-related

3

on the pharmacokinetics of

338

32

R / Topic related

4

the concomitant use of

338

38

R / Topic-related

5

it is recommended that

330

30

S / Recommendation

6

the concomitant administration of

330

34

R / Topic-related

7

drug laboratory test interactions

323

45

R / Topic-related

8

in the presence of

302

26

T / Framing-signals

9

the patient should be

294

22

S / Recommendation

10

has not been studied

287

24

T / Research-related

11

caution should be exercised

280

34

S / Recommendation

12

had no effect on

273

25

T / Research-related

13

caution should be used

258

34

S / Recommendation

14

has been reported to

258

28

T / Research-related

15

have been reported in

258

35

T / Research-related

16

no effect on the

244

26

T / Research-related

17

affect the pharmacokinetics of

237

22

R / Topic-related

18

should be observed closely

237

23

S / Recommendation

19

the clinical significance of

237

24

R / Topic-related

20

did not alter the

230

20

T / Research-related

21

increase the risk of

230

29

T / Research-related

22

should not be used

230

26

S / Recommendation

23

been reported in patients

223

27

T / Research-related

24

should be closely monitored

223

26

S / Recommendation

25

there have been reports

223

26

T / Research-related

No. of texts

General and specific discourse function

. This means that 14% (4 bundles) represent top-frequency band; 33% (8 bundles) – medium-frequency band; 52% (13 bundles) – bottom-frequency band. In each band, the bundles were selected at random.

 Łukasz Grabowski

Table 4. Sample 2 with 25 bundles selected through stratified random sampling Normalized frequency (pmw)

No. of texts

General and specific discourse function

No.

Lexical bundles

1

it is recommended that

330

30

S / Recommendation

2

had no effect on

273

25

T / Research-related

3

the clinical significance of

237

24

R / Topic-related

4

have been reports of

215

25

T / Research-related

5

inhibit the metabolism of

179

18

R / Process-related

6

has not been established

165

20

T / Research-related

7

has been shown to

143

20

T / Research-related

8

have not been studied

129

13

T / Research-related

9

is administered concomitantly with

122

17

R / Topic-related

10

on the metabolism of

115

11

R / Topic-related

11

is not known whether

107

14

S/ Attitude

12

drugs metabolized by the

100

12

R / Topic-related

13

and its active metabolite

93

12

R / Topic-related

14

no significant effect on

93

10

T / Research-related

15

highly bound to plasma

86

11

R / Topic related

16

be administered with caution

79

10

S / Recommendation

17

may be potentiated by

79

11

S / Epistemic stance

18

the antihypersensitive effect of

79

10

R / Topic-related

19

doses than usually prescribed

71

9

R / Topic-related

20

of renal prostaglanding synthesis

71

10

R / Topic-related

21

be expected to have

64

9

S / Epistemic stance

22

enhance the effects of

64

9

T / Research-related

23

in vitro studies have

64

9

R / Topic-related

24

the oral clearance of

64

9

R / Topic-related

25

with a history of

64

9

T / Framing signals

Chapter 3. Fine-tuning lexical bundles 

Table 5. Sample 3 with 25 bundles selected through systematic sampling Normalized frequency (pmw)

No. of texts

General / specific discourse function

No.

Lexical bundles

1

drug laboratory test interactions

323

45

R / Topic-related

2

have been reported in

258

35

T / Research-related

3

been reported in patients

223

27

T / Research-related

4

patient should be observed

194

17

S / Recommendation

5

when such drugs are

172

9

T / Condition

6

alter the pharmacokinetics of

143

16

R / Topic-related

7

other drugs metabolized by

129

15

R / Topic-related

8

reported in patients receiving

122

15

R / Topic-related

9

should be administered with

115

10

S / Recommendation

10

risk of lithium toxicity

107

15

R / Topic-related

11

may result in a

100

11

S / Epistemic stance

12

drugs are administered to

93

11

R / Topic-related

13

of sirolimus oral solution

93

13

R / Topic-related

14

in the absence of

86

9

T / Framing signals

15

be closely observed for

79

10

S / Recommendation

16

mg twice daily for

79

9

R / Measurement and temporal marker

17

the combined use of

79

10

T / Research-related

18

have been observed with

71

10

T / Research-related

19

or the other drug

71

10

R / Topic-related

20

be minimized by either

64

9

S / Epistemic stance

21

ergot toxicity characterized by

64

9

R / Topic-related

22

inhibits the metabolism of

64

9

R / Topic-related

23

renal clearance of lithium

64

9

R / Topic-related

24

there were no clinically

64

9

R / Topic-related

25

with any of the

64

9

T / Framing signals

 Łukasz Grabowski

The comparison of the results across three samples revealed certain differences in terms of the dominant discourse functions (Table 6). More specifically, text-oriented bundles (11) are the most numerous among the 25 most frequent bundles in Sample 1; also, stance/evaluation bundles are more numerous (7) in Sample 1 than in Sample 2 and 3 (with 5 stance bundles in each). This preliminary finding suggests that a sample of the most frequent bundles in a corpus of texts may give a different insight into the dominant discourse functions as compared with the bundles selected by other sampling methods. This hypothesis, however, needs to be further tested in the future using corpora with texts representing multiple text types or genres. Table 6. Discourse functions of bundles across three samples Discursive functions of bundles

Sample 1

Sample 2

Sample 3

Entire set

Referential (R)

7 (28%)

12 (48%)

13 (52%)

110 (54%)

Text-oriented (T)

11 (44%)

8 (32%)

7 (28%)

50 (25%)

Stance/evaluation (S)

7 (28%)

5 (20%)

5 (20%)

43 (21%)

One may also note that the discourse functions of bundles selected through stratified random sampling (Sample 2) and systematic sampling (Sample 3) are similar. More specifically, in both samples approximately 50% (12 and 13 items respectively) of lexical bundles perform referential functions by referring to key properties of drugs and medicines relevant for development of drug-drug interactions, followed by text-oriented (around 30%, that is 8 and 7 items respectively) and stance bundles (20%, that is, 5 items in both samples). Also, it appeared that an overview of discourse functions of the bundles selected for the qualitative analysis by means of systematic sampling (Sample 3) was the most similar to the distribution of the discourse functions of the entire set of 203 lexical bundles identified in the course of the analysis. Interestingly, the discourse functions of the most frequent bundles (Sample 1) were the least similar to the discourse functions of the entire range of lexical bundles in the corpus under scrutiny. Finally, the results showed that most of the bundles (54%) that occur with various frequencies (high, medium or low) in the sample of DrugDDI corpus perform domain-specific referential functions, namely conveying and structuring information relevant for a specialist field of drug-drug interactions research.22 In the future, one may attempt to explore whether this finding is applicable to other text types and genres, including non-specialist ones. . This finding corresponds with the main communicative function of these texts, that is, presenting information on drug-drug interactions to specialists in the pharmaceutical field.

Chapter 3. Fine-tuning lexical bundles 

5. Discussion of findings The results revealed a number of similarities and differences across the bundles found in three samples extracted from the study corpus. While the majority of the most frequent bundles in Sample 1 (the 25 most frequent bundles) were found to perform text-oriented functions, the most numerous groups of the bundles in Sample 2 (25 bundles selected through stratified proportional random sampling) and Sample 3 (25 bundles selected through systematic sampling) perform referential functions by describing those properties of drugs or medicines that cause drug-drug interactions. All in all, the study showed that depending on sampling methods, one may get different insights into discourse functions performed by lexical bundles typical of a given text type or genre. However, the results revealed that systematic sampling provided the most accurate overview of the discourse functions performed by the bundles found in the sample of DrugDDI corpus under scrutiny. It is worth emphasizing that the application of three different sampling methods for selecting a representative inventory of bundles was aimed to ensure that lexical bundles with lower frequencies in texts be also included in qualitative functional analyses. In fact, more than half (107) of the total number of 203 bundles identified in the study corpus come from the bottom-frequency band (with normalized frequencies of 99–40 pmw), as shown in Figure 1.

400 Top-frequency threshold

350

Μedium-frequency threshold

Norm. freq. Raw freq.

Frequency

300 250 200 150 100 50 0

1

10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199

Lexical bundles (ranks by freq.)

Figure 1. Frequency distribution of lexical bundles (ranked by frequency) in the study corpus

 Łukasz Grabowski

Thus, the approach presented in this chapter enables one to qualitatively explore the bundles other than the most frequent ones, which translates into a more comprehensive phraseological description. This may become important when the frequencies of the bundles in a corpus do not follow a steep declining curve. In such a situation, the top frequency bundles do not constitute a majority of the total number of bundles in the corpus. This, however, depends on the specificity of the research material (a text type or genre) used in a study as well as on the corresponding frequency and distributional thresholds specified by researchers. 6. Conclusions Designed to provide a preliminary description of the use, distribution and discourse functions of lexical bundles found in pharmaceutical texts describing drugdrug interactions, the corpus-driven study presented in this chapter was primarily intended as an opportunity to reflect on the methodologies used to explore the use, distribution and functions of lexical bundles. The methodological proposals were concerned with, among other issues, dealing with structurally incomplete bundles, filtering out overlapping bundles, and, most importantly, selecting a representative sample of bundles for further qualitative analyses; the last two issues were addressed in greater detail in small-scale case studies and certain solutions to both problems were presented. The discussion showed that there is still an area for fine-tuning the lexical bundles methodology. Also, an overview of methodological issues revealed that the criteria and their parameters set to extract lexical bundles may vary depending on research purposes (descriptive or applied) or the specificity of the research material. To sum up, it seems that the awareness of opportunities and limitations of using certain quantitative and qualitative research methods is a sine qua non condition for research on lexical bundles to flourish. As a matter of fact, identifying gaps or flaws in the tools or methodologies may help researchers avoid the same problems in the future. This means that if researchers want to be presented with ever more fine-grained results and distinctions, the methods used to explore lexical bundles should be treated flexibly rather than strictly. This may become particularly relevant in the future,23 when more research on lexical bundles may be conducted on texts written in languages other than English.24

. See Wood (2015: 166) for an overview of other future challenges in the lexical bundles research. . In fact, lexical bundles methodology has been developed using English language data and that is why its application to texts written in other languages (inflectional or

Chapter 3. Fine-tuning lexical bundles 

Acknowledgements I wish to cordially thank the Editors and Reviewers of this volume for their helpful and constructive comments on an earlier draft of this chapter. I would also like to thank Dr Phillip W. Matthews and Dr Tomasz Gadzina for proofreading the manuscript.

References Adolphs, Svenja. 2006. Introducing Electronic Text Analysis. A Practical Guide for Language and Literary Studies. London: Routledge. Appel, Randy & Trofimovich, Pavel. 2015. Transitional probability predicts native and nonnative use of formulaic sequences. International Journal of Applied Linguistics. Advance online publication: 29 Jan 2015. doi: 10.1111/ijal.12100 Babbie, Earl. 2013. The Basics of Social Research. Belmont MA: Wadsworth, Cengage Learning. Bednarek, Monika. 2014. ’Who are you and why are you following us?’ Wh- questions and communicative context in television dialogue. In Discourse in Context: Contemporary Applied Linguistics, Vol. 3, John Flowerdew (ed.), 49–70. London: Bloomsbury. Bernardini, Silvia, Ferraresi, Adriano & Gaspari, Federico. 2010. Institutional academic English in the European context: A web-as-corpus approach to comparing native and non-native language. In Professional English in the European context: The EHEA challenge, Angeles Linde Lopez & Rosalia Crespo Jimenez (eds), 27–53. Bern: Peter Lang. Biber, Douglas. 2006. University Language. A Corpus-based Study of Spoken and Written Registers [Studies in Corpus Linguistics 23]. Amsterdam: John Benjamins. doi: 10.1075/scl.23 Biber, Douglas. 2009. A corpus-driven approach to formulaic language in English: multiword patterns in speech and writing. International Journal of Corpus Linguistics 14(3): 275–311. doi: 10.1075/ijcl.14.3.08bib Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward. 1999. The Longman Grammar of Spoken and Written English. London: Longman. Biber, Douglas, Conrad, Susan & Cortes, Viviana. 2004. If you look at…: Lexical bundles in university teaching and textbooks. Applied Linguistics 25(3): 371–405.

doi: 10.1093/applin/25.3.371

Biber, Douglas & Barbieri, Federica. 2007. Lexical bundles in university spoken and written registers. English for Specific Purposes 26: 263–286. Bogusławski, Andrzej. 1976. O zasadach rejestracji jednostek języka. Poradnik Językowy 8: 356–364. Bogusławski, Andrzej. 1989. Uwagi o pracy nad frazeologią. In Studia z polskiej leksykografii współczesnej, Vol. 3, Zygmunt Saloni (ed.), 13–29. Białystok: Wydawnictwo Uniwersytetu w Białymstoku. Breeze, Ruth. 2013. Lexical bundles across four legal genres. International Journal of Corpus Linguistics 18(2): 229–253. doi: 10.1075/ijcl.18.2.03bre

agglutinative ones) may pose a number of further challenges. For example, see Grabowski (2014) for an exploration of lexical bundles in Polish patient information leaflets.

 Łukasz Grabowski Canning, John. 2013. An Introduction to Statistics for Students in the Humanities. (16 December 2014). Chen, Yu.-Hua & Baker, Paul. 2010. Lexical bundles in L1 and L2 academic writing. Language Learning and Technology 14(2): 30–49. (10 December 2014). Cortes, Viviana. 2013. The purpose of this study is to: Connecting lexical bundles and moves in research article introductions. Journal of English for Academic Purposes 12(1): 33–43.

doi: 10.1016/j.jeap.2012.11.002

Cortes, Viviana. 2015. Situating lexical bundles in the formulaic language spectrum. In Corpus-based Research in Applied Linguistics: Studies in Honor of Doug Biber [Studies in Corpus Linguistics 66], Viviana Cortes & Eniko Csomay (eds), 197–216. Amsterdam: John Benjamins. DrugDDI Corpus. (5 December 2014). Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1): 61–74. Fletcher, William. 2002–2007. KfNgram. Annapolis MD: USNA. (20 November 2011). Forsyth, Richard. 2015a. Formulib: Formulaic Language Software Library. (30 November 2015) Forsyth, Richard. 2015b. Formulib: Formulaic Language Software Library. User notes. (2 November 2015) Gallicano, Keith & Drusano, George. 2005. Introduction to drug interactions. In Drug Interactions in Infectious Diseases, Stephen Piscitelli & Keith Rodvold (eds), 1–11. Totowa: Humana Press. doi: 10.1385/1-59259-924-9:001 Gaspari, Federico. 2013. A phraseological comparison of international news agency reports published online: Lexical bundles in the English-language output of ANSA, Adnkronos, Reuters and UPI. Studies in Variation, Contacts and Change in English, Vol. 13. (14 April 2016) Gledhill, Christopher. 2000. Collocations in Science Writing. Tübingen: Gunter Narr. Goźdź-Roszkowski, Stanisław. 2011. Patterns of Linguistic Variation in American Legal English. A Corpus-Based Study. Frankfurt: Peter Lang. Grabowski, Łukasz. 2014. On lexical bundles in Polish patient information leaflets: A corpusdriven study. Studies in Polish Linguistics 9(1): 21–43. doi: 10.4467/23005920SPL.14.002.2186 Grabowski, Łukasz. 2015. Keywords and lexical bundles within English pharmaceutical discourse: A corpus-driven description. English for Specific Purposes 38: 23–33.

doi: 10.1016/j.esp.2014.10.004

Grabowski, Łukasz & Jukneviciene, Rita. 2016. Towards a refined inventory of lexical bundles: an experiment in the Formulex method. Kalbu Studijos/Studies About Languages 29: 58–73. Granger, Sylviane. 2014. A lexical bundle approach to comparing languages. Stems in English and French. In Genre- and Register-related Discourse Features in Contrast, Marie-Aude Lefer & Svetlana Vogeleer (eds). Special issue of Languages in Contrast 14(1): 58–72. Grochowski, Maciej. 1981. O wyróżnianiu jednostek opisu semantyki leksykalnej. Studia Minora Facultatis Philosophicae Universitatis Brunensis, 29: 31–37. (8 December 2015).

Chapter 3. Fine-tuning lexical bundles 

Huang, Shiew-Mei, Lesko, Lawrence & Temple, Robert. 2008. An integrated approach to assessing drug-drug interactions: A regulatory perspective. In Drug-Drug Interactions, 2nd edn, A. David Rodrigues (ed.), 665–685. New York NY: Informa Healthcare. Hyland, Ken. 2008. As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes 27: 4–21. doi: 10.1016/j.esp.2007.06.001 Ionescu, Corina & Caira, Mino. 2005. Drug Metabolism: Current Concepts. Dordrecht: Springer. doi: 10.1007/1-4020-4142-X Jukneviciene, Rita. 2009. Lexical bundles in learner language: Lithuanian learners vs. native speakers. Kalbotyra 61(3): 61–71. Koester, Almut. 2010. Building small specialized corpora. In The Routledge Handbook of Corpus Linguistics, Michael McCarthy & Anne O’Keeffe (eds), 66–79. London: Routledge.

doi: 10.4324/9780203856949.ch6

Kopaczyk, Joanna. 2012. Long lexical bundles and standardisation in historical legal texts. Studia Anglica Posnaniensia: International Review of English Studies 47(2–3): 3–25.

doi: 10.2478/v10121-012-0001-0

Kopaczyk, Joanna. 2013. The Legal Language of Scottish Burghs (1380–1560). Oxford: OUP.

doi: 10.1093/acprof:oso/9780199945153.001.0001

Liu, Dilin. 2012. The most frequently-used multi-word constructions in academic written English: A multi-corpus study. English for Specific Purposes 31: 25–35.

doi: 10.1016/j.esp.2011.07.002

McEnery, Tony, Xiao, Richard & Tono, Yukio. 2006. Corpus-Based Language Studies. An Advanced Resource Book. London: Routledge. Moon, Rosamund. 2007. Corpus linguistic aspects of phraseology. In Phraseologie: Ein internationales Handbuch zeitgenoessischer Forschung Vol. 2, Harald Burger (ed.), 1045–1059. Berlin: Walter de Gruyter. Myles, Florence & Cordier, Caroline. 2017. Formulaic sequence(fs) cannot be an umbrella term in SLA: Focusing on psycholinguistic FSs and their identification. Studies in Second Language Acquisition 39(1): 3–28. doi: 10.1017/S027226311600036X Oakes, Michael. 1998. Statistics for Corpus Linguistics. Edinburgh: EUP. O’Donnell, Matthew Brook. 2011. The adjusted frequency list: A method to produce cluster- sensitive frequency lists. ICAME Journal 35: 135–169 (9 December 2015). Rayson, Paul & Garside, Roger. 2000. Comparing corpora using frequency profiling. In Proceedings of the Workshop on Comparing Corpora, Vol. 9 (WCC ’00), 1–6. Stroudsburg PA: Association for Computational Linguistics. (10 December 2015) Rieger, Burghard. 2001. Computing granular word meanings. In Computing with Words, Paul Wang (ed.), 147–208. New York NY: Wiley. (cited in Stubbs & Barth 2003: 81). Rowland, Malcolm. 2008. Introducing pharmacokinetic and pharmacodynamic concepts. In Drug-Drug Interactions, 2nd edn, A. David Rodrigues (ed.), 1–29. New York NY: Informa Healthcare. Rowntree, Derek. 2000. Statistics Without Tears. An Introduction for Non-Mathematicians. London: Penguin Books. Salazar, Danica. 2011. Lexical Bundles in Scientific English: A Corpus-Based Study of Native and Non-native Writing. PhD dissertation. University of Barcelona. (10 March 2013).

 Łukasz Grabowski Salazar, Danica. 2014. Lexical Bundles in Native and Non-native Scientific Writing [Studies in Corpus Linguistics 65]. Amsterdam: John Benjamins. doi: 10.1075/scl.65 Schmitt, Norbert., Grandage, Sarah & Adolphs, Svenja. 2004. Are corpus-derived recurrent clusters psycholinguistically valid? In Formulaic Sequences: Acquisition, Processing, and Use [Language Learning & Language Teaching 9], Norbert Schmitt (ed.), 127–151. Amsterdam: John Benjamins. doi: 10.1075/lllt.9.08sch Scott, Mike. 2008. WordSmith Tools 5.0. Liverpool: Lexical Analysis Software. Segura-Bedmar, Isabel, Martinez, Paloma, & de Pablo-Sanchez, Cesar. 2010. Extracting drugdrug interactions from biomedical texts. BMC Bioinformatics 11(Suppl 5): P9.

doi: 10.1186/1471-2105-11-S5-P9

Simpson-Vlach, Rita & Ellis, Nick. 2010. An academic formulas list: New methods in phraseology research. Applied Linguistics 31(4): 487–512. doi: 10.1093/applin/amp058 Stubbs, Michael & Barth, Isabel. 2003. Using recurrent phrases as text-type discriminators: A quantitative method and some findings. Functions of Language 10(1): 65–108.

doi: 10.1075/fol.10.1.04stu

The British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. (10 January 2015) Tognini-Bonelli, Elena. 2001. Corpus Linguistics at Work [Studies in Corpus Linguistics 6]. Amsterdam: John Benjamins. doi: 10.1075/scl.6 Wood, David. & Appel, Randy. 2014. Multi-word constructions in first year business and engineering university textbooks and EAP textbooks. Journal of English for Academic Purposes 15: 1–13. doi: 10.1016/j.jeap.2014.03.002 Wood, David. 2015. Fundamentals of Formulaic Language: An Introduction. London: Bloomsbury. Wray, Alison & Perkins, Michael. 2000. The functions of formulaic language: An integrated model. Language and Communication 20: 1–28. doi: 10.1016/S0271-5309(99)00015-4

chapter 4

Lexical obsolescence and loss in English: 1700–2000 Ondřej Tichý

Charles University in Prague This paper explores a new methodology for extracting forms that were once common but are now obsolete, from large corpora. It proceeds from the relatively under-researched problem of lexical mortality, or obsolescence in general, to the formulation of two closely related procedures for querying the n-gram data of the Google Books project in order to identify the best word and lexical expression candidates that may have become lost or obsolete in the course of the last three centuries, from the Late Modern era to Present-day English (1700–2000). After describing the techniques used to process big uni- and trigram data, this chapter offers a selective analysis of the results and proposes ways the methodology may be of help to corpus linguists as well as historical lexicographers. Keywords: lexicology; corpus linguistics; diachronic linguistics; obsolescence; n-grams; lexical bundles; Late Modern English; Google Books

1. Introduction The topic of lexical obsolescence and mortality in the history of English is relatively under-researched,1 especially when compared to the studies on neologisms or lexical innovation in general. Nonetheless, understanding how words leave a language should be of no less import and interest than understanding the processes of word-formation and borrowing. The paucity of research in the area, especially in using corpus methodology, may be, at least to a degree, attributed to the inherent difficulties in tracing a decline of an already hard to detect phenomenon.

. Few studies on the topic have been published, cf. (Trench 1871) or (Coleman 1990). The only corpus-based study so far seems to be (Petersen, Tenenbaum, Havlin & Stanley 2012), which is however focused on overall trends of lexical mortality and word-birth.

doi 10.1075/scl.82.04tic © 2018 John Benjamins Publishing Company

 Ondřej Tichý

Traditionally, lexical loss in English has been regarded as largely restricted to the transition from Old to the Middle English, that is to the period during which profound changes, both language internal (i.e. typological re-shaping) and language external (i.e. due to language contact, political and c ultural changes), led to a large-scale loss of the native lexicon often accompanied by borrowed foreign replacements. The scope of the changes and the paucity of Early Middle English textual resources both called for a close-to-the-text philological approach to this problem and at the same time precluded the use of corpus-driven quantitative methods (Čermák 2008). In contrast, later periods in the history of English are usually perceived in terms of lexical expansion, especially through borrowing: Late Middle English is marked by its absorption of French vocabulary related to culture and arts; Early Modern English is shaped by an influx of classical scientific terminology or the diverse vocabulary of exploration and colonization. Similarly, the period of Late Modern English to Present-day English is characterized both by its growth into a global language and by the continuous influx of foreign borrowings which fuel the publication of ever-growing dictionaries. This flow of new words makes the identification of lexical loss something of a challenge, to say the least. Yet, now there are large amounts of textual material available that invite a quantitative corpus-driven research. 1.1 Research questions This paper will address the methodological suitability of using corpora to study lexical loss. It will focus on whether the lexical obsolescence is limited to specific periods of linguistic upheaval, or whether it can be observed, given sufficient data, throughout the entire history of the language. More specifically the paper will focus on how we can study lexical loss using corpora, and it will attempt to determine how large the corpora need to be in order to identify obsolescence effectively. 1.2 Theoretical problems and practical definitions Before proceeding any further, it is important to note a more fundamental reason as to why relatively little has been done in the field of lexical obsolescence. From a practical point of view, it is much simpler to prove that something exists (or has come to existence) than to prove that something does not exist (or has ceased to exist). To prove something requires evidence and while evidence of existence is easily observable, evidence of non-existence is not – or as the aphorism goes: “absence of evidence is not evidence of absence”.2 In other words, finding no

. A similar aphorism, “absence of proof is not proof of absence”, has been attributed to a number of people, though their authorship is doubtful.

Chapter 4. Lexical obsolescence and loss in English: 1700–2000 

e vidence of a word is not the same as proving its non-existence. However, proving the non-existence of a phenomenon is possible if we can agree on the premise of the underlying inductive argument or more generally on the validity of inductive reasoning (Hales 2005). In corpus linguistics, for example, to prove that something does not exist is seemingly simple: if a form does not appear in a corpus, it does not exist in the language the corpus represents. But unlike the case of an attestation proving the existence of a form, to prove anything of linguistic significance by the non-appearance of a form, the corpus in question would have to be a comprehensive representation of a given language or variety. A truly comprehensive and representative corpus of a living language is, however, a practical impossibility. Moreover, corpora often contain tokens that we would not necessarily consider “valid” or “grammatical” members of the lexicon in a given language: for example typos, foreign words or forms used only in meta-language. The practical upshot of the preceding argument is that all the examples discussed below as obsolete, are shown to be obsolete only in relation to the language represented by the corpus. Another problem lies in the definition of lexical obsolescence. By the strictest definition, a form is lost when it no longer exists in a given language. Such a definition is impracticable for the purpose of this paper, as a mere comment on a lost form would presently revive it. Other definitions may consider a form lost when it is no longer used except in meta-language, for example “the meaning of the modern English verb to take was in Old English mostly expressed by the verb niman”, where niman is only used meta-linguistically. Some dictionaries label non-current forms (if they list them at all) as obsolete. While dictionaries differ in precise delimitation of the term, a common practice, according to Jackson (2002), is to label all forms lost by 1755 (the date of the publication of Johnson’s Dictionary) as obsolete. If the concept was lost (i.e. is no longer used) together with the form (as e.g. in case of cervelliere ‘a close-fitting helmet’), lexicographers usually label the usage of such forms as historical. If a form is used, but only to produce a deliberately old-fashioned effect, it may be labelled archaic, while rare is an umbrella term for any word not in normal use (Jackson 2002: 113–114). It is clear that a precise definition of a lost form is difficult to provide. For the purposes of this paper, the above term obsolete seems to serve well, except that no one specific diachronic cut-off point will be given and obsolete may not be systematically distinguished from historical forms. In other words, obsolete here means lost, that is from a corpus linguistic perspective either not present in the data or indistinguishable (in frequency) from errors. Since the paper is specifically interested in lexical loss as a process and not so much in obsolescence as a state, it will attempt to establish a methodology to identify formerly common forms (see Section 3.1 below) that have fallen out of use. In the following two sections, I will discuss both the data source necessary for analysis and the methodology proposed for application – making a d istinction

 Ondřej Tichý

between single and multi-word expressions. I will also briefly mention the techniques useful for processing very large linguistic data. The proposed methodology will supply a limited number of prospective candidates for further analysis. Several indicators useful for the analysis of multi-word expressions will be noted and a number of the most interesting examples will be discussed in greater detail. 2. The corpus and its problems The choice to use the Google Books project as a source for linguistic data stems mostly from its size, availability and diachronic breadth. Since the goal of proving the non-existence of a form directly influences the choice of research data, while the methodology required is determined by the character of the data, the reasons for selecting Google Books will be discussed in more detail along with the methodology in Section 3. The composition of the Google Books project is defined by the aims and methods of its construction. The aim is “to digitally scan every book in the world” (Google Books History 2009) and the method is first scanning at least one copy of every book in the world’s largest libraries and later getting all new books directly from their authors and publishers in electronic form. Linguistic representativeness is a complex concept to keep in mind and it was not a concern of the creators of the project. As a result, the data well covers printed production, but have much less coverage of non-printed production. Neither format is representative in terms of impact/reception in that books printed in the millions have the same status as books printed in only a few dozen copies. In this sense, some genres are over-represented (science, law) and some underrepresented (fiction). It is beyond the scope of this paper to discuss what exactly the deficiencies are in terms of representativeness concerning this set of data, but it is important to note that deficiencies are bound to exist. The deficiencies also stem from the fact that often the same book appears multiple times in the dataset – in an exact copy or in similar editions. Also, books are not infrequently misdated, usually because a reprint of a book was published long after its first edition and the initial edition is missing from the dataset. Another problem with the data can be summarized as “junk”. The problems with OCR3 are noted in more detail below, but generally speaking, plenty of the data would normally have been removed from a linguistic corpus, or would be

. OCR stands for Optical Character Recognition, which is an automated algorithm that turns digital images (e.g. scanned pages) into editable texts.

Chapter 4. Lexical obsolescence and loss in English: 1700–2000 

at least appropriately tagged, so that it could be avoided in linguistic research. Among such problematic items are texts in foreign languages or texts that are not part of the actual content of the scanned books. It can be argued that such problems are overcome by the sheer size of the dataset, but enough of the “junk” has been detected in the results to make manual analysis necessary, precluding fully automatic analyses.4 As an example of this kind of problems, in addition to the problems mentioned below, the trigram with the highest collocability from the resulting set of candidates for obsolescence was date stamped below, the prominence of which in the data is explained by exploring its usual context, which clearly points to the source of Google’s data: “This book is due on the last date stamped below.” 2.1 The n-grams I have so far mostly refrained from using the term corpus when referring to the chosen dataset, since it does not exist, or at least is not publicly accessible, in the form of a typical linguistic corpus. Instead, primarily for licensing reasons, the data are publicly available only in n-gram format. Specifically, the public dataset consists of n-gram strings (from unigrams up to 5-grams) representing types (as opposed to tokens) derived from the Google Books project, their respective raw frequencies for a given year and the number of books in which the n-gram in question was attested in that year (see Table 1). This limits the methodology severely. Table 1. Google Ngrams data format N-gram

Year

Frequency

Books

as appears by

1768

86

48

The only variables that can be taken into account are the n-gram frequencies, the number of books in which they are attested and their respective development over time. This is especially constraining when working with multi-word expressions, since it prevents a number of co-occurrence measures from being calculated. Further analysis of the data is also limited by the absence of a wider context – there is nothing similar to a concordance line. Only the standard interface of the Google Books project can be used and the accessibility of a particular context, therefore, depends on the licensing of the particular books containing the given n-gram.

. That is, not all the results can be expected to be actual examples of lexical obsolescence, manual pruning is necessary.

 Ondřej Tichý

3. Methodology 3.1 Data requirements The primary requirement of the data when researching lexical obsolescence is comprehensiveness – that is, to be able to focus on the lowest frequency levels in a meaningful way, as large a dataset as possible is needed. To understand how large a dataset is required to get reliable results, the relative frequencies of words on the very “fringes” of the English lexicon need to be established. I say “words” here, rather than “forms”, since the discourse of word frequencies and terms such as rare words, core, basic or extended vocabularies etc. is relatively well established (see Milton & Donzelli 2013; Aitchison 1987 or Kilgarriff 2015), while only the high frequency end of the spectrum has so far been discussed in the case of multi-word expressions like lexical bundles (Biber 1999: 992–995). The issue of core, basic or elementary vocabulary has been quite extensively researched, mostly in connection to language acquisition (Milton & Donzelli 2013), but it has little bearing on this research. The average and maximum vocabulary of an adult native speaker of English has also been widely debated and it is of more interest here, since such vocabulary may well be considered common.5 While the differences in the estimates are often due to the unclear definition of the key term word, the reasonable maximum number of English words known to any one single person, as based on both Aitchison (1987) and the results of the popular online testing site (TestYourVocab.com n.d.), may be set for the purposes of this paper at 40–100,000 words.6 The number of words (entries and forms) of the largest English dictionaries is below 500,000. This assessment seems to agree with Kilgarriff (2015: 33), who describes forms below the threshold of the most common 500,000 as mostly unintelligible: “half the items no longer even look like English words, but are compounded from obscure forms, typos, words glued together and other junk”. Kilgarriff used the 12-billion-word corpus enTenTen, and the forms, at a threshold of 500,000 words, had absolute frequencies of 10–100. If one is to observe developments around this threshold, a corpus containing at least a 100 million to 1 billion words per each diachronic period/sub-corpus (i.e. at least 1/100 or ideally 1/10 of Kilgarriff ’s data) is necessary.

. As stated in the previous section, formerly common forms that have become obsolete, are the focus of this paper. . The wide range is given mainly by the differences in the methodologies used to derive the estimates, but their critical discussion is beyond the scope of this paper.

Chapter 4. Lexical obsolescence and loss in English: 1700–2000 

Currently, the largest available source of textual data is the Google Books project and the n-gram data derived from it (Michel et al. 2011).7 The 2012 version of the English dataset comprises ca. 468 billion tokens. Table 2 shows the structure of English vocabulary as described above with frequencies per million (ppm) derived from the Google unigram data. By way of example: if sorted by descending frequency, the 100 000th n-gram has an approximate frequency of ca. 0.15 ppm, which also agrees with Kilgarriff ’s data. Table 2. Structure of English vocabulary according to frequency Type of vocab. core basic native active native passive dictionary words

Num. of words

Freq. ppm

200–300

>200

2,000–3,000

>30

10,000

>5

40–100,000

>1 – >0.15

500,000

> 0.03

Google Ngrams also fulfil the second major requirement, a diachronic scope, since they theoretically encompass printed texts from 1500 to 2008. If the unigram data are divided into decades, each decade after 1800 contains over a billion tokens and all but the first two decades of the eighteenth century contain over 100 million tokens. There are both theoretical and practical reasons to limit the present research to the diachronic range of 1700–2000. The theoretical motivation is determined by the established periodization of the history of English (i.e. 1700 is usually considered the end of the Early Modern English period and the beginning of the Late Modern English period), the practical necessity is dictated by the paucity and low quality of the data for the earlier period (all the decades of the sixteenth and the seventeenth century amount to less than 64 mil. tokens altogether, see also Table 3 and Section 3.2). 3.2 Word obsolescence It has been noted in the introduction that the definition of obsolescence is problematic and that the focus here will be on forms that were once common but have been

. While the paper uses the 2012 dataset, the Google Books Ngram project page is to be cited using the 2011 article that describes the 2009 dataset.

 Ondřej Tichý

Table 3. Google Ngrams data composition according to decades Decade

N-grams in millions

Books

Decade

N-grams in millions

Books

1700s

49

574

1850s

5 866

44 039

1710s

59

855

1860s

5 055

41 469

1720s

100

 1 081

1870s

6 169

52 135

1730s

107

 1 306

1880s

8 160

71 538

1740s

114

 1 575

1890s

9 798

89 365

1750s

174

 2 084

1900s

13 469

128 383

1760s

191

 2 240

1910s

12 565

134 433

1770s

214

 2 686

1920s

11 988

126 038

1780s

286

 3 329

1930s

11 556

118 173

1790s

468

 5 133

1940s

11 654

117 557

1800s

 1 122

 10 200

1950s

17 684

171 509

1810s

 1 693

 14 041

1960s

31 951

324 447

1820s

 2 631

 21 312

1970s

41 979

438 891

1830s

 3 413

 27 361

1980s

54 317

535 854

1840s

 4 282

 32 017

1990s

82 492

781 581

339 604 861 737

 3 301 206

Totals for 1700–2000

lost over time. Using the frequencies given in Table 2, it is now possible to be more specific about what to consider common and what to consider to be lost. For our purposes, words are unigram forms as found in the Google Books datasets. I will not attempt here any linguistic definition of word. Suffice it to say that the unigrams are mostly what words are intuitively expected to be: strings that are delimited by spaces or punctuation from both sides. Only a relatively small number of unigrams are not words by this definition, but turn out to be residues of the parsing process by which the data were composed: punctuation, contractions or non-lexical material. The tokens are non-lemmatised and part-of-speech (POS) tagged, which means that their number is larger than it would have been without the POS tagging (i.e. same forms belonging to different POS are distinguished in the data). However, in the case of Present-day English and especially in lowerfrequency bands, the difference in rank with or without POS tagging is for our purposes quite negligible. Using the frequency bands specified in Table 2, it is possible to postulate that common words are those with a frequency over 1 ppm or, in other terms, common words are the 40,000 most frequent words in a given decade.

Chapter 4. Lexical obsolescence and loss in English: 1700–2000 

By the same token, it can be postulated that all words with a frequency below 0.03 ppm (i.e. in the frequency band characterised mostly by errors, non-words or “junk”) are potentially lost.8 Note that not all the words in this frequency band are necessarily considered to be lost: some highly technical or specialised vocabulary may, for example, be found in this frequency band and yet be considered neither lost nor obsolete in any sense. See Figure 1 to compare items in the low-frequency band, such as typos, historical or meta-language and obsolete words. However, since the focus is only on the vocabulary that used to be common, such highly specialised items will either not appear in the analysis at all or if they do, their radical decrease in frequency should substantiate their inclusion into the category of lost words (e.g. in cases where they were replaced or their originally prominent denotation was lost). habban

flagitious

heora

drihten + dryhten

beft

0.025

RF ppm

0.02 0.015 0.01 0.005 0

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1999

Figure 1. Low frequency unigrams in the last decades9

. Petersen, Tenenbaum, Havlin and Stanley (2012) used a similar threshold of 0.05 though with a different methodology. . While most of the calculations here use adjusted frequency (see below), the graphs are provided in relative frequency, which can be compared to other corpora and is intuitively more meaningful. habban and drihten are high frequency Old English forms that may be safely assumed to be lost by the twentieth century; beft is an OCR error, but one typical for the eighteenth century, because the long s has not been used since (note that common contemporary misspellings such as existance are about 5 times as frequent); flagitious is an example of a form that was once common, but has been lost.

 Ondřej Tichý

Finally, it would be impractical to use the years as they are recorded in the dataset as the basic periods by which the frequencies are calculated. The shortest periods of time for which there is enough data (i.e. over 100 mil. words) are decades (with the exception of the first two decades of the eighteenth century, see Table 3). Longer periods would unnecessarily obscure the tendencies to be analysed. The basic procedure follows from the definitions above: relative frequencies (RF = absolute frequency divided by the total size of the sub-corpus, that is a decade) are calculated for all the words attested in a given decade and then it is established which words were at the same time 1. among the 40,000 most frequent in at least one of the decades, and 2. below the 0.03 ppm in the last decade. The last decade is admittedly an arbitrary choice. It could have been a longer period, similar, for example, to periods over which words have to be evidenced before being admitted as an entry by dictionaries like the OED. Note that the focus of the paper is on lexical loss, but a similar methodology could be employed to investigate oscillations or neologisms 3.3 Pruning and sorting the results A cursory look at the resulting data set of ca. 20,000 words reveals that (1) a large portion of these is unusable (see below) and that (2) some of the forms are better candidates for further analysis than others. I have therefore decided to remove some of the unusable items from the results and also to devise a measure of lexical obsolescence that would help me rank the more interesting candidates for further analysis. I call this measure the Obsolescence Index (OI), its parameters being the maximum relative frequency of a form over time and its relative frequency in the last decade (LRF). The greater the difference between the former and the latter, the higher the OI of the form in question (see below for a detailed description and formula). The unusable forms were of 3 basic types: a. Proper names Certain proper names that had been very popular at one time but then quickly forgotten afterwards crop up prominently in the results. (1) Dunallan is a character from a novel by Grace Kennedy, whose novels “of a decidedly religious caste, were very popular in her day, though now when the age has become more liberal, they have lost most of their interest and are very little read” – says the The Encyclopedia Americana of 1922. The characteristic frequency curve of such items has a sharp

Chapter 4. Lexical obsolescence and loss in English: 1700–2000 

spike (see Dunallan in Figure 2). Unlike OCR errors, proper names are not as easily removed from the results, since the spikes can be recurrent (reflecting, e.g., the re-editions of popular books) and can appear in any decade. Capitalisation is of no use here, either, since that is hardly an unambiguous sign of proper names. To rank them lower than other lexical items, I have decided to take into account their average relative frequency in the 50 years preceding and the 50 years following their peak frequency. This movable average around the given form’s maximum frequency strongly favours forms that had been common for a relatively long time and whose decline was not unusually rapid. b. OCR errors These comprise almost three quarters of the data and most of them are due to the long s (ſ) of the eighteenth century prints being recognized as the letter f. In turn, this causes forms like (2) fingular (see Figure 2) to turn up in the data as prime examples of lexical mortality. Since the use of the long s declined in English print very rapidly between 1790s and 1810s, it has been possible to remove this type of OCR errors from the results relatively easily: all the forms whose frequency had declined during those three decades by more than 99.99% have been removed. Using this method, I have failed to detect only a few OCR errors in several hundred manually analysed results. The first hundred errors removed were manually double checked, not a single one was a false positive.10 c. Variety-specific forms A number of forms come up in the results because they happened to appear very prominently in a limited number of books, which can be enough in the earliest decades of the eighteenth century to make them register among the 40,000 most frequent forms overall. Apart from proper names, these are mostly highly specialized terms or regional forms, such as (3) quhilk, (see Figure 2). To reverse their effect and boost the OI of forms that are more evenly spread in the data, the number of books in which the form was attested can be taken into account as well. The usual dispersion formulas for adjusting the frequency by splitting the corpus into evenly long pieces (Gries 2008) cannot be used because there is no real corpus to split, but the relative frequency can be simply multiplied by the number of books to derive a measure called here adjusted frequency (AF).11

. This might be a good method for the Google Books project to improve their data. . Note that the Google Books dataset only contains n-grams that occurred in at least 40 books.

 Ondřej Tichý

The resulting formula for OI uses a moving average of 11 decades (i is the decade of maximum AF) and is a product of some tinkering. For example, simply multiplying the maximum RF by the inverse of LRF gives too great an impact to the forms with an extremely low LRF, while it actually matters relatively little whether the LRF is 0.001 or 0.00001 (i.e. very low or extremely low). I have therefore decided to use a logarithm of the inverse value of LRF multiplied by a quite arbitrary parameter p = 10.12 In the end, the index favours forms that had been relatively frequent and well distributed among texts (i.e. common) for some time and then declined slowly, but well below 0.03 ppm in the last decade. OI =

AFi–5 + AFi–4 +…+ AFi–4 + AFi–5 11

Forms just described are well exemplified by the case of (4) flagitious (see Figure 2). A more detailed analysis and a discussion of such forms will be carried out after the methodology for observing obsolescence in multi-word expressions is described. 4.5 4 3.5

dmax -50 years

+50 years

RF ppm

3 2.5 2 1.5 1 o.5 0

moving average

1700s 1720s 1740s 1760s 1780s 1800s 1820s 1840s 1860s 1880s 1990s 1920s 1940s 1960s 1980s Decades DUNALLAN_NOUN FLAGITIOUS_ADJ

FINGULAR_ADJ QUHILK_NOUN

Figure 2. Comparison of frequency trends in “junk”: (1) dunallan, (2) fingular, (3) quhilk, and valid: (4) flagitious, results, dmax stands for the decade of maximum RF

. While the logarithm lessens the impact of extreme LRF values on the OI, the parameter makes the overall impact of LRF greater, whatever the actual LRF value.

Chapter 4. Lexical obsolescence and loss in English: 1700–2000 

3.4 Obsolescence of multi-word expressions A major difference between words and multi-word expressions when dealing with their frequencies is that one can rely both on a considerable body of literature (as well as on linguistic intuition) when terms applicable to single words like common or core vocabulary are discussed. In contrast, one can hardly rely on intuition at all when concepts like core multi-word expressions or their frequencies in general are considered. The literature in this sense deals mostly with collocations and lexical bundles. Biber (1999: 990) characterizes a lexical bundle somewhat arbitrarily as “a recurring sequence of three or more words” that occurs “at least ten times per million words”. Biber (40) also works with collocational associations, whose strength he defines as “the probability of observing two words together compared to the probability of observing each word independently” and measures it by means of (Pointwise) Mutual Information (MI). Both frequency and MI can be used to identify “common” trigrams comparable to the common words described above, but it is more difficult to identify the lower frequency band that would be comparable to the “lost” unigrams. Following Biber’s example and for practical reasons discussed below, I focus on trigram data rather than on shorter bigrams or longer 4- or 5-grams. A trigram which drops below the threshold of lexical bundle or of a salient collocation is not simply obsolete or lost. It has lost some of its currency or expressive power (perhaps by a small margin only, due to the arbitrariness of the lexical bundle threshold), but I am only interested here in those expressions that have lost their currency to a highly significant degree. Lacking any meaningful threshold for the lower-frequency band under which an expression could be considered a prospective candidate for being lost, I have decided to merely remove those trigrams from the dataset that were (a) never frequent and “meaningful” enough to be considered both lexical bundles and collocations at the same time (see Figure 3); and (b) whose AF in the last decade was above the level of genuine (i.e. not OCR) misspelling. The results were then sorted by their OI, which ensures that expressions at the top of the list at one point in time were both fairly common and “meaningful” and have meanwhile become marginal or lost. The maximum MI and the MI in the final decade were also considered during the analysis described in Section 4 below. The actual threshold of AF in the last decade was set to 1 × 10−15 (anything above that was removed from the results), but most items analysed in Section 4 go several orders of magnitude lower than that. For comparison, the AF of a relatively common misspelling such as the existance of is 2.7 × 10−16. The MI was calculated as a logarithm of a division of the RF of a particular trigram and of the multiplication of the RFs of each element of the trigram. To

 Ondřej Tichý

normalize the value, I have followed Bouma’s method (2009) of dividing the MI by a logarithm of the trigram’s RF. MI = (log

RFx,y,z ) /–log RFx,y,z RFx×RFy×RFz

But while in Bouma’s case, the data were essentially bigrams (or collocations in a traditional sense) and the value was normalized between [−1,+1], in this case the data are trigrams and the normalization therefore results in values between [−1,+2], with negative values standing for a negative associative bond between the elements of the trigram, 0 standing for completely uncorrelated elements (i.e. no association) and 2 standing for an exclusive (maximum) association. Note that when I refer to the MI of the trigrams below, I refer to its normalised value. The actual threshold of the MI was set quite arbitrarily to 0.4 based on the preliminary samples of the data and on the MI distribution through the dataset (see Figure 3). MI distribution

cut-off point

Number of trigrams

12000000 10000000 8000000 6000000 4000000 2000000 0

2

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0

-0.2

-0.4

-0.6

MI

Figure 3. Mutual Information (MI) distribution in the dataset and the minimum threshold

3.5 Technicalities The major practical hurdles when working with the Google Books n-gram data sets are their size and the inadequacy of the Google’s Ngram Viewer interface for any advanced linguistic research. The datasets have to be downloaded and processed by individual researchers. Another problem is that the data are not available as an

Chapter 4. Lexical obsolescence and loss in English: 1700–2000 

actual corpus of texts but in a format similar to frequency lists (described above). Some of the datasets (American English, British English and Spanish) can be queried through a web based interface by Mark Davies (2012) that adds some features useful for a corpus linguist, but it (understandably) still lacks the traditional KWIC (key word in context) view and it cannot query the complete dataset for English, hence it was not used in this project. The simplest method of querying the data seemed for the purposes of this paper to be the Structured Query Language (SQL) used in different varieties by most computer database engines. For processing the unigram data, we have used a relational database. When imported into the database, the unigrams amounted to almost 60 GB of data. To process a database of this size on a relatively powerful PC took several days of processing time. The trigram data amount to almost 280 GB data and are correspondingly more difficult to process – analysing the trigrams on a local PC was therefore not feasible. I considered a number of hosted solutions including several scientific infrastructures, but in the end, I chose the Google Cloud for several reasons. First, scientific infrastructures usually require lengthy application procedures and are not suited for small-scale, short-lived humanities projects. S econd, securing and configuring a dedicated hosted solution that would be powerful enough seemed inefficient (complicated and expensive) when needed for a period of weeks only. Third, I have considered several commercial cloud solutions including Google Cloud, Amazon Web Services and Microsoft Azure. A special advantage of the former two was that they already feature Google Ngrams as public datasets and the arduous task of downloading, assembling, uploading and importing the data could therefore be avoided. In terms of technology, Amazon offers a standard (i.e. built on common open source technologies) solution for processing large amounts of data through its Elastic MapReduce (EMR) service. Google in turn offers its relatively new BigQuery service based on Dremel. Both of these services can perform similar tasks – process and query large amounts of data using the SQL much faster than anything run on a single albeit powerful machine. While Amazon’s EMR is a better-documented standard, it requires more administrative expertise and more time to set up and is in the end slower, at least in tasks necessitated by this methodology. Google’s BigQuery is simpler to use and faster, but since it is a new technology, the documentation is sparse and the SQL variant it uses features some surprising and undocumented differences from, for example, the widely used MySQL variant. However, Google’s technicians were fast to reply and ready to help with any problems. In the end, using BigQuery, most of the queries required by this methodology took only a couple of seconds to complete. For example, assembling all the results into a pivot table suited for manual analysis (computationally the most demanding task of this project) took about 20 seconds.

 Ondřej Tichý

During the whole process, both the unigram and the trigram data were slightly modified: all the n-grams strings were turned into lowercase and all non-literal characters were removed. Owing to the fact that most of the punctuation and contraction was already removed by Google when assembling the data, the remaining non-literal characters would, for linguistic purposes, be unreliable guides at best and would in turn make any further processing much more difficult. 4. Analysis and discussion of the results Since the paper is mainly focused on methodology, the analysis of the data resulting from the processing and pruning described above will be selective, exploratory and in that sense preliminary, rather than systematic and comprehensive. I will intentionally only highlight data that I find useful in showing how this method can be applied and used by future researchers. The manual analysis proceeded from a pivot table with one unigram or trigram per row, AF values for each decade and a number of indicators like OI, MI and maximum/minimum AF/RF in separate columns of each row. By sorting the pivot tables according to OI and by filtering the indicators, I was easily able to highlight the best candidates for further analysis. lngram upon this account upon the account as appears by upon that account appears by the what hath been upon account of am apt to i am apt from hence it put in execution upon this head the like nature it doth not not be improper and from hence

tag note err? oi max_d ls 1 2,67E-09 170 ls 2,66E-09 170 s 2,02E-09 170 ls 1,53E-09 170 s 1,30E-09 170 m 1,28E-09 170 ls 1,17E-09 173 ? 1,03E-09 170 ? 9,53E-10 170 ls 9,44E-10 173 s 8,45E-10 173 s 8,02E-10 173 s 7,43E-10 171 m 7,17E-10 171 s 6,90E-10 172 s 6,19E-10 172 from hence

max 3,E-10 4,E-10 2,E-10 2,E-10 1,E-10 2,E-10 1,E-10 1,E-10 1,E-10 1,E-10 9,E-11 1,E-10 1,E-10 8,E-11 7,E-11 7,E-11

min min_d diff max milast_mi mi_diff 4,E-16 198 6 0,63 0,25 0,39 7,E-16 198 6 0,45 0,09 0,37 1,E-15 198 6 0,61 0,15 0,46 7,E-16 197 6 0,51 0,19 0,32 3,E-15 198 5 0,41 0,01 0,39 7,E-16 198 6 0,70 0,36 0,34 6,E-16 198 6 0,42 0,12 0,30 3,E-15 198 5 0,65 0,35 0,29 3,E-15 198 5 0,78 0,47 0,32 2,E-16 197 6 0,61 0,19 0,42 2,E-15 198 5 0,74 0,34 0,40 2,E-16 198 6 0,65 0,21 0,45 2,E-16 198 6 0,46 0,00 0,45 1,E-15 198 5 0,54 0,36 0,18 9,E-16 198 5 0,64 0,31 0,34 3,E-16 197 6 0,52 0,12 0,41

170 3,24E-10 3,80E-10 2,00E-10 1,74E-10 1,19E-10 1,67E-10 9,03E-11 1,27E-10 1,09E-10 6,34E-11 5,06E-11 5,93E-11 8,38E-11 4,25E-11 1,44E-11 4,48E-11

171 2,09E-10 3,28E-10 1,40E-10 1,38E-10 8,20E-11 7,92E-11 8,54E-11 4,87E-11 4,43E-11 9,43E-11 7,39E-11 3,73E-11 9,72E-11 8,31E-11 3,11E-11 3,16E-11

172 2,10E-10 1,42E-10 1,54E-10 8,85E-11 9,96E-11 6,14E-11 1,16E-10 8,96E-11 8,63E-11 1,04E-10 6,21E-11 8,61E-11 3,94E-11 4,25E-11 7,00E-11 7,37E-11

173 1,52E-10 1,17E-10 1,01E-10 9,00E-11 7,01E-11 1,05E-10 1,39E-10 5,64E-11 5,69E-11 1,06E-10 8,69E-11 1,32E-10 4,91E-11 6,22E-11 5,21E-11 6,72E-11

174 7,54E-11 3,78E-11 1,29E-10 6,63E-11 8,54E-11 3,78E-11 1,07E-10 4,56E-11 4,23E-11 7,82E-11 7,61E-11 5,81E-11 2,43E-11 4,63E-11 6,10E-11 5,42E-11

196 4,72E-15 4,16E-15 1,70E-14 5,59E-15 2,12E-14 5,50E-15 4,93E-15 1,69E-14 1,62E-14 2,27E-15 1,48E-14 2,20E-15 4,21E-15 9,82E-15 6,68E-15 2,55E-15

197 2,19E-15 2,88E-15 1,36E-14 2,74E-15 1,83E-14 3,09E-15 1,96E-15 8,62E-15 7,74E-15 8,85E-16 9,92E-15 1,18E-15 1,81E-15 4,26E-15 6,27E-15 1,44E-15

198 7,92E-16 1,25E-15 2,17E-15 7,18E-16 4,16E-15 9,65E-16 8,29E-16 4,24E-15 3,74E-15 2,23E-16 2,11E-15 2,73E-16 7,16E-16 1,48E-15 1,51E-15 3,29E-16

199 3,64E-16 7,38E-16 1,41E-15 8,37E-16 3,37E-15 6,72E-16 6,40E-16 2,75E-15 2,59E-15 3,26E-16 1,79E-15 1,56E-16 2,25E-16 1,26E-15 9,08E-16 6,15E-16

200 1,90E-15 2,54E-15 6,37E-15 2,51E-15 7,36E-15 3,75E-15 2,20E-15 8,47E-15 7,49E-15 1,08E-15 7,94E-15 1,74E-15 1,36E-15 6,91E-15 2,60E-15 1,38E-15

Figure 4. The pivot table used to select the best candidates for further analysis13

4.1 Unigrams Words that remained after the pruning of errors, typos and “junk” can be divided into two basic categories: (a) those forms that have been replaced in

. The columns are: trigram, notes for analysis, OI, decade of max ARF, max ARF, min ARF, decade of min ARF, difference between min and max ARF, max MI, MI of the last decade, difference between min and max MI, ARF by decade (decades of 1750s to 1950s were cut to fit the table into this format).

Chapter 4. Lexical obsolescence and loss in English: 1700–2000 

their function by a single observable form; and (b) those forms that do not have an apparent substitute. The forms that have a clear substitute were usually either replaced by a formally related word or they belong to a specialised terminology that is often well structured – largely because otherwise we would not have been able to spot the substitute. In case of a replacement by a related form, the loss of the form in question is a part of a larger process: spelling standardization, morphological analogy or change in word-formation strategies (See Figure 5 for an example of a loss through replacement). Examples of the standardization of spelling are (5) recal (replaced by recall); or (6) oxigenated (replaced by oxygenated, but note that in turn (7) oxyd was replaced by oxide). Morphological analogy can be exemplified by the loss of (8) shew and (9) shewn (in favour of show, shown); (10) properest (partly in favour of most proper, but partly disused due to the decline of any superlative of proper); and (11) commotions or (12) dissentions (previously mostly plural, now usually singular and uncountable). Changes in word-formation strategies are most noticeable in relatively new items like (13) proteid (replaced by protein), (14) acetous (replaced by acidic) or (15) sulphuret (replaced by sulphide). It is clear that, in the eighteenth and nineteenth centuries, scientific terminology, especially in chemistry and geology, was being formed through a kind of natural selection process, rather than being determined by an appointed institution or authority. Another common tendency seems to be in replacing native affixes with foreign ones (which could possibly be part of a larger process of dehybridization). Examples of the process include: (16) visiter (replaced by visitor); (17) unfrequent (replaced by infrequent); (18) rivalship (replaced by rivalry); and (19) profaneness (replaced by profanity).14 Conversion may be replacing derivation in cases like (20) inconveniency (replaced by inconvenience), but more data would be needed to show whether this is a general trend, since there are counterexamples at hand like (21) falses (replaced by falsehoods). Prescriptive etymologizing tendencies are at play in word-formation when “incorrect” forms like (22) cotemporary are replaced by etymologically “correct” forms like (23) contemporary. Some forms also switch affixes for no apparent reason, like (24) emblematical (replaced by emblematic).

. The decline of profaneness may also at least in part be explained on phonetic grounds (relative difficulty of articulation).

 Ondřej Tichý

Replacement of forms like (25) intituled (by entitled) or (26) lenity (by leniency) can historically be grouped with word-formation, though synchronically it should be classified as a replacement by a doublet/cognate form. Replacement of a lexeme by a non-related form is best exemplified by specialised terminology, such as (27) azotic (replaced by nitric). Muriatic (28) is an example that is half-way to the second category, since it is not replaced by a single form, but indirectly by constructions containing the word brine, such as pertaining to brine. profanity

profaneness

3 2.5

RF ppm

2 1.5 1 0.5 0

1800 1810 1950 1940 1930 1830 1820 1880 1870 1860 1850 1840 1920 1910 1900 1890 1960 1970 1980 1990 2000

Figure 5. Frequency development of profanity and profaneness as an example of a replacement pattern

The second category, where no apparent replacement can be traced, is perhaps more interesting, since it exemplifies what we most readily associate with lexical mortality – a loss of a word without any apparent compensation. It may disappear as a part of a derivational family, so that (29) reprobate and (30) reprobated are lost as a verb and a participial adjective, but the noun and adjective reprobate remains. Or the whole derivational family may be lost, as in the cases of flagitious (‘deeply criminal, wicked’, together with flagitiously, flagitiousness etc.), or (31) animadversion (‘the turning of mind to something’, together with animadverse, animadversive etc.). Another reason why this category is of special interest is that the non- existence of an immediate substitute opens the question of potential motivation and circumstance leading to the obsolescence. In the case of flagitious, examining some of the examples in Google Books, one can hardly miss a pervasively religious context or, more specifically, the markedly anti-Catholic discourse in which the

Chapter 4. Lexical obsolescence and loss in English: 1700–2000 

word often appears. This appears to be verified by correlating frequency of flagitious with a more obviously discourse-specific word such as papist (see Figure 6). flagitious

papist

3.5 3

RF ppm

2.5 2 1.5 1 0.5 0

1900 1800 1990 1960 1860 1890 1980 1880 1940 1840 1920 1950 1820 1850 1930 1970 1830 1870 1910 1810 2000

Figure 6. Frequency development of flagitious and papist

4.2 Trigrams The analysis of the trigrams followed a similar method as that of unigrams, but an additional indicator was used – that of the MI. This appeared to be a salient indicator which divided the results again into two categories. The first is characterized by a smaller (if any) decrease in MI (between the maximum MI and the MI in the last decade), indicating that not only the AF of the trigram itself has decreased, but also the AF of at least one of its elements, since only a low frequency member of the trigram may bring about a high MI in such a low frequency trigram.15 In other words, the obsolescence of a trigram is in such cases caused by the obsolescence of one or more of its members. From the viewpoint of the multi-word expressions, this is not especially interesting, since the obsolescence of the individual members is better observed through an analysis of the unigram dataset. Examples of this category are trigrams like (32) it doth not or (33) by vertue of, where the causes are obviously doth and vertue, respectively. Characterised by a large decrease in MI, the second category is more interesting because the decrease in MI clearly signals that a once salient collocation

. A fact that disqualifies MI from being used for comparison of strength of collocation in low frequency items.

 Ondřej Tichý

is becoming less salient, though its members may still be quite frequent outside the trigram. As in the case of unigrams, the loss of several of the example trigrams may be explained and exemplified as a replacement by one or more immediate substitutes. For ought I (34) shows that exploring trigrams may often point to 4-grams, since the salient construction in this case is for ought I know. This had been replaced by a spelling variant for aught I know,16 only to be replaced shortly afterwards by for all I know (see Figure 7). for ought I know

for aught I know

for all I know

1.600 1.400 1.200

FR ppm

1.000 0.800 0.600 0.400 0.200 0.000 1900 1800 1990 1960 1980 1890 1860 1880 1940 1840 1920 1820 1950 1850 1930 1760 1970 1830 1790 1870 1780 1740 1910 1810 1750 1770 2000

Figure 7. Frequency development of the 4-grams: for ought I know; for aught I know; and for all I know

Exploring trigrams may also point at interesting bigrams, as in the case of (35) a right line, where the salient construction is right line – a technical term that had been in competition with, and eventually replaced by straight line. In a similar fashion, the following two examples were replaced by analogous forms with just one member of the trigram changed: (36) the like nature came to be replaced by the same nature and (37) after what manner gave way to in what manner. In both cases, even the substituting constructions seem now to be losing frequency and are in danger of being replaced themselves (the latter construction mostly by how). The last two examples seem of special interest, because there appears to be no obvious substitute causing their disappearance. In the case of (38) there needs no

. The competition of ought and aught goes all the way back to Old English (OED Online 2015).

Chapter 4. Lexical obsolescence and loss in English: 1700–2000 

(as in “For where no sacrifice is, there needs no priest”), the cause may be a general decline in impersonal constructions and its replacement either by a nominal prepositional construction such as there is no need for or by a construction with an overt subject. The final example, (39) as appears by (as in “That the complainants have not, as appears by their bill, made out or stated any right or title in them”), as in fact many others, is highly discourse-specific – it is typical of legal documents and theological treatises. The disappearance of as appear by seems to be partly due to a decrease in the latter genre especially since in terms of frequency, it is one of the best examples I have found of a very common expression becoming extremely scarce (from 13 ppm in 1710s to 0.02 ppm in 1990s) even though legal English is still quite well-represented in the corpus. 4.3 Future research Future improvements of this methodology may start from further and more indepth analysis of its results, preferably on more traditionally constructed corpora (e.g. on the Hansard Corpus or the Early English Books Online material). Another direction might be to apply this methodology to other periods of English, though there may not be enough digitised material prior to the fifteenth century (see Section 3.1) and the material prior to the eighteenth century may need lemmatisation/standardisation in preparation for this kind of processing, or the methodology may need to be changed to some degree. Lastly, it would be desirable to see the method applied to other languages, though again, in the case of inflectional languages, the data would need to be lemmatised. 5. Conclusions While easily comparable data reflecting lexical loss in other periods in the history of English are not available at the moment, it seems that the answer to the first research question is that the phenomenon of lexical loss is not limited to periods of significant linguistic upheaval (such as Early Middle English). On the other hand, no losses were discovered in the period of 1700–2000 comparable in systemic and structural centrality and consequence to the changes in the Early Middle English period.17

. Examples of such losses may be the plurals of personal pronouns in h- or such core members of the lexicon as niman or drihten.

 Ondřej Tichý

The results presented in the previous sections also provide an answer to the methodological questions: “How can we study lexical loss using corpora, and how large do the corpora need to be?” I consider the method presented here to be valid and useful in discovering examples of lexical obsolescence and loss. How accurate and comprehensive the method is in regards to always locating important items cannot be answered at this moment and should be subject to further testing and/ or comparison to other methods. To conclude, the method can help researchers extract interesting, but otherwise easy-to-overlook, data via a corpus-driven methodology focused not only on lexical loss, but also on standardization of spelling, changes in inflectional and word-formation (mainly derivational) morphology and syntax, especially in the case of multi-word expressions. The corpus-driven aspect of the methodology may help to better define lexicographical terminology in the area of lexical obsolescence and loss, as well as improve the consistency with which the terms are applied in lexicographical resources – in the same way traditional corpus methodologies help to improve the consistency and verifiability of lexicographical terminology in the area of high-frequency words and English Language Teaching (e.g. core and basic vocabulary). In fact, during the review process of this paper, the OED has implemented a word frequency indicator based on the same dataset – the Google Ngrams possibly derived by a methodology not too dissimilar to the one presented here (OED Online: Key to frequency 2015). Assessment of the methodology introduced by the OED is difficult at present, since not much detail about it is provided. On one hand, it is definitely a step that this paper would endorse, on the other, putting words like argentiferous and badass into the same frequency band may be warranted by their frequencies, but the data these frequencies are derived from cannot justifiably be seen as representing typical modern English usage (see Section 2). Finally, the present methodology should help us understand and spur interest in lexical loss in English and other languages.

References Aitchison, Jean. 1987. Words in the Mind: An Introduction to the Mental Lexicon. New York NY: Blackwell. Biber, Douglas. 1999. Longman Grammar of Spoken and Written English. Harlow: Longman. Bouma, Gerlof. 2009. Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL Conference, 31–40.

Chapter 4. Lexical obsolescence and loss in English: 1700–2000 

Coleman, Robert. 1990. The assessment of lexical mortality and replacement between old and modern English. In Papers from the 5th International Conference on English Historical Linguistics [Current Issues in Linguistic Theory 65], Sylvia M. Adamson, Vivien A. Law, Nigel Vincent & Susan Wright (eds), 69–86. Amsterdam: John Benjamins.

doi: 10.1075/cilt.65.08col

Čermák, Jan. 2008. Ælfric’s homilies and incipient typological change in the 12th century English word-formation. Acta Universitatis Philologica: Prague Studies in English XXV(1): 109–115. Davies, Mark. 2012. Google Books corpus. Google Books Corpus. (1 February 2016). Google Books History. 2009. (29 November 2015). Gries, Stefan T. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4): 403–437. doi: 10.1075/ijcl.13.4.02gri Hales, Steven D. 2005. Thinking tools: You can prove a negative. Think 4(10): 109–112. (30 January 2016). doi: 10.1017/S1477175600001287 Jackson, Howard. 2002. Lexicography: An Introduction. New York NY: Routledge.

doi: 10.4324/9780203467282

Kilgarriff, Adam. 2015. How many words are there? The Oxford Handbook of The Word, John R. Taylor (ed.), 29–37. Oxford: OUP. Michel, Jean-Baptiste, Shen, Yuan Kui, Aiden, Aviva Presser, Veres, Adrian, Gray, Matthew K., Pickett, Joseph P., Hoiberg, Dale et al. 2011. Quantitative analysis of culture using millions of digitized books. Science 331(6014): 176–182. (29 November 2015) doi: 10.1126/science.1199644 Milton, James & Donzelli, Giovanna. 2013. The lexicon. In The Cambridge Handbook of Second Language Acquisition, Julia Herschensohn & Martha Young-Scholten (eds), 441–460. Cambridge: CUP. doi: 10.1017/CBO9781139051729.027 OED Online: Key to frequency. 2015. OED Online. (4 February 2016) Petersen, Alexander M., Tenenbaum, Joel, Havlin, Shlomo & Stanley, H. Eugene. 2012. Statistical laws governing fluctuations in word use from word birth to word death. Scientific Reports 2. (29 November 2015). doi: 10.1038/srep00313 TestYourVocab.com. (29 November 2015). Trench, Richard Chenevix. 1871. English, Past and Present. New York NY: Charles Scribner and Company.

part ii

Patterns in utilitarian texts

chapter 5

Constance and variability Using PoS-grams to find phraseologies in the language of newspapers Antonio Pinna & David Brett Università degli Studi di Sassari

This paper describes the use of a corpus-driven methodology, the retrieval of part-of-speech-grams (PoS-grams), which is extremely effective for the discovery of phraseologies that might otherwise remain hidden. The PoS-gram is a string of part-of-speech categories (Stubbs 2007: 91), the tokens of which are strings of words that have been annotated with these PoS tags. A list of PoS-grams retrieved from a sample corpus can be compared with that from a reference corpus. Statistically significant items are further analysed to identify recurrent patterns and potential phraseologies. The utility of PoS-grams will be illustrated by way of analysis of a one million token corpus composed of texts from ten sections of The Guardian, the Sassari Newspaper Article Corpus (SNAC). Keywords: PoS-grams; phraseology; journalism; corpus-driven

1. Introduction This article investigates the potential of an up to now largely unexplored methodology for the retrieval of multi-word sequences containing a certain amount of variation on the semantic rather than syntactic plane. The methodology in question is that of the retrieval of part-of-speech grams (or PoS-grams), which provides chains of word forms, all corresponding to specific concatenations of PoS categories. The sub-registers to be examined are different sections of a British daily newspaper, The Guardian. Our research question focuses on whether some PoS-grams are present in particular sections of the newspaper in a statistically significant manner, and if so, whether the syntactic regularity is indicative of phraseologies typical of the sub-register. doi 10.1075/scl.82.05pin © 2018 John Benjamins Publishing Company

 Antonio Pinna & David Brett

Corpus linguistic studies have traditionally privileged the investigation of a specific type of Multi-Word Unit (MWU) model, one which is variously known as the n-gram (e.g. Stubbs 2007), chain (e.g. Stubbs & Barth 2003), lexical bundle (e.g. Biber et al. 1999: 987–1024) or word cluster (e.g. Carter & McCarthy 2006: 828–837). This is a recurrent, continuous sequence of word forms. The most commonly studied form is that composed of four word forms (e.g. Biber & Barbieri 2007; Biber et al. 2004; Hyland 2008), as bundles of this length are usually more frequent than longer strings and, at the same time, have a wider assortment of readily recognizable functions than shorter sequences. The studies by Biber et al. (2004) and Biber and Barbieri (2007) are particularly important as they identify four main functional roles played by lexical bundles in university registers: discourse organization, reference, stance and interaction management. Discourse organizers link prior and forthcoming portions of text; referential bundles identify an entity or a particularly relevant attribute of an entity; stance expressions convey speaker attitude; finally, interactive bundles are typically used to mark politeness or reported speech. These functions have been shown to provide a means of differentiation between spoken and written university registers. In particular, Biber and Barbieri (2007: 273, 279) prove that stance expressions are most frequent in oral registers (e.g. classroom teaching and class management), while referential bundles are more common in written ones (e.g. institutional writing and textbooks). However, while the extraction of n-grams is a highly useful tool for the identification of linguistic patterns, for some uses its focus on identical, rather than very similar, strings may lead to the exclusion of considerable quantities of important information: “n-gram searches are only helpful in finding instances of collocation that are strictly contiguous in sequence. The result is that many instances of word association may be overlooked, and that collocations that typically occur in non-contiguous sequences (i.e. AB, ACB) risk going undiscovered” (Cheng et al. 2006: 412). The first step on the road to uncovering similar, rather than identical, strings is constituted by the skip-gram, which identifies word forms that are repeatedly present within a certain span, for example, A B, A * B, A * * B etc. However, as Cheng et al. (2006) note, the variation allowed for concerns only constituency, and not position-based variation. Therefore, instances of B A will not be summed to those of the other patterns listed above. In order to enact such a measure, one must avail of the concgram, which has been defined as “all of the permutations of constituency variation and positional variation generated by the association of two or more words” (Cheng et al. 2006: 414; see also Cheng et al. 2009; Greaves & Warren 2010). Stubbs (2007) describes various methods for the extraction of what he calls “routine phraseologies”; apart from n-grams, he discusses two main

Chapter 5. PoS-grams in the language of newspapers 

procedures that take into account variability in one form or another.1 The first of these is the phrase-frame (or p-frame), which is “an n-gram with one variable slot” (Stubbs 2007: 90). He provides the example of plays a * part in, in which the variable slot may be occupied by a large number of items from the same semantic set: large, significant, big, major, vital, essential, key, central, full, great, and prominent.2 Stubbs (2007: 94–95) also illustrates the procedure that is the main focus of the current work: the PoS-gram, a string of part of speech categories. The examples given concern sequences with a length of five, yet clearly this length can be varied according to the needs of the research project in hand. Stubbs (2007: 94) lists some of the most frequent PoS-grams in the BNC and provides examples for each: these are described as being “parts of nominal and prepositional phrases, which express spatial, chronological and logical relations”. One of the research questions of the present work is whether this observation, based on a large, well-balanced general corpus, applies equally to specific registers, or whether variation in the repertories of PoS-grams from register to register may point towards over- or underuse of certain syntactic categories. At this point we would like to provide a brief example of what PoS-grams are, and their potential in uncovering variability. Using the Phrases in English resource (see footnote 1), we extracted data concerning the PoS-gram PRP AT0 NN1 PRF DPS NN1.3 There are 417 different types of this PoS-gram in the database, corresponding to a total of 2987 tokens. Therefore, its frequency is approximately 30 per million words. The PoS-gram in question constitutes a potentially complete syntactic unit composed of a prepositional phrase containing a noun phrase post-modified by another noun phrase. A brief glance at the types is sufficient . Stubbs’ work is based largely on information extracted from William Fletcher’s http://phrasesinenglish.org/, which allows queries to be made of a database of items extracted from the second or World Edition of the 100-million-word British National Corpus (BNC). The types of -grams involved include: n-grams, phrase-frames, PoS-grams and Char-grams. The database provides a list of all the PoS-grams in the BNC with a minimum frequency of 3 instances. . The concept of phrase frame was originally developed by William Fletcher (2002–2007), who launched the software kfNgram, a program dedicated to the extraction of n-grams and phrase frames. More recently Biber (2009) and Gray and Biber (2013) carried out analyses of what are essentially p-frames, which they describe as “recurrent four-word continuous and discontinuous patterns” (Gray & Biber 2013: 109), but in this case allowing for there to be more than one variable slot in each n-gram. These works investigate variability within multiword units using two corpora: one of American English conversation and the other of academic prose. . Preposition+Article+Singular noun+of+Possessive determiner form+Singular noun. See Appendix B for a full list of the CLAWS5 tags.

 Antonio Pinna & David Brett

to ascertain that the sixth slot (NN1 – single noun) is frequently (roughly 50%) occupied by words indicating body parts such as head (33), neck (22), hand (22), mouth (13), eye (10), throat (8), tongue (5), back (4), heart (4), stomach (4), or words in any case related to the person such as mind (17) and voice (6). We therefore ordered the results by the sixth slot and then by the first slot (PRP – preposition). This revealed repeated instances in which there are a series of highly similar phrases, each of which differs only in one or two slots. These highly similar phrases can be condensed into the formulae displayed in Figures (1a), (b) and (c).

(a) with the

(b) out of the

(c) with the

back flat palm heel

of

his her your my

hand

her eye his corner her of mouth side his

corner of

shake nod toss jerk

of

his her

hand

Figure 1. Formulae condensed from types of the PoS-gram PRP AT0 NN1 PRF DPS NN1

The individual elements are arranged so that a string of the topmost elements corresponds to the most frequent type: with the back of his hand has 54 tokens out of a total of 111; out of the corner of her eye, 47 tokens out of 142; and with a shake of his head, 7 out of 28. Note that out of is counted as a single preposition. The previous analysis has shown how useful this methodology may be in uncovering potential multiword units. In the examples above, three collocational patterns have been identified that allow more than one variable slot where their constituent elements belong to restricted semantic areas. In the case of Figure 1a), the items in the first variable slot denote parts of a human hand and are thus in a meronymic relationship with hand. A similar relationship is at work in Figure 1b) between the items in the first variable slot (corner, side) and those in the last (eye, mouth), while in Figure 1c) the nominalized actions in the first variable slot are related to head by representing the symbolic relay of some unspoken message, typically specified in the co-text. Examples (1)–(4) illustrate the latter point.

Chapter 5. PoS-grams in the language of newspapers 

(1) Peter dismissed this with a shake of his head.

(2) The mother motioned him to go with a shake of her head.

(3) He indicated a swing-seat, but with a shake of her head she refused to sit.

(4) […] offered one to Estabrook, who declined with a shake of his head.

Frequency of occurrence, clear internal relationship among their components, and functional roles within their co-texts mark the phraseological nature of these patterns, the identification of which can be obtained only by means of a method, such as the one outlined here, which can account for paradigmatic variability within its constituent slots.4 Despite its considerable potential for the extraction of phraseologies, very few studies can be found in the literature of corpus linguistics that make use of the PoS-grams procedure. One notable example, within the realm of historical linguistics, is Morley and Sift’s (2006) study of directive speech acts in Late Middle English sermons. These authors do not apply the PoS-gram procedure rigidly, but rather avail of a sort of hybrid between the PoS-gram and the p-frame, allowing up to two wildcards in each sequence, due to the “inconsistency of grammar and syntax in Middle English” (Morley & Sift 2006: 103). On the other hand, the technique, and variations thereof, has received a certain amount of attention in the field of information retrieval. For example, S piccia, Augello, and Pilato (2015) describe a method in which PoS-grams are used to make Italian language automatic text completion applications more efficient; D’hondt, Verberne, Weber, Koster, and Boves (2012) use PoS-filtered skipgrams to aid the classification of patents in English. Finally, within the field of Natural Language Processing, Reyes and Rosso (2012) use PoS-grams of variable length (2 to 7) in combination with other features to identify key components that enable the automatic detection of irony in a corpus of customer reviews. . The collocational frameworks illustrated in Figure 1 highlight the tendency for meaning and syntax to be closely associated and are reminiscent of Hunston and Francis’s (2000) notion of Pattern Grammar where words characterized by semantic similarity occurring in a given syntactic structure typically fall into similar functional or topical categories, thereby contributing to the meaning of the entire structure. Though with some notable exceptions, as is the case with patterns characterized by extraposed it or existential there, their grammar patterns are however limited in length, with usually no more than three syntactic items, and broader in terms of the classes of meanings associated with each pattern. Our methodology aims to uncover longer phraseological stretches with more definite functional or pragmatic meanings. For example, the collocational frameworks shown in Figure 1 may all be subsumed under the grammar pattern “poss N” in Francis, Hunston, and Manning’s (1998: 59–80) list of noun-based patterns. However, nouns with the same pattern in their list may belong to 64 different meaning groups, none as specific as those foregrounded in Figure 1.

 Antonio Pinna & David Brett

In short, the PoS-gram is greatly underused as an analytical procedure and extensive studies comparing the relative frequency of PoS sequences in different registers and their most common phraseologies have yet to be undertaken. In our study, we focus on newspaper writing to verify whether there is indeed variability in the types and frequencies of PoS-grams across its different sub- registers. The main factor conditioning our choice of register concerned the fact that the different sections of newspapers provide considerable variety in subject matter and communicative purpose, hence constituting highly suitable material for the type of analysis we intended to conduct. Biber and Conrad (2009) provide a corpus-based analysis of the main situational and linguistic characteristics of newspaper writing with respect to academic prose. The linguistic features highlighted include the pervasiveness of noun phrases, which display a marked tendency towards pre- and post-modification. Biber and Conrad connect this particular linguistic feature with the communicative purpose of newspaper writing: that of being precise and concise. Newspaper writing is considered to be a general register, of which several sub-registers are identified: “newspapers have articles identified as ‘news analysis’, sports reports, editorials, letters to the editor, and movie and restaurant reviews. These sub-registers differ in their particular communicative purposes, and so we would predict that there will be corresponding linguistic differences” (Biber & Conrad 2009: 124). For example, the communicative purpose of a news report is reporting facts; editorials on the other hand provide readers with opinions and interpretations of current affairs. Therefore, in the former: “[t]here are no opinions overtly expressed, no suggestions for next steps, no discussion of hypothetical situations or possibilities for the future. Correspondingly, modals and conditionals are absent” (Biber & Conrad 2009: 126). Therefore, there is good reason to expect that such differences in communicative purpose may be reflected in the texts in the form of variation in the relative proportions of syntactic constructions, a feature that can be highlighted by the PoS-gram method.5 In our study, we consider texts from ten different sections of a newspaper corpus compiled at the University of Sassari: the Sassari Newspaper Article Corpus (SNAC). Some of the sections correspond to news reports (Crime, Banking, World, Politics) and others deal mainly with evaluation and opinions, and are closer to the sub-register of the editorial (Travel, Film, Education, Football). Hence, there is difference not only in the communicative purpose, but also clearly in the topic. By way of the identification of PoS-grams that are statistically significant in particular sections of the SNAC, we aim to shed new light on characteristic phraseologies of sub-registers in newspaper writing, and on their functions. . Further investigation of the linguistic features of newspaper language drawing on corpora is provided by Bednarek (2008) and Bednarek and Caple (2012). The latter analyses lexical and syntactic features such as noun phrases, verbs and adverbials that characterise the register.

Chapter 5. PoS-grams in the language of newspapers 

2. Materials and methods The corpus taken into examination was the Sassari Newspaper Article Corpus (henceforth SNAC), collected by the authors at the University of Sassari, Italy. This is a one-million token corpus composed of texts downloaded from 10 different sections of the online version of the well-known British newspaper. The sections are Travel, (UK) Crime, Football, Banking, Politics, Education, Obituaries, Technology, World News and Films (details are provided in Appendix A). The texts were then tagged for part-of-speech (PoS) using the online CLAWS tool . The tagset used was C5 and the output style was set to Vertical.6 The reference corpus chosen was the BNC. Although larger corpora are now available, the BNC was deemed sufficiently large and varied enough to be representative of general spoken and written usage. Furthermore, it is composed of samples of British English, thus being diatopically coherent with the SNAC, and is tagged with the same tagset, thus allowing direct comparison. Tailormade perl scripts were then used to form PoS-grams starting from each token of the texts. Initially, the length of the PoS-grams was set to 4, however, this resulted in an unmanageable quantity of data. Thereafter, the length was gradually increased up to 8, at which point the results were too small in number to allow in-depth analysis. Therefore, we settled on 6 as an ideal medium that provides both sufficient data and a long enough span to permit the identification of specific functions. The PoS-grams that featured tags for punctuation (PU*), sentence boundaries (SENT) or elements judged to be unclear (UNC) were discarded. The PoS-grams obtained were then quantified and compared with a database of PoS-grams retrieved from the BNC in the following way: for each section of the SNAC the PoS-grams extracted with a frequency greater than or equal to ten were tallied with those from the BNC. The chi-square test was then applied to identify the PoS-grams that correlated positively with the sample section; only the results with p  = 6.63) were saved for further study.7 This procedure resulted in large quantities of results. For example, the results for Travel, Crime and Obituaries are shown in Table 1.

. See for a description of the C5 tagset. . The procedure followed is that described by Baron, Rayson, and Archer (2009). The test involved one degree of freedom and the frequencies of the PoS-grams were compared with the total tokens of the two corpora, there being a negligible difference between the corpus size and the possible number of PoS-grams that it could contain (Corpus Size-PoS-gram length+1).

 Antonio Pinna & David Brett

Table 1. Numerical data relating to the relevant (p