Corpus Linguistics for Pragmatics: A Guide for Research 1138718785, 9781138718784

Corpus Linguistics for Pragmatics provides a practical and comprehensive introduction to the growing field of corpus pra

748 167 5MB

English Pages 220 [221] Year 2018

Report DMCA / Copyright


Polecaj historie

Corpus Linguistics for Pragmatics: A Guide for Research
 1138718785, 9781138718784

Citation preview

Corpus Linguistics for Pragmatics

Corpus Linguistics for Pragmatics provides a practical and comprehensive introduction to the growing field of corpus pragmatics. Taking a hands-on approach to showcase the applications of corpora in the exploration of core topics within pragmatics, this book: • covers six key areas of corpus-pragmatic research, including speech acts, deixis, pragmatic markers, evaluation, conversational structure, and multimodality; • demonstrates the use of freely available corpora, corpus interfaces, and corpus analysis tools to conduct original pragmatic analyses; • is accompanied by an e-resource which hosts multimodal data sets for additional exercises. Featuring case studies and practical tasks within each chapter, Corpus Linguistics for Pragmatics is an essential guide for students and researchers studying or conducting their own corpus-based research in pragmatics. Christoph Rühlemann lectures in the Department of English and American Studies at Philipps-University Marburg, Germany.

Routledge Corpus Linguistics Guides provide accessible and practical introductions to using corpus-linguistic methods in key sub-fields within linguistics. Corpus linguistics is one of the most dynamic and rapidly developing areas in the field of language studies, and use of corpora is an important part of modern linguistic research. Books in this series provide the ideal guide for students and researchers using corpus data for research and study in a variety of subject areas. SERIES CONSULTANT: RONALD CARTER Ronald Carter is Research Professor of Modern English Language in the School of English at the University of Nottingham, UK. He is the co-series editor of the Routledge Applied Linguistics, Routledge Introductions to Applied Linguistics, and Routledge English Language Introductions series. SERIES CONSULTANT: MICHAEL McCARTHY Michael McCarthy is Emeritus Professor of Applied Linguistics at the University of Nottingham, UK; Adjunct Professor of Applied Linguistics at the University of Limerick, Ireland; and Visiting Professor in Applied Linguistics at Newcastle University, UK. He is co-editor of the Routledge Handbook of Corpus Linguistics and editor of the Routledge Domains of Discourse series. SERIES CONSULTANT: ANNE O’KEEFFE Anne O’Keeffe is Senior Lecturer in Applied Linguistics and Director of the Inter-Varietal Applied Corpus Studies (IVACS) Research Centre at Mary Immaculate College, University of Limerick, Ireland. She is co-editor of the Routledge Handbook of Corpus Linguistics and co-editor of Routledge Applied Corpus Linguistics series. OTHER TITLES IN THIS SERIES

Corpus Linguistics for Grammar Christian Jones and Daniel Waller

Corpus Linguistics for ELT Ivor Timmis

Corpus Linguistics for Translation and Contrastive Studies Mikhail Mikhailov and Robert Cooper

Corpus Linguistics for Vocabulary Pawel Szudarski

Corpus Linguistics for Pragmatics Christoph Rühlemann More information about this series can be found at

Corpus Linguistics for Pragmatics

A guide for research

Christoph Rühlemann

First published 2019 by Routledge 2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN and by Routledge 711 Third Avenue, New York, NY 10017 Routledge is an imprint of the Taylor & Francis Group, an informa business © 2019 Christoph Rühlemann The right of Christoph Rühlemann to be identified as author of this work has been asserted by him in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data Names: Ruhlemann, Christoph, author. Title: Corpus linguistics for pragmatics : a guide for research / Christoph Ruhlemann. Description: Abingdon, Oxon ; New York, NY : Routledge, 2019. | Series: Routledge corpus linguistics guides | Includes bibliographical references and index. Identifiers: LCCN 2018017407| ISBN 9781138718746 (hardback) | ISBN 9781138718784 (paperback) | ISBN 9780429451072 (e-book) Subjects: LCSH: Corpora (Linguistics) | Pragmatics.Classification: LCC P99.4.P72 R835 2019 | DDC 401/.45—dc23 LC record available at ISBN: 978-1-138-71874-6 (hbk) ISBN: 978-1-138-71878-4 (pbk) ISBN: 978-0-429-45107-2 (ebk) Typeset in Times New Roman by Apex CoVantage, LLC Visit the eResources:

To my sons Lionel and Ricardo


List of figures List of tables Acknowledgements

x xii xiii


CL and pragmatics – an introduction 1.1  Corpora and corpus linguistics  1 1.2 Pragmatics 6 1.3 Corpus pragmatics 7 1.4 Chapter structure 9 1.5  A note on BNC transcripts and BNCweb  9 1.6  How to get registered for BNCweb  10 1.7  Working with BNCweb  11


CL and speech acts 16 2.1 Introduction 16 2.1.1  Structure of speech acts  18 2.1.2 Performative/constative dichotomy 19 2.1.3 Form-function mismatch 19 2.1.4  Searle’s (1976) taxonomy of illocutionary acts  20 2.1.5  Indirect speech acts  23 2.1.6  What motivates indirect speech acts?  31 2.2 Focus: Corpus research on the speech act expression ‘Why don’t you’  34 2.3  Task: Exploring Why not + V speech acts in BNCweb  40 2.4 Further exercises 42 2.4.1 Different speech acts performed by the same utterance: the case of “oh I don’t know”  42 2.4.2 Comparing speech act expressions: ‘Can i’ vs. ‘Can you’formatted speech acts  43 2.4.3 Comparing speech acts: Ferguson/Missouri, August 9, 2014  44


viii Contents


CL and deixis 3.1 Introduction 48 3.1.1  Deixis and reference  49 3.1.2  The deictic origo  53 3.1.3 Deictic projection 57 3.1.4  Deictic fields  61 3.2 Focus on social deixis and short-term diachronic change  68 3.3 Task: Deictic projection in the use of constructed dialog  71 3.4 Further exercises 74 3.4.1  Deixis and reference patterns of the definite article the 74 3.4.2 Deictic proximity manipulation in ‘wondered/was wondering if’-formatted requests  78 3.4.3 Deictic anchoring 79



CL and pragmatic markers 4.1 Introduction 82 4.1.1  Keyness and frequency in conversation  83 4.1.2 Functions 84 4.1.3 Positioning 89 4.2  Focus on acoustic properties of ‘well’  96 4.3  Task: ‘BE like’ in COCA  100 4.4 Further exercises 101 4.4.1 Diachronic change in the use of pragmatic marker ‘well’ in journalistic writing  101 4.4.2 Canonical ordering in clusters of pragmatic markers 104 4.4.3  ‘Well’ in news broadcasts  104



CL and evaluation 5.1 Introduction 110 5.1.1  Pervasiveness of evaluation  112 5.1.2  Evaluation in storytelling  115 5.2  Focus on evaluative prosody  121 5.3  Task: Functions of ‘tails’  129 5.4 Further exercises 131 5.4.1 Investigating evaluative prosodies of ‘BUILD up’  131 5.4.2  Exploring ‘good’ synonyms and ‘bad’ synonyms  132 5.4.3  Evaluation in storytelling  133


Contents ix


CL and conversational structure 6.1 Introduction 137 6.1.1 Turn 138 Turn preface 140 Turn-constructional unit (TCU) and transition-relevance place (TRP)  143 Transition space 145 6.1.2 Sequence 147 6.1.3 Preference 150 6.2  Focus on backchannels in storytelling sequences  153 6.3  Task: Turn openers and turn prefaces  161 6.4 Further Exercises 164 6.4.1 Co-constructed turns 164 6.4.2 Delayed responses 167 6.4.3  Overlapped tag questions  168



CL and multimodality 7.1 Introduction 176 7.2  Focus on multimodality in storytelling  180 7.3 Task: Climacto-telic crescendo: the role of intensity in climax projection  188 7.4 Further exercises 192 7.4.1  Mimicry in conversation  192 7.4.2 Gazing away: the role of non-participant-directed gaze in storytelling 193



Concluding remarks




2.1  Restricting the range of spoken texts to the demographically sampled subcorpus in BNCweb 41 3.1 Unanchored deictic reference 50 3.2 Left panel: percentage uses of deontic and, respectively, epistemic ‘must’ in TIME Corpus according to Millar (2009); right panel: frequencies per million words of ‘must’ (in either sense); shaded gray is the WW2 period from 1939 to 1945 69 3.3 Frequencies per million words of select ‘responses’, ‘hesitators’, and ‘pragmatic markers’ in the TIME corpus 71 4.1 Distribution of ‘well’ and any other words across the nine word slots in nine-word turns; (additional 60 durations of ‘well’ performing a quote-marker function in the Narrative Corpus not included)99 4.2 Durations of ‘well’ by six different functions 100 4.3 Querying COCA for quotative ‘BE like’ 102 4.4 Frequencies per million words of predicative-‘well’ in the TIME corpus 103 6.1 Turn lengths in 40,000 turn sample from the conversational subcorpus of the BNC 139 6.2 Pre-starts and post-completers in random sample of 1,000 ten-word turns from conversational subcorpus of the BNC: pre-starts solid lines, TCU white, post-completers dotted lines 141 6.3 Turn structure 142 6.4 Proportions of backchannel (BC) overlap durations against the durations of the turn in the clear in the Narrative Corpus (N = 223 turns; durations measured in Audacity) 147 6.5 Overlap duration as a function of turn-in-the-clear duration (N = 223 turns; durations measured in Audacity) 148 6.6 Backchannel (BC) response time in the Narrative Corpus 156 6.7 AntConc screenshot 164 7.1 Myers’ hand gesture while saying “He was nominated” 179 7.2 Still 1 X0.1 X↓0.4 X0.1 uhm well 183

Figures xi

7.3 7.4 7.5 7.6 7.7 7.8 7.9

Still 2 XL1.3 when we came back (0.3) 183 Still 3 X→1.3 X0.1 (0.3) u::hm (0.2) the 183 Still 4 XR1.1 day after I arrived 184 Still 5 X↓0.3 (0.6) 184 Still 6 XR1.6 uhm his best friend was getting married 184 Still 7 X0.3 XL1.5 and he was [his best] man 184 Gazes to Rico and Lio by story component in storytelling sequence “Virginia Tech”; gazes to Lio represented by full dots; gazes to Rico represented by empty dots; dotted line: regression line indicating the overall trend of the participant-directed gaze durations187 7.10 Intensity in “Drained canal”; solid line: piecewise regression segments; dotted line: break point 190 7.11 Screenshot of the Objects and Sound windows in Praat 192 7.12 Gaze directions by Sandra in “Virginia Tech” 195


1.1 Twelve concordance lines for the verbal lemma INFER from the BNC 1.2 Top six most frequent verbal collocates of the verbal lemma INFER (L3-R3) 2.1 Functional profiles of Suggestion-WDY v. Question-WDY 2.2 Selected concordance lines illustrating forms of SAY to introduce reported Suggestion-WDY 2.3 Left (L1-L3) collocates of WDY in the BNC-C (ordered by collocate frequency) 2.4 Selected concordance lines of WDY followed by ‘just’ 3.1 Layout of coding sheet for “Women problems” 3.2 Percentage use of reference patterns in four registers (according to Biber et al. 1999: 266) 3.3 Select concordance lines for ‘was wondering if’ and ‘wondered if’ from the BNC 4.1 Top 20 keywords in demographically sampled spoken subcorpus (C) against the whole of the written component (W) of the BNC 4.2 Utterance-initial words in the spoken component of the BNC 5.1 Frequency list of adjectives in the conversational subcorpus of the BNC 5.2 Ten instances of phrasal verb ‘SET in’ from the BNC 5.3 Top ten nouns collocating with ‘BREAK out’ (L1-L3)

4 5 36 36 37 39 74 77 78 84 91 114 122 124


A monograph is credited to only one author but in fact many people deserve credit for having helped the author directly or indirectly. This book is no exception. The first two people I am grateful to are Mike McCarthy (the current co-editor of the ‘Corpus Guides’ series) and Ron Carter (the previous co-editor of the series) who invited me to write this book when I least expected it but most needed an uplift. I also owe thanks to Mike and his now co-editor, Anne O’Keeffe, for their forgiving review of the manuscript. Thanks also go to my former students at Paderborn University who had to ‘sit through’ (cf. Chapter 5 for the implications of this expression) my early attempts at teaching corpus pragmatics. I am also indebted to colleagues for their assistance with corpus queries and syntax, most notably Sebastian Hoffmann of Trier University, who helped with the CQP syntax, and Mark Davies, the creator of the Brigham Young suite of corpora, who provided tips for queries in COCA. Also, I am grateful to Elliott Hoey at University of Basel, who gave me permission to use his photograph of what he nicely called ‘adventures in deixis’ in Chapter 3. Moreover, the editorial and production team at Routledge deserves a great thank you for their utter professionalism, which secured a smooth sailing from manuscript to book. Finally, thanks are due to my wife Andrea. After a good quarter century of marriage we still talk to each other. That talk, embedded in its deep intimate background, provides arguably the richest resource a linguist can dream of for getting closer to unlocking the mysteries of speech act and implicature. With that much assistance from outside, the only thing I can lay claim to as being entirely my own are the errors, omissions, and weaknesses of this book.

Chapter 1

CL and pragmatics – an introduction

1.1  Corpora and corpus linguistics It has become somewhat fashionable in linguistics and related disciplines to assert that one’s research is based on a corpus. Sometimes, though, the term ‘corpus’ refers to “simply an electronically stored, searchable collection of texts” (Jones & Waller 2015: 5). Such a collection is, strictly speaking, not a corpus (Biber 1993). A corpus is defined by a number of criteria. It is typically a large computerized collection of texts ranging from, say, 100,000 words to trillions of words (more on corpus size later). It contains naturally occurring language rather than ‘edited’ language. It is most often annotated in some form, be it part-of-speech (PoS) tagging or some other type of markup (see below in this section). Most importantly, it is, or aims to be, representative of a language or language variety. This last point is critical. A language (variety) as a whole – termed ‘population’ in statistics – will always exceed the bounds of any corpus, whatever its size; one necessarily has to content oneself with a sample of that population. If the aim is to make valid generalizations from the sample to the population, the sample should be ‘representative’ – that is, it should include “the full range of variability in a population” (Biber 1993: 243). The variability of language, however, is a nightmare: it is not only infinite in its potential to create and integrate new forms (newly coined words, unheard-of sentences, unusual uses of words and sentences, etc.) but also infinite in its historical dimension (it existed before the sampling and will likely exist thereafter), and it is infinite in its social variation (different social groups talk differently, different social situations require different talk, etc.). Clearly, ‘the full range of variability’ can never be established with complete confidence. The quest for representativeness thus resembles, as Leech (2007) noted, the quest for the holy grail – an ideal that will be never reached. It is nonetheless an ideal worth pursuing and one which has been pursued. For example, the conversational subcorpus of the British National Corpus (BNC), from which most of the illustrative examples in this book are drawn and on which the bulk of the exercises are based, has been constructed with this aim in mind. The constructors deployed ‘demographic sampling’, a sampling approach well known in sociological research. In this approach, “[r]epresentativeness is achieved by a spread of language producers

2  CL and pragmatics – an introduction

in terms of age, gender, social group, and region, and recording their language output over a set period of time” (Crowdy 1995: 225). Thus, the roughly 4.2 million words assembled, transcribed, and annotated in that subcorpus reflect, or ‘represent’, the language use in conversational interaction by roughly balanced cross sections of young and old speakers, men and women, blue-collar and whitecollar workers, and so on. The effort to achieve representativeness may be the key reason why “no spoken corpus since the Spoken BNC1994 has equalled its utility for research” (Love et  al. 2017: 324). Its representativeness also distinguishes the BNC from a number of other spoken corpora, including the recently created successor corpus, the BNC2014, a large spoken corpus of 11.5 million words; its creators employed an opportunistic approach to data collection: the priority in collecting the data “seems to have been to collect as much data as possible and to accept the consequent imbalances in the corpus across the demographic categories” (Love et al. 2017: 326). Corpus linguistics, henceforth CL, is a relatively recent method in linguistics. The first electronic corpus was the Brown Corpus, a one-million-word corpus compiled in the 1960s aiming to represent a range of written genres (Francis & Kučera 1964). Computer technology has since made quantum leaps facilitating the creation of more, bigger, and more diverse linguistic corpora. A non-exhaustive list of corpus types include (i) general corpora, aiming to reflect a language in its entirety (e.g., the Cambridge International Corpus, which has led to the creation of Carter & McCarthy’s [2006] corpus-based reference grammar; cf. also Biber et al.’s [1999] seminal corpus grammar based on the 40 million word Longman Spoken and Written Corpus); (ii) specialized corpora tailored to a specific variety of the language, for example, the Michigan Corpus of Academic Spoken English (MICASE), capturing spoken language in academic situations (e.g., Maynard & Leicher 2007); (iii) dynamic corpora, which are updated regularly (e.g., the Corpus of Contemporary American English (COCA) (Davies 2009)); (iv) learner corpora targeted to language by non-native speakers, for example, the International Corpus of Learner English (ICLE) containing essays written by French, Swedish, and German learners of English; (v) comparable corpora, for example, the family of International Corpus of English (ICE) corpora, which each consist of one million words sampled from different regional varieties of English; and finally, (vii), multimodal corpora, containing not only transcriptions of speech but also records of nonverbal behavior (cf. Chapter 7). A further distinction is between raw text corpora (such as the web-as-corpus) and annotated corpora. Corpus annotation refers to “the practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written language data” (Garside et  al. 1997: 2; emphasis in original). By far the most widely used type of annotation is part-of-speech (PoS) tagging, an automatic process whereby each word token is assigned to a grammatical word class depending on the co-text in which it occurs. A small number of corpora, such as the ICE-GB, the British component of the ICE family, are ‘parsed’, that is, automatically

CL and pragmatics – an introduction 3

segmented “into constituents, such as clauses and phrases” (Hunston 2002: 19). Even smaller is the number of corpora with phonetic, semantic, discourse, and pragmatic annotation; the latter is found, for example, in MICASE (Maynard & Leicher 2007), the Narrative Corpus (Rühlemann & O’Donnell 2012), and SPICE Ireland (Kallen & Kirk 2012). Pragmatic annotation is mostly implemented manually; but see, for example, Weisser (2015) for an attempt at semiautomatic annotation. Manual annotation has the advantage that complex non-surface phenomena can be captured reliably but the disadvantage of being resource-intensive and therefore feasible only in small, specialized corpora. The main benefit of annotation (of any kind) is that it frees the researcher from having to search for surface forms and instead allows searching for (more abstract) patterns. PoS tagging, for example, allows you to search for lexico-grammatical patterns, such as lemmas, that is, any morphological realization of a head word. To illustrate, let us assume you are interested in the verb INFER (as will become more obvious later, this choice is not without reason: how we infer meanings in contexts is the very stuff of pragmatics). Thus, for instance, if you are interested in how, not a specific form, but the verb INFER as such is used, it would be cumbersome to perform searches for each possible form, including ‘infer’, ‘infers’, ‘inferring’, and ‘inferred’. Instead, by using information stored in the tag, you can search a corpus for all and any instantiation of the lemma. Annotation really becomes powerful when searching for pattern combinations. For example, the simplest form of pattern combination is to compute collocates. That is, to stick to INFER, you can search for word forms or lemmas frequently co-occurring with the verb (see below in this section). Or, in SPICE Ireland, where, inter alia, prosody, speech acts, and pragmatic markers are marked up, you can search for co-occurrence patterns of certain speech acts with certain markers spoken with a certain tone. Markup also includes meta-information related, for example, to speakers’ social characteristics. Thus, linguistic patterns or pattern combinations can be searched for, targeted to certain social groupings. For example, in the Narrative Corpus, which has annotation for quotatives, constructed dialog, and participant role, you can look for direct speech that is introduced by a specific quotative verb (such as ‘said’ or ‘goes’), that starts with a specific word/lemma/PoS-tag, contains a specific word/lemma/PoS-tag combination, and that is used as a response to another speaker’s storytelling by working class women in their sixties located in the Midlands. Corpora, thus, allow addressing questions that, without recourse to corpora, cannot even be asked. It is therefore no surprise that corpora have seen applications in a wide range of linguistic disciplines, including lexicography, grammar, discourse analysis, sociolinguistics, language teaching, literary studies, translation studies, forensics, and pragmatics (see McCarthy & O’Keeffe 2010 for an overview). It is also no surprise that some observers speak of a ‘corpus revolution’ (e.g., Crystal 2003: 448). The revolution is made possible by the ever-growing processing power of modern computers enabling researchers to scour ever larger and ever more complex data and to see “patterns emerge that could not be seen

4  CL and pragmatics – an introduction

before” (Tognini Bonelli 2010: 18). The impact of the revolution has been felt most dramatically in the study of what Sinclair (1991) termed the ‘idiom principle’, demonstrating that lexis and grammar interact in fundamental ways and calling into question the long-held categorical distinction between grammar and lexis (Sinclair 2000). As noted already, a defining feature of corpora is scale. Corpora range from relatively small, specialized corpora with less than a million words to mega-­ corpora of more than a billion words (e.g., the Cambridge International Corpus) to the web-as-corpus, which has trillions of words (and counting) (e.g., Hundt et al. 2007). Corpus-linguistic methodology is adapted to ‘big data’. The favored methodology is ‘vertical reading’ (Tognini Bonelli 2010); it can be applied to data of any size. The most typical incarnation of the vertical-reading methodology is the key word in context (KWIC) method, also referred to as concordance line display. Corpus software, instructed to search for a specific item, ‘drills’ through all texts in the corpus searching for that item, yanks out any occurrence of the searched-for item, and displays it in the center of the concordance line along with limited amounts of co-text to either side. For illustration, consider randomly selected concordance lines for the verbal lemma INFER from the BNC:

Table 1.1  Twelve concordance lines for the verbal lemma INFER from the BNC 1 2 3 4 5 6 7 8 9 10 11 12

off!’ and he done so’. As can be know that p implies q, allows us to of diatoms. The authors’ ability to difficulty in assuming that they could  case studies, in prospect it has to be employment and that it could be were to ask each what he or she many years, may have been able to consent may be either express or a different course, the court would Polybius and Panaetius. This can be they did when they were expected to

inferred infer infer infer inferred inferred infers infer inferred infer inferred infer

from the above account there is no that a does not know that p. It seems water chemistry quantitatively from confused thinking from the observation by those most closely connected with that the defendants had used those from the term enrichment in this what kinds of buildings may have stood from a course of dealing. Evidence of that he had no good reason and that he from the line which Diodorus takes in failure. Dominant experimenters can

CL and pragmatics – an introduction 5

What the researcher can do with a concordance is scan it “for the repeated patterns present in the context of the node” (Tognini Bonelli 2010: 19). For instance, note just two such patterns. One obvious pattern is that the node, in this case forms of INFER, is repeatedly followed by the preposition ‘from’ (concordance lines 1, 4, 7, 9, 11), that is, the head of a prepositional phrase indicating the kind of evidence on which the inference was made. Another pattern is that INFER is preceded by expressions of modality, mostly (semi-)modal verbs (concordance lines 1, 4–6, 8–11). This observation could be taken to suggest that INFER, as a cognitive process of concluding from some sort of evidence, is fraught with uncertainty. This hypothesis can be tested by yet another vertical-reading method: collocation analysis. That is, corpus software is instructed to drill, within a given ‘span’ or ‘window’ of, say, three words on the left and three words on the right of the node, through all texts containing the node, record which words occur within the window, compute how often they do, and display the co-occurrence as a frequency table on the screen. The first top five verbal collocates of INFER in the BNC are given in Table 1.2 ordered by their log-likelihood value: Table 1.2 Top six most frequent verbal collocates of the verbal lemma INFER (L3-R3) Rank

Verbal collocate

Freq in whole corpus

Expected collocate freq

Observed collocate freq


1 2 3 4 5 6

be can may might could is

649884 231452 112397 59026 159818 990191

24.83 8.843 4.294 2.255 6.106 37.831

197 132 42 22 23 72

478.8024 470.9763 116.4873 60.8323 27.2867 24.6076

The collocation analysis fully confirms our hypothesis: INFER is typically preceded by modal verbs evaluating the inference in question as, ultimately, uncertain: as a possible conclusion, not an inevitable one (see Chapter 5 for more on modality and evaluation). One more technical aspect is worth pointing out in Table 1.2. Considering the form ‘is’ ranked sixth, the observed collocate frequency, shown in column five, is 72  – hence greater than the observed collocate frequency of ‘may’ (42), ‘might’ (22), and ‘could’ (23). Why is ‘is’ ranked lower than collocates that accompany the node less frequently? The reason is that ‘is’, with almost a million tokens, is far more frequent in the corpus as a whole. The statistical odds that it will co-occur with the node are therefore greater than for ‘may’, ‘might’, and ‘could’, which each are far less common. As a result, the association between ‘is’ and INFER, shown in the log-likelihood column, is weaker than for the modal verbs. (Log-likelihood is one out of several measures of collocational strength; see Hoffmann et al. [2008: Chapter 8] for an accessible description).

6  CL and pragmatics – an introduction

As can be gleaned from this discussion, CL involves working with frequencies and statistics; indeed, as pointedly asserted by Gries (2009: 11), “strictly speaking at least, the only thing corpora can provide is information on frequencies”. On this view, corpus linguistics is essentially a quantitative discipline. The contrast with pragmatics, essentially a qualitative discipline, could hardly be starker.

1.2 Pragmatics In the 1980s, Leech (1983: 1) wrote: The subject of ‘pragmatics’ is very familiar in linguistics today. Fifteen years ago it was mentioned by linguists rarely, if at all. In those far-seeming days, pragmatics tended to be treated as a rag-bag into which recalcitrant data could be conveniently stuffed, and where it could be equally conveniently forgotten. Pragmatics does not only deal with recalcitrant data. It represents a recalcitrant discipline in itself, as it incorporates the most recalcitrant influence on language and meaning: the speaker. Speakers do not normally talk to themselves, so taking the speaker into account requires taking the hearer into account as well (cf. Bublitz & Norrick 2011: 4). Speakers and hearers do not talk in vacuo with one another, so taking the speaker and the hearer into account requires taking the larger context in which they talk into account as well: the situation. Pragmatics, then, is “concerned with meaning in relation to a speech situation” (Leech 1983: 15; original emphasis), or, simply, with “how language is used in communication” (Leech 1983: 1). The notion of communication casts the net wide, indeed far wider than the confines of language, as successful communication can be much more than the words uttered (as in indirect speech acts) or even without any words (as in sign language or nonverbal pointing). The kind of meaning pragmatics is concerned with is, then, sharply distinguished from the two other core ‘dimensions of semiosis’ that Morris (1938) distinguished – syntax (the relation of signs to one another) and semantics (the relation of signs to the objects they denote). All three dimensions deal with meaning but foreground different aspects of it. Syntax looks into the interaction of grammatical meanings such as tense, aspect, number, and so on, that create well-formed sentences. Semantics is concerned with meaning as residing in words, phrases, and sentences in abstraction from their use in context. Pragmatics is interested in the creation and interpretation of meaning in situations. The notion of situation in which utterances are produced and processed is a Pandora’s box containing a large number of ‘messy’ contextual variables. A  non-exhaustive list includes the sequential context (the utterances that went before an utterance and that the utterance is a response to, and also the utterances that will expectedly follow), the activity context (the recognizable activity the speaker and the hearer are engaged in), the spatiotemporal context (coding time and place as well as receiving-time and place of the utterance), the multimodal context (the speaker’s bodily conduct into which the utterance is

CL and pragmatics – an introduction 7

integrated), the intentional context (what the speaker intends to say in making an utterance, which may often not be read off the surface structure of the utterance), the emotive context (the speaker’s involvement with the entity the utterance is about), the epistemic context (the almost infinite range of the speaker’s and the hearer’s knowledge), and the social context (the power or role relationship that holds between speaker and hearer). Importantly, the context, in all or any of its facets, as ‘con-text’, may not be manifest in what is said. Although absent from the linguistic message, the context still influences how the message is processed as a communication-in-context. For example, if you receive an invitation for dinner “tonight”, you will infer that the invitation is for the evening of the same day as when the invitation was made (cf. Chapter 3). If the president of the United States informs an FBI director that he hopes the FBI director “can let this go”, with “this” referring to an ongoing investigation into the president’s possible collusion with a foreign power, it is hard not to interpret this statement as intended to influence the FBI director (cf. Chapter 2). If you propose marriage to your partner, and the response is delayed, you will interpret the gap as foreshadowing trouble (cf. Chapter 6). If you ask someone a question, the response will be “faster when the question has a gestural component than when it does not” (Holler et al. 2017; cf. Chapter 7). If a white police officer approaches two African-Americans walking on the street shouting at them, “Why don’t you guys walk on the sidewalk?”, the increased intensity may easily block interpreting the utterance as a suggestion and instead indicate the interpretation as an aggressive command (cf. Chapter 2). Pragmatics is, thus, concerned with how what is said relates to what is not said but communicated anyway through the context. As defined by Mey, pragmatics is “the art of the analysis of the unsaid” (Mey 1991: 245) or, as Yule noted, “the study of how more gets communicated than is said” (Yule 1996: 3).

1.3  Corpus pragmatics Pragmatic research, concerned with the interplay of the said and the unsaid, has traditionally been strictly qualitative, based on careful horizontal reading of (very) small amounts of texts in their contexts. Since CL typically works vertically and with big data, it is not surprising that pragmatics and CL were for a long time regarded “as parallel but often mutually exclusive” (Romero-Trillo 2008: 2). In recent years, however, corpus linguists and pragmaticists have discovered common ground, paving the way for the advent of the new field of corpus pragmatics, as evidenced in the publication of a number of edited collections (e.g., Felder et al. 2011; Taavitsainen et al. 2014; Aijmer & Rühlemann 2015) and a new journal, aptly titled Corpus Pragmatics. Corpus pragmatics makes use of the best of two worlds: the vertical-reading methodology of CL (instructing computer software to plough through myriads of text samples in search of occurrences of a target item) integrated into the horizontal-reading methodology of pragmatics (weighing and interpreting individual occurrences within their contextual environments).

8  CL and pragmatics – an introduction

The two complementary methodologies can be integrated in two complementary approaches to data analysis: form-to-function and function-to-form. The form-to-function approach is based on the observation that while the unsaid is often not expressly encoded in the said, there are still ‘footprints’ of it – indices pointing to what is unsaid. These indices may be of any semiotic variety – verbal, vocal, or gestural (cf. Chapter 7). Researchers can use them to ‘hook’ out potential instances of the said-unsaid interplay. For example, the pragmatic marker ‘well’ used utterance-initially often acts as a ‘warning particle’ (Levinson 2013: 108) projecting a response that is in some way in disagreement with the course of action suggested by the prior utterance (cf. Chapter 4). A  researcher can search for utterance-initial occurrences of ‘well’, define a manageable random subset, discard unwanted hits (‘noise’), and investigate, for example, what (sub-)types of the speech act of disagreement ‘well’-prefaced utterances perform. The form-to-function approach is probably the most widely used approach in corpus pragmatics. One of its downsides is the lack of ‘recall’. For example, disagreement can be expressed without any ‘well’-prefacing. Thus, while a search for utterance-initial ‘well’ may achieve very high ‘precision’, effectively retrieving all instances of disagreements prefaced by ‘well’ (as well as noise), it may perform poorly in terms of ‘recall’ – all the disagreements without ‘well’ are overlooked (for a discussion of precision and recall, see Hoffmann et al. [2008: 77–79]). The function-to-form approach takes the inverse direction, starting from a function and investigating the forms used to perform it. This approach is underlying, for example, Garcia McAllister (2015): the author used a bottomup method by performing “a line-by-line reading of the corpus conversations to identify speech acts within Searle’s speech act categories (i.e., directives, commissives, expressive, etc.) as they occurred in context” (Garcia McAllister 2015: 34). Based on this methodology, subcorpora for different speech act functions can be defined and searched for lexico-grammatical and other contextual association patterns. Another example of the function-to-form approach is the Narrative Corpus: all the texts in the corpus were horizontally read to identify and annotate instances of constructed dialog. Thus, the corpus offers a tag for the function ‘constructed dialog’. Corpus users can invoke the function by its tag and examine how the function is realized, inquiring, for example, whether the lexical inventory of constructed dialog differs from the inventory of nonconstructed dialog, or whether constructed dialog that is introduced by a quotative verb differs from constructed dialog not introduced by such verb, and so on (cf. Chapter 3). Obviously, the function-to-form approach is far from perfect too, its major disadvantage being that it is resource-costly and therefore amenable to small corpora only. Whatever the approach, and however imperfect the approach, corpus pragmatics does cut new paths into the jungle of human communication, illuminating some of the complex ways in which we, as speakers, entangle the said with the unsaid and how we, as listeners, disentangle the two.

CL and pragmatics – an introduction 9

1.4  Chapter structure This volume aims to provide an accessible, practical guide to corpus pragmatics for undergraduates and postgraduate students. Some chapters will also provide food for thought for seasoned pragmaticists unfamiliar with the corpus method. Taking a hands-on approach, the book will devote large sections to practical applications. The book is also accompanied by a companion website where data for the practical assignments can be accessed. The book is divided into eight chapters; Chapters 1 and 8 introduce and, respectively, round off the volume while Chapters 2–7 examine applications of CL to core pragmatic areas of research. These latter chapters share the same fourfold structure: •

• •

Introduction: – The first subsection provides an introduction aiming both to elucidate the pragmatic concept(s) in question as well as to survey existing corpuslinguistic work in the area. Focus: – The second subsection aims to explore and illustrate in good detail one specific research question in the area. Task: – The third subsection is devoted to a practical task to be carried out by the reader; the relevant research background as well as the research question(s) to be addressed will be explained in good detail and the methodology to be used will be carefully described; some tasks will be based on corpora or corpus interfaces freely available on the internet, such as BNCweb and corpora of the BYU suite of corpora; others will be based on specially prepared data which can be accessed via the companion website. Further Exercises: – The fourth subsection contains brief descriptions of further tasks; again, if necessary, data to be used will be made available on the companion website.

1.5  A note on BNC transcripts and BNCweb Most of the examples used in this book for illustration are taken from the British National Corpus (BNC), a 100-million-word corpus from the 1990s, which is probably the most widely used corpus resource worldwide. More specifically, most examples are from a subcorpus of the BNC, the so-called ‘demographicallysampled’ subcorpus, consisting of informal conversation (Crowdy 1995). The reason why the book relies so heavily on this one resource is not only its “great utility and longevity in linguistic research” (Love et al. 2017: 322). There are three more specific reasons.

10  CL and pragmatics – an introduction

First, unlike most other general corpora, the audio files from which the corpus transcriptions were made have been made available in the Audio BNC (Coleman et  al. 2012).1 They can now be accessed for free online. The benefits of having access to the audio files for pragmatic research cannot be overstated. As will be argued throughout the book, but specifically in Chapter 7, human communication is multimodal, drawing in intricate ways not only on verbal but also vocal and gestural semiotic resources. While the gestural resources are still out of reach in the BNC, the vocal resources speakers deploy can now be ‘heard’ and their contribution to how the speakers in the BNC communicate verbally can be assessed and appreciated. Second, based on the audio files, the BNC transcripts can be critically examined for transcription errors and omissions, and corrected accordingly. What is more, the transcripts can be enriched by adding paralinguistic details that are salient in the interaction based on the auditory evidence, including, for example, modulations in voice quality, changes in intensity, shifts in pitch, variations in speed of delivery, and so on. Further, characteristics of timing and sequencing can be determined: pauses can be (re-)measured, latchings can be observed, and overlap can be ascertained. In other words, the availability of the audio files facilitates a transcription that by far exceeds the original ‘enhanced orthographic’ transcription (Crowdy 1994: 25) of the BNC and that is much more in line with the kind of transcription common in Conversation Analysis: a transcription that follows Jeffersonian standards and conventions (e.g., Jefferson 2004) and is “detailed enough to facilitate the analyst’s quest to discover and describe orderly practices of social action in interaction” (Hepburn  & Bolden 2013: 58). The bulk of the examples presented in this book represent such Jeffersonian re-transcriptions of the original orthographic BNC transcripts (indicated by “corrected transcription”; in the few cases where no such indication exists, the example was taken from written texts, or from spoken texts for which no audio is available). A glossary of the transcription symbols used is given in the Appendix to this chapter. The third reason is intimately related to the second. The audio files can be accessed via BNCweb (Hoffmann et  al. 2008). BNCweb is a free online interface for the BNC that reconciles user-friendliness with an amazing richness of corpus-linguistic functionality. Also, BNCweb allows users to perform queries, from simple to highly sophisticated, and to inspect hits in the context of extended transcripts while listening to them. BNCweb thus represents the perfect resource for the practical assignments that are a key component of this book.

1.6  How to get registered for BNCweb Registration for BNCweb is quick and easy: 1 2 3

Go to: Alternatively, use a web browser to search for “registration BNCweb” Under First time users on the left click on Register for an account on the right. Fill in the required information; then click on Register.

CL and pragmatics – an introduction 11

4 5

After registration, you will receive an email to confirm your input. The email includes a link; click it to complete your registration. Remember your BNCweb access details so you can retrieve them easily!

1.7  Working with BNCweb Care has been taken in this book to describe the steps involved in the practical assignments in great detail and with sufficient clarity. Where this attempt may have failed or, more ideally, where the reader feels he or she wants to do their own research projects in BNCweb, there are two resources that may be of help. First, note the link on the BNCweb starting page to the Simple Query Syntax help. This is a concise summary of the Simple Query Syntax, addressing all major syntactic elements and giving illustrative examples. Second, readers are referred to Hoffmann et al.’s (2008) immensely useful book on BNCweb, which not only describes the BNC and the functionalities of BNCweb in very good detail but also provides an accessible overall introduction to CL.


Glossary of transcription conventions

Category Sequential aspects Temporal aspects Phonological aspects


C A symbol [] =

Description overlapping speech

(.) or (1.2)

one turn latched on to next turn with less-than-usual or no gap at all short or longer pause