Corpus Applications in Applied Linguistics 9781441107800, 9781472541611, 9781441184382

Corpus linguistics is one of the most exciting approaches to studies in applied linguistics today. From its quantitative

370 114 8MB

English Pages [273] Year 2012

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Corpus Applications in Applied Linguistics
 9781441107800, 9781472541611, 9781441184382

Table of contents :
Title Page
Copyright
Contents
Contributors
Part One: Corpora in Applied Linguistics
Chapter 1: Introduction
References
Part Two: Corpora and Institutional Uses of Language
Chapter 2: Professional Communication and Corpus Linguistics
Background
A Methodology for Analyzing Professional Corpora
A Study of Construction Industry Discourse
Directions for Teaching and Training
Conclusion
Notes
References
Chapter 3: Corpora and Academic Discourse
Academic Corpora Research: Directions and Findings
Genres are Structured to be Persuasive
Written and Spoken Academic Language Vary
Academic Conventions Differ Across Language Groups
Texts are Discipline Specific
Academic Persuasion Involves Interpersonal Negotiations
An Illustrative Example: Gender in Book Reviews
Conclusion
References
Chapter 4: Corpora and Workplace Discourse1
Introduction
Overview of Relevant Corpora
Lexical Features
Phraseology: Frequent Chunks
Pragmatics Features
Summary and Conclusion
Future Directions
Notes
References
Part Three: Corpora in Applied Linguistics Domains
Chapter 5: Corpora and Translation Studies
Research and Translator Training
Corpora and Descriptive Studies of Translation
Corpus Methods in Translation Studies
Types of Corpora
Descriptive Research Endeavours
Current Descriptive Research
Translator Training
The Relationship Between Description and Pedagogy
References
Chapter 6: Some Aspects of the Use of Corpora in Forensic Linguistics
Case Background
Sampling Issues
Was There More Than One Anonymous Author?
Author Profiling by Corpus
Known and Questioned Registers
The Occurrence of Identical Expressions
Authorship Conclusion Based on Corpus Study
The Use of Corpora in Legal Language Studies
The Evolution of Legal Terminology
Conclusion
References
Cases cited
Chapter 7: Corpora and Gender Studies
Introduction
Usage
Representation
Metrosexual
Critical Considerations
References
Chapter 8: Corpora and Media Studies
Introduction: Getting Started
What can a Corpus do for the Study of Media Discourse?
Keywords
Word Frequency Lists
Concordances
Conclusion
References
Part Four: Corpora in New Spheres of Study
Chapter 9: Corpora and English as a Lingua Franca
Corpora and Linguistic Varieties
ELF and Availability of Data
ELF Corpora
Methodological Challenges
ELF Findings: Forms and Functions
Applied Linguistic Implications
References
Online Resources
Chapter 10: Texting and Corpora
Introduction
Overview of Text Message Corpora
Corpora and Respelling
Extending the Focus: Beyond Respelling
Challenges Posed by Text Message Corpora
Conclusion
Notes
References
Corpora Freely Available
To Buy
Chapter 11: A Conceptual Model for Segmenting and Annotating a Documentary Photograph Corpus (DPC)
Introduction: What is a Documentary Photograph Corpus and Why?
Human-Computer Image Processing Compared
Segmenting and Annotating Documentary Photographs
Corpus Linguistics And Ontology: A Theoretical Exploration
Conclusion: Application Prospective
Notes
References
Part Five: Corpora, Language Learning and Pedagogy
Chapter 12: Learner Corpora and Second Language Acquisition
Introduction
Learner Corpus Research: Developments and Contributions
Learner Language as a Dynamic and Independent Language System
Research Implications
An Illustrative Study: Development of Phraseological Competence
Prospects and Challenges
References
Chapter 13: Corpora in the Classroom: An Applied Linguistic Perspective
Introduction
Concept of the Lexical Approach
Vocabulary Approaches
Functional Approaches
Genre-Based Approaches
Relationship Between DDL and Language Learning Theories
Case Study: Integrating Corpus Consultation in a Report-Writing Course
Conclusion and Future Pathways
References
Chapter 14: Corpora and Materials Design
Introduction: The Emergence of Corpus-Informed Language Teaching
Unearthing the Evidence: Initial Analysis of Corpora
Pedagogical Mediation of Conversational Data
Modelling the Data: The Basics of Conversation
Conclusion
Notes
References
Afterword: The Problems of Applied Linguistics
References
Subject Index
Author Index

Citation preview

Corpus Applications in Applied Linguistics

Also available from Continuum Contemporary Corpus Linguistics Edited by Paul Baker Corpus-Based Approaches to English Language Teaching Edited by Begona Belles-Fortuno, Maria Lluisa Gea-Valor and Mari Carmen Campoy Corpus Linguistics in Literary Analysis Bettina Fischer-Starcke Corpus Linguistics and World Englishes Vivian de Klerk Corpus Stylistics in Principles and Practice Yufang Ho Text, Discourse and Corpora Michaela Mahlberg, Michael Stubbs, Wolfgang Teubert and Michael Hoey

Corpus Applications in Applied Linguistics

Edited by Ken Hyland, Chau Meng Huat and Michael Handford

Continuum International Publishing Group The Tower Building 80 Maiden Lane 11 York Road Suite 704 London SE1 7NX New York NY 10038 www.continuumbooks.com © Ken Hyland, Chau Meng Huat, Michael Handford and Contributors 2012 All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage or retrieval system, without prior permission in writing from the publishers. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. ISBN: HB: 978-1-4411-0780-0 e-ISBN: 978-1-4411-8438-2 Library of Congress Cataloging-in-Publication Data Corpus applications in applied linguistics/edited by Ken Hyland, Chau Meng Huat, and Michael Handford. p. cm. Includes bibliographical references and index. ISBN 978-1-4411-0780-0 -- ISBN 978-1-4411-8438-2 (ebook (pdf)) 1.  Applied linguistics. 2.  Corpora (Linguistics) I. Hyland, Ken. II. Chau, Meng Huat. III.  Handford, Michael, 1969P129.C65 2011 418--dc23 2011035036

Typeset by Deanta Global Publishing Services, Chennai, India



Contents

Contributors

vii Part One: Corpora in Applied Linguistics

1.   Introduction Ken Hyland, Chau Meng Huat and Michael Handford

3

Part Two: Corpora and Institutional Uses of Language 2.  Professional Communication and Corpus Linguistics Michael Handford

13

3.   Corpora and Academic Discourse Ken Hyland

30

4.   Corpora and Workplace Discourse Almut Koester

47

Part Three: Corpora in Applied Linguistics Domains 5.   Corpora and Translation Studies Sara Laviosa

67

6.   Some Aspects of the Use of Corpora in Forensic Linguistics John Olsson

84

7.   Corpora and Gender Studies Paul Baker

100

8.   Corpora and Media Studies Anne O’Keeffe

117

Contents

vi

Part Four: Corpora in New Spheres of Study 9.   Corpora and English as a Lingua Franca Barbara Seidlhofer

135

10.  Texting and Corpora Caroline Tagg

150

11. A Conceptual Model for Segmenting and Annotating a Documentary Photograph Corpus (DPC) Gu Yueguo

166

Part Five: Corpora, Language Learning and Pedagogy 12.  Learner Corpora and Second Language Acquisition Chau Meng Huat 13. Corpora in the Classroom: An Applied Linguistic Perspective Lynne Flowerdew

191

208

14.  Corpora and Materials Design Michael McCarthy and Jeanne McCarten

225

Afterword: The Problems of Applied Linguistics Susan Hunston

242

Subject Index

249

Author Index

255



Contributors

Paul Baker, Reader in Corpus Based Discourse Studies (University of Lancaster, UK) Chau Meng Huat, Fellow (University of Malaya, Malaysia) Lynne Flowerdew, Senior Lecturer (Hong Kong University of Science and Technology, China) Michael Handford, Associate Professor in English Language and Intercultural Communication (Tokyo University, Japan) Susan Hunston, Professor of English Language and Head of School of English, Drama and American & Canadian Studies (University of Birmingham, UK) Ken Hyland, Professor of Applied Linguistics and Director of the Centre for Applied English Studies (The University of Hong Kong, Hong Kong) Almut Koester, Senior Lecturer (University of Birmingham, UK) Sara Laviosa, Senior Lecturer (Università di Bari, Italy) Jeanne McCarten, Co-author of bestseller Touchstone (Cambridge University Press) Michael McCarthy, Emeritus Professor of Applied Linguistics (University of Nottingham, UK) Anne O’Keeffe, Senior Lecturer (University of Limerick, Ireland) John Olsson, Academic Associate (Bangor University, Wales) and Director of the Forensic Linguistics Institute Barbara Seidlhofer, Professor of English and Applied Linguistics and Head of Department of English Studies (University of Vienna, Austria) Caroline Tagg, Lecturer in Applied Linguistics (University of Birmingham, UK) Gu Yueguo, Research Professor of Linguistics and Head of the Contemporary Linguistics Department (Chinese Academy of Social Sciences, China)

viii

Part One

Corpora in Applied Linguistics

2

Chapter 1

Introduction Ken Hyland, Chau Meng Huat and Michael Handford

The impact of corpora in the field of applied linguistics over the past twenty years has been enormous, transforming both how we understand and how we study language across a range of different areas. Corpora bring an empirical dimension to language studies, which was not possible before, or not possible to the same extent, allowing researchers to replace intuitions, strengthen interpretations, reinforce claims and generally talk about language with greater confidence. While corpora themselves do not provide new information about language, the ability to draw on large samples of naturally occurring data allows analysts to offer enhanced descriptions of language and how it works, bring previously unnoticed features to ‘our attention’ and extend theory-building based on a growing awareness of lexical and grammatical patterning. The pace of this change has also been considerable, accelerated by the growing accessibility and processing capabilities of computers, such that researchers are now able to routinely refer to corpora as a methodological technique without the need of elaborate justification or recourse to specialists. The present volume, Corpus Applications in Applied Linguistics, aims to capture something of these changes, focusing on some of the most exciting and recent developments that corpus research has brought to applied linguistics. Comprising a collection of specially commissioned chapters by leading researchers in a range of fields, the volume covers the more traditional areas of applied linguistics, such as SLA, professional discourse and the development of teaching materials, and extends to more emergent areas, such as texting, image corpora and English as a lingua franca. Our intention, through these chapters, is to show something of the importance and diversity of corpus research to applied linguistics, illustrating its methodologies, particularly its value to approaches as diverse as critical discourse analysis, pragmatics and organizational studies, and suggesting

4

Corpus Applications in Applied Linguistics

how corpus data can illuminate our understanding of particular, multifaceted contexts. The volume consists of thirteen chapters and an Afterword. Each chapter addresses a specific topic and presents a synthesis of both past and current research on the topic. Wherever relevant, a study or series of studies is provided to illustrate the points raised in the chapter. The thirteen chapters are further divided into four sections. 1. Corpora and institutional uses of language 2. Corpora in Applied Linguistics domains 3. Corpora in new spheres of study 4. Corpora, language learning and pedagogy. The first section, Corpora and institutional uses of language, comprises three chapters, which explore corpus approaches to institutional and professional discourse. It begins with Chapter 2 by Michael Handford, who considers the role of corpora to inform research and pedagogy/training in professional communication contexts. Handford shows how corpora meet the growing need for authentic materials to assist professionals in developing their spoken discourse skills for business meetings. Example exercises are presented from a corpus-informed textbook for business professionals and students as well as results from an on-going research project that explores communication in the international construction industry. One important finding is that second language (L2) professional interactions may not, at least at the lexical level, be very different after all from native speaker professional interactions. This, as Handford notes, supports Firth’s (2009) call for more corpora of professional international communication. In Chapter 3, Ken Hyland highlights how corpus studies have brought a quantitative dimension to the study of academic discourse and made a considerable contribution to our understanding of different disciplinary practices. He first reviews relevant findings grouped under five main headings before illustrating a corpus study of how gender and discipline influence how writers use evaluative features in book reviews. A central argument throughout the chapter is that a range of features have been found to occur and behave in dissimilar ways in different disciplinary environments, indicating that academic discourse is more of a myriad of texts differing across contexts than a single monolithic entity. This, in turn, underlines the importance of community, context and purpose in writing which, as Hyland observes, has helped inform EAP course design and teaching.



Introduction

5

The last chapter of the first section is by Almut Koester, who looks at corpus research in workplace discourse. Workplace discourse is a subset of professional discourse, which deals with intra-organizational or internal communication between employees within the same company. Koester illustrates how corpus analysis of workplace discourse based on lexis, chunks and pragmatic features exhibits distinctive characteristics of institutional discourse, which distinguish it from everyday conversation. She further notes that while there is also a high frequency of interactive and interpersonal features found in workplace interactions, the interpersonal dimension here is exploited to maintain and reinforce workplace relationships, hence distinct in a number of ways from social or intimate interactions. The next section, Corpora in Applied Linguistics domains, begins with Sara Laviosa’s chapter on corpora and translation studies. She first reviews how corpora were introduced into translation studies in the early 1990s before explaining corpus methods in translation studies, the types of corpora used and how corpus studies of translation have contributed to the advancement of Descriptive Translation Studies. Laviosa also suggests that the concerns of descriptive corpus studies of translation have varied considerably since the turn of the millennium, with increasing attention paid to the regularities in translational practices, the particulars of culture-specific phenomena and the investigation of Anglicisms. She further considers examples of applied research, showing how the insights of relevant corpus studies have been fruitfully explored in a student-centred pedagogical environment, before concluding with predictions about the way in which corpora are likely to influence the study, the teaching and the practice of translation in the future. John Olsson, in Chapter 6, examines how three areas of forensic linguistics  – investigations into authorship, the development of legal language and the exploration of forensic social history – have benefited from the use of corpora. In highlighting the work of Coulthard (1994) in the Derek Bentley case to show the pioneering nature of corpus studies in the forensic context, Olsson also discusses a case study to illustrate a corpus approach to authorship investigation. In this study, he drew on several corpora (including the use of Google) and studied a number of rare phrases that occurred across (general and target) texts. He concludes the chapter by noting how increasingly larger corpora and the added benefit of internet databases allow for meaningful searches into language use which, in turn, is important to explore such issues as the extent to which idiolect plays a role in authorship identification. In Chapter 7, Paul Baker considers how a corpus approach can benefit gender-based research. He begins by noting and discussing two strands of

6

Corpus Applications in Applied Linguistics

research in gender studies: one concerned with gendered usage of language, which involves answering research questions comparing male and female speech, while the other looks at research on linguistic representation of gender, which involves issues surrounding discourse and ideology. The case study presented in the chapter focuses on the latter. In the study, Baker shows how a range of corpus tools can be used to uncover representations of metrosexuals in British newspapers, which reveals contradictions in representations and supports Pompper’s (2010) finding that men find the concept confusing and frustrating. He then concludes with a critical examination of potential problems with a corpus approach while maintaining that corpora have a great deal to offer gender research. Anne O’Keeffe, in Chapter 8, concludes the second section by focusing on the use of corpora in media studies. She observes how most studies of media discourse up until the year 2000 are based on small amounts of data that did not need much recording, transcription or scanning time, and notes that although assembling spoken corpora of media discourse remains a challenging task, it is now relatively easy to collect and build a corpus of media data such as newspapers. Drawing on the BBC 1 Panorama television interview by Martin Bashir of Diana, Princess of Wales, and several other corpora, O’Keeffe illustrates how the use of a range of corpus techniques can shed light on the study of media discourse, before concluding the chapter by suggesting the immense potential corpora have in the study of language in the media, both quantitatively and qualitatively. The next three chapters address Corpora in new spheres of study and consider areas encompassing English as a lingua franca (ELF), texting and still images in the form of documentary photographs. Barbara Seidlhofer, in Chapter 9, begins the section by examining the bilateral relationship between corpora and ELF. In Seidlhofer’s words, the ‘methodology of corpus linguistics can contribute to an understanding of ELF by providing the means for describing the form it takes. But equally the nature of ELF as a use of language can prompt an appraisal of this methodology as an approach to linguistic description’. She discusses how the empirical study of ELF as a linguistic resource, employed as a common means of communication among speakers of ‘different linguacultural backgrounds’, is beginning to be promoted by the availability of corpora such as VOICE and ELFA. She further considers methodological challenges and findings that relate to ELF before concluding the chapter with implications for applied linguistic research and practice. In Chapter 10, Caroline Tagg focuses on corpus research into text messaging. She first reviews this new area of research, situating the



Introduction

7

discussion within the general field of computer-mediated discourse. She then moves on to consider what Sebba (2007) has termed ‘respelling’, which refers to the spelling of words in unconventional ways indexing social identities, as well as other features of the language of text messaging. Throughout,Tagg shows how the study of text message corpora can uncover meaningful patterns in the way words are spelt and provide evidence of creative language use. There are, of course, challenges in corpus-based text message research such as message length, corpus size, identification of stable word forms and handling of multilingual data, and these are all considered before Tagg concludes by discussing future developments in corpus studies of text messaging. In Chapter 11, the final chapter in this section, Gu Yueguo describes another new research area in the field – the study of image corpora. In the chapter, Gu presents a conceptual model for segmenting and annotating an image corpus comprising several million documentary photographs, offering both a methodology for doing so and an argument for the benefits of extending corpora beyond texts. As Gu points out, images of various types such as satellite images, CT and fMRI images in the hospital, traffic surveillance images and CCTV images for security in supermarkets have been an asset for humankind, and issues for their effective and efficient retrieval and maintenance have become increasingly important. Mobile phones, iPads and PDAs also have image sending capabilities which contribute to the growing importance of digital images in our lives. More immediate pedagogic implications have also been noted, ranging from enhancing vocabulary learning to forming part of teaching aids for implementing classroom tasks. The fourth and final section of the volume consists of three chapters addressing Corpora, language learning and pedagogy. Chau Meng Huat, in Chapter 12, highlights the contribution of learner corpora to the study of L2 acquisition and the challenges to be addressed in corpus studies of learner language. Chau observes how, in the light of emergent views of learner language as both a dynamic and independent language system, a range of issues involving research agenda, methods and assumptions are in need of reconsideration. He also illustrates how a corpus approach to studying learner language can uncover key processes underlying L2 development, before concluding the chapter by noting positive trends of development in learner corpus research and pointing out challenges in researching L2 learning in the complex, changing worlds of language learners. Lynne Flowerdew, in Chapter 13, considers the use of corpora in the classroom. Following Römer (2010), she distinguishes between direct and

8

Corpus Applications in Applied Linguistics

indirect corpus applications and focuses on the former in this chapter. In reviewing how different hands-on corpus activities and concordance-based worksheets have been used in the classroom, Flowerdew guides the reader through phraseological, functional-notional and genre-based approaches to instruction. She then discusses the value of data-driven learning in relation to sociocultural, cognition-based and constructivist theories of learning. She moves on to illustrate how a course on report writing has drawn on theories of language and language learning in its delivery using a corpus approach before considering future pathways in direct applications of corpora in the classroom. The final chapter by Michael McCarthy and Jeanne McCarten focuses on corpora and materials design. McCarthy and McCarten observe how initial negative or suspicious views of using native speaker corpora to inform teaching materials have changed in the first decade of this century, with an increasing number of products for language learning now advertising themselves as corpus-informed. They illustrate one such project in the chapter – the creation of the four-level adult North American English course, Touchstone. They show how the fully corpus-informed syllabus initially involves extensive research using various analysis tools before the mediation process begins to transform spoken corpus evidence into interesting and relevant materials for learners. Overall the volume strongly communicates the vibrancy and diversity of the use of corpora in applied linguistics. While the authors do not speak as a group, a recurrent theme in their work is that corpus approaches will, in tandem with other methods, continue to play a vital role in applied linguistic research. The discussion presented in the various chapters also offers openings into understanding such familiar notions as language use and language learning in new ways made possible by a heightened sensitivity to multiple contexts of interpretations in applied linguistic endeavours. The volume closes with an Afterword by Susan Hunston who seeks to draw together trailing threads and make sense of some of the connections the chapters raise between corpus analyzes and applied linguistics. In that chapter, she addresses what our authors have to say about the ways corpus studies embody contexts and writes about the difficulties of collecting and analyzing corpora, about methodologies and the role of interpretation and about the application of corpora to real-world problems. Most importantly, however, she makes the point that the chapters in this volume contribute to a more general re-interpretation of applied linguistics itself. One important connection is that corpus studies help move applied linguistics beyond a ‘discipline that solves real-world problems’ to a discipline that ‘expands our



Introduction

9

understanding of the world mediated through language’. We hope that readers, through this book, are able to share some of the excitement of this enterprise.

References Firth, A. (2009). Doing not being a foreign language learner: English as a lingua franca in the workplace and (some) implications for SLA. International review of Applied Linguistics in Language Teaching. 47 (1): 127–56. Römer, U. (2010). Using general and specialized corpora in English language teaching: Past, present and future. In M. Campoy-Cubillo, B. Bellés-Fortuño & M. ­Gea-Valor (eds.), Corpus-based Approaches to English Language Teaching, pp. 18–38. London: Continuum. Sebba, M. (2007). Spelling and Society: the culture and politics of orthography around the world, Cambridge: Cambridge University Press.

10

Part Two

Corpora and Institutional Uses of Language

12

Chapter 2

Professional Communication and Corpus Linguistics Michael Handford

Background This chapter will explore how corpora of professional discourse can inform the two main branches of applied linguistics: research into language use and pedagogy/training. This will involve outlining a methodology that has been used with different corpora of professional communication, followed by a presentation of results from a study of communication in the international construction industry. Finally, the chapter will briefly discuss teaching and training implications of corpus findings and will present example exercises from a corpus-informed textbook aimed at business professionals and students. First though, we can ask, ‘What is professional discourse?’. Koester’s chapter in this collection explores ‘workplace discourse’, that is, communi­ cation between employees within the same company or, in other words, intraorganizational or internal communication. Professional discourse is a broader category, and includes external or inter-organizational or external as well as internal discourse, and also includes communication between professional people and lay people (such as doctor–patient communication). Professional communication is contrasted here with academic discourse, although both are types of institutional discourse and as such are often contrasted with everyday communication (Holmes, 2000; Heritage, 1997). As is generally the case in other areas of linguistics, most of the research on professional discourse has explored written texts, for example, Bhatia (1993, 2004) and Bargiela-Chiappini and Nickerson (1999) as well as collections that include academic and professional studies, such as Christie and Martin (1997) and Candlin and Hyland (1999), although none of these collections are specifically concerned with corpus linguistics. Connor and

14

Corpus Applications in Applied Linguistics

Upton (2004), however, is a collection of chapters that utilize corpus linguistic methods to analyze both spoken and written professional discourse. More recently, researchers at the Hong Kong Polytechnic University have compiled several online corpora of professional written texts (http://rcpce. engl.polyu.edu.hk/default.htm; Cheng, 2010). There has also been interdisciplinary research in professional written health communication that employs corpus linguistic techniques at Nottingham University (e.g. Harvey et al., 2008). This corpus comprises requests from adolescents to a UK-based website that provides health advice for young people. Using the BNC as a reference corpus, several keywords were identified within the healthcare corpus and then categorized according to theme. Several insights were provided by expert informants, indicating the importance of contextual understanding in interpreting findings from professional corpora. The research has also fed back into the professional health community through the publication of the findings in a health journal (Harvey et al., 2008). Research into spoken professional discourse, on the other hand, is particularly difficult to conduct, mainly because of the issues in obtaining permission to record authentic data and then the demands of transcribing that data. Nevertheless, two notable collections are Drew and Heritage (1992) and Sarangi and Roberts (1999). Specifically, corpus-informed work on professional spoken discourse includes a growing body of work on internal and external business communications (Nelson, 2000; Koester, 2004a, 2006, 2010; McCarthy and Handford, 2004; Poncini, 2007; Handford, 2010a; Handford and Koester, 2010). In terms of teaching and training materials that deal with spoken professional discourse, there is a dearth of materials that use authentic texts (Bargiela-Chiappini et al., 2007), never mind corpus linguistics insights. Some exceptions are The Language of Work (Koester, 2004b), Grammar for Business (McCarthy et al., 2009) and the Business Advantage series (Handford et  al., 2011; Koester et  al., 2011; Lisboa and Handford, 2012), an extract from which will be explored in the final section.

A Methodology for Analyzing Professional Corpora Flowerdew (2005) argues that corpus linguistics is moving in two directions: the development of larger and larger mega-corpora (for instance the 500 billion word Google corpus of books: http://ngrams.googlelabs.com/) and an increasing number of specialized corpora. The former are inevitably highly decontextualized, and are useful for quantitative comparisons and studies. For example, according to the Google corpus, the idiomatic



Professional Communication and Corpus Linguistics

15

expression have an issue with was not used before the 1880s and is now over eight times more frequent compared to just thirty years ago. Specialized corpora, which are generally less than a million words and sometimes much smaller, tend to benefit from contextual information to allow for interpretation of the genres in question through a combination of quantitative and qualitative approaches (Handford, 2010b). Contextual information is important because professional discourse occurs among groups of people who work together, often over extended periods of time. Such discourse is characterized by particular knowledge and practices, which may not be immediately evident to the researcher. For instance, one logistics director in a multinational pharmaceutical company whose meeting was recorded for the CANBEC corpus and was then interviewed several times (see Handford, 2010a: 64–65) stated that logistics managers need two types of knowledge: content knowledge of the products, equipment, etc. used in logistics, and process knowledge of the way things are done. While the former tends to be shared across different pharmaceutical organizations, process knowledge may be quite different in different organizations. Practices make up processes and are the shared social actions that members of a particular community employ, usually automatically (Bhatia, 2004), to get things done. As an analyst, working to develop an understanding of professional practices and processes is therefore invaluable in understanding what is going on in the discourse, and combining emic and etic interpretations can lead to a much richer analysis than that a merely decontextualized approach would allow. Methodologically, what this means is that corpus linguistics can be combined with discourse analysis (see Baker, 2006) and other approaches such as intercultural analysis (Spencer-Oatey and Franklin, 2009) and linguistic anthropology (Ochs, 1996) to understand the recurrent language and practices in the specific professional context. This section will outline one such methodology, and then the following section will explore the findings provided by the methodology. The data, a sub-corpus of interactions between engineers and foremen of various nationalities in the construction industry in Hong Kong, is part of an on-going project to develop a corpus of professional international English spoken communication (see Handford and Matous, 2011). The Hong Kong data (hereafter HK data) is made up of audio and video recordings, interviews with the participants, expertinformant insights and the researchers’ field notes. The quantitative step in the methodology is to produce statistically significant ‘keywords’ (Scott, 1999), that is, items with significantly greater or lesser than random frequency compared to some norm (in the form of a

16

Corpus Applications in Applied Linguistics

larger reference corpus). In this case, the two spoken reference corpora are the 2.7 million-word social and intimate sub-corpus of CANCODE (The Cambridge and Nottingham Corpus of Discourse English), hereafter referred to as SOCINT, and the 900,000-word meeting sub-corpus of CANBEC (Cambridge and Nottingham Business Corpus). The rationale for using these corpora is that, unlike large publically accessible corpora such as the BNC (British national Corpus), these corpora are purely spoken, thus providing a more focused comparison with the spoken HK data; in addition, SOCINT is one of the largest corpora of social and intimate (everyday) communication, and CANBEC is the largest corpus of authentic unscripted business interaction, to date. Both corpora have been developed by Nottingham University and Cambridge University Press and are copyrighted by Cambridge University Press. The corpus directors are Professors Ronald Carter and Michael McCarthy. For further details on the corpora, see McCarthy (1998) and Handford (2010a), respectively, or visit www.cambridge.org/corpus. By comparing the HK data to these two spoken corpora at the lexicogrammatical level, we can begin to see how our data compares to everyday talk and business-meeting talk. Once quantitative searches have been conducted, selected items can be studied in longer extracts. Employing quantitative techniques allows us to achieve a level of objectivity and replicability that are arguably lacking from many discourse analyzes (Baker, 2006). It is worth noting that frequent items are not important per se: unusual items can have considerable force, frequent items can be uninteresting and the absence of something can be extremely important. Nevertheless, as a first step, quantitative analyzes of fully transcribed complete encounters allow for the texts to be prized open with an attractive degree of interpretative objectivity. An example The methodology used here has been more fully developed and explored in Handford’s study of the CANBEC corpus of business meetings (2010a, Chapter 2) and is outlined in the following steps; the analysis is cyclical, moving between the text and the context to understand the meanings, goals, practices, identities and structures the participants have constructed through their talk (Gee, 2005). Step 1:  collect and transcribe relevant textual and ethnographic data Step 2: pinpoint potentially important lexicogrammatical items and linguistic/paralinguistic features



Professional Communication and Corpus Linguistics

17

Step 3: understand the situated meaning (Gee, 2005), which the chosen item or feature invokes in its specific context Step 4: infer the practices, goals, socially situated identities and social structures the participants orient to through the discourse this item occurs in Step 5:  develop teaching or training materials based on the above insights The first step is self-explanatory. The second step pinpoints potentially important items and features, such as single words, clusters, turn-taking features and non-verbal behaviour. This second step will be demonstrated further in the following section, as it is probably of most interest to corpus linguists. Steps three and four of the methodology move beyond quantitative and descriptive analyzes and involve interpreting the corpus data through reference to the various sources collected during the study, such as interviews with the participants and discussions with expert informants as well as the extracts featuring the items. Evaluation of the data is the eventual goal, which can inform training of the inexperienced engineers as well as allow for a critical understanding of the context we have researched. As such, the analysis moves from description to interpretation and finally towards evaluation, which Candlin asserts should be the aim of professional discourse analysis (Candlin, 2002). While discourse analysis has been criticized for reductively arguing that an item is important because the researcher says it is important (see Widdowson, 1998), corpora allow for such criticisms to be avoided. This is because of the corpus linguistic tenet that frequency and recurrence are important in language (Stubbs, 2007) and the corpus can show which items recur. In this methodology, items are potentially important if 1. they are frequent 2. they are statistically significant (compared to some norm) 3. they are stylistically salient/culturally key 4. they have been shown to be important in other, related studies The first two types of analyzes of items can be conducted quantitatively using corpus software, whereas pinpointing the last two involves looking for pre-determined items and features (Stubbs, 2001; Tribble, 2002), such as metaphors or turn organization (see Handford and Koester, 2010). The challenge of working with professional discourse is to understand the link between the text and the context (Charles, 1996), thereby understanding what actions, or practices, people are performing with

Corpus Applications in Applied Linguistics

18

language. The argument here is that linguistic items and features have particular meanings in specific situations because they invoke recognizable practices, and corpus linguistics can be used to pinpoint them. In other words, an item in a particular context has a situated meaning that reflexively constitutes particular discursive, professional and/or social practices. For instance, in Extract 1, a senior engineer (‘Jimmy’, male, from Hong Kong) is advising his colleague (‘Alie’, male, from Sierra Leone) about the need to check the site more regularly. In a follow-up interview, Jimmy said that more checking was needed because the position of a big tree, which could not be cut down, and a retaining wall meant that progress with that part of the project was difficult. Extract 1 Jimmy: so I maybe . . . err . . . um . . . you should go there from time to time Alie: yah Jimmy: maybe once a week or something you need Alie: yah Jimmy: to get familiar to Alie: yah I see Jimmy: there’s ongoing changes yeah? As we will see below, the deontic modal verb need is statistically significant in the HK data when compared to CANCODE and CANBEC, meaning that it is more typical of the HK interactions than of everyday communication or business meetings. A practice-based interpretation of this is outlined in Figure 2.1, with the arrow indicating the widening social context when one moves further from the localized text (Bhatia, 2004). It may be worth saying a little more about practices and corpora. Practices, particularly, repeated discursive practices, can be pinpointed through the use of specialized corpora because recurrent items invoke particular

Figure 2.1

Discourse:

Being an engineer

Social practice:

Managing projects

Professional practice:

Ensuring on-site changes are checked and controlled

Discursive practice:

Advising

Text:

...you need to get familiar...



Professional Communication and Corpus Linguistics

19

discursive practices. For instance, metaphors are often used for evaluation in decision-making discourse. Discursive practices in a professional context are thus the local, goal-driven actions that members of a professional community use to constrain and enable the unfolding discourse through recognized language. Drawing on the work in discourse analysis by Bhatia (2004) and Gee (2005; Gee et al., 1996), this methodology combines social, professional and discursive practices with Gee’s notion of Discourses (bundles of social practices ‘composed of ways of talking, listening, reading, writing, acting, interacting, believing, valuing and using tools and objects, in particular settings and at specific times, so as to display or recognize a particular social identity’ (Gee et al., 1996: 10)) to show how these layers of context relate to the text. This enables us to infer how the interlocutors address their transactional and interpersonal goals through language. The importance of practices to meaningful communication cannot be overstated: meaning is impossible without practices (Gee, 2005: 8), and communities of speakers, such as engineers, can share meaning because they interpret practices and their communicative realization in shared, recognizable ways.

A Study of Construction Industry Discourse This section will outline some of the corpus findings of the HK data. As mentioned earlier, I am in the process of developing a corpus of international spoken professional communication, with the support of a grant from the Japanese government,1 and construction-industry discourse will form a significant part of the corpus. This is because of two reasons, one academic and the other practical: despite the fact that the construction industry accounts for around 10% of global GDP, there is very little research on spoken interactions (There is, however, powerful research on written engineering and construction discourse coming from The Polytechnic University of Hong Kong’s Research Centre for Professional Communication in English.) in this field, and, secondly, it is a type of professional data I have access to. When making a professional spoken corpus, having personal contacts is often the only way to gain recording permission. The findings discussed in this section draw partly on an article by Handford and Matous (2011), and the reader is directed to this article for further reference. Dr. Petr Matous is a lecturer in the Department of Civil Engineering at the University of Tokyo, and such interdisciplinary collaboration with a researcher and his colleagues from

20

Corpus Applications in Applied Linguistics

the professional field itself has helped develop richer insights than would have otherwise been possible. It also means that there are channels to feed back the research findings for those who could practically benefit from them. After an initial visit to several potential construction sites of large international projects, Handford and Matous spent a week on construction sites for a large tunnel project in Hong Kong, collecting data in the form of audio and video recordings of on-site interactions and interviews with all the key employees on the project. These employees were mainly from Japan, Hong Kong, Sierra Leone and the UK. The main focus of our research was the communication between the Japanese engineers managing the day-to-day work on the project sites and the Hong Kong foremen who managed the contracted staff, although we also recorded interactions between other nationalities. We were also able to discuss the data with expert informants; for instance, a PhD student in Matous’s department who had been working for 20  years for the Japanese construction company heading the project, including several years in Hong Kong. After transcribing all the on-site interactions, we had a small sub-corpus of approximately 13,000 words. Using Wordsmith Tools, keyword searches were conducted, using two reference corpora: CANCODE (The Cambridge and Nottingham Corpus of Discourse English) and CANBEC (The Cambridge and Nottingham Business English Corpus). Furthermore, lists were compiled of the most frequent clusters. In an earlier study comparing business meeting discourse (CANBEC) to everyday spoken discourse (CANCODE) (Handford, 2010a), several interpersonal features were key. These included certain pronouns, back channels, modal verbs, indirect expressions and evaluative nouns. Interestingly, several of the same categories were also key when the HK data was compared to SOCINT, with some of the same items appearing, such as we, need (to), problem and so (see Table 2.1). Table 2.1  HK vs. SOCINT interpersonal/textual keywords Back channels: hmm (1.262), yah (0.48), okay (1.07), hai(0.18) Modal verbs: cannot (0.19), need (0.51), will (0.58) Pronouns: we (2.57) Conjunctions: so (1.79), because (0.64) Hedges, fillers: um (0.18), uh (0.16), eeh (0.14) Markers: okay (1.07), alright (0.21) Evaluative nouns: problem (0.18) Place deictics: this (2.55), here (1.18)



Professional Communication and Corpus Linguistics

21

Table 2.2  HK vs. CANBEC interpersonal/textual keywords Back channels: hmm (1.26), hai (0.18), okay (1.07), yes (0.52), yah (0.48) Modal verbs: need (0.51), cannot (0.19), will (0.58) Pronouns: we (2.57) Conjunctions: so (1.79), because (0.64) Hedges, fillers: maybe (0.25), um (0.18), uh (0.16) Markers: okay (1.07), alright (0.21), so (1.79), now (0.51) Place deictics: this (2.55), here (1.18)

Table 2.3  Categorization of keyword nouns Site nouns: slope (0.22), boulder (0.18) Organizational abbreviations: JV (0.05), QS (0.12), DMP (0.11) Engineering process nouns: excavation (0.05), bricking (0.04), calculations (0.04) Engineering object nouns: bolt (0.08), concrete (0.07), backhoe (0.09), site (0.13) General occupation nouns: access (0.24), location (0.09), subcontractor (0.04), area (0.15), Evaluative nouns: problem (0.18)

When keyword comparisons were run between the HK data and CANBEC, though there were some differences in the HK comparison in contrast with SOCINT, such as problem not being key in the CANBEC comparison, there was a surprising degree of similarity between the two lists, both in terms of the categories and the actual items (see Table 2.2). One interesting finding was the appearance of place deictics on both keyword lists. By checking the video recordings and our notes, we could also see that such deictics are often accompanied by gestures and the action of drawing pictures. Clarifying seemed to be the discursive practice that was most often invoked by such communication. As might be expected, there were also several nouns that were statistically significant when the HK data was compared with the reference corpora, given that nouns communicate the content of the discourse. A selection of these have been categorized in Table 2.3; some nouns are related to the industry (such as concrete) and the type of project (slope, excavation) whereas others are more generic. Cluster lists were made for the 2-, 3- and 4-word clusters from the HK interactions, and the top ten 2- and 3-word cluster lists are reproduced in Table 2.4. The frequency of the items is again placed in brackets. Lists of 5-, 6- or 7-word clusters are not provided because there were no longer clusters in the data. This might be because of the small amount of data, or perhaps because all the speakers were L2 users of English and had not acquired such items, or if they had they did not use them. The most frequent clusters are categorized by type in Table 2.5.

Corpus Applications in Applied Linguistics

22

Table 2.4  The most frequent clusters in the HK data

1 2 3 4 5 6 7 8 9 10

2 word

3 word

this is(0.63) and then(0.53) this one(0.53) I think(0.42) need to(0.36) we can(0.23) yeah yeah(0.22) have to(0.20) if we(0.17) of the(0.16)

no need to(0.12) and then we(0.09) this is the(0.08) this one is(0.08) we need to(0.08) I don’t know(0.07) I think this(0.06) put it here(0.06) add one more(0.05) something like that(0.05)

Table 2.5  Categorization of clusters Deictic expressions: this is, this one, put it here Hedging: I think, I don’t know Deontic modals: (we/I) need to, no need to, we can, have to Vague language: something like that If expressions: if we Problem-decision expressions: I agree with you, to make a decision Combinations: hedging  potential face threatening act: I think this is er you need to

When analyzing these findings, a high level of overlap between the language and discursive practices from the CANBEC corpus was evident. While more data is required before drawing any strong conclusions, this does suggest that English L2-L2 professional communication may share many lexical and discursive practices with L1-L1 professional interactions. Given that so much professional discourse is concerned with problemsolving and negotiating relationships and power, this is perhaps not so surprising. At present, these findings, along with longer extracts and some of the interview data, are being used with postgraduate engineering students (Japanese and non-Japanese) to raise awareness of selected aspects of authentic international professional communication, and we are planning to conduct training sessions with Japanese engineers in construction companies later this year. These sessions include raising awareness of the different types of content knowledge as evidenced by the range of nouns, the types of problems that arise on a day-to-day basis on site and the importance of paying attention to interpersonal issues, both at the lexicogrammatical level (e.g. modal verb choices) and in terms of making relationships with others.



Professional Communication and Corpus Linguistics

23

Intercultural relationship building on this project was pinpointed as a concern by a senior Hong Kong engineer we interviewed, and, during our time at the site, we noticed that the Japanese engineers very much stuck together and did not socialize with others – hence we have no recordings of the Japanese and non-Japanese eating together at lunch or dinner or socializing. When we interviewed the Japanese about this, they said that they never eat with the foreign colleagues, but always drive off to a restaurant with the other Japanese staff. Issues like this can be used as ‘critical incidents’ for training purposes and also to highlight the limitations of corpus approaches that only explore the recorded texts: the absence of something can be very telling. Corpus linguistics is predicated on the notion that occurrence and recurrence are important (Stubbs, 2007), yet, here, the absence of something is important. Corpus analysts therefore need to be cautious in their interpretations and prepared to temper their conclusions, especially in complex professional contexts. This involves moving beyond what corpora reveal about linguistic patterns of preference to understand the choices people make, or reject, from the alternatives that the context makes available.

Directions for Teaching and Training As mentioned above, insights and evidence from studies of authentic professional communication, corpus linguistic or otherwise, have not made much impact on teaching and training materials. Indeed, as Hoey notes (2005: 186), teaching materials in general often pay no attention to the lexical language that is common in particular situations, leading to ‘unnatural output’. Bargiela-Chiappini et al. (2007: 9), reviewing business teaching materials, argue that a ‘mismatch continues to exist between teaching materials and real language’. Handford (2010a: 251–252) argues that the contrived and potentially inappropriate language examples and concepts to be found in certain textbooks and training manuals may actually be detrimental to learners’ professional careers. An alternative approach is to base materials on the language and practices pinpointed in authentic discourse and teach the practices as skills (see Nonaka, 1994, on teaching tacit practices as explicit skills). This section demonstrates such an approach by discussing material that teaches the discursive practice of evaluating through metaphors and idioms. The degree to which L2–L2 interactions resemble those in L1–L1 contexts has already been touched on. Even though CANBEC is a corpus of primarily native-speakers’ English, as noted above, deontic choices in the HK data

24

Corpus Applications in Applied Linguistics

were similar: need was a keyword and must was not. Furthermore, a manual reading of the HK data showed that metaphors and idioms were used by the L2 engineers in the same way as the L1 users of English employ them, that is, to make a comment or to evaluate (Cameron, 2003). For instance, in Extract 2, the foreman (TT) and engineer (Gouda) are talking about how to deal with a box that is repeatedly being vandalized. Extract 2 TT: you know if he want to damage it even where we pick where we where we where we where we where we aaah heavy steel Gouda:  Hmm TT: also and (3) but we we we fixed the box on [today’s] err he he he EVEN he want to play with [you] also when he [trouble] just give him trouble back that’s all Gouda: Hmm Here we see the foreman using metaphors play with you and give him trouble back to comment on the behaviour of the vandal and to judge the best way to respond. According to Prodromou (2008), the appropriate use of metaphorical and idiomatic language by L2 users is very unusual, but in the HK data and in CANBEC, there are many instances of L2 users successfully employing this versatile linguistic resource. Indeed, given the prevalence of metaphors and idioms in business meetings and the importance of evaluation and commenting in professional discourse (Handford, 2010a; Handford and Koester, 2010), any L2 user who needs to comment, evaluate and make decisions at work may be at a disadvantage if they cannot use them. Because of this, in the Business Advantage series, metaphors and idioms are taught and discussed in several units, and an example exercise is presented below. The Business Advantage series consistently uses authentic and corpus texts. All listening and reading texts are taken from authentic business books, magazines, interviews with business professionals or academics or the 1  billion word CEC (Cambridge English Corpus www.cambridge.org/ corpus) corpus (which includes CANBEC, CANCODE, the Wolverhampton written business corpus as well as several other corpora). Furthermore, the language activities use examples from the above business corpora in the CEC. In each book, there is a ‘business skills’ strand, which teaches several of the most important ‘skills’ as evidenced through research on professional communication, such as using narratives in job interviews (the importance of which is discussed by Roberts and Campbell, 2005), developing

   

                

                                                                     

                                     

                                       

                         

                            

                                       

                                 

                                                                                                                        

          

        

          

           

      

     

            

               



26

Corpus Applications in Applied Linguistics

relationships with potential clients as well as colleagues, dealing with conflict situations (see Handford and Koester, 2010), international team-building or using tactical summaries (see Charles and Charles, 1999). The listening extracts for these skills are generally based on CANBEC recordings, providing the learner with examples of expert performances of the skill in question. The example exercise below is from the Upper Intermediate level of the course (Handford et al., 2011:43) from a unit that explores decision making. The listening extract, which the listening task refers to, was taken from an inter-organizational CANBEC meeting between two IT companies: an IT sales company and their ISP provider. The metaphors and idioms in the Language activity are all mined from CANBEC, and the activities are intended to move beyond the typical textbook approach of ‘here are some metaphors and idioms, go and learn them’ and to raise awareness of where in the discourse they are used, why they are used and who they are typically used by. Question #5 also encourages the learners to consider the possible censure of using such items inappropriately, by raising the allimportant professional issue of power. Such exercises are intended to develop a ‘critical space’ where learners can reflect on and sometimes critique the norms and practices of their present or future professional environments.

Conclusion This chapter has explored how corpora of professional communication can  inform both research and pedagogy/training. A methodology was demonstrated that can, provided the researchers have sufficient contextual information, be used to unpack transcribed interactions. The methodology shows the potential and the limits of a quantitative approach and the importance of combining corpus techniques with those of other approaches. The third section of this chapter outlined findings from an on-going research project in professional communication in the construction industry, and the final section demonstrated how a corpus can be used to develop educational materials. Several themes and issues were raised in the chapter, such as the extent to which L2 professional interactions compare to L1 professional interactions. There is often an assumption that much L2-L2 communication will differ considerably from L1-L1 communication, but here, it seems at the lexical and discursive practice levels there are many similarities. Such findings support Firth’s (2009) call for more corpora of authentic professional international communication.



Professional Communication and Corpus Linguistics

27

Notes This is a three-year grant provided by the Japanese Society for the Promotion of Science. 2 This refers to the frequency of the item as a percentage of all words in the HK data.

1



References Baker, P. (2006). Using Corpora in Discourse Analysis. London: Continuum. Bargiela-Chiappini, F. and Nickerson, C. (eds) (1999). Writing Business: Genres, Media and Discourses. Harlow: Longman. Bargiela-Chiappini, F., Nickerson, C. and Planken, B. (2007). Business Discourse. Basingstoke: Palgrave. Bhatia, V. (1993). Analysing Genre: Language Use in Professional Settings. Harlow: Longman. — (2004). Worlds of Written Discourse. London: Continuum. Cameron, L. (2003). Metaphor in Educational Discourse. London and New York: Continuum. Candlin, C. (2002). Research and Practice in Professional Discourse. Hong Kong: City University of Hong Kong Press. Candlin, C. and Hyland, K. (eds) (1999). Writing: Texts, Processes and Practices. Longman, London. Charles, M. (1996). Business negotiations: Interdependence between discourse and the business relationship. English for Specific Purposes, 15: 19–36. Charles, M. and Charles, D. (1999). Sales negotiations: Bargaining through tactical summaries. In Hewings, M. and Nickerson, C. (eds), Business English: Research into Practice. London: Longman. Cheng, W. (2010). Using a specialised corpus of engineering English for ESP practice. Asian EFL Journal, (Special Edition): 43–56. Christie, F. and Martin, J. (eds) (1997). Genre and Institutions: Social Processes in the Workplace and School. London: Continuum. Connor, U. and Upton, T. (eds) (2004). Discourse in the Professions: Perspectives from Corpus Linguistics. Amsterdam: John Benjamins. Drew, P. and Heritage, J. (eds) (1992). Talk at Work. Cambridge: Cambridge University Press. Firth, A. (2009). Doing not being a foreign language learner: English as a lingua franca in the workplace and (some) implications for SLA. International Review of Applied Linguistics in Language Teaching, 47(1): 127–56. Flowerdew, L. (2005). An integration of corpus-based and genre-based approaches to text analysis in EAP/ESP: Countering criticisms against corpus-based methodologies. English for Specific Purposes, 24: 321–32. Gee, J. P. (2005). An Introduction to Discourse Analysis. Abingdon: Routledge. Gee. J. P., Hull, G. and Lankshear, C. (1996). The New Work Order. London: Allen and Unwin.

28

Corpus Applications in Applied Linguistics

Handford, M. (2010a). The Language of Business Meetings. Cambridge: Cambridge University Press. — (2010b). What corpora have to tell us about specialised genres. In McCarthy, M. and O’Keeffe, A. (eds), The Routledge Handbook of Corpus Linguistics. Abingdon: Routledge, pp. 255–69. Handford, M. and Koester, A. (2010). “It’s not rocket science”: Metaphors and idioms in conflictual business meetings. Text and Talk, 30: 27–51. Handford, M. and Matous, P. (2011). Lexicogrammar in the international construction industry: A corpus-based case study of Japanese–Hong-Kongese on-site interactions in English. English for Specific Purposes, doi:10.1016/j.esp.2010.12.002. Handford, M., Lisboa, M., Koester, A. and Pitt, A. (2011). Business Advantage: Theory, Practice, Skills (Upper Intermediate). Cambridge: Cambridge University Press. Harvey, K., Churchill, D., Crawford, P., Brown, B., Mullany, L., Macfarlane, A. and McPherson, A. (2008). Health communication and adolescents: What do their emails tell us? Family Practice, 25: 1–8. Heritage, J. (1997). Conversation analysis and institutional talk. In Silverman, D. (ed.), Qualitative Research: Theory, Method and Practice. London: Sage, pp. 161–82. Hoey, M. (2005). Lexical Priming. Abingdon: Routledge. Holmes, J. (2000). Doing collegiality and keeping control at work: Small talk in government departments. In Coupland, J. (ed.), Smalltalk. London: Longman, pp. 32–61. Koester, A. (2004a). Relational sequences in workplace genres. Journal of Pragmatics, 36: 1405–28. — (2004b) The Language of Work. London: Routledge. — (2006). Investigating Workplace Discourse. London: Routledge. — (2010). Workplace Discourse. London: Continuum. Koester, A., Pitt, A., Handford, M. and Lisboa, M. (2011). Business Advantage: Theory, Practice, Skills (Intermediate). Cambridge: Cambridge University Press. Lisboa, M. and Handford, M. (2012). Business Advantage: Theory, Practice, Skills (Advanced). Cambridge: Cambridge University Press. McCarthy, M. (1998). Spoken Language and Applied Linguistics. Cambridge: Cambridge University Press. McCarthy, M. and Handford, M. (2004). “Invisible to us”: A preliminary corpus-based study of spoken business English. In Connor, U. and Upton, T. (eds), Discourse in the Professions: Perspectives from Corpus Linguistics. Amsterdam: John Benjamins, pp. 167–202. McCarthy, M., McCarten, J., Clark, D. and Clark, R. (2009). Grammar for Business. Cambridge: Cambridge University Press. Nelson, M. (2000). A corpus-based study of business English and business English teaching materials. Unpublished Ph.D. Thesis, University of Manchester, Manchester. Nonaka, I. (1994). A dynamic theory of organizational knowledge creation. Organizational Science, 5(1): 14–37. Ochs, E. (1996). Linguistic resources for socializing humanity. In Gumperz, J. (ed.), Rethinking Linguistic Relativity. Cambridge: Cambridge University Press, pp. 407–37. Poncini, G. (2007). Discursive Strategies in Multicultural Business Meetings. Bern: Peter Lang.



Professional Communication and Corpus Linguistics

29

Prodromou, L. (2008). English as a Lingua Franca: A Corpus-Based Analysis. London: Continuum. Roberts, C. and Campbell, S. (2005). Fitting stories into boxes: Rhetorical and contextual constraints on candidates’ performances in British job interviews. Journal of Applied Linguistics, 2(1): 45–73. Sarangi, S. and Roberts, C. (eds) (1999). Talk, Work and Institutional Order. Berlin: Mouton de Gruyter. Scott, M. (1999). Wordsmith Tools Version 3. Oxford: Oxford University Press. Spencer-Oatey, H. and Franklin, P.(2009). Intercultural interaction : a multidisciplinary approach to intercultural communication. Basingstoke: Palgrave Macmillan. Stubbs, M. (2001). On inference theories and code theories: Corpus evidence for semantic schemas. Text, 21(3): 437–65. — (2007). On texts, corpora and models of language. In Hoey, M., Mahlberg, M., Stubbs, M., and Teubert, W. (eds), Text, Discourse and Corpora. London: Continuum, pp. 163–90. Tribble, C. (2002). Corpora and corpus analysis: New windows on academic writing. In Flowerdew, J. (ed.), Academic Discourse. London: Longman, pp. 131–49. Widdowson, H. (1998). Review article: The theory and practice of critical discourse analysis. Applied Linguistics, 19: 136–51.

Chapter 3

Corpora and Academic Discourse Ken Hyland

Academic Corpora Research: Directions and Findings It is difficult to imagine a domain of applied linguistics where corpus studies have had a greater influence than in the description of academic discourse. In pre-corpus days, studies of academic discourse typically focused on a few texts or on the situated actions of writers, and although these kinds of studies are still helpful today, corpus studies have brought a quantitative dimension to the picture. By dematerializing texts and approaching them as a package of specific linguistic features employed by a group of users, corpora provide language data that represent a speaker’s experience of language in a particular domain. It therefore offers evidence of typical patterning of academic texts. It is a method that focuses on community practices and the ways members of particular disciplines understand the world and talk about these understandings (e.g. Biber, 2006; Swales, 2004). As a result, corpus studies are particularly valuable in providing insights into language use and variation across disciplines, genres and languages, which can contribute to the teaching of academic discourse to students and researchers (Hyland, 2009a). Corpus analysts are, however, sensitive to criticisms that they treat texts as artifacts abstracted from real contexts. As a result, they have increasingly added a focus on ‘action’ to balance the focus on ‘language’, seeking to ‘rematerialize’ these features to understand how and why writers make the choices they do when they write. This kind of qualitative interpretation is generally undertaken through interviews with text users, grounding patterns of text meanings in the conscious choices of writers and readers. Consequently, corpus research in academic settings has adopted a range of approaches and foci, being combined with interviews and focus groups, exploring expert and learner corpora and being used to both test hypotheses (corpus-based research) and discover new insights (corpus-driven).



Corpora and Academic Discourse

31

Corpus studies demonstrate that the discourses of the academy are not a single monolithic entity, but a myriad of texts that differ across contexts. These findings can be listed under five main headings. 1. That academic genres are persuasive and structured to secure readers’ agreement; 2. That there are variations in spoken and written academic genres; 3. That language groups have different ways of expressing ideas and structuring arguments; 4. That ways of producing agreement represent discipline-specific preferences; 5. That academic persuasion involves interpersonal negotiations as much as convincing ideas. I will discuss these briefly in the following paragraphs, before illustrating a corpus approach to academic discourse through a study of gender and discipline variation of evaluation in book reviews.

Genres are Structured to be Persuasive All academic texts are designed to persuade readers of something, and, to achieve this purpose, academics draw on the same repertoire of linguistic resources for each genre again and again. This is, in part, because writing is a practice based on expectations and writers try to anticipate their readers’ background knowledge, processing needs and rhetorical preferences. By focusing on particular genres, corpus research has highlighted the specific features of those genres and the reasons why they are used. Corpus analyzes have been particularly productive in identifying the lexicogrammatical regularities of particular genres to identify their structural identity. Research, for example, has described rhetorical moves in grant proposals (Connor and Upton, 2004), dissertation acknowledge­ ments (Hyland, 2004b) and student presentations (Basturkmen, 2002). Corpus studies have also sought to describe how individual moves are constructed in various genres, with introduction (Ozturk, 2007) and result sections (Bruce, 2009) of articles receiving particular attention. This research demonstrates the distinctive differences between academic genres where particular purposes and audiences lead writers to employ very different rhetorical choices. Table 3.1, for example, shows the different frequencies of selected features in 240 research articles and 56 textbooks.

32

Corpus Applications in Applied Linguistics Table 3.1  Selected features in research articles and textbooks per 1,000 words Research Articles University Textbooks

Hedges Self-mention Citation Transitions 15.1 8.1

3.9 1.6

6.9 1.7

12.8 24.9

These variations show how genres reflect the different purposes of their writers. The greater use of hedging underlines the need for caution and opening arguments to accommodate readers’ alternative positions in the research papers compared with the authorized certainties of the textbook, while the removal of citation in textbooks shows how statements are presented as facts rather than claims grounded in the literature. The greater use of self-mention in articles points to the personal stake that writers invest in their arguments and their desire to gain credit for claims. The higher frequency of transitions, which are conjunctions and other linking signals, in the textbooks is a result of the fact that writers need to make connections far more explicit for readers with less topic knowledge (Hyland, 2004a). Corpus studies have also identified a range of key features that have previously gone largely unnoticed, or at least untaught. I am thinking here of ‘attended and unattended this’ (Swales, 2005), evaluative that (Hyland and Tse, 2005) and the use of this and these as pronouns versus determiners (Gray and Cortes, 2011). Recently, the availability of online spoken academic corpora such as the Michigan Corpus of Academic Spoken English (MICASE) and the British Academic Spoken English Corpus (BASE) have allowed analysts to study spoken academic language in settings such as lectures (Thompson, 2006), office hours (Reinhardt, 2010) and doctoral defences (Swales, 2004).

Written and Spoken Academic Language Vary Corpus studies have shown how the discourses of the academy are not a single monolithic entity, but a myriad of variety, which differs enormously across contexts. Biber’s (2006) work, for instance, confirms the importance of mode in the language choices made by speakers and writers, with greater nominalization, impersonalization and lexical density in written compared with spoken texts. Lectures, for example, contain many features of conversation and comprise a series of relatively short clauses, while ‘university textbooks rely heavily on complex phrasal syntax rather than clausal syntax’ (Biber, 2006: 5). The two genres are complex in different ways, but both can



Corpora and Academic Discourse

33

present considerable difficulties for students (Reppen, 2004). Despite these contrasts, however, there are also similarities. Like written texts, for instance, lectures are heavily hedged, particularly in humanities and social science disciplines (Poos and Simpson, 2002), while lecturers also like to heavily signpost their presentations. Here, framing constructions (‘now I wanna spend a little bit of time talking about natural selection’) and what Swales and Malczewski (2001) call ‘new episode flags’ (OK, now, right), which mark shifts in the discourse, help guide students through a lecture. Swales (2001) and Simpson (2004) also note the high frequency of formulaic expressions in lectures, which function to manage the discourse and highlight key information, much like in written academic genres (Hyland, 2008). Impromptu definitions are particularly common, for example, and Flowerdew (1992) found that there was a definition about every 2 minutes as lecturers introduced terms on the fly, as in these examples from MICASE. (1) What is a false reference blank? okay that is any reference blank that that is incorrectly designed and it happens in research all the time If the cation is a hydrogen ion H-plus, then we’ll be calling it an acid if it’s got O-H-minus we call it a base, and if it’s got oxygen O-twominus, we call it an acid-anhydride. This is similar to research articles (Hyland, 2007). (2) These have rested on assumptions that power is ‘zero-sum’, that is, a finite resource which people cannot share. (sociology article) One of the most significant, which has received wide attention, is the formation of ‘arrow shape’ voids in the central axis of the extrudate, known as ‘central bursting’ or ‘chevroning’. (Mech E article) These similarities with written academic texts are a consequence of the need to construe experience in ways that encourage learners to think about content not just as facts, but as systems. It is a language suited to talking about thinking itself and about sets of complex relationships. This involves presenting material in certain conventional ways that require, in part, speakers and writers to explicitly label their discourse structure and direction, highlight key points and rework utterances to offer a reformulation or concrete instance of what they have said.

34

Corpus Applications in Applied Linguistics

Academic Conventions Differ Across Language Groups Corpus research into cultural variations in academic writing is a minor industry. Culture is a controversial term and needs to be handled with care, but if we understand it as a historically transmitted and systematic network of meanings, we can see that it is inextricably bound up with language. This means that cultures influence our intuitions about language and expectations about appropriacy, audience and ways of organizing ideas and structuring arguments. The corpus-based contrastive rhetoric literature suggests that the schemata of L2 and L1 writers differ and influence how they write in English. These conclusions have been supported over the past decade by a range of corpus studies in different genres and by different first-language writers (e.g. Loi, 2010; Moreno and Suárez, 2008). Much of this work has focused on student genres and has identified a range of different features in first- and second-langauge writing in English, particularly, the ways in which writers incorporate material into their writing, how they orientate to readers through attentiongetting devices and estimates of reader knowledge and differences in the use of overt linguistic features (e.g. Hinkel, 2002). A great deal of recent comparative research, however, has focused on different perceptions of interpersonal appropriacy and self-representation in academic writing in different languages. Molino (2010), comparing English and Italian linguistics articles, for example, found that while impersonal forms were used at similar frequencies, first-person pronouns were far more frequent in the English texts. Similarly, English and Spanish writers inhabit the text in distinct ways with far more selfmention evident in the English texts, suggesting that self-representation is not homogenous across these written cultures (Dueñas, 2007; Sheldon, 2009). As Dueñas (2007) observes, The appropriate degree of authorial presence in a written academic text seems to be determined by the communicative purpose of the text produced, the different participants in the communicative event and the disciplinary community within which it is written as well as by the cultural context in which it is produced and distributed. Such differences therefore indicate that rhetorical choices are not only influenced by the discipline to which the authors belong but also by the specific cultural context in which texts are used.



Corpora and Academic Discourse

35

Texts are Discipline Specific A fourth area in which corpus studies have contributed to our understanding of academic discourse is to reveal the extent of disciplinary variation in writing and, to a lesser degree, speaking. Individuals use language as members of social groups and the ways that they write essays, theses and articles and give seminars, conference presentations and lectures function to frame problems and understand issues in ways specific to their disciplines (Hyland, 2004a, 2009a). Corpus studies therefore put some flesh on Geertz’s (1983) view that knowledge and writing depend on the actions of members of local communities. Communities provide the context within which we learn to communicate and interpret each other’s talk, gradually acquiring the specialized discourse competencies to participate as members. While discipline is something of a contested term, corpus studies allow us to say something of how they are created and maintained, along with the knowledge they establish, through the routine rhetorical preferences of academics working in collaboration as members of social communities. Corpus studies reveal the ways that persuasion is accomplished through the epistemic conventions of a discipline or what counts as appropriate evidence and argument (Hyland, 2009a). They show, for example, considerable variation in writers’ choices of academic lexis. Hyland and Tse (2007) found that the so-called ‘universal’ sub-technical items from the Academic Word List vary enormously in different genres across disciplines in terms of range, frequency, collocation and meaning. Similarly, Thompson (2006) found similar variations in the distribution of the top 20 content words in the BASE lecture corpus with 60% of occurrences of ‘economy’ in the social sciences, 65% of ‘vary’ in the physical sciences and 50% of ‘respond’ in life sciences. Corpus analysis also reveals that the frequency and use of citations differ according to context, influenced by the ways particular disciplines see the world and carry out research. Citations are about twice as frequent in the hard sciences, where much greater prominence is given to the cited author through the use of integral structures where the cited author is in subject position (Hyland, 2004a). This reflects the fact that scientists produce public knowledge in a linear way through cumulative growth, so that research emerges from earlier problems and writers do not need to report research with extensive referencing. Readers are often familiar with the earlier work and so do not need citation to be reminded of it. In the humanities and social sciences, on the other hand, the literature is more dispersed and the readership

36

Corpus Applications in Applied Linguistics

more heterogeneous, so writers cannot pre-suppose a shared context but have to build one far more through citation. There are also disciplinary differences in the extent to which writers choose to intrude into their texts through the use of hedges such as possible, might, and likely. These devices withhold complete commitment to a proposition, implying that a claim is based on plausible reasoning rather than certain knowledge, while opening a discursive space for readers to dispute interpretations. Again, they are twice as common in humanities and social science papers than in hard sciences, as there is less control of variables, more diversity of research outcomes and fewer clear bases for accepting claims in the soft disciplines, which means writers cannot report research with the same confidence of shared assumptions. They must recognize alternative voices and be sensitive to other opinions far more by expressing their arguments more cautiously by using more hedges. In the hard sciences, on the other hand, positivist epistemologies mean that the authority of the individual is subordinated to the authority of the text and facts are meant to ‘speak for themselves’. Greater weight is put on the methods, procedures and equipment used than the argument, and writers often disguise their interpretative activities behind linguistic objectivity, using fewer hedges and preferring modals over cognitive verbs when they do hedge as these more easily combine with inanimate subjects.

Academic Persuasion Involves Interpersonal Negotiations Finally, corpus research has informed our ideas about the role of interpersonal features in academic persuasion or how writers and speakers construct an appropriate authorial self and negotiate participant relationships. The considerable use of self-mention in research articles (Hyland, 2001), abstracts (Bondi and Silver, 2004) and doctoral theses (Hyland, 2005), for example, suggests that academic writing is less selfevidently objective and impersonal than once believed. This is because academics do not simply produce texts that plausibly represent an external reality, but use language to acknowledge, construct and negotiate social relations, and this is achieved through a writer–reader dialogue, which situates both their research and themselves. As this view gains greater currency, considerable attention has turned to the features that help realize this interpersonal and evaluative dimension of academic texts. Genres such as undergraduate lectures (Morell, 2004), conference monologues (Webber,



Corpora and Academic Discourse

37

2005) and book reviews (Hyland and Diani, 2009) have been explored from this perspective. Several frameworks have been proposed to analyze interactions in this way, with several corpus studies exploring the use of metadiscourse in academic genres. Metadiscourse is self-reflective linguistic material referring to the evolving text and to the writer and imagined reader of that text. It is based on a view of writing as social engagement and reveals the ways writers signal their attitude towards both their content and the audience of the text. One model distinguishes Interactive resources, which allow the writer to manage the information flow, organizing the discourse to anticipate readers’ knowledge and reflect the writer’s assessment of what needs to be made explicit, and Interactional resources, which refer to the participants of the interaction and the writer’s efforts to establish a suitable relationship to the data, arguments and audience. Table 3.2 shows the main features of metadiscourse (Hyland, 2005; Hyland and Tse, 2004). Table 3.2  A model of metadiscourse in academic texts Category

Function

Interactive Resources

Help to guide reader through the text

Transitions

express semantic relation between main clauses refer to discourse acts, sequences or text stages refer to information in other parts of the text refer to source of information from other texts help readers grasp meanings of ideational material

Frame markers Endophorics Evidentials Code glosses

Interactional Resources

Involve the reader in the argument

Hedges

withhold writer’s full commitment to proposition emphasize force or writer’s certainty in proposition express writer’s attitude to proposition explicitly refer to or build relationship with reader explicit reference to author(s)

Boosters Attitude markers Engagement marks Self-mentions

Examples

in addition/but/thus/ and finally/to conclude/ my purpose here is to noted above/see Fig / in Section 2 according to X/(Y, 1990)/Z states namely/eg /such as/ in other words

might/perhaps/possible/about in fact/definitely/it is clear that unfortunately/I agree/surprisingly consider/note that/ you can see that I/we/my/our

38

Corpus Applications in Applied Linguistics

Frequency information indicates the importance of metadiscourse to writers in a range of different genres including research article abstracts (Gillaerts and Van de Velde, 2010), postgraduate theses (Hyland, 2004b), undergraduate textbooks (Hyland, 2004a), lectures (Thompson, 2003) and book reviews (Tse and Hyland, 2008). In theses, for example, students used one metadiscourse signal every 21 words in a four million word corpus, with 73% of all cases in the PhD texts. This reflects both the need for more interactive devices to structure much longer texts to help readers through the arguments and a more sophisticated approach to language, suggesting writers’ attempts to present themselves as competent academics familiar with the practices of their fields.

An Illustrative Example: Gender in Book Reviews As an illustration of corpus research, I want to consider a study that challenges the idea that academic writing is an agonistic and aggressive discourse, which reflects masculine values of competition, rationality and objectivity (e.g. Frey, 1990). This is said to privilege male values and disadvantage women, who tend to favour a more conciliatory interaction style (Kirsch, 1993). Such a view, however, draws on a male–female stereotype, which tends to minimize gender intra-group differences and inter-group overlaps, neglecting the context of local meanings and the ways that gender is shaped by language use, rather than the other way around. With evidence for gender-preferential discourses in academic contexts mixed, Tse and Hyland (2008) decided to explore the issue by examining a corpus of 56 academic book reviews in leading journals in philosophy and biology, together with interviews with reviewers and editors in those fields. This genre seemed an excellent place to explore gender preferences as it is a public forum where community members critique and engage with others’ ideas (Hyland, 2004a). We structured the corpus with fourteen reviews by males in each discipline (seven books with male authors and seven with female authors) and fourteen reviews by females in each discipline (again, seven books with male authors and seven with female authors). These texts produced a corpus of 61,000 words. We computer-searched these sub-corpora for explicit features of evaluation, employing interactional metadiscourse markers as features that express a writer’s personal stance and assessment of material (see above). The textual analyzes were supplemented by semi-structured interviews with three reviewers and one editor from each discipline. These focused on the



Corpora and Academic Discourse

39

Table 3.3  Evaluation categories by gender (per 1000 words) All Reviewers

Engagement Markers Hedges Attitude Markers Boosters Totals

Females

Males

Female

Male

Rev F

Rev M

Rev F

Rev M

13.2 9.8 9.1 7.0 38.1

15.3 11.0 8.5 9.0 43.8

15.1 10.9 8.9 6.7 41.6

10.9 8.5 9.2 7.3 35.9

15.9 11.1 9.3 9.0 45.3

14.7 10.8 7.8 9.0 42.3

participants’ views on evaluation, their understandings of disciplinary constraints and opinions on argument and gender. The results highlight how the visible presence of the writer and reader play a key role in this discourse, with about 50 forms per review. Table 3.3 shows that men and women make use of broadly similar features, with males employing more evaluative devices in every category but attitude markers. Analysis of the sub-corpora, presented in the four right hand columns, suggests that both males and females employ more features when reviewing female authors. The importance of creating a shared evaluative context through engagement markers and hedges is evident in all the reviews. Engagement markers such as imperatives, inclusive pronouns and questions, for instance, help underline an appeal to scholarly solidarity and communal understandings (3); while hedges serve to tone down the author’s judgemental authority (4). (3) Notice that since non-substances must mention substances in their definitions, they are not primary things on the criteria of Z 4. This is a useful point to keep in mind, when we try to understand why Aristotle says that non-substances do not have essences strictly speaking.  (Female Phil) How do we know when biologists have succeeded – at what? Keller organizes her book around the thesis that there is no simple answer to these questions.  (Female Bio) (4) Some readers might yet be surprised by some features of this project: John Stuart Mill looms larger here than Aristotle, and the neo-Kantian flavor is strong. (Female Phil)

40

Corpus Applications in Applied Linguistics There are also some samples (Spain, Hungarian Gypsies, Zulus) that appear to display no sex difference in 2D:4D, which suggests that in some populations the ratios do not vary with prenatal androgen exposure.  (Male Bio)

Generally, both male and female writers employed similar patterns of evaluation in their texts and conducted their interactions with readers in similar ways. It seems, then, that the discursive construction of an appropriate authorial identity in this genre involves an interpersonal stance with relatively high frequencies of attitudinal lexis, hedged evaluation and explicit engagement with readers. Perhaps surprisingly, males used 15 per cent more evaluative features overall, with hedges, boosters and engagement markers particularly prominent in their writing. These are all features of a personal and engaging style more usually associated with female discourse. Boosters (items used to reinforce arguments and express conviction) represent the widest gender differences in the corpus and were mainly used by females to intensify praise, fully committing themselves to their positive evaluations of a book. (5) In an extremely readable and coherent style, Thomas integrates biology, politics, social concerns, and economics [. . .]  (Female Bio) Wedin’s interpretations are extremely detailed and dense, and they contain many subtle points, which are worthy of admiration and thought.  (Female Phil) Males, in contrast, more often used boosters to underpin their confidence in their judgements, particularly framing these evaluations with a personal pronoun to stamp a personal authority in their views. (6) This was particularly noticeable in the areas with which I was most familiar, where I also found occasional unsupported statements that I believe were incorrect. (Male Bio) I have absolutely no doubt that Whitford’s understanding of these seemingly simple, but amazingly complex systems is among the most comprehensive in the world community of desert ecologists.  (Male Bio)



Corpora and Academic Discourse

41

The only feature more common in the female’s reviews were expressions of attitude, particularly regarding the overall coverage, argument and value of the reviewed book and the experience, reputation and ability of its writer. (7) Rosalind Hursthouse has written an excellent book, in which she develops a neo-Aristotelian virtue ethics that [. . .]  (Female Phil) I found this book rather repetitive, and Barlow’s knowledge of evolutionary theory a little naive.  (Female Bio) These evaluative expressions were common and indicate how reviewers judge target texts according to community-based standards. There appear, then, to be some gender differences, but the male writers’ greater willingness to make bold statements, boost their arguments and generally take a more confident and uncompromising line might be a result of seniority, rather than gender. In fact, several informants commented on this dominance. Yes scientists are mainly male . . . the imbalance is even greater when you go up the ladder. . . . It’s hard because part of being confident depends on how you’re perceived. You know, many people think women are not as good in writing that kind of ‘factual’ report. I know this perception is wrong but it affects how you see and present yourself. (Female Bio interview) Unfortunately there is a huge gender imbalance in professional philosophy. The observation that men use more ‘I’ and are more assertive may be due to the hierarchical thing that the women feel that they have to be more careful or less assertive and this has to do with masculine aggressivity. (Male Phil interview) Few women hold chairs in these fields, oversee research labs or sit on important committees, so language preferences may be related to a male dominant culture in the academy and the dominance of men in higher career positions. More important than gender in influencing writer’s language decisions, it seems, however, is discipline. Table 3.4 shows how male and female reviewers used evaluation by discipline, with differences in each feature in the final column. It is clear from this admittedly relatively small data set that

42

Corpus Applications in Applied Linguistics

Table 3.4  Gender use of metadiscourse in individual disciplines (per 1000 words) Philosophy Gender Engagement Markers Hedges Attitude Markers Boosters Totals

Female

Male

14.7 11.0 8.8 7.3 41.8

17.2 11.4 8.3 9.6 46.5

Biology   F/M 2.5 0.4 0.5 2.3 4.7

Female

Male

  F/M

9.3 6.9 9.8 6.1 32.1

10.1 9.8 9.1 7.4 36.4

0.8 2.9 0.7 1.3 4.3

language uses by males and females show greater similarity within each discipline and more variations between the disciplines. Clearly, engagement, hedging and boosting are all evaluative features more common in philosophy than biology, reflecting the more visible role of the agent in the soft knowledge fields. In research writing, rhetorical practices are closely related to the ways the disciplines create knowledge, with philosophers taking a more explicitly involved and interpretative approach compared with the biologist’s greater reliance on shared background understandings and proven methods (Hyland, 2004a). These rhetorical conventions are also apparent in review genres, with the scientists exhibiting a relatively greater preference for impersonality. The greater use of engagement markers and boosters by the male reviewers is the most obvious gender difference in philosophy and, on closer analysis, this seems to relate to very different ways of conceptualizing arguments in this discipline. The reviews written by male philosophers are conspicuous in the extent to which they are invested with an explicitly committed and engaged stance. This is a personal and intrusive discourse that presents an individual voice proclaiming ideas and meeting objections head-on with conviction, as in this example. (8) This is almost certainly too weak an account, for it fails to rule out chance coincidences of judgment and motivation. This problem may be avoidable by appeal to a more complex pattern of possibilities, however I believe that any moderated account of the disposition to exercise control will run into a serious difficulty, as follows. (Male Phil) This directly challenging mode of philosophical debate assumes that the best way to investigate ideas is to subject them to extreme opposition. In contrast, many female philosophers employed a very different rhetorical style, encoding praise and criticism of a target book as an outcome of logical



Corpora and Academic Discourse

43

reasoning. This approach emphasizes respect for values of rationality and logic by careful exemplification and coherent analysis. Clearly, both approaches represent appropriate ways of arguing in philosophy, and both are available to men and women, but the analyzes show a clear gender preference for these choices. In biology, gender differences in the use of evaluative features are narrower, with only hedges employed more by one group. This variation may be related to both gender preferences and to the different roles that reviews serve in biology. In philosophy, books are the main vehicle for advancing scholarship and presenting original research, whereas biology is a more fast moving and competitive environment, so journal articles are the key means of reporting knowledge. Books, therefore, mainly assemble already accredited knowledge for students, and so reviewers tend to focus on visible production and stylistic features rather than fine points of argument. Again, it was the male reviewers who were more likely to adopt a more explicit evaluation of the authors’ ideas and methods, encouraging rhetorical choices of personal involvement. Many of these more critical and evaluative reviews, moreover, were of books written by other men, with concordance patterns suggesting that male reviewers used far more hedges in reviewing the books of other males, toning down the strength of their evaluations. (9) Although he wanders freely through an eclectic selection of images and ideas, he could perhaps have done more to explore new ground. (M-review-M Bio) It was a bit of a tedious read due to the level of detail, the somewhat artificial breakdown of topics by chapter, and the nature of models in general. Because of the amount of detailed discussion it was often difficult to see the forest for the trees. (M-review-M Bio) The fact that female biology reviewers were less critical meant that they used far fewer hedges, but they also used them in a different way, often to redress the impact of the book’s argument on readers, rather than the impact of the criticism on the author. (10) I expect some readers would enjoy the multidisciplinary treatment [. . .] (F-review-F Bio)

44

Corpus Applications in Applied Linguistics

So although readers might be horrified by what they see [. . .] 

(F-review-M Bio)

Overall, while only a small-scale study, this corpus exploration of evaluation in book reviews supports current research which suggests that there is no one-to-one relation between gender and language. This analysis gives greater space for individual agency and challenges perspectives that stress an ‘essentialist’ view of gender: a view that reifies language behaviour and gives a fixed identity to groups.

Conclusion Corpus studies have made a considerable contribution to our understanding of academic discourse and revealed many of the ways that writers in different disciplines, genres and languages represent themselves, their work and their readers in different ways. In particular, they have shown that a range of features occur and behave in dissimilar ways in different disciplinary environments and underlined the importance of community, context and purpose in writing, which has helped inform EAP course design and teaching. It is this observation about addressing students’ target needs that helps clarify future directions for research. Quite clearly, we need more descriptions of the genres of specific disciplines that students need to write and read. Reports, essays, articles, critiques, presentations, case notes, lectures and so on differ across disciplines, and the knowledge of their structures and salient features can demystify them for learners. As corpus research into academic genres continues to grow, we can therefore anticipate an ever-increasing broadening of studies beyond texts to the talk and contexts that surrounds their production and use, beyond the verbal to the visual and beyond tertiary to school and professional contexts. Corpus studies will, in tandem with other methods, have a continuing and important role to play in this endeavour.

References Basturkmen, H. (2002). Negotiating meaning in seminar-type discussion and EAP. English for Specific Purposes, 21: 233–42. Biber, D. (2006). University Language: A Corpus-based Study of Spoken and Written Registers. Amsterdam: Benjamins.



Corpora and Academic Discourse

45

Bondi, M. and Silver, M. (2004). Textual voices: A cross disciplinary study of attribution in academic discourse. In Anderson, L. and Bamford, J. (eds), Evaluation in Oral and Written Discourse. Rome: Officina Edizioni, pp. 117–36. Bruce, I. (2009). Results sections in sociology and organic chemistry articles: A genre analysis. English for Specific Purposes, 28(2): 105–24. Connor, U. and Upton, T. (2004). The genre of grant proposals: A corpus linguistic study. In Connor, U. and Upton, T. (eds), Discourse in the Professions. Amsterdam: Benjamins, pp. 235–55. Dueñas, P. M. (2007). ‘I/we focus on. . .’: A cross-cultural analysis of self-mentions in business management research articles. Journal of English for Academic Purposes, 6(2): 143–62. Flowerdew, J. (1992). Definitions in science lectures. Applied Linguistics, 13(2): 202–21. Frey, O. (1990). Beyond literary Darwinism: Women’s voices and critical discourse. College English, LII(5): 507–26. Geertz, C. (1983). Local Knowledge: Further Essays in Interpretive Anthropology. New York: Basic Books. Gillaerts, P. and Van de Velde, F. (2010). Interactional metadiscourse in research article abstracts. Journal of English for Academic Purposes, 9(2): 128–39. Gray, B. and Cortes, V. (2011). Perception vs. evidence: An analysis of this and these in academic prose. English for Specific Purposes, 30(1): 31–43. Hinkel, E. (2002). Second Language Writers’ Text. Mahwah, NJ: Lawrence Erlbaum. Hyland, K. (2001). Humble servants of the discipline? Self-mention in research articles. English for Specific Purposes, 20(3): 207–26. — (2004a). Disciplinary Discourses. Ann Arbor: University of Michigan Press. — (2004b). Graduates’ gratitude: The generic structure of dissertation acknowledgements. English for Specific Purposes, 23(3): 303–24. — (2005). Metadiscourse. London: Continuum. — (2007). Applying a gloss: Exemplifying and reformulating in academic discourse. Applied Linguistics, 28: 266–85. —  (2008). ‘As can be seen: Lexical bundles and disciplinary variation’. English for Specific Purposes, 27(1), 4–21. — (2009). Academic Discourse. London: Continuum. Hyland, K. and Diani, G. (eds) (2009). Academic Evaluation: Review Genres in University Settings. London: Palgrave-MacMillan. Hyland, K. and Tse, P. (2004). Metadiscourse in academic writing: A reappraisal. Applied Linguistics, 25(2): 156–77. — (2005). Hooking the reader: A corpus study of evaluative that in abstracts. English for Specific Purposes, 24(2): 123–39. — (2007). Is there an “academic vocabulary”? TESOL Quarterly, 41(2): 235–54. Kirsch, G. (1993). Women Writing in the Academy: Audience, Authority, and Transformation. Carbondale and Edwardsville: Southern Illinois University Press. Loi, C. K. (2010). Research article introductions in Chinese and English: A comparative genre-based study. Journal of English for Academic Purposes, 9(4): 267–79. Molino, A. (2010). Personal and impersonal authorial references: A contrastive study of English and Italian linguistics research articles. Journal of English for ­Academic Purposes, 9(2): 86–101.

46

Corpus Applications in Applied Linguistics

Morell, T. (2004). Interactive lecture discourse for university EFL students. English for Specific Purposes, 23(3): 325–38. Moreno, A. and Suárez, L. (2008). A study of critical attitude across English and Spanish academic book reviews. Journal of English for Academic Purposes, 7(1): 15–26. Ozturk, I. (2007). The textual organization of research article introductions in applied linguistics: Variability within a single discipline. English for Specific Purposes, 26(1): 25–38. Poos, D. and Simpson, R. (2002). Cross-disciplinary comparisons of hedging: Some findings from the Michigan Corpus of Academic Spoken English. In Reppen, R., Fitzmaurice, S. and Biber, D. (eds), Using Corpora to Explore Linguistic Variation. Amsterdam: John Benjamins, pp. 3–23. Reinhardt, J. (2010). Directives in office hour consultations: A corpus-informed investigation of learner and expert usage. English for Specific Purposes, 29(2): 94–107. Reppen, R. (2004). Academic language: An exploration of university classroom and textbook language. In Connor, U. and Upton, T. (eds), Discourse in the Professions. Amsterdam: John Benjamins, pp. 65–86. Sheldon, E. (2009). From one I to another: Discursive construction of selfrepresentation in English and Castilian Spanish research articles. English for Specific Purposes, 28(4): 251–65. Simpson, R. (2004). ‘Stylistic features of academic speech’. In U. Connor and T. Upton (eds), Discourse in the Professions (pp. 37–64). Amsterdam: John Benjamins. Swales, J. (2001). Metatalk in American academic talk: The case of point and thing. Journal of English Linguistics, 29(1): 34–54. — (2004). Research Genres. New York: Cambridge University Press. — (2005). Attended and unattended “this” in academic writing: A long and unfinished story. ESP Malaysia, 11: 1–15. Swales, J. and Malczewski, B. (2001). Discourse management and new episode flags in MICASE. In Simpson, R. and Swales, J. (eds), Corpus Linguistics in North America. Ann Arbor, MI: University of Michigan Press, pp. 145–64. Thompson, P. (2006). A corpus perspective on the lexis of lectures, with a focus on economics lectures. In Hyland, K. and Bondi, M. (eds), Academic Discourse Across Disciplines. Bern: Peter Lang. Thompson, S. (2003). Text-structuring metadiscourse, intonation and the signalling of organisation in academic lectures. Journal of English for Academic Purposes, 2(1): 5–20. Tse, P. and Hyland, K. (2008). ‘Robot Kung fu’: Gender and the performance of a professional identity. Journal of Pragmatics, 40(7): 1232–48. Webber, P. (2005). Interactive features in medical conference monologue. English for Specific Purposes, 24(2): 157–81.

Chapter 4

Corpora and Workplace Discourse1 Almut Koester

Introduction The distinctive characteristics of workplace discourse have been explored through a variety of methodological approaches, including conversation analysis (CA) (e.g. Drew and Heritage, 1992), genre analysis (e.g. Bhatia, 1993) and social constructionist approaches (e.g. Holmes and Stubbe, 2003). However, despite the proliferation of research activity in the field of business, professional and workplace discourse, until recently, few studies have used corpus linguistic methods to analyze the nature of workplace and business discourse (McCarthy and Handford, 2004). Recent corpus-driven studies2 (e.g. Nelson, 2000a, 2000b, 2006; Cheng, 2004, 2007; Handford, 2007, 2010) provide some fascinating insights into some of the distinctive features of workplace and business language. The most obvious contribution of corpus analytical methods to the study of a variety of registers is in providing information about the most frequent lexical items and collocations. However, corpus methods have also been used to explore a range of features that have usually been examined using exclusively qualitative methods. These include pragmatic features, such as speech acts and politeness markers (Cheng and Warren, 2005, 2006), interactive features, such as interruptions and question tags (Cheng and Warren, 2007), and genre as well as prosody (Cheng, 2004; Warren, 2004). This chapter reviews relevant findings from corpus studies of workplace discourse, which generally combine qualitative with quantitative methods to explore features of lexis, phraseology and pragmatics (including interactive features).

48

Corpus Applications in Applied Linguistics

Overview of Relevant Corpora While many large corpora contain sub-corpora of more specific varieties, including professional and business genres, most studies of more specialized varieties or genres have been carried out using relatively smaller collections of texts. This is also the case for workplace or business corpora, and it has been argued that smaller specialized corpora are in many ways more suitable for studying specific registers or genres, as they are carefully targeted and set up to reflect the contextual features of the genre, such as information about the setting and the participants (Flowerdew, 2004; O’Keeffe et  al., 2007). This section provides an overview of the most important workplace or business corpora to be drawn on in this chapter. The first such corpus of note to be compiled is Mike Nelson’s Business English Corpus, BEC (Nelson, 2000a). It contains 1 million words of spoken and written business data, with slightly more written than spoken texts. Nelson makes a useful distinction between ‘language about’ business and ‘language doing’ business. Language used for talking about business is from texts such as business books, newspapers and interviews; whereas language used for actually doing business can be found, for example, in emails, reports, meetings, negotiations and phone calls. While BEC focuses specifically on ‘business discourse’, and therefore is not necessarily representative of ‘workplace discourse’ in general, it does cover a very important area within workplace discourse. Also, as the first corpus of its kind, the work carried out by Nelson is pioneering and provides a kind of model of how a corpus of workplace discourse can be compiled and analyzed. More recently, the Cambridge and Nottingham Corpus of Business English Corpus (CANBEC) has been compiled, consisting exclusively of spoken language doing business (Handford, 2007, 2010). This 1 millionword corpus consists mainly of business meetings recorded mostly in the UK in a large variety of business sectors, including the pharmaceutical industry, information technology and manufacturing sectors. The majority (three in every four) of the meetings are company-internal meetings, and therefore involve ‘workplace discourse’, whereas the other external meetings between companies are more specifically concerned with ‘business discourse’. A wider range of workplace genres were targeted in the 262,000-word business sub-corpus of the Hong Kong Corpus of Spoken English (HKCSE), which consists of 30 hours of meetings, service encounters, workplace presentations, job interviews, telephone conversations and informal office talk recorded in Hong Kong (see Research Centre for Professional Communication



Corpora and Workplace Discourse

49

in English Website). Much of the corpus consists of company-internal interactions, and therefore it provides a good coverage of workplace discourse in general. The speakers are Hong Kong Chinese, native speakers of English or speakers with mother tongues other than Chinese, and settings include a wide range of workplace contexts, including the service industry, for example, hotels (see Warren, 2004). In discussing what corpora can tell us about workplace discourse, I will be drawing mainly on the three corpora described previously (BEC, CANBEC, HKCSE), as these are the largest business and workplace corpora to date, but I will also refer to some smaller and more specialized workplace corpora that have also yielded relevant findings.

Lexical Features Frequent words One of the benefits of corpus software is that it can quickly count things that would be extremely painstaking, or even impossible, to count manually. Therefore, a distinctive feature of specialized corpora that can easily be discovered using corpus analytical methods is the most frequent words in the corpus and by extension in that particular register or genre. However, such knowledge of frequency is meaningless without some kind of benchmark against which it can be measured. The benchmark against which ‘special’ types of discourse (e.g. institutional or academic) have usually been compared is ‘ordinary conversation’. This was the approach used by McCarthy and Handford (2004) in a study analyzing a sub-corpus of 250,000 words from CANBEC, in which they posed the question: ‘To what extent is SBE [Spoken Business English] like or unlike everyday informal conversation?’ (p. 172). The CANBEC sub-corpus was compared to a sub-corpus of the Cambridge and Nottingham Corpus of Discourse in English (CANCODE) consisting of social and family conversations and also to the academic sub-corpus from CANCODE (see also O’Keeffe et al., 2007: 204–216). Handford (2007, 2010) examines word frequency in the whole 1-million-word CANBEC Corpus and compares this to the combined Socializing and Intimate sub-corpora (SOCINT) from CANCODE. Both the two-way and three-way comparisons are revealing of the ways in which spoken workplace discourse is different from – and also similar to – everyday language. The first step was to create raw frequency lists, which, at first sight, look quite similar across the different corpora, as the most frequent words are always grammatical ones, for example ‘the’, ‘I’ and ‘and’. But even in such

50

Corpus Applications in Applied Linguistics

lists of highly frequent words, some differences emerge. ‘I’ is the most frequent word in the conversational sub-corpus, whereas ‘the’ is at the top of the list for CANBEC and the academic corpus (O’Keeffe et al., 2007: 207). This difference reflects the more personal orientation of casual conversation compared to the more institutional and objective nature of business and academic discourse. The order of frequency of the pronouns is also interesting: ‘you’ occurs more frequently than ‘I’ in both CANBEC and the academic subcorpus, and ‘we’ is much more frequent in CANBEC than in the other two corpora (ibid.). This highlights the important role of the group, team or organization in business, while the increased use of ‘you’ may reflect the importance of directives and requests in academic and business discourse. Keywords The distinctive lexis of a genre or register comes out much more clearly using a keywords list. Keywords (Scott, 1999) are those words whose frequency is unusually high in the genre compared to their normal frequency in the language. Corpus software, such as Wordsmith Tools (ibid.), generally has a keywords tool, which can establish the keywords in a corpus or text compared to a benchmark corpus (usually a more general one). The software analyzes word lists from both corpora (the specialized corpus and the ‘reference’ corpus) and compares the relative frequency of the words. Those words that have a high frequency in both texts (i.e. grammatical items, such as articles and prepositions) will therefore not come out as key, but instead lexical words that are distinctive of the genre will be foregrounded. Keywords were identified for CANBEC based on a comparison to the SOCINT sub-corpus from CANCODE (Handford, 2007, 2010). Table 4.1 shows the top 50 keywords in CANBEC. In contrast to the raw frequency list, the keyword list contains a number of lexical words that we would expect to have a high frequency in business language compared to everyday language, such as ‘customer(s)’, ‘order’, ‘meeting’, ‘sales’, ‘order(s)’ and so forth. Interestingly, some grammatical words still appear near the top of the list, for example, ‘we’, ‘the’, ‘if’, ‘us’, ‘so’. This indicates that even though these words have a high frequency in the language in general, they are, nevertheless, still unusually frequent in spoken business discourse. One reason for the high frequency of some of these keywords is that many of them occur as part of re-occurring phrases or ‘chunks’ as we shall see. A number of backchannels and fillers are also in the list (‘okay’, ‘hmm’, ‘er’), which highlight the interactive nature of workplace meetings. What is also striking is the position of the semi-modal ‘need’ in the list being the tenth



Corpora and Workplace Discourse

51

Table 4.1  Top 50 Keywords CANBEC N

WORD

FREQ.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

WE WE’VE OKAY WE’RE HMM THE CRANE CUSTOMER LIFT NEED CRANES ORDER MEETING SALES THOUSAND HUNDRED ORDERS IF WHICH WILL CUSTOMERS PER PRICE MAIL BUSINESS LIFTS IS MONTH WE’LL STOCK ISSUE PRODUCT CENT PROBLEM FOR US SERVER SO VEHICLE ER POINT TYRE LIST COMPANY INFORMATION SYSTEM

12,078 2,752 2,951 2,376 527 32,032 460 495 653 1,812 377 560 594 380 731 944 345 5,362 2,101 1,776 307 452 395 281 522 239 8,660 503 1,085 289 287 224 329 660 6,210 1,418 161 7,983 191 8,059 738 159 344 527 339 272

(continued)

Corpus Applications in Applied Linguistics

52

Table 4.1  (Continued) N

WORD

FREQ.

47 48 49 50

TERMS CELLAR TWO TO

277 144 2,366 18,403

(adapted from Handford, 2007, p. 180, extracted from CANBEC, © Cambridge University Press, reproduced with permission)

position and the presence of the words, ‘issue’ and ‘problem’. Even such minimal information as provided by a keyword list affords an intriguing glimpse into the world of spoken business discourse: there seems to be more emphasis on the group, i.e. the company or organization (‘we’, ‘us’), than the individual; discourse markers like ‘if’ and ‘so’ seem to point to the importance of discussing, hypothesizing and negotiating and resolving problems seems to be a key activity (‘need’, ‘issue’, ‘problem’). Nelson (2000a, 2000b) produced frequency and keyword lists for the Business English Corpus (BEC) using the 2 million-word BNC sampler corpus (a subset of the British National Corpus), consisting of ‘general English’, as a reference corpus. In the list of the 100 most ‘key’ words in BEC, lexical business words are much more in evidence than in CANBEC, showing words such as ‘business’, ‘market’, ‘customer’, ‘supplier, ‘sale’, ‘management’ in the top 10 (see Nelson, 2000a, ‘The 100 most “key” words in the Business English Corpus’). This can be explained by the fact that BEC includes written as well as spoken discourse (therefore, interactive features are less in evidence) and is composed not only of ‘language doing’ business (as is the case for CANBEC), but also ‘language about’ business. When language is used to do business, the business world forms a part of the assumed shared knowledge between the participants, but is not always referred to explicitly; whereas in talking about business (e.g. in interviews or magazines), the business world is referred to explicitly with business lexis, such as ‘market’, ‘export, ‘merger’. The keywords in BEC fall into five semantic categories: ‘people in business’, ‘business activities’, ‘business actions’, ‘business descriptions’ and ‘business events and entities’ (Nelson, 2006: 222). Table 4.2 shows examples of keywords in each of these categories. As with CANBEC, even just looking at keywords points to some interesting general tendencies in the lexis. Nelson (2000a) notes that the key lexis is overtly positive in nature (e.g. ‘achieve’, ‘improve’, ‘best’ and ‘successful’) with very few negative words. It is also dynamic and action-orientated (e.g. ‘manage’, ‘operate’, ‘export’) and clearly non-emotive: adjectives like ‘nice’,



Corpora and Workplace Discourse

53

Table 4.2  Examples of top 50 keywords in BEC in  5 semantic categories People in Business

Business Activities

Business Actions

customer manager supplier distributor staff

business investment delivery development payment

sell manage achieve improve operate

Business Descriptions

Business Events and Entities

high big international successful best

sale merger export performance market

(adapted from Nelson, 2000a)

‘lovely’ and ‘terrible’ are negatively key, meaning they occur less frequently than is the norm. Nelson also observes that most of the adjectives refer to things, such as products and companies (e.g. ‘big banks’, ‘international company’), rather than to people. One interesting difference between the keywords in BEC and CANBEC is that while in BEC there are few keywords with a negative meaning or connotation, in CANBEC, ‘problem’ and ‘issue’ are in the top 50 keywords. This again highlights the difference between ‘language about’ and ‘language doing’ business: in talking or writing about business (usually for public consumption), the emphasis will often be on successes and positive developments, whereas when actually engaged in doing business, the focus is often on problem-solving of some kind. A keyword list is clearly limited in what it can tell us about workplace or business discourse as a whole. Nevertheless, both Nelson’s and Handford’s work on business corpora show that frequency and keyword lists provide a kind of window onto discourse, pointing to potential areas of interest to explore further, both in terms of the collocations and phrases these words enter into and the type of pragmatic and discoursal function they perform, as discussed in the next section.

Phraseology: Frequent Chunks Many of the keywords in CANBEC and BEC frequently occur in particular phrasal strings or ‘chunks’ of two or more words, which partly accounts for

54

Corpus Applications in Applied Linguistics

their high frequency. For example, ‘we need to’ is a very high-frequency chunk in BEC, and in CANBEC, ‘need’ frequently occurs in the following combinations of two to five words (Handford, 2007, 2010): zz‘need

to’ zz‘we need to’ zz‘we need to do’ zz‘I think we need to’ Such recurring strings of words have been given different labels: ‘lexical bundles’ (Biber et  al., 1999), ‘formulaic sequences’ (Wray, 2002; Schmitt, 2004), ‘chunks’ (De Cock, 2000; O’Keeffe et  al., 2007) and ‘clusters’ (Handford, 2007, 2010). Chunks (the term that will be used here) can be identified using corpus software by setting a minimal frequency, such as ten occurrences per million words (Biber et  al., 1999; Handford, 2010), to determine whether or not a string qualifies as a chunk. A computer can only identify strings of words, but cannot decide whether or not these strings are meaningful in any way. It will therefore also pick out syntactically incomplete strings, such as ‘a bit of’ or ‘at the end of’, as well as chunks that have ‘semantic unity and syntactic integrity’ (O’Keeffe et al., 2007: 64), for example, ‘at the end of the day’. O’Keeffe et al. (ibid.: 70–71) argue that many such syntactic fragments nevertheless have ‘pragmatic integrity’ and should therefore be considered meaningful chunks. They often perform particular pragmatic or discourse functions. For example, ‘a bit of’ is routinely used as a ‘frame’ to downtone utterances, e.g. ‘a bit of a mess/problem/nuisance’ (ibid.). A number of corpus studies of specialized genres have shown that such chunks often perform specific pragmatic functions and therefore play a key role within the genre (Oakey, 2002; Simpson, 2004; Handford, 2007, 2010; O’Keeffe et al., 2007). McCarthy and Handford (2004) and O’Keeffe et al. (2007) compared chunks and their pragmatic functions in spoken academic and business discourse and found that chunks in business discourse were often used to express functions such as hedging, purposive vagueness, referring to collective goals and politeness. Handford (2010) examined the most chunks (or ‘clusters’) of two to six words, and categorized these into two broad functional categories: discourse marking and interactional functions. Discourse marking categories include such functions as focusing, linking and summarizing; interactional categories include a wide variety of functions, such as checking understanding, hedging, hypothesizing and evaluating. Table 4.3 shows the four most frequent two-, three- and fourword chunks found in CANBEC.



Corpora and Workplace Discourse

55

Table 4.3  The four most frequent two-, three- and four-word chunks in CANBEC Two-word chunks

Three-word chunks

Four-word chunks

you know I think of the I mean

I don’t know a lot of at the moment we need to

at the end of the end of the have a look at a bit of a

(extracted from CANBEC, © Cambridge University Press, reproduced with permission)

One reason why studying chunks is important in identifying the characteristics of workplace language is their sheer ubiquity and frequency. A number of chunks in CANBEC have a higher frequency than some very common single words (ibid.). The chunks with the highest frequency, ‘you know’ and ‘I think’, are more frequent than ‘really’ and ‘work’; and longer, less frequent chunks like ‘I don’t know’ and ‘at the end of the day’ still occur more frequently than the business nouns ‘industry’ and ‘profit’. It is interesting to compare the chunks identified by Handford (ibid.) in CANBEC with their frequency and use in another workplace corpus, the Hong Kong Corpus of Spoken English (HKCSE-bus3). All these chunks also occurred at least ten times per 1 million words in HKCSE-bus and therefore qualify as chunks (according to Handford’s cut-off point)4; however, on the whole, there was quite a disparity in the relative frequencies of the items in the two corpora. The only chunks that had broadly similar frequencies were ‘I think’, ‘I mean’, ‘I don’t know’ and ‘at the end of’. A more detailed examination of the contexts of uses of the various chunks in both corpora would be necessary to explain this disparity in frequency, but a number of possible reasons include zzthe

speaker composition of the corpus: most participants in CANBEC are UK native speakers, whereas HKCSE is composed of one-third Hong Kong Chinese speakers; zzthe different genres included in the corpora: CANBEC is composed mainly of business meetings, whereas HKCSE-bus contains a greater variety of workplace genres, such as job interviews and presentations. If chunks perform specific functions within a genre, their frequency and function are likely to vary from genre to genre. A more detailed comparison of one three-word chunk, ‘so I think’, which was very frequent and had similar frequencies in both corpora, is quite revealing. In CANBEC, ‘so I think’ is used most commonly for summarizing, but also for explaining, elaborating and disagreeing (Handford, 2010),

56

Corpus Applications in Applied Linguistics

occurring more frequently in ‘external meetings’ (between companies) than in ‘internal meetings’ (between colleagues), where it is frequently used as a ‘tactical summary’ (Charles and Charles, 1999) to push the discussion in a direction favourable to the speaker. In the HKCSE-bus, ‘so I think’ is used quite differently. Many of the 40  instances of the chunk occur in job placement interviews in Hong Kong hotels with students on a BA programme in Hotel Management (Warren, 2006). Very broadly, it still has a summarizing function, due largely to the discourse marking function of ‘so’ (Schiffrin, 1987), but both interviewers and interviewees use it in specific ways within the job interview. Interviewers use the chunk to perform a broader range of functions than interviewees, which is perhaps a reflection of their greater power in the interview, for example, i) to move to the next stage of the interview, ii) to summarize the benefits of the placement for the interviewee or iii) to highlight some key aspect of the job. In Example 4.1, the latter function is illustrated. Example 4.1 It’s not like another shop or bank it closes so you can’t (.) but basically we are open all the time a: mm B: so I think in our industry you when you’re committed* a: **mm B: you will find that hours can be quite long a: mm mm Interviewees seem to use this chunk primarily to provide an ‘upshot’ of their answer to a question posed by the interviewer. In a number of instances, this involves summarizing their suitability for – or their reasons for – wanting the job, Example 4.2 a: and to see er their er how to deal with the guest when the guest have er when the guest is demanding *and so I think it’s very interesting for me b: **mm Here, the interviewee has been explaining why she would like to work in the front office of the hotel for her placement (Warren, 2006). The analysis of chunks in both CANBEC and HKCSE shows that identifying high-frequency chunks in a specific register such as workplace discourse not only provides a more complete picture of the lexicogrammatical characteristics of the register, but also can lead to a discovery of important pragmatic functions performed by these chunks in workplace interactions.



Corpora and Workplace Discourse

57

High-frequency chunks can perform quite specific functions within a genre and thus constitute a kind of ‘generic fingerprint’ (Farr, 2007).

Pragmatics Features Something as bare as a frequency or keyword list can provide a ‘systematic point of entry’(Adolphs et  al., 2004: 14) into the data. Concordance searches for keyword collocations can then be carried out, and examination of the contexts in which these collocations occur may then lead on to exploring their pragmatic functions. Such an approach is used by Cheng (2004) to study checking out discourse in recordings of hotel interactions from HKCSE-bus. A frequency list generated with corpus software showed that the lexical item ‘minibar’ was unexpectedly frequent and occurred in all the checking out interactions. The next step was to create concordances for the most frequent items, which revealed that the word ‘minibar’ was used exclusively by hotel staff and never by a guest. This is perhaps not surprising, as there is a clear role distinction between the participants in service encounters like hotel front desk interactions. A qualitative examination of the discourse context of ‘minibar’, ranging from a study of politeness features and intonation patterns to the positioning of the lexical item within the structural organization of the discourse, showed that the hotel’s corporate message of ‘customer care’ was frequently at odds with what actually happened in checking out discourses. For example, transcribing the data using Brazil’s (1985, 1997) discourse intonation model showed that most of the questions to the guests regarding the minibar had rising intonation, Example 4.3 b: //[Hex:0028]HAVE you got a minibar KEY// B: //[Hex:0028] I wasn’t GIven one// 

(Cheng, 2004: 150)

In the discourse intonation model, rising tone indicates an assumption of shared knowledge, whereas falling tone marks the utterance as communicating new information. By using rising tone, the receptionist projects an assumption that the answer will be ‘yes’; moreover the use of a rise, instead of a fall-rise, adds ‘insistence’ or ‘forcefulness’ to the utterance, as this tone tends to be used by dominant speakers (Brazil, 1997: 86–98). Cheng (ibid.: 153) notes that the choice of rising tone is probably not appropriate in this context, which could explain the fact that the guest’s negative responses tended to be quite emphatic.

58

Corpus Applications in Applied Linguistics

Another workplace study that takes the same approach, starting with a keyword list and concordances, leading to an investigation of pragmatic features of the discourse, is from the domain of health care. Adolphs et al.’s (2004) used this approach to examine a small corpus (totalling approximately 60,000 words) of telephone calls to NHS Direct, a telephone advice service provided by the National Health Service in the UK. The telephone calls were made by researchers, posing as members of the public, to nurses and health advisors working for NHS Direct who did not know which of the calls were made by the researchers. Once medical terminology had been removed from the keyword list, the remaining lexical items fell into the following categories: negatives, imperatives, pronouns, vague language, affirmations/ positive backchannels and directives. A qualitative examination of items in these categories in the phone call transcripts showed that many of them performed interpersonal functions, which were important in eliciting symptoms from the callers and giving them advice. Moreover, they often related to particular phases of the interaction. At the beginning of the phone call, the personal pronouns ‘you’ and ‘your’ are frequently used to secure the caller’s involvement. The use of backchannel responses such as ‘okay’ and ‘right’ is significantly higher in the corpus compared to a reference corpus of general spoken English (CANCODE), signalling active listenership on the part of the health professionals. Nurses and advisors often use modal items, such as ‘may’ and ‘can’ as politeness markers, introducing ‘optionality into the conversation and thus give the appearance of allowing the patient to make their own decision as to whether or not to follow the advice’ (ibid.: 18). The vague expression ‘or anything’ occurs frequently, and seems to have the function of inviting patients to add their own description to the proposed symptoms, for example: Example 4.4 N: And so there’s no swelling anywhere to your face or anything? (Adolphs et al., 2004: 19) Once symptoms have been elicited, the interaction moves to the advice-giving stage. Here the health professionals tend to de-personalize the advice by referring to external sources of authority, such as the British Medical Association’s guidelines. The words ‘advise’, ‘advice’, ‘suggest’ are key, and one reason for this seems to be their use at this stage of the interaction for impersonal constructions such as ‘they’ll be able to advise/suggest . . .’. The phone calls usually finish with a ‘convergence coda’, in which the advisors or nurses



Corpora and Workplace Discourse

59

summarize the discussion and try to secure assent from the caller to adopt a suggested course of action. Adolphs et al. (ibid.) conclude that this small-scale study illustrates ways in which corpus linguistic methods, which have had little application in studies of health care communication, can complement and enhance findings arrived at through discourse and conversation analyzes. These two studies are excellent examples of combining quantitative with qualitative methods to analyze the characteristics of workplace discourse in a way that is corpus-driven, where the initial quantitative findings, obtained using corpus methods (such as keywords and concordancing), reveal features of the data that are then further explored using qualitative methods. Corpus findings can also be revealing of key characteristics of a genre and can lead to a mapping of its ‘generic fingerprint’ (Farr, 2007). The main lexico­ grammatical features can be identified, and these can also provide clues to the discourse structure of the genre. Certain keywords may cluster in particular parts of an interaction or text and can thus be indicative of the staging of the genre, as, for example, in the NHS Direct telephone interactions (Adolphs et  al., ibid.). More examples of studies that have used corpus methods to investigate workplace genres can be found in Koester (2010).

Summary and Conclusion This chapter set out to answer the question: What are the distinctive characteristics of workplace discourse? McCarthy and Handford’s (2004) corpus-based comparison of spoken business English (SBE) with everyday conversation led them to conclude that SBE is indeed different from everyday conversation and that it is ‘an institutional form of talk’ (p. 187). The studies discussed in this chapter, all of which involved applying corpus analysis to various levels of workplace discourse (lexis, chunks and genre), contribute to confirming that workplace discourse is indeed ‘institutional’ and exhibit the three distinctive characteristics of institutional discourse which, according to Drew and Heritage (1992: 22), distinguish it from everyday conversation: 1. goal orientation 2. special and particular constraints 3. special inferential frameworks and procedures The focus on institutional goals is reflected in the frequent and key lexis in business and workplace corpora, for example, in the unusually high frequency of the personal pronoun ‘we’ in CANBEC or the semantic

60

Corpus Applications in Applied Linguistics

categories (e.g. ‘business activities’) that make up the keywords in BEC (Nelson, 2006). Goal orientation is also reflected in Nelson’s finding from keyword analysis that business language is action-oriented. This is also reflected at other levels of discourse, for example, in the highly frequent chunks using deontic modals (e.g. ‘we need’/‘we need to’/‘we need to do’), which show speakers discussing necessary or desirable actions5. Special and particular constraints are reflected, for example, in the finding that certain lexical items have a high frequency or are ‘key words’, while others are negatively key compared to everyday usage. Such ‘constraints’ may also mean that the available linguistic and interactional choices are limited on the basis of the institutional or discursive roles of participants. For example, in the job placement interviews in HKCSE-bus, interviewers used the chunk ‘so I think’ to perform a greater range of functions than interviewees. Corpus analysis can also help to reveal the special inferential frameworks in operation in different workplace contexts, for example, in identifying frequent collocations and chunks and their specific pragmatic functions within an institutional environment or genre. For example, we saw that the chunk ‘so I think’ performed different functions in CANBEC meetings compared to job placement interviews in HKCSE-bus. While corpus analysis, on the one hand, confirms the institutional character of workplace discourse, the high frequency of interactive and interpersonal features, such as backchannel responses, vague language and various politeness markers, also clearly highlights the importance of interpersonal aspects of workplace interactions. Such interpersonal features can perform key functions within particular workplace practices (e.g. using backchannels to secure the caller’s involvement in the NHS Direct phone calls), and they characterize workplace discourse as much as lexical items and collocations, more typically associated with particular domains of work. However, the interpersonal dimension here is distinct in a number of ways from social or intimate interactions due to the asymmetry often found in workplace discourse, which is another key characteristic of institutional discourse in general (Heritage, 1997). Because of the power difference and special role relationships in many workplace interactions, politeness strategies involving, for example, hedging and vague language play an important role in maintaining and reinforcing workplace relationships. This phenomenon is evident in a number of the studies and examples presented in this chapter, for instance, in the advice given to callers in the NHS Direct data. An example of not using such strategies appropriately was found in the checking out discourse of hotel staff in Hong Kong. Corpus methods also reveal very clearly that such politeness strategies are not evenly



Corpora and Workplace Discourse

61

distributed in all workplace interactions, but vary according to the particular role relationships and the genre being performed. So, for example, strategies showing ‘listenership’ (e.g. ‘mmhm’, ‘okay’) were particularly prominent in the NHS direct calls. Perhaps the most important lesson from corpus analysis is that workplace discourse varies in line with the goals, role relationships and activities of the participants. But this variation is not random: by combining quantitative and qualitative methods, it can be traced to the specific practices within each workplace community. As demonstrated in this chapter, it is a combination of quantitative corpus methods with more qualitative methods, such as discourse and genre analysis, which can provide the most rich and differentiated account of workplace interactions.

Future Directions Corpus research in workplace discourse is currently developing in a number of different ways. One direction is to focus on increasingly specialized types of professional or occupational discourse. For example, a number of specialistwritten professional corpora ranging between approximately 5 and 9 million words have recently been compiled at the Hong Kong Polytechnic University: the Corpus of Surveying and Construction Engineering, Financial Services Corpus and Engineering Corpus (see Research Centre for Professional Communication in English website). An example of research in specialistspoken corpora is the Health Communication Corpus Project at the University of Nottingham (Adolphs et  al., 2007); however, collecting and compiling spoken workplace corpora will continue to pose challenges of access and transcription. Websites and online communication offer a further rich vein of enquiry to be explored; for example, a corpus of emails sent to an online health forum is currently being investigated as part of the Nottingham Health Communication Project (Harvey and Adolphs, 2011). Finally, findings from workplace and business corpora have implications for the teaching of English for Business and Occupational Purposes and are beginning to make their way into some teaching material (see Handford, 2010: 245–261).

Notes Adapted from Koester, A. 2010, Workplace Discourse, London: Continuum, ­Chapter 3: ‘What can a corpus tell us about workplace discourse’, pp. 45–69. 2 In a ‘corpus-driven’ approach, the theories developed derive from the corpus data, which means ‘recurrent patterns and frequency distributions are expected

1



62





Corpus Applications in Applied Linguistics

to form the basic evidence for linguistic categories’ (Tognini-Bonelli, 2001: 84). This can be contrasted to a ‘corpus-based’ approach, where a corpus is drawn on to ‘test or exemplify theories’, which were not themselves derived from the corpus (Tognini-Bonelli, 2001: 65). 3 see Research Centre for Professional Communication in English website. 4 As the sub-corpus is only about a quarter of the size of CANBEC (262,000 as opposed to 1 million), frequencies were normalized to a million words in order to allow comparison. 5 Deontic modals are highly frequent in CANBEC (Handford, 2010) and other workplace corpora, and decision-making and problem-solving are key activities.

References Adolphs, S., Atkins, S. and Harvey, K. (2007). Caught between professional requirements and interpersonal needs. In Cutting, J. (ed.), Vague Language Explored. Houndmills, Basingstoke: Palgrave Macmillan, pp. 62–78. Adolphs, S, Crawford, P., Brown, B., Sahota, O. and Carter, R. A. (2004). Applying corpus linguistics in a health care context. International Journal of Applied Linguistics, 1(1), 44–9. Bhatia, V. K. (1993). Analysing Genre: Language Use in Professional Settings. London: Longman. Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E. (1999). Longman Grammar of Spoken and Written English. Harlow: Pearson Education. Brazil, D. (1985). The Communicative Value of Intonation in English. Birmingham: English Language Research. — (1997). The Communicative Role of Intonation in English. Cambridge: Cambridge University Press. Charles, M. and Charles, D. (1999). Sales negotiations: Bargaining through tactical summaries. In Hewings, M. and Nickerson, C. (eds), Business English Research into Practice. London: Longman, pp. 71–82. Cheng, W. (2004). //did you TOOK// from the miniBAR//: What is the practical relevance of a corpus-driven language study to practioners in Hong Kong’s hotel industry? In Connor, U. and Upton, T. (eds), pp. 141–66. — (2007). The use of vague language across spoken genres in an intercultural Hong Kong corpus. In Cutting, J. (ed.), Vague Language Explored, Houndmills, Basingstoke: Palgrave Macmillan. Basingstoke: Palgrave Macmillan, pp. 161–81. Cheng, W. and Warren, M. (2005). //well I have a DIFferent//THINking you know//: A corpus-driven study of disagreement in Hong Kong business discourse. In Bargiela-Chiappini, F. and Gotti, M. (eds), Asian Business Discourse(s). Frankfurt am Main: Peter Lang, pp. 241–70. — (2006). “I would say be very careful of.  .  .”: Opine markers in an intercultural business corpus of spoken English. In Bamford, J. and Bondi, M. (eds), Managing Interaction in Professional Discourse. Intercultural and Interdiscoursal Perspectives. Rome: Officina Edizioni, pp. 46–58. — (2007). Checking understandings in an intercultural corpus of spoken English. Language Awareness, 16(3), 190–207.



Corpora and Workplace Discourse

63

Connor, U. and Upton, T. (eds). Discourse in the professions: Perspectives from corpus linguistics. Amsterdam: John Benjamins. De Cock, S. (2000). Repetitive phrasal chunkiness and advanced EFL speech and writing. In Mair, C. and Hundt, M. (eds), Corpus Linguistics and Linguistic Theory. Papers from ICAME 20 1999. Amsterdam: Rodopi, pp. 51–68. Drew, P. and Heritage, J. (eds) (1992). Talk at Work. Cambridge: Cambridge University Press. Farr, F. (2007). Spoken language analysis as an aid to reflective practice in language teacher education: Using a specialized corpus to establish a generic fingerprint. In Campoy, M. C. and Juzón, M. J. (eds), Spoken Corpora in Applied Linguistics, pp. 235–58. Flowerdew, L. (2004). The argument for using English specialized corpora to understand academic and professional settings. In Connor, U. and Upton, T.A. (eds), Discourse in the Professions: Perspectives from Corpus Linguistics. Amsterdam: John Benjamins, pp. 11–33. Handford, M. (2007). The genre of the business meeting: A corpus-based study. Unpublished PhD. Thesis, University of Nottingham, School of English Studies. — (2010). The Language of Business Meetings. Cambridge: Cambridge University Press. Harvey, K. and Adolphs, S. (2011). Discourse and healthcare. In Handford, M. and Gee, J. P. (eds), Routledge Handbook of DiscourseAnalysis. London: Routledge. Heritage, J. (1997). Conversation analysis and institutional talk. In Silverman, D. (ed.), Qualitative Research: Theory, Method and Practice. London: Sage, pp. 161–82. Holmes, J. and Stubbe, M. (2003). Power and Politeness in the Workplace. London: Pearson Education. Koester, A. (2010). Workplace Discourse. London: Continuum. McCarthy, M. and Handford, M. (2004). ‘Invisible to us’: A preliminary corpusbased study of spoken business English. In Connor, U. and Upton, T.A. (eds), Discourse in the Professions. Amsterdam: John Benjamins, pp. 167–201. Nelson, M. (2000a). Mike Nelson’s Business English Lexis Site (Retrieved 14 September 2009 from http://users.utu.fi/micnel/business_english_lexis_site.htm). — (2000b). A corpus-based study of the lexis of business English and business English teaching materials. Unpublished Thesis, University of Manchester, Manchester (Retrieved 15 October 2009 from http://users.utu.fi/micnel/business_english_ lexis_site.htm). — (2006). Semantic associations in business English: A corpus-based analysis. English for Specific Purposes, 25: 217–34. O’Keeffe, A., McCarthy, M. J. and Carter, R. A. (2007). From Corpus to Classroom. Cambridge: Cambridge University Press. Oakey, D. (2002). Formulaic language in English academic writing: A corpus-based study of the formal and functional variation of a lexical phrase in different disciplines. In Reppen, R., Fitzmaurice, S. and Biber, D. (eds), Using Corpora to Explore Linguistic Variation. Philadelphia, PA: John Benjamins, pp. 111–29. Research Centre for Professional Communication in English Website at the Hong Kong Polytechnic University (Retrieved 17 November 2008 from http://langbank. engl.polyu.edu.hk/RCPCE/).

64

Corpus Applications in Applied Linguistics

Schiffrin, D. (1987). Discourse Markers. Cambridge: Cambridge University Press. Schmitt, N. (ed.) (2004). Formulaic Sequences. Amsterdam: John Benjamins. Scott, M. (1999). Wordsmith Tools, Version 3 (corpus analytical software suite). Oxford: Oxford University Press. Simpson, R. (2004). Formulaic expressions in academic speech in discourse. In Connor, U. and Upton, T. (eds), Discourse in the Professions: Perspectives from Corpus Linguistics. Amsterdam: John Benjamins, pp. 37–64. Tognini-Bonelli, E. (2001). Corpus Linguistics at Work. Amsterdam: John Benjamins. Warren, M. (2004). // so what have YOU been WORKing on Recently//: Compiling a specialized corpus of spoken business English. In Connor, U. and Upton, T. (eds), pp. 115–40. —(2006). The role of discourse intonation in lexical cohesion. International Journal of Corpus Linguistics 11(3), 169–87. Wray, A. (2002). Formulaic Language and the Lexicon. Cambridge: Cambridge University Press.

Part Three

Corpora in Applied Linguistics Domains

66

Chapter 5

Corpora and Translation Studies Sara Laviosa

Research and Translator Training Nowadays, corpora are widely utilized in the empirical study of translation, translator training and professional translation. This chapter traces the development of corpus use in translation studies from its advent to the present day and ends with some predictions about the way in which corpora are likely to influence the study, teaching and practice of translation in the future.

Corpora and Descriptive Studies of Translation Corpora were introduced into translation studies in the early 1990s, when corpus linguistics was making inroads into many applied fields of inquiry, and descriptive studies of translation were expanding. At the time, it was envisaged that corpus-linguistic analytical methods together with corpusdesign principles would provide suitable tools for unveiling the nature of translation as a mediated communicative event (Baker, 1993). Crucially, corpus-based methods of enquiry were considered to be largely compatible with the discovery and justification procedures that were being elaborated within the empirical and target-oriented approach to translation studies known as Descriptive Translation Studies. In this paradigm, translations are studied as target language texts in their own right rather than being viewed as derivative texts that are to reproduce and represent the originals by establishing a relationship of equivalence based on a-priori criteria. Hence, equivalence is conceived in terms of balance between invariance and transformation across the source and target texts, a balance that is determined by historical and cultural factors reflected in norms. Here, norm refers to ‘[a] regularity in behaviour, together with the common knowledge about and the mutual expectations concerning the way in which members

68

Corpus Applications in Applied Linguistics

of a group of community ought to behave in certain types of situation’ (Hermans, 1999: 163). Within a descriptive perspective, translations are generally investigated using a methodology involving an inductive and helical progression from observable translational phenomena to the non-observable and culturally determined norms that govern translators’ choices (Toury, 1995: 36–39). Therefore, the object of study is translation embedded in a dynamic network of relations of a linguistic, textual, literary and socio-cultural nature.

Corpus Methods in Translation Studies The first phase of the research methodology elaborated within Descriptive Translation Studies involves the selection of a corpus of translated texts and the examination of their acceptability in the target language, that is to say, the extent to which translations adhere to the linguistic and cultural norms prevailing in the target language for a particular text genre. The investigation may be carried out with a collection of texts produced by a particular translator or by comparing different translations of the same source text either synchronically or diachronically. The second phase starts with the identification of the source text(s) and proceeds by mapping the segments of each target text onto the source text’s counterparts to determine target–source relationships, translation problems, translation solutions and shifts. In the third phase, these relationships become the basis of first-level generalizations about the norm underlying the concrete way in which equivalence is realized. At each stage of this gradual discovery of facts about the nature of translation, hypotheses are formulated on the basis of empirical descriptions and are verified through further procedures that are applied to an expanding corpus, aiming to achieve increasingly higher levels of generalization. An example of a corpus study of norms is the investigation of Spanish short stories by Gabriel García Márquez and their English translations. Drawing on systemic functional linguistics, cultural studies and reception theory, a comparative analysis of the target and source texts vis-à-vis English and Spanish reference corpora revealed that the initial norm characterizing the translator’s choices was orientated towards acceptability, that is a tendency to adhere to the target language patterns. Evidence of the operation of this norm is provided by the shifting to sentence-initial position of circumstantial adjuncts such as since day break (desde el amanacer), all at once (de pronto) and to her surprise (de pronto) (Munday, 1997).



Corpora and Translation Studies

69

corpus types

bi/multilingual

parallel

unidirectional

monolingual comparable

comparable

bidirectional

Figure 5.1.  Types of corpora

Types of Corpora Different types of corpora are used to investigate the product and process of translation, as shown in Figure 5.1. A bi-/multilingual corpus consists of texts produced in two or more languages. A monolingual comparable corpus is made up of two collections of texts in one language. One collection comprises translations from one source language or a variety of source languages, the other includes original texts of similar composition to the translational component. A bi-/multilingual corpus can be either parallel or comparable. A bilingual parallel corpus consists of one or more texts in language A and its/their translations in language B, while a bilingual comparable corpus consists of two collections of original texts, which are generally similar with regard to text genre, topic, time span and commu­ nicative function. A bilingual unidirectional parallel corpus is made up of one or more texts in language A and its/their translation(s) in language B, while a bilingual bidirectional parallel corpus consists also of one or more texts in language B and its/their translation(s) in language A.

Descriptive Research Endeavours Throughout the 1990s, corpus studies of translation contributed to the advancement of Descriptive Translation Studies. A major research endeavour was the quest for translation universals, i.e. linguistic features that typically occur in translated rather than original texts and are thought to be independent of the influence of the specific language pairs involved in the process of translation (Baker, 1993: 243). Corpus studies of universals focused initially on

70

Corpus Applications in Applied Linguistics

zzsimplification,

defined as ‘the process and/or result of making do with less words’ (Blum-Kulka and Levenston, 1983: 119); zzexplicitation, i.e. ‘an observed cohesive explicitness from [source language] to [target language] texts regardless of the increase, traceable to differences between the linguistic and textual systems involved’ (Blum-Kulka, 1986: 19), this being known as the Explicitation Hypothesis; zznormalization, defined as ‘the tendency to conform to patterns and practices which are typical of the target language, even to the point of exaggerating them’ (Baker, 1996: 176–177).

The universal of simplification Simplification was investigated with a monolingual comparable corpus consisting of English narrative and newspaper texts translated from a variety of source languages and original texts of similar composition. The findings showed that, largely independent of the influence of the source language, translators tend to use a relatively lower proportion of lexical words versus grammatical words and a relatively higher proportion of high-frequency versus low-frequency words. Translated texts are also characterized by greater repetition and less variety of the words most frequently used in the corpus (Laviosa-Braithwaite, 1996).

The universal of explicitation The Explicitation Hypothesis was tested with a corpus of literary translations drawn from the bidirectional English-Norwegian Parallel Corpus (ENPC), compiled in the department of English and American Studies of the University of Oslo under the direction of Stig Johansson. The explicitating shifts, involving the addition and specification of lexical and grammatical items, were found to outnumber the implicitating shifts in both directions of translation. In addition to the process of interpretation inherent in translation, various factors were considered to explain the phenomenon of explicitation, for example, the stylistic preferences of the source and target languages, their systemic differences and the culture-bound translation norms (Øverås, 1998). The examples below illustrate the addition of a grammatical item (1; 3) and of a lexical item (2; 4) in English and Norwegian translations (Øverås, 1998: 575–579).



Corpora and Translation Studies

71

1. The definitive face that begins to emerge with adolescence was long, slender and tenderly responsive beneath thick-browed eyes ringed with dark skin as if in physical manifestation of deep thought. long, slender is translated as langt og smalt (Literally: long and slender); 2. It lived in yet another shadow, being equidistant from the Mendip Mast and Glastonbury Tor. it lived is translated as den levde sitt liv (Literally: it lived its life); 3. jeg var på vei inn, han på vei ut (Literally: I was on the way in, he was on the way out) is translated as I was on the way in and he was on the way out; 4. Den hvite mannen knipser (Literally: the white man klicks) is translated as the white man clicks his camera. The following examples show two types of specification: substitution of a grammatical item (5; 6) and of a lexical item (8) as well as the expansion of a lexical item (7) in English and Norwegian translations (Øverås, 1998: 575–579): 5. They were supposed to stay at the beach a week, but neither of them had the heart for it and they decided to come back early. and they decided is translated as så de bestemte seg (Literally: so they decided); 6. Her companion hesitated, looked at her, then leaned back and released the rear door. looked at her is translated as så på piken (Literally: looked at the girl); 7. bestemor Gælion (Literally: grandmother Gælion) is translated as his grandmother Gælion; 8. bleg hellig (Literally: became holy) is translated as was made holy. The universal of normalization Lexical creativity and lexical normalization were investigated with a parallel corpus of contemporary German literary texts and their English translations (GEPCOLT). Three sets of creative lexis were identified in the German subcorpus: creative word forms selected from an initial list of hapax legomena (word forms that occur only once in the corpus); creative forms specific to a particular writer and creative author-specific collocations. The results showed that 44% of creative hapax legomena and 16% of creative collocations were normalized. For example, the unusual compound Aber-glaubigkeit,

72

Corpus Applications in Applied Linguistics

meaning ‘superstitions’, creatively hyphenated by the writer (Natascha Wodin) to give emphasis to aber, meaning ‘but’, is normalized with the conventional word superstitions. An example of a creative translation is the rendering of the compounds Annaaugen, Annabrüste and Annarock. These are formed with Anna (the name of one of the characters in Elfried Jelinek’s novel Die Ausgesperrten) and Augen, ‘eyes’; Brüste, ‘breasts’; Rock, ‘skirt’ and are translated as Annaeyes, Annabreasts and Annaskirt, respectively. On the whole, lexical normalization was found to be a feature of translation, but on most occasions it did not take place. The extent to which creative lexis is normalized appears to be influenced by how the translator sees his/her brief and on the systemic resources of the source language. For instance, creative lexis linked to the derivational possibilities offered by German, or those that involve puns might be particularly difficult to render in the target language (Kenny, 1999).

Current Descriptive Research The quest for universals Since the turn of the millennium, the concerns of descriptive corpus studies of translation have varied considerably. Based on a growing number of languages, the search for universals has revealed different aspects of the specificity of the language of translation. This sometimes shows lexical and structural traces of language contact, while at other times it may underrepresent features unique to the target language, over-represent elements that are less common in the target language or display a tendency towards conventional target language use (Mauranen, 2008; Mauranen and Kujamäki, 2004). Culture-specific phenomena In addition to the study of regularities in translational behaviour, scholars have focused on the particulars of culture-specific phenomena. For example, the style of literary translators has been examined by considering both linguistic choices and the extra-linguistic aspects of the process of translation such as the relative status of the source and target languages, the distance between source and target cultures, the translator’s choices regarding themes and literary genres as well as the professional status of the translator (Baker, 2000).



Corpora and Translation Studies

73

Another corpus study that takes into consideration the specificity of the cultural context to explain the nature of translation is the systemic functional analysis of an article by Gabriel García Márquez and three English translations published in the Guardian and New York Times newspapers as well as by a Cuban group, Gramma International. Important shifts in the translation process were unearthed and accounted for by linking the linguistic features of the source and target texts to their relative sociocultural contexts. For instance, the anti-USA feelings expressed in the original were relayed differently in the target texts, and, in the New York Times, they were largely omitted (Munday, 2002). The investigation of Anglicisms Corpus studies of translation have also contributed, in the last few years, to the investigation of Anglicisms. This research enterprise is particularly relevant in Europe, where the need to harmonize a national with a transnational identity is intimately related to the promotion of multilingualism, cultural diversity, mutual intelligibility and cultural unity (see Anderman and Rogers, 2005). The general assumption underlying these studies is that, in the wake of global English, translation is a mediator of language change induced by English source texts, as a result of the process known as ‘negative transfer’, that is ‘deviations from normal, codified practices of the target system’ (Toury, 1995: 275). Negative transfer is regarded to be more readily tolerated ‘when translation is carried out from a “major” or highly prestigious language/culture, especially if the target language/culture is “minor”, or “weak” in any other sense’ (Toury, 1995: 278). Translation-based corpus studies of the Anglicization of European languages have revealed considerable variation across target languages, domain-specific discourses, text types and different types of Anglicisms (cf. Anderman and Rogers, 2005). The findings have led to reconsider the role of translation as one of the main means of importing new linguistic trends vis-à-vis other forms of language contact. The evidence suggests, in fact, that even though translated texts tend to reflect the linguistic patterns displayed by their sources; they are not the only form of frequent language contact and cannot be regarded as the enemy within pure, isolated, self-contained languages (Mauranen, 2008: 45). The significance of these findings lies in their shedding light on the similarity and difference between translation and other forms of bilingual communication.

74

Corpus Applications in Applied Linguistics

Translator Training At the end of the 1990s, translation studies scholars began to develop a corpus-based pedagogy, which was inspired by the innovative Data-Driven Learning approach to foreign language teaching and was aimed specifically at developing translator skills and capacities. In the light of the successful use of corpora in LSP teaching at an advanced level, the goal was to devise a student-centred teaching method, where trainee translators would be encouraged to identify problem areas, suggest descriptive hypotheses and then test them in co-operation with their tutor who assumes the role of facilitator in the learning process. More specifically, bilingual comparable corpora of specialized texts were compiled and analyzed in order to acquire a) content knowledge about a specific subject field; b) textual knowledge concerning the overall structure of specialized texts and c) linguistic knowledge about the typical use of terminology and general words. The analytical tools were frequency lists and keyword in context (KWIC) concordance lines, through which selected lexical items were investigated in order to improve the understanding of terms as well as to identify the most accurate equivalents at the level of lexis and discourse units (Gavioli and Zanettin, 2000). Similar techniques were used in an experimental study, which tested the effectiveness of using a target language monolingual corpus of specialized texts as an aid to improve translation quality. A group of fourth-year undergraduate English students at Dublin City University carried out two translations from French into English of two semi-specialized texts on optical scanners. One translation was completed with the use of a bilingual dictionary and other conventional reference material; the other was carried out with a bilingual dictionary and a 1.4 million word specialized English corpus of articles on optical scanners. The findings revealed that the corpus-aided translations were of a higher quality with respect to subject field understanding (sensibilité ayx nuances was accurately translated as whatever their sensitivity to colour); correct term choice (vitre was properly translated as either glass paten or scan bed) and idiomatic expression (photodiodes sensible à la lumière was translated fluently as light-sensitive photodiodes or photosensitive diodes) (Bowker, 1998). It was also demonstrated that translation into the L2 could be improved with the use of a target-language reference corpus. The British National Corpus, for example, proved to be highly effective for translating tourist brochures from Italian into English. In fact, Italian trainee translators were able to produce naturally sounding collocations (i.e. grand tour of the city and road with panoramic views) by examining the frequency of occurrence and concordance lines of



Corpora and Translation Studies

75

assumed target language equivalents of the source language noun phrases gran giro della città and strada panoramica (Stewart, 2000). Nowadays, corpora are being increasingly used in translator training, where they constitute valuable resources for retrieving and examining lexical, terminological, phraseological, syntactical and stylistic equivalents. They are also employed to acquire subject-specific knowledge, to evaluate translation quality and as essential components of computer-aided translation technology (Bowker, 2001; Koby and Baer, 2003).

The Relationship Between Description and Pedagogy Description and pedagogy have become more and more interrelated in the recent years. Eminent examples are the pedagogical applications of descriptive research into universals and studies of Anglicisms.

The Unique Items Hypothesis The Unique Item Hypothesis states that target-language-specific elements, which do not have direct equivalents in other languages and, in particular, in the source language, tend to be under-represented in translated texts. This hypothesis, supported by previous empirical evidence, was used in the translation classroom to raise student awareness of what the translation process entails. In the first phase of this experimental study, thirty-six students were asked to back-translate into Finnish the German and English translations of a Finnish original text created ad hoc on the topic of driving in Finland, which included several language-specific items with no straight­ forward equivalents in either German or English. In the second phase of the experimental design, the students’ translations were compared with the students’ use of original Finnish as revealed by a cloze test designed to elicit ‘unique items’. The findings confirmed the Unique Item Hypothesis. In their translations, in fact, students tended to overlook unique items and opt for straightforward lexical or dictionary equivalents, even when target-language-specific items were part of their lexical repertoire, as revealed by the results of the cloze test (Kujamäki, 2004). Simplification and explicitation In another experimental investigation, the universals of simplification and explicitation were tested as possible indicators of translation quality.

76

Corpus Applications in Applied Linguistics

Specialized English-Italian translations carried out by advanced trainee translators were first compared with the English source texts as regards overall length, number of sentences, average sentence length, standardized type/ token ratio and lexical density (calculated as the ratio of grammar to content words). The results of this initial analysis were then compared with the assessment grades given by the evaluators. In the last stage, the translated texts were compared with comparable originals drawn from the Italian reference corpus CORIS (Corpus dell’Italiano Scritto). The Italian translations were generally found to be longer and to have fewer and longer sentences than the English originals, standardized type/token ratio was higher and lexical density lower. Simplification and explicitation were largely confirmed by comparisons made with source texts and comparable originals. The analysis of the relation between universals and translation quality assessment showed that higher-scoring translations had a higher level of syntactic explicitness and a lower level of lexical simplification compared with lowerscoring translations (Scarpa, 2006). If these findings were corroborated by evidence obtained from other language pairs, they could have important pedagogical implications. Simplification and explicitation could, in fact, be considered, not only as processes inherent in the act of translating, but also as criteria for assessing the quality of the translation product. Studying Anglicisms in the translation classroom Another example of how descriptive insights and pedagogical concerns have been harmonized in translator training can be seen in a recent study of Anglicisms. The investigation was performed in collaboration with the students attending a 60-credit postgraduate course in specialized translation from English to Italian. As part of the module devoted to the language of business and economics, a teaching unit was designed to address the problems posed by polysemic lexical Anglicisms. The learning objectives were a) to become familiar with corpora as one of the computer-aided translation tools and resources available to the professional translator; b) to discover the initial and textual-linguistic norms underlying the translation of polysemic lexical Anglicisms; c) to identify the Italian native equivalents of well-established English loan words. Corpus data The sources of data consisted of: 1) a monolingual comparable corpus comprising 71 translated and non-translated articles published in the Italian



Corpora and Translation Studies

77

weekly magazine Economy; 2) a parallel corpus of 71 English articles from The Economist and their translations in Economy. In the first phase, the lexical Anglicisms contained in the translational sub-corpus were identified. By Anglicism is intended a word or idiom that is recognizably English in its form (spelling, pronunciation, morphology or at least one of the three), but is accepted as an item of the vocabulary of the receptor language’ (Görlach, 2003: 1). The most frequent Anglicism was business, a well-established English loan word, having been introduced in the Italian lexicon in 1895. Results Out of nearly 60,000 running words, 37 occurrences were retrieved. The analysis of the KWIC concordance lines revealed five meanings of business I. the work of producing or buying and selling goods or services for money. II. high-profile area of business where more than one company operates. III. a. a highly profitable business activity undertaken by a company; b. investment, deal or transaction made by a company. IV. a large organization that provides services or that makes or sells goods. V.  volume of business. Interestingly, students were able to identify a wider range of meanings, compared with the descriptions provided by Italian dictionaries, which focus on meanings II, IIIb and IV. Moreover, for each of the above senses, the collocations, colligations and semantic associations were identified as follows. I. Business occurs with words that refer to other human activities, (turismo e business), the geographical place where business is carried out and the people of different nationalities that are in business; it forms compounds (business hub, area business, segmento business) and one multiword-unit (aree consumer e business). II. Business occurs with nouns that identify a particular business sector and the position gained in the market, nouns referring to the major players that operate in or impact on it, adjectives describing its qualities such as diversity, profitability or importance. III. Business occurs with words that refer to the company undertaking a particular business activity and to the type of activity undertaken,

78

Corpus Applications in Applied Linguistics

verbs referring to the changes undergone by a business, adjectives and nouns describing its main features such as novelty, solidity or volatility; it forms one compound (core business). IV. Business occurs with words referring to the people owning or running a company or to the way in which a company organizes its activities; it forms compounds (business model, business manager, business partner) and one multi-word-unit (modello di business). V. Business occurs with words that refer to the monetary value (or turnover) of a company or business sector Al 73enne Ecclestone è rimasto il 25% del gruppo che gestisce il business da 800 milioni di dollari. ‘73 year old Ecclestone still owns 25% of the group that runs the 800 million dollar business’. Per contrastare il cambiamento del clima anche gli ambientalisti riscoprono il nucleare. Un business da 125 miliardi di dollari. ‘To counteract climate change, even environmentalists are re­discovering nuclear power. A 125 billion dollar business’. Next, business was examined in the comparable sub-corpus of non-translated articles. The number of occurrences was nearly double, 74 against 34. This suggests a tendency towards normalization (or conservatism) in translation vis-à-vis the original language. The meanings of business were the same, except for II and V, which also referred to illegal business activities. This finding can be explained by the influence of the source texts, which do not include examples of business with a negative connotation. The analysis of the KWIC concordance lines revealed similarities and differences (the latter are indicated in single quotemarks) I. Business occurs with words that refer to other spheres of human activity (business & società, sport e business, musica & business, business & genetica), the geographical place where business is carried out and the people of different nationalities that are in business; it forms compounds (business information, clienti business, utenti business) and one multi-word-unit (aree di business). ‘Business is used in creative collocations’ (Il business resta in porto ‘business stays in the port’. Titanic del business ‘the Titanic of business’. Il business non è l’unico quadrante su cui far girare le lancette della vita ‘business isn’t the only dial on which the hands of life turn’).



Corpora and Translation Studies

79

II. Business occurs with nouns that identify a particular sector and the position gained in the market, nouns referring to the major players that operate in or impact on it, adjectives describing its qualities such as diversity, profitability or importance. ‘Business forms compounds’ (business travel, social business, business online). ‘Business refers to illegal sectors’ (il business dei falsi ‘the business of counterfeits’; i business si chiamano droga, prostituzione, racket ‘drugs, prostitution and rackets are known as businesses’). ‘Business is used in creative collocations’ (un business che si chiama sconto ‘a business that is called discount’; un business duro come il teak ‘a business as hard as teak’). III. Business occurs with words that refer to the company undertaking a particular business activity and to the type of activity undertaken, verbs referring to the changes undergone by a business activity, adjectives and nouns describing its main features such as novelty. Business is described in terms of its importance, competitiveness, credibility or profitability. ‘Business forms various compounds’ (core business, business continuity, business case, business plan), ‘it strongly collocates with’ possibilità, opportunità, occasioni, fare ‘do’; ‘it is used in creative collocations and puns’ (il business in una cannuccia ‘business in a straw’; ora faccio business col cuore ‘now I’m doing business with my heart’; un Tornado di business ‘a business tornado’; ho più di un business per capello ‘I’ve got more than one business in my hair’). IV. Business occurs with words referring to the people owning or running a company or to the way in which a company organizes its activities; it forms multi-word-units (business unit, unità di business, business development manager). ‘Business is used in creative collocations and puns’ (un business fatto di nuvole ‘a business made of clouds’; l’Enav e quel business che è caduto dal cielo ‘Enav and the business that fell from the sky’; il business lievita alla luce del sole ‘business rises in the light of the sun’). V. Business occurs with words that refer to the monetary value (or turnover) of a company or business sector; ‘it is used in creative collocations’ (Quel business da 2,5 milioni di sacchi di caffè ‘that 2.5 million sacks of coffee business’); ‘it refers to illegal activities’ (Il business [dei falsi] vale almeno 7 miliardi di euro all’anno ‘the business [of counterfeits] is worth at least 7 billion euros a year’; Cibo Nostro. La Mafia nell’alimentare. Quasi 20 miliardi di incassi per la criminalità organizzata: tanto vale oggi il business mafioso nell’agroalimentare, nelle sue varie declinazioni ‘Our Food. The Mafia in the food sector. Almost 20 billion euros worth of takings for organized crime: that’s how much the Mafia business is worth in the agriculture and food sectors in its various ramifications’).

Corpus Applications in Applied Linguistics

80

These results show, first of all, that the collocations, semantic associations and colligations of one sense of business differ from those of its other senses. Secondly, these patterns reveal similarities across translational and non-translational Italian. Thirdly, the translations seem to have resisted the influence of English to a degree by limiting the use of business. The next phase involved mapping the Italian target texts onto the English source texts to identify the native Italian equivalents of business for each of its five meanings I. II. III. IV. V.

il mondo degli affari, gli affari, affari, l’attività, attività commerciali il settore, l’industria, le industrie un’attività commerciale, l’attività, le attività delle aziende un’azienda. generating the business ® cedendo i prestiti ‘relinquishing loans’.

The initial norm that can be inferred from these findings is a tendency towards acceptability in the target language, as evidenced by the absence of business in creative collocations or puns in translational Italian. The textuallinguistic norm consists of a preference for native Italian equivalents. The final phase of the analysis involved the examination of the lexicogrammatical patterns of business across the donor and the receptor language. Unlike in English, business in Italian can have pejorative overtones when it conveys meanings II and V. Moreover, while in English business refers to a small or medium enterprise, in Italian it is used to refer to a large company (meaning IV). In Italian, business is sometimes used in creative expressions, which usually appear in the article title or subtitle, thus acting as an attention-getting device. Summing up, the examples of applied research surveyed in this section show how the insights of descriptive corpus studies of translation have been fruitfully projected onto various teaching contexts, where they have been further explored in a student-centred pedagogical environment. It is reasonable to affirm that their aim has been ‘to set norms in a more or less conscious way’ without intending ‘to account either for possibilities and likelihoods or for facts of actual behaviour’, as envisioned by Toury (1995: 19) in his conceptual map of translation studies. Future directions A consistent trend displayed by corpus studies at the turn of the century is their growing interdisciplinarity, as testified by the themes addressed by the



Corpora and Translation Studies

81

biennial international conference series, Using Corpora in Contrastive and Translation Studies (UCCTS). The first conference was held at Zhejiang University in Hangzhou, China, on 25–27 September 2008 (Xiao, 2010a). The second, jointly organized by Edge Hill University, the University of Bologna and Beijing Foreign Studies University, took place at Edge Hill University in Ormskirk, UK, on 27–29 July 2010 (Xiao, 2010b). This international initiative has brought to light four main concerns of corpus research: the nature of translation and interpreting; contrastive language studies; lexicography and terminology and computer-aided translation technology. It can safely be affirmed that corpora have contributed significantly to pushing the study of translation towards empiricism, interdisciplinarity and multilinguality. Thanks to the advent of corpora, strong links have been forged between translation studies and multilingual corpus linguistics. This effective partnership will continue to enable us ‘to learn not only about individual languages and their relationships, about translation and foreign-language acquisition, but also about language in general’ (Johansson, 2007: 316). The search for translation universals has received a great boost, thanks to the availability of corpora in many different languages. This area of inquiry will benefit from seeing the full potential inherent in the specificity of translation as a bilingual processing situation and an important form of interlingual and intercultural communication that can contribute significantly to our understanding of other forms of language contact, such as the borrowing of lexical or structural patterns across languages, ELF and foreign language learning (Mauranen, 2008). Also, translator training programmes are increasingly responding to the exigencies of today’s technologized language industry by incorporating a range of tools in their curricular design, of which corpora are an essential component (Koby and Baer, 2003). Postgraduate programmes that have joined the EMT network (European Master in Translation) are a case in point. They are required to provide training in the effective use of electronic corpora, translation memories, terminology database, voice recognition software, machine-translation systems and the internet so that students can acquire information mining skills and technological competences (http:// www.ec.europa.eu/emt). It is therefore envisioned that the cooperation between corpus linguistics, translation studies and computer science will grow even stronger in the years to come.

82

Corpus Applications in Applied Linguistics

References Anderman, G. and Rogers, M. (eds) (2005). In and Out of English: For Better, for Worse? Clevedon: Multilingual Matters. Baker, M. (1993). Corpus linguistics and translation studies: Implications and applications. In Baker, M., Francis, G., and Tognini-Bonelli, E. (eds), Text and Technology: In Honour of John Sinclair. Amsterdam and Philadelphia, PA: John Benjamins, pp. 233–50. — (1996). Corpus-based translation studies: The challenges that lie ahead. In Somers, H. (ed.), LSP and Translation: Studies in Language Engineering in Honour ofJuan C.Sager. Amsterdam: John Benjamins, pp. 175–86. — (2000). Towards a methodology for investigating the style of a literary translator. Target, 12(2): 241–66. Blum-Kulka, S. (1986). Shifts of cohesion and coherence in translation. In House, J. and Blum-Kulka, S. (eds), Inter-lingual and Inter-cultural Communication: Discourse and Cognition in Translation and Second Language Acquisition Studies. Tübingen: Gunter Narr, pp. 17–35. Blum-Kulka, S. and Levenston, E. A. (1983). Universals of lexical simplification. In Faerch, C. and Kasper, G. (eds), Strategies in Interlanguage Communication. London: Longman, pp. 119–39. Bowker, L. (1998). Using specialized monolingual native-language corpora as a translation resource: A pilot study. Meta, 43(4): 631–51 (Retrieved 23 February 2011 from http://www.erudit.org/revue/meta/1998/v43/n4/). — (2001). Towards a methodology for a corpus-based approach to translation evaluation. Meta, 46(2): 345–64 (Retrieved 23 February 2011 from http://www. erudit.org/revue/meta/2001/v46/n2/). Gavioli, L. and Zanettin, F. (2000). I corpora bilingui nell’apprendimento della traduzione: Riflessioni su un’esperienza pedagogica. In Bernardini, S. and Zanettin, F. (eds), I Corpora Nella Didattica Della Traduzione. Corpus Use and Learning to Translate. Bologna: CLUEB, pp. 61–80. Görlach, M. (2003). English Words Abroad. Amsterdam and Philadelphia, PA: John Benjamins. Hermans, T. (1999). Translation in Systems: Descriptive and System-Oriented Approaches Explained. Manchester: St. Jerome. Johansson, S. (2007). Seeing Through Multilingual Corpora. Amsterdam and Philadelphia, PA: John Benjamins. Kenny, D. (1999). Norms and creativity: Lexis in translated text. Unpublished PhD. Thesis, Centre for Translation and Intercultural Studies (CTIS), University of Manchester. Koby, G. S. and Baer, B. J. (2003). Task-based instruction and the new technology. Training translators for the modern language industry. In Baer, B. J. and Koby, G. S. (eds), Beyond the Ivory Tower: Rethinking Translation Pedagogy. Amsterdam and Philadelphia, PA: John Benjamins, pp. 211–27. Kujamäki, P. (2004). What happens to ‘unique items’ in learners’ translations? ‘Theories’ and ‘concepts’ as a challenge for novices’ views on ‘good translation’. In Mauranen, A. and Kujamäki, P. (eds), Translation Universals. Do They Exist?. Amsterdam and Philadelphia, PA: John Benjamins, pp. 187–204.



Corpora and Translation Studies

83

Laviosa-Braithwaite, S. (1996). The English Comparable Corpus (ECC): A resource and a methodology for the empirical study of translation. Unpublished PhD. Thesis, Centre for Translation and Intercultural Studies (CTIS), University of Manchester. Mauranen, A. (2008). Universal tendencies in translation. In Anderman, G. and Rogers, M. (eds), Incorporating Corpora: The Linguist and the Translator. Clevedon: Multilingual Matters, pp. 32–48. Mauranen, A. and Kujamäki, P. (eds) (2004). Translation Universals. Do They Exist? Amsterdam and Philadelphia, PA: John Benjamins. Munday, J. (1997). Systems in translation: A computer-assisted systemic approach to the analysis of the translations of García Márquez. Unpublished PhD. Thesis, Department of Modern Languages, University of Bradford. — (2002). Systems in translation: A systematic model for descriptive translation studies. In Hermans, T. (ed.), Crosscultural Transgressions. Research Models in Translation II: Historical and Ideological Issues. Manchester: St. Jerome, pp. 76–92. Øverås, L. (1998). In search of the third code: An investigation of norms in literary translation. Meta, 43(4): 571–88. (Retrieved 23 February 2011 from http://www. erudit.org/revue/meta/1998/v43/n4/). Scarpa, F. (2006). Corpus-based specialist-translation quality assessment: A study using parallel and comparable corpora in English and Italian. In Gotti, M. and Šarčević, S. (eds), Insights into Specialised Translation. Bern: Peter Lang, pp. 155–72. Stewart, D. (2000). Conventionality, creativity, and translated text: The implications of electronic corpora in translation. In Olohan, M. (ed.), Intercultural Faultlines. Research Models in Translation Studies 1: Textualand Cognitive Aspects. Manchester: St. Jerome, pp. 73–91. Toury, G. (1995). Descriptive Translation Studies and Beyond. Amsterdam and Philadelphia, PA: John Benjamins. Xiao, R. (ed.) (2010a). Using Corpora in Contrastive and Translation Studies. Newcastle upon Tyne: Cambridge Scholars Publishers. — (2010b). Using Corpora in Contrastive and Translation Studies 2010 Conference (UCCTS2010), Edge Hill University, Ormskirk, 27–29 July. (Retrieved 23 February 2011 from http://www.lancs.ac.uk/fass/projects/corpus/ UCCTS2010Proceedings/).

Chapter 6

Some Aspects of the Use of Corpora in Forensic Linguistics John Olsson

It is arguable that corpus linguistics has had some of its greatest successes in the field of forensic linguistics. By being able to calculate relative occurrences of words and phrases in a wide variety of text types and corpora, the forensic linguist is often able, in a court of law, to give an opinion on whether an individual’s use of language is rare or not. In a typical forensic scenario, the analyst has access to a handful of texts of questioned authorship and several texts of one or more candidates. These are divided into their sub-corpora in order to undertake the relevant searches. In this chapter, the aim is to guide the reader towards a general view of the kinds of applications to which resources can be put, rather than to offer specific methodological or statistical solutions. Building on the pioneering work of the early 1990s, the chapter brings readers up to date with recent developments in the field. In addition to the use of corpora in forensic authorship inquiries, the chapter will also stray briefly into the intersection of corpus linguistics and legal language. The use of corpora in forensic linguistics has seen considerable development from the pioneering work of Coulthard with regard to the Bentley case, in which an abnormally high use of ‘then’ was found by comparing a short 570-word police statement with a large language corpus (Coulthard 1994). The same corpus was used to note the peculiar position of ‘then’ (e.g.  ‘I then’ instead of ‘then I. . .’). As a result of that study, linguists have become ever more aware of the significance and influence of specialist or institutional registers. In addition, the work by Coulthard showed the presence of multiple registers within one text, a point which does not seem to have been taken up by corpus linguists generally, although scholars of poetry have long found the topic of interest (see Brinton, 2009: 31) in the context of the innate plasticity of language.



Some Aspects of the Use of Corpora in Forensic Linguistics

85

Strangely, the concerns of poetry scholars are not that far removed from the work of forensic linguists when working on authorship problems. While the popular expectation is that all authors express themselves in a unique ‘style’ and that the key characteristic of such a style is its consistency, in the real world of forensic inquiries, it is the flexible, malleable way in which authors use language that often comes as a surprise. Thus, courts sometimes grapple with difficulty with the idea that authors routinely present a number of different text types, or texts of different genres, and hence that a belief in stylistic consistency within an individual’s oeuvre cannot always be sustained. Multiple styles can even occur within one text, a phenomenon not always appreciated by corpus linguists more familiar with the predictable lexical and semantic structures of academic texts, although the idea is very familiar to teachers of English for Special Purposes (ESP). To illustrate how this phenomenon can be handled in forensic work, in which a corpus approach is desirable, I will begin by giving an example from a recent case, although any one of a number of cases could have been chosen for this purpose. As will be seen, rare and formal words and expressions occur side by side with expressions at the colloquial end of the register.

Case Background In this case, the task was to discover the probable identity of an individual or individuals who had written a number of letters to an actress. The letters claimed that the actress was being spied upon by the authorities because of her political activities. At the same time, it happened that the police were conducting another criminal investigation involving the actress. Although these allegations of political spying might seem farfetched in a modern democracy, it is of course the style of language – and not their supposed political content – which is of importance to the analyst.

Sampling Issues For various reasons unknown to this author, police officers considered that the actress may have written the letters herself in order to distract them from the main investigation, although another individual had also come

86

Corpus Applications in Applied Linguistics

under suspicion. The texts in the case presented a particular sampling challenge for two reasons. First, the anonymous texts were mostly emails, but there were several letters among them. Hence, we have two text types. In an ideal situation, we would have all of the questioned and known texts of the same type. Second, the known texts are also of different types, consisting of emails, letters and oral discourse from a court. The court discourse is in transcribed form. We cannot always be sure that the stenographer is completely accurate. While such features as speaker pauses and fillers are not particularly relevant when comparing written with spoken language, the concern was that key lexemes that occurred in both the spoken and written data might have been mistranscribed. However, training in court steno­ graphy nowadays includes an awareness of linguistic factors and an increasing number of stenographers are in fact linguistics graduates (see Chapman et al., 2001). It also has to be said that an entire court transcript – the present one is over 500,000 words in length (of which the suspect’s contribution was approximately 40,000 words in length) – cannot really be conceived of as a single ‘text’, but, rather, contributions of a particular form of spoken discourse in an institutional setting. Further, many of these contributions are not spontaneous – they are often elicited, either by questioning or by the context of the court. This can affect the validity of a speaker’s response in that we cannot be sure that the words selected to answer a question in examination or cross examination necessarily reflect that individual’s preferred choice of language. The existence of the court transcript meant that, in addition to dealing with the text of different types and the question of accuracy with respect to the oral transcript, the known material being tested comprised a much larger body of language than the anonymous texts.

Was There More Than One Anonymous Author? Since the letters purported to be written by different authors, all of them apparently aware of the actress’s predicament, it was important to establish whether the anonymous texts were in fact of different authorships. It can happen in such an investigation that there is more than one anonymous author. In this connection, there were interesting examples of cohesion across the texts. For example, in one of the earliest texts to be received, the author wrote



Some Aspects of the Use of Corpora in Forensic Linguistics

87

Anonymous mail excerpt 1 Your every move is being monitored. In a later text the following was observed. Anonymous mail excerpt 2 Your every move is still being monitored. The frequency of the first phrase above was measured on an internet search engine and 152 instances1 of it were found. There were no instances of the second phrase. There were no instances of either phrase in the British National Corpus (BNC). In the 450 million-word Bank of English (BoE) corpus at Birmingham, there were three instances of move  still  monitored. It seemed that the presence of this phrase could indicate commonality of genre, and possibly authorship, across the two anonymous texts. The main environments of the phrase appear to relate to privacy issues and scientific matters. The presence of the word ‘still’ in the second excerpt was noted, indicating a possible instance of textual cohesion (see Halliday and Hasan, 1976: 242). On checking the second text, it seemed that this could be an instance of internal cohesion (within the same text), or an instance of external cohesion (relative to the previous text). In other words, ‘still’ could refer to an earlier part within the text (where the writer refers to ‘devices’ that allegedly ‘bugged’ the actress’s car) or it could refer to the previous text where the phrase ‘your every move is being monitored’ occurs, with ‘still’ implying a continuation of that monitoring. The latter would suggest the likelihood of an awareness of the first text by the writer of the second text, especially given the rarity of the phrase. Even if the former applies, the differences between the wordings are minor and, given the relative rarity of the phrase, the similarity between the two instances seems to be significant. Earlier researches point to the likelihood of identical strings of six or more words occurring independently of each other being correspondingly more remote than short strings (Olsson, 2004: 112). Coulthard and Johnson made a similar observation some years later in relation to another case (2008: 198). The target phrase ‘your every move is [still] being monitored’ is extremely rare. There were also references in several of the texts to a vehicle being tampered with or bugged. Each reference begins with ‘Believe me your car.  .  .’ and is followed with references to some kind of monitoring or

88

Corpus Applications in Applied Linguistics

‘tracing’. It is thought that what was meant was ‘tracking’, since to ‘trace’ usually refers to a missing person (see below for expansion of this point). In the context of a device designed to keep a vehicle under constant surveillance by remote means, the word ‘tracking’ is much more usual, and therefore likely, than ‘tracing’ – which was found across both known and questioned texts. In both known and questioned samples, the phrase ‘Believe me your car’ was used in conjunction with the word to ‘trace’, which occurred several times as either a collocate of ‘monitor’ or in the same sentence as ‘monitor’: Known mail excerpt 1 I am under constant monitoring and my calls and movements are being recorded or traced Anonymous mail excerpt 3 Your new phone is also under constant monitoring. You are being traced via your car When this pattern was searched for on the internet, in the following form [‘under constant monitoring’    ‘traced’], only 181 instances of it were found.2 This combination, therefore, found across both known and questioned texts, is exceptionally rare. There are further comments on ‘under constant monitoring’, and other phrases, later in this chapter. As indicated above, there is also the possibility that in each case the author means ‘tracked’ instead of ‘traced’. In the first example, the word ‘traced’ appears to apply to ‘movements’ while ‘recorded’ appears to be applied to ‘calls’. However, it seems possible that ‘traced’ in relation to ‘movements’ would only apply in the case of a missing person. In the BoE, there are only 15 instances of car  traced and these refer to a vehicle being traced to its owner. Again, this would suggest that ‘tracked’ rather than ‘traced’ was intended, both in the known and the anonymous texts. Hence, it seems much more likely that the actress’s movements are, allegedly, being ‘tracked’, that is, she is being followed (either literally or by use of technology). Similarly, in the second example, cars are followed (via satellite positioning technologies), and since the activity is given in the present tense (‘are being traced’), it is more likely that this is a tracking procedure rather than a tracing procedure. In order to check that this intuition was correct, further analysis of the occurrence of ‘trace’ and ‘track’ in the BoE was undertaken. This revealed that ‘trace’ collocates with missing or unknown people, events and things: for example, lost relatives, family history, witnesses, criminals, etc. Also



Some Aspects of the Use of Corpora in Forensic Linguistics

89

something, some result or event is usually ‘traced back’ to a previously unknown source. Something or someone can also sink/disappear without trace. However, when collocating with ‘movements’, it almost always refers to the last/final movements of a person before their death or disappearance: thus, again, the semantic prosody would be the uncertain or unknown past of that person, rather than their physical location. By contrast, ‘track’ is often synonymous with ‘monitor’. The most common collocation is ‘to track down’. ‘Track’ collocating with types of movement, unlike ‘trace’, does imply the importance of physical whereabouts. For these reasons, it therefore seems that we have a common lacuna across both the known and questioned texts of an item that turns out to be very rare. While it is possible that two authors can, independently of each other, present a shared lacuna of this nature, its presence does add weight to the possibility of common authorship.

Author Profiling by Corpus An interesting and potentially important question for forensic linguists is to consider how an author can be socially profiled. To what extent can we tell something about an author’s gender, age group, education, social class and occupation from their writings? In the investigation under discussion, prior to working as an actress, the candidate author had worked for some years as a journalist with an interest in political reporting. In this connection, a word of some interest in both the known and questioned texts was the word ‘garnishee’. Readers may be aware that it means to seize a person’s bank assets. It is in reference to an alleged friend of the actress, and occurs in the phrase ‘garnisheeing of funds’. Most of the top internet search entries for this word relate to financial matters and refer to a process by which creditors retrieve money from debtors. There were less than 7,000 entries for the word ‘garnisheeing’ on the internet search engine and this most likely puts it in the rarest 1 per cent of words in usage. Further evidence of its rarity is that it occurs only three times in the BoE. The word was found several times in the actress’s earlier writings as a journalist, and it was evident that she had been familiar with terms such as ‘bank garnisheeing’, ‘garnisheeing of funds’, etc. Although it is difficult to assess the importance of this word occurring in both the known and questioned texts, I suggest that even among people who

90

Corpus Applications in Applied Linguistics

have suffered garnisheeing, given the word’s rarity, it is possible that relatively few would be aware of the word, and even fewer would actually use it. It seems possible that the use of this word may indicate some important shared life experiences by the anonymous writer and the actress.

Known and Questioned Registers Close examination of the inquiry texts appeared to suggest that the actress and the questioned writer use a similar linguistic register to each other and that this register is marked in particular ways. It can be described as a mixed register, combining excessive formality with the use of colloquial expressions. At the formal end of the register, we find the occurrence of several abstract material process verbs (see Halliday, 1994: 110–111) used in ways that are to some extent marked. Thus, across the known texts we see the word ‘secure’ being used as a verb. The first thing to note about the word ‘secure’ is that in terms of everyday usage it is more common as an adjective than as a verb. Secondly, there are two relevant meanings to the word ‘secure’ when used as a verb: to make a place, object or situation secure and, on the other hand, to obtain. The Cobuild dictionary cites ‘secure’ in the sense of ‘obtain’ as a formal use (Sinclair, 1995: 1498). In the known texts, we have a number of uses of ‘secure’ in this sense. It occurs in relation to ‘documents’, ‘affidavits’, an ‘account’ (in the sense of ‘narrative’ or ‘story’) a ‘purchase’, etc. We find the same formal use in the questioned texts where ‘secure’ is applied to ‘stories’, ‘statements’ and ‘payments’. Note that ‘affidavit’ (known) is a synonym for ‘statement’ (questioned) and ‘account’ (known) is a synonym for ‘story’ (questioned). In both questioned and known texts ‘secure’ is usually being used instead of the more common ‘obtain’ or the even more common ‘get’. Thus, ‘secure’ (in the sense of to ‘obtain’) by virtue of its relative rarity when compared to ‘get’, ‘acquire’, etc is a marked form. We also see the verb ‘conduct’ used across a variety of processes in the known texts, including ‘war’, ‘an affair’, ‘an inquiry’, ‘a case’ and ‘an investigation’, while the questioned texts have ‘conduct’ in terms of ‘affair’. Of these, only ‘conduct’ as applied to ‘affair’ is unusual, but – as with the previous example – it is a marked form when compared to ‘have’ (i.e. to ‘have an affair’) and it is common to both known and questioned. In the BoE to ‘have’ an affair is 12 times more common than to ‘conduct’ one.



Some Aspects of the Use of Corpora in Forensic Linguistics

91

We also see the word ‘deploy’ – in the known texts applied to ‘strategy’ – ‘they have deployed the “as much muck as possible” strategy’) and in the questioned texts applied to ‘dirty tricks’, i.e. ‘they are deploying dirty tricks against you’. Again, ‘deployed’ is a marked form when compared to other alternatives such as ‘use’, i.e. ‘use a strategy’ or ‘apply’ as in ‘apply a strategy’. It is also possible that ‘deploy’, both with respect to ‘strategy’ and ‘dirty tricks’ is a malapropism for ‘employ’. This would suggest the possibility of a common malapropism across both known and questioned texts. Internet searches reveal that ‘employ’ (in different forms, i.e. ‘employ’, ‘employed’, ‘employing’) is much more common than ‘deploy’ when related to either ‘strategy’ or ‘dirty tricks’. This is also evidenced in the BoE where we find only one instance of deploy    strategy whereas in the same corpus employ  strategy occurs 11 times, and use  strategy occurs 153 times. If this is a common malapropism, as seems to be the case, then it appears to strengthen the argument for common authorship across the known and questioned texts. In the questioned text, we see the word ‘levelled’ used in terms of ‘accusation’, while in the known text, the same verb ‘levelled’ is used in terms of ‘allegation’. In this context, ‘levelled’ is a marked form when compared with the much more frequent verb ‘make’, i.e. ‘to make’ an allegation or to ‘make’ an accusation. Again, we find support for these observations in the BoE where level    accusation(s) occurs 22 times, as opposed to make  accusation(s), which occurs 270 times, while level  allegation(s) occurs only three times in the Bank, with make  allegation(s) occurring 494 times. Aside from these examples of formality, we also find common expressions of a highly colloquial nature across the known and questioned texts, such as to ‘do somebody in’ and ‘to get somebody’, with very similar meanings across the known and questioned texts. Examples of informality in the anonymous texts include ‘you are to be fried’, ‘numpties’, ‘earn their poppy’, ‘mate’, ‘info’, etc, while in the known texts we have such colloquial expressions as ‘wouldn’t lose too much sleep over that’, ‘info’, ‘scum bags’, ‘scum of the earth’, etc. Common to some of the known and questioned texts is a high level of passives and nominalizations, which are also indicative of similarity in register.

The Occurrence of Identical Expressions In addition to a shared marked lexicon, there are also several identical phrases across the known and questioned texts, which I suggest may be indicative of common authorship. I searched for three phrases, using both,

92

Corpus Applications in Applied Linguistics

Table 6.1  Identical words and expressions across known and questioned texts Known

Questioned

Google frequency

BNC Corpus Notes

under constant under constant monitoring monitoring (Known Text 7) (Anonymous Text 8)

9503 No instances (‘constantly monitoring’  165,000); ‘being constantly monitored 17,800

Much rarer than ‘constantly monitoring’ and rarer than ‘being constantly monitored (2/3) No Approximately instances 200 times rarer than ‘having an affair’ No instances About 1/10 the frequency of ‘character assassination’ (i.e. singular noun)

conducting an affair

conducting an affair

6180 (‘having an affair  312,000)

character assassinations

character |assassinations

42300 (plural form  530,000)

the internet search engine, Google, and the BNC of 100 million words. Although some of these have been discussed earlier, the aim here is to test the relative rarity of identical expressions across several corpora. Of four possible uses, namely ‘under constant monitoring’, ‘constantly monitoring’, ‘being constantly monitored’ and ‘monitoring constantly’, the version found in both the known and questioned texts comprises just over 5% of these populations. We also note the cooccurrence of four lexemes across, known and questioned in these strings: (Known) ‘I am under constant monitoring and that my calls and movements are being recorded or traced’ and (Questioned) ‘your movements are being monitored and phone calls recorded’. Of the two versions ‘conducting an affair’ and ‘having an affair’, the version found in both the known and questioned texts comprises just under 2%. Similarly, the plural form ‘character assassinations’ occurs in just 7% of the two options. Note that the 100 million-word corpus held at Oxford University, the BNC, gives no instances of any of the above phrases. It thus seems that these are rare expressions that are highly marked and I submit that their collective rarity is suggestive of common authorship across the known and questioned texts. The BoE is a much larger corpus than the BNC, and – perhaps unsurprisingly – instances of ‘character assassinations’ did occur in the BoE. However, it occurs in this plural form less than 1/12 of the occurrences of the singular form, i.e. ‘character assassination’.



Some Aspects of the Use of Corpora in Forensic Linguistics

93

Authorship Conclusion Based on Corpus Study From the corpus studies mentioned above, it seemed inescapable that there was commonality of authorship across the anonymous letters and the actress’s known letters. A number of rare phrases occurred across both subcorpora and, critically, at least two lexical anomalies, both very rare, were shared. In the above tests, several corpora were used in order to ‘triangulate’ results. While some have disputed that an internet search engine can be accurately described as a corpus, Coulthard and Johnson suggest that there are virtues in using a readily accessible utility such as an internet search engine (in their case the use of Google): “Google was used, rather than a corpus such as the BoE or the BNC, on the grounds that it was accessible to the layperson, for whom the argument was designed – they could go home and test the claims for themselves as indeed can you” (Coulthard and Johnson, 2008: 197). In the present instance, a number of different corpora have been used in order to check whether the different corpora support each other with regard to their individual findings. This appears to be the case, and Coulthard and Johnson’s use of Google as a corpus for this purpose seems vindicated, providing safeguards are in place to interpret the results accurately.

The Use of Corpora in Legal Language Studies Corpora have also been used in the study of legal language, mostly with a view to ascertain whether lay members of the public, such as jurors, are able to understand judicial directions and summaries. For example, Heffer (2006) noted jury confusion surrounding the phrase ‘beyond reasonable doubt’. He observed that members of the legal profession consider the phrase relatively easy to understand, based on their perception that it is a phrase in common use. However, he recorded only 25 instances of ‘beyond reasonable doubt’ in the BNC, which would correspond to the reality that, among speakers in ordinary settings, it is indeed a very rare phrase. Other studies into the language of jury instruction include Tiersma (1999) and Dumas (2000). Another aspect of legal language is the discourse of the courtroom. Perhaps the seminal work in this regard is Janet Cotterill’s study of the language of the OJ Simpson trial. In one part of her study, Cotterill focuses on the closing speeches of defence and prosecution counsel and underlines

94

Corpus Applications in Applied Linguistics

the importance of using metaphor, for example, the descriptions of OJ Simpson as a ‘ticking time-bomb’. As Cotterill notes, the aim of using metaphor is to ‘create a powerful conceptual and ideological framework for the jury, and in this way lawyers are able to influence the perception of both witnesses and their evidence. The use of alternative representations permits a coercive reconstruction of “reality” in the crime story, in terms which may activate powerful schemata in the minds of the jury, thereby achieving a good degree of fit in terms of narrative typification’ (Cotterill, 2003: 201). The point here about narrative typification is interesting, because it ties in with Heffer’s analysis of lawyer language – as it is actually used as opposed to how it is popularly perceived – as being a tension between the construction of a crime as narrative and attempts at detached legal argumentation. Lawyers will use language to implement either of these rhetorical strategies at crucial moments in cross examination. However, in closing speeches, a different strategy, as evidenced by Cotterill’s findings, emerges, which, in the case of prosecution counsel, is the use of highly charged metaphors designed to make the crime real and to tie the defendant in with the crime. Of course, defence counsel also uses metaphor, but with a different aim in mind, namely to dissociate the defendant from the crime, for example, defence attorney Johnny Cochran’s famous ‘If the glove doesn’t fit, you must acquit’, which distorts a well known idiom (‘if the shoe fits.  .  .’) in order to bolster the defence position for acquittal. Metaphor, with its ability to take wellknown concepts and reclothe them in memorable language, is highly effective for this purpose. As we can expect, when corpus searches of metaphors are undertaken, we find that typical examples are of much higher frequency than the types of phrasing found in judges’ summaries to juries. The main point is not their relative frequency, but the adept way in which prosecution attorneys attach particular versions of metaphors to defendants, and how these, in turn, are countered by defence attorneys.

The Evolution of Legal Terminology However, there is one area of legal language for which corpus studies may not be so well equipped, and that is the way in which meaning evolves through judicial decisions. Consider the concept of mens rea and the word ‘intention’ as it is used in English law. The concept of mens rea (the presence of the ‘guilty mind’) differs in application across different crimes. Thus, for example, the mens rea for criminal damage under English law is recklessness or intention.



Some Aspects of the Use of Corpora in Forensic Linguistics

95

Here, intention is subjectively tested. That is to say, the jury must attempt to arrive at what was in the defendant’s mind when s/he committed the crime: i.e. what did s/he intend to do? On the other hand, under English common law, murder is defined as the intention to kill or to cause grievous bodily harm. The defendant must have intended serious harm or death to the victim. However, getting at what ‘intention’ can mean in English law may not be as straightforward as using a corpus because the word ‘intention’ has, as a result of case law, evolved away from its common meaning. Hence, while the ‘ordinary meaning’ of intention (referred to by UK lawyers as ‘direct intention’) – that is the actions of the defendant to bring about a result – will suffice for most cases, it does not suffice for those cases in which it is clear that the defendant did not consciously intend to cause serious harm or death; this form of intention being referred to by lawyers as indirect or oblique. As will now be outlined, for over 20 years, between 1975 and 1996, the Court of Appeal and the House of Lords grappled with defining this form of intention. This does have a bearing on the use of corpus linguistics, because it is clear that in many instances – the definition of ‘intention’ being just one example – the judiciary will evolve the meaning of a word or phrase through case law. Ultimately, the judge’s use of this word ‘intention’ may be significantly different from that with which most jury members will be familiar. It should, first, be clarified, that the common law definition of murder is the unlawful killing of another with malice aforethought. Many consider that this implies a degree of premeditation. However, mens rea in murder does not consider issues such as premeditation. In fact, intention in murder does not include either premeditation or even motive. The common law definition, therefore, has very little bearing on what actually happens in trials in which the question of intention is examined on a daily basis. Moreover, as will be seen, there may be a danger that its everyday meaning has to some extent been lost within the court system. As a result, it remains to be seen whether corpus linguistics can develop strategies in order to become aware of trends such as these. Beginning with the case of Hyam3 (1975) in which, at appeal, intention was equated with a high probability of awareness on the part of the defendant that his actions would cause serious harm or death, an equation was made between intention and foresight of consequences. In a later case, Moloney (1985), it was decided that foresight of consequences should not be equated with intention, though foresight could be evidence of intention. In that case, Lord Bridge suggested that if a result (i.e. death of the victim) was a

96

Corpus Applications in Applied Linguistics

‘natural consequence’ of an act, and if the defendant was aware of this, then intention could be inferred. In the following year, 1986, another ‘oblique intention’ case, Hancock and Shankland, came before the House of Lords. While Lord Bridge’s view of ‘natural consequence’ (see above) was generally approved, the Lords felt that it did not take the degree of probability of a consequence occurring into account. From this case, a kind of formula for intention emerged. It was stated that if the probability of a consequence was high, then the probability of foresight of that consequence was high, and if the probability of such foresight was high, then it was correspondingly more probable that the defendant’s intention could be inferred. While the House of Lords (which now operates as the UK Supreme Court) may have thought that this formula resolved the problem, several cases after Hancock and Shankland showed that it had not. Hence, the matter of indirect intention again surfaced in Nedrick (1986). Lord Lane suggested that if a result was virtually certain and the defendant was aware of this, then intention could be inferred. This direction, which came to be known as the Nedrick direction on oblique or indirect intention, was well received by legal academics and professionals alike. However, the matter was not entirely resolved because in Woollin (1998) the matter again came up. At Woollin’s trial the judge had not used the Nedrick direction of ‘virtual certainty’ but had stated that if the jury felt certain that the defendant was aware that his actions carried a ‘substantial risk’ of injury or death to the victim, then they could infer intention from this. At the Appeal Court, it had been ruled that the trial judge’s test of ‘substantial risk’ was not a misdirection because other evidence had been taken into account in order to derive the defendant’s mental state at the time of the offence. However, the matter was taken to the Lords on a point of public importance, namely whether in cases of indirect intention, there was an obligation to give the Nedrick direction. The House of Lords ruled that in such cases the Nedrick direction on intention must be given. However, they substituted ‘find’ for ‘infer’. They also pointed out that juries were to base their finding on all the evidence, not just on evidence of intention (or lack of it). Although some have criticized the flexibility that juries have – the actual direction states that the jury is entitled to find rather than has to find or is obliged to find – the higher courts have always insisted on the importance of this flexibility. As can be seen from the above example, the concept of intention has undergone considerable refinement in respect to the mens rea for the offence of murder. Even so, the contretemps surrounding the intention for



Some Aspects of the Use of Corpora in Forensic Linguistics

97

this offence is far from over. The Law Commission has published several opinions on the desirability of changing the law with respect to homicide offences, as well as having given several opinions on the need to formulate a definition of intention. Their 2005 consultation paper ‘A new homicide act for England and Wales?’ initially proposed a statutory definition of intention where ‘a person acts intentionally with respect to a result if he acts either to bring it about, or acts knowing that it will be virtually certain to occur, or acts knowing that it would be virtually certain to occur were he to succeed in his purpose of causing some other result’. However, this suggestion of a statutory definition was withdrawn when legal experts complained that it would fetter the jury, and that directions on these matters were best left to judges. Hence, following further consultation, the Law Commission opted in favour of simply codifying the existing common law on intention. So far, the indications are that this may not happen, though the current UK government is reviewing homicide offences at the time of writing. The point about this excursus into intention is that it relates to directions given to juries. The concern must always be, as Heffer ably outlines in his 2006 paper, whether the language in which such concepts are explained by judges is sufficiently clear to juries. A general corpus would not be able to clarify what has happened in case law. It is only by studying the case law that one can develop an understanding of how meanings evolve in legal language. Corpus linguists are nothing if not creative and, doubtless, will in time find ways of tracking the way in which legal meanings change and, sometimes, evolve. There is always, however, the danger – especially in common law systems – that words can become gradually divorced from their everyday meanings as judges grapple, not always successfully, with the task of interpreting the law.

Conclusion In this chapter, instances of how corpus linguistic insights can be applied to forensic situations have been given. In the earlier part of this chapter, the work of Coulthard in the Derek Bentley case was cited to illustrate the pioneering nature of corpus studies in the forensic context. With corpora now much larger than in the early 1990s, and with the added benefit of massive internet databases such as Google, we are now able to conduct meaningful searches into language use in our bid to understand the extent to which idiolect may play a role in authorship identification.

98

Corpus Applications in Applied Linguistics

In addition, as casework expands, forensic linguists have been expanding their own corpora of specialist text types, including ransom demands, suicide notes and mobile phone texts. With more and more linguistics graduates now studying forensic applications of linguistics, for example at Bangor University in Wales and Aston University in England, the use of these and other corpora will grow in the coming years, thus cementing the role of forensic linguistics as a viable forensic science and a valuable contributor to applied linguistics generally. [At this point I would like to record my gratitude to Dominik Vaijn, a PhD student at the University of Birmingham, for his assistance in tracking down a number of words and expressions in the BoE].

Notes On the Google internet search engine, on 10th May, 2008. For obvious reasons, frequency counts obtained from internet search engines need to be treated with caution. 2 In many of the internet instances, there is a considerable distance between the words in the core phrase ‘under constant monitoring’ and the word ‘trace’ In some cases, the core phrase and the word are not even in the same article or paragraph. This implies that the number of instances found, 181, can be reduced even further. 3 See References section for full case citation.

1





References Brinton, I. (2009). Contemporary Poetry: Poets and Poetry Since 1990. Cambridge: Cambridge University Press. Chapman, M., Davida, G. I. and Rennhard, M. (2001). A practical and effective approach to large-scale automated linguistic steganography. In Information Security Conference (ISC 01), Vol. 2200 of Lecture Notes in Computer Science, Springer, Heidelberg. Cotterill, J. (2003). Language and Power in Court: A Linguistic Analysis of the O. J. Simpson Trial. Houndmills: Palgrave. Coulthard, M. and Johnson, A. (2008). An Introduction to Forensic Linguistics: Language in Evidence. London: Routledge. Coulthard, R. M. (1994). ‘On the use of corpora in the analysis of forensic texts’, Forensic Linguistics: the International Journal of Speech, Language and the Law, 1, i, 27–43. Dumas, B. (2000). U. S. pattern jury instructions: Problems and proposals. Forensic Linguistics, 7(1): 49–71. Halliday, M. A. K. (1994). An Introduction to Functional Grammar (2nd edn). London: Edward Arnold.



Some Aspects of the Use of Corpora in Forensic Linguistics

99

Halliday, M. A. K. and Hasan, R. (1976). Cohesion in English. London: Longman. Heffer, C. (2006). Beyond ‘reasonable doubt’: The criminal standard of proof instruction as communicative act. International Journal of Speech Language and the Law, 13(2): 159–88. Olsson, J. (2004). Forensic Linguistics: An Introduction to Language, Crime and the Law. London: Continuum. Sinclair, J. (1995). Collins COBUILD English Dictionary. 2nd ed. London: HarperCollins. Tiersma, P. (1999). Legal Language. Chicago, IL: University of Chicago Press.

Cases cited Hyam v DPP (1975) HL. R v Hancock and Shankland [1986] 2 WLR 257. R v Moloney [1985] 2 WLR 648. R v Nedrick (Ransford Delroy) (1986) 8 Cr. App. R. (S.) 179. R v Woollin [1999] 1 A.C. 82; [1998] 3 W.L.R. 382; [1998] 4 All E.R. 103.

Chapter 7

Corpora and Gender Studies Paul Baker

Introduction This chapter focuses on ways that corpus linguistics approaches can be helpful to researchers who want to focus on gender. A corpus (Latin for body, plural corpora) is a ‘body’ of naturally occurring language data, which is meant to be representative of a particular language (or genre of language). Corpus linguistics is thus an approach to linguistic analysis that uses large bodies of computerized texts along with specialized computer software, enabling the fast and accurate identification of language patterns that would otherwise remain hidden from human view (see McEnery and Wilson, 2001 for more information). This chapter describes two ways that corpus research can benefit gender research: first addressing questions concerned with gendered usage of language, and second, looking at research on linguistic representation of gender. This is followed by a small case study, which shows how corpus tools can be used to uncover representations of metrosexuals (a relatively new label given to a certain type of men) in British newspapers. Finally, I consider potential limitations to this approach.

Usage One strand of research that uses corpora to examine gender is concerned with gendered usage of language and involves answering research questions that aim to compare male and female speech. Such research links back to statements made by early language and gender researchers, who were often concerned with charting ways that the sexes supposedly differed, with an initial focus on the ways by which males dominated females. Lakoff (1975), for example, noted that women were more likely to use polite forms than



Corpora and Gender Studies

101

men, including tag questions, hyper-correct pronunciation and grammar and the avoidance of slang and taboo words. While Lakoff acknowledged that her claims were based on introspection, her claims helped to spark a long-standing debate about language and gender. Quantitative research that used relatively small numbers of speakers found some support for some of Lakoff’s claims (e.g. Zimmerman and West, 1975; Eakins and Eakins, 1978; Fishman, 1978; Pearson, 1981; Beck, 1988; Lesch, 1994; Meyers et al., 1997). Fishman’s (1980) conceptualization of women using talk to carry out ‘interactional shitwork’ and Spender’s (1980) analysis of social structures as including language itself as producing and reflecting sexism (including the denial of women’s voices in many contexts) were also important contributions towards the ‘male dominance’ model of gendered language use. A related set of research also noted that men and women used language differently, although the focus was not on dominance, but on the difference itself. The main proponent of such work is Tannen (1990), who argued that males and females grow up in different cultures, developing ‘genderlects’ which results in a form of cross-cultural miscommunication in adult heterosexual couples. Other researchers have also noted sex differences in language use, for example, Aries and Johnson (1983) and Seidler (1989). However, Tannen’s work has been called ‘non-engaged and apolitical’ (Troemel-Plotz, 1991) while meta-studies since the 1990s (Wilkins and Andersen, 1991; Dindia and Allen, 1992; Canary and Hause, 1993; Hyde, 2005) have tended to find little evidence for sex differences. It is certainly the case that some research on sex differences in language can be criticized for using introspective methods (which can result in selective bias as people recall what is salient or confirms pre-existing beliefs), for making claims that are partially based on fictional sources (Tannen uses illustrative examples from plays and short stories), or for using small amounts of data that sample only a few people over a short period of time. Additionally, quantitative research on gendered usage of language can often be based upon testing a specific set of hypotheses. So researchers may select a set of linguistic features in advance (interruptions, use of the word mauve etc) and then attempt to count them across a transcription. Such an approach might tell us about sex variation relating to that feature, but we must rely on the researcher having a priori identified the most relevant features to examine. There may be other, more interesting phenomena to examine that is more revealing of difference (or similarity for that matter), but if the researcher does not know to look for them, such features will never be uncovered. An approach which uses corpora and corpus methods is thus useful in a number of ways. A large corpus where language has been marked for speaker

102

Corpus Applications in Applied Linguistics

or writer sex, and containing speech from a wide range of people, will allow us to make claims about gendered usage that will be much more robust and generalizable to a wider population than if small numbers of people were examined. While such a corpus can be queried in order to test existing hypotheses (a ‘corpus-based’ approach), other procedures can be applied, which will automatically highlight what is different in terms of frequency or saliency across different types of speakers (a ‘corpus-driven approach’). An example of this would be to derive a list of keywords – words that are statistically more frequent in one corpus (or sub-corpus) when compared against another. Computer software quickly compares all words within both corpora and produces lists of those which have such relatively large differences in frequency (using a test like chi-squared or log-likelihood). Such procedures therefore allow analysts to take a somewhat naïve approach, rather than being led by pre-existing hypotheses (although it is also reasonable to use a mixture of both approaches as a form of triangulation). For example, Rayson et al. (1997) used automatic procedures to identify statistically frequent words in male and female speech across four million words of speech from the British National Corpus (BNC), collected in the early 1990s. Some of their findings supported earlier claims about gender differences or seemed to confirm stereotypes (males used more taboo words and colloquial language while females used more family terms). Interestingly, the researchers also found that female speakers tended to take more turns and talk longer than male speakers. A more hypothesisdriven approach to the BNC data was taken by Schmid (2003), who examined frequencies of a number of pre-chosen semantic domains. He found that females tended to use more words connected to colours, home, food, the body and people, whereas males used more words from work, computing, sports and public affairs. Schmid (2003: 212–213) warns that such differences should not be seen as ‘essential’, however, but instead more related to the types of roles and situations that males and females are more likely to be involved in. The studies above have tended to focus on lexical differences, which are usually relatively easy to investigate and require little additional mark-up or tagging of the corpus. Corpus software such as WordSmith can also quickly uncover differences in the frequencies of longer fixed stretches of words, sometimes called lexical bundles (such as do you think). Additionally, corpora that have been ‘tagged’ with additional linguistic information, can uncover other types of differences. Such tagging is normally carried out automatically by computer software, and is generally highly accurate (although texts which feature unfamiliar words or grammatical structures



Corpora and Gender Studies

103

may contain errors). Two common forms of linguistic tagging involve attaching tags to words or phrases, based on grammatical class (e.g. singular common noun, gradable adjective, preposition etc) or semantic categories (which may be based on thesaurus classifications). Such grammatically or semantically tagged corpora can uncover other types of differences, which may not be apparent if individual words are considered. Schmid and Fauth (2003), for example, found that females tended to use third person pronouns, indefinite pronouns, predicative adjectives and intensive adverbs more than males, whereas the reverse was true for definite articles and nominalizations. Additionally, Kaur (2009) found that in tagged corpora of British and Malaysian children’s writing, boys tended to use words from semantic domains related to sports, violence and adventure whereas girls used more words related to relationships and social interaction. For the English language at least, the availability of automatic taggers allows for analysis of gendered usage beyond merely lexical variation, although more complex phenomena such as interruptions or ‘conflict’ may be difficult to tag with computer software and may need to be hand-tagged. A potential practical problem in carrying out studies of spoken usage is that the corpora can be time consuming and expensive to build. The make-up of the BNC (90% written, 10% spoken) is a telling indicator of how much more difficult it is to record and transcribe spoken texts. The written texts in the BNC are also tagged according to the gender of the author (where known), although analysts have tended to find speech differences more interesting to work on, and published writing can be subject to editing processes making the notion of authorship identification somewhat complex. Ethical considerations can also hamper the collection of large amounts of spoken data for insertion into corpora (some of the ethical practices used to collect the spoken data of the BNC, such as only requiring the people who carried the tape recorders to sign permission forms, rather than every person whose speech was recorded, would now be considered problematic). As well as the large amount of work involved in manually transcribing spoken data (which can contain background noise and overlapping conversations), demographic information about speakers also needs to be collected and annotated accordingly. Perhaps as a result of this, corpus builders are turning to other sources. For example, the 400 million Corpus of Contemporary American English, built by Mark Davies contains a sizeable spoken section, which is mainly composed of transcripts of speech from American television and radio stations like Fox News and NPR. Some of this speech may be scripted while other parts appear

104

Corpus Applications in Applied Linguistics

to be more ‘naturalistic’. Another option involves using computer mediated contexts, such as Thelwall’s (2008) study of swearing and gender among users of the social networking site MySpace (interestingly, Thelwall found little difference in ‘strong swearing’ among young males and females). The advantage of internet-based data is that it already exists in electronic form, although care needs to be taken when assuming that a person who has posted a message actually possesses the identity they claim to have.

Representation A second approach to the study of language and gender has the advantage of being able to circumnavigate some of the problems associated with usage studies that were described in the previous section. This approach is not necessarily as interested in the gender identities of the speakers (although this information can be relevant) and more reliance can be made on easyto-collect written or computer-mediated data (particularly news sources, but also fictional, political, legal, scientific or academic texts). Rather than asking ‘How do males and females use language’, we could instead ask a related question ‘How is language used to represent males and females?’. Such a question asks us to consider issues surrounding discourse and ideology and requires us to approach a corpus in a different way. Work by Sunderland (2004) on the concept of gendered discourse is relevant here. Whereas Parker (1992: 5) defines discourse as a ‘system of statements which constructs an object’, Sunderland (2004: 6) refers to them as ‘ways of seeing the world, often with reference to relations of power’. Sunderland (2004: 28) argues that, ‘People do not  .  .  .  recognise a discourse  .  .  .  in any straightforward way . . . Not only is it not identified or named, and is not self-evident or visible as a discrete chunk of a given text, it can never be “there” in its entirety. What is there are certain linguistic features: “marks on a page”, words spoken or even people’s memories of previous conversations . . . which – if sufficient and coherent – may suggest that they are “traces” of a particular discourse.’ The process of identifying, naming or evaluating a discourse is thus somewhat subjective form of interpretation and can depend on the author’s own political stance. A ‘higher-order’ gendered discourse, such as ‘gender differences’ (or the view that males and females are different to each other) could also contain numerous more specific discourses, such as ‘women as domestic’ or ‘men need sex’. We may label certain discourses as ‘damaging’ or ‘empowering’, depending on a variety of perspectives.



Corpora and Gender Studies

105

Many of the examples of gendered discourses uncovered by Sunderland (2004) involve the analysis of single texts or small collections of texts (newspaper articles, books, parenting magazines, classroom talk, etc). While such an approach is often required for a detailed qualitative analysis, another way of identifying gendered discourses could be to build a corpus and use computer software in order to identify repetitive linguistic patterns that relate to gender in some way. For example, in Baker (2006) I carried out analyzes of the words bachelor and spinster in the BNC, examining their most frequent collocates (words that tend to co-occur) and also looking at concordances (tables which show every instance of a word within its immediate context) of the words in order to identify common patterns or ‘prosodies’ around them. With several hundred instances of these words (and their plurals) to consider, this allowed me to identify linguistic traces which were suggestive of ‘majority’ discourses, because they were so frequent. For example, bachelor appeared to have two discourses associated with it – one which was indexed via collocates like eligible and references to memories of happy bachelor days. This discourse seemed to imply that bachelorhood (for young men) was a pleasant experience. However, another set of collocates framed older men who were bachelors more negatively, with older bachelors implied to be in need of looking after as well as being lonely, somewhat odd or victims of crime. Spinsters were also often represented as elderly, as well as being connected to widows (another type of older single woman). Spinsters were also often represented via two related discourses – as sex-starved and unattractive. However, these two discourses were often evoked in order to be resisted – as some writers referred to them as stereotypes. Interestingly, the analysis did not find equivalent examples of ‘eligible’ spinsters enjoying a youthful, carefree existence before settling down, as was the case with bachelors. From reading concordance lines, we get the impression that society places a high premium on marriage, that people who do not marry need to be ‘explained’ and are often viewed negatively and that young males (but not young females) are allowed to enjoy a period of being single (as long as it does not last). An approach that focused on grammatical relationships, also using the BNC, was that of Pearce (2008), who examined collocates of the gendered terms man and woman. Pearce used the corpus analysis tool Sketch Engine in order to identify whether verb collocates positioned men or women as subjects or objects. This sophisticated approach identified that women were more frequently the subject of verbs which positioned them as being annoying (annoy, berate, cluck, flaunt, fuss, nag, patronize, presume, urge and

106

Corpus Applications in Applied Linguistics

wail) whereas they were objects of words which implied that they were being used sexually (bed, ravish, sexualize, shag). On the other hand, men were the subject of many violence words (bludgeon, mutilate, oppress, pounce, ransack, rape and strangle) and the object of seduction (bewitch, captivate, charm, enthral, entice and flatter). Another set of studies have focused on frequency of usage of gendered words, rather than collocates. Such studies are generally concerned with uncovering evidence for sexism in societies, by showing how words that refer to males tend to be both more frequent and more various than those that refer to females. In another study I examined frequencies of gendermarked pronouns, nouns and terms of address in four equal sized corpora of written British English from 1931, 1961, 1991 and 2006 (Baker, 2010). I found that while most of the male terms tended to be more frequent than the female terms (except for boy/girl), the difference appeared to be shrinking over time. I also found that the change seemed to be due more to people not using male terms, rather than them using more female terms. For example, the term of address Mr had decreased considerably between 1931 and 2006, even though the equivalent female term Ms had only been marginally taken up.

Metrosexual As a more detailed example of how corpus linguistics techniques can be used to investigate research questions around gender and language, in this section, I present a step-by-step description of a short analysis based around the discursive representation of the term metrosexual in the British press. Simpson (2002) defines a metrosexual as an image-conscious man who spends considerable resources on appearance and lifestyle. Aldrich (2004: 1733) notes that he is a heterosexual who is ‘nevertheless in touch with his feminine side’, although Coad (2008) suggests that others may consider him to be gay or bisexual. Pompper’s focus group study (2010) concluded that men feel anxiety, confusion and frustration with regard to metrosexuality. In order to identify discourses around metrosexuality, it was decided to analyze how the concept was represented in the media. First, a corpus of newspaper articles was built. The term metrosexual was entered into the searchable internet database Nexis UK (the database also considered the plural form metrosexuals when carrying out the search). In order to limit the amount of data elicited, only articles from the national British press from the years 2009 and 2010 were considered. This resulted



Corpora and Gender Studies

107

in 385 citations of the term, occurring across 355 articles (about 250,000 words of text in total). The articles were saved as text-only form and then loaded into WordSmith 5, a suite of corpus analysis tools, which allow users to work with text in any language. There are a number of procedures that can be carried out on a corpus. This analysis shows very brief examples of three of the most common ones: keywords, collocations and concordance analyzes; each provides a different type of focus, ranging from most wide ranging to most detailed analyzes. Keywords Keywords are words that occur in a corpus or text more often than we would expect them to occur, when compared against a (usually larger) reference corpus, which acts as a standard reference for normal frequencies of words. Keywords therefore reveal something of the ‘aboutness’ of a particular corpus. For the purposes of this analysis, the corpus of news articles was compared against the BE06 Corpus, a 1 million word reference corpus of standard written British English, which contained excerpts of texts published in or around 2006. This was felt to be a good reference corpus as it reasonably matched the data in terms of national variety of English and time period. WordSmith was used to create word lists of each corpus (lists of all of the words in a corpus, along with their frequencies), and the two word lists were then compared against each other using the Keywords function, with WordSmith’s default settings for statistical significance. Each word in the corpus is assigned a ‘keyness’ value, which is based on how confident WordSmith is in determining whether its frequency in the newspaper corpus is high, compared to the reference corpus. The 100 words with the highest keyness values were then grouped into categories of ‘aboutness’ as shown in Table 7.1 below. This categorization was a subjective process and emerged when considering the set of words as a whole, rather than referring to a preassigned categorization scheme. The scheme itself changed over the process of categorization. For example, a category called ‘People’ was initially used to group all names of people, such as the surnames Beckham and Cameron. However, it was found that some names of people referred to sporting figures, while others referred to politicians. Therefore, this category was divided into sub-categories, and words like Tory and party, which also referred to politics, were added to the politics category. The context of usage of some words was not always immediately clear, so concordance

108

Corpus Applications in Applied Linguistics

Table 7.1  Keywords in the metrosexual corpus Category

Keywords

Gender and sexuality

metrosexual, men, man, male, women, metrosexuals, gay, masculinity, macho, feminine, masculine, blokes, real, manly, beefcake fashion, look, style, grooming, wear, moisturizer, wearing, shirt, cream, designer, hairy, facial, Gandy, tan, looks, hair, spa, creams, smart, shave, products, trousers, chest, pictured, trend, skin, dress, clothes Beckham, (David), Ronaldo, Federer, Murray, Tim, Christiano, Wimbledon, Nadal, Rooney, final, slam, footballers, Manchester, rugby, player, Wayne Cameron, (David) party, Tory, Mandelson, Clegg, Lib, coalition, conservative, Dems, Gordon, Nick, Miliband, election, recession magazine, comedy, Kate, Gladwell, presenter, Corden, Heathcliff, Mongrels, Twitter, TV, Ricky I, all, has, a, who, bit, is, really, me, his says, like, think, love

Fashion and appearances

Sport

Politics Media Grammatical words Verbs

analyzes were used to check whether the categorizations were accurate. For example, cream referred to face cream and shaving cream, so was placed in the ‘fashion’ category. The word David referred to the footballer David Beckham 51 times and the Prime Minister David Cameron 36 times, so this word was put in two categories. Table 7.1 reveals the main contexts within which metrosexuals are discussed in the British news. Along with terms that relate to gender, there are many keywords relating to fashion, particularly clothing, grooming and hair and skincare. Additionally, metrosexuals seem to be discussed in articles about sports-players, politicians and the media. One line of direction would be to focus on analyzing some of these words in more detail, by conducting concordance analyzes of them. For example, a concordance analysis of the keyword recession reveals that it sometimes occurs in contexts which blame the global recession for men needing to consider their appearance. The following example shows an expanded concordance line in order to show how recession and metrosexuals relate to each other: The recession has seen a rise in male executives taking drastic action over their appearance, according to a magazine for high net worth individuals. Dr Daniel Sister, of Beauty Works West, a specialist in aesthetic medicine and non-surgical procedures, told Spear’s magazine: “Men now account for 30 per cent of my work. Because of the economic crisis, men



Corpora and Gender Studies

109

are looking for jobs or trying to keep them, so they want to look good and not too tired or worried.” His male clients used to be gay men, then metrosexuals. “Now it’s “ordinary” men too,” Dr Sister said. (The Independent, 4 November, 2009) In the above article, qualities first associated with being metrosexual are extended to other men, who are represented as competitive and having wellpaid jobs. So the article suggests that caring about appearance could be viewed as congruent with traditional masculinity because the men who engage in it are doing it for work-oriented reasons within a competitive market-place. The article also implies that taking care of one’s appearance is not normally associated with being masculine, and it is interesting to note the graded distinction in the article between gay men, then metrosexuals, then ‘ordinary’ men. This categorization differentiates gay men and ‘ordinary’ men at opposite extremes on a continuum (an implicature here is that gay men are not ‘ordinary’). The category of metrosexual is also implied not to be ‘ordinary’, but is instead categorized as somewhere between gay and ‘ordinary’. Interestingly, other articles refer to the recession in order to argue that metrosexuals are now unpopular. Social psychologist Prof Geoffrey Beattie, of Manchester University, says: “When the economic times get tough men often refocus on core masculine values. In previous recessions men found other ways to reassert their masculinity outside of the world of work. Similarly, in this recession men are changing their lifestyle, patterns of consumption and reassessing who they really are. “The metrosexual man was concerned with how he looked and was not afraid to admit it. In 2010 we’re seeing more men worrying about jobs and unemployment. Suddenly the man interested in his appearance is less relevant.” (The Express, 14 October, 2010) Thus, the analysis of the keyword recession has helped to uncover a contradiction in the data, with the recession being attributed to both decreases and increases in ‘metrosexual’ behaviour. Collocates Another approach is to focus on the word metrosexual (or related words) and consider words which tend to commonly occur near or next to that word – collocates. WordSmith offers several ways to calculate collocation.

110

Corpus Applications in Applied Linguistics

One way involves simply counting the number of times that two words occur next to each other (within say, 5 words of each other) and then presenting this information in a table. This can be useful, although it tends to give precedence to grammatical words that are already very frequent in  all language, such as and, the, of etc. These words also occur in many other contexts, so they do not reveal much about exclusive relationships. Another way is to consider collocates that tend to co-occur together often and apart rarely. WordSmith offers several ways of calculating this, one of which is Mutual Information or MI. Each word in the corpus is compared against every other word, and those which are calculated as having a high MI score are presented in a list (normally an MI score of 3 or more is viewed as constituting an important collocational relationship). Table 7.2 shows the collocates of metrosexual, all of which have an MI score of 3 or more. As with the table of keywords, the list of collocates is only the start of the analysis. We would need to examine each collocational pair in detail in order to explain why this relationship exists and what it is used to achieve across the corpus. For example, the pair metrosexual-but appears to often be used to differentiate between unacceptable and acceptable male behaviour. So, in the following Times article, footballer Moritz Volz describes how he took an online test to see if he was a metrosexual and then writes about his ‘limits’: I guess that makes me officially metrosexual. But I do have my limits. You won’t find me – or any footballer for that matter – applying Superdrug’s new Guyliner or Manscara. (The Times, 8 December, 2008) The following example also uses but to signify a boundary between what is acceptable: ‘being a bit metrosexual’ and what is not: wearing women’s trainers: THERE’S nothing wrong with blokes taking care of their appearance and being a bit metrosexual these days. But ASTON MERRYGOLD from JLS might have taken things a step too far – admitting a preference for WOMEN’S trainers. He’s a little fella so I can understand smaller sizes. But the wee man much prefers “the colours”. Although articles about metrosexuality do not refer to transvestism explicitly, it is interesting how this article appears to problematize men who wear women’s clothing.



Corpora and Gender Studies

111

Table 7.2  Collocates of metrosexual Collocate

MI score

Number of times word occurs near metrosexual

Nelson Fox

8.16 8.11

5 10

inner

7.72

5

concerned

7.21

5

Posh

7.06

5

Beckham

5.99

8

you’re

5.48

6

he’s

5.47

9

male

5.44

11

David

5.41

10

look

5.37

18

man

5.27

25

again

5.25

5

against

5.15

5

want

4.25

5

new

4.18

10

being

4.10

6

type

3.99

7

has

3.93

19

who

3.66

13

only

3.57

5

your

3.52

5

some

3.51

5

but

3.33

8

are

3.31

15

like

3.24

8

men

3.02

7

Concordances The final type of analysis involves examining concordance lines of a particular term (such as metrosexual) and identifying discourse prosodies (Stubbs, 2001), or stretches of language which have a similar meaning (often evaluative). For example, a concordance search of metrosexual found a number of constructions, as shown in the samples of concordance lines below (the relevant prosodies are shown in bold print).

112

Corpus Applications in Applied Linguistics

Metrosexual as vain and body-focused Physical preening and

metrosexual

Big Baby, and the shallow

metrosexual

male stereotypes. From the preening

metrosexual

And the

metrosexual

Man cleavage!

Metrosexual

when you see hair nestling like a headless squirrel on your beloved’s chest you know you have a man in your bed. Not a Athena poster of the fit-looking

metrosexual

plucked, tweaked and moisturized as the

metrosexual

metrosexual

habits such as body-waxing earned him Ken doll (pictured), the same as to the commitment-phobic bachelor, the is just as precious about his pretty locks as the vainest girl. men are desperate to show off their hairy chests in lowcut tops But Man. Grr. I actually feel sorry for men who don’t have chest hair. guy caressing the baby, that’s for sure Swiss. Some did not like his bicep-bulge celebrations in London and New York.

Metrosexual as middle-class/liberal Among the sensitive,

metrosexual

The fancy-pants London-based

metrosexual

smart, metropolitan, faintly

metrosexual

Political correctness and latte-sipping

metrosexual

Liberal democrat circles I moved in lily-livered media. “We totally despite the and not completely to be trusted cosmopolitans, its continuing commitment

Metrosexual as feminine/inadequate hopeless epicurean and an all-round

metrosexual

physical shortcomings of this scrawny

metrosexual

But potter about like a softy The tiresome culture of the

metrosexual metrosexual

mincerout of touch with the reality of life male and the wimpy hysteria of modern and you’ll burn 25% less fuel. and his eternal whining

Metrosexuals tend to be portrayed in mocking terms, as vain, feminine and politically liberal – all characteristics which emphasize ‘softness’ in different ways. Some of these representations overlap within a single concordance line. For example, saying that a metrosexual is ‘just as precious about his pretty locks as the vainest girl’ implies that he is both vain and feminine.



Corpora and Gender Studies

113

Additionally, it is interesting to note further contradictions in representations. For example, in the ‘vain’ concordance table, some metrosexual men are described as vain because they want to show off their hairy chests, whereas in another concordance line it is implied that metrosexuals do not have chest hair (because they shave their chests). Also in the ‘vain’ concordance table there are constructions of metrosexuals who are ‘preening’, ‘physically fit’ (with reference to a famous Athena poster of a muscular man) and engage in ‘bicep-bulge celebrations’. However, in the feminine/inadequate concor­dance table, metrosexuals are described as ‘scrawny’ and ‘wimpy’. These negative yet contradictory media representations of a metrosexual (muscular yet vain vs. soft and scrawny), perhaps help to explain Pompper’s (2010) focus group finding that men find the concept confusing and frustrating. Due to space limitations, this is not a complete analysis of representations of metrosexual, but it should hopefully demonstrate how the different methodological tools available to corpus linguists can uncover interesting patterns in the data. In the final section of this chapter, I consider potential problems with the corpus approach.

Critical Considerations A potential criticism aimed at research on gendered usage is that it in itself seems to contribute towards a ‘gender differences’ discourse. In separating humankind into two groups based purely on sex, and then comparing the groups against each other, we are encouraged to focus on differences between the sexes, rather than considering intra-sex difference or similarities between males and females. We conflate a multiplicity of identity traits and individual experiences into an over-simplified binary. It could be argued that some of the procedures that are available within corpus software packages actually guide analysts in certain directions. It is very easy, for example, to derive a list of keywords within WordSmith, but not so easy to find out how to produce a list of words that are actually very similar in relative frequency in two sets of language data. A keyword list directs our attention to what is different, even though there may be many more words that are used with equal frequency by males and females. Humans tend to find difference fascinating, and we also have a tendency to seize upon difference and perhaps place too much focus on it. For example, the study described by Rayson et al. (1997) noted that males use more swear words than females. However, this simple statement implies something else:

114

Corpus Applications in Applied Linguistics

females also swear. In fact, in the BNC, females say the word shit more than males. We need to guard against viewing a statistical difference as an absolute or binary difference. If differences do exist, they are often a matter of gradience. Additionally, as Harrington (2008) warns, corpus analysts also need to take dispersion into account. A gender difference may actually be the result of a small number of people from a particular group using a linguistic feature much more than anyone else and skewing the data. Again, such a phenomenon points to an intra-group difference, which is harder to take into account if we are in a male/female differences mindset. Yet in other ways, a corpus process may actually obscure difference – consider a hypothetical situation whereby males and females both use the word football with equal frequency within corpus data. This word would therefore not be a keyword and may not ever be identified by corpus analysis software as ‘interesting’. However, a detailed analysis of context could show that males tend to use the word to talk about playing football, whereas females use the word in a wider range of ways (such as talking about celebrity gossip surrounding footballers and their wives). Frequency of usage is therefore not the only indicator of ‘difference’. Another potential issue involves interpretation of concordance lines. It may often appear easy to identify a discourse prosody by examining a few lines and considering the immediate context that a word appears in. However, there may be cases where the immediate context results in an incorrect interpretation. For example, someone may describe a metro­ sexual as ‘preening’, but this may be so that they can then refute this construction and argue for a more positive representation. Yet this would only be apparent if more context was considered. Additionally, concordance lines may not be helpful in eliciting other types of information relevant to representation, such as perspectivation (Reisigl and Wodak, 2001: 81). So it may be worth considering if one person is afforded 300 words to put across their point of view, and they are quoted at the beginning of an article, while a person with an opposing view is only attributed 15 words of quoted speech, which appears towards the end. For these reasons, I would argue that combining corpus approaches with other methods which consider longer stretches of text in more detail would be most rewarding (see Baker et al., 2008). With those points noted however, corpus approaches offer a unique way of considering data (especially large amounts of data), can alert analysts to linguistic patterns that may be otherwise difficult to see, can provide a focus or ‘way in’ to the data and help to reduce researcher bias. They therefore have a great deal to offer to gender-based research.



Corpora and Gender Studies

115

References Aldrich, R. (2004). Homosexuality and the city: An historical overview. Urban Studies, 41: 1719–37. Aries, E. and Johnson, F. (1983). Close friendship in adulthood: Conversational content between same-sex friends. A Journal of Research, 9(12): 1183–96. Baker, P. (2006). Using Corpora for Discourse Analysis. London: Continuum. — (2010). Will Ms ever be as frequent as Mr? Gender and Language, 4(1): 125–49. Baker, P., Gabrielatos, C., Khosravinik, M., Krzyzanowski, M., McEnery, T. and Wodak,  R. (2008). A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and ­asylum seekers in the UK press. Discourse and Society, 19(3): 273–306. Beck, A. T. (1988). Love is Never Enough. New York: Harper and Row. Canary, D. J. and Hause, K. S. (1993). Is there any reason to research sex differences in communication? Communication Quarterly, 41: 129–44. Coad, D. (2008). The Metrosexual: Gender, Sexuality, and Sport. Albany, NY: SUNY Press. Dindia, K. and Allen, M. (1992). Sex differences in self-disclosure: a meta-analysis. Psychological Bulletin, 112(1): 106–24. Eakins, B. and Eakins, R. (1978). Sex Differences in Human Communication. Boston, MA: Houghton Mifflin. Fishman, P. M. (1978). Interaction: The work women do. Social Problems, 25: 397–406. — (1980). Conversational insecurity. In Giles, H., Robinson, W. P., and Smith, P. M. (eds), Language: Social Psychological Perspectives. New York: Pergamon Press, pp. 127–32. Harrington, K. (2008). Perpetuating difference? Corpus linguistics and the gendering of reported dialogue. In Harrington, K., Litosseliti, L., Sauntson, H., and Sunderland, J. (eds), Gender and Language Research Methodologies. Basingstoke: Palgrave MacMillan, pp. 85–102. Hyde, J. (2005). The gender similarities hypothesis. American Psychologist, 60(6): 581–92. Kaur, S. (2009). A corpus-driven contrastive study of girls’ and boys’ use of vocabulary in their writing in Malaysia and the United Kingdom. Unpublished PhD. Thesis, Lancaster University. Lakoff, R. (1975). Language and Woman’s Place. New York: Harper and Row. Lesch, C. L. (1994). Observing theory in practice: Sustaining consciousness in a coven. In Frey, L. (ed.), Group Communication in Context: Studies of Natural Groups. Hillsdale, NJ: Lawrence Erlbaum, pp. 57–82. McEnery, T. and Wilson, A. (2001). Corpus Linguistics. Edinburgh: Edinburgh University Press. Meyers, R. A., Brashers, D. E., Winston, L. and Grob, L. (1997). Sex differences and group argument: A theoretical framework and empirical investigation. Communication Studies, 48: 19–41. Parker, I. (1992). Discourse Dynamics: Critical Analysis for Social and Individual Psychology. London: Routledge. Pearce, M. (2008). Investigating the collocational behaviour of MAN and WOMAN in the British National Corpus using Sketch Engine. Corpora, 3(1): 1–29.

116

Corpus Applications in Applied Linguistics

Pearson, S. S. (1981). Rhetoric and organizational change: New applications of feminine style. In Forisha, B. L. and Goldman, B. H. (eds), Outsiders on the Inside: Women and Organizations. Englewood Cliffs, NJ: Prentice-Hall, pp. 55–74. Pompper, D. (2010). Masculinities, the metrosexual, and mediaiImages: Across dimensions of age and ethnicity. Sex Roles, 63(9/10): 682–96. Rayson, P., Leech, G. and Hodges, M. (1997). Social differentiation in the use of English vocabulary: Some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics, 2(1): 133–52. Reisigl, M. and Wodak, R. (2001). Discourse and Discrimination: Rhetorics of Racism and Antisemitism. London: Routledge. Schmid, H.-J. (2003). Do women and men really live in different cultures? Evidence from the BNC. In Lewandowska-Tomaszczyk, B. and Melia, P.  J. (eds), Lodz Studies in Language 8: Corpus Linguistics by the Lune. Frankfurt: Peter Lang, pp. 185–221. Schmid, H.-J. and Fauth, J. (2003). Women’s and men’s style: Fact or fiction? New grammatical evidence. In Paper presented at the Corpus Linguistics Conference, Lancaster, March 2003. Seidler, J. (1989). Rediscovering Masculinity: Reason, Language and Sexuality. New York: Routledge. Simpson, M. (2002). Meet the metrosexual: He’s well dressed, narcissistic and obsessed with butts. But don’t call him gay.’ Salon (Retrieved from October 3, 2010 http://archive.salon.com/ent/feature/2002/07/22/metrosexual/). Spender, D. (1980). Man Made Language. London: Routledge. Stubbs, M. (2001). Words and Phrases. Oxford: Blackwell. Sunderland, J. (2004). Gendered Discourses. London: Palgrave. Tannen, D. (1990). You Just Don’t Understand: Women and Men in Conversation. London: Virago. Thelwall, M. (2008). Fk yea I swear: Cursing and gender in MySpace. Corpora, 3(1): 83–108. Troemel-Plotz, S. (1991). Review essay: Selling the apolitical. Discourse and Society, 2(4): 489–502. Wilkins, B. M. and Andersen, P. A. (1991). Gender differences and similarities in management communication: A meta-analysis. Management Communication Quarterly, 5: 6–35. Zimmerman, D. and West, C. (1975). Sex roles, interruptions and silences in conversation. In Thorne, B. and Henley, N. (eds), Language and Sex: Difference and Dominance. Rowley, MA: Newbury House, pp. 105–29.

Chapter 8

Corpora and Media Studies Anne O’Keeffe

Introduction: Getting Started Traditionally, studies in media discourse have been divided into those that focus on spoken media (mostly radio genres) and those that focus on written media (mostly newspapers). Studies in spoken media discourse were largely covered by conversation analysts (Conversation Analysis, see Hutchby, 1991, 1996) and the written media discourse was more likely to be explored within a critical framework (Critical Discourse Analysis, see Fairclough, 1995a, 1995b, 2000). Considering the prevalence of media discourse in everyday life, the number of studies based on it as a whole, over the years, is less than one would expect. Reasons for this probably lie in the difficulty of gathering data. In the case of spoken data, it has to be recorded and transcribed, a time-consuming and laborious task. In the case of written discourse, previous to the advent of the internet, the data needed to be scanned (and checked) or keyed into a computer. It is not a coincidence therefore that most studies of media discourse up until the year 2000 or so, whether spoken or written, focused on small amounts of data that did not need much recording, transcription or scanning time. For corpus linguists, accessibility of data is always a key issue and despite all of the technological advances, the drudge of transcription has not really gone away in relation to the assembling of spoken media data into a corpus. It is relatively easy now to build a corpus of newspapers. With interfaces such as Lexis-Nexis, one can quite readily assemble a large research corpus (obviously within the bounds of any copyright restrictions that may prevail). Even on a small-scale, newspaper stories can generally be gathered quickly online. When it comes to the spoken word, however, while it is infinitely easier to get access to radio or television material online, the scourge of transcription still prevails. This has inhibited the growth of studies in

118

Corpus Applications in Applied Linguistics

relation to spoken media genres. Despite this, there are a number of ways in which spoken data can be gathered online for research purposes. Some news corporations provide transcripts of political interviews and many avid fans painstakingly transcribe interviews with their hero and post them online. For example, if you enter an internet search query for the famous name plus the word ‘interview’, you will usually get a number of sites where interviews have been transcribed. These will, almost without exception, have been cleaned up in terms of real-time discourse features such as hesitations, false starts and repetitions but generally, they will not have been altered beyond recognition. It is usually easy to access a recording of the actual interview and so one can then add the features that have been cleaned up and check the authenticity of the transcript. Obviously, in building a corpus of media discourse, the same principles prevail in relation to representativeness. A corpus needs to have a principled underpinning. A collection of texts is not necessarily a corpus that can produce reliable results. It has to be designed around a solid design matrix. In the case of newspaper studies, the corpus needs to have parameters of time (from X date/year to Y date/year – with a rationale as to why these dates were chosen). The newspaper corpus needs to consider newspaper type (broadsheet versus tabloid, national versus regional and so on). Other considerations include whether the corpus contains only news reports or includes editorials. Will it focus on one particular topic? Many considerations need to be taken into account. In relation to spoken data, there are different considerations. In trying to represent spoken media interactions, O’Keeffe (2006) put together a small corpus that was centred around personae type, that is, the data was collected on the basis of trying to represent the voices and identities with the interactions of: 1. political persona (people in the political sphere, radio and television political interviews made up this sub-category) 2. public persona (people who were in the political sphere but who were well-known in the public sphere, television chat shows formed a major part of this sub-category) 3. private persona (people who phoned radio programmes but who were not known in the public sphere – they were using their private sphere identities, radio phone-ins made up most of this sub-category). This allowed for the gathering of data around three different interaction types. Not all media discourse is based on political or news interviews, not



Corpora and Media Studies

119

all is television chat show nor is it radio phone-in. All needed to be represented in equal measure.

What can a Corpus do for the Study of Media Discourse? The core functions of corpus software, namely the generation of key word and word frequency lists and concordances, can offer much to the analysis of a corpus of spoken or written media discourse. In this section, we will survey their main applications and use concrete examples from media interviews.

Keywords Keywords are not necessarily the most frequent words in your corpus; rather, they are the most unusually frequent words. It is very easy to calculate keywords using corpus software such as Wordsmith Tools (Scott, 2008). Essentially, the software compares the word frequency list of the text or corpus that you are focusing on with some other larger ‘reference corpus’. The choice of reference corpus can have an influence on the results. For example, if we take an internationally recognizable interview and do keyword calculations against different reference corpora, you will see the varying results. The interview that we will focus on is the BBC 1 Panorama television interview by Martin Bashir of Diana, Princess of Wales (broadcast November 1995). The transcript is readily available on the internet as is the actual television interview, see: http://www.bbc.co.uk/politics97/diana/panorama.html. First, we compare this interview with a corpus of media interviews from O’Keeffe (2006), which is made up of 271,553 words from 29 political interviews (93,180), 46 interviews on TV chat shows and radio involving known or public persona (89,225) and 17 interviews from radio phone-ins involving unknown or private persona (89,148). Data is drawn from international English-speaking media sources including from the UK, USA, Canada, Australia and Ireland. The key words in this case are relatively short, 27 in all. Predictably, the names of the interviewer and the interviewee are the top items. These are in fact marking the speakers in the interview. In each table, these can be ignored for the most part. When we generate the key words of the same interview against a different reference corpus, we get more and different results. This time, we use a

Corpus Applications in Applied Linguistics

120

Table 8.1  Keywords of Bashir-Diana Panorama interview with Media corpus (O’Keeffe, 2006) 1  Diana 2  Bashir

7   marriage

13  difficult

19  were

8   husband

14  William

20  yourself

26  depression

3  did

9   had

15  royal

21  because

27  husband’s

4  was

10  uh

16  my

22  relationship

5  Wales

11  monarchy 12  bulimia

17  role 18  queen

23  your 24  children

6  prince

25  media

reference corpus that is very unrelated to media interviews, namely an academic corpus of English, the Limerick-Belfast corpus of Academic Spoken English (LIBEL). This is made up of lectures, tutorials, seminars and presentations. In this case, 500,000 words were used from LIBEL. In all, there were a total of 94 key words from this list which were statistically ‘key’, that is, occurring with unusual frequency when compared with academic lectures and seminars. Table 8.1, the key words from the comparison with the media corpus, contained only 27 words in all. Because the academic data is more generically diametric, it results in a broader range of key items, including common first and second person pronouns I, I’m, my, myself, yourself, me, high frequency verbs and verb forms was, did, didn’t, wasn’t loved, think, pronoun verb combinations I’ve, you’re, everyday seeming nouns divorce, husband, marriage, people’s, husband’s and so on. Of course, these all relate to more private sphere domains of reference ‘you – I’, relationships, marriage, problems like bulimia, marriage breakdowns, all of which would not normally be talked about in the more referential world of academia, as these short extracts from the Panorama interview in question and the data from LIBEL illustrate. 1.  extract from interview between Martin Bashir and Diana, Princess of Wales, broadcast November 1995 BASHIR: What effect did the depression have on your marriage? DIANA: Well, it gave everybody a wonderful new label - Diana’s unstable and Diana’s mentally unbalanced. And unfortunately that seems to have stuck on and off over the years. BASHIR: Are you saying that that label stuck within your marriage? DIANA: I think people used it and it stuck, yes. 2. extract from a literature lecture (LIBEL) . . .So, according to David Lloyd’s writings we tease out concepts of national identity through literature. Germany, for example, has a great philosophical



Corpora and Media Studies

121

tradition. The strange thing about Ireland is that we do not have a great philosophical tradition. We have a great literary tradition. Germany has a great philosophical tradition. The French in the nineteen –sixties and seventies produced a terrific theoretical body of work. In Ireland largely we have used literature to sculpt concepts of national identity. Let us compare the Panorama Bashir-Diana interview with two further datasets. Both of these have in common with the interview the fact that they involve more reference within the ‘I–you’ domain and they refer more to everyday worlds of relationships, and so on. These are a corpus of the sitcom programme Friends (based on Malveira Orfanó, 2010) and secondly, the Limerick Corpus of Irish English (LCIE), a one million-word corpus of everyday spoken Irish English (see Farr et al., 2004). As Tables 8.3 and 8.4 illustrate, and in common with Table 8.2, the spread of key words from the Bashir-Diana interview is broad, much broader than when you compare it with the corpus of media interviews (Table 8.1). This tells us that if you do a key word analysis using a reference corpus that is very like the test corpus, you will get more concentrated (and fewer) key word forms. If you take a corpus that is very different to the test corpus, as in the case of LIBEL, you will get a very diverse range of key words including some that you will not expect (e.g. all of the first and second person pronouns and high frequency verbs). If you take a general corpus, representing how English is generally used (in this case LCIE), then you will still get a lot of key words but they will be less disparate. If you look at the results in Tables 8.3 and 8.4, you see that they have a lot in common with Tables 8.1 and 8.2, but they have a broader spread. Many of the noun forms are common to all four tables, for example, husband, bulimia and monarchy. Interestingly, divorce, is on Tables 8.2, 8.3 and 8.4 but not on Table 8.1, when the key words are generated by comparison with other media interviews. This suggests that divorce is not an uncommon reference in media interviews. Table 8.2  Keywords of Bashir-Diana Panorama interview with LIBEL (top 50 of 94) 1  Diana 2  Bashir 3  was 4  I 5  don’t 6  husband 7  my 8  I’m 9  did 10  Wales

11  I’ve 12  it’s 13  me 14  uh 15  yes 16  didn’t 17  had 18  prince 19  I’d 20  marriage

21  people’s 22  bulimia 23  you’re 24  queen 25  William 26  were 27  because 28  monarchy 29  myself 30  role

31  husband’s 32  couldn’t 33  divorce 34  that’s 35  people 36  difficult 37  public 38  there’s 39  yourself 40  relationship

41  feel 42  loved 43  think 44  never 45  wasn’t 46  Mr. 47  princess 48  royal 49  pressures 50  albeit

122

Corpus Applications in Applied Linguistics

Table 8.3  Keywords of Bashir Panorama interview with Friends sitcom corpus (50 in total) 1  Diana 2  Bashir 3  was 4  very 5  people 6  husband 7  had 8  of 9  public 10  Wales

11  media 12  and 13  were 14  prince 15  role 16  difficult 17  children 18  marriage 19  because 20  that

21  William 22  Obviously 23  Royal 24  Country 25  Monarchy 26  As 27  relationship 28  The 29  In 30  Did

31  people’s 32  bulimia 33  queen 34  effect 35  yourself 36  interest 37  being 38  felt 39  think 40  depression

41  get 42  can 43  not 44  look 45  is 46  know 47  you 48  just 49  no 50  oh

Table 8.4  Keywords of Bashir Panorama interview with LCIE as reference corpus (top 50 of 89 keywords) 1  Bashir 2  Diana 3  husband 4  marriage 5  Wales 6  prince 7  relationship 8  media 9  monarchy 10  people

11  bulimia 12  role 13  difficult 14  royal 15  William 16  because 17  public 18  queen 19  had 20  that

21  Uh 22  My 23  people’s 24  Very 25  Yes 26  husband’s 27  Divorce 28  depression 29  Yourself 30  Albeit

31  feel 32  your 33  princess 34  was 35  effect 36  book 37  children 38  obviously 39  did 40  being

41  pressures 42  separation 43  attention 44  engagements 45  daunted 46  were 47  future 48  enormous 49  knowledge 50  duties

Let us look at the other instruments to hand, namely, word frequency lists and concordance lines.

Word Frequency Lists Word frequency lists can be easily generated by corpus software and they simply refer to the rank ordered frequency of all of the words, or ‘types’ in the whole of your corpus. It is very often illuminating to compare where in the rank order a word is in comparison to some other baselines. For example, in Table 8.5, we put the top 20 most frequent forms in the Bashir-Diana Panorama interview, the media, LIBEL, Friends sitcom and LCIE corpora. Based on this we can make a number of observations: zzComparing

the I–you domain on the datasets is a good starting point. In conversation, these two pronouns are usually very high in the frequency list and this is a marker of the interactive nature of conversation. In Table 8.5, we can see that I and you are high ranking in all datasets but in the academic lecture data, we see that there is a lot more you reference than I.



Corpora and Media Studies

123

Table 8.5  Top 20 words in Bashir-Diana Panorama interview, media corpus, LIBEL, Friends sitcom corpus and LCIE everyday conversations sub-corpus N

BashirDiana

Media corpus

LIBEL

Friends

LCIE everyday conversations

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

I the and to you that a was it of Diana Bashir in but my what do did had people

the and I to a of you that in it was is on have we but for they this be

the you and to of that a in it I is s so we what amm this there okay have

I you the to a and it that what oh is no okay know my this of yeah me do

the I and you to it a yeah that in of was like is know he on no they but

zzAnomalies

always need to be checked out by looking at concordance lines to try to get to the bottom of why certain words have floated to the top, so to speak. In the following section, we will follow up on the follow words that are high-frequency in the Bashir-Diana interview but not to the same degree in the other datasets: was, my, what, do, did, had, people.

Concordances Sinclair (1991: 32) tells us that ‘a concordance is a collection of the occurrences of a word-form, each in its own textual environment’ and that ‘in its simplest form it is an index. Each word-form is indexed and a reference is given to the place of occurrence in a text.’ A set of concordance lines then is in essence a list of occurrences of a search word, where all of the occurrences of that word, throughout the corpus, or text, will be collated into one place and indexed back to their position within the dataset. The search word is usually referred to as the node word and it normally appears at the centre of the concordance lines. A spin-off benefit of having all of the node words aligned down the centre of the page is that the eye can pick out

124

Corpus Applications in Applied Linguistics

salient patterns that emerge to the left and right of the node. These can usually be computed by the corpus software by way of follow up but very often the researcher will form hypotheses on scanning the patterns in the concordance itself. Using the concordance function then, let us look at one of the anomalies that we have identified from the word frequency lists in Table 8.5, namely was, by way of example of the type of research process that would follow: Was The appearance of the past tense verb form was in the top 20 items in the interview begs explanation. The only other dataset where we find it on Table 8.5 is in LCIE, that is casual conversation where it is used within narratives. It is worth exploring whether it is being put to a similar use in this interview. Here is an extract of the concordance lines for was using Wordsmith Tools. At first, there seems to be no order to this stream of seemingly truncated lines but any one line can be clicked to bring the researcher back to the source file. In addition, as referred to above, when looking at a search word in context, one needs to look at the patterns that form to its left and right. 7 s of questions - and I hoped I was able to reassure them. Bu 9 s agendas changed overnight. I was now separated wife of the 10 e of Wales, I was a problem, I was a liability (seen as), an 11 wife of the Prince of Wales, I was a problem, I was a liabil 12 their attitude towards me. It was, you know, if we are goin 13 rpose was behind it? DIANA: It was to make the public change 14 d more cards than I would - it was very much a poker game, c 16 nuisance phone calls? DIANA: I was reputed to have made 300 18 adulterous relationship, which was not true. BASHIR: Have yo 19 presses his affection for you. Was that transcript accurate? 20 ional press? DIANA: No, but it was done to harm me in a seri 21 t time I’d experienced what it was like to be outside the ne 22 in a serious manner, and that was the first time I’d experi 23 heard it on the radio, and it was just very, very sad. Real 24 Andrew Morton’s book about you was published. Did you ever m 25 her ‘annus horribilis’, and it was in that year that Andrew 26 saw the distress that my life was in, and they felt it was 27 , I did. BASHIR: Why? DIANA: I was at the end of my tether. 28 life was in, and they felt it was a supportive thing to hel 29 od team in public; albeit what was going on in private, we w 30 uired. BASHIR: Do you think it was accepted that one could l 31 mily? DIANA: I think everybody was very anxious because they 32 A: No, because again the media was very interested about our

Figure 8.1  A sample from the concordance lines for was from Bashir – Diana interview



Corpora and Media Studies

125

Table 8.6  Plot of collocates of node word was using Wordsmith Tools N

Word

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

WAS I A IT THE AND THAT TO OF DIANA IN BECAUSE WHAT BUT YOU MY BASHIR VERY YOUR OUT WITH

L5 1 8 5 2 10 5 3 2 0 0 5 1 1 1 4 0 2 0 1 3 1

L4

L3

L2

L1

Centre

R1

R2

R3

R4

R5

6 6 3 0 3 7 2 4 2 5 2 1 1 1 3 2 1 0 0 1 1

2 7 6 1 4 4 7 5 4 10 0 1 2 3 3 2 1 0 0 0 3

2 2 1 1 12 13 6 1 2 6 1 5 1 9 1 4 10 0 2 0 0

0 43 0 45 0 2 10 0 0 1 0 0 13 0 1 0 1 0 0 0 0

181 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 29 2 11 0 2 3 0 0 4 1 1 1 1 5 0 9 5 2 0

2 4 3 2 4 4 4 13 2 0 2 3 0 1 1 1 0 4 1 2 1

2 3 3 3 5 4 2 8 7 1 4 2 1 2 1 0 0 1 0 3 2

6 2 7 2 6 5 6 5 5 0 2 8 1 1 4 3 0 0 2 1 3

1 5 7 4 6 3 5 5 3 0 3 0 0 2 1 3 0 0 2 0 0

Most concordancing software will also have the facility to compute the collocates for you. Table 8.6 shows the breakdown of the words that go before and after was. From Table 8.6, we can see that zzPronouns

I and it account for the 49% of all the words that go before was zzA, the and very account for 27% of all the words that come after was zzWhat was accounts for 7% of all of the patterns to the left of the node word was. From each of the above respectively, we can say that zzThere

is a lot of representation of how the interviewee felt and how the interviewee represents situations in the past and how the interviewee felt she was represented. She cleverly merges all of these into very emotive language centring around the first person pronoun. This in turn allows the interview to be set up as a haunting self-revelation of how hard it was to be ‘me’, Diana. This is telling of an interview that seeks to reveal what was like to be Diana, Princess of Wales:

Corpus Applications in Applied Linguistics

126 I was I was I was I was I was But I I was I was I was I was I was . . . I was I was I was I was I was

a different person. concerned I was a fat, chubby. . . a problem, I was a liability. . . a problem, fullstop. absolutely devastated was actually crying out again unstable, sick, almost an embarrassment ashamed because I could at the end of my tether compelled to go out and do my engagements as far as I was concerned I was a fat, constantly tired, exhausted. . . crying out for help. . . desperate. in love with him. now separated wife. . .

Equally, patterns with it was show a negative portrayal of what it was like to be Diana, Princess of Wales: And in our private life it was obviously turbulent. I heard it on the radio, and it was just very, very sad. R It was already difficult, . . . We struggled a bit with it, it was very difficult; Well, there were three of us in this marriage, so it was a bit crowded. . . .it was very distressing for me . . .it was so cruel. . . .I felt it was unfair, because I want . . .it was isolating. . . It was a challenge. . . zzThe

patterns to the right of the node word all point to noun and adjective phrases: a, the and very. Again negativity prevails in the portrayal of life as a Princess:

Was  noun phrases with a: a a a a a a a

problem liability basket-case fat, chubby, 20-year-old challenge lot of anxiety pretty dull subject

a bit of a difficult time



Corpora and Media Studies

127

1 in a serious manner, and that was the first time I’d experi 2 to share that load, because I was the one who was always pi 3 y, yeah. BASHIR: Why? DIANA: I was the separated wife of the 4 from about 1989 I think. What was the nature of your relati 5 confuse the enemy. BASHIR: Who was the enemy? DIANA: Well, t 6 as the cause? DIANA: The cause was the situation where my hu 7 d to be smoothed. BASHIR: What was the family’s reaction to 8 ught. The most daunting aspect was the media attention, beca 9 epression? DIANA: Well maybe I was the first person ever to 10 se. DIANA: Uh,uh. BASHIR: What was the cause? DIANA: The cau 11 on a hanger: they decided that was the problem - Diana was u

Figure 8.2  All of the instances of was  the in the Bashir – Diana interview

Was  noun phrases with the Of the 11 instances of was followed immediately by the, we see from Figure 8.2, four are of interrogative forms in questions from Bashir (what was. . .?; and Who was. . .?). We will return to the interrogative forms used by the interviewer below. The remaining eight are uttered by Diana, Princess of Wales. Again they all feed into the autobiography of the unhappy princess (. . .it was the first time I’d experienced what it was like to be outside of the net; . . .I was the one always pictured. . .. ; I was the separated wife. . ., etc.) Was  very The other main pattern in terms of words that come after was is was  very (see Figure 8.3). All of these are uttered by Diana and are largely indicative of very being used as a modified (in this case an intensifier) of an adjective or a noun phrase. They are again largely relating to the portrayal of the unhappy nature of the life and relationships of the interviewee: 1 d more cards than I would - it was very much a poker game, c 2 A: No, because again the media was very interested about our 3 I was in love with him. But I was very let down. BASHIR: Ho 4 f fantasy in that book, and it was very distressing for me t 5 mily? DIANA: I think everybody was very anxious because they 6 eople initially? DIANA: Yes, I was very daunted because as f 7 is the bigger the drop. And I was very aware of that. BASHI 8 We struggled a bit with it, it was very difficult; and then 9 t like to fulfil? DIANA: No, I was very confused by which ar

Figure 8.3  All of the instances of was  very in the Bashir – Diana interview

What and questions We noted above that what was accounts for 7% of all of the patterns to the left of the node word was (see Figure 8.4). Of the 13 occurrences of the

128 1 2 3 4 5 6 7 8 9 10 11 12 13

Corpus Applications in Applied Linguistics very busy stopping me. BASHIR: What was your reaction when n orehand, and explained to them what was happening. And they ry good team in public; albeit what was going on in private, you, from about 1989 I think. What was the nature of your r e they were able to understand what was coming out, and I wa ngth from to continue? BASHIR: What was your reaction to you es, a number of times. BASHIR: What was said? DIANA: Well, i e they’re coming from. BASHIR: What was your husband’s react needed to be smoothed. BASHIR: What was the family’s reactio ng before you became pregnant. What was your reaction when y e cause. DIANA: Uh,uh. BASHIR: What was the cause? DIANA: Th he fridge. It was a symptom of what was going on in my marri it. BASHIR: Did he understand what was behind the physical

Figure 8.4  All of the instances of what  was in the Bashir – Diana interview

pattern, nine are uttered by the interviewer. Eight of these are whinterrogatives with what. Wh- interrogatives are prototypical interrogative clauses. They are very much indicative of prepared (and pre-approved) questions as opposed to questions which arise out of the previous response. Compare the follow question and answer sequences in terms of the first comprising two preprepared and planned questions versus the second example where the interviewer’s follow-up question arise in an ad hoc manner, based on the response to the first question: 3.  Bashir – Diana, Princess of Wales interview BASHIR: What was the family’s reaction to your post-natal depression? DIANA: Well maybe I was the first person ever to be in this family who ever had a depression or was ever openly tearful. And obviously that was daunting, because if you’ve never seen it before how do you support it? BASHIR: What effect did the depression have on your marriage? 4. Jeremy Paxman interviewing author JK Rowling on BBC Newsnight 18 June 2003. Full transcript available at: http://news.bbc.co.uk/1/hi/programmes/ newsnight/3004594.stm JEREMY PAXMAN: Do you think success has changed you? JK ROWLING: Yes. JEREMY PAXMAN: In what way? JK ROWLING: I don’t feel like quite such a waste of space anymore. JEREMY PAXMAN: You didn’t really feel a waste of space? JK ROWLING: I totally felt a waste of space. I was lousy. . .. The more prototypical the question forms the more formal and pre-scripted they appear. The more ad hoc and less prototypical the question forms, the less



Corpora and Media Studies

129

Table 8.7  Range of question types across a 500 questions across media five interviews (O’Keeffe, 2005, 2006) Question type

Example

Yes/no

Is it true that you figure it’s associated with all sorts of seedy things like venereal diseases or prostitution or that kind of thing? What age is he ah Breda? You won’t be seeing the match this weekend? How did you know? Did the bush telegraph tell you? Eh that’s the point isn’t it? And in terms of changing a climate or an atmosphere ah within the course and within the community within society do you believe it’s a legislative requirement or ah a debate requirement?

WhDeclarative Double Tag Alternative

pre-scripted they appear. In the Paxman – Rowling interview, there were prototypical wh- questions, but there was also a range of other question forms, such as tag questions, declarative questions and double questions. Based on O’Keeffe (2005, 2006), the comparison of question forms makes for an interesting study of how question forms can vary. In my study I randomly analyzed 100 questions in a number of interviews, including Bashir-Diana and Paxman-Rowling as well as data from the TV chat show Parkinson and a BBC Newsnight interview between Jeremy Paxman and the Prime Minister of Britain at the time, Tony Blair. The range of question types found in the analysis is presented in Table 8.7. Table 8.8, which is based on the analysis of the different interviews, presents the results found in relation to the question types across 100 randomly chosen questions in each interview. From Table 8.8, we can see that the majority of the questions in the Bashir interview fall within the range of the prototypical question forms, yes/no, Wh- and declarative questions. This is also the case in the Paxman-Blair interview that took place on 6th February 2003. Full transcript: http://news.bbc.co.uk/1/hi/programmes/ newsnight/2732979.stm. In both cases, 97% of all questions asked fall within Table 8.8  Results of analysis of question forms across four TV interviews Question type Yes/no

Bashir– Diana

Parkinson (TV chat show)

25

23

Paxman– Blair 30

Paxman–JK Rowling 40

Wh-

40

16

24

18

Declarative

31

39

42

20

3

12

3

15

Tag

0

10

0

7

Alternative

1

0

1

0

Double

130

Corpus Applications in Applied Linguistics

these three conventional types. This suggests pre-scripting and carefully planning in terms of the interviews. In both cases, these three main question types account for 78% of all the question forms in the Parkinson and PaxmanRowling interview. In these less formal interviews, there are more ad hoc questions formed, which have a more conversational style. In particular, we see the use of double questions and tag questions, both of which are much more hedged in nature compared to the more prototypical question forms (which are more face-threatening): 5.  Double questions from Paxman-Rowling interview Jeremy Paxman: And what about the money? A lot of people when they suddenly make a lot of money, feel guilty about it. Do you feel guilt? 6.  Tag question from Parkinson chat show (interview with Lily Savage) Michael Parkinson: Well this is your comeback but it’s also your final television appearance as Lily Savage, isn’t it? 25 December 2004. Full transcripts available at http://parkinson. tangozebra.com/guest_transcript.phtml?guest_id  55

Conclusion This chapter has surveyed the core applications of corpus linguistics to the study of media discourse. It has discussed the setting up of a representative corpus and has exemplified the core corpus functions of key word analysis, word frequency lists and concordancing. These are just the basics of the applications of corpus linguistics to the study of media discourse. As such, the basic application of CL to the study of media texts, whether spoken or written, is to provide handy and quick quantitative analyzes that would be difficult and labour-intensive, to provide a facility for automated indexed searches of the text. In other words, the job of the analysis still remains with the analyst. This chapter has not discussed the wider application of CL in the study of media discourse and that is when it is used in tandem with other analytical approaches and models. There is a growing body of work(s) where CL has been used to mine results within broader analytical frameworks. Generating corpus results on their own describe and quantify the data itself but in order to take these further for the study of media discourse, they need a broader framework. Conversation analysis has provided this framework for most studies of spoken media discourse and Critical Discourse Analysis (CDA) has been the main paradigm for the analysis of written media texts.



Corpora and Media Studies

131

In terms of marrying CDA with CL, O’Halloran (2010) provides an excellent example. He shows how CL can be a powerful complementary tool to CDA when he examines a set of texts over a six-week period in the British popular tabloid newspaper The Sun on the topic of the European Union (EU) expansion on 1st May 2004. The corpus he built consists of The Sun texts in the six weeks prior to 1st May, which contain the cultural keywords: ‘(im)migration’, ‘(im)migrant(s)’, ‘EU’ and ‘European’. O’Halloran is able to show, in a convincing and powerful way, how the language and ideology were intertwined in that period. For example, key words such as ‘high unemployment’, ‘impoverished’ and ‘poor’ were linked to ‘Eastern European’ and were tied up with the presupposition that EU enlargement would mean that migrants would be a drain on social services, etc. In essence, the potential for the use of corpora in the study of language in the media is immense. It allows the analysis of much larger amounts of data in both a quantitative and qualitative way, as I hope to have illustrated in the brief analyzes presented here. However, in order to come to broader conclusions beyond describing how media texts differ from other genres, one needs to work with other discourse frameworks such as CA or CDA.

References Fairclough, N. (1995a). Media Discourse. London: Arnold. — (1995b). Critical Discourse Analysis. London: Longman. — (2000). New Labour, New Language. London: Routledge & Kegan Paul. Farr, F., Murphy, B. and O’Keeffe, A. (2004). The Limerick Corpus of Irish English: Design, description and application. Teanga, 21: 5–29. Hutchby, I. (1991). The organisation of talk on talk radio. In Scannell, P. (ed.), Broadcast Talk. London: Sage, 119–37. — (1996). Power in discourse: The case of arguments on a British talk radio show. Discourse and Society, 7(4): 481–97. Malveira Orfanó, B. (2010). The representation of spoken language: A corpus-based study of sitcom discourse. Unpublished PhD. Thesis, Mary Immaculate College, University of Limerick, Ireland. O’Halloran, K. (2010). How to use corpus linguistics in the study of media discourse. In O’Keeffe, A. and McCarthy, M. J. (eds), The Routledge Handbook of Corpus Linguistics. London: Routledge, 563–77. O’Keeffe, A. (2005). “You’ve a daughter yourself?”: A corpus-based look at question forms in an Irish radio phone-in. In Schneider, K. P. and Barron, A. (eds), The Pragmatics of Irish English. Berlin: Mouton de Gruyter, pp. 339–66. — (2006). Investigating Media Discourse. London: Routledge. Scott, M. (2008). WordSmith Tools Version 5. Liverpool: Lexical Analysis Software. Sinclair, J. McH. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.

132

Part Four

Corpora in New Spheres of Study

134

Chapter 9

Corpora and English as a Lingua Franca Barbara Seidlhofer

The conjunction and in the title of this chapter (as in the titles of other chapters) implies some kind of relationship. In the case of English as a lingua franca, ELF, (I cannot, of course, speak for the other chapters) this relationship is bilateral. The methodology of corpus linguistics can contribute to an understanding of ELF by providing the means for describing the form it takes. But equally the nature of ELF as a use of language can prompt an appraisal of this methodology as an approach to linguistic description. In what follows, to echo the words of J.F Kennedy, I shall be asking not only what corpus linguistics can do for ELF but what ELF can do for corpus linguistics and subsequently, and consequently, what both can do, together and apart, for applied linguistics.

Corpora and Linguistic Varieties Applied linguistic research is obviously dependent on linguistic description of one kind or another and linguistic descriptions are in their turn dependent on different kinds of data. The kind of data compiled in corpora derive from actually attested language use. The idea of basing description on such observed third-person data is, of course, not new, as witness the extensive work on word frequency carried out by Thorndike, Palmer and others in the first half of the last century, interestingly enough with language pedagogy in mind (see Thorndike, 1921; Faucett et al., 1936 and for a general account Bongers, 1947). What is new is that electronic technology now enormously extends the scope and precision of the observation so that contemporary computerized corpora can provide a vastly more comprehensive and exact account of language that has been actually produced. But such production is, of course, only one source of language data. Before the advent of the computer, linguistic descriptions were generally based on data from two other sources.

136

Corpus Applications in Applied Linguistics

One was first-person introspection, where there is a reliance on the linguist’s own intuition as a representative native speaker of the language concerned. Another source was some other representative native speaker, a secondperson informant from whom the data was elicited (see Widdowson, 2009: 194–195). As has frequently been pointed out, neither introspection nor elicitation can be relied on to provide an accurate account of language that is actually produced by its users. Only observation based on (computerized) corpora of third-person data can do that. The use of English as a lingua franca (ELF) yields a vast amount of naturally occurring language data, which can obviously be captured in and retrieved from a computer corpus like any other. There are, however, differences, and these are of interest since they bring into sharp relief a number of issues of more general significance about the nature of language descriptions and their relationship with applied linguistics. To begin with, we need to note that other corpora generally deal with established languages or varieties conceived of as relatively stable entities. Thus the language compiled in corpora of English such as the British National Corpus (BNC) or the Corpus of Contemporary American English (COCA) is defined as what its native speakers produce, more specifically what has been produced by so-called educated and predominantly British and American (etc.) native speakers of the language. Corpora of different varieties of ‘World Englishes’ have also been compiled, especially in post-colonial settings where English is either a majority first language or an official additional language (like most of the 23 corpora of the International Corpus of English (ICE), partly completed and partly under construction). But again the scope of these corpora is primarily defined by reference to particular speech communities residing in certain countries or regions. The assumption here is that the usage of the speakers belonging to these speech communities represents different varieties of the language – British English, East African English, Singapore English and so on (see Kortmann and Schneider, 2008).

ELF and Availability of Data This identification of a language, associated with a particular and relatively well-defined community of its users, does not apply to the essentially extraterritorial use of English as a lingua franca for international/ intercultural communication. ELF users do not constitute a speech community in the conventional sense in that its speakers are not ‘born into’ it – so ELF has no native speakers but its use comes about when required by



Corpora and English as a Lingua Franca

137

the communicative situation. It is not a language or variety as such but a linguistic resource, which is drawn on as a common means of communication chosen by speakers from different linguacultural backgrounds. Those who use it include speakers of English as a native language (ENL) as well, as these obviously also take part in ELF interactions across linguistic boundaries, but the ethnographic and sociolinguistic reality today is that the vast majority of ELF users have first languages other than English. ‘Mainstream’ corpora assume the existence of particular languages and provide information about how they are variously manifested in performance so that different patterns of lexical and grammatical usage can be identified by reference to a pre-supposed notional code. On the basis of this, comparisons can be made between what people (first or second persons) think they know of their language and what they actually do with it. So corpus data can be used, and frequently are used, to reveal the inadequacies of intuition and elicitation, but this still presupposes that there are conventional norms that users subscribe to even though they cannot be relied on to say just what they are. But since ELF has no native speakers, there can be no appeal to first-person intuition or elicitation from secondperson informants: observed third-person data are the only data available. In this respect ELF can be seen as similar to certain genres that also get acquired through secondary socialization, for example, English for specific purposes – but the latter (so far at least) require adherence to the norms of Standard English. The prevailing assumption that would seem to inform other language corpora, then, is that language performance must be described with reference to some code or usage conventions that define a language or a variety, which are known, if only imperfectly, by the community of its speakers. Here perhaps is the main reason why ELF has hitherto not been regarded as worthy of serious consideration. For there is no ‘speech community’ of ELF speakers and no code and usage conventions that they are intuitively aware of and which define their usage as a variety of English. This being so, their linguistic behaviour has generally been seen as simply defective performance, a failure to learn the proper language. It is not surprising, therefore, that in spite of its spread and significance as an international means of communication, ELF had remained almost entirely undescribed until recently. It is curiously paradoxical that when the electronic means for description had become readily available, it should have taken the very technological development that is both a cause and consequence of the global spread of English so long to be applied to the investigation of ELF.

138

Corpus Applications in Applied Linguistics

This situation, however, has been changing. Corpora are beginning to appear and there is a gradual recognition that these represent an important development in language description, not only because ELF figures so centrally and significantly as a means of communication in a globalized world and it would be absurd to ignore it, but also because the very process of describing it raises issues about the analysis and interpretation of observed language data that corpus linguistics has not always felt it necessary to address. The need for descriptions of ELF followed naturally from an awareness of its significance. In the late 1990s a lively and controversial debate began about the disregard of the global reality of ELF, prompted in part by the promotion of native-speaker English as the exclusive ‘real’ language. The case was made for studying ELF not, as was the established view, as evidence of defective learning but as a natural variable use of language in its own right, which called for a rethinking of established sociolinguistic ideas (e.g. Seidlhofer, 2001; Widdowson, 2003). Alongside arguments for the conceptualization of ELF, a number of small-scale empirical studies were published, one of the most significant of which was Jenkins (2000). Although the focus of this book is on phonology, the empirically-based discussion of issues of international intelligibility, of relevant sociolinguistic and socio-psychological research and its delineation of applied linguistic implications for teaching fuelled demand for a broader empirical basis for the description of ELF interactions.

ELF Corpora Subsequently, the case for corpus-based descriptions of ELF to remedy its neglect was sufficiently recognized to attract funding, and since then, empirical research on ELF usage has been gathering momentum. In addition to a growing number of published and unpublished PhD theses on various aspects of ELF based on ‘personal’ corpora that are not in the public domain, there are now two corpora explicitly designed for the investigation of ELF usage: VOICE, the Vienna-Oxford International Corpus of English (VOICE, 2009), and ELFA, the Helsinki-based corpus of English as a Lingua Franca in Academic Settings (cf. Mauranen and Ranta, 2008). VOICE and ELFA each comprise about one million words of transcrip­ tions of recordings of spoken interactions among speakers of many different first languages carried out through ELF. For researchers interested in language variation and language change – obviously a key concern of the



Corpora and English as a Lingua Franca

139

study of ELF – a focus on the spoken medium is essential, as it is in speech that variability in language is most readily discernible. Since the primary interest of ELF research is to further the understanding of how the language develops when used for intercultural communication, predominantly by non-native speakers across the boundaries of primary speech communities, it stands to reason that the first ELF corpora should be focused on capturing how ELF is spoken rather than written. The interactants’ negotiation of meaning in real time, in spontaneous, nonscripted speech events is relieved of the self-monitoring pressure of writing, and represents the use of what Labov refers to as the vernacular, where attention is paid to communicative content rather than to linguistic forms themselves (Labov, 1984: 29). In addition, when the speech events are highly interactive, researchers can also gain some measure of insight into how mutual understanding among the interlocutors is co-constructed (Seidlhofer, 2001: 147), though, for reasons I shall come to later, gaining such insights is far from straightforward. The speech events captured in both VOICE and ELFA are naturally occurring; this means that the interactions they record were not specially arranged, nor were they elicited, but they would have happened anyway, whether the researchers were there to record them or not. There are no speech events with speakers who all share a first language, nor are there English language lessons. Both corpora contain transcriptions of complete speech events rather than samples or extracts of texts, which means that they contain transcripts of interactions of varying length, ranging from a few minutes up to several hours. In the case of VOICE at least, the decision to sample complete speech events was motivated by the nature of ELF interactions and the purpose in describing them, and here we come to the issues raised earlier about what is distinctive about ELF as a use of language and how corpus methodology can be employed in its description. What is essential about ELF is that it is the communicative exploitation of available linguistic resources; hence what is of primary importance is not the linguistic forms that are produced but their pragmatic function. So it is clearly not enough to retrieve tokens of words or word clusters or grammatical structures dissociated from the circumstances that motivated their occurrence. The inclusion of complete speech events, together with supporting information about participants and contexts, provides for the possibility of going beyond a simple quantitative analysis of linguistic forms towards a qualitative interpretation of their functional significance. This question of the relationship between the analysis and interpretation of data is central to an understanding of what

140

Corpus Applications in Applied Linguistics

a corpus can contribute to an understanding of ELF, and it also raises issues about the methodology of corpus linguistics in general that I mentioned earlier, and which I shall come back to presently. But to return to the two corpora. They differ most significantly in their domain coverage. ELFA focuses on the academic use of ELF, whereas VOICE covers three domains: educational, leisure and professional (the latter subdivided into business, organizational and research/science). This difference of coverage provides for the possibility of undertaking smallscale comparisons of findings to investigate whether certain patterns are likely to be domain-specific or specific to ELF usage more generally. In terms of first languages represented, both cover a wide range, but VOICE has a primary (but not exclusive) European focus. However, the work of VOICE has been extended in that it has provided the methodological basis for another, complementary ELF corpus in the ‘ASEAN  3’ region (ACE), being prepared by a project consortium with its centre in Hong Kong (cf. Kirkpatrick, 2010). Another significant difference between ELFA and VOICE (at the time of writing 2011) is that while ELFA, according to the project website, ‘will be available to researchers on request’, presumably on-site, VOICE is in the public domain: VOICE Online 1.0 was released for free internet access in  2009, and the release of the XML corpus followed in May 2011. This open access policy, which VOICE (unlike many other corpora) has adopted as a matter of principle, is especially important for applied linguistics in that it enables researchers to draw on the resources of the corpus as appropriate to their particular enquiries and to check on findings that other researchers have derived from the data.

Methodological Challenges I come now to the question of the methodological challenges posed by the compilation of VOICE and what their significance might be for corpus methodology in general. These challenges, for the reasons I have already touched upon earlier, are comparable in some ways to the recording of an unknown language. On the one hand, the purpose of VOICE is to provide a basis for analyzing ELF and since this is a use of English that has hitherto not been described, let  alone codified, it would have been counterproductive to rely on taken-for-granted assumptions based on encoded varieties of English. On the other hand, if spoken ELF was to be converted into consistent, computer-readable transcripts, these had to be modelled



Corpora and English as a Lingua Franca

141

on some existing orthographic system. But if neither British nor American spelling is actually suggested by the speakers’ pronunciation, which one should be used when transcribing their speech? The decision was therefore taken to adopt neither entirely British nor entirely American spelling, but to introduce a (clearly defined) degree of fusion of both. The intention behind this was to give the ELF speakers in the corpus their own voice, so to speak, to achieve a certain symbolic degree of dissociation from either British or American English, and to reflect the intrinsic hybridity of ELF talk – while still retaining the consistency that computer-readability and retrievability requires. As for mark-up, particular tags absent or optional in other transcription conventions are employed frequently in VOICE because observation of the ELF interactions suggested that certain features have a high functional load. For instance, utterances spoken laughingly are put between ‘ @  /@ ’ tags, and special tags are used for speaking onomatopoetically, for marking non-English speech and for pronunciation variations and coinages: you know i can’t  pvc  pronunciate {pronounce}  /pvc  well (VOICE, 2009: LEcon229) A precise description of all mark-up and spelling conventions is available on the project website, and the rationale of the decisions taken is described in detail in Breiteneder et  al. (2006). The operationalization of these conventions posed a challenge since transcribers had to be trained to override their natural inclination to report what they supposed was said from an ENL viewpoint instead of recording what was actually said. It is important to recognize that while issues like these were, and have to be, operationally dealt with, they remain endemic to the whole undertaking of compiling not only an ELF corpus but the corpus of any language use. There will always be a tension between the great potential of a corpus for getting at textual facts on the one hand and its considerable limitations when dealing with language use ‘in the real world’ on the other (cf. Cook, 1995). This is why, from the very beginning of work on ELF description, it was regarded as crucial that the design of VOICE should strive to counteract some of the inevitable drawbacks of a corpus (and a relatively small one at that) by laying particular emphasis on factors that support corpus users in their attempts to develop an emic understanding of the speech events in the corpus, what the purpose of the interaction is, what the relationships are between the speakers – in short, ‘to figure out what the devil they think they are up to’ (Geertz, 1983: 58). As I have already indicated, the design features of VOICE that support this approach are the selection of complete speech events, the provision of abundant contextual information and a

142

Corpus Applications in Applied Linguistics

very high degree of interactivity rather than monologues. In combination, these features enable corpus users, to some extent at least, to infer how participants negotiate meaning and jointly construct their interactions by drawing on their shared communicative resource. In addition, the original recordings of a number of speech events have been made accessible online as well, and part-of-speech tagging – a particularly complex matter in the case of ELF – is being researched and will be implemented in a new release of VOICE. Nevertheless, severe limitations remain as to what can be captured in, and gleaned from, such a corpus; or, indeed, any corpus. The accumulation of observed linguistic data and their quantitative analysis are relatively straightforward procedures, at least when they can be related to established codes and norms of usage as is the case with most current corpora, and so long as the methodology of corpus descriptions is confined to these procedures, all is well. Problems arise, however, when these textual data, these decontextualized linguistic products, are interpreted as evidence of contextualized discourse processes. On the one hand, quantitative text analysis as such is of limited interest if we want to find out not only what language people use but how they use it. On the other hand, the qualitative interpretation of corpus findings is, as I have indicated, beset with difficulties, which means that we need to be tentative in our inferences. But the making of such inferences is absolutely central in ELF research. With conventional corpora there may be a case for the quantitative analysis of linguistic forms in its own right, as a way of documenting and comparing patterns of usage in different languages and varieties. But for a better understanding of how ELF works as communication, quantitative analysis on its own is of very limited value.

ELF Findings: Forms and Functions So the descriptive work on ELF has not focused on linguistic forms as such, but on inferring what communication processes these forms might point us to, on trying to get at what these forms are symptomatic of – how ELF users exploit and shape the underlying resources of the language to achieve their communicative outcomes. Particularly over the past few years, numerous corpus-based studies of ELF talk have become available. Clearly, some time had to elapse for the arguments in early publications claiming ‘conceptual space’ for ELF as a phenomenon in its own right to be taken on board, at least in some quarters. But by now, the field has in fact rapidly developed



Corpora and English as a Lingua Franca

143

into an exceedingly vibrant one: many MA and PhD theses have been completed and are under way at universities all over the world, a series of annual ELF conferences is in its fourth year (Helsinki 2008, Southampton 2009, Vienna 2010, Hong Kong 2011, Istanbul 2012), the number of books and articles on ELF has risen exponentially, and the first dedicated international journal, the Journal of English as a Lingua Franca, is about to be launched (volume 1, 2012, published by Mouton de Gruyter). The bulk of this has happened in the span of just three or four years, and it is also during these years that the first substantial collections of empirical studies have been published: House (2009), Mauranen and Ranta (2009), Archibald et al. (2011) and Björkman (2011). The foci of recent descriptive studies cover a wide spectrum, but there are certain tendencies and leitmotifs that can be discerned as emerging from the results of much of the empirical ELF work published to date. At the most general level, these findings shed light on the pragmatic/strategic and accommodative skills of ELF speakers and on how the variable linguistic forms observed in ELF interactions line up with communicative functions in ways that differ from familiar form-function relations codified in standard ENL. From the point of view of corpus linguistics, the most pertinent examples will be of a lexicogrammatical kind – which, however, still tend to be limited in scope due to the lack of part-of-speech-tagged corpora. Nevertheless, as a larger number of investigations becomes available, an understanding is beginning to emerge of certain lexical and grammatical choices ELF speakers make, and of the way they creatively shape this linguistic resource to fulfil their communicative goals. This concerns innovations in verb morphology (e.g. Breiteneder, 2009; Pitzl et al., 2008; Cogo and Dewey, forthcoming), tense and aspect (Ranta, 2006; Dorn, 2010), the noun phrase (e.g. Dewey, 2007a, b; Björkman, 2008; Seidlhofer, 2011), tag questions (e.g. Hülmbauer, 2007, 2009) and collocational and colligational patterns (e.g. Dewey, 2007b; Pitzl, 2011; Seidlhofer, 2009a, 2011). In addition, there are studies of new lexical items (Pitzl et al., 2008) and code-mixing/code-switching (Klimpfinger, 2009; Hülmbauer, 2007). What is important about all these studies however is not that they investigate linguistic forms and patterns for their own sake, but that they engage with the question as to what motivates these forms and patterns and how ELF users exploit the resources of the language to negotiate meanings and relationships, to achieve their communicative outcomes (cf. Seidlhofer, 2009b). The primary goal is thus to understand what the variability of ELF tells us about the communicative and interpersonal functions these observed forms and patterns serve. This is why most descriptive ELF studies,

144

Corpus Applications in Applied Linguistics

even when focused on specific linguistic features such as repetition (Lichtkoppler, 2007) or silences (Böhringer, 2009), ultimately shed light on larger-scale, pragmatic processes and functions such as accommodation strategies (e.g. Jenkins, 2000; Cogo, 2007; Seidlhofer, 2009a), indication and resolution of non-understanding (Pitzl, 2010; Watterson, 2008), prevention or resolution of misunderstanding (Mauranen, 2006; Kaur, 2009), establishing rapport and solidarity, expressing identity and building an interculture (e.g. Brkinjac, 2009; Gundacker, 2010; Kordon, 2006; Pölzl and Seidlhofer, 2006; Pullin-Stark, 2009; Thompson, 2006).

Applied Linguistic Implications ELF corpora consist of data produced when speakers of different languages draw on the formal resources of English to communicate with each other in naturally occurring contexts and their purpose is to provide evidence of how they do it. They are therefore quite different from corpora that have been compiled of learner English (see Granger, 2008). Though there is the superficial similarity that both will record linguistic forms that do not conform to the norms of Standard English, in the case of a learner corpus its very purpose is to document the nonconformities, predicated on the assumption that they constitute errors that learners will wish to correct. In contrast, it is central to an understanding of ELF that nonconformities, far from being in need of correction, represent the effective communicative exploitation of linguistic resources. From an ELF perspective, they point us to the processes by which interlocutors from diverse first languages appropriate and adapt English for their own needs. As I pointed out earlier, conventional corpora work on the assumption that whatever language people produce must be attributable to, or associated with, a language or a language variety and if it is not, it is marked as abnormal or erroneous. Learner corpora share the same assumption. But since research based on ELF corpora shows very clearly that people are capable of using this ‘erroneous’ English very effectively on a global scale, the question arises as to whether we are using the right criteria for deciding what language it is appropriate to teach and how learner achievement is to be evaluated (see also Chau, Chapter 12, this volume). The criteria currently invoked are based on the doctrine of authenticity, which holds that the language to be prescribed for learning should correspond with the native speaker language that corpora can now record with such precision. It has become common practice to criticize English



Corpora and English as a Lingua Franca

145

teaching textbooks on the grounds that they fail to represent the ‘real’ English that corpora are now able to describe. Thus Römer, for example, asks: ‘Do we teach our pupils authentic English, i.e. do we confront them with the same type of English they are likely to be confronted with in natural communicative situations?’ (Römer, 2004: 152–153). She then compares textbook English with corpus findings and concludes, not surprisingly, that we do not. But then why should we? For the assumption here is that the only ‘natural communicative situations’ that learners are likely to confront are those involving native speakers, which ignores the situations, which learners are probably more likely to confront, where nonnative speakers naturally communicate by using ELF, often without any ENL speakers present. There is an implication, furthermore, that a confrontation with ‘authentic’ forms that natives have produced will somehow enable learners to use them to communicative effect. But of course an ability to replicate linguistic forms does not at all imply the ability to replicate the contextual conditions that gave rise to these forms in the first place. Language corpora represent an innovation in language description but they have been used (in the case of English at any rate) to consolidate a conception of language pedagogy that is not innovative at all, namely that the objective of learning is essentially to replicate the linguistic behaviour of its native speakers. Though lip service is paid to the importance of communication, this is assumed to necessarily involve conformity to the standard code and to the conventions of native speaker usage. We find evidence of this state of affairs in descriptions of learning goals, for instance in the Common European Framework of Reference for Languages, as well as in the norms reflected in English language tests, even those purporting to test ‘international’ English (see McNamara, 2011). The findings that emerge from ELF corpora lead to very different pedagogic implications. When learners of the language put their learnt language to use, they are clearly capable of communicating without conformity. Linguistically ‘incompetent’ though they may be by reference to the norms imposed by teaching and testing, they have a strategic capability for making effective communicative use of the linguistic resources at their disposal. This suggests that the objectives for language learning might be revised to focus attention not on the production of language forms that conform to the norms of native speaker competence and conventions of usage but on the communicative process itself, dissociated from such conformity, whereby learners can develop a capability for exploiting the potential of the language beyond (mere) linguistic proficiency (for further discussion, see Seidlhofer, 2011: Chapter 8).

146

Corpus Applications in Applied Linguistics

The findings that are emerging from research based on ELF corpora are also relevant for other areas of applied linguistics apart from language pedagogy. If we want to understand how intercultural communication works in a globalized world and how it may be improved we need to revisit and revise established ideas about what constitutes a language or a community and how people actually draw on linguistic resources to communicate with each other in the real world. We need, in other words, to understand how ELF is used. Such an understanding is a prerequisite for engaging with the contemporary world and coping with ‘natural communicative situations’ such as translation and interpretation (e.g. Hewson, 2009), peacekeeping operations, conflict resolution and asylum seeker interrogations (see Guido, 2008) where ELF is the only means of communication available, and where its use may well be, quite literally, a matter of life and death. If applied linguistics really is to be, as its proponents generally claim it to be, ‘the theoretical and empirical investigation of real world problems in which language is the central issue’ (Brumfit, 1997: 86), then the importance of ELF research and the corpora on which it is based can hardly be exaggerated.

References Archibald, A., Cogo, A. and Jenkins, J. (eds) (2011). Latest Trends in ELF Research. Newcastle upon Tyne: Cambridge Scholars Press. Björkman, B. (2008). English as the lingua franca of engineering: The morphosyntax of academic speech events. Nordic Journal of English Studies, 7(3): 103–22. — (ed.) (2011). The pragmatics of English as a lingua franca in the international university. Journal of Pragmatics, 43: Special issue. Böhringer, H. (2009). The Sound of Silence: Silent and Filled Pauses in English as a Lingua Franca Business Interaction. Saarbrücken: VDM Verlag. Bongers, H. (1947). The History and Principles of Vocabulary Control. As it Affects the Teaching of Foreign Languages in General and of English in Particular. Woerden: WOCOPI. Breiteneder, A. (2009). English as a lingua franca in Europe: An empirical perspective. World Englishes, 28(2): 256–69. Breiteneder, A., Pitzl, M.-L., Majewski, S. and Klimpfinger, T. (2006). VOICE recording – Methodological challenges in the compilation of a corpus of spoken ELF. Nordic Journal of English Studies, 5(2): 161–88. Brkinjac, T. (2009). Humour in English as a Lingua Franca: ELF Interactions. Saarbrücken: VDM Verlag. Brumfit, C. J. (1997). How applied linguistics is the same as any other science. International Journal of Applied Linguistics, 7(1): 86. Cogo, A. (2007). Intercultural communication in English as a lingua franca: A case study. PhD. Thesis, King’s College London.



Corpora and English as a Lingua Franca

147

Cogo, A. and Dewey, M. (2012). Analysing English as a Lingua Franca. A Corpus-Driven Investigation. London: Continuum. Cook, G. (1995). Theoretical issues: Transcribing the untranscribable. In Leech G., Myers G. and Thomas J. (eds), Spoken English on Computer. Transcription, Mark-up and Application. New York, NY: Longman, pp. 35–53. Dewey, M. (2007a). English as a lingua franca and globalisation: An interconnected perspective. International Journal of Applied Linguistics, 17: 332–54. — (2007b). English as a lingua franca: An empirical study of innovation in lexis and grammar. PhD. Thesis, King’s College London. Dorn, N. (2010). Exploring –ing: The progressive in English as a lingua franca. MA Thesis, University of Vienna. Faucett, L., Palmer H. E., Thorndike, E. L. and West, M. (1936). Interim Report on Vocabulary Selection for the Teaching of English as a Foreign Language. London: P.S. King & Son. Geertz, C. (1983). Local Knowledge. Further Essays in Interpretive Anthropology. New York: Basic Books. Granger, S. (2008). Learner corpora. In Lüdeling, A. and Kytö, M. (eds), Corpus Linguistics. An International Handbook. Berlin: Walter de Gruyter, pp. 259–75. Guido, M. G. (2008). English as a Lingua Franca in Cross-Cultural Immigration Domains. Frankfurt am Main: Peter Lang. Gundacker, J. (2010). English as a Lingua Franca Between Couples: Motivations and Limitations. Saarbrücken: VDM Verlag. Hewson, L. (2009). Brave new globalized world? Translation studies and English as a lingua franca. Revue Française de Linguistique Appliquée, XIV: 109–20. House, J. (ed.) (2009). The pragmatics of English as a lingua franca. Intercultural Pragmatics, 6(2): Special issue. Hülmbauer, C. (2007). ‘You moved, aren’t?’ The relationship between lexicogrammatical correctness and communicative effectiveness in English as a lingua franca. Vienna English Working PaperS, 16(2): 3–35. http://www.univie.ac.at/Anglistik/Views_0702.pdf. — (2009). ‘“We don’t take the right way. We just take the way that we think you will understand’ – The Shifting Relationship between Correctness and Effectiveness in ELF”. In A. Mauranen & E. Ranta (eds.). English as a Lingua Franca: Studies and findings. Newcastle upon Tyne: Cambridge Scholars Publishing, pp. 323–47. Jenkins, J. (2000). The Phonology of English as an International Language. New Models, New Norms, New Goals. Oxford: Oxford University Press. Jenkins, J., Cogo, A. and Dewey, M. (2011). Review of developments in research into English as a lingua franca. Language Teaching, 44, 281–315. Kaur, J. (2009). Pre-empting problems of understanding in English as a lingua franca. In Mauranen, A. and Ranta, E. (eds), English as a Lingua Franca: Studies and Findings. Newcastle upon Tyne: Cambridge Scholars Press, pp. 107–23. Kirkpatrick, A. (2010). English as a Lingua Franca in ASEAN: A Multilingual Model. Hong Kong: Hong Kong University Press. Klimpfinger, T. (2009). ‘She’s mixing the two languages together’: Forms and functions of code switching in ELF. In Mauranen, A. and Ranta, E. (eds), English as

148

Corpus Applications in Applied Linguistics

a Lingua Franca: Studies and Findings. Newcastle upon Tyne: Cambridge Scholars Press, pp. 348–71. Kordon, K. (2006). You are very good. Establishing rapport in English as a lingua franca: The case of agreement tokens. Vienna English Working PaperS, 15(2): 58–82. http://www.univie.ac.at/Anglistik/views0602.pdf. Kortmann, B. and Schneider, E. W. (eds) (2008). Varieties of English (4 vols.). Berlin: Mouton de Gruyter. Labov, W. (1984). Field methods of the project on linguistic change and variation. In Baugh, J. and Scherzer, J. (eds), Language in Use. New York: Englewood Cliffs, pp. 28–53. Lichtkoppler, J. (2007). ‘Male. Male.’ – ‘Male?’ – ‘The sex is male’. The role of repetition in English as a lingua franca conversations. Vienna English Working PaperS, 16(1): 39–65. http://www.univie.ac.at/Anglistik/views_0701.pdf. Mauranen, A. (2006). Signaling and preventing misunderstanding in English as lingua franca communication. International Journal of the Sociology of Language, 177: 123–50. Mauranen, A. and Ranta, E. (2008). English as an academic lingua franca – The ELFA Project. Nordic Journal of English Studies, 7(3): 199–202. — (eds) (2009). English as a Lingua Franca: Studies and Findings. Newcastle upon Tyne: Cambridge Scholars Publishing. McNamara, T. (2011). Managing Learning: Authority and Language Assessment. Language Teaching. Cambridge: Cambridge University Press. Pitzl, M.-L. (2010). English as a Lingua Franca in International Business: Resolving Miscommunication and Reaching Shared Understanding. Saarbrücken: VDM. — (2011). Creativity in English as a lingua franca: Idiom and metaphor. PhD. Thesis, University of Vienna. Pitzl, M.-L., Breiteneder, A. and Klimpfinger, T. (2008). A world of words: Processes of lexical innovation in VOICE. Vienna English Working PaperS, 17(2): 21–46. http://www.univie.ac.at/Anglistik/views_0802.pdf. Pölzl, U. and Seidlhofer, B. (2006). In and on their own terms: The ‘Habitat Factor’ in English as a lingua franca interactions. International Journal of the Sociology of Language, 177: 151–76. Pullin-Stark, P. (2009). ‘No joke – this is serious!’ Power, solidarity and humour in business English as a lingua franca (BELF) meetings. In Mauranen, A. and Ranta, E. (eds), English as a Lingua Franca: Studies and Findings. Newcastle upon Tyne: Cambridge Scholars Press, pp. 152–77. Ranta, E. (2006). The ‘attractive’ progressive: Why use the –ing form in English as a lingua franca? Nordic Journal of English Studies, 5: 95–116. Römer, U. (2004). Comparing real and ideal language learning input: The use of an EFL textbook in corpus linguistics and language teaching. In Aston, G., Bernardini, S., and Stewart, D. (eds), Corpora and Language Learners. Amsterdam: John Benjamins, pp. 152–68. Seidlhofer, B. (2001). Closing a conceptual gap: The case for a description of English as a lingua franca. International Journal of Applied Linguistics, 11: 133–58. — (2009a). Accommodation and the idiom principle in English as a lingua franca. Intercultural Pragmatics, 6(2): 195–215.



Corpora and English as a Lingua Franca

149

— (2009b). ELF findings: Form and function. In Mauranen, A. and Ranta E. (eds), English as a Lingua Franca: Studies and Findings. Newcastle upon Tyne: Cambridge Scholars Publishing, pp. 37–59. — (2011). Understanding English as a Lingua Franca. Oxford: Oxford University Press. Thompson, A. (2006). English in context in an East Asian intercultural workplace. PhD. Thesis, Ontario Institute for Studies in Education, University of Toronto. Thorndike, E. (1921). The Teacher’s Word Book. New York: Teachers’ College, Columbia University. VOICE. (2009). The Vienna-Oxford International Corpus of English (version 1.0 online). Director: Barbara Seidlhofer; Researchers: Angelika Breiteneder, ­Theresa Klimpfinger, Stefan Majewski, Marie-Luise Pitzl. http://voice.univie. ac.at. (28 F ­ ebruary 2011). Watterson, M. (2008). Repair of non-understanding in English in international communication. World Englishes, 27: 378–406. Widdowson, H. G. (2003). Defining Issues in English Language Teaching. Oxford: Oxford University Press. — (2009). The linguistic perspective. In Knapp, K. and Seidlhofer, B. (eds), Handbook of Foreign Language Communication and Learning (Handbooks of Applied Linguistics, Volume 6). Berlin and New York: Mouton de Gruyter, pp. 193–218.

Online Resources British National Corpus, http://corpus.byu.edu/bnc/ (28 February 2011). Corpus of Contemporary American English, http://corpus.byu.edu/coca/ (28 February 2011). ELFA (English as a Lingua Franca in Academic settings), http://www.helsinki.fi/ englanti/elfa/elfacorpus.html (28 February 2011). International Corpus of English (ICE), http://ice-corpora.net/ice/index.htm (28 February 2011). Investigating English in Asia: Building the Asian Corpus of English (ACE), Research Centre into Language Education and Acquisition in Multilingual Societies, http://www.ied.edu.hk/rcleams/view.php?secid227 (28 February 2011). SELF (Studying in English as a Lingua Franca), http://www.helsinki.fi/englanti/ elfa/self.html (28 February 2011). VOICE (Vienna-Oxford International Corpus of English), http://www.univie.ac.at/ voice/ (28 February 2010).

Chapter 10

Texting and Corpora Caroline Tagg

Introduction The impact of text messaging on society has been profound, changing communication patterns and sparking public debate about the implications for language and literacy. The perceived need to counter negative media portrayals of texting forms the starting-point of much linguistic research (Anis, 2007; Dürscheid and Stark, 2011; Thurlow, 2003), and the fact that linguists see themselves as entering a public debate is illustrated in the title of David Crystal’s popular book, Txtng: The Gr8 Db8 (2008). Perhaps unsurprisingly, the disapproving stance taken by the media towards texting tends to be based on extreme or ‘fictionalised’ cases of highly unconventional spelling practices (Thurlow, 2006). One of the aims of this chapter is to show how corpus-based research can enable linguists to challenge common assumptions around texting and to explore actual texting practices. Corpora reveal meaningful patterns in the way words are spelt, as well as evidence of language play and creative code-switching, which serve as sources of individual expression. This chapter starts with an overview of text message corpora, the insights gleaned from them with respect to spelling and other forms of expressive creativity, and the challenges they raise for corpus methods: how can text messages best be transferred into computer-readable format? Should spelling be standardized or variation preserved? How effectively can multilingual messages be handled by existing software? I draw on a number of studies to illustrate these points, including my own research. The chapter concludes by speculating on future developments in texting practices and the likely direction of linguistic research. A brief note on terminology before we proceed: this chapter uses various terms to describe the medium it explores: text messaging, texting and SMS (Short Message Service). Each communiqué is described as a text message,



Texting and Corpora

151

and is not shortened to ‘text’ due to the general meaning of this term in linguistics. The language of texting has been called ‘Txt’ (Thurlow, 2003). Although the term risks perpetuating erroneous notions that text messages are inevitably abbreviated, I use it throughout the chapter for its convenience and clarity and because of the widespread use of the term in the literature.

Overview of Text Message Corpora Research into text messaging can be situated within the field of computermediated discourse (CMD), to use Herring’s (2007) term. Much early research into CMD was concerned with where it sat on the writing-speech interface (e.g. Baron, 1998). As researchers moved away from the notion of one homogeneous ‘electronic mode’, more sophisticated data analyzes took into account the various social and technological features that define each domain or ‘technological mode’ (Herring, 2007). Text messaging is one such mode, sharing with other modes features such as the abbreviation strategies adopted by CMD users to overcome time and/or space constraints (Hård af Segerstad, 2002) while differing in other respects. Characteristics particular to texting include its use in coordinating activities (Ling and Yttri, 2002) and thus its integration with other forms of interaction such as face-to-face encounters as well as its particularly intimate nature. Text messages tend to be sent from mobile phones (rather than computers) straight into the recipient’s pocket; and, unlike some chat rooms, they are generally sent between people who already know each other well (Ling and Yttri, 2002). Of course, the lines between text messaging and other modes blur and shift as technology and practices change; for example, people can increasingly access the internet from their phones and send short messages through online social network sites. For corpus researchers, one important distinction between modes is the ease with which data can be collected. Difficulties in collecting text messages arise from their private nature, the reluctance with which people may part with them and the ethical issues raised when they agree to do so, and because transferring them to computer-readable format can be tedious and time-consuming. In contrast, public online sites like Twitter are easily accessible by researchers (e.g. Baron et al., 2011). Perhaps in part for these reasons, linguistic research into text messaging has by most accounts been slow to get off the ground (e.g. Thurlow, 2003), although the first decade of the twenty-first century has seen an increase in empirical studies. Initial data collections were made by social scientists

152

Corpus Applications in Applied Linguistics

engaged in understanding the communicative practices of mobile technology and the extent to which new forms of social interaction were being produced. Kasesniemi and Rautiainen (2002), for example, draw on a corpus of 8000 text messages collected in Finland in  1990s. Their general observations regarding language anticipate later linguistic findings. For example, they identify a variety of styles which they suggest indicates an awareness of the audience. Text messages were more or less abbreviated depending on who the intended recipient was: teenagers wrote more formal text messages to teachers and parents, but used ‘slang, plays on local dialects, puns and insider vocabulary’ with close friends, while girls claimed to use a more direct style in text messages to boys than to other girls (Kasesniemi and Rautiainen, 2002: 185). The observation that texters have choices in how they text, dependent not only on technological constraints but interpersonal considerations, is one that runs through linguistic research. Corpus-based linguistic research into text messaging can be categorized in various ways. In Tagg (2009), I drew distinctions according to the method of data collection. A handful of studies collected data through broad appeals to the public, asking participants to enter text messages, along with personal data, into a website (Grant, 2010) or to forward text messages to a provided number (Dürscheid and Stark, 2011; Fairon and Paumier, 2005). (The latter method overcomes the possibility of transcription error.) Other studies target smaller, delimited populations, such as students in a particular class (Thurlow, 2003), often combining text message collections with other data sets (interviews, questionnaires and so on) in richly layered research. An alternative source of data has been found among the researcher’s own texting network (Deumert and Masinyana, 2008). This is a method I adopted in my own study (Tagg, 2009). By recruiting 16 participants, and asking them to transcribe and submit all the messages they sent and those which they received within a particular time period, I received messages written by a loose extended network of over 200 texters. These smaller-scale methods have the advantage of ensuring the authenticity of each text message, although at the expense of the breadth obtained online. Many studies in fact combine methods on the basis that, given difficulties in obtaining data, the best approach is a pragmatic and opportunistic one (Hård af Segerstad, 2002). An alternative categorization to that of data collection, as suggested by Dürscheid and Stark (2011), is between corpora that have been made publicly available (this may often depend on ethical considerations) and those which are not. A list of available corpora is included at the end of this chapter.



Texting and Corpora

153

Choice of method depends, of course, upon the purpose and scope of research. For forensic linguists, the research focus is author attribution. Tim Grant’s corpus, comprising text messages collected online as well as those given to him by the police, serves as a reference corpus against which he can determine an individual’s characteristic practices in order to attribute authorship to text messages involved in criminal cases. A number of other general corpora have been collected with the aim of making a resource available for linguistic description and comparison, such as the ‘SMS4science’ project initiated in Belgium (Fairon and Paumier, 2006), which involves corpora collected in several countries. Another goal of corpus-based research has been the improvement of mobile technology: for example, 10,000 text messages collected by the University of Singapore (NUS SMS Corpus) were used by How and Kan (2005) in their attempt to increase typing speeds through rearrangements of the phone keyboard (this corpus has since grown to 60,000 text messages, as documented by Chen and Kan, 2011). The objectives of both those working to improve mobile technology and those working with general corpora require breadth and variety of textual data, and so they adopt webbased collection procedures for at least part of their collection process. At the time of writing, the SMS4science corpora are at various stages of construction and have yet to be exploited, while the Singapore corpus does not feature strongly in the text messaging literature. Given the less grand aims of their corpus construction, the projects that have so far yielded the most interesting results are smaller projects that seek to document the local practices of a narrower population. Data-based studies of local practices have explored a number of features, which can be categorized into three groupings suggested by Thurlow and Poff (2012): language change (linguistic forms and orthographic practices), discourse (online conversational practices) and multilingualism. More specifically, these features have included communicative functions (Thurlow, 2003), sequential organization (Hutchby and Tanna, 2008), grammar (Fuchs and Vuković, 2008) and codeswitching (Al-Khatib and Sabbah, 2008). However, the main focus across such studies is what Sebba (2007) calls ‘respelling’: the spelling of words in unconventional ways that index social identities.

Corpora and Respelling The use of corpora in texting studies has provided, in Thurlow’s (2003: 4) words, ‘a way of rendering more empirical populist claims about the language of text messaging’. Such studies have directly refuted claims made

154

Corpus Applications in Applied Linguistics

in the media: Grinter and Eldridge (2003), for example, find few of the long, complex abbreviated phrases reported in the press (such as ROFL and TTYL8R) and point out that it is common words that tend to be shortened: Instead of being long challenging phrases offered by dictionaries, the teenagers recorded shortening simple words such as tomorrow and weekend, which often appeared in messages discussing plans. Other commonly shortened words included school, football, Internet, lessons, and homework. (Grinter and Eldridge, 2003: 7–8) One of the first and most influential data-based analyzes of local respelling practices was carried out by Thurlow (2003) in the UK. The data used in his study comprises 554 messages gathered from 135 first year undergraduates (with a mean age of 19 and overwhelmingly British), who transcribed their own messages. Apart from exploring why people texted, Thurlow was concerned with how and to what extent texters experimented with language. As well as concluding that spelling was not as varied as generally thought, Thurlow found that the motivations underlying respelling also differed from expectations. Despite the fact that 82% of participants claimed to use abbreviations, Thurlow (2003) found only 1401 examples of abbreviation in his data (around 3 per message, or less than 20% of message content) and few letter or number homophones. There were instead frequent ‘accent stylizations or phonological approximations’ such as novern for northern, ‘onomatopoeic, exclamatory spellings’ such as haha and WOOHOO, and words such as wakey wakey designed to add ‘prosodic impact’. These suggest the influence of interactive demands as well as, or in some cases rather than, physical constraints. Thurlow (2003: p. 17) posits that ‘brevity and speed’ is just one motivation among three social ‘maxims’ which encourage variation in spelling. 1. ‘brevity and speed’ (seen in lexical abbreviation including letter-number homophones [2, 4, c, u]; and the minimal use of capitalization, punctuation and spacing); 2. ‘paralinguistic restitution’ (such as the use of capitals to indicate emphasis or loudness, or multiple punctuation, which ‘seeks to redress the apparent loss of such socio-emotional or prosodic features as stress and intonation’ (p. 17)) 3. ‘phonological approximation’ (i.e. attempts to capture informal speech, which ‘engenders the kind of playful, informal register appropriate to . . . text-messaging’ (p. 18)).



Texting and Corpora

155

Many respellings in fact fulfil more than one maxim:  goin, for example, fulfils the need for brevity and speed as well as approximating spoken conversation. In other cases, the latter principles can override the concern with abbreviation. For example, although forms such as nooooo! or nope fulfil the maxims of paralinguistic restitution and phonological approximation, respectively, both require more key presses than  no, particularly where the form is not recognized by predictive text devices (inbuilt devices that enable some phones to predict and in some cases complete the word being typed). In other words, Thurlow’s corpus-based study suggests that variation in spelling is not simply a response to the technological constraints, but a creative and interactive resource for the expression of social identity. Tagg et al. (2010) show how respellings in text messaging can be seen to form meaningful patterns. The corpus used in their study was a subset of CorTxt, the corpus I compiled (Tagg, 2009) and it was processed using VARD2 (Baron and Rayson, 2009), an interactive piece of software which analyzes and standardizes spelling. A second tool, DICER (Discovery and Investigation of Character Edit Rules) (Baron et al., 2009), was then used to extract the letter replacement rules which VARD uses to transform the respelt form to the standard equivalent. This information was used to manually group the respelling rules into categories (Table 10.1). By far the largest category is letter homophones, mainly because of the frequency of  u (see Table 10.1). The figure reflects the fact that you is the most frequent word in CorTxt, with 7884 occurrences, 3043 of which are spelt  u  (Tagg, 2009: 135).1 It is evident from the table that a large part of the respelling in texting affects very frequently occurring functional words. These findings extend Grinter and Eldridge’s (2003) observation that commonly used words tend to be respelt. At first glance, the respellings may seem primarily motivated by Thurlow’s maxim of brevity and speed. For example, representing you with either u and ya requires fewer letters. Tagg et al. (2010) speculate that the frequency of clipping (where the beginning or end of the word is clipped) may be attributed to the fact that, for people using predictive texting, clipping is a faster, easier abbreviation strategy than, say, consonant writing. (It is possible to start typing the word and to stop once the predictive texting device has correctly ‘predicated’ the word.)2 However, the very fact that texters can choose between  u  or  ya  points to motivations other than brevity and speed. In this case, the choice of  ya  (rather than  you  or  u) represents an informal, spoken

156

Corpus Applications in Applied Linguistics

Table 10.1  Categories of respelling in the CorTxt sub-set Letter homophones

u, r, ur, c, b

1040 (29.91%)

Number homophones

person2die, 2gether, up4that, in2hospital, 2nite

476 (13.69%)

Clippings

tomo, tho, v, prob, hav

414 (11.91%)

Apostrophe omission

wots, im, il, its, thats

367 (10.56%)

Eye dialect

bak, luv, wots, gud

243 (6.99%)

Colloquial contractions

lookin, av, cos, n, whaddya

232 (6.67%)

Spacing

Thankyou, ur, u2, aswell, Ohdear, sleep4aweek

232 (6.67%)

Consonant writing

txt, msg, lv, wld, pls

130 (3.74%)

Mistyping

your (for you’re), definately, adn, menas

61 (1.75%)

Double-letter reduction

stil, wory, spel, I’l, 2moro, ul (for you’ll)

43 (1.24%)

Misspelling

your (for you’re), definately

32 (0.92%)

Unclear

ur (for your), tomoz (for tomorrow)

31 (0.89%)

Other abbreviations

happng, checkd, 2morw

29 (0.83%)

Possible regional respellings

summat, summort, sumfing, dis

28 (0.81%)

Predictive texting ‘mistake’

in (for go), he (for if)

6 (0.17%)

Visual morphemes

I’m@my; Lunch@12

5 (0.14%)

No category assigned

----

108 (3.11%)

version of you, while the communicative meaning of  u  rests on the visual impact of the word. With colloquial contractions, texters combine the maxim of brevity and speed with that of phonological representation; that is, they capture informal spoken forms. Within the eye dialect category, respelt forms such as  gud  and  wots  do not reflect fast, informal spoken words but represent a more common sound-symbol relationship (e.g.  o  for the vowel sound in what). The effect of these can be to suggest an uneducated, ignorant speaker (Preston, 2000). Tagg et al. (2010) speculate that, in SMS, the motivation for their use may lie in conveying a certain kind of identity (rebellious, careless and unconventional) rather than an attempt to reduce characters. If colloquial contractions and eye dialect can be described as indexical, the implication is that homophones, clippings and consonant writing may similarly index identities, albeit in a way less directly associated with speech. Researchers have collected corpora in communities across the globe, including France (Anis, 2007), Sweden (Hård af Segerstad, 2003), South Africa (Deumert and Masinyana, 2008) and Jordan (Al-Khatib and Sabbah, 2008). Most reveal similarities with the UK findings in terms of their respellings, suggesting that local respelling practices can be seen as realizations of a



Texting and Corpora

157

wider global phenomenon. At the same time, distinct practices arise in particular texting communities. My British corpus, CorTxt, contains various regional respellings of something:  summat,  sumfing,  summort. Deumert and Masinyana (2008) analyzed 312 text messages collected from 22 isiXhosa speakers in South Africa aged between 18 and 27 and proficient to varying degrees in English. Of the SMSes written in English, Deumert and Masinyana found more spelling variation than had been identified in Thurlow’s UK corpus. They also found a number of local forms, as in the following text message. Brur its 2bed one matras my darling is going 2 put me in shid in church. My money i have save have been decrease due 2 da Aunt Mayoly’s funeral, and miner problst. So da case is coming very soon 3months preg. I’ll c then. Sharp. (Mzwakhe) (Deumert and Masinyana, 2008: 126) Alongside the global features of Txt (2  for to;  c  for see and the African American form  da  for the), local forms identified by the researchers include brur for brother (from Afrikaans broer), sharp for cheers or see you and shid for shit, where the d represents the voicing of the final consonant. Another aspect of the local content is 2bed one matras, which refers to a South African joke. As Anis (2007: 110) argues, Txt is not a fixed standard but emerges from the contextual features facing each texter: ‘[t]he realisation of these processes is highly variable, reflecting the personalized and often playful nature of the private exchanges’. Furthermore, in text messages written wholly or in part in isiXhosa, Deumert and Masinyana noticed that the local language was spelt conventionally and was not abbreviated even where this increased the cost of sending the text message. Participants in the study, when asked about this, were adamant that isiXhosa could not be manipulated like English. The researchers put this down to the desire of isiXhosa speakers to uphold the standards of their local language and its association in their minds with cultural tradition and purity. What this corpus-based study shows is that the motivations identified by Thurlow do not necessarily apply for every language and in every situation.

Extending the Focus: Beyond Respelling As alluded to above, one strength of some data-based research has been to enable researchers to look beyond the focus on spelling to other, perhaps

158

Corpus Applications in Applied Linguistics

under-researched or less obvious, features. One of these is the multilingual nature of much text messaging. Data collected in various locales reveal the degree to which code-switching takes place through text messaging (Al-Khatib and Sabbah, 2008; Deumert and Masinyana, 2008). Many of the text messages in Deumert and Masinyana’s study included both English and isiXhosa, as in the following. usis’Bety ukuba if ubecinga ukuza kule-weekend, angezi. (‘sis’ Bety that if she was thinking of coming this weekend, she mustn’t.) (Deumert and Masinyana, 2008: 137) These switches are seen as identity markers and important in self-expression, even at the cost of time, space or money. Deumert and Masinyana (2008), for example, report that participants would turn off the predictive texting function in order to write in two languages. The researchers identified different motivations for the use of the two languages: English was (as we shall see further below) playful and light-hearted, while isiXhosa was used to ‘create emotional authenticity when more serious sentiments are at stake’ (p. 138) as well as in the expression of traditional beliefs and values. The isiXhosa word –mphanga, for example, is used in relation to the death of a family member and carries ‘shades of grey and sadness’ which an English translation would not (p. 38). As the above study shows, the focus in some studies is shifting from how individual words are respelt to explore the wider role these play in the creative manipulation of discourse. Deumert and Masinyana go on to describe two different voices adopted by the South African texters. The first, which they call a ‘mock-Elizabethan’ voice, involves two lovers who play with intertextual references to Romeo and Juliet. Zoleka shows her awareness of the bathos they create by marking her switch to Txt with a self-conscious Ahem. Wherefore out thou. . . Ahem. . . Where are u? (Zoleka) Is this an sms from my oh-so-dear Julietta I see before me? I am here at last, my luv! What shalt thou cookest for thy sweet Romeo whenfore he visitest u? (Thami) (Deumert and Masinyana, 2008: 128) In the second example, the reference is to rhythm and blues lyrics, as in sex me up below. This time, the authors point out, the lyrics do not stand out from the surrounding text, which uses respellings associated with African American speech: dis, da.



Texting and Corpora

159

Baby da way u have sex me up 2day I am sorry 2 tell u dis it waz realy amazing u re da realy Husband I can c I luv u. (Nobesuthu) (Deumert and Masinyana, 2008: 129) In my study of CorTxt, I looked at the wordplay in which that particular community engaged (Tagg, 2009; Tagg, in press, 2012). The study drew on research into artful language use in spoken conversation (Carter, 2004) and in part from my noticing, in the process of exploiting CorTxt, that respelling was only one aspect of the playful nature of Txt. Examples could not be retrieved automatically but, by reading through a sub-set of text messages, I found creative practices that ranged from structural parallelism within one turn to wordplay that stretched over an exchange. Am watching house – very entertaining – am getting the whole hugh laurie thing – even with the stick – indeed especially with the stick. This is in reference to the American medical drama House, in which British actor Hugh Laurie plays a doctor with a limp and walking stick. In Tagg (in press, 2012), I suggest that the parallelism of even with the stick and especially with the stick emphasizes the new information (especially). The intensifier indeed emphasizes the structural similarity of the two phrases and highlights the way in which the effect of the second part builds on or exploits the first. Although it appears to be set up as such, this should not be described as a case of self-correction. Instead, especially with the stick adds to even with the stick to create the overall intended meaning. Although the use of repetition in Txt may at first glance seem unlikely given the limited character allowance, such creative manipulation of structural forms occurs through the corpus. The following exchange shows how language play can be co-constructed across turns. A:

Sorry, left phone upstairs. OK, might be hectic but would be all my birds with one fell swoop. It’s a date.

B:

Can help u swoop by picking u up from wherever u n other birds r meeting if u want. X

A:

That would be great. We’ll be at the Guild. Could meet on Bristol road or somewhere will get in touch over weekend. Our plans take flight! Have a good week x

In this exchange, two friends (aged around 30) are arranging to meet. In the first turn, A combines two expressions: the idiom ‘hit two birds with one stone’ and ‘with [or in] one fell swoop’. Whether or not this is accidental or a deliberate attempt to amuse is unclear. However, the two are semantically quite closely linked: firstly, in that both refer to an arc – the throw of the stone

160

Corpus Applications in Applied Linguistics

and the trajectory of the swoop; and secondly, in that a bird may be said to swoop down for prey. What is interesting here is not simply that the texter combines these idioms, but that the second texter picks up on and extends both. In Can help u swoop, B further manipulates ‘one fell swoop’, which she integrates into her sentence by changing the word class (from noun to verb); and she picks up on A’s use of birds to refer to the other people that A is meeting. In the third turn A further extends the ‘bird’ and the ‘arc’ metaphor with Our plans take flight! There is a sense that, by paying such close attention to each other’s wording, the texters highlight their engagement with each other and with the interaction; they show that they are listening. At the same time, the language play also tells us a lot about the people who are texting, or how they want to be perceived: clever, amusing, literate people who enjoy a close relationship. In other words, as with the spelling choices, these creative patterns also play a role in marking social identity and relationships. These creative practices are speech-like, in that similar features have been found in spoken language (Carter, 2004). However, I also found practices which may be particular to the electronic medium. One example is the playful reference to features of the medium, such as the predictive texting device. if you text on your way to cup stop that should work. And that should be BUS Thanks NAME270 and NAME56! Or bomb and date as my phone wanted to say! X

The apparent corrections made above should again not be seen simply as corrections, but as attempts to share either amusement or frustration at the technology’s failure. The effect of these comments relies on the apparent choice by the texters to comment on what they perceive as a mistake rather than going back to change it (a choice unfortunately not available in speech!). In this section, we have seen how data-based analyzes of local texting practices have served to challenge popular conceptions of texting by revealing how users consciously and creatively draw on various language resources in their expression of social or community identities. As we shall see in the next section, however, the collection and investigation of text message corpora is not without its difficulties.

Challenges Posed by Text Message Corpora Although collections of text messages have proved useful in challenging popular conceptions of the new medium, the studies discussed in this



Texting and Corpora

161

chapter raise questions about the use of text message corpora in applied linguistics. Some of the issues are connected to message length and corpus size. Text message corpus size tends to be restricted on the one hand because of the short message length and on the other hand because of difficulties in data collection. When an average text message length is around 17 words (Tagg, 2009), even a corpus which, like CorTxt, comprises over 11,000 text messages, will hold less than 200,000 words. One of the aims of text message corpora has been to empirically test the nature and degree of spelling variation, but we may have to wait for studies drawing on larger general corpora before statistically significant conclusions can be made. The importance of size in corpora is a contentious topic but, depending on the purpose and nature of the data, size is not necessarily a problem. Indeed, the trend towards larger and larger corpora such as the Oxford English Corpus (over 2 billion words) is matched by parallel moves towards smaller, specialized corpora such as the text message corpora discussed here. Nonetheless, some of these ‘corpora’ comprise a handful of text messages, and the extent to which they can be considered corpora in the linguistic sense of the word is debateable. At what point does a collection of text messages become a corpus? Of course, this question cannot be answered solely with reference to size: the extent to which the corpus is structured for linguistic analysis, for example, is a central consideration. To what extent is it possible to conduct corpus studies on text message data? In Tagg (2007; 2009), I used a word frequency list as a starting point to explore the startling frequencies of the and a in CorTxt – the, which in any corpus tends to be by far the most frequent word and around 3 times more frequent than a, was in CorTxt slightly less frequent than a. As well as the omission of the and the speech-like nature of Txt, I concluded that the reversed frequencies could be explained in part by the frequent phrases in which a, in particular, occurred. However, the nature of text messaging challenges the premises underlying word and word form frequency. Word forms are typically defined within corpus linguistics as a string of letters separated from other word forms by spaces (Sinclair, 1991: 28). How can stable word forms be identified in Txt, given the possibility that spacing may be omitted (resulting, for example, in  alphanumeric sequences such as sleep4aweek) and that the same word may be respelt, resulting in different strings of letters? The answer may be that text message corpora, as with much historical data, require standardization before being processed using corpus analysis tools. As discussed above, tools such as VARD are being trained on text message data, their use raises further questions.

162

Corpus Applications in Applied Linguistics

If, as studies suggest, respellings index social identity, what is lost in a standardized corpus? If text messaging is part of a general move towards unregulated online writing, does this demand responses other than blanket standardisation? Finally, studies so far have shown that in many contexts text messages display a great deal of code-switching, as do synchronous online interactions (Danet and Herring, 2007). Analysis of Thai-English Instant Messaging and Facebook exchanges reveals a complex mix of languages, which suggests that switches cannot easily be predicted (Seargeant and Tagg, 2011). To what extent can current corpus tools handle such multilingual data? Given increasing user engagement with synchronous and often multilingual online interaction (including Twitter and social network sites such as Facebook) the above are issues that corpus linguists will increasingly have to tackle as they seek to explore these emerging practices. One of the methodological advantages of text messaging over computer-mediated communication may be that it is easier to obtain bibliographic data than with online communication, which is often anonymous and which can be obtained (from public sites) without consulting participants (Thurlow and Poff, 2009). This data allows studies to consider gender, age and other social factors. Similar analysis of Twitter, for example, may be possible only with users who the site has validated and who tend to be celebrities (Baron et al., 2011).

Conclusion How this research area develops is dependent both on the object of study and the field itself. Nobody expected text messaging to take off as it did – and nobody knows exactly where future advances will take us. Not only are developments in communications and mobile technology rapid, but the ways in which these are exploited can shift in unexpected directions. As of 2011, some observations can be made. First, the exchanging of short, textbased messages shows no sign of abating, despite the ability to send video or other multimodal messages; present trends suggest there is no reason to presume that we will stop texting any time soon. Second, the way in which texting relates to other forms of communication may be shifting, given the increasing use of social network sites (from which users can send short messages) and the fact that internet-based sites can be accessed through mobile phones. What may become interesting is the notion of choice – how users select one technology over another – and, with the understanding



Texting and Corpora

163

that communication practices tend to emerge through interaction, increasingly fine-grained analysis of electronic contexts. The extent to which text messaging is studied, and how studies are carried out, may depend in part on the availability of data. Software for downloading text messages onto a computer are at present not widely available, and rapid changes in mobile technology quickly render what software is available obsolete (Tim Grant, pers comm.). It is likely that studies into online communication through public sites will grow rather more rapidly; many studies, for example, are taking advantage of the wealth of publicly available Twitter data (Honeycutt and Herring, 2009; Baron et al., 2011). However, given the similarities, we can assume that challenges and methods being explored in relation to this online data.

Notes In comparison, I is generally more frequent than you in spoken corpora; while in written corpora, neither are at the top of the frequency list. 2 Respelling practices may be different among people, like smart phone users, who don’t use predictive texting.

1



References Al-Khatib, M. and Sabbah, E. H. (2008). Language choice in mobile text messages among Jordanian university students. Sky Journal of Linguistics, 21: 37–65. Anis, J. (2007). ‘Neography: Unconventional spelling in French SMS text messages’. In The Multilingual Internet: Language, Culture and Communication Online, (eds) Brenda Danet and Susan C. Herring, Oxford: Oxford University Press, pp. 87–116. Baron, A. and Rayson, P. (2009). Automatic standardisation of texts containing spelling variation: How much training data do you need? In Proceedings of Corpus Linguistics 2009. University of Liverpool, Liverpool. Baron, A., Rayson, P. and Archer, D. (2009). Automatic standardization of spelling for historical text mining. In Proceedings of Digital Humanities 2009. University of Maryland, College Park, MD. Baron, A., Tagg, C., Rayson, P., Greenwood, P., Walkerdine, J. and Rashid, A. (2011). Using verifiable author data: gender and spelling differences in Twitter and SMS. Paper presented at ICAME 32, Oslo, 1–5 June 2011. Baron, N. (1998). Letters by phone or speech by other means: The linguistics of email. Language and Communication, 18: 133–70. Carter, R. (2004). Language and Creativity: The Art of Common Talk. London: Routledge. Chen, T. and Kan, M.-Y. (2011). Creating a live, public short message service corpus: the NUS SMS corpus, Submitted to Language Resource and Education Journal (Retrieved 6 January 2012 from http://arxiv.org/abs/1112.2468v1). Crystal, D. (2008). Txtng: The Gr8 Db8. Oxford: Oxford University Press.

164

Corpus Applications in Applied Linguistics

Danet, B. and Herring, S. (2007). Introduction: Welcome to the multilingual internet. In Danet, B. and S. Herring (eds.), The Multilingual Internet: Language, Culture, and Communication Online. Oxford: Oxford University Press, pp. 3–39. Deumert, A. and Masinyana, S. O. (2008). Mobile language choices – The use of English and isiXhosa in text messages (SMS) evidence from a bilingual South African sample. English World-Wide, 29(2): 117–47. Dürscheid, C. and Stark, E. (2011). SMS4science: An international corpus-based texting project and the specific challenges for multilingual Switzerland. In Thurlow, C. and Mroczek, K. (eds), Digital Discourse. Language in the New Media. Oxford: Oxford University Press, pp. 299–320. Fairon, C. and Paumier, S. (2006). A translated corpus of 30,000 French SMS. In Proceedings of LREC 2006, Geneva. Fuchs, M. Ž. and Vuković, N. T. (2008). ‘Communication technologies and their influence on language: reshuffling tenses in Croatian SMS text messaging’ Linguistics (Jezikoslovlje), 9/1-2: 109–22. Grant, T. (2010). Txt 4n6: Idiolect free authorship analysis? In Coulthard, M. and Johnson, A. (eds), Routledge Handbook of Forensic Linguistics. Abingdon: Routledge. Grinter, R. E. and Eldridge, M. (2003). Wan2talk? Everyday text messaging. ACM Conference on Human Factors in Computing System CHI (Retrieved 5 November 2009 from http://www.cc.gatech.edu/~beki/c24.pdf). Hård af Segerstad, Y. (2002). Use and adaptation of the written language to the conditions of computer-mediated communication. PhD. Thesis, Department of Linguistics, Goteborg University (Retrieved 5 November 2009 from http://www. ling.gu.se/%7eylvah/dokument/ylva_diss.pdf). Herring, S. (2007). A faceted classification scheme for computer-mediated discourse. Language@Internet 1 (Retrieved 25 March 2010 from http://www. languageatinternet.de/articles/2007/761/index_html). Honeycutt, C. and Herring, S. C. (2009). Beyond microblogging: Conversation and collaboration via Twitter. In Proceedings of the Forty-second Hawai’i International Conference on System Sciences (HICSS-42). Los Alamitos, CA: IEEE Press. How, Y. and Kan, M.-Y. (2005). Optimizing predictive text entry for short message service on mobile phones. Proceedings of HCII (Human Computer Interaction International), Las Vegas, Nevada, USA, 2005. Published: Hillsdale, NJ: Lawrence Albaum Associates. Hutchby, I. and Tanna, V. (2008). Aspects of sequential organization in text message exchange. Discourse and Communication, 2(2): 143–64. Kasesniemi, E. and Rautiainen, P. (2002). Mobile culture of children and teenagers in Finland. In Katz, J. and Aakhus, M. (eds), Perpetual Contact: Mobile Communication, Private Talk, Public Performance. Cambridge: Cambridge University Press, pp. 170–92. Ling, R. and Yttri, B. (2002). Hyper-coordination via mobile phones in Norway. In Katz, J. and Aakhus, M. (eds), Perpetual Contact: Mobile Communication, Private Talk, Public Performance. Cambridge: Cambridge University Press, pp. 139–69. Preston, D. (2000). ‘Mowr and mowr bayud spellin’: Confessions of a sociolinguist. Journal of Sociolinguistics, 4(4): 615–21.



Texting and Corpora

165

Seargeant, P. and Tagg, C. (2011). English on the internet and a ‘post-varieties’ approach to language. World Englishes, 30(4): 496–514. Sebba, M. (2007). Spelling and Society: The Culture and Politics of Orthography Around the World. Cambridge: Cambridge University Press. Sinclair, J. (1991). Corpus Concordance Collocation. Oxford: Oxford University Press. Tagg, C. (2007). All a bit mind-boggling: Some observations on the frequency of a and the in text messaging. In Corpus Linguistics Conference 2007, University of Birmingham, July 2007. — (2009). A corpus linguistics study of SMS text messaging. PhD. Thesis, Department of English, University of Birmingham (Retrieved 5 November 2009 from http:// etheses.bham.ac.uk/253/1/Author09PhD.pdf). —(2012). The Discourse of Text Messaging: analysis of SMS communication. ­London: Continuum. Tagg, C., Baron, A. and Rayson, P. (2010). ‘I didn’t spel that wrong did i. Oops’: Analysis and standardisation of SMS spelling variation. In ICAME 31 Abstracts, Gießen, Germany, pp. 108–9. Thurlow, C. (2003). Generation Txt? Exposing the sociolinguistics of young people’s text-messaging. Discourse Analysis Online 1(1) (Retrieved 5 November 2009 from http://extra.shu.ac.uk/daol/articles/v1/n1/a3/thurlow2002003-paper.html). — (2006). From statistical panic to moral panic: The metadiscursive construction and popular exaggeration of new media language in the print media. Journal of Computer-Mediated Communication, 11: 667–701. Thurlow, C. and Poff, M. (2012). ‘Text Messaging’. In Herring, S.C., Stein, D., and Virtanen, T. (eds), Handbook of the Pragmatics of CMC. Berlin and New York: Mouton de Gruyter.

Corpora Freely Available CorTxt (11,067 SMS in English, UK): available on request to the chapter author. NUS SMS Corpus (11,179 SMS in English and 4651 in Chinese, as of March 2011, National University of Singapore): available http://wing.comp.nus.edu.sg:8080/ SMSCorpus/. SMS Korpus (c. 1500 SMS in German, Germany): available http://www.mediensprache.net/de/networx/docs/networx-22.asp 13/07/2010.

To Buy SMS4science (30,000 text messages, Belgium, French), see: http://www.sms4science. org/?qen.

Chapter 11

A Conceptual Model for Segmenting and Annotating a Documentary Photograph Corpus (DPC) Gu Yueguo 3-M Learning Research Lab, Tongji University1

Introduction: What is a Documentary Photograph Corpus and Why? This chapter deals with a corpus of still images rather than a corpus of orthographic texts. Specifically it is concerned with how to segment and annotate an image corpus comprising several million documentary photographs stored in digital form. Image data processing and retrieval is, first of all, theoretically interesting. For in the mainstream of corpus linguistics whose primary concern has been with orthographic text corpora, the processing of an image corpus poses a serious challenge. It is also practically significant. Satellite images of the Earth and space, CT and fMRI images in the hospital, meteorological images for weather forecast, traffic surveillance images, CTV images for security in supermarkets, images in news agencies, etc. have become a huge asset for mankind, and all cry out for effective and efficient retrieval and maintenance. Furthermore, mobile phones, iPad, PDAs, etc. are all capable of receiving and sending images. The present project, like many other similar projects, also has a practical application in its initial design. The annotated images will be distributed through the internet and the intranet (i.e. LAN system) to the mobile, PDA and the desktop. There is a long-term goal of integrating the images with the existing written documents to reconstruct an image-based history. Since image processing is relatively new, and we shall unavoidably throw in some technical terminologies that might make the text difficult to follow, it may be quite helpful, as a measure of some background preparation, to start



Annotating a Photographic Corpus

167

Figure 11.1  The author in a Tibetan costume

with comparing it against the human segmentation and annotation of alphanumeric and image texts. As we know, to segment a text means to slice it into a series of segments. Take the author’s name (Gu Yueguo) of this chapter for example. It is a text, though a quite short one, and an alphanumeric one too. Our purpose of segmenting it is to let the computer know that it is not an ordinary alphanumeric text but a personal name. This leads to the rule of segmentation: Segment the text into two chunks, one for the family name, and the other for other names. So we set up two segments: Gu and Yueguo. We then annotate the two segments by tagging to them labels that can be ‘understood’ by the computer. Using the XML syntax, we have: family_name  Gu  /family_name,  other_name  Yueguo  /other_name Up to this point the original alphanumeric text has been properly segmented and annotated. Now here is an image text (as shown in Figure 11.1) about the same author. Our task remains the same, that is to segment this text into segments and annotate them so that the computer can ‘understand’ that the image is the person whose name is called Gu Yueguo. How can we do it? Obviously we cannot proceed the way we did with the alphanumeric text above. We need to set up the boundaries that circumscribe the segments and a set of descriptive labels to be tagged to the segments so that the computer can ‘understand’ what the segments are about. It may have become quite clear to readers that there are two major differences between an alphanumeric text and an image text. First, the

168

Corpus Applications in Applied Linguistics

former is linear, consisting of discrete letters, but the latter is not. The image text is two-dimensional, consisting of colours, shapes, distance, brightness vs. darkness, etc. Second, the latter displays far more information than the former, which is easily testified by the fact that a person’s passport must have the bearer’s photograph properly taken and stamped. So much for the background preparation. It is noted that the computer technology itself was initially designed for crunching alphanumeric data. Digitalized images create a new data type, which the traditional computer technology was not designed for. Moreover, as mentioned earlier, an image is two-dimensional, thus taking far more storage space than a plain orthographic text. This means it consumes far more computing power than its counterpart. Nowadays word-based processing, text retrieval, concor­ dances, word frequency statistics, etc. can be carried out by a few clicks of a mouse. In the case of an image corpus, however, even with the state-of-theart technology, it is still a formidable task to process it automatically and retrieve what is specifically wanted. Flickner et al. (1997: 8) write At an artificial intelligence conference several years ago, a challenge was issued to the audience to write a program that would identify all the dogs pictured in a children’s book, a task most 3-year-olds can easily accomplish. Nobody in the audience accepted the challenge, and this remains an open problem. Although progress has been made in multimedia processing in the last decade (e.g. Djeraba, 2003; Dickinson et  al., 2009; Tao et  al., 2009), the challenge remains no less formidable. Presently what is realistically achievable is the formation of partnerships between human and machine to tackle the problem. The primary objective of this chapter is to present a conceptual model of how the problem of segmentation and annotation will be dealt with in our project (note that it is by no means solved to the satisfaction of all stakeholders). The translation of the conceptual model into data model, and its application and implementation, has to be addressed elsewhere given the shortage of space. In what follows, we shall first make a quick review of the literature by way of comparing the image processing by the human brain/mind and the computer (Section 2). The human-machine partnership in image processing is also defined. This will prepare the way for a 3-layered model of manual segmentation and annotation of documentary photographs (Section 3). Segmentation and annotation schemes, GUI design, ontology and OWL are discussed and demonstrated. The chapter concludes with exploring



Annotating a Photographic Corpus

169

some theoretical issues associated with the project (Section 4), and a prospective application in the future (Section 5).

Human-Computer Image Processing Compared As we know, it is an ultimate goal for artificial intelligence to simulate human intelligence. Image perception and processing by artificial intelligence certainly shares this ambition. In this section, we compare human-computer image processing in order to lay bare the differences between the two as well as the gaps that the state-of-the-art technology currently has to overcome before reaching its ultimate goal. How the human brain perceives and interprets a photograph Given a photograph or image in general, human perception of it can be analyzed as consisting of two processes. First there is the unconscious process of taking light signals from the pupil to retina, to the various neural pathways, and finally to the production of the perceived image. Second, there is the conscious process of image recognition, that is the perceived image being seen as an object of a certain kind, which can be named by the perceiving brain/mind or remain anonymous to it. To many the image recognition is often accompanied with judgemental response, for example like or dislike, friend or enemy, beautiful or ugly and so on. The latter obviously is affected by the perceiving brain/mind’s personal knowledge and social–cultural environment. In the literature, the unconscious process is studied by neuroscientists (e.g. Frishman, 2001, Levine, 2001; Wade and Swanston, 2001), while the second is the subject matter of cognitive investigation (e.g. Solso, 1994; Kosslyn and Osherson, 1995). The judgemental aspect, on the other hand, catches the interest of those who are concerned with artistic appreciation (e.g. Barrett, 2000). The human brain/mind’s approach to a photograph or image, as discussed above, can thus be mapped out by a model involving three layers of operation: (1) the neural layer, (2) the cognitive layer and (3) the judgemental layer. We shall talk no more about the first layer, as it obviously goes beyond the scope of this chapter. The other two layers will be dealt with in descriptive rather than in explanatory terms, that is, we take the brain/mind’s image recognition and appreciation as given and focus on the result of such operations. For example, given a photograph or image, we do not attempt to explain how the brain/mind processes it cognitively

170

Corpus Applications in Applied Linguistics

as well as judgementally. Instead, we assume that it has done it, and proceed to describe what it has seen in the photograph or image. What is seen can be described in terms of layers too, that is, a photograph or image is analytically seen as consisting of layers of various kinds: (1) the form/ structure layer, for example, layout, foreground, background, perspective, focus, etc.; (2) the content layer, for example, people, animals, scenes, plants, buildings, behaviours, captions, etc.; (3) the artistic layer, for example, composition techniques, photographic skills, special effects, etc. Figure 11.2 demonstrates graphically the way the human brain interacts with photographs in terms of the three levels drawn above. At the form/structure layer, the perspective (i.e. the left) and the significant background (i.e. the right) are illustrated. At the content layer, the human eye sees objects (i.e. the left image above), behaviour (i.e. the left image below) and people (i.e. the right image). Finally at the artistic layer, the photograph at the centre is a household-known image following the Wenchuan earthquake. Its photographic skill has made the photograph extremely effective in fund-raising. The photograph on the left side has a caption that read: 南京。1949年2 月。两个中国孩子,浑然 未觉历史的重大转折正从他们身边经过 (i.e. Nanjing. February 1949. The two Chinese children were totally unaware of the historical turning point

Figure 11.2  An analytic model of the human perception of images



Annotating a Photographic Corpus

171

bypassing them). This caption casts the image in its historical perspective, and the photograph is thus made to tell forever untold stories. It is important to note that there is no one to one relationship between the 3-layered analysis of the brain/mind and the 3-layered analysis of what is seen in a photograph or image. In other words, we do not know if the brain/mind’s perception of a photograph proceeds from the form/ structure layer all the way to the artistic layer, or starts from the content layer, or rather from the top down. The 3-layered analysis sets up three different challenges to both human and computer image processing. With sufficient and appropriate training, human segmentation and annotation of images can adequately cover the contents shown in the 3-layered analysis. Computer image processing, in contrast, with the current state-of-the-art technology, can hardly deliver any of the contents automatically. Note that the hierarchy of the 3 layers is significant to both human and computer. For human intelligence, the conformity and consistency in image segmentation and annotation decreases from the form/ structure layer upwards. For artificial intelligence, an automatic processing of the form/structure layer is much less challenging than the upper layers. How the computer ‘perceives’ and ‘interprets’2 a photograph Whereas the way the brain neurons represent a mental image of a photograph largely remains mysterious, it is crystal clear to human developers how the computer represents it in its ‘brain’ (called ‘diannao’ 电脑, i.e. electronic brain in Chinese). The computer can acquire a photograph via a scanner or a digital camera. The image thus acquired is rendered in display on the monitor through a graphics application program (popularly known as multimedia software) that generates and keeps an internal model of the image. The graphic modelling this chapter deals with is bitmapped graphics, that is, the image, for example a photograph, is modelled by an array of pixel values. This graphic modelling is different from vector graphics. In the latter, the image is stored as a mathematical description of a collection of individual lines, curves and shapes making up the image (see Chapman and Chapman, 2004: Ch. 4–5 for detailed discussion). In the bitmapped graphics, what is perceived by the human eye as, say, a dog is in the computer ‘brain’ nothing but a pattern of pixel values. In linguistic terms, the computer brain has no semantics, only physical properties such as physical dots, logical pixels, coded colours and so on. Analytically, it is also helpful to think of the computer processing of images as 3-layered operations3: (1) the hardware

172

Corpus Applications in Applied Linguistics

Figure 11.3  A demo of automatic content retrieval

layer, that is, physical pixels or dots; (2) the pixel layer, that is, logical pixels to be mapped to the physical pixels and (3) the pattern layer, that is, brightness, colour regions, spatial layout, etc. Figure 11.3 graphically demonstrates the outcome a bitmapped photograph may be automatically extracted for content retrieval. At the hardware layer, the pixel patterns of the author’s nose and closed mouse are used to demonstrate the physical pixel effects. Note that the physical pixel effects vary from the display devices (e.g. a high resolution monitor, a low resolution LCD of a mobile phone, etc.). The physical pixels are in essence the physical rendering of the logical pixels at the second pixel layer, which are arrays of pixel values. At the top pattern layer, as shown in Figure 11.3, it is now possible to automatically or semi-automatically detect colour patterns, shapes, grey levels, etc. Image retrieval: Human–computer partnership In the literature which can be put under the broad category of computational linguistics, there are works bearing the title ‘content-based’ or ‘semanticbased’ image retrieval (e.g. Smeulders et al., 2000; Marques and Furht, 2002; Petković and Jonker, 2004; Zhang, 2007; see also Datta et al., 2008 for the latest survey). We have to be aware of the fact that the ‘content’ or ‘semantic’ these researchers have in mind is different from that found outside their research circle. From Lu (1999: 46) we can discern where their notion comes from: Digital audio, image, and video are just a series of sampled values, with no obvious semantic structure. This is in contrast to alphanumeric information,



Annotating a Photographic Corpus

173

where each character is uniquely specified and identified by its ASCII code or other similar code. Based on this code, computers can search and retrieve relevant alphanumeric items from a database or document collection. To facilitate information retrieval, features or contents have to be extracted from raw multimedia data. Thus feature extraction, organization of extracted features and similarity comparison between feature descriptions are major design issues of MIRSs [multimedia information retrieval systems]. (italics added) Two things are worth noticing: (1) multimedia information (image being one of its kind) is contrasted to alphanumeric information. It is this contrast that leads to the notion of semantic structure, which alphanumeric information is assumed to have. This notion of semantic structure is totally different from that found in linguistics, and in semantics in particular. The alphanumeric information (e.g. the letter a) is said to have a semantic structure because it can be uniquely identified and retrieved. In the case of image information, no image can be uniquely identified unless it is properly annotated. (2) The notion of content refers to those features that can be extracted by the machine, that is, it corresponds with the features found at the pattern layer in Figure 11.3. As has become apparent, the machineextracted features, so-called content, are outside the real-life concern of human interaction with images. The message we want to drive home at this point is that the current state-of-the-art technology in multimedia information processing in computational linguistics can only deliver the low-level content/semantics. In most cases, however, image users are not interested in retrieving pixel arrays or patterns. That is, what the computer is good at, that is, as far as the state-of-the-art technology goes, has no attraction to human image users. They are interested in the high-level content/semantics as shown in Figure 11.2. The contrast between the two can be brought to a sharper focus when the two pairs of triple layers discussed in 2.1 and 2.2 above are put in parallel, as shown in Figure 11.4. Flickner et al. (1997: 8) makes the ensuing comparison between humans and computers Perceptual organization—the process of grouping image features into meaningful objects and attaching semantic descriptions to scenes through model matching —is an unsolved problem in image understanding. Humans are much better than computers at extracting semantic

174

Corpus Applications in Applied Linguistics

Figure 11.4  Two 3-layered models compared

descriptions from pictures. Computers, however, are better than humans at measuring properties and retaining these in long-term memories. Since the current automatic visual information retrieval cannot meet fastgrowing practical demands, manual segmentation and annotation, though highly labour-intensive, still is the option we have to adopt. In other words, the human–machine partnership remains the best practice for quite some time to come before significant breakthroughs take place on the machine side. Given a bitmapped photograph, as shown in Figure 11.5, it can be dealt with simultaneously at a range of interconnected layers. The 3 layers at the bottom each can generate a set of data automatically. The 3 layers at the top each require manual segmentation and annotation.

Segmenting and Annotating Documentary Photographs It is time to show how we approach the segmentation and annotation of documentary photographs in bitmapped graphics. Basically it is a computerassisted human effort. The machine will first run through the corpus to achieve what it can reliably do before the job is handed over to manual work. The latter is aided by a specially designed tool for the purpose. Guiding principles for segmentation and annotation: The purpose and degree of granularity As the aphorism says, a picture is worth a thousand words. In our vocabulary, an image text is information-rich. Moreover, it is possible to look at the same image from different perspectives, and the interpretations thus generated are equally valid. Elsewhere we call this fact the perspective



Annotating a Photographic Corpus

175

Figure 11.5  A model of human–machine partnership

multiordinality (Gu, 2006: 129). Bearing these in mind, before we start in real earnest, the first thing we should do is to articulate as succinctly as we can: For what purpose(s) do we segment and annotate the image corpus? In our case, the images we deal with are documentary photographs, and the purpose of segmentation and annotation is to enable a machine-based retrieval of those image contents that are historically significant. For instance, users may want to retrieve photos that have Chairman Mao declaring the founding of the People’s Republic of China in 1949 so that the retrieved photos can be integrated with orthographic texts about the

176

Corpus Applications in Applied Linguistics a. Person b. Person + girl c. Person + girl + in a stripe vest d. Person + girl + in a stripe vest + appearing scared e. Person + girl + in a stripe vest + appearing scared + receiving oxygen f. Etc.

Figure 11.6  Degrees of granularity

same event, and Chairman Mao’s personal voice in performing that particular speech act. This will generate a multimodal text comprising the image, the orthographic text and the audio. For another instance, users may want to retrieve photographs showing the devastations of an earthquake as an initial input for group discussions or writing assignments in language teaching context. Thus our work shall focus on two layers: the form/structure and the content (see Figure 11.2 and also Figure 11.5). It is obviously closely related to those historical documents (alphanumeric texts or otherwise) that cover more or less the same real-life events. The research purpose determines the degree of granularity in segmen­ tation. Let me illustrate this notion by example. The little girl’s photograph cited in Figure 11.2 above can be segmented and annotated in varied degrees of granularity, as shown in Figure 11.6. Retrieval: Fuzziness, similarity and verbal description Human annotation involves a set of vocabulary to be used to describe what is perceived from images. The vocabulary is technically called a tagset or metalanguage. It is a general practice for human annotation to adopt a natural language (i.e. Chinese in our case) as its metalanguage. As we all know, there are image features that meet the eye but go beyond words in two ways: One is that there is no verbal expression at all in the existing lexicon of the natural language used as metalanguage of annotation; and the other is that it is beyond the annotator’s acquired vocabulary. The second difficulty can easily be overcome by incorporating a machine-readable dictionary and a thesaurus in particular into the database. The first case, however cannot readily be amended. It is in this sense that the notion of fuzziness of image features is being used in this chapter.



Annotating a Photographic Corpus

177

It must be noted that in computational linguistics fuzziness often refers to the fact that image information is intrinsically fuzzy in comparison with alphanumeric information (recall the relevant discussion in Section 1 above). This can be clearly seen when a concordance program retrieves keywords from an alphanumeric corpus. All the instances containing the keywords being searched for will be uniquely identified and retrieved without exception. In the cases of image retrieval, this cannot be achieved, for the retrieval is based on similarity rather than on exact matching. Consequently image retrieval is different from orthographic keyword retrieval. In the literature, there are several image retrieval strategies: (1) query by example (QBE for short), i.e. post or select an image as an example, and let the system retrieve images similar to the example; (2) query by colour matching, i.e post or select a colour, and let the system match the colour; (3) query by drawing, i.e. draw a shape and let the system match the shape; (4) query by annotation, i.e. orthographic keyword searching as it is found in alphanumeric corpus; (5) query by interactive feedback, i.e. the system interacts with the user by a question/answer cycle before retrieving what is searched for and (6) query by hybrid methods, i.e. a combination of the 1–5 methods. Our project adopts the last retrieval strategy. Segmenting and annotating interface: GUI design The 3-layered structure of human manual effort serves two interconnected roles: (1) it provides a way of data organization; and (2) it structures the annotator’s focus of attention on a particular aspect of the image in question. These two roles feed into GUI development. For each layer there is an annotation scheme, which can be activated by the annotator. The annotation is supported by a domain-specific knowledge base constructed in OWL (i.e. Web Ontology Language, see http://www.w3.org/TR/owlfeatures/, and further discussion in 3.5 below). Segmentation of a photograph is enabled with a set of segmenting tools consisting of line, curve, square, oval and polygon. The annotator makes use of these to segment the picture into still regions, that is units for annotation. The annotation schemes mentioned above provide the annotator with a set of standard vocabulary to tag the segmented units (see further discussion in 3.4 below). The design of segmentation and annotation interface is shown in Figure 11.74.

178

Corpus Applications in Applied Linguistics

Figure 11.7  GUI design

Annotation schemes and MPEG-7 standards It is of course quite legitimate for each project to develop its own annotation schemes. There is a price for us to do so: It will be very difficult for data exchange between projects and across platforms. Hence it is much wiser and more beneficial for the academic community as a whole to adopt a general standard. The annotation schemes we develop are based on MPEG-7, which is an international standard for multimedia content description interface. Chiariglione (2002: 5) succinctly captures MPEG-7 thus: ‘MPEG-7 will fulfil a key function in the forthcoming evolutionary steps of multimedia. As much as MPEG-1, MPEG-2 and MPEG-4 provided the tools through which the current abundance of audiovisual content could happen, MPEG-7 will provide the means to navigate through this wealth of content.’ The annotation schemes mentioned above are conceptually based on MPEG-7 standard and on image description in particular. For detailed discussion of MPEG-7 readers are advised to consult Manjunath et al. (2002), Kosch (2004) and Kim et al. (2005). Following the standard, we organize annotation schemes into two major groups: (1) image content description and (2) image content management. The image content description in turn consists of (a) image content entity, where the low-level content of images is handled, and (b) image content abstraction, where human manual annotation is heavily required to cover information such as world (see Section 3.5 on ontology below), model, summary, view and variation. The image content management, on the other hand, deals with (a) user description, (b) creation description, (c) usage



Annotating a Photographic Corpus

179

Figure 11.8  Map of image description schemes

description and (d) classification description. Figure 11.8 is a map of image description schemes. MPEG-7 specifications about the low-level content (e.g. colour, still region, spatial layout) are far more detailed than about the high-level content (e.g. people, objects, animals). This is understandable since the high-level content is extremely varied from project to project. As far as our project is concerned, the artistic and content layers of the manual part as shown in Figure 11.5 are the areas where extensions to MPEG-7 are required. This will not pose a serious problem since MPEG-7 specifies a Data Definition Language (DDL) for the purpose. Note that the 6-layered structure as shown in Figure 11.5 is an analytic structure, that is our organized way of dealing with the complexity of the problem. The structure of image annotation schemes is a conceptual structure for describing the image data and for developing tools of annotation. As indicated in its title, this chapter will not deal with implementation,that is turning the conceptual structure into an application product. Annotation schemes also include a classification scheme as well as provide vocabularies for descriptive purposes. As indicated in Figure 11.8, a classification scheme is used as part of image content management. We start with a generic classification scheme covering a wide range of activities on the basis of human ecology (as spelt out in Fellmann et  al., 1995), which is modelled as consisting of interactive tiers: (1) the tier of nature, that is, natural environment before being ‘invaded’ by human development; (2) the tier of habitats, that is, human occupational patterns effected on the natural environment; (3) the tier of basic living, for example, fishing, gathering, farming, mining, etc. (4) the tier of social–cultural activities, that is, executive and management activities, arts, theatres, etc. and (5) the political–ideological tier, political parties, nation-state activities, etc. The documental photographs are mapped onto this human ecology, as shown in Figure 11.9.

180

Corpus Applications in Applied Linguistics

Figure 11.9  Human ecology and mapping of documentary photographs

Vocabularies for descriptive purposes can be based on two kinds of thesaurus: (1) a thesaurus of lexical items classified according to domains of knowledge or life activities and (2) a thesaurus of synonyms and antonyms. For our project we have used Classified Dictionary of Modern Chinese 《现代汉语分类词典》 ( ) by Dong (1998) and A Thesaurus of Synonyms( 《同义词词林》)by Mei et al. (1996). These two are helpful in providing a classified vocabulary that serves as potential candidates for image descriptors. When the annotator annotates an image using the GUI interface (see Figure 11.6 above), s/he is prompted with a pull-down list of vocabulary for choice. In practice, the annotator is encouraged to provide feedback on the vocabulary and suggest revisions or alternatives if the prompts are found inappropriate. Ontology, knowledge base and OWL As we know, a documentary photograph is a snapshot of a real-life event. In other words, it is part of a particular domain of knowledge and experience. The segmentation and annotation treat a documentary photograph as an independent object. Thus there is a knowledge gap between the segmented and annotated photograph and the event from which it was initially taken. Human perceivers can bridge the gap through personal direct or indirect



Annotating a Photographic Corpus

181

experiences, whereas the computer ‘brain’ cannot do so without explicitly being programmed for the purpose. This is where ontology, knowledge base and OWL are brought in. The term ontology here is used in the sense as found in AI. It refers to a formal explicit description of concepts in a domain of knowledge. In AI, knowledge representation is one of its fundamental research areas (see Russell and Norvig, 2003; Cuena et  al., 2004). Lately, it is extensively borrowed into the development of a semantic web. The research literature is growing extremely fast, for example Motta et al. (2005), Walton (2007), Thuraisingham (2008), to name but a few. The W3C has set up various committees to work on standards and recommendations in order to coordinate international research efforts. The RDF and the OWL this chapter adopts are two languages on the W3C’s recommended list, and are widely used by researchers for ontology construction. As Walton (2007: 6) points out, ‘Designing a formal ontology is a difficult task. Even a simple ontology, such as our Camera example, can take a lot of effort to produce.’ This remark is quickly appreciated by anyone who has tried their hand in it. Moreover, given a domain of knowledge, there is no one correct way of constructing an ontology for it. There is a range of potentially possible ways. The choice to be made largely depends on for what purpose the ontology is to be constructed. Our project requires construction of ontology for various domains of knowledge. For this chapter, we only report our effort for one domain, i.e. sports. We conceptualize this domain of discourse using a set of 7 generic concepts, CompetitiveGames, PlayBehavior, PlayFacility, PlaySpace, PlayRules, PlayTime and PlayTeam5. These 7 concepts are regarded as necessary conditions for anything to be a sports game. For each sports game, for example a basketball game, there are 7 corresponding conditions specific to the game. In the language of OWL, the 7 generic concepts are superclasses, and the 7 corresponding conditions for the basketball game are sub-classes of the superclasses. All sports games can be conceptualized and modelled in this way. We have used Protégé (4.1 beta version) to construct a skeleton ontology for sports, and a basketball game as a test case. Some sample coding is shown below:

Sample coding of Classes http://www.semanticweb.org/ontologies/2011/0/Ontology_basketball_ v2.owl#BallGames -- owl:Class rdf:about“&Ontology_basketball_v2;BallGames”

182

Corpus Applications in Applied Linguistics

rdfs:subClassOf rdf:resource“&Ontology_basketball_ v2;CompetitiveGames”/ /owl:Class !-- http://www.semanticweb.org/ontologies/2011/0/Ontology_ basketball_v2.owl#Basketball -- owl:Class rdf:about“&Oncology_basketball_v2;Basketball” rdfs:subClassOf rdf:resource“&Oncology_basketball_ v2;BasketballFacility”/ owl:disjointWith rdf:resource“&Ontology_basketball_ v2;BasketballNet”/ /owl:Class !-- http://www.semanticweb.org/ontologies/2011/0/Ontology_ basketball_v2.owl#BasketballBehavior -- owl:Class rdf:about“&Ontology_basketball_ v2;BasketballBehavior” rdfs:subClassOf rdf:resource“&Ontology_basketball_ v2;PlayBehavior”/ /owl:Class !-- http:/www.semanticweb.org/ontologies/2011/0/Ontology_ basketball_v2.owl#BasketballCoach -- owl:Class rdf:about“&Ontology_basketball_v2;BasketballCoach” rdfs:subClassOf rdf:resource“&Ontology_basketball_ v2;KeyMembers”/ /owl:Class !--

Sample coding of object properties: http://www.semanticweb.org/ontologies/2011/0/Ontology_basketball_ v2.owl#hasBasketballTime -- owl:ObjectProperty rdf:about“&Ontology_basketball_ v2;hasBasketballTime” rdfs:domain rdf:resource“&Ontology_basketball_ v2;BasketballGame”/ rdfs:range rdf:resource“&Ontology_basketball_ v2;BasketballTime”/ rdfs:subpropertyOf rdf:resource“&Ontology_basketball_ v2;hasPlayTime”/ /owl:ObjectProperty



Annotating a Photographic Corpus

183

!-- http://www.semanticweb.org/ontologies/2011/0/Ontology_ basketball_v2.owl#hasCountryOfOrigin -- owl:ObjectProperty rdf:about“&Ontology_basketball_ v2;hasCountryOfOrigin”/ !-- http://www.semanticweb.org/ontologies/2011/0/Ontology_ basketball_v2.owl#hasPlayBehavior -- owl:ObjectProperty rdf:about“&Ontology_basketball_ v2;hasPlayBehavior” rdfs:range rdf:resource“&Oncology_basketball_v2;Sports”/ rdfs:domain rdf:resource“&Ontology_basketball_v2;Sports”/ /owl:ObjectProperty !-" http://www.semanticweb.org/ontologies/2011/0/Ontology_ basketball_v2.owl#hasPlayFacility --" owl:ObjectProperty rdf:about“&Ontology_basketball_ v2;hasPlayFacility” rdfs:range rdf:resource“&Ontology_basketball_v2;Sports”/ rdfs:domain rdf:resource“&Ontology_basketball_v2;Sports”/ /owl:ObjectProperty !-Figure 11.10 is an OntoGraf view of the skeleton ontology generated automatically by Protégé.

Corpus Linguistics And Ontology: A Theoretical Exploration Segmentation/annotation plus ontology: A critical evaluation The methodology of using segmentation/annotation buttressed with ontology in still image processing and retrieval is not a fresh contribution by this chapter. It is already found in Wielemaker et  al. (2003), Zhang (2006) and Ferecatu et  al. (2008). These studies including the present chapter, however, are exploratory at best. This is because, although the method of segmentation and annotation has been well testified in corpus linguistics, ontology construction is still at its formative stage. It is, however, extremely promising to combine both. Segmentation and annotation are fundamentally instance-based. It is like segmenting and annotating separate trees and plants in a forest. Ontology is concept-based. It is like mapping the forest in terms of tree and plant types, and its general patterns. The two are obviously complementary.

Figure 11.10  A skeleton ontology for sports (using Protégé)



Annotating a Photographic Corpus

185

Integrating segmentation/annotation and ontology The integration of segmentation/annotation with ontology is by no means straightforward. With a small size corpus the integration seems to be much easier. MINDSWAP Research Lab has developed an application named ‘PhotoStuff’ (see http://www.mindswap.org/2003/PhotoStuff), in which segmentation/annotation and ontology are dealt with together. Ferecatu et al. (2008), on the other hand, adopt a different approach. Segmentation and annotation of images are handled in a database. Our project is in line with Ferecatu, Boujemaa and Crucianu, but instead of adopting an external ontology, we would construct a library of ontologies for major domains of knowledge, as MINDSWAP Research Lab attempts to do. The database technology is well-testified, and is hence much safer to use with a corpus of several million photographs. Ontologies constitute an abstract knowledge layer on top of the segmented/annotated corpus/database. Technically speaking, ontologies and the corpus/database both can adopt RDF/XML syntax. Moreover, ontology and annotation are semantic in nature. So the real challenge to the integration of the two is how to integrate verbal text coding with bitmapped pixel coding. For in-depth discussions of multimedia data integration readers are advised to consult Maragos et al. (2008). As mentioned above, our project relies on manual segmentation and annotation. It is obviously extremely labour-intensive. Readers may wonder if it is wise to do so. Why not explore ‘intelligent’ multimedia processing techniques, as found reported in the literature? One of our difficulties in exploring this alternative lies in the fact that the documentary photograph corpus covers a time span of about a hundred years. The image qualities vary enormously. It would be very difficult to write an intelligent program to automatically segment and annotate such corpus. As shown in Figure 11.5, there is a partnership between the human and the machine. The human segmentation and annotation efforts will be learnt by the machine through machine learning mechanism. In other words, the more human efforts have been put into segmentation and annotation, the more intelligent the machine will become as a result of learning. It is hoped that the initial labour-intensive input will be rewarded eventually.

Conclusion: Application Prospective With digital image technology becoming part of everyday life, academics, industrialists and businessmen all see the importance of image processing. In applied linguistics, the audio and video materials have long been

186

Corpus Applications in Applied Linguistics

appreciated for their instructional values. Still images, and documentary photographs in particular, are seriously undervalued. I would like to mention three areas where an image corpus may be found extremely valuable. First, a vocabulary learning/teaching will be enhanced enormously when words are learnt together with images, as demonstratively shown in Mayer (2001). A segmented and annotated image corpus will be found extremely handy for both learners and teachers. Second, images will serve well as initial inputs for descriptive practice, group discussions, writing assignments, etc. A segmented and annotated image corpus will enable the user to construct tasks in a far more systematic and coherent manner. Thirdly, preparations for illustrated text books and dictionaries will be accelerated beyond measure with the help of a corpus of such kind. Try it yourself and I’ll compensate you should you regret having done so!

Notes This project is partially funded by the Mutlimodal, Multimedia and MultiEnvironment Learning Research Lab jointly run by Tongji University and Beijing Foreign Studies University. 2 A robot fixed with a camera sensor can literally perceive an image in its environment and recognizes it as a particular object (i.e. interprets it). To empower a robot with such capability constitutes a special area study known as computer vision (see Forsyth and Ponce, 2003). But the sense of perceiving and interpreting used in this chapter is rather metaphorical. 3 The magic number 3 is a bit arbitrary. It is chosen for the sake of easy comparison between the human and computer brains. 4 This interface has no originality of its own. It is a sort of standard interface found in similar applications. 5 We have adopted the naming conventions recommended by Protégé (v. 4.1 beta): the initial capital for class names and the first low case for object property names.

1









References Barrett, T. (2000). Criticizing Photographs: An Introduction to Understanding Images, 3rd edn. Boston, MA: McGraw Hill. Chapman, N. and Chapman, J. (2004). Digital Multimedia, 2nd edn. Chichester: Wiley. Chiariglione, L. (2002). Introduction to MPEG-7: Multimedia content description. In Manjunath, B. S. et al. (eds), pp. 3–6. Cuena, J., Demazeau, Y., Serrano, A. G. and Treur, J. (eds) (2004). Knowledge Engineering and Agent Technology. Amsterdam: IOS Press.



Annotating a Photographic Corpus

187

Datta, R., Joshi, D., Li, J. and Wang, J. Z. (2008). Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys, 40(2): 5:1–5:60. Dickinson, S. J., Leonardis, A., Schiele, B. and Tarr, M. J. (eds) (2009). Object ­Categorization: Computer and Human Vision Perspectives. Cambridge: Cambridge University Press. Djeraba, C. (ed.) (2003). Multimedia Mining: A Highway to Intelligent Multimedia Documents. Boston, MA: Kluwer Academic Publishers. Dong, D. (1998). Classified Dictionary of Modern Chinese 《现代汉语分类词典》 Shanghai: Shanghai Lexicography Publisher. Fellmann, J., Getis, A. and Getis, J. (1995). Human Geography. Dubuque, IA: Wm. C. Brown Communications, Inc. Ferecatu, M., Boujemaa, N. and Crucianu, M. (2008). Interactive image retrieval using a hybrid visual and conceptual content representation. In Maragos, P., Potamianos, A., and Gros, P. (eds), pp. 221–40. Flickner, M. et al. (1997). Query by image and video content: The QBICTM system. In Maybury, M. T. (ed.), pp. 7–22. Forsyth, D. A. and Ponce, J. (2003). Computer Vision: A Modern Approach. Englewood Cliffs, NJ: Prentice Hall. Frishman, L. (2001). Basic visual processes. In Goldstein, B.E. (ed.), pp. 53–91. Gu, Y. (2006). Multimodal text analysis – A corpus linguistic approach to situated discourse. Text and Talk, 26(2): 127–67. Kim, H.-G., Moreau, N. and Sikora, T. (2005). MPEG-7 Audio and Beyond. West Sussex: John Wiley & Sons, Ltd. Kosch, H. (2004). Distributed Multimedia Database Technologies Supported by MPEG-7 and MPEG-21. Boca Raton, FL: CRC Press. Kosslyn, S. M. and Osherson, D. N. (eds) (1995). Visual Cognition: An Invitation to Cognitive Science. Cambridge, MA: The MIT Press. Levine, M. W. (2001). Principles of neural processing. In Goldstein, B.E. (ed.), pp. 24–52. Lu, G. (1999). Multimedia Database Management Systems. Boston, MA: Artech House. Manjunath, B. S., Salembier, P. and Sikora, T. (eds) (2002). Introduction to MPEG-7: Multimedia Content Description Interface. West Sussex: John Wiley & Sons, Ltd. Maragos, P., Potamianos, A. and Gros, P. (eds) (2008). Multimodal Processing and Interaction: Audio, Video, Text. New York: Springer. Marques, O. and Furht, B. (2002). Content-Based Image and Video Retrieval. Boston, MA: Kluwer Academic Publishers. Mayer, R. E. (2001). Multimedia Learning. Cambridge: Cambridge University Press. Mei, et al. (1996). A Thesaurus of Synonyms( 《同义词词林》 )Shanghai: Shanghai Encyclopedia Publisher. Motta, E., Shadbolt, N., Stutt, A. and Gibbins, N. (eds) (2005). Engineering Knowledge in the Age of the Semantic Web. Berlin: Springer. Petković, M. and Jonker, W. (2004). Content-Based Video Retrieval: A Database Perspective. Boston, MA: Kluwer Academic Publishers. Russell, S. J. and Norvig, P. (2003). Artificial Intelligence: A Modern Approach, 2nd edn. Upper Saddle River, NJ: Pearson Education, Inc. Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A. and Jain, R. (2000). Contentbased image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12): 1349–80.

188

Corpus Applications in Applied Linguistics

Solso, R. L. (1994). Cognition and the Visual Arts. Cambridge, MA: The MIT Press. Tao, C., Xu, D. and Li, X. (eds) (2009). Semantic Mining Technologies for Multimedia Databases. Hershey, PA: Information Science Reference. Thuraisingham, B. (2008). Building Trustworthy Semantic Webs. New York: Auerbach Publications. Wade, N. J. and Swanston, M. T. (2001). Visual Perception: An Introduction, 2nd edn. East Sussex: Psychology Press Ltd. Walton, C. D. (2007). Agency and the Semantic Web. Oxford: Oxford University Press. Wielemaker, J., Schreiber, A. Th., and Wielinga, B. J. (2003). Supporting semantic image annotation search. In Handschuh, S. and Staab, S. (eds), pp. 147–55. Zhang, Y.-J. (ed.) (2006). Advances in Image and Video Segmentation. Hershey, PA: IRM Press. — (2007). Semantic-Based Visual Information Retrieval. Hershey, PA: IRM Press.

Part Five

Corpora, Language Learning and Pedagogy

190

Chapter 12

Learner Corpora and Second Language Acquisition Chau Meng Huat

Introduction Learner corpus research has in recent years emerged as an area of distinctive – and to some extent, fashionable – study in applied linguistics. It can be traced back to work produced and published in the early 1990s, most notably by Milton and Tsang (1993) and Granger (1993). This specific area of research, like other domains of corpus linguistics, values the size of data collected, spoken or written, analyzed using computer-assisted techniques. While ‘large’ can be beautiful, there are challenges that remain to be addressed in learner corpus research in the study of second language (L2) acquisition. This chapter considers the contribution of learner corpus research to enhancing our understanding of L2 use and development, and discusses some challenges in need of close attention for further meaningful research and practice. To do so, I examine the extent to which learner corpus studies have addressed central areas of concern in second language acquisition (SLA). A discussion follows of current reconceptualizations of learner language critical for reflection, before a study is presented to illustrate how a corpus approach to exploring learner language can reveal interesting observations about the dynamic process of L2 use and development.

Learner Corpus Research: Developments and Contributions Over the past 40  years of exponential growth in the field, there are five central areas that have captured the attention of most SLA researchers (Ortega, 2007). They are

192

Corpus Applications in Applied Linguistics

1. The nature of L2 knowledge and cognition; 2. The nature of learner language; 3. The role of the first language (L1); 4. Contributions of the linguistic environment; and 5. The role of instruction. Each of these five areas relates to one or more of a set of ten ‘observed phenomena’ based on well-established findings in SLA (see VanPatten and Williams, 2007). Area (1), for example, relates to the observation that a good deal of L2 learning happens incidentally while Area (5) highlights that there are limits on the effects of instruction on L2 learning. Within the five key areas, learner corpus research is making its mark most evidently on the broad areas of (2) The nature of learner language and (3) The role of the L1. As far as the nature of learner language is concerned, learner corpus studies have been most productive in examining and revealing, both quantitatively and qualitatively, the characteristics of the language of advanced learners. Many of these studies were conducted using Contrastive Interlanguage Analysis (CIA), which involves comparing language performance of learners with that of native speakers, or between learners of different L1s. A good example of a CIA-based study is Granger (1998b), who compared the use of pre-fabricated patterns such as amplifiers ending in –ly (e.g. deeply moved) in a corpus of native speaker English writing and in a corpus of writing by advanced French-speaking learners of English from the International Corpus of Learner English (ICLE) project. She observed that the learners produced too few native-like prefabs and overused them. Ringbom (1998), De Cock et al. (1998) and Petch-Tyson (1998) are three other studies, all published in Granger (1998a), which compared advanced learner language with native speaker language. Ringbom, like Granger (1998b), found that learners tend to overuse a small range of high frequency, ‘general’ vocabulary items such as people, things and get. This pattern of ‘fewer items repeated more’ in learner language was also observed in De Cock et  al.’s study of vagueness tags (e.g. and stuff like that) based on university admission interviews. Petch-Tyson, on the other hand, studied signals of writer/reader visibility by comparing the presence and extent of spoken language features (e.g. the use of first and second person pronouns) in learner writing and in native speaker writing, and found that learner writing tends to resemble ‘talk written down’. These three studies, examining learner language at word, phrase and discourse level respectively, were replicated by Cobb (2003) based on



Learner Corpora and Second Language Acquisition

193

comparable corpora developed in Canada. Similar findings were observed. Reviewing these findings, Granger (2008) recently notes the following: The learner corpus studies conducted over the last decade or so have given us a much more accurate picture of advanced EFL interlanguage. This appears clearly from a study by Cobb (2003) who replicated three European learner corpus studies with Canadian data and found a high degree of similarity. (Granger, 2008: 269) Another important observation that has emerged from learner corpus research, as Gilquin et al. (2007) point out, is that some of the linguistic features of learner language are shared by learners from a wide range of L1 backgrounds while others are restricted to one particular learner population. The shared features across language backgrounds can be argued to relate to the observed phenomenon about the developmental nature of learner language (Area 2). The features unique to a specific learner population, on the other hand, may be accounted for by the role of the L1 on L2 (Area 3). Two corpus studies, which contribute to the latter area, are briefly discussed below. The first is Nesselhauf (2005), who examined verb  noun collocations in a detailed study of 318 essays by advanced German learners of English. She found that congruence (i.e. word-for-word equivalence of a collocation in the learners’ L1 and the L2) is the strongest factor that correlates with collocation difficulty, followed by the factor of the degree of restriction of a collocation. The second study is by Paquot (2008), who considered the potential influence of the L1 on learners’ production of multiword units in five sub-corpora of the ICLE. These units are concerned with the phraseology of vocabulary items, which relate to the function of exemplification in academic writing (e.g. for example, illustrate, exemplify). Using both CIA and contrastive analysis, she compared the use of these items between learner language of different populations and also within native writing, and found that the L1 has strong effects on the multiword units studied. This selective review indicates how learner corpus research has enriched existing descriptions of learner language, shedding new light particularly on advanced language use. It also shows that this area of study is beginning to address the role of the L1 in L2 learning. It is important to point out that there are also attempts at hypothesis testing on a quantificational basis in learner corpus research, considering such topics as the development of L2 morphological categories in the verb system (Housen, 2002), using a crosssectional design (see Barlow, 2005). Further studies involving learner-native speaker comparisons have also revealed how learner language diverges

194

Corpus Applications in Applied Linguistics

from the expert or native speaker norm spanning linguistic and rhetorical features, with insights gained to inform teaching and curriculum design (e.g. Flowerdew, 1998; Hinkel, 2002) as well as the development of pedagogical tools such as dictionaries (Howarth, 1996; Turton and Heaton, 1996; Rundell and Granger, 2007).

Learner Language as a Dynamic and Independent Language System The earlier review suggests that learner corpus research is expanding as a vibrant area of study, whose findings appear to have direct implications for SLA and applications in language education. To gain a fuller understanding of L2 use and development through a corpus approach, it seems clear though that there remains much more to explore and work on. To this end, critical engagement with related disciplines such as SLA, bilingualism and multilingualism is necessary, for theoretical insights, empirical work and recommendations. A number of major themes have emerged and featured exchanges in these disciplines in recent years, and two are discussed here for their relevance to addressing emergent concerns in the study of learner language. The first is about dynamic language use and development. The field of applied linguistics in general has moved from treating language as a static object to considering it as a dynamic system and language development a dynamic process (see, e.g. the recent special issues of Applied Linguistics, 2006; The Modern Language Journal, 2008; and Language Learning, 2009). The addition of -ing to grammar in the title of Larsen-Freeman’s (2003) book, Teaching Language: From Grammar to Grammaring, for example, also suggests how the dynamic process of language use has recently been emphasized and highlighted. Language, seen from this perspective, grows, changes and organizes itself in an organic way, and this holds for both the language of the native speaker and the L2 of the learner (Larsen-Freeman, 2006; see also de Bot et  al., 2005). Accepting a dynamic view of L2 development also carries with it the methodological implication that the process of L2 learning is best captured over time (i.e. longitudinally). The notion of dynamism, or dynamicity, is particularly useful to be taken into consideration in our attempt to enrich current corpus-based descriptions, analysis and understanding of the nature of learner language. It underlies a range of central processes observed in L2 learning such as simplification, over­generalization, restructuring and U-shaped behaviour (see, e.g. Ortega, 2009), which can be exemplified in the acquisition of da (‘the’) by Ge, a



Learner Corpora and Second Language Acquisition

195

naturalistic learner of English (see Huebner, 1983). In this longitudinal study, Ge’s initial use of da showed restricted functional use but with seemingly high levels of accuracy. At a later stage, da was observed to ‘flood’ (Huebner’s description) all the noun phrases Ge produced and to occur with few functional constraints, an indication of overgeneralization. An on-going restructuring process took place in which towards the end of the study, a U-shaped development was noted where Ge exhibited a pattern of use of da which is very similar to proficient language use. We will look at an illustrative study shortly of how a corpus approach to analyzing learner language uncovers similar, illuminating processes underlying L2 development. The second theme relates to the idea that learner language constitutes a linguistic system in its own right rather than being a deficit version of that of an idealized monolingual native speaker. This understanding can be traced back to the seminal work of Corder (1967) and Selinker (1972), and was reinforced by Bley-Vroman (1983), who argues that research which sets the target language norm as a yardstick for measuring learner success is susceptible to the ‘comparative fallacy’. Cook (1991) has also long observed that L2 acquisition is in many respects different from L1 acquisition. The L2 user possesses a more complex overall language system, which includes two languages, and neither the L1 nor the L2 in a bilingual is isomorphic with the language in a monolingual of either language. It makes little sense, in the study of L2 acquisition, to compare the L2 with a monolingual native speaker norm: The child is learning their first language and has no other; the L2 learner is learning a second language when they already have a first. Why should the mastery of a second language be measured against the mastery of a first? L2 learners are only failures if they are measured against something they are not and never can be – monolingual native speakers. .  .  . To call what the vast majority of L2 users achieve ‘failure’ is to accept that the only valid view of the world is that of the monolingual. . . . Only in a monolingual universe is a multi-competent person a failure for not speaking like a monolingual. (Cook, 2010: 154) There is a clear need for overcoming the bias towards a monolingual native speaker construct in L2 research and for rejecting a deficit view of learner language. This is, as Ortega (2010) notes, for good science and for ethical accountability. There is also the awareness, arising from recent work on English as a lingua franca, that ‘correctness’ does not have to be commensurate with the native speaker norm and that nonconformities

196

Corpus Applications in Applied Linguistics

can represent effective communicative exploitation of linguistic resources (see, e.g. Seidlhofer, Chapter 9, this volume).

Research Implications All this invites a re-evaluation of our practices and forces those of us involved in learner corpus research, and in other domains of L2 research more widely, to redefine our agenda, methods, assumptions and expectations. To capture dynamic L2 use and development meaningfully, approaches to collecting and analyzing learner language might be reconsidered. Having a collection of learner data based on a snapshot in time may be able to provide a picture of a particular state of learner language; having collections of data by the same (group of) learner(s), gathered at several points in time, has the extra advantage of illuminating how each of the several states of learner language looks like and relates to others, and how and the extent to which various sub-systems of learner language interact and change over time. This latter, longitudinal approach, when compared to the former, cross-sectional approach, provides opportunities for making more meaningful explanations possible about the dynamically changing and self-organizing nature of learner language (see also Ortega and Byrnes, 2008). The reconceptualization of learner language as a linguistic system independent of a monolingual norm also invites reconsideration of what makes comparable corpus data. When the attainment of bilingual/multilingual competence is considered fundamentally different from the attainment of monolingual competence, this raises the immediate issue of whether a corpus of language sample collected from monolinguals of a given target language remains appropriate for direct comparison with a learner corpus. Some researchers have suggested the use of data contributed by comparable mature bilinguals (Ortega, 2007) or multilinguals (note how labels such as ‘mature’ or even ‘expert speaker’ (Rampton, 1990) have emerged alongside as part of new terminology reflecting a commitment to diminishing the bias towards the native speaker ideal). Of course developing and analyzing a longitudinal learner corpus, assembled at several points in time and with different points of data sets to be compared, also offer a way to address the concern. Crucially a corpus approach of a longitudinal nature allows for internal comparisons that are necessary to uncover the nature and processes of L2 learning. A final note before we turn to the next section. The study of learner language as both a dynamic and independent language system can be seen to be closely linked to and supported by a longitudinal methodology. This



Learner Corpora and Second Language Acquisition

197

is perhaps even necessary. In her recent review as well as ‘endorsement’ of the first major text devoted to addressing longitudinal research and advanced L2 capacities in the field (Ortega and Byrnes, 2008), Lightbown observes that the best way to understand L2 development is through longitudinal research. Perhaps unsurprisingly it has also been noted that because such research is so difficult, few studies have looked at more than one or two learners for more than one or two years. For practical reasons, an alternative, in the absence of longitudinal learner data, is to devise a pseudo-longitudinal study. In research of this design, .  .  . samples of learner language are collected from groups of learners of different proficiency levels at a single point in time. A longitudinal picture can be then constructed by comparing the devices used by the different groups ranked according to their proficiency. (Ellis and Barkhuizen, 2005: 97) In other words, by dividing and classifying subjects to different groups of learners representing successive, clearly defined levels of proficiency, a pseudo-longitudinal study is formed and changes in language use can be observed, which correspond to some extent with changes that take place over time. The following section provides an example pseudo-longitudinal study.

An Illustrative Study: Development of Phraseological Competence The study to be discussed illustrates through a corpus approach many of the issues noted earlier on the key processes observed in L2 development. The main research issue explored is: How might learners go about developing phraseological competence? This particular piece of research investigated how word combinations develop and come to be regular patterns of L2 use as the learner’s proficiency level increases. One area of focus of the study to be considered here was on topic-related verb  noun sequences. The participants of the study were 301 Malaysian Secondary One learners of English (13 year olds), drawn from a larger population of learners whose language data had been collected for a corpus project (Abd. Samad et al., 2002). For this subset of the corpus, each learner, without access to the dictionary, wrote an essay based on a series of pictures illustrating how a near-drowning character was saved. This produced a collection of 301 essays totalling 49,496 words. The essays were then scored manually by two

198

Corpus Applications in Applied Linguistics

independent readers using a holistic grading rubric employed in Malaysian secondary schools with 13- to 15-year-olds. The essays were categorized into one of three proficiency groups (or ‘corpora’) based on the scoring: Beginning, Intermediate and High Intermediate. The Beginning Learner corpus was first analyzed for key verbs, termed topic-related verbs in the study, which form part of a keyword list generated using the Keywords Extractor (at http://www.lextutor.ca/ keywords). These are verbs that occur far more frequently, in proportion, than they do in a general reference corpus. The reference corpus used was the Brown corpus, which consists of 500 texts of 2000 words sampled from 15 genres. A list of 55 verb forms (e.g. save and saved are considered two verb forms) was identified from the analysis at keyness cut-off of 10 (i.e. all these verb forms were at least 10 times more numerous in the Beginning Learner corpus than in the Brown corpus, calculated on a per-million basis). The list was then used to explore all the verb  noun sequences that occur in the three learner corpora to determine whether and to what extent such sequences vary across proficiency levels. A close study of the concordance lines of the 55 verbs followed, based on emergence criteria which considered only those sequences that had occurred at least four times and in four different learner essays within each corpus (cf., Pallotti, 2007). The criteria yielded a total of 385 individual instances of verb    noun sequences across the three corpora, further categorized into eight verbfamily groups of sequences, FISH, HEAR, HELP, PICK, PLUCK, SAVE, SHOUT and THANK, as presented in Table 12.1. Generally the two more advanced groups of learners, Intermediate and High Intermediate, were found to be more similar in performance especially in terms of the broad categories of verb  noun sequences they employed as language resources to express and convey central ideas in the narrative task. They produced and shared the same set of six groups of sequences: HEAR, HELP, PICK, PLUCK, SAVE and THANK. They differed from the Beginning learners who showed specific use of FISH and SHOUT sequences to form verb   noun sequences. Closer observation also reveals that both the Intermediate and High Intermediate learners used twice as many SAVE sequences as did the Beginning learners, and produced PLUCK and HEAR sequences that were absent in the language of the latter learner group. The Beginning learners, on the other hand, were found to rely heavily on HELP sequences. This could be viewed as compensatory for the proportionally lower percentage



Learner Corpora and Second Language Acquisition

199

Table 12.1  Distribution of verb  noun sequences by proficiency level (%) Verbnoun sequences

Examples

FISH

fishing fish

HEAR

Intermediate

High intermediate

6

0

0

heard someone, heard the shouted

0

7

9

HELP

help me, help the girl

30

14

4

PICK

picking flowers, pick some flowers

13

17

18

PLUCK

pluck some flowers

0

2

4

SAVE

saved the girl, saving her

20

40

40

SHOUT

shouted help

8

0

0

THANK

thank you, thanked the boys

23

20

25

100

100

100

Total

Beginning

of SAVE sequences used. It could also be that at this stage of development they tended to use a more general verb (i.e. HELP), which can express a wider range of meanings, including ‘to save’: (1)  Feliz also jump into the river to help her. (SMHK-P-f1-15) They swam to my friend and swam to help her. (SAM-P-f1-23) Mamat quickly jump into the river to help the girl. (SMMA-P-f1-18) A functional analysis (see Ellis and Barkhuizen, 2005) was further performed in the study in which the sequences identified earlier were examined using the general framework of Halliday’s (1994) Material, Mental and Verbal Processes. A high reliance on material process sequences (e.g. FISH, HELP, PICK, PLUCK and SAVE sequences), for example, would suggest the narrator’s focus on the ‘doings’ and ‘happenings’ – the events or types of action taking place in the story. Employing mental process sequences (e.g. HEAR sequences), on the other hand, could be argued to point to the narrator’s sensitivity to engage the reader to understand and appreciate the mental processes of the character(s) in the story – what is felt, thought or perceived, with the character(s) being depicted as having been ‘endowed with consciousness’ (Halliday, 1994: 114). When verbal process sequences (e.g. SHOUT and THANK sequences) are used, some form of

200

Corpus Applications in Applied Linguistics

Table 12.2  Distribution of different functional process-type sequences (%) Function Material Mental Verbal

Realization Devices (Sequences) FISH, HELP, PICK, PLUCK, SAVE HEAR THANK, SHOUT

Beginning

Intermediate

High Intermediate 

68.75

72.94

66.02

0 31.25

6.88 20.18

8.74 25.24

communication or symbolic activity is realized and the (fictional) characters are depicted as being involved in some process of communication. The distributions of different process-type sequences that occurred in each corpus are presented in Table 12.2. A developmental pattern can be observed in the learners’ use of verb  noun sequences. The initial period or stage of development (as reflected in the Beginning Learner corpus) was characterized by a heavy use of material and verbal process sequences and an absence of mental process sequences – choices that suggest an emphasis on the doings and happenings in narration. At a later stage (in the Intermediate Learner corpus), while material and verbal process sequences were still the dominant strings of words, mental process sequences made an appearance. Later still (in the High Intermediate Learner corpus), the increased use of verbal and mental process sequences became much more evident, reflecting the learners’ increased awareness of different narrative discoursal demands or expectations at this stage, and their preference and orientations towards constructing and addressing (new) complexities to fulfil the expectations in narrative discourse in general. These general observations were explored in greater detail through an analysis of the specific sequences that have emerged in the three learner corpora, which represent different points of language development. For reasons of space, we focus only on the SAVE sequences, although it may be worth pointing out that the pattern of development to be noted in the use of the SAVE sequences holds for other sequences as well (e.g. the HEAR and THANK sequences). As can be seen from Table 12.3, which shows the distribution of forms used in relation to function, learners’ language development is characterized by periods of development which seem to begin with a restricted set of sequences. For example, at the initial stage, their use of SAVE sequences shows an exclusive use of the structurally ‘simple’ pattern of verb  noun sequences (i.e. save her, save the girl) and heavy use of the present simple and bare infinitive for the narrative task, as follows:



Learner Corpora and Second Language Acquisition

201

Table 12.3  Distribution of forms in relation to the functional realization of SAVE sequences across corpora Function Material process

Realization Devices (Sequences) SAVE sequences

Beginning

Intermediate

High Intermediate

� save the girl � save her

� save the girl � saved the girl � saved her � saving her � saving her life � save her � saved her life � saved Lina

� save the girl � saving her � save her � saving her life � saved the girl

(2)  I go to save her in water. (SAM-P-f1-26) He jump into the water and save the girl. (SMTI-P-f1-35) They run to place for save the girl. (SMMA-P-f1-22) Although limited in range, these initial forms of SAVE sequences could be seen to serve as the basic communicative resources the learners had at their disposal. At a later stage, however, a wider range of linguistic forms was noted. In fact, there was a ‘flood’ of sequences observed (see earlier discussion on Ge’s L2 development). The initial, structurally simple sequences were still used, but now additionally with an extended pattern emerging of verb    noun    from    (verb)-ing (as in save the girl from drowning). Notice that the learner language at this stage also exhibited emergent, but sometimes invariable, use of both past and present tenses: (3)  I tried very hard to save her. (SMSAB-P-f1-11) Abu, one of them tried to be brave and he jump into the water to save the girl. (SMPM-P-f1-19) Immediately, I jumped into the river to save the girl from drowning in the river . . . . (SMART-P-f1-33) Later still, both the basic pattern and the extended pattern of verb   noun  from  (verb)-ing were observed to occur in an overall narrative discourse which was characterized by, among other things, a more productive and consistent use of (past) tenses: (4)  She thanked Ahmad for saving her life. (SMTA-P-f1-15) Amran jumped into the river to save the girl. (SMTA-P-f1-30) . . . without thinking, Harvey jumped into the river and saved the girl from drowning. (SMTA-P-f1-20)

202

Corpus Applications in Applied Linguistics

The results, illustrated through the use of the SAVE sequences here, highlight how the various forms of verb  noun sequences, emerging as functional realizations of process-type sequences, gradually shift over different points in L2 development. Initially the SAVE sequences, for instance, emerge only in two structurally simple forms. The present simple, and sometimes the bare infinitive, dominate language production at this stage. Later, a far wider range of forms emerges, resulting in a ‘flood’ of sequences. Past tenses make an appearance but there are few morphological constraints at this point of language use. At yet another stage, a complex but systematic and refined pattern of sequences emerges. That is, the sequences emerging at this point of development show a greater performance consistency of language use in form and function, spanning morphological and immediate discoursal concerns. Overall the findings of this corpus study suggest that in the development of phraseological competence, learners apparently begin with ‘a basic communicative resource of clusters’, then move on to producing ‘a flood of clusters’ before developing ‘a systematically refined set of clusters’, or ‘a re-organized system of clusters’, all of which are reflective of fundamental L2 processes such as restructuring taking place. It was also observed that functionally learners’ preference and orientations towards addressing complexities in language use gradually change in the course of language development. The study supports current research which suggests that L2 development is a dynamic process and highlights that it is both instructive and illuminating to examine learner language in its own right.

Prospects and Challenges A short chapter of this type can only present a selective overview of develop­ ments in learner corpus research, provide a partial account of the study and scratch the surface of the potential contribution this area of research can make to the study of L2 acquisition. It should be obvious though that the future of learner corpus research looks distinctly promising. There are, for example, already several notable learner corpus initiatives taking place: 1. The Longitudinal Database of Learner English (http://www.uclouvain.be/ en-cecl-longdale.html), which comprises learner data of a range of genres; 2. French Learner Language Oral Corpora (http://www.flloc.soton. ac.uk/), which are a wealth of learner data based on different starting age groups, allowing for the study of age effects on L2 learning;



Learner Corpora and Second Language Acquisition

203

3. The Multimedia Adult English Learner Corpus (http://www.labschool. pdx.edu/maelc_access.html), which covers over 3600 hours of videorecorded classroom learner interaction and can be exploited to explore many of the issues noted earlier in the chapter about the five central areas of concern in SLA (although challenges remain to extract relevant data from this valuable but complex learner resource; see Conrad, 2006); 4. The Cambridge English Profile Corpus (http://www.englishprofile. org/), a kind of ever-expanding ‘monitor’ learner corpus to be used for describing levels of proficiency in English; and 5. The British Sign Language Learner Corpus (http://www.bris.ac.uk/ deaf/english/research/active/active05.html), a project currently in progress, which points to the need, as well as an opportunity, for an expanded research agenda in the foreseeable future to study, on a quantificational basis, the nature and processes of sign language learning. An increasing range of sophisticated tools and corpus analysis systems has also been developed for researchers such as CHILDES (see Myles, 2005), CQPweb (http://cqpweb.lancs.ac.uk), GOLD (Lu, 2010) and the Sketch Engine (http://www.sketchengine.co.uk/), and there is a further encouraging move towards SLA model construction (e.g. Tono, 2009). Overall learner corpus research can be seen to be consciously and demonstrably expanding its wings to explore and address both disciplinary and real-world concerns. In noting some observable positive trends, it is perhaps also important to consider challenges ahead. If a major goal of learner corpus research is to improve our understanding and explanations of L2 learning, an immediate challenge is the need to account for individual differences – one of the central themes in SLA research (e.g. Dörnyei, 2005). Whether and to what extent this contradicts the very nature of corpus research ( a quantitative approach, focusing on groups, not on individuals?) can be a contested issue in itself. If individual variation is considered in learner corpus research, however, opportunities for methodological innovation, or even disciplinary renewal or reconceptualization, are opening up. It may be worth noting though that the increasingly common use of mixed-methods approaches (e.g. Schmitt, 2010) represents an attractive alternative in the study of individual developmental paths. There is also the issue of when and on what basis learners are considered ‘learners’ and not ‘users’ of language (and vice versa), and how this relates to approaches to research, research applications and classroom practice. Do learners, as opposed to language users, ‘learn’ by definition, and so with, by implication, an expected, desired end-state target to meet? If so, how does this

204

Corpus Applications in Applied Linguistics

expectation resonate with an emergent view, which sees language development as constant, organic growth (e.g. Ellis and Larsen-Freeman, 2006), and what would make an appropriate ‘target’? If learners are viewed instead as language users ‘in their own right’, what are the status and expectations of learners in the classroom? Further analysis and exploration of the scope of the issue may lead to a meaningful reshaping of the research landscape and relevant practices. In any case, the concern for reconsidering our approach to conceptualizing the learner and language learning is genuine, and any such serious consideration needs to take into account the essential interplay between contexts and purposes for research and for the classroom, which drives possible, different realizations of roles and expectations of learners, in relation to their complex, changing environments of language use and language learning. Finally, in the current milieu of technological advancements, we can no longer assume that L2 learning takes place only in the ‘traditional’ classroom. The increasing use of, as well as increased dependence on, the internet and mobile technologies in the lives of young people for such purposes as learning and social networking, calls for refining our research agenda and views of learning sites. Just as classroom researchers have recently been urged ‘to track learners’ L2 experiences in a more unified and consistent fashion, both inside and outside the classroom’ (Mitchell, 2009: 698), so too do we need to consider an expanded set of approaches and scope of investigation. Indeed, the changing worlds, both personal and social, of our learners present us with changing demands for our research and practice (cf. Block, 2003; Chau and Kerry, 2008). It will remain an open question for some time how far down the path of challenge and change L2 researchers can and will travel, but it seems clear that when a game changes, the players on the field must change. We may also note, however, that difficulties and complexities in bringing about change in research activity abound, and that we are more likely to succeed by proceeding in phases. Nevertheless, the expected benefits of change are immense. Through concerted and sustained research efforts, we are likely to advance our understanding and explanations of L2 learning in multifaceted contexts and to gain satisfaction from the enterprise.

References Abd Samad, A., Hassan, F., Mukundan, J., Kamarudin, G., Syd Abd Rahman, S. Z., Md. Rashid, J., and Vethamani, M. E. (2002). The English of Malaysian School Students (EMAS) Corpus. Serdang: UPM Press. Barlow, M. (2005). Computer-based analyses of learner language. In Ellis, R. and Barkhuizen, G. (eds), pp. 335–69.



Learner Corpora and Second Language Acquisition

205

Bley-Vroman, R. (1983). The comparative fallacy in interlanguage studies: The case of systematicity. Language Learning, 33: 1–17. Block, D. (2003). The Social Turn in Second Language Acquisition. Edinburgh: ­Edinburgh University Press. Chau, M. H. and Kerry, T. (eds) (2008). International Perspectives on Education. ­London: Continuum. Cobb, T. (2003). Analysing late interlanguage with learner corpora: Quebec replications of three European studies. The Canadian Modern Language Review, 59: 393–423. Conrad, S. (2006). Challenges for English corpus linguistics in second language acquisition research. In Kawaguchi, Y., Zaima, S., and Takagaki, T. (eds), Spoken Language Corpus and Linguistic Informatics. Amsterdam: John Benjamins, pp. 67–88. Cook, V. (1991). The poverty-of-the-stimulus argument and multi-competence. Second Language Research, 7: 103–17. — (2010). The relationship between first and second language acquisition revisited. In Macaro, E. (ed), The Continuum Companion to Second Language Acquisition. London: Continuum, pp. 137–57. Corder, S. P. (1967). The significance of learners’ errors. International Review of Applied Linguistics in Language Teaching, 5: 161–9. de Bot, K., Lowie, W. and Verspoor, M. (2005). Second Language Acquisition: An Advanced Resource Book. London: Routledge. De Cock, S., Granger, S., Leech, G. and McEnery, T. (1998). An automated approach to the phrasicon of EFL learners. In Granger, S. (ed), pp. 67–79. Dörnyei, Z. (2005). The Psychology of the Language Learner: Individual Differences in Second Language Acquisition. Mahwah, NJ: Lawrence Erlbaum. Ellis, N. C. and Larsen-Freeman, D. (eds) (2006). Language emergence: Implications for applied linguistics. Special issue. Applied Linguistics, 27(4). Ellis, R. and Barkhuizen, G. (2005). Analysing Learner Language. Oxford: Oxford University Press. Flowerdew, L. (1998). Integrating expert and interlanguage computer corpora findings on causality: Discoveries for teachers and students. ESP Journal, 17: 329–45. Gilquin, G., Granger, S. and Paquot, M. (2007). Learner corpora: The missing link in EAP pedagogy. Journal of English for Academic Purposes, 6: 319–35. Granger, S. (1993). The international corpus of learner english. In Aarts, J., de Haan, P. and Oostdijk, N. (eds), English Language Corpora: Design, Analysis and Exploitation. Amsterdam: Rodopi, pp. 57–69. — (ed.) (1998a). Learner English on Computer. London: Longman. — (1998b). Prefabricated patterns in advanced EFL writing: Collocations and formulae. In Cowie A. (ed), Phraseology: Theory, Analysis and Applications. Oxford: Oxford University Press, pp. 145–60. — (2008). Learner corpora. In Lüdeling, A. and Kytö, M. (eds), Corpus Linguistics: An International Handbook. Volume 1. Berlin: Mouton de Gruyter, pp. 259–75. Halliday, M. A. K. (1994). An Introduction to Functional Grammar (2nd ed.). London: Arnold. Hinkel, E. (2002). Second Language Writers’ Text: Linguistic and Rhetorical Features. Mahwah, NJ: Lawrence Erlbaum.

206

Corpus Applications in Applied Linguistics

Housen, A. (2002). A corpus-based study of the L2-acquisition of the English verb system. In Granger, S., Hung, J., and Petch-Tyson, S. (eds), Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching). Amsterdam: Benjamins. pp. 77–117. Howarth, P. (1996). Phraseology in English Academic Writing. Tubingen: Max Niemeyer Verlag. Huebner, T. (1983). A Longitudinal Analysis of the Acquisition of English. Ann Arbor: Karoma. Larsen-Freeman, D. (2003). Teaching language: From grammar to grammaring. Boston, MA: Heinle/Cengage. —(2006). The emergence of complexity, fluency, and accuracy in the oral and written production of five Chinese learners of English. Applied Linguistics, 27: 590–619. Lu, X. (2010). What can corpus software reveal about language development? In O’Keeffe, A. and McCarthy, M. (eds), The Routledge Handbook of Corpus Linguistics. London: Routledge, pp. 184–93. Milton, J. and Tsang, E. S. C. (1993). A corpus-based study of logical connectors in EFL students’ writing. In Pemberton, R. and Tsang, E. S. C. (eds), Lexis in Studies. Hong Kong: Hong Kong University Press, pp. 215–46. Mitchell, R. (2009). Current trends in classroom research. In Long, M. and Doughty, C. (eds), The Handbook of Language Teaching. Malden, MA: Wiley-Blackwell, pp. 675–705. Myles, F. (2005). Interlanguage corpora and second language acquisition research. Second Language Research, 21: 373–91. Nesselhauf, N. (2005). Collocations in a Learner Corpus. Amsterdam: John Benjamins. Ortega, L. (2007). Second language learning explained? SLA across nine contemporary theories. In VanPatten, B. and Williams, J. (eds), pp. 225–50. —(2009). Understanding Second Language Acquisition. London: Hodder Education. — (2010). The bilingual turn in SLA. In Plenary Delivered at the Annual Conference of the American Association for Applied Linguistics, Atlanta, GA, 6–9 March. Ortega, L. and Byrnes, H. (eds) (2008). The Longitudinal Study of Advanced L2 Capacities. New York: Routledge. Pallotti, G. (2007). An operational definition of the emergence criterion. Applied Linguistics, 28: 361–82. Paquot, M. (2008). Exemplification in learner writing: A cross-linguistic perspective. In Meunier, F. and Granger, S. (eds), Phraseology in Foreign Language Learning and Teaching. Amsterdam: Benjamins, pp. 101–19. Petch-Tyson, S. (1998). Writer/reader visibility in EFL written discourse. In Granger, S. (ed.), pp. 107–18. Rampton, M. B. H. (1990). Displacing the ‘native speaker’: Expertise, affiliation, and inheritance. ELT Journal, 44: 97–101. Ringbom, H. (1998). Vocabulary frequencies in advanced learner English: A cross-linguistic approach. In Granger, S. (ed), pp. 41–52. Rundell, M. and Granger, S. (2007). From corpora to confidence. English Teaching Professional, 50: 15–18. Schmitt, N. (2010). Researching Vocabulary: A Vocabulary Research Manual. Basingstoke: Palgrave.



Learner Corpora and Second Language Acquisition

207

Selinker, L. (1972). Interlanguage. International Review of Applied Linguistics, 10: 209–31. Tono, Y. (2009). Integrating learner corpus analysis into a probabilistic model of second language acquisition. In Baker P. (ed), Contemporary Corpus Llinguistics. London: Continuum, pp. 185–203. Turton, N. D. and Heaton, J. B. (1996). Longman Dictionary of Common Errors. London: Longman. VanPatten, B. and Williams, J. (eds) (2007). Theories in Second Language Acquisition. Mahwah, NJ: Lawrence Erlbaum.

Chapter 13

Corpora in the Classroom: An Applied Linguistic Perspective Lynne Flowerdew

Introduction The use of corpora in the classroom, initially compiled for the teaching of ELT and ESP, can be traced back to the pioneering work of Tim Johns in the early 1990s (Johns, 1991) and the publication of Tribble and Jones’ textbook Concordancing in the Classroom in 1990. Since then, corpora have been used for a number of other purposes, including the teaching of literature, in teaching translation, in teacher education and in assessment (see Flowerdew, 2011 for further discussion of these areas and also O’Keeffe and McCarthy, 2010). The use of corpora has made significant inroads into all these classroom applications for a number of reasons. First, corpora are compiled of naturally occurring, attested data, thus exposing students to authentic examples rather than invented ones. Second, corpus-based pedagogy can be considered an ideal conduit for promoting the phraseological approach to language learning, which captures phenomena such as collocations, colligations and semantic prosody (see Hunston, 2002). Such aspects may not be easy to glean from dictionaries and grammars as they are found at the lexis-grammar interface, but can be easily observed in corpora through the co-textual environment. Corpus-based pedagogy also promotes an inductive approach to language learning, requiring learners to extrapolate grammar rules from scrutinizing multiple concordance lines. Corpora are suitable for autonomous learning as, once trained in search strategies, students can work on their own language queries and on corpora related to their field of enquiry. Figure 13.1 below summarizes the various pedagogic applications to which corpora have been put. Around the same time as the use of corpora in datadriven learning, that is ‘direct applications’, as described in the activities by



Corpora in the Classroom: An Applied Linguistic Perspective

209

Pedagogical corpus applications

Indirect applications: hands-on for researchers and material writers

Effects on the teaching syllabus

Effects on reference works and teaching materials

Direct applications: hands-on for learners and teachers (datadriven learning, DDL)

Teacher– corpus interaction

Learner– corpus interaction

Figure 13.1  Types of pedagogical corpus applications (Römer, 2010: 19)

Johns (1991) and Tribble and Jones (1990), ‘indirect applications’ also began to make their mark with the findings from the 20-million-word COBUILD corpus used to inform dictionaries and grammars (cf. Sinclair, 1987, 1990) and teaching syllabi (cf. Willis and Willis, 1988). This chapter focuses on direct applications and consists of two main sections. In the first section, I review how the direct applications of corpora have been used in the classroom, both in the form of hands-on activities and worksheets based on concordance output, to enhance language learning of key vocabulary, functions and genre from a phraseological perspective. In the second section, I discuss data-driven learning (DDL) in relation to specific theories of language learning. The article first commences with an overview of the principles underlying the lexical, i.e. phraseological, approach to pedagogy, which is at the heart of many of the classroom applications of corpora.

Concept of the Lexical Approach For Sinclair (1999), the lexical item has primacy over the grammar, and it is this concept of a ‘lexicalised grammar’ which inspired the ELT lexical syllabus designed around frequency lists and their most important recurrent patterns (Willis, 1990). However, this does not mean to say that grammar is neglected, as the main lexical patterns incorporate the main grammatical forms, as noted by Flowerdew (2009), who explicates further:

210

Corpus Applications in Applied Linguistics

For the COBUILD course for the first level the most frequent 700 words were selected from the corpus, these words accounting, according to Willis (1990:vi), for around 70% of all English text. The underlying principle of the lexical syllabus is frequency. Sinclair and Renouf (1988) argue that the most frequent words typically have a range of uses and that many of these uses are typically not covered in beginners’ courses. They give the example of the word make. This word most typically occurs in patterns such as make decisions, make discoveries, make arrangements. These abstract uses are more frequent than the concrete use, as in make a cake. (J. Flowerdew, 2009: 338) This interest in the lexical approach has also extended to EAP/ESP teaching. While corpora have proved invaluable for identifying the salient lexis characteristic of different EAP/ESP genres for use in pedagogy, there is some debate over whether there exists a generic set of vocabulary items common across a variety of disciplines.

Vocabulary Approaches Corpus-based research on different academic genres has reignited the debate on the ‘common-core hypothesis’ (Bloor and Bloor, 1986). Proponents of this approach maintain that there exists a common set of linguistic structures and vocabulary that will be found across a range of academic texts, regardless of discipline and genre. Coxhead’s (2000) corpus-based research of word families and vocabulary items in a 3.5 million word corpus composed of written academic texts drawn from the disciplines of Arts, Commerce, Law and Science seems to support the hypothesis with its extraction of common vocabulary items forming an Academic Word List (AWL). However, Hyland and Tse’s (2007) corpus data point to a disciplinebased specific lexical repertoire. Other ESP studies, mostly of research articles (e.g. Chen and Ge, 2007), in which wordlists from specialist disciplines are compared with Coxhead’s academic wordlist, also favour sets of discipline-specific lexis (see Flowerdew, 2011 for a summary of these). It is to be noted, however, that all the studies arguing for a disciplinary specific lexis for EAP are based solely on frequency counts of some kind. Studies by Ellis et al. (2008), besides frequency information, also take into account pedagogic relevance as seen through teachers’ eyes and psycholinguistic salience, to arrive at a pedagogic list of formulaic sequences (i.e. strings of 3-word sequences, e.g. in terms of, in order to) for teaching



Corpora in the Classroom: An Applied Linguistic Perspective

211

academic speech and writing. This triangulated study supports Coxhead’s AWL (as does Paquot’s (2010) research), thus lending weight to the common-core hypothesis and also echoing the sentiments expressed by Widdowson (2003), among others, that pedagogic relevance may not be circumscribed solely by frequencies found in a corpus. Perhaps a key point to note here is that all these studies rely on different specialized corpora compiled by the researchers, so it is not surprising that the findings differ. Another consideration to take account of is that when lexical bundles, i.e. three- or four-word n-grams, are the focus of investigation rather than individual lexical items, then a case can be made for the disciplinary specific nature of corpora (cf. Hyland, 2008). Corpus-based pedagogy has targeted various kinds of lexis occurring in specialized corpora. For example, Mudraya’s (2006) materials based on a twomillion-word corpus of engineering textbooks targeted vocabulary specifically of a sub-technical nature. Mudraya has noted that this type of vocabulary (i.e. those items such as current, solution, tension, which have one sense in general English, but are used in a different sense in technical English) is problematic for students. She proposes a set of queries based around solution on the grounds that this word occurs, in its general sense, both as a high frequency word family and as a frequent sub-technical item. Students are presented with concordance output of carefully selected examples of solution and in one exercise are asked to identify, for example, the following: those adjectives used with solution (1) in the general sense and (2) in the technical (chemical) sense and then asked to underline those adjectives that can be used with both senses of solution as a means to highlight collocational sensitivities. Jones and Schmitt (2010) devised corpus-based materials for disciplinespecific vocabulary and phrases that were found to occur in academic seminars related to Language and Gender, International Law and Entrepreneurship. While the focus of Mudraya’s activities was on semi-technical vocabulary, Jones and Schmitt targeted three other types of vocabulary to be covered: technical, for example, a regional body; general, for example, take the initiative and colloquial, for example, gut feeling. In fact, their ‘general’ classification seems to have some overlap with semi-technical vocabulary as their item ‘venture’ from the Entrepreneurship lectures would fit Mudraya’s classification. That the materials devised by Jones and Schmitt were based not only on corpus statistical analysis but also on lecturer and student input follows the method adopted by Ellis et al. (2008) for defining the items to be taught. Needless to say, vocabulary items can also be viewed as part of a functional unit. Speech acts commonly found in academic seminar discussions were also covered in Jones and Schmitt’s materials; for example, the general item

212

Corpus Applications in Applied Linguistics

‘good point’ could be viewed as a form of agreement. Those corpus-based pedagogic activities taking a more functional perspective are discussed below.

Functional Approaches The notional-functional syllabus (see Wilkins, 1976), associated with the communicative language teaching approach, has also had an influence on the design of corpus-based materials. One key endeavour in the production of corpus-based materials to aid students with academic writing of a general nature is that by Thurstun and Candlin (1998). In this corpusderived material the lexicogrammar is introduced according to its specific rhetorical function (e.g. referring to the literature, reporting the research of others). Within each broad function, each key word (e.g. claim, identify) is then examined within the following chain of activities (Thurstun and Candlin, 1998: 272): zzLOOK

at concordances for the key term and words surrounding it, thinking of meaning. zzFAMILIARIZE yourself with the patterns of language surrounding the key term by referring to the concordances as you complete the tasks. zzPRACTISE key terms without referring to the concordances. zzCREATE your own piece of writing using the terms studied to fulfil a particular function of academic writing. Corpus-based instructional material for English for general academic purposes has also been produced by Charles (2007) and Bloch (2009). Like Thurstun and Candlin, Charles (2007: 296) also targets key rhetorical functions, in this case, the combinatorial function of defending your work against criticism, a two-part pattern: ‘anticipated criticism → defence and its realization using signals of apparent concession, contrast and justification’. Another feature of Charles’ materials is that she approaches these functions by first using a top-down approach, providing students with a suite of worksheet activities to sensitize them to the extended discourse properties of this rhetorical function. She then supplements these with a more bottom-up level approach by having students search the corpus to identify typical lexicogrammatical patterns realizing these functions. Meanwhile, Bloch (2009) describes a user-friendly program for teaching the use of reporting verbs in academic writing in corpora compiled in-house to meet the needs of specific learners. The interface presents users with only a limited number of



Corpora in the Classroom: An Applied Linguistic Perspective

213

hits for each query and a limited number of criteria for querying the database, namely integral/non-integral; indicative/informative; writer/author; attitude towards claim; strength of claim, categories devised from Bloch’s research on the use of reporting verbs from a rhetorical perspective. The functional–notional tradition is at the heart of Coccetta’s (2011) corpus-based materials focusing on spoken communication in general ELT. Unlike the materials described above, those of Coccetta are based on multimodal approaches to corpus analysis, which take a systemic-functional orientation to determine how different semiotic resources (language, gaze, gesture etc.) interact to create meaning. Coccetta explains how an online multimodal concordancer incorporating a search engine allows students to find and isolate sequences in a corpus sharing the same characteristics by means of a functionally tagged corpus. For example, to see if the function of ‘declining an offer’ occurs in a sub-corpus relating to requests, invitations and offers and to find the linguistic forms realizing this function, students choose a parameter from a drop-down menu to retrieve the relevant concordance lines exemplified below (Table 13.1). Each concordance line has access to the film clip which provides non-linguistic information drawing on semiotic resources such as gesture, posture, gaze and facial expressions. Table 13.1  Concordance results for the ‘declining an offer’ function (Coccetta, 2011: 131). 1. 2. 3. 4. 5. 6. 7.

No thanks. No thanks. No thanks. I mean, that – that water’s been there for ages. No. No thanks. I’m not – I’m not hungry. Uh, no thanks. I’ve already had one thanks.

Functions are also embodied in the Swalesian (2004) approach to genre through the concept of ‘move structures’ denoting the communicative rhetorical purpose of a particular section of a written text, the focus of the following section.

Genre-Based Approaches Several pedagogic applications of corpora for writing approach the corpus consultation from a Swalesian genre-based perspective (see Handford

214

Corpus Applications in Applied Linguistics

(2010) for an overview of the advantages in using corpora for genre analysis). For example, Bhatia et  al. (2004) propose concordance-based activities for one genre of legal English, the problem-question genre written by students within academic settings. As deductive reasoning plays a major role in this highly specialized genre, various hedging devices are used for different rhetorical purposes in different moves. One suggestion by Bhatia et  al. (2004) is to have students identify and classify hedges by function, form or grammar according to move structure using worksheets based on concordance output (see Weber (2001) for another account of a concordance- and genre-based approach to formal legal essay writing). However, as legal discourse is such a complex discourse area, corpus-based methodologies may not completely align themselves with legal writing tasks and institutional practices. Bhatia et  al. (2004: 224) have underscored the complexity of legal discourse, pointing out that a number of academic and professional genres in law appear to be ‘dynamically embedded in one another’. In view of this, they caution that one has to go beyond the immediate textual concordance lines and look at discursive and institutional concerns and constraints to fully interpret and by extension become a skilled writer of these highly specialized genres. Likewise, Hafner and Candlin (2007), in their report on the use of legal corpora by university students, have also noted a tension between professional discourse practices and the phraseological approach associated with corpus-driven learning. However, they see this as a tension to be exploited, arguing that ‘continuing lifelong learners still need to be able to focus and reflect on the functional lexical phrases that constitute the essence of the texture of the documents they are composing’ (p. 312). Swales (2002) contrasted the ‘fragmented’ world of corpus linguistics with its tendency to adopt a somewhat bottom-up, atomistic approach to text with the more ‘integrated’ world of ESP material design with its focus on more genre-based top-down analysis of macro-level features. However, the tasks proposed by Bhatia et al. and others incorporate this more topdown approach, as called for by Durrant and Mathews-Aydınlı (2011). This more macro-perspective has also been made possible by advances in software; for example, the Michigan Corpus of Upper-level Student Papers (MICUSP) permits searches according to various textual features, for example, definitions, Problem-Solution pattern. While pedagogic endeavours have been classified under one of the three approaches, lexical, functional and genre-based, in reality, many combine all three as in the EAP course for doctoral students described in Lee and Swales (2006). Those grammar points found to be particularly troublesome for advanced-level students were targeted in online searches, such as



Corpora in the Classroom: An Applied Linguistic Perspective

215

adjective pre-modification and [bare participles    ing/ed], (e.g. when might phenomena observed occur as opposed to observed phenomena?). A key feature of these materials is that the search instructions ask students to examine frequencies, an aspect that is neglected in many pedagogic applications. Another observation is that the materials are built around hypothesis-testing, for example: Grammar (for  V-ing structures) Worksheet had a typical pattern of: (1) Guess and rank the frequencies, guess some exemplar word forms, and then check what the corpus says; (2) Check some given sentences and correct where necessary; (3) For each pair of sentences, which alternative is better and why? Aimed at reinforcing the useful pattern ADJ/N/V for V-ing (e.g appropriate for modeling…; devices for storing…; suited for studying…). Also asked participants to rank the three word classes in terms of which was most likely to trigger the for verb-ing structure in academic writing. (Answer: N, ADJ, V). (Lee and Swales, 2006: 65) The materials reviewed in this section clearly subscribe to the inductive approach to learning, which is inherent in DDL activities. However, they also incorporate elements of a deductive approach, as students are prompted by the teacher to examine specific language points, which could be viewed as a kind of ‘pedagogic mediation’ (Widdowson, 2000). Johansson comments thus, Is the use of corpora to be grouped with the explicit or implicit method? The term ‘data-driven’ learning suggests that it is an inductive approach and therefore comparable with the implicit method, though the emphasis is on gaining insight rather than establishing habits, and in this sense it is mentalistic. I believe that the dichotomy explicit-implicit is far too simple. In the case of corpora in language teaching, I would favour a guided inductive approach or a combination of an inductive and deductive approach where the elements of explanation and corpus use are tailored to the needs of the student. (Johansson, 2009: 41–42) Johansson’s notion of a ‘guided inductive approach’ is explicated further in the following section on the relationship between DDL and various language learning theories, which can be seen as contributing to the more deductive aspect of classroom applications of corpora.

216

Corpus Applications in Applied Linguistics

Relationship Between DDL and Language Learning Theories The ‘noticing’ hypothesis discussed in second language acquisition (SLA) studies (see Swain, 1998) underpins many corpus activities. The principle underlying this cognitive concept is that learners’ acquisition of linguistic input is more likely to increase if their attention is drawn to salient linguistic features. Corpus-based materials highlighting recurrent phrases would thus seem to be an ideal means for enhancing learners’ input via ‘noticing’, leading to uptake. While Lee and Swales (2006) do not make specific reference to SLA in their article, in effect, the concept of noticing is intrinsic to their corpus-based activities requiring students to focus on the frequency counts of particular structures. One corpus initiative in which explicit reference is made to the noticing hypothesis is that by Johns et al. (2008). In a project on teaching literature at the secondary school level in Taiwan, Johns et al. note that their course is underpinned by the Noticing Theory with a conscious focus of attention on form and meaning. In an innovative project using corpora with L1 primary school children, the concept of ‘noticing’ was approached through colour-coding of word classes (Sealey and Thompson, 2007). Sealey and Thompson used the CLAWS tag-set for the BNC to tag for different parts of speech using 40 texts from the BNC classified as having been written for a child audience, i.e. imaginative fiction; these were then colour-coded. The colour-coded concordance output stimulated discussion on which word classes children were sure about and which ones they were unsure about (Sealey and Thompson, 2007: 215–216). Extract 3: School B, Session 5 BB1[489]: what are the brown ones? BB1 [493]: the brown ones look like cut GB1 [495] like kiss GB1 [497]: and went, got, went GB2 [498]: I’m definitely sure they’re verbs I think. Function words, grammatical or non-lexical words emerged as a broad category of ‘dull’ words, which children had more difficulty with. Meanwhile, Kennedy and Miceli’s (2010) approach to corpus work entails two kinds of ‘noticing’ activities designed to match the abilities of their students. Kennedy and Miceli note the high demands in terms of language proficiency, observation and inductive reasoning that a purely inductive



Corpora in the Classroom: An Applied Linguistic Perspective

217

approach such as that proposed by Bernardini (2004), i.e. ‘the learner-asresearcher’, would make on students. As their students are intermediate level Italian and not advanced translation students like Bernardini’s, they propose a more guided inductive approach through apprenticeship training. Their apprenticeship model, using a 500,000-word corpus of contemporary Italian to aid students with personal writing of a creative nature, consists of a ‘pattern-hunting’ followed by a ‘pattern-defining’ phase, which encourages ‘noticing’ of various features. For example, when writing about their sense of personal space for an autobiography, students were first prompted to come up with some key words for pattern-hunting, with many suggesting the common term spazio. This not only turned up ideas and expressions, for example, rubare spazio (take space) but also triggered further searches on words encountered in the concordance lines, for example, percorso (path). Other pattern-hunting techniques included browsing through whole texts on the basis of the title and text-type and scrutinizing frequency lists for common word combinations. Kennedy and Miceli’s approach would thus seem to mediate the deductive–inductive continuum, with students exposed to both ‘noticing’ and browsing modes, with the browsing mode paralleling Bernardini’s concept of the ‘learner-as-traveller’ engaged in serendipitous learning. Both Bernardini’s concept of the ‘learner-as-traveller’ and Miceli and Kennedy’s task on personal writing with corpus searches in Italian for ‘space’ and ‘path’ implicitly touch on conceptual metaphor. Conceptual metaphor theory holds that metaphor operates at the level of thinking and has primacy over language (e.g. the most widely cited conceptual metaphor is that life is seen as a journey, for example ‘He hasn’t found his direction yet in life’). Of interest here is that Kim (2011) proposes a meaning-making curriculum for second language writing underpinned by conceptual metaphor theory. As a counterpoint to the view of learners as deficient language users, a perspective that tends to be reinforced in much learner corpus work with its focus on error tagging, Kim advocates reconceptualizing learners as already competent meaning-making writers. To this end, Kim proposes corpus tasks modelled around conceptual metaphors, which encourage learners to engage in corpus searches as a conceptual, meaning-making activity rather than from a purely linguistic standpoint. For example, one such concept-based query would be to examine the nouns (e.g. ‘evidence’) associated with the item ‘converging’, as a means to decoding the underlying conceptual metaphor. A few corpus-based endeavours have been inspired by Vygotskyan sociocultural theories of co-constructing knowledge through collaborative dialogue and negotiation, helping learners by means of ‘assisted

218

Corpus Applications in Applied Linguistics

performance’ to enhance learner agency. With particular reference to vocabulary learning, O’Keeffe et  al. (2007: 55) discuss the advantages of using corpora for promoting ‘learner agency’, stating that this involves learners being trained to operate independently to develop a set of skills and strategies for processing and using new vocabulary, pointing out that depth of knowledge, i.e. building an integrated lexicon on a particular topic, can be more valuable than breadth. While Chau (2003) does not explicitly refer to ‘learner agency’ in the set of topic-related vocabulary patterns he directs students to compile from genre-specific corpora, the tasks would seem to embody this notion. Based on a corpus of 11 texts on English as an International Language, one of Chau’s tasks requires students to make a list of words/phrases connected with the reasons for the need to be proficient in English (also see Cobb, 1999, who demonstrates that hands-on concordancing promotes both breadth and depth of lexical acquisition). Corpus-based activities are also a prime way to promote constructivist learning; this approach holds that learners create their own learning through linking new knowledge to their existing knowledge. As Widmann et  al. (2011: 168) point out, ‘the more possible starting points a corpus offers for exploitation, the more likely it is that there exists an appropriate starting point for a specific learner’. It is this concept that is behind many of the recent search engine interfaces specifically designed with learners in mind. Widmann et al.’s search tool for use with spoken language corpora in seven European languages consisting of interviews with teenagers has four types of search modes: browse mode (gives a quick summary of each interview); section search mode (allows zoom in on topic-specific sections and on all other annotation categories); co-occurrence mode (allows searching for several modes within a certain span) and word search mode (allows word-pattern searches). Other corpus programs offering different pathways to searching the corpus include The Chemnitz Internet Grammar (Schmied, 2006), which allows students to work either deductively, accessing grammar rules, or inductively through searches for grammar patterns, thereby accommodating different learning styles.

Case Study: Integrating Corpus Consultation in a Report-Writing Course This section describes a corpus-informed course, using in-house compiled corpora, primarily a 7-million word sub-corpus of reports (see Milton, 1999), to exemplify how the course designer has attempted to integrate



Corpora in the Classroom: An Applied Linguistic Perspective

219

some of the concepts on language and language learning outlined above in the design and delivery of a 15-week report writing module (Flowerdew, 2008, 2009). Students need strategy training in corpus consultation. Various strategies were introduced to students in the report-writing module throughout the 15-week semester. Strategy training was incorporated in the actual writing task and not conducted as a separate training session, as is often the case in corpus-based instruction. Another feature of the program was that corpus consultation was integrated systematically, at three stages of the writing process (the first two stages in class) rather than done on an ad hoc basis. In stage 1, students were required to consult the corpus for teacheridentified errors on their outline. In stage 2, after the students themselves had identified errors in their project report Introductions, they carried out corpus searches for their queries. The last stage required students to generate their own queries while writing up their final project reports. Corpus instruction and training thus progressed from teacher-directed ‘convergent’ tasks moving towards more ‘divergent’ ones where students took on a more autonomous role in deciding on the language points for which to consult a corpus (Leech, 1997). During Stages 1 and 2 of the corpus consultation sessions, following the tenets of the sociocultural approach to learning outlined previously, students were divided into groups with weaker students intentionally grouped with more proficient ones to foster collaborative dialogue through ‘assisted performance’ for formulating searches and discussion of corpus output. While the aim was to adopt an inductive approach, in reality, Johansson’s notion of a ‘guided inductive approach’ was often adopted as students encountered difficulties in interpreting some of the concordance output. One such query concerned the use of ergative verbs which required prompting by the teacher to direct students to the subject of a sentence in order to decide whether active or passive form should be used (e.g. With a very crowded schedule, students’ level of motivation was decreased/has decreased). As far as the search items were concerned, in the main, student searches concentrated on individual lexical or grammatical items. However, the ‘noticing’ hypothesis was employed to encourage students to examine their phraseological, functional and genre-based properties, together with a scrutiny of their frequency counts. For example, in looking up whether to use ‘focus’ in the active or passive in the sentence ‘This project focuses on/is focused on…’, one student queried why there was such a high frequency of the verb ‘focus’ in the present perfect tense. Closer inspection

220

Corpus Applications in Applied Linguistics

of the concordance lines looking at the wider context revealed these to have the function of indicating a gap in the research, a e.g. much of the crosscultural work to date, however, has focused on East Asian versus Anglo comparisons with little attention paid to…. (Flowerdew, 2009). As the sub-corpus of reports was also linked to an in-house grammarbased guide, it was possible to have students toggle between the two resources. In fact, on occasions, it proved more efficient to look up certain items (e.g. the difference between ‘to be familiar with’ vs. ‘familiarize with’) in the grammar guide rather than having students examine the corpus data. Moreover, the availability of a corpus linked to a grammar guide has the advantage of accommodating different learning styles, as mentioned previously. As noted in Flowerdew (2008), field-dependent students who thrive in cooperative, interactive settings may benefit from corpus-based pedagogy, whereas field-independent learners may not take to this inductive approach to grammar, preferring instruction emphasizing rules.

Conclusion and Future Pathways The first part of this chapter has shown how ‘corpora in the classroom’ have been of value for pedagogy with a focus on phraseological, functionalnotional and genre-based approaches to instruction. The second part has exemplified how corpora can also be an ideal means for equipping students with strategies and skills embodied in SLA, sociocultural, cognition-based and constructivist theories of learning. A case study is included to illustrate how a course on report writing making use of corpora has drawn on theories of language and language learning from the applied linguistics literature in its delivery. However, in spite of these affordances to learning noted in this chapter, corpora have not been exploited nearly as much in direct applications as they have in indirect applications (Stubbs, 2004). Why should this be so? Two key considerations relate to authentication and simplification. As Widdowson (2000) has argued, corpus output presents decontextualized data as the original context is not transposed with the text, resulting in a loss of communicative intent and sociocultural purpose. However, as pointed out by Gavioli and Aston (2001), it is not so much a question of whether corpora are divorced from their original setting, but rather ‘whether their use can create conditions that will enable learners to engage in real discourse, authenticating it on their terms’ (p. 240). Authenticating the use of corpora can be accomplished by the ‘guided-inductive approach’



Corpora in the Classroom: An Applied Linguistic Perspective

221

discussed previously, involving some kind of pedagogic mediation to meet the specific needs of students. Pedagogic processing of corpus data can also involve simplification of various kinds as unedited corpus data might be too difficult for students of lower ability. This can be done through software that filters the corpus data in terms of difficulty or by simplification of downloaded corpus texts (see Flowerdew, 2011). It can be seen that more sophisticated corpus tasks involving guided induction and recent advances in software have enabled the pedagogic processing of corpus data to suit learners’ needs. Multimodal corpora are now coming on stream. However, several relatively unexplored areas remain ripe for further investigation and development. The first consideration is that corpora have overwhelmingly been used in tertiary settings; very few initiatives are reported in using corpora at the secondary school level. Another area for further development is the use of corpora for teaching languages besides English. The question of the role of English as a lingua franca in pedagogy has also been the subject of debate in corpus pedagogy (Mauranen et al., 2010). Perhaps one of the most pressing issues, though, relates to the effectiveness of corpus based pedagogy; more empirical studies along the lines of the one conducted by Boulton (2010) are required to assess learners’ performance, thereby providing statistical evidence for the efficacy of learning through corpora. This chapter has reviewed corpus-based pedagogies designed to enhance the acquisition of vocabulary, functions and genre, well-researched areas in the applied linguistics literature. It has also been argued that DDL is an ideal methodology to equip students with skills and strategies expounded in various language learning theories. It is hoped that future initiatives will continue to draw on this synergy between applied linguistics and corpus pedagogy.

References Bernardini, S. (2004). Corpora in the classroom: An overview and some reflections on future developments. In Sinclair J. (ed.) How to Use Corpora in Language Teaching. Amsterdam: John Benjamins, pp. 15–36. Bhatia, V. K., Langton, N. and Lung, J. (eds) (2004). Legal discourse: Opportunities and threats for corpus linguistics. In Connor, U. and Upton, T. (eds), Discourse in the Professions. Amsterdam: John Benjamins, pp. 203–31. Bloch, J. (2009). The design of an online concordancing program for teaching about reporting verbs. Language Learning & Technology, 13(1): 59–78. Bloor, M. and Bloor, T. (1986). Language for specific purposes: Practice and theory. In CLCS Occasional Papers. Dublin: Centre for Language & Communication Studies, Trinity College.

222

Corpus Applications in Applied Linguistics

Boulton, A. (2010). Data-driven learning: Taking the computer out of the equation. Language Learning, 60(3): 534–72. Charles, M. (2007). Reconciling top-down and bottom-up approaches to graduate writing: Using a corpus to teach rhetorical functions. Journal of English for Academic Purposes, 6(4): 289–302. Chau, M. H. (2003). Contextualizing language learning: The role of a topic- and genre-specific pedagogic corpus. TESL Reporter, 36(2): 42–54. Chen, Q. and Ge, C. (2007). A corpus-based lexical study on frequency and distribution of Coxhead’s AWL word families in medical research articles. English for Specific Purposes, 26(4): 502–14. Cobb, T. (1999). Breadth and depth of lexical acquisition with hands-on concordancing. Computer Assisted Language Learning, 12(4): 345–60. Coccetta, F. (2011). Multimodal functional-notional concordancing. In FrankenbergGarcia, A. et al. (eds), pp. 121–38. Coxhead, A. (2000). A new academic wordlist. TESOL Quarterly, 34(2): 213–38. Durrant, P. and Mathews-Aydınlı, J. (2011). A function-first approach to identifying formulaic language in academic writing. English for Specific Purposes, 30(1): 58–72. Ellis, N., Simpson-Vlach, R. and Maynard, C. (2008). Formulaic language in native and second language speakers: Psycholinguistics, corpus linguistics and TESOL. TESOL Quarterly, 42(3): 375–96. Flowerdew, J. (2009). Corpora in language teaching. In Long, M. and Doughty, C. (eds), The Handbook of Language Teaching. London: Wiley-Blackwell, pp. 327–50. Flowerdew, L. (2008). Corpus linguistics for academic literacies mediated through discussion activities. In Belcher, D. and Hirvela, A. (eds), The Oral-Literate Connection. Perspectives on L2 Speaking, Writing, and Other Media Interactions. Ann Arbor, MI: University of Michigan Press, pp. 268–87. — (2009). Applying corpus linguistics to pedagogy: A critical evaluation. International Journal of Corpus Linguistics, 4(3): 393–417. — (2011). Corpora and Language Education. London: Palgrave Macmillan. Frankenberg-Garcia, A., Flowerdew, L. and Aston, G. (eds) (2011). New Trends in Corpora and Language Learning. London: Continuum. Gavioli, L. and Aston, G. (2001). ‘Enriching reality’: Language corpora and language pedagogy. ELT Journal, 55(3): 238–46. Hafner, C. and Candlin, C. (2007). Corpus tools as an affordance to learning in professional legal education. Journal of English for Academic Purposes, 6(4): 303–18. Handford, M. (2010). What can a corpus tell us about specialist genres? In O’Keeffe, A. and McCarthy, M. (eds), pp. 255–69. Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Hyland, K. (2008). As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes, 27(1): 4–21. Hyland, K. and Tse, P. (2007). Is there an “academic vocabulary”? TESOL Quarterly, 41(2): 235–53. Johansson, S. (2009). Some thoughts on corpora and second-language acquisition. In Aijmer, K. (ed.), Corpora and Language Teaching. Amsterdam: John Benjamins, pp. 33–44.



Corpora in the Classroom: An Applied Linguistic Perspective

223

Johns, T. (1991). Should you be persuaded: Two examples of data-driven learning. English Language Research Journal, 4: 1–16. Johns, T., Lee, H.-C. and Wang, L. (2008). Integrating corpus-based CALL programs in teaching English through children’s literature. Computer Assisted Language Learning, 21(5): 483–506. Jones, M. and Schmitt, N. (2010). Developing materials for discipline-specific vocabulary and phrases in academic seminars. In Harwood, N. (ed.), English Language Teaching Materials. Theory and Practice. Cambridge: Cambridge University Press, pp. 225–50. Kennedy, C. and Miceli, T. (2010). Corpus-assisted creative writing: Introducing intermediate Italian learners to a corpus as a reference resource. Language Learning and Technology, 14(1): 28–44. Kim, S. (2011). Metaphors in literacy development: A cognitive linguistic and multimodal perspective. In Paper presented at AAAL Conference, Chicago, IL, 27 March. Lee, D. and Swales, J. (2006). A corpus-based EAP course for NNS doctoral students: Moving from available specialized corpora to self-compiled corpora. English for Specific Purposes, 25(1): 56–75. Leech, G. (1997). Teaching and language corpora: A convergence. In Wichmann, A. et al. (eds), Teaching and Language Corpora. London: Longman, pp. 1–23. Mauranen, A., Hynninen, N. and Ranta, E. (2010). English as an academic lingua franca: The ELFA project. English for Specific Purposes, 29(3): 183–90. Milton, J. (1999). Lexical thickets and electronic gateways: Making texts accessible by novice writers. In Candlin, C. and Hyland, K. (eds), Writing: Texts, Processes and Practices. London: Longman, pp. 221–43. Mudraya, O. (2006). Engineering English: A lexical frequency instructional model. English for Specific Purposes, 25(2): 235–56. O’Keeffe, A. and McCarthy, M. (eds) (2010). The Routledge Handbook of Corpus Linguistics. London: Routledge. O’Keeffe, A., McCarthy, M. and Carter, R. (2007). From Corpus to Classroom. Cambridge: Cambridge University Press. Paquot, M. (2010). Academic Vocabulary in Learner Writing. London: Continuum. Römer, U. (2010). Using general and specialized corpora in English language teaching: Past, present and future. In Campoy-Cubillo, M., Bellés-Fortuño, B. and Gea-Valor, M. (eds), Corpus-based Approaches to English Language Teaching. London: Continuum, pp. 18–38. Schmied, J. (2006). Corpus linguistics and grammar learning: Tutor vs. learner perspectives. In Braun, S., Kohn, K. and Mukherjee, J. (eds), Corpus Technology and Language Pedagogy. Frankfurt am Main: Peter Lang, pp. 25–47. Sealey, A. and Thompson, P. (2007). Corpus, concordance, classification: Young learners in the L1 classroom. Language Awareness, 16(3): 208–23. Sinclair, J. McH. (ed.) (1987). Collins COBUILD Dictionary. London: HarperCollins. — (ed.) (1990). Collins COBUILD English Grammar. London: HarperCollins. — (1999). The lexical item. In Weigand, E. (ed.), Contrastive Lexical Semantics. Amsterdam: John Benjamins, pp. 1–24. Sinclair, J. McH. and Renouf, A. (1988). A lexical syllabus for language learning. In Carter, R. and McCarthy, M. (eds), Vocabulary and Language Teaching. London: Longman, pp. 140–60.

224

Corpus Applications in Applied Linguistics

Stubbs, M. (2004). Language corpora. In Davies, A. and Elder, C. (eds), The Handbook of Applied Linguistics. Malden, MA: Blackwell, pp. 106–32. Swain, M. (1998). Focus on form through conscious reflection. In Doughty, C. and Williams, J. (eds), Focus on Form in Classroom Second Language Acquisition. Cambridge: Cambridge University Press, pp. 64–81. Swales, J. (2002). Integrated and fragmented worlds: EAP materials and corpus linguistics. In Flowerdew, J. (ed.), Academic Discourse. London: Longman, pp. 150–64. — (2004). Research Genres. Cambridge: Cambridge University Press. Thurstun, J. and Candlin, C. (1998). Concordancing and the teaching of vocabulary of academic English. English for Specific Purposes, 17(3): 267–80. Tribble, C. and Jones, C. (1990). Concordances in the Classroom. London: Longman. Weber, J.-J. (2001). A concordance- and genre-informed approach to ESP essay writing. ELT Journal, 55(1): 14–20. Widdowson, H. G. (2000). On the limitations of linguistics applied. Applied Linguistics, 21(1): 3–25. — (2003). Defining Issues in English Language Teaching. Oxford: Oxford University Press. Widmann, J., Kohn, K. and Ziai, R. (2011). The SACODEYL search tool  Exploiting corpora for language learning purposes. In Frankenberg-Garcia, A. et al. (eds), pp. 167–78. Wilkins, D. (1976). Notional Syllabuses: A Taxonomy and Its Relevance to Foreign Language Curriculum Development. Oxford: Oxford University Press. Willis, D. (1990). The Lexical Syllabus: A New Approach to Language Teaching. London: HarperCollins. Willis, D. and Willis, J. (1988). Collins COBUILD English Course. Glasgow: Collins Cobuild.

Chapter 14

Corpora and Materials Design Michael McCarthy and Jeanne McCarten

Introduction: The Emergence of Corpus-Informed Language Teaching The last ten years have seen a revolution in attitudes towards corpora, especially spoken corpora. Chapters in O’Keeffe and McCarthy (2010) testify to the interest of applied linguists in exploiting corpora in areas as diverse as syllabus design (Walsh, 2010), data-driven learning, where corpora become the raw material for language learning (Chambers, 2010), the teaching of writing (Flowerdew, 2010) and assessment (Barker, 2010). These developments come some quarter of a century after the visionary first steps by John Sinclair and his COBUILD dictionary team at the University of Birmingham, UK. Initial scepticism towards the idea of a dictionary written from tabula rasa soon gave way to a wave of excitement amid the global success of the corpus-based COBUILD (1987) learner’s dictionary and its spin-offs. This was followed by reference grammars for language learners based on corpus evidence (e.g. COBUILD, 1990; Carter and McCarthy, 2006). The first major attempt to produce a general English language course informed by corpus research was the Collins COBUILD English Language Course (Willis and Willis, 1987–1988). The course met with mixed success. Many were inspired by its novel approach to words and structures, where in the beginner-level of the course, the most frequent 700 words and their most common patterns were taught, with the next 800 words and their patterns taught at Level 2 of the course, and so on (see Willis, 1990). On the other hand, others found its ways of describing and teaching items an unfamiliar challenge. Meanwhile, in the 1990s, the corpus revolution consolidated its footholds in the language teaching profession, though often accompanied by testy debates between its defenders and those sceptical of the value of corpora in

226

Corpus Applications in Applied Linguistics

the language classroom. Widdowson (2000) challenged the notion that corpus extracts in teaching materials are, in any meaningful sense, more ‘real’ than invented language. On the other hand, the shortcomings of scripted dialogues as models for learning were identified by Carter (1998), Burns et  al. (2001), McCarthy and O’Keeffe (2004), among others. The debate seemed particularly heated over the role of conversational corpora (especially native-speaker corpora) as a model for learners to aspire to. Multiple factors played into the arguments. These included the growing international role of English and the feeling that English should no longer be seen as the property of its native users, the difficulty of presenting natural spoken language data to learners with its context-embedded understandings that often made it impenetrable to the external observer, the feeling of insecurity generated by the challenge of teaching apparently anomalous grammatical structures associated with casual conversation and a general unease about the informality of everyday talk. Notwithstanding, the applied linguistics literature during the period abounds with investigations and insights into everyday spoken language that caused many to question whether ‘safe’ teaching models based largely on written language and received wisdom gave learners the tools to interact with other users of English (Scotton and Bernsten, 1988; McCarthy, 1998; see also references mentioned above). This position was especially relevant in a global, mobile society increasingly attuned to spoken communication and the rise of largely informal electronic and internet-based means of communication. However, it was not till the late 1990s that enough spoken data had been gathered in major corpus projects such as The British National Corpus, the Cambridge English Corpus (formerly Cambridge International Corpus www.cambridge.org/corpus) and the American National Corpus, among others, to underpin a general multi-strand language course that could truly exploit spoken, as well as written, evidence. In the rest of this chapter, we focus mainly on one project involving us as co-authors, which we present as an example of the challenges of producing spoken corpus-based materials that are usable, useful and non-threatening to teachers and learners alike1. We focus on the spoken strand of the course, as it is there that, we hope, the most innovative elements reside, and it is there that answers to the criticisms of the debates mentioned above were cautiously attempted. We leave the reader to judge whether the results were successful, and we have no doubt that much work remains to be done in the future in areas relating to the description and teaching of speaking, as well as to the exploitation of corpora in this kind of large-scale publishing project.



Corpora and Materials Design

227

Unearthing the Evidence: Initial Analysis of Corpora Word frequency lists Of the tools that corpus analytical software can offer, word frequency lists are the easiest to produce and exploit. The vocabulary materials published by McCarthy and O’Dell (1999–2008) illustrate the types of material that are facilitated by such lists. The main source for those particular vocabulary syllabuses was the Cambridge English Corpus (hereafter CEC), which now exceeds a billion words, along with the Cambridge Learner Corpus (hereafter CLC), a collection of several hundred thousand written examination papers from Cambridge ESOL, recently boosted by the addition of large amounts of spoken data from oral examinations and other English language-learning contexts. The CLC, a great proportion of which has been error-coded (enabling searches for incorrect and rater-corrected forms in student writing, linked to metadata about the learners), has informed a number of ELT products, including dictionaries, the vocabulary syllabus of a general course (Redston and Cunningham, 2005) and examination preparation materials (Capel and Sharp, 2008). Other corpus-based word lists, such as the Academic Word List (AWL: see Coxhead, 2000), have informed more specialized learning materials (e.g. Schmitt and Schmitt, 2005). Word frequency lists are an invaluable resource for the creation of common core and more advanced and specialized vocabulary syllabuses (see McCarthy, 1999; O’Keeffe et  al., 2007, Chapter 2). At beginner and elementary levels, they can assist the materials writer in identifying which members of large lexical sets to prioritize. For example, which of two or three near-synonymous expressions is more frequent (e.g. start or begin; make a trip, take a trip or go on a trip), or what are the most frequent collocates of common delexical verbs such as get, take, make and do (McCarthy and O’Keeffe, in press, and see below). However, even here, matters are not so simple. In comparing spoken and written frequency lists separately, Buttery and McCarthy (2011) show that, while some 65% of all the lexical items are found in both lists, 35% are unique to either the spoken or written data, thus militating against an over-simplification of using a ‘master’ list of frequent words: a good many words characterize and reveal the separate and distinct ‘fingerprints’ of the spoken and written corpora, respectively. An equal note of caution is shown vis-a-vis lists for specialized courses by Hyland and Tse (2007). Further issues raised by an over-simplified approach to frequency lists include the uneven distribution of words that are related semantically or which cluster into psychologically valid sets. For example, the most frequent

228

Corpus Applications in Applied Linguistics

ten-colour terms display widely differing frequencies (red is almost four times more frequent than brown or pink in the spoken British English segment of the CEC), so these terms might be mathematically pigeon-holed in different frequency bands. Only four of the seven days of the week feature in the top 1,000 words in the North American English conversational segment of the CEC, but teachers and learners would consider it absurd to separate out three days of the week for teaching at higher levels. McCarten and McCarthy (2010) observe also that native-speaker corpora derived from the naturalistic environments of day-to-day life often do not contain the kinds of language which non-native users may need. They remark that few large-scale spoken corpora include data recorded in language-teaching classrooms (pace Walsh, 2006), where classroom items (board, highlighter pen), common procedures (underline, get into pairs) and linguistic terminology (adverb, collocation etc) would feature more frequently than in a corpus collected outside the classroom. Keyword lists More sophisticated features of corpus-analytical software can help overcome some of the vagaries of raw frequency lists. Keyword analysis, in which words are ranked not according to frequency but according to the statistical significance of their occurrence (or non-occurrence) in particular corpora, can provide a more accurate fingerprint of a particular set of data. Such information can be invaluable to the materials writer. For example, based on the one-million-word CANBEC spoken business English corpus (see Handford, 2010 for a full description of the corpus), the six most frequent –ly adverbs are actually, really, only, probably, obviously and basically. Actually, really, only and probably are shared with the top six –ly adverbs in a frequency list derived from a one-million-word socializing sub-corpus of the CANCODE corpus. So, on the basis of frequency alone, there seems to be little here to distinguish business English from ordinary conversation. A keyword analysis, however, gives a different top six –ly adverbs: effectively, currently, basically hopefully, obviously and originally. The keyword adverbs seem to yield more readily the flavour of business interactions, even though such interactions obviously will share many features with other types of talk. Thus the materials writer can generate materials more narrowly focused in the target domain(s) with, hopefully, a higher degree of efficiency in the teaching and learning of specialized language types (see McCarthy et al., 2009 for examples of corpus-derived business English material).



Corpora and Materials Design

229

Collocation statistics A further capability offered by corpus software is the generation of collocation statistics. Collocation is, by definition, a probabilistic phenomenon based on likelihood of co-occurrence (Halliday, 1966; Sinclair, 1966). In this way, relatively simple computer algorithms provide hard information on statistically significant co-occurrences of words which, although part of the expert-user’s competence, are not readily amenable to introspection. Furthermore, by performing the same operations on learner data, we can assess whether collocational competence is developing adequately alongside other aspects of proficiency. McCarthy and O’Keeffe (2011), using the spoken segment of the CLC, show how a set of common delexical verbs (get, go, do, have, take and make) display different collocational patterns in three different datasets. They compared native-speaker everyday conversation with spoken academic data, and learner data taken from oral examinations2. They observe that the actual distribution of the delexical verbs is not greatly different between the three datasets, especially in the case of go, do, take and make. The verb get, nonetheless, showed a certain discrepancy, with learners appearing to use it much less frequently than the native speakers of the other two datasets. Of equal interest were the different collocations of the delexical verbs in the three datasets. The learner data yielded collocations for the verb get such as get married, get a job, get a chance, reflecting the typical personally oriented topics of oral examinations, while the academic data had collocations such as get rid of, get an idea, get a sense. Such outcomes naturally need to be treated with caution, and corpus data of this type offer both opportunities and challenges. The learner data provides a window into the discourse of the oral examinations and can thus nicely inform exam preparation material; it perhaps also provides a glimpse of what may have been gaps in pedagogical input, with insufficient attention being paid to ubiquitous, polysemous items such as get, especially if the syllabus is strongly written-biased. The picture is by no means simple but is one that frequently presents itself to the materials writer immersed in corpus data. The dilemma spotlights the fact that it is the mediation process that will ultimately determine success. One thing is clear: good mediation is assisted by good data, preferably triangulated with other data from a variety of contexts. In the case of the learner spoken data on collocation, we probably cannot make any general claims about learner competence in this area without robust data from contexts other than oral interviews, to include non-examination classroom performances and out-of-class naturalistic data, as well as student writing across a range of genres, including free writing opportunities (Nesselhauf, 2003, 2005; see also Chau, Chapter 12, this volume).

Corpus Applications in Applied Linguistics

230

Chunk lists A final type of statistical output from corpus software is the counting of recurrent chunks of varying extents (e.g. 3-word chunks, 4-word chunks). It is no longer controversial to observe that meaning is created not only in single words but in the regular recurrence of particular strings of words which, over time, acquire pragmatically specialized meanings. Such recurrent strings have been variously called ‘multi-word units’, ‘clusters’, ‘chunks’, ‘lexical phrases’, ‘formulaic sequences’ or ‘lexical bundles’ (Nattinger and DeCarrico, 1992; Biber et al., 1999; Schmitt, 2004; O’Keeffe et al., 2007). The most frequently recurring chunks consist of between two and four words (with a much lower frequency of chunks of up to six or seven words) and are characterized, in conversation, by familiar expressions such as you know, I mean, in a way, things like that, on the other hand, you know what I mean. Statistical evidence of the frequency of such chunks in everyday talk has been one of the major contributions of corpus linguistics to our understanding of conversation and to how pragmatic meaning is realized in units greater than the single word, the so-called ‘idiom principle’ (Sinclair, 1991, 2003, 2004). Lists of chunks typically contain a mix of fragments that are often devoid of unified meaning (e.g. the high frequency of combinations such as in the, was a, no I do, it was a) along with fully integrated items such as at the moment, oh my God, a couple of. These latter provide a lexicon as useful as (and often more frequent than) the single-word lists that commonly form the basis of the vocabulary syllabus of a language course. As with all corpus analysis, finer-grained results can be obtained by building corpora of specialized language. For example, the chunk in terms of is an extremely high-frequency item in both academic and business English, while the pragmatically specialized meaning of the chunk going forward (meaning ‘as from now’) emerges as a fingerprint of business English (McCarthy et al., 2009).

Pedagogical Mediation of Conversational Data The issue of mediation touched on in Section 2, above, raises the question of what it is we are actually modelling for language learners when we exploit the output of corpus analyzes. We have suggested that this output can be both illuminating and potentially misleading. The mediation process, to be truly effective, must begin with the question of what the data tell us about language in context, as evidenced in the texts or conversations that make up the corpus.



Corpora and Materials Design

231

We mentioned earlier that conversational data present particularly acute challenges, not least because they may provoke strong reactions. Over and above the arguments about native-user dominance, critics have rightly pointed out that conversational data are highly context-dependent and commonly possessed of a great degree of unspoken shared understandings. For the outsider, conversation (especially among friends and intimates) can be impenetrable when seen in transcribed form. We accept these caveats but wish to offer here what we feel are some basics of conversational behaviour, which corpus analysis can tease out, which can be observed with a certain degree of objectivity and which could be said to be a set of human values of interaction rather than culture-bound or native-user-bound forms. In the simplest of terms, it is not so much what the speakers say in the corpus as what they do to create and sustain successful spoken interactions. For example, effecting topic-change in a conversation or showing that one is listening actively can be seen as acts of doing rather than of saying. As we have suggested, the statistical output of corpus analysis is the starting point in the design of the speaking syllabus. A conversational corpus, preferably compiled to reflect a wide range of language users and contexts of use, is the raw material. This may range in size from several hundred thousand words for more specialized types of language, to multi-million word spoken corpora such as the 10 million words of the British National Corpus or the spoken segment of the American National Corpus. An ability to create sub-corpora (for example, of service encounters or of dinner-table conversations) offers the possibility of finer-grain analyzes. The 5-millionword CANCODE corpus of spoken British English is one such corpus (see McCarthy, 1998, for an account of its sub-components). Conversational frequency lists largely confirm the broadest expectations. McCarten and McCarthy (2010) note that conversation, unsurprisingly, reveals more informal and colloquial vocabulary, (kids, get, yeah, gosh, um), while, as in written corpora, a large proportion of the most frequent words consist of grammatical–functional words (articles, pronouns, prepositions, etc.), which form the bulk of the first hundred items. However, as Buttery and McCarthy (2011) show, about a third of the words in the spoken and written frequency lists they used exhibit properties distinct to one list or the other. Some words occur counter-intuitively in extremely high ranks in spoken frequency lists: polysyllabic adverbs such as apparently, absolutely and definitely all occur in the top 700 words of the CANCODE spoken corpus but do not feature in the top 1200 of an equivalent-sized general written corpus. Indeed, absolutely is four times more frequent in the spoken than in the written data. The explanation is found in the interactive roles the adverbs

232

Corpus Applications in Applied Linguistics

play in conversation, with the high frequency of absolutely, for example, largely accounted for by its use as a response token, where it operates as a stronger, more involved version of yes (or, with the addition of not, as a strong version of no). Meanwhile, all three words also perform functions in relation to modality, tempering or reinforcing degrees of certainty or necessity. The three adverbs, in addition to a number of other adjectives and adverbs (e.g. good, great, fine, certainly, right) operate as members of a finite set of recurring freestanding items used to react to incoming talk. The members of the set may vary between varieties of English (McCarthy, 2002), and it is by no means axiomatic that any set identified as characteristic of a particular group or sub-group of native users will be the most useful, appropriate or meaningful for learners. What matters in the final analysis is the observation of engaged response (or ‘listenership’ as it has been termed elsewhere – see O’Keeffe et al., 2007) as a normal, natural and pervasive human conversational phenomenon. It is this insight that the materials writer carries over into the material. In this particular case, the morphology of the response tokens as adjectives or adverbs is less important than their discourse function as interpersonal responses and so they earn a place in the speaking syllabus rather than in the grammatical or vocabulary syllabus. A second example of the mediation process is how to interpret turnopening phenomena. This aspect of corpus research has involved the counting not just of spoken words but of transcription codes themselves. Tao (2003) looked at those items immediately following new speaker turnmarkers (typically diamond brackets enclosing a speaker designation, such as S1, S2 or $01, $02, etc.). He found a consistency in how speakers open their turns, with turn-opening items predominantly consisting of free-standing items which attended to what the speaker had just heard. Applying Tao’s analysis to the spoken data of the CANCODE corpus, we find the most common turn-openers to include responsive items such as yeah, mm, oh, right, well and others of similar type (McCarthy, 2010). Not only can this be seen as an extension of the listenership principle already discussed, but we can also see in the turn-opening process the creation of conversational linking from turn to turn, which contributes to that ‘flow’ in talk that McCarthy (ibid) refers to as confluence – the quality that makes a conversation ‘fluent’. Fluency is located not solely in the domain of the individual speaker but is a product of the conversation itself. Once again, we see the central role of the interpretative, mediating process in the use of corpus output in the creation of materials. Without such mediation, corpusinformed materials design is a mere game of numbers.



Corpora and Materials Design

233

Modelling the Data: The Basics of Conversation In the coursebook project, which we report in this chapter, the challenge was to sustain a speaking syllabus over the several levels of the course and to engineer it so that, by the completion of the course, the learner had covered the most important elements of a useful speaking repertoire. There needed, therefore, to be not just a list of items to be taught but also some macrolevel understanding of how the parts fitted together into a coherent whole. We have attempted to show in Section 3, above, that individual items from frequency lists offer a window into general principles of conversational behaviour, the doing, not saying, principle. As the research progressed, an overarching structure gradually emerged from the mass of statistical output, which offered four major categories to guide the syllabus: 1. Organizing your own talk 2. Taking account of other speakers 3. Listenership 4. Organizing the conversation as a whole All four together were seen as contributing to fluency. Here we present brief examples of how each of the four categories worked in practice. Organizing your own talk We would include here items that realize conversational strategies such as self-correcting, maintaining the floor in a conversation, topic management, digression and returning to the main thread of the talk. Organizational discourse markers such as well, so, I mean, on the other hand, the thing is, highly frequent in both the single-word and chunks lists, come into service, enabling speakers to construct coherent turns. An instance of the thing is is shown in Example 13. Example 1: T  he speaker talks about whether she and her husband will be able to attend an oversubscribed event that someone else has organized. Before he hung up he said he said `I’m pretty sure that you’ll be able to go and Peter will be able to go’. He said `I’m pretty sure. But I’ll let you know. I have to let you know because I don’t have any responses yet’. But see the thing is when he said he’d overbooked I didn’t know how much he overbooked so I don’t know how many people have to decline in order to fit another person.

234

Corpus Applications in Applied Linguistics

Here is an extract from the low intermediate-level course book which teaches the thing is and related expressions for focusing and discourse organization. The presentation box contains the following elements zzexplanation

of the use of the thing is expressions zzan extract from a previously studied conversation (with a photo of the speaker) zza ‘factoid’ about the relative frequencies of related expressions in the corpus. The practice activity that follows (Figure 14.1) is a familiar ‘Fill the gap’ format. The items are then used as the basis for controlled and freer, personalized practice. Taking account of other speakers Speakers adapt their talk in relation to their listeners. A variety of items facilitate and signal such adaptations, ranging from politeness phenomena aimed at protecting both the positive and negative face of participants (Brown and Levinson, 1987), through the casting of various levels of formality, to checks that the listener is following and projections of the listener’s potential response(s). Hedging and boosting and the placement of stance and attitudinal expressions positioning the speaker and aligning with the positions of other speakers come within this category. As before, the corpus frequency lists bear evidence of the items within the repertoire, but it is the behaviour itself which is fundamental and this has to be mediated in whatever way best suits the target learner population. Issues such as relevant

Figure 14.1  From McCarthy, McCarten and Sandiford 2006a © Cambridge University Press 2006. Reprinted with kind permission.



Corpora and Materials Design

235

and interesting topics of conversation, the register-sensitivity of individual items and the generalizability and usefulness of items all mediate between the corpus and the course book and may involve choices that transform the original data while remaining faithful to the basic observations. An important feature of sensitivity to another speaker is the degree to which speakers can assume information to constitute new or shared knowledge, or the extent to which the listener possesses a similar world view. The chunks You see and You know epitomize the divide between assumed new versus old information, respectively. In Example 2 below, the speaker gives what he believes is new information for the listener, who is a stranger. Example 2: Two people are discussing how they use credit cards. A: So you’re somebody that will pay off your bills right away. B: Yes, definitely. You see, if I’d don’t have the money in my checking account, I will not charge it, even if I have the charge. I have to have that money before I’ll charge it. In contrast, the speaker in Example 3 can be confident that her listener does not need to be told that dogs can create dirt in the house. Example 3: Someone talks about the problem of having a dog in the house I never wanted, my husband wanted indoor dogs, because he’d grown up with them. But I can’t stand it indoors. I don’t, they shed, and, you know, the mud, and, And the scratches. Among the most frequent items in the chunks lists from informal spoken corpora are vague expressions such as or something like that, that sort of thing, or whatever, and things like that, and all that kind of stuff. They project an assumption that the listener can fill in the items in a category, which are only obliquely referred to, hence the term vague category marker (VCM) for such expressions (Evison et al., 2007). Speakers use these chunks to avoid overexplicitness; indeed it would be absurd and impossible to list all items in every category referred to in a conversation (this would be true of example 4, below). The VCMs perform a basic conversational function of projecting a shared world with the listener. However, this type of expression is an example of the sort of language that might be considered as too colloquial, lazy or ‘sloppy’. The mediation process thus requires embedding them in contexts that are normal, natural and transparent, and where their use is indispensable, as well as in activities that look familiar and non-threatening.

236

Corpus Applications in Applied Linguistics

Figure 14.2  From McCarthy, McCarten and Sandiford 2005b, © Cambridge University Press 2005. Reprinted with kind permission.

Example 4: Someone describing a friend We both, um, liked to read some of the same kinds of stuff. Science fiction and things like that so yeah. Exploring the meaning of strategic language in realistic contexts is an important step along the path of learning to use them productively. One means of doing this is demonstrated in the exercise shown in Figure 14.2, where students choose two possible meanings of the vague expressions in the exchanges before they ask the questions and give their own answers.

Listenership The role of listenership outlined in Section 3 is one where the exploitation of the corpus has knock-on effects in the mediation process with outcomes in both the nature of the syllabus and in the teaching methodology, illustrating the interrelatedness of the steps that begin with the corpus and end in the classroom. We suggested that a particular repertoire of items enable the listener to show engagement and involvement with the speaker by responding with more than yes and no, that items such as great, fine, absolutely and definitely suggest more than a passive recipient role. We also suggested that turn-opening items provided a link with what had just come before, a further extension of the listener’s acknowledgement of and engagement with the speaker. This theme of supportive fellow-speakers (Bublitz, 1988) has perhaps received less than due attention in the general area of ‘listening skills’, and listening syllabuses tend to be dominated by tasks relating to comprehensibility and strategies for extracting meaning from sound. O’Keeffe et  al. (2007, Chapter 7) contrast listenership as a means of showing engagement with what is said with ‘listening compre­ hension’, for example, informational answers to comprehension questions, true/false statements, gap filling etc. While undoubtedly valid as a check on the learner’s comprehension, the corpus data suggest that, in human interaction, comprehension is shown by appropriate reaction. Mediating



Corpora and Materials Design

237

the real-world phenomenon in a learning environment suggests a shift in emphasis from solely testing the accuracy of information intake to activities that facilitate appropriate response. In the activity below (Figure 14.3), listening comprehension and listener­ ship activities sit side by side. In part A, students listen for gist in a traditional comprehension exercise. Part B has students write an appropriate response to each speaker to show understanding and engagement, using previously learnt expressions with That must be  adjective or You must be  adjective as a listenership activity. Students can then suggest further ideas for responses to develop the activity further. Managing the conversation as a whole This final heading covers the ways speakers exercise the power of management, the ability to open and close conversations, to change their direction, to ‘pause and rewind’ them (metaphorically speaking) and generally to feel in control of the discourse. Such phenomena relate to the conversation as a complete artefact and involve the materials designer in examining lengthy stretches of talk, albeit originally prompted by an individual item or items in the frequency lists. These items are likely to include markers such as conversation openers or topic-shifters (so, right, yeah-no, now,) and conversation-closers (well, anyway, all right, right-then), reverting to earlier topics (as you say, as you were saying, going back to what you were saying), interrupting and restarting conversations after a break (hold on a sec; where were we?). Illustrative in this respect are the chunks speaking of and talking of, both of which predominantly function strategically to raise a topic linked to a previous one.

Figure 14.3  Extract from McCarthy, McCarten and Sandiford (2006a) © Cambridge University Press 2006. Reprinted with kind permission

238

Corpus Applications in Applied Linguistics

Conversation management functions are often the prerogative of the more fluent or native speaker. Enabling learners to take ownership of conversations so that they can exercise these management skills thus brings us back again to the need to mediate the corpus insights with interesting and relevant topics, personalized activities and meaningful contexts in which to practise.

Conclusion The first decade of the twenty-first century offered developments in corpus compilation and analysis, which facilitated the creation of a corpus-informed English language course. The availability of large conversational corpora made it possible for general courses to be constructed in a way that all their elements, not just the written elements, benefit from corpus insights. Spoken data not only put spoken–written differences into relief, they also have knock-on effects on the grammar and lexicon (see McCarten and McCarthy, 2010 for examples), the concept of fluency, the methodology of teaching elements such as listening and the notion of what it means to perform effectively in a conversation. Corpus-analytical software has come a long way since the 1980s and can now produce a variety of statistical output. However, we have attempted to show throughout this chapter that the output of corpus analysis is where the mediation process begins. The issues of mediation are far from simple but can be rewarding and can lead to innovative and, we hope, creative, effective, enjoyable and productive teaching materials that remain faithful to the human contexts in which they originate.

Notes The project which is the main inspiration of this chapter is the creation of the four-level adult North American English course, Touchstone (McCarthy et  al., 2005a, b, 2006a, b). 2 The data were kindly made available for the English Profile Project by Cambridge ESOL. For full details of the English Profile Project, see its website. 3 Examples 1–5 taken from the Cambridge English Corpus, North American English conversation, © Cambridge University Press. Reprinted with kind permission.

1





References Barker, F. (2010). How can corpora be used in language testing? In O’Keeffe, A. and McCarthy, M. J. (eds), The Routledge Handbook of Corpus Linguistics. Abingdon, Oxon: Routledge, pp. 633–45.



Corpora and Materials Design

239

Biber, D., Johannson, S., Leech, G., Conrad, S. and Finnegan, E. (1999). Longman Grammar of Spoken and Written English. London: Longman. Brown, P. and Levinson, S. (1987). Politeness: Some Universals in Language Usage. Cambridge: Cambridge University Press. Bublitz, W. (1988). Supportive Fellow-Speakersand Cooperative Conversations. Amsterdam: John Benjamins. Burns, A., Joyce, H. and Gollin, S. (2001). ‘I See What You Mean’: Using Spoken Discourse in the Classroom. Sydney: National Centre for English Language Teaching and Research. Buttery, P. and McCarthy, M. J. (2011). Lexis in spoken discourse. In Gee, J. and Handofrd, M. (eds), Routledge Handbook of Discourse Analysis. Abingdon, Oxon: Routledge. Capel, A. and Sharp, W. (2008). Objective First Certificate, 2nd edition Student’s Book. Cambridge: Cambridge University Press. Carter, R. A. (1998). Orders of reality: CANCODE, communication and culture. ELTJ, 52: 43–56. Carter, R. A. and McCarthy, M. J. (2006). Cambridge Grammar of English. Cambridge: Cambridge University Press. Chambers, A. (2010). What is data-driven learning? In O’Keeffe, A. and McCarthy, M.J. (eds), The Routledge Handbook of Corpus Linguistics. Abingdon, Oxon: Routledge, pp. 345–58. COBUILD. (1987). Collins COBUILD English Language Dictionary. London: Collins. — (1990). Collins COBUILD English Grammar. London: Collins. Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2): 213–38. Evison, J., McCarthy, M. J. and O’Keeffe, A. (2007). ‘Looking out for love and all the rest of it’: Vague category markers as shared social space. In Cutting, J. (ed), Vague Language Explored. Basingstoke: Palgrave Macmillan, pp. 138–57. Flowerdew, L. (2010). Using corpora for writing instruction. In O’Keeffe, A. and McCarthy, M. J. (eds), The Routledge Handbook of Corpus Linguistics. Abingdon, Oxon: Routledge, pp. 444–57. Halliday, M. A. K. (1966). Lexis as a linguistic level. In Bazell C. E., Catford J. C., Halliday, M. A. K., and Robins, R. H. (eds), In Memory of J R Firth. London: Longman, pp. 148–62. Handford, M. (2010). The Language of Business Meetings. Cambridge: Cambridge University Press. Hyland, K. and Tse, P. (2007). Is there an ‘academic Vocabulary’? TESOL Quarterly, 41(2): 235–54. McCarten, J. and McCarthy, M. J. (2010). Bridging the gap between corpus and course book: The case of conversation strategies. In Mishan, F. and Chambers, A. (eds), Perspectives on Language Learning Materials Development. Oxford: Peter Lang, pp. 11–32. McCarthy, M. J. (1998). Spoken Language and Applied Linguistics. Cambridge: Cambridge University Press. — (1999). What constitutes a basic vocabulary for spoken communication? SELL, (University of Valencia, Spain), 1: 233–49. — (2002). Good listenership made plain: British and American non-minimal response tokens in everyday conversation. In Reppen, R., Fitzmaurice, S. and

240

Corpus Applications in Applied Linguistics

Biber, D. (eds), Using Corpora to Explore Linguistic Variation. Amsterdam: John Benjamins, pp. 49–71. — (2010). Spoken fluency revisited. English Profile Journal, 1(online at http:// journals.cambridge.org/action/displayJournal?jidEPJ). McCarthy, M. J. and O’Dell, F. (1999). English Vocabulary in Use. Elementary. Cambridge: Cambridge University Press. — (2004). English Phrasal Verbs in Use. Intermediate Level. Cambridge: Cambridge University Press. — (2005). English Collocations in Use. Cambridge: Cambridge University Press. — (2008). Academic English in Use. Cambridge: Cambridge University Press. McCarthy, M. J. and O’Keeffe, A. (2004). Research in the teaching of speaking. Annual Review of Applied Linguistics, 24:26–43. — (2011). Competencies explored and exposed: Grammar, lexis, communication and the notion of levels. In Eisenmann, M. and Summer, T. (eds), Basic Issues in EFL: Teaching and Learning. Heidelberg: Universitaetsverlag Winter. McCarthy, M. J., McCarten, J., Clark, D. and Clark, R. (2009). Grammar for Business. Cambridge: Cambridge University Press. McCarthy, M. J., McCarten, J. and Sandiford, H. (2005a). Touchstone Student’s Book 1. Cambridge: Cambridge University Press. — (2005b). Touchstone Student’s Book 2. Cambridge: Cambridge University Press. — (2006a). Touchstone Student’s Book 3. Cambridge: Cambridge University Press. — (2006b). Touchstone Student’s Book 4. Cambridge: Cambridge University Press. Nattinger, J. and DeCarrico, J. (1992). Lexical Phrases and Language Teaching. Oxford: Oxford University Press. Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics, 24(2): 223–42. — (2005). Collocations in a Learner Corpus. Amsterdam: John Benjamins. O’Keeffe, A. and McCarthy, M. J. (2010). The Routledge Handbook of Corpus Linguistics. Abingdon, Oxon: Routledge. O’Keeffe, A., McCarthy, M. J. and Carter, R. A. (2007). From Corpus to Classroom. Cambridge: Cambridge University Press. Redston, C. and Cunningham, G. (2005). Face2face Elementary Student’s Book. Cambridge: Cambridge University Press. Schmitt, D. and Schmitt, N. (2005). Focus on Vocabulary: Mastering the Academic Word List. Longman: White Plains, NY. Schmitt, N. (ed.) (2004). Formulaic Sequences. Amsterdam: John Benjamins. Scotton, C. M. and Bernsten, J. (1988). Natural conversations as a model for textbook dialogue. Applied Linguistics, 9(4): 372–84. Sinclair, J. McH. (1966). Beginning the study of lexis. In Bazell C. E., Catford J. C., Halliday, M. A. K., and Robins, R. H. (eds), In Memory of J R Firth. London: Longman, pp. 410–30. — (1991). Corpus, Concordance and Collocation. Oxford: Oxford University Press. — (2003). Reading Concordances. London: Longman. — (2004). Trust the Text: Language, Corpus and Discourse. London: Routledge. Tao, H. (2003). Turn initiators in spoken english: A corpus-based approach to interaction and grammar. In Leistyna, P. and Meyer, C. F. (eds), Language and Computers. Corpus Analysis: Language Structure and Language Use. Amsterdam: Rodopi, pp. 187–207.



Corpora and Materials Design

241

Walsh, S. (2006). Investigating Classroom Discourse. Abingdon, Oxon: Routledge. — (2010). What features of spoken and written corpora can be exploited in creating language teaching materials and syllabuses? In O’Keeffe, A. and McCarthy, M. J. (eds). The Routledge Handbook of Corpus Linguistics. Abingdon, Oxon: Routledge, pp. 333–44. Widdowson, H. G. (2000). On the limitations of linguistics applied. Applied Linguistics, 21(1): 3–25. Willis, D. (1990). The Lexical Syllabus, A New Approach to Language Teaching. London and Glasgow: Collins. Willis, D. and Willis, J. (1987–8). The Cobuild English Course. London and Glasgow: Collins.

Afterword: The Problems of Applied Linguistics Susan Hunston

Applied Linguistics is often defined as the study of language placed at the service of solving real-world problems. While this definition (conveniently for an applied linguist) puts Applied Linguistics in the position of a superhero, energetically tackling humanity’s woes, it does also imply a rather negative view: the world is a sorry place, and applied linguists work endlessly to address an ever-increasing mountain of problems. This book illustrates the very wide range of issues that currently come under the heading of Applied Linguistics; as well as language teaching, translation and forensic applications, there are questions connected with English used in academic discourse and in business environments, with English used as a lingua franca and in the medium of texting, with English used to construe relationships in the media and to reflect on gendered identities. That is a lot of problems. However, the very variety suggests that, rather than to find a solution for every ill, the role of Applied Linguistics may be to expand opportunities for understanding and intervening in a world construed and reflected in language. Because this book is about Corpus Linguistics as well as Applied Linguistics, the world in it appears a little strange. In fact, it consists entirely of texts. People may fight, vote, drive cars and eat noodles; perhaps, more pertinently, they may think, learn and hold opinions. But to the corpus linguist, the only interesting thing that they do is produce texts: whether as written or spoken text or (as in Gu’s chapter) as visual images. But because this book is about Applied Linguistics as well as Corpus Linguistics, the studies do not stop at the description of those texts; rather, they move back towards the contexts of the texts’ production and consumption. In doing so, they locate the linguistic study firmly within a broader social context. Handford refers to linguistic items constituting ‘discursive, professional and/or social practices’. Hyland points out that theories about knowledge as socially constructed by academic discourse communities can be fleshed out by associating linguistic features with those communities and thus



Afterword

243

explaining the mechanisms by which that knowledge is constructed. In these and other cases, the writers explicitly countermand the view that Corpus Linguistics takes an impoverished, decontextualized view of texts. Laviosa, for example, points out that the translations in the corpora she describes have been produced as part of a ‘dynamic network of relations’. Chau’s work is embedded in a view of the social context of language learning that rejects the primacy of the native speaker. For Seidlhofer, the difficult task of compiling a corpus itself encourages the exploration of the concept of ELF: with no agreed sense of what English as a Lingua Franca consists of. Selecting texts that represent it poses an epistemological problem; and with no agreed standard, issues such as spelling conventions in transcription raise ideological as well as practical questions. In the chapters of this book, there is a very honest reflection on the difficulties involved in collecting and investigating texts. Throughout the book, there is both an interaction and a tension between the demands of Corpus Linguistics and Applied Linguistics. This is seen clearly in the number of chapters that refer to the difficulties of collecting texts that will fulfil the demands of the corpus design as well as address Applied Linguistics questions. Olsson, for example, points out that the texts available to him in a forensic linguistics case were far from ideal from a Corpus Linguistics point of view. In a different context, Tagg points to the difficulty of collecting large quantities of text messages that are, by definition, both extremely short and very private. The specific nature of the corpus required in each study limits the availability of suitable material and therefore the size of the corpus. Handford, for example, in one study uses a corpus of only 13,000 words. Koester cites a corpus of 60,000 words. For each of these researchers, the applied research question restricts the kind of text that can be sought, while the corpus research question demands more and more extensive texts or a different balance of texts than can easily be supplied. This restriction demands that we reconsider what is meant by ‘a corpus’: not necessarily a very large body of language but a body of language that is subjected to particular kinds of investigations. A corpus is defined not by what it contains but by how it is used. The nature of a corpus is called into question more radically by Gu, who asks how a collection of visual images might be processed to make it usable as a corpus. Doing so, he argues, involves re-presenting a non-linear image as a series of linear segments, each of which adds an annotation to the image. Turning once again to text corpora, it is noticeable in this book how much common ground there is in the methods of investigation. The use of word frequency lists, keyword lists, collocations, concordances and word

244

Afterword

clusters is common in most of the chapters in this collection. Chau’s chapter points out how a fairly simple comparison of the clusters surrounding a single frequent word (save) can demonstrate the development of sophisticated language use in learners of different levels of competence. His decision not to compare the learner corpus with a native speaker reference is based on the views of second language acquisition rather than on the practicalities of corpus research: he agrees with Ortega (2010) that the target for learners is the language of learners at a more advanced level and not the language of native speakers. In most cases, however, a specialized corpus is compared with a more general one, in order to establish differentials in frequency. This is mentioned as a standard research procedure by Handford and Koester among others. O’Keeffe, however, problematizes the concept of keywords by pointing out that the question of what constitutes the general (‘reference’) corpus is crucial and is not to be taken for granted. She compares the keyword lists produced by her specialized corpus with a number of different reference corpora, and the differences in the lists so produced remind us that the choice of comparable corpora is something that requires a great deal of thought. Keywords are not absolute but are relative to the corpora chosen. Corpus comparison raises other issues. Both Hyland and Baker tackle an area where comparison between corpora carries huge social significance: the comparison of texts produced by women and by men. Both writers note that to compartmentalize ‘men’s language’ and ‘women’s language’ in order to differentiate between them is in itself highly controversial. This controversy usefully throws into question the whole practice of comparing corpora to identify areas of maximum difference. Hyland, for example, talks of gender stereotypes as ways of exaggerating male and female differences, while Baker comments wryly that ‘humans tend to find difference fascinating’. Both writers make the point that, in many cases and certainly in the comparison of language produced by women and by men, differences are far outweighed by similarities. This is something that studies prioritizing differentials can be prone to overlook. Baker points out that differences are only a matter of ‘gradience’ rather than of binary choice. There are two sides of the methodological coin here. Calculations of relative frequency are so important precisely because they pick up the ‘gradience’ differences rather than the binary ones. When two individual texts are compared, actual differences may be masked because it is the relative frequency of a feature and not its absence or presence in any one text that construes distinctiveness. When whole corpora are compared, these subtle distinctions of frequency may be observed. On the other side



Afterword

245

of the coin, focusing on difference alone may exaggerate its significance, as Baker says, and may replicate what happens when stereotypes are formed; taking this further and flipping the coin again, however, such studies may actually be used to explain how those stereotypes come about (Bang and Hunston, 2008). To take an example used by Baker, there may be a stereotype that, in English, men swear and women do not. Observing that some swearwords are somewhat more frequent (but only somewhat more frequent) in men’s language than in women’s language serves not only to counteract the stereotype as fact (women do swear, and they use some swearwords more than men do) but also to account for it as myth: the proportional differences in language encountered by individuals over time tend to be interpreted as a binary difference. Corpus studies, of course, move beyond keywords. In most cases, in fact, the calculation of keywords serves as the starting point for a more detailed study of individual words and phrases. Often, these are the entry point for studies that isolate text segments and proceed to investigate them as individual extended sequences of naturally occurring text. For McCarthy and McCarten, for example, quantitative information relating to word lists, relative frequency and collocation is used to isolate important phrases for further investigation, but the real interest in their study is how those phrases are used in the individual texts in which they occur. Similarly, Koester demonstrates how various researchers move from identifying a significantly frequent word or phrase to identifying the particular discourse context in which the word or phrase occurs and its pragmatic value in that place. The corpus search tools are used to seek out recurring sequences of text that turn out to be similar both in form – they contain the target phrase – and in function. Seidlhofer notes that while formal features might be used as initial search terms, the connection between these and ‘what communi­ cation processes these forms might point us to’ is the subject of inference on the part of the researcher rather than of observation. This is a useful reminder of the interpretative work undertaken by the researcher; it also serves to warn us that statements on the lines of ‘this 3-word phrase is the fourth most frequent in this corpus’ and those on the lines of ‘this phrase is used to manage a movement between two topics’ are not the same kind of thing at all. It is known that identifying the function in some categories such as discourse markers is subjective and highly interpretative (e.g. Huang, 2010), but Seidlhofer extends this observation of imprecision to the interpretation of pragmatic function in general. The chapters mentioned above demonstrate one possible form of interaction between the corpus and the individual texts that make it up, but

246

Afterword

others are exemplified in this book. Tagg, for example, acknowledges that to discover creative language use in her corpus of text messages she had to treat the corpus as a series of texts and simply read them one by one. Olsson’s approach to text and corpus is different, again: he treats the suspect texts he is working with as texts taken in isolation, but uses a general corpus of English to test out his hypotheses about what constitutes unusual language use in those texts. I suggested at the beginning of this Afterword that the studies in this book, among others, inspire a re-envisaging of Applied Linguistics as a discipline that expands our understanding of the world mediated through language as well as a discipline that solves problems. Although this is true, we should not overlook the practical applications of much of the research reported in this book. Flowerdew’s chapter, for example, focuses specifically on the way that corpus research is used in language teaching contexts, ranging over a large number of approaches and examples. She distinguishes between the influence of corpus-based language descriptions on syllabuses and materials and the use of corpora by learners themselves. Handford reports that studies of business discourse have been used both to describe target interactions and as the source of materials for teaching. Similarly, Laviosa notes a dual use of corpora in translator education: insights into the practice of translation derived from corpus studies of translated texts inform materials for training translators; and the trainees themselves may be invited to consult parallel or comparable corpora to develop their own expertise. McCarthy and McCarten focus on a more specific issue raised by attempting to use corpus-based materials in the classroom: how to represent naturally occurring spoken interaction in language teaching materials. They note that it is not so much the recognizable features of conversation – hesitations, false starts and so on – that cause the problem, but the difficulty of ‘overhearing’ a brief snatch of conversation when the previous hours, maybe years, of interaction between the participants is crucial to understanding. The teaching materials must somehow mediate between that ‘overhearing’ and the experience of the learner. In one sense, this problem appears not to be related to corpus research as such but to the use of any naturally occurring examples in teaching materials. On the other hand, it is the investigation of the patterning of interaction through corpora that reveals the importance of recurring words and phrases in managing not just the local organization of an individual conversation but a much broader sweep of interaction and the relationships it construes. It therefore provides a warrant for including such texts in classroom materials.



Afterword

247

Ten years ago, I published a book entitled Corpora in Applied Linguistics (Hunston, 2002). Corpus Linguistics had already been around a long time, but it was still the case, I think, that there were some Applied Linguists for whom it was a relatively novel concept. It was certainly possible to debate at the time whether the use of corpora was a useful or a disruptive adjunct to research in Applied Linguistics. What is remarkable in the intervening decade is the extent to which corpus investigation methods have ‘infiltrated’ Applied Linguistics, to the extent that papers at Applied Linguistics conferences casually mention corpus research as an integral part of their methodology without either justification or extensive explanation. Corpus Linguistics itself has entered into a debate about its own identity and its relationship with other views of what language is and how it can be studied (see the papers in Worlock Pope, 2010). This book is valuable in maintaining the distinction between Applied Linguistics and Corpus Linguistics, thereby usefully holding in balance the two perspectives on texts. For Applied Linguistics, each individual text is a site of the negotiation of relationships, power, meaning and influence. For Corpus Linguistics, each individual text is a small part of a much larger pattern, with the pattern observable only when all texts are put together and subject to manipulation through corpus investigation software. However, those patterns themselves are then fed back into the interpretation of single text, focusing attention on features that would otherwise pass unnoticed. As the chapters in this book demonstrate, it is important to hold the two perspectives separate so that one can be seen to inform the other.

References Bang, M. and Hunston, S. (2008). The corpus and the stereotype: A research tale. Paper presented at the 41st Annual Meeting of the British Association for Applied Linguistics. Swansea University, Swansea. Huang, L. (2010). Localisation of English: Discourse markers oh and I think in a non-native variety of English. Paper presented at the 43rd Annual Meeting of the British Association for Applied Linguistics. University of Aberdeen, Aberdeen. Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Ortega, L. (2010). The bilingual turn in SLA. Plenary delivered at the Annual Conference of the American Association for Applied Linguistics, Atlanta, GA. Worlock Pope, C. (ed.) (2010). ‘The Bootcamp Discourse and Beyond’. Special issue of the International Journal of Corpus Linguistics, 15(3).

248

Subject Index

academic discourse, cultural variations  34 disciplinary variation  35–6 interpersonal features  36–8 persuasive genres  31–2 written vs. spoken language  32–3 Academic Word List (AWL), vocabulary items  210 word frequency lists  227 Anglicisms  73, 76 autonomous learning, corpus-based pedagogy  208 strategy training, corpus consultation  219 BEC see Business English Corpus (BEC) bilingualism  194 British Sign Language Learner Corpus  203 Business English Corpus (BEC), keyword lists  52–3 keywords  60 shared knowledge  51 Cambridge and Nottingham Business English Corpus (CANBEC), business  48 chunks  55–6 deontic modals  62n. 6 frequency lists  49 Handford’s study  16–18 HK data  21 vs. HKCSE  55, 60 keyword lists  228 keywords  50–2 knowledge, logistics managers  15 listening task  26 Cambridge English Profile Corpus  203

Cambridge Learner Corpus (CLC), collocation statistics  229 word frequency lists  227 CANBEC see Cambridge and Nottingham Business English Corpus (CANBEC) chunks  5, 59–60 CANBEC  54–6 conversation  237 HKCSE-bus  55–6 keywords  50 lists  230 organizational discourse markers  233 segmentation  167 classroom applications, genre-based perspective  213–16 lexical items  209–10 notional-functional syllabus  212–13 vocabulary approaches  210–12 clusters  17 categorization  22 discourse marking and interactional functions  54 ELF  139 HK data  20–2 phraseological competence  202 recurring chunks  230 text corpora  244 collocations  35, 53, 107, 211 business  78–80 calculating ways  109–10 concordance searches  57 congruence  193 corpus analytical methods  47 corpus-based pedagogy  208 creative lexis  71 ELF users  143 gender discourse  109–11

250

Subject Index

identified methods  77–8 Italian trainee translators  74 keywords  245 pragmatic functions  57 quantitative information  245 special inferential frameworks  60 statistics  229 common-core hypothesis, vocabulary  210–11 word frequency lists  227 comparable corpus data  196 comparative fallacy  195 complex/complexities  7, 40, 42 abbreviated phrases  154 code-switching, text messages  162 ELF  142 learner corpora  200 lectures and university textbooks  32 legal discourse  214 L2 user  195 professional contexts  23 sequences  202 taggers  103 written academic texts  33 conceptual metaphor  217–18 constructivist learning  218 Contrastive Interlanguage Analysis (CIA)  192 conversation management  238 conversational data  230–1 corpus-based research  62n. 3, 102 academic discourse  30, 34 COBUILD dictionary team  225 ELF  138, 142 language teaching contexts  246 learner language  194 pedagogy, constructivist learning  218 field-dependent vs. field-independent learners  220 inductive approach  208 instructional materials  212–13 legal discourse  214 lexis  211 noticing hypothesis  216 notional-functional syllabus  212 phraseological approach  208

strategy training, corpus consultation  219 translation studies  74 vocabulary  210–11 spoken business English (SBE)  59 text messaging  7, 150, 157 categorization  152 mobile technology  153 respellings  155 translation studies  67 word frequency lists  227 corpus compilation  238 corpus consultation  219 corpus-driven approach frequent words  102 legal discourse  214 workplace and business language  47, 59, 62 data-driven learning (DDL)  8, 225 language learning theories  216–18 translation  74 deficit version  195 delexical verbs  227, 229 descriptive translation studies, Anglicisms  73 culture-specific phenomena  72 universals, explicitation  70–1 normalization  71–2 simplification  70 discourse intonation model  57 discourse markers  52, 233–4, 245 discourse prosody  114 documentary photograph corpus (DPC), alphanumeric text vs. image text  168 human ecology  180 image processing, computer  171–2 human brain 169–71 human–machine partnership  174–5 image retrieval  172–3 segmentation and annotation, annotation schemes and MPEG-7 standards  178–80 degree of granularity  176 GUI design  177–8



Subject Index

image retrieval  176–7 knowledge base, ontology and OWL  180–3 MINDSWAP Research Lab  185 ontology  183–5 purpose  175–6 dynamism/dynamicity  194 ELF see English as a lingua franca (ELF) emergent concerns, corpus-informed language teaching  225–6 learner language  194, 201, 204 English as a Lingua Franca (ELF), applied linguistic implications  144–6 data availability  136–8 forms and functions  142–4 linguistic varieties  136 methodological challenges  140–2 VOICE and ELFA  138–40 English as a Lingua Franca in Academic ­Settings (ELFA)  138–40 forensic linguistics, anonymous texts  86–8 authorship, styles  85 identical expressions  91–2 known texts  86, 88–91 questioned texts  91 French Learner Language Oral Corpora  202 frequency lists, browsing modes, pattern-hunting  217 CANBEC  49–50 common core  227 conversation  231, 233–4, 237 lexical approach  209 media discourse  122–3 text corpora  243 translation  74 vocabulary syllabus  227 functional analysis, phraseological competence  199 translation  73 gender, book reviews  38–44 usage of language  100–4

gender differences, boosters  40 disciplines  42–3 potential criticism  113 representation  104 usage of language  102 gendered discourse, metrosexual, collocation  109–11 concordances  111–13 keywords  107–9 potential problems  113–14 representation  104–6 genre-based approach, noticing hypothesis  220 pedagogic applications  213–16 HK data see Hong Kong (HK) data Hong Kong Corpus of Spoken English (HKCSE) vs. CANBEC  55–6 chunks  55 pragmatic functions  56 special and particular constraints  60 special inferential frameworks  60 workplace genres  48 Hong Kong (HK) data  15 vs. CANBEC  20 construction-industry discourse  19 frequent clusters  22 vs. SOCINT  20 independent language system  7, 194–6 individual variation  203 inductive-deductive approach, apprenticeship training  217 corpus-based pedagogy  68, 208 corpus consultation  219 data-driven learning  215 grammar guide  220 specialized genre  214 intention concept  94–7 internal comparisons  196 interpretation of data  139 keyword lists, BEC  52–3

251

252

Subject Index

comparison  244 keyword analysis  228 place deictics  21 text corpora  243 keywords  15, 59, 102, 110 BEC  60 Beginning Learner corpus  198 BNC  14 calculation  245 concept  244 fuzziness  177 gender discourse  107–9 HK vs. CANBEC  21 HK vs. SOCINT  20 media discourse  119–22, 131 metrosexual corpus  108 WordSmith  113 workplace discourse  50–3 language learning, COBUILD dictionary team  225 DDL  216–18 ELF corpora  145 learners  203–4 phraseological approach  208–9 social context  243 translation  81 vocabulary syllabus  227 learner corpus research, challenges  202–4 nature of learner language  192–3 phraseological competence  197–202 research implications  196–7 role of first language  193–4 SLA, bilingualism and multilingualism  195–6 dynamism  194–5 learner vs. user  195 legal language studies, concept of mens rea  94–7 courtroom discourse  93–4 jury instruction  93 lexical bundles, pedagogic relevance  211 recurrent strings  54, 230 WordSmith  102

Limerick-Belfast corpus of Academic Spoken English (LIBEL)  120 Limerick Corpus of Irish English (LCIE)  121 listenership, health professionals  58 NHS direct phone calls  61 speaking syllabus  233, 236–7 turn-openers  232 longitudinal approach, learner language, vs. cross-sectional approach  196 dynamic process  195 pseudo-longitudinal study  197 Longitudinal Database of Learner ­English  202 media discourse, concordances  123–4 Critical Discourse Analysis (CDA)  130–1 interaction types, spoken media  118 key words  119–22 newspapers  117 question types, interviews  129 spoken vs. written discourse  117 word frequency lists  122–3 modelling conversation, listenership  236–7 management  237–8 organizational discourse markers  233–4 other speakers, taking account of  234–6 multilingualism, Anglicisms  73 learner corpus research  194 text messaging  153 Multimedia Adult English Learner ­Corpus  203 multimodal approach, functional–notional tradition  213 image  176 pedagogic processing  221 text messaging  162 mutual information (MI)  110 noticing hypothesis, L2 acquisition  216 phraseological approach  220 notional-functional syllabus  212–13



Subject Index

oral examinations  227, 229 pedagogic mediation, conversational data, frequency lists  231–2 turn-opening items  232 and description, Anglicisms  76 simplification and explicitation  75–6 Unique Item Hypothesis  75 inductive approach  215, 221 phraseological approaches, corpus-based pedagogy  208–9 corpus-driven learning  214 instruction  8 noticing hypothesis  220 translator training  75 phraseological competence  197, 202 professional communication, analyzing methodology  14–19 construction-industry study  19–23 spoken professional discourse  14 teaching and training materials  23–6 written texts  13–14 report writing module  219–20 representation  125 cultural variations, academic writing  34 gendered discourses  6, 100, 104–6 legal language  94 metrosexuals  112–14 ontology  181 respellings  156 restructuring  194–5, 202 second language (L2) acquisition/ development/learning  201 bilingualism and multilingualism  195–6 dynamic view  194 dynamism  194–5 individual variation  203 vs. L1 acquisition  195 L1 role  196 learner corpus initiatives  202–3 longitudinal research  197 noticing hypothesis  216 phraseological competence  202 traditional classroom  204

253

segmentation and annotation, DPC annotation schemes and MPEG-7 standards  178–80 degree of granularity  176 GUI design  177–8 image retrieval  176–7 MINDSWAP Research Lab  185 ontology, integration  185 knowledge base and OWL  180–3 sample coding  181–3 skeleton diagram, sports  184 purpose  175–6 semantic domains  102–3 sexism  101, 106 shared knowledge, BEC  51 discourse intonation model  57 speaker–listener interaction  235 social contexts  18, 242–3 social worlds of learners  7, 204 sociocultural theory  8, 218, 220 speaking syllabus  231–3 stereotype, bachelor and spinster  105 frequent words  102 gender, in book reviews  38 men’s and women’s languages  244–5 sub-technical vocabulary  211 tagging  217 common forms  103 ELF  142 lexical bundles  102 segmentation and annotation  167 text messaging, categorization, data collections  151–2 challenges  160–2 computer mediated discourse  151 forensic linguists  153 language play  159–60 mobile technology  153 multilingual nature  158 respelling, brevity and speed  154–5 categories  156 local and global practices  156–7

254

Subject Index

top-down/bottom-up approach  212, 214 translator training  74–5 turn-openers  232 Unique Item Hypothesis  75 vague language, chunks lists  235 interactive and interpersonal features  60 pragmatic features, workplace study  58 vague category marker (VCM)  235–6

Vienna-Oxford International Corpus of English (VOICE)  138–40 workplace discourse, chunks  53–7 constraints  60 goal orientation  59–60 lexical features, frequent words  49–50 keywords  50–3 pragmatics features  57–9 special inferential frameworks  60

Author Index

Abd. Samad, A.  197 Adolphs, S.  57–8, 61 Aldrich, R.  106 Al-Khatib, M.  153 Allen, M.  101 Anderman, G.  81 Andersen, P. A.  101 Anis, J.  158, 164–5 Archibald, A.  151 Aries, E.  101 Aston, G.  221 Atkins, S.  61 Baer, B. J.  75, 81 Baker, M.  67, 69–70, 72 Baker, P.  15–16, 105, 114 Bang, M.  114 Bargiela-Chiappini, F.  14, 23 Barker, F.  105 Barkhuizen, G.  199 Barlow, M.  193 Baron, A.  151, 155–6, 162–3 Baron, N.  151 Barrett, T.  169 Basturkmen, H.  31 Beck, A. T.  101 Bernardini, S.  217 Bernsten, J.  226 Bhatia, V.  13, 18–19, 47 Bhatia, V. K.  13, 18–19, 47, 214 Biber, D.  30, 32, 54, 230 Björkman, B.  143 Bley-Vroman, R.  195 Bloch, J.  212 Block, D.  204 Bloor, M.  210 Bloor, T.  210 Blum-Kulka, S.  70

Böhringer, H.  144 Bondi, M.  36 Bongers, H.  135 Boujemaa, N.  183, 185 Boulton, A.  221 Bowker, L.  74–5 Brashers, D. E.  101 Brazil, D.  57 Breiteneder, A.  141–3 Brinton, I.  84 Brkinjac, T.  144 Brown, B.  57–8, 61 Brown, P.  234 Bruce, I.  31 Brumfit, C. J.  146 Bublitz, W.  236 Burns, A.  226 Buttery, P.  227, 231 Byrnes, H.  196–7 Cameron, L.  24 Campbell, S.  24 Canary, D. J.  101 Candlin, C.  13, 17, 212, 214 Capel, A.  227 Carter, R.  159–60, 218 Carter, R. A.  50, 54, 57–8, 218, 225, 227, 232, 236 Chambers, A.  225 Chapman, J.  171 Chapman, M.  86, 171 Charles, D.  56 Charles, M.  17, 26, 56, 112 Chau, M. H.  204, 218 Chen, Q.  210 Chen, T.  153 Cheng, W.  14, 47, 57 Chiariglione, L.  178

256

Author Index

Christie, F.  13 Churchill, D.  61 Clark, D.  14, 228, 230 Clark, R.  14, 228, 230 Coad, D.  106 Cobb, T.  192–3, 218 Coccetta, F.  213 Cogo, A.  143–4, 151 Connor, U.  13–14, 31 Conrad, S.  54, 203, 230 Cook, G.  141 Cook, V.  195 Corder, S. P.  195 Cortes, V.  32 Cotterill, J.  94 Coulthard, R. M.  5, 84 Coxhead, A.  210, 227 Crawford, P.  57–8, 61 Crucianu, M.  183, 185 Crystal, D.  150 Cuena, J.  181 Cunningham, G.  227

Ellis, R.  199 Evison, J.  235

Danet, B.  162 Datta, R.  172 Davida, G. I.  86 de Bot, K.  194 De Cock, S.  54, 192 DeCarrico, J.  230 Demazeau, Y.  181 Dewey, M.  143–4 Diani, G.  37 Dickinson, S. J.  168 Dindia, K.  101 Djeraba, C.  168 Dorn, N.  143 Dörnyei, Z.  203 Drew, P.  14, 47, 59 Dueñas, P. M.  34 Dumas, B.  93 Durrant, P.  214 Dürscheid, C.  150, 152

Gabrielatos, C.  114 Gavioli, L.  74, 221 Ge, C.  210 Gee, J. P.  16–17, 19 Geertz, C.  35, 141 Getis, A.  179 Getis, J.  179 Gibbins, N.  181 Gillaerts, P.  38 Gilquin, G.  193 Gollin, S.  226 Görlach, M.  77 Granger, S.  144, 191–4 Grant, T.  152 Gray, B.  32 Greenwood, P.  151, 162–3 Grinter, R. E.  154–5 Grob, L.  101 Gros, P.  185 Guido, M. G.  146 Gundacker, J.  144 Gupta, A.  172

Eakins, B.  101 Eakins, R.  101 Eldridge, M.  154–5 Ellis, N.  210–11 Ellis, N. C.  204

Fairclough, N.  117 Fairon, C.  152–3 Farr, F.  57, 59, 121 Faucett, L.  135 Fauth, J.  103 Fellmann, J.  179 Ferecatu, M.  183, 185 Finnegan, E.  54, 230 Firth, A.  4, 26 Fishman, P. M.  101 Flickner, M.  168, 173 Flowerdew, J.  209–10, 219–20 Flowerdew, L.  14, 48, 194, 208, 210, 219–21, 225 Forsyth, D. A.  186 Franklin, P.  15 Frey, O.  38 Frishman, L.  169 Fuchs, M. Ž.  153

Hafner, C.  214 Halliday, M. A. K.  87, 90, 199, 229

Handford, M.  14–17, 19–20, 23–4, 26, 47–50, 52–5, 59, 61–2, 213–14, 228 Hårdaf Segerstad, Y.  151–2, 156 Harrington, K.  114 Harvey, K.  14, 61 Hasan, R.  87 Hassan, F.  197 Hause, K. S.  101 Heaton, J. B.  194 Heffer, C.  93 Heritage, J.  13–14, 47, 59–60 Hermans, T.  68 Herring, S.  151, 162 Herring, S. C.  163 Hewson, L.  146 Hinkel, E.  34, 194 Hodges, M.  102, 113 Hoey, M.  23 Holmes, J.  13, 47 Honeycutt, C.  163 House, J.  143 Housen, A.  193 How, Y.  153 Howarth, P.  194 Huang, L.  245 Huebner, T.  195 Hull, G.  19 Hülmbauer, C.  143 Hunston, S.  114, 208, 247 Hutchby, I.  117, 153 Hyde, J.  101 Hyland, K.  13, 30–3, 35–8, 42, 210–11, 227 Hynninen, N.  221 Jain, R.  172 Jenkins, J.  138, 144 Johannson, S.  54, 230 Johansson, S.  81, 215 Johns, T.  208–9, 216 Johnson, A.  93 Johnson, F.  101 Jones, C.  208–9 Jones, M.  211 Jonker, W.  172 Joshi, D.  172 Joyce, H.  226

Author Index Kamarudin, G.  197 Kan, M. Y.  153 Kasesniemi, E.  152 Kaur, J.  144 Kaur, S.  103 Kennedy, C.  217 Kenny, D.  72 Kerry, T.  204 Khosravinik, M.  114 Kim, H. G.  178 Kim, S.  217 Kirkpatrick, A.  140 Kirsch, G.  38 Klimpfinger, T.  141, 143 Koby, G. S.  75, 81 Koester, A.  14, 17, 24, 26, 59 Kohn, K.  218 Kordon, K.  144 Kortmann, B.  136 Kosch, H.  178 Kosslyn, S. M.  169 Krzyzanowski, M.  114 Kujamäki, P.  72, 75 Labov, W.  139 Lakoff, R.  100 Langton, N.  13, 18–19, 47, 214 Lankshear, C.  19 Larsen-Freeman, D.  194, 204 Laviosa-Braithwaite, S.  70 Lee, D.  214–16 Lee, H. C.  216 Leech, G.  54, 102, 113, 192, 219, 230 Leonardis, A.  168 Lesch, C. L.  101 Levenston, E. A.  70 Levine, M. W.  169 Levinson, S.  234 Li, J.  172 Li, X.  168 Lichtkoppler, J.  144 Ling, R.  151 Lisboa, M.  14, 26 Loi, C. K.  34 Lowie, W.  194 Lu, G.  172 Lu, X.  203 Lung, J.  13, 18–19, 47, 214

257

258

Author Index

Macfarlane, A.  61 Majewski, S.  141 Malczewski, B.  33 Malveira Orfanó, B.  121 Maragos, P.  185 Martin, J.  13 Mathews-Aydýnlý, J.  214 Matous, P.  15, 19 Mauranen, A.  72–3, 81, 138, 143–4, 221 Mayer, R. E.  186 Maynard, C.  210–11 McCarten, J.  14, 228, 230–1, 234, 236–8 McCarthy, M.  14, 16, 49, 54, 59, 218 McCarthy, M. J.  50, 54, 218, 225–32, 235–8 McEnery, T.  100, 114, 192 McNamara, T.  145 McPherson, A.  61 Md. Rashid, J.  197 Meyers, R. A.  101 Miceli, T.  217 Milton, J.  191, 219 Mitchell, R.  204 Molino, A.  34 Moreau, N.  178 Morell, T.  36 Moreno, A.  34 Motta, E.  181 Mudraya, O.  211 Mukundan, J.  197 Mullany, L.  61 Munday, J.  68, 73 Murphy, B.  121 Myles, F.  203 Nattinger, J.  230 Nelson, M.  14, 47–8, 52–3, 60 Nesselhauf, N.  193, 229 Nickerson, C.  14, 23 Nonaka, I.  23 Norvig, P.  181 O’ Dell, F.  227 O’ Halloran, K.  131 O’ Keeffe, A.  48–50, 54, 118–21, 129, 208, 218, 225–7, 229–30, 232, 235–6

Oakey, D.  54 Ochs, E.  15 Olsson, J.  87 Ortega, L.  191, 194–7, 244 Osherson, D. N.  169 Ozturk, I.  31 Pallotti, G.  198 Palmer, H. E.  135 Paquot, M.  193, 198, 211 Parker, I.  104 Paumier, S.  152–3 Pearce, M.  105 Pearson, S. S.  101 Petch-Tyson, S.  192 Pitt, A.  14, 26 Pitzl, M. L.  141, 143–4 Planken, B.  14, 23 Poff, M.  153 Pölzl, U.  144 Pompper, D.  6, 106, 113 Ponce, J.  186 Poncini, G.  14 Poos, D.  33 Potamianos, A.  185 Preston, D.  156 Prodromou, L.  24 Pullin-Stark, P.  144 Rampton, M. B. H.  196 Ranta, E.  138, 143, 221 Rashid, A.  151, 162–3 Rashid, J.  197 Rautiainen, P.  152 Rayson, P.  102, 113, 151, 155–6, 162–3 Redston, C.  227 Reinhardt, J.  32 Reisigl, M.  114 Rennhard, M.  86 Renouf, A.  210 Reppen, R.  33 Ringbom, H.  192 Roberts, C.  14, 24 Rogers, M.  81 Römer, U.  7, 145, 209 Rundell, M.  194 Russell, S. J.  181



Author Index

Sabbah, E. H.  153 Sahota, O.  57–8 Sandiford, H.  234, 236–7 Santini, S.  172 Sarangi, S.  14 Scarpa, F.  76 Schiele, B.  168 Schiffrin, D.  56 Schmid, H. J.  102–3 Schmied, J.  218 Schmitt, D.  227 Schmitt, N.  54, 203, 211, 227, 230 Schneider, E. W.  136 Scott, M.  15, 50, 119 Scotton, C. M.  226 Sealey, A.  216 Seargeant, P.  62 Sebba, M.  7, 153 Seidler, J.  101 Seidlhofer, B.  138–9, 143–5 Selinker, L.  195 Serrano, A. G.  181 Shadbolt, N.  181 Sharp, W.  227 Sheldon, E.  34 Sikora, T.  178 Silver, M.  36 Simpson, M.  106 Simpson, R.  33, 54 Simpson-Vlach, R.  210–11 Sinclair, J.  90, 123, 161, 230 Sinclair, J. McH.  123, 161, 209–10, 229–30 Smeulders, A. W. M.  172 Solso, R. L.  169 Spencer-Oatey, H.  15 Spender, D.  101 Stark, E.  150, 152 Stewart, D.  75 Stubbe, M.  47 Stubbs, M.  17, 23, 111, 220 Stutt, A.  181 Suárez, L.  34 Sunderland, J.  104–5 Swain, M.  216 Swales, J.  30, 33–4, 214–16 Swanston, M. T.  169 Syd. Abd. Rahman, S. Z.  197

259

Tagg, C.  62, 151–2, 155–6, 159, 161–3 Tanna, V.  153 Tannen, D.  101 Tao, C.  168, 232 Tarr, M. J.  168 Thelwall, M.  104 Thompson, A.  32, 35, 144 Thompson, P.  32, 35, 216 Thompson, S.  38 Thorndike, E. L.  135 Thuraisingham, B.  181 Thurlow, C.  150–4 Thurstun, J.  212 Tiersma, P.  93 Tognini-Bonelli, E.  62 Tono, Y.  203 Toury, G.  68, 73, 80 Treur, J.  181 Tribble, C.  17, 208–9 Troemel-Plotz, S.  101 Tsang, E. S. C.  191 Tse, P.  32, 35, 37–8, 210, 227 Turton, N. D.  194 Upton, T.  13–14, 31 Van de Velde, F.  38 VanPatten, B.  192 Verspoor, M.  194 Vethamani, M. E.  197 Vuković, N. T.  153 Wade, N. J.  169 Walkerdine, J.  151, 162–3 Walsh, S.  225, 228 Walton, C. D.  181 Wang, J. Z.  172 Wang, L.  216 Warren, M.  47, 56 Watterson, M.  144 Webber, P.  36–7 West, C.  101 West, M.  135 Widdowson, H.  17 Widdowson, H. G.  136, 138, 211, 215, 220 Widmann, J.  218 Wilkins, B. M.  101

260 Wilkins, D.  212 Williams, J.  192 Willis, D.  209–10, 225 Willis, J.  209, 225 Wilson, A.  100 Winston, L.  101 Wodak, R.  114 Worlock Pope, C.  247 Worring, M.  172 Wray, A.  54

Author Index Xiao, R.  81 Xu, D.  168 Yttri, B.  151 Zanettin, F.  74 Zhang, Y. J.  172, 183 Ziai, R.  218 Zimmerman, D.  101



261

262



263

264