Corpora for Language Learning 9781040009277, 9781032537214, 9781032537221, 9781003413301

This volume presents a diverse range of expertise and practical advice on corpus-assisted language learning, bridging th

122 48 31MB

English Pages [288] Year 2024

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Corpora for Language Learning
 9781040009277, 9781032537214, 9781032537221, 9781003413301

Table of contents :
Cover
Half Title
Endorsements
Title
Copyright
Dedication
Contents
List of figures
List of contributors
1 Bridging the research-practice divide for data-driven learning: introduction to the volume
2 Addressing the challenges of data-driven learning through corpus tool design – in conversation with Laurence Anthony
3 Exploring multimodal corpora in the classroom from a multidimensional perspective
4 Data-driven learning: in conversation with Alex Boulton
5 Data-driven learning, second language acquisition and languages other than English: connecting the dots
6 DDL support for academic English collocations
7 Extensive reading and viewing for vocabulary growth: corpus insights
8 Transforming language teaching with corpora – bridging research and practice
9 Practical use of corpora in teaching argumentation
10 On the perils of linguistically opaque measures and methods: toward increased transparency and linguistic interpretability
11 Why we need open science and open education to bridge the corpus research–practice gap
12 Corpora in L2 vocabulary assessment
13 Corpus-based language pedagogy for pre-service and in-service teachers: theory, practice, and research
14 The applications of data-driven learning, corpus-based phraseology and phraseography in the development of collocational and phraseological competence
15 Data-driven learning in informal contexts? Embracing broad data-driven learning (BDDL) research
16 Using a data-driven learning approach to teach English for Academic Purposes and improve higher education internationalisation
17 Exploring the interface between corpus linguistics and English for Academic Purposes: what, how and why?
Index

Citation preview

CORPORA FOR LANGUAGE LEARNING

This volume presents a diverse range of expertise and practical advice on corpus-assisted language learning, bridging the gap between corpus research and actual classroom practice. Grounded in expert discussions and interviews, the book offers an extensive exploration into the intricacies of corpus-based language pedagogy, addressing its challenges, benefits, and potential drawbacks while demonstrating the power of data-driven learning (DDL) tools, including AntConc, WordSmith Tools, and CorpusMate. The book navigates the complexities of integrating DDL into mainstream educational systems, showcasing real-world applications for teaching. The authors bring together cutting-edge, international perspectives on this topic in dialogue with those using such techniques in their classroom practice. Both a rigorous academic resource and a hands-on guide for practitioners, this book is recommended reading for educators, researchers, or anyone wanting to upskill themselves in learning to harness the power of data for language pedagogy in primary, secondary, tertiary, or other professional contexts. Peter Crosthwaite is Associate Professor of Applied Linguistics at the University of Queensland, Australia. He is an expert in corpus linguistics, data-driven learning, and computer-assisted language learning. He has published more than 50 articles in top journals and is the editor-in-chief of Australian Review of Applied Linguistics (from 2024).

“This book is a treasure trove of perspectives and purposeful chapters. Whether you are interested in researching data-driven learning or in using it in your classroom, this book will intrigue you. Not only does it curate a range of leading contributors in a volume that addresses research and practice, it goes far beyond by connecting the voice of the teacher, the researcher and the learner in the form of reflective discussions linked to each chapter.” Anne O’Keeffe, Mary Immaculate College, University of Limerick, Ireland “Corpora for Language Learning: Bridging the Research-Practice Divide represents a major new step in research on the use of corpus data in language learning and teaching, and I am happy to endorse it for publication. After the initial excitement generated by early publications in the 1980s, it soon became clear that the applications of real-life language data in the language classroom were largely limited to learners in higher education who were fortunate to have researchers in this area as their teachers. This book aims to change that narrow perspective, and to broaden this area of research to include voices from all stakeholders involved. The authors focus on enabling teachers and learners to have direct access to corpora and the software necessary to analyse them without substantial training, making it easy for teachers to integrate them in their classes and for learners to use them both inside and outside the classroom. This book thus has the potential to appeal not only to researchers in this field but also to those involved in language teacher education and to teachers. It is very likely to be widely cited as an important turning point in DDL research.” Angela Chambers, Professor Emerita of Applied Languages, University of Limerick, Ireland

CORPORA FOR LANGUAGE LEARNING Bridging the Research-Practice Divide

Edited by Peter Crosthwaite

Designed cover image: Getty First published 2024 by Routledge 4 Park Square, Milton Park, Abingdon, Oxon OX14 4RN and by Routledge 605 Third Avenue, New York, NY 10158 Routledge is an imprint of the Taylor & Francis Group, an informa business © 2024 selection and editorial matter, Peter Crosthwaite; individual chapters, the contributors The right of Peter Crosthwaite to be identified as the author of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-1-032-53721-4 (hbk) ISBN: 978-1-032-53722-1 (pbk) ISBN: 978-1-003-41330-1 (ebk) DOI: 10.4324/9781003413301 Typeset in Times New Roman by Apex CoVantage, LLC

This book is dedicated to all the presenters and attendees of the “International perspectives on corpus technology for language learning” seminar series held online at the height of COVID-19. In particular, I wish to thank Dr. Paula Pinto and Dr. Adriane Orenha-Ottaiano of the State University Sao Paulo (UNESP) Brazil for choosing to work with me on applying for the grant that made this all possible, and for your continued support throughout the seminar series and beyond.

CONTENTS

List of figures List of contributors

x xiii

1 Bridging the research-practice divide for data-driven learning: introduction to the volume Peter Crosthwaite

1

2 Addressing the challenges of data-driven learning through corpus tool design – in conversation with Laurence Anthony Laurence Anthony

9

3 Exploring multimodal corpora in the classroom from a multidimensional perspective Tony Berber Sardinha 4 Data-driven learning: in conversation with Alex Boulton Alex Boulton 5 Data-driven learning, second language acquisition and languages other than English: connecting the dots Luciana Forti 6 DDL support for academic English collocations Ana Frankenberg-Garcia

25 43

59 71

viii Contents

7 Extensive reading and viewing for vocabulary growth: corpus insights Clarence Green

84

8 Transforming language teaching with corpora – bridging research and practice Reka R. Jablonkai

93

9 Practical use of corpora in teaching argumentation Tatyana Karpenko-Seccombe 10 On the perils of linguistically opaque measures and methods: toward increased transparency and linguistic interpretability Tove Larsson and Douglas Biber 11 Why we need open science and open education to bridge the corpus research–practice gap Elen Le Foll 12 Corpora in L2 vocabulary assessment Agnieszka Leńko-Szymańska 13 Corpus-based language pedagogy for pre-service and in-service teachers: theory, practice, and research Qing Ma 14 The applications of data-driven learning, corpus-based phraseology and phraseography in the development of collocational and phraseological competence Adriane Orenha-Ottaiano 15 Data-driven learning in informal contexts? Embracing broad data-driven learning (BDDL) research Pascual Pérez-Paredes 16 Using a data-driven learning approach to teach English for Academic Purposes and improve higher education internationalisation Paula Tavares Pinto

111

131

142 157

174

187

211

227

Contents

ix

17 Exploring the interface between corpus linguistics and English for Academic Purposes: what, how and why? Vander Viana

240

Index

261

FIGURES

Attendees and speakers for the International Perspectives on Corpus Technology for Language Learning Seminar Series 2022–2023 2.1 AntConc downloads (2004–2021) 2.2 AntConc KWIC concordancer results for the word ‘process’. The words are ordered by ‘KWIC pattern’ frequency 2.3 The scientific research process 2.4.1 Three highlighted Polypharmacy definitions taken from AntConc 3.1 Output screen for the Google Vision web demo 3.2 Image accompanying the message in Example 2 3.2.1 A screenshot from the multimodal corpus analysis carried out by students during the media literacy project 3.3.1 Perceptions of student participants towards benefits of multimodality 5.1 DDL for English, LOTEs except Italian, and Italian over 30 years 5.2 Experimental vs. control conditions 5.3 Experimental vs. control conditions with reference to congruency 5.4 Experimental vs. control conditions with reference to congruency and frequency 6.1 ColloCaid prototype displaying interactive menus in default In-Text View 6.2 ColloCaid prototype displaying interactive menus in Tree View 6.3 Verb/noun disambiguation prompt for issue in ColloCaid In-Text View 6.4 Collocates for activity in ColloCaid Tree View 8.1 Pedagogic collocational profile of the lemma CRITERION (based on Jablonkai, 2011) 1.1

3 10 12 14 23 27 33 39 41 62 64 65 66 73 74 77 77 96

Figures

8.2 8.3 8.4 8.5 8.6 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.10 9.11 9.12 9.13 9.14 9.15 9.16 9.17 9.18 12.1 12.2 12.3 12.4 13.1 13.2 14.1

Comparison of the collocational patterns of the lemma LAY in a general and discipline-specific corpus (based on Jablonkai, 2011) Concordance lines for a LOOK activity (based on Jablonkai, 2011) Example of a FAMILIARISE activity with the lemma CRITERION (based on Jablonkai, 2011) Example of a FAMILIARISE activity with the lemma IMPLEMENT (based on Jablonkai, 2011) Example of a FAMILIARISE activity with lexical bundles (based on Jablonkai, 2011) Adjectives used with argument in SkELL Adjectives used with argument in Lextutor Adjectives used with argument in Lextutor. Search argument is in Academic General Search this argument is in Lextutor (BAWE) Multiple words search in Lextutor with associated word this in Academic Abstracts Results of multiple words search in Lextutor with associated word this Claim in association with evidence (10 words both sides, Lextutor, Academic General) Evidence + infinitive (Lextutor, BAWE) Argument in MICUSP Problem in MICUSP Hypothesis in MICUSP Adjectives collocating with problem in Lextutor, Academic General Hypothesis in Word sketch in SkELL Refute + hypothesis in Examples in SkELL Phrases containing hypothesis (Lextutor) Use of argument by discipline (MICUSP) Use of problem by discipline (MICUSP) Use of hypothesis by discipline in reports and research papers (MICUSP) Types of vocabulary tests (Leńko-Szymańska, 2020) A user-defined vocabulary list generated by English Vocabulary Profile A list of 50 multi-item terms generated from a custom-built urology corpus Manual assignment of CEFR levels to the words in a text in Text Inspector A two-step CBLP training framework Alignment of the four-stage CBLP lesson design model with the SLA model Collocations taxonomy

xi

97 98 99 100 101 113 113 114 115 116 117 118 119 120 121 122 123 123 124 124 125 126 126 159 161 162 165 177 178 192

xii

Figures

14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9 14.10 14.11 14.12 14.13 16.1 16.2 16.3 16.4 16.5 16.6 16.7 16.8 16.9 16.10 17.1 17.2 17.3 17.4 17.5 17.6 17.7 17.8

Instructions for Activity 1 (Part A) Screenshot of the PLATCOL prototype – morphosyntactic structures Screenshot of the PLATCOL Prototype – Adjectival Collocations of the base “role” and examples of “pivotal role” Activity 1 (Part B) Activity 1 (Part C) Instructions for Activity 2 and gap fill Screenshot of the PLATCOL prototype – verbal collocations of the base ‘role’ Screenshot of the PLATCOL Prototype – Search for the entry Papel and its definitions Screenshot of the PLATCOL Prototype – Morphosyntactic structure Papel + Adjective and examples of the collocation papel crucial Screenshot of the PLATCOL prototype – morphosyntactic structure ‘adjective + papel’ Instructions for Activity 3 Collocations with the structures ‘adjective + papel’, ‘papel + adjective’ Screenshot of AntCorGen Screenshot of multiword terms of SDGs in Sketch Engine Concordance lines with the search term “ocean health” Extended concordance line with the search term “ocean health” Questions taken from an expanded concordance line Extended concordance line with the search term “ocean health” and the answer to the guided questions previously presented The CorpusMate platform Concordance lines with the search word “ocean” + word Concordance lines with the search term “ocean pollution” Concordance lines with the search term “ocean master” Distribution of ‘inconspicuous’ across components in COCA Distribution of ‘remain* + inconspicuous’ in COCA Distribution of ‘hinterland’ in the BNC Concordance lines for ‘hinterland’ in the BNC Two-word sequences containing ‘utter’ followed by a noun in the academic component of COCA Sequences containing ‘statistics’ followed by a verb in the academic component of COCA Concordance lines for ‘statistics reports’ in the academic component of COCA Concordance lines for ‘statistics report’ in the academic component of COCA

193 194 195 196 196 197 198 199 200 201 202 202 230 230 231 232 232 232 233 234 234 235 245 245 246 247 248 249 250 250

CONTRIBUTORS

Laurence Anthony is Professor of Applied Linguistics at the Faculty of Science and Engineering, Waseda University, Japan. His main research interests are in corpus linguistics, educational technology, and English for Specific Purposes (ESP) program design and teaching methodologies. Tony Berber Sardinha is Professor at the Pontifical Catholic University of São

Paulo (PUCSP), Brazil. His interests include corpus linguistics, multidimensional analysis, artificial intelligence, metaphor, and applied linguistics.

Douglas Biber is Regents’ Professor of Applied Linguistics at Northern Arizona University. His recent research focuses primarily on linguistic studies of register variation and grammar. Alex Boulton is Professor of English and Applied Linguistics at the ATILF,

University of Lorraine and CNRS (France). His research interests focus on datadriven learning, the use of corpus tools, and techniques for language learning and teaching, with several syntheses and meta-analyses on this in recent years.

Luciana Forti is a researcher and lecturer based at the University for Foreigners

of Perugia, Italy. Her current research interests include data-driven learning and phraseological complexity in the context of Italian L2 studies.

Ana Frankenberg-Garcia is Associate Professor of Translation Studies and

Programme Leader of the MA in Translation at the University of Surrey. Her research focuses on corpus linguistics applied to assisted writing, lexicography, and translation.

xiv

Contributors

Clarence Green is Assistant Professor of English Language Education at the

University of Hong Kong. His areas of expertise include vocabulary, corpus linguistics, literacy, and reading psychology.

Reka R. Jablonkai is an award-winning Associate Professor in Education and

Applied Linguistics at the University of Bath, UK. She conducts research into language corpora in teaching, corpus-based discourse analysis, and was lead editor of The Routledge Handbook of Corpora and English Language Teaching and Learning (2022) and an active member of CorpusCALL SIG of EuroCALL.

Tatyana Karpenko-Seccombe (UK) has been teaching academic writing to inter-

national students and researchers for over 25 years. She has been extensively using corpora and concordancers in her teaching and is currently working on a new addition of her resource book, Academic Writing with Corpora (Routledge, 2020).

Tove Larsson is Assistant Professor of Applied Linguistics at Northern Arizona University. Her research interests include corpus linguistics, L2 writing, and research methods. Elen Le Foll is a post-doctoral researcher and lecturer at the University of Cologne

(Germany). She is particularly interested in quantitative corpus linguistics methods and applications of corpus research in language teaching and learning.

Agnieszka Leńko-Szymańska is Assistant Professor at the Institute of Applied Linguistics, University of Warsaw. Her research interests are primarily in psycholinguistics, second language acquisition and corpus linguistics, especially in the issues related to lexis and phraseology in these fields. Qing Ma is Professor at the Education University of Hong Kong, Hong Kong SAR.

Her research interests include second language vocabulary acquisition, corpus linguistics, corpus-based language pedagogy, computer-assisted language learning and mobile-assisted language learning.

Adriane Orenha-Ottaiano works as a lecturer at the São Paulo State University

(UNESP) and holds a Postdoctoral Fellowship in Translation from Université de Montréal and a PhD in Linguistics Studies (UNESP).

Pascual Pérez-Paredes is Professor in Applied Linguistics and Linguistics, Universidad de Murcia, and former Lecturer in Research in Second Language Education at the University of Cambridge. His main research interests are the use of corpus linguistics methods in applied linguistics, corpora and digital resources in language education, and corpus-assisted discourse analysis.

Contributors

xv

Paula Tavares Pinto is Professor of Applied Linguistics at São Paulo State

University (UNESP), Brazil. She has coordinated and worked collaboratively on several national and international projects.

Vander Viana is a senior lecturer at the University of Edinburgh, UK. He has a

range of teaching/research interests in English language, Applied Linguistics and Education (e.g., corpus linguistics, data-driven learning, TESOL, EAP, language teacher education and academic/pedagogical discourse analysis).

List of discussants Wesley Henrique Acorinti is an undergraduate student of Portuguese and English, a teaching assistant and a research assistant at the Federal University of Rio Grande do Sul in Brazil. Chyara Alvarado López has a degree in tourism from the Autonomous University

of the State of Mexico and is an educational researcher. Her research focuses on the Importance of Open Science journals in research.

Alisha Biler is Assistant Professor of English and Linguistics at Boyce College in

Louisville, KY, USA. Her research focuses on applications of corpus linguistics to second language writing and reading.

Alessandra Cacciato is a PhD student in Applied Linguistics at the University of Lorraine, ATILF. Her research focuses on mobile-assisted language learning and data-driven learning for Italian with pre-tertiary learners. Alejandro Curado Fuentes is a senior lecturer, teaching English and Applied Linguistics at the University of Extremadura, Spain. Khaoula Daoudi is a PhD researcher at the school of Modern Languages and

Applied Linguistics at the University of Limerick in Ireland. Her research focuses on the practical application of corpus linguistics in the classroom, materials design in ESP, and teacher training and professional development.

Kevin F. Gerigk is a doctoral researcher and associate lecturer at Lancaster

University, and a research associate at Aston University. His research interests are corpus linguistics, EFL materials design, DDL, and critical discourse analysis.

Sylvia Goetze Wake is an English teacher and Director of the University of

Lausanne Language Centre in Switzerland. Her research revolves around English for Specific Purposes, with an interest in DIY corpora for doctoral writers.

xvi Contributors

Larissa Goulart is Assistant Professor of Linguistics at Montclair State University,

New Jersey in the United States. Her research focuses on undergraduate student writing, register variation, and the applications of corpora into teaching. Hung Tan Ha is currently a PhD student in Applied Linguistics at Victoria

University of Wellington, New Zealand. His research interests involve L2 reading comprehension and vocabulary assessment.

Handoko is Assistant Professor of Linguistics at Universitas Andalas, Padang in

Indonesia. His research focuses on language competence, cyberpragmatics, ecolinguistics, and the application of corpus in teaching.

Elizabeth Hanks is a PhD student at Northern Arizona University (United States) whose research uses corpus methods to study vocabulary and pragmatics. Adrian Jan Zasina is Assistant Professor in Linguistics and Head of the Institute of Czech and Deaf Studies at Charles University, Czech Republic. His main research interests include data-driven learning, corpus linguistics, parallel corpora, grammar of contemporary Czech and Polish, language variability, and corpus-assisted discourse studies. Vanessa Joy Z. Judith is presently Associate Professor V from the College of Education at Carlos Hilado Memorial State University in Talisay City, Negros Occidental, Philippines. She has published various research works in a number of peer-reviewed international journals. Baramee Kheovichai is Assistant Professor of English, working in the Department

of English, Faculty of Arts, Silpakorn University, Thailand. He has conducted various research projects in the fields of corpus linguistics, discourse analysis, and metaphor.

Ranwa Khorsheed is based at the Arab International University, Damascus, Syria.

They are an instructor of Academic Writing and research in English, researcher, and a Certified Sworn Translator.

Keisuke Koyama is a language teacher at Shibuya Makuhari Senior and Junior High School, Japan. He is interested in corpus-based language teaching, collocation, and second language vocabulary acquisition. Nataliia Lazebna, habilitated Doctor of Science (DSc) in linguistics (Germanic

languages), Associate Professor (academic title), and researcher in the TEFL Department, Julius-Maximilian University of Würzburg (Germany).

Contributors

xvii

Liang Li is a researcher in the Te Kura Toi School of Arts, University of Waikato, New Zealand. Her research areas include corpus linguistics, L2 academic writing, and computer-assisted language learning. Mengyang Li holds an MA TESOL from the University of Queensland and works as an English teacher at Shanghai SIPO Polytechnic in China. Ning Liu is Associate Professor in the College of Global Talents at the Beijing

Institute of Technology (Zhuhai) in China, where he mainly teaches academic English and interdisciplinary courses.

Lee McCallum is a lecturer in Assessment and Feedback at Edinburgh Napier University in Scotland, UK. Her research interests include corpus linguistics and writing assessment. Luthf Muhyiddin is an Arabic linguist who is fascinated by teaching Arabic for

Indonesian learners. He is always interested in learning more about corpus linguistics and its wide-ranging research and application.

Mariam Orkodashvili is a researcher at Georgian American University, Georgia. She works on the issues of multimodality, corpus-mediated discourse analysis and language acquisition, and neuroplasticity. Heloísa Orsi Koch Delgado is a visiting lecturer at La Salle University and a post-

doc researcher at the Federal University of Porto Alegre (UFRGS), Brazil. In the past years, she has researched the topic of Plain Language in Healthcare.

Ali Şükrü Özbay’s main areas of interest as a researcher include corpus linguis-

tics and learner corpora, academic writing and error analysis, and phraseology. He is currently engaged in EAP research as part of a multidisciplinary team of researchers.

Zeynep Özdem-Ertürk is an English instructor at Niğde Ömer Halisdemir

University, Niğde in Türkiye. Her research focuses on lexical and corpus studies.

Mustafa Özer is pursuing his doctoral degree in applied linguistics at Erciyes University in Kayseri, Türkiye. He currently works as a language instructor at Abdullah Gül University in the same city. Mariangela Picciuolo is a PhD student of English Linguistics at the University of Bologna, Italy. Her research focuses on English-Medium Instruction (EMI), corpusassisted multimodal discourse analysis, and computer-mediated communication.

xviii Contributors

José Luis Poblete is Adjunct Professor of English Linguistics and English as a

Foreign Language at Universidad de Santiago, Chile. His research interests include the use of corpora and data-driven learning in the EFL classroom and B language teaching in Translation Studies. He’s also a podcaster at ELT in Chile. Muhammad Rudy is an English language lecturer in the Pharmacy Department of

Malahayati University in Lampung Province, Indonesia. He is interested in corpus for teaching and material development, English for Specific Purposes, and teaching techniques.

Danang Satria Nugraha is a PhD candidate at the University of Szeged, Hungary, and also serves as a Linguistics lecturer at Sanata Dharma University in Indonesia. His research’s central focus is studying language phenomena in Indonesia, with a particular emphasis on Theoretical Linguistics. Boonyakorn Siengsanoh is an English language teacher at Chulalongkorn

University Language Institute in Bangkok, Thailand. Her areas of research interests include L2 writing, lexical collocations, corpus-based ESP collocation or vocabulary lists, and corpus linguistics.

Vicki Jade Stevenson is an EAP tutor at the Centre for Academic Language and

Development at Bristol University. She is currently a student on the MA TESOL Leadership and Management at the University of Portsmouth.

Shuyi Amelia Sun is a PhD researcher of applied linguistics in the School of Foreign

Language Education at Jilin University, China. She has been actively researching and publishing in the fields of English for Academic Purposes, corpus linguistics, quantitative linguistics, and second language writing. Paweł Szudarski is Assistant Professor at the University of Nottingham, UK. He

works in the areas of applied linguistics, corpus linguistics, and TESOL, focusing in particular on the acquisition of vocabulary and phraseology by second/foreign language learners.

Carolina Tavares de Carvalho is a PhD student in applied linguistics at UNESP – São Paulo State University. She collaborates on projects related to corpus-based teaching, UN Sustainable Development Goals, and partnership initiatives between UNESP and The University of Queensland. Daniela Terenzi is an ESP practitioner and teaches English for Aeronautical Maintenance and Engineering at the Federal Institute of Education, Science and Technology of São Paulo in Brazil.

Contributors

xix

Le Thanh Thao serves as a language lecturer at the School of Foreign Languages,

Can Tho University, which is situated in the Mekong Delta region of Vietnam. His academic interests span across language studies, classroom-centric research, and educational policy frameworks.

Thi Thanh Thuy Tran is a full-time lecturer in English as a Foreign Language at

the Language Institute, Van Lang University in Ho Chi Minh City, Vietnam. Her research focuses on teaching materials design and assessment, discourse analysis, and the applications of corpora into teaching.

Buse Uzuner has an MA degree in Applied Linguistics and is currently pursuing a PhD in the same field. Their main areas of interest cover corpus linguistics, learner language, data-driven learning, and academic writing. Claudia Wunderlich is Full Professor of English for Specific Purposes and German

as a Foreign Language at the Technical University of Applied Sciences WürzburgSchweinfurt, Germany. Her research focusses on vocabulary in languages for specific and professional purposes in EMI settings and applying those findings and corpora into teaching.

Yiyang Yang, a PhD candidate at the University of Science and Technology Beijing (USTB), China, is an early career researcher in the field of computer-assisted language learning and cognitive education. Magdalena Zinkgräf is a language teacher and researcher at Universidad Nacional

del Comahue, Argentina. Her research interests include the formulaic sequence acquisition by Spanish-speaking EFL learners, corpus linguistics, and academic vocabulary learning.

1 BRIDGING THE RESEARCH-PRACTICE DIVIDE FOR DATA-DRIVEN LEARNING Introduction to the volume Peter Crosthwaite

Introduction

Corpus linguistics has played an important role in language teaching for decades. Whether through the ‘indirect’ application of corpora in the creation of dictionaries, wordlists and teaching materials, or the ‘direct’ application of corpora in the language classroom (also known as data-driven learning, DDL, Johns, 1991), the chances are that if you are learning a second language, you will, at some point, have been exposed to teaching or materials influenced by corpus findings. Accordingly, the wealth of published corpus linguistics and DDL research of this nature has increased significantly over the past 30 years (Boulton & Vyatkina, 2021; Crosthwaite et al., 2022), and continues to be a highly productive area of applied linguistics research. However, no one is under the illusion that corpora (at least, in their direct DDL application) have finally entered mainstream classroom pedagogy. We are still some way off from concordancing software being an accepted part of the digital pencil case of CALL tools employed by teachers and language learners, a situation perhaps made even more precarious by the recent introduction of generative artificial intelligence (AI) software (e.g., ChatGPT) that may potentially overcome several long-established difficulties with DDL research, including limited corpus data and clunky user-unfriendly corpus tool interfaces (Zappavigna, 2023; Lin, 2023; Crosthwaite & Baisa, 2023). In recognition, and in a vitally important article titled ‘Towards the corpus revolution? Bridging the research – practice gap for DDL’, Chambers (2019) emphasised that despite positive reported results on language learning, learner engagement and classroom integration, normalisation of DDL into everyday

DOI: 10.4324/9781003413301-1

2

Peter Crosthwaite

practice had not materialised. Chambers points out most teachers might not be aware of the existence of concordancers nor be familiar with their use, and that the majority of empirical studies involving learners using corpora are small-scale, conducted by only a limited number of researchers. These researchers (including a fair number of us featured in this volume!) tend to have advanced skills in corpus creation and consultation currently lacking in most practicing teachers, and utilise corpora or tools they themselves have created which may not always be publicly available. Pedagogical corpora specifically designed for DDL are still too few and far between, and while researchers devote much time and effort to teaching, the end goal is still one primarily focused on research with little follow-up once that research is published. To bridge this gap, Chambers recommends all stakeholders, e.g., teachers, teacher educators, researchers, major publishers and students come together to champion the corpus cause (where suitable), despite the varying needs of each. Of course, Chambers was neither the first nor the last to state the obvious, with several leading DDL researchers also noting the general failure of DDL to penetrate mainstream pedagogical practice. Boulton (2009) was one of the first to note the ‘reasonable fears’ of teachers involved in DDL, requiring ‘rational reassurance’ intended to demystify DDL to those would end up using it. Pérez-Paredes (2022) called for further theorisation underpinning DDL together with improved syllabus integration and increased contribution from language teachers if DDL was to stay relevant in the years to come. Crosthwaite and Boulton (2023) claim that DDL research as a whole is only covering well-trod paths in a zombie-like manner rather than breaking substantial new ground outside the ivory towers of higher education. Yet, Chambers’s call to bring together disparate voices, including corpus linguists, teachers, teacher-researchers, teacher trainers and the learners themselves in order to bridge this gap has still largely been unheeded. That, therefore, leads us to the present volume. International perspectives on bridging the gap

Of course, one of the main reasons why this call has gone unheeded in the last few years is the unprecedented COVID-19 pandemic, shutting down a great many language classrooms around the world not too much longer after the actual publication of Chambers’s article. During the pandemic, researchers, teachers and learners scrambled to survive, both figuratively and literally, under remote learning conditions. Relevant conferences and grants were cancelled, and educators and educational stakeholders were in little mood to waste time and resources on new pedagogies. What momentum we had as a field was in danger of being lost, and we could quite easily have done so. However, one of the unforeseen benefits of the switch to remote learning was that of improvements in video conferencing technology (e.g., Zoom), such that we

Bridging the research-practice divide for data-driven learning

3

now had the ability to connect with hundreds of people from around the world in real time. Following on from an initial pre-COVID collaborative grant between myself at the University of Queensland, Australia and Drs. Paula Pinto/Adriane Orenha-Ottaiano at the State University of São Paolo, Brazil, and following an excellent series of online talks in applied linguistics led by Tony Berber Sardinha, São Paulo Catholic University, Brazil, we decided to make a concerted effort to keep the DDL ball rolling by creating a new, DDL-specific online seminar series titled International Perspectives on Corpus Technology for Language Learning. Featuring 30 international speakers, this was, at the time of writing, the largest-ever DDL-focused seminar series with over 4500 participants spanning 112 countries during its initial three-semester run (Figure 1.1). One of the most significant outcomes of this series was the diversity of our audience, including established researchers, early career scholars, teachers, graduate students and language learners. The ability for so many people to collaboratively discuss and reflect on DDL in real time, no matter where you were from or the

FIGURE 1.1

Attendees and speakers for the International Perspectives on Corpus Technology for Language Learning Seminar Series 2022–2023

4

Peter Crosthwaite

contexts you were working/learning under, was a first outside of higher education and published research. In short, the series had worked, in some respect, to begin to bridge the research-practice divide discussed earlier. At the series conclusion (and while we have not ruled out restarting the series in the near future), it was time to take stock and work out a way to document our experiences for dissemination. At the same time, we recognised that a ‘traditional’ edited volume featuring only the voices of established researchers producing ‘traditional’, lengthy, research-focused chapters would do little to help bridge the research-practice gap, instead only continuing to omit the valuable voices of those we are trying to reach. In other words, an alternative – perhaps risky – but necessary approach needed to be taken. In response, the present volume is somewhat of an experiment to create the most accessible, most practical and most engaging DDL resource from the point of view of both established researchers and those using their research in the field. Each ‘main’ contribution by one of the series’ speakers is punctuated by several ‘discussion’ sections featuring teachers, researchers and language learners who were influenced or impacted by their work. These do not take the form of typical research articles, but instead are personal reflections on research-informed practice, written for a general audience and intended to promote discussion and reflexivity. The ‘main’ contributions are also significantly shorter than standard chapters so as to increase their readability, and several are in the form of digestible ‘interview’-style conversations between the editor and the contributor(s) themselves, an approach also recently seen in Pérez-Paredes and Mark’s excellent Beyond Concordance Lines (2021) edited DDL volume. In short, this book is both aimed at, and co-produced by, the very people we have been trying to reach all along, that is, you – the (hopeful) DDL enthusiast. Contributions

Laurence Anthony’s chapter discusses the development and application of the ever-popular AntConc corpus analysis toolkit in DDL classrooms. Originally designed as a simple concordancer, AntConc has transformed into a comprehensive tool that facilitates language teaching and learning. Anthony explains how future versions aim to incorporate large language models, including ChatGPT, to provide more profound insights. Discussants of Anthony’s chapter are Lee McCallum, Yiyang Yang, Baramee Kheovichai and Muhammad Rudy. Tony Berber Sardinha delves into the potential of analysing multimodal corpora in classrooms from a multidimensional perspective. He highlights the traditional focus on textual content in linguistic research, despite the inherently multimodal nature of language use. Through analysing a corpus of social media posts, Sardinha suggests practical classroom applications and underscores the

Bridging the research-practice divide for data-driven learning

5

growing significance of AI tools in corpus linguistic research. Discussants are Mariangela Picciuolo, Mariam Orkodashvili and Danang Satria Nugraha. Alex Boulton discusses the evolution and principles of DDL, emphasising its benefits in enhancing active learning through investigation of real-world language examples. He addresses the challenges and potential drawbacks of DDL, suggesting the need for collaborative research and more diverse study designs to better understand its impact. Boulton also underscores the significance of teacher involvement, the role of evolving technology, and the importance of integrating DDL into broader pedagogical strategies. Discussants are Alessandra Cacciato, Le Thanh Thao, Adrian Jan Zasina and Kevin F. Gerigk. Luciana Forti explores the intersection between DDL and second language acquisition (SLA), with a particular focus on languages other than English (LOTEs). She underscores the need to explore how DDL can assist in the learning of LOTEs, using a case study on Italian as a second language to illustrate how DDL can be integrated to enhance language acquisition. Forti argues for a more comprehensive integration of DDL practices in LOTEs and highlights the potential benefits and challenges of such integration, emphasising the role of teachers as agents of change in this pedagogical shift. Discussants are Luthfi Muhyiddin and Zeynep Özdem-Ertürk. Ana Frankenberg-Garcia discusses the challenges faced by students and researchers unfamiliar with academic English collocations in her chapter. She introduces ColloCaid, a DDL text-editing tool designed to assist writers in improving their use of academic English collocations. The chapter highlights the tool’s application in teaching supported by user studies, and its potential for future integration with mainstream text editors. Discussants are Liang Li and Magdalena Zinkgräf, Clarence Green examines the role of extensive reading (ER) and extensive viewing (EV) in vocabulary acquisition, emphasising the insights gained from corpus linguistics. The chapter showcases how corpus research can inform teachers about the potential of ER and EV in enhancing vocabulary comprehension, with studies suggesting the need for around 9,000 word families for proficient reading in English. Green proposes using corpus data to model and predict the outcomes of ER/EV interventions, advocating for a data-driven approach in educational decisions to ensure efficient and effective vocabulary learning. Green’s discussant is Elizabeth Hanks. Reka R. Jablonkai’s chapter explores the transformative potential of corpora in language education. Despite early enthusiasm for the ‘corpus revolution’, there remains a gap between corpus research and practical teaching applications. Through various case studies in English for Specific Purposes (ESP), Jablonkai highlights three primary teaching approaches based on the degree of student access to corpus data and underscores the need for integration of lessons from both CALL (computer-assisted language learning) and TEL

6

Peter Crosthwaite

(technology-enhanced learning) research to design effective courses. Discussants are Claudia Wunderlich, Ali Sükrü Özbay and Thi Thanh Thuy Tran. Tatyana Karpenko-Seccombe delves into the utilisation of corpora in teaching argumentation techniques, specifically emphasising patterns like claim-support and problem-solution. She presents a range of practical activities tailored for advanced language learners, highlighting the use of free online corpus tools like Lextutor Concordancer, SkELL and MICUSP. The chapter underscores the importance of DDL in language instruction, pushing for a broader application beyond vocabulary and grammar to incorporate the nuances of academic writing and argumentation. Discussants are Ning Liu and Sylvia Goetze Wake. Tove Larsson and Douglas Biber discuss the crucial importance of linguistic interpretability in research. Balancing statistical techniques with linguistic interpretability is vital, and they advocate for methods that prioritise clear linguistic understanding. They emphasise the need for clear operational definitions of linguistic features and caution against solely relying on automated tools without evaluating their outputs. Discussants are Ranwa Khorsheed and Keisuke Koyama. Elen Le Foll emphasises the critical role of open science and open education in bridging the gap between applied corpus linguistic research and its practical implementation in language learning and teaching. By advocating for the accessibility and sustainability of research findings and dissemination projects, Le Foll highlights the benefits of transparent, collaborative and inclusive scientific practices. Drawing from her experience, she underscores the potential of open educational resources (OERs) in promoting inclusive, equitable and effective language teaching approaches. Discussants are Heloísa Orsi Koch Delgado, Vicki Jade Stevenson and Chyara Alvarado López. Agnieszka Leńko-Szymańska’s chapter outlines how corpora play a pivotal role in the assessment of second language (L2) proficiency, especially in the domain of vocabulary. Yet, despite their extensive use in research, corpora’s full potential is yet to be harnessed in language education and certification. Agnieszka argues for the growth of corpus applications for L2 lexical assessment in the future. Discussants are Mengyang Li, Alisha Biler and Hung Tan Ha. Qing Ma’s chapter probes corpus-based language pedagogy (CBLP) as a progressive methodology for language instruction, emphasising its relevance for both pre-service and in-service teachers. This approach merges linguistics theories with practical teaching applications, using corpus technology to optimise learning results. The chapter underscores the importance of equipping educators with the tools and knowledge to integrate CBLP into their teaching methods, aiming to transform the landscape of language education. Discussants are Vanessa Joy Z. Judith and Khaoula Daoudi. Adriane Orenha-Ottaiano examines the significance of corpus-based phraseology and phraseography in foreign language teaching and learning. The chapter emphasises the importance of collocational and phraseological competence

Bridging the research-practice divide for data-driven learning

7

for fluency in a foreign language, detailing the potential of the Multilingual Collocations Dictionary Platform as a tool for teaching these competencies. This lexicographical platform, developed under the theoretical framework of corpus-based phraseology and phraseography, offers corpus-driven activities aimed at improving English and Portuguese language learners’ collocational skills. Discussants are Boonyakorn Siengsanoh, Buse Uzuner and Paweł Szudarski. Pascual Pérez-Paredes reveals the potential of broad data-driven learning (BDDL) in informal language learning contexts, emphasising its importance beyond traditional, institutional settings. The author distinguishes between ‘prototypical DDL’, which is tailored for formal educational environments, and BDDL, which integrates a variety of digital tools and resources, such as AI tools, online dictionaries and apps for self-directed learning. Pérez-Paredes advocates for a shift in research focus towards BDDL to capture the evolving nature of digitalera language learning, encompassing informal and non-formal learning settings where learners can engage autonomously with digital content and tools. Discussants are Alejandro Curado Fuentes, Daniela Terenzi and Carolina Tavares de Carvalho. Paula Tavares Pinto explores using a DDL approach to teach English for Academic Purposes (EAP) in the context of the United Nations’ Sustainable Development Goals (SDGs). Through the use of specialised corpora such as the SDG Corpus, educators can create engaging activities that allow students to analyse authentic language patterns in research papers, thus enhancing their reading comprehension and linguistic skills. The chapter emphasises the importance of English as a global academic language and showcases how tailored EAP materials and activities can better support Brazilian students in achieving academic and international success. Discussants are Luana Viana dos Santos, Handoko and Nataliia Lazebna. Vander Viana delves into the relationship between corpus linguistics (CL) and English for Academic Purposes (EAP), highlighting the benefits of this intersection for EAP professionals and students. Structured around the themes of ‘what’, ‘how’ and ‘why’, the chapter sheds light on CL’s contributions to EAP. Vander’s contribution emphasises how DDL can improve written discourse in academic contexts, presents real-world case studies and provides practical insights on conducting CL-EAP research. Discussants are Shuyi Amelia Sun, Mustafa Özer, Wesley Henrique Acorinti and Larissa Goulart. Reference list Boulton, A. (2009). Data-driven learning: Reasonable fears and rational reassurance. Indian Journal of Applied Linguistics, 35, 81–106. Boulton, A., & Vyatkina, N. (2021). Thirty years of data-driven learning: Taking stock and charting new directions over time. Language Learning & Technology, 25(3), 66–89. http://hdl.handle.net/10125/73450

8

Peter Crosthwaite

Chambers, A. (2019). Towards the corpus revolution? Bridging the research – practice gap. Language Teaching, 52(4), 460–475. Crosthwaite, P., & Baisa, V. (2023). Generative AI and the end of corpus-assisted datadriven learning? Not so fast! Applied Corpus Linguistics, 3(3), 100066. Crosthwaite, P., & Boulton, A. (2023). DDL is dead? Long live DDL! Expanding the boundaries of data-driven learning. In H. Tyne, M. Bilger, L. Buscail, M. Leray, N. Curry & C. Pérez-Sabater (Eds.), Discovering language: Learning and affordance. Peter Lang. Crosthwaite, P., Ningrum, S., & Schweinberger, M. (2022). Research trends in corpus linguistics: A bibliometric analysis of two decades of Scopus-indexed corpus linguistics research in arts and humanities. International Journal of Corpus Linguistics, 28(3), 344–377. Johns, T. F. (1991). Should you be persuaded: Two samples of data-driven learning materials. ELR Journal, 4, 1–16. Lin, P. (2023). ChatGPT: Friend or foe (to corpus linguists)? Applied Corpus Linguistics, 3(3), 100065. Pérez-Paredes, P. (2022). A systematic review of the uses and spread of corpora and datadriven learning in CALL research during 2011–2015. Computer Assisted Language Learning, 35(1–2), 36–61. Pérez-Paredes, P., & Mark, G. (Eds.). (2021). Beyond concordance lines: Corpora in language education. John Benjamins Publishing Company. Zappavigna, M. (2023). Hack your corpus analysis: How AI can assist corpus linguists deal with messy social media data. Applied Corpus Linguistics, 3(3), 100067.

2 ADDRESSING THE CHALLENGES OF DATA-DRIVEN LEARNING THROUGH CORPUS TOOL DESIGN – IN CONVERSATION WITH LAURENCE ANTHONY Laurence Anthony

Crosthwaite: Can you explain the evolution of the AntConc software and how has it been used in the classroom? Anthony: AntConc[1] has a rather odd beginning. As part of my PhD on applications of artificial intelligence (AI) in applied linguistics, I aimed to develop an AI agent that could automatically analyze the discourse structure of texts following the Swales (1990) Create-A-Research-Space (CARS) model. As part of this project, I needed to refine my programming skills to develop an easy-to-use graphical interface for the software so that it could be used in practical settings by nontechnical learners, instructors, and researchers. I anticipated (correctly) that the software would be quite complex, so I decided to hone my programming skills by building a much simpler software tool . . . a concordancer. As it so happened, just after finishing the tool, Judy Noguchi, a friend and well-known English for Specific Purposes (ESP) expert in Japan, contacted me about her need for a concordancer for a new STEM (science, technology, engineering, and mathematics) technical writing course that she was teaching in a Linux-based computer lab. The course would be using a data-driven learning (DDL) approach, with a concordancer being the main analytical tool that learners would use. At the time, all the popular concordance tools, such as WordSmith Tools[2] and MonoConc Pro,[3] could only run on Windows-based machines, so there was a need for a new tool. Fortunately, my simple concordancer was written in Perl, which allowed it to be compiled for any operating system, including Windows, Linux, and even Macintosh computers. I immediately released the tool to Judy and also put it on my website, and AntConc was born. The name comes from the ‘Ant’ of Anthony, and ‘Conc’ of concordancer. After releasing AntConc 1.0 in April of 2002, I began adding new tools and features in response to requests and feedback from Judy and several other practicing DOI: 10.4324/9781003413301-2

10

Laurence Anthony

instructors. With the release of AntConc 2.0 in August of 2002, the toolkit contained the original concordancer together with tools for dispersion plot, cluster/n-grams, file, word list, and keyword list analyses. AntConc 3.0 was released in December of 2004, and this included native support for MacOS. Then, for the next 15 years, the general functionality and core tools of AntConc were incrementally improved based on my own needs for the software (as I also taught STEM-based technical writing using a DDL approach), but also feedback from other ESP/EAP (English for Academic Purposes) instructors. I also got feedback from a growing number of users from other domains, including Japanese language instructors, translators, and traditional corpus linguists, who were looking for free, multiplatform alternatives to the popular commercial corpus tools of the time. The corpus linguists’ group is interesting because they had two quite different purposes for adopting the tool. On the one hand, they seemed to like using AntConc for their own research. But they also found it useful when introducing corpus linguistics methods to learners as part of an applied linguistics course or program. Then, in December of 2021, I released AntConc 4.0, which marked a major change in the way AntConc was developed. AntConc 4.0 is built in Python, and all searches are powered by an indexed SQLite database. These changes have given me many new opportunities to improve and expand the software. Crosthwaite: What do you think are the reasons for AntConc’s popularity in the DDL classroom? Since the release of version 1.0, AntConc has experienced a massive growth in usage among instructors and learners. This is seen in Figure 2.1, which shows the number of direct downloads from my website between 2003 and 2020. Over this time, I’ve tried to improve AntConc in various ways, giving it faster speeds, more

FIGURE 2.1

AntConc downloads (2004–2021)

Addressing the challenges of data-driven learning through corpus tool design

11

statistics, and greater flexibility in how corpora are processed and results displayed. But I suspect the main reasons for its rapid growth in popularity have been because of a few key design choices. First, it is very easy to find, set up, and use as it is available in both installer and portable versions, with the portable version being able to run from a USB stick and only needing a single click to start. Second, it is very safe and reliable. AntConc is code-signed and notarized by Microsoft and Apple, meaning that it will not generate any security warnings on install. It is also reviewed positively by dozens of virus detection sites. Third, it is constantly evolving, with new tools, features, and bug fixes being regularly added. Fourth, it is very fast and offers a comprehensive set of analytical tools, including KWIC, Plot, File, Cluster, N-gram, Collocate, Word (list), and Keyword (list) tools, which are the core tools used by most corpus linguists. Finally, and perhaps most importantly, it is designed for a broad user base. For example, it can run on any operating system, so instructors do not have to worry about which sort of computer lab they are assigned or what computers learners bring to the classroom. It is also built from the ground up to support Unicode, making it useful for the analysis of both European and Asian languages, like Japanese and Chinese. It also supports both left-to-right (LTR) and right-to-left (RTL) languages, so it works well with Arabic and Hebrew texts. Finally, it supports regular expressions (regex), allowing it to be used for very powerful text searches across any language. Crosthwaite: What are some specific ways that AntConc can be used to support language teaching and learning? AntConc is very much a general-purpose corpus analysis toolkit, but because of its roots as a support tool for in-class teaching using a DDL approach, many of its tools and features are designed to support language teaching and learning. When adopting a DDL approach, the first questions facing either the instructor or the learner are what corpus should be used and how can it be accessed or created. AntConc is not able to determine which corpus should be used in the classroom, but it comes with an online repository of existing corpora, making it a one-click step to access existing corpora. It also offers a very simple way for both instructors and learners to create, store, and share their own corpora through its ‘Corpus Manager’ tool. This makes it possible for instructors to prepare custom corpora for the classroom (e.g., previous assignment submissions, practice TOEIC papers, published research articles, transcripts of spoken conversations, and so on) and share them with learners. Because these corpora will be added to the learners’ own corpus repository, they can then access these corpora with a single click. Learners can also create their own corpora, for example, from files they collect from the Internet, or even their own writing in the form of course submissions. A common starting point for DDL is identifying words and phrases in a target corpus that may prove interesting, unusual, or difficult for the learners. AntConc offers multiple ways for achieving such a goal. First, the ‘Word list tool’ can generate a

12

Laurence Anthony

word list of all the words in a target corpus and rank these in various ways, e.g., by frequency, range (dispersion), or alphabetically. Second, the ‘Keywords list tool’ can generate so-called keywords in a target corpus, by comparing the words in that corpus with those in a reference corpus. AntConc offers many ways to statistically compare the word sets, using both frequency- and dispersion-based statistical methods as well as effect size measures and combinations of the two. Third, the ‘N-gram tool’ can automatically identify all contiguous word patterns of a certain size, which is useful for identifying common idiomatic expressions, set phrases, and even potential cases of plagiarism in learner writing. The ‘N-gram tool’ can also generate these same results for certain minimum frequency and range thresholds, which are sometimes referred to as lexical bundles (Biber et al., 2004), as well as for open-slot N-grams, which are sometimes referred to as positional frames or p-frames (Römer, 2010). The core activity in most DDL classrooms is using a concordancer to look at word and phrase usage. Through concordancing, learners can identify the common words that surround target words or phrases, formulate rules for how these words and phrases should be used, and even hypothesize about the subtle meanings and semantic prosody of the words and phrases. An example is shown in Figure 2.2 for the word ‘process’ from the academic subcorpus of the AmE06 one-millionword corpus of general American English (Potts & Baker, 2012). Unlike most other tools, the concordancer in AntConc is quite unique in that it can order the results of a concordance search alphabetically (as found in almost all other tools) but also more usefully by ‘KWIC pattern’ frequency (Anthony, 2018, 2022), i.e., the frequency of the patterns in the results. For example, the most frequent word to the

FIGURE 2.2

AntConc KWIC concordancer results for the word ‘process’. The words are ordered by ‘KWIC pattern’ frequency

Addressing the challenges of data-driven learning through corpus tool design

13

right of the word ‘process’ in the academic subcorpus of AmE06 is ‘of’, so the results for ‘process of’ are shown first in the AntConc concordancer, as seen in Figure 2.2. If a POS-tagged corpus is used, the AntConc concordancer can also be used to show the most frequent POS-tag patterns (e.g., verb + preposition patterns) or even combinations of different tagging layers (e.g., the most frequent adjectives that precede the word ‘result’ in research papers). One of the main challenges in traditional DDL learning is managing data overload and helping learners discover salient patterns from within a huge number of results. As I described earlier, AntConc immediately helps overcome both problems by presenting concordance results in a paged formed and ordered by KWIC pattern frequency. As a result, learners do not need to scroll through a massive set of results trying to remember what they have seen. On the contrary, the first page of results always shows the most frequent pattern(s), which often equates to the most salient pattern(s). For the other tools in the AntConc toolkit, learners can easily choose minimum values for frequency and range, and when statistics are used, the tools will use default thresholds for p-values and effect size measures that align with best practices in the field. As an example, keywords in AntConc are by default calculated using the log-likelihood measure, with a critical value of p < 0.05 with a Bonferroni correction. Another challenge that faces instructors hoping to adopt a DDL approach in the classroom is avoiding the technical challenges of using computers in the classroom, especially if the learners are not very tech savvy. AntConc supports DDL in this regard in a number of ways. As I have already described, it first can be installed on any operating system or launched from a USB stick, so instructors do not have to worry about the types of computers that learners bring to the classroom. The software is also code-signed and notarized, so no security warning will be generated. More importantly, the software is designed to look and feel like a familiar Internet browser, with tabs for the different tool pages, and a simple search bar that is always visible on the screen. As with Internet browsers, all of the complex or advanced settings are hidden behind menu bar options, and the results are also simplified, with more advanced result columns needing to be activated before they are displayed. One feature that is particularly important in a classroom is the ability to change the font size and color scheme of the entire AntConc display across all tools. This makes it easy for instructors to display AntConc on a central screen in the classroom and walk learners through the steps of an activity as they work independently or in groups. Also, the ability for instructors to create corpora that can be then distributed to learners and launched with one click cannot be overstated. From my experience, the greatest challenge in DDL for learners is simply finding the target texts and loading them into the concordancer. The in-built corpus repository of AntConc greatly simplifies this task. Crosthwaite: What are some best practices for using AntConc in the classroom?

14

Laurence Anthony

FIGURE 2.3

The scientific research process

As I explain to my own learners, DDL is essentially a replication of the scientific research process applied to language and carried out by learners in the classroom (see Figure 2.3). Viewed in this way, it is perhaps best to consider AntConc as being useful at the data collection and analysis stage (i.e., Step 5). For any true discovery to take place in the DDL classroom, the instructor or learner first needs to choose a target area of investigation. This could be the vocabulary of the TOEIC, common expressions used when writing the methods section of a research paper, or the identification of common errors in learner writing. Next, it is important to review the literature in that target area, identify a gap in that knowledge, and design an experiment to fill that gap. The review of literature (i.e., a presentation of known knowledge) might already be covered in the course textbook, but it could also easily be done by the instructor in the form of slides or readings. Instructors might also encourage learners to review existing material, observe target language settings, or speak to their research supervisors about target language usage. Identifying the gap in knowledge is where things get interesting. For example, the course textbook might cover language patterns in general English

Addressing the challenges of data-driven learning through corpus tool design

15

but ignore the subtle differences found in specific disciplines of STEM. Indeed, learners might be able to make some genuine new discoveries about language use by focusing on their own specific field or language interests. Even when the existing knowledge is vast, there will always be a gap in knowledge when it comes to the individual learner’s own language uses. So, learners can choose to focus on their individual strengths and weaknesses and make discoveries there. With guidance from the teacher, they are then in a good position to design their ‘experiment’, deciding which corpus, tools, and searches to use. I feel that AntConc becomes especially useful at the next stage. Instructors and/ or learners can locate a suitable corpus and add it to AntConc, or even create their own custom corpus of target texts within AntConc and add it to their repository. Next, I would recommend that the instructor displays AntConc on a main classroom screen (enlarging the font size as necessary) and walk through the basic uses of the toolkit, explaining how it can be used for data analysis. Usually, I start by showing learners how AntConc can be used to identify ‘important’ words through the Word list and Keyword list tools. Next, I show the learners *how* these words are used in context with the KWIC tool, emphasizing that the display ranks the results by the most frequent patterns of usage. Then, I show the learners how to identify *where* these words are used (i.e., in which texts and in which positions) using the Plot tool. I usually then quickly cover the Cluster, N-gram, and Collocates tools, illustrating how these tools can be used to give insight on contiguous and non-contiguous phrase patterns. Once the learners understand the basic operations of AntConc, I would recommend that the instructor give them clear instructions on how to formulate some basic searches of their own data with results that the instructor can generally predict. Learners can then be given time to go through these tasks on their own or in small groups as the instructor walks around the room, checking that there are no obvious technical problems. The instructor may also choose to bring the whole class together to show some example results, which helps to reassure the learners that they are using AntConc correctly. I would also recommend that instructors encourage learners to go beyond the basic analyses and learn to ‘play’ with the software, trying out other searches and tools, and seeing what they find. Importantly, I always remind learners to think about *why* the results appear as they do, and *how* they might apply the findings for their own current and future language purposes. This process maps to the final stage of the scientific research process, where the researcher discusses results and formulates their conclusions in terms of the research question being asked. Crosthwaite: Are there any limitations or challenges to using AntConc for language teaching and learning, and how can they be addressed? AntConc is a desktop software tool, which brings with it certain advantages but also some clear limitations. In terms of advantages, the software can be used without an

16

Laurence Anthony

Internet connection, which is important for many school environments around the world, where the Internet may be unstable or nonexistent. It also processes all data locally. This means it can be used with sensitive data, such as learner assignments, interview data, high-stakes examinations, and legal texts. But the desktop nature of AntConc is also perhaps its biggest limitation. Users need to download the tool and install it on their computers, which can be a challenge for novice computer users. In some school environments, computers will be managed centrally by an IT department, and so again, downloading and installing the software on these machines can be a challenge, requiring the instructor to submit a request for the software to be installed. Another challenge to using AntConc is that it only has immediate access to corpora in its online repository or available on the user’s computer. If an instructor or learner wants access to big general English corpora, such as the British National Corpus (BNC)[4] or the Corpus of Contemporary American English (COCA),[5] or niche corpora readily available on the web, AntConc will not be a good choice. Web services such as CQPWeb,[6] englishcorpora.org,[7] and Sketch Engine,[8] are popular because of the wide range of big corpora that they give ready access to. One other important limitation of AntConc is that it cannot be extended by the user through extensions. If a particular tool or statistical measure is not part of AntConc, the only way for instructors and learners to add that tool or measure is by submitting a request to the developer (me), e.g., as a member of the AntLab Patreon community[9] or in the AntConc discussion group.[10] I do regularly update AntConc based on user requests and feedback, but instructors who are considering adding it to their courses should plan on using only the features that it currently offers. Crosthwaite: What would you like to implement in AntConc V5? How would this align with new developments, such as ChatGPT? In software development, a major version update signals a break in compatibility with the current version, so major updates tend to be rare. For this reason, I have not even started thinking about how I would break compatibility when going from AntConc 4x to 5x. Much more common are feature updates, which add new functionality to the system but do not break the way it currently works. Since the release of AntConc 4.0 in December of 2021, I have already released two such updates, with AntConc 4.1 being released in July of 2022 and AntConc 4.2 being released in August of 2023. AntConc 4.3 is already in the works. In this new release, one of the major feature updates will be adding direct access to various commercial and open-source large language models (LLMs), including ChatGPT.[11] I see LLMs playing a pivotal role in corpus linguistics research, as they can provide deep and nuanced views of language use in a variety of domains and registers without the need for users to load specific corpora. In essence, LLMs encapsulate a huge range of massive corpora within the embeddings that they are built on, and these corpora can be probed by the user through prompting in natural language.

Addressing the challenges of data-driven learning through corpus tool design

17

However, one problem with LLMs is that their outputs can be difficult to evaluate because it is not always clear what data informs the responses. We also know that LLMs can ‘hallucinate’ (i.e., generate erroneous outputs) that may look authentic, but are in fact completely fictitious. As a comical example, the following are the responses of various open-source LLMs to the question ‘Who is the developer of AntConc?’: vicuna-13b (https://vicuna.lmsys.org/) AntConc is a corpus analysis tool developed by Mark Davies. It is designed to help researchers in natural language processing (NLP) analyze large corpora of text data. AntConc is a free, open-source tool that is widely used in the NLP community. It is available for Windows, Mac, and Linux. koala (https://bair.berkeley.edu/blog/2023/04/03/koala/) AntConc is a tool for corpus linguistics developed by the University of Liverpool. It was created by a team of linguists led by Professor John Sinclair. AntConc is designed to help researchers analyze and process large amounts of text data, and is often used in the field of corpus linguistics, which involves studying the structure, meaning, and use of language in a particular context. alpaca-13b (https://crfm.stanford.edu/2023/03/13/alpaca.html) The developer of AntConc is David Robinson, a professor in the Department of Linguistics at the University of Texas at Austin. llama-13b (https://arxiv.org/abs/2302.13971) AntConc is developed by Ann Taylor, a Professor of Linguistics at the University of British Columbia. Ann Taylor is a leading researcher in the field of corpus linguistics and computational linguistics. So, to partially address these problems, I plan to also connect LLMs to the output of the various core tools of AntConc, allowing users to probe traditional corpus results and gain new insights on them. For example, a learner might choose to generate a frequency-ordered word list for a target corpus. Then, through LLM integration, they can ask AntConc to offer insights on that word list, for example, giving them a set of ‘themes’ that the words cover, or providing them a list of words from the list with positive or negative sentiment. Similarly, the output of the KWIC tool can be sent to the LLM, allowing learners, instructors, and perhaps most importantly, researchers the ability to query the results and discover completely new patterns of usage. The possibilities are quite endless. Of course, the risk of ‘hallucinations’ is still present, but this phenomenon can be greatly reduced through prompt engineering,[12] and the insights offered by the LLM can easily be checked because the user will have a direct link back to the target corpus. Reference list Anthony, L. (2018). Visualization in corpus-based discourse studies. In C. Taylor & A. Marchi (Eds.), Corpus approaches to discourse: A critical review. Routledge Press.

18

Lee McCallum

Anthony, L. (2022). What can corpus software do? In A. O’Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics. Routledge Press. Biber, D., Conrad, S., & Cortes, V. (2004). If you look at . . .: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405. Potts, A., & Baker, P. (2012). Does semantic tagging identify cultural change in British and American English? International Journal of Corpus Linguistics, 17(3), 295–324. Römer, U. (2010). Establishing the phraseological profile of a text type: The construction of meaning in academic book reviews. English Text Construction, 3(1), 95–119. Swales, J. M. (1990). Genre analysis: English in academic and research settings. Cambridge University Press.

Website links 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

AntConc. www.laurenceanthony.net/software/AntConc/ WordSmith Tools. www.lexically.net/wordsmith/ MonoConc Pro. https://monoconc.com/ British National Corpus. www.natcorp.ox.ac.uk/ Corpus of Contemporary American English. www.english-corpora.org/coca/ CQPWeb. https://cqpweb.lancs.ac.uk/ English-Corpora.org. www.english-corpora.org/corpora.asp Sketch Engine. www.sketchengine.eu/ AntLab Patreon Community. www.patreon.com/antlab AntConc Discussion Group. https://groups.google.com/g/AntConc ChatGPT. https://chat.openai.com/ Prompt Engineering Guide. www.promptingguide.ai/

2.1 Addressing the challenges of data-driven learning through corpus-tool design Lee McCallum Data-driven learning (DDL) involves students learning how to create, search, analyse, and interpret general and specialised corpora (Anthony, 2016). A fundamental tenet of DDL is that learners act as detectives where they investigate and interpret language patterns with teacher support (Johns, 1991). Although DDL is becoming part of a language teacher’s repertoire, its success can be hindered by many practical challenges. This has meant that the tenet that learners are detectives can be fraught with difficulties. Perhaps one of the most striking difficulties lies in successfully training learners to notice the intended language patterns to be studied. In my own experience, learners are often overwhelmed by the amount of language data captured in large general corpora. The same is also applicable to supposedly smaller, more specialised corpora of disciplinary discourse, which are increasing in size

Addressing the challenges of data-driven learning

19

because of the computational ease with which large collections of texts can be amassed today. In his latest version of AntConc (Version 4.0), Anthony addresses the challenge of noticing appropriate language patterns. Enhanced noticing is achieved through AntConc’s design and the improved functionality of its features. Early versions of AntConc (e.g., Version 1.0) only contained a simple concordancer. Versions 2.0–2.6 introduced an enhanced concordancer, plot, cluster/n-grams, and keyword search functions. Versions 3.0–3.5 focused on enhanced operation system functionality (e.g., versions were produced to run on Linux and Microsoft) and introduced a focus on collocates in addition to existing concordancer, plot, cluster/n-grams, and keyword search functions. When teaching academic voice, I have used these functions to explain nuanced meaning in reporting verbs and their impact on argument strength. The latest version of AntConc enhances functionality with keywords. The design of the interface promotes explicit noticing of patterns because there is now a ‘split’ window for studying KWIC (KeyWord in Context), which shows words to the right and left of the search word (node) and the node itself appears centrally on the screen in an illuminating colour. This allows learners to ‘zoom’ in on understanding the surrounding contexts of the node and to consider key grammatical structures which may precede or follow the node. It also allows learners to consider different semantic uses of a particular node. The KWIC feature further promotes noticing via its sorting and filtering options. For example, learners can sort patterns using the ‘sort’ function according to frequency (most frequent patterns will appear first). Learners can also sort according to ‘range’, meaning they can consider the number of texts a node appears in. This is useful for drawing their attention to or away from idiosyncratic uses in a single text. Overall, the latest version of AntConc has a lot of potential to facilitate DDL in the future. However, I would also reinforce to teachers using AntConc that they need to maximise its features by carefully selecting appropriate corpora and language patterns to focus on. Teachers also need to encourage students to critique the language patterns that are illuminated by providing learners with critical questions that help them focus on patterns and raise their awareness of how language functions. Reference list Anthony, L. (2016). Introducing corpora and corpus tools into the technical writing classroom through Data-Driven Learning (DDL). In J. Flowerdew & T. Costley (Eds.), Discipline-specific writing (pp. 176–194). Routledge. Johns, T. F. (1991). Should you be persuaded: Two samples of data-driven learning materials. ELR Journal, 4, 1–16.

20

Yiyang Yang (Camellia)

2.2 Improving participants’ experiences of DDL through AntConc 4 Yiyang Yang (Camellia) As an early-career researcher, I personally find Laurence Anthony’s AntConc 4 very useful for my research participants and, potentially, for my students who are EFL learners in China. My participants, being first-time users of corpora, often face difficulties in analyzing corpus data and in applying the detected language patterns to their own contexts. While unfamiliarity and a lack of corpus query skills are contributing factors, another significant factor is the large cognitive load required (Gilmore, 2009). In follow-up interviews, participants also mentioned feeling overwhelmed by the volume of corpus output after a search. The search results (i.e., concordance lines) were displayed randomly, making it challenging to detect target language patterns. As a DDL researcher and foreign language teacher, it’s imperative to address these issues and find a way to help learners extract meaningful information from corpora. Before learning about the new features of AntConc 4, I had considered introducing more hands-on corpus query activities and providing step-by-step guidance for analyzing corpus output. Such measures would hopefully familiarize learners with the process and enhance their analytical skills, enabling them to sift through search results more effectively. While these activities and training are still valuable in writing classrooms, AntConc 4 offers solutions to these enduring challenges from a technical standpoint. At the word level, AntConc 4 allows users to search for target expressions ranked by hit frequency. This enables learners to identify the most prominent linguistic patterns of word usage based on frequency. This feature is a revolutionary change, helping learners focus on the most evident patterns amid the vast corpus results, streamlining the corpus-based error correction process and quickly acclimating learners to this new approach, all while reducing their cognitive load. At the collocation level, AntConc 4’s cluster tool displays multi-word clusters along with their hit numbers, assisting learners in choosing useful cluster and chunk patterns based on frequency. Despite this useful functionality, one participant in my group noted that the context length provided by the KWIC (KeyWord in Context) from Sketch Engine was often too brief. For instance, when searching for the phrase “on the one hand”, the counterpart phrase “on the other hand” was sometimes missing, leading to confusion. Access to the full context of the collected text in AntConc 4 addresses this issue, reducing misunderstandings during corpus-based correction activities. Overall, the adoption of AntConc 4 and its innovative features promises to significantly improve the application of DDL in language classrooms.

More than just tools

21

Reference list Gilmore, A. (2009). Using online corpora to develop students’ writing skills. ELT Journal, 63(4), 363–372. https://doi.org/10.1093/elt/ccn056

2.3 More than just tools Corpus linguistics as a platform to promote learning and growth Baramee Kheovichai Data-driven learning is an approach where learners play the role of researchers (Johns, 1991). Learners explore language data, analyze it, and arrive at the patterns and principles underlying language use. In so doing, learners play an active role in the learning process. I am an English lecturer teaching corpus linguistics to English majors. In my corpus linguistics classes, students sometimes felt disengaged when they attended practical corpus linguistics workshops. They felt that the workshops were focused too much on technical aspects and not enough on English language learning. Anthony’s work has sparked new ideas to engage my students by introducing a new version of AntConc 4 that can better facilitate data-driven learning. The design of the new AntConc 4 provides learners with tools to better manipulate data output, making them easier to observe. What is more, AntConc 4 comes with two preinstalled corpora representing general English, helping students acquire everyday vocabulary. To implement this idea in my corpus linguistics classes, instead of simply giving workshops on how to use different functions of the software, students are assigned language problems to be solved using the software. By way of illustration, in one session, I asked students to compare the words ‘refuse’ and ‘deny’, which tend to be confusing as they have a similar translation in Thai, the native language of the students. The students then used AntConc to explore patterns through concordance lines and make observations about the meaning and grammatical structures associated with these words. As the concordance lines can be arranged according to the frequency of the co-occurring words, the patterns and collocations are easier to observe. Therefore, they not only learn how to use the KWIC function to generate concordance lines but also the differences between ‘refuse’ and ‘deny’. Consequently, students are more engaged with the learning process. Corpus linguistics can feel theoretical and technical at times, especially for language learners. Yet, it does not have to be. When combining data-driven learning into corpus linguistics workshops, students immediately see the practicality

22

Muhammad Rudy

of corpus use and how they benefit from the experience. Thus, incorporating datadriven learning in my class motivates students not only to learn about corpus analysis software but also the English language. Consequently, it is crucial to see corpus linguistics as a platform for students to learn, explore, and grow. By conducting data-driven learning in this manner, corpus linguistics can be more engaging, meaningful, and fulfilling for both teachers and students. Reference list Johns, T. (1991). Should you be persuaded: Two examples of data-driven learning. In T. Johns & P. King (Eds.), Classroom concordancing. English Language Research Journal, 4, 1–16.

2.4 Utilizing AntConc software on DDL practice to teach pharmaceutical vocabulary Muhammad Rudy The instruction of English for pharmacy students within the Indonesian setting, particularly at Malahayati University, is heavily centered on developing reading proficiency. The students are directed toward comprehending English texts, like abstracts. Nation and Waring (1997) indicate that mastery of an English text requires a student comprehending at least 95% of the words within the text. This equates to encountering only one unknown word for every five known words. Unfortunately, drawing from my teaching experience, it appears that my students encounter difficulties in reading abstracts, potentially attributable to limited lexical familiarity. It is a prevalent occurrence among university students in Indonesia (Hartono & Prima, 2021). During data-driven learning (DDL) practice, I instructed my students to copy abstracts from academic pharmaceutical journals. One student presented an abstract from a targeted article. The abstracts were saved in .txt format and uploaded into cloud-based storage. A total of 83 abstracts were collected. Upon completing the collection process, the students could download the abstracts as the corpus source. Following the construction of the corpus, the students collaborated in groups to construct their word lists using the AntConc 4.0 software (Anthony, 2022). Throughout the process, students were guided to compare word lists. A comparative analysis was conducted between the students’ word lists and the more comprehensive list available in AntConc. Ultimately, the students compiled a word list that closely mirrored the vocabulary found in pharmaceutical text readings. The students were guided to utilize the KeyWord in Context (KWIC) feature of the AntConc software. This enabled them to autonomously craft a ready-to-use

Utilizing AntConc software on DDL practice

FIGURE 2.4.1.

23

Three highlighted Polypharmacy definitions taken from AntConc

dictionary from the KWIC results. An interesting observation was made during this process, especially regarding the term “Polypharmacy.” The students found that the definition provided in the KWIC was clearer than the one from Google Translate. Within the KWIC, the term “Polypharmacy” appeared with three distinct definitions: the regular intake of five or more medications; the simultaneous prescription of two or more medications; and an individual’s use of multiple medications, as shown by Figure 2.4.1. In contrast, Google Translate defined “polypharmacy” as the simultaneous use of multiple drugs to treat a single ailment or condition. The students rarely use the term “simultaneous” as presented in Google Translate. They noted that the KWIC output offered a variety of word definitions and enhanced their understanding of both meaning and application. As highlighted by Lai and Chen (2015) in their study on Taiwanese EFL students, the students tend to use concordance tool to clarify word meaning and usage compared with an online bilingual dictionaries. Therefore, in the Indonesian context, particularly at Malahayati University, while pharmacy students encounter challenges in reading English abstracts due to lexical unfamiliarity, the use of DDL and tools like AntConc’s KWIC feature can significantly enhance their comprehension, as demonstrated by a clearer understanding of terms like “Polypharmacy” compared to standard online translations. Reference list Anthony, L. (2022). AntConc (Version 4.2.0) [Computer Software]. Waseda University. www.laurenceanthony.net/software Hartono, D. A., & Prima, S. A. B. (2021). The correlation between Indonesian university students’ receptive vocabulary knowledge and their reading comprehension level. Indonesian Journal of Applied Linguistics, 11(1), 21–29.

24

Muhammad Rudy

Lai, S.-L., & Chen, H.-J. H. (2015). Dictionaries vs concordancers: Actual practice of the two different tools in EFL writing. Computer Assisted Language Learning, 28(4), 341– 363. https://doi.org/10.1080/09588221.2013.839567 Nation, P., & Waring, R. (1997). Vocabulary size, text coverage and word lists. In N. Schmitt & M. McCarthy (Eds.), Vocabulary: Description, acquisition and pedagogy (pp. 6–19). Cambridge University Press.

3 EXPLORING MULTIMODAL CORPORA IN THE CLASSROOM FROM A MULTIDIMENSIONAL PERSPECTIVE Tony Berber Sardinha

Introduction

Although language use commonly involves various modes of communication, the prevailing focus in corpus linguistics has been a single mode: the written language. This bias is, to some extent, a consequence of the technical limitation of characterbased computer input, which restricts the incorporation of non-character-based sources like speech and images. This limitation arises from the need to transform such multimodal content into textual format to enable corpus analysis. Historically, the manual transcription of speech and the manual annotation of visual features have been the sole methods available, resulting in non-written corpora being smaller and rarer than their written counterparts. This scenario stands in contrast to the landscape of English language teaching, where a tradition of resorting to multiple modes of communication has long existed. For instance, approaches such as the audio-visual method and the communicative approach have relied on videos, still images, audio recordings, and gestures – in conjunction with textual sources – as materials for language instruction. Yet, corpus-based approaches to language teaching have, in turn, been primarily text-based because of their reliance on written corpora. Recent advances in technology have opened up new opportunities for multimodal corpus development (Todd et al., 2023), possibly bridging the gap between the classroom and multimodal corpora. Speech-to-text technology, for instance, has made the automatic transcription of videos and ongoing talk possible, thereby facilitating the creation of spoken corpora. In addition, computer vision technology, which deconstructs static and dynamic images into discrete elements and subsequently assigns content labels to these components, has made it possible for users

DOI: 10.4324/9781003413301-3

26

Tony Berber Sardinha

to retrieve visual content from sources such as hand-held devices and the Web through text-based queries. The chapter reports on the collection and analysis of a corpus of social media posts (Berber Sardinha, 2022a, 2022c) from the conservative social media outlet GETTR (http://gettr.com), centering on messages related to climate change. The corpus consists of the textual messages and their corresponding still images. The corpus was analyzed using multimodal multidimensional analysis (Berber Sardinha, 2022b), which comprises the analysis of the text and the visual components separately, with the goal of unveiling the major discourses emerging from the texts and the major visual characteristics present in the images, followed by the joint analysis of both modes. This particular social network was chosen because the data come from a project with a goal of determining the underlying ideologies sustaining right-wing communication around climate change. Applying the results of the corpus analysis outlined in this chapter can help develop critical thinking skills around the topic through multimodal work. Collecting a multimodal corpus of social media

In order to collect a multimodal corpus from social media effectively, specialized software is essential. This software facilitates the retrieval of a substantial volume of posts, accomplished either through “scraping” online sources or by establishing connections to these sources via API (application programming interface) access. “Scraping” social media platforms refers to using automated scripts to extract data such as posts, comments, and images from the platform web pages. This method involves simulating human browsing behavior to collect information. On the other hand, API access involves using the interface provided by the platform to request and retrieve data in a structured manner. Data collection tools are continuously being introduced or evolving to adapt to the dynamic landscape of social media. As a result, the most reliable source for discovering available tools, as well as learning their installation and utilization, is the internet – in particular, platforms dedicated to the data science community, such as GitHub. Social media platforms might present obstacles to batch downloading posts, ranging from technical hurdles to paywalls. Legal limitations governing access to social media content also exist, and researchers should familiarize themselves with the laws governing access. The current corpus was collected using the Python tool gogettr. We ran searches using different search terms, such as climate change, global warming, and AGW (short for anthropogenic global warming) and saved the output from all of these searches to a single JSON (JavaScript Object Notation) file, which contained the text posted and its related metadata in a structured format. Specialized scripts were developed to clean up the JSON file, remove duplicates, and retain only the relevant text, image, and metadata information. The images were downloaded using the curl tool, which saved each image as a separate file.

Exploring multimodal corpora in the classroom

27

Annotating a multimodal corpus

The text messages, images, and emojis were annotated separately. The messages were tagged for part-of-speech and lemmatized with TreeTagger (Schmid, 1994) to enable the retrieval of the lemma forms of content words (nouns, adjectives, and verbs). The images were annotated using Google Cloud Vision’s labels option, which provided text labels for the visual constituents of the images (Berber Sardinha, 2022b; Christiansen et al., 2020; Collins & Baker, 2023). The emojis were converted to a text format through the demoji Python library, which provided the text descriptors for each emoji (Berber Sardinha, 2022a). For instance, the tool converted the emoji 😃 to “grinning face with big eyes,” which is its official Unicode CLDR (Common Locale Data Repository) short name. Google Cloud Vision can be accessed through both a web demo (https://cloud. google.com/vision) and an API. Because the web demo requires manual input, API access is generally preferred for corpus construction, as input and retrieval take place through programming. Both options return roughly the same output; therefore, researchers can use the web demo to gain a general overview of the tool’s capabilities. Figure 3.1 exemplifies the application of the web demo in annotating an image sourced from the corpus. Once the image is uploaded by the user, it returns a series of annotations. Under the Labels tab, the tool identifies some of

FIGURE 3.1

Output screen for the Google Vision web demo

28

Tony Berber Sardinha

the individual visual components of the picture, such as chin, jaw, and suit. Other tabs provide different information. For instance, the Objects tab identifies specific features – namely, person, tie, and glasses. Meanwhile, the Safe Search tab evaluates content appropriateness, manipulation, and safety, categorizing the image as “very likely” to qualify as a “spoof.” Analyzing a multimodal corpus from a multidimensional perspective

As previously mentioned, the analysis was conducted utilizing a multimodal multidimensional analysis (Berber Sardinha, 2022b), which comprises a lexical multidimensional analysis (Berber Sardinha & Fitzsimmons-Doolan, 2024), a visual multidimensional analysis (Berber Sardinha et al., 2023), and a canonical correlation analysis (Delfino et al., 2023; Kauffmann & Berber Sardinha, 2021; Mayer, 2018). These are extensions of multidimensional analysis, introduced by Biber (1988) with the aim of modeling register variation through corpora. Unlike a regular multidimensional analysis, which is concerned with the functional description of register variation, a lexical multidimensional analysis, such as the one reported here, is centered on the description of discourses, ideologies, themes, and other lexis-based constructs in corpora (Berber Sardinha, 2017, 2021, 2023; Berber Sardinha & Fitzsimmons-Doolan, 2024; Clarke et al., 2021, 2022; Fitzsimmons-Doolan, 2014, 2019, 2023). A visual multidimensional analysis seeks to determine the predominant image patterns in visual datasets by considering the correlated sets of constituents shared in the images. A multimodal multidimensional analysis concerns itself with matching the lexical and visual dimensions derived from these respective analyses into groups of dimensions that share similar patterns of covariation. The multimodal dimensions correspond to the multimodal choices put into play to convey specific discourses. A statistical procedure known as canonical correlation is used to detect the dimensions from one particular mode that align with the dimensions from another mode. This process allows for the identification of the systematic relationships among different modes of communication in a multimodal corpus. Table 3.1 presents the dimensions derived from the lexical and visual analyses, while the resulting multimodal dimensions are shown in Table 3.2. The design of the corpus analyzed in this study is outlined in Tables 3.3 and 3.4. Due to downloading problems, a subset of images (N=63) was not successfully retrieved, resulting in a diminished number of files in the image corpus. The analysis returned four multimodal dimensions, but for the scope of this chapter, our focal point will be only two of these dimensions: multimodal dimensions 1 and 2. To further narrow our focus, only a few of the lexical and visual dimensions therein will be discussed – namely, the positive pole of lexical dimension 2 (a pole here signifies one of the “ends” of a dimension, with no inherent value judgment attached to the term), the negative pole of lexical dimension 3, the negative pole of visual dimension 6, and the positive pole of visual dimension 1 (all marked in bold in Table 3.2).

Exploring multimodal corpora in the classroom

29

TABLE 3.1 Lexical and Visual Dimensions

Lexical Dimensions 1 Climate Change Manipulation Narrative vs. Climate Change’s Political Motives and CCP Allegations 2 Framing Climate Activism as Alarmism vs. Critique of Progressive Measures and Corporate Power 3 Climate Collusion Critique (CCP) vs. Anti-CCP Sentiment and Climate Activism Hypocrisy, Contradiction, and Inconsistency 4 Amplified Anti-CCP Media Influence 5 Business and Climate Advocacy Doubt vs. Rejecting Human Influence and Corporate Responsibility Visual Dimensions 1 Natural Environments, Gatherings, and Events vs. Media Interactions: Critical Conferences, Television Anchors, and Guest Appearances 2 Television Clips, Web Content, Political Conferences, Property, and Electric Cars vs. Sarcastic Memes, Animated Cartoons, and Captured Screenshots Mocking the Woke Movement and Questioning Climate Change Advocacy 3 Scenes of Winter, Ice, and Snow vs. Focus on the Facial Expressions of Political Figures, Activists, and Fictional Characters 4 Graphs, Newspaper Articles, Textual Content, Battlefields, and Scenery 5 Aircraft, Trucks, Automobiles, Snowfall, Individuals, and Pets vs. American Television News and Guest Commentaries 6 Influencers, Radio Hosts, Elite cars, Private Jets, Yachts vs. Experts, Government and Party Officials, Military Officers, General Public, Graphs, Memes, Text Samples

TABLE 3.2 Multimodal Dimensions

Multimodal Dim.

Lexical Dim.

Canonical Loadings

1

2 3 4

.5263 .3601 –.6354

2 3 5

.4472 –.5666 –.7941

2

Visual Dim.

Canonical Loadings

2 3 6

–.4502 .3754 –.6715

1 2 4 5 6

.5423 –.3417 .7004 .7037 .5498

30

Tony Berber Sardinha

TABLE 3.3 Written Corpus

Year

Texts

Words

Mean Length

Std. Deviation

2021 2022 2023 Total

63 1,853 3,193 5,109

1,278 40,433 79,529 121,240

20.3 21.8 24.9 23.7

16.2 17.3 18.9 18.3

TABLE 3.4 Visual Corpus

Year

Images

Visual Labels

Mean Length

Std. Deviation

2021 2022 2023 Total

62 1,827 3,157 5,046

1,447 40,390 67,759 109,596

23.3 22.1 21.5 21.7

8.5 9.7 9.6 9.7

Let’s begin with multimodal dimension 1. Labeled as “Framing Climate Activism as Alarmism,” the positive pole of lexical dimension 2 encapsulates a recurring motif within climate change skeptic discussions, involving the depiction of climate change as an exaggerated scare tactic that is full of inconsistencies. This perspective characterizes climate change warnings as a sensationalized narrative that predicts catastrophic consequences unless specific actions are pursued. Comprising a large set of lexical components (N=72), it mirrors an intricate web of discourse surrounding the issue. In Table 3.5, we present the factorial pattern for this dimension. The items are organized based on their factor weights in descending order. The lexical items ending with “_e” correspond to the text labels of emojis. To illustrate this discourse, let’s consider Example 1, which comprises 17 lexical items loading on this pole. By asserting that “PANIC SELLS,” the post employs a sensationalized approach to grab attention and garner support. The text further provides historical context, referencing a perspective from the early 20th century that paradoxically highlighted the potential benefits of global warming. This instance underscores the evolving and sometimes inconsistent narratives associated with climate change. In addition, the text cites a newspaper article that questions a warming trend, casting doubt on the urgency of climate change. TABLE 3.5 Loadings for the Positive Pole of Lexical Dimension 2

Dimension

Loadings (15 of 72)

Lexical 2 positive pole

temperature (.96), earth (.88), scientist (.85), warm (.84), weather (.82), warming (.82), dioxide (.81), atmosphere (.80), decade (.80), hot (.79), ice (.79), age (.79), science (.77), heat (.76), lie (.75)

Exploring multimodal corpora in the classroom

31

Example 1

5/x Scientists seeking funding and journalists seeking an audience agree: PANIC SELLS. Here’s the continued list – an amazing chronology of the last 120 years of scare-mongering on climate • 1938 – Global warming, caused by man heating the planet with carbon dioxide, “is likely to prove beneficial to mankind in several ways, besides the provision of heat and power.” – Quarterly Journal of the Royal Meteorological Society • 1938 – “Experts puzzle over 20 year mercury rise . . . Chicago is in the front rank of thousands of cities throughout the world which have been affected by a mysterious trend toward warmer climate in the last two decades” – Chicago Tribune #ClimateScam The visual content associated with this particular post is the meme previously shown in Figure 3.1, which exemplifies the negative pole of visual dimension 6 (Experts, Government and Party Officials, Military Officers, General Public, Graphs, Memes, Text Samples; Table 3.6). It portrays a purported expert dismissing the gravity of climate change as “hysteria.” Moreover, the image critiques the younger generation, insinuating that climate change activism is merely a passing trend for the youth, who lack the experience to judge the accuracy of predictions. In so doing, it elevates older individuals as knowledgeable while belittling the younger generation as gullible and easily swayed by a fabricated narrative. Moving on to multimodal dimension 2, we look at the co-occurrence of the negative pole of lexical dimension 3 (Anti-CCP Sentiment and Climate Activism Hypocrisy, Contradiction, and Inconsistency) with the positive pole of visual dimension 1 (Natural Environments, Gatherings, and Events). Tables 3.7 and 3.8 present the loadings for these dimensions. TABLE 3.6 Loadings for the Negative Pole of Visual Dimension 6

Dimension

Loadings (15 of 42)

Visual 6 negative pole

coat (–.64), (uniform (–.62)), line (–.61), darkness (–.60), television program (–.60), ice cap (–.59), official (–.59), plant (–.56), performing arts (–.56), sea (–.54), slope (–.54), (shelf (–.52)), (snow (–.52)), circle (–.48), grass (–.48)

TABLE 3.7 Loadings for the Negative Pole of Lexical Dimension 3

Dimension

Loadings (15 of 50)

Lexical 3 negative pole

wayne (–.67), suicidal (–.62), climatique (–.62), nicole (–.60), incidence (–.59), knocked_out_face_e (–.59), palestine (–.55), wreak (–.55), havoc (–.55), thunberg (–.52), ridiculous (–.49), la (–.49), ohio (–.49), greta (–.48), icebound (–.47) . . .

32

Tony Berber Sardinha

TABLE 3.8 Loadings for the Positive Pole of Visual Dimension 1

Dimension

Loadings (15 of 62)

Visual 1 positive pole

birds-eye view (1.20), mist (1.14), watercraft (1.10), estate (1.09), military uniform (1.08), sports sedan (1.07), operating system (1.05), coast (1.04), narrow-body aircraft (1.00), automotive design (.98), outer space (.97), luggage and bags (.96), naval architecture (.94), tread (.94), hearing (.92)

Example 2 illustrates lexical dimension 3 – in particular, allegations of inconsistency on the part of climate activism supporters. The emojis carry different meanings supported by specific discourses. The facepalming emoji functions as a reaction of disappointment and incredulity toward the terminology shift, thereby corresponding to the discourse of contradiction and inconsistency. Overall, the short ironic message and the accompanying emoji are indicative of an inclination to downplay or question the urgency of climate change concerns. Example 2:

Global warming, oh sorry climate change. 🤦 The image shared with this post appears in Figure 3.2. It includes a textual excerpt accompanied by a picture of a naval vessel recognized as an icebreaker. The image suggests that the boat was trapped in extensive ice, an unusual scenario for that specific time of year, thereby contradicting the prevailing notion of the Arctic ice cap diminishing in size. The image includes four distinct visual elements that loaded on the visual dimension: naval architecture, ocean, sea ice, and watercraft. Using lexical, visual, and multimodal dimensions in the classroom

Lexical and visual dimensions are powerful constructs that synthesize the major textual and visual components of communication. However, conducting a multidimensional analysis demands a substantial level of computational and statistical skill. Consequently, expecting teachers to independently conduct a comprehensive multidimensional analysis solely for instructional purposes is an impractical expectation. Nonetheless, it is possible to explore the results of existing lexical, visual, or multimodal multidimensional analyses for ideas around which one can design classroom activities. In this section, we present some possible

Exploring multimodal corpora in the classroom

FIGURE 3.2

33

Image accompanying the message in Example 2

entry points for turning dimensions such as those presented here into a valuable pedagogical tool. First and foremost, when designing activities that involve students collecting and/or reading social media posts, it is crucial for teachers to exercise caution due to the potential students to encounter offensive or inappropriate content. Furthermore, in the case of the data presented here, it is essential to approach these dimensions critically, meaning that students need to be introduced to the context in which the messages are produced and be made aware of the ideologies that sustain the conservative view of climate change expressed in the posts, which might conflict with their own perspectives. In addition, it is essential to introduce students to the dimensions in a straightforward manner, focusing on what these dimensions reveal without resorting to technical jargon. For example, teachers might refer to each lexical dimension as “how people talk about” particular ideas, visual dimensions as “the way things look,” and multimodal dimensions as “how messages and visuals go together.”

34

Tony Berber Sardinha

The activities outlined next primarily revolve around project-based work, where students typically gather some texts or images, analyze them, have discussions, interact with peers, and present their findings to the class. However, teachers are encouraged to think creatively about how to incorporate the dimensions into their teaching, adapting to their specific educational aims. One entry point into the multimodal content of social media revolves around emoji characters. Although more than 3,000 emojis exist, the dimensions present a subset of the 30 most relevant ones in this particular context, thereby providing a much more manageable and focused selection for teaching compared to the whole emoji catalogue. Furthermore, the dimensions show that emoji use is patterned around the expression of particular discourses. For instance, the “pouting face” 😡 and “clown face” 🤡 emojis co-occur on the positive pole of dimension 2 as indexes to the “framing climate change as alarmism” discourse. Possible tasks include teachers prompting students to engage in discussions concerning the use of these emojis in social media and text message exchanges. Students can also compare the standard emoji meanings as found in emoji dictionaries and the senses they acquire in actual messages. When using these particular lexical dimensions in activities, teachers should work with the students from a critical perspective, exploring rhetorical strategies such as framing, emotional appeal, exaggeration, and use of stereotypes. The teacher could request students to categorize posts based on the dimensions. Conversely, students might search social media platforms for groups of dimension words. In addition, the lexical dimensions can be employed as gateways for vocabulary instruction. Students can use online corpora to investigate the usage patterns of these words and compare and contrast the usage of these words in social media and in the online corpora. Similarly, when dealing with the visual dimensions presented here, it is important to assume a critical perspective in the classroom by discussing the historical and cultural contexts that might shape the images, ideologies, beliefs, and events that could lead to the creation and sharing of such content. Students can use web search engines to hunt for images featuring the visual attributes associated with each specific dimension. They can then analyze the meanings present within these images and share with the class various instances demonstrating how these individual elements manifest in pictures, considering whether the images conform to or conflict with their own viewpoints. Conclusion

This chapter has illustrated how it is possible to collect a multimodal corpus comprising text and images from social media and analyze it using the multidimensional analytical framework. The analysis offers insights into the discourses, ideologies, and visual content that shape social media conversations. The analysis herein outlined some of the dimensions corresponding to the major discourses associated with climate change, describing how conservative ideology

Exploring multimodal corpora in the classroom

35

is systematically fleshed out using words and visual elements. Based on these discourses, suggestions for teaching activities have been offered, aimed at exploring the vocabulary, emojis, and visual elements comprising the dimensions from a critical perspective. For the future, although tools for tagging written and transcribed spoken corpora for linguistic features have long existed, tools for automatically tagging visual content have only recently been developed. Despite current limitations (Berber Sardinha, 2022b; Collins & Baker, 2023), artificial intelligence technology permits the annotation of visual files and the handling of the annotated files as a regular text-based corpus, thereby allowing for the collection of large corpora of visual material (Christiansen et al., 2020). In summary, the convergence of textual and visual analysis as shown here not only advances our understanding of complex social media discourses but also underscores the potential of emerging AI-driven tools for corpus linguistic research. I want to thank the editor and discussants for their work and insights. I also want to thank the following funders: São Paulo Research Foundation (FAPESP), Grant #2022/05848-7; National Council for Scientific and Technological Development (CNPq), Grant #310140/2021-8. Reference list Berber Sardinha, T. (2017). Lexical priming and register variation. In M. Pace-Sigge & K. Patterson (Eds.), Lexical priming: Applications and advances (pp. 190–230). John Benjamins. Berber Sardinha, T. (2021). Discourse of academia from a multi-dimensional perspective. In E. Friginal & J. Hardy (Eds.), The Routledge handbook of corpus approaches to discourse analysis (pp. 298–318). Routledge. Berber Sardinha, T. (2022a). Corpus linguistics and the study of social media: A case study using multi-dimensional analysis. In A. O’Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (2nd ed., pp. 656–674). Routledge. Berber Sardinha, T. (2022b). Patterns of lexis, patterns of text, patterns of images. Talk presented at the Pathways to Textuality Meeting in Honor of Michael Hoey, Eastern Finland University. Berber Sardinha, T. (2022c). A text typology of social media. Register Studies, 4(2), 138–170. Berber Sardinha, T. (2023). Corpus linguistics and historiography: Finding the major discourses in the first 50 years of TESOL quarterly. Journal of Research Design and Statistics in Linguistics and Communication Science, 7(1), 69–90. Berber Sardinha, T., & Fitzsimmons-Doolan, S. (2024). Lexical multidimensional analysis. Cambridge University Press. Berber Sardinha, T., Romeiro, Y. T. D., Marcondes, L. N. L., Whiteman, M., Silva, C. S. D., Gerciano, N. P., Kauffmann, C., Pinto, P. T., Dutra, D., Delfino, M. C. N., Bocorny, A., & Sarmento, S. (2023). The coronavirus infodemic: A multidimensional, discourse-based perspective. Corpus Linguistics 2023 International Conference, Lancaster University, UK. Biber, D. (1988). Variation across speech and writing. Cambridge University Press. Christiansen, A., Dance, W., & Wild, A. (2020). Constructing corpora from images and text: An introduction to visual constituent analysis. In S. Rüdiger & D. Dayter (Eds.), Corpus approaches to social media (pp. 149–174). John Benjamins.

36

Danang Satria Nugraha

Clarke, I., Brookes, G., & McEnery, T. (2022). Keywords through time: Tracking changes in press discourse of Islam. International Journal of Corpus Linguistics, 27(4), 399–427. Clarke, I., McEnery, T., & Brookes, G. (2021). Multiple correspondence analysis, newspaper discourse and subregister: A case study of discourses of Islam in the British press. Register Studies, 3(1), 144–171. Collins, L., & Baker, P. (2023). Creating and analysing a multimodal corpus of news texts with Google Cloud Vision’s automatic image tagger. Applied Corpus Linguistics, 3(1), 100043. Delfino, M. C. N., Berber Sardinha, T., & Collentine, J. G. (2023). Dimensões de variação lexical e acústica na música popular em inglês: um estudo baseado em corpus [Lexical and acoustic dimensions of variation in popular music in English: A corpus-based study]. Cadernos de Estudos Linguísticos, 65, e023025. https://periodicos.sbu.unicamp.br/ojs/ index.php/cel/article/view/8671801/33021. Fitzsimmons-Doolan, S. (2014). Using lexical variables to identify language ideologies in a policy corpus. Corpora, 9(1), 57–82. Fitzsimmons-Doolan, S. (2019). Language ideologies of institutional language policy: Exploring variability by language policy register. Language Policy, 18(2), 169–189. Fitzsimmons-Doolan, S. (2023). 21st century ideological discourses about US migrant education that transcend registers. Corpora, 18(2), 143–173. Kauffmann, C., & Berber Sardinha, T. (2021). Brazilian Portuguese literary style. In E. Friginal & J. Hardy (Eds.), The Routledge handbook of corpus approaches to discourse analysis (pp. 354–375). Routledge. Mayer, C. (2018). O que e como escrevemos na Web: Um estudo multidimensional de variação de registro em língua inglesa [Unpublished PhD dissertation, Pontifical Catholic University of Sao Paulo]. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of International Conference on New Methods in Language Processing, 12. Todd, R. W., Vungthong, S., Trakulkasemsuk, W., Pojanapunya, P., & Towns, S. G. (2023). Love is blue: Designing and using a multimodal corpus analysis tool. Applied Corpus Linguistics, 3(1), 100042.

3.1 Navigating challenges and opportunities Incorporating multimodal analysis into corpus linguistics for social media research Danang Satria Nugraha Incorporating multimodal analysis into corpus linguistics has gained prominence recently, especially when studying social media data. Text, images, videos, and other nonverbal characteristics are a few communication channels that might occur in a given setting and are studied through multimodal analysis. This method

Navigating challenges and opportunities

37

recognizes that social media platforms offer several options for users to express themselves outside of text-based communication and are highly participatory. Corpus linguists working with social media data quickly grasp the importance of analyzing the textual content and accompanying visual or audiovisual elements as these multimodal elements are essential for creating meaning and encouraging conversation in online settings. Multimodal corpus linguistics’ capacity to capture the complexity and variety of online communication is one of its key advantages for social media research. For users to express themselves, social media platforms include many tools and features, including emoticons, images, memes, gifs, hashtags, and even live videos. By including these multimodal aspects in their study, researchers can better understand how meaning is created and conveyed within social media discourses. For a more thorough analysis of social media phenomena, corpus linguistics can be combined with visual and audiovisual data. For instance, taking into account both textual and visual clues may be helpful for analyzing the spread of misinformation or creating online communities, as in identifying the macro discourses in a corpus of social media (Twitter) posts around climate change from a multimodal perspective. However, working with multimodal data in corpus linguistics presents several challenges. One of the main difficulties lies in developing appropriate methodologies and tools to collect, annotate, and analyze multimodal corpora effectively. Manual annotation of multimodal data can be time-consuming and resource-intensive, requiring researchers to adapt existing annotation schemes or develop new ones to account for the multimodal nature of the data. Additionally, interpreting and analyzing multimodal data necessitates interdisciplinary collaborations. Corpus linguists may need to collaborate with experts in visual communication, semiotics, or data visualization to fully understand and interpret the multimodal aspects of social media texts. Such collaborations can enrich research and contribute a more holistic understanding of social media communication. In conclusion, following Berber-Sardhina’s suggestion, going multimodal in corpus linguistics, particularly in social media research, offers exciting opportunities for studying online communication more comprehensively and nuancedly. By incorporating visual and audiovisual elements into the analysis, researchers (e.g., Feldman & Kinoshita, 2018; Nugraha, 2022) can gain valuable insights into the meaning-making processes and social dynamics within online communities. The recent studies used a unique approach to analyze political discourse on social or mass media in certain political circumstances. By examining text and visual elements, researchers identified sentiment, framing, and engagement strategies used by different political groups. This analysis shows how multimodal analysis can uncover insights missed by solely focusing on text. However, working with multimodal data requires the development of appropriate methodologies and interdisciplinary collaborations to overcome the associated challenges.

38

Mariangela Picciuolo

Reference list Feldman, O., & Kinoshita, K. (2018). Political communicators and control in political interviews in Japanese television a comparative study and the effect of culture. In The psychology of political communicators: How politicians, culture, and the media construct and shape public discourse. https://doi.org/10.4324/9780429487897 Nugraha, D. S. (2022). On silent laughter: The political humour depicted in Indonesian cartoons. VELES (Voices of English Language Education Society), 6(1), 283–298. https:// doi.org/10.29408/veles.v6i1.5022

3.2 A corpus-assisted multimodal discourse analysis approach to media literacy Mariangela Picciuolo The advent of Web 2.0 and the use of digital media have revolutionized our communication practices. We are now “endowed with the ability to consume, produce and disseminate media messages often involving multimodal representations” (Lim et al., 2011, p. 1), which incorporate not only text but also images and sound. Although young media users are often said to “be highly attuned to the functional aspects of the multimodal, multi-media environment . . . their ability to consume media in a critical and discerning fashion may be wanting” (Lim et al., 2011, p. 10). Let’s give an example: a global survey conducted by Hickman et al. (2021, p. 863) demonstrated that children and young people are particularly prone to suffer from “climate anxiety” – a general feeling of “distress relating to the climate and ecological crises”. In this respect, several studies (León et al., 2022; McClellan & Davis, 2023) have shown that visual and linguistic communication of climate change (CC) on both traditional and digital media tend to systematically engender fear, apathy, and denial among young people, which in turn promote inaction and social disengagement. To address this issue, “it is critical that training in media literacy, and multimodal literacy be incorporated into the formal curriculum as early as possible, particularly since these children would already have grown up in an environment where multimodal representation is a given” (Lim et al., 2011, p. 11). One way to do this is to use corpus-based approaches to the analysis of multimodal media texts. Drawing upon Berber Sardinha’s methodology, I carried out a media literacy project where students were encouraged to try out computer vision tools to compare and discuss media representations of CC. One of the tasks was to build up a multimodal corpus of captioned images about CC, which students collected from the internet and then stored on a collaborative Google Sheet. Students

A corpus-assisted multimodal discourse analysis approach

FIGURE 3.2.1

39

A screenshot from the multimodal corpus analysis carried out by students during the media literacy project

annotated the images using Google Cloud Vision API, while they resorted to the UCREL semantic analysis system (USAS; Rayson, 2002) to semantically tag the caption text. Students reported their results on the same Google Sheet. Finally, they created a frequency list by quantifying the number of occurrences of each label, for both images and text. This task allowed them to identify some typical multimodal patterns in the media representation of CC, as well as differences in the way CC is represented visually and verbally. For example, although the image in Figure 3.1 represents one of the prominent causes of CC, it does not show any local connection, nor even people, whose agency is not represented. Similarly, although the captioned text legitimised CC as a science, through numbers, figures, and the authoritative voice of a scientist, modal expressions, causal verbs, and future prepositions establish a context of vagueness that reinforces a “wait and see” approach (McClellan & Davis, 2023, p. 4) and further discourages immediate action. This study shows that adopting a corpus-based approach to the analysis of multimodal media texts can provide students with a practical tool to exercise their agency by challenging and reflecting upon multimodal media(ted) discourses. Reference list Hickman, C., Marks, E., Pihkala, P., Clayton, S., Lewandowski, R. E., Mayall, E. E., Wray, B., Mellor, C., & van Susteren, L. (2021). Climate anxiety in children and young people

40

Mariam Orkodashvili

and their beliefs about government responses to climate change: A global survey. Lancet Planet Health, 5(12), e863–e873. https://doi.org/10.1016/S2542-5196(21)00278-3. www.thelancet.com/planetary-health León, B., Negredo, S., & Erviti, M. C. (2022). Social engagement with climate change: Principles for effective visual representation on social media. Climate Policy, 22(8), 976–992. https://doi.org/10.1080/14693062.2022.2077292 Lim, S. S., Nekmat, E., & Nahar, S. N. (2011). The implications of multimodality for media literacy. In K. O’Halloran & B. Smith (Eds.), Multimodal studies – Exploring issues and domains (pp. 169–183). Routledge. McClellan, E. D., & Davis, K. (2023). ‘Managing’ inaction and public disengagement with climate change: (Re)considering the role of climate change discourse in compulsory education. Javnost – The Public, 30(3), 356–376. https://doi.org/10.1080/13183222.2023.21 98940. Retrieved August 9, 2023, from www.tandfonline.com/doi/full/10.1080/1318322 2.2023.2198940?scroll=top&needAccess=true&role=tab Rayson, P. (2002, February). USAS: UCREL semantic analysis system. Invited talk at Daito Bunka University, Tokyo, Japan. Retrieved August 9, 2023, from www.lancaster.ac.uk/ staff/rayson/publications/tokyo2002/

3.3 Multimodal communication and corpus analysis in a language class Mariam Orkodashvili The term multimodality refers to the integrated use of different semiotic resources (e.g. language, image, sound and music) in texts and communicative events. A multimodal type of interaction involves using modalities to direct the receiver’s attention to a particular aspect of a multimodal display. Moreover, a message that is encoded simultaneously both visually and verbally is more likely to remain in the minds of more people (Ferrara & Hodge, 2018; Finnegan, 2014). And visual, pictorial, photo or graphic messages make verbal communication more effective and are easier to recall in the memory (Paivio, 1971). The interaction-dominant nature of multimodal communication enables the learners of a language to master speaking, reading, writing and listening skills more efficiently. Drawing corpora from verbal-visual discourses helps the learners to enrich vocabulary as well (Franconeri, 2021; Franconeri et al., 2021; Zacks & Tversky, 1999). In multimodal communication, the sum is much more than its parts. Communicants build their interactions using verbal and visual messages and signs (Perlman et al., 2015). This makes the narration of a story easier and faster. Using Dr Berber Sardinha’s method of multimodality (30 November 2021 webinar: going multimodal in corpus linguistics), I asked my students to narrate the ongoing stories, or specially selected topics, to each other, using both verbal and visual means.

Multimodal communication and corpus analysis

41

This strategy made the classes livelier, and topic vocabularies and lexical-phraseological units more memorable for my students. Besides, based on Dr Berber Sardinha’s findings, I used the multimodal communication strategy in my class. The students were asked to look through social media posts of current social, science and cultural issues and to compile corpora of verbal and visual (graphic, pictorial, illustrative, photo) means of describing, delivering, analysing and commenting on the information. Furthermore, the students used Google Cloud Vision to annotate the images in social media and tag them semantically. The result of the activity was a higher ability to memorise the information, to understand it and to transfer the message to other communicants. The second important observation the students made was the increased usage of numbers, graphs, visuals, photos and pictures in social media that made communication more illustrative, fact-based, reliable and memorable. Thereafter, I conducted a survey among my students on their attitudes towards multimodal communication and their perceptions of the benefits of verbal-visual methods of learning. The most frequent benefits the students named were the supremacy of visuals and pictures in memorizing vocabulary and recalling discourses; easier attention grabbing of a communicant through visual means; and easier and more efficient way of telling a story through multimodal means (Figure 3.3.1). To conclude, leveraging multimodal communication strategies in classroom settings, particularly those inspired by Dr Berber Sardinha, results in enhanced memorization, deeper understanding and more effective storytelling, as substantiated by student interactions with social media and their positive perceptions of verbal-visual learning methods.

BENEFITS OF MULTIMODALITY Telling a story 20% Supremacy of visuals 45%

Attention grabbing 35% FIGURE 3.3.1

Perceptions of student participants towards benefits of multimodality

42

Mariam Orkodashvili

Reference list Ferrara, L., & Hodge, G. (2018). Language as description, indication, and depiction. Frontiers in Psychology, 9, 1–15. https://doi.org/10.3389/fpsyg.2018.00716 Finnegan, R. (2014). Communicating: The multiple modes of human communication (2nd ed.). Routledge. Franconeri, S. L. (2021). Three perceptual tools for seeing and understanding visualized data. Current Directions in Psychological Science, 30, 367–375. Franconeri, S. L., Padilla, L. M., Shah, P., Zacks, J. M., & Hullman, J. (2021). The science of visual data communication: What works. Psychological Science in the Public Interest, 22, 110–161. Paivio, A. (1971). Imagery and verbal processes. Holt, Rinehart and Winston. Perlman, M., Clark, N., & Johansson Falck, M. (2015). Iconic prosody in story reading. Cognitive Science, 39, 1348–1368. https://doi.org/10.1111/cogs.12190 Zacks, J., & Tversky, B. (1999). Bars and lines: A study of graphic communication. Memory & Cognition, 27, 1073–1079. https://doi.org/10.3758/BF03201236

4 DATA-DRIVEN LEARNING In conversation with Alex Boulton Alex Boulton

1.

Could you provide examples of how DDL can give learners more direct access to data?

The basic idea behind data-driven learning (DDL) is that instead of being taught (by teachers, textbooks, online courses, etc.), learners take on a more active role. They do this not by learning ‘rules’ but by looking at how language is actually used. This is typically in a relatively large collection of texts that we call a corpus, which might be enormous – billions of words aiming to represent the entire English language, for example, or much smaller – for learners with specific needs, such as academic writing within their discipline. The important point is that the texts should be representative of what they’re interested in. The tool we use to search a corpus is called a concordancer, where the main function is to find occurrences of the word or phrase you’re looking for and show them in context so you can see for yourself how they’re used. They can also show how frequent a word is, what other words it occurs with, which part of the corpus it occurs most frequently in, and much more besides. So, learners don’t get ‘answers’ to their questions, they see language in use and work out the answers for themselves. It’s kind of a game really, they have to think quite hard to come to their own conclusions, but that’s conducive to learning compared to ‘being taught’. The idea came about in the 1980s and 1990s, when the most usual contact with the language was in the classroom. Now of course with the internet it may feel less relevant, but the fundamental idea is still valid – looking at lots of examples to arrive at your own conclusions. In fact, today, many people do exactly this type of thing with search engines such as Google. DDL just helps do it better, giving it a more respectable face, with a less random collection of texts and a tool that’s actually designed for formulating language queries and presenting the information DOI: 10.4324/9781003413301-4

44

Alex Boulton

appropriately, as opposed to repurposing a general tool that’s designed to retrieve information content. To give a simple example, do you say a book by JK Rowling or a book of JK Rowling? In French, you’d tend to say de, which is often equivalent to of, so it’s not obvious for my students. It’s the kind of thing that’s difficult to find in a dictionary, grammar book or usage manual, and you’ll probably forget if the teacher simply tells you. Or, what’s the difference between big and large? Dictionary definitions are pretty circular on this, defining each in terms of the other. Yes, they are often synonymous, but with a corpus you can look for differences too. Among other things, big tends to be relatively more common in speech while large is more frequent in academic texts; also, big is more likely to describe things which are subjectively important or idiomatic, such as big brother, big idea or big mistake, while large refers to things that are measurable, such as large extent, large amounts or large portions. Which is not to say that the others are impossible, just less usual. DDL takes us away from the ‘what’s possible’ with grammar rules to what is more normal usage. Language isn’t just a question of correct/incorrect but what’s normal and appropriate in different linguistic and non-linguistic contexts. Frequency of words and patterns can be a very useful guide in helping to decide what’s worth learning at this point in your trajectory as a learner, for things that interest you as an individual. 2.

What are some of the potential drawbacks of using DDL in language learning and teaching?

Nothing’s perfect! In other words, no technique, method or approach is 100% effective for 100% of learners in 100% of contexts 100% of the time. Nor for 100% of teachers either, and teachers are essential for change in teaching cultures, but we’ll come back to that. If the search for the ‘universal best method’ has largely been abandoned, the question is: at what point can you justify new practice? If, say, 10% of students really take to it, and it transforms their learning practices, is that enough? In practical terms, many of the early technological barriers have diminished as more tools and corpora become available, and as they become more user-friendly and relevant. Of course, it’s all far faster and more easily accessible now that so many people have their own devices and internet connections. Even the concept of searching for information using computers wasn’t always obvious – one early DDL paper mentioned students who had never touched a keyboard before, which is difficult to imagine today. Regarding drawbacks, on a more pedagogical level, a first encounter with DDL may seem time-consuming and inefficient. And it is, if you think merely in terms of direct learning of specific target items. But if you go beyond that, DDL provides a new way of thinking about language, not as learning and applying rules but noticing patterns in context, an exemplar-driven approach in line with

Data-driven learning

45

usage-based theories of language acquisition. Not what is possible, but learning about what is ‘normal use’. Rule-learning is a demanding intellectual process, and applying rules is even more difficult. It’s also fairly artificial, since language use generally operates at the level of chunks, i.e., groups of words that go together and that you understand or produce as a single block, not building them up or breaking them down every time into their smallest elements. With DDL, i.e., learning by example from patterns and chunks and generalising from the specific to the general, you are exploiting a natural evolutionary trait that we draw on every day. In addition to increased awareness of how language actually works, learners can lose some of their dependence on teachers, dictionaries, etc., and seek their own answers to their own questions. And even, perhaps, become ‘better learners’. Another barrier may be research itself, which is often perceived as irrelevant to ‘real’ teachers and learners, though that’s not specific to DDL. In applied linguistics more widely, it’s pretty rare to set up a teaching situation to prove that something doesn’t work (i.e., the null hypothesis). It wouldn’t be very ethical for a start! But researchers also have things they are enthusiastic about and want to find support for, to create positive outcomes at least on some level. You can see this in various syntheses of DDL, e.g., my meta-analysis with Tom Cobb (2017) where we pooled the quantitative learning outcomes from 64 separate studies and found the overall result to be very good – this is in the top 25% of all meta-analyses carried out in the field of second language acquisition. Combining evidence from multiple sources gives more weight to the findings. More than that, it worked pretty much everywhere, though there clearly is an element of ‘publication bias’ with a number of outliers. More recently, with Nina Vyatkina (2021), we coded 489 individual studies but found a lot of similar ground being covered again and again, especially with English as the target language in a university context. It would be nice to go beyond that, although work is starting to appear (e.g., Crosthwaite, 2019). 3.

What types of studies are needed to determine the effectiveness of DDL on language learning outcomes?

Following on from the previous discussion, surveys show quite a lot of studies covering similar ground, reinventing the wheel, if you like. That’s another reason why you need syntheses, to identify areas that are underexplored. For DDL, we could extend its reach to other target languages, with a wider range of learner profiles, and not just in university contexts. The concept of ‘corpus’ could be revisited to include greater variety and originality – learner corpora, graded readers, websites, course texts, novels, TV scripts, social media and so on. It’s comparatively easy these days to build your own corpus, specific to your own or your learners’ needs, for use with free tools such as AntConc.1 I’d also be delighted to see more collaborative studies between teachers and researchers, or colleagues with different groups, different learners, in different disciplines or institutions or even countries working together to compare their results. Other comparative studies could include

46

Alex Boulton

different corpora or tools, different amounts of guidance or scaffolding, different implementations of DDL, and so on. It would be nice just to get out of the rut of a single teacher/researcher working with their own class as an experimental group. Mostly we have researchers interested in their own students, beginning with an idea of what they want to do and then wondering how you can get a publication out of it, rather than thinking in terms of research questions from the outset – what you really want to know and how you could go about exploring that. Surveys can give ideas for this. In general terms, I’d like to see more theoretical input into research objectives, as Anne O’Keeffe has argued (2020). In particular, when you look at the alleged underpinnings of DDL, let’s explore those directly rather than just offering them as post-hoc explanations for why a study turned out the way it did. Does DDL actually foster autonomy? How can we test that? Does it actually lead to greater language awareness? Do learners get better at noticing patterns? Do they continue with the tools and techniques outside class, or after a course has finished? How, what, where? Does it help with becoming ‘better learners’? For this type of thing, we need to get a lot more creative about our research designs and methodologies. Wouldn’t that be nice! 4.

What are some of the limitations of research on DDL, and how can they be addressed in future studies?

Quite a few negative or leading questions here! The limitations of DDL research probably aren’t specific to DDL. All kinds of research practices in applied linguistics are, to be honest, a bit dodgy. We’ve already mentioned the lack of initial research questions underpinning many studies. The researcher as teacher has obvious drawbacks too, with enthusiasm between teaching control and experimental groups as an obvious confounding factor. If it doesn’t work as well as you think it could, it’s tempting not to publish the results (the file-drawer problem), or present only certain results (p-hacking) or maybe to run the study again after tweaking it. All kinds of research and publication bias are present. There are also issues of ecology versus rigour in exploring specific variables, the dynamic nature of language and language learning itself, the selection of experimental and control groups, sample sizes and poor statistical analyses – the list goes on. Which is ironic in a way, with so much of our methodology and even terminology coming from psychology, which in turn gets it from medicine: control group, subjects, intervention, treatment – as if language teaching were a cure for some kind of illness! Research in applied linguistics as a very human enterprise, despite all our attempts to improve the science. Reporting is another problem: if a study is conducted over a semester, what does that mean? Fifteen weeks with five 3-hour classes per week entirely given over to the stated aim, or maybe 15 minutes once every two weeks over eight weeks? A huge variation. And then proficiency, how do you report that reliably? What did the learners actually do? It’s really important to provide the materials, the instruments, the full data set, but it’s surprising in our climate of open science how many researchers don’t even consider this.

Data-driven learning

5.

47

How can teachers and researchers collaborate to expand the sample sizes and explore different research questions related to DDL?

Certainly, collaborative research would be a major step forward, not only between researchers but also between researchers and teachers. Everyone has a different perspective, and that’s really valuable to bring to research. Often though, the two sit a bit uncomfortably together. Some teachers think of researchers in their ivory towers with their big ideas divorced from reality – come and teach a 20-hour week with my students, you’ll see! Researchers may also lack respect for teachers, which is ironic when you think about it since most researchers are teachers too. But it’s not symmetrical: most researchers teach, but most teachers don’t do research in a formal sense (though every day in class is an experiment). Then again, there are differences in culture: most DDL research is situated within applied linguistics, which is largely confined to university contexts and less within secondary or primary education (not everyone will agree with me on this, but still). In any case, collaboration has to happen. Researchers shouldn’t expect teachers to come to them, we have to reach out. In the same vein, something I strongly believe is that we shouldn’t expect our learners to come to DDL, making the effort and the compromises to become little corpus linguistic researchers: we have to adapt the tools, corpora and techniques to meet their needs, interests and abilities. Or, at least, be prepared to meet half-way, even if it means compromising on some things. I don’t mean abandon all principles, but be pragmatic. Most of the research studies with teachers have involved in-service training, usually workshops or relatively small-scale attempts and local initiatives. There are MOOCs,2 but I’ve no idea how many regular participants are full-time teachers in school contexts. This is fine as far as it goes, but to reach more than a dozen people at a time, we need something more integrated into initial teacher training in particular. There are plenty of pressures on trainee teachers to master the elements they need for their exams, it’s not really on to ask them to invest time and energy in something that doesn’t count towards their qualification. So, until that changes, doing DDL with teachers will be difficult. The question should perhaps then be: how can we convince the decision-makers that this type of approach should be included in pre- and in-service training? 6.

How do you see DDL ftting into a broader pedagogical approach to language learning? Can it be used in conjunction with other methods, or is it best used as a standalone approach?

Tim Johns is often called the ‘father’ of DDL, though he wasn’t the first to see the potential of corpora in language learning and teaching. He did coin the term DDL, though, or at least steal it from computer science. I like to think he was being deliberately provocative: ‘data’ and ‘driven’ are fairly unsexy terms, not obvious terminology choices to win people over. This reminds me of Bax’s consideration

48

Alex Boulton

of ‘normalisation’ (e.g., Chambers & Bax, 2006): that technology is most effective when its use is so obvious that nobody thinks about it anymore. This is like Google, so ubiquitous as to be invisible. Many learners are indeed just googling their language queries and seeing what comes up; and if you think of that as a kind of DDL that-dare-not-speak-its-name, that’s normalisation. Much – if not most – DDL research presents fairly one-off situations, where an individual teacher-researcher uses the approach with one or two classes over a short period. Fine, but it will be difficult for any approach to really take off under these conditions, if it’s seen as gimmicky and out on the margins. It’s not a panacea to all learning problems. Sometimes dictionaries and other things are more appropriate, and I don’t think anyone would say DDL should be used exclusively. So, rather than presenting DDL as something revolutionary, I’d like to see it better integrated into regular teaching over time as a normal practice itself and, indeed, not just in language classes. Whenever you’re working with quantities of text, it’s important to point out how corpus tools and techniques can come in useful. I’d love to see collaboration between teachers of different disciplines on a range of uses in this respect. I mention Johns, because back in 1993, he wrote that “One of the most striking aspects of the [DDL] methodology is the extent to which it mirrors the languageteaching Zeitgeist”, with “a central emphasis on authentic language use, on the development of discovery methods in the learner”, as well as “the possibility of restoring ‘grammar’ to a central position in language teaching”. Admittedly, he went on to say that “DDL is in danger of becoming too fashionable for its own good”, so, not an entirely accurate prophet. But a few points on this. First, the ‘zeitgeist’ hasn’t changed that much since the 1990s, if you compare it to the three decades before that. The current philosophy is still very much into discovery, constructivism, individualisation and so on, all of which are more than compatible with DDL. As I said before, though, it would be nice if researchers explored specific theoretical implications of DDL explicitly, rather than relying on them as a justification after the fact. Second, ‘discovery’: this is a heuristic approach which aligns well with constructivism. Third, ‘language use’, not language knowledge, i.e., what you can do with language in communicative situations, rather than how much you know about language. Many people had similar experiences to me, after a decade of learning a foreign language (French), of not being able to understand what people were saying, or to string together a coherent contribution to a conversation. DDL sits well with usage-based theories, as I said earlier, learning from examples rather than from rules. As for the ‘central position’ of grammar, I’d prefer to rephrase that as the central place of language. For learners of my generation, language was the only thing, with no thought for socio-cultural communication or interaction. Today it’s rather the other way round, and you read so many papers in language teaching where you think: but where’s the language? These are the

Data-driven learning

49

learners’ representations and reactions, but did they learn anything, what can they do now in a language that they couldn’t before? 7.

In your opinion, what are the key technological advancements needed to improve the effectiveness of DDL? Are there any current technological limitations that are hindering its adoption?

I’ve recently been looking at how DDL is presented visually in research papers: the authors provide screenshots of what their learners see, things they presumably think are representative of their conception of DDL. What strikes me is how little things have changed fundamentally in 30 years. Prettier, certainly, and much faster, but a lot of it boils down to seeing a word or phrase in the middle of a set of concordance lines. The book by Pascual Pérez-Paredes and Geraldine Mark (2021) shows a willingness to go beyond that. A surprising number of recent studies use print-outs: this may suggest that technology is more of a problem than I let on earlier, or it might be a question of selecting relevant lines to speed things up and avoid the danger of data overload. Again, I’d like to see the technology come to the learners rather than asking them to do all the running. A number of tools aim in this direction, with much simplified interfaces: SkELL,3 Linggle,4 ColloCaid5 all spring to mind. We could go further, looking at what technology learners use most, and how. DDL has to become more accessible on smartphones via mobile-assisted language learning. Certainly, it has to be free, stable, available on their own devices and not limited to institutional hardware or paying services over a limited period. I’d also like to see more exploration of DDL ‘light’, i.e., practices that are not entirely dissimilar to DDL, but which are more familiar and accessible. Google and other search engines, of course, and how the snippets can be used as input samples, kind of like the traditional corpus concordance. There’s also Linguee6 and Reverso Context,7 which involve human (not machine) translations of text so you can input a word or phrase in your own language and see how other people have translated it in the past. Not all the translations are equally good, but they are often better than machine translations, and the range of options gives you a choice. YouGlish8 is an interesting tool for many different languages: as with DDL, type in your word or phrase and see it in multiple contexts, but here in videos, a few seconds before and after. Skipping from one context to the next, you can see the item in the subtitles with visual support, in different communicative situations, with different accents, and so on. Spoken DDL! Some researchers won’t be convinced by all of this, since we aren’t dealing with a ‘corpus’ in the traditional sense (the collections aren’t aiming to be ‘representative’). Personally, I don’t think we should be too dogmatic or ideological about such things: from a pedagogical perspective, the procedures are really quite close to DDL, and learners are probably doing things like this already.

50

Alex Boulton

8.

You mentioned that most DDL research has focused on written skills. Are there any promising new avenues for DDL research that could address other language skills such as speaking and listening?

Most ‘spoken’ DDL involves transcripts of speech, which means you can use the same tools and techniques as with other written corpora. With the BNClab9 you can visualise how different words and phrases are used by different populations, depending on age, gender, socio-economic background, region, etc. Still, this probably won’t be entirely satisfactory in a communicative classroom with the emphasis on spoken interaction – we want sound and video. There are a few tools around: I mentioned YouGlish earlier, which sources videos from YouTube along with their subtitles, available in English and several other languages. A similar principle drives PlayPhrase,10 though with films and TV series rather than YouTube. The nice thing about that is seeing actors you recognise in more interactive situations (fewer long monologues), but the dialogues are of course scripted, and unfortunately you can only see five occurrences, after which you have to pay or delete your browsing history and start again. There’s also the TED Corpus Search Engine11 of spoken presentations, perhaps a bit less intuitive but it does show you a concordance for your query item, then you can click on one to go to the video. This underlines the difference with written text: you can see multiple examples in a concordance on the same page, but you can’t watch lots of videos all at the same time. Hence the back-and-forth between concordances and videos, which is also at the heart of FLEURON12 for international students coming to France, short videos collected by colleagues in my research group. Either search for a word to see a concordance and then select a video, or watch a video and click on a word to go to a new concordance. If DDL has mainly been used with written rather than spoken skills, it’s also associated with productive skills, especially writing, sometimes translation, occasionally speaking. Less so the receptive skills of reading and listening. Is that because DDL isn’t relevant there, or that it’s difficult to research, or simply that we’re influenced by existing studies and forget to take a step back and think of new things? Tim Johns and colleagues (2008) did use DDL to help secondary students with reading a novel (there’s nothing that says it has to be a ‘corpus’, it’s the corpus tools and techniques that define DDL). At the moment I’m involved in a large national project called Lexhnology13 to see if DDL can help with reading US Supreme Court judgements that have been annotated for moves and steps. So, rather than a bottom-up approach where you teach the words and grammar and assume students will be able to use that to create meaning out of a text, we begin with the text and see if a DDL approach to understanding at the discourse level will help with learning the language. It’s very early days and I have to admit I have absolutely no idea how this will work or what it will look like, but it’ll be fun finding out. . . .

Data-driven learning

9.

51

How do you see the use of DDL evolving in the future? Do you think it will become more widely adopted in language teaching, or will it remain a niche approach?

What do you mean “remain a niche approach”? Seriously, if lots of people are using Google and other search tools for the internet or for individual web pages or documents, then it’s really already out there. You could think of this as DDL hiding in plain sight. I’d love to see someone actually try to survey such practices. I was asked recently how I thought ChatGPT might influence DDL. Short answer: I have no idea. Like any technology, used intelligently it could be amazing, at least as a reference resource in improving text. But for language learning as such, beware of any tool that does everything for you. Just like we know from experience that ‘correcting’ students’ work can go in one ear and out the other, they need to think about it for that essential depth of cognitive processing. I also have a colleague, Henry Tyne, who recently told me that one of his doctoral students has found a workaround with ChatGPT to input quite large numbers of texts and ask questions to get very similar answers to what you would find from corpus queries, but in plain language and with no corpus software involved. I need to speak to him more about this and try it for myself. 10. What role do you see teachers playing in the implementation of DDL? Are there any specifc teacher training programs or pedagogical approaches that could help them incorporate DDL more effectively into their classrooms?

Apart from the learners themselves, teachers are clearly the most important people in language learning. If they’re not on board, no innovation will work: an unconvinced teacher will not convince their students. It might be nice to think of a bottom-up grass-roots revolution (What do we want? DDL! When do we want it? Now!), but that’s unlikely. A top-down approach via training programmes would certainly work, but I’m not holding my breath on that either. So, within the current paradigm, we are really looking more at the individual level, though that of course keeps DDL in its niche (ooh, a nice interlinguistic pun, since niche in French is a dog kennel). As I say, teachers already have a lot on their plate, and it’s not surprising many are unwilling to invest time and energy in something perceived as marginal, and with a relatively steep learning curve. I’m the same: there are so many websites and apps around, I think I’d love to do that, but don’t have time to explore them all myself, let alone use them in class. It’s also a bit of a challenge in many teaching cultures where teachers are expected to have all the answers. Allowing learners to interact directly with data can be a direct threat to that, though I’d say that nobody can know everything, and that admitting so will help show learners understand that language is not just a set of grammar rules plus vocabulary and

52

Alex Boulton

pronunciation. Language is fuzzy. When my students ask me questions that I don’t know the answer to, I might look in a dictionary for some things, but as often as not it’s a corpus I turn to. Using corpora can give greater credibility to the language taught – it’s not just me or the textbook, but rather hundreds, thousands of real people actually using language like that. I don’t think there’s a one-size-fits-all way to implement DDL. Teachers can use it for their own development, as a source of examples for more authentic materials and activities. Hands-on, they might simply help their students use familiar tools like Google more effectively; and for some, it might stop there. For others, that could be a way in to more prototypical DDL, either with a large online corpus (e.g., the Corpus of Contemporary American English),14 or to create a smaller corpus that addresses specific local needs. Often you won’t really know until you try: my first attempts at DDL used paper-based materials because I was sceptical my students would be able to deal with it hands-on. When I did get them to use a concordancer, I was surprised at how quickly they took to it, at least the basics. It’s easy to underestimate your students. So, my advice to any teachers would be: Stick to a small number of relatively simple, user-friendly tools; choose relevant corpora or other text types; experiment yourself before class (you can usually find tutorials on YouTube for websites and software); and as with any technology, always have a plan B. Perhaps most important, don’t feel pressured or constrained by what other people do, but take it where you want to go, how it suits you, your learners and your particular context. Give it a go! Notes 1 2 3 4 5 6 7 8 9 10 11 12 13 14

www.laurenceanthony.net/software/antconc https://wp.lancs.ac.uk/corpussummerschools/corpus-linguistics-mooc https://skell.sketchengine.eu https://linggle.com www.collocaid.uk www.linguee.fr https://context.reverso.net/traduction https://youglish.com http://corpora.lancs.ac.uk/bnclab/search https://playphrase.me/#/search https://yohasebe.com/tcse https://fleuron.atilf.fr https://lexhnology.hypotheses.org www.english-corpora.org

Reference list Boulton, A., & Cobb, T. (2017). Corpus use in language learning: A meta-analysis. Language Learning, 67(2), 348–393. https://doi.org/10.1111/lang.12224

Research opportunities in the feld of corpus linguistics

53

Boulton, A., & Vyatkina, N. (2021). Thirty years of data-driven learning: Taking stock and charting new directions. Language Learning & Technology, 25(3), 66–89. www.lltjournal. org/item/10125-73450 Chambers, A., & Bax, S. (2006). Making CALL work: Towards normalisation. System, 344, 465–479. https://doi.org/10.1016/j.system.2006.08.001 Crosthwaite, P. (Ed.). (2019). Data-driven learning for the next generation: Corpora and DDL for pre-tertiary learners. Routledge. Johns, T. (1993). Data-driven learning: An update. TELL&CALL, 2, 4–10. Johns, T., Lee, H. C., & Wang, L. (2008). Integrating corpus-based CALL programs and teaching English through children’s literature. Computer Assisted Language Learning, 21(5), 483–506. https://doi.org/10.1080/09588220802448006 O’Keeffe, A. (2020). Data-driven learning – a call for a broader research gaze. Language Teaching, 54(2), 259–272. https://doi.org/10.1017/S0261444820000245 Pérez-Paredes, P., & Mark, G. (Eds.). (2021). Beyond the concordance: Corpora in language education. John Benjamins. https://doi.org/10.1075/scl.102

4.1 Research opportunities in the feld of corpus linguistics in Vietnam A researcher’s view Le Thanh Thao The arrival and devastation of COVID-19 on human life brought about significant loss and anxiety for the future of humanity in an increasingly uncertain world. Many researchers in question felt a sense of bewilderment regarding the continuation of their research in such an unpredictable future. Given the immense impact of COVID-19 on society, many researchers encountered challenges in steering their academic pursuits. Being deeply committed to advancing their research, they explored various avenues to garner insights and strategies. In this journey, they were introduced to the field of corpus linguistics (CL), especially its significance concerning international perspectives on language learning technology. The distinguished contributions of CL experts to this area offered invaluable insights, with further understanding coming from other domain experts, highlighting the potential of data-driven learning (DDL). My comprehension intensified upon delving into the intricacies of data collection and delineating a research path in CL, spotlighting the extensive opportunities inherent in the field of language studies. In the context of Vietnam, an examination of the existing body of research on CL reveals significant limitations. Scholars within the country appear to face challenges in harnessing and optimizing available data effectively. The methodologies employed often lack contemporary relevance, and the topics pursued within the

54 Alessandra Cacciato

domain of CL do not consistently align with practical applications or offer significant contributions. Given these observations, there is an evident and urgent need to augment and refine the expertise of researchers in this particular milieu. Influenced by Prof. Boulton’s work, I, as a researcher, developed a deep appreciation for the role of DDL in enhancing research methodologies within the realm of CL. Recognizing its potential relevance to the Vietnamese context, I embarked on a project to determine how DDL could be leveraged to make a substantive impact and bolster research quality in CL within Vietnam. As a tangible instance of the researcher’s engagement with DDL, a systematic review was conducted. This process involved familiarizing oneself with the rigorous procedures of systematic reviews, methodologies for ensuring scientific and logical rigor, and strategies for articulating findings. This review proved instrumental in highlighting potential research gaps, offering a roadmap for interested scholars to identify and address unexplored areas in the field. Integrating the prior discussion of CL research constraints in Vietnam, I acknowledge a mitigation of initial anxieties concerning research development within this field. Nonetheless, the path forward appears riddled with complexities. One notable challenge in Vietnam is the limited accessibility to non–open access studies. Often, individual researchers bear the financial burden of procuring publications like scientific journals or books, a scenario not feasible for all. Moreover, enhancing research capacity remains a pressing concern for many Vietnamese scholars. As emerging researchers, they are eager to foster international collaborations. Such partnerships could provide researchers from nations grappling with research challenges, including but not limited to Vietnam, with opportunities to engage with and learn from global experts. Collaborative endeavors of this nature hold the promise of forging meaningful impacts on future societal landscapes.

4.2 A user-friendly resource for DDL YouGlish Alessandra Cacciato Alex Boulton’s discussion covers publications describing empirical studies on datadriven learning (DDL), showing how the vast majority come to positive conclusions on the impact that the use of corpora may have on language learning. At the same time, they present several challenges, such as the lack of a detailed approach to students’ corpus training, longitudinal work and exploration “outside university classrooms: in autonomous contexts, in secondary education, in language schools, or in the workplace for ‘genuine’ life-long needs” (Boulton, 2017, p. 485).

A user-friendly resource for DDL

55

Moreover, with the exception of English as a target language, the tools and techniques of corpus linguistics in teaching are rather marginal. By adopting the broader definition of DDL as “using the tools and techniques of corpus linguistics for pedagogical purposes” (Gilquin & Granger, 2010, p. 359), we, as teachers and as practitioners, can start thinking outside the box when using DDL and transform it from a one-off practice to a regular and widespread learning approach. To reach this goal, having more user-friendly resources is needed. For example, tools such as YouGlish, available at https://youglish.com/, can be used for a form of oral DDL. It is a free online website designed to help users improve their pronunciation by automatically extracting segments containing a target word or phrase from relevant YouTube videos along with a transcript and phonetic help. While it was created as on online pronunciation resource with a focus on the development of listening and speaking skills, it integrates an audiovisual representation of the word or expression in context. It can also be integrated into other platforms or websites thanks to the Application Programming Interface (API) provided. In my path to developing a useful DDL resource, I adopted this tool to widen the possibilities offered by this approach for autonomous learning in secondary schools. Strictly speaking, YouGlish is not a corpus; however, it allows the learner to discover linguistic patterns, to make hypotheses, to find answers to language questions hands-on as “a researcher” (Johns, 1991, p. 2). It empowers learners to explore authentic language, to exercise considerable autonomy throughout their learning process, and it stimulates their curiosity to embrace and use DDL for their own learning purposes outside the classroom. In practice, the students I have presented it to have reported that the variety of videos on offer is enriching and food for thought within a learning experience that goes beyond mere understanding of the language item, encouraging them to reflect on the connection between it and different contexts, people and accents. To conclude, user-friendly and stimulating tools like YouGlish open up a wide range of possibilities to meet the needs of learners, maximize DDL benefits and unleash its potential by leading to a more spontaneous and accessible version of this approach. Reference list Boulton, A. (2017). Research timeline: Corpora in language teaching and learning. Language Teaching, 50(4), 483–506. https://doi.org/10.1017/S0261444817000167 Gilquin, G., & Granger, S. (2010). How can data-driven learning be used in language teaching? In A. O’Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (pp. 359–370). Routledge. https://doi.org/10.4324/9780203856949 Johns, T. (1991). From printout to handout: Grammar and vocabulary teaching in the context of data-driven learning. In T. Johns & P. King (Eds.), Classroom concordancing. English Language Research Journal, 4, 27–45.

56

Kevin F. Gerigk and José Luis Poblete

4.3 Corpus linguistics for translation studies in Chile Perceptions from the perspectives of undergraduate students Kevin F. Gerigk and José Luis Poblete In this short discussion, we address a few of the findings from Boulton’s chapter. One of his findings that was particularly interesting was that DDL-related research seems to be rather Eurocentric, with a particular scarcity of publications from Latin America. Whilst interest in DDL studies in Latin America is growing (see Singer Contreras, 2016; Singer et al., 2021 for studies from Chile), traditional views towards language teaching may still create a barrier to DDL methods. The research regarding efficacy of DDL-informed teaching seems to be rich (for example Lin, 2015). However, there was little previous work that focuses on how Chilean students perceived this approach. In response to this, a collaboration between a Chilean and a UK university was established, aiming to further develop corpus-driven DDL in an undergraduate translation studies programme in Chile. In this collaboration, we were interested in two main aspects: 1) promoting corpusdriven DDL as a methodology for language teaching; and 2) exposing students to naturally occurring language via corpus analysis software as opposed to contrived coursebook language. During the semester, our students learnt to use the freeware software packages AntConc for concordancing and #LancsBox for the investigation of collocation networks and keyword analyses. As students became more proficient, their final assessment was an autonomous corpus linguistics investigation of a topic of their own choosing. Alongside this, an investigative study design was developed to capture the perceptions of our students regarding the course contents and their own development as L2 learners and translators. What is novel about this is that our research did not focus on the actual performance of the students, i.e., we did not evaluate test scores. Instead, we were interested in the students’ perceptions and evaluation of the course. We tried to ‘push the envelope’, as Boulton suggests, and as a result the onesemester-long module, running twice a week, started from a needs analysis. The needs analysis captured students’ expectations regarding autonomy, motivation and areas of interest as well as their perceived proficiency in using corpus methods. After the module, students responded to a questionnaire (11 Likert-Scale and 14 open-answer items), which elicited student perceptions and evaluation of the input regarding corpus methods for their translation studies. We were particularly interested in student self-assessment of their corpus proficiency and a change in language proficiency as perceived by the students themselves. This helped us to

Adoption of the DDL method through corpus-based exercises

57

identify and evaluate how Chilean university students perceive corpus-driven DDL. The results have been positive throughout but have also opened an avenue for future improvement of the module for the teacher-researchers. The results have led us to a second implementation of this collaboration, for which we have taken on board the results and made improvements in accordance with the students’ feedback and perceptions. Perhaps the greatest improvement for the reiteration of this course was an increase of hands-on time for the students, in which they were able to operationalise the input they received during the lectures. We have addressed several issues discussed by Boulton: We have added to the body of research from Latin America, whilst at the same time we have re-shifted the focus back onto the students. In this endeavour, we have been able to establish an international research and teaching collaboration, from which our students and the wider DDL community, particularly in Latin America, can benefit. Reference list Lin, M. H. (2015). Effects of corpus-aided language learning in the EFL grammar classroom: A case study of students’ learning attitudes and teachers’ perceptions in Taiwan. TESOL Quarterly, 50(4), 871–893. Singer, N., Poblete, J. L., & Velozo, C. (2021). Design, implementation and evaluation of a data-driven learning didactic unit based on an online series corpus. Literatura y Linguistica, 43, 391–420. Singer Contreras, N. D. (2016). A proposal for language teaching in translator training programmes using data-driven learning in a task-based approach. International Journal of English Language and Translation Studies, 4(2), 155–167.

4.4 Adoption of the DDL method through corpus-based exercises Adrian Jan Zasina I have been employing the DDL method since 2013, initially as a student and later as a teacher. Alex’s chapter provided numerous insights into incorporating DDL activities into teaching Czech as a foreign language. One of the standout points for me was the statistics highlighting the number of studies on the use of corpora in language teaching (Boulton & Cobb, 2017). This led me to ponder its reflection in standard classroom activities. I observed that Czech teachers often hesitate to incorporate corpora into their language teaching. Yet, when presented with the right tools, specifically ready-to-use exercises, they become more inclined to integrate DDL activities into their lessons. Consequently, I directed my efforts towards providing teachers with appropriate corpus-based

58

Adrian Jan Zasina

exercises as part of my project. Drawing from my experiences with students and a thorough analysis of learner errors in a learner corpus, I designed a set of corpusbased exercises. When these were juxtaposed with traditional teaching methods (Zasina, 2022a), it was evident that the customized exercises effectively addressed student needs and reinforced the material they had learned. The subsequent phase involved showcasing the utility of DDL to educators. Through numerous seminars, I discerned their apprehension towards corpus data. Yet, upon presenting my exercises, they began to appreciate the potency of DDL in their classrooms. This realization culminated in my proposition of a corpus workbook encompassing over a hundred ready-to-use exercises (Zasina, 2022b). This methodology has notably fostered the adoption of DDL activities among a broader audience. Furthermore, there’s been a surge in the interest teachers show not only towards using language corpora but also in utilizing user-friendly corpus tools, such as the tools for Czech available at www.korpus.cz/. These tools not only facilitate easier interaction with the corpus data but also enrich classroom experiences. As Boulton highlights, contemporary tools have demystified working with corpus data, introducing features that enhance classroom appeal and foster independent learning among students. Notably, students often don’t require a grounding in corpus linguistics, especially when their instructors adeptly use captivating tools that seamlessly integrate corpus data. Boulton also advocates for the progression of DDL through collaboration, not just among peers but also with educators. Echoing this sentiment, I am committed to rendering the DDL method more approachable for Czech teachers and learners across all proficiency levels. As Boulton articulated in his 2009 paper, while DDL can be applied even at lower proficiency levels, a bulk of the research is slanted towards advanced learners. My renewed mission is to delve into the nuances of non-native Czech in relation to a learner’s L1 and proficiency, aiming to discern patterns and variations stemming from these variables. I am confident this exploration will yield invaluable insights, enhancing our understanding of learner language and refining our approach to teaching Czech more effectively. Reference list Boulton, A. (2009). Testing the limits of data-driven learning: Language proficiency and training. ReCALL, 21(1), 37–54. Boulton, A., & Cobb, T. (2017). Corpus use in language learning: A meta-analysis. Language Learning, 67(2), 348–393. https://doi.org/10.1111/lang.12224 Zasina, A. J. (2022a). Corpus approach to teaching Czech as a foreign language in university courses. Bohemistyka, XXII(3), 435–462. https://doi.org/10.14746/bo.2022.3.7 Zasina, A. J. (2022b). Designing a corpus workbook for students of Czech as a foreign language. Studie z Aplikované Lingvistiky – Studies in Applied Linguistics, 13(2), 125–132.

5 DATA-DRIVEN LEARNING, SECOND LANGUAGE ACQUISITION AND LANGUAGES OTHER THAN ENGLISH Connecting the dots Luciana Forti

1.

Introduction

The overall purpose of data-driven learning (DDL) is to help language learners in ways that other approaches are unable to. By relying on resources and techniques based on corpus linguistics, DDL provides learners with the opportunity to interact with authentic language data and discover linguistic patterns emerging from use. But what do language learners need help with specifically? Can DDL provide such help in ways that other approaches cannot? And can it do so in the context of languages other than English (LOTEs)? To address these questions, we must turn to second language acquisition (SLA) as a whole, DDL pedagogical principles, and LOTEs specificities. We must also consider the overlaps between these three areas and ‘connect the dots’. In this chapter, I seek to do so by means of a case study on L2 Italian phraseology. By identifying the linguistic variables pertaining to the learning aims, based on the available research evidence, the study attempts to show how we can determine whether DDL can help learners in ways that other approaches cannot. The chapter is structured as follows. Section 2 explores the connection between DDL and SLA, with respect to phraseology. Section 3 explores the connection between DDL and LOTEs, with an emphasis on Italian. Section 4 presents and briefly discusses a case study on L2 Italian and the development of phraseological competence through DDL. Section 5 concludes the chapter with some final remarks.

DOI: 10.4324/9781003413301-5

60

Luciana Forti

2.

Connecting DDL and SLA: focus on phraseology

In the introduction, I formulated three questions. This section addresses the first two, which concern what language learners need help with the most, and whether DDL can provide this help more effectively than other approaches. In order to answer the first question, we must turn to the empirical evidence accumulated by studies conducted in the broader field of SLA. We must also turn to even broader fields related to how language works beyond acquisitional matters, such as general linguistics. By pulling these areas together we will be able to identify what really counts in language learning. One major achievement of corpus linguistics is to have provided solid empirical insight into the formulaic nature of language.1 We now know that most language production is based on multiword chunks (Altenberg, 1998; Erman & Warren, 2000) and that knowledge of these chunks considerably helps language processing, acquisition and use (Henriksen, 2013). We also know, however, that multiword units can be particularly challenging for learners, even at advanced levels of proficiency. For example, semantic opacity (i.e., when a word is not used in its literal meaning) has been identified as potentially challenging (Altenberg & Granger, 2001). Combinations such as to make a stand contain a delexicalised verb, which may challenge learners in their comprehension of the meaning of the overall combination. Frequency is another property that has been much discussed in the literature. While highly frequent phraseological units may have positive effects in terms of the emergence of form-meaning mappings in the learner’s mental lexicon, highly infrequent ones may pose challenges. Infrequency may also be associated with semantic opacity (e.g., to see red), so the two properties combined may challenge learners even more. Another property that may cause issues is that of congruency. Word combinations sharing the same meaning between two languages are considered congruent if they are made of words linked to the same literal meaning. For example, the English to take a photo is congruent with the Italian fare una foto, because the verbs selected by photo/foto are linked to the same literal meaning. On the contrary, both combinations are incongruent with the French prendre une photo, as prendre is linked to a different literal meaning (i.e., ‘to take’). These are cases in which there is no parallel use in the language being considered. Several studies have found that incongruency, also when matched with other properties such as semantic opacity, can pose problems for language learners (Wolter & Yamashita, 2018). We can thus answer the first question – what language learners need – by recognising the importance of phraseology. Let us now address the second question – whether DDL can effectively help – and see where in fact it could. We will refer specifically to concordance-based DDL, where learners work with sentences containing multiple occurrences of a word or word combination. To help learners with issues related to semantic

DDL, SLA and LOTEs

61

opacity, DDL can provide concordances with numerous examples, containing both literal and non-literal uses of a combination, which learners (alone or together with their peers) can distinguish thanks to the contextual meaning provided by the sentences (e.g., differences between raccontare una storia, ‘to tell a story’, and raccontare storie, ‘to lie’). Infrequent combinations, which may also be semantically opaque to some extent, can be addressed effectively through DDL activities as well. Through the multiple examples provided by concordance lines, learners can infer meaning from the context emerging from the sentences, thus overcoming the potential limitations of infrequency (e.g., essere al verde; literal meaning: ‘to be at the green’; actual meaning: ‘to have no money’). As for incongruency, this too can be addressed through DDL activities, which can help learners develop awareness of differences with their L1. This can help them avoid using non-typical combinations (e.g., dare un sorriso, instead of fare un sorriso, ‘to smile’; incongruent for L1 Chinese learners). All in all, concordance-based DDL activities, matched with guided-discovery teaching techniques, have the potential of making the challenges identified in SLA in relation to the development of phraseological competence less problematic for learners. 3.

Connecting DDL and LOTEs: focus on Italian

In this section, we focus on the third question formulated in the introduction, which related to whether DDL can help learners even if they are learning a language other English. Compared to the other two questions, this may sound more like a rhetorical question: the obvious answer would be affirmative, but to what extent has this been the case in the history of DDL? One solid source of knowledge that we can use in order to answer this question is the 30-year review published by Boulton and Vyatkina (2021). The review collects studies on DDL published from 1989 to 2019, focusing solely on empirical work, that is “any publication that provides some type of evaluation of DDL” (Boulton & Vyatkina, 2021, p. 68). A total of 489 studies were collected and coded according to a wide range of variables, including target language. We used the dataset linked to the Boulton and Vyatkina review (and freely available through Iris2) to obtain a graphic overview of the distribution of LOTEs studies over the years with respect to empirical DDL research overall (see Figure 5.1). Studies focusing on more than one target language were counted only once (e.g., Amador Moreno et al., 2006). An exception was made for cases in which one of the target languages was English (e.g., Chambers, 2005): in these cases, given the aim of the graph, we counted the study twice. As can be seen, empirical evaluations of DDL were published sparingly from 1989 to 1996 and they concerned English only. The very first study reporting on an empirical evaluation of DDL was published in 1998 and focused on French (Bowker, 1998). The situation does not seem to be improving over the years: as can

62

Luciana Forti

100%

0

0 0

0

0 0 0 0 0

90%

1

1

80%

2

70%

0 1

0 0 0 0 0 1

2

2 1

2

3 3

0 0 0 0 0 0 2 0 2 1 2 1 2 2 5 2 3 0 6 4

2

60% 50%

1

2 1

2

7

1 4

40%

3

2

30%

2

20%

4

13 11 12

6

13 11

16 22 32 24 27 43

34

60 40

44

3

10% 0

0

0

1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

0%

DDL for English FIGURE 5.1

DDL for LOTEs except Italian DDL for Italian

DDL for English, LOTEs except Italian, and Italian over 30 years

be seen in Figure 5.1, while the number of studies published per year for empirical DDL research has steadily risen, that of studies concerning LOTEs do so very slightly (average number of studies over 10-year time spans: 1.9 between 1998 and 2007; 2.5 between 2008 and 2017). We must wait until 2001 to see the first empirical DDL study focusing on Italian, which was published by Kennedy and Miceli (2001). According to the studies collected by Boulton and Vyatkina, over 30 years the number of studies dealing with empirical evaluations of DDL for Italian totals only 9. If we consider 9-year time spans, starting from 2001, we see that on average there has been very little improvement (0.4 between 2001 and 2009; 0.5 between 2010 and 2019). The review by Boulton and Vyatkina, however, considers only publications in English. If we include publications in languages other than English, and in Italian specifically, as well as broadening the inclusion criteria so as to consider also studies that do not contain an evaluation of DDL, the picture slightly improves. We found a total of 26 studies, spanning 30 years, and dealing with corpus use in Italian L2 pedagogy (Forti, 2023, pp. 28–31). However, most of this work is rather descriptive and introductory, though enthusiastically supportive of DDL. Many scholars have, in fact, warmly welcomed DDL and presented ways in which corpora can be used in Italian teaching settings, but empirical studies investigating the effects produced by DDL and the processes involved in DDL activities are still very rare.

DDL, SLA and LOTEs

4.

63

Merging DDL, SLA and LOTEs: a case study on Italian

Aims and research questions

This section synthesises and selects work stemming from a larger project aimed at evaluating the effects of DDL in the context of Italian L2 language courses tailored for Chinese L1 speakers. In designing the study, special attention was given to the selection of the learning aims and to the empirical evidence available in relation to such learning aims. The selected general aim was that of collocations, so the empirical evidence available in relation to collocation learning was taken into account. The research questions were numerous and related to both effects produced on language competence and those produced on learner attitudes. In this chapter, we restrict our focus to the former and cover the following research questions specifically: RQ1: How does DDL influence the learning of collocations overall? RQ2: What effect does DDL have on learning congruent and incongruent collocations? RQ3: Is the effect modulated by frequency? Method

The study was based on a controlled, longitudinal, between-groups design. A phraseological competence test built for the purposes of the study was administered at 4-week intervals over 12 weeks. A single 1-hour lesson a week was taught for 8 weeks across 8 classes of pre-intermediate Chinese learners of Italian aged between 18 and 27. The classes were randomly assigned to either the experimental condition (i.e., classes which would be exposed to DDL materials on the selected learning aims) or to the control condition (i.e., classes which would be exposed to non-DDL materials on the same selected learning aims). The cohort of participants comprised 123 learners, 61 in the experimental classes and 62 in the control classes. We focused on verb-noun collocations, selecting them from both the LOCCLI (Spina & Siyanova-Chanturia, 2018) and the PEC (Spina, 2014), a corpus of L1 Chinese learners of Italian and a reference corpus of Italian respectively. Each week, 8 collocations were presented in the context of a thematically coherent lesson. DDL activities were based on paper materials with printed concordance lines. These consisted of concordance-based pattern-hunting, matching and gap fills. The control activities, on the other hand, consisted of single-sentence gap fills, single split-sentence matching and single sentence error correction. Congruency was determined on the basis of the evaluations conducted by two separate L1 Chinese speakers with an advanced proficiency level of Italian. Phrasal frequency for each collocation was established by searching for the total number of occurring collocations in an Italian reference corpus. The last data collection point

64

Luciana Forti

was administered 4 weeks after the end of the lessons, in order to analyse retention rates. The data was analysed through mixed-effects modelling (Linck & Cunnings, 2015). Findings

The first research question looks at DDL effects broadly, in terms of how DDL affects collocation learning overall. The findings indicate U-shaped learning curves in both groups, over time, with no significant differences (see Figure 5.2). The fact that the richer input that is typical of DDL does not necessarily lead to overall better language gains in comparison to a non-DDL approach could be explained by the duration of the intervention and the presence of preliminary training sessions. In terms of duration, it has been found that interventions of 10 sessions or more tend to lead to better results (Lee et al., 2019, p. 745). In our case, organisational reasons prevented us from having a more extended intervention. Longer interventions should be considered in future studies. There have also been suggestions to provide learners with preliminary training in using corpora and working with concordance lines. In our case, gaining familiarity with a corpus interface does not apply as we used solely paper-based materials, so the preliminary training would concern gaining familiarity with concordances. While it has been shown that preliminary training on DDL is, in general, not a major influencing factor in DDL

FIGURE 5.2

Experimental vs. control conditions

DDL, SLA and LOTEs

65

outcomes (Lee et al., 2019, p. 745), specific training focused on concordance work rather than on the software used to interrogate a corpus could be useful. In the second research question, congruency is included as a variable. We observe that both conditions exhibit U-shaped learning patterns. Furthermore, incongruent collocations are learned significantly better in both groups, and the best retention rates (i.e., differences between time c and d) are observed for incongruent collocations in the experimental condition (see Figure 5.3). The fact that incongruent collocations are retained better in a DDL setting may be because engaging in DDL activities can lead to longer-lasting effects in terms of the robustness of learning. This may be triggered by the repeated encounters with multiple examples containing the same combination, and the heavier cognitive load involved in DDL compared to traditional activities. The picture changes again if we include phrasal frequency into the picture. We found that the interaction between phrasal frequency and L2 congruency is significant. Furthermore, as phrasal frequency increases, predicted probabilities for

FIGURE 5.3

Experimental vs. control conditions with reference to congruency

66

Luciana Forti

FIGURE 5.4

Experimental vs. control conditions with reference to congruency and frequency

accuracy decrease for congruent collocations and increase for incongruent collocations. In other words, an increased frequency of occurrence in the input is more beneficial for incongruent collocations rather than for congruent ones, with no significant differences between the two conditions. 5.

Conclusions

This chapter aimed at connecting the dots between DDL, SLA and LOTEs. The emerging picture is multifaceted. By considering the connections between DDL principles and practices and SLA empirical evidence, we are in a position to develop a more nuanced view of DDL effects. We can explore the ways in which DDL can meet the actual needs of learners, as demonstrated by SLA research. We believe that by addressing actual learner needs, thus looking beyond the domain of DDL research and referring to all the sources of knowledge we have to determine

DDL, SLA and LOTEs

67

which ones these are, we can help spread DDL practices in contexts other than those addressed for research purposes. Research designs comparing DDL practices to more traditional ones can be key in clarifying the specificities of DDL. This can be pivotal when presenting DDL to language teachers, who remain the main agents of change when it comes to spreading the use of corpora in educational settings. The connections between DDL and LOTEs must also be tighter. While the pedagogical principles on which DDL is founded can be (and have already been) applied to LOTEs, discussion of the implications of DDL for LOTEs is still very much under-developed. Such a discussion is needed, not only with reference to the corpus-based resources that are available for LOTEs, but also with reference to the specific linguistic characteristics of the languages involved, which can have a significant impact on the identification of the learning aims. LOTEs can be characterised by a more complex morphology compared to English, as in the case of Italian. This can result in learning needs that are different from those identified on the basis of research on English L2 learning. The case for expanding the scope of DDL to include LOTEs thus spans over to SLA studies as well, which also lack adequate focus on LOTEs. It is my hope that by highlighting the missing connections between DDL, SLA and LOTEs, new research can be conducted to help bridge the research/teaching gap and make DDL more relevant to learners and teachers of any language. Notes 1 In this paper, the terms phraseology, formulaicity and multiword units will be used interchangeably to refer to the tendency of words to co-occur. For a more detailed treatment of the notion and associated terminology, please refer to Espinal and Mateu (2019). 2 www.iris-database.org/details/wiFVJ-jee8y

Reference list Altenberg, B. (1998). On the phraseology of spoken English: The evidence of recurrent word-combinations. In A. P. Cowie (Ed.), Phraseology: Theory, analysis and applications (pp. 101–122). Oxford University Press. Altenberg, B., & Granger, S. (2001). The grammatical and lexical patterning of MAKE in native and non-native student writing. Applied linguistics, 22(2), 173–195. Amador Moreno, C. P., O’Riordan, S., & Chambers, A. (2006). Integrating a corpus of classroom discourse in language teacher education: The case of discourse markers. ReCALL, 18(1), 83–84. Boulton, A., & Vyatkina, N. (2021). Thirty years of data-driven learning: Taking stock and charting new directions. Language Learning and Technology, 25(3), 66–89. Bowker, L. (1998). Using specialised monolingual native-language corpora as a translation resource: A pilot study. Meta, 43(4), 631–651. Chambers, A. (2005). Integrating corpus consultation in language studies. Language Learning and Technology, 9(2), 111–125.

68

Dr. Zeynep Özdem-Ertürk

Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text – Interdisciplinary Journal for the Study of Discourse, 20(1), 29–62. Espinal, M. T., & Mateu, J. (2019). Idioms and phraseology. In M. T. Espinal & J. Mateu (Eds.), Oxford research encyclopedia of linguistics. Oxford University Press. Forti, L. (2023). Corpus use in Italian language pedagogy: Exploring the effects of datadriven learning. Routledge. Henriksen, B. (2013). Research on L2 learners’ collocational competence and development – a progress report. In C. Bardel, C. Lindqvist, & B. Laufer (Eds.), L2 vocabulary acquisition, knowledge and use. New perspectives on assessment and corpus analysis. Eurosla Monographs Series, 2 (Vol. 2, pp. 29–56). EUROSLA. Kennedy, C., & Miceli, T. (2001). An evaluation of intermediate students’ approaches to corpus investigation. Language Learning & Technology, 5(3), 77–90. Lee, H., Warschauer, M., & Lee, J. H. (2019). The effects of corpus use on second language vocabulary learning: A multilevel meta-analysis. Applied Linguistics, 40(5), 721–753. Linck, J. A., & Cunnings, I. (2015). The utility and application of mixed-effects models in second language research: Mixed-effects models. Language Learning, 65(S1), 185–207. Spina, S. (2014). Il Perugia Corpus: Una risorsa di riferimento per l’italiano. Composizione, annotazione e valutazione. In Proceedings of the First Italian Conference on Computational Linguistics CLiC-It 2014 & the Fourth International Workshop EVALITA 2014 (Vol. 1, pp. 354–359). Spina, S., & Siyanova-Chanturia, A. (2018). The longitudinal corpus of Chinese learners of Italian (LOCCLI). Poster presented at the 13th Teaching and Language Corpora Conference. Wolter, B., & Yamashita, J. (2018). Word frequency, collocational frequency, L1 congruency, and proficiency in L2 collocational processing: What accounts for L2 performance? Studies in Second Language Acquisition, 40(2), 395–416.

5.1 DDL and productive vocabulary knowledge improvement Dr. Zeynep Özdem-Ertürk Luciana Forti’s chapter has provided a concise definition of data-driven learning (DDL) and “DDL effects-oriented research”, discussing the difficulties faced by learners in the domain of phraseology. Semantic opacity, infrequency of exposure, and incongruency were identified as key challenges for language learners in this chapter, which sought to investigate the impact of DDL on the enhancement of phraseological competence within the context of L2 Italian. Forti’s study utilized paper-based DDL exercises as a means to enhance learners’ understanding and proficiency in phraseology. Depending on the technical support or the proficiency level of the learners, it is possible to develop paper-based DDL activities. Forti’s study has inspired the development of an action research (AR) project aimed at incorporating direct corpus usage into the learning process

DDL-based learning of Arabic syntax for Indonesian learners

69

of English major students with the goal of enhancing their productive vocabulary knowledge. This presents an opportunity to implement a modification aimed at fostering greater student engagement and autonomy within the learning process. Initially, the participants underwent receptive and productive vocabulary assessments in order to ascertain the target terms. The objective was to select words that the participants were familiar with in their receptive vocabulary but not in their productive vocabulary. Subsequently, the participants received instruction on the methodology for doing word searches within the reference corpus. Afterwards, they were required to compose three short stories using the provided target words through a systematic exploration of the corpus. They were given extensive background knowledge about the words through the use of a linguistically rich context. Instead of utilizing paper-based DDL materials or formal vocabulary instruction in the classroom, the objective of this study was to examine the effects of integrating direct corpus usage on students majoring in English. By being exposed to a diverse range of linguistic input, individuals were provided with the opportunity to autonomously learn the essential abilities to comprehend new words and collocations. The examination of the short stories revealed that a significant number of errors pertaining to usage were mostly attributed to incorrect collocations. Throughout the duration of the tasks, there was a noticeable improvement in their performance, as seen by a decrease in the frequency of errors. Furthermore, participants’ perceptions of corpus use were also investigated with the help of reflections and semi-structured interviews, in addition to assessing their enhancement in vocabulary knowledge inside their short stories. The findings indicated that participants first encountered challenges in utilizing the corpus. However, their view underwent a shift from a negative to a positive outlook as they progressed from the initial to the last assignment. I found the process of doing this study to be quite stimulating. It was gratifying to observe that I successfully facilitated the acquisition of new words and fostered favourable shifts in the perspectives of the learners. The forthcoming publication of this study will take the form of a book chapter.

5.2 DDL-based learning of Arabic syntax for Indonesian learners Luthf Muhyiddin Research on second language acquisition (SLA) and data-driven learning (DDL) has significantly contributed to our understanding of language learning and teaching. DDL emphasizes using authentic language data to enhance learners’ language skills, whereas SLA delves into the intricate processes that underlie language acquisition. Incorporating SLA insights into DDL-focused research can further enrich

70

Luthf Muhyiddin

language pedagogy, offering a broader understanding of how learners engage with language data. In this context, I discuss how I’ve implemented Forti’s advice in my Arabic teaching for non-native speakers. According to Forti (2023), DDL can be applied to language pedagogy in various ways, whether through direct or indirect approaches. Concurrently, Arabic corpus research has thrived, leading to the availability of corpus data for both classical Arabic and Modern Standard Arabic (MSA). Regrettably, the use of Arabic corpora in the teaching environment remains relatively unpopular (Zaki, 2016; Zaki et al., 2021). Nonetheless, corpus data can be applied both directly and indirectly in teaching (Bernardini, 2004). Hence, I have ventured into using corpora to teach Arabic syntax, particularly idiomatic structures and phrases. Utilizing the Arabic corpus in teaching presents its own set of challenges. One significant challenge is that teachers must be cognizant of word information to guide learners on structuring sentences. Furthermore, educators can employ various software systems, known as ‘concordancers’, to analyze word groupings in sentences. As Meyer (2002) notes, these tools serve several purposes: (a) executing simple word searches; (b) presenting results in a unique format; and (c) computing frequency information of the searched item. These tools also provide data on word frequency, word lists, KWIC (KeyWord in Context), and collocations. For instance, when using ArabiCorpus to analyze the word Aljyll (‫)ليلجلا‬, it was found to occur 3,313 times out of a total of 100,000 words. The word most frequently associated with it was the particle Fii (‫)يف‬, which appeared 855 times. As this data indicates, the word Aljyll should be paired with the particle Fii in sentences. Integrating SLA insights into DDL-focused research is promising for the evolution of language pedagogy. O’Keeffe and Farr (2003, p. 412) aptly concluded that “the more teachers know about corpora and how to utilize them, the better equipped they will be to evaluate corpus-based materials objectively.” Reference list Bernardini, S. (2004). Corpora in the classroom. An overview and some reflections on future developments. In J. Sinclair (Ed.), How to use corpora in language teaching (pp. 15–36). Benjamins. Forti, L. (2023). Corpus use in Italian language pedagogy: Exploring the effects of datadriven learning. Routledge. Meyer, C. (2002). English corpus linguistics: An introduction. Cambridge University Press. O’Keeffe, A., & Farr, F. (2003). Using language corpora in initial teacher education: Pedagogic issues and practical applications. TESOL Quarterly, 37(3), 389–418. Zaki, M. (2016). Corpus-based teaching in the Arabic classroom: Theoretical and practical perspectives: Corpus-based teaching in the Arabic classroom. International Journal of Applied Linguistics, 27. https://doi.org/10.1111/ijal.12159 Zaki, M., Wilmsen, D., & Abdulrahim, D. (2021). The utility of Arabic corpus linguistics. In K. Ryding & D. Wilmsen (Eds.), The Cambridge handbook of Arabic linguistics. Cambridge handbooks in language and linguistics (pp. 473–504). Cambridge University Press.

6 DDL SUPPORT FOR ACADEMIC ENGLISH COLLOCATIONS Ana Frankenberg-Garcia

Introduction

Students and researchers who are not accustomed to writing about their work in English often face difficulties in recalling academic English collocations, i.e., prefabricated combinations of words that enhance the idiomaticity and fluency of texts (Sinclair, 1991) such as collect data, change dramatically, entirely appropriate, and so on. This has long been recognized not only for users of English as an additional language (Palmer, 1933; Szudarski, 2023), but also more recently for firstlanguage English writers who have had less exposure to written academic English (Hyland, 2016; Frankenberg-Garcia, 2018). There is also a difference between understanding and using collocations. Even when fully able to process collocations during reading, writers may not be able to remember to use them in text production. Indeed, collocation recall – which is crucial for writing fluent, idiomatic texts – has been shown to be one of the later components of vocabulary acquisition (GonzálezFernández & Schmitt, 2020). Insufficient collocation proficiency can have a negative impact on essays, dissertations, theses, research articles, applications for funding, and other written academic outputs, especially for writers whose careers have developed outside an English-speaking medium (Flowerdew, 2019). For example, compare the readability of fine use of collocations with effective use of collocations. The former is generally perceived as being harder to process because fine is not a word readers of academic English texts expect to see in the context of use, whereas effective use is a common collocation that reads as a single prefabricated item. Despite the existence of excellent resources that can be consulted to improve academic writing – for example, textbooks, dictionaries, corpora, and even recent AI tools – users of academic English may be unaware of where or how to look up collocations, or they DOI: 10.4324/9781003413301-6

72

Ana Frankenberg-Garcia

may simply not notice that their lexical choices could be refined. Moreover, pausing to look up a collocation can be disruptive to text production, especially during cognitively demanding tasks such as academic writing (Frankenberg-Garcia, 2020). ColloCaid is a proof-of-concept solution designed to help users of academic English access collocation suggestions as they write.1 It draws on the principle of data-driven learning (DDL) pioneered by Johns (1991), where corpus-driven collocation suggestions are shown to writers in a way that encourages them to be processed inductively. Evolving from early printed DDL concordances or direct corpus consultation, ColloCaid brings relevant corpus data (collocations and concordances) directly to digital writing environments (Frankenberg-Garcia, 2020). Underpinning ColloCaid is a large lexical database of general academic English collocations that has been integrated into an open-access online text editor. Instead of explaining collocations, the tool nudges writers to consult collocation prompts relevant to their own texts and navigate interactive menus of collocation suggestions enriched with concordances showing collocations in context. By bringing this directly to writers without them having to consult external resources, ColloCaid can help users convey their intended meanings more efficiently, enhance the vocabulary and readability of their texts, expand their collocation repertoire, and even offset collocation avoidance or reassess misconceptions about what constitutes valid collocations (Pinto et al., 2023). This chapter will briefly explain how ColloCaid was conceived, describe how it works, and exemplify how it can be used in independent writing activities and in teaching academic English writing. It will conclude with a summary of user studies and future developments. Developing an academic collocation writing assistance tool

The first step in the development of ColloCaid was the compilation of a lexical database covering core collocations used across disciplines in general academic English that could be useful to a wide range of people with less practice of academic English writing (Frankenberg-Garcia et al., 2019). The idea was to employ frequent academic words as prompts to help writers enhance their use of collocates surrounding those words. Rather than starting from scratch, the database compilation made extensive use of existing academic word lists and corpora. Lemmas featured in the Academic Keyword List (Paquot, 2010), the Academic Collocation List (Ackermann & Chen, 2013), and the student-focused subset of the Academic Vocabulary List (Gardner & Davies, 2014) compiled by Durrant (2016) were used as a foundation for selecting frequently occurring general academic English words. After refining this selection according to the criteria detailed in FrankenbergGarcia et al. (2021), a total of 572 high-frequency and collocationally productive general academic English nouns, verbs, and adjectives were researched in two corpora of expert academic English – the Oxford Corpus of Academic English (Lea,

DDL support for academic English collocations

73

2014) and the written component of the Pearson International Corpus of Academic English (Ackermann et al., 2011). Over 32,000 academic collocation suggestions that met the collocation strength criteria set out in the study were integrated into the ColloCaid database.2 Also included in the database are concordances exhibiting nearly 30,000 authentic examples of collocations in use that were curated from the same corpora. The next step was to develop a prototype that would enable writers to consult the ColloCaid database directly from a text editor. An open-source online editor – TinyMCE3 – was adapted for this purpose so as to facilitate user access and experiments without any software installation. After initial feedback from experts in lexicography and academic writing, a number of academic writing workshops featuring the tool helped to further refine the integration of the lexical database with the text editor. Users can write directly on the ColloCaid editor or copy a previously written text onto it. The ColloCaid tab on the editor allows users to hide and unhide collocation prompts. These prompts are also automatically activated when users click on the space bar. When collocation prompts are turned on, the editor underlines (in a green dotted line) the 572 general academic words in the database for which collocation suggestions are available. Figure 6.1 shows that there are collocation suggestions available for tool, develop, general, and advantage. Prompts can be ignored if writers do not wish to look up collocations, or they can be followed up through a series of interactive menus. The first menu simply lists different types of lexical collocations pertaining to an academic word (e.g., verb + noun, noun + adjective, and so on), prioritizing strong, high-frequency collocates. Relevant, grammatical collocations are also presented. In Figure 6.1, for example, it can be seen that take is a strong verbal collocate and selective is a

FIGURE 6.1

ColloCaid prototype displaying interactive menus in default In-Text View

74

Ana Frankenberg-Garcia

strong adjectival collocate of advantage. Prepositional collocates like advantage of and advantage over are also shown. The system is interactive and incremental because if writers find the information they need then and there, they can simply click on the collocation to insert it into their texts and carry on writing without opening any further menus. If the collocations shown in the first menu are not exactly what writers need, they can click on a collocation type to open a second menu with further collocations of the same type. For example, the second menu in Figure 6.1 lists seven adjectival collocations for advantage, all of which are enhanced with concordances showing how to use those collocations in context. Again, if users have found the adjective they were looking for, they can click to insert it automatically and get back to their writing. If not, they can click on More to open a sidebar with further adjectival collocation suggestions. Users can change their viewing settings from the default In-Text View, which uses context menus, to Tree View, which displays the same interactive menus on a sidebar if preferred. This is particularly recommended when working with large fonts or small screens so that in-text context menus are not superimposed over the text written on the editor (see Figure 6.2). It is important to note that ColloCaid does not check spelling or grammar, nor does it have many of the other functions that users expect to see in text editors. It is simply a proof-of-concept tool that provides academic English collocation suggestions. ColloCaid developers do not store or have access to the texts written by users. Texts will be lost when the ColloCaid browser window is closed, so writers must download a copy of their texts or copy them onto their regular text editors for saving.4

FIGURE 6.2

ColloCaid prototype displaying interactive menus in Tree View

DDL support for academic English collocations

75

Using ColloCaid

ColloCaid was conceived primarily to be used independently by writers, but it can also be used to teach collocations in general and written academic English collocations in particular. When used autonomously, ColloCaid can work simply as an aide-memoire for academic writers who cannot think of a good word to use in a given syntagmatic slot. For example, what verb can be used in the context of The target was ____ last year? ColloCaid underlines the academic word target as a prompt and, when followed up, provides a list of verbs that can be used in the context, such as meet, reach, achieve, miss, set, and so on. Note that while the first three of these verbs can be said to be near-synonyms, the last two have entirely different meanings. It is up to ColloCaid users to pick the collocation that matches their intended meanings. ColloCaid does not provide dictionary definitions to help discriminate between different collocates, as its aim is not to teach new words, but rather to help writers activate vocabulary that they are already familiar with. This can include collocations writers have used before but simply cannot recall, or collocations that they have previously seen when reading but not yet used in their writing. To further encourage writers to start using collocations they are less confident about, ColloCaid also displays concordance lines showing the collocation in context. The tool thus encourages writers to expand their productive vocabulary, in what González-Fernández and Schmitt (2020) have referred to as an advanced stage of vocabulary acquisition. Another way of using ColloCaid independently is by copying a draft of a text that writers are working on onto the editor. When activating collocation suggestions (as seen earlier, this can be done by simply hitting the space bar), the tool highlights all words for which there are collocation suggestions, and writers can use them as prompts to refine the vocabulary of initial drafts. In the classroom, ColloCaid can be used at a very basic level simply to raise awareness of grammatical collocations. Learners can be asked to copy a set of gapped sentences with missing prepositions like the following examples (a–c) onto the ColloCaid text editor and asked to fill in the gaps before activating the tool to compare their answers with the prepositions provided by the tool. a. They need to raise more awareness . . . . . . . . . . the problem. b. There was a sharp increase . . . . . . . . . . prices. c. This chapter consists . . . . . . . . . . . three sections. An important first step to prepare for the exercise is to raise awareness of the collocation node by helping learners understand which word governs or acts as a trigger for the preposition required in the gap. In the examples given, the nodes are respectively awareness, increase, and consists. When checking whether the prepositions learners initially supplied match the preposition suggestions given in ColloCaid,

76

Ana Frankenberg-Garcia

it is recommended to draw attention to any relevant language transfer errors like *awareness to, *increase of, and *consists in, and to point out that ColloCaid does not identify errors. ColloCaid only shows the correct collocations awareness of, increase in, and consists of. It is up to the user to interpret negative evidence, i.e., the absence of suggestions like *awareness to and *consists in, or to notice the different environments in which increase of and increase in are used in a DDL approach that displays concordance lines showing the target collocation in context. However, users who have subscribed to error-correction tools like Grammarly5 will notice that they work alongside the ColloCaid editor, correcting spelling, grammar, repeated words, and unwanted extra spaces in texts. Gap-filling exercises can also be used to teach lexical collocations. In example (d), learners need to supply an adverb that collocates with adopted, in (e) they need to fill in the gap with an adjective that collocates with issue, and in (f) they need to provide a verb that collocates with activities. d. The new measure was . . . . . . . . . adopted. e. This is a . . . . . . . . . issue in the development of new research. f. The experimenters had to . . . . . . . . . . the activities of each group. Lexical collocation exercises can be considered more advanced, since several collocation options are provided for each gap. As mentioned earlier, ColloCaid does not offer dictionary definitions to help discriminate between them. For example, in (d), the top eight adverbial collocates supplied for adopted are widely, formally, finally, universally, readily, subsequently, quickly, and increasingly. Learners are therefore encouraged to reflect not just on grammatical accuracy, but also on meanings. If needed, learners can click further to see concordances with each collocation, which can help them to discriminate between them in a DDL approach. In (e), the first eight collocation suggestions given for adjective + issue are key, important, major, critical, related, complex, central, and difficult. Before that, however, there is an extra step in teaching language awareness involved, since learners must first select whether they wish to pursue collocations for issue as a verb or as a noun (Figure 6.3), and only after selecting noun will they be provided with suitable adjectives. In (f), there is an additional layer of complexity to be taught, because learners are shown verbs that collocate with activities as the sentence subject (e.g., activity generate) and verbs that collocate with activities as the sentence object (e.g., undertake activity). Therefore, before selecting a semantically fitting collocation, learners need to select which type of collocation is appropriate (see Figure 6.4). Going over these non-trivial issues with learners in the classroom can help them develop the language awareness needed to optimize DDL and become better independent users of ColloCaid. Although there is no space here to provide suggestions for further classroom activities using ColloCaid, more examples are given in Pinto et al. (2023).

DDL support for academic English collocations

FIGURE 6.3

Verb/noun disambiguation prompt for issue in ColloCaid In-Text View

FIGURE 6.4

Collocates for activity in ColloCaid Tree View

77

User studies

As explained earlier, ColloCaid was initially trialled with users as a means of obtaining feedback on tool development. At a very incipient phase, ColloCaid version 0.1 was presented to users of academic English in Poland and France, and preliminary feedback served to make several improvements to the text editor integration. More advanced versions of the tool were then integrated into a series of four workshops for academic English writers (63 university professors and postgraduate students writing about their work in English) and 52 English teachers at two Brazilian

78

Ana Frankenberg-Garcia

universities. As reported by Frankenberg-Garcia et al. (2022), in feedback collected immediately after the workshops, both writers and English teachers declared that they were very likely to continue using the tool for writing and teaching writing. In delayed feedback collected one year later, 54% of the writers and 93% of the teachers did indeed continue to use the tool. These figures are quite encouraging. They suggest that the teachers welcomed the tool and that the writers who did continue to use the tool may have become more aware of collocations and the need to look them up. This is especially relevant considering that writers tend to overestimate their knowledge of collocations (Laufer, 2011; Frankenberg-Garcia, 2018; Pinto et al., 2021). Indeed, one participant in the Brazilian workshops explained: “I like the fact that it gives me combinations of words that sound more formal and academic than the ones I would think [sic] myself”. In another study, Rees (2021) compared the use of ColloCaid with other selfselected lexical reference tools (e.g., online dictionaries, search engines, machine translation, and so on) in terms of how it was perceived when measured against the NASA Task Load Index (Hart & Staveland, 1988). A group of 106 students at a Spanish university were asked to fill in gapped sentences designed to elicit a range of lexical and grammatical collocations in the two conditions. ColloCaid significantly outperformed other tools in terms of the participants’ perceptions of mental, physical, and temporal demands, as well as in terms of their perceptions of effort required and degree of frustration experienced. Rees (2021) concluded that ColloCaid offers overall significant advantages over other collocation resources. The most recent user study evaluated the effect of using ColloCaid in more detail, beyond self-reporting, by examining actual revisions made by academic writers using ColloCaid. Michta and Frankenberg-Garcia (2023) asked 27 English philology students at a Polish university to use ColloCaid to revise an approximately 600-word excerpt from their final-year BA or MA dissertations before submitting them to their supervisors. The researchers found that the tool offered good vocabulary coverage, providing collocation suggestions for, on average, 13.6 lemmas in each excerpt. The participants then followed up on 54% of the suggestions motivated by ColloCaid and did not act on 46% of the cues given. This is interesting because it suggests that the participants exercised critical judgement in using the tool. The main reasons for not following up a collocation prompt were that the participants felt they did not need a collocate or the collocation they had used was already appropriate. When they did revise collocations, it was mainly because they wanted to improve the vocabulary of their texts, even though they did not think their initial words were incorrect. It is also encouraging to see that, according to an external rater, over 50% of the revisions resulted in better use of collocations, against 14% of changes leading to a negative outcome (the remaining changes could not be evaluated because of lack of context or were unnecessary because a good collocation was replaced with an equally good one). Finally, information about who the users of ColloCaid prototype are can be gleaned from the number and profile of its registered users. The latest figures (from

DDL support for academic English collocations

79

August 2023) show over 8,000 registrations from all over the world. The majority of users (88%) use ColloCaid for writing and revising, but there is a growing number of users employing the tool for teaching (8%). Users are mostly master’s, PhD, and undergraduate students (31%, 29%, and 18% respectively), but there is also a sizable number of professors, lecturers, and research fellows registered to use ColloCaid (12%). The language of users also varies considerably. While it is not unexpected that 83% write in English as an additional language, the fact that 27% of registered users are first-language users of English gives a good indication that ColloCaid can indeed benefit them too. Conclusion and future developments

Although there are valuable resources available to help writers with collocations, ColloCaid differs in that it assists academic English users in accessing collocation suggestions while they write in a way that promotes DDL. However, ColloCaid has a large potential for impact that has yet to be explored. The main limitation of our proof-of-concept prototype is that it does not offer certain powerful features that writers have come to expect from commercially developed mainstream text editors. Therefore, future plans include converting the prototype and its built-in lexical database into a plug-in that can be integrated with mainstream text editors. In this way, we aim to bring DDL support to enhance the use of academic English collocations directly to the text editors that writers normally use. Acknowledgements

ColloCaid was funded by the Arts and Humanities Research Council (AHRC), ref. AH/P003508/1. Notes 1 ColloCaid is free to use and can be accessed at www.collocaid.uk/. 2 The logDice statistic was used to assess collocation strength, leading to a far greater selection of collocations than the 2,469 featured in the Academic Collocations List (Ackermann & Chen, 2013). A sample of the ColloCaid lexical database can be downloaded from https://doi.org/10.6084/m9.figshare.13028207. 3 See www.tiny.cloud/ 4 For a short video showing the basic features of the ColloCaid editor, see www.collocaid. uk/what-are/#video. 5 See www.grammarly.com/

Reference list Ackermann, K., & Chen, Y.-H. (2013). Developing the academic collocations list (ACL) – A corpus-driven and expert-judged approach. Journal of English for Academic Purposes, 12, 235–247.

80

Ana Frankenberg-Garcia

Ackermann, K., de Jong, J. H. A. L., Kilgarriff, A., & Tugwell, D. (2011). The Pearson International Corpus of Academic English (PICAE). https://www.birmingham.ac.uk/documents/college-artslaw/corpus/conference-archives/2011/Paper-47.pdf Durrant, P. (2016). To what extent is the academic vocabulary list relevant to university student writing? English for Specific Purposes, 43, 49–61. https://doi.org/10.1016/j. esp.2016.01.004 Flowerdew, J. (2019). The linguistic disadvantage of scholars who write in English as an additional language: Myth or reality. Language Teaching, 52(2), 249–260. https://doi. org/10.1017/S0261444819000041 Frankenberg-Garcia, A. (2018). Investigating the collocations available to EAP writers. Journal of English for Academic Purposes, 35, 93–104. Frankenberg-Garcia, A. (2020). Combining user needs, lexicographic data and digital writing environments. Language Teaching, 53(1), 29–43. Frankenberg-Garcia, A., Lew, R., Roberts, J. C., Rees, G. P., & Sharma, N. (2019). Developing a writing assistant to help EAP writers with collocations in real time. ReCALL, 31(1), 23–39. Frankenberg-Garcia, A., Pinto, P. T., Bocorny, A. E., & Sarmento, S. (2022). Corpus-aided EAP writing workshops to support international scholarly publication. Applied Corpus Linguistics, 2(3), 1–13. https://doi.org/10.1016/j.acorp.2022.100029 Frankenberg-Garcia, A., Rees, G. P., & Lew, R. (2021). Slipping through the cracks in elexicography. International Journal of Lexicography, 34(2), 206–234. https://doi-org. surrey.idm.oclc.org/10.1093/ijl/ecaa022 Gardner, D., & Davies, M. (2014). A new academic vocabulary list. Applied Linguistics, 35(3), 305–327. https://doi.org/10.1093/applin/amt015 González-Fernández, B., & Schmitt, N. (2020). Word knowledge: Exploring the relationships and order of acquisition of vocabulary knowledge components. Applied Linguistics, 41(4), 481–505. https://doi-org.surrey.idm.oclc.org/10.1093/applin/amy057 Hart, S. G., & Staveland, L. E. (1988). Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. Human Mental Workload, 1(3), 139–183. https://doi.org/10.1016/S0166-4115(08)62386-9 Hyland, K. (2016). Academic publishing and the myth of linguistic injustice. Journal of Second Language Writing, 31, 58–69. https://doi.org/10.1016/j.jslw.2016.01.005 Johns, T. (1991). Should you be persuaded: Two samples of data-driven learning materials. English Language Research Journal, 4, 1–16. Laufer, B. (2011). The contribution of dictionary use to the production and retention of collocations in a second language. International Journal of Lexicography, 24(1), 29–49. https://doi-org.surrey.idm.oclc.org/10.1093/ijl/ecq039 Lea, D. (2014). Making a Learner’s dictionary of academic English. In Proceedings of the XVI EURALEX international congress: The user in focus (pp. 181–190). Michta, T., & Frankenberg-Garcia, A. (2023). The impact of invisible lexicography on the self-revision of academic English collocations. In M. Medved (Ed.), Electronic lexicography in the 21st century (eLex 2023, pp. 18–19). Book of Abstracts. Palmer, H. E. (1933). Second interim report on English collocations. Kaitakusha. Paquot, M. (2010). Academic vocabulary in learner writing: From extraction to analysis. Continuum. Pinto, P. T., Crosthwaite, P., de Carvalho, C. T., Spinelli, F., Serpa, T., Garcia, W., & Ottaiano, A. O. (Eds.). (2023). Using language data to learn about language: A teachers’ guide to classroom corpus use. University of Queensland. https://doi.org/10.14264/3bbe92d

Collocation assistance in academic writing environments

81

Rees, G. P. (2021). Measuring user workload in e-Lexicography with the NASA task load index. In I. Kosem & M. Cukr (Eds.), Electronic lexicography in the 21st century (eLex 2019): Post-editing lexicography (pp. 54–55). Book of Abstracts. Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press. Szudarski, P. (2023). Collocations, corpora and language learning. Cambridge University Press.

6.1 Collocation assistance in academic writing environments Magdalena Zinkgräf The issue of collocation has long been established as an obstacle for foreign language learning. Frankenberg-Garcia’s ColloCaid project efficiently addresses the challenge in the realm of academic English writing through an open-source editor which provides in-text collocation suggestions based on lexicographic data from the largest compilation of academic English collocations. ColloCaid presents a multiplicity of obvious advantages, such as scaffolded access to stronger collocations, both lexical and grammatical, for nodes discriminated according to meanings; collocates arranged in decreasing order of frequency; and optional, simplified concordances illustrating each. However, the trial of this useful tool on an essay by a university student in Argentina reveals some caveats that may need to be acknowledged. First, nonexpert writers may feel compelled to accept suggestions in the immediate vicinity of certain typical academic words unquestioningly, regardless of whether these collocates are already present in their version. Moreover, since ColloCaid does not detect or question the appropriateness of the accompanying linguistic text for the node, suggestions are provided which may not be grammatically necessary or even possible in those contexts. On a larger scale, ColloCaid operates under the assumption that our thoughts start from mainly objects and nouns, which are the nodes it targets. Although possibly true for native speakers of a language, in foreign language academic writing, the wrong choice of an L2 node based on learners’ L1 might trigger non-standard word combinations, which could potentially go undetected by the writing assistant. Furthermore, ColloCaid claims to help learners “convey their intended meanings more effectively” (Frankenberg-García, 2020). The collocation prompts and concordances in the interface may not necessarily reflect the meanings intended by less proficient academic English writers, unaware of the typicalities in the academic corpus compiled. In this regard, the effectiveness of ColloCaid seems to be dependent on writers’ level of proficiency in the foreign language, specifically in academic contexts, to have chosen the correct node for their meanings and to discern the

82

Liang Li

pros and cons of selecting any of the collocates suggested. To avoid such uncritical acceptance and probable mistakes, previous training in the use of this text editor might be essential. In the academic English courses I teach in Argentina, concordance line analysis and corpus searches are essential in the exploration of text types like abstracts, academic biographies and research articles, though not without hindrances, as enumerated in Frankenberg-García’s chapter: navigating user-unfriendly resources, building the query and interpreting the data become major obstacles for these academic writers. Despite the learning gains derived from these activities (O’Keeffe, 2021) as input for academic writing, attempts at inductive DDL through corpus consultation in this context often leaves learners with an overload of information which either distracts them away from the node and its typical collocates in the raw data or restricts their perspective to only adjacent words. With some training in its use and in the range and scope of the informed decisions to be made, learners in these academic writing courses could obtain systematic and targeted collocation assistance given ColloCaid’s multiple advantages. Reference list Frankenberg-Garcia, A. (2020). Combining user needs, lexicographic data and digital writing environments. Language Teaching, 53(1), 29–43. O’Keeffe, A. (2021). Data-driven learning: A call for a broader research gaze. Language Teaching, 54(2), 259–272. https://doi.org/10.1017/S0261444820000245

6.2 Corpus tools for collocation learning Benefts and challenges Liang Li Collocation knowledge is important for language users, as it not only helps them process information efficiently and produce texts in chunks, but also enhances the accuracy and fluency of the texts produced. However, it poses a particular challenge to second language (L2) speakers due to the interference of their native language collocations, limited exposure to L2 contexts, and the blurry boundaries of collocations (Frankenberg-Garcia et al., 2019; Nesselhauf, 2005). To facilitate collocation learning, a variety of corpus tools have been developed, including the Learning Collocations collection in FLAX (Witten et al., 2013), Word sketch in SkELL (Baisa & Suchomel, 2014), part-of-speech tags in Linggle (Boisson et al., 2013), and Pattern Finder in CorpusMate (Crosthwaite & Baisa, in press). Each tool comes with its strengths and weaknesses. FLAX, for instance, offers an extensive range of collocations, but its sentence examples often feature

Corpus tools for collocation learning

83

low-frequency words and complex sentence structures. SkELL, on the other hand, provides more comprehensible examples but presents only up to 15 collocations of each type. In contrast, ColloCaid (Frankenberg-Garcia et al., 2019) provides real-time academic English collocation suggestions based on written texts. This tool was introduced to my L2 Chinese students at a New Zealand university, as a firm grasp of academic English collocations is crucial for their study. The students found the interface of ColloCaid appealing, as it mirrors popular document editors like Microsoft Word and Google Docs. They appreciated the collocation prompts. The students also enjoyed the manageable array of collocation options, the brief and understandable sentence fragments, and the handy click-to-apply function. However, some students found the provided collocations to be contextually irrelevant, leading to a time-consuming and sometimes fruitless search, although the searching process undoubtedly boosts their collocation learning. The students suggested an auto-correction feature for their collocations, particularly given the proliferation of AI-powered writing tools like Grammarly, Writefull, and Quillbot. Overall, they expressed interest in using this tool in their future writing and are expecting to see more advanced features. The needs of these learners could possibly propel the tool development into its next stage, e.g., the incorporation of AI technology into corpus tools. Reference list Baisa, V., & Suchomel, V. (2014). SkELL: Web interface for English language learning. In Eighth workshop on recent advances in Slavonic natural language processing (pp. 63–70). NLP Consulting. Boisson, J., Kao, T., Wu, J., Yen, T., & Chang, J. S. (2013). Linggle: A web-scale linguistic search engine for words in context. In Proceedings of the 51st annual meeting of the association for computational linguistics: System demonstrations (pp. 139–144). Association for Computational Linguistics. Crosthwaite, P., & Baisa, V. (in press). Introducing a new, child-friendly corpus tool for datadriven learning: CorpusMate. International Journal of Corpus Linguistics. Frankenberg-Garcia, A., Lew, R., Roberts, J. C., Rees, G. P., & Sharma, N. (2019). Developing a writing assistant to help EAP writers with collocations in real time. ReCALL, 31(1), 23–39. https://doi.org/10.1017/S0958344018000150 Nesselhauf, N. (2005). Collocations in a learner corpus. John Benjamins Publishing Company. Witten, I. H., Wu, S., Li, L., & Whisler, J. L. (2013). The book of FLAX: A new approach to computer-assisted language learning (2nd ed.). University of Waikato. http://flax.nzdl. org/BOOK_OF_FLAX/BookofFLAX%202up.pdf

7 EXTENSIVE READING AND VIEWING FOR VOCABULARY GROWTH Corpus insights Clarence Green

Introduction

The theme of this chapter is that we need to use corpora as a form of data science to inform educational decisions about extensive reading (ER) and related pedagogical approaches to vocabulary acquisition, such as extensive viewing (EV). ER/EV involves wide reading or viewing of comprehensible input, typically self-selected, and large volumes of input over time (Nation, 2014). The key idea of this chapter is that insights can be gained from using a corpus to model possible real-world outcomes of ER/EV. Corpus research can provide essential information on multiple problems in vocabulary research and teaching, including: How much vocabulary do students need? What type of vocabulary? What contribution might ER/EV make? Our current ‘best guess’ for the vocabulary size required for general reading proficiency in English is about 9000 word families (Schmitt et al., 2017). This figure is reasonable because of two research findings. The first is that the most frequent 9000 word families of English cover more than 95% of the words in corpora representing the language (Nation, 2014). The second is that quasi-experimental data indicates for students to comprehend text, they need to know at least 95%, and optimally 98%, of the words on the page (Hu & Nation, 2000). Taken together, these findings suggest 9000 word families is probably sufficient for comprehension of most English input. The picture is somewhat more complex, with comprehension and vocabulary knowledge being more correlational than percentage thresholds might suggest (Schmitt et al., 2011) and the relationship is mediated by other variables, for example, background knowledge, topic familiarity and so forth. Nevertheless, this is a useful framework for language education that provides vocabulary size learning goals and helps calibrate learners’ proficiency levels with DOI: 10.4324/9781003413301-7

Extensive reading and viewing for vocabulary growth

85

potentially comprehensible text (Hiebert, 2020). Corpus research combined with the experimental work therefore narrows down the answer to ‘how much vocabulary we need’. Corpus methods can also tell us where this vocabulary might come from, such as how far a learner might get to their vocabulary size goals through ER or EV. Before coming to how corpora can do this, it is helpful to describe some essential characteristics of ER/EV. These are popular approaches to developing vocabulary, in which second-language vocabulary is incidentally learned as a byproduct of reading comprehensible input. The theoretical framework is essentially that unknown words are acquired when repeatedly encountered in input a learner understands, because the message and known surrounding words provide sufficient support to learn a new word’s form, meaning and use. So, with ER, students are encouraged to read a large volume of print (typically fiction) that they find interesting or enjoyable and choose for themselves. This makes the input both comprehensible and ‘compelling’ (Krashen, 1989), and vocabulary learning occurs naturally as a by-product of multiple exposures over time. Extensive viewing and extensive listening (EL) share the same ideas but employ different modalities. EL can focus on improving listening skills as well as vocabulary. EV offers multimodal input, in which vocabulary information comes from multiple information sources. For example, new words may be seen in the context of some action, and learners can also simultaneously hear the phonological form of the word and see its print form by turning on subtitles (Renandya & Iswandari, 2022). Since ER/EV involves large volumes of text (including subtitle transcripts) and the repetitions of vocabulary in this input, corpus research can help us understand its potential because it can model the input. Data analytics can be conducted on a corpus of fiction, movies, TED talks, podcasts etc. as a proxy for ER/EV input. Analysis can tell us how many words are needed to cover these kinds of inputs or, put another way, if it might constitute comprehensible input for learners of certain vocabulary sizes. Such studies are called vocabulary profiling or lexical coverage studies (Green, 2020). For example, the vocabulary of a corpus of movie scripts is covered by approximately 5000 word families (Nation, 2014), so a reasonable conclusion is that movies are comprehensible input for learners with this vocabulary size, and if a learner’s proficiency goal is to watch English-language movies, then the target amount of vocabulary is known and can inform curriculum. However, not all movies will be comprehensible with 5000 words, as there are likely to be different lexical demands, for example, between Fast and Furious and Schindler’s List. Lexical profile studies can further show us the variation and provide information on how many movies (or books, talks, podcasts etc.) might be comprehensible and for which student vocabulary sizes. In addition, corpus research can guide us on whether ER/EV might provide enough exposure to vocabulary for incidental learning. As with coverage and comprehension, no number is precise when it comes to how many encounters are needed to learn a new word. Uchihara et al. (2019) reports that form-meaning

86

Clarence Green

mappings might be possible after 2–4 exposures, word associations after 10–17. Interestingly, the numbers do not vary much in either first or second language educational research: Hiebert (2020) reports 10 for elementary school in the US; Nation (2014) suggests 12 for adult L2 learners; The US Department of Education suggests 17 (Kamil et al., 2008). The estimates are broadly similar, which makes sense because theoretically the range of exposures cannot be too high. If hundreds of exposures were needed, it would be impossible to learn a language, first or second. The analysis of a corpus designed to represent ER/EV input provides information on which vocabulary occurs at the hypothesised rates required for learning. The use of a corpus for modelling ER/EV: example studies

The previous section outlined issues that corpus linguistics can address. This section illustrates, through example studies, how this kind of research has been conducted and what can be learned. The use of a corpus for modelling ER/EV, then using the model outcomes to inform pedagogical decision-making, is a form of data science. Data science is an increasingly popular interdisciplinary field that focuses on ‘data mining’ (i.e., analytics) large volumes of digital data. Data science typically aims to build models that simulate real-world environments using simplified parameters. The model outcome is used to inform real-world decisions. The real world we are concerned about is language education, and more specifically in this chapter the value of ER/EV as a vocabulary pedagogy. Let us first consider the possible contribution of ER to academic vocabulary. McQuillan (2019) was interested in the possibility that ER of juvenile fiction might provide input from which academic vocabulary could be learned. In a corpus of 22 novels (Harry Potter, Twilight etc.), he computed occurrences of Coxhead’s (2000) Academic Word List (AWL). McQuillan (2019) asked how much of the AWL was repeated 10, 12, 25 times – rates that might correlate with incidental learning. He set some model parameters, i.e., reading 1 hour a day at 150 words per minute (wpm). The model showed that 85% of the AWL occurred and quite a lot at learnable rates: e.g., 20–37% within 1 year of reading 1 hour a day at 150 wpm. This supports a teacher having confidence in the ER of juvenile fiction to support academic vocabulary learning. Green (2022a) looked broadly at both ER and EV, and two types of academic vocabulary: the general academic AWL (Coxhead, 2000) and the discipline-specific Secondary Vocabulary Lists (SVL) (Green & Lambert, 2018). The rationale was the same as McQuillan (2019), namely that since there are curriculum time restraints, corpus insights can help determine which words might not require teaching by virtue of being widely available in ER/EV. Green (2022a) created a corpus of fiction (21,226,768 words) and television/movie scripts (33,045,952 words) from the Corpus of Contemporary American Fiction (Davies, 2010) as the proxy for ER/EV. He computed exposure to the target vocabulary, then modelled the time it would take to meet targets at different reading rates (100, 260, 350 wpm), viewing rates

Extensive reading and viewing for vocabulary growth

87

TABLE 7.1 Input and Time Estimates for Curriculum Planning (Green, 2022a)

AWL: ER

12 x Input Required (Words)

@ 100 words per minute (Reading Hours)

APPARENT INVESTIGATE REVEAL EVIDENT APPARENT SVL (Biology): ER INTENSITY HEARTBEAT MUSCULAR INSECT

136,287 137,985 142,861 145,389 136,287

22.71 23.00 23.81 24.23 22.71

740,469 742,627 751,390 758,099

123.41 123.77 125.23 126.35

(80, 140, 200 wpm) and exposure rates (6, 12, 20 times). He also produced a tool for teachers (a spreadsheet with embedded Excel formulas) that estimates the curriculum time to meet specific academic vocabulary. A sample of the outcome of this study is given in Table 7.1. Overall, Green (2022a) found that specialised and general academic vocabulary occur rather extensively in ER/EV. All AWL words occurred in the ER/EV proxy corpus. In under 3 years of reading at 45–60 minutes a day, 98.5% occurred at least 12 times reading at 260 wpm, and 93% at 100 wpm. EV takes longer but with similar results. For disciplinary vocabulary, 40–60% had sufficient encounters for a chance of incidental learning. As Table 7.1 shows, it is likely that even for a slow reader at 100 wpm, ER would provide sufficient exposure to AWL targets such as apparent, investigate, reveal, in 22–24 hours of reading. Even for biology-related words such as muscular, intensity, heartbeat, these words occur at least 12 times in under 125 hours of reading. The previous discussion has focussed on the contribution ER/EV might make to a specific type of vocabulary, but teachers also want to know how far ER might take a student towards the general vocabulary goal of 9000 words. This question also has theoretical importance. For example, Krashen (1989) argues that incidental learning is both more effective and efficient than explicit instruction and ER might “do the entire job” (p. 448), or most of it. Corpus insights can provide valuable information. In terms of modelling, a few things need to be shown. These include that enough books exist with 95–98% vocabulary coverage for vocabulary sizes from beginner to proficient demonstrating input at “at the right level” (Nation, 2014, p. 7) is possible. In these books, vocabulary up to the 9000-word target needs to occur at rates adequate for incidental learning. Additionally, an input and time parameter is needed to evaluate if the number of words and time to read these words are ‘reasonable’, e.g., models showing 30 years of reading would suggest learning vocabulary this way is unreasonable.

88

Clarence Green

If modelling cannot show this, the idea that ER can do most of the job is a falsified hypothesis without real-world testing. Cobb (2007) was the first to build a model of ER using a learning threshold of 6 exposures to randomly sampled words from the first 1000, second 1000 and third 1000 most frequent words of English. He computed repetitions in 179,000-word samples of the Brown corpus (fiction, press, academic). For words in the 1000 level, most were at the exposure parameter set for incidental learning, but only one was from the second 1000 with none from the third 1000. Running a similar model on 300,000 words of fiction, Cobb (2007) found only 469 of the third 1000 words were repeated 6 or more times. For teachers, he concluded, this information suggested the “extreme unlikelihood of developing an adequate L2 reading lexicon through reading alone, even in highly favourable circumstances” (p. 38). Nation (2014) extended this modelling using 25 novels a L2 reader might read on their way towards a 9000-word vocabulary, e.g., Animal Farm, The Great Gatsby, (the somewhat unlikely) Ulysses etc. He used all words in the BNC-COCA level lists to represent the first 1000th to 9000th word families of English and computed how long and how much input was needed to meet the majority of them on average 12 times. He simulated reading at 150 and 200 wpm. Nation’s (2014) model suggested, amongst other things, that if learners read all 25 novels in the corpus (approximately 3 million words) exposures to the required 9000 words were potentially enough for incidental learning. He suggests that readers might acquire up to 1000 words per year. Also of interest to teachers is the lexical coverage information. Nation (2014) reports that few novels were covered by mid-range vocabulary sizes, suggesting authentic fiction might provide “poor conditions for reading and incidental vocabulary learning” (Nation, 2014, p. 13), indicating graded readers for beginner and mid-range vocabulary sizes are essential. Nevertheless, Nation (2014) interprets his model as a “largely positive study of the opportunities for learning” (p. 13). McQuillan (2016) used a more contemporary ER input proxy, using 14 novels such as Twilight and The Hunger Games. He computed lexical coverage using AntWordProfiler (Anthony, 2014) and found many were readable (95–98%) with mid-frequency vocabulary sizes, for example, The Hunger Games had reached 95% coverage by only the first 4000 word families. So, perhaps graded readers are only necessary for a start, and then progress can be made through self-selecting appropriate popular fiction. Re-modelling Nation’s (2014) figures at 1 hour a day, 12 repetitions, in the books available for the different vocabulary sizes up to 9000 word target, McQuillan (2016) concludes that learners could find “sufficient input each step of the way, all at Nation’s recommended 98% vocabulary coverage . . . [and] a little over three years of reading takes readers all the way to the 9,000-word family level” (p. 72). Green (2022b) brought together the insights into a larger corpus study. He extracted 4989 books from COCA, processed them with AntWordProfiler, found how many were potentially comprehensible at 95% coverage (using each

Extensive reading and viewing for vocabulary growth

89

BNC-COCA list 1000–9000th as a proxy), and then created subcorpora ‘at the right’ level for readers with different vocabulary sizes. He computed in these subcorpora the time to reach 12 encounters of words from subsequent levels at 150 wpm, 60 minutes per day. This evaluated the possibility for ER to provide a pathway through incidental learning to larger vocabulary sizes. For example, a reader with a 4000-word vocabulary size reading comprehensible input with 95% lexical coverage would need exposure to the majority of the 5000 level words in that input and in a reasonable time frame. This needed to be true for all vocabulary sizes if progression to 9000 words was a possibility in the real world. This is what Green (2022b) found. Assuming access to a wide range of books, teachers should have some confidence in the availability of comprehensible input and exposure to vocabulary targets in that input. To develop a 9000-word vocabulary through ER, Green’s (2022b) model estimated 6.3 years, if reading 1 hour a day, or 3.5 years reading 2 hours a day. This is somewhere between McQuillan’s (2016) and Nation’s (2014) models, and in regard to Cobb (2007), the model showed words beyond 3000 word families do occur at possible rates for incidental learning. What should we do next?

A general point is that corpus linguistics studies such as those described in this chapter are excellent ways to provide insights into possible educational outcomes and inform teachers of what may work and what might not. Such studies should be considered more in planning and conducting intervention studies. Ideally, it makes sense to check a corpus-derived model, when possible, before running any experimental intervention studies. Interventions can be resource intensive, both in time and money, and it is better to avoid wasting both. Also, it is better not to subject human participants to pointless academic research if a corpus model can determine whether something has no value. For example, if Dang (2019) had found her corpus of medical television scripts did not contain medical vocabulary, there would be no justification for putting medical students through an intervention study to test medical vocabulary learning through EV of medical television. However, since Dang (2019) found a reasonable possibility of this being beneficial input, the model justifies trying a real-world intervention. For teachers, information gained from the kinds of corpus research in this chapter needs to be considered in the context of their practice. For example, how often and how inclined are one’s students to read? What access to books do they have? If they prefer television or movies, perhaps EV would be preferred. These kinds of practical classroom considerations should be incorporated into any application of research. A few things of value for future research and practice should be done next. Corpus studies that take a modelling approach to support intervention research designs should continue. There should be more professional discussion about whether corpus data needs consideration before an intervention. In my view, whenever possible, models should come first for ethical reasons. In future studies, more

90

Clarence Green

complex models need to be built with improved ecological validity that incorporate word- and learner-related variables and have better representations of ER/EV input. For example, Green (2022b) used COCA, which is a large representation of contemporary fiction, but it is a corpus of first chapters, so it does not quite represent ER input in the real world. A study of ER based on a pool of complete texts closely calibrated with students’ reading input would enhance ecological validity and provide clearer insights on ER for vocabulary growth. Reference list Anthony, L. (2014). AntWordProfiler (Version 1.4. 1) [Computer Software]. Waseda University. Cobb, T. (2007). Computing the vocabulary demands of L2 reading. Language Learning & Technology, 11(3), 38–63. Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34, 213–238. Dang, T. N. Y. (2019). The potential for learning specialized vocabulary of university lectures and seminars through watching discipline-related tv programs: Insights from medical corpora. TESOL Quarterly, 1–24. Davies, M. (2010). The Corpus of Contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing, 25(4), 447–464. Green, C. (2020). Extensive reading and viewing as input for academic vocabulary: A largescale vocabulary profile coverage study of students’ reading and writing across multiple secondary school subjects. Lingua, 239, 102838. Green, C. (2022a). Computing curriculum time and input for incidentally learning academic vocabulary. Language Learning & Technology, 26(1). Green, C. (2022b). Extensive reading for a 9,000-word vocabulary: Evidence from corpus modeling. Reading in a Foreign Language, 34(2). Green, C., & Lambert, J. (2018). Advancing disciplinary literacy through English for academic purposes: Discipline-specific wordlists, collocations and word families for eight secondary subjects. Journal of English for Academic Purposes, 35, 105–115. Hiebert, E. H. (2020). The core vocabulary: The foundation of proficient comprehension. The Reading Teacher, 73(6), 757–768. Hu, M., & Nation (2000). Vocabulary density and reading comprehension. Reading in a Foreign Language, 13(1), 403–430. Kamil, M. L., Borman, G. D., Dole, J., Kral, C. C., Salinger, T., & Torgesen, J. (2008). Improving adolescent literacy: Effective classroom and intervention practices. IES practice guide. NCEE 2008–4027. National Center for Education Evaluation and Regional Assistance. Krashen, S. (1989). We acquire vocabulary and spelling by reading: Additional evidence for the input hypothesis. The Modern Language Journal, 73(4), 440–464. McQuillan, J. (2016). What can readers read after graded readers? Reading in a Foreign Language, 28(1), 63–78. McQuillan, J. (2019). Where do we get our academic vocabulary? Comparing the efficiency of direct instruction and free voluntary reading. The Reading Matrix: An International Online Journal, 19(1), 129–138. Nation, P. (2014). How much input do you need to learn the most frequent 9,000 words? Reading in a Foreign Language, 26(2), 1–16.

Implementing extensive reading and viewing in language courses

91

Renandya, W. A., & Iswandari, Y. (2022). Extensive reading. In H. Mohebbi & C. Coombe (Eds.), Research questions in language education and applied linguistics. Springer. Schmitt, N., Cobb, T., Horst, M., & Schmitt, D. (2017). How much vocabulary is needed to use English? Replication of van Zeeland & Schmitt (2012), Nation (2006) and Cobb (2007). Language Teaching, 50(2), 212–226. Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in a text and reading comprehension. The Modern Language Journal, 95(1), 26–43. Uchihara, T., Webb, S., & Yanagisawa, A. (2019). The effects of repetition on incidental vocabulary learning: A meta-analysis of correlational studies. Language Learning, 69(3), 559–599.

7.1 Implementing extensive reading and viewing in language courses Challenges and possible solutions Elizabeth Hanks Green discusses how corpus research has explored the potential of extensive reading and viewing to enhance vocabulary skills. I have implemented both extensive reading and listening programs in my classes at an intensive English program, but I (like many others) have had to address several challenges in order to harness their benefits. I discuss some possible solutions here. First, students often do not feel motivated to read. In response to this challenge, I have tried to make level-appropriate reading content more accessible. I keep graded readers, comic books, and magazines in my classroom and provide them to students when they complete tasks (such as tests) early. After reading in class, students are sometimes surprised to find that they enjoyed reading: not only because the story is interesting but also because they are proud of themselves for having understood the text. I encourage students to borrow the book and return it once they have finished. Of course, this practice does not instantaneously transform students into book lovers, but it has led to more frequent reading – if not outside of class, then certainly during class time. While some learners are not motivated to read, others may not be motivated to engage in extensive listening. There is again no easy solution to this, but listening logs have been a helpful way to engage students in extensive viewing. Students keep a listening log in which they watch something in English for at least 25 minutes every day, taking brief notes about (a) what happened in the show and (b) any new words they heard. Students then discuss their shows in groups. The conversations are usually lively, with students eager to hear about updates on their classmates’ favorite shows. These conversations often also result in more extensive listening, as students begin watching their friends’ shows as well.

92

Elizabeth Hanks

Another challenge of extensive listening is that graded listeners are relatively rare compared to graded readers, so it can be difficult to find appropriate viewing materials for students, particularly those who are at beginner to low-intermediate levels. To alleviate this issue, my students and I typically work to compile a list of YouTube channels that contain long-form listening content that is suitable for their level at the beginning of each semester. This provides opportunities for students to share interesting videos with each other while also resulting in a rich resource for students to draw from for their classwork and beyond. Other challenges and possible solutions exist beyond those I have mentioned here, but these have been the most pressing issues in my own teaching experience. Although encouraging students to engage in extensive reading and listening comes with inherent challenges, I believe the potential benefits Green discusses make our efforts in alleviating them worthwhile.

8 TRANSFORMING LANGUAGE TEACHING WITH CORPORA – BRIDGING RESEARCH AND PRACTICE Reka R. Jablonkai

1.

Introduction

At the emergence of corpus approaches in language teaching, researchers in the field claimed that they had the potential to transform English language teaching (Leech, 1997). Conrad (2000), for example, argued that corpus linguistics would potentially revolutionise grammar teaching. Ever since Tim Johns (1991) proposed sample tasks to use corpus data in language teaching in the 1990s, researchers have explored how the results of corpus-based research can be integrated into teaching practice. Research has also established that using corpora in tandem with a datadriven approach to teaching and learning benefits language learners in several ways (Boulton & Cobb, 2017; Boulton & Vyatkina, 2021). Based on the level of a language learner’s direct access to corpus data, three main approaches of corpus-based pedagogy have developed over the last three decades: (1) corpus-informed teaching, (2) integrated corpus-supported teaching and learning, and (3) self-directed data-driven learning (DDL) (Jablonkai & Csomay, 2022). In their meta-analysis of experimental and quasi-experimental studies, Boulton and Cobb (2017) found that a DDL approach can especially be effective when learners consult specialised and purpose-built pedagogic corpora rather than large public corpora directly, and particularly for vocabulary and lexicogrammar. Corpus-based pedagogies and DDL have also often been linked with a focus-onform approach (Long, 1991), a usage-based model of language acquisition (Ellis et al., 2016) and task-based learning. Furthermore, DDL has often been presented as a form of corpus-aided discovery learning (CADL) where corpora are used to scaffold and mediate the learning process (Huang, 2011). Using concordance lines in a DDL classroom has been linked to Schmidt’s Noticing Hypothesis (1990)

DOI: 10.4324/9781003413301-8

94

Reka R. Jablonkai

where the learner advances in their language abilities as they are guided to consciously notice language patterns through input. However, despite the positive results in research, recent studies have highlighted that corpus-based approaches have neither been taken up by textbook and material designers (Granger, 2015; Paquot, 2018) nor has the application of corpus-based pedagogies been widely used in language classrooms (Chambers, 2019; Meunier, 2020). The most frequently cited reasons for this are related to the technical complexity of corpus consultation, the time that is needed to learn the use of specific corpus analysis tools and to interpret corpus data, and learners feeling overwhelmed by the amount of data and the exposure to real, authentic language (e.g., Jablonkai & Čebron, 2021). The aim of this chapter is to demonstrate how to utilise all three approaches to corpus-based pedagogy in language teaching, in the hope to bridge the persistent gap between research and practice. 2.

Approaches to corpus-based pedagogy

The majority of the ever-growing body of research on the use of corpora in language teaching has concentrated on areas such as providing descriptions of language use of various registers and disciplines (Jablonkai & Csomay, 2022); exploring the effectiveness of corpus-based approaches to teaching (e.g., Boulton & Cobb, 2017), especially DDL; and reporting on student perspectives of using corpora (e.g., Boulton & Vyatkina, 2021). Although recent years have witnessed an increase in the publications that focus on corpus-based activities for teaching practice (e.g., Pinto et al., 2023; Le Foll, 2021; Viana, 2023), there is still little literature on the pedagogy of using corpus data for and in language education. In what follows, the three main approaches to corpus-based pedagogy will be discussed and exemplified by tasks focusing on English for Specific Purposes (ESP) contexts. 2.1 Corpus-informed teaching

Students do not have direct access to corpora in the case of corpus-informed teaching. It is researchers, materials designers, and teachers who access and analyse corpora and make pedagogical decisions about the contents and sequencing of syllabi and teaching materials in terms of lexical, grammatical, or discourse aspects based on the results of corpus analysis (Römer, 2011). While there has been considerable success in compiling reference materials such as dictionaries and grammar books using corpus data, its integration into coursebooks remains relatively scarce (Granger, 2015; Paquot, 2018). A couple of general coursebook series (notably Touchstone, McCarthy et al., 2005; and Grammar and Beyond, Reppen, 2012) and a few ESP coursebooks (for example, an EU English coursebook, Trebits & Fischer, 2013; and a Business English coursebook, Handford

Transforming language teaching with corpora

95

et al., 2011) stand as exceptions that have embraced corpus-informed elements and systematically used the results of the analysis of general and specialised corpora for teaching purposes. One burgeoning aspect of corpus-informed teaching has been the creation of single-word and multi-word wordlists. General and discipline-specific wordlists, collocation lists (Green, 2022; Jablonkai, 2020), and multi-word lists, i.e., lists of lexical bundles, have been compiled to inform the design of materials and syllabi (Szudarski, 2022; Cortes, 2022). The pedagogic usefulness of such lists could, however, be enhanced by demonstrating their presentation and application in teaching practice. One way to present the results of corpus-based vocabulary analyses in teaching materials is to compile pedagogic collocational profiles of relevant words. As can be seen in Figure 8.1, such collocational profiles provide learners with information on the productive usage of a word by listing relevant collocates grouped according to the grammatical relations they form with that word. In addition, this profile also gives insight into the semantic patterns of the particular discourse they represent, in this case about the European Union (EU). In ESP contexts presenting learners with comparisons of the collocational patterns of the same words in general and specialised corpora, as illustrated in Figure 8.2, can also raise their awareness of the distinctive lexical patterns of any specific professional discourse, and provides them with guidance on how to discern nuances of word meaning (Jablonkai, 2020). Corpus-informed single- and multi-word lists can also be used to create lexical profiles of individual texts, and thus such lists play a pivotal role in selecting suitable texts for language learners. Such profiles can be used by teachers to characterise texts in terms of the breadth and depth of their vocabulary and help them make informed decisions about a text’s appropriateness for teaching specific proficiency levels or disciplinary language use (Szudarski, 2022). Jablonkai (2009), for example, compared lexical bundles in online news texts and official EU texts to find considerable differences between these two registers and suggested that official EU texts are more appropriate for disciplinespecific pedagogic purposes. 2.2 Integrated corpus-supported teaching

Corpus-supported teaching is often integrated into classroom pedagogy as DDL. In the case of DDL, learners become ‘researchers’ as they have direct access to corpus data (Johns, 1997). In a DDL pedagogic approach, corpus data can be provided in two forms. Learners are given printouts of selected concordance lines or other types of corpus data in the case of hands-off or paper-based DDL. In handson DDL, they consult a corpus using various corpus analysis tools. Studies on the effectiveness of DDL found that it enhances learners’ language awareness, confidence, and autonomous learning (Jablonkai & Csomay, 2022).

96

Reka R. Jablonkai

FIGURE 8.1

Pedagogic collocational profile of the lemma CRITERION (Based on Jablonkai, 2011)

Transforming language teaching with corpora

FIGURE 8.2

97

Comparison of the collocational patterns of the lemma LAY in a general and discipline-specific corpus (based on Jablonkai, 2011)

However, the integration and pedagogic considerations of DDL and corpus use are rarely discussed in these studies (Pérez-Paredes, 2019). There are only a handful of authors who make recommendations for the specific ways of integration, implementation, and sequencing of DDL activities (e.g., Johns, 1997; Thurstun & Candlin, 1998; Timmis & Templeton, 2023). Johns suggested three steps, namely, identification, classification, and generalisation (sometimes also termed research, practice, and improvisation). According to these steps, students first explore the corpus data to identify a language problem to focus on. They then classify the patterns they deduced from the corpus examples and finally formulate rules based on the established patterns. A similar sequence is encapsulated in the ‘three Is’ (Illustration: looking at corpus data; Interaction: discussion of observations; Introduction: put forward a rule to explain the linguistic feature) proposed by Carter and McCarthy (1995), originally for spoken grammar that had minimal coverage in textbook materials. Thurstun and Candlin (1998) developed these steps

98

Reka R. Jablonkai

further and presented the following chain of activities for focussing on individual words for academic writing: Learners

LOOK at concordance lines of the new vocabulary item and words surrounding it, thinking of meaning FAMILIARISE [themselves] with the patterns of language surrounding the word by referring to the concordances as [they] complete the tasks PRACTISE words without referring to concordances CREATE their own piece of writing using the words studied to fulfil particular functions in writing, speaking (p. 272). Following this chain of activities, learners can look at concordance lines by directly accessing corpora or looking at printouts of concordance lines as presented in Figure 8.3. The instructions for such LOOK activities can be very open and direct learners’ attention to patterns of the word forms in texts. For example: Study the concordance lines and list words (nouns, verbs, adjectives, adverbs) used with the different forms of the word ‘criterion’. How does the meaning of the word ‘criterion’ change, and what are its frequent forms? As a next step, learners can work

FIGURE 8.3

Concordance lines for a LOOK activity (based on Jablonkai, 2011)

Transforming language teaching with corpora

FIGURE 8.4

99

Example of a FAMILIARISE activity with the lemma CRITERION (based on Jablonkai, 2011)

on FAMILIARISE activities that help them practise the patterns the word is part of. These can be gap-fill exercises, as in Figures 8.4 and 8.5. In Figure 8.4, learners are given concordance lines with frequent collocates of the word ‘criteria’ blanked out. Based on the LOOK activity, learners are asked to identify the missing words.1 Another example of a FAMILIARISE activity is given in Figure 8.5, where learners fill in the gaps in the sentences selected from the corpus and focus on one particular grammatical relation (verb + adverb). Such activities can also be designed to practise multi-word units, for example, lexical bundles. Figure 8.6 exemplifies an activity where learners focus on the discourse functions of specific lexical bundles by exploring concordance lines. The final two types of activities in the chain proposed by Thurstun and Candlin (1998) do not require consulting corpora as learners PRACTISE the use of words when paraphrasing passages and CREATE their own writing. These activities are very

100

Reka R. Jablonkai

FIGURE 8.5

Example of a FAMILIARISE activity with the lemma IMPLEMENT (based on Jablonkai, 2011)

much teacher-directed tasks, as they are prepared by the teacher in advance and there is little room for flexibility in class as learners do the tasks with guidance and scaffolding from the teacher. Timmis and Templeton (2022) propose a more flexible approach that takes learners’ characteristics and needs into account. In line with the zone of proximal development (Vygotsky, 1978), the introduction of elements of corpus-based pedagogy should be incremental as learners become ready for it. In their flexible framework for DDL practice, Timmis and Templeton (2023) suggest that in the first stage learners be familiarised with DDL techniques and how they can infer meaning from concordance lines and interpret frequent words in a text by looking at, for example, word clouds. In the second stage, learners are introduced to user-friendly corpus tools and reference corpora. In this stage, searches will be prompted by learners needs, for example, by a student error or question. In the final stage, learners might create and search their own corpora when they are fully equipped with the technical skills and corpus literacy.

Transforming language teaching with corpora

FIGURE 8.6

101

Example of a FAMILIARISE activity with lexical bundles (based on Jablonkai, 2011)

These proposed pedagogic models of corpus-based pedagogy are imperative to the pedagogic applications of corpora. Although several practical collections of corpus activities, tasks, and lesson plans have been published recently (e.g., Le Foll, 2021; Pinto et al., 2023; Viana, 2023), further research that results in pedagogic strategies and frameworks for the systematic pedagogic application of corpora would still be needed. This could include the development of a corpus task taxonomy, principles of course design and lesson planning, and a list of basic corpus query and corpus data interpretation skills (see Ma, this volume). 2.3 Self-directed data-driven learning

The third main approach of corpus-based pedagogy is self-directed DDL, in which learners are encouraged to consult corpora autonomously, often outside the language classroom. Early in the history of DDL, it was seen as an approach with great potential for self-directed learning, and learner autonomy is also often emphasised as a benefit and goal of DDL (Charles, 2022). Self-directed DDL can involve the autonomous use of large, publicly available corpora and

102

Reka R. Jablonkai

the creation of bespoke DIY corpora for specific, individual purposes. Studies that investigated the autonomous use of corpora for language learning found that such a pedagogic approach can be suitable for doctoral and undergraduate students as well (e.g., Jablonkai & Čebron, 2021; Lee & Swales, 2006). They also revealed that through engagement with corpus compilation, learners developed a better understanding of corpus structure and potential scopes of investigation, improved interpretation of output of corpora, and achieved a deeper understanding of how language works and how meaning is made through language. Charles (2022) argued that this aspect of corpus-based pedagogy is very much underresearched, and except for self-directed DDL for writing, there is little literature on autonomous corpus-based learning for language skills such as reading and speaking. This approach is also technically the most challenging and requires training for both teachers and learners. Research into the cognitive processes of corpus consultation (e.g., identifying search types, inferencing, forming hypotheses) and effective strategies for pedagogic corpus search would also be needed to exploit the potential of self-directed DDL further. 3.

Lessons from DDL, CALL, and TEL research

Technical complexity and the time it takes to learn necessary basic corpus literacy skills are among the most often mentioned challenges of applying any of the three approaches of corpus-based pedagogies (Jablonkai & Csomay, 2022). Such challenges should not come as a surprise when we look at the literature on computer-assisted language learning (CALL) and technology-enhanced learning (TEL) (Hendry & Sheepy, 2022; Hubbard, 2013). Usability often linked to cognitive load theory is a critical factor in designing user-friendly technology for learning. Hendry and Sheepy analysed three of the widely used corpus tools and found that interfaces and presentation of content could be further developed to make these tools easier to use. Such usability analysis of corpus tools can help teachers to select the most appropriate corpus software for their pedagogical contexts. Recent years have also witnessed the development of more user-friendly corpus tools (e.g., CorpusMate, https://corpusmate.com/; Versatext, https://versatext.versatile.pub/; COCA, www.english-corpora.org/coca/; SkELL, https:// skell.sketchengine.eu) that might be appropriate for a wider range of pedagogical environments. Alternatives to corpus tools such as Google Search, Google Translate, and very recently generative AI applications like ChatGPT also have some potential advantages. Although these alternatives may give a better option in terms of usability, to date corpus tools still seem to provide more reliable, authentic, and context-specific materials, concordance lines, and frequency lists (Crosthwaite & Baisa, 2023; Lin, 2023). In addition to the development of corpus tools, the compilation of pedagogic, teaching-oriented corpora needs to be advanced in order to accelerate the

Transforming language teaching with corpora

103

transformation of language teaching. As most corpora to date have been designed and created for research purposes, graded, topic-, field-, and genre-specific corpora together with pedagogic multimodal corpora would be needed to provide an array of corpus resources to apply each of the three approaches of corpus-based pedagogy in a wide range of pedagogic contexts. Lessons from DDL and CALL literature also point towards some guiding principles for course design and task sequencing when applying a corpus-based pedagogy, including: 1 Introducing students to basic concepts of corpus-based learning: corpus search, interpretation of corpus data; 2 Including components that focus on technical skills; 3 Including both direct and indirect corpus use, guide, and scaffold corpus consultation; 4 Allowing sufficient time and repetition for hands-on corpus querying in class; 5 Including compiling and analysing a DIY corpus; 6 Promoting autonomous corpus use outside of the classroom; 7 Contextualising the demonstration of corpus resources and tools; and 8 Including components of reflection (based on Jablonkai & Čebron, 2021). 4.

Conclusions

This chapter has discussed and exemplified the three main approaches of corpusbased pedagogy to bridge the gap between research and practice in language teaching and learning. By offering a practical overview, this discussion contributes to the broader acceptance and normalisation of corpus-based pedagogies, as emphasised by Pérez-Paredes (2019). Looking ahead, the following aspects deserve more attention. From the teacher’s perspective, the development of theoretically underpinned models for selecting and integrating elements of corpus-based pedagogies is vital, together with identifying specific training needs to overcome the technical challenges and pedagogical issues associated with this approach. From the learners’ perspective, more user-friendly tools and a diverse array of corpora, aligned with different proficiency levels, age groups, and disciplines could make corpus-based pedagogies more engaging and relevant to a broader spectrum of learners. Equally important is the learning perspective, which necessitates further investigation to unravel the precise contributions of a corpus-based approach within the learning process, and how it can be translated into pedagogic practice. Although the specific activities in this chapter were drawn from a corpus-based ESP study, it is hoped that the recommended principles, considerations, tasks, and strategies to integrate and implement corpus-based pedagogy into teaching practice will serve as useful examples for all language teaching contexts.

104

Reka R. Jablonkai

Note 1 (Key to task in Figure 8.4: 1 transparent, 2 eligibility, 3 membership, 4 satisfy, meet)

Reference list Boulton, A., & Cobb, T. (2017). Corpus use in language learning: A meta-analysis. Language Learning, 67(2), 348–393. https://doi.org/10.1111/lang.12224 Boulton, A., & Vyatkina, N. (2021). Thirty years of data-driven learning: Taking stock and charting new directions over time. Language Learning & Technology, 25(3), 66–68, 66–89. https://doi.org/10125/XXXXX Carter, R., & McCarthy, M. (1995). Grammar and the spoken language. Applied Linguistics, 16(2), 141–158. https://doi.org/10.1093/applin/16.2.141 Chambers, A. (2019). Towards the corpus revolution? Bridging the research-practice gap. Language Teaching, 52(4), 460–475. https://doi.org/10.1017/S0261444819000089 Charles, M. (2022). Corpora and autonomous language learning. In R. R. Jablonkai & E. Csomay (Eds.), The Routledge handbook of corpora and English language teaching and learning. Routledge. Conrad, S. (2000). Will corpus linguistics revolutionize grammar teaching in the 21st century? TESOL Quarterly, 34(3), 548–560. https://doi.org/10.4324/9780203905029 Cortes, V. (2022). Lexical bundles in EAP. In R. R. Jablonkai & E. Csomay (Eds.), The Routledge handbook of corpora and English language teaching and learning. Routledge. Crosthwaite, P., & Baisa, V. (2023). Generative AI and the end of corpus-assisted data-driven learning? Not so fast! Applied Corpus Linguistics, 3, 100066. https://doi.org/10.1016/j. acorp.2023.100066 Ellis, N. C., Römer, U., & O’Donnell, M. B. (2016). Constructions and usage-based approaches to language acquisition. Language Learning, 66, 23–44. Granger, S. (2015). The contribution of learner corpora to reference and instructional materials design. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 486–510). Cambridge University Press. Green, C. (2022). Corpora for teaching collocations in ESP. In R. R. Jablonkai & E. Csomay (Eds.), The Routledge handbook of corpora and English language teaching and learning. Routledge. Handford, M., Lisboa, M., Koester, A., & Pitt, A. (2011). Business advantage. Cambridge University Press. Hendry, C., & Sheepy, E. (2022). Evaluating corpus analysis tools for the classroom. In R. R. Jablonkai & E. Csomay (Eds.), The Routledge handbook of corpora and English language teaching and learning. Routledge. Huang, L. (2011). Corpus-aided language learning. ELT Journal, 65(4), 481–484. Hubbard, P. (2013). Making a case for learner training in technology enhanced language learning environments. CALICO Journal, 30(2), 163–178. Jablonkai, R. R. (2009). “In the light of”: Corpus-based analysis of lexical bundles in two EU-related registers. WoPaLP, 3, 1–27. Jablonkai, R. R. (2011). A corpus-linguistic investigation into the lexis of written English EU discourse: An ESP pedagogic perspective [Unpublished PhD thesis, Eotvos Lorand University Budapest]. Jablonkai, R. R. (2020). Leveraging professional wordlists for productive vocabulary knowledge. ESP Today, 8(1), 2–24.

Transforming language teaching with corpora

105

Jablonkai, R. R., & Čebron, N. (2021). Undergraduate students’ responses to a corpusbased ESP course with DIY corpora. In M. Charles & A. Frankenberg-Garcia (Eds.), Corpora in ESP/EAP writing instruction (pp. 100–120). Routledge. https://doi. org/10.4324/9781003001966-5-8 Jablonkai, R. R., & Csomay, E. (Eds.). (2022). The Routledge handbook of corpora and English language teaching and learning. Routledge. Johns, T. (1991). ‘Should you be persuaded: Two examples of data-driven learning.’ In: T. Johns & P. King (Eds.), Classroom concordancing. English Language Research Journal, 4, 1–16. Johns, T. (1997). Contexts: The background, development and trialling of a concordancebased CALL program. In A. Wichmann, S. Fligelstone, T. McEnery & G. Knowles (Eds.), Teaching and language corpora (pp. 100–115). Addison Wesley Longman. Lee, D., & Swales, J. (2006). A corpus-based EAP course for NNS doctoral students: Moving from available corpora to self-compiled corpora. English for Specific Purposes, 25(1), 56–75. Leech, G. (1997). Teaching and language corpora: A convergence. In A. Wichmann, S. Fligelstone & T. McEnery (Eds.), Teaching and language corpora (pp. 1–23). Longman. Le Foll, E. (Ed.). (2021). Creating corpus-informed materials for the English as a foreign language classroom. A step-by-step guide for (trainee) teachers using online resources. https://pressbooks.pub/elenlefoll/ Lin, P. (2023). ChaptGPT: Friend or foe (to corpus linguistics)? Applied Corpus Linguistics, 3, 100065. https://doi.org/10.1016/j.acorp.2023.100065 Long, M. H. (1991). Focus on form: A design feature in language teaching methodology. In K. de Bot, R. Ginsberg & C. Kramsch (Eds.), Foreign language research in crosscultural perspective (pp. 39–52). Benjamins. McCarthy, M., McCarten, J., & Sandiford, H. (2005). Touchstone student book 1. Cambridge University Press. Meunier, F. (2020). Data-driven learning: From classroom scaffolding to sustainable practices. EL.LE, 8(2), 423–434. Paquot, M. (2018). Corpus research for language teaching and learning. In A. Phakiti, P. De Costa, L. Plonsky & S. Starfield (Eds.), The Palgrave handbook of applied linguistics research methodology (pp. 359–374). Palgrave Macmillan. Pérez-Paredes, P. (2019). A systematic review of the uses and spread of corpora and datadriven learning in CALL research during 2011–2015. Computer Assisted Language Learning, 10. https://doi.org/1080/09588221.2019.1667832 Pinto, P. T., Crosthwaite, P., Tavares de Carvalho, C., Spinelli, F., Serpa, T., Garcia, W., & Ottaiano, A. O. (2023). Using language data to learn about language: A teachers’ guide to classroom corpus use. https://uq.pressbooks.pub/using-language-data/ Reppen, R. (2012). Grammar and beyond level 1 student’s book. Cambridge University Press. Römer, U. (2011). Corpus research applications in second language teaching. Annual Review of Applied Linguistics, 31, 205–225. Schmidt, R. (1990). The role of consciousness in second language learning. Applied Linguistics, 11(2), 129–158. Szudarski, P. (2022). Corpora and teaching vocabulary and phraseology. In R. R. Jablonkai & E. Csomay (Eds.), The Routledge handbook of corpora and English language teaching and learning. Routledge. Thurstun, J., & Candlin, C. (1998). Concordancing and the teaching of the vocabulary of academic English. English for Specific Purposes, 17(3), 267–280.

106

Claudia Wunderlich

Timmis, I., & Templeton, J. (2022). DDL for English language teaching in perspective. In R. R. Jablonkai & E. Csomay (Eds.), The Routledge handbook of corpora and English language teaching and learning. Routledge. Timmis, I., & Templeton, J. (2023). A flexible framework for DDL in practice. In K. Harrington & P. Ronan (Eds.), Demystifying corpus linguistics for English language teaching. Palgrave Macmillan. Trebits, A., & Fischer, M. (2013). EU English – Using English in EU contexts. Klett. Viana, V. (2023). Teaching English with corpora a resource book. Routledge. Vygotsky, L. (1978). Mind and society: The development of higher psychological processes. Harvard University Press.

8.1 DDL and ESP in Germany Claudia Wunderlich Reka R. Jablonkai’s chapter highlights the discrepancy between corpus linguistics, which has long been recognized as a highly relevant methodology for language teaching, and the lack of actual use of corpora by instructors, even at the tertiary level. In any case, this seems to be true in my field of teaching in Germany, despite corpus methods recommended here, too, and even by leading teaching organizations for language courses at the tertiary level (cf., for example, Taylor Torsello et al., 2009). This applies to all three areas outlined by Reka: 1) corpus-informed teaching, 2) integrated corpus-supported teaching and learning, as well as 3) selfdirected data-driven learning (DDL). Creating course materials informed by corpora or hands-on DDL in the classroom are far from mainstream in Germany. Teaching English for Specific Academic Purposes (ESAP), English for Professional Purposes (EPP), and particularly English for engineering purposes hardly seem to involve corpora and corpus-informed materials tailored to students’ specific needs. For many teaching professionals, it may appear too daunting and time-consuming an endeavour. Therefore, Reka’s contribution with the methods and tasks outlined by her provide much-needed information and inspiration to fill the gap. This is highly relevant for the more practically oriented type of universities in Germany. Since I teach at the only technical university of applied sciences in Germany offering bilingual English Medium of Instruction (EMI) courses – so-called TWINs – where my focus is on industrial engineering, this is of particular significance for our institution. How DIY corpora can be compiled and exploited, for example for lists of words, collocations, and chunks, has been widely described in the literature since Coxhead’s (2000) seminal work. Yet, how

Teachers’ prepared handouts as a part of scafolding in DDL

107

to create a corpus-informed ESP/EPP course including corpus-based tasks and chains of activities, also directing students towards autonomous learning, is a different matter. Thus, how Reka’s corpus study on EU English informed the creation of a coursebook can serve as a model and starting point for developing my own materials. All this, together with the contents of her recent Routledge handbook co-edited with Eniko Csomay (2022), and suggestions by others in this volume for making corpus use more appealing to students – and I would like to add colleagues alike – are likely to be gamechangers. Overall, it adds considerable impetus to my own project of creating awareness for the multiple affordances of corpora in teaching ESP/EPP, forming a network with other professionals in higher education, as well as compiling specific corpora and materials for engineering purposes as outlined by Reka. All this will greatly contribute to bridging that theory–practice gap in Germany. Reference list Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238. Jablonkai, R. R., & Csomay, E. (Eds.). (2022). The Routledge handbook of corpora and English language teaching and learning. Taylor & Francis. Taylor Torsello, C., Ackerley, K., Costello, E., Dalziel, F., Gesuato, S., & Opas-Hannien, L. (2009). Corpus work. In J. Fischer, M. T. Musacchio, & A. Standring (Eds.), Exploiting internet case studies and simulation projects for language teaching and learning: EXPLICS; a handbook (pp. 25–36). Cuvillier.

8.2 Teachers’ prepared handouts as a part of scaffolding in data-driven learning Thi Thanh Thuy Tran Writing is a crucial skill in English. Attaining lexical and grammatical accuracy is essential when learning the language. However, many non-native English speakers struggle to produce flawless pieces without significant lexical or grammatical errors. Overly aggressive correction can undermine students’ confidence and motivation to learn. As such, rather than directly correcting learners or providing explicit examples, an increasing number of teachers are guiding students to develop discovery strategies. In the context of this discussion, Vietnamese students attending my English for Academic Purposes course aim to enhance their English skills, particularly

108

Thi Thanh Thuy Tran

in academic writing. They engage in corpus-informed activities to address two significant errors evident in their work. Their English proficiency is at the B2 level according to the CEFR (Upper-Intermediate). This level equips them for data-driven learning (DDL), which is best suited for advanced learners filling knowledge gaps rather than laying foundational knowledge (Hunston, 2022, p. 171). I have noticed that identifying which errors to address is challenging for learners. It often requires experienced instructors to judge the linguistic and educational validity of items based on trusted corpus sites’ query results. Pedagogically, a specific cut-off is vital to determine which phrases to include/exclude. Such decisions stem from complex statistical evaluations. Hence, teacher preparedness is essential when crafting worksheets that help students identify and understand prominent errors. Furthermore, printed DDL materials are more accessible to students lacking computer access or training in corpus usage. Consequently, well-designed handouts simplify the learning process, promoting practices like observation, familiarization, and inductive learning (Schmidt, 1990). For students to successfully observe and notice desired language patterns, I also see instructor scaffolding is beneficial. Activities that use corpus data, such as pre-selected concordance lines highlighting commonly confused patterns, can be invaluable. These lines can be used for gap-filling tasks, grouping node words by their parts of speech, understanding lexical bundle functions, or identifying usage patterns. Engaging in these and other structured tasks allows students to recognize recurrent patterns and explore word and phrase usage autonomously, fostering a discovery-based learning approach with teacher-prepared materials (Römer, 2009). Incorporating corpus tools and materials holds vast potential in English as a Second Language classrooms. However, I acknowledge this approach demands significant teacher commitment, dedication, and creativity in producing wellcrafted handouts and instructions. These resources aid learners in guiding their cognitive processes. This underscores the need for more research on training teachers and equipping them with hands-on experience to create materials that foster DDL and student–corpus interactions. Reference list Hunston, S. (2022). Corpora in applied linguistics. Cambridge University Press. Römer, U. (2009). Corpus research and practice: What help do teachers need and what can we offer? In Corpora and language teaching (pp. 83–98). John Benjamins Publishing. Schmidt, R. W. (1990). The role of consciousness in second language learning. Applied Linguistics, 11(2), 129–158. https://doi.org/10.1093/applin/11.2.129

Towards the use of corpora in language teaching

109

8.3 Towards the use of corpora in language teaching Rising awareness in slow but gradual motion! Ali Şükrü Özbay From Reka’s account of corpus transformation in language teaching, I derived many implications. The most influential of these, which seems to have already gained wide consensus in academia, is that corpus linguistics as a methodology offers a myriad of revolutionary opportunities for various scientific fields in their quest to cross the “devil’s bridge”. It’s unsurprising that corpus linguistics has captivated diverse fields due to its potential to be integrated into classroom learning, promoting inductive and discovery-based learning through consciousness-raising DDL activities. Frankly, resistance to this change appears irrational to all involved parties. One significant factor that influenced my classroom teaching, especially in my course “Scientific Article Writing in English”, was the awareness of prevalent lexical co-occurrences in any text. As an EFL and EAP teacher, it made me realize that I needed to re-evaluate my English teaching methods. Relying on introducing a new set of rules each time students face challenges in essay writing isn’t effective. Instead, incorporating more corpus-informed practices or the “corpus revolution” can bridge this gap, emphasizing the importance of consulting the corpus in essay writing. Another influencing factor was the critical importance of frequency data in analysing and evaluating certain words and lexical structures. The associated collocational preferences in texts merit immediate attention and research. Studying these collocational structures became an essential component of my classroom teaching, focusing on introducing new words, word combinations, and lexical bundles. Reka’s chapter reaffirmed that the potential benefits of using corpora directly or indirectly in language classrooms outweigh the challenges in numerous ways. Through Reka’s explanation of the three primary approaches to teaching with corpora, it is evident that corpus research and tools can highlight the phraseological nature of the English language, especially the semantic prosodic and priming features. Like Reka, I, an EAP teacher who frequently uses corpora in the classroom, have adopted the philosophy of “same task, different corpora” (Charles, 2015). I’ve developed in-class corpus-based materials and a series of activities (e.g., concordance activities) for improvisation. I also encouraged my students to compile

110

Ali Şükrü Özbay

their corpora, tailored to their specific fields of interest, following a needs analysis process. The goal was to foster autonomous learning, enhance their awareness of lexical patterns, and help them identify their deficiencies in this area. The outcome included various academic corpora from several engineering disciplines. These corpora focused on frequency-based investigations of phraseological properties, including lexical, terminological, formulaic, and collocational analyses with references to clusters, lexical bundles, N-gram structures, phrasal complexities, and their semantic prosodic features. The feedback for the course was largely positive, with students expressing favourable views on the use of corpora in article writing. Reka also emphasized that exploring the combinatory nature of the English language aids both teachers and learners. This is in line with the sentiment of British linguist John Firth, who stated, “you shall know a word by the company it keeps” (Firth, 1957, p. 11). Reference list Charles, M. (2015). Same task, different corpus: The role of personal corpora in EAP classes. In A. Leńko-Szymańska & A. Boulton (Eds.), Multiple affordances of language corpora for data-driven learning (pp. 129–154). John Benjamins. Firth, J. R. (1957). A synopsis of linguistic theory. Studies in linguistic analysis. Blackwell.

9 PRACTICAL USE OF CORPORA IN TEACHING ARGUMENTATION Tatyana Karpenko-Seccombe

Introduction

The value of data-driven learning (DDL) in language teaching has been acutely summarised by Tom Cobb, who argued that ‘no one can learn everything there is to know about a language from being told; at some point, the learning depends on observing how the language works and drawing one’s own conclusions from observations’ (2010, p. 16). Although corpus-based learning is increasingly becoming an established methodology in language classrooms, the emphasis is still placed on so-called low-level phenomena, such as vocabulary and grammar. The potential of DDL to address significant aspects of teaching rhetorical and discoursal aspects of academic writing is often overlooked (Boulton, 2007; Charles, 2011; Hunston, 2002, p. 184). Despite a significant body of academic work dedicated to the analysis of corpus methods for teaching academic discourse (Biber et al., 2007; Charles et al., 2009; Poole, 2016; Flowerdew, 2003; Charles, 2011), the tendency to prefer ‘low-level’ to ‘high-level’ linguistic processes still exists, as was noted by Boulton (2007) Boulton et al. (2012) and later by Crosthwaite et al. (2019) and KarpenkoSeccombe (2023). Another important consideration is insufficient practical help in supporting teachers interested in incorporating corpus consultations into regular teaching practice: there is limited availability of ready-made teaching materials, considering the time-consuming nature of preparing DDL activities (Vyatkina & Boulton, 2017, p. 2, see also Karpenko-Seccombe, 2018). This chapter attempts to address these gaps by offering several practical suggestions for introducing corpus consultations into lessons concerned with developing an argument. Effective argumentation is an essential skill for students mastering academic discourse. However, existing research indicates that learner writers often DOI: 10.4324/9781003413301-9

112 Tatyana Karpenko-Seccombe

lack a clear understanding of what argumentation is, which supports the case for explicit teaching of argumentation (Wingate, 2012, p. 153). The use of corpus consultation can assist in the explicit teaching of an argument by drawing students’ attention to its linguistic manifestations. This chapter considers ways of using corpora in teaching argumentation, in particular claim-support and problem-solution patterns, evaluative language and appropriate strength of claim. It suggests a series of practical activities which can be integrated into a variety of lessons. Wherever possible, the chapter also points to disciplinary variations in presenting an argument. The activities suggested here are targeted at upper-intermediate and advanced second language learners; most can be used across multiple disciplines. The tasks are based on the use of three free online corpus tools, Lextutor Concordancer (Cobb), SkELL (Baisa & Suchomel, 2014) and MICUSP (University of Michigan, 2009). These were chosen because they present data in three different ways: SkELL offers students quick answers to queries about collocations and gives examples of usage in the form of sentences; Lextutor presents corpus results as concordance lines and gives an opportunity to see the larger context of two or three paragraphs; and MICUSP allows users to see examples in the context of a paragraph as well as the complete text of a paper or essay. 1. Presenting arguments

The first set of activities is aimed at introducing students to argumentation-specific vocabulary. Several tasks can be used for pre-writing preparation and as support during the writing process. The screenshots can also be used in creating printed paper material.

Task 1. Students use SkELL (Word Sketch, see Figure 9.1) and Lextutor (BAWE and/or General Academic) to explore adjectival collocates of argument. In Lextutor students can run two searches: on argument, sorting results to the left (Figure 9.2), and on argument is, sorting results by two words to the right (Figure 9.3). Students can work in groups: a) one group is asked to think of as many adjectives as can be used with the noun argument and another group uses corpus tools. The results are later compared. b) each group uses diferent corpus tools and compare fndings.

Practical use of corpora in teaching argumentation

FIGURE 9.1

Adjectives used with argument in SkELL

FIGURE 9.2

Adjectives used with argument in Lextutor

113

These results can be complemented with exploration of concordance lines in Lextutor using the KeyWord in Context (KWIC) search and sorting the results to the left of the keyword (Figure 9.2).

114

Tatyana Karpenko-Seccombe

FIGURE 9.3

Adjectives used with argument in Lextutor. Search argument is in Academic General

Task 2. After completing the searches, students can discuss the examples from these searches in groups: a) the groups can decide which adjectives characterise argument as positive or negative and then compare their results; b) one group could look for positive, and the other for negative, adjectives. The teacher monitors the search and facilitates discussion. By the end of the session, students should have a clear idea about the strength and usage of evaluative adjectives used with argument and their role in expressing the author’s stance. Follow-up task. Students are asked to use adjectives from corpora in a piece of writing evaluating arguments identifed in the sources they are reading. This task can serve as a preparation for writing a critical literature review.

The next task, based on a similar search, is aimed at recognising the critical approach that writers take in discussing other scholars’ arguments.

Practical use of corpora in teaching argumentation

FIGURE 9.4

115

Search this argument is in Lextutor (BAWE)

Task 3. Ask students to search Lextutor (BAWE or Academic General) putting this argument is in the search window (Figure 9.4) and to sort the results to the right. Prompt them to consider the concordance lines and fnd examples of critical writing where the writer: 1 2 3 4 5

Shows the development of the argument Shows confdence in the argument Evaluates the argument positively Points to the faws in the argument Provides an explanation of the argument

Follow-up task. Ask students to think of three sources that they have read recently and summarise the main argument in one sentence. Then invite them to add another sentence evaluating the argument using phrases found in the corpus.

2.

Introducing claims

The process of argumentation commonly starts with a claim, followed by the evidence to support this claim, and a conclusion. A claim is often presented as a clear and assertive statement that conveys the main idea of the author. Presenting a claim (also known as a thesis statement) is often signalled by sentences beginning with This study/This research/This dissertation/This thesis/This work. Such introductory phrases are likely to be found in abstracts, and, therefore, students can be directed

116

Tatyana Karpenko-Seccombe

to the Lextutor corpus of Academic Abstracts. However, not all the sentences starting with This study/research/dissertation/thesis/work are claims – they may also contain descriptive statements of the content of work or research topic. The next task is based on a search for study/research/dissertation/thesis/work in association with This (Figures 9.5–9.6).

Task 4. In Lextutor put in the search window the words dissertation | thesis | study | paper | work separated by a vertical bar. Run a search in association with This (see Figure 9.5). Next, introduce a concept and characteristic features of a claim as opposed to a research topic, for example: 1 a claim shows the author’s position 2 a claim is what other scholars can agree and disagree with  – it is not a descriptive statement of the research topic 3 a claim is concise and specifc Show the group examples from concordance lines (Examples 1–2 next). Ask students to scroll down the search results (Figure  9.6) to fnd examples of claims and descriptive statements. If students fnd it difcult to identify a claim solely from a concordance line, encourage them to investigate larger context by clicking on the keyword (Examples 3–5).

These examples should help students identify the difference. In (1) the writer introduces a claim: (1) This thesis argues that liberalisation implies a continuing link between government and industry, whereas in (2) a description of a content of work is more likely: (2) This dissertation reports on a laboratory experiment.

FIGURE 9.5

Multiple words search in Lextutor with associated word this in Academic Abstracts

Practical use of corpora in teaching argumentation

FIGURE 9.6

117

Results of multiple words search in Lextutor with associated word this

A broader context helps students distinguish between descriptive (3) and argumentative (4) statements, or notice a mixture of both (5): (3) This study examines William Notman’s portraits of children, taken between 1856 and 1891 in the Montreal studio. (4) The first chapter of this dissertation assesses the extent to which the construction of new limited access highways has contributed to centre city population decline. (5) This thesis, firstly, describes the traditional settlement patterns of the Garawa people of northern Australia. Secondly, it develops methodological and theoretical approaches to the study of hunter-gatherer settlement patterns. A follow-up task can be set to connect corpus searches with the students’ own writing. Follow-up task. Using words and phrases learned from corpus searches, students are asked to write two sentences connected with the topic they are working on: one describing the topic or content of work, and the other introducing their central claim.

3.

Providing evidence

In argumentative writing, writers are expected to provide evidence in support of the validity of their claim. There are several ways in which corpus searches can help students construct robust evidence-based arguments, for example, searches in Lextutor involving the words claim or argument in association with evidence,

118 Tatyana Karpenko-Seccombe

support or proof (see Figure 9.5 for an example of the search claim + evidence). These searches suggest to the students that expert writers make sure that their claims are supported, and demonstrate what language writers use to achieve this. Task 5. Ask students to run a search on the word claim in association with evidence and fnd words and expressions related to providing evidence for claims (Figure 9.7). Have students record words and phrases which can be useful in their writing process, e.g., to bolster the claim, to provide positive evidence in favour, to obtain direct evidence, absence of formal evidence, criticise the empirical evidence, evidence cited in support of a claim. Follow-up task. Encourage students to use the vocabulary learned from the searches in their own piece of writing presenting and supporting their claims or assessing support provided by other authors.

A search on evidence + an infinitive (Figure 9.8) reveals a pattern that has several functions: 1 it can be used to provide a precise explanation of the purpose of the presented piece of evidence (evidence to show, evidence to prove, evidence to indicate); 2 it can also serve to support a claim (we have sufficient evidence to conclude, I will show evidence to confirm); 3 coupled with no/not, it can also be used to question other writers’ evidence (she offers no evidence to attest this claim, there is no reliable evidence to say that); or 4 it can be used to strengthen one’s own claim by refuting possible objections in counter-argumentation (there is no clear evidence to dismiss, there is certainly little evidence to dispute).

FIGURE 9.7

Claim in association with evidence (10 words both sides, Lextutor, Academic General)

Practical use of corpora in teaching argumentation

FIGURE 9.8

119

Evidence + infinitive (Lextutor, BAWE)

Task 6. Ask students to identify the functions of infnitives following evidence (Figure 9.8).

In the next task, students are asked to explore recurrent patterns of verbs and adjectives collocating with evidence. This task can be followed by further discussion of the role of evaluative verbs and adjectives in indicating the strength of evidence.

Task 7. Ask students to run a search on evidence using SkELL (Word Sketch) or Lextutor (Academic General and BAWE) and then: a) explore adjectives used with evidence. Ask students to work in groups and to classify adjectives according to their evaluative function into strong (e.g., convincing, defnitive, incontrovertible, corroborative), weak (e.g., limited, inadequate, fragmentary, meagre) or neutral (e.g., experimental, factual, additional). Monitor their work and discuss cases where they disagree. b) compare verbs used with evidence as a subject (e.g., reveals, proves, supports, seems to suggest), comment on their strength and classify them as either assertive or tentative. Ask students to refect on how strength of expression works in academic writing.

120

Tatyana Karpenko-Seccombe

4. Disciplinary conventions

The ways of presenting an argument are inherently connected to the conventions existing in different academic disciplines. For example, academic writers may frame their argument as finding a solution to a problem rather than employing the claimsupport structure discussed earlier. Problem-solution could be used in case studies, argumentative essays and reports, or may underlie the structure of a research paper or thesis. Sometimes authors also prefer to present their line of argument as a hypothesis that is subsequently tested in the course of further investigation. To explore the disciplinary conventions in presenting arguments, students can be asked to compare the frequency of distribution of the words argument, claim, problem, solution, hypothesis in corpora for different subject areas using the corpus tool MICUSP, which offers 16 discipline-specific sub-corpora. Corpus searches, presented next, are useful for teaching students in multi-discipline groups because they indicate preferences for a particular type of argumentation in different disciplines: presenting an argument, testing a hypothesis or solving a problem.

Task 8. Ask students to use MICUSP and to put the word argument in the search window. The search results (see the chart in Figure 9.9) show the distribution of the word by discipline. Next, ask students to run similar searches on the word problem (Figure 9.10) and hypothesis (Figure 9.11). Draw students’ attention to the fact that these searches display diferent dynamics of word usage. This task guides students towards understanding discipline-specifc ways of presenting an argument and the preferences specifc to their own subject area.

FIGURE 9.9

Argument in MICUSP

Practical use of corpora in teaching argumentation

FIGURE 9.10

121

Problem in MICUSP

The results of the search in Figure 9.9 show that the top three disciplines in the MICUSP corpus for most frequent use of the word argument are Philosophy, Economics and Political Science; in the fields of Industrial and Operational Engineering, Mechanical Engineering, Civil and Environmental Engineering argument is used least frequently. A search on problem (Figure 9.10) in all available disciplinary sub-corpora shows that the use of problem is generally high across all disciplines. While highest of all in Industrial and Operational Engineering, it is also significant in Economics and Philosophy. This may suggest that presenting an argument is preferred in humanities and solving a problem is more typical in sciences. A search on hypothesis in the MICUSP corpus, however, shows that this is the preferred way of presenting an argument in Biology, Linguistics and Psychology (Figure 9.11). Similar to the tasks on word usage and collocations of the words argument and evidence (Tasks 1, 2, 7), the following tasks involve the words problem and hypothesis (Tasks 9 and 10):

122

Tatyana Karpenko-Seccombe

FIGURE 9.11

Hypothesis in MICUSP

Task 9. Have students do a search on the word problem. In Lextutor they need to sort the results to the left (Figure  9.12), and in SkELL they need to focus on adjectives and modifers of problem. By scrolling through the search results, students are asked to fnd alternatives to bad, small, growing or lasting problem (e.g., bad – acute, severe, serious; small – minor, limited, minimal; growing – increasing, escalating; lasting – recurrent, continuing, pervasive). Ask students to refect on the diference between concordance results and results from thesauri. Similarly, students can examine adjectives used with solution in Lextutor and/or SkELL and grade the adjectives as strong or cautious according to the degree of the writer’s confdence in the solution. Draw students’ attention to the way the variation in strength of an adjective expresses the writer’s stance. Follow-up task. The examples of strong (critical, convincing, credible, defnitive, successful) and cautious (acceptable, suitable, tentative, workable) adjectives can be integrated into students’ writing of a critical survey of existing solutions to a particular problem.

The ways in which academic writers talk about hypotheses can also be explored through Lextutor and SkELL searches. SkELL searches are demonstrated in Figures 9.13 and 9.14.

Practical use of corpora in teaching argumentation

FIGURE 9.12

123

Adjectives collocating with problem in Lextutor, Academic General

Task 10. In SkELL (Word Sketch), by clicking on a collocate, students can examine sentences in which hypothesis is used together with one of the collocates, as in Figure 9.14, hypothesis + refute.

Lextutor can be searched to investigate phrases containing hypothesis (Figure 9.15), recording them and using them in sentences related to the students’ own research.

FIGURE 9.13

Hypothesis in Word Sketch in SkELL

124

Tatyana Karpenko-Seccombe

FIGURE 9.14

Refute + hypothesis in Examples in SkELL

FIGURE 9.15

Phrases containing hypothesis (Lextutor)

5.

Observing argumentation in larger texts

MICUSP offers a unique opportunity for exploring examples of the development of argumentation, the problem-solution rhetorical pattern and presenting and testing a hypothesis in the context of a whole paper or essay. Next are examples of suggested texts for classroom discussion, narrowed down by subject area, paper type or textual features. Students can find texts containing the word argument by putting argument in the search box and choosing ‘Argumentative Essay’ in the ‘Paper Type’ section (see Figure 9.16). The full text of an essay can be seen by clicking on ‘Paper ID’ which gives students an opportunity to observe the development of an argument (the paragraphs containing the search term are highlighted in the text).

Practical use of corpora in teaching argumentation

FIGURE 9.16

125

Use of argument by discipline (MICUSP)

Task 11. An essay in economics (ECO.G0.03.1) ofers a discussion of several arguments presented by diferent scholars. The author then refutes these arguments and proposes his own. Ask students to discuss the essay paying attention to the language used by the author to refute previous arguments and advance his own. Similarly, a search on claim in ‘Argumentative essays’ can also provide material for group discussion. For example, in an essay in English (ENG.G0.04.1) the author discusses and evaluates claims from earlier research. Ask students to identify linguistic means used by the author to critically assess claims made in the literature.

To investigate the way problem-solution pattern is presented throughout the text of a paper, one may put in the search window problem or solution and choose from the ‘Textual Features’ drop-down menu the option ‘Problem-solution pattern’ (Figure 9.17).

Task 12. Take as an example a paper in Industrial & Operations Engineering (IOE.G2.03.1) and ask students to fnd and discuss extracts related to the problem-solution pattern.

Longer examples of developing and supporting a hypothesis can be found by running a search on hypothesis in the ‘Paper Type’ sections of Research Papers and Reports, in which the occurrences of hypothesis are most frequent, 51% and 35% respectively (Figure 9.18).

126

Tatyana Karpenko-Seccombe

FIGURE 9.17

Use of problem by discipline (MICUSP)

FIGURE 9.18

Use of hypothesis by discipline in reports and research papers (MICUSP)

Task 13. Discuss with students a Biology paper (BIO.G0.04.2) which demonstrates the entire process of suggesting a hypothesis, experimentally testing it and rejecting it in the conclusion. Ask students working in groups to identify the linguistic signposts and evaluative vocabulary used by the author in the course of discussing this hypothesis.

Conclusion

To sum up, this chapter has briefly outlined ways in which corpus materials can be helpful to learners in developing evidence-based arguments, analysing and evaluating arguments and evidence, taking a stance and engaging in academic debate.

Practical use of corpora in teaching argumentation

127

The value of corpus consultations lies in raising learners’ awareness of the language used in argumentation and in its disciplinary variations. The tasks suggested here are aimed at supporting students in observing natural academic language and the way it functions in discourse in order to help them make informed language choices in writing. Reference list Biber, D., Connor, U., & Upton, T. (2007). Discourse on the move. Using corpus analysis to describe discourse structure. John Benjamins. Boulton, A. (2007). DDL is in the details . . . and in the big themes. In Corpus Linguistics Conference: CL2007, hal-00326994. Retrieved June 20, 2023, from https://hal.archivesouvertes.fr/hal-00326994 Boulton, A., Carter-Thomas, S., & Rowley-Jolivet, E. (Eds.). (2012). Corpus-informed research and learning in ESP: Issues and applications (Vol. 52). John Benjamins. Charles, M. (2011). Using hands-on concordancing to teach rhetorical functions: Evaluation and implications for EAP writing classes. In A. Frankenberg-Garcia, L. Flowerdew, & G. Aston (Eds.), New trends in corpora and language learning (pp. 81–104). Continuum. Charles, M., Pecorari, D., & Hunston, S. (2009). Introduction: Exploring the interface between corpus linguistics and discourse analysis. In M. Charles, D. Pecorari, & S. Hunston (Eds.), Academic writing: At the interface of corpus and discourse (pp. 1–10). Continuum. Cobb, T. (2010). Instructional uses of linguistic technologies. The Procedia – Social and Behavioral Sciences, 3, 14–23. Crosthwaite, P., Wong, L. L., & Cheung, J. (2019). Characterising postgraduate students’ corpus query and usage patterns for disciplinary data-driven learning. ReCALL, 31(3), 255–275. Flowerdew, L. (2003). A combined corpus and systemic-functional analysis of the problemsolution pattern in a student and professional corpus of technical writing. TESOL Quarterly, 37(3), 489–511. Hunston, S. (2002). Corpora in applied linguistics. Cambridge University Press. Karpenko-Seccombe, T. (2018). Practical concordancing for upper-intermediate and advanced academic writing: Ready-to-use teaching and learning materials. Journal of English for Academic Purposes, 36, 135–141. Karpenko-Seccombe, T. (2023). Data-driven learning: Aiming at the bigger picture. Nordic Journal of English Studies, 22(1). Poole, R. (2016). A corpus-aided approach for the teaching and learning of rhetoric in an undergraduate composition course for L2 writers. Journal of English for Academic Purposes, 21, 99–109. Vyatkina, N., & Boulton, A. (2017). Corpora in language teaching and learning. Language Learning and Technology, 21(3), 66–89. Wingate, U. (2012). “Argument!” Helping students understand what essay writing is about. Journal of English for Academic Purposes, 11(2), 145–154.

Corpus tools and corpora Cobb, T. Lextutor concordancer. lextutor.ca/conc/eng/

128

Sylvia Goetze Wake

MICUSP Michigan Corpus of Upper-level Student Papers (2009). The Regents of the University of Michigan. micusp.elicorpora.info/ Nesi, Gardner. (2008). Thompson & Wickens, BAWE (used in Lextutor) University British academic written English corpus, developed at the Universities of Warwick. Reading and Oxford Brookes. coventry.ac.uk/bawe SkELL (Sketch Engine for Language Learning) Baisa, V., & Suchomel, V. (2014). SkELL – Web interface for English language learning. In Eighth workshop on recent advances in slavonic natural language processing (pp. 63–70). Tribun EU. ISSN 2336–4289. https:// skell.sketchengine.co.uk/run.cgi/skell

9.1 Corpora and concordancers in teaching rhetorical functions Practical material for both university teachers and doctoral students Sylvia Goetze Wake How can ordinary teachers feel more comfortable with using corpora? The lack of resources for practitioners who are not necessarily corpus researchers may be one reason so many teachers simply avoid this area. Yet, Dr Tatyana KarpenkoSeccombe has made her materials widely available and transparent: her examples, tasks and exercises were first shared online and then grew into a book (2020). These practical ideas opened up possibilities in our language centre, and her resources were a key contribution in exposing our academic writing teachers to corpus searches. In her chapter, Dr Karpenko-Seccombe explains how she uses corpus results to illustrate patterns in academic writing. This creates interest in corpora, piquing the curiosity of students. She demonstrates searches using BNC, MICUSP and Lextutor (BAWE corpus), but in our context, we applied her exercises using AntConc (Anthony, 2022). Doctoral writers used their own discipline-specific corpora in a one-day seminar our language teachers developed, exploring corpus results of writing from their own field. We started with simple exercises using formal and informal language use (such as ‘really’ or ‘interesting’) as a springboard to discover real language use. In terms of rhetorical functions, students explored hedging and boosting language as well as expressions such as ‘it is * that’. To focus on different parts of the paper, and inspired by Dr Karpenko-Seccombe’s example of introductions, our teachers had doctoral writers explore the research gap using the KWIC searches ‘little research’, ‘less attention’ and ‘few researchers/studies’. Going further, we suggested searches on ‘method’ and ‘methodology’ in that particular section.

Exploring the practical use of corpora and concordancers

129

Dr Karpenko-Seccombe ends her chapter by discussing the problem-solution rhetorical pattern (Swales & Feak, 2009) and its prevalence, for example, in engineering research papers. Looking at adjectives left of the word ‘solution’ was presented as a lead-in to evaluate strength of claim in a practical classroom exercise. Using activities inspired by these tasks, our course ‘Editing One’s Writing Using Corpora’ combined corpus and discourse approaches and thus exposed more specific phraseology than that which may be consulted in more general academic lists. But even more, the ongoing dance between theory and practice in these corpus examples enticed the participants to explore, check, and discover, rather than avoid, which is perhaps the goal of data-driven learning. This work had a positive impact on professional development in our centre. The practical, hands-on approach to corpus application at both the lexical and rhetorical levels provided content ideas for searches relevant to our context in the user-friendly AntConc interface. Using these resources contributed to teacher confidence and engagement with corpora and subsequently served as a base for these same teachers to create and launch the corpus-based editing course for doctoral students at our institution. Reference list Anthony, L. (2022). AntConc (Version 4.1.4) [Computer Software]. www.laurenceanthony. net/software Karpenko-Seccombe, T. (2020). Academic writing with corpora: A resource book for datadriven learning. Routledge. Swales, J., & Feak, C. (2009). Academic writing for graduate students: Essential tasks and skills. University of Michigan Press.

9.2 Exploring further the practical use of corpora and concordancers in teaching rhetorical functions Ning Liu Academic writing instructors may often find a gap between course materials and students’ learning needs. In a course at my current institution, I teach students how to use academic writing as a technique for documenting their research processes and disseminating their research outputs. McCormack and Slaght’s (2018) book that connects academic writing with research skills is used as compulsory course material. Aside from teaching essential research skills, I guide students in analyzing the genre moves and steps in the sample texts provided in the book. Despite the many qualities of the textbook, these sample texts are not always associated with the same discipline as the projects initiated by the students, and thus the genre

130

Ning Liu

moves and steps realized in these texts are not fully applicable to the students’ writing. Consequently, corpus-based genre analysis (e.g., Lu et al., 2021) is incorporated into my pedagogy to resolve this disparity. Corpus-based genre analysis suggests that writing instructors guide their students to identify recurrent linguistic patterns in corpus examples in order to determine the genres. In both her chapter and recent book (Karpenko-Seccombe, 2021), Dr. Karpenko-Seccombe has provided very abundant examples and resources usable for implementing this pedagogy. She suggests that the instructors focus on a key word that realizes the rhetorical purpose of the genre move or step, and then expand their search for collocations surrounding this key word. Instructors can establish multiple search phases, such as from the phrase “it is important to” to the verb collocations following the phrase to the reasons statement that typically surrounds the importance statement. Additionally, she recommends that instructors guide their students to consider and discuss why the text authors employ such typical language choices at every phase. This has enabled me to modify my pedagogy. In a typical session, after explaining the genre moves and steps of a text type as included in the aforementioned textbook, I encourage students to search for the similar text type from their concerned disciplines in MICUSP and analyze the sentences that can realize similar generic purpose. As a first step, they need to locate the key words and their usage patterns that correspond to the rhetorical functions. Following this, they search for alternative language patterns in their discipline-specific sub-corpora, using the wildcard search function at BNC or the Word Sketch function at SkELL. Finally, they identify the remaining language components of the concerned genre move or step, as well as to discuss why the author includes them. This endeavor has bridged the aforementioned gap and strengthened the students’ genre awareness. Reference list Karpenko-Seccombe, T. (2021). Academic writing with corpora: A resource book for datadriven learning. Routledge. Lu, X., Casal, E., & Liu, Y. (2021). Corpus-based genre analysis. In H. Mohebbi & C. Coombe (Eds.), Research questions in language education and applied linguistics: A reference guide. Springer Nature. McCormack, J., & Slaght, J. (2018). Extended writing & research skills: English for academic study (3rd ed.). Foreign Language Teaching and Research Press.

10 ON THE PERILS OF LINGUISTICALLY OPAQUE MEASURES AND METHODS Toward increased transparency and linguistic interpretability Tove Larsson and Douglas Biber

1.

Introduction

In studies whose primary goal is linguistic description of characteristics of features, we have to do our utmost to ensure that we measure and analyze said features in a way that is linguistically interpretable. A measure is considered linguistically interpretable (a) “when its scale and values represent a real-world language phenomenon that can be understood and explained” (Egbert et al., 2020, p. 24) and (b) when all variables have clear operational definitions. The present chapter covers a discussion between Tove Larsson and Douglas Biber at Northern Arizona University, during which they argue in favor of using linguistically interpretable measures and techniques. Specifically, the chapter covers discussion of four potentially problematic issues in this context, along with a conclusion. The subsections start with a brief description of or background to the issue, followed by answers to the questions posed. 2.

Issue #1: Analyzing a concrete linguistic feature with no description of what is included or excluded

Like specialists in all disciplines, linguists share a technical vocabulary as part of their everyday discourse. We use terms like noun, verb, nominalization, subordinate clause, and relative clause as if they had obvious meanings understood by all of our colleagues. In a normal conversation, this assumption usually causes no problems. However, carrying out a quantitative corpus analysis is an entirely different enterprise, because every word and every structure in the corpus must be evaluated to decide whether it fits the definition of the target feature. That is, the final result of a quantitative corpus analysis is a number, suggesting that the analyst DOI: 10.4324/9781003413301-10

132

Tove Larsson and Douglas Biber

has identified the exact number of times that the target linguistic feature occurs. Surprisingly, though, such analyses are often not based on a detailed operational definition of the target feature. What are some issues that may arise if we attempt to analyze a linguistic feature with no description of what is included or excluded? [DB:] The importance of operational definitions for everyday linguistic terms is not obvious until we try to carry out a quantitative analysis of particular linguistic features. For example, take a moment to count the verbs, nouns, and nominalizations in the following text excerpt: There is growing evidence that stress may contribute to physical illness such as cardiovascular disease. Discuss with your doctor how stress management may be used to support more conventional forms of treatment. Stress can also develop into an anxiety disorder, including panic disorder, which is a condition where a wave of sudden panic overtakes the person for no apparent reason. In addition, phobias can be a challenge, leading to a fear of encountering specifc situations.

How many verbs did you find? Your answer will depend on your operational definitions. For example, did you count the modals may and can as verbs? What about the auxiliary be? And what did you decide about the four -ing words (growing, including, leading, encountering)? Did you treat all four words in the same way because they have the same form? Or did you treat them differently, based on their different syntactic functions? Similar issues arise in any precise count of ‘nouns’. For example, did you count nominalizations like management as nouns? What about words that occur before another noun, such as stress management and anxiety disorder? And what about -ing words embedded in a prepositional phrase, like encountering? There are even more operational challenges for deciding what to count as a nominalization. Should we include words that might have been converted from a verb to a noun (e.g., stress, challenge)? Or should we only include words that have a derivational suffix (e.g., illness, management, treatment)? And should we include all words that appear to have a derivational suffix? Or should we be able to identify the specific root that a nominalization is derived from? For example, what did you decide about the words evidence, anxiety, disorder, condition, situations? These words all appear to have derivational suffixes that can be used to form a nominalization from a verb/adjective. But, for many of these words, it is difficult to identify what that underlying verb/adjective would have been. Questions like these can make a huge difference for the quantitative results of a corpus analysis. For example, we could operationally define ‘nominalization’ to include all words that have one of the following derivational suffixes: -ness,

On the perils of linguistically opaque measures and methods

133

-ment, -ence, -iety, -er, -tion. In this case, there would be 9 nominalizations in this text excerpt (evidence, illness, management, treatment, anxiety, disorder(×2), condition, situations), which would correspond to an extremely high rate of 120 nominalizations per 1,000 words. In contrast, we could adopt a stricter operational definition, with the requirement that a ‘nominalization’ has both a derivational suffix and an obvious root that it is derived from. In this case, we might exclude the words disorder(×2), condition, and situations, and as a result end up with a much lower rate of 67 nominalizations per 1,000 words. Of course, even seemingly precise operational definitions can still be problematic to apply in practice. For example, speakers of present-day English clearly do not think of the noun condition as if it were derived from a verb like condite (even though this word came from a verb historically). But what about the noun situation? In this case, English has a corresponding verb – situate – even though most speakers probably would not think of situation as being derived from situate. As a grammarian, I would be happy to give my preferred answers to all of the previous questions. But that is not the main point here. Rather, the issue here is what we should all be doing as quantitative corpus linguists: providing the detailed operational definitions that we applied when we analyzed a corpus. The peril is that corpus studies have the appearance of being based on careful scientific analyses, resulting in precise quantitative findings, when in fact we have not even critically thought about the linguistic basis of those findings. For example, we can easily obtain results like nominalizations that occur 22.78 times per 1,000 words in one text, versus 42.33 times per 1,000 words in another text. But if we are not able to specify what the term ‘nominalization’ means, then those precise quantitative findings are useless. Here, we focus on basic linguistic constructs, like nouns and verbs. But this same challenge applies to every linguistic analysis of a corpus, including more complex grammatical features (e.g., relative clauses, adverbials) as well as lexical and phraseological features (e.g., key words, lexical bundles, collocations). The bottom line is that it is easy in corpus-based research to give the appearance of precise scientific findings – but it is much more challenging, and more important, to ensure that those quantitative results are actually based on meaningful and replicable linguistic analyses. 3.

Issue #2: Using measures with no clear understanding of what linguistic characteristics were analyzed

Automated tools will provide numeric output and give the appearance that a measure is trustworthy and well defined. Taking this at face value is problematic if the measure is not actually linguistically interpretable. For example, the measure complex nominals from Lu’s (2017) L2 complexity tagger confounds a large number of structurally and syntactically distinct grammatical features (see Biber et al., 2020; Egbert et al., 2020: Section 5.2). It is not linguistically

134

Tove Larsson and Douglas Biber

interpretable; yet studies that have the stated goal of linguistic description use this measure as if it was. How can we assess the extent to which a tool provides sufficiently informative, or linguistically interpretable, output, and why does this matter? [TL:] There are plenty of tools out there that tool developers have generously made freely available. Many of them are fantastic and do a really good job of providing informative, linguistically interpretable output. However, we have to be critical users of tools – just because a measure has been included in a tool and maybe also has been used in previous studies does not mean that we should use it in our study. The key is to ask ourselves ‘does this measure represent a well-defined linguistic phenomenon?’ and only use that measure if the answer is ‘yes’. Let us take grammatical complexity as an example, and I will share an experience I had, in the hopes that it will help illustrate why this is important and make this more concrete. In a study I did with a colleague, the answer to that question was ‘no’ for many of the measures we used for part of that study. The fact that we went ahead anyway got us into a lot of unnecessary trouble and ultimately led us to use a different tool that provided more detailed information. Take complex nominals as an example. It is described to cover structures such as possessives, prepositional phrases, relative clauses, participles, complement clauses controlled by verbs, infinitives when found in subject position. In our study, we found that this measure was important for explaining register differences, but what had we really found out here? That one or several of these structures matter, but which ones? In fact, to complicate matters further, we had an even larger problem on our hands: The measure we used was actually complex nominals per T-unit, which adds two further layers of opaqueness. A T-unit is one main clause plus any subordinate clause attached to it. As such, it conflates structural and syntactic characteristics. For example, a T-unit can be a sentence with one main clause with extensive phrasal embedding or a sentence with one main clause and multiple dependent clauses; these are very different, but they both get the same value. We thus have two measures here that are linguistically opaque: complex nominals and T-units. And not only that, but we had to try to interpret a ratio-based measure: complex nominals per T-unit, meaning complex nominals in the numerator and T-units in the denominator. What this means is that a large value can be due to either the numerator or the denominator – or both. In sum, when we wanted to know what specific linguistic features were useful for explaining register differences, the measure complex nominals per T-units could not help us. This measure does not correspond to a real-world language phenomenon; in fact, it confounds and obscures a large number of real-world language phenomena, without giving the analyst any way to tease them apart. So, trying to

On the perils of linguistically opaque measures and methods

135

interpret results that stem from linguistically opaque measures is a cumbersome and, in many cases, like in our case, even a futile process. 4.

Issue #3: Using quantitative counts with no access to the annotated text to see what was counted

Many of our studies report on result from automated tools, including taggers. That is, as a field, we often rely on tools that only provide numeric output from measures defined and determined by the tool developer. This is problematic if we cannot assess the accuracy of the output. When we use taggers, why is it important to have access to the annotated text to see what was counted? [DB:] Imagine being a geologist who was trained to look at a rock and figure out the composition of that rock (e.g., how much quartz, feldspar, mica). Now imagine that this same geologist has been given a new job: to do the same thing, but to generalize the results to the entire Sierra Nevada mountain range – i.e., determine the mineral composition of all the rocks in the mountain range. Clearly, the geologist would need new tools, as trying to look at each individual rock with a microscope is not going to work. This is kind of what happened to linguistics when it made the transition to corpus linguistics. Traditionally, linguists were trained to study individual words/ sentences/texts and to figure out things about the structure and function of the language in that entity. Then, all of a sudden, linguists were given corpora and realized that they could make generalizations about that massive collection of language. However, the old tools did not work so well – it is unrealistic to examine every word/sentence/text in a corpus that consists of hundreds of millions of words! So, for years, we would hear talks at conferences from corpus linguists saying how the major methodological problem for the field was the lack of computational tools for automated analysis. Luckily, computational linguists with strong programming skills developed such tools, including morphological analyzers, part-of-speech taggers, and syntactic parsers. These powerful tools take raw corpus texts as input and produce annotated texts as output, with each word in the corpus coded for morphological/ grammatical/syntactic characteristics. But this was still not sufficient help for the everyday corpus linguist. It is analogous to computer scientists developing a tool that automatically codes every rock in a mountain range for its mineral composition. Our traditional geologist would then be able to go into the field, pick up a rock, and see that it had been coded. But without additional computational skills or tools, that geologist is still far away from being able to describe the mineral compositions of all rocks in the mountain range.

136

Tove Larsson and Douglas Biber

So, to address this second need, other linguists with strong programming skills developed additional computational tools that begin with a corpus that has been automatically processed by a tagger/parser as input and then produce counts of linguistic features as output (tools like TAASSC, Kyle, 2016; #LancsBox, Brezina et al., 2020). This meant that everyday corpus linguists finally got the kind of result that they craved: quantitative counts telling us how often different linguistic entities occur in the corpus. One seemingly nice thing about these automated tools is that we do not need to worry about the problems discussed in Section 2. Or, at least we can stick our head in the sand and pretend that the problem does not exist. For example, if the automated tool gives us the answer to how many ‘nominalizations’ occur in our corpus, why would we need to worry about the operational definition of ‘nominalization’? And so, the job of the everyday corpus linguist has been reduced to being a proficient button-pusher. Linguistic analysis skills are no longer needed because a corpus linguist does not even need to look at the coding produced by taggers/parsers. And computational skills are not needed because the automated analysis tools have already been developed. All that is left is to collect a corpus and use the tool to produce quantitative results – and to then present this process as if one has done a research study. Ok, maybe I am exaggerating. But I do not think I am far off. Let me emphasize that I am not criticizing the automated analysis tools or the highly skilled researchers who developed those tools. Rather, the problem lies with the end users. For me, a corpus linguist needs to be a linguist – and it is the responsibility of that linguist to ensure that the linguistic analysis underlying any automatic count is accurate and meaningful. In addition, I would argue that most corpus linguists should develop proficiency in the computational skills required to analyze linguistic patterns in a corpus, but that is a different topic. Doing linguistics with the output from tools requires evaluation of the linguistic accuracy of the automatic coding; that is, we have to assess – and be able to assess – the output of tools and the underlying assumptions made. Many automated analysis tools actually produce output that would enable such analysis: in addition to counts of the target linguistic features, they also annotated texts with codes in it. Although it might not be trivial, it is usually possible to inspect those annotated texts to evaluate the output. So, let me jump to my bottom-line recommendations: 1 If an automated tool produces annotated output texts (e.g., showing part-ofspeech tags and syntactic parse codes), then it is the responsibility of the enduser to ensure the accuracy of the automatic tagging/parsing. 2 If instead an automated tool does not produce annotated texts that permit the end-user to verify the linguistic accuracy of quantitative results, then the enduser should simply not use that tool!

On the perils of linguistically opaque measures and methods

5.

137

Issue #4: Making choices that are ideal from the perspective of a statistical technique but that are not ideal from the perspective of linguistic interpretability

Once we have our (reliable) counts, we often use statistical techniques of some sort to be able to generalize our findings beyond the corpus sample we used. During this step in the analysis, we sometimes find ourselves in a situation where we need to change our data or data structure to meet the assumptions of statistical techniques. An oft-discussed example is dichotomizing continuous data to be able to use a technique that is well-known in a field, such as ANOVA. Another example is to perform nonlinear transformations to meet assumptions of normality of residuals for multiple regression. However, in making adjustments to our data in these ways may come at the cost of linguistic interpretability. Ideally, we would want to balance both statistics and linguistics, but if that is not possible, we do not want to sacrifice linguistic interpretability of the results. How can we balance the sometimes-conflicting pressures of (a) meeting assumptions of statistical techniques with (b) obtaining linguistically interpretable results? [TL:] Observational language data tend not to be numeric. Our data tend to be nominal: If we are interested in looking at how often a certain adverb is placed clause-initially, we can count the instances and report a number, but any number will be based on language data. If we report three instances of clause-initial importantly, our data look like this under the hood: importantly, we . . . , importantly, the study . . . , importantly, they. . . . This means that any time we report frequencies, we are abstracting away from language. By extension, even a simple mean is a little problematic if you think about it, in that a mean of 5.3 instances of clauseinitial importantly per text technically does not correspond to a real-life language phenomenon because adverbs cannot be fractioned. Issues of linguistic interpretability can often get exacerbated when we want to use more advanced statistical techniques. While we should of course try our best to meet the assumptions of statistical tests, there are times when language data just do not behave like other kinds of data. For example, they tend not to be normally distributed. So, what do we do? As a rule of thumb, and not surprisingly, I would argue in favor of making sure our results are interpretable to ourselves and our audience. I might be a bit extreme in this respect, but I am not a big fan of nonlinear transformations from this perspective. If you have log-transformed your data because it enabled you to meet the assumptions of a given test, can you and your audience really interpret the results? What does it mean, for linear regression, to say something like ‘all else held constant, for every one-unit increase in months spent abroad, we see a log(1.3) increase in filled pauses’? You can of course help your

138

Tove Larsson and Douglas Biber

audience interpret it by doing the math, but simply presenting the regression output will likely not prove particularly helpful. Another option would be to instead use a technique that does not have assumptions that cannot be easily met with your data. There are many techniques out there – less advanced and more advanced – that do not have assumptions that may entice us to choose solid statistics over solid linguistics. For example, conditional inference trees and random forests are underused techniques in our field that we may want to look into more. These do not have assumptions of the sort we are used to from multiple regression. So, in short, I do not really think we should balance meeting assumptions of statistical tests and obtaining linguistically interpretable results: we should always aim for linguistically interpretable results, even if it means choosing a different method. 6.

Summing up

What are some suggested ways forward to ensure linguistically interpretable results? [TL:] There are several ways in which we could go – and have gone – astray in terms of the issues discussed in this chapter. However, I think increased awareness of the importance of linguistic interpretability is a good first step in the right direction. The choices we make in this regard and the resulting conclusions we can draw from our findings not only affect us but also consumers of research: other researchers, teachers, and the general public wishing to learn from our results. We would want the aim to be to hold ourselves and others to as high a standard as possible to make sure that all variables have clear operational definitions and that we use techniques that are directly interpretable relative to the linguistic research questions of interest. Concretely, in some cases, this might mean opting for a less complex analysis or technique than was originally chosen to make sure that the results are interpretable. As a field, we are making good progress, and I am excited to see what the future of the field will look like. Reference list Biber, D., Gray, B., Staples, S., & Egbert, J. (2020). Investigating grammatical complexity in L2 English writing research: Linguistic description versus predictive measurement. International Journal of Academic Purposes, 46, 100869. Brezina, V., Weill-Tessier, P., & McEnery, A. (2020). #LancsBox v. 5.x [Software]. http:// corpora.lancs.ac.uk/lancsbox Egbert, J., Larsson, T., & Biber, D. (2020). Doing linguistics with a corpus: Methodological considerations for the everyday user. Cambridge Elements in Corpus Linguistics. Cambridge University Press. Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication [Doctoral dissertation]. http://scholarworks.gsu.edu/alesl_diss/35

Call for transparent measures without oversimplifcation

139

Lu, X. (2017). Automated measurement of syntactic complexity in corpus-based L2 writing research and implications for writing. Language Testing, 34(4), 493–511.

10.1 Call for transparent measures without oversimplifcation Keisuke Koyama In actual classroom contexts, the importance of transparent measures in assessment is key. For example, suppose students’ writing is assessed based on the correct use of complex nominals per T-unit. Should complex nominals used once in each of two T-units be scored as high as complex nominals used twice in one T-unit? If so, what is an intuitively reasonable explanation? Also, does the difficulty not depend on which nominal is used? For example, pre-modifiers are common in Japanese (Abekawa & Okumura, 2005). For second language English learners in Japan, therefore, premodifiers must be easier than post-modifiers. Then, it would make more sense for teachers to score higher on the use of post-modifiers. The problem here is that the definition for “the complex nominal per T-unit” is too simplified and may cover too wide a range of situations where different complex nominals are used. The concerning thing is that this problem not only occurs in the measures of complex nominals per T-unit but also in other measures. If Student A uses more sophisticated, low-frequency phrases than Student B, should A receive higher scores? What is the objective threshold for low frequency? If the phrases are occasionally not rare in specialised corpora on the topic of the writing task, should we still consider the phrases to be low frequency? In my teaching context, to describe the performance of each student compared to the population as a whole, I show the mean, standard deviation, and coefficient of variation (standard deviation/mean) of each question. However, what do these values actually mean to us? For example, consider the scenario where one grammar question has a mean of 10 and a standard deviation of 1, while another grammar question has a mean of 2 and a standard deviation of 1? The latter has a coefficient of variation five times higher than the former, but what does this value of five signify? Does the lower coefficient of variation (or the higher mean) indicate that students have a better understanding of the former grammar compared to the latter? It may just be that the vocabulary used with the grammar in the former question was easier than the vocabulary used with the grammar in the other. Mean and standard deviation oversimplify the linguistic information provided by individual grammar questions. Additionally, how the grammar is presented matters. For instance, if the former question requires students to work on a fill-in-the-blank exercise, while the latter requires producing a complete sentence using the target grammar, how can we

140

Ranwa Khorsheed

interpret the difference in the required linguistic skills between these two grammar questions based on measures such as means and standard deviations? Without properly defining the necessary linguistic knowledge for each grammar question, it becomes exceedingly challenging to interpret the difference between the two different grammar questions. Teachers, including myself, should always bear in mind that the use of overly simplified measures may result in the loss of some linguistically interpretable information. Reference list Abekawa, T., & Okumura, M. (2005, October). Corpus-based analysis of Japanese relative clause constructions. In International conference on natural language processing (pp. 46–57). Springer Berlin Heidelberg.

10.2 The reliability of using a parallel or a comparable corpus in translation training sessions Ranwa Khorsheed Larsson and Biber’s chapter’s draws our attention to the importance of testing and validating the measures used when assessing the reliability of parallel or a comparable corpora as tools when assisting students on English/Arabic translation courses, especially when translating cultural or literary texts. This is important in my context of Syria, as most participants have never used any kind of corpus before. McEnery et al. (2006) define a parallel corpus as a compilation of texts that are considered as source texts combined with other texts considered as the source texts translations in one or more target languages. In addition, the corpus could contain an A to B translation besides a B to A translation, which shows translation in two directions. A clear example of a bidirectional corpus is that compiled of texts produced by world organizations or institutions such as the EU and the UN. On the other hand McEnery and Xiao (2003) define a comparable corpus as a compilation of original L1 texts each of which discusses the same topic. However, the compilation must fulfill certain “sampling techniques” and achieves a certain “balance and representativeness”. When it comes to explaining how parallelism is implemented, Vinay and Darbelnet (1958) have confirmed the possibility of transferring a message, including its smallest elements, from a source language into a target language. This is clearly achieved by using “parallel categories”. This will include using parallel linguistic structures in addition to parallel concepts, thus fulfilling a “metalinguistic parallelism”. However, this procedure may use opaque methods of linguistic

The reliability of using a parallel or a comparable corpus

141

measurement or analysis, meaning our designated means of measurement might not match our expectations and would therefore fall short of fulfilling a lesson’s (assessment/teacher) aims. Additionally, while the source and translated texts in a parallel corpus are useful for exploring how the same content is expressed in two languages, they alone serve as a poor basis for cross-linguistic contrasts because translations (L2 texts) cannot avoid the effect of translationese (McEnery et al., 2006). An example is the English proverb “Don’t put all your eggs in one basket”. This proverb has no parallel in the Arabic language. It is most likely to be translated literally as “‫ضيبلا عضت ال‬ ‫”ةدحاو ةلس يف هلك‬. Speakers of Arabic would not probably find any cultural or linguistic parallel structure for this proverb. Thus, they have to fully comprehend its meaning in the source text. Also, a corpus may be tagged or marked for parts of speech and other main linguistic features. However, it’s difficult to mark or tag a parallel corpus semantically based on cultural or literary aspects of the source and the target language. This is because there are no specified tools or mark up, for example, the gaps or lacunae in the target language (TL), which must be filled by corresponding elements so that the overall impression is the same for the two messages. Such a lacuna occurs when certain stylistic effects cannot be transposed into the target language without upsetting the syntactic order or even the lexis. Consequently, a translator would resort to different techniques to compensate for this lacuna. Some syntax-related techniques can be traced and marked with a corpus tagger while others can be difficult to be marked or traced because of their nature. As there is no ready annotated or marked-up corpus for modulation, and since the students aren’t experts in annotation, even if I was to manually mark or tag a corpus, it would be time consuming. Moreover, a corpora that needs extensive manual annotation (semantic or pragmatic) are necessarily small due to the manual labor exerted in annotation. In conclusion, using a parallel or comparable corpus in translation courses would bring several benefits as long as the measures used to do so meet the required criteria of sampling, balance and representativeness. If this can be done, it may better enable learners to focus on certain linguistic features that occur during the translation process. Reference list McEnery, T., & Xiao, R. (2003). The Lancaster Corpus of Mandarin Chinese. https://www. lancaster.ac.uk/fass/projects/corpus/LCMC/default.htm McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies. An advanced resource book. Routledge. Vinay, J.-P., & Darbelnet, J. (1958/1995). Stylistique comparée du français et de l’anglais: Méthode de traduction. Didier, translated and edited by J. C. Sager and M. J. Hamel in 1995 as comparative stylistics of French and English: A methodology for translation. John Benjamins.

11 WHY WE NEED OPEN SCIENCE AND OPEN EDUCATION TO BRIDGE THE CORPUS RESEARCH–PRACTICE GAP Elen Le Foll

What do you mean by open science and open education, and why do you think that these concepts are important in the context of using corpora for language learning?

Open science and open education practices are all about ensuring that scientific knowledge is both rigorous and transparent, as well as accessible, collaborative, and inclusive. In practical terms, it is about sharing not only the results of academic research and teaching interventions, but also research data, testing instruments, methods, and educational resources openly – for the benefit of all. In the context of language learning using corpora, we have seen time and time again that, although the concept of data-driven learning (DDL) has been around for nearly half a century now, there remains a wide gap between DDL research and what is actually happening in classrooms around the world. The number of academic publications on the potential of corpora for language learning appears to be ever-growing, but at the same time, uptake by teachers and learners remains very slow. There is no denying that there are many reasons for this, but I believe that one which has largely been ignored so far concerns the accessibility of our research. The good news is that, unlike some of the other reasons, this is one that we can easily act upon. What can those of us involved in corpus linguistics and DDL research do to make our research more accessible?

The reality today is that most academic research is hidden behind paywalls. These paywalls represent the most obvious barrier to the accessibility of our research for student teachers (especially those studying in the Global South) and teaching

DOI: 10.4324/9781003413301-11

Why we need open science and open education

143

practitioners worldwide. If we are serious about wanting to share knowledge, we should be prioritising non-profit open-access publication outlets and, when that is not feasible, uploading pre- and/or post-prints of our work on open-science repositories.1 We should also consider exploring alternative forms of publications such as blog posts, podcasts, videos, and other online resources to communicate our research more effectively to non-academic audiences. At the same time, if we want our research to contribute to generating cumulative knowledge about corpus linguistics for language teaching and learning, we must ensure that our methods are fully transparent so that others can reproduce our results (e.g., to check the robustness of published results using slightly different statistical methods), as well as replicate it in different contexts (e.g., in different countries, with learners of different proficiency levels and/or with different L1s and L2s) without having to always “reinvent the wheel”.2 This entails sharing not just the results of our research but also the tools, methods, and materials used in any study. The infrastructure to do so already exists. In addition to general research repositories such as the Open Science Framework (https://osf.io/) and Zenodo (https:// zenodo.org/), there are a number of great initiatives for sharing and archiving linguistics research–related resources, for example: the IRIS database (https://irisdatabase.org/) and the Tromsø Repository of Language and Linguistics (https:// trolling.uit.no/). As a discipline, I believe that our research could be both more efficient and effective if we engaged more in open science and open education practices. How can open education practices contribute to making corpus linguistics more accessible for language teaching and learning?

Open education is about teaching and learning with resources that are either in the public domain or licensed in such a way that everyone can engage in the so-called 5R activities. The five Rs are: retain, reuse, revise, remix, and redistribute. This means that OERs are freely accessible to all and can be adapted by anyone, for example, by updating, expanding, or translating them. These new, modified versions can then also be shared with the community. When we teach corpus literacy to (future) language teachers, it is important that we think carefully about the sustainability of what we teach: Can the activities we teach realistically be used in the classroom? Will our students continue to have access to these corpus resources once the course/training is over? Can teachers readily adapt these corpus-based materials to their students’ needs? We are lucky in that there are more and more open corpora and freely available corpus resources out there (see König et al., 2021). But we need to make a conscious choice and effort to promote these in our teaching practice and, as researchers, contribute to them as much as we can.

144

Elen Le Foll

Can you provide some examples of how corpus linguistics can be used to develop teaching materials that align with open education principles?

The earliest example that I’m aware of is Tim Johns’ collection of Kibbitzers, which were created in the late 1990s and are a remarkable example of an OER avant la lettre. These short DDL activities are still available as PDF worksheets on https:// lexically.net/TimJohns/. Each worksheet focuses on a real-life language question, mostly taken from English for Academic Purposes (EAP) classes. The materials feature selected concordance lines and provide guidance as to how to interpret these to solve each language issue. As such, they are an example of paper-based materials for DDL (see Boulton, 2010). Since then, a number of projects have sought to integrate computer-based corpus literacy with language learning activities. Excellent recent initiatives that show how corpus linguistics can be used to develop teaching materials and that align with open education principles include the Corpus for Schools platform and its associated teaching materials for English L1 and L2 (Gablasova et al., 2018), BAWE Quicklinks for English for Academic Purposes (Vincent & Nesi, 2018), the Corpus-Aided Platform for Language Teachers (Ma, 2018), and the Integrating Corpora platform for German L2 (Vyatkina, 2020). Together with some of my students, I also published an OER entitled “Creating Corpus-Informed Materials for the English as a Foreign Language Classroom: A Step-by-Step Guide for (Trainee) Teachers Using Online Resources”, available as an e-book on https://pressbooks.pub/elenlefoll/ (Le Foll, 2021). It not only provides ready-made corpus-based lesson plans, activities, and worksheets, but also explains to (pre-service) EFL teachers how to create their own corpus-informed materials. In total, there are 16 chapters divided into four sections providing lesson ideas for 1) primary and lower secondary school, 2) upper secondary school, 3) content and language integrated learning (CLIL) at secondary school, 4) vocational education and English for Specific Purposes (ESP). This resource is one of the few that provides concrete examples of how to use corpora for language teaching and learning in pre-tertiary education in open access. Making this resource available as an OER aligns with open education principles in that it allows anyone to not only access it but also to adapt it and update it, hence also contributing to the sustainability of the project. Can you share some insights from your project on creating an OER together with trainee teachers? How did open science and open education principles guide your approach?

From the very beginning of the project, I decided to adopt an “OER-enabled pedagogy” (Wiley & Hilton, 2018, p. 134). A key element of OER-enabled pedagogy is the concept of setting “renewable assignments” (Wiley, 2013). In contrast to

Why we need open science and open education

145

“disposable assignments” that primarily serve to assess whether students have met the course objectives and are frequently discarded once they have been marked because they have no value beyond assessment, “renewable assignments” not only support the learning of the individuals that produce them, but also result in new or improved OERs that that can benefit a wider community of learners. Examples of OER-enabled pedagogy projects include the creation and revision of anthologies, textbooks, and online courses, and the creation and editing of Wikipedia articles, blogs, and instructional videos (for many inspiring examples, see, e.g., Mays, 2017; Wiley & Hilton, 2018). In this project (see Le Foll, in press), I decided that the renewable assignment would consist of writing a chapter for a (at first, hypothetical) OER textbook aimed at showing (trainee) teachers how to create their own corpus-informed teaching materials for the EFL classroom. The project took place at Osnabrück University (Germany) where I taught three iterations of a semester-long class on corpus linguistics for language teaching to M.Ed. students training to become EFL teachers for primary, secondary, and vocational education. Prior to starting to the course, students had little to no previous knowledge of corpus linguistics. Over the course of just one semester, they learnt about corpora, online corpus tools, and the benefits of using corpora for language learning. They then applied this newly acquired knowledge to develop their own corpus-based TEFL materials. Although some students chose to write their chapters individually, they were encouraged to co-write their chapters in groups of 2–4. Collaboration was a core aspect of the course: students gave each other feedback on all stages of their materials and chapter development. At the end of the semester, they wrote a term paper in the form of an OER textbook chapter, in which they presented their materials and how they could be integrated in a lesson plan and explained to other (student) teachers exactly how they created these materials using online corpora and corpus tools (Le Foll, in press). I made very clear that the eventual publication of their chapter in the OER was entirely optional and that only original, high-quality contributions would be considered for publication. In this project, OER-enabled pedagogy proved to be an effective framework to contribute to bridging the corpus research–practice gap in pre-service teacher training. In terms of its limitations, it’s worth stressing that considerable time and energy was invested in the process of revising the chapters prior to publication. This happened during the semester breaks in both my and student authors’ free time. An unforeseen limitation in terms of resources was that the online publication platform that I had chosen (https://pressbooks.com) changed its business model during the project. After this change, it was no longer possible to add or modify chapters within Pressbooks without paying for an additional subscription. It’s difficult to empirically evaluate the success of such a resource, but from anecdotal feedback from colleagues, I know that the OER is being used

146

Elen Le Foll

in teacher training at many higher education institutions on all five continents. Looking back, it seems remarkable that, as a result of a one-semester course, corpus novices became proficient users of corpora and DDL multipliers with international reach! In your opinion, what are the greatest challenges to creating and using open education resources about using corpora for language learning?

Before I discuss the challenges, I would like to stress that creating, using, and adapting OERs is a very positive and rewarding activity! In fact, after I published this OER that I co-wrote with my students, a number of colleagues launched similar initiatives – some of which are now online (e.g., Goulart & Veloso, 2023; Pinto et al., 2023). So, although there are challenges, these projects and the ones that I mentioned earlier are all proof that they can be overcome! One challenge for hands-on corpus-based activities is certainly that there are still too few corpora that are suitable for language teaching and learning and that are freely available (or only for a limited number of queries, after registration, etc.). This is particularly true for languages other than English. Some are available for free, but their online interfaces are not particularly user-friendly, undergo frequent changes that make it difficult to create sustainable teaching resources, and/or contain language that are not always suitable for younger learners. I was particularly keen to develop an OER on corpus-informed teaching materials together with trainee teachers because, in the process of working with corpora for the first time, my students came across challenges that I, as a corpus linguist, had not necessarily envisaged. The result is that, in their chapter contributions to the OER, the student authors describe, in simple terms and avoiding academic jargon, how they overcame these challenges. We hope that these explanations will prove useful to others – both (trainee) teachers, as well as academics who teach corpus literacy in pre- and in-service teacher training. As corpus linguists, we should also not underestimate the level of technical literacy required to use corpora for language teaching and learning. Even though today’s student teachers and many practising teachers may consider themselves “digital natives”, this does not mean that they possess the digital and data literacy skills necessary to make use of corpus resources. Finally, another limitation resides in the well-documented difficulty of making OERs known to the wider teaching community (for a recent survey, see Marín et al., 2022). In this endeavour, the academic community of Twitter was of great help. My posts on the OER project were widely retweeted by the #CorpusLinguistics and #TEFL communities. Sadly, however, only a small proportion of potential users can be reached in this way. I also presented the project at academic conferences,

Why we need open science and open education

147

and it has been mentioned in several online seminars and courses. But, for all the positive resonance on these platforms, reaching non-academic audiences remains a challenge – even when applying open education principles. What strategies can educators use to teach technical literacy skills to learners who may have limited experience with corpus linguistics and related technologies?

I personally think that it is important to start where learners are. To achieve this, educators can ask their students to solve some pertinent language issues and observe how they go about these tasks. Which resources are they aware of? Which ones do they use most frequently? In what order? How are they querying these resources? For a start, not all tasks are best solved with a corpus so, for some problems, they may already be using more appropriate resources. Next, we must make learners aware of the strengths and limitations of the resources that they are using (in my experience, most often Google, machine translation and, increasingly, ChatGPT and other AI-based tools) and, when meaningful, introduce them to corpora and corpus tools as an additional set of resources. Here, I believe that it is crucial to place emphasis on using accessible corpus resources to solve concrete language issues that are of immediate interest to the learners. In terms of teaching students the technicalities of using corpus tools, I have found that, in addition to going through practical examples in class, short tutorial videos that students can watch and re-watch at their own pace are very effective. For example, Laurence Anthony who develops the freeware corpus tool AntConc has an excellent series of video tutorials (https://youtu.be/_GSlwIO5QZE). Nowadays, creating such short video tutorials is quick and easy and, for those of us involved in teaching, sharing our tutorials in the spirit of open education is only a matter of a few clicks. I must admit that when I uploaded the videos that I created for my students on how to use english-corpora.org and Sketch Engine (https://youtu.be/ kDRpyJSE_6s), I did not envisage so many students and educators using them! Alternatively, renewable assignments in the spirit of OER-enabled pedagogy could revolve around students creating their own tutorials. Once these have been checked for quality, they can also serve as OERs (provided, of course, that the students consent to sharing their work with the wider community). Another strategy that I rely on a lot when teaching the technical aspects of corpus literacy is the power of collaboration. Most of the tasks that I set my students are set up as group work. I believe that this is important because learning to use corpora for language learning and teaching is a complex set of skills. Hence, groups tend to fare better than individuals. Ultimately, I think that we view teaching corpus literacy as way to develop broader, transferable “life skills” that include digital and data literacy, basic statistical understanding, and critical thinking (see Le Foll, accepted).

148

Elen Le Foll

You discuss the “technical aspects of corpus literacy”. What are the other subcomponents of corpus literacy, and how can educators go about teaching those following open education principles?

Callies (2016, p. 395) identified four subcomponents of corpus literacy: 1) understanding basic concepts in corpus linguistics, 2) searching corpora and analysing corpus data by means of corpus software tools, 3) interpreting corpus data, and 4) using corpus output to generate teaching material and activities. For both educators and learners, it is very tempting to focus on the second, technical subcomponent of corpus literacy. However, without a sound understanding of the basic concepts of corpus linguistics and without guidance and practice in interpreting the results of corpus queries, the potential of corpora for language teaching and learning cannot be unlocked. That said, covering all four subcomponents of corpus literacy within more or less strict institutional constraints is no mean feat. Personally, I have found that OER-enabled pedagogy has been very beneficial in helping me to achieve this with my students. What is certain is that corpus tools and online corpus platforms will continue to change and evolve so that, if we want our teaching to be sustainable, we must ensure that we are teaching the principles and applications of corpus linguistics rather than “where to click”. How can open science and OERs promote more inclusive and equitable language teaching and research practices?

Using corpora, we can expose language learners to a much broader range of language varieties and registers than typically featured in textbooks and other commercially published teaching materials. For instance, there are more and more open corpora of varieties of spoken language, film and TV subtitles, and web registers that can help educators to explore situational language variation, something which is underrepresented in school textbooks and can create a genuine disconnect between curricular and extra-curricular language learning (Le Foll, 2022). We now also have access to some large open corpora that feature a range of language varieties that can help educators to broaden their teaching horizon to beyond the traditionally taught normative varieties. That said, whilst it is now much easier to create corpus-based materials and DDL activities to explore Cuban Spanish, Nigerian English or Swiss German, the number of languages for which this is possible currently remains very limited. Open science and open education practices promote accessibility, transparency, and collaboration, making scientific knowledge and educational resources available to all. As corpus linguists, we believe that these practices can empower language teachers and learners. Paradoxically, however, I have found that the use of

Why we need open science and open education

149

corpora can provoke a lot of insecurities among (pre-service) language teachers. Many of my students expect and want to be able to deliver clear-cut answers to language questions, e.g., what is correct: the kind of things or the kinds of things?3 For some, working with corpora is the first time that they are directly confronted with language variation. The realisation that correct vs. incorrect dichotomies on which they have so far relied do not hold in all contexts can be quite troubling. If we want teachers to adopt corpora in their teaching practice, we need to address this (see Le Foll, accepted). As I mentioned earlier, with often very little class time, it is tempting to focus on the technical aspects of corpora and corpus tools, but we must dedicate enough time to interpreting corpus results and discussing their implications for language teaching and learning. Corpus linguistics can also be used to analyse and address social justice issues, including language-related bias and discrimination. For example, one of my students contributed an OER chapter demonstrating how to create a subcorpus of the News of the Web (NOW) corpus to create corpus-based materials for the secondary level on the topic of Black Lives Matter (https://pressbooks.pub/elenlefoll/chapter/ meise-reckefuss/). How do you recommend that people get started with open science and open education?

I think that the process starts with a reflection on our personal motivation for doing research in this field and on the long-term impact of our research and teaching on language learning and teaching practices. To those keen to start implementing open science and open education practices, I would say: Start by talking with your colleagues, then exchanging with the wider community; start by sharing pre- and post-prints of your papers, then move on to materials, code, data, and ideas. Some people seem to think that you either do or don’t do open science and open education. I see them more as ideals to strive for where the journey is more important than the actual destination. And in keeping with that metaphor, it’s worth remembering that even a journey of a thousand miles begins with a single step. Notes 1 A very useful resource to find out more about the legal aspects of sharing pre- and post-prints of articles and monographs published with commercial publishers is: https:// v2.sherpa.ac.uk/romeo/. 2 I am not the first to use this expression in this context, see Braun (2007, p. 309). 3 For the curious, a quick search on english-corpora.org/coca reveals that the string kind of things occurs 1,380 times in the Corpus of Contemporary American English (COCA), whilst kinds of things is found 4,034 times. The combination of kind of + plural noun occurs 16,075 times, whilst a search for kinds of + plural noun returns 26,006 occurrences.

150

Elen Le Foll

Reference list Boulton, A. (2010). Data-driven learning: Taking the computer out of the equation: Datadriven learning. Language Learning, 60(3), 534–572. https://doi.org/10.1111/j.14679922.2010.00566.x Braun, S. (2007). Integrating corpus work into secondary education: From data-driven learning to needs-driven corpora. ReCALL, 19(3), 307–328. https://doi.org/10.1017/ S0958344007000535 Callies, M. (2016). Towards corpus literacy in foreign language teacher education: Using corpora to examine the variability of reporting verbs in English. In R. Kreyer, S. Schaub & B. Güldenring (Eds.), Angewandte Linguistik in Schule und Hochschule (pp. 391– 415). Peter Lang. https://doi.org/10.3726/978-3-653-05953-3 Gablasova, D., Brezina, V., McEnery, T., Meyerhoff, M., Reichelt, S., Arnold, T., & Cheung, C. (2018). Corpus for schools: Teaching English language with corpus linguistics. http://wp.lancs.ac.uk/corpusforschools/ Goulart, L., & Veloso, I. (Eds.). (2023). Corpora in English language teaching. https://pressbooks.pub/testbook123/ König, A., Frey, J.-C., & Stemle, E. W. (2021). Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora. Information, 12(5), 199. https://doi.org/10.3390/info12050199 Le Foll, E. (2021). Creating corpus-informed materials for the English as a foreign language classroom: A step-by-step guide for (trainee) teachers using online resources (3rd ed.). https://pressbooks.pub/elenlefoll Le Foll, E. (2022). Textbook English: A corpus-based analysis of the language of EFL textbooks used in secondary schools in France, Germany and Spain [PhD thesis, Osnabrück University]. https://doi.org/10.48693/278 Le Foll, E. (accepted). “To me, authenticity means credibility and correctness”: Encouraging pre-service teachers to re-evaluate their understanding of ‘authentic English’ using data-driven learning. In C. Blume (Ed.), Multiliteracies-aligned teaching and learning in digitally-mediated second language teacher education. Routledge. Le Foll, E. (in press 2023). “Opening up” corpus linguistics: An open education approach to developing corpus literacy among pre-service language teachers. Journal of Second Language Teacher Education, 2(2). Ma, Q. (2018). The corpus-aided platform for language teachers (CAP). https://corpus. eduhk.hk/cap/ Marín, V. I., Zawacki-Richter, O., Aydin, C. H., Bedenlier, S., Bond, M., Bozkurt, A., Conrad, D., Jung, I., Kondakci, Y., Prinsloo, P., Roberts, J., Veletsianos, G., Xiao, J., & Zhang, J. (2022). Faculty perceptions, awareness and use of open educational resources for teaching and learning in higher education: A cross-comparative analysis. Research and Practice in Technology Enhanced Learning, 17(1), 11. https://doi.org/10.1186/s41039-022-00185-z Mays, E. (Ed.). (2017). A guide to making open textbooks with students. The Rebus Community for Open Textbook Creation. https://press.rebus.community/makingopentextbooks withstudents/ Pinto, P. T., Crosthwaite, P., Tavares de Carvalho, C., Spinelli, F., Serpa, T., Garcia, W., & Orenha Ottaiano, A. (2023). Using language data to learn about language: A teachers’ guide to classroom corpus use. The University of Queensland. https://doi.org/10.14264/ 3bbe92d

The importance of open science journals in research

151

Vincent, B., & Nesi, H. (2018). The BAWE quicklinks project: A new DDL resource for university students. Lidil. Revue de Linguistique et de Didactique Des Langues, 58, Article 58. https://doi.org/10.4000/lidil.5306 Vyatkina, N. (Ed.) (2020). Incorporating corpora: Using corpora to teach German to English-speaking learners [Online instructional materials]. https://corpora.ku.edu Wiley, D. (2013). What is open pedagogy? [Open content]. Improving learning. https:// opencontent.org/blog/archives/2975 Wiley, D., & Hilton, J. L. (2018). Defining OER-enabled pedagogy. The International Review of Research in Open and Distributed Learning, 19(4). https://doi.org/10.19173/irrodl.v19i4.3601

11.1 The importance of open science journals in research Chyara Alvarado López We have opportunities in the future to address the lack of open-access resources; it is crucial to advocate not only for training and workshops on the subject but also to understand the importance of licenses when accessing materials. We need to emphasize that the main advantages of these resources are transparency and the continued potential for collaboration in the development of other resources that benefit academic learning and research. As new technologies have promoted the use of digital resources in educational programs. In the realm of research, this offers a vast opportunity due to the ease of accessing information. From an academic standpoint, this impacts research at higher levels, as, in many instances, fees are required to access certain materials. This dynamic could shift if open-access platforms were developed, allowing individuals to consult and utilize resources in research and knowledge sharing. There are consistent opinions regarding access to resources for research, especially in distance learning modalities. Rushby (2013) states that open educational resources are teaching, learning, or research materials that can be utilized either under the guidance of an instructor or in a self-directed manner. In essence, open access to resources provides materials from around the world, enabling the recognition of various paradigms on a single topic and facilitating feedback. This has enabled developing countries, such as Mexico, to make significant strides at the scientific and academic levels. For higher-level students, this can serve as a motivational tool, sparking their interest in the scientific domain. Encouraging participation in open-resource workshops allows the academic community to recognize policies related to information access, including the publication of scientific resources (Galina & Giménez, 2008). Through open scientific resources, educational challenges have been addressed, inspiring students from the moment they share relevant materials to when they

152

Heloísa Orsi Koch Delgado

excel in research, writing, and collaborative article creation. Le Foll emphasizes that the prime advantages of these resources are transparency and the ongoing potential for collaboration in developing other resources that benefit academic learning and research. To address the lack of open-access resources, it is crucial to advocate not only for training and workshops on the subject but also for understanding the significance of licenses when accessing materials. This will further evolve the landscape of knowledge sharing. A debate has arisen which lauds open access but also critiques the methods by which knowledge is disseminated. This debate also touches upon the motivations behind knowledge creation. A pertinent question to consider is: If MOOC courses are available, why not extend the same approach to bibliographic and reference resources? Reference list Galina, I., & Giménez, J. (2008). An overview of the development of open access journals and repositories in Mexico. www.ucl.ac.uk/~uczciga/texts/OAMexicoGalina_Gimenez. pdf Rushby, N. J. (2013). Data sharing. British Journal of Educational Technology, 44(5), 675–676.

11.2 Plain language summaries A contribution of simple language to improve health communication and set standards for open science Heloísa Orsi Koch Delgado This chapter covers Plain Language Summaries (PLS) as a contribution to refining health communication and a helpful reference point for open science (OS). It is based on Le Foll’s (2022) research on the replication crisis in science caused by low-quality published studies and costly processing charges, among others. It also describes PLS’s contribution to making research more tangible to the scientific community since many high-standard OS journals offer summaries written in simple language. Some guidelines for using this type of language and excerpts contrasting less objective and more concise writings are also provided. According to Le Foll’s chapter, concerns about a replication crisis across scientific fields have stimulated a movement for OS, encouraging and requiring

Plain language summaries

153

researchers to publish their raw data and analysis code. The problem has been primarily caused by the lack of access to information on previous studies, data manipulation, poor research practices, and statistical analysis. Le Foll adds that other reasons include the inaccessibility of research due to the expensive article processing charges and reports behind paywalls. My contribution relates OS to PLS for health research. PLS is a science communication tool that allows researchers to reach a wider audience using easy-tounderstand language for people outside a specific scientific circle. Simply put, PLS should predominantly explain why and how the research was done and the key findings. Studies show summaries have become valuable tools to make research more accessible, contributing to research transparency. OS journal databases like Elsevier and Cochrane Library now provide PLS on their home pages. This initiative helps bridge the gap between science communication and players, conferring numerous benefits for society, such as effective doctor–patient communication, research improvement, and wiser investment. As for PLS research, Maurer et al. (2021) developed and tested a standard summary template with patients to verify their understanding of the information provided. The corpus comprised 272 PLSs covering cardiovascular disease, cancer, and mental health topics. Regarding lessons learned, the authors mention consistency across summaries and balance between plain language and content precision. Software to measure readability scored the equivalent to school grade 8, meaning the summaries were reasonably accessible for the participants to read. Writing in plain language involves: • Using strategies such as understanding the audience. • Avoiding technical terms (if needed, explain them). • Keeping ideas succinct (use short sentences and break long ones into two or three). • Focusing on a single idea in each paragraph. • Using the direct sentence order and the active voice. • Testing the content with laypeople. As an example of PLS use, summary excerpts from the Cochrane Library site are shown next, written in scientific (SL) and plain languages (PL). (SL) We identified six eligible RCTs1 published in English . . . , which together included 1702 participants. . . . Most participants . . . where the type of dementia was reported had a diagnosis of Alzheimer’s disease (AD; n = 1002, 58.9% of the whole sample, 81.2% of the participants for whom the specific diagnosis was reported). (Kudlicka et al., 2023).

154

Vicki Jade Stevenson

(PL) “We found six studies. They involved 1702 people. . . . Alzheimer’s disease was the most common dementia diagnosis (59% of all participants, 82% of participants with the specific diagnosis reported)”. The strategies used were more familiar words, direct sentence order, shorter sentences, and eliminating unnecessary words. To conclude, studies combining PL(S) and health communication can benefit all levels of scientific activity and narrow the science–practice gap in OS. While the former assists in disseminating knowledge to citizens of all educational stages, the latter promotes scientific research accessibility that can be developed through collaborative networks. The OS movement has reached recognition in the scientific community, indicating a favorable disposition to continue to promote accessibility, research transparency, and the reliability and reproducibility of results in the years to come, minimizing the effects of the crisis. Note 1 Randomized Controlled Trials.

Reference list Kudlicka, A., Martyr, A., Bahar-Fuchs, A., Sabates, J., Woods, B., & Clare, L. (2023). Cognitive rehabilitation for people with mild to moderate dementia. Cochrane Database of Systematic Reviews, 6, Article CD013388. Retrieved July 6, 2023, from https://doi. org/10.1002/14651858.CD013388.pub2 Le Foll, E. (2022). Why we need open science and open education to bridge the corpus research – practice gap. International Perspectives on Corpus Technology for Language Learning, The University of Queensland. Seminar Session. Maurer, M., Siegel, J. E., Firminger, K. B., Lowers, J., Dutta, T., & Chang, J. S. (2021). Lessons learned from developing plain language summaries of research studies. HLRP: Health Literacy Research and Practice, 5(2), e155. Retrieved July 6, 2023, from https:// doi.org/10.3928/24748307-20210524-01

11.3 Facilitating student engagement in replication studies through open science and open research Vicki Jade Stevenson As both an MA student and EAP practitioner, I am inspired by Le Foll’s advocation of open science and open education as instruments for bridging the corpus

Facilitating student engagement in replication studies

155

research–practice gap, demonstrating their potential in facing this and other areas of contemporary debate, such as accessibility and equality. The dialogue surrounding open science as a solution to the replication crisis not only questions the currency of scientific and academic procedures but also illustrates the influential role of replication studies in these discussions. This potential encouraged me to conduct a replication study as part of my MA assessment. However, I had to make some adjustments to the original study because the corpora mentioned in the paper were not accessible. This situation raised concerns about data accessibility, a topic that was discussed in the seminar. Nonetheless, I was able to find a substantial amount of learner corpora freely available online, and the process outlined in the original research paper was accessible and provided satisfactory results. This experience was purposeful and authentic, enabling me to learn more about corpora analysis and tools than in other corpus linguistic courses I had participated in. The benefits of this type of research can empower students, allowing a handson approach as an introduction to this field. As Tschichold (2023) suggests, this can be more valuable for a student’s first research project than looking for gaps in research. This can be particularly useful for EAP students transitioning into academic studies in their chosen country, who may lack the necessary tools to conduct research during the early stages of academic study. As an EAP practitioner, I acknowledge the value of students undertaking replication studies early in their academic trajectory as it familiarises them with research processes and identification of research trends. It can also enhance critical thinking skills by enabling students to verify the findings of previous research and identify any potential limitations. There is also an argument for including these practices in undergraduate programmes. In their research, Jekel et al. (2020) found that the use of replication studies in undergraduate programmes was both meaningful and useful for students and instructors, allowing students to contribute to the area of science and instructors to fulfil both academic and teaching obligations more effectively. One issue, however, is that Le Foll suggests open science can afford greater accessibility to individuals, yet most of those in question are already in a privileged position where a large amount of research is readily available to them. This prompts the question of whether open science can truly close the corpus research– practice gap or predominantly benefits the already privileged. Nevertheless, open education may potentially extend the scope of corpus research and DDL into the reach of those beyond academic institutions. It is thus encouraging to observe the emergence of open resources on globally accessible platforms such as YouTube. Similarly, many researchers (e.g., Le Foll, 2021) produce materials which are usable across various educational levels, yet, for me, the critical challenge lies in engaging an audience willing to incorporate these resources into their practices.

156

Vicki Jade Stevenson

References Jekel, M., Fiedler, S., Allstadt Torras, R. C., Mischkowski, D., Dorrough, A. R., & Glöckner, A. (2020). How to teach open science principles in the undergraduate curriculum – The Hagen cumulative science project. Psychology Learning & Teaching, 19, 106–191. Le Foll, E. (2021, January 15). Creating corpus-informed materials for the English as a foreign language classroom. Pressbooks. https://pressbooks.pub/elenlefoll/ Tschichold, C. (2023). Replication in CALL. ReCALL, 35(2), 139–142. https://doi.org/10. 1017/S0958344023000083

12 CORPORA IN L2 VOCABULARY ASSESSMENT Agnieszka Leńko-Szymańska

Introduction

Much like the versatile applications of language corpora in the fields of second language acquisition and pedagogy, the significance of large linguistic databanks in the context of assessing second language (L2) proficiency is multifaceted. These applications encompass a wide array of domains, ranging from empirical research into language use in order to gain deeper insights into the nature of language proficiency, to the development of more robust and valid language assessment instruments. Furthermore, language instructors employ corpus-derived information to probe into their students’ linguistic competence, while students themselves leverage these resources for self-assessment, progress tracking, or test preparation. The aim of this chapter is to provide a comprehensive overview of these diverse applications specifically in the assessment of L2 vocabulary. Lexical competence and lexical profciency

Before delving into the applications of corpora in L2 vocabulary assessment,1 it is essential to establish a solid foundation by clarifying some fundamental concepts within this field – in particular, the place of lexis in language testing. To effectively evaluate the language proficiency of L2 learners, it is important to gain a comprehensive understanding of its nature and establish clear definitions. Traditionally, language proficiency was commonly equated with language knowledge, predominantly encompassing aspects of grammar and vocabulary. Additionally, less emphasized elements like pronunciation and cultural awareness were also considered. More recently, the concept of communicative competence was introduced, and was further refined as comprising grammatical, sociolinguistic, DOI: 10.4324/9781003413301-12

158

Agnieszka Leńko-Szymańska

and strategic competencies (Canale & Swain, 1980). Building on this foundation, Bachman (1990) introduced the notion of communicative language ability, which incorporated strategic competence and language competence, with the latter encompassing grammatical, textual, illocutionary, and sociolinguistic competences. In Bachman’s framework, language competence represented declarative knowledge (referring to knowledge about language), while strategic competence denoted procedural knowledge (or knowledge of how to use language effectively in practical communication). Despite vocabulary being included in modern models of communicative competence within the context of grammatical competence, its role in these models has been somewhat marginalized. However, L2 teachers and students continue to recognize vocabulary as a pivotal element of language acquisition. Similarly, several second language acquisition (SLA) researchers have also emphasized the centrality of lexis in language learning and usage. This perspective has been further explored by Leńko-Szymańska (2020), resulting in a model of lexical proficiency where lexical competence forms the foundation for all other language competences and interacts with fundamental language processes such as listening, reading, writing, and speaking, as well as general cognitive strategies. This theoretical model finds support in numerous studies that have demonstrated the correlation between lexical command and various language skills, including reading and listening comprehension, writing ability, and oral proficiency. Furthermore, it has been established that lexical proficiency is closely tied to an individual’s overall proficiency level (see Leńko-Szymańska, 2020 for an overview of these studies). This belief in the pivotal role of vocabulary is also reflected in common practices, where language educators and SLA researchers frequently employ vocabulary tests such as the Vocabulary Size Test (Nation & Beglar, 2007), Dialang2 (Alderson, 2005), or Lexical Test for Advanced Learners of English – LexTale3 (Lemhöfer & Broersma, 2012) as placement tools for language programs or selection instruments for study participants. Types and uses of vocabulary tests

Vocabulary assessment can be categorized along two distinct dimensions. The first dimension pertains to the aspect of language proficiency targeted by an evaluation instrument. A test may assess either an L2 learner’s knowledge of vocabulary or their use of vocabulary in communication. Vocabulary knowledge can be further divided into receptive and productive knowledge. Receptive knowledge involves a learner’s ability to comprehend the meaning of a word when presented with its form, while productive knowledge involves the capacity to produce the correct form of a word in response to its meaning. These aspects of vocabulary are typically evaluated using discrete-item tasks, such as multiple-choice, matching, gap-filling, or translation tasks. These tasks can involve presenting target words either in isolation or within the context of sentences or short texts. On the other hand, vocabulary assessment within this dimension can encompass the ability to comprehend words

Corpora in L2 vocabulary assessment

159

in listening or reading tasks, as well as the ability to produce words in writing or speaking tasks. In these cases, vocabulary is not assessed directly but is integrated into broader constructs, such as the comprehension of spoken discourse or the production of extended written text. The second dimension in the classification of lexical tests revolves around the scope of items covered in the assessment. A test can focus on specific lexical items or aim to evaluate the entire lexicon. In the former scenario, the assessment exclusively targets the knowledge or use of particular words, with no inferences made regarding other lexical items not included in the test. Conversely, in the latter case, the tested items are carefully selected to represent larger categories of words, and the test results are extrapolated to encompass these broader word categories. This two-dimensional taxonomy is represented in Figure 12.1. Vocabulary tests are an integral part of the process of language instruction and learning. They are developed and administered by L2 instructors to monitor their students’ progress and achievement in acquiring newly introduced lexical items, whether from classroom instruction or pedagogical materials. Additionally, these tests serve as valuable diagnostic tools for language instructors to pinpoint specific words requiring further instruction or revision. In high-stakes standardized certification exams, vocabulary assessment is also conducted, but it typically occurs indirectly, embedded in tasks that evaluate broader competencies and skills. Conversely, in the field of research on SLA, instruments designed to gauge L2 learners’ lexical proficiency enjoy significant popularity. These vocabulary tests are employed to measure the progression of lexical proficiency over time or following the implementation of specific teaching methods. Corpora and corpus-based tools can facilitate the development and scoring of diverse types of lexical tests. The following section will delve into the various ways in which language databanks contribute to the field of L2 lexical assessment.

knowledge

entire lexicon

specific items

ability FIGURE 12.1

Types of vocabulary tests (Leńko-Szymańska, 2020)

160

Agnieszka Leńko-Szymańska

Corpora and vocabulary knowledge tests

As depicted in Figure 12.1, one category of lexical tests is designed to assess the knowledge of specific vocabulary items. These items can encompass a variety of words introduced in the classroom, although instructors may seek a more systematic approach when selecting vocabulary for assessment. In this regard, numerous readily available pedagogical word lists in various languages serve as valuable sources for test content. Many of these word lists have been compiled using rigorous procedures, typically involving the utilization of carefully designed corpora and the application of at least two specific selection criteria: a word’s frequency within a corpus, and its presence in a range of corpus texts. Among the most popular lists of this kind is the Academic Word List4 (Coxhead, 2000), comprising 570 word families frequently found in diverse academic domains. A comprehensive report outlines the content of the academic text corpus used for compiling this inventory, as well as the precise criteria employed in selecting the words for inclusion in the list. More recently, Oxford Learner’s Dictionaries Online have made available their lists of 3000 and 5000 core vocabulary items (Oxford 3000 and Oxford 5000) along with an inventory of 650 common phrases (Oxford Phrase List) and the Oxford Phrasal Academic Lexicon, containing 2500 items.5 These lists were crafted through consultation with several corpora, including large reference databanks such as the Oxford English Corpus and the Oxford Corpus of Academic English, in addition to a specially curated corpus of Secondary and Adult English courses published by Oxford University Press. Another valuable vocabulary resource with immense potential for L2 lexical assessment is the English Vocabulary Profile6 (Capel, 2010, 2012), derived from the analysis of the Cambridge Learner Corpus. This database contains over 7000 items, which have been attested to be used by L2 learners at different proficiency levels. What sets this inventory apart is the assignment of different Common European Framework of Reference (CEFR) proficiency levels, ranging from A1 to C2 (Council of Europe, 2001), not to the lexical items themselves but to their distinct meanings. For instance, the word “degree” is classified as an A2 word when referring to temperature, a B1 word with the meaning of qualification, a B2 word when denoting an amount, and a C1 word within the context of completing an academic program. Furthermore, each meaning is linked to specific topics, such as describing objects or education. The search engine allows users to generate customized lists, filtered by part of speech, proficiency level, and topic. For example, a teacher can generate a list of adjectives classified at the C1 level and linked to the topic of describing people’s personalities (see Figure 12.2). Such a search yielded a list of 52 relevant items. Finally, a test developer can compile their own inventories from which they can further select lexical items for inclusion in a test. This approach becomes particularly valuable in the realm of Language for Special Purposes (LSP), where readily available inventories are often scarce. Corpora serve as a highly effective tool for

Corpora in L2 vocabulary assessment

FIGURE 12.2

161

A user-defined vocabulary list generated by English Vocabulary Profile

this purpose, facilitating the swift and systematic compilation of terminological databases. A test developer can compile a corpus of texts that students have encountered during a course or a program. Alternatively, they can construct a corpus by collecting relevant texts from the internet through web-crawling software. Corpus management platforms like Sketch Engine (Kilgariff et al., 2014) enhance this process by enabling the annotation of self-compiled corpora with part-of-speech tags. Consequently, a list of frequently occurring nouns, verbs, adjectives, adverbs, or lexical bundles derived from such a database serves as a valuable source of items to be targeted in a language test. Keyword analysis also stands out as an excellent tool for automatically extracting terminology from a corpus. Additionally, the Sketch Engine offers a feature for extracting key multiword terms, exemplified by Figure 12.3, which showcases a multi-item term list automatically generated from a self-compiled corpus related to urology. In addition to evaluating an L2 learner’s knowledge of individual words, lexical tests also serve the purpose of estimating the overall size of a learner’s lexicon. In terms of their format, these assessment instruments do not differ significantly from the achievement or diagnostic tests described in the preceding paragraph. They can encompass various task types such as multiple-choice, matching, gapfilling, or translation. Another format, not commonly employed for specific-word tests but frequently utilized for estimating lexicon size, involves yes/no tasks. In these tasks, test-takers simply indicate the words they believe they know. To mitigate overoptimistic responses, a portion of the test items are pseudowords,

162

Agnieszka Leńko-Szymańska

FIGURE 12.3

A list of 50 multi-item terms generated from a custom-built urology corpus

and specialized algorithms are employed to adjust the final scores, considering L2 learners’ response styles. The development of such instruments relies heavily on corpus-generated frequency lists. These tests operate on the assumption that the order in which L2 learners acquire lexical items loosely follows the words’ frequency ranks. Consequently, these lists are divided into bands, typically containing 1000 or 500 words each. Representative samples of words from each band are then chosen for inclusion in the test. For test creators, the specific words targeted in the test are of secondary importance, as long as they belong to the designated frequency bands. The final estimate is extrapolated from the scores obtained for each band. For instance, if a learner demonstrates familiarity with 70% of the test items from a particular band, they are expected to be acquainted with 70% of all the words within that band. These values are then summed across consecutive frequency bands. Examples of tests designed to estimate an L2 learner’s lexicon size include the Eurocentres Vocabulary Size Test (Meara & Jones, 1988), Vocabulary Levels Test (both productive and receptive versions) (Nation, 1990), X_Lex, Y_Lex (Meara & Milton, 2003), Vocabulary Size Test (Nation & Beglar, 2007), and Computer Adaptive Test of Size and Strength (Aviad-Levitzky et al., 2019). These tests are readily accessible online through Compleat Lexical Tutor,7 a website offering an array of resources and tools for corpus-based vocabulary assessment and instruction. One noteworthy tool available on this site, particularly relevant to vocabulary size assessment, is the Random Item Generator (Cobb, n.d.). This tool randomly selects a desired number of items from specified frequency bands and supplements them with corresponding pseudowords, enabling the creation of multiple test versions and the repetition of the assessment procedure at a later time.

Corpora in L2 vocabulary assessment

163

As mentioned earlier, while assessing word knowledge can be conducted without embedding words in context, it is common for target items to be contextualized within language tests. In this regard, corpora prove highly beneficial by providing a multitude of context sentences or longer text passages for test developers to choose from when creating contextualized vocabulary tests. Corpus citations offer the advantage of being authentic and abundant, making them particularly valuable for crafting specialized vocabulary tests. Furthermore, corpus tools often offer an additional feature for automatically selecting the most suitable context sentences for use in dictionaries, pedagogical materials, and assessment instruments. One such tool, known as GDEX (Good Dictionary Example) (Kilgariff et al., 2008), is part of Sketch Engine. GDEX evaluates citations of a search phrase based on criteria such as sentence length, use of complex (less frequent) vocabulary, presence of controversial topics, provision of sufficient context, and references pointing beyond the sentence (e.g., pronouns or demonstratives), among other factors. The tool retrieves the user-defined number of sentences and ranks them according to their scores, reflecting how well they align with the GDEX criteria. While some post-editing may be necessary before incorporating these sentences into a lexical test, the use of GDEX significantly enhances the efficiency of test creation. Corpora and vocabulary use metrics

In widely recognized proficiency exams like Cambridge English Qualifications (CEQ),8 IELTS9 or TOEFL,10 the assessment of vocabulary is embedded in the evaluation of an L2 learner’s language skills. The lexical complexity of reading and listening passages plays a significant role in determining their suitability for inclusion in these exams. Additionally, reading and listening comprehension tasks often gauge the test-taker’s understanding of specific words in context. When it comes to evaluating productive skills, such as writing and speaking, the use of vocabulary is among the criteria considered by raters during the grading process. For instance, in the CEQ B2 English exam, the writing assessment rubric includes the following descriptor related to vocabulary use: “Uses a range of everyday vocabulary appropriately, with occasional inappropriate use of less common lexis” (Cambridge Assessment, 2021, p. 35). Up to this point, corpora have not been extensively utilized to facilitate vocabulary assessment in proficiency exams. Yet, efforts are underway to make the assessment of the lexical complexity of a text more rigorous and provide clearer definitions for potentially ambiguous terms like “everyday vocabulary,” “less common lexis,” or “appropriate/inappropriate use.” Currently, the interpretation of these terms in specific contexts relies on the intuition of raters. One notable endeavor in this direction was the Criterial Features project (Hawkins & Filipović, 2012), which aimed to identify distinct grammatical and lexical properties of English characteristic of individual CEFR proficiency levels. This study was grounded in the analysis of the 45-million-word Cambridge Learner Corpus (CLC), which comprises exam scripts

164

Agnieszka Leńko-Szymańska

written by learners of varying proficiency levels worldwide. While the project did not provide an exhaustive list of lexical properties corresponding to each CEFR level, which could offer more precise guidance for vocabulary-related descriptors, it did lead to the creation of the English Vocabulary Profile database, as described earlier. While corpora have not yet been fully integrated into the domain of lexical assessment in established and widely used proficiency tests that serve practical purposes like academic program admissions or job recruitment, they have been extensively employed to support research on L2 vocabulary acquisition and use. In research settings, the evaluation of L2 vocabulary use relies on the application of two types of automatically computed measures: lexical sophistication and lexical diversity. Lexical sophistication is a measure that assesses the degree to which an (L2) text incorporates advanced vocabulary. Similar to language tests that evaluate the breadth of an L2 learner’s vocabulary knowledge, advanced vocabulary in lexical sophistication metrics is defined as lexical items and phrases that occur less frequently in a reference corpus. In the literature, two categories of lexical sophistication metrics have been proposed. The list-based indices calculate the proportions of a text’s vocabulary items falling into different frequency bands or other userdefined inventories, often derived from corpus-based information. For instance, Laufer and Nation (1995) introduced the Lexical Frequency Profile, which characterizes a text’s lexical content by four complementary percentages representing items from the first and second 1000 most frequent words in English, an inventory of academic vocabulary (a precursor to the Academic Word List), and items not found in the previous three lists. This last category encompasses infrequent words, as well as lexical errors such as spelling mistakes and newly coined items. Some studies have employed more complex lexical profiles based on a broader range of frequency bands. Although these intricate profiles can be more challenging to interpret, they offer a more detailed insight into an L2 learner’s vocabulary use and effectively distinguish infrequent words from errors. In recent studies (e.g., LeńkoSzymańska, 2015; Owen et al., 2021), lexical profiles have included proportions of words and phrases assigned to different CEFR levels, as discussed earlier. Several readily available tools, including Range (Nation, n.d.), AntWordProfiler (Anthony, n.d.), and VocabProfiler (Cobb, n.d.), streamline the automatic computation of such lexical profiles for a batch of texts, making the process quick and convenient. These tools offer various frequency lists derived from corpora, and the first two programs also allow users to incorporate custom lists, enabling the generation of user-defined reports on the lexical content of analyzed texts. Another valuable tool, Text Inspector11 (Owen et al., 2021), generates multiple frequencybased descriptions for individual texts or text batches. Additionally, it produces profiles based on information extracted from the English Vocabulary Profile (EVP). This program assigns a CEFR level to each word in a text. However, as the EVP database associates levels with different word meanings rather than specific lexical forms, Text Inspector always selects the lowest level for each word. Users have the

Corpora in L2 vocabulary assessment

FIGURE 12.4

165

Manual assignment of CEFR levels to the words in a text in Text Inspector

flexibility to review and manually adjust the assigned levels for every word in a text by selecting the relevant meaning from a drop-down menu, thereby updating the profile (see Figure 12.4). It is important to note that, to date, Natural Language Processing methods for sense disambiguation have not been implemented in the program to fully automate the assignment of levels to individual words and the generation of CEFR-based profiles. Mean-based lexical sophistication measures are represented by a single value, generated by matching each word in a text with its frequency in a reference corpus, followed by calculating the mean frequency for all the words (Crossley et al., 2013). Lower values indicate a higher level of lexical sophistication in the text. This method of describing vocabulary use is considered more precise, as it considers actual frequencies rather than frequency bands. To mitigate the impact of sporadic, very rare words, especially in shorter texts, a logarithmic transformation of word frequency values can be applied before calculating the mean. Another potential modification of this measure involves computing the mean frequency for content words exclusively. Automatic computation of mean-based lexical sophistication indices using different reference corpora can be achieved with the assistance of programs such as Coh-Metrix12 (McNamara & Graesser, 2011), TAALES13 (Kyle & Crossley, 2015), Lexical Frequency Analyzer14 (Lu, 2012), and Text Inspector. Measures of lexical diversity gauge the extent to which the vocabulary used in an (L2) text is varied. A range of formulas have been proposed over the years to accurately capture this aspect, like the Root Type/Token Ratio, or voc-D (see Leńko-Szymańska, 2020 for an overview). Most of these formulas are mathematical

166

Agnieszka Leńko-Szymańska

transformations involving two values: the number of running words in a text (tokens) and the number of unique word forms (types) or unique words (lemmas), with the latter recommended for highly inflectional languages like French or Polish. While reference corpora are not necessary for calculating various lexical diversity measures, corpus tools can readily provide the required data. Typically, programs designed for generating mean-based lexical sophistication metrics also compute lexical diversity indices.15 Two recently introduced measures (Durrant & Durrant, 2022) aim to assess the register appropriateness of a text’s lexicon by indicating its alignment with specific registers. These metrics are grounded in frequency lists which have been compiled separately for various registers found in a reference corpus, like academic, fiction, or magazine registers. For each content word in a text, multiple register scores are computed. Each score represents the proportion of a word’s frequency within a given register compared to its total frequency in the entire corpus. Subsequently, an overall score for each register is assigned to a text. This overall score is calculated as the mean score for that particular register, considering all content words within the text. Moreover, register weights can be generated for individual words, reflecting their contribution to the register-specific vocabulary in the text. These weights are computed by multiplying word frequencies in a given register by their register scores. The two measures were employed in a study by Durrant and Durrant (2022) to trace vocabulary development in the L1 writing of school children in England. Although the new metrics do not constitute a fully developed framework for operationalizing the construct of a text’s register appropriateness for measurement purposes, they represent a significant step in this direction. Extensive research, based on large corpora of both native and learner, written and spoken production, has attempted to model vocabulary use by both L1 and L2 learners applying various measures discussed in this chapter (see LeńkoSzymańska, 2020; Durrant et al., 2021 for overviews of these studies). The results have demonstrated that these metrics effectively capture the development of L1 and L2 learners’ lexical proficiency along a continuum. However, so far, none of the indices or their combinations have reliably been shown to discriminate between different proficiency levels. Interestingly, Text Inspector, which generates a total of 110 different lexical metrics, also includes a feature called Scorecard that provides an estimate of a text’s CEFR level. Nonetheless, to date, no study has been conducted to validate this estimate. The indices and tools employed for evaluating lexical sophistication, diversity, and register appropriateness of texts and spoken discourse produced L2 learners can also be utilized to gauge the lexical complexity of reading or listening passages intended for comprehension tasks. It is worth noting that Text Inspector requires identifying the mode of the analyzed text (writing, reading, or listening) before generating its numerous lexical metrics.

Corpora in L2 vocabulary assessment

167

Conclusion

Corpora play a crucial role in the development of increasingly valid and robust instruments for assessing L2 learners’ vocabulary knowledge and use. Various types of databanks, including reference corpora, LSP corpora, and learner corpora, have been employed for this purpose. However, while corpus-based vocabulary tests and metrics have gained significant traction in SLA research, the domains of language education and certification have been somewhat hesitant to leverage the advantages offered by corpora and corpus tools. Nevertheless, given the growing demand for language tests that are valid, reliable, user-friendly (i.e., cost-effective, readily available, and efficiently scored), and a renewed recognition of the importance of vocabulary in language acquisition and use, it can be anticipated that corpora will increasingly inform L2 lexical assessment in the future. Notes 1 The terms assessment and testing are not fully synonymous, yet, for the purposes of this chapter, they will be used interchangeably. 2 https://dialangweb.lancaster.ac.uk/ 3 www.lextale.com/whatislextale.html 4 www.wgtn.ac.nz/lals/resources/academicwordlist 5 www.oxfordlearnersdictionaries.com/about/wordlists/ 6 www.englishprofile.org/wordlists 7 www.lextutor.ca/ 8 www.cambridgeenglish.org/exams-and-tests/qualifications/ 9 www.ielts.org/ 10 www.ets.org/toefl.html 11 https://textinspector.com/ 12 http://tool.cohmetrix.com/ 13 www.linguisticanalysistools.org/taales.html 14 https://sites.psu.edu/xxl13/lca/ 15 In the case of TAALES, the computations are performed by a separate program from the same suite called TAALED (Kyle et al., 2021, www.linguisticanalysistools.org/taaled. html).

Reference list Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between learning and assessment. Continuum. Anthony, L. (n.d.). AntWordProfiler [Computer software]. www.laurenceanthony.net/ software/antwordprofiler/ Aviad-Levitzky, T., Laufer, B., & Goldstein, Z. (2019). The new computer adaptive test of size and strength (CATSS): Development and validation. Language Assessment Quarterly, 16(3), 345–368. https://doi.org/10.1080/15434303.2019.1649409 Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press.

168

Agnieszka Leńko-Szymańska

Cambridge Assessment. (2021). B2 First. Handbook for teachers. Cambridge Assessment English. www.cambridgeenglish.org/Images/167791-b2-first-handbook.pdf Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1(1), 1–47. https://doi.org/10.1093/ applin/I.1.1 Capel, A. (2010). A1 – B2 vocabulary: Insights and issues arising from the English Profile Wordlists project. English Profile Journal, 1, e3. https://doi.org/10.1017/ S2041536210000048 Capel, A. (2012). Completing the English vocabulary profile: C1 and C2 vocabulary. English Profile Journal, 3, 1–14. https://doi.org/10.1017/S2041536212000013 Cobb, T. (n.d.). Randon item generator [Computer Software]. www.lextutor.ca/rand/ bnc_coca/ Cobb, T. (n.d.). VocabProfiler [Computer Software]. www.lextutor.ca/vp/ Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge University Press. Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238. https:// doi.org/10.2307/3587951 Crossley, S. A., Cobb, T., & McNamara, D. S. (2013). Comparing count-based and band-based indices of word frequency: Implications for active vocabulary research and pedagogical applications. System, 41(4), 965–981. https://doi.org/10.1016/j.system.2013.08.002 Durrant, P., Brenchley, M., & McCallum, L. (2021). Understanding development and proficiency in writing: Quantitative corpus linguistic approaches. Cambridge University Press. https://doi.org/10.1017/9781108770101 Durrant, P., & Durrant, A. (2022). Appropriateness as an aspect of lexical richness: What do quantitative measures tell us about children’s writing? Assessing Writing, 51, 100596. https://doi.org/10.1016/j.asw.2021.100596 Hawkins, J. A., & Filipović, L. (2012). Criterial features in L2 English: Specifying the reference levels of the Common European Framework. Cambridge University Press. Kilgariff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The Sketch Engine: Ten years on. Lexicography, 1, 7–36. https:// doi.org/10.1007/s40607-014-0009-9 Kilgariff, A., Husák, M., McAdam, K., Rundell, M., & Rychlý, P. (2008). GDEX: Automatically finding good dictionary examples in a corpus. In Proceedings of the 13th EURALEX international congress (pp. 425–432). www.sketchengine.eu//wp-content/ uploads/2015/05/GDEX_Automatically_finding_2008.pdf Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly, 49(4), 757–786. https://doi.org/ 10.1002/tesq.194 Kyle, K., Crossley, S. A., & Jarvis, S. (2021). Assessing the validity of lexical diversity indices using direct judgements. Language Assessment Quarterly, 18(2), 154–170. https:// doi.org/10.1080/15434303.2020.1844205 Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics, 16(3), 307–322. https://doi.org/10.1093/applin/16.3.307 Lemhöfer, K., & Broersma, M. (2012). Introducing LexTALE: A quick and valid lexical test for advanced learners of English. Behavior Research Methods, 44(2), 325–343. https:// doi.org/10.3758/s13428-011-0146-0 Leńko-Szymańska, A. (2015). The English vocabulary profile as a benchmark for assigning levels to learner corpus data. In M. Callies & S. Goetz (Eds.), Learner corpora in language testing and assessment (pp. 115–140). John Benjamins. https://doi.org/10.1075/ scl.70.05len

Utilizing corpora-informed tools in the classroom

169

Leńko-Szymańska, A. (2020). Defining and assessing lexical proficiency (1st ed.). Routledge. Lu, X. (2012). Lexical Complexity Analyzer [Computer Software]. www.personal.psu.edu/ xxl13/downloads/lca.html McNamara, D. S., & Graesser, A. C. (2011). Coh-Metrix: An automated tool for theoretical and applied natural language processing. In P. M. McCarthy & C. Boonthum (Eds.), Applied natural language processing: Identification, investigation, and resolution (pp. 188–205). IGI Global. https://doi.org/10.4018/978-1-60960-741-8.ch011 Meara, P., & Jones, G. (1988). Vocabulary size as a placement indicator. In P. Grunwell (Ed.), Applied linguistics in society (pp. 80–87). CILT. http://eric.ed.gov/?id=ED350829 Meara, P., & Milton, J. (2003). X_Lex: The Swansea Vocabulary Levels Test [Computer Software]. Express. Nation, I. S. P. (1990). Teaching and learning vocabulary. Heinle and Heinle. Nation, I. S. P. (n.d.). Range [Computer Software]. www.victoria.ac.nz/lals/about/staff/ paul-nation Nation, I. S. P., & Beglar, D. (2007). A vocabulary size test. The Language Teacher, 31(7), 9–13, Appendix. Owen, N., Shrestha, P., & Bax, S. (2021). Researching lexical thresholds and lexical profiles across the Common European Framework of Reference for Languages (CEFR) levels assessed in the Aptis test. The British Council.

12.1 Utilizing corpora-informed tools in the classroom for evidence-based L2 vocabulary assessment Alisha Biler Assessing the vocabulary of L2 learners is both methodologically and practically difficult in the classroom. Teachers are charged with selecting relevant and appropriately challenging words for their students, often based primarily on their intuition. They also need to make sure that vocabulary assessments reflect words in real-world use, or as they occur in “non-test environments,” which is a timeconsuming task (Taylor & Barker, 2008; Bachman & Palmer, 1996). Agnieszka Leńko-Szymańska’s chapter provides language teachers with a helpful framework for how corpus linguistics can address these concerns. I would like to focus on the various corpora-informed tools highlighted in that chapter and how use of such tools results in more efficient, reliable, and evidence-based assessment instruments for classroom teachers. One such tool is the English Vocabulary Profile Online (www.englishprofile. org/wordlists), which is a free tool to help teachers select words for study. This tool allows users to browse vocabulary items by CEFR level based on topic or part of speech. For instance, I was able to identify a number of B2-level words for my students related to the topic of our reading, politics, and thus have greater assurance that the words I was providing to my class were appropriately challenging and relevant to the topic.

170

Mengyang Li

For English for Specific Purposes classes, Leńko-Szymańska also demonstrates the abilities of Sketch Engine (www.sketchengine.eu) to create specialized word lists. Using this subscription-based tool (with certain corpora available for query for free), teachers are able to focus on words or multi-word expressions that are frequent in or specific to the students’ discipline to better prepare them to encounter textbooks and conversations in their field. Sketch Engine can also provide a number of example sentences from which teachers can create test items with authentically rich context. Of particular benefit is Sketch Engine’s “good dictionary example” (GDEX) tool, which allows teachers to isolate only the clearest of example sentences to use for testing. Having contextually rich and authentic example sentences available to copy or modify for my own assessments saves a great deal of time and improves test validity tremendously. The final tool I would like to discuss is Text Inspector (www.textinspector.com), a helpful site that provides a more holistic evaluation of a learner’s vocabulary knowledge. Text Inspector, which also requires a subscription, allows teachers to upload student writing for automatic analysis of vocabulary use (including the student’s use of infrequent words and how varied their vocabulary usage is) to arrive at a CEFR level for the student. When used appropriately, these automatized tools allow for an objective measurement to serve as a baseline for understanding a learner’s vocabulary proficiency level. All L2 teachers can benefit from learning more about how corpora and corpusinformed tools can better support vocabulary assessment. By incorporating these tools into my course design and assessment creation, I am able to rely on evidencebased practice and maximize my limited preparation time. Reference list Bachman, L., & Palmer, A. (1996). Language testing in practice. Oxford University Press Taylor, L., & Barker, F. (2008). Using corpora for language assessment. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of language and education, Vol.7: Language testing and assessment (pp. 241–254). Springer. doi:10.1007/978-0-387-30424-3_179

12.2 Creating a word list for a paragraph of an automobile textbook Mengyang Li New energy cars have been highly advocated by the Chinese government in recent years (Cheng & Xiong, 2023). This facilitates some vocational colleges to open a new major called ‘New Energy Vehicle Technology’ to cultivate technical people

Creating a word list for a paragraph of an automobile textbook

171

textbooks for automobile majors may not provide word lists to emphasize which words students need to learn. This also creates a difficulty for teachers to teach and assess. One feasible way to solve this issue is to utilize a linguistic corpus. Hence, I will share how I use different corpora to create a word list for one paragraph of an automobile book to assist students in learning academic and technical vocabulary. The material I share is selected from the paragraph of an introduction to airbags (see the following script). Two linguistic corpora are employed for the next two sequential steps, which are the academic word list created by Averil Coxhead in 2000, and CorpusMate (Crosthwaite & Baisa, to appear). Airbags are a safety feature in cars for the protection of vehicle occupants in the event of a collision. The first Airbags recorded patent in the U.S. was in 1951. (Han & Jiang, 2022) In the first step, the paragraph is analysed through VocabProfiler to discover if there are some academic words in the text. Through the analysis of VocabProfiler, it shows that ‘feature’, ‘vehicle’, and ‘occupant’ are academic words. However, words like ‘patent’, ‘collision’, and ‘airbag’ are off-list words. This does not mean these words are not useful for students. It may mean they can be used in other ways, for example, as technical words. In the second step, the three off-list words are analysed via CorpusMate which contains 40,000 scientific papers that can be used to analyse the frequency of technical words within each topic. Through the search of CorpusMate, it shows that the distribution of these three words ‘patent’, ‘collision’, and ‘airbag’ in engineering falls into the ‘more likely to be used’ category. Even though the automobile major is part of the engineering programme, considering some common situations of the automobile industry (e.g., the use of airbags in a car, an accident in which two cars crash into each other), these three words are still selected as technical words. Hence, the use of two corpora provides evidence for teachers to establish a reliable word list to teach and assess students’ academic and technical vocabulary knowledge. However, when using the results from the corpus, we also need to be cautious about in what situation these words could be used, which may affect whether we should include those words.

Reference list Cheng, Q., & Xiong, Y. (2023). Low-carbon sustainable development driven by new energy vehicle pilot projects in China: Effects, mechanisms, and spatial spillovers. Sustainable Development, 1–22. https://doi.org/10.1002/sd.2715 Han, E., & Jiang, T. (2022). Automobile professional English. Northeastern University Press.

172

Hung Tan Ha

12.3 Corpus-driven assessment – a new standard for language testing Hung Tan Ha Agnieszka Leńko-Szymańska’s chapter delves into a fascinating yet often neglected subject in second language testing. Armed with deep theoretical insights, she equips second language teachers with answers to the “Why,” “What,” and “How” of using corpora in vocabulary assessment. “Why should vocabulary be tested?” This question frequently arises among English as a second/foreign language (ESL/EFL) teachers in Vietnam, where the national curriculum primarily emphasizes grammar (Vu & Peters, 2021). LeńkoSzymańska astutely underscores the notable lack of vocabulary assessment in proficiency tests and, more broadly, in educational contexts. This is surprising given vocabulary’s significant role in second language research. In her chapter, she further reinforces the critical importance of vocabulary in second language education, accentuating the need for vocabulary testing. Such an emphasis could catalyse a pedagogical shift towards prioritizing vocabulary in second language teaching, learning, and testing, potentially influencing English instruction and evaluation in EFL nations like Vietnam. The next concern that arises is, “What words should be tested?” In Vietnam, vocabulary sections in most English exams, including the National High School Graduation Examination (NHSGE), primarily assess students’ familiarity with words present in the curriculum. This approach gauges the vocabulary students glean from textbooks rather than their broader mental lexicon. Leńko-Szymańska presents word lists derived from corpora that mirror the frequency of words in authentic English. Assessing students based on these word lists provides a glimpse into their grasp of specific lexical subsets, such as academic English. This method could have a beneficial washback effect, enlightening teachers about their students’ learning objectives, progress, and the effectiveness of their instruction. Furthermore, Leńko-Szymańska’s discussion provides a roadmap for teachers on how to incorporate corpora into language testing. Beyond introducing vocabulary tests that harness word-frequency lists, she elucidates the potential of corpora in evaluating students’ writing. Historically, writing assessments have been the domain of human evaluators, inherently bringing subjectivity. Leńko-Szymańska illustrates how software rooted in corpora can gauge the diversity, sophistication, accuracy, and productivity of writing. Such tools not only alleviate the burden on human markers but also yield results that can be standardized across various educational settings – a boon for research.

Corpus-driven assessment – a new standard for language testing

173

In essence, the universal characteristics of corpora could lay the foundation for standardized writing evaluation, potentially setting new benchmarks to enhance the impartiality of both writing and speaking assessments. Reference list Vu, D. V., & Peters, E. (2021). Vocabulary in English language learning, teaching, and testing in Vietnam: A review. Education Sciences, 11(9), 563. https://doi.org/10.3390/ educsci11090563

13 CORPUS-BASED LANGUAGE PEDAGOGY FOR PRE-SERVICE AND IN-SERVICE TEACHERS Theory, practice, and research Qing Ma

Introduction

Since the 1980s, corpora have been widely used in all fields of language studies and have significantly advanced our understanding of language use and patterns. In the classic data-driven learning (DDL) approach expounded by Tim Johns (1991, 1994), language learners can directly work with corpora with minimum help from the ‘middleman’ (language teachers). However, a gap exists between theory and language teachers’ actual practice. Some questions remain unanswered: Why are teachers reluctant to adopt corpus technology in classroom teaching? What role can teachers play in helping learners to use corpus technology? What basic skills do teachers need to work with and teach through corpus technology? This chapter seeks to answer these questions and introduces a new corpus-based language pedagogy (CBLP) to pre-service and in-service teachers. CBLP is formulated on a simple rationale: teachers need both subject knowledge (e.g., corpus linguistics and corpus technology) and pedagogical knowledge (language pedagogy) to teach effectively. This chapter highlights some of the essential skills for these two requirements and how these skills can be developed for teachers/student teachers. Efforts are made to provide a theoretical underpinning to support this new, interdisciplinary language pedagogy. Some research evidence is also provided to show how CBLP emerges as a new research area.

From DDL to corpus technology: a shift from students only to both students and teachers

With the advent of corpus linguistics, corpus technology has found its way into almost all branches of language and linguistic analysis, including grammar, DOI: 10.4324/9781003413301-13

Corpus-based language pedagogy

175

lexicography, semantics, pragmatics, language variation/change, translation studies, and contrastive and discourse analysis (McEnery & Xiao, 2011). Since as early as the 1990s, researchers started exploring the pedagogical potential of corpus linguistics/technology in language learning and teaching. The most well-known approach, advocated by Johns (1991), is data-driven learning (DDL), where language learners act as researchers or detectors by observing and analysing language examples through corpus searches. Johns (1994, p. 297) stated: ‘What distinguishes the DDL approach is the attempt to cut out the middleman [teacher] as far as possible and to give [students] direct access to the data so that the learner can take part in building up his or her own profiles of mean and uses’. Boulton (2011, p. 572) later further defined DDL as ‘the hands-on use of authentic corpus data (concordances) by advanced, sophisticated foreign or second language learners in higher education for inductive, self-directed language learning of advanced usage’. At the early stages of DDL, corpus linguists or educators seemed to primarily focus on language learners with good second language (L2) proficiency, assuming that advanced and mature university students could quickly self-develop their skills of navigating corpus technology to facilitate their language learning. A neglect of the middleman – namely, language teachers – is clearly seen in corpus educators’ early research agenda. Since corpus researchers primarily offer direct corpus training to students (but not teachers), this means that ordinary language teachers usually possess limited knowledge about DDL. Recently, in helping teachers learn about DDL and apply corpus technology directly in their classroom teaching, Ma et al. (2022a, p. 2) added pedagogical potential to corpus technology and defined it as ‘the use and application of technology associated with corpus linguistics and corpora for language learning and teaching’. This new definition of corpus technology aims to extend the benefits of DDL not only for language learners but also for language teachers. Corpus technology for language learning and teaching

Boulton and Vyatkina’s (2021) meta-analysis of 489 studies highlights the highly positive impact of corpus technology on learners. Corpus technology boosts various linguistic skills, such as vocabulary (Ackerley, 2017; Lee et al., 2019; Liontou, 2019; Ma & Mei, 2021), grammar (Smart, 2014; Crosthwaite & Steeple, 2022), academic writing (Crosthwaite, 2016), and rhetorical skills (Charles, 2011; Poole, 2016). Hyland and Hyland (2006) underscored the importance of corpus-based feedback in academic writing. However, some researchers have advocated for further exploration of the theoretical underpinnings of corpus use in language learning (Flowerdew, 2015; O’Keeffe, 2021; Tribble, 2015). Leech (1997) identified two categories of corpus applications in language teaching: direct application during classroom instruction, and indirect application for teacher development and resource creation. Many studies have reaffirmed the benefits of both these applications (Leńko-Szymańska & Boulton, 2015; McEnery & Xiao, 2011; Boulton & Cobb, 2017; Boulton & Vyatkina, 2021; Lee et al., 2019;

176

Qing Ma

Pérez-Paredes, 2019). Nevertheless, Chambers (2019) stated that only corpus specialists, not regular teachers, use this technology directly in classrooms. Boulton and Vyatkina (2021) noted that more than 90% of corpus research focuses on higher education, overlooking lower proficiency L2 learners in primary and secondary education. Additionally, evidence supporting the direct use of corpora in the professional development of in-service and pre-service teachers, particularly in primary or secondary education, remains scant (Callies, 2019; Leńko-Szymańska, 2017; Abdel Latiff, 2021; Ma et al., 2023; Naismith, 2017; Pérez-Paredes, 2022; Schmidt, 2022). Thus, regular teachers in pre-tertiary settings often lack the skills or knowledge to work with corpus technology. Theory for formalising corpora use in language teaching

The extant literature reveals a paradox. Despite positive learner outcomes reported from many corpus studies, many schoolteachers lack sufficient corpus knowledge skills to adopt corpus technology in classroom teaching. A key question remains: How could corpora and corpus technology successfully find their way into ordinary teachers’ classroom use? I suggest formalising corpus technology use in language teaching by illustrating a recently established language pedagogy – CBLP (Ma et al., 2022a). CBLP aims to significantly enhance corpus training in teacher education programmes (for preservice teachers) and teacher professional development (for in-service teachers), drawing teachers’ attention and encouraging them to incorporate directly corpus technology into classroom teaching. To construct this language pedagogy, we need to draw on theories from different disciplines, including corpus linguistics, second language acquisition (SLA), education and socio-cultural learning. From a corpus linguistics perspective, we need to understand what corpus skills teachers need to learn to operate corpus technology (e.g., essential corpus search functions). In this regard, we borrow the idea of corpus literacy proposed by Mukherjee (2006), who outlined four components: (1) a basic understanding of corpora, (2) what one can (and cannot) do with a corpus, (3) how to analyse corpus data, and (4) how to summarise language use/patterns from corpus data. To acquire these four dimensions of knowledge and skills, teachers need to conduct hands-on corpus searches and gain familiarity with common corpus search functions available from popular corpus websites/tools (BNC, COCA, Lextutor, Antconc, etc.). Teachers also need to master the pedagogical skills for teaching with corpus technology, which is termed by Ma et al. (2022a, p. 2732) as CBLP, defined as ‘the ability to use technology of corpus linguistics to facilitate language teaching in a classroom context’. The coinage of this term was inspired by Shulman’s pedagogical content knowledge (PCK), which cuts across content knowledge (knowledge of the subject of teaching, such as math, Chinese, or English) and pedagogical knowledge. In this sense, CBLP incorporates corpus linguistics with language pedagogy. Because corpus linguistics entails the use of corpus technology, CBLP can also be considered as a type of technological pedagogical content knowledge

Corpus-based language pedagogy

177

(TPACK) – that is, it combines corpus technology with language pedagogy. In line with Shulman’s pedagogical reasoning, teachers’ development of CBLP involves five iterative stages: (1) understanding the purposes of corpus use, corpus linguistics, and target students’ learning needs/difficulties (developing corpus literacy); (2) leveraging corpus literacy for designing corpus-based teaching materials; (3) conducting CBLP lessons with appropriate instructional strategies; (4) evaluating CBLP teaching; and (5) self-reflecting on CBLP teaching. This process clearly shows the importance of learning about corpus linguistics/technology, experiencing corpus-based lesson design, conducting classroom teaching, and reflecting on and evaluating the teachers’ learning and their students’ learning outcomes. The use of corpus technology has been endorsed as an effective language-learning approach based on empirical evidence (Boulton & Cobb, 2017; Boulton & Vyatkina, 2021), but this approach suffers from a lack of theoretical underpinning (Flowerdew, 2015; O’Keeffe, 2021; Tribble, 2015). Following the advice of O’Keeffe (2021), we draw relevant theories from various disciplines, including corpus linguistics, SLA, education, and socio-cultural learning, to lay a solid theoretical basis for integrating corpus technology into regular teachers’ classroom teaching. In this regard, Ma et al. (2022a) proposed a two-step corpus training framework to help teachers gain both corpus literacy and CBLP. The first step focuses on helping teachers gain basic corpus literacy (introduction to corpus linguistics and free online corpora); the second step focuses on helping teachers deepen their understanding of corpus literacy and design CBLP lessons collaboratively in an online learning community. In addition to incorporating corpus linguistics and technology, this two-step training framework integrates theories from experiential learning (Kolb, 1984), socio-cultural learning (Vygotsky, 1978), and community learning (Wenger, 1998). Figure 13.1 illustrates this two-step CBLP training framework.

FIGURE 13.1

A two-step CBLP training framework

178

Qing Ma

FIGURE 13.2

Alignment of the four-stage CBLP lesson design model with the SLA model

In addition, drawing on theories from SLA, inductive learning, and socio-cultural learning, Ma et al. (2022a) proposed a four-stage CBLP lesson design model as scaffolding to facilitate teachers’ initial CBLP design practice: (1) identifying language gaps (e.g. testing students’ knowledge), (2) observing and analysing language (e.g. hands-on corpus search by students and looking for language patterns), (3) summarising language use patterns (e.g. inductive discovery by students), and (4) practising using the language (e.g. output exercises, communication, and homework). These four stages for CBLP lesson design are aligned with the SLA model proposed by Gass et al. (2020), as shown in Figure 13.2. Practice for providing CBLP for pre-service and in-service language teachers

Using the two-step CBLP training framework and the four-stage CBLP lesson design model, at the Education University of Hong Kong, our corpus team have, since 2017, conducted more than 60 CBLP workshops/seminars for more than 400 schools/universities and 5000 pre- and in-service teachers in Hong Kong, mainland China, and other countries. We aim to help participants gain the skills to work with corpus technology and the pedagogical skills to teach with corpora. We created rich opportunities for them to work collaboratively and co-construct knowledge, to reinforce their knowledge of corpus technology and ability to design corpus-based learning materials, and to improve their teaching skills to teach with corpus technology. We also regularly organised corpus-based lesson competitions to motivate our pre- and in-service teacher participants to sustain their interests in improving their CBLP competence and adopting CBLP in classroom teaching. We guided them in adapting and implementing their lesson plans to authentic classroom teaching in primary, secondary, and tertiary settings. To fully support pre- and in-service teachers to deepen their learning after our workshop training, we developed a dedicated Corpus-based Learning Platform for Language Teachers (CAP) (https://corpus.eduhk.hk/cap/) with self-learning

Corpus-based language pedagogy

179

resources, including 65 short training videos, 68 concrete lesson plans and worksheets for various language skills (vocabulary, grammar, pronunciation, translation, reading, and writing) for students at the primary, secondary, and tertiary levels. So far, our CAP website has attracted visitors from more than 85 countries/regions and recorded an average 590 visits per month. We evaluated our CBLP training based on survey and interview data; results confirmed that our endeavours have significantly benefitted both pre-service and in-service teachers. Three themes emerged regarding pre-service teachers’ improvement after CBLP training: (1) enhancement in corpus skills, (2) improvement of collaborative learning, and (3) benefits from various CBLP resources. The following excerpts support these themes: I have never heard about corpus tools before. The workshop materials and competition offered me with a new path to put theory into practice. What impressed me most is receiving and giving comments on lesson designs by other teachers and the corpus team, which promotes teaching reflection and motivates us to make improvement and progress. Regarding in-service teachers, our CBLP training helped them to (1) enhance their teaching confidence and (2) conduct continuous teaching reflection, as shown by the following excerpts: [H]elped me be more confident about incorporating corpus-related activities into my own teaching. [Using corpus] I can guide students in finding the answer to their questions, then they can interact with peers and reflect [on the language] and finally produce the target language. Advancing CBLP: initiatives in establishing a novel research domain

In addition to providing CBLP training for pre-service and in-service teachers, we conducted CBLP-related research to support the continuous growth of this new research area. In the study by Ma et al. (2023), we addressed the pivotal role of corpus linguistics in language learning and the emerging demand for its integration into teacher education programmes. The central question examined pertains to the nature of knowledge and skills that should be targeted by such training. Based on the work of Mukherjee (2006) and Callies (2019), we proposed a five-component theoretical framework to measure teachers’ perceived corpus literacy and its subskills: understanding, search, analysis, and the recognition of advantages and limitations of corpora. We also

180

Qing Ma

hypothesised a relation between teacher corpus literacy and their intent to incorporate corpora in their teaching. The study involved 183 pre- and in-service teachers who received corpus training to enhance their corpus literacy. Subsequently, they completed a survey evaluating their corpus literacy and intentions to use corpora in classroom teaching using Likert scale items and open-ended questions. Confirmatory factor analysis suggested a hierarchical factor structure for corpus literacy, underpinned by the five subskills, best fitting the data. Structural equation modelling revealed a positive correlation between corpus literacy and the intention to incorporate corpora into classroom teaching. Our key findings underline the importance of all five subskills, with an emphasis on the development of corpus search and analytical skills, deemed as the cornerstone of corpus literacy training for teachers. Further, raising teacher awareness of the limitations of corpora usage is crucial in corpus literacy training. This knowledge illuminates the challenges that teachers face, aiding corpus linguists and developers in improving related tools to meet teacher needs. It also enables educators to design effective instructional activities to overcome these limitations and enhances teachers’ metacognition for effective CBLP teaching. In another study by Ma et al. (2022a), we empirically tested the effectiveness of the two-step CBLP training framework (Figure 13.1) with a group of pre-service TESOL teachers by employing a mixed-methods design. Our findings revealed that the majority of participants achieved a high level of corpus literacy, as evidenced by self-reported survey data. The participants also demonstrated initial competence in employing CBLP, as shown by the quality of their lesson designs and the content analyses of the lessons and interview data. The results underscore the efficacy of the two-step training framework and support the conceptual distinction between corpus literacy and CBLP. The study further offers insights into scaffolding teachers in CBLP training and addressing their students’ vocabulary needs related to negative mother tongue transfer using corpus resources. This research highlights the key factors influencing teachers’ implementation of effective CBLP in school settings. Teachers should provide clear guidance for hands-on corpus searches, balance corpus resources with concurrent lesson activities and ensure that lessons offer ample opportunities for students to use the target language in contextualised, meaningful, and authentic ways. Further, in the work by Ma et al. (2022b), we employed a case study approach and delved into the complex process of developing in-service teachers’ CBLP in classroom teaching. We examined how two experienced university English teachers incorporated CBLP into real-world teaching scenarios. Data were collected through CBLP lesson materials, pre- and post-lesson interviews, and lesson observations. The key findings indicate that the development of CBLP is underpinned by five components of teacher knowledge and practice: (1) knowledge of the English language, (2) understanding of corpus technology, (3) pedagogical knowledge, (4) contextual knowledge, and (5) continual learning and practice. Despite the notable differences in the teachers’ approaches, both exhibited significant improvement in

Corpus-based language pedagogy

181

CBLP after extensive learning and practice. One teacher’s development was primarily driven by his strong understanding of corpus technology, whereas the other heavily relied on firm pedagogical knowledge. Ensuring the smooth execution of CBLP lessons necessitates the use of corpora that offer free access and the resolving of any technical difficulties students might face. Teachers are also encouraged to weave topic-specific theories into the lesson design. For instance, integrating genre-based writing theory into CBLP activities can help improve students’ writing skills. Beyond concordance activities, the inclusion of interactive, student-centred tasks can effectively engage students. Finally, the development of CBLP among teachers can be further facilitated through self-learning and peer support, fostering an environment of ongoing practice and learning. These insights yielded practical implications for facilitating the development of CBLP among language teachers, thus enhancing their teaching effectiveness. Implications and conclusion

This chapter demonstrates corpus literacy and CBLP as two theoretically distinct yet interconnected concepts. Corpus literacy forms teachers’ basic skills in understanding corpus linguistics and use of corpus technology for language learning and analysis. CBLP builds on corpus literacy, constituting the pedagogical skills for integrating corpus technology into classroom teaching. The new language pedagogy, CBLP, draws upon theories from various disciplines, attesting to its multidisciplinary nature and potential applicability in different educational contexts (from primary to tertiary settings). The chapter illustrated a two-step CBLP framework (Ma et al., 2022a) involving online collaboration that has shown promise in the development of both pre-service and in-service teachers’ CBLP. This framework can potentially democratise access to training in CBLP, thereby enhancing its reach and effectiveness. The development of user-friendly corpus technology is a critical step in this direction, as teachers and students are more likely to adopt and use technology that is intuitive and easy to navigate. Moreover, CBLP lessons need to be engaging and attractive to learners. The design of CBLP teaching materials and activities should be student-centred, interactive, and relevant to the learners’ contexts and interests. The more engaging the lessons, the greater the likelihood of students being motivated to learn and apply the skills they acquire. We also emphasise the pivotal role of teachers in transmitting corpus knowledge to students. Given the heavy workload of most teachers, finding ways to motivate them to adopt CBLP lessons is crucial. These could involve offering professional development opportunities, showcasing success stories of CBLP implementation, or providing supportive resources to ease the transition to this new pedagogical approach. In conclusion, although the adoption and implementation of CBLP is challenging, its potential benefits for language learning and teaching are substantial.

182

Qing Ma

As pre-service and in-service teachers continue to refine their understanding and application of CBLP, they will be better equipped to harness the power of corpus technology to enhance language teaching and learning. Acknowledgements

The article was supported by the Research Grants Council of the Government of the Hong Kong SAR as part of a GRF project (GRF 18600123); it was also supported by the CRAC project (Ref: 04A32) at the Education University of Hong Kong. References Abdel Latif, M. M. M. (2021). Corpus literacy instruction in language teacher education: Investigating Arab EFL student teachers’ immediate beliefs and long-term practices. ReCALL (Cambridge, England), 33(1), 34–48. https://doi.org/10.1017/S0958344020000129 Ackerley, K. (2017). Effects of corpus-based instruction on phraseology in learner English. Language Learning & Technology, 21(3), 195–216. Boulton, A. (2011). Data-driven learning: The perpetual enigma. In S. Goźdź-Roszkowski (Ed.), Explorations across languages and corpora (pp. 563–580). Peter Lang. Boulton, A., & Cobb, T. (2017). Corpus use in language learning: A meta-analysis. Language Learning, 67(2), 348–393. Boulton, A., & Vyatkina, N. (2021). Thirty years of data-driven learning: Taking stock and charting new directions over time. Language Learning & Technology, 25(3), 66–89. http://hdl.handle.net/10125/73450 Callies, M. (2019). Integrating corpus literacy into language teacher education. In S. Götz & J. Mukherjee (Eds.), Learner corpora and language teaching (pp. 245–263). John Benjamins. Chambers, A. (2019). Towards the corpus revolution? Bridging the research–practice gap. Language Teaching, 52(4), 460–475. Charles, M. (2011). Using hands-on concordancing to teach rhetorical functions: Evaluation and implications for EAP writing classes. New trends in corpora and language learning, 26–43. Crosthwaite, P. (2016). A longitudinal multidimensional analysis of EAP writing: Determining EAP course effectiveness. Journal of English for Academic Purposes, 22, 166–178. Crosthwaite, P., & Steeples, B. (2022). Data-driven learning with younger learners: exploring corpus-assisted development of the passive voice for science writing with female secondary school students. Computer Assisted Language Learning, 1–32. Flowerdew, L. (2015). Data-driven learning and language learning theories. In A. LeńkoSzymańska & A. Boulton (Eds.), Multiple affordances of language corpora for datadriven learning (pp. 15–36). John Benjamins. Gass, S. M., Behney, J., & Plonsky, L. (2020). Second language acquisition: An introductory course (5th ed.). Routledge. https://doi.org/10.4324/9781315181752 Hyland, K., & Hyland, F. (2006). Feedback on second language students’ writing. Language Teaching, 39(2), 83–101. Johns, T. (1994). From printout to handout: Grammar and vocabulary teaching in the context of data-driven learning. In T. Odlin (Ed.), Perspectives on pedagogical grammar (pp. 293–313). Cambridge: Cambridge University Press.

Corpus-based language pedagogy

183

Johns, T. (1991). Should you be persuaded: Two examples of data-driven learning. In T. Johns & P. King (Eds.), Classroom concordancing (Vol. 4, pp. 1–16). ELR. Kolb, D. A. (1984). Experiential learning: Experience as the source of learning and development. Prentice-Hall. Lee, H., Warschauer, M., & Lee, J. H. (2019). The effects of corpus use on second language vocabulary learning: A multilevel meta-analysis. Applied Linguistics, 40(5), 721–753. Leech, G. (1997). Teaching and language corpora: A convergence. In A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 1–23). Longman. Leńko-Szymańska, A. (2017). Training teachers in data driven learning: Tackling the challenge. Language Learning & Technology, 21(3), 217–241. Leńko-Szymańska, A., & Boulton, A. (Eds.). (2015). Data-driven learning in language pedagogy: Introduction. In Multiple affordances of language corpora for data-driven learning (pp. 1–14). John Benjamins. Liontou, T. (2019). The effect of data-driven learning activities on young EFL learners’ processing of English idioms. In P. Crosthwaite (Ed.), Data-driven learning for the next generation (pp. 208–227). Routledge. Ma, Q., Chiu, M. M., Lin, S., & Mendoza, N. B. (2023). Teachers’ perceived corpus literacy and their intention to integrate corpora into classroom teaching: A survey study. ReCALL, 35(1), 19–39. Ma, Q., & Mei, F. (2021). Review of corpus tools for vocabulary teaching and learning. Journal of China Computer-Assisted Language Learning, 1(1), 177–190. Ma, Q., Tang, J., & Lin, S. (2022a). The development of corpus-based language pedagogy for TESOL teachers: A two-step training approach facilitated by online collaboration. Computer Assisted Language Learning, 35(9), 2731–2760. Ma, Q., Yuan, R., Cheung, L. M. E., & Yang, J. (2022b). Teacher paths for developing corpusbased language pedagogy: A case study. Computer Assisted Language Learning, 1–32. McEnery, T., & Xiao, R. (2011). What corpora can offer in language teaching and learning. In E. Hinkel (Ed.), Handbook of research in second language teaching and learning (Vol. 2, pp. 364–380). Taylor & Francis Group. Mukherjee, J. (2006). Corpus linguistics and language pedagogy: The state of the art – and beyond. In S. Braun, K. Kohn & J. Mukherjee (Eds.), Corpus technology and language pedagogy: New resources, new tools, new methods (pp. 5–24). Peter Lang. Naismith, B. (2017). Integrating corpus tools on intensive CELTA courses. ELT Journal, 71(3), 273–283. O’Keeffe, A. (2021). Data-driven learning – a call for a broader research gaze. Language Teaching, 54(2), 259–272. Pérez-Paredes, P. (2019). The pedagogic advantage of teenage corpora for secondary school learners. In P. Crosthwaite (Ed.), Data driven learning for the next generation: Corpora and DDL for pre-tertiary learners (pp. 67–87). Routledge. Pérez-Paredes, P. (2022). A systematic review of the uses and spread of corpora and datadriven learning in CALL research during 2011–2015. Computer Assisted Language Learning, 35(1–2), 36–61. https://doi.org/10.1080/09588221.2019.1667832 Poole, R. (2016). A corpus-aided approach for the teaching and learning of rhetoric in an undergraduate composition course for L2 writers. Journal of English for Academic Purposes, 21, 99–109. Schmidt, N. (2022). Unpacking second language writing teacher knowledge through corpusbased pedagogy training. ReCALL. Advance online publication. https://doi.org/10.1017/ S0958344022000106

184

Vanessa Joy Z. Judith

Smart, J. (2014). The role of guided induction in paper-based data-driven learning. ReCALL, 26(2), 184–201. Tribble, C. (2015). Teaching and language corpora. In A. Leńko-Szymańska & A. Boulton (Eds.), Multiple affordances of language corpora for data-driven learning (pp. 37–62). John Benjamins. Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes. Harvard University Press. Wenger, E. (1998). Communities of practice: Learning as a social system. Systems Thinker, 9(5), 2–3.

13.1 Corpus-based language pedagogy Enhancing pre-service teachers’ pedagogical competence towards differentiated instruction Vanessa Joy Z. Judith The actual usage and authentic occurrences of language as it is written and used by native speakers in various situations are the focus of corpus-based approaches to language teaching. Corpus-based language pedagogy is the ability to integrate corpus linguistics technology into classroom language pedagogy to facilitate language teaching (Ma & Mei, 2021). Corpus tools have enabled linguistic researchers and teachers to investigate actual usages and the characteristics of particular genres to improve syllabus design and derive more effective classroom exercises. Corpus findings have led to innovations in textbooks, dictionaries, teaching materials, primary research, classroom activities, and syllabus design, among others (LessardClouston & Chang, 2014). This approach is instrumental in conducting the internship programme in my state university in the Philippines. The said approach appropriately guides the integration of technology in the teaching and learning process during the training of pre-service teachers. Awareness of corpus linguistics technology assists these pre-service teachers in selecting relevant, contextualized, and localized learning materials that adequately address the different learning styles of the students. This awareness eventually leads to the integration of differentiated instruction in the conduct of the lessons whereby students’ interest, readiness, and learning profile are considered (Geel et al., 2019), especially when designing CBLP lesson activities. Through this, the learners, under the guidance and care of these teachers, may fully achieve their required learning competencies. However, in preparing for corpus linguistics technology, these pre-service teachers need to consider how to achieve differentiated instruction in line with the different learning preferences of students. This consideration must address the

Refecting on CBLP

185

diversity of the learners in their classrooms and the need for teaching practices that can motivate these students to be productive and successful citizens in a changing local and global environment. Pre-service teachers must carefully consider the relevance and availability when selecting any technology to be utilized in any lesson, as is the case in developing countries. In conclusion, Qing Ma’s chapter on corpus-based language pedagogy for preservice and in-service teachers is very relevant for guiding pre-service teachers to achieve their full potential during deployment. The two-step CBLP training framework enables the pre-service teacher to have, first, substantial knowledge of the corpus, which eventually will facilitate the acquisition of the pedagogical skills necessary to achieve a certain competence in designing corpus-based teaching materials responsive to the learning environment. The use of an interactive method in the process of training enriches their learning drawn from the shared experiences and insights of the participants. Consequently, the four steps in designing CBLP as outlined by Professor Ma give pre-service teachers the opportunity to be grounded on the particular context of students, which should also consider the technology available to them, as a starting point, and leaves ample room for creativity. Lastly, by going through the five stages of CBLP these teachers are afforded with a tool towards self-learning and the furtherance of their pedagogical skills in language teaching thereby becoming effective companions in enabling the students to become self-reliant in achieving the desired competencies. Reference list Geel, M. V., Keuning, T., Frèrejean, J., Dolmans, D., Merriënboer, J. V., & Visscher, A. J. (2019). Capturing the complexity of differentiated instruction, school effectiveness and school improvement, 30(1), 51–67. https://doi.org/10.1080/09243453.2018.1539013 Lessard-Clouston, M., & Chang, T. (2014). Corpora and English language teaching: Pedagogy and practical applications for data-driven learning. TESL Reporter, 47(1 & 2), 1–20. Ma, Q., & Mei, F. (2021). Review of corpus tools for vocabulary teaching and learning. Journal of China Computer-Assisted Language Learning, 1(1), 177–190. https://doi. org/10.1515/jccall-2021-2008

13.2 Refecting on CBLP Khaoula Daoudi Qing Ma’s chapter provides an insightful overview of the use of corpora in language education focusing on direct application, in which learners activate their discovery skills and inductive learning. Despite the promising potential of corpora in the classroom, its use is still minimal for several reasons, one of which is lack

186

Khaoula Daoudi

of training. For this, Ma developed a two-step training approach through a framework called Corpus-Based Language Pedagogy (CBLP), which combines corpus linguistics and language pedagogy. Ma’s work has been incredibly enlightening. As a novice researcher in the field of corpus linguistics and language pedagogy, I was able to get a comprehensive picture of the situation of corpus inclusion in the classroom and the challenges that hinder its implementation. Such information gave me a clear idea of the gap existing in the literature, which helped me to narrow the focus of my current Ph.D. research, where I developed and delivered a short-term teacher training course on the use of corpora in the teaching of specialised lexis. Previous research highlighted a number of issues and challenges associated with corpus use. For example, working on corpora can be quite challenging at first; thus, I attempted to simplify the content of the workshop by creating step-by-step materials that combine both theoretical and practical aspects of corpora. Moreover, an issue that Dr. Ma highlighted was the technical issues associated with corpus use. This calls for the provision of user-friendly concordancers such as The Compleat Lexical Tutor and Webcorp, along with creating accessible step-bystep worksheets that support their use. The post-course questionnaire showed that teachers’ overall perceptions towards the course were very positive; they all found the training interesting and informative. They also declared that they gained a reasonable understanding of what a corpus is and how to use it for pedagogical purposes. Participants also highlighted that using corpus tools can have a substantial impact on their teaching practices, especially in increasing learners’ independent learning skills leading towards autonomy. For this reason, they favoured the direct use of corpora in which learners access a corpus and discover language themselves. The lack of technical resources, however, was a prominent reason that could refrain teachers from using corpora in the future. Subsequently, an interview was held two months after the intervention to give teachers time to grasp the information delivered to them and practise using the tools if they wanted to. Some teachers even reported their positive experiences of using corpora with their learners, highlighting the advantage it could have on learners’ vocabulary repertoire. To conclude, Ma’s work has been very informative in shaping my thoughts on the practical applications of data-driven learning (DDL) in the classroom. It provoked pertinent questions and offered valuable recommendations that shaped my understanding of the application of DDL in the classroom and eventually helped in the promotion of the design of my research intervention.

14 THE APPLICATIONS OF DATA-DRIVEN LEARNING, CORPUS-BASED PHRASEOLOGY AND PHRASEOGRAPHY IN THE DEVELOPMENT OF COLLOCATIONAL AND PHRASEOLOGICAL COMPETENCE Adriane Orenha-Ottaiano 1.

Introduction

In the scope of this project, collocations are viewed under two most distinct approaches: the statistically oriented approach and the phraseological approach. Under a statistically oriented approach, collocations are considered to be frequent word combinations whose co-occurrence within a certain distance of each other is statistically higher than expected in comparison to any other words randomly combined in a specific language (Barfield & Gyllstad, 2009; Nesselhauf, 2005; Sinclair, 1991; etc.). However, as Teubert (2004, p. 188) mentioned, being statistically significant is not enough to identify a combination of words as a collocation: ‘They also have to be semantically relevant. They have to have a meaning of their own, a meaning that isn’t obvious from the meaning of the parts they are composed of’. For this reason, it is important to describe collocations under a phraseological approach, and so collocations are defined as pervasive, recurrent and conventionalized combinations consisting of a base and a collocate (Hausmann, 1989), which are lexically and/or syntactically fixed to a certain degree. They can be said to have a special meaning only in combination (when only combined) with the base (AlonsoRamos, 1994; Corpas Pastor, 1996; Hausmann, 1989; Heylen & Maxwell, 1994; Orenha-Ottaiano, 2020, 2021a; Pamies, 2019; Penadés Martínez, 2017; Torner & Bernal, 2017). In this chapter, I aim to discuss the advances and relevance of corpus-based phraseology and phraseography, as well as its applications in the pedagogical practice of foreign language teachers. In training courses for foreign language teachers, there has been an increasing interest in learning phraseological units, and mainly collocations, especially those that are more frequently used by speakers of DOI: 10.4324/9781003413301-14

188

Adriane Orenha-Ottaiano

a specific foreign language. Thus, the need for corpus-based pedagogical materials and teaching resources has become greater in this context. We will also introduce some aspects regarding the development of a platform of collocation dictionaries, namely the Multilingual Collocations Dictionary Platform, and propose corpus-based activities to teach about collocations based on the data-driven learning (DDL) approach. These activities are proposed with the objective of highlighting the relevance of collocation dictionaries in aiding learners in achieving natural and fluent language use and mastering language proficiency. The purpose of creating and applying activities as such is the development of collocational awareness and competence in teacher training courses and in the classroom. 2.

Corpus-based phraseology and phraseography and the development of learners’ phraseological and collocational competence

Phraseology, a research field within linguistics aimed at investigating recurrent lexical combinations, has grown significantly in the past few decades and has become a major field in the context of foreign language teaching and learning. It investigates phraseological units as a single, definable unit, as well as employs a specific scientific and descriptive method of analysis. Due to the advancement of corpus linguistics, a new field, corpus-based phraseology, was born and is gaining more space in foreign language teaching (Corpas Pastor & Mitkov, 2022; Granger et al., 2015; Corpas-Pastor et al., 2019; Meunier & Granger, 2008; Reppen, 2009; Orenha-Ottaiano, 2013, 2017, 2020, 2021), greatly contributing to the development of research aimed at helping learners achieve phraseological and collocational competence. Corpus-based phraseology also investigates recurrent lexical combinations; however, this investigation is carried out through the use of and access to corpora and computational tools for lexical/ phraseological analysis. The use of corpora in phraseological research has become vital and absolutely necessary due to its great impact on the area of foreign language learning and teaching and due to empirical evidence of how words or chunks of words behave and are used, whose results used to be more restricted in traditional phraseology. Corpus-based phraseology focuses on new types of evidence, new ways to describe language, new ways to organize data (the evidence) and new theoretical conceptions of the language, thus demanding the development of new tools and resources. Research in corpus-based phraseology has shown that the lexico-grammatical knowledge of learners is not enough if there is no perception of the natural and conventional phraseology aspect of the language. (Meunier & Granger, 2008; Orenha-Ottaiano, 2004, 2017, 2021a; Sinclair, 1991). In order to achieve fluency in the language, it is necessary for learners to develop phraseological competence. Phraseological competence can be defined as the ability to be aware of, identify,

The development of collocational and phraseological competence

189

understand and, mainly, produce and use the most frequent and recurrent phraseological units in a particular language, which are also conventionalized by a particular culture and community in specific contexts. Nevertheless, besides learning general phraseological units, it is also necessary to learn more specific combinations of words which are ubiquitous and present in all types of texts and text genres: the collocations. Jackendoff (1995) mentioned that there are as many fixed expressions as there are unique lexical items in the dictionary, and other authors, such as Mel’čuk, argue that fixed or multi-word expressions far outweigh single lexical items (Corpas Pastor, 1996, p. 7) – and collocations are one of these combinations. Why is it so important to learn collocations? • To use words more accurately; • To sound more natural when speaking or writing; • To vary your oral and written speech – offer alternative ways of saying something that might be more ‘colorful’, expressive or more accurate (O’Dell & McCarthy, 2017); and • Rather than repeat everyday words like very, good or nice, to exploit and have mastery of a wide range of language: for example, instead of saying ‘very cold’, use other options which may be more accurate or ‘colorful’, such as ‘freezing cold’ or ‘bitterly cold’; replace ‘It’s very, very hot’, as some learners use, for ‘It’s boiling hot’, ‘It’s unbearably hot’ or ‘It’s scorching hot’ etc. Taking that into account and considering that ‘language that is collocationally rich is also more precise’ (McIntosh et al., 2009, p. v), teachers should help learners develop their collocational competence as well. Collocational competence can be understood as the ability to be aware of, identify, understand and, mainly, produce and use the most frequent and recurrent collocations in a particular language, which are also conventionalized by a particular culture and community in specific contexts. By developing such competence, learners will reduce the effort of processing these combinations, of expressing their thoughts, and the effort of having to combine them whenever they are to produce a discourse. In this respect, Pawley and Syder (2000) and Nesselhauf (2005) claim that psycholinguistic evidence shows that the human brain is much better at memorizing than processing, and hence having at their disposal a large number of prefabricated units may reduce processing effort and, therefore, enhance language fluency (Orenha-Ottaiano, 2020, p. 61). Promoting collocational competence may help learners achieve a native-like level of language proficiency. Besides corpus-based phraseology, it is also worth highlighting the relevance of corpus-based phraseography to the field of foreign language teaching. The methodology, investigations and development of the phraseographical work are now conducted, after its interface with corpus linguistics, through the use and exploration of corpora, lexical analysis tools and based on frequency data and statistical

190

Adriane Orenha-Ottaiano

measures, with an aim to offer greater reliability, in quest of ensuring that the phraseologisms being included in the dictionaries are relevant to their users. Large corpora provide phraseographers with more evidence to decide what to include and (most importantly) what to leave out (adapted from Hanks, 2012, p. 63, when the author discussed electronic lexicography). According to Hanks (2012), the evidence raised through the exploration of corpora also contributes to ‘the endless tasks of improving the accuracy of explanations and identifying the pragmatic uses of words and phrases that had been largely neglected in traditional dictionaries. Regarding dictionary compilation, corpus-based phraseography has brought changes in the following aspects: • Corpus conception – a large collection of texts which enables lexicographers or phraseographers to extract a greater number of naturally occurring multi-word units; • The corpus compilation – automatic mode; • Corpora size and greater data storage capacity; • Software for corpora annotation (lemmatizers, taggers, parsers etc.); • Research methodology – lexical analysis tools and software (WordSmith Tools, AntConc, Sketch Engine etc.); • Way of observing and analyzing data; • Phraseographical treatment – due to theoretical changes, given its interface with corpus linguistics); • Phraseographical decisions; • The very posture, tasks and role of the lexicographer/phraseographer – selecting and synthesizing information to validate choices and ‘edit’. According to Rundell (2012), this is a role of greater skill and in-depth knowledge of the language. Even though we live in a digital world and there has been an increasing development of different online tools (Google, Google Translate, Linguee, DeepL, Reverso etc.) and writing assistants (Grammarly, Collocaid etc.) that may have seemed to compete with or even replace traditional dictionaries, the use of dictionaries in the 21st century can still be considered important and prevalent. Even the creation of ChatGPT, which apparently seems to be a threat to publishers and lexicographers, can bring strengths and advantages to this area, turning into a powerful ally for corpus-based lexicography and phraseography. For dictionary makers, ChatGPT may be able to provide a wide range of linguistic information, such as definitions, example sentences, synonyms, antonyms or any other type of linguistic feature. Nevertheless, it is worth highlighting that it cannot be used without human expertise, so the figure of the lexicographers or phraseographers will inevitably still continue to exist.

The development of collocational and phraseological competence

3.

191

The signifcant role of collocation dictionaries to enhance language profciency

Dictionaries can be used as learning tools and regarded as a potential source of helping learners to enhance language proficiency. Orenha-Ottaiano and Olimpio (2021, p. 123) acknowledge the useful role of collocation dictionaries in raising, for example, pre- and in-service teachers’ awareness of collocations and recognize them as a valuable pedagogical tool. The authors also claim that: ‘[I]n order to get the most out of them, professors should raise pre-service teachers’ awareness of collocations and in-service teachers should recognize their relevance to FL proficiency and fluency development’. This way, they will be able to better identify the collocational patterns they are looking up and enhance retention. (Orenha-Ottaiano & Olimpio, 2021, p. 123) On top of that, the authors (2021, pp. 123–124) highlight the relevance of training students to regularly consult collocation dictionaries, referring to Neshkovska (2018) who also points out that collocation dictionaries can be a significant tool when it comes to mastering collocations. Neshkovska (2018) also addresses the issue of FL learners being encouraged to look up collocations in dictionaries of collocations on a regular basis, not only in class, but also outside the classroom environment. Moreover, teachers are also advised to develop activities based on collocation dictionaries. In this chapter, we claim that the best practices for incorporating dictionary activities into language courses should be encouraged by integrating resources and exercises that motivate students to access a dictionary for solving their exercises. This practice may have a significant impact on the learning process and development of collocational competence. The next section will briefly present the online corpus-based Multilingual Collocations Dictionary Platform and provide some examples of DDL-based activities, guided by a phraseological view, that can be done by accessing the referred platform. 4.

The Multilingual Collocations Dictionary Platform: incorporating dictionary activities into foreign language learning

The Multilingual Collocations Dictionary Platform (PLATCOL)1 is the result of the funded project ‘A phraseographical methodology and model for an online corpus-based Multilingual Collocations Dictionary Platform’ (FAPESP Proc nr. 2020/01783–2). PLATCOL is a multilingual platform made up of dictionaries in five languages: Portuguese, English, Spanish, French and Chinese, forming

192

Adriane Orenha-Ottaiano

different language pairs. In the near future, more languages will be added, and more pairs of languages will be formed. The PLATCOL aims to fulfill users’ needs regarding language encoding and, as such, is considered a production dictionary. Besides helping users produce more authentic texts, it also has the purpose of developing users’ collocational competence, which is intrinsically connected with fluency. The wider the repertoire of collocations, the greater fluency a learner can achieve. Its methodology relies on the combination of automatic methods to extract candidate collocations (Garcia et al., 2019a) with careful post-editing performed by lexicographers. The automatic approaches take advantage of NLP tools to annotate large corpora with lemmas, PoS-tags and dependency relations in five languages (English, French, Portuguese, Spanish and Chinese). Using these data, statistical measures are applied and distributional semantics strategies select the candidates (Garcia et al., 2019b) and retrieve corpus-based examples (Kilgarriff et al., 2014). To help the target audience use collocations more precisely and productively, the dictionaries will address all types of collocations. They cover various morphosyntactic structures of collocations that fit into the following taxonomy for the English language: verbal, noun, adjectival and adverbial. To understand the taxonomy and morphosyntactic structures more clearly, please see the structures of each type of collocation and their examples in Figure 14.1. The examples of collocations and collocation types n this chapter were taken from our database and will be displayed on the end-user platform of the dictionary:

FIGURE 14.1

Collocations taxonomy

The development of collocational and phraseological competence

193

It is also important to highlight that PLATCOL aims to be customized for different target audiences according to their needs. It is specially designed for foreign language learners, foreign language teachers, learner and professional translators, material developers, lexicographers and researchers or any other target audience who may be interested in learning collocations more deeply. 5. Collocational Activities in English and Portuguese using PLATCOL

In this section, we will show examples of activities that can be done by accessing PLATCOL. Two of them were prepared for learners of English and one of them for learners of Portuguese as a foreign language. It can also be adapted to learner translators as PLATCOL can be used as a monolingual, bilingual or even multilingual dictionary. In this way, these activities are intended for foreign language learners, teachers and learner translators. Activities were planned based on the DDL approach, so please note that all examples taken from the platform are all corpus-based and were automatically extracted from the corpora that were compiled specifically for this research. The first activity in English focuses on the base ‘role’, considering it is a highly frequent word in English. Besides, as a wide range of verbs and adjectives can collocate with it and as these verbs may differ a lot in learners’ mother tongue, we consider it to be a very productive word.

FIGURE 14.2

Instructions for Activity 1 (Part A)

The following figures are mere illustrations to show what learners will find when they go to PLATCOL and have access to the collocation dictionaries. Learners will note that they can choose from the different meanings of the entry ‘role’ and, according to the exercises they are doing, they will see different collocations on the screen to choose from.

194

Adriane Orenha-Ottaiano

FIGURE 14.3

Screenshot of the PLATCOL prototype – morphosyntactic structures

The development of collocational and phraseological competence

FIGURE 14.4

195

Screenshot of the PLATCOL Prototype – Adjectival Collocations of the base “role” and examples of “pivotal role”

196

Adriane Orenha-Ottaiano

FIGURE 14.5

Activity 1 (Part B)

FIGURE 14.6

Activity 1 (Part C)

The development of collocational and phraseological competence

FIGURE 14.7

Instructions for Activity 2 and gap fill

197

198

Adriane Orenha-Ottaiano

The following figure shows the screenshot of the verbal collocations for meaning 2 of the entry ‘role’:

FIGURE 14.8

Screenshot of the PLATCOL prototype – verbal collocations of the base ‘role’

The development of collocational and phraseological competence

FIGURE 14.9

199

Screenshot of the PLATCOL Prototype – Search for the entry Papel and its definitions

200

Adriane Orenha-Ottaiano

For the Activities in Portuguese, we chose the equivalent word for ‘role’: ‘papel’. The morphosyntactic structures for the entry ‘papel’ are: ‘adjective + papel’, ‘papel + adjective’, ‘verb + papel’, and ‘papel + verb’.

FIGURE 14.10

Screenshot of the PLATCOL Prototype – Morphosyntactic structure Papel + Adjective and examples of the collocation papel crucial

The development of collocational and phraseological competence

FIGURE 14.11

201

Screenshot of the PLATCOL prototype – morphosyntactic structure ‘adjective + papel’

The number of collocations extracted from the Portuguese corpus was lower (85 verbal and adjectival collocations) in comparison to the ones extracted from the English corpus (139 verbal and adjectival collocations). As learners will notice, the specificity of the adjectives in English also differ in relation to the Portuguese ones. When it comes to the order of noun phrase constituents, the Portuguese language shows certain flexibility, displaying more freedom of placement for adjectives (preposed or postposed) in comparison to English, where adjectives are always preposed to the noun. However, teachers of Portuguese as a foreign language should draw students’ attention to the fact that placement for adjectives in Portuguese may have some implications. Preposed adjectives may indicate two

202

Adriane Orenha-Ottaiano

FIGURE 14.12

Instructions for Activity 3

different situations, unlike in English. For example, if the adjective is preposed, it can be used to place emphasis to a certain noun or give to it a more ‘subjective’ value. In some other situations, preposed adjectives may even change the meaning of whole the sentence, as it can be seen in the following example: Ele é um oficial alto. (= He is a tall official) Ele é um alto oficial. (= He is an official with a high rank) Because of this, we propose an activity to draw learners’ attention to this fact at the same time they are learning collocations.

FIGURE 14.13

Collocations with the structures ‘adjective + papel’, ‘papel + adjective’

The development of collocational and phraseological competence

203

As can be seen from the proposed activities, encouraging learners to explore collocations in a collocation dictionary, especially in PLATCOL, will raise students’ awareness and enable them to produce these combinations more naturally. The activities proposed here are just a few examples of how PLATCOL and other dictionaries could be used as a learning strategy to allow foreign language learners to gain collocational competence. Concluding remarks

In this chapter, we underlined the relevance of corpus-based phraseology and phraseography for foreign language teaching and learning, focusing on the fact that the learning of a wider range or repertoire of phraseological units or collocations enhances phraseological and collocational competence and therefore enables the acquisition of fluency in a foreign language. Under the scope of the research project ‘A phraseographical methodology and model for an online corpus-based Multilingual Collocations Dictionary Platform’, we addressed the usefulness of collocation dictionaries in the development of collocational competence and hence the mastery of language proficiency. We raised issues that highlight the advantages of using a collocation dictionary platform as a learning resource and a complementary aid for foreign language learners. Moreover, the platform could be regarded as a pedagogic tool for language teachers as they will be able to count on one more resource for developing activities using the dictionary, helping learners improve their understanding of a language and their ability to communicate effectively. Note 1 PLATCOL will be available at www.platcol.com from July/2024.

Reference list Alonso-Ramos, M. (1994). Hacia una definición del concepto de colocación: de J. R. Firth a I. A Mel’čuk. Revista de Lexicografía, 1, 9–28. Barfield, A., & Gyllstad, H. (Eds.). (2009). Researching collocations in another language: Multiple interpretations. Palgrave Macmillan. Corpas Pastor, G. (1996). Manual de fraseología española. Gredos. Corpas Pastor, G., R. Mitkov, M. Kunilovskaya, M. A. Losey, S. Moze, P. O. (Eds.). (2019). Computational and Corpus-Based Phraseology: Third International Conference, Europhras 2019, Malaga, Spain, Proceedings. https://doi.org/10.1007/978-3-030-30135-4 Corpas Pastor, G., & Mitkov, R. (Eds.). (2022). Computational and corpus-based phraseology (EUROPHRAS 2022. Lecture Notes in Computer Science, Vol. 13528). Springer. https://doi.org/10.1007/978-3-031-15925-1_2 Garcia, M., Salido, M. G., & Ramos, M. A. (2019a). A comparison of statistical association measures for identifying dependency-based collocations in various languages.

204

Adriane Orenha-Ottaiano

In Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWEWN 2019) (pp. 49–59). Garcia, M., Salido, M. G., Sotelo, S., Mosqueira, E., & Ramos, M. A. (2019b). Pay attention when you pay the bills. A multilingual corpus with dependency-based and semantic annotation of collocations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4012–4019). Granger, S., Gilquin, G., & Meunier, F. (Eds.). (2015). The Cambridge handbook of learner corpus research. Cambridge University Press. Hanks, P. (2012, December). The corpus revolution in lexicography. International Journal of Lexicography, 25(4), 398–436. https://doi.org/10.1093/ijl/ecs026 Hausmann, F. J. (1989). Le dictionnaire de collocations. In O. Reichmann, H. E. Wiegand & L. Zgusta (Eds.), Wörterbücher: Ein internationales Handbuch zur Lexicographie. Dictionaries. Dictionnaires (pp. 1010–1019). De Gruyter. Heylen, D., & Maxwell, K. (1994). Lexical functions and the translation of collocations. International conference on computational linguistics (pp. 298–305), Kyoto, Japan. Jackendoff, R. (1995). The boundaries of the lexicon. In M. Everaert, E. J. van der Linden, A. Schenk & R. Schreuder (Eds.), Idioms, structural and psychological perspectives (pp. 133–169). Lawrence Erlbaum. Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The Sketch Engine: Ten years on. Lexicography, 1, 7–36. www. sketchengine.eu McIntosh, C., Francis, B., & Poole, R. (2009). Oxford collocations dictionary for students of English. Oxford University Press. Meunier, F., & Granger, G. (Eds.). (2008). Phraseology: An interdisciplinary perspective. John Benjamins. Neshkovska, S. (2018). What do advanced ESL\EFL students’ need to know to overcome “collocational” hurdles? Thesis, 7(2), 53–74. AAB College. Nesselhauf, N. (2005). Collocations in a learner corpus. John Benjamins. https://doi. org/10.1075/scl.14 O’Dell, F., & McCarthy, M. (2017). English collocations in use: Intermediate. Cambridge University Press. Orenha-Ottaiano, A. (2004). A compilação de um glossário bilíngue de colocações, na área de jornalismo de Negócios, baseado em corpus comparável [Master’s thesis, Universidade de São Paulo, Brazil]. Orenha-Ottaiano, A. (2013). The proposal of an electronic bilingual dictionary based on corpora. En O. M. Karpova (Ed.), Life beyond dictionaries. Proceedings of X anniversary international school on lexicography (pp. 405–408). Orenha-Ottaiano, A. (2017). The compilation of an online corpus-based bilingual collocations dictionary: Motivations, obstacles and achievements. En I. Kosem, C. Tiberius, M. Jakubíč ek, J. Kallas, S. Krek y V. Baisa (Eds.), Proceedings of eLex 2017 – Electronic lexicography in the 21st century: Lexicography from Scratch (pp. 458–473). Lexical Computing CZ, s.r.o. Orenha-Ottaiano, A. (2020). The creation of an online English collocations platform to help develop collocational competence. Phrasis: Revista di studi fraseologici e paremiologic. Associazone Italiana di Fraseologia e Paremiologia, 1, 59–81. Orenha-Ottaiano, A. (2021). Escollas colocacionais a partir dun corpus de estudantes de tradución e a importancia do desenvolvemento da competencia colocacional. Cadernos de Fraseoloxía Galega, 21, 35–64.

Exploring the pedagogical potentials of corpora and DDL

205

Orenha-Ottaiano, A., & Olimpio, M. E. O. S. (2021). A corpus-based platform of multilingual collocations dictionaries (PLATCOL): Some lexicographical aspects aiming at pre- and in-service teachers. In Proceedings of the international conference ‘corpus linguistics 2021 (pp. 122–132). Pamies, A. (2019). La fraseología a través de su terminología. In J. J. Martín Ríos (Ed.), Estudios lingüísticos y culturales sobre China (pp. 105–134). Comares. Pawley, A., & Syder, F. (2000). The one clause at a time hypothesis. In H. Riggenbach (Ed.), Perspectives on fluency (pp. 163–191). University of Michigan Press. Penadés Martínez, I. (2017). Arbitrariedad y motivación en las colocaciones. RLA, 55(2), 121–142. Reppen, R. (2009). Exploring L1 and L2 writing development through collocations: A corpus-based look. In A. Barfield & H. Gyllstad (Eds.), Researching collocations in another language. Palgrave Macmillan. https://doi.org/10.1057/9780230245327_4 Rundell, M. (2012). The road to automated lexicography: An editor’s viewpoint. In S. Granger & M. Paquot (Eds.), Electronic lexicography (pp. 15–30). Oxford University Press. Sinclair, J. M. (1991). Corpus, concordance, collocation. Oxford University Press. Teubert, W. (2004). Units of meaning, parallel corpora, and their implications for language teaching. In U. Connor y T. Upton (Eds.), Applied corpus linguistics: A multidimensional perspective (pp. 171–189). Rodopi. Torner, S., & Bernal, E. (Eds.). (2017). Collocations and other lexical combinations in Spanish. Routledge. https://doi.org/10.4324/9781315455259

Web tools DeepL. www.deepl.com/en/translator Google Translate. https://translate.google.com/ Linguee. www.linguee.com/ OpenAI. (2023). ChatGPT (February 13 version) [Large language model]. https://chat.openai.com Reverso. https://context.reverso.net/traducao/

14.1 Exploring the pedagogical potentials of corpora and DDL to enhance collocational and phraseological skills Buse Uzuner The phraseological dimension of corpus-based studies offers fresh ways to depict language and linguistic evidence. It organizes evidence within new theoretical frameworks based on corpus methodology. When discussing phraseologism, which includes idioms, proverbs, phrasal verbs, and collocations, the broader

206

Buse Uzuner

phraseological knowledge, such as collocations, aids non-native learners in fluently conveying their messages in English (Nesselhauf, 2005). Mastering multiword units, especially collocations, is vital for foreign language learners, allowing them to sound more authentic and expand their linguistic variations (Orenha-Ottaiano, 2012). Conversely, phraseography is an applied field of phraseology. It delves into the description of phraseological units via corpus-based frequency and statistical analyses. Larger corpora offer richer linguistic evidence to pinpoint and evaluate phraseological units, like collocations. Corpus-based phraseography aids in crafting dictionaries tailored to phraseological units, such as collocations. Both corpusbased phraseology and phraseography play pivotal roles in developing pedagogical materials for foreign language instruction and translation. Notably, lexicographical resources inductively heighten students’ awareness of collocations. As such, collocation dictionaries serve as valuable tools, guiding foreign language learners on using collocations both inside and outside the classroom. Following Orenha-Ottaiano’s chapter describing an online corpus-based multilingual collocation dictionary platform targeting collocation teaching, I contemplated how to adapt this methodology to the Turkish context, aiding Turkish foreign language learners in autonomously mastering collocations – a challenging aspect of language acquisition, particularly for Turkish learners. The lecture’s content resonates with my master’s thesis, which utilized an online pattern dictionary (Hanks, 2013) to contrast English verb-noun patterns in learner language from three distinct L1 backgrounds: Turkish, Norwegian, and Japanese. Learners should be promoted to use multiword expressions such as collocations, and corpus-based classroom activities are one of the most effective ways. Both direct and indirect use of corpus methodology can not only contribute to learners’ collocational knowledge but also enhance learner autonomy. These classroom activities can be interest-based based on the learners’ preferences, or need-based. As I have been teaching English to various learner groups, I reconsider my teaching method to fill the gap in the teaching–learning process from time to time. The outcomes of DDL activities such as compiling corpora and examining n-grams, frequently used collocations and their phraseological features in context. The application of these activities can contribute to learners’ awareness about culture-specific words, learners can differentiate the meanings of specific words in specific contexts. This would enable learners to forge a connection between the collocational networks of both their native and target languages. The outcomes of the corpus application into my English teaching showed learners had positive attitudes, as they are not only learning the language but also exploring the linguistic structures. Reference list Hanks, P. (2013). Lexical analysis: Norms and exploitations. MIT Press. https://doi. org/10.7551/mitpress/9780262018579.001.0001

In pursuit of collocational and phraseological competence

207

Nesselhauf, N. (2005). Collocations in a learner corpus. John Benjamins. https://doi. org/10.1075/scl.14 Orenha-Ottaiano, A. (2012). English collocations extracted from a corpus of university learners and its contribution to a language teaching pedagogy. Acta Scientiarum, 34(2), 241–251. https://doi.org/10.4025/actascilangcult.v34i2.17130

14.2 In pursuit of collocational and phraseological competence Boonyakorn Siengsanoh As a teacher, I have always stressed the importance of collocations and phraseology and explicitly communicated that to my students in my classes. The courses I teach include fundamental English courses for first-year university students, essayand business email–writing courses, speaking and presentation courses and English proficiency test preparation courses. Despite variation in course content, one major and vital learning outcome that the students are expected to achieve upon completing these courses is strong productive skills and, for me, that is where the collocational and phraseological competence comes into play. Oftentimes, what I notice is that although most of the students know quite a lot of individual words and are well-equipped with grammatical knowledge, they get stuck at the intermediate plateau, where they do not seem to make any perceived progress in terms of language use and fluency and, to use Lewis’s terms (2000), they “need a breakthrough” (p. 14). To illustrate, they seem to lack the ability to use lexical items they already know with a wide range of collocations or formulaic sequences. I strongly believe that such ability would allow them to say or write English in a more natural, precise and concise way. Well aware of that problem, I have established a new set of the priorities for my classes. Promoting student autonomy and fluency have taken priority over lecture-based grammar teaching. Also, instead of only going through the course content as specified in the course syllabi, I now make sure to devote some of my class time promoting the study skills required for the students to improve their writing and speaking skills on their own. That’s why I usually do the following: 1 In the first week of the course, I deliver a short lecture on collocations and multiword units, explaining what they are and why they are important for language use and fluency development. I also mention the potential impact of L1 interference on collocation acquisition. Besides, whenever possible, I train my students to notice those chunks by themselves in the reading texts or transcripts they work on. 2 Right from the very beginning of the course, I recommend some useful tools that can help them develop collocational and phraseological competence on

208

Paweł Szudarski

their own. These include books on collocations and online monolingual dictionaries, such as Oxford Dictionary, Cambridge Dictionary, Collins Dictionary and OZDIC English Collocation Dictionary. The selection of a dictionary is very important, as some dictionaries provide more detailed information about and more examples of collocations than others. 3 I show the students how to use some basic functions on corpus tools to solve some language problems on their own. These problems are, for example, nearsynonyms, grammatical or lexical collocations, high-frequency words or multiword units in a particular field of study. 4 To increase collocation recalls and recognition, collocations in reading texts are typographically enhanced (i.e. underlined, highlighted or bolded). In addition, when developing vocabulary exercises and tests, I ensure that the students are prompted to produce multi-word units or collocations that they learn instead of just one word. To conclude, the key to developing EFL students’ collocational and phraseological competence is to heighten their awareness about word combinations. In doing so, teachers should ensure students have vital study skills that allow them to acquire such lexical knowledge by themselves. Reference list Lewis, M. (Ed.). (2000). Teaching collocation: Further developments in the lexical approach. Language Teaching Publications.

14.3 Applications of corpora in studying and enhancing second language collocational and phraseological knowledge Paweł Szudarski Abstract

This chapter discusses the applications of corpora in studying and enhancing second language (L2) collocational and phraseological knowledge. Based in the area of corpus-based phraseology, it uses Orenha-Ottaiano’s work as a starting point to consider the key role of corpus analysis and evidence in examining word partnerships, such as collocations or other kinds of phraseological patterning. Focusing in

Second language collocational and phraseological knowledge

209

particular on L2 learning and use, the chapter explains how the advent of corpora as databases of naturally occurring language has provided new ways of analysing collocation use across a wide range of speakers, settings and communicative situations. The chapter also points to corpus-based tools that enable cross-lingual research into phraseology at the interface of linguistic description, lexicography and language education. It is abundantly clear that the last two decades have produced an impressive body of new corpus-based research, providing a wealth of insights into a wide range of language features and phenomena. Collocations, or word partnerships, are a good example of such phenomena, with corpora providing rich data on the use of various types of collocational pairs (e.g., ‘powerful car’, ‘strong coffee’, ‘make a mistake’) by different groups of speakers (e.g., speakers of English for whom it is a first language, L1 vs. users and learners of English as a second language, L2). It is such corpus-based analysis of collocations and collocational competence more broadly that are the focus of Orenha-Ottaiano’s chapter, which provides a useful overview of empirical work in this area centring on the usefulness of corpora, corpus evidence and data-driven learning (DDL) in raising collocational awareness, promoting the knowledge of collocations by L2 users, producing collocation dictionaries and training translators. One of the main points Orenha-Ottaiano stressed concerns the importance of corpus evidence in studying collocations and other kinds of word combinations. As discussed in detail in Szudarski (2023), corpora have been instrumental in revealing the phraseological nature of language and have assisted with the identification of collocational partnerships as they occur across different types of linguistic material, including L2 learner language or translation texts. Orenha-Ottaiano’s work convincingly shows how corpus-based analysis helps to delve into the intricacies of phraseological patterning and the use of collocations as an important feature of naturally occurring language. With collocations being realised differently across languages, types of discourse and communicative situations, corpora have opened up a number of possibilities for evidence-based descriptions of collocational competence, as well as training language professionals (teachers, translators, materials writers). Another key message is a strong emphasis on the multiple ways or modes in which collocational and phraseological information is presented in resources such as dictionaries or online databases. My response is that we need more research in the area of phraseography, a field of applied phraseology looking into the preparation and compilation of dictionaries, glossaries and similar resources. OrenhaOttaiano’s example of PLATCOL (www.institucional.grupogbd.com/dicionario/ sobre?locale=en), a platform of multilingual collocation dictionaries, usefully demonstrates how corpus-derived information enables us to compare the use of collocations across different contexts and languages. It is also encouraging to see research, such as Wojciechowska (2023) or Dziemianko (2014), that examines the

210

Paweł Szudarski

most effective ways of presenting phraseological and collocational information as a way of enhancing the learning of such knowledge by L2 users. In summary, corpus-based examinations of collocations, phraseology and different types of word combinations have become an important topic in corpus linguistics, and applied linguistics more broadly. Summaries such as Orenha-Ottaiano’s chapter constitute an important contribution in this area, helping to navigate the complexities of this kind of linguistic analysis and inspiring new research and practice in lexicography and language education. Reference list Dziemianko, A. (2014). On the presentation and placement of collocations in monolingual English Learners’ dictionaries: Insights into encoding and retention. International Journal of Lexicography, 27(3), 259–279. https://doi.org/10.1093/ijl/ecu012 Szudarski, P. (2023). Collocations, corpora and language learning. Cambridge University Press. https://doi.org/10.1017/9781108992602 Wojciechowska, S. (2023). Hand in hand or separate ways: Navigation devices and nesting of metonymic BODY PART multiword expressions in monolingual English learners’ dictionaries. International Journal of Lexicography, 1–21. https://doi.org/10.1093/ijl/ ecad007

15 DATA-DRIVEN LEARNING IN INFORMAL CONTEXTS? EMBRACING BROAD DATA-DRIVEN LEARNING (BDDL) RESEARCH Pascual Pérez-Paredes

Introduction

Data-driven learning and the use of corpora and corpus linguistics tools (PérezParedes, 2010) have had a modest impact on the broader language learning field (Chambers, 2019; Boulton, 2021; O’Keeffe, 2021). Boulton (2021) has noted that there is an imperative for future research to examine the use of alternative ways to approach DDL, particularly outside university classrooms, and Crosthwaite and Boulton (2023) argue that there is plenty of scope for innovation if DDL research and practice become less dogmatic in terms of the resources used for their implementation. In this chapter, I argue that it is necessary to pursue an analysis of DDL practices in the broader language learning context (Pérez-Paredes & Mark, 2022), particularly in informal contexts outside the university classroom. To do this, it is essential to expand the ecological research model that has dominated DDL research so far, and which has thoroughly examined higher education (HE) contexts. I will use the term “prototypical DDL” (Boulton, 2015; Boulton & Pérez-Paredes, forthcoming) to refer to DDL that is designed by an expert in corpus linguistics and which takes place in the context of instructed second language acquisition (SLA) as part of a module or an official programme, typically in a higher education institution (HEI). The term “broad DDL” (BDDL) will be used to refer to pedagogical natural language processing resources (P-NLPRs) for language learning (see PérezParedes et al., 2018). BDDL makes use of a wide range of existing resources such as online dictionaries, text analysis and text processing tools, vocabulary-oriented websites and apps, translation services, and artificial intelligence (AI) tools for language learning across a variety of contexts, including self-directed uses. It also involves the use of informal language learning against the backdrop of digital DOI: 10.4324/9781003413301-15

212

Pascual Pérez-Paredes

learning, characterized by a new ecology of reading and writing, multitasking and the emergence of a new literate social formation (Pérez-Paredes & Zhang, 2022) where communication processes are transitioning towards “dialogic interactions [less] subject to the power of institutions to set standards of knowledge, procedure, and truth based on their control of written texts” (Gee & Hayes, 2011, p. 125). P-NLPRs have the potential to foster autonomy, personalization, induction and authenticity and may offer an alternative to prototypical DDL corpora when engaging with BDLL (Pérez-Paredes et al., 2018, 2019). “Prototypical DDL” as the dominating research ecology

The popularization of DDL took place in the 1990s. During the previous decade, computing in educational contexts underwent a major transformation as language teachers gained access to personal computers, either in their schools or at home. It was access to personal, more affordable computers that allowed language teachers to explore a whole range of new possibilities of instruction (Higgins, 1983). As early as the mid-1980s, Tim Johns had developed a concordancing software that ran on a Spectrum computer, one of the most popular home computers before PCs became widespread (Davies, 2002). A few years later, Johns (1990, 1991) proposed using DDL as a way of providing language learners with linguistic data with which to deepen their learning through direct, unmediated access to concordance lines from a corpus. The determining factor in the interest in the use of corpora in language teaching was access to new developments in computing and data processing, which made it easier for researchers and language teachers in universities to exploit these resources in the context of their own teaching and learners. Throughout the years, concordance lines have been conceptualized by language researchers as a proxy for real language use, facilitating language learners’ understanding of language patterns in a corpus. This principle has remained intact after 30 years. Some of the early views on DDL include the notion that concordance lines are richer and more revealing than textbooks (Johns, 1986), as they show all the occurrences of a word and list the contexts where it occurs in the corpus. In this early conceptualization, we can already note the potential for DDL to empower language learners and foster their autonomy in language learning. For Boulton (2021, p. 9), through DDL learners can query language data “to detect recurring patterns and thus find answers to their own questions ‘in the wild’ rather than being dependent on teachers or published resources”. Research, however, has shown that very rarely have language learners’ own drive and motivations mobilized their will to use DDL, and only occasionally have language learners’ uptake of corpus query been followed up beyond standard delayed-treatment questionnaires or interviews (an example of positive uptake is Charles [2014] after one year; while Pérez-Paredes and Sánchez Hernández [2018] showed negative evidence of autonomous corpus use two years after a DDL experiment).

Data-driven learning in informal contexts?

213

Since the early days, DDL has been driven by researchers’ interest in bringing corpora and explicit linguistic analysis to the language classroom (McCarthy et al., 2021). DDL has been examined extensively in tertiary instructed contexts (Boulton & Cobb, 2017; Chambers, 2019), resulting in what Pérez-Paredes (2010) conceptualized as the possibilities scenario, that is, classroom-based practices where corpus linguistics methods and corpora are brought to the language classroom.1 This scenario is arguably a type of DDL ecology (Pérez-Paredes, 2023) where learners’ interactions with corpus data in instructed settings are central. This is a top-down approach where achievement in DDL is measured through tasks designed and implemented by researchers either in between-groups or within-groups research designs (Boulton & Cobb, 2017). This has impacted the use and spread of DDL in different ways (Boulton & Cobb, 2017; Boulton, 2021; Pérez-Paredes, 2022; Dong et al., 2023), including (1) the fact that a small number of representative corpora and query interfaces have been widely used and researched in most teaching contexts (i.e. COCA and BNC); (2) the need for students’ specific training on how to conduct corpus queries and interpret corpus queries (Pérez-Paredes et al., 2011, 2012; Geluso & Yamaguchi, 2014); and (3) the need for teacher training on corpus management and exploitation (Poole, 2020). Dong et al. (2023) have identified that recent transformative research in DDL should include larger-scale classrooms and further investigation of the role of variables in DDL. It is therefore remarkable that informal learning is not included in the equation, at least explicitly. Despite the potential to unleash individual, autonomous learning, adult language learners have rarely been studied in informal language settings where they can selfinitiate language learning episodes. They are generally represented in DDL research (Pérez-Paredes, 2022) as passive stakeholders that do not show agency (Leung & Scarino, 2016) across language learning contexts (Pérez-Paredes & Mark, 2021), engaging with corpus uses in semi-autonomous ways as they follow the directions provided by researchers (Pérez-Paredes, 2023). In the context of this chapter, I use the term “informal learning” quite freely to refer to both informal learning and non-formal learning as conceptualized in Conole and Pérez-Paredes (2017, p. 46): Informal learning corresponds to everyday life, i.e., activities whose main objective is not educational. It is not structured, and it is non-intentional and does not lead to certification. Non-formal learning is learning which is embedded in planned activities not always explicitly designated as learning (in terms of learning objectives, learning time or learning support), but which contain an important learning element. It is intentional from the learner’s point of view . . . adult non-formal learning refers to the self-directed intentional learning that takes place in people’s leisure time outside of both the workplace and the curriculum of formal learning activities. As suggested by the authors, non-formal learning can be offered by different types of organizations not officially recognized as learning institutions. YouTube

214

Pascual Pérez-Paredes

channels run by experts and users across hundreds of thousands of topics are an example. Researching informal language learning: bringing broad data-driven learning to the fore

We need to push the boundaries of DDL praxis and research outside the classroom if we are to gain a more comprehensive view of the contributions of DDL to language learning in the first half of the 21st century. The pre-eminence of the researchers’ own agendas, and convenient access to informants in HEIs and their own institutionalized practices when examining sites of DDL practice (i.e. academic writing in undergraduate and graduate programmes) may explain the lack of engagement with outside-the-classroom learning (Kukulska-Hulme & Lee, 2020). These practices ignore both informal language learning and non-formal language learning ecologies. In institutionalized practices, learners are typically instructed to follow a set of steps that will eventually train them to query corpora (see Charles, 2014) and either acquire lexico-grammatical features or, more generally, improve their writing or language awareness. These practices may eventually facilitate selfinitiated uses, as in Charles (2014). It is encouraging that there is strong evidence that HE language learners, when provided with relevant data and a focus-on-form inductive approach, do engage in autonomous learning through original queries and the examination of corpus data (Crosthwaite et al., 2019). Informal learning encompasses both intentional and unintentional learning that may happen because of self-initiated or incidental learning episodes. The examination of self-directed uses of BDDL in informal contexts and in emerging spaces for language learning shifts the focus of DDL research ecology from institutional practices and researchers to learners and their engagement with language data. According to Godwin-Jones (2023), usage-based theories of language learning are central to new emerging learning sites as opportunities to provide input and stimuli for language learning. Some of these uses may include online experiential language learning (Godwin-Jones, 2020, 2023), such as gamified contexts, digital games and serious games (Jackson & McNamara, 2013; Noroozi et al., 2016; Dehghanzadeh et al., 2021), social networking sites (Zainuddin & Yunus, 2022), digital narratives (Yus, 2021) or video watching apps (e.g. TikTok and the like) and web services (Nguyen & Diederich, 2023). Assuming that learners/L2 users (I prefer not to take it for granted that all L2 users consider themselves as learners) engage in some sort of intentional learning while using their L2 in digital contexts, some of the many open questions that merit examination include the following: • How do learners/L2 speakers analyse language in informal learning (including non-formal contexts)? • How do learners/L2 speakers engage with innate domain general cognitive abilities?

Data-driven learning in informal contexts?

215

• How do learners/L2 speakers use their critical thinking about language through hypothesis formation? • How do learners/L2 speakers formulate queries (i.e. word senses or word uses while reading or checking their Instagram feeds) and interpret the language data? What computational skills are favoured by learners? • How do learners/L2 speakers make sense of collocational behaviour even if they lack the metalanguage of collocation and are not consulting a corpus? • How do learners/L2 speakers interpret lexical-grammatical frequencies? • How do learners/L2 speakers extract prototypicality in texts? • How do learners/L2 speakers engage with the notion of “usage” while they examine language data on, for example, their mobile devices? BDDL will arguably open up new spaces for reflection and understanding of how language learners make sense of language data in the digital era (Kern, 2021). Note that I deliberately avoid the use of corpora in a restricted sense in this context. In BDDL, corpora are one of the many resources available to language learners (see Boulton, 2015 for a discussion of the use of the Web as a corpus). While some research has examined the use of Google as a web corpus and a concordancer (Sun, 2007; Sha, 2010; Pérez-Paredes et al., 2012; Boulton, 2015), this has mostly happened in instructed SLA contexts. The impact of other P-NLPRs in informal learning remains largely unexplored (see Crosthwaite & Boulton, 2023 for a discussion of some of these resources). In fact, user-generated activity using personal devices such as phones or tablets treasure the potential to inform designed activity and, most significantly, what we know about learners’ interactions with content online (Kukulska-Hulme et al., 2007). The emphasis on the researchers’ own agendas and institutionalized practices has arguably resulted in the fact that some standard language-learning practices in instructed SLA have received little attention in DDL research. In this agenda, the researchers’ use of corpora in institutional, instructed settings define the resources, the tasks and the learners’ engagement with DDL. Researching BDDL can increase our understanding of the learners’ practices of “recognizing the fuzzy nature of authentic language use in context” (Boulton & Cobb, 2017, p. 350) that are sustained by digital learning ecologies and mediated by new digital literacies understood as “constellations of symbolically mediated practices that involve various kinds of knowledge, predispositions, and skills to deal with texts in electronically mediated environments” (Kern, 2021, p. 134). Digital literacies are complex, diverse and, as pointed out by Kern (2021, p. 134), “because the knowledge, skills, and practices involved in digital literacies are so wide-ranging, no one is ever digitally literate in all possible ways”. The use of BDDL, including using the Web as a corpus, checking out frequency bands in online dictionaries or, among others, looking up the collocational behaviour of words in context-aware predicted text, involves a wide range of practices, skills and cognitive processes that go beyond prototypical DDL (O’Keeffe, 2021).

216

Pascual Pérez-Paredes

There are three areas, at least, that will benefit from an examination of BDDL practices in informal learning: 1 The exploration of new sites of language learning engagement; 2 New opportunities to increase our understanding of the cognitive processes involved in statistical language learning; and 3 Study and analysis of the role of new corpora in informal settings. The study of new sites of engagement necessarily involves a reconceptualization of DDL and the extension of the concept of prototypical DDL. Crosthwaite (2021) has recently suggested that DDL needs to integrate new information sourcing and needs to be open to innovation, and I have myself argued in the 2020 online TaLC plenary for the need of a new DDL agenda that places a wider range of Learning scenarios and Learners at the core. An example in this direction is Crosthwaite (2020), who has implemented online training for DDL-led error detection and correction. For Godwin-Jones (2020), some of the areas where self-directed uses are relevant include under resourced communities, language learning for migrants and refugees and language learning for interactive (fan) fiction. Godwin-Jones (2020) has highlighted that there is a rise of informal language learning through leisure time and film and TV viewing. To the best of my knowledge, DDL researchers have not explored these sites of engagement. This may be due either to lack of ad-hoc resources or to a very strict, reductionist perspective about DDL practices. In relatively under-explored sites of language learning, such as AI-driven L2 uses (An et al., 2023), digital narratives (Yus, 2021), self-directed mobile-assisted language learning outside-the-classroom (Zhang & Pérez-Paredes, 2021) or fan fiction communities where L2 users read and write mostly in English worldwide, we can find enhanced interactivity, new digital literacies and multiple modes of user involvement that necessarily require the use of the type of statistical language learning that is supported by usage-based language learning theories. For Perruchet and Pacton (2006), statistical learning is a form of implicit learning where learners are not given any instructions regarding learning. They are expected to use their domain general cognitive abilities to learn from exposure to the instances found in texts and extract statistical regularities from the language input and to apply them to L2 grammars (Hudson et al., 2005). BDDL can offer a range of tools that favour the emergence of metalinguistic awareness in informal learning, which in turns favours the extraction and categorization of L2 features in the input. In a very recent paper, Goldberg (in press) notes that while human brains represent the world imperfectly with limited recourses, humans generalize from familiar to related cases via interpolation and induction. The former has very rarely been researched in DDL, and it treasures the potential to inform us about how learning as a process happens in digital learning contexts. Other usage-based areas that can be of interest in informal language learning include contingency, reliability of form-function mapping and statistical association

Data-driven learning in informal contexts?

217

(Ellis et al., 2015). Besides, insights from computational thinking research can complement how learners interact with statistical thinking research in outlining the types of strategies in place when solving real-life problems, including, among many others, heuristics, lateral thinking, abstract thinking, recursivity, iteration, trial and error methods, the use of problem solving patterns, metacognition or creativity (see PérezParedes & Zapata-Ros, 2018 for a detailed account of these strategies and Kafai and Proctor [2022] for an account of the notion of computational literacies). It is surprising the low impact of computational thinking skills on DDL research, particularly when one examines the body of research in other learning fields where numeracy and abstraction are involved. As for the role of new and old corpora in BDDL, it is expected that existing corpora adapt their interfaces to mobile devices and facilitate APIs that link with a wider range of services across the Web. Similarly, individualized, user-generated corpora and language datasets are expected to become easier to create. Měchura (2017) has already developed the technology to generate personalized dictionaries by extracting data from the Web. The spread of AI web services and its integration with existing technologies for language learning will eliminate the need to use complex coding and computational skills to solve complex language analysis problems, which in a way will have an impact similar to the spread of personal computers in the 1980s and 1990s. Crosthwaite and Baisa (2023) have recently shown that AI is expected to have an impact on the way language learners and educators interact with language data. Analysing the complexities and features (Ballance, 2016, 2017) involved in non-corpus-based concordancing in websites like Linguee is also a major area of future research. Moving BDDL forward

Digital language learning ecologies shape L2 use today. Kern (2021, p. 135) has noted that “digital-era language learners” are presented with a vast array of resources on the Internet: These include reference materials (e.g., dictionaries, grammars, encyclopedias), tutorials (on a variety of linguistic and cultural topics), artistic productions (e.g., music, films, exhibits), day-to-day information and vocabulary (through news sites, business websites, websites on any interest under the sun), and, perhaps most significant, engagement with other speakers of the language through social media, forums, telecollaborative exchanges, and so forth. For Kukulska-Hulme and Lee (2020, p. 170), outside-the-classroom learning represents social spaces that offer a variety of affordances for language learning. For these authors, these new opportunities “will not be fully realized unless we make efforts to propose and try out new designs for learning in these settings”. It is urgent that DDL researchers take action and rethink how to integrate informal language

218

Pascual Pérez-Paredes

learners/L2 speakers in BDDL research. Otherwise, DDL may remain restricted to (some) university language classrooms. The task ahead for BDDL researchers is paramount. As pointed out by Soyoof et al. (2023), “only a few studies have addressed the agency and digital literacy dimension of CALL in recent years”. Part of the new research to be carried out is technology- and resource-driven, but part of the effort is learning-driven. By placing the emphasis on Learning, we may bring to the mix a wider range of statistical learning and computational thinking skills and competencies that can enrich DDL research ecologies where formal learning and institutions do not exclusively shape language learning. Pérez-Paredes and Zhang (2022, p. 21) have suggested that even at the end of the first quarter of the 21st century, governments and educational institutions retain almost absolute power when deciding what counts as competence, and institutional ideologies about language learning determine how learners frame their experiences about instructed language learning and how artefacts can possibly mediate the use of technology for language learning. While instructed, formal language learning continues to be central to language learners’ experiences, new sites of learning and technologies emerge sometimes unexpectedly (e.g. the impact of ChatGPT at the end of 2022 was surprising, and it is probably too soon to evaluate its impact on language education). Informal learning “places the learners’ lives and their digital and material experiences as the central point for their language learning [favouring] a usage-driven and user-centered L2 pedagogy (Pérez-Paredes & Zhang, 2022, p. 22). As Yus (2021) put it when discussing the use of mobiles and digital narratives, users who are capable of obtaining all the reward from digital literacies will be granted a more active role in engaging with other language users and language learning. As for DDL as a field, O’Keeffe (2021) has noted that a more pedagogically aware and SLA-focused approach to researching DDL is expected to be adopted in the forthcoming years. BDDL practices can remove some of the rigidness in terms of resources and learning scenarios that have characterized prototypical DDL so far, allowing educators and language learners to facilitate individualization and adaptation to informal learning. Note 1 94% of the research carried out in the 2011–2015 period and published in Q1 journals happens in HEIs (Pérez-Paredes, 2022).

Reference list An, X., Chai, C. S., Li, Y., Zhou, Y., & Yang, B. (2023). Modeling students’ perceptions of artificial intelligence assisted language learning. Computer Assisted Language Learning, online ahead of print.

Data-driven learning in informal contexts?

219

Ballance, O. J. (2016). Analysing concordancing: A simple or multifaceted construct? Computer Assisted Language Learning, 29(7), 1205–1219. Ballance, O. J. (2017). Pedagogical models of concordance use: Correlations between concordance user preferences. Computer Assisted Language Learning, 30(3–4), 259–283. Boulton, A. (2015). Applying data-driven learning to the web. In A. Lenko-Szymanska & A. Boulton (Eds.), Multiple affordances of language corpora for data-driven learning (pp. 267–295). John Benjamins. Boulton, A. (2021). Research in data-driven learning. In P. Pérez-Paredes & G. Mark (Eds.), Beyond concordance lines. Corpora in language education (pp. 9–34). John Benjamins. Boulton, A., & Cobb, T. (2017). Corpus use in language learning: A meta-analysis. Language Learning, 67(2), 348–393. Boulton, A., & Pérez-Paredes, P. (forthcoming). Data-driven language learning. In R. Hampel & U. Stickler (Eds.), Bloomsbury handbook of language learning and technology. Bloomsbury. Chambers, A. (2019). Towards the corpus revolution? Bridging the research – practice gap. Language Teaching, 52(4), 460–475. Charles, M. (2014). Getting the corpus habit: EAP students’ long-term use of personal corpora. English for Specific Purposes, 35, 30–40. Conole, G., & Pérez-Paredes, P. (2017). Adult language learning in informal settings and the role of mobile learning. In S. Yu, M. Ally & A. Tsinakos (Eds.), Mobile and ubiquitous learning. An international handbook (pp. 45–58). Springer. Crosthwaite, P. (2020). Taking DDL online: Designing, implementing and evaluating a SPOC on data-driven learning for tertiary L2 writing. Australian Review of Applied Linguistics, 43(2), 169–195. Crosthwaite, P. (2021). DDL is dead? International Perspectives on Corpora for Education Seminar. https://youtu.be/SRlaoJ8faWU?si=oamlX-5_2Ak9x5gq Crosthwaite, P., & Baisa, V. (2023). Generative AI and the end of corpus-assisted datadriven learning? Not so fast! Applied Corpus Linguistics, 3(3), 100066. Crosthwaite, P., & Boulton, A. (2023). DDL is dead? Long live DDL! Expanding the boundaries of data-driven learning. In H. Tyne, M. Bilger, L. Buscail, M. Leray, N. Curry & C. Pérez-Sabater (Eds.), Discovering language: Learning and affordance. Peter Lang. Crosthwaite, P., Wong, L. L., & Cheung, J. (2019). Characterising postgraduate students’ corpus query and usage patterns for disciplinary data-driven learning. ReCALL, 31(3), 255–275. Davies, G. (2002). ICT and modern foreign languages: Learning opportunities and training needs. International Journal of English Studies, 2(1), 1–18. Dehghanzadeh, H., Fardanesh, H., Hatami, J., Talaee, E., & Noroozi, O. (2021). Using gamification to support learning English as a second language: A systematic review. Computer Assisted Language Learning, 34(7), 934–957. Dong, J., Zhao, Y., & Buckingham, L. (2023). Charting the landscape of data-driven learning using a bibliometric analysis. ReCALL, 35(3), 339–355. Ellis, N. C., O’Donnell, M. B., & Römer, U. (2015). Usage-based language learning. In B. MacWhinney & W. O’Grady (Eds.), The handbook of language emergence (pp. 163– 180). John Wiley. Gee, J., & Hayes, E. (2011). Language and learning in the digital age. Routledge. Geluso, J., & Yamaguchi, A. (2014). Discovering formulaic language through data-driven learning: Student attitudes and efficacy. ReCALL, 26(2), 225–242. Godwin-Jones, R. (2020). Future directions in informal language learning. In M. Dressman & R. W. Sadler (Eds.), The handbook of informal language learning (pp. 457–470). John Wiley.

220

Pascual Pérez-Paredes

Godwin-Jones, R. (2023). Emerging spaces for language learning: AI bots, ambient intelligence, and the metaverse. Language Learning & Technology, 27(2), 6–27. Goldberg, A. E. (in press). A Chat about constructionist approaches and LLMs. Constructions and Frames. Higgins, J. (1983). Computer assisted language learning. Language Teaching, 16(2), 102–114. Hudson Kam, C. L., & Newport, E. L. (2005). Regularizing unpredictable variation: The roles of adult and child learners in language formation and change. Language Learning and Development, 1(2), 151–195. Jackson, G. T., & McNamara, D. S. (2013). Motivation and performance in a game based intelligent tutoring system. Journal of Educational Psychology, 105(4), 1036–1049. Johns, T. (1986). Micro-concord: A language learner’s research tool. System, 14(2), 151–162. Johns, T. (1990). From printout to handout: Grammar and vocabulary teaching in the context of data-driven learning. CALL Austria, 10, 14–34. Johns, T. (1991). Should you be persuaded: Two samples of data-driven learning materials. In T. Johns & P. King (Eds.), “Classroom concordancing”. English Language Research Journal, 4, 1–16. Kafai, Y. B., & Proctor, C. (2022). A revaluation of computational thinking in K – 12 education: Moving toward computational literacies. Educational Researcher, 51(2), 146–151. Kern, R. (2021). Twenty-five years of digital literacies in CALL. Language Learning & Technology, 25(3), 132–150. Kukulska-Hulme, A., & Lee, H. (2020). Mobile collaboration for language learning and cultural learning. In M. Dressman & R. W. Sadler (Eds.), The handbook of informal language learning (pp. 169–180). John Wiley. Kukulska-Hulme, A., Traxler, J., & Pettit, J. (2007). Designed and user-generated activity in the mobile age. Journal of Learning Design, 2(1), 52–65. Leung, C., & Scarino, A. (2016). Reconceptualizing the nature of goals and outcomes in language/s education. The Modern Language Journal, 100(S1), 81–95. McCarthy, M., McEnery, T., Mark, G., & Pérez-Paredes, P. (2021). Looking back on 25 years of TaLC: In conversation with Profs Mike McCarthy and Tony McEnery. In P. Pérez-Paredes & G. Mark (Eds.), Beyond concordance lines. Corpora in language education (pp. 57–74). John Benjamins. Měchura, M. B. (2017). Introducing Lexonomy: An open-source dictionary writing and publishing system. In Electronic lexicography in the 21st century: Lexicography from scratch. Proceedings of the eLex 2017 conference, 19–21 September 2017, Leiden. Nguyen, H., & Diederich, M. (2023). Facilitating knowledge construction in informal learning: A study of TikTok scientific, educational videos. Computers & Education, 104896. Noroozi, O., McAlister, S., & Mulder, M. (2016). Impacts of a digital dialogue game and epistemic beliefs on argumentative discourse and willingness to argue. The International Review of Research in Open and Distributed Learning, 17(3), 208–230. https://doi. org/10.19173/irrodl.v17i3.2297 O’Keeffe, A. (2021). Data-driven learning, theories of learning and second language acquisition. In search of intersections. In P. Pérez-Paredes & G. Mark (Eds.), Beyond concordance lines. Corpora in language education (pp. 35–56). John Benjamins. Pérez-Paredes, P. (2010). Corpus linguistics and language education in perspective: Appropriation and the possibilities scenario. In T. Harris & M. Moreno Jaén (Eds.), Corpus linguistics in language teaching (pp. 53–73). Peter Lang. Pérez-Paredes, P. (2022). A systematic review of the uses and spread of corpora and datadriven learning in CALL research during 2011–2015. Computer Assisted Language Learning, 35(1–2), 36–61.

Data-driven learning in informal contexts?

221

Pérez-Paredes, P. (2023). How learners use corpora. In R. R. Jablonkai & E. Csomay (Eds.), The Routledge handbook of corpora and English language teaching and learning (pp. 390–405). Routledge. Pérez-Paredes, P., & Mark, G. (Eds.). (2021). Beyond concordance lines. Corpora in language education. John Benjamins. Pérez-Paredes, P., & Mark, G. (2022). What can corpora tell us about language learning? In McCarthy, M. & O’Keeffe, A. (Eds.), The Routledge handbook of corpus linguistics (2nd ed., pp. 313–327). Routledge. Pérez-Paredes, P., Ordoñana, C., & Aguado-Jiménez, P. (2018). Language teachers’ perceptions on the use of OER language processing technologies in MALL. Computer Assisted Language Learning, 31(5–6), 522–545. Pérez-Paredes, P., Ordoñana, C., Van de Vyver, J., Meurice, A., Aguado-Jiménez, P., Conole, G., & Sáncjez-Hernández, P. (2019). Mobile Data-driven language learning: Affordances and learners’ perception. System, 84, 145–159. Pérez-Paredes, P., & Sánchez Hernández, P. (2018). Uptake of corpus tools in the Spanish Higher Education context: A mixed-methods study. Research in Corpus Linguistics, 6, 51–66. Pérez-Paredes, P., Sánchez-Tornel, M., & Alcaraz Calero, J. M. (2012). Learners’ search patterns during corpus-based focus-on-form activities. International Journal of Corpus Linguistics, 17, 483–516. Pérez-Paredes, P., Sánchez-Tornel, M., Alcaraz Calero, J. M., & Aguado-Jiménez, P. (2011). Tracking learnersʼ actual uses of corpora: Guided vs non-guided corpus consultation. Computer Assisted Language Learning, 24(3), 233–253. Pérez-Paredes, P., & Zapata-Ros, M. (2018). El pensamiento computacional, análisis de una competencia clave. Create Space Independent Publishing. Pérez-Paredes, P., & Zhang, D. (2022). Mobile assisted language learning: Scope, praxis and theory. Porta Linguarum, IV, 11–25. Perruchet, P., & Pacton, S. (2006). Implicit learning and statistical learning: One phenomenon, two approaches. Trends in Cognitive Sciences, 10(5), 233–238. Poole, R. (2020). “Corpus can be tricky”: Revisiting teacher attitudes towards corpus-aided language learning and teaching. Computer Assisted Language Learning, 35(3), 1–22. Sha, G. (2010). Using Google as a super corpus to drive written language learning: A comparison with the British National Corpus. Computer Assisted Language Learning, 23(5), 377–393. Soyoof, A., Reynolds, B. L., Vazquez-Calvo, B., & McLay, K. (2023). Informal digital learning of English (IDLE): A scoping review of what has been done and a look towards what is to come. Computer Assisted Language Learning, 36(4), 608–640. Sun, Y.-C. (2007). Learner perceptions of a concordancing tool for academic writing. Computer Assisted Language Learning, 20(4), 323–343. Yus, F. (2021). Smartphone communication: Interactions in the app ecosystem. Routledge. Zainuddin, F. N., & Yunus, M. M. (2022). Sustaining formal and informal English language learning through Social Networking Sites (SNS): A systematic review (2018–2022). Sustainability, 14(17), 10852. Zhang, D., & Pérez-Paredes, P. (2021). Chinese postgraduate EFL learners’ self-directed use of mobile English learning resources. Computer Assisted Language Learning, 34(8), 1128–1153.

222

Carolina Tavares de Carvalho

15.1 Exploring language learning through varied perspectives on corpus technology Carolina Tavares de Carvalho Language teaching has undoubtedly undergone significant transformations over the years, propelled by the integration of technology into the classroom environment. This shift has been particularly pronounced in Brazil, where the educational landscape witnessed a remarkable surge in technological adoption during the COVID-19 lockdown. Faced with unprecedented challenges, educators had to explore innovative alternatives to ensure seamless learning experiences for their students. This included, among other efforts, introducing educators to data-driven learning (DDL). By harnessing authentic language usage data, DDL offers a powerful tool for enhancing language instruction. The alignment of the seminar with the Brazilian educational ethos is evident in the country’s emphasis on student-centered education. In the past three years, significant efforts have been made to bolster teacher training programs. Additionally, the government has introduced guidelines promoting the integration of technology into teaching methods, and substantial investments have been made in providing tables, computers, and internet access to all public-school students. However, it is important to note that these initiatives, although commendable, have not yet fully reached all students, and many educators still require additional training to effectively harness technology’s potential in the classroom. Brazil’s educational framework places a strong emphasis on digital literacy and the cultivation of skills such as critical thinking, adaptability, and self-direction among its students. Professor Pérez-Paredes’s chapter delves into the utilization of apps and websites within the context of DDL, shedding light on innovative methods for language learning. Building upon the concepts illuminated in his chapter, I personally engaged in creating a vocabulary activity that intricately interwove language pattern hunting and a website called Kahoot. Using Versatext, I introduced students to a movie review, with a specific focus on the prevalent word classes within this genre, notably adverbs and adjectives. The activity unfolded as students engaged in a hands-off exercise, where they read the review and employed a color-coded highlighting technique to identify and distinguish adverbs and adjectives. This process aimed not only to heighten their awareness of common linguistic elements in movie reviews but also to encourage them to recognize recurring word patterns occurring both before and after each highlighted word. This analytical approach fostered a more profound understanding of the syntactic and semantic roles played by these linguistic constructs. Furthermore, to reinforce their newfound knowledge, students were provided with a link to access a Kahoot quiz. This quiz, crafted around a different movie

DDL in informal contexts

223

review, challenged students to actively apply their understanding by inserting adverbs and adjectives in context. This practical exercise not only enhanced language acquisition but also nurtured critical thinking skills and promoted syntactic pattern recognition. This activity, based on movie reviews, exemplified the practical application of the insights gained from Pascual’s chapter. By leveraging language patterns present in authentic contexts, I fostered an engaging learning environment that not only enriched vocabulary but also honed analytical skills. As I continue to explore innovative methods including DDL, I am reminded that the synergy between traditional teaching philosophies and contemporary advancements holds the key to nurturing a generation of linguistically adept and technologically savvy learners. This has reaffirmed that education is a dynamic voyage, where adaptation and innovation propel us toward more effective and impactful teaching methodologies.

15.2 DDL in informal contexts Possible post-class challenges for learners of Aviation English Daniela Terenzi Pascual Pérez-Paredes’s chapter adeptly connects DDL practices with contemporary concerns about learners’ post-class attitudes and behaviours. It offers insights into computational thinking skills and technology utilization while also igniting new avenues for practice and research. With over 15 years of experience teaching English for Aviation Maintenance, I found data-driven learning (DDL) to be a potential panacea for the unique challenges I faced. However, this introduction to a (then) new methodology also brought forth a plethora of questions. Teachers of English for Specific Purposes (ESP) encounter a range of challenges. These include understanding learners’ needs, gaps, and desires (Hutchinson & Waters, 1987); grappling with genres and vocabulary outside their primary field of expertise, especially if their educational background is unrelated; and a shortage of specialized teaching resources. A prime example is teaching English for Aircraft Maintenance. Prospective technicians are expected to familiarize themselves with language pertaining to two principal certifications (Airframe and Powerplant) and various aircraft sections. DDL has been demonstrated to be effective for general language learning (Boulton & Cobb, 2017), and this holds true in ESP contexts. Given that an Aircraft Maintenance Technician’s primary requirement is to comprehend written instructions for diverse tasks, understanding key terms and prevalent linguistic patterns in documents such as maintenance manuals becomes crucial. DDL emerges as an apt methodology for this objective.

224

Daniela Terenzi

Within a structured learning environment, utilizing corpora under the guidance of an experienced teacher enables learners to discern consistent patterns in the data relevant to them (Boulton, 2010). Examples include distinguishing between “electric” and “electrical” as adjectives (as in “electric pump” and “electrical connectors”) or understanding the correct employment of verbs like “to do” and “to make” in specific contexts, such as “do a visual inspection” and “make a report”. This approach not only boosts students’ autonomy and engagement in ESP classes but also inspires non-linguistics major students to undertake research as an extracurricular activity. As these learners transition into their professional roles, DDL techniques may become even more instrumental. They’ll need to navigate countless written documents, understanding specific vocabulary and grammar, especially since Maintenance encompasses varied tasks across different aircraft sections. However, informal learning presents challenges for researchers (O’Keeffe, 2021). Issues such as self-mediation, potential loss of oversight regarding students’ corpus usage, and the difficulty of assessing learners’ autonomy during self-guided technology use become paramount. These aspects are not only pivotal but also challenging to study, yet they would advance DDL’s adoption in independent learning scenarios. Considering mobile technology may enable learners to explore corpora anywhere and anytime, learning experiences can be autonomous, especially for those who do not have a teacher to guide them anymore, but also demanding, frustrating, and complex. For this reason, research is undoubtedly needed. However, as in the context of teaching English for Aircraft Maintenance, the use of DDL in ESP courses can contribute significantly to the language learning process and lead to positive outcomes.

Reference list Boulton, A. (2010). Data-driven learning: Taking the computer out of the equation. Language Learning, 60(3), 534–572. Boulton, A., & Cobb, T. (2017). Corpus use in language learning: A meta-analysis. Language Learning, 67(2), 348–393. https://doi.org/10.1111/lang.12224 Hutchinson, T., & Waters, A. (1987). English for specific purposes: A learner-centered approach. Cambridge University Press. O’Keeffe, A. (2021). Data-driven learning, theories of learning and second language acquisition, in search of intersections. In P. Pérez-Paredes & G. Mark (Eds.), Beyond concordance lines: Corpora in language education (pp. 35–55). John Benjamins.

Reconsidering Pérez-Paredes’s informal learning contexts

225

15.3 Reconsidering Pérez-Paredes’s informal learning contexts of DDL Alejandro Curado Fuentes In this chapter, I will be commenting on some of Pascual Pérez-Paredes’s ideas about the less investigated informal learning contexts of DDL. As a language instructor, I recognize the realization put forward by Pascual about the need for expanding DDL beyond the classroom. I can see how students’ use of mobile devices/apps may provide them with enriching DDL opportunities, or that simple and familiar interfaces for everyday internet searches may be used successfully for DDL. I agree that a pedagogic focus on DDL-related skills (e.g., noticing and discovery), rather than on the features of concordance interfaces, is a good strategy. The question is how to approach these skills so that students get motivated and empowered to use corpora. Pascual refers to the idea of user-generated queries, which entail a pedagogic objective of self-learning awareness. The key is that students eventually realize the benefits of DDL for their own personal learning practices and continue using DDL on their own. But how can we lead students into this objective? In my own teaching experience, a suitable scenario for autonomous DDL can be ESP (English for Specific Purposes) courses that integrate authentic reading material. In line with a communicative focus on learners’ needs in their target academic/ professional communities (Belcher, 2006), I think that DDL can be integrated if students appreciate and recognize it as an approach that fits in with their perceived learning needs. The process is not easy, but having learners critically reflect on corpus consultation as a potential ally for their academic/professional advancement can be a significant step. According to Pascual, critical thinking, seldom explored in DDL, can aid learners’ work with corpora. I could not agree more. A key aspect here, in my view, is that tasks are scaffolded in a way that they adapt to learners’ expectations. In fact, ESP students are often put off by web-based corpus interfaces and/or text output that fail to meet these expectations (Curado Fuentes, 2023). As task mediators, teachers can therefore prioritize concepts rather than computer/digital tools so that learners grasp the notion first and then move on to experiment with tools that suit these notions. An example can be the exploration of keywords in texts related to learners’ studies/professions. I have found that it is effective to introduce the notion of keywords first by discussing why/how keywords are salient in these specialized texts, and then to let students manually locate keywords in specific texts (e.g., in business news) by pairs/groups. Learners can then feel motivated to investigate corpusdriven keywords in ad-hoc texts, comparing keywords and sharing patterns. It is

226

Alejandro Curado Fuentes

important to choose a simple, easy-to-understand tool for keyword search and analysis (for example, Versatext in this case). This approach can contribute to incorporating DDL within the scope of learners’ targeted digital literacies. Reference list Belcher, D. (2006). English for specific purposes: Teaching to perceived needs and imagined futures in worlds of work, study, and everyday life. TESOL Quarterly, 40, 133–156. https://doi.org/10.2307/40264514 Curado Fuentes, A. (2023). Corpus affordances in foreign language reading comprehension. In K. Harrington & P. Ronan (Eds.), Demystifying corpus linguistics for English language teaching (pp. 99–119). Palgrave MacMillan. https://doi.org/10.1007/978-3-03111220-1_6

16 USING A DATA-DRIVEN LEARNING APPROACH TO TEACH ENGLISH FOR ACADEMIC PURPOSES AND IMPROVE HIGHER EDUCATION INTERNATIONALISATION Paula Tavares Pinto

1.

Introduction

In this chapter, I discuss the use of corpora and data-driven learning (DDL) approach in English for Academic Purposes (EAP) classes to support internationalisation in reading comprehension classes. Corpora are extensive collections of electronic texts that learners can interact with and analyse to gain insights into language usage patterns. By utilising corpora, learners are encouraged to engage in data observation, develop hypotheses, establish rules based on linguistic patterns (inductive approach), and validate the grammatical rules presented in textbooks (deductive approach). This educational perspective is known as datadriven learning and has gained significant recognition in computer-aided language learning (CALL) research, as evidenced by studies conducted by Boulton (2010), Crosthwaite (2020), Frankenberg-Garcia et al. (2022), Pérez-Paredes (2020), and others. Strategic plans for internationalisation have frequently been used as guidelines towards the expansion of higher education in institutions in many countries. According to de De Wit et al. (2015), internationalisation is the process of integrating an international, intercultural or global dimension into the purpose, functions and delivery of post-secondary education, in order to enhance the quality of education and research for all students and staff and to make a meaningful contribution to society. (p. 29) Part of these strategic plans has been the development of linguistic skills for effective communication, research collaboration, and embracing digital advancements. DOI: 10.4324/9781003413301-16

228

Paula Tavares Pinto

Although many universities have already built various international partnerships, it is crucial to recognise the significance of foreign languages, especially English as lingua franca, in actively engaging students and academics in international research. Some recent actions aimed at internationalisation have been reported at a multicampi institution in the state of São Paulo in order to improve students’ and staff’s English proficiency level (Pinto et al., 2021). In this case, the kind of English required is, most of the time, what we call English for Academic Purposes (EAP), which differs from general English in terms of vocabulary usage, sentence complexity, and discourse organisation (Hyland & Shaw, 2016; Charles, 2018, Frankenberg-Garcia et al., 2022). In Brazil several efforts have been made towards developing the specific EAP needs of Brazilian university students. One of the most known actions nationwide is that of the Languages without Borders (LwB) program (Abreu e Lima et al., 2016), which is “an impressively extensive initiative aimed at bridging the gap between the general English taught in schools and language institutes and the EAP skills required by Brazilians to study at English-medium universities” (Frankenberg-Garcia et al., 2022). In support of the LwB program, several initiatives were addressed to the EAP needs of Brazilian students. For example, Goulart da Silva (2017) compiled an electronic corpus, known as the BrAWE corpus, consisting of 380 essays written by Brazilian students at British universities, which was used to investigate their writing regarding EAP standards. Matte and Sarmento (2018) analysed the BrAWE corpus and observed that Brazilian university students tend to overuse additive connectors. On the other hand, Tavares Pinto et al. (2021) compiled a 906,035word corpus of research papers written by Brazilian authors and found that there are more underused academic collocations similar to Portuguese than overused ones. Such studies contribute to the production of EAP materials tailored to the Brazilian context. In this sense, although we find plenty of academic resources in EAP textbooks (e.g. Swales & Feak, 2012) there is still a major demand for pedagogical materials on EAP that are specially designed for EAP users from specific first-language backgrounds, such as Brazilian students who speak Portuguese. I will now discuss how a specialised corpus constructed from research papers on the UN’s Sustainable Development Goals has been used as a source for several EAP teaching activities. I will also show how an open-source corpus can be used to expand the topics discussed in EAP classes in higher education institutions. 2.

The UN’s Sustainable Development Goals

The United Nations presented the Sustainable Development Goals (SDGs) in 2015, and they have been used as guidelines for several countries with the aim to help governments deal with global challenges and promote a sustainable future for the

Using a DDL approach to teach English for Academic Purposes

229

entire planet by the year 2030. These goals consist of 17 interconnected themes, including poverty eradication, climate action, and peace and justice. Since the beginning, researchers worldwide have focused their studies on finding solutions to SDG-related issues. This research mobilisation is part of a decade-long effort involving global, local, and individual actions, including academia and other stakeholders, to drive the necessary transformations. Several major world facts have recently emphasised the importance of following the 17 SDGs, especially after the COVID-19 pandemic and, more recently, the war between Russia and Ukraine. The crisis brought to light the disparities in the distribution of resources and catalysed the race to develop a vaccine against COVID-19. Consequently, a plethora of scientific publications and studies have emerged across diverse disciplines such as health, sociology, economics, chemistry, mathematics, and linguistics, aiming to address these issues. English has become the predominant language for scientific research (Jenkins, 2014), necessitating researchers to produce and present their work in English for publication in top international journals, presentation at global conferences, and even when applying for research funding. Following several studies on corpora and EAP in Brazil (Dayrell & Aluísio, 2008; Frankenberg-Garcia et al., 2022; and Pinto et al., 2023; among others), our aim in this chapter was to analyse keywords from a collection of research papers to be used as teaching material to undergraduate and graduate students. 3.

Development

This section presents the methodological steps taken to: (i) compile a specialised corpus focusing on the SDGs; (ii) elaborate a hands-off activity based on keywords in context; and (iii) elaborate a hands-on activity based on keywords with CorpusMate. 3.1 Compiling an SDG corpus

In order to compile our corpus we used AntCorGen (Anthony, 2019), which is a tool that quickly compiles specialised corpora with research papers from PLOS, which is a non-profit, open-access multidisciplinary publisher. For this study, we wanted to analyse articles that discussed the SDGs in all areas. In order to do so, we included the term “Sustainable Development Goal” as a query word in AntCorGen and set the tool to extract 400 articles from PLOS. Since we wanted to have mostly written material, we selected the articles’ abstracts, introductions, materials & methods, results & discussions, and conclusions, as we can see in Figure 16.1. As a result, we compiled the SDG corpus, which contains 2,037.311 types. This corpus was uploaded to SketchEngine (Kilgarriff et al., 2014) in order to be used as source material for EAP activities.

230

Paula Tavares Pinto

3.2 Exploring the SDG corpus

The next step was generating a list of keywords which are statistically more relevant in a specialised corpus compared to a corpus of general English, in this case, the English Web 2020 corpus (enTenTen20). We selected the list of multiword terms as it is shown in Figure 16.2. As shown in Figure 16.2, the terms in the list show the topics that were discussed in the research papers regarding SDGs, such as “maternal health”, “fishing mortality”, “food insecurity”, and global health”. By clicking on the term and having access to concordance lines, which are segments from different texts in which the term is inserted, this tool makes it possible to read different contexts and have

FIGURE 16.1

Screenshot of AntCorGen

FIGURE 16.2

Screenshot of multiword terms of SDGs in Sketch Engine1

Using a DDL approach to teach English for Academic Purposes

231

a general idea about how the term was discussed in different articles by different authors. After having access to these different contexts, we selected the term “Ocean Health”, related to SDG 14 “Life below Water” to be used in one of the EAP classes as a reading comprehension exercise. In Figure 16.3 we can see some concordance lines with “ocean health” as the main term.

FIGURE 16.3

Concordance lines with the search term “ocean health”

We observed that one of the texts referred to ocean health in Brazil, which we thought would be an interesting topic to discuss with our Brazilian undergraduate and graduate students. By reading a set of concordance lines, besides practising reading strategies, students would be able to observe the language used in authentic research papers and would become more used to this academic genre. 3.3 Producing a “hands off” activity for an EAP class

In Sketch Engine, the concordance lines can be expanded so the user can read a sentence, a paragraph, or even the whole text if they wish to. This possibility has been explored by different researchers in our research group since we can teach functional words, content words, and terms in context and even analyse the discourse used by different authors. One example of an expanded concordance line can be observed in Figure 16.4, in which we can see what was discussed in the article from the SDG corpus. After reading the excerpt in Figure 16.4, we generated some questions to be used in a reading comprehension class by using the term “ocean health”, previously selected from the keywords list. The focus of the text will show students how they can withdraw information from the text by reading the context where we can find the keyword in focus. In Figure 16.5 we see five questions that will be guided in the reading comprehension of the keywords.

232

Paula Tavares Pinto

FIGURE 16.4

Extended concordance line with the search term “ocean health”

FIGURE 16.5

Questions taken from an expanded concordance line

FIGURE 16.6

Extended concordance line with the search term “ocean health” and the answer to the guided questions previously presented

The first four questions will be answered by reading the text, as shown in Figure 16.6. The last question will be answered by students based on their own reality, and they will be able to use the new vocabulary they have just encountered. The previous activity is considered a “hands-off” DDL exercise since it can be created by EAP teachers and taken to class as a handout or projected on a

Using a DDL approach to teach English for Academic Purposes

233

screen. In this case, this was done with a pre-compiled corpus by a researcher; however, activities can also be created having students using online corpora, e.g., CorpusMate (Crosthwaite & Baisa, 2024), described in the next section. 3.4 Producing a “hands on” activity for an EAP class

One way to encourage students to know more about the topic they are reading is to have them access an online corpus tool and read concordance lines themselves. If it is the first time students have access to corpus information displayed in concordance lines, it is important that the EAP teacher guide them to explore this information in the best way possible so that they can explore the online platform in various ways, as we can see in Figure 16.7, in which a student can choose to have access to either written or oral language. CorpusMate (50,298,976 tokens) is an open platform specially created for disciplinary learners of English as a Second Language (ESL) which contains written and oral texts from 20 different areas such as law, health and medicine, chemistry, business, and economics, among others. It can be used by undergraduate and graduate students in EAP classes since its content is carefully chosen from reliable sources such as BAWE, TED Talks, Simple English Wikipedia, BBC Teach, Elsevier, and BNC 2014 Spoken. In Figure 16.8 we can see concordance lines with the combinations of “ocean” + word. EAP teachers can encourage students to point out new information about a similar topic they have been discussing based on the reading of expanded concordance lines. In terms of grammar and syntax, students will be able to contrast the structure of their own language to English. In this case, they could notice the use of “ocean” as a modifier for “tsunami” and “seaquake” is in a different position from Portuguese, for example, in which we would say “Terremoto Oceânico”

FIGURE 16.7

The CorpusMate platform

234

Paula Tavares Pinto

FIGURE 16.8

Concordance lines with the search word “ocean” + word

FIGURE 16.9

Concordance lines with the search term “ocean pollution”

or “Tsunami Oceânico”. In terms of discourse, reading concordance lines will give students the chance of comparing what they know about “Ocean Health” to what has been discussed in different contexts on the same theme, as it is shown in Figure 16.9 on “ocean pollution”. The definition of “ocean pollution” may seem simple to a native speaker of English; however, considering that we were focusing on students who had a Basic (A2) to Low-Intermediate (B1) level of English, by reading this definition a student will learn how the lexicon involving that term is used. From this point on he/she may feel more confident to reproduce it in other texts such as abstracts or research reports, for example. As an element of surprise, students may also feel encouraged to read more about more unfamiliar usages, such as the term “Ocean Master”, in Figure 16.10. By reading the excerpt, we learn about Aquaman’s half-brother, the “Ocean Master”, and the other characters of the story, as well as actors and actresses who are part of the movie. Although some might consider this text not useful in an EAP class, students who have watched the movie will enjoy reading about it and might

Using a DDL approach to teach English for Academic Purposes

FIGURE 16.10

235

Concordance lines with the search term “ocean master”

learn new vocabulary and sentence structure in a different genre, such as a movie review. Final considerations

Data-driven learning (DDL) is a valuable approach in language education that benefits both EAP instructors and students. Combining DDL with the SDGs, DDL supports personalised learning, enables access to a wide range of language resources, and prepares students for the demands of a globalised world, which is of great importance in the Brazilian context by increasing our students’ competitiveness in the global economy. By emphasising the advantages of DDL in this way, English instructors in Brazil can encourage their students to develop the necessary language skills for promoting a more engaging and effective learning environment. The activities presented in this chapter have been inspired by our research group, whose investigations aim at developing teaching activities in different contexts, including other higher education institutions in Brazil. Some of the results from these studies can be seen in a recent publication entitled “Using language data to learn about language: a teachers’ guide to classroom corpus use” (Pinto et al., 2023), where the user will find DDL teaching and learning activities for classes with technology and internet (hands on activities), or traditional classes, without such resources (hands off activities). The activities have been used in English and Spanish classes in Brazil and have also been tested by language instructors in different countries. So far, we have been able to observe which DDL tools are more effective in secondary education and which ones work best for higher education students. More studies have been conducted focusing on the teachers’ training to use these tools and how to best introduce a DDL approach to different curricula. Acknowledgements

The author would like to acknowledge the research funding from CNPq – Conselho Nacional de Desenvolvimento Científico e Tecnológico (nº 307287/2021–1PQ2) and The São Paulo Research Foundation, FAPESP (nº 2022/05908–0). Note 1 https://app.sketchengine.eu/

236

Paula Tavares Pinto

Reference list Abreu e Lima, D., Moraes Filho, W., Barbosa, W., & Blum, A. (2016). O programa Inglês sem Fronteiras e a política de incentivo à internacionalização do ensino superior brasileiro. In S. Sarmento, D. Abreu e Lima & W. Moraes Filho (Eds.), Do Inglês sem Fronteiras ao Idioma sem Fronteiras. Editora UFMG. Anthony, L. (2019). AntCorGen (Version 1.1.2). Waseda University. www.laurenceanthony. net/software/antcorgen/ Boulton, A. (2010). Data-driven learning: Taking the computer out of the equation. Language learning, 60(3), 534–572. Charles, M. (2018, June 8). Starting out with DDL: How do students use their own do-ityourself corpus? Keynote presentation at the 2nd BAAL corpus linguistics SIG event – New directions in DDL, Centre for Academic Writing, Coventry University. Crosthwaite, P. (2020). Taking DDL online: Designing, implementing and evaluating a SPOC on data-driven learning for tertiary L2 writing. Australian Review of Applied Linguistics, 43(2), 169–195. Crosthwaite, P., & Baisa, V. (2024). Introducing a new, child-friendly corpus tool for Datadriven learning: CorpusMate. International Journal of Corpus Linguistics, online ahead of print. Dayrell, C., & Aluísio, S. M. (2008). Using a comparable corpus to investigate lexical patterning in English abstracts written by non-native speakers. LREC 2008 Workshop on Comparable Corpora, 61–66. De Wit, H., Hunter, F., Howard, L., & Egron-Polak, E. (2015). Internationalisation of higher education. European Parliament, Directorate-General for Internal Policies, Policy Department B: Structural and Cohesion Policies, Culture and Education. http://www.europarl. efuropa.eu/RegData/etudes/STUD/2015/540370/IPOL_STU(2015)540370_EN.pdf. Frankenberg-Garcia, A., Pinto, P. T., Bocorny, A. E. P., & Sarmento, S. (2022). Corpus-aided EAP writing workshops to support international scholarly publication. Applied Corpus Linguistics, 2(3). Goulart da Silva, L. (2017). Compilation of a Brazilian academic written English corpus. E-Scrita, 8(2), 32–47. Hyland, K., & Shaw, P. (2016). The Routledge handbook of English for academic purposes. Routledge. Jenkins, J. (2014). English as a lingua franca in the international university. Routledge. Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovvář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The Sketch Engine: Ten years on. Lexicography, 1, 7–36. Matte, M., & Sarmento, S. (2018). A corpus-based study of connectors in student academic writing. English for Specific Purposes World, 55(20), 1–21. Pérez-Paredes, P. (2020). Corpus linguistics for education: A guide for research. Routledge. Pinto, P. T., Crosthwaite, P., de Carvalho, C. T., Spinelli, F., Serpa, T., Garcia, W., & Ottaiano, A. O. (Eds.). (2023). Using language data to learn about language: A teachers’ guide to classroom corpus use. University of Queensland. https://doi.org/10.14264/3bbe92d Pinto, P. T., de Moraes Garcia, D. N., & dos Santos, D. C. (2021). Ações de internacionalização para o ensino e a pesquisa na área de línguas. Fórum linguístico, 18(1), 5618–5630. Swales, J., & Feak, C. (2012). Academic writing for graduate students: Essential skills and tasks. Michigan University Press.

Enhancing environmental awareness in the classroom through DDL

237

16.1 Enhancing environmental awareness in the classroom through data-driven learning Handoko I was impressed by Paula Tavares Pinto’s chapter on “Bringing the UN’s Sustainable Development Goals and data-driven learning to the classroom”. I was convinced to use the same approach with my English-speaking class and have begun experimenting with data-driven language learning based on environmental sustainability. In a time when global issues such as climate change, poverty, and inequality continue to pose grave threats, connecting our education to these crucial issues is not merely a matter of choice, but a moral necessity (United Nations Development Programme, 2007). By integrating the Sustainable Development Goals (SDGs) into the classroom, students are able to comprehend the urgency of this problem and recognize their role in developing sustainable solutions (Vasconcelos et al., 2022). Using a data-driven approach, students may learn English in a more engaging and efficient manner (Zare & Aqajani Delavar, 2022). Teachers can better accommodate their students’ diverse learning styles and demands when they use data to guide their lessons. The use of corpus data and collocations in the context of environmental issues will help the students to learn English. Students are exposed to the authentic use of language in context by interacting with materials such as scientific reports, news articles, and multimedia resources related to environmental issues. This exposure helps students improving their English skills, including reading, listening, speaking, and writing, while discussing environmental issues. Furthermore, when students work with real data, statistics, and case studies, they develop a holistic comprehension of the world’s problems (Kozhevnikova, 2014). This teaching method helps students develop empathy and a sense of urgency, encouraging them to participate in protecting the planet. Discussions concerning deforestation, carbon emissions, plastic pollution, and biodiversity loss are more than just language exercises; they also serve as a catalyst for serious dialogue and action. Data-driven language learning on environmental sustainability makes learning relevant to students’ lives and fosters civic responsibility. As my experience in implementing data-driven language learning in English speaking lessons, especially in the context of environmental issues, students have a positive response in terms of language skills and environmental awareness. They are able to identify and use various linguistic terms related to environmental issues. In speaking practice, they are also able to formulate data and facts and present them in speaking form. Besides, they can also express their empathy and awareness of various environmental problems in their speech. Data-driven language learning based on environmental sustainability, in my opinion, is a powerful technique that can help students learn English more

238

Nataliia Lazebna

effectively and become more engaged in environmental issues. In line with the UN’s Sustainable Development Goals, as students improve in their language learning, they also contribute to a more sustainable and compassionate society. This unique approach makes education a powerful resource to enhance human and world futures. Reference list Kozhevnikova, E. (2014). Exposing students to authentic materials as a way to increase students’ language proficiency and cultural awareness. Procedia – Social and Behavioral Sciences, 116, 4462–4466. https://doi.org/10.1016/j.sbspro.2014.01.967 United Nations Development Programme. (2007). Fighting climate change: Human solidarity in a divided world. In United Nations development programme. Human Development Report 2007/2008 (pp. 1–18). Palgrave Macmillan. https://doi.org/10.1057/ 9780230598508_1 Vasconcelos, C., Silva, J., Calheiros, C. S. C., Mikusiński, G., Iwińska, K., Skaltsa, I. G., & Krakowska, K. (2022). Teaching sustainable development goals to university students: A cross-country case-based study. Sustainability, 14(3), 1593. https://doi.org/10.3390/ su14031593 Zare, J., & Aqajani Delavar, K. (2022). Enhancing English learning materials with datadriven learning: A mixed-methods study of task motivation. Journal of Multilingual and Multicultural Development, 1–17. https://doi.org/10.1080/01434632.2022.2134881

16.2 UN Sustainable Development Goals in a datadriven transnational learning environment of TEFL students Nataliia Lazebna This discussion focuses on the UN Sustainable Development Goals (SDGs) in a data-driven transnational learning environment for TEFL students. Following the approaches of Dr. Paula Tavares Pinto concerning the dissemination of UN SDGs through in-class hands-on and hands-off activities among secondary students, related practices were implemented during hybrid-format sessions for TEFL students from Julius-Maximilians University of Würzburg (Germany) and Zaporizhzhia National Polytechnic University (Ukraine). The digital meta-framework of learning represents a modern and powerful toolset for teachers, learners, researchers, and all global citizens of the world (Kester, 2023). TEFL teachers are mediators of the global issues often considered in a modern hybrid-format learning environment, while TEFL students are often involved in data-driven learning (DDL) contexts with critical and creative backgrounds. Group

Data-driven transnational learning environment

239

work of our students focused on the implementation of UN SDGs while learning thematically related collocations and colligations. Students could identify speech patterns in the context of UN SDGs (Biermann et al., 2017). For example, the interdisciplinary use of Voyant Tools (https://voyanttools.org/) applying automatic analyses of UN SDGs texts enables the learners to highlight keywords and phrases within a specific discursive environment (Miller, 2018). Also, building up concordances and “playing” with the intertextuality by using the Brazilian Academic Corpus of English(BrACE), the learners delved into the data and found lexical triggers for further semantic analysis in combination with the UN SDGs and parts of speech used within the defined thematic context. Following these practices, during hybrid-format sessions, TEFL students from Germany and Ukraine used the Sketch Engine and COCA text corpora to investigate the contexts of keywords to find the meaning of new words from UN SDGs thematically related texts. Further, in groups, the students drew their associations on an interactive whiteboard (https://lucidpark.com) in the form of conceptual mapping and different types of graphic mediation. Multimodal representations of students’ term definitions revealed their creative potential, intensified collaboration, and individual information processing. The hyperlinking dynamics of specific scientific terminology corpora fosters both research and technical competence of language learners. Following the contextual representation of keywords found in studied texts, the students developed their critical thinking and found background information on the UN SDGs. The linguistic competence of TEFL students is enriched by info-mining competence. TEFL students practiced automatic collocation modification and text analysis. Hybrid sessions involved TEFL students from Germany and Ukraine in diverse digital modes of human–machine communication with a focus on the UN SDGs. During the following seminars, we intend to apply data mapping and distant reading practices, thus integrating digital humanities approaches during TEFL classes to foster transcultural, digital, and linguistic competences of students. Moving from UN SDGs text and keyword analysis, a future focus is on humanities data surrounding TEFL students, for example, geospatial placement and analysis of toponyms from English-language literature, authentic narrative mapping, geocoded storytelling, and other research activities. Reference list Biermann, F., Kanie, N., & Kim, R. E. (2017). Global governance by goal-setting: The novel approach of the UN sustainable development goals. Current Opinion in Environmental Sustainability, 26, 26–31. Kester, K. (2023). Global citizenship education and peace education: Toward a postcritical praxis. Educational Philosophy and Theory, 55(1), 45–56. Miller, A. (2018). Text mining digital humanities projects: Assessing content analysis capabilities of voyant tools. Journal of Web Librarianship, 12(3), 169–197.

17 EXPLORING THE INTERFACE BETWEEN CORPUS LINGUISTICS AND ENGLISH FOR ACADEMIC PURPOSES What, how and why? Vander Viana

What has Corpus Linguistics contributed to English for Academic Purposes?

Corpus Linguistics (CL) has been instrumental in the compilation of academic vocabulary lists such as Coxhead’s (2000) academic word list. This list contains a total of 570 word families, that is, the root of a word and its related forms (e.g., the word family for ‘constitute’ consists of ‘constituencies,’ ‘constituency,’ ‘constituent,’ ‘constituents,’ ‘constituted,’ ‘constitutes,’ ‘constituting,’ ‘constitution,’ ‘constitutions,’ ‘constitutional,’ ‘constitutionally,’ ‘constitutive’ and ‘unconstitutional’). These word families are divided into ten sublists. Table 17.1 illustrates the 60 word families included in Sublist 1. TABLE 17.1 Word families in Sublist 1 in Coxhead’s (2000) academic word list

#

Word family

#

Word family

#

Word family

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

analyse approach area assess assume authority available benefit concept consist constitute context contract create data define derive distribute economy environment

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

establish estimate evident export factor finance formula function identify income indicate individual interpret involve issue labor legal legislate major method

41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

occur percent period policy principle proceed process require research respond role section sector significant similar source specific structure theory vary

DOI: 10.4324/9781003413301-17

Exploring the interface between CL and EAP

241

Coxhead’s (2000) academic word list has been influential in the field of English for Academic Purposes (EAP) over the years and has been used as the basis for several teaching materials (cf. Coxhead, 2011; see also Ackermann & Chen, 2013).

Case Study A How would you regard the use of ‘inconspicuous’ in a doctoral thesis (cf. the example provided next)? Is such a word characteristically employed in academic discourse or is it more frequently found in another type of discourse? “As usual, I was sitting in the back corner of the room trying to remain inconspicuous.” (Gaynor, 2018, p. 115)

However, the available literature on EAP has questioned the usefulness of general academic vocabulary lists. Hyland and Tse (2007, pp. 249–250), for instance, note that because academic knowledge is embedded in processes of argument and consensus-making, it will always be particular to specific disciplines and the socially agreed on ways of discussing problems in those disciplines. The fact that writing actually helps to create disciplines, rather than being just another aspect of what goes on in them, is a serious challenge to identifying uniformities in academic language use that apply in the same ways to all disciplines. To address the issue with general academic vocabulary lists, several disciplinespecific vocabulary lists have been compiled. These lists consider the ways through which members of a given disciplinary community communicate their research findings within said communities. Some of the disciplines which have been investigated in this manner include engineering (Todd, 2017), finance (Tongpoon-Patanasorn, 2018) and medicine (Lei & Liu, 2016). Disciplinespecific vocabulary lists point out that academic knowledge dissemination is very much shaped by discipline-specific ontologies and epistemologies. For example, Table 17.2 shows the differences in the top ten words1 used in Applied Linguistics (Khani & Tazik, 2013, p. 216) and in Chemistry (Valipouri & Nassaji, 2013, p. 254). The table reveals that Applied Linguistics and Chemistry draw on a completely divergent set of concepts: ‘text’ and ‘discourse’ vs. ‘spectrum’ and ‘temperature,’ respectively. This is unsurprising given the nature of these two

242

Vander Viana

TABLE 17.2 Top Ten Words in Applied Linguistics and Chemistry

#

Applied Linguistics

Chemistry

1 2 3 4 5 6 7 8 9 10

‘research’ ‘text’ ‘data’ ‘task’ ‘participation’ ‘discourse’ ‘items’ ‘context’ ‘texts’ ‘classroom’

USE SHOW REACT RESULT SOLVE SPECTRUM CAN FORM TEMPERATURE HIGH

disciplines. However, even a simple abridged frequency list already uncovers the different ways in which knowledge is conveyed in these two disciplines. In addition to words which are primarily used as nouns, the list of top ten words in Chemistry contains words which are either verbs (e.g., ‘react’ and ‘solve’) or which can function as verbs (e.g., ‘use,’ ‘show,’ ‘result,’ ‘can’ and ‘form’). It is also equally relevant to examine disciplinary variation across neighboring disciplines. An example is Viana’s (2012) investigation of doctoral theses in the fields of English Language and English Literature, two disciplines which generally belong to the same administrative academic division (e.g., ‘School of English’ or ‘School of Languages’). The analysis reveals a more patterned prose in the theses dedicated to the study of English Language. These theses are characterized by their methodological drive and their use of textual reference. On the other hand, theses in English Literature employ a more varied lexicon, foreground personal experiences and contain numerous instances of personal reference.

Case Study B If you were the EAP professional supporting the writer of a doctoral thesis in the feld of language education, what advice would you give in relation to the use of ‘hinterland’ in the following excerpt? “Context here covers a multitude of variables, from the socio-economic hinterland surrounding each school, to the English language profciency of individual teachers.” (Gaynor, 2018, p. 104)

Exploring the interface between CL and EAP

243

TABLE 17.3 Comparison Between the Academic Word List and the Academic Collocation List

Academic Word List

Academic Collocation List

available

become currently freely make publicly readily widely

availability unavailable

available

data evidence information resources

Another limitation of vocabulary lists is that they generally focus on individual words. They seem to overlook the fact that much of our language production consists of multi-word sequences acquired through reading or listening to texts. This is what Sinclair (1991) describes as the idiom principle as opposed to the open-choice principle (i.e., when words are chosen individually on a one-to-one basis). To address this gap, some researchers have compiled academic lists containing, for example, collocations (Ackermann & Chen, 2013), that is, the statistically recurrent use of two or more words within a short space of one another in a text (Sinclair, 1991). Table 17.3 exemplifies the differences between the academic word list (Coxhead, 2000) and the academic collocation list (Ackermann & Chen, 2013) in relation to the word ‘available.’The academic word list focuses on individual words which have been frequently used in academic discourse. Under the main entry (e.g., ‘available’), all relevant related forms are listed (e.g., ‘availability’ and ‘unavailable’). The academic collocation list, on the other hand, prioritizes the words which are used in the vicinity of one another. Instead of simply identifying ‘available’ as a frequent word in academic discourse, the list indicates what adverbs (e.g., ‘currently’), nouns (e.g., ‘data’) and verbs (e.g., ‘become’) are usually used together with ‘available,’ thus foregrounding patterns in language use and facilitating learners’ uptake of two-word sequences. Case Study C The following sentence was written by a senior academic, who is a successful user of English, in a reference letter for a student who was applying for admission in a doctoral program. Can you identify an instance of an unusual adjective + noun collocation? If so, how would you rephrase it? “Throughout his Master’s studies and our subsequent collaboration, X has conducted himself with utter professionalism and has exhibited commitment, organizational skills and efciency.”

244

Vander Viana

In addition to vocabulary, CL can also greatly contribute to our grammatical knowledge in EAP. For instance, a common myth that has been debunked by corpus researchers is the belief that subordination is a feature of academic writing. Based on extensive empirical evidence, Biber et al. (1999, p. 93) show that, compared to other word classes (e.g., adjectives, nouns, verbs), subordinators are overall rare. The researchers also report that the use of subordinators is more frequent in spoken registers – e.g., conversation – than in written registers – e.g., academic prose and news reportage.2 Corpus evidence has helped to show that clausal complexity is a feature of spoken discourse, whereas phrasal complexity is a feature of written discourse, including written academic discourse.

Case Study D Does the word ‘statistics’ require a verb in the singular or in the plural? For example, would you accept the use of ‘inferential statistics draw’ in the following example taken from an academic book, or would you change it to ‘inferential statistics draws’? “On the other hand, inferential statistics draw on the data available to make evidence-based conclusions that go beyond the dataset.” (Viana & O’Boyle, 2022, p. 130)

In sum, CL has made an enormous contribution to the development of EAP in relation to both vocabulary and grammar, as well as English for general and for specific academic purposes. The following section demystifies the use of CL by showing how free online corpora may be used to solve EAP-related questions and, as a result, advance one’s knowledge of EAP. How can EAP professionals and students conduct corpus investigations?

The free availability of online corpora and their accessibility through numerous resources (e.g., tablets and laptops) has made it possible for everyone with access to the Internet to investigate language patterns by themselves. The questions posed in Case Studies A–D in the previous section can be easily solved with a few clicks if one knows what one is looking for, how to conduct relevant searches and how to interpret the empirical results. General corpora – that is, textual collections aimed at representing an entire language – are useful for EAP, as they generally allow for the comparative investigations of search words/phrases within that language. This type of investigation sheds light on register variation and helps to uncover the linguistic features of academic

Exploring the interface between CL and EAP

245

discourse. As a matter of fact, register variation is at the heart of Case Study A, where ‘inconspicuous’ was used in a doctoral thesis. A search for such a word in the Corpus of Contemporary American English (COCA – Davies, 2008), which contains more than one billion words of American English, shows that ‘inconspicuous’ is most frequently found in fiction and magazines (see Figure 17.1). The infrequent use of ‘inconspicuous’ in academic discourse becomes even clearer when the same verb + adjective combination – i.e., all forms of the verb ‘remain’ followed by the adjective ‘inconspicuous’ – is searched for in COCA. Figure 17.2, which contains the results of such search, reveal that there are no instances of this verb + adjective combination in the academic prose included in COCA, which totals nearly 121 million words of texts from peer-reviewed journals. For this reason, the use of ‘inconspicuous’ in a doctoral thesis seems unusual. This option might have been deliberate, as in the writer’s attempt to blur the boundaries between academic and fictional discourse. However, the use of such a word in doctoral theses seems to go against the expectations that readers would possibly have of academic discourse. The investigation of vocabulary use is not limited to the register in which the search item is most frequently found. Some online corpora contain search

FIGURE 17.1

Distribution of ‘inconspicuous’ across components in COCA

FIGURE 17.2

Distribution of ‘remain* + inconspicuous’ in COCA

246

Vander Viana

functionalities which enable users to drill down into disciplines or macro-disciplinary groups. This is the case, for instance, of the British National Corpus (BNC) made available by Davies (2004), which totals 100 million words. BNC users can investigate the occurrence of a word, phrase or grammatical feature across seven registers (i.e., academic, fiction, magazine, miscellaneous, newspaper, non-academic and spoken) as well as six (macro)disciplinary groups (i.e., engineering, humanities, law, medicine, natural sciences and social sciences). Case Study B focused on the use of ‘hinterland’ in academic discourse. Different from what had been observed in relation to ‘inconspicuous,’ ‘hinterland’ is commonly used in academic discourse, as shown in Figure 17.3.3 The BNC additionally shows that, from a (macro)disciplinary perspective, most instances of this word come from the humanities. A close analysis of the concordance lines for ‘hinterland’ in the BNC reveals that this word is primarily used with spatial meaning in academic texts in the field of history. Figure 17.4 contains ten concordance lines for this word in the BNC. In several cases, the focal noun is preceded by an adjective indicating the location or ownership of such hinterland (see Lines 01–08 and 10). The use of ‘hinterland’ in the excerpt reproduced in Case Study B seems unusual in a doctoral thesis in the field of language education. While the sentence may be understood by those readers who are familiarized with the meaning of ‘hinterland,’ they might expect to read about ‘socio-economic background,’ ‘socio-economic factors,’ ‘socio-economic characteristics’ or ‘socio-economic context,’ to cite a few examples. The choice of ‘socio-economic hinterland’ might have been deliberate, but this is likely to require a bit more processing time from readers since it breaks

FIGURE 17.3

Distribution of ‘hinterland’ in the BNC

Exploring the interface between CL and EAP

247

01 p its overland trade with the Balkan hinterland, as this activity did not bring it 02 wer which had moved into the Bosnian hinterland, and, armed with a papal dispensati 03 of his royal presence in the British hinterland of Bernicia. It is uncertain what w 04 nj and Knin in the central Dalmatian hinterland. The frontier settlement agreed to 05 . Further inland, from the Dalmatian hinterland, in Bosnia and the Sandak, strong c 06 tas circulated widely in the Mercian hinterland and some reason to suggest in the l 07 driatic ports and their Ottoman-held hinterland which was to bring them so much wea 08 r Mallia, or it may have had its own hinterland on the boundary of the two major te 09 strongly imply control over a rural hinterland, although the precise nature of tha 10 e, but the Turkish occupation of the hinterland was an ever present menace. In the

FIGURE 17.4

Concordance lines for ‘hinterland’ in the BNC

up the usual pattern that they have come across in their previous experience of academic discourse in this field. In addition to investigating isolated words, online corpora make it easy for users to identify patterns in language use. Case Study C focused on one specific collocational pattern, namely, adjective + noun. The original academic reference letter employed the phrase ‘utter professionalism’ to refer to someone’s skills in a positive way. This specific adjective + noun combination is extremely infrequent, as shown in Table 17.4. Admittedly, not all the corpora included in Table 17.4 contain TABLE 17.4 Frequencies of ‘Utter,’ ‘Professionalism’ and ‘Utter Professionalism’ Across

Corpora

Corpus

Frequency

Name

Size ‘Utter’ ‘Professionalism’ ‘Utter (in words) professionalism’

British National Corpus (BNC – Davies, 2004) Corpus of Contemporary American English (COCA – Davies, 2008-) Corpus of Global Web-based English (GloWbE – Davies, 2013) The Intelligent Web-based Corpus (iWeb – Davies, 2018) News on the Web Corpus (NOW – Davies, 2016-)

100 million

646

487

1

7,698

2,766

0

1.9 billion

22,468 10,622

7

14 billion

69,038 81,489

16

18 billion

90,328 91,132

53

1 billion

248

Vander Viana

academic discourse (e.g., NOW contains texts published in newspapers and magazines available online). However, the large size of these corpora helps to show that the reduced occurrence of ‘utter professionalism’ is not due to lack of data. Even though both words appear separately in all corpora with considerable frequencies, their combined use does not seem to be recurrent. A search for ‘utter’ in the academic component of COCA reveals its semantic prosody, that is, a “consistent aura of meaning with which a form is imbued by its collocates” (Louw, 1993, p. 157). As can be seen in Figure 17.5, the adjective ‘utter’ is followed by nouns which express negative meaning (e.g., ‘lack,’ ‘destruction,’ ‘disregard,’ ‘failure’ and ‘indifference’). For this reason, the use of ‘utter’ in an academic reference letter may cause an interpretation issue in that readers are likely to expect some negative evaluation of the doctoral applicant, but this is not the originally intended meaning. The positive nature of the reference becomes clear in the final part of the sentence (“has exhibited commitment, organizational skills and efficiency”) as well as in the remainder of the reference letter. A search for two-word sequences containing any adjective followed by the word ‘professionalism’ in COCA reveals some more natural sequences that could have been used in the original academic reference letter reported in Case Study C: ‘utmost professionalism’ or ‘great professionalism’. So far, the present section has considered vocabulary investigations. However, grammar can also be examined through corpora. This is especially the case if

FIGURE 17.5

Two-word sequences containing ‘utter’ followed by a noun in the academic component of COCA

Exploring the interface between CL and EAP

FIGURE 17.6

249

Sequences containing ‘statistics’ followed by a verb in the academic component of COCA

corpora have been tagged, that is, the words have their respective grammatical classes identified. Tagged corpora allow for the (semi-)automatic exploration of grammar, thus providing empirical evidence to answer the question raised in Case Study D – i.e., whether a verb in the singular or plural form should follow ‘statistics.’ Figure 17.6 shows the results of the search for ‘statistics’ followed by a verb in the academic component of COCA. As can be seen in Lines 01–03, 06, 12 and 19, ‘statistics’ is primarily followed by a verb in the plural form (i.e. ‘show,’ ‘indicate,’ ‘suggest,’ ‘reveal,’ ‘provide’ and ‘report’). The only exception in Figure 17.6 concerns the use of ‘statistics reports’ in Line 14. However, the concordance lines (see Figure 17.7) show that most of these instances do not correspond to the use of ‘statistics’ as the subject of ‘reports.’ In Lines 01, 03 and 04, the head of the noun phrase is a singular noun (i.e., ‘center’ or ‘bureau’), and ‘statistics’ appears in a prepositional phrase, functioning as a post-nominal modifier. Lines 02 and 05 reveal that automatic tagging is not always entirely accurate: these two instances of ‘reports’ would be better categorized as nouns since they are part of a longer noun phrase which identifies two different publications. By way of comparison, Figure 17.8 shows some of the concordance lines for ‘statistics report,’ which appear in Line 19 in Figure 17.6. All the cases refer to the use of ‘statistics’ as the subject of the verb ‘report.’ The present section has focused on how professionals and students can make use of general corpora to support their teaching and/or learning EAP. However,

250

Vander Viana

01 he National Center for Education Statistics reports that 20 percent of U.S. public 02 ates, 2001-2007 (National Health Statistics Reports No. 48). Retrieved from http: * 03 r control. The Bureau of Justice Statistics reports that some 186,000 black males b 04 ited States, the Bureau of Labor Statistics reports that 33% of the civilian workfo 05 nt statistics and National Vital Statistics Reports at www.cdc.gov/nhs/data). The M

FIGURE 17.7

Concordance lines for ‘statistics reports’ in the academic component of COCA

01 ccess. # * Diversity: The latest statistics report that just 12 percent of libraria 02 reases, even though the official statistics report a large jump in prices. # To gua 03 ncy. # World Health Organization statistics report that mosquitoes spread about 4 m 04 ndividuals who have ADHD. Recent statistics report that nearly 4.3 million children 05 onal law of asylum. Indeed, U.S. statistics report that most applicants reach this

FIGURE 17.8

Concordance lines for ‘statistics report’ in the academic component of COCA

their exploration can be further supported by exploring specialized written academic corpora, such as the ones listed here: • Corpus of Research Articles 2007 (http://rcpce.engl.polyu.edu.hk/RACorpus/), which contains 5.6 million words spanning 39 disciplines. • Corpus of Journal Articles 2014 (http://rcpce.engl.polyu.edu.hk/cja2014/ default.htm), a collection of approximately 6 million words of articles from 38 disciplines. • British Academic Written English Corpus (https://app.sketchengine.eu/#dash board?corpname=preloaded%2Fbawe2), a 6.5-million-word corpus of written assignments from 30 disciplines produced by undergraduate and Master’s students in the United Kingdom. • Michigan Corpus of Upper-Level Student Papers (https://micusp.elicorpora. info/), which totals 2.6 million words of assignments written by final-year undergraduate students as well graduate students from 16 disciplines at an American university. • Written English as a Lingua Franca in Academic Settings Corpus (www.helsinki. fi/en/researchgroups/english-as-a-lingua-franca-in-academic-settings/research/ wrelfa-corpus), which contains PhD examiner reports, research blogs and unedited research papers, all of which are written by users of English as a lingua franca. EAP professionals and students can also engage in the compilation of written academic corpora to suit their own specific needs. Viana and O’Boyle (2022) explain

Exploring the interface between CL and EAP

251

that there are numerous levels of corpus specialization and provide a thorough list of 25 potential criteria that can be used as a guide for specialized academic corpus compilation. Their proposed list contains criteria that have been discussed in the previous section, such as medium (i.e., the present chapter focuses on written discourse) and field (e.g., the different disciplines within academic discourse), as well as other criteria like text production circumstances (e.g., written texts produced with or without time constraints) and source (e.g., articles from a single journal vs. from a variety of journals). Why should EAP professionals and students consider exploring corpora themselves?

One of the strongest reasons supporting EAP professionals’ and students’ engagement in corpus investigations is the amount of data available at their fingertips (see, for instance, the size of the online corpora reported in Table 17.4). While EAP professionals and students were, in the past, limited by the input provided by the (usually print) materials to which they had access (e.g., dictionaries, grammars, textbooks), this is no longer the case. As Viana (2023a, p. 104) notes, “[n]owadays there is hardly any shortage of data to investigate language use.” The popularization of technological tools to investigate corpora is also a reason supporting EAP professionals’ and students’ corpus exploration. Over the years, the hardware needed for such exploration has become increasingly simple and less expensive. Mainframe computers are no longer a must for corpus research, which may be undertaken in a few seconds using a smartphone (see Viana, 2023a). The use of technology in CL means that EAP professionals and students do not need to waste their time with mechanical tasks, such as the identification, counting and/or sorting of words or word sequences. Instead, they can devote their time to “higher-order analytical and cognitive tasks such as the recognition of the academic functions performed by linguistic forms in their co-texts (i.e., their language environment) and contexts (i.e., the circumstances of language production)” (Viana & O’Boyle, 2022, p. 51). CL provides EAP professionals and students with an empirical way to learn about academic discourse and to solve their language doubts in an autonomous way. It supports their distancing from intuitive impressions on language use and provides them with clear evidence on actual language use. This empirical approach is most welcomed: it empowers all EAP professionals and students – irrespective of their nationality and/or language background – to engage in discussions on academic discourse in English. There is no need to rely on the intuition of the so-called native speaker of English, nor is the EAP professional the ‘oracle’ of language wisdom in the classroom. Corpus engagement therefore has the potential to change the power dynamics within EAP by equipping professionals and students, especially (but not limited to) those for whom English is not a first language, with the means to take an active role and to challenge long-held beliefs about academic discourse.

252

Vander Viana

Conclusion

The present chapter has focused on the contributions that CL can make to EAP and, more specifically, written academic discourse. However, there are features (e.g., ellipses) which are more difficult to investigate using corpora (for such a discussion, see Viana & O’Boyle, 2022, pp. 55–57). It must be explicitly acknowledged that CL is not the panacea for all EAP questions. As Hunston (2002, p. 20) aptly puts it, corpora “are invaluable for doing what they do, and what they do not do must be done in another way.” It is therefore important to understand how CL can be an ally to EAP professionals and students in their socialization journeys into academic discourse. This way, they will be able to reap the full benefits of corpus exploration. While CL has extensively contributed to our understanding of EAP, its influence on EAP practice has been less substantial. New materials have been designed to support the pedagogical application of CL, such as the resource books by Karpenko-Seccombe (2021), which contains several tasks to support students’ EAP learning; and by Viana (2023b), which contains numerous ready-made lesson plans indicating how corpora and corpus tools may be successfully used in the language classroom. Given the multiplicity of contexts in which EAP is practiced, it is expected that an increasing number of context-sensitive teaching and learning materials which draw on CL will be designed by (and hopefully made available for) the EAP community in the years to come. Notes 1 The results from Khani and Tazik (2013) and Valipouri and Nassaji (2013) are juxtaposed in Table 17.2 for illustrative purposes even though different units of analysis were used in these studies. The list of most frequent words in Applied Linguistics contains the top types (i.e., the different words), whereas the Chemistry-specific list contains the most frequent word families. 2 Fiction is a perhaps an exception, as its use of subordinators is closer to conversation than to the other written registers. This is explained by the fact that fiction contains features of spoken discourse (e.g., direct speech). 3 A similar result is found in COCA: 190 out of the 441 instances of ‘hinterland’ are found in academic discourse. This word is most frequent in such type of discourse than in all the other seven registers in COCA (i.e., blog, fiction, magazine, news, spoken, TV/movies and Web).

Reference list Ackermann, K., & Chen, Y.-H. (2013). Developing the academic collocation list (ACL): A corpus-driven and expert-judged approach. Journal of English for Academic Purposes, 12(4), 235–247. https://doi.org/10.1016/j.jeap.2013.08.002 Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. Longman. Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238. https:// doi.org/10.2307/3587951

Exploring the interface between CL and EAP

253

Coxhead, A. (2011). The academic word list 10 years on: Research and teaching implications. TESOL Quarterly, 45(2), 355–362. https://doi.org/10.5054/tq.2011.254528 Davies, M. (2004). British national corpus (from Oxford University Press). www.englishcorpora.org/bnc/ Davies, M. (2008). The corpus of contemporary American English (COCA). www.englishcorpora.org/coca/ Davies, M. (2013). Corpus of global web-based English (GloWbE). www.english-corpora. org/glowbe/ Davies, M. (2016). Corpus of news on the web (NOW). www.english-corpora.org/now/ Davies, M. (2018). The iWeb corpus. www.english-corpora.org/iWeb/ Gaynor, B. (2018). From language policy to pedagogic practice: Elementary school English education in Japan [Doctoral thesis, University of Stirling]. STORRE: Stirling Online Research Repository. https://dspace.stir.ac.uk/handle/1893/29671 Hunston, S. (2002). Corpora in applied linguistics (1st ed.). Cambridge University Press. https://doi.org/10.1017/CBO9781139524773 Hyland, K., & Tse, P. (2007). Is there an “academic vocabulary”? TESOL Quarterly, 41(2), 235–253. https://doi.org/10.1002/j.1545-7249.2007.tb00058.x Karpenko-Seccombe, T. (2021). Academic writing with corpora: A resource book for datadriven learning. Routledge. https://doi.org/10.4324/9780429059926 Khani, R., & Tazik, K. (2013). Towards the development of an academic word list for applied linguistics research articles. RELC Journal, 44(2), 209–232. https://doi. org/10.1177/0033688213488 Lei, L., & Liu, D. (2016). A new medical academic word list: A corpus-based study with enhanced methodology. Journal of English for Academic Purposes, 22, 42–53. https:// doi.org/10.1016/j.jeap.2016.01.008 Louw, B. (1993). Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies. In M. Baker, G. Francis & E. Tognini-Bonelli (Eds.), Text and technology: In honour of John Sinclair (pp. 157–176). John Benjamins. Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press. Todd, R. W. (2017). An opaque engineering word list: Which words should a teacher focus on? English for Specific Purposes, 44, 31–39. https://doi.org/10.1016/j.esp.2016.08.003 Tongpoon-Patanasorn, A. (2018). Developing a frequent technical words list for finance: A hybrid approach. English for Specific Purposes, 51, 45–54. https://doi.org/10.1016/j. esp.2018.03.002 Valipouri, L., & Nassaji, H. (2013). A corpus-based study of academic vocabulary in chemistry research articles. Journal of English for Academic Purposes, 12(4), 248–263. https:// doi.org/10.1016/j.jeap.2013.07.001 Viana, V. (2012). Disciplinary variation in academic writing: A corpus study of PhD theses in English language and literature [Doctoral thesis, Queen’s University Belfast]. Queen’s University Belfast Research Portal. https://pure.qub.ac.uk/files/175061318/ Paula_Viana_Disciplinary_variation_70156638.pdf Viana, V. (2023a). Corpus linguistics in English language teacher education during the COVID-19 pandemic: Exploring opportunities and addressing challenges. In K. Sadeghi & M. Thomas (Eds.), Second language teacher professional development: Technological innovations for post-emergency teacher education (pp. 101–124). Palgrave Macmillan. https://doi.org/10.1007/978-3-031-12070-1_6 Viana, V. (2023b). Teaching English with corpora: A resource book. Routledge. https://doi. org/10.4324/b22833 Viana, V., & O’Boyle, A. (2022). Corpus linguistics for English for academic purposes. Routledge. https://doi.org/10.4324/9781003245988

254

Mustafa Özer

17.1 EAP through the lens of corpus linguistics Specifying the focus Mustafa Özer Corpus linguistics (CL) has gained pedagogical prominence thanks to the emergence of techniques and tools that can benefit English for Academic Purposes (EAP) practitioners. Drawing on methodologies in CL, EAP is now a growing field of research which holds the potential to revolutionise the way we teach English to potential members of the various global scientific communities publishing their research in English. CL tools and corpora can be employed to map disciplinary linguistic conventions employed by senior researchers setting the discoursal trends, such as authorial presence. However, compliance with one strand of ‘academese’, the English used by academicians, does not usually guarantee compliance with all others, especially at the lexical level. Despite some convergence, there is substantial variation among disciplines (Hyland, 2008), making them linguistically diverse at varying levels. In this respect, CL methodologies can be employed to provide concrete evidence of the inter-disciplinary divergence in language use. From an EAP teaching standpoint, evidence from corpus analyses can help create a more empirical and grounded basis for syllabus and material design. Vocabulary-wise, although “a universal academic vocabulary” (Hyland & Tse, 2007, p. 247) is believed to exist, exploration of disciplinary corpora reveals idiosyncratic patterns specific to disciplines. In fact, there is now a growing body of research assembling specified vocabulary lists mapping disciplinary lexical common cores (e.g., Khani & Tazik, 2013; Lei & Liu, 2016) rather than a universal academic “lexical common core” (Stein, 2017) that allegedly applies to any field of study. CL facilitates the development of well-tailored syllabi addressing the needs of particular EAP cohorts as well. While the Web is often used as a reference tool by both EAP learners and teachers for linguistic inquiry, the results displayed by a browser may contain miscellaneous entries by irrelevant web authors. A corpus, however, is a compilation of targeted linguistic data based on stringent criteria and is, therefore, much more reliable. Furthermore, corpora hold linguistic data engineered for quantitative analysis of language phenomena, and they are highly structured. A careful inspection of these phenomena through CL methods yields concrete and reliable evidence on them, unlike the Web. Harvesting this unadulterated evidence and transforming it into a multidimensional teachable medium can boost the effectiveness of EAP teaching regarding any discipline, register, or genre. Once trained in using this medium, learners stand a higher chance of working autonomously.

English for Academic Purposes meets corpus linguistics

255

This autonomy, fostered through hands-on corpus referencing sessions (datadriven learning; DDL) can translate into life-long learning habits. That being said, CL-literate EAP practitioners are possibly more capable of designing a betterendowed exploration of content for discipline-specific EAP, which has implications for teacher education and professional development practices as well. To sum up, CL-based EAP practice is more advantageous. It is a rather technical and reliable means of documenting linguistic variation across genres. Thus, data from CL research can be used to fuel EAP syllabi with a disciplinary focus and improve learner achievement. Reference list Hyland, K. (2008). As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes, 27(1), 4–21. https://doi.org/10.1016/j.esp.2007.06.001 Hyland, K., & Tse, P. (2007). Is there an “academic vocabulary”? TESOL Quarterly, 41(2), 235–253. Khani, R., & Tazik, K. (2013). Towards the development of an academic word list for applied linguistics research articles. RELC Journal, 44(2), 209–232. https://doi. org/10.1177/0033688213488432 Lei, L., & Liu, D. (2016). A new medical academic word list: A corpus-based study with enhanced methodology. Journal of English for Academic Purposes, 22, 42–53. https:// doi.org/10.1016/j.jeap.2016.01.008 Stein, G. (2017). Some thoughts on the issue of core vocabularies: A response to Vaclav Brezina and Dana Gablasova: ‘Is there a core general vocabulary?’ Introducing the new general service list. Applied Linguistics, 38(5), 759–763. https://doi.org/10.1093/applin/ amw027

17.2 English for Academic Purposes meets corpus linguistics A multifaceted view from Brazil Wesley Henrique Acorinti This discussion section explores the impacts of Dr. Vander Viana’s chapter on myself, an undergraduate student, a research assistant and a teaching assistant in Brazil. I found Viana’s chapter relevant for independently leveraging corpus tools to improve my writing. As a research assistant, I deepened my understanding of the significance of studying academic discourse. As an English for Academic Purposes (EAP) instructor, Viana’s insights led me to adjust the curriculum and materials to cater to students from different disciplines. Furthermore, it inspired me

256

Wesley Henrique Acorinti

to organize a workshop to introduce peers to the potential of corpus linguistics in EAP. Viana’s work broadened my understanding of corpus linguistics and directly inspired changes in my EAP teaching and research, as well as indirectly in those of my peers. Regarding the practical applications of corpus linguistics (CL) in EAP, Viana emphasizes how CL can uncover variation in language use across genres, a task that would be challenging to accomplish through other methods. As an EAP instructor who lacks direct exposure to my students’ academic contexts and practices, and thus whose intuition about language use in their disciplines is unreliable, this perspective was particularly valuable to me. It prompted me to reconsider my classroom practices, resulting in changes to the curriculum and materials design in the courses I teach in order to cater to the needs of students in different disciplines. Inspired, I conducted a hands-on workshop for my peers, introducing them to the potential of CL in EAP. This workshop sparked continuous discussions, fostering a collaborative learning environment and further deepening our understanding of language use in different academic contexts. On this occasion, my focus was on exploring disciplinary variation and the potential of corpora analysis using #LancsBox 6 (Brezina et al., 2021). I guided my peers through several stages of using the software, including downloading, installing, creating and loading their own corpora. Additionally, I referenced Viana’s chapter to establish a theoretical foundation that connects CL to our teaching. I emphasized that CL can do much more than generate vocabulary lists by analyzing the frequencies of individual words within corpora of texts from our students’ respective disciplines. For instance, by examining collocations, we can learn not only how words commonly co-occur but also uncover meaningful semantic associations between words (Gablasova et al., 2017). This is particularly valuable in EAP classes for students whose disciplines are unfamiliar to us. The workshop enabled my peers to gain an overview of the possibilities offered by CL, providing them with an open space to discuss any challenges they might encounter. Viana concludes his chapter by focusing on the key impacts of CL in EAP, all of which resonate with my undergraduate student life. Those included advocating for the application of CL principles, such as automating data investigation, prioritizing evidence and performance and utilizing data to make well-informed pedagogical decisions and foster learner autonomy. One practical example is evident as I write this discussion segment, where I independently leverage CL tools to ensure that my writing aligns with the discourse community of applied linguists to which I aspire to belong. Overall, the what, how and why brought by Viana have broadened my understanding of CL. This approach potentially brings a methodological shift to EAP teaching and research, challenges traditional views of academic language and improves instruction by leveraging the power of language data for students, researchers and educators.

The implications of the CL-EAP interface

257

Reference list Brezina, V., Weill-Tessier, P., & McEnery, A. (2021). #LancsBox (Version 6) [Software]. http://corpora.lancs.ac.uk/lancsbox/index.php Gablasova, D., Brezina, V., & McEnery, T. (2017). Exploring learner language through corpora: Comparing and interpreting corpus frequency information. Language Learning, 67(S1), 130–154. https://doi.org/10.1111/lang.12226

17.3 Personal viewpoints on the implications of the CL-EAP interface Shuyi Amelia Sun The enlightening chapter by Dr. Vander Viana critically explores the interface between corpus linguistics (CL) and English for Academic Purposes (EAP), discussing what EAP investigations benefit from CL approaches, how such investigations could be conducted, and why the CL-EAP interface is worth highlighting. Considering academic fields (e.g., academic publishing and tertiary education) are “highly institutionalized, routinized and gate-kept” (Viana & O’Boyle, 2022, p. 4), corpus-informed analyses of empirically gathered data can evidence authentic language use in a particular EAP context, and supply the EAP community with first-hand knowledge of linguistic and discoursal practices alongside the ways communication takes place in different disciplines and genres (Hyland, 2009). As a novice researcher, I have gained valuable insights into the rationale for addressing EAP issues with CL tools and techniques, which contribute to my PhD research project, i.e., corpus-informed studies of English for Academic/Specific Purposes. The evidence-based demonstrations of the CL-EAP interface also fostered my reflection on the possible extensions of the ongoing project, especially with respect to compiling academic vocabulary lists and identifying lexical bundles, which may further uncover academic language use at lexical and formulaic levels. In addition, from the perspective of language learners, advanced (under-)graduate students can benefit from exploring the CL-EAP interface by seeking to develop corpus-assisted language learning skills, metalinguistic knowledge, and (disciplinary) language awareness, which can develop learners’ abilities to produce quality written production and engage with the EAP community. Especially for those students without a systematic account of EAP practices, learning to consult and interpret corpus data can promote language acquisition, genre awareness, and learner autonomy, as well as contribute to an understanding of disciplinary specificity (Crosthwaite et al., 2019).

258

Larissa Goulart

For practitioners in China, it is beneficial to bring corpora into EAP classrooms since corpus data can provide evidence for research-led teaching, assist in designing disciplinary supplementary materials, and help develop specific teaching practices for educators and students’ engagement in particular academic events. Corpus-informed language teaching and learning have the potential to raise awareness of linguistic strategies and discursive features by exploring corpus data (e.g., Sun & Crosthwaite, 2022). However, it is noteworthy that data-driven learning (DDL) – as one of the most promising applications of CL – has not widely spread in China despite its great potential for language learning curricula (Crosthwaite, 2020). We need to do more to incorporate DDL into China’s EAP classrooms so as to further enhance learners’ capacity in information and communications technology (ICT) and advance the ways literacy is viewed in the current EFL context. Reference list Crosthwaite, P. (2020). Data-driven learning and younger learners: Introduction to the volume. In P. Crosthwaite (Ed.), Data-driven learning for the next generation: Corpora and DDL for pre-tertiary learners (pp. 1–10). Routledge. https://doi.org/10.4324/ 9780429425899 Crosthwaite, P., Wong, L. L. C., & Cheung, J. (2019). Characterising postgraduate students’ corpus query and usage patterns for disciplinary data-driven learning. ReCALL, 31(3), 255–275. https://doi.org/10.1017/S0958344019000077 Hyland, K. (2009). Academic discourse: English in a global context. Continuum. Sun, S. A., & Crosthwaite, P. (2022). “The findings might not be generalizable”: Investigating negation in the limitations sections of PhD theses across disciplines. Journal of English for Academic Purposes, 59, 101155. https://doi.org/10.1016/j.jeap.2022.101155 Viana, V., & O’Boyle, A. (2022). Corpus linguistics for English for academic purposes. Routledge. https://doi.org/10.4324/9781003245988

17.4 Exploring language patterns Integrating corpus linguistics in the pedagogical grammar class Larissa Goulart Vander Viana’s chapter delves into the impact of corpus linguistics developments on English for Academic Purposes (EAP) instruction. Viana highlights the importance of incorporating corpus linguistics in teacher training and applied linguistics programs beyond the corpus linguistics and CALL courses. This chapter illustrates how I implemented the use of online corpus tools within a pedagogical grammar

Exploring language patterns

259

course. Students in the course were asked to produce a corpus-based lesson plan as part of the course assignments. The goal of this activity was to familiarize pre-service educators with the potential of corpus tools in the English language classrooms. In his chapter, Viana explores the influence of technological advances, particularly corpus linguistics resources, on the teaching of EAP and the development of EAP pedagogical materials. This emphasis on the applications of corpus research to material development has made a significant impact on how I teach corpus linguistics to my graduate and undergraduate students. Viana builds on the work of several corpus linguists who have also considered the applications of corpus research to language teaching (e.g., Friginal, 2018; Reppen, 2010). From a teacher training perspective, I find Viana’s approach of illustrating how corpus-based language analysis can be used in language teaching helpful for undergraduate students. In my experience, undergraduate linguistics students sometimes have difficulty understanding how corpora can be applied to teaching. Viana’s chapter helps introduce students to several ways that corpora can be applied to the classroom, showcasing examples of how CL has already been integrated into teaching materials, such as vocabulary lists, textbooks, and dictionaries. The impact of Viana’s chapter on my teaching also expands to the assignments I now integrate in my classes. Viana gives an example of how he uses corpora with his graduate students, illustrating the need to incorporate corpus linguistics modules across different courses (i.e., not confined only to corpus linguistics or computer-assisted language learning [CALL] courses). For example, in my pedagogical grammar course, students develop a corpus-based lesson plan focusing on a grammatical aspect discussed in class, which could be easily transferred into a viable ESL class. As part of this course students learn how to use online corpus tools (e.g., COCA, SkELL). Based on the topics discussed in class, students then decide on a topic to focus on for their lesson plan (e.g., the differences between do and make; prepositions of time, etc.). I then guide them through different corpus searches in class. Some of the guiding questions I give to students are: How would you describe the patterns of use encountered in the corpus? What words occur to the right and to the left of your search term? How does frequency of use change across different sections of the corpus? How do your findings compare to the explanation about this topic in traditional ESL textbooks? After guided searches and independent work, students then work on developing a lesson plan in three parts. Part 1 explains the steps of the corpus analysis in a way that could be replicated in the language classroom. Part 2 gives teachers a summary of the findings. Part 3 consists of activities that could be implemented in the classroom. Some examples of lesson plans from this course focus on the collocates of the verbs do and make, the modifiers of have and be, and the use of modal verbs in requests. Overall, Viana’s chapter emphasizes the impact of corpus linguistics on EAP, encompassing professional practice and knowledge sources while offering a great

260

Larissa Goulart

number of resources for teacher trainers who are looking for activities to implement in the classroom that will encourage students to use corpora in their own language lessons. Reference list Friginal, E. (2018). Corpus linguistics for English teachers: Tools, online resources, and classroom activities. Routledge. Reppen, R. (2010). Using corpora in the language classroom. Cambridge University Press.

INDEX

ability 3, 13, 17, 38, 41, 158, 159, 176, 178, 184, 188, 189, 203, 207 abstracts 22, 23, 80–82, 115, 116, 229, 234, 236 academic xiv–xix, 5–7, 10, 12, 13, 18, 19, 22, 43, 44, 53, 71–75, 77–83, 86–90, 98, 105–108, 110–112, 114–116, 118–120, 122, 123, 126–130, 138, 142, 144, 146, 151, 152, 155, 160, 164, 166, 168, 171, 172, 175, 182, 183, 214, 221, 225, 227, 228, 231, 236, 239–258 accessibility 6, 54, 142, 148, 154, 155, 244 acquisition xiv, xvi–xix, 5, 24, 45, 59, 60, 68, 69, 71, 75, 80, 84, 93, 104, 157, 158, 164, 167, 176, 182, 185, 203, 206, 207, 211, 220, 223, 224, 257 activity 12, 13, 41, 76, 77, 98–101, 146, 154, 193, 196, 197, 202, 215, 220, 222–224, 229, 231–233, 247, 259 adjective 73, 74, 76, 122, 132, 200–202, 243, 245–248 advanced 2, 6, 13, 58, 60, 63, 75–77, 83, 102, 108, 112, 127, 137, 138, 141, 158, 164, 168, 174, 175, 204, 257 advantages 15, 37, 78, 81, 82, 102, 151, 152, 167, 179, 190, 203, 235 affordances 107, 110, 182–184, 217, 219, 221, 226 agency 39, 213, 218 analysis xii–xvii, xix, 4, 7, 8, 11, 14, 15, 17, 18, 22, 25, 26, 28, 32, 34–40, 56, 58, 67, 64, 68, 80, 82, 85, 86, 94, 95, 102,

104, 110, 111, 127, 130–133, 135–138, 140, 141, 149, 150, 153, 155, 160, 161, 163, 170, 171, 174, 175, 176, 179–182, 188–190, 206, 208–211, 213, 214, 216, 217, 219, 226, 227, 229, 231, 239, 240, 242, 246, 252, 254, 256, 259 annotated 27, 35, 39, 50, 135, 136, 141 annotation 25, 35, 37, 141, 161, 190, 204 AntConc 4, 9–23, 45, 52, 56, 128, 129, 147, 176, 190 AntWordProfiler 88, 90, 164, 167 applications xiv–xvii, xix, 4–6, 9, 20, 23, 26, 27, 35, 54, 55, 67, 68, 70, 71, 89, 94, 95, 101, 102, 105, 127, 129, 148, 157, 160, 164, 168, 175, 182, 185–187, 206, 208, 223, 252, 256, 258, 259 approaches 5, 6, 17, 25, 35, 36, 38, 51, 59, 60, 68, 70, 84, 85, 93, 94, 102–104, 109, 117, 129, 168, 180, 184, 187, 192, 220, 221, 238, 239, 257 argumentation 6, 111, 112, 115, 120, 124, 127 assessment xvi, xvii, xix, 6, 56, 68, 139, 141, 145, 155, 157–164, 167–170, 172 authentic 7, 17, 48, 52, 55, 59, 69, 73, 88, 94, 102, 150, 155, 163, 170, 172, 175, 178, 180, 184, 192, 206, 215, 222, 223, 225, 231, 237–239, 257 automated 6, 26, 133, 135, 136, 139, 169, 205 autonomous xv, 54–56, 95, 101–104, 107, 110, 212–214, 224, 225, 251

262

Index

background 50, 69, 84, 105, 131, 223, 239, 246, 251 BDDL 7, 211, 214–218 benefits 2, 5–7, 30, 41, 55, 82, 91–93, 109, 141, 145, 153, 155, 175, 179, 181, 225, 235, 252 Berber Sardinha, T. xii, 3, 4, 25–28, 35, 36, 41 BNC 16, 128, 130, 168, 176, 213, 233, 246, 247, 253 Boulton, A. xii, 1, 2, 5, 7, 8, 43, 52–58, 61, 62, 67, 93, 94, 104, 110, 111, 127, 144, 150, 175–177, 182–184, 211–213, 215, 219, 223, 224, 227, 236 CEFR 108, 160, 163–166, 169, 170 challenges 5, 9, 13, 15, 18, 20, 23, 36, 37, 53, 54, 60, 61, 68–70, 82, 91, 92, 102, 103, 109, 132, 146, 151, 180, 186, 222–224, 228, 253, 256 China xvii–xix, 20, 171, 178, 183, 185, 205, 258, 11, 61, 63, 68, 83, 141, 170, 176, 191, 192, 221 class 15, 22, 34, 40, 41, 46, 47, 51, 52, 91, 100, 103, 145, 147, 149, 169, 191, 207, 231–234, 237, 258, 259 classroom xv, xviii, 1, 4, 9–11, 13–15, 19, 22, 25, 32, 34, 41, 43, 50, 55, 57, 58, 67, 69, 70, 75, 76, 80, 89–91, 93, 95, 101, 103–106, 109, 124, 129, 139, 143–145, 150, 156, 159, 160, 169, 174–178, 180, 181, 183–186, 188, 191, 206, 211, 213, 214, 220, 222, 225, 235–237, 242, 251, 252, 256, 259, 260 Cobb, T. 45, 52, 57, 58, 88–91, 93, 94, 104, 111, 112, 127, 162, 164, 168, 175, 177, 182, 213, 215, 219, 223, 224 COCA 16, 18, 88, 90, 102, 149, 168, 176, 213, 239, 245, 247–250, 252, 253, 259 cognitive xix, 20, 42, 51, 65, 102, 108, 154, 158, 214–216, 221, 251 collaboration 47, 48, 56–58, 145, 147, 148, 151, 152, 181, 183, 220, 227, 239, 243 ColloCaid 5, 52, 72–79, 81, 83, 190 collocation xvi, xviii, 20, 56, 63, 64, 71–76, 78, 79, 81–83, 95, 187, 188, 191–193, 200, 203, 205–209, 215, 239, 243, 252, 253 communication xvii, 25, 26, 28, 32, 35–38, 40–42, 48, 152–154, 158, 178, 212, 221, 227, 239, 257 community 16–18, 26, 57, 143, 145–147, 149–152, 154, 177, 189, 241, 252, 256, 257

competence xvi, 6, 59, 61, 63, 68, 157, 158, 178, 180, 184, 185, 187–189, 191, 192, 203, 204, 207–209, 218, 239 comprehensible 83–85, 88, 89 comprehension xvi, 5, 7, 23, 53, 60, 84, 85, 90, 91, 158, 159, 163, 166, 226, 227, 231, 237 computational 17, 19, 32, 68, 83, 135, 136, 188, 203, 204, 215, 217, 218, 220, 223 computer 8, 9, 11, 16, 23–25, 38, 47, 53, 90, 105, 108, 129, 135, 150, 162, 167–169, 182, 183, 203, 212, 218–221, 224, 225, 236 concordance 4, 8, 9, 12, 13, 20, 21, 23, 49, 50, 53, 61, 63–65, 75, 76, 81, 82, 93, 95, 98–100, 102, 108, 109, 112, 113, 115, 116, 122, 144, 181, 205, 212, 219–221, 224, 225, 230–235, 246, 247, 249, 250, 253 concordancer 4, 6, 9, 10, 12, 13, 19, 43, 52, 112, 127 concordancing 1, 12, 22, 55, 56, 105, 127, 182, 183, 212, 217, 219–221 conversation 9, 37, 43, 48, 131, 220, 244, 252 corpus xii–xix, 1–13, 15–22, 25–28, 30, 34–40, 43–45, 47–60, 62–65, 67–70, 72, 73, 80–91, 93–95, 97, 99–112, 115–117, 120, 121, 126–133, 135–138, 140–150, 153–155, 160–169, 171, 174–186, 188–190, 201, 204–215, 219–222, 224–226, 228–231, 233, 235–237, 239, 240, 244–247, 250–260 corpus-aided 57, 80, 93, 104, 127, 144, 150, 183, 221, 236 corpus-based xiv, xvi, xviii, 6, 7, 17, 20, 25, 36, 38, 39, 53, 57, 58, 67, 70, 93–95, 100–105, 107, 109, 111, 129, 130, 133, 139–141, 143–146, 148–150, 159, 162, 164, 167, 174, 175, 177, 178, 182–193, 203–206, 208–210, 221, 236, 253, 255, 259 corpus-driven 7, 56, 57, 72, 79, 172, 225, 252 courses xvii, 6, 16, 43, 58, 63, 82, 91, 106, 140, 141, 145, 147, 152, 155, 160, 183, 187, 188, 191, 207, 224, 225, 256, 258, 259 Coxhead, A. 86, 90, 107, 160, 168, 171, 241, 243, 252, 253 critical xv, 6, 13, 17, 19, 26, 29, 34, 35, 38, 76, 78, 102, 109, 114, 115, 122, 134, 147, 155, 172, 181, 215, 222, 223, 225, 238, 239

Index

Crosthwaite, P. 1, 2, 8–11, 13, 15, 16, 45, 53, 80, 82, 83, 102, 104, 105, 111, 127, 150, 171, 175, 182, 183, 211, 214–217, 219, 227, 233, 236, 257, 258 Csomay, E. 93–95, 102, 104–107, 221 curriculum 38, 85–87, 90, 156, 172, 213, 255, 256 data-driven xii, xv, xvi, xviii, xix, 1, 5, 7–9, 18, 19, 21, 22, 43, 53–55, 57–59, 67–70, 72, 80, 82, 83, 93, 101, 104–108, 110, 111, 127, 129, 130, 142, 150, 174, 175, 182–188, 209, 211, 214, 219–224, 227, 235–238, 253, 255, 258 database 10, 72, 73, 79, 143, 154, 160, 161, 164, 192 DDL xv, 1–14, 18–20, 22, 23, 43–72, 76, 79, 82, 93–95, 97, 100–103, 106, 108, 109, 111, 127, 142, 144, 146, 148, 151, 155, 174, 175, 183, 186, 188, 193, 205, 206, 209, 211–219, 222–227, 232, 235, 236, 238, 255, 258 description 24, 28, 42, 116, 131, 132, 134, 138, 206, 209 design xii, xv, xix, 6, 9, 11, 14, 15, 18, 19, 21, 28, 32, 35, 56, 57, 63, 95, 101, 103–105, 170, 177, 178, 180, 181, 184, 186, 220, 254, 256 development xv, xviii, 4, 7, 16, 25, 35, 37, 48, 52, 54–56, 59, 61, 68, 72, 76, 77, 80, 83, 100–103, 105, 106, 115, 124, 129, 138, 145, 151, 152, 157, 159, 162, 166–168, 171, 175–177, 180–184, 187–191, 203, 205, 207, 220, 227–229, 237–239, 244, 253–255, 259 dimension 28, 30–34, 158, 159, 205, 218, 227 disciplines 15, 45, 48, 72, 94, 103, 110, 112, 120, 121, 130, 131, 176, 177, 181, 229, 241, 242, 246, 250, 251, 254–258 discourse xiv–xvii, xix, 7, 9, 17, 18, 30, 32, 34–38, 40, 50, 67, 68, 94, 95, 99, 104, 111, 127, 129, 131, 159, 166, 175, 189, 209, 220, 228, 231, 234, 241–248, 251, 252, 255, 256, 258 EAP xv, xvii, xviii, 7, 10, 80, 83, 104, 105, 109, 110, 127, 144, 154, 155, 182, 219, 227–229, 231–236, 241, 242, 244, 249–252, 254–259 education xiv–xvi, xviii, xix, 2, 4–6, 8, 36, 38, 40, 47, 53, 54, 67, 70, 84, 86, 90, 91, 94, 107, 130, 142–150, 154, 155, 160, 167, 170, 172, 173, 175–179, 182, 185,

263

209–211, 218–224, 227, 228, 235–239, 242, 246, 250, 253, 255, 257 educational xii, xv, xix, 2, 5–7, 34, 67, 84, 86, 89, 108, 142, 148, 150–152, 154, 155, 172, 181, 212, 213, 218, 220, 222, 223, 227, 239 educators 2, 6, 7, 58, 70, 147, 148, 158, 175, 180, 217, 218, 222, 256, 258, 259 effectiveness 45, 49, 81, 94, 95, 172, 180–182, 185, 254 empirical 2, 54, 60–63, 66, 80, 118, 157, 177, 188, 209, 244, 249, 251, 254 ESP xii, xv, xviii, 5, 9, 10, 80, 94, 95, 103–107, 127, 144, 223–225, 253, 255 essay 81, 109, 112, 124, 125, 127, 207 evidence 45, 59, 60, 63, 66, 67, 76, 90, 115, 117–119, 121, 126, 132, 133, 171, 174, 176, 177, 188–190, 205, 206, 208, 209, 212, 214, 243, 244, 249, 251, 254, 256–258 evidence-based 117, 126, 169, 170, 209, 244, 257 exercises 57, 58, 68, 76, 99, 128, 178, 184, 191, 193, 208, 237 exposure 68, 71, 82, 85–89, 94, 216, 237, 256 feedback xvii, 9, 10, 16, 57, 73, 77, 78, 110, 145, 151, 175, 182 fluency 7, 71, 82, 188, 189, 191, 192, 203, 205, 207 focus-on-form 93, 214, 221 foreign xviii, xix, 6, 7, 20, 48, 57, 58, 81, 90, 105, 130, 144, 150, 156, 167, 172, 175, 187–189, 191, 193, 201, 203, 206, 219, 226, 228 Frankenberg-Garcia, A. xii, 5, 71, 72, 78, 80, 82, 83, 105, 127, 227–229, 236 frequency 12, 13, 19–21, 39, 44, 60, 63, 65, 66, 68–70, 81, 102, 109, 120, 139, 160, 162, 164–166, 168, 171, 172, 189, 206, 215, 242, 247, 257, 259 function 19, 21, 43, 83, 119, 130, 135, 240, 242 Gablasova, D. 144, 150, 255–257 genre 18, 129, 130, 222, 231, 235, 254, 257 global xvii, 7, 26, 30–32, 38, 40, 54, 142, 169, 185, 227–230, 235, 237–239, 247, 253, 254, 258 grammar xii, xvi, 6, 44, 48, 50, 51, 55, 57, 74, 76, 93, 94, 97, 104, 105, 111, 139, 140, 157, 172, 174, 175, 179, 182, 207, 220, 224, 233, 244, 248, 249, 252, 258, 259

264

Index

grammatical 19, 21, 67, 73, 75, 76, 78, 81, 94, 95, 99, 107, 133–135, 138, 157, 158, 163, 207, 208, 227, 244, 246, 249, 259 handout 55, 182, 220, 232 hands-off 95, 222, 229, 232, 238 hands-on 20, 52, 55, 57, 95, 103, 106, 108, 127, 129, 146, 155, 175, 176, 178, 180, 182, 229, 238, 255, 256 health 40, 152–154, 229–234, 250 Hunston, S. 108, 111, 127, 252, 253 hypothesis 45, 88, 90, 93, 120–126, 205, 215 ideologies 26, 28, 33, 34, 36, 218 idiomatic 12, 44, 70, 71 implementation 6, 51, 57, 97, 159, 180, 181, 186, 211, 239 implications 40, 48, 67, 109, 127, 139, 149, 168, 181, 182, 201, 205, 253, 255, 257 improvement 57, 62, 68, 69, 153, 179, 180, 185 incidental 85–89, 91, 214 instruction xvii, 6, 22, 25, 34, 69, 87, 90, 105, 106, 159, 162, 172, 175, 182, 184, 185, 206, 212, 222, 256, 258 instructor xvi, xvii, 11, 14–16, 108, 151, 225, 255, 256, 9–11, 13–17, 58, 106, 108, 129, 130, 155, 157, 159, 160, 235 instruments 46, 142, 154, 157–159, 161–163, 167, 169 integration 1, 2, 5, 17, 73, 77, 94, 97, 179, 184, 217, 222 interactive 50, 72–74, 181, 185, 216, 239 interdisciplinary xvii, 37, 68, 86, 174, 204, 239 interpretable 131, 133, 134, 137, 138, 140 investigation 5, 14, 56, 68, 102–104, 120, 169, 188, 213, 242, 244, 245, 256 Jablonkai, R. R. xiv, 5, 93–107, 221 Karpenko-Seccombe, T. xiv, 6, 111, 127–130, 252, 253 keyword 10, 11, 15, 19, 20, 22, 56, 70, 72, 113, 116, 161, 226, 231, 239, 12, 13, 19, 36, 225, 229–231, 239 #Lancsbox 56, 136, 138, 256, 257 languages xv, xvi, xix, 5, 11, 45, 49, 50, 59–62, 67, 140, 141, 146, 148, 153, 160, 166, 168, 169, 182, 191, 192, 203, 206, 209, 219, 228, 242

Larsson, T. xiv, 6, 131, 138, 140 lemma 27, 96, 97, 99, 100 Leńko-Szymańska, A. xiv, 110, 157–159, 164–166, 168–170, 172, 175, 176, 182–184 lexical xvii, xviii, 6, 12, 18, 22, 23, 28–36, 67, 72, 73, 76, 78, 79, 81, 85, 88, 89, 94, 95, 99, 101, 104, 107–110, 129, 133, 157–169, 172, 186, 188–190, 204–208, 236, 239, 254, 255, 257 lexicography xii, 73, 80, 81, 168, 175, 190, 204, 205, 209, 210, 220, 236 lexicon 60, 88, 159–162, 166, 172, 204, 234, 242 lexis xiv, 35, 104, 141, 157, 158, 163, 186 limitations 15, 26, 35, 46, 49, 53, 61, 145, 147, 155, 179, 180, 258 Linggle 49, 52, 82, 83 linguistics xii–xix, 1, 3, 5–10, 16–18, 21–23, 25, 35–37, 40, 45–47, 53, 55, 56, 58–60, 67, 68, 70, 80, 83, 86, 89, 91, 93, 104–106, 108, 109, 121, 127, 130, 135–138, 142–145, 147–150, 168, 169, 174–177, 179, 181, 183, 184, 186, 188–190, 204, 205, 210, 211, 213, 219–221, 226, 229, 236, 240–242, 252–260 literacy xiv, 38–40, 90, 100, 102, 143, 144, 146–148, 150, 154, 176, 177, 179–183, 218, 222, 258 longitudinal 54, 63, 68, 182 materials xv, xix, 1, 7, 8, 19, 25, 46, 52, 63, 64, 69, 70, 80, 92, 94, 95, 97, 102, 104–109, 111, 126–129, 143–146, 148–152, 155, 156, 159, 163, 177–181, 184–186, 188, 206, 209, 217, 220, 228, 229, 237, 238, 241, 251, 252, 255, 256, 258, 259 McCarthy, M. 18, 24, 35, 55, 94, 97, 104, 105, 169, 189, 204, 213, 220, 221 McEnery, T. 36, 105, 138, 140, 141, 150, 175, 183, 220, 257 media 4, 8, 26, 29, 33–41, 45, 217 meta-analysis 45, 52, 58, 68, 91, 93, 104, 175, 182, 183, 219, 224 metacognition 180, 217 metaphor xii, xvi, 149 methodology 6, 38, 46, 48, 56, 58, 69, 105, 106, 109, 111, 128, 141, 189–192, 203, 205, 206, 223, 253, 255 methods xiv, xvi, 6, 10, 12, 14, 25, 36, 41, 47, 48, 56, 58, 85, 106, 109, 111, 131, 140, 142, 143, 152, 159, 165, 168, 183, 192, 213, 217, 222, 223, 229, 254, 256

Index

MICUSP 6, 112, 120–122, 124–126, 128, 130, 250 models 4, 16, 68, 86, 87, 89, 90, 101, 103, 158, 219 monolingual 67, 193, 208, 210 morphosyntactic 192, 194, 200, 201 multi-word 20, 60, 95, 99, 170, 189, 190, 206–208, 243 multidimensional xii, 4, 25, 26, 28, 32, 34–36, 182, 205, 254 multilingual 7, 188, 191, 193, 203–206, 209, 238 multimodal xvii, 4, 25–34, 36–41, 85, 103, 239 native 10, 21, 67, 81, 82, 166, 184, 206, 234, 251 nominalization 131–133, 136 nominals 133, 134, 139 normalisation 1, 48, 53, 103 noun 73, 76, 77, 112, 131–133, 149, 192, 201, 202, 243, 246–249 objective 69, 139, 152, 170, 188, 213, 223, 225 observation 23, 41, 108, 227 Orenha-Ottaiano, A. xiv, 3, 6, 187–189, 191, 204–207, 209 paper-based 52, 64, 68, 69, 95, 144, 184 Paquot, M. 72, 80, 94, 105, 205 part-of-speech 27, 36, 82, 135, 136, 161 participants 3, 20, 41, 47, 63, 69, 78, 89, 129, 140, 153, 154, 158, 178, 180, 185, 186 pattern 12, 13, 30, 82, 118, 124, 125, 127, 129, 206, 222, 223, 247 pedagogic 70, 93, 95–97, 101–104, 183, 203, 225, 253 pedagogical xv, 2, 5, 33, 44, 47, 49, 51, 55, 59, 67, 84, 86, 94, 102, 103, 159, 160, 163, 168, 172, 174–178, 180–182, 184–188, 191, 205, 206, 211, 219, 228, 252, 254, 256, 258, 259 pedagogy xiv, 1, 6, 24, 62, 68, 70, 86, 93–95, 100–103, 130, 144, 145, 147, 148, 151, 157, 174, 176, 177, 181, 183–186, 207, 218 perceptions 41, 56, 57, 69, 78, 150, 186, 218, 221 Pérez-Paredes, P. xiv, 2, 4, 7, 8, 49, 53, 97, 103, 105, 176, 183, 211–213, 215–221, 224, 227, 236

265

perspective 4, 25, 28, 30, 34, 35, 37, 47, 49, 82, 103–106, 137, 158, 176, 204, 205, 216, 220, 227, 246, 256, 257, 259 phraseography 6, 7, 187–190, 206, 209 phraseological xii, 6, 18, 60, 61, 68, 110, 133, 187–189, 191, 203, 205–210 PLATCOL 191–195, 198–201, 203, 205, 209 Portuguese xv, 7, 36, 191–193, 200, 201, 228, 233 post-modifiers 139 practical xv, 4–7, 9, 18, 21, 39, 44, 54, 70, 89, 101, 103, 111, 112, 127–129, 142, 147, 158, 164, 181, 185, 186, 208, 223, 256 pre-service 6, 144, 145, 149, 150, 174, 176, 178–182, 184, 185, 191, 259 proficiency 6, 22, 46, 56, 58, 60, 63, 68, 71, 81, 84, 85, 95, 103, 108, 136, 143, 157–160, 163, 164, 166–170, 172, 175, 176, 188, 189, 191, 203, 207, 228, 238, 242 prosody 12, 42, 248 prototype 73, 74, 78, 79, 194, 195, 198–201 Python 10, 26, 27 quantitative xiv, xviii, 45, 131–133, 135, 136, 168, 254 queries 26, 43, 48, 51, 112, 146, 148, 213–215, 225 reflection 57, 103, 149, 179, 215, 257 register xii, xvi, 28, 35, 36, 134, 166, 244, 245, 254 regression 137, 138 reliability 140, 154, 190, 216 replication 14, 91, 152, 154–156 researcher xii–xix, 15, 17, 20, 46, 54, 55, 186, 220, 233, 257 research-practice 1, 4, 104, 155 resource xiv, 4, 51, 54, 55, 67, 89, 92, 106, 129, 130, 141, 144, 145, 149, 151, 160, 175, 203, 238, 252, 253 rhetorical 34, 61, 111, 124, 127–130, 175, 182 role 1, 5, 6, 16, 21, 40, 43, 51, 54, 95, 105, 108, 110, 114, 119, 155, 158, 163, 167, 172, 174, 179, 181, 184, 190, 191, 193, 195, 198, 200, 208, 213, 216–219, 237, 240, 251 scaffolded 81, 225 school xv–xix, 16, 47, 86, 90, 144, 148, 153, 166, 172, 180, 182, 183, 185, 204, 242, 253

266

Index

search 12, 13, 18–20, 26, 28, 34, 43, 44, 49–52, 78, 83, 100, 102, 103, 113–122, 124, 125, 130, 149, 160, 163, 171, 176, 178–180, 199, 220, 221, 224, 226, 231, 232, 234, 235, 244, 245, 248, 249, 259 secondary 47, 50, 54, 55, 86, 90, 144, 145, 149, 150, 160, 162, 176, 178, 179, 182, 183, 235, 238 self-directed 7, 93, 101, 102, 106, 151, 175, 211, 213, 214, 216, 221 semantic 12, 18, 19, 39, 40, 60, 68, 95, 109, 110, 141, 204, 222, 239, 248, 253, 256 sentence 63, 76, 82, 83, 115, 134, 135, 139, 153, 154, 163, 202, 228, 231, 235, 243, 246, 248 significant 3, 20, 53, 54, 64–67, 69, 70, 78, 107–109, 111, 121, 151, 159, 163, 166, 167, 172, 180, 187, 191, 217, 222, 225, 227, 240, 259 SKELL 6, 49, 52, 82, 83, 102, 112, 113, 119, 122–124, 128, 130, 259 sketchengine 18, 52, 102, 128, 168, 170, 204, 229, 235, 250 SLA 5, 59–61, 63, 66, 67, 69, 70, 158, 159, 167, 176–178, 211, 215 social 4, 8, 26, 33–38, 40, 41, 45, 127, 149, 184, 212, 214, 217, 221, 238, 246 software 1, 9, 10, 13, 15, 16, 18, 21–23, 26, 51, 52, 56, 65, 70, 73, 90, 102, 129, 138, 148, 153, 161, 167–169, 172, 190, 212, 236, 256, 257 speakers 3, 4, 63, 70, 81, 82, 107, 133, 141, 184, 187, 209, 214, 215, 217, 218, 236 statistics 11, 13, 35, 57, 137, 138, 237, 244, 249, 250 strategies 5, 34, 37, 41, 53, 54, 101–103, 107, 147, 153, 154, 158, 177, 192, 217, 231, 258 structures 19, 21, 70, 83, 109, 110, 134, 140, 192, 194, 200, 202, 206 student xv–xviii, 5, 22, 41, 56–58, 67, 69, 80, 81, 85, 87, 94, 100, 105, 127, 128, 139, 142, 145, 146, 154, 170, 174, 182, 207, 219, 233, 234, 236, 243, 250, 255, 256 study xvi, 5, 23, 28, 35–39, 42, 46, 56, 57, 59, 61–63, 67–69, 73, 78, 83, 87–90, 98, 103, 107, 115–117, 130, 134–137, 143, 155, 158, 163, 166, 169, 179, 180, 183, 207, 208, 216, 220, 221, 224, 226, 228, 229, 236, 238, 241–249, 253–255 sustainable xviii, 7, 105, 146, 148, 171, 228, 229, 237–239

syntactic 132, 134–136, 138, 139, 141, 222, 223 system 9, 11, 13, 16, 19, 32, 39, 40, 53, 74, 83, 168, 184, 220, 221 systematic 8, 28, 54, 69, 82, 101, 105, 154, 160, 161, 183, 219–221, 257 Szudarski, P. xviii, 7, 71, 81, 95, 105, 208–210 TaLC 216, 220 tasks 15, 34, 38, 69, 72, 91, 93, 94, 98, 100, 101, 103, 106–108, 112, 121, 127–129, 147, 158, 159, 161, 163, 166, 181, 190, 213, 215, 223–225, 236, 251, 252 teach 7, 22, 47, 50, 70, 75, 76, 82, 106, 127, 129, 143, 146, 147, 151, 156, 171, 174, 178, 182, 188, 207, 227, 231, 233, 254, 256, 259 teachers 1–6, 19, 22, 32–34, 43–45, 47, 48, 51, 52, 55, 57, 58, 67, 70, 77, 78, 80, 87–89, 94, 95, 102, 105, 107, 108, 110, 111, 128, 129, 138–140, 142–146, 148–150, 158, 168–172, 174–187, 189, 191, 193, 201, 203, 205, 208, 209, 212, 221, 223, 225, 232, 233, 235–238, 242, 254, 259, 260 technical xix, 9, 10, 13, 15, 19–21, 25, 26, 33, 68, 94, 100, 102, 103, 106, 127, 131, 146–149, 153, 170, 171, 181, 186, 239, 253, 255 techniques xii, xviii, 6, 46–48, 50, 55, 59, 61, 100, 131, 137, 138, 140, 141, 224, 254, 257 technology xii, xvii–xix, 2, 3, 5–7, 9, 25, 35, 48, 49, 51–53, 67, 68, 83, 90, 102, 104, 127, 150, 152, 154, 170, 174–178, 180–185, 217–220, 222–224, 235, 251, 253, 258 terminology 32, 46, 47, 67, 161, 239 tests 91, 137, 138, 158–164, 167, 172, 208 textbooks 18, 43, 71, 145, 148, 150, 170–172, 184, 212, 227, 228, 251, 259 texts 9, 11, 13, 15, 16, 19, 22, 26, 30, 34, 36–40, 43–45, 51, 71, 72, 74, 76, 78, 82, 83, 90, 95, 98, 109, 124, 129, 130, 135, 136, 140, 141, 152, 158, 160, 161, 164–166, 189, 190, 192, 207–209, 212, 215, 216, 225, 227, 230, 231, 233, 234, 239, 242, 243, 245, 246, 248, 251, 256 textual 4, 25, 26, 29, 32, 35, 37, 124, 125, 158, 242, 244

Index

theory 67, 102, 110, 129, 174, 176, 179, 181, 208, 221, 239, 240 thesis 104, 115–117, 120, 150, 204, 206, 241, 242, 245, 246, 253 tool 1, 4, 5, 7, 9–13, 15–17, 20, 23, 26, 27, 33, 36, 39, 43, 44, 49, 51, 55, 72–75, 77–79, 81–83, 87, 120, 134–136, 147, 151, 153, 160–164, 167, 169, 170, 185, 191, 203, 220–222, 226, 229, 230, 233, 236, 254 topic xvii, 26, 41, 56, 84, 103, 116, 117, 136, 139, 140, 149, 151, 155, 160, 169, 171, 210, 231, 233, 259 traditional 4, 7, 10, 13, 17, 38, 49, 56, 58, 65, 67, 117, 135, 188, 190, 223, 235, 256, 259 translation xii, xiv, xviii, 21, 50, 56, 57, 67, 78, 140, 141, 147, 158, 161, 175, 179, 204, 206, 209, 211 undergraduate xv, xvi, 56, 79, 102, 105, 127, 155, 156, 183, 214, 229, 231, 233, 250, 255, 256, 259 units 41, 60, 67, 99, 187–190, 203, 205–208, 252 university xii–xix, 3, 17, 18, 22–24, 35, 36, 40, 45, 47, 54, 56–58, 67, 68, 70, 77, 78, 80, 81, 83, 90, 104–106, 108, 112, 127–129, 131, 138, 145, 150, 151, 154, 160, 167, 168, 170, 171, 175, 178, 180, 182, 184, 204, 205, 207, 210, 211, 218, 224, 228, 236, 238, 250, 253, 260 usage 10, 12, 14, 15, 17, 20, 23, 34, 41, 44, 68, 69, 95, 108, 112, 114, 120, 121, 127, 130, 158, 170, 175, 180, 184, 215, 219, 222, 224, 227, 228, 258

267

usage-based 45, 48, 93, 104, 138, 214, 216, 219 user-friendly 44, 52, 54, 55, 58, 100, 102, 103, 129, 146, 167, 181, 186 variation xii, xvi, 28, 35, 36, 46, 85, 122, 139, 148, 149, 175, 207, 220, 242, 244, 245, 253–256 verb 13, 60, 73, 75–77, 99, 130–133, 200, 244, 245, 249 Versatext 102, 222, 226 Viana, V. xv, 7, 94, 101, 106, 240, 244, 250–253, 256–259 visual 25–35, 37, 38, 40–42, 49, 224 vocabulary xiv, xvi, xviii, xix, 5, 6, 14, 21–24, 34, 35, 40, 41, 51, 55, 68, 69, 71, 72, 75, 78, 80, 84–91, 93, 95, 98, 104, 105, 111, 112, 118, 126, 131, 139, 157–173, 175, 179, 180, 182, 183, 185, 186, 208, 217, 220, 222–224, 228, 232, 235, 240, 241, 243–245, 248, 253–257, 259 Vyatkina, N. 1, 7, 45, 53, 61, 62, 67, 93, 94, 104, 111, 127, 144, 151, 175–177, 182 web-based 225, 247, 253 wordlists 1, 90, 95, 104, 167–169 workshops 21, 47, 73, 77, 78, 80, 151, 152, 178, 236 writers xv, 5, 71–75, 77–83, 111, 114, 117, 118, 120, 122, 127, 128, 183, 209 YouGlish 50, 52, 54, 55 YouTube 50, 52, 55, 92, 155, 213 Zoom 2, 19