Extending the Scope of Corpus-Based Translation Studies 9781350143258, 9781350143289, 9781350143265

With the rapid growth of corpus-based translations studies (CBTS) over recent years, this book offers a timely overview

173 97 12MB

English Pages [289] Year 2022

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Extending the Scope of Corpus-Based Translation Studies
 9781350143258, 9781350143289, 9781350143265

Table of contents :
Half Title
Series Page
Title Page
Copyright Page
Contents
Illustrations
Contributors
Introduction
Part I: Corpus-based translation studies: Current challenges and future perspectives
Chapter 1: Corpus-based translation and interpreting studies: A forward-looking review
Chapter 2: Expanding the reach of corpus-based translation studies: The opportunities that lie ahead1
Part II: Recent methodological and theoretical developments in CBTS
Chapter 3: Translation as constrained communication: Principles, concepts and methods
Chapter 4: On the use of multiple methods in empirical translation studies: A combined corpus and experimental analysis of subject identifiability in English and German
Part III: Corpus-based empirical studies
Chapter 5: Syntactic properties of constrained English: A corpus-driven approach
Chapter 6: Grammatical metaphor in translation: A corpus-based investigation of nominal of-constructions
Chapter 7: Detecting normalization and shining-through in novice and professional translations
Part IV: Corpus use in translator training
Chapter 8: Translation quality in student specialized translation: The impact of corpus use
Chapter 9: Using comparable corpora for translating and post-editing complex noun phrases in specialized texts: Insights from English-to-French specialized translation
Index

Citation preview

Extending the Scope of Corpus-Based Translation Studies

BLOOMSBURY ADVANCES IN TRANSLATION SERIES Series Editor: Jeremy Munday, Centre for Translation Studies, University of Leeds, UK Bloomsbury Advances in Translation publishes cutting-edge research in the fields of translation studies. This field has grown in importance in the modern, globalized world, with international translation between languages a daily occurrence. Research into the practices, processes and theory of translation is essential and this series aims to showcase the best in international academic and professional output. A full list of titles in the series can be found at: https​:/​/ww​​w​.blo​​omsbu​​ry​.co​​m​/uk/​​serie​​s​/blo​​omsbu​​ry​-ad​​vance​​s​​-in-​​trans​​latio​n Recent titles in the series include: Celebrity Translation in British Theatre Robert Stock Collaborative Translation Edited by Anthony Cordingley and Céline Frigau Manning Genetic Translation Studies Edited by Ariadne Nunes, Joana Moura and Marta Pacheco Pinto Institutional Translation for International Governance Fernando Prieto Ramos Intercultural Crisis Communication Edited by Federico M. Federici and Christophe Declercq Sociologies of Poetry Translation Edited by Jacob Blakesley Systemic Functional Linguistics and Translation Studies Edited by Mira Kim, Jeremy Munday, Zhenhua Wang and Pin Wang Telling the Story of Translation Judith Woodsworth The Pragmatic Translator Massimiliano Morini Translating Holocaust Lives Edited by Jean Boase-Beier, Peter Davies, Andrea Hammel and Marion Winters Translating in Town Edited by Lieven D’hulst and Kaisa Koskinen

Extending the Scope of Corpus-Based Translation Studies Edited by Sylviane Granger and Marie-Aude Lefer

BLOOMSBURY ACADEMIC Bloomsbury Publishing Plc 50 Bedford Square, London, WC1B 3DP, UK 1385 Broadway, New York, NY 10018, USA 29 Earlsfort Terrace, Dublin 2, Ireland BLOOMSBURY, BLOOMSBURY ACADEMIC and the Diana logo are trademarks of Bloomsbury Publishing Plc First published in Great Britain 2022 Copyright © Sylviane Granger, Marie-Aude Lefer and Contributors, 2022 Sylviane Granger and Marie-Aude Lefer have asserted their right under the Copyright, Designs and Patents Act, 1988, to be identified as Editors of this work. Cover Illustration © hellena13/ shutterstock All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage or retrieval system, without prior permission in writing from the publishers. Bloomsbury Publishing Plc does not have any control over, or responsibility for, any third-party websites referred to or in this book. All internet addresses given in this book were correct at the time of going to press. The author and publisher regret any inconvenience caused if addresses have changed or sites have ceased to exist, but can accept no responsibility for any such changes. A catalogue record for this book is available from the British Library. A catalog record for this book is available from the Library of Congress. ISBN: HB: 978-1-3501-4325-8 ePDF: 978-1-3501-4326-5 eBook: 978-1-3501-4327-2 Series: Bloomsbury Advances in Translation Typeset by Deanta Global Publishing Services, Chennai, India To find out more about our authors and books visit www​.bloomsbury​.com and sign up for our newsletters.

Contents List of Illustrations List of Contributors

vii x

Introduction  Sylviane Granger and Marie-Aude Lefer 1 Part I  Corpus-based translation studies: Current challenges and future perspectives 1

Corpus-based translation and interpreting studies: A forwardlooking review  Sylviane Granger and Marie-Aude Lefer 13

2

Expanding the reach of corpus-based translation studies: The opportunities that lie ahead  Federico Gaspari 42

Part II  Recent methodological and theoretical developments in CBTS 3

Translation as constrained communication: Principles, concepts and methods  Haidee Kotze 67

4

On the use of multiple methods in empirical translation studies: A combined corpus and experimental analysis of subject identifiability in English and German  Stella Neumann, Jonas Freiwald and Arndt Heilmann 98

Part III  Corpus-based empirical studies 5

Syntactic properties of constrained English: A corpus-driven approach  Ilmari Ivaska, Adriano Ferraresi and Silvia Bernardini 133

6

Grammatical metaphor in translation: A corpus-based investigation of nominal of-constructions  Arndt Heilmann, Tatiana Serbina, Jonas Freiwald and Stella Neumann 158

7

Detecting normalization and shining-through in novice and professional translations  Ekaterina Lapshinova-Koltunski 182

Contents

vi

Part IV  Corpus use in translator training 8

Translation quality in student specialized translation: The impact of corpus use  Heidi Verplaetse 209

9

Using comparable corpora for translating and post-editing complex noun phrases in specialized texts: Insights from Englishto-French specialized translation  Natalie Kübler, Alexandra Mestivier and Mojca Pecman 237

Index

267

Illustrations Figures 1.1 Proportion of corpus-based articles per journal 19 1.2 Breakdown of empirical, methodological-theoretical and applied studies in the dataset (n = 186) 22 1.3 Breakdown of applied studies in the dataset (n = 35) 23 1.4 Linguistic focus of the empirical studies (n = 122) 25 1.5 Corpus types in empirical studies (n = 122) 28 1.6 Written registers in empirical studies (n = 92) 31 1.7 Use of statistics in empirical studies (n = 122) 35 3.1 Variable importance plot for all three subcorpora combined 85 3.2 Variable importance plot for each individual varietal subcorpus 86 3.3 Conditional inference tree for all three corpora combined 88 5.1 Tree visualization of the sentence I like their curries and enjoy their signature chicken tikka 140 5.2 Scree plot of eigenvalues for dimensions 142 5.3 Mean scores and standard deviations for Dimension 1 144 5.4 Mean scores and standard deviations for Dimension 2 146 5.5 Mean scores and standard deviations for Dimension 3 147 5.6 Dimension 1: Results from train and test data 148 5.7 Dimension 2: Results from train and test data 148 5.8 Dimension 3: Results from train and test data 148 7.1 Accuracy as % for the GO register classification 194 7.2 Accuracy as % for the first classification scenario (normalization) 195 7.3 Accuracy as % for the EO register classification 196 7.4 Accuracy as % for the second classification scenario (shining-through)197 8.1 AntConc Keyword in Context (KWIC) 217 8.2 AntConc Fileview for extended context of search terms 218 8.3 Target text compound nouns of domain-specific search terms through Advanced Search in AntConc 219 8.4 AntConc function searching different terms simultaneously in context 219

viii

8.5 8.6 8.7 8.8 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8

Illustrations

Acceptability error subclassification for the present study 220 Adequacy error subclassification for the present study 220 Proportions of acceptability error subtypes with MOC and Linguee 225 Proportions of adequacy error subtypes with MOC and Linguee 227 Concordance of solution stable in Geosciences, the EPS comparable corpus developed in-house 243 Evaluation steps of our corpus-based specialized translation training methodology247 First principle molecular dynamics (FPMD) in the English EPS corpus 252 Dynamique moléculaire premier principe in the French EPS corpus 252 Simulations de/en/par dynamique moléculaire in the French EPS corpus 253 An example of classroom activity: ST and MT errors when translating complex NPs 255 Concordance of hydrous minerals in the English EPS corpus 255 Concordance of minéral hydraté in the English EPS corpus 256

Tables 1.1 3.1 3.2 4.1

4.2 4.3 4.4 4.5 4.6 4.7 4.8

Dataset before and after Manual Filtering Corpus Composition An Extended Constraint Matrix Change in Subject Length in Translations from English to German Given as the Delta in Average Length of the Translated Subject and in Percentage Points Product Summaries for the Frequencies of Identifiable and Nonidentifiable Subjects Target Identifiability by Source Identifiability and Target Position (Experimental Product Data) Target Identifiability by Source Identifiability and Target Position (Corpus Data) Translation Strategies for Identifiable and Non-identifiable Subjects Total Reading Time by Identifiability Translation Duration by Identifiability Total Reading Time of the Subject by Target Text Identifiability and Subject Position

19 81 91

107 112 114 115 116 119 120 121

Illustrations

ix

4.9 Translation Duration by Target Text Identifiability and Subject Position121 4.10 Multiple Comparisons of the Effect of Theme Position on Translation Duration122 5.1 Number of Words per Component, with Data Provenance 137 5.2 POS Dependency Bigrams Loading onto Dimension 1 143 5.3 POS Dependency Bigrams Loading onto Dimension 2 145 5.4 POS Dependency Bigrams Loading onto Dimension 3 146 6.1 Of-constructions by Semantic Type 170 6.2 Literal Translation of Grammatical Metaphoricity and De-metaphorization by Semantic Category 171 6.3 Results of the Binomial Mixed Regression Model 173 7.1 Feature Set Representing Language Conventions 188 7.2 Register Size in the Subcorpora of the Dataset Used (Tokens) 189 7.3 CQP Query for General Nouns 190 7.4 Differences in Normalization Effect between Professional and Student Translations (Measured by Classification Accuracy) 195 7.5 Differences in Shining-through Effect between Professional and Student Translations (Measured by Classification Accuracy) 198 7.6 Normalization and Shining-through in Professional Translations (Measured by Classification Accuracy) 198 7.7 Normalization and Shining-through in Student Translations (Measured by Classification Accuracy) 199 7.8 Normalization and Shining-through in Professional Translations (Measured by F-measure) 199 7.9 Normalization and Shining-through in Student Translations (Measured by F-measure) 200 7.10 Translationese Effects Observed for Both Translation Varieties (Measured by F-measure) 201 8.1 Corpus of Popular Scientific Medical Translations 216 8.2 Total Errors with MOC versus Linguee 223 8.3 Total Acceptability and Adequacy Errors 223 8.4 Breakdown of Acceptability and Adequacy Errors in MOC versus Linguee224 9.1 Size of Annotated Corpora of Students’ Translations, Number of Annotations and of Feedback Notes 246 9.2 Error Analysis Conducted during the Initial Phase of Experiments 250

Contributors Silvia Bernardini is Professor of English linguistics and translation at the Department of Interpreting and Translation of the University of Bologna, where she teaches translation from English into Italian and corpus linguistics. She has published widely on corpus use in translator education and for translation practice and research. Her research interests include the investigation of the points of contact between translation and interpreting, and between translation and non-native writing, seen as instances of bilingual language use. Adriano Ferraresi is Associate Professor at the Department of Translation and Interpreting of the University of Bologna, where he is also a member of the Board of the Doctorate in Translation, Interpreting and Intercultural Studies. He got a doctorate in English for Special Purposes from the University of Naples ‘Federico II’. His main research interests are in the field of construction of linguistic resources, and corpus-based investigations of phraseology and terminology in native, translated and lingua franca varieties of English. Jonas Freiwald is a research assistant at the English Linguistics department of the RWTH Aachen University. His research interests include corpusbased translation studies and contrastive linguistics with a focus on word order variations. His most recent publications include papers on process- and product-based translation research in New Empirical Perspectives on Translation and Interpreting and Lingua. Federico Gaspari is Associate Professor of English language and translation at the University for Foreigners ‘Dante Alighieri’ of Reggio Calabria, where he is Director of the University Language Centre. He has a visiting affiliation to the ADAPT Centre of Dublin City University, where he collaborates on international research projects devoted to language and translation technologies. In the past he held teaching and research positions at the Universities of Manchester, Salford, Bologna (Forlì campus) and Macerata. His main research and teaching interests focus on machine translation, corpus linguistics, descriptive and variationist English linguistics, and applied translation studies.

Contributors

xi

Sylviane Granger is Emerita Professor of English language and linguistics at the University of Louvain. She has initiated several international projects in learner corpus research and corpus-based cross-linguistic studies, two fields which she views as closely interrelated. Her current research interests focus on corpusbased and corpus-driven phraseology, its implications for linguistic theory and methodology, and its applications to foreign language learning and teaching, and translation studies. Her recent publications include Translating and Comparing Languages: Corpus-based Insights (2020) co-edited with M.-A. Lefer and Perspectives on the L2 Phrasicon: The View from Learner Corpora (2021). Arndt Heilmann is a researcher at RWTH Aachen University and has been working within the framework of the TRICKLET project. His major research interests are bilingual language processing, cognitive linguistics and syntax. He relies on computer linguistic and psycholinguistic approaches to help uncover mental processes and states during language processing. He completed his PhD thesis on the effects of syntactic source text complexity on translation. Ilmari Ivaska is Assistant Professor at the Department of Finnish and FinnoUgric Languages at the University of Turku. After getting his doctorate in Turku, Ivaska has worked as a postdoctoral research fellow at the University of Bologna and a visiting lecturer at the University of Washington. Ivaska is an applied linguist with research interests revolving around corpus methodologies as well as quantitative contrastive and cross-linguistic research designs, especially in the context of non-native and mediated language varieties. Haidee Kotze is Professor and Chair of Translation Studies in the Department of Languages, Literature and Communication at Utrecht University. She is Editorin-chief of the journal Target: International Journal of Translation Studies, and Co-editor of the book series Translation, Interpreting and Transfer. Her research explores the complexities of linguistic, social, institutional and ideological aspects of language contact (including translation) in multilingual settings. Her most recent work is at the interface of linguistics and digital humanities, and focuses on modelling language change in parliamentary discourse across varieties of English. Natalie Kübler is Full Professor at the University of Paris. She was one of the first to introduce a corpus-driven approach into specialized translation and terminology training in France in 1999. She is Director of the CLILLAC-ARP,

xii

Contributors

which is a research lab in corpus linguistics and experimental linguistics. In 2015, she created the PERL, in which an inter-university team works on online language teaching resources and research in applied linguistics. She is currently working on translation and post-editing learner corpora and on the linguistic evaluation of specialized Neural Machine Translation output. Ekaterina Lapshinova-Koltunski is a senior lecturer at the Department of Language Science and Technology of Saarland University. She holds a degree in translation from the Volgograd State University, a PhD in computational linguistics from the University of Stuttgart and a teaching permission habilitation in Corpus Linguistics and Translation from Saarland University. She has been working in translator training and has been involved in various projects involving collection, annotation and analyses of various (translation) corpora. Her major research interests lie in linguistic phenomena in multilingual texts, and she has published numerous studies in this area. Marie-Aude Lefer is Associate Professor of translation studies and English– French translation at the University of Louvain, where she teaches corpus-based translation studies and financial translation. Her main research interests lie in parallel corpus design, corpus-based translation studies, learner translation corpus research and English–French contrastive linguistics. She is Co-editor-inchief of the journal Languages in Contrast. Alexandra Mestivier is Assistant Professor in English linguistics at the University of Paris, where she teaches courses in corpus linguistics, terminology, specialized translation and introduction to CAT tools. Her main research interests are the use of corpora for terminological and phraseological analysis in specialized languages as well as in specialized translation, the integration of machine translation in the translation process, translation error annotation and the development of corpus-based teaching resources for specialized translation. Stella Neumann is Professor of English linguistics at RWTH Aachen University. Her research interests focus on how variation shapes language in terms of register variation across languages and varieties and empirical modelling of translation in comparison with other types of language use. Her publications include a book on corpus-based translation research (Cross-Linguistic Corpora for the Study of Translations, 2012) together with Silvia Hansen-Schirra and Erich Steiner and a monograph on register (Contrastive Register Variation, 2013).

Contributors

xiii

Mojca Pecman is Professor in Translation Studies at the University of Paris, where she teaches and conducts research on terminology, phraseology, discourse analysis and specialized translation. She is in charge of a project on the creation of terminological and phraseological resources within ARTES (Aide à la Rédaction de Textes Scientifiques) online multilingual and multidomain database for improving specialized language-related research and studies. Her publications include Langue et construction de connaisSENSes. Énergie lexicodiscursive et potentiel sémiotique des sciences (2018). Tatiana Serbina is a postdoctoral researcher in English linguistics at RWTH Aachen University. Her PhD thesis, entitled A Construction Grammar Approach to the Analysis of Translation Shifts: A Corpus-Based Study, applied the usagebased theory of construction grammar to the area of empirical translation studies. Her further research interests include contrastive linguistics and machine learning approaches to linguistic analysis. Her most recent publication is a paper co-authored by Mario Bisiada and Stella Neumann, which applies the Principal Component Analysis to the comparison of translation manuscripts and edited translations. Heidi Verplaetse is Associate Professor at the Faculty of Arts of KU Leuven, Campus Antwerp. She has an academic background in linguistics, with a PhD on the topic of modality. Her current teaching includes translation and writing of medical, scientific, business and journalistic texts. She has published in the field of translation quality assessment related to the use of CAT tools and specialized translation, among other topics and domains. She is also interested in students’ translation processes and directionality in translation, including L2 translation. She is a member of the Research Group Quantitative Lexicology and Variational Linguistics.

xiv

Introduction Sylviane Granger and Marie-Aude Lefer

The field of corpus-based translation studies (CBTS) relies on corpus-linguistic methods and tools to analyse electronic corpora of authentic translations. Its central objective is to identify the defining features of translation as a form of interlingual communication, with a view to developing an empirically valid model of translation. Since its emergence in the early 1990s under the impetus of Mona Baker (1993), CBTS has grown to become a recognized area of research in translation studies. This volume offers an overview of the field and presents a variety of fresh perspectives provided by leading experts. The majority of the studies featured were originally presented at the Using Corpora in Contrastive and Translation Studies conference (UCCTS 2018), which took place in Louvain-la-Neuve (Belgium) in September 2018. This was the fifth edition of a biennial conference series initiated by Richard Xiao (cf. Xiao 2010), who played a pioneering role in the interface between corpus-based contrastive and translation studies. The conference provided the stimulus for three publications: a proceedings volume entitled Translating and Comparing Languages: Corpusbased Insights (Granger and Lefer 2020a), a special issue of the journal Languages in Contrast, focusing on the complementary contribution of comparable and parallel corpora to cross-linguistic studies (Granger and Lefer 2020b), and the current volume, which aims to showcase some of the latest trends in corpusbased translation studies. This publication starts with an overview of current challenges and future perspectives (Part I). The following chapters cover a wide spectrum of topics and approaches, from recent methodological and theoretical developments (Part II) through innovative corpus-based empirical studies (Part III) to practical applications in translator training (Part IV). With a view to allowing researchers to delve deeper into areas in which they have a particular interest, the chapters in Parts II and III contain a list of recommended key readings, together with a short summary. The chapters all have in common that they extend the scope of CBTS in multiple ways. From a theoretical and methodological perspective, this volume

2

Extending the Scope of Corpus-Based Translation Studies

testifies to a genuine ‘process of tearing down the interdisciplinary walls’ (De Sutter and Kruger 2018: 55), giving ample evidence of the ‘de-isolation of CBTS’ (De Sutter and Kruger 2018). It emerges clearly that CBTS now makes widespread use of theoretical insights and research methods borrowed from neighbouring disciplines, such as translation process research, linguistic theory, contrastive linguistics, variational linguistics, contact linguistics, second-language acquisition and psycholinguistics (cf. Kruger and Van Rooy 2016, Halverson 2017, Kruger and De Sutter 2018, De Sutter and Lefer 2020). Methodological advances are also evidenced by the meticulous description of the corpora analysed, the use of sophisticated annotation systems (whether fully automatic or computer-aided) and the reliance on advanced quantitative methods based on robust multivariate statistics. These features reflect a huge leap in methodological rigour in CBTS (cf. De Sutter et al. 2012), which in turn makes it possible to obtain much more solid insights into translation and other forms of bilingualism-influenced language varieties. Another noteworthy feature is that the chapters included in this volume examine lesser studied forms of translation, such as student translation and post-editing, and explore underresearched semantic and syntactic aspects of translated language, taking key variables into consideration, such as source language and register. In addition to showcasing recent methodological and theoretical developments in CBTS, this volume also reports on concrete classroom experiments on the use of corpora in translator training. Interestingly, the corpus component is twofold here, as corpus-based approaches to translation quality are adopted to assess the impact of corpus use by trainee translators. Even though electronic corpora are now widely used in translation curricula worldwide, empirical translation studies reporting on corpus-oriented teaching practices are rather rare. This is especially striking in recent edited volumes, where corpus applications, if discussed at all, are often limited to terminology and bilingual lexicography (see, e.g. Xiao 2010, Kruger et al. 2011, De Sutter et al. 2017). Although this volume highlights several of the most recent trends in the field, it by no means provides a fully comprehensive picture. In particular, one important research strand, that of corpus-based interpreting studies, is under-represented. This is due to the fact that corpus-based studies are currently less widespread in interpreting studies than in translation studies, as indicated by the small number of interpreting studies presented at the UCCTS 2018 conference (10 per cent). Part I opens with a survey of corpus approaches to translation and interpreting by the two editors. The aim of the chapter is to provide a quantitative and qualitative overview of the current state of corpus-based research, identify

Introduction

3

dominant trends and potential weaknesses, and suggest recommendations for further development of the field. The survey is based on scientific articles published in English in twelve scientific journals between 2012 and 2019. It was restricted to those articles that met the conditions of bona fide corpus studies in terms of data type and analytical techniques used. The first stage of the survey revealed that corpus approaches account for 11 per cent of the articles published in the eight-year period. A detailed investigation of the empirical studies, which were in the majority compared to methodology- and theory-oriented and applied studies, uncovered valuable information on the dominant research foci, corpus types and sizes, languages, modality and registers, as well as corpus techniques and statistical testing. Among other findings, the survey highlights a clear dominance of parallel corpora over monolingual comparable corpora, a strong focus on lexis and terminology, and a continued interest in translation features. It also shows a clear underuse of advanced corpus-linguistic and statistical techniques, a weakness that should be addressed in future research. In Chapter 2, Federico Gaspari takes stock of the progress made in CBTS in the last three decades with a view to identifying gaps and challenges and suggesting ways of expanding the reach of the field in future years. He starts by discussing two areas where CBTS has significantly contributed to translation theory and corpus methodology: translation universals (e.g. explicitation) and their branching out as mediation universals, and translation directionality. He then discusses a key technological development for CBTS, namely the abundance of authentic digital translation data for many language pairs (e.g. social media, localized websites and smartphone apps). In his view, it is now up to CBTS researchers to tame the digital data and devise new methods to tap into these potential corpus resources. As also shown in the chapter, CBTS can further extend its reach by examining other forms of translation than the ones traditionally included in parallel corpora, such as collaborative translation, amateur translation, translation crowdsourcing and video game localization. This is, according to Gaspari, an absolute priority if the discipline is to continue to thrive and remain relevant and meaningful in today’s world. Part II covers recent methodological and theoretical trends in CBTS. It focuses on two key issues: the combined use of corpus data and experimental data and the adoption of the constrained-language framework in CBTS. In empirical translation studies, a distinction is traditionally made between product- and process-based research. So far, due to its focus on translational products (i.e. existing translations), CBTS has been mainly product-oriented. By contrast, process-oriented translation research, which makes use of experimental data

4

Extending the Scope of Corpus-Based Translation Studies

derived from keystroke logging and eye-tracking, focuses on the cognitive aspects of translation. There have been several initiatives aimed at the rapprochement of these two research strands in translation studies (Halverson 2017, Vandevoorde et al. 2020), following a similar development in usagebased linguistics (see Ellis and Simpson-Vlach 2009, Gilquin and Gries 2009, Schönefeld 2011). Another recent key advancement in translation methodology and theory is the application of the constrained-language framework to translation and interpreting. The construct of constrained language refers to ‘the language produced in communicative contexts characterised by particularly conspicuous constraints’ (Kruger and Van Rooy 2016: 27), such as bilingual language activation. The framework aims to identify the commonalities and differences between constrained-language varieties (e.g. translation and nonnative writing) and native varieties. Chapter 3, written by Haidee Kotze, offers explicit theorization of the constrained-language framework, which has recently been gaining ground in corpus-based translation and interpreting studies. In the first part of this chapter, the author discusses the rationale of the framework, defines its key constructs (the notions of constraint and varioversal) and sets out its core theoretical principles. The variationist and multivariate method needed to apply the constrained-language framework is then illustrated in a corpus study on the omission of the complementizer that in three varieties of English (English translated from Afrikaans, British English and South African English). While the varieties investigated are similar with respect to three constraint dimensions (register, proficiency and task expertise), they differ in the constraints of language activation (bilingual vs monolingual) and text production (dependent vs independent). The results of the random forest analysis and conditional inference tree modelling show that there are subtle, but significant, differences across varieties. While similar conventionality- and complexity-related factors are found to condition that-omission in the three varieties, corpus data show that translators opt for explicit that more often than writers in some specific registers and contexts. Chapter 4, by Stella Neumann, Jonas Freiwald and Arndt Heilmann, deals with a key methodological development in corpus-based translation studies, namely the use of multiple methods. After an insightful discussion of how corpus, eye-tracking and keystroke logging data can be combined, the mixedmethods approach to translation is illustrated with a case study on subject identifiability in English and German. Identifiability refers to whether subjects contain given (or inferable) information or not. Taking as a starting point

Introduction

5

cross-linguistic differences in word order between English and German, the authors examine the role of subject identifiability on the positional shifts observed in translation. Their integrated analysis of corpus data, experimental product data and behavioural data confirms their initial hypothesis, namely that English non-identifiable subjects are more prone to translation shifts than identifiable subjects and that their translation is more effortful, possibly because they trigger a wider range of translation options from which to choose. More generally, it also emerges from the study that despite the undisputed value of experimental data to understand the translation process, corpus data, thanks to their authenticity and high ecological validity, provide access to translational products in ways that cannot be equalled by experimental product data. Part III features three corpus-based empirical studies. Unlike the majority of CBTS studies which mostly rely on fairly simple corpus-linguistic techniques and basic frequency counts, the three studies all use cutting-edge data extraction techniques, annotation systems and quantitative methods to provide innovative insights into translational products, taking into account a wide array of factors alongside the translated versus non-translated status of the texts under scrutiny. The first chapter in Part III, Chapter 5, co-authored by Ilmari Ivaska, Adriano Ferraresi and Silvia Bernardini, reports on a corpus-driven study within the constrained-language framework. It aims to assess the degree of commonality of constrained varieties that involve bilingual language activation, namely English translated from Italian and German (dependent text production) and non-native English writing by native speakers of Italian and German (independent text production), against a benchmark of native non-translated English. The focus of the study is on part-of-speech dependency bigrams, extracted automatically from syntactically parsed data, a data type that is still rarely used in translation research. The study further innovates in controlling for both constraining language (Italian and German) and register (argumentative writing, political speeches and tourism-related communication), thus making it possible to disentangle cross-linguistic from register-dependent differences. Based on a method involving random forestbased keyness analysis and multidimensional analysis, the study brings out a complex interaction of constrainedness and register effects. Features shared by the constrained varieties include a tendency to rely on post-nominal modification and common nouns with determiners. Register variation proves to have a significant impact across varieties, although constrained varieties appear to be less register-sensitive than unconstrained ones.

6

Extending the Scope of Corpus-Based Translation Studies

Chapter 6, co-authored by Arndt Heilmann, Tatiana Serbina, Jonas Freiwald and Stella Neumann, deals with grammatical metaphor in translation. In systemic functional linguistics (SFL), grammatical metaphor refers to a mismatch between meanings and their lexico-grammatical realizations. In this chapter, the authors set out to identify the causes of translation shifts from grammatically metaphorical of-constructions in English (e.g. the creation of complex objects) to more congruent, that is, more explicit, renderings in German (e.g. by means of verbs). To do so, a range of source-language variables are examined, such as the semantic category and the context and co-text of the nominal of-constructions (phrase type and complexity) in two registers, namely popular scientific writing and tourism brochures. Relying on a richly annotated corpus dataset and regression modelling, the authors show that the variables involved in the analysis do not influence translation shifts (except the semantic category of engagement). This suggests that target language variables, such as idiomaticity, need to be considered in future studies to gain deeper insights into the phenomenon of de-metaphorization in translation. The last chapter (Chapter 7) in Part III, by Ekaterina Lapshinova-Koltunski, examines the two phenomena of normalization and shining-through, that is, adherence to target-language versus source-language conventions, in both professional and student translations. Starting out from a large set of lexicogrammatical features derived from variational linguistics and SFL, the author uses state-of-the-art automatic text classification techniques (supervised machine learning) to trace normalization and shining-through in a multiregister English-to-German parallel corpus. Language conventions in English and German are modelled on the basis of comparable texts originally authored in the two languages. The results indicate that normalization and shiningthrough patterns do not vary across levels of translation experience, which goes against the author’s initial hypothesis that novice translators should display lower register sensitivity than professionals. However, clear register-specific trends emerge from the analysis, namely normalization in fiction, popular science and political speeches, and shining-through in political essays, instruction manuals and tourism brochures. In future research, lexical features will need to be added to the current lexico-grammatical feature set to better characterize variation at the level of translation experience. Translator training, the topic of Part IV, has been one of the objectives of CBTS from the very beginning. Zanettin (1998: 12), for example, suggested that ‘[t]ranslation activities based on bilingual corpora can be integrated into the curriculum of trainee translators and provide a means of learning about

Introduction

7

aspects of language that are otherwise not easily detectable’. While the potential of corpora for translator training is universally recognized by CBTS researchers, it is rarely assessed in authentic pedagogical practices. The two studies in Part IV steer clear of this weakness and provide detailed reports of classroom experiments. The experiment reported by Heidi Verplaetse in Chapter 8 aims to investigate the quality of English-to-Dutch popular scientific translations, carried out by Dutch-speaking translation students in different classroom conditions. More particularly, the study aims to assess the relative impact of the use of a domain-specific monolingual target language corpus (MOC) versus the general bilingual concordancer Linguee. The student translations were fully annotated for errors, distinguishing between acceptability errors, which are independent of the source text, and adequacy errors, which are related to transfer from source to target text. The author hypothesized that MOC use would result in a lower number of acceptability errors and a higher number of adequacy errors and that reverse results would be obtained for translations with Linguee. While the results for adequacy errors were confirmed, those related to acceptability were disconfirmed, the MOC condition resulting in a higher acceptability error rate, a finding that calls into question the positive impact of monolingual corpus use reported in some earlier studies. An analysis of error subtypes yields a wealth of interesting findings, some of which, such as the lower proportion of lexical errors in the MOC condition, show that the debate on the use of monolingual versus bilingual resources is far from settled. Chapter 9, co-authored by Natalie Kübler, Alexandra Mestivier and Mojca Pecman, also reports on a classroom experiment but differs from the preceding chapter in focusing, not on the students’ overall performance, but on one specific difficulty they are faced with, that is, the translation of complex noun phrases (CNPs) in specialized texts. The data consists of English-to-French translations produced by French-speaking translation students, in which all errors affecting CNPs have been annotated with the help of a fine-grained typology. The aim of the study is to compare students’ output under two task conditions: with the sole help of bilingual dictionaries and term-bases versus a much wider range of resources, in particular two specialized English and French comparable corpora that students had learned to compile and query. A detailed analysis of the errors showed no difference in the number of errors produced with and without corpus, thereby casting further doubt on the all-round benefit derived from corpus use and pointing to the need for further research. An additional innovative task performed by the students provides valuable insights into their skills in

8

Extending the Scope of Corpus-Based Translation Studies

post-editing machine-translated CNPs. The chapter ends with an overview of remedial classroom activities. In her introduction to a special issue of Meta, Sara Laviosa (1998: 474) highlighted the potential of the corpus-based approach to translation studies to evolve ‘through theoretical elaboration and empirical realization, into a coherent, composite and rich paradigm that addresses a variety of issues pertaining to theory, description, and the practice of translation’. Over twenty years later, her prescience has been amply borne out. The studies in the current volume provide evidence of a field that is growing and maturing in all its facets – theory, methodology, description and applications – and are a sign that it will continue to do so in the future.

References Baker, M. (1993), ‘Corpus Linguistics and Translation Studies. Implications and Applications’, in M. Baker, G. Francis and E. Tognini-Bonelli (eds), Text and Technology: In Honour of John Sinclair, 233–50, Amsterdam and Philadelphia: Benjamins. De Sutter, G., Goethals, P., Leuschner, T. and S. Vandepitte (2012), ‘Towards Methodologically More Rigorous Corpus-Based Translation Studies’, Across Languages and Cultures, 13 (2): 137–43. De Sutter, G. and H. Kruger (2018), ‘Disentangling the Motivations Underlying Syntactic Explicitation in Contact Varieties: A MuPDAR Analysis of That vs. Zero Complementation’, in S. Granger, M.-A. Lefer and L. Penha-Marion (eds), Book of Abstracts. Using Corpora in Contrastive and Translation Studies Conference (5th edition), CECL Papers 1, 55–7, Louvain-la-Neuve: Centre for English Corpus Linguistics/Université catholique de Louvain. De Sutter, G. and M.-A. Lefer (2020), ‘On the Need for a New Research Agenda for Corpus-based Translation Studies: A Multi-methodological, Multifactorial and Interdisciplinary Approach’, Perspectives, 28 (1): 1–23. De Sutter, G., Lefer, M.-A. and I. Delaere, eds (2017), Empirical Translation Studies: New Methodological and Theoretical Traditions, Berlin: De Gruyter Mouton. Ellis, N.C. and R. Simpson-Vlach (2009), ‘Formulaic Language in Native Speakers: Triangulating Psycholinguistics, Corpus Linguistics, and Education’, Corpus Linguistics and Linguistic Theory, 5 (1): 61–78. Gilquin, G. and S. T. Gries (2009), ‘Corpora and Experimental Methods: A State-of-theArt Review’, Corpus Linguistics and Linguistic Theory, 5 (1): 1–26. Granger, S. and M.-A. Lefer, eds (2020a), Translating and Comparing Languages: Corpus-based Insights, Louvain-la-Neuve: Presses Universitaires de Louvain.

Introduction Granger, S. and M.-A. Lefer, eds (2020b), The Complementary Contribution of Comparable and Parallel Corpora to Crosslinguistic Studies, Special issue of Languages in Contrast, 20 (2). Halverson, S. L. (2017), ‘Gravitational Pull in Translation. Testing a Revised Model’, in G. De Sutter, M.-A. Lefer and I. Delaere (eds), Empirical Translation Studies: New Methodological and Theoretical Traditions, 9–45, Berlin: De Gruyter Mouton. Kotze, H. (2020), ‘Converging What and How to Find Out Why: An Outlook on Empirical Translation Studies’, in L. Vandevoorde, J. Daems and B. Defrancq (eds), New Empirical Perspectives on Translation and Interpreting, 333–70, Abingdon: Routledge. Kruger, A., Wallmach, K. and J. Munday, eds (2011), Corpus-Based Translation Studies: Research and Applications, London: Bloomsbury. Kruger, H. and B. van Rooy (2016), ‘Constrained Language: A Multidimensional Analysis of Translated English and a Non-Native Indigenised Variety of English’, English World-Wide, 37 (1): 26–57. Kruger, H. and G. De Sutter (2018), ‘Alternations in Contact and Non-Contact Varieties: Reconceptualising That-Omission in Translated and Non-Translated English Using the MuPDAR Approach’, Translation, Cognition & Behavior, 1 (2): 251–90. Laviosa, S. (1998), ‘The Corpus-based Approach: A New Paradigm in Translation Studies’, Meta, 43 (4): 474–9. Schönefeld, D., ed. (2011), Converging Evidence: Methodological and Theoretical Issues for Linguistic Research, Amsterdam and Philadelphia: Benjamins. Vandevoorde, L., Daems, J. and B. Defrancq, eds (2020), New Empirical Perspectives on Translation and Interpreting, Abingdon: Routledge. Xiao, R., ed. (2010), Using Corpora in Contrastive and Translation Studies, Newcastle upon Tyne: Cambridge Scholars Publishing. Zanettin, F. (1998), ‘Bilingual Comparable Corpora and the Training of Translators’, Meta, 43 (4): 616–30.

9

10

Part I

Corpus-based translation studies Current challenges and future perspectives

12

1

Corpus-based translation and interpreting studies A forward-looking review Sylviane Granger and Marie-Aude Lefer

1 Introduction The origin of corpus-based translation studies (CBTS) can be traced back to an article entitled ‘Corpus Linguistics and Translation Studies: Implications and Applications’ (1993). In that groundbreaking paper, Mona Baker convincingly argues against the view that translated texts are unworthy of academic enquiry and advocates applying the methods and techniques of corpus linguistics to translated texts. A few years later, Shlesinger (1998) made a similar proposal for interpreting. Since then the field of corpus-based translation and interpreting studies has greatly expanded and matured. Over a quarter of a century later it seems worth while to take stock of the most recent developments. The aim of our study is to carry out a thorough review, both quantitative and qualitative, of recent corpus-based studies of translation and interpreting with a view to describing their key characteristics in terms of data, methods and research foci, identifying potential gaps and suggesting avenues for future research. The three most recent surveys of translation and interpreting studies (CandelMora and Vargas 2013, Zanettin et al. 2015 and van Doorslaer and Gambier 2015) differ from ours in a number of ways, the main one being that all three rely on large bibliographic databases and only analyse the bibliometric records, that is, the titles, keywords and abstracts, without delving into the full texts of the publications. Our survey is more limited in terms of the number and types of publications, but goes beyond the quantitative, mostly automatic investigation of bibliometric records in order to provide more qualitative insights thanks to an in-depth manual exploration of the full texts of the publications.

14

Extending the Scope of Corpus-Based Translation Studies

There are also differences of scope. The surveys by Zanettin et al. (2015) and van Doorslaer and Gambier (2015) cover much more ground than ours. Their aim is to identify the main subfields and research foci of translation studies as a whole, and they therefore provide only limited information on corpusbased studies. However, some interesting findings emerge. Zanettin et al. used the analytic categories in the Translation Studies Abstracts Online1 (TSA) database to identify the most popular subfields of translation and interpreting studies. The results show that the three most popular categories are literary translation, translation theory and intercultural studies. CBTS is only to be found among the next five largest categories, but seems to display an upward trend from 1996 to 2011. As corpus and parallel corpus also appear as keywords in a corpus of 16,000 abstracts compiled by the authors, they conclude that ‘[i]n terms of methodologies, the impact of linguistic corpora is noticeable and is a trend that is clearly here to stay’ (p. 20). The second survey, by van Doorslaer and Gambier (2015), makes use of the online Translation Studies Bibliography2 (TSB), which currently contains over 30,000 annotated records. An analysis of the authors’ academic affiliations provides useful information on the geographical spread of translation and interpreting research. Thanks to the extended list of searchable keywords in the TSB conceptual map, the authors are also able to identify the main topical foci (e.g. literary translation, terminology, teaching) of specific journals and to highlight differences in correlation with the language of publication. Unfortunately, the study fails to reveal any information on corpus-based studies, probably because the keyword analysis is limited to the five most frequent keywords in seven journals. Unlike the preceding two surveys, Candel-Mora and Vargas-Sierra (2013) focus specifically on CBTS and pursue objectives similar to ours. Their aim is ‘to analyze with data the consolidation of corpus methods in translation and to specify which issues are under research and the features that characterize these studies’ (p. 317). Unlike our survey, however, their study relies on bibliometric records from two translation databases (Bibliography of Interpreting and Translation3 and Translation Studies Abstracts Online). This allows them to provide a wide panorama of the field, based on a large number of publications (389).4 As regards the languages represented in the corpora, the survey shows that 40% of the bibliographic records that specify the language refer to a corpus of English or include English in the language pair investigated. The second most represented language, Spanish, falls way behind (13%). The types of corpora used are predominantly parallel corpora

A forward-looking review

15

(58%), followed by comparable corpora (27%) and a combination of parallel and comparable corpora (15%). The survey also shows that specialized translations are far more numerous (69%) than literary translations (31%). Another interesting finding is that CBTS studies are published in similar proportions as book chapters (45%) and as journal articles (40%). Although the survey provides some useful information on corpus-based translation studies, the bibliometric method on which it is based has its limitations, the main one being that the bibliometric records, because they only contain the title, keywords and words in the abstracts, fail to provide information on many key features of CBTS. For example, only 109 out of 389 records (28%) specify the type of corpus used, and only 11, a mere 3%, refer to the size of the corpus, which considerably reduces the reliability of the conclusions drawn in respect of these aspects. As regards corpus orientation (research, teaching or professional), the authors acknowledge that ‘this parameter cannot be interpreted appropriately without carrying out an in-depth study of the publications’ (p. 324). Alongside bibliometric analyses, which ‘have the ability to offer factual, quantitatively based, but sometimes also broader views on tendencies in a discipline’ (van Doorslaer and Gambier 2015: 317), we believe there is scope for a survey of corpus-based translation and interpreting studies which relies on manual, in-depth exploration of the actual texts of the publications, thereby allowing for the investigation of qualitative information which is not captured – and indeed in some cases cannot be captured – by an analysis of bibliometric records. Our survey is based on a relatively small number of scientific articles, but as these studies cover the most recent years (2012–19), they provide a worthwhile snapshot of the latest trends in the field and point the way forward for further research. The chapter is structured as follows. The first two sections specify the scope of the survey (Section 2) and the method used to extract the survey data (Section 3). The next sections present the results of the survey. Section 4 subcategorizes the corpus-based studies in terms of three main types of corpus orientation, which are analysed in the following sections: methodology- and theory-oriented studies in Section 5, applied studies in Section 6 and empirical studies in Section 7. The dominant category, that of empirical studies, is further explored along three main axes: linguistic focus and translation features, corpus design, and corpus techniques and statistical testing. Section 8 sums up the main findings of the survey and makes some recommendations on desirable developments in the field.

16

Extending the Scope of Corpus-Based Translation Studies

2  Delineating the scope of the survey In the framework of the present survey, it is essential to establish what qualifies as a bona fide corpus-based study. Our starting point is Baker’s 1993 paper, which sums up the key features of CBTS: This paper explores the impact that the availability of corpora is likely to have on the study of translation as an empirical phenomenon. It argues that the techniques and methodology developed in the field of corpus linguistics will have a direct impact on the emerging discipline of translation studies, particularly with respect to its theoretical and descriptive branches. The nature of this impact is discussed in some detail and brief reference is made to some of the applications of corpus techniques in the applied branch of the discipline. (p. 233; our underlining)

The first key point is that CBTS is an empirical approach to translation, situated within descriptive translation studies: ‘Through the 1970s and beyond, descriptive translation studies (DTS) foregrounded description of what translation was and is, removing from dominance previous approaches that were more concerned with prescribing what translation should be. Corpus-based studies in translation are clearly aligned with the descriptive perspective’ (Olohan 2004: 10). For Laviosa (2011: 14), ‘[t]he strong links forged in those years between Corpus Linguistics and DTS thanks to a set of common concerns stemming from an empirical perspective is (. . .) one of the keys if not the key to the success story of CTS [corpus-based translation studies]’. Second, Baker clearly underlines that CBTS is an offshoot of corpus linguistics, from which it borrows its specific techniques and methods. These involve both ‘basic text processing operations’ (Baker 1995: 226), such as concordancing and word frequency profiling, and more sophisticated techniques, such as automatic annotation and extraction of keywords, collocations, colligations and word clusters (Zanettin 2012). Finally, Baker underlines the three main objectives of CBTS: to contribute to the theoretical, descriptive and applied branches of translation studies. In the article, Baker further specifies the scope and objectives of CBTS. The term corpus is key in CBTS, and Baker insists on its ambiguity in translation studies: ‘although the words corpus and corpora are beginning to figure prominently in the literature on translation, they do not refer to the same kind of corpora that we tend to talk about in linguistics’ (1993: 241). In a later article she returns to this issue and specifies what exactly is meant by corpus in CBTS: ‘any collection of running texts (as opposed to examples/sentences), held in electronic form

A forward-looking review

17

and analysable automatically or semi-automatically (rather than manually)’ (Baker 1995: 225). The way we need to understand the term corpus in CBTS is therefore quite different from the way it is regularly used in translation studies, namely to refer to ‘fairly small collections of text which are not held in electronic form and which are therefore searched manually’ (Baker 1995). Baker also insists on the key role played by methodology in CBTS, and more particularly by the ‘new software tools’ and ‘new and sophisticated methodologies’ (Baker 1993: 248) borrowed from corpus linguistics, which can help counter ‘the heavy reliance on introspective methods in translation studies’ (Baker 1993: 240). Large size is also a key characteristic of corpora. Sinclair (1996) lists ‘quantity’ as a default value of corpora: ‘A corpus is assumed to contain a large number of words. The whole point of assembling a corpus is to gather data in quantity.’ According to Baker (1993: 237), the fact that translation scholars can now study large numbers of texts of the same type ‘is precisely where corpus work comes into its own’, as it ‘enables the discipline to shed its longstanding obsession with the idea of studying individual instances in isolation (one translation compared to one source text at a time)’. In the light of this description we decided to limit our survey to translation and interpreting studies that rely on machine-readable corpora, that is, electronic collections of texts, and are analysed with the help of (semi-)automatic corpus-linguistic techniques. The corpora can be monolingual or bilingual/ multilingual, comparable or parallel. Size was not selected as a defining criterion as it is a relative notion. In addition, as rightly pointed out by Fernandes (2006: 88), although large size is one of the attributes of corpora, in the context of translation and interpreting studies, corpora are often relatively small and ‘the issue of corpus size (.  .  .) becomes a relative one in the sense that qualitative aspects sometimes may be more relevant than quantitative ones’.

3  Survey dataset: Extraction and general overview The survey is based on scientific articles written in the years 2012–19 in the following twelve journals: Across Languages and Cultures, Babel, Interpreting, inTRAlinea, Journal of Specialised Translation, Meta, Perspectives, Target, The Interpreter and Translator Trainer, Journal of Translation and Technical Communication Research (trans-kom), Translation & Interpreting and Translation and Interpreting Studies. The selection was made on the basis of two criteria: the journals had to be peer-reviewed, and we needed to have direct access to the

18

Extending the Scope of Corpus-Based Translation Studies

full texts in our academic environment. The survey is synchronic because the limited period covered (eight years) is not a realistic basis on which to carry out a diachronic study. However, some passing remarks on evolutionary trends will be made where appropriate. The filtering of corpus-based studies involved the following steps: ●●

●●

Automatic extraction of the articles written in English5 that contain the word corpus or corpora in the title, abstract and/or keywords; full-text search if the two words are absent from these sections; Manual filtering: Rejection of the articles that do not fit our defining criteria, that is, use of data in electronic format and of corpus analytic methods – from the most basic to the most sophisticated.

As pointed out in Section 2, the presence of corpus/corpora is not enough to qualify articles as bona fide corpus-based translation and interpreting studies. As a result, the texts of the 265 articles containing the words corpus or corpora were scanned in order to ensure that the data were in electronic format and the analysis relied on corpus analytic methods. Table 1.1 gives an overview of the dataset before and after this manual filtering step. A comparison of the number of unfiltered (265) and filtered (186) datasets shows that a search that relies solely on the presence of the terms corpus or corpora generates a non-negligible number of studies (79, i.e. 30%)6 that do not in fact belong in corpus-based translation and interpreting studies and were therefore excluded from the survey. This somewhat unexpected finding reduces the reliability of surveys such as Candel-Mora and Vargas-Sierra (2013) for CBTS and Liao and Lei (2017) for corpus linguistics that omit the manual filtering step and, as a result, fail to take into account the ambiguity of the word corpus. Approximately one third of the excluded studies proved to be based on a paper, rather than an electronic, format. Others did in fact make use of an electronic corpus (e.g. news articles downloaded from newspaper websites) but relied on purely manual methods to analyse the data, regularly using the ‘corpus’ as a ‘repository of examples’ (TogniniBonelli 2001: 10) to illustrate one or another phenomenon. Overall, these studies fall within the Descriptive Translation Studies paradigm (and are indeed in several instances explicitly described as such by the authors) rather than corpus-based translation and interpreting studies. Table 1.1 shows that the genuine corpus studies only represent 11% of the total number of articles in English. This result ties in with Zanettin et al.’s (2015: 12) bibliometric survey, which shows that corpus-based studies accounted for c. 7% in 2011. The fact that this percentage is higher than that established by

A forward-looking review

19

Table 1.1  Dataset before and after Manual Filtering

Journal Across Languages and Cultures Babel Interpreting inTRAlinea Journal of Specialised Translation Meta Perspectives Target The Interpreter and Translator Trainer trans-kom Translation & Interpreting Translation and Interpreting Studies Total

Articles in English

Articles in English Corpus-based articles with corpus/corpora in English (filtered (unfiltered dataset) dataset)

92

34

27

183 80 123 164

33 12 26 19

16 8 22 12

132 303 128 142

27 54 23 11

20 36 18 7

42 117

5 10

4 7

145

11

9

1651 (100%)

265 (16%)

Figure 1.1  Proportion of corpus-based articles per journal.

186 (11%)

20

Extending the Scope of Corpus-Based Translation Studies

the authors for 1997 (c. 3%), coupled with our own average proportion of 11% for the 2012–19 period, suggests that corpus-based translation and interpreting studies are experiencing an upward trend that reflects the growth of corpus linguistics studies in general (Liao and Lei 2017: 4). It is important to point out, however, that there are marked differences between the journals. As is clearly apparent from Figure 1.1, two journals display higher proportions: 29% in the case of Across Languages and Cultures and 18% in that of inTRAlinea, a far cry from the much lower proportions in Translation and Interpreting Studies and Translation & Interpreting (6%), and The Interpreter and Translator Trainer (5%). In other words, some journals seem to be more corpus-oriented than others. These differences cannot be attributed to differences in the journals’ overall scope, as not a single descriptive section available on the respective websites contains a reference to corpora, corpus linguistics or corpus-based approaches to translation and interpreting.

4  Corpus study orientation We have classified the articles included in the database into three main categories, according to their main research focus and objectives: methodologicaltheoretical, empirical and applied. This categorization corresponds quite closely to the three branches of translation studies outlined in Holmes (1988[2000]), where a distinction is made between theoretical, descriptive and applied translation studies. For Holmes, descriptive translation studies ‘describe[s] the phenomena of translating and translation(s) as they manifest themselves in the world of our experience’ (1988[2000]: 176). It is the branch of translation studies where the empirical phenomena under investigation hold centre stage. Theoretical translation studies ‘evolve[s] principles, theories, and models which will serve to explain and predict what translating and translations are and will be’ (1988 [2000]: 178). The third branch of translation studies in Holmes’s model, applied translation studies, covers several areas, such as teaching and translation aids (in both translator training and professional practice). In his seminal paper, Holmes does not discuss research methods directly, but he acknowledges the crucial importance of the methodological and meta-theoretical dimension of translation studies, ‘concerning itself with problems of what methods and models can best be used in research in the various branches of the discipline (how translation theories, for instance, can be formed for greatest validity, or what analytic methods can best be used to

A forward-looking review

21

achieve the most objective and meaningful descriptive results)’ (1988 [2000]: 183). In the present survey, the empirical category includes corpus studies devoted to specific linguistic phenomena (e.g. grammatical, lexical) and translation features (e.g. explicitation, normalization). The methodological-theoretical category subsumes three main types of contribution: (i) calls for methodological and theoretical advancement, such as proposals for the adoption of methods and theories borrowed from neighbouring disciplines; (ii) literature reviews and overviews; and (iii) descriptions of new corpora and corpus tools for translation and interpreting studies. The applied category covers four major types of corpus application in translation and interpreting studies, namely corpus use in (i) translator and interpreter training, (ii) professional practice (language industry), (iii) translation quality assessment and (iv) machine translation. It is important to recognize that the three main categories – empirical, methodological-theoretical and applied – are not watertight. For example, although their focus is primarily empirical, a number of studies in the dataset discuss – in more or less detail – the (possible) implications of their descriptive results for theory, methodology or practice. In the same vein, some studies in which methodology and theory hold centre stage include empirical case studies whose purpose is to illustrate the potential for theoretical development or the application of a particular corpusbased method in translation and interpreting studies. Similarly, some applied studies include empirical case studies showcasing practical corpus applications. Despite the relative porousness of the categories, an in-depth analysis of the articles allowed us to assign each study to a single category. As can be seen in Figure 1.2, empirical studies take up the lion’s share, as they account for two thirds of the articles in the dataset, with applied studies and methodological-theoretical contributions lagging far behind (19% and 15%, respectively).

5  Methodology- and theory-oriented studies Methodological-theoretical contributions represent a sixth of the articles under scrutiny. Even though they are admittedly far less numerous than empirical studies, they testify to the growing maturity of the field in terms of both methodological and theoretical advancement. Methodology- and theoryoriented papers in the dataset are evenly distributed across the three subcategories outlined above (new methods/theories, literature reviews, new corpora/tools).

22

Extending the Scope of Corpus-Based Translation Studies

Figure 1.2  Breakdown of empirical, methodological-theoretical and applied studies in the dataset (n = 186).

First, we find articles that aim to introduce new methodological and theoretical approaches or concepts, often borrowed from neighbouring disciplines, such as monolingual corpus linguistics or contact linguistics. The methodological aspects concern corpus compilation (e.g. compilation of multimodal corpora), corpus annotation (e.g. annotation of speech acts for discourse analysis), corpus data extraction (e.g. semantic relations in terminology) and data triangulation (e.g. combining corpus and experimental data). Theoretical contributions range from presentations of general frameworks (e.g. cognitive approaches to translation) to discussions of specific constructs (e.g. explicitation). Second, a few articles offer literature reviews of the field or of corpus use in specific areas of translation and interpreting studies. Contributions that are relatively broad in scope focus on some of the evolving contours of corpus-based translation and interpreting studies, for instance, in terms of the forms of interlingual translation that are typically investigated (vs the ones that are emerging) or the quantitative methods used in the field. Articles that are more focused in their scope deal with specific areas, such as corpus use in terminology, literary translation or news translation, placing particular emphasis on the added value of cross-fertilization between corpus-based translation and interpreting studies and disciplines such as corpus-assisted discourse studies, digital humanities or stylometry. The third

A forward-looking review

23

category deals with new corpora and corpus tools for translation and interpreting studies. In our dataset, we mainly find descriptions of interpreting corpora (e.g. signed language interpreting, telephone interpreting) and audiovisual translation (e.g. dubbing), which is in line with the forms of interlingual mediation that have recently entered the field (cf. Defrancq et al. 2015). Some of these corpus descriptions include aspects related to annotation (e.g. annotation of turns in telephone interpreting), but generally speaking, corpus annotation is rather infrequently addressed in the methodology-oriented papers.

6  Applied studies Around a fifth of the articles published in the selected journals deal with corpus use in various applied areas of translation and interpreting studies, which shows that corpora also have their place in the applied branch of the field. Figure 1.3 shows that the most prominent of these areas is translator and interpreter training. The publications related to translator and interpreter education reflect a rich array of didactic approaches that all rely on the use of electronic corpora in the translation or interpreting classroom (cf. Zanettin et al. 2003, Loock 2016). In other words, these studies show how corpora can be used as translation or interpreting tools. Three quarters of the papers in this category deal with written translation, both general and

Figure 1.3  Breakdown of applied studies in the dataset (n = 35).

24

Extending the Scope of Corpus-Based Translation Studies

specialized (e.g. legal or scientific translation), into the trainee translators’ L1 or L2. A wide range of corpus types are used: target language monolingual corpora (whether reference, web or specialized), parallel corpora made up of professional translations, learner translation corpora (i.e. parallel corpora comprising student translations, whether error-annotated or not), bilingual comparable corpora and combinations thereof. The main pedagogical objectives of corpus use are related to awareness-raising, decision-making while translating (e.g. to solve translation problems) and revision (editing). Some contributions cover a wide range of phenomena while others have a specific linguistic focus, mostly lexis, phraseology and terminology (e.g. reporting verbs). Audiovisual translation is also represented, though far less than written translation (two papers). The remaining quarter of the articles in this category deal with the use of corpora in interpreter training, whether for simultaneous, consecutive or dialogue interpreting tasks. The focus here is on the use of corpora for interpreting task preparation (e.g. creation of corpusbased bilingual glossaries or term-bases) and for materials design (e.g. the use of spoken corpora as source speeches for interpreting exercises). As shown in Figure 1.3, the remaining applied articles account for a quarter of the category and are equally distributed across corpus use in professional practice, quality assessment and machine translation. All three categories are very marginal in the dataset. The use of electronic corpora in the language industry is hardly discussed in our dataset (see Candel-Mora and Vargas-Sierra 2013: 324 for a similar observation), but the few contributions in this area mostly deal with terminology and how corpora can be used to develop terminological resources for professional translators and other language professionals. Our survey also seems to indicate that quality assessment is rarely investigated with corpus methods. Studies on the use of corpora in machine translation are much more widespread, but they are typically published in Natural Language Processing publication outlets and in journals specifically dedicated to machine translation.

7  Empirical studies The present section offers an overview of the empirical studies included in the dataset. Each empirical study was analysed in terms of its research focus and corpus design, and the corpus techniques and statistical tests used.

A forward-looking review

25

7.1  Research focus A key aspect of research syntheses consists in determining the topical issues that dominate the field under review. To achieve this aim, the 122 empirical studies in our survey data were examined in order to identify their main linguistic focus as well as the attention given to translation features.

7.1.1  Linguistic focus Empirical studies can be categorized according to the language features they investigate either as the direct focus of the study (e.g. modals) or as a way of assessing the validity of a specific translation feature (explicitation, normalization, etc.). We have grouped these language features into seven broad categories: lexis and terminology (L&T), grammar (G), morphology (M), semantics (S), discourse and pragmatics (D&P), speech-specific (SP) and mixed (MX). Figure 1.4 shows the breakdown of the categories in the 122 empirical studies in our survey. The most populated category is that of lexis and terminology, which encompasses single words and multi-word expressions as well as single terms and multi-terms. Although phraseology is currently regarded as a category in its own right, we have included it in the L&T category because the dividing line between some of the categories, notably compounds and collocations, is very difficult to draw, especially when it comes to specialized vocabulary. Two

Figure 1.4  Linguistic focus of the empirical studies (n = 122).

26

Extending the Scope of Corpus-Based Translation Studies

factors can explain the dominance of the lexical category: first, lexis in the wide sense has always been at the forefront of translation studies; and, second, it is the aspect of language that is the most amenable to corpus techniques, a factor that contributes to the popularity of lexical studies in corpus linguistics, as shown by Gilquin and Gries’s (2009: 10) survey. Terms (e.g. business terms) account for 40% of the lexical items investigated; the other studies focus on general vocabulary – either single words (e.g. between) or one specific category of words (e.g. phrasal verbs) – or general measures of lexical richness, in particular lexical variation and lexical density. Grammar, the second most represented category, includes a wide range of grammatical and syntactic phenomena, some of which (passives, modals, nominalizations) recur in the dataset. The D&P category is dominated by discourse-oriented studies, with cohesion, and more particularly the use of connectors, as the main object of investigation. Pragmatics, which is mainly associated with speech, a medium that is in a minority in our dataset compared to writing (see Section 7.2.3), is limited to a handful of studies on (im)politeness. Semantics, which is not easy to approach using corpus techniques, is a relatively minor category limited to the analysis of a few topics such as metaphors and the expression of manner-of-motion. The SP category, which groups studies focused on speech-specific features (pauses, speech rate, hesitations, interactional nonrenditions), is also scantily represented in the data. Morphology ranks last, with only four studies focused on derivational affixes. It is interesting that the low frequency of the SP and M categories is not specific to corpus-based translation and interpreting studies as, together with pragmatics, phonology and morphology are among the least frequent categories in Gilquin and Gries’s (2009) survey of corpus linguistics. While most studies fall squarely into one welldefined linguistic domain, a sizeable number embrace two or more domains. These studies, which have been grouped into the MX category, aim to provide a general profile of different varieties of translated and/or interpreted language or use a range of language features to operationalize translation features.

7.1.2  Translation features The identification of ‘universal features of translation’, that is, ‘features which typically occur in translated text rather than original utterances and which are not the result of interference from specific linguistic systems’ (Baker 1993: 243), was one of the key objectives of early CBTS. At first, the four main features, established by comparing corpora of translated and non-translated texts, were

A forward-looking review

27

considered to be simplification, explicitation, normalization and levelling out (Baker 1996: 176–7). Since then, other features have been added and the very notion of universality has been called into question, leading researchers to abandon the term ‘translation universal’ in favour of the more realistic ‘translation feature’. Twenty-five years on, it seems worthwhile to assess the place occupied by this aspect of translation. With a view to identifying the studies that have translation features as their main focus, we scanned the titles and keywords of the studies in our dataset for the following words: translation universal, translation feature, features of translated language, explicitation, implicitation, normalization, standardization, simplification, levelling out, unique items (hypothesis) and convergence.7 The results show that translation features remain a strong research strand in current corpus studies, as 29% (35/122) of the studies were extracted on the basis of this criterion. It is important to bear in mind, however, that this percentage does not take into account the many studies that refer to translation features in the analysis (typically, when interpreting their results) but do not highlight them explicitly in the title and/or keywords. Explicitation is by far the most-researched feature, either as the sole focus of the study or alongside other features. Normalization/standardization and simplification are also popular, the other features (levelling out/convergence, unique items) trailing far behind. A wide diversity of linguistic phenomena are used to operationalize the translation features. For example, the rate of explicitation is established on the basis of the use of connectors, modals, passives, collocations and manner-of-motion verbs and omission of the conjunction that in English. In several instances, translation features are assessed on the basis of a mixture of words, phrases and structures (the MX category) rather than a single linguistic phenomenon.

7.2  Corpus design This section provides an overview of the corpus designs of the empirical studies under scrutiny, focusing on four aspects: corpus types, corpus size, registers and languages.

7.2.1  Corpus types We have distinguished between three main categories of corpus type: parallel corpora, monolingual comparable corpora and mixed corpora (combinations of parallel and comparable corpora). As shown in Figure 1.5, parallel corpora (PARA) are much more frequently used than the other corpus types, with more

28

Extending the Scope of Corpus-Based Translation Studies

Figure 1.5  Corpus types in empirical studies (n = 122).

than half of the empirical studies relying on parallel corpus data. They are followed by monolingual comparable corpora (CMONO, 25%) and mixed corpora (MIX, 16%). Other corpus types are rarely used (6%). The clear dominance of parallel corpora over monolingual comparable corpora sheds light on the central position of source text-target text (ST-TT) comparisons in the field. While this is in sharp contrast with Baker’s (1995: 233) programmatic call for a shift away from ST-TT comparisons to comparisons of translation with original text production, it reflects a concern voiced quite early in the field, namely that translation products cannot be fully understood if they are cut off from their ST (see e.g. Stewart 2000 and Kenny 2005 on the ‘target-orientedness’ of Baker’s and other early corpus-based work). Interestingly, the figures presented here are very similar to the ones obtained by Candel-Mora and Vargas-Sierra (2013) in their bibliometric-based survey covering approximately fifteen years, which seems to indicate that parallel corpora have dominated the field for quite some time. Back in 2005, however, Kenny (2005: 155) insisted that the use of parallel corpora had been strikingly limited in the first decade of CBTS research. A larger-scale survey would be needed to track this development more precisely. Parallel studies are typically monodirectional and restricted to one language pair (e.g. English>Spanish). There are also a few monodirectional studies that

A forward-looking review

29

examine different language pairs (e.g. French>English and French>Dutch) or make use of several same-language versions of the STs, such as unedited and edited versions of the same texts and different translations of the same novel. We find very few cases of bidirectional studies, that is, studies where the two translation or interpreting directions are analysed (e.g. English>Italian and Italian>English). This is particularly regrettable, as the investigation of the two directions makes it possible to go some way towards disentangling source-language (SL) influence from more general translation features. Bidirectional parallel corpora have shown their worth in contrastive linguistics, where it is common practice to examine the two directions in order to identify cross-linguistic correspondences. In the empirical studies of our dataset, researchers typically rely on parallel corpora to examine the procedures used to translate given ST items or structures (e.g. culture-specific items, passives) and the explicitation/implicitation of certain ST phenomena, such as connectors (explicitation is the only translation feature to often be approached from a parallel perspective). Monolingual comparable corpora, which account for a quarter of the empirical studies under investigation, are often made up of two subcorpora, namely translated/interpreted language and non-translated/non-interpreted (i.e. original) language. In c. 75% of these studies, the texts are translated (or interpreted) from a single source language (e.g. Arabic translated from English). Less commonly, the texts/speeches have been translated/interpreted from several SLs, which makes it possible to study the effect of SL influence. Monolingual comparable corpora are mostly used to examine translation features (e.g. normalization, simplification, increased explicitness, unique items), in line with Baker’s research agenda for the field. Around a sixth of the empirical studies rely on a combination of parallel and comparable corpora. The ways of combining the two types of corpus are manifold, depending, for instance, on whether it is the parallel or the comparable perspective that holds centre stage. The most common approach in the dataset at hand is the combined use of a monodirectional parallel corpus and a comparable corpus of target language original texts. Typically, the former is the core of the study, while the latter is used to check whether the trends identified in the TTs of the parallel corpus diverge from the ones found in a comparable set of original texts (e.g. the frequency of the passive voice). Other corpus types are far less frequent. They include multilingual comparable corpora (representing two or more original languages), which are analysed with reference to translation-related objectives (e.g. cross-linguistic register description), and monolingual corpora (e.g. audio-description). More generally,

30

Extending the Scope of Corpus-Based Translation Studies

irrespective of corpus types, we see that the corpora used in the empirical studies mainly contain professional L1 translations/interpretations (exceptions include learner translation corpora and corpora of Chinese>English retour interpreting).

7.2.2  Corpus size Unsurprisingly, a close inspection of corpus size reveals that the parallel and comparable corpora used in the empirical studies are rather small by today’s standards in mainstream corpus linguistics. For monolingual comparable studies, we find that approximately half of the corpora are smaller than one million tokens in total (all subcorpora considered), while the other half are larger than one million tokens. Parallel corpora tend to be much smaller in size. The survey shows that c. 40% of the parallel corpora contain less than 100,000 ST tokens and another 40% between 100,000 and 1 million ST tokens. This is in part due to parallel corpora of interpreting, which are smaller on account of the many hurdles inherent in transcribing speech (Bernardini et al. 2018). The relatively small size of the corpora used is not necessarily problematic, as some analyses do not require large corpora, for example because they focus on high-frequency phenomena or involve the manual coding of numerous variables. In this regard, an interesting methodological solution, found in a few empirical studies in our dataset, is the use of large general reference corpora to compensate for the relatively small size of the corpora used. For example, phraseological units, such as collocations, can be identified in translated texts on the basis of the statistical association scores they display in large reference corpora. It is important to point out that size figures could not be retrieved for all the empirical studies. In 19% of the cases, corpus size is not mentioned or is not provided in tokens (but rather in number of texts or length in minutes). It seems that De Sutter et al.’s (2012: 137) methodological call for research papers in the field to ‘provide a meticulous overview of the corpus materials used’ has not yet been fully heeded.

7.2.3  Modality and registers Three quarters of the empirical studies examine translation (whether written or audiovisual). Interpreting accounts for a little over a fifth of the empirical studies. A handful of studies are intermodal: they compare written translation and simultaneous interpreting. These figures point to a noticeable breakthrough by corpus-based interpreting studies in recent years (cf. Russo et al. 2018).

A forward-looking review

31

Interpreting studies in the dataset examine a wide range of interpreting contexts, such as government press conferences, parliamentary debates and court proceedings. Even though most studies deal with simultaneous interpreting, other forms of interpreting are investigated as well, such as dialogue interpreting. The spoken corpora used in the studies mostly comprise transcripts of spoken data, following various transcription conventions, and, less frequently, verbatim reports of parliamentary debates. The survey further shows that four main types of written registers are investigated: literature (LIT), specialized registers (SPEC), audiovisual registers (AVT) and news (NEWS). Literature and specialized registers rank first, followed by audiovisual registers and news (see Figure 1.6). The sizeable proportion of empirical studies devoted to literary translation is in line with Zanettin et al. (2015), who found that literary translation is one of the top three most-researched topics in translation and interpreting studies. As can be seen in Figure 1.6, multi-register studies (MULTI) are also widespread in the dataset. While the LIT category focuses almost exclusively on a single register, namely novels, SPEC is very fragmented, and covers registers as diverse as legal texts (e.g. treaties), reports emanating from international institutions and popular science articles. The AVT category is also quite diversified, with studies devoted to subtitling, dubbing, voice-over and audio-description. Movies and sitcoms figure prominently in the dataset, whatever the type of AVT. AVT corpora mainly contain textual data, but we also find some studies relying on

Figure 1.6  Written registers in empirical studies (n = 92).

32

Extending the Scope of Corpus-Based Translation Studies

multimodal and multimedia AVT corpora. Surprisingly, news translation is not frequently investigated on its own in our dataset, despite the wide availability of translated news items nowadays. MULTI studies, though quite numerous, are rather difficult to characterize. Some of them offer cross-register analyses. These analyses, which are based on register-stratified corpora such as the Dutch Parallel Corpus (Macken et al. 2011) and P-ACTRES (Izquierdo et al. 2008), reflect the recent interest in register variation. It is important to note, however, that not all MULTI studies analyse registers contrastively. Rather, in such studies, registers are simply combined (e.g. to make up for lack of sufficient data) and studied as a unified whole.

7.2.4 Languages A total of twenty-three languages are investigated in the empirical studies, of which two thirds are European languages (exceptions include Arabic, Hebrew, Russian and Thai). The survey reveals a clear English-centric perspective, as 72% of the studies deal with English, either on its own or in combination with other languages. In this respect, we agree with Vandevoorde and De Sutter (2019: 1) that ‘this hegemony of English raises fundamental issues about the nature and relevance of research questions and theoretical concepts, the stability of research findings and the appropriateness of methodologies that are primarily tailored towards the investigation of the English language’. English is followed, far behind, by French, Spanish, Dutch, Chinese, Italian and German, each of which accounts for between 18% and 11% of the studies. As can be seen, Chinese is the only non-European language that features in this top list (it is mainly examined in interpreting studies in our dataset).

7.3  Corpus techniques and statistical testing 7.3.1  Corpus techniques As pointed out in Section 2, a study only qualifies as a bona fide corpus study if it relies on the automated techniques developed within the framework of corpus linguistics. This part of the survey, which required a minute scanning of the 122 studies, proved to be particularly arduous as the information was often incomplete and tended to be scattered rather than included in a separate section devoted to data and methodology. Several studies merely reported that the occurrences of the phenomenon under scrutiny were ‘extracted’ or ‘identified’ without providing any information on the way they were extracted and/or

A forward-looking review

33

processed. The in-depth exploration of the texts proved to be well worth the effort, however, as it allowed some interesting trends to emerge. The main finding is that the majority of the studies (c. 60%) rely solely on the ‘basic text processing operations’ that were described by Baker (1995: 226) in the early days of CBTS, over twenty-five years ago, that is, frequency and concordancing. In some studies, the only aspect that is computed automatically is the number of words in the corpora used, the bulk of the study being carried out manually. Although this approach makes minimal use of the electronic nature of the data, it is still quite valuable in that it makes it possible to compare the frequency of linguistic phenomena across corpora (e.g. in translated and non-translated language), using both raw and relative frequencies. Most studies, however, go one step further and make use of the word list and concordancing functionalities offered by text analysis software. Word lists are particularly useful as they provide the frequency of all words in the corpus data used and make it possible to compute lexical variation indices (type/token ratio and standardized type/token ratio) automatically. Concordancing lives up to its reputation as ‘the corpus analyst’s stock-in-trade’ (Baker 1995: 226), the two most popular programs being the monolingual tools WordSmith Tools (WST) (Scott 2016) and AntConc (Anthony 2019). Bilingual concordancers such as ParaConc (Barlow 2008) are much less frequently used. In many cases, however, the authors analyse concordance lines but provide no indication of the program used. Of the forty-five studies that contain an explicit reference to one of the above-mentioned three programs, a deplorably high number (fifteen) fail to include a bibliographic reference. Some 40% of the studies make use of more advanced techniques. We have classified as advanced all the techniques that go beyond word-form-based extraction and concordancing. Two main categories of technique emerge: automatic annotation and automatic extraction of keywords and phraseological units. The most popular types of annotation in the survey are lemmatization and part-of-speech (POS) tagging. The first relieves the researcher of the burden of extracting the inflected forms of one and the same lemma and is a necessary step for computing the lexical density of texts. The second allows researchers to carry out extractions focused on a whole word category (e.g. modal verbs) and relieves them of another burden, that of disambiguating homonyms (e.g. the noun can vs the auxiliary can). The automatic extraction of keywords and phraseological units relies on frequency, co-occurrence and recurrence indices and can be performed automatically by programs such as WST and AntConc. Several studies in the survey use this method to extract collocates, lexical bundles, keywords and key clusters. A number of studies make use of the Corpus Workbench,8 whose central

34

Extending the Scope of Corpus-Based Translation Studies

component is a powerful query processor that makes it possible to query large corpora with linguistic annotations. Surprisingly, only two studies make use of Sketch Engine, which contains very powerful functionalities for translation research and allows translation scholars to upload their own corpora (Kilgarriff et al. 2014). Several studies focused on speech make use of EXMARaLDA (Schmidt and Wörner 2009), a system for working with oral corpora which includes a transcription and annotation tool and a query and analysis tool. While the majority of the studies rely on independent software programs, a nonnegligible number base their analysis on corpora such as P-ACTRES (Izquierdo et al. 2008), which come with their own interface supporting basic and complex queries on word forms, lemmas, POS tags and phrases. The results show that the majority of the corpus-based translation and interpreting studies in our survey do not exploit the full potential offered by the electronic nature of the corpus. They tend to rely on a limited set of basic corpus techniques and thereby fail to display one of the key characteristics of corpus-based studies, that is, the fact that they ‘make extensive use of computers for analysis’ (our underlining) (Biber et al. 1998: 4). Resources exist for researchers who would like to exploit corpus techniques more intensively. In particular, Zanettin’s (2012) and Mikhailov and Cooper’s (2016) volumes are excellent sources of inspiration. There are signs that the situation is changing, however. A breakdown of the two approaches – simple versus advanced – per year shows that 55% of the studies using advanced techniques were published in the last two years covered by the survey (2018 and 2019), while the percentage of those relying on more basic methods is only half as great (27%). This said, it should be stressed that sophisticated corpus techniques are not required and indeed are not even practicable for many types of study, particularly those focused on aspects of language – semantic, functional or cultural – that are very hard to handle automatically.

7.3.2  Statistical testing In addition to corpus techniques, we examined whether statistical tests were used in the 122 empirical studies. To do so, we relied on the distinction between descriptive statistics, that is, statistics that describe and summarize datasets, such as frequency counts, and inferential statistics, that is, statistics that make it possible to infer whether a trend observed in a dataset is representative of the whole population sampled. We found that 55% of the papers rely on descriptive statistics only (mostly in the form of relative frequencies), without recourse to inferential statistics (see Figure 1.7). The other studies rely on inferential statistics. Among those, we find that a majority of studies make

A forward-looking review

35

Figure 1.7  Use of statistics in empirical studies (n = 122).

use of monofactorial tests (whether parametric or non-parametric), typically based on contingency tables or mean rank orders. Examples of such tests include the chi-squared test, the t-test, ANOVA (Analysis of Variance), the Mann-Whitney U test and Pearson’s correlation. More elaborate statistical methods – namely multivariate exploratory techniques (e.g. correspondence analysis) and multivariate tests (e.g. regression modelling, inference trees and random forests) – are used in a small number of empirical studies, most of them published between 2017 and 2019. While the use of elaborate statistical testing and advanced quantitative methods is a most welcome development in corpusbased translation and interpreting studies (cf. De Sutter et al. 2012), it is also important to warn against an excessive drift in focus from linguistic description to statistical analysis. In their survey, Larsson et al. (forthcoming) find that the steady increase in the use of advanced statistical methods in corpus linguistics is coupled with a decreased focus on linguistic description. They therefore advocate striking a balance between these two central aspects of corpus work.

8  Conclusion and outlook Research surveys are ideal instruments with which to take stock of academic fields with a view to identifying current and emerging trends, assessing both

36

Extending the Scope of Corpus-Based Translation Studies

strengths and weaknesses, and to suggest directions for future developments. Surveys come in many shapes and forms. One popular type relies on bibliometric data available in online bibliographies, in particular titles, abstracts and keywords. The advantage of this method is that it gives access to a wide range of factual information (number of publications per year, types of publication, range of publication languages, etc.) from a large number of studies published in several formats (books, book chapters, journal articles, proceedings volumes) in a wide range of languages. The disadvantage is that the information that can be extracted from abstracts, titles and keywords is too limited to provide in-depth qualitative insights into the field under scrutiny. The method we decided to use for our survey of corpus-based translation and interpreting studies relies on a minute exploration of the full texts of 186 recent journal articles written in English.9 It is a very time-consuming method, which precludes investigating a very large dataset. However, as we hope the results of our survey demonstrate, it compensates for this weakness by providing a rich picture of theoretical, methodological and descriptive aspects of the current status of the field. The two approaches are therefore clearly complementary. The data extraction stage of the survey showed that journal articles that meet the requirements of corpus studies (in terms of data type and corpus techniques) account for 11% of our initial dataset. The breakdown per journal, however, showed that some journals were more corpus-oriented than others. In addition, the survey brought out a number of key trends in present-day corpus-based studies, testifying to recent developments in the field while also highlighting areas where progress has been relatively modest. First, an analysis of the overall corpus orientation of the studies into three main categories – empirical, methodological-theoretical and applied – showed that empirical studies accounted for two thirds of the studies. In view of the descriptive, productoriented slant of corpus linguistics, this is not particularly surprising. What came as something of a surprise, however, and can be seen to testify to the growing maturity of the field, is that one third of the studies went beyond description to tackle methodological and theoretical aspects and concrete applications. Second, a detailed scanning of the linguistic focus of each empirical study showed that the dominant category was that of lexis and terminology, followed by grammar, discourse and pragmatics, together with a mixed category comprising more than one linguistic domain. Semantic, speech-related and morphological features turned out to be less popular. Translation features (in particular, explicitation) proved to remain a popular subject of investigation, in line with Baker’s research agenda. Third, the analysis of the corpus designs of the empirical studies showed

A forward-looking review

37

that parallel corpora are used twice as frequently as monolingual comparable corpora, contrary to Baker’s (1995) call to move away from ST-TT comparisons. Corpora used in the field were found to represent a wide range of written and spoken registers, with a clear overrepresentation of English (either as a source or as a target language). Fourth, methodology- and theory-oriented studies proved to be quite diverse, ranging from descriptions of new corpora, literature reviews and calls for the use of more advanced quantitative methods to the application of particular theoretical constructs or models, fostering cross-fertilization with neighbouring disciplines. Finally, applied studies appeared to be mostly geared towards corpus use in translator and interpreter training, while other applied areas, such as corpus use in professional practice or translation quality assessment, were found to be rarely explored. One of the survey’s most important findings concerns the use of corpus techniques and statistics. The analysis showed that the majority of the empirical studies relied on fairly basic techniques (frequency, concordancing), which were promoted in Baker’s early papers. More advanced techniques were found to be less frequently used, the dominant types being automatic lemmatization and POS-tagging and techniques to extract keywords and phraseological units automatically. The survey also revealed that most studies rely on simple descriptive statistics (such as relative frequencies) or monovariate inferential statistics, although advanced corpus techniques and elaborate statistical testing have recently started to gain momentum. The picture drawn by our survey is only partial as it is limited to journal articles written in English and therefore leaves out many relevant publications written in other languages and published in other formats. In spite of these limitations, the study offers a useful survey of the field and allows us to formulate a few forward-looking suggestions. First, there is a need to build new, large corpora for translation and interpreting studies, especially bidirectional parallel corpora. As things stand, the field tends to rely on small ad hoc corpora, very few of which are available to the research community. The new corpora should comprise several registers and involve many languages so as to curb the current dominance of English. There are promising initiatives in this direction, such as the TransBank project.10 Second, care should be taken to provide a detailed description of corpus data and methodology – corpus type, corpus size, data extraction, selection and annotation, etc. – and to ensure that the information is grouped in one dedicated section rather than scattered across various sections. Third, future studies should aim to exploit the full potential of corpus techniques rather than limiting themselves to frequency profiling and concordancing. In addition, although this

38

Extending the Scope of Corpus-Based Translation Studies

is admittedly not in the hands of researchers, the field of corpus-based translation and interpreting studies could be greatly boosted if translation and interpreting journals were to give more visibility to corpus approaches in the range of topics listed in the description of the scope of the journal. Corpus-based translation and interpreting studies is still a relatively young research field. It is therefore only natural that some aspects of it have not yet attained full maturity. However, the fact that activity is thriving on all fronts – empirical, theoretical, methodological and applied – is a strong sign that the field will continue to progress unabatedly in the future.

Acknowledgements We would like to thank Gert De Sutter for his valuable feedback on the first draft of this chapter. Any remaining shortcomings are, however, our own. Thanks also go to Thomas Simon for his help with the extraction of a preliminary version of the dataset.

Notes 1 2 3 4 5

6

7

8 9

The TSA content has been merged with the Translation Studies Bibliography (TSB). https://benjamins​.com​/online​/tsb. http:​/​/dti​​.ua​.e​​s​/en/​​bitra​​/intr​​oduct​​i​on​.h​​tml. The authors do not specify the dates of these publications. According to Zanettin et al. (2015), 74% of the papers included in TSA were originally written in English. Five of the journals included in our survey publish articles in languages other than English (Babel, inTRAlinea, Journal of Specialised Translation, Meta and trans-kom). This percentage does not include studies where the word corpus is used in its strictly literary sense of a collection of writings representing a specific author, topic, genre or period. These studies were excluded without being quantified. While some authors include interference (or source-language influence) in the list of translation features, we follow Baker (1993), who explicitly excludes it from her definition of ‘translation universals’. http://cwb​.sourceforge​.net​/index​.php. We recognize that the restriction to studies written in English is a major limitation. It stems from our method of analysis which requires careful reading of the full

A forward-looking review

39

texts, a task that we cannot undertake in languages we do not master. It should in no way be taken as a lack of recognition of the value of articles written in other languages. Researchers using bibliometric measures can include publications in many languages because they only rely on the abstracts which are written in English. 10 https://transbank​.info/.

References Anthony, L. (2019), AntConc, Tokyo: Waseda University. Baker, M. (1993), ‘Corpus Linguistics and Translation Studies. Implications and Applications’, in M. Baker, G. Francis and E. Tognini-Bonelli (eds), Text and Technology: In Honour of John Sinclair, 233–50, Amsterdam and Philadelphia: Benjamins. Baker, M. (1995), ‘Corpora in Translation Studies: An Overview and Some Suggestions for Future Research’, Target, 7 (2): 223–43. Baker, M. (1996), ‘Corpus-based Translation Studies: The Challenges That Lie Ahead’, in H. Somers (ed.), Terminology, LSP and Translation: Studies in Language Engineering in Honour of Juan C. Sager, 175–86, Amsterdam: Benjamins. Barlow, M. (2008), ‘Parallel Texts and Corpus-Based Contrastive Analysis’, in M. de los Ángeles Gómez González, J. L. Mackenzie and E. M. González Álvarez (eds), Current Trends in Contrastive Linguistics. Functional and Cognitive Perspectives, 101–21, Amsterdam: Benjamins. Bernardini, S., Ferraresi, A., Russo, M., Collard, C. and B. Defrancq (2018), ‘Building Interpreting and Intermodal Corpora: A How-to for a Formidable Task’, in M. Russo, C. Bendazzoli and B. Defrancq (eds), Making Way in Corpus-Based Interpreting Studies, 21–42, Singapore: Springer. Biber, D., Conrad, S. and R. Reppen (1998), Corpus Linguistics. Investigating Language Structure and Use, Cambridge: Cambridge University Press. Candel-Mora, M.A. and C. Vargas-Sierra (2013), ‘An Analysis of Research Production in Corpus Linguistics Applied to Translation’, Procedia, 95: 317–24. Defrancq, B., De Clerck, B. and G. De Sutter (2015), ‘Corpus-based Translation Studies: Across Genres, Methods and Disciplines’, Across Languages and Cultures, 16 (2): 157–62. De Sutter, G., Goethals, P., Leuschner, T. and S. Vandepitte (2012), ‘Towards Methodologically More Rigorous Corpus-Based Translation Studies’, Across Languages and Cultures, 13 (2): 137–43. Fernandes, L. (2006), ‘Corpora in Translation Studies: Revisiting Baker’s Typology’, Fragmentos, 30: 87–95. Gilquin, G. and S. Th. Gries (2009), ‘Corpora and Experimental Methods: A State-ofthe-Art Review’, Corpus Linguistics and Linguistic Theory, 5 (1): 1–26.

40

Extending the Scope of Corpus-Based Translation Studies

Holmes, J. S. (2000 [1988]), ‘The Name and Nature of Translation Studies’, in L. Venuti (ed.), The Translation Studies Reader, 180–92, London and New York: Routledge. Izquierdo, M., Hofland, K. and Ø. Reigem (2008), ‘The ACTRES Parallel Corpus: An English-Spanish Translation Corpus’, Corpora, 3 (3): 1–41. Kenny, D. (2005), ‘Parallel Corpora and Translation Studies: Old Questions, New Perspectives? Reporting That in Gepcolt: A Case Study’, in G. Barnbrook, P. Danielsson and M. Mahlberg (eds), Meaningful Texts: The Extraction of Semantic Information from Monolingual and Multilingual Corpora, 154–65, London and New York: Continuum. Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P. and V. Suchomel (2014), ‘The Sketch Engine: Ten Years on’, Lexicography, 1: 7–36. Larsson, T., Egbert, J. and D. Biber (forthcoming), ‘On the Status of Statistical Reporting versus Linguistic Description in Corpus Linguistics: A Ten-year Perspective’, Corpora, 17 (1). Laviosa, S. (2011), ‘Corpus-based Translation Studies: Where Does it Come from? Where Is it Going?’, in A. Kruger, K. Wallmach and J. Munday (eds), Corpus-Based Translation Studies: Research and Applications, 13–32, London and New York: Bloomsbury. Liao, S. and L. Lei (2017), ‘What We Talk about When We Talk about Corpus: A Bibliometric Analysis of Corpus-related Research in Linguistics (2000–2015)’, Glottometrics, 38: 1–20. Loock, R. (2016), La Traductologie de Corpus, Villeneuve d’Ascq: Presses universitaires du Septentrion. Macken, L., De Clercq, O. and H. Paulussen (2011), ‘Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus’, Meta, 56 (2): 374–90. Mikhailov, M. and R. Cooper (2016), Corpus Linguistics for Translation and Contrastive Studies. A Guide for Research, London and New York: Routledge. Olohan, M. (2004), Introducing Corpora in Translation Studies, London and New York: Routledge. Russo, M., C. Bendazzoli and B. Defrancq, eds (2018), Making Way in Corpus-Based Interpreting Studies, Singapore: Springer. Schmidt, T. and K. Wörner (2009), ‘EXMARaLDA – Creating, Analysing and Sharing Spoken Language Corpora for Pragmatic Research’, Pragmatics, 19 (4): 565–82. Shlesinger, M. (1998), ‘Corpus-based Interpreting Studies as an Offshoot of CorpusBased Translation Studies’, Meta, 43 (4): 486–93. Scott, M. (2016), WordSmith Tools Version 7, Stroud: Lexical Analysis Software. Sinclair, J. (1996), EAGLES. Preliminary Recommendations on Corpus Typology. http:​/​/ www​​.ilc.​​cnr​.i​​t​/EAG​​LES96​​/corp​​ustyp​​/corp​​​ustyp​​.html​ Stewart, D. (2000), ‘Poor Relations and Black Sheep in Translation Studies’, Target, 12 (2): 205–28. Tognini-Bonelli, E. (2001), Corpus Linguistics at Work, Amsterdam and Atlanta: Benjamins.

A forward-looking review Vandevoorde, L. and G. De Sutter (2019), ‘Empirical Translation Studies in a Monolinguistic World: Theoretical and Methodological Challenges’ Workshop Description, EST Congress 2019, Stellenbosch University, South Africa, 9–13 September. van Doorslaer, L. and Y. Gambier (2015), ‘Measuring Relationships in Translation Studies. On Affiliations and Keyword Frequencies in the Translation Studies Bibliography’, Perspectives: Studies in Translatology, 23 (2): 305–19. Zanettin, F. (2012), Translation-Driven Corpora. Corpus Resources for Descriptive and Applied Translation Studies, Manchester: St. Jerome Publishing. Zanettin, F., Bernardini, S. and D. Stewart (2003), Corpora in Translator Education, London: Routledge. Zanettin, F., Saldanha, G. and S.-A. Harding (2015), ‘Sketching Landscapes in Translation Studies: A Bibliographic Study’, Perspectives: Studies in Translatology, doi:10.1080/0907676X.2015.1010551

41

2

Expanding the reach of corpusbased translation studies The opportunities that lie ahead1 Federico Gaspari

1  Introduction: Today’s corpora despite translation studies Corpus-based translation studies (CBTS) was pioneered when the increasing and progressively cheaper storage capabilities of computers, coupled with more userfriendly text analysis software, made it feasible to save large samples of electronic (translated) texts and search through them quickly and efficiently (Baker, 1993; Laviosa, 2004). The valiant attempts at introducing new corpus-based and corpus-driven (Tognini-Bonelli, 2001) empirical approaches in the broadly theoretically dominated discipline of translation studies were met with some prolonged resistance. However, the groundbreaking efforts eventually paid off, and corpus-based projects are now firmly established at the heart of translation studies of today and, most certainly, also of the future. While initially CBTS fed off methodological insights and operational procedures that had gradually matured in corpus linguistics, theoretically there was a need for altogether new departures, which were provided by the audacious vision that paved the way for subsequent developments, as evidenced by the chapters in this book. Some three decades after Baker’s (1993) inspiring seminal paper that sparked serious and sustained interest in the use and application of corpora among translation scholars, many of the relevant challenges subsequently identified by Baker (1996) have been successfully dealt with, but inevitably some others remain as yet to be addressed, at least in part, and new ones have been presented to the research and scholarly community. This chapter intends to take stock of the progress in CBTS to date, assess the main challenges that appear to have been overcome as well as those that still have to be fully met and draw attention to the

Expanding the Reach of CBTS

43

ones that are on the horizon, with a view to extending the scope of CBTS even further, as the title of this timely collection suggests, with a particular focus on the opportunities for research that lie ahead. When CBTS was programmatically launched by Baker almost thirty years ago, corpus resources were severely lacking: texts in electronic format were still hard to come by, language datasets were difficult and slow to design, collect, assemble and make accessible for scholarly pursuits. These obstacles were obviously compounded for the corpora that were needed to advance the study of translation from a product-oriented perspective (Baker, 1995; Baker, 1996; Tymoczko, 1998), in the wake of the significant and long-lasting impact of polysystem theory and descriptive translation studies (Even-Zohar, 1990; Toury, 1995; Baker, 1999; Laviosa, 2002; Laviosa, 2004). This difficulty was largely due to the technical and operational challenges involved in the compilation and analysis of translation-oriented corpora (Zanettin, 2012 provides a comprehensive overview of the relevant issues). Fast-forward to the present day, the situation is strikingly reversed in terms of the availability of electronic and digital resources, vis-à-vis their linguistic and translation-oriented conceptualization and analysis: all the people who use the internet and most forms of current technology (including smartphonebased apps), not only academics and scholars, are constantly surrounded by linguistic data and translations in digital format. Borrowing House’s (1981) well-entrenched concepts, all technology users are presented on a regular basis with both overt translations (e.g. the speech or interview given by a high-profile international political leader published by an online newspaper or blog) and covert translations (e.g. the terms and conditions to subscribe to an online service via a localized website or app-operated system). This unprecedented – and unimaginable until about a mere decade ago – abundance of authentic digital translation data is a welcome difference compared to the early days when CBTS was moving its first steps. However, the theory and methodology currently run the risk of lagging behind, as they have to play catch-up with the ubiquitous availability of distributed translated language that is waiting to be analysed in novel ways, to produce scholarship and insights that are at pace with the relentless dynamic developments of how people communicate across languages via translation today, on multiple platforms and in very diverse and fragmented scenarios (see e.g. Gaspari, 2013 and Caimotto and Gaspari, 2018 for a discussion of some of the relevant issues for news translation in the online environment). The large-scale availability on the Web of candidate corpus resources relevant to translation studies has been creatively exploited by a number of projects

44

Extending the Scope of Corpus-Based Translation Studies

since the early work by Baroni and Bernardini (2006a) and Baroni et al. (2009), but today and for the foreseeable future this remains an exciting challenge for CBTS. Unlike thirty, or even fifteen years ago, now and for the coming decades empirical linguists and translation scholars will have at their disposal generous and constantly refreshed sources of translation data with tremendous potential, waiting to be mapped and systematically tapped into. This is a situation where endless corpora of translated text exist and are available despite, so to speak, translation studies as such, and it is the scholars’ job to devise creative and effective ways to tap into this vast pool of potential corpus resources. The everincreasing digital texts are in need of being somehow tamed and organized in such a way as to be usable and suitable for research. Key concepts such as corpus size, balance and representativeness come under severe pressure when designing and conducting CBTS projects in the exciting, but very wild, online scenario. Fundamental contributions to corpus methodology and design such as Atkins et al. (1992) and Biber (1993) still provide valuable guidelines for the challenges currently faced by CBTS, but they cannot provide all the answers, and significant methodological and operational ingenuity is required. Empirical translation scholars once hungry for translated texts that were not available digitally are now inundated with possibilities everywhere they look, but the methodological and theoretical scope of CBTS needs to be accordingly extended and strengthened for the discipline to be able to take steps in this direction. One crucially important example that stands out as a testament to the impactful nature of this scenario is that of the translation industry, in all its many forms across the world, spanning more or less structured markets and traditional as well as innovative fields. O’Hagan’s (1996) visionary and influential book heralded the profound impact that the internet and ubiquitous digital international communication could have on professional translation. A quarter of a century later, many of the developments envisaged then have turned out to be surprisingly accurate and long-lasting, while others have petered out or not left a noticeable mark on the translation industry and profession. However, unlike in the mid-1990s, today virtually any professional translation activity happens digitally, adding technical layers to the strictly interlingual and textual labour, for example, in the form of desktop publishing processes, preparation and transmission of translations to digital platforms via content management systems and so on. To name just a few examples, virtually all literary masterpieces, marketing copy, technical instruction manuals and financial reports that get translated today – regardless of the source and target languages – are invariably produced

Expanding the Reach of CBTS

45

and delivered in digital format. This is something that the pioneers of CBTS could only dream of at a time of floppy disks with limited storage capacity, cranky printers, faxes with crease-prone paper and less-than-reliable optical character recognition devices to convert printed sources into electronic format (Sinclair, 1991; Leech, 1992; Louw, 1993; Flowerdew, 2012: 36ff; McEnery and Hardie, 2013). In addition, computer-assisted translation tools have long been the norm for the translation of technical and specialized texts (Heyn, 1998 is an early example of a study in this area). More recently, neural machine translation (MT) has made so much progress that after Toral and Way’s (2014: 174) somewhat provocative argument that using MT for literature would represent an attack on ‘the last bastion of human translation’, today this proposition is no longer as farfetched as it would have appeared a few years ago (see also Chapter 9 by Kübler, Mestivier and Pecman in this volume). All these concurrent developments that seem to display an accelerating pace and are most probably going to continue to characterize translations from and into any language across the world for a long time deserve careful consideration as we ponder the evolution of CBTS. Starting from a brief overview of areas and issues for which CBTS has already provided a substantial contribution, and of some key challenges that seem to remain open, the rest of the chapter will zoom in on the opportunities that still lie ahead thanks to this exciting scenario. An attempt will be made to map out the outstanding challenges and the key issues that need to be confronted to expand the reach of CBTS going forward, keeping in mind the lessons learned over the remarkable developments of the last three decades in the field. The discussion will show the connections between strands of independent (but theoretically, methodologically or operationally related) research and will emphasize where convergence can be identified, or seems worth pursuing, to encourage the further advancement of CBTS. Similar attention will be paid to unmotivated disconnects and especially gaps in the field. In this way, we hope to indicate where there is potential for further development on the basis of what has already been achieved.

2  Taking stock of progress in CBTS to move forward Fillmore’s (1992) polarized and amusing depiction of the (physical as well as theoretical) posture of the villain ‘armchair linguist’ and the imaginary but plausible skirmishes with the hero corpus linguist was subsequently revisited several times, for example, by Mahlberg (2005: 15), Partington (2008) and Stewart

46

Extending the Scope of Corpus-Based Translation Studies

(2010: 167). Fillmore’s scene epitomized the tension due to the longer tradition and stronger institutional and academic support enjoyed by theoretically motivated linguistics with respect to the fresher, but less institutionally embedded, empirical approaches to language study. The present volume and a sizeable body of other publications (e.g. Kruger et al., 2011; Fantinuoli and Zanettin, 2015; De Sutter et al., 2017a; Vandevoorde et al., 2020) testify that Fillmore’s scene would be an overdramatized representation of today’s relationship between what we might call ‘armchair translation theorists’ and corpus-based translation scholars. Corpus-based translation researchers are thriving, and, as the title of this book suggests, there is the potential – and, one may add, the need – to further extend the scope and expand the reach of CBTS, partly through cross-fertilization with more traditional theoretical frameworks and complementary research methods, an argument that is well illustrated in Chapter 4 by Neumann, Freiwald and Heilmann in this volume. For reasons of space, this section can only briefly review some of the key areas that scholars have been able to explore thanks to CBTS, without any ambition to be exhaustive or to discuss the few hand-picked examples in the level of detail that they would deserve and that would do justice to the work of the scholars involved. The arbitrarily selected major topics that are discussed in this section as illustrations of the contribution of CBTS to translation theory and methodology are translation (and mediation) universals, and in particular explicitation, and directionality. Going beyond the strict opposition conjured up by Fillmore’s anecdote for comic effect, this discussion shows that CBTS has unified a variety of approaches under the umbrella of a bottom-up framework that has arguably produced the most exciting and groundbreaking advances in translation studies since the 1990s. One of the first major topics that emerged thanks to the analytical potential afforded by corpus-based investigations and that has been looked at by several scholars over a long period of time that almost exactly coincides with the existence of CBTS itself is that of translation universals. In addition to the comprehensive references included at the end of Kotze’s chapter in this volume, other key studies that have contributed to the debate on translation universals include the interesting ‘unique item hypothesis’ (Chesterman, 2007) and a range of other studies such as Blum-Kulka (1996); Klaudy (1996); Olohan (2001); House (2008); Gaspari and Bernardini (2010); Xiao (2010) and Bernardini (2011). Over time, some scholars have suggested that the notion of ‘translation universals’ is somewhat too narrow and limited, and have put forward the more comprehensive concept of ‘mediation universals’ (e.g. Ulrych and Murphy, 2008;

Expanding the Reach of CBTS

47

Gaspari and Bernardini, 2009; Ulrych, 2009; Bisiada, 2017; see also Chapter 5 by Ivaska et al. in this volume). While this is a thoroughly investigated area in CBTS, as one would expect in bottom-up product-oriented descriptive studies, reaching conclusive unifying findings has proven elusive so far, due to the complexity of the phenomena to be mapped and related to each other across language pairs, text types, genres, registers and data collection conditions (cf. Lapshinova-Koltunski’s Chapter 7, this volume). Innovative approaches such as the one convincingly proposed by Kotze in this volume can add the missing methodological and explanatory power to promote more accuracy and granularity in the study of the phenomena related to translation (and mediation) universals going forward. Frankenberg-Garcia (2009) explores the link between translation, text length and explicitation, claiming its relevance for pedagogic purposes. She makes an interesting connection between this long-standing strand of CBTS research, translator training and a process-oriented emphasis, which otherwise seems to have been largely overlooked (one notable and interesting exception being the chapter by Neumann, Freiwald and Heilmann in this volume, which lays emphasis on the importance of connecting length and directionality with regard to the phenomena they consider). Another important contribution in this area is Castagnoli (2011), which analyses the use of connectives in a multiple learner translation corpus, that is, a parallel corpus with several translations by different students for the same source text, with the aim of uncovering regularities and variations in the translations, keeping the source invariant. One conclusion of this interesting study is that when several or most student translators adopt the same rendition, explicitation features appear to be linked with concurrent normalization, that is, closer adherence to target language norms. In addition, the virtues of the novel multiple translation corpus methodology are extolled as an effective means to gain new insights into the relevant phenomena. FrankenbergGarcia (2009) and Castagnoli (2011) are two interesting studies that have started to investigate the issue of explicitation within the universal hypothesis from new angles and proposing new methodological set-ups, whose potential does not seem to have been adequately considered in subsequent CBTS work. Molés-Cases (2019) emphasizes the role of source and target language typology to correctly investigate translation universals, and focuses on the two complementary perspectives of process-oriented implicitation and productoriented explicitation. The study focuses on German to Spanish (i.e. from a Germanic to a Romance language) and puts forward a convincing argument that linguistic typology is to be factored into researching translation universals,

48

Extending the Scope of Corpus-Based Translation Studies

especially in intertypological settings. One suggestion that can be made for a more powerful methodological set-up involves controlling for the typological variable by considering typologically matching and different source and target languages across varieties of genres, text types and registers, and check for the regularities that emerge, in order to systematically build a solid picture going from the bottom up (cf. Ivaska et al., this volume). In addition, the profiles of the translators responsible for producing the target texts when researching explicitation in translation are also deemed crucial (see Fattah, 2010 for work on this issue); one shortcoming in previous related work is that of frequently relying on student translators: while the insights they provide are certainly valuable, for example, to inform research-based teaching practices and improve pedagogic materials, there is broad consensus that learners and students cannot be taken as fully reliable proxies of professional and experienced translators (Bowker, 1998: 635–6; De Sutter et al., 2017b; Lapshinova-Koltunski, this volume). This is another interesting instance where there seem to be ample opportunities to further develop CBTS in relation to the long-standing focus on explicitation, especially now that, as noted in Section 1, virtually all professional translation activities are carried out electronically. At least a brief mention should also be made of the substantial body of CBTS research in a neighbouring area, namely that concerning ‘translationese’ (e.g. Mauranen, 1999; Tirkkonen-Condit, 2002; Puurtinen, 2003; Bernardini and Baroni, 2005; Baroni and Bernardini, 2006b; Rabinovich and Wintner, 2015; Kunilovskaya and Lapshinova-Koltunski, 2019), the ‘third code’ (e.g. Frawley, 1984; Loock, 2013; Granger, 2018) and ‘shining through’ (e.g. Teich, 2003; Hansen-Schirra, 2011; Neumann, 2013; Lapshinova-Koltunski, 2015; Cappelle and Loock, 2017), as these studies share similar objects of research and related methodologies. Lapshinova-Koltunski’s contribution to this volume is an interesting study that looks at normalization and shining-through, in the tension between the adoption of source- versus target language norms, considering student as well as professional translations. Interestingly, as noted for MolésCases (2019) who investigates implicitation and explicitation as candidate features of translation universals, Cappelle and Loock (2017) also focus on the role of typological similarities and differences between the source and target language to investigate the phenomenon of ‘shining through’, thus showing some interesting convergence in this respect that points to further promising linguistically motivated typologically oriented research in CBTS. This brief selective review of some of the studies and approaches that have been unified within the bottom-up framework of CBTS is completed by the issue

Expanding the Reach of CBTS

49

of directionality, which is chosen as it plays a key role in researching translation (and mediation) universals, and especially explicitation. The issue of translation directionality, that is, into the L1 versus into the L2, has been widely debated, especially in translator training, with different positions and suggestions concerning its actual pedagogic usefulness (e.g. Campbell, 1998; Adab, 2005; Pokorn, 2005; Stewart, 2008; Pokorn et al., 2020). Evert and Neumann (2017), for instance, investigate the ‘shining through’ effect as a function of directionality between English and German, using a sophisticated methodology that involves multivariate analysis, visualization and lightly supervised machine learning techniques, and discuss whether ‘shining through’ might in fact be considered ‘a universal feature of translation’ (Evert and Neumann 2017: 70, emphasis in the original). Similarly, Dupont and Zufferey (2017) compare some specific translation equivalents between English and French based on the three dimensions of directionality, text register and level of translator expertise, thus linking some of the separate strands of research discussed above, showing that the interplay between these dimensions does affect cross-linguistic equivalences. The relationship between translation directionality and translator expertise (on the cline between novice and expert) is addressed with respect to explicitation and simplification by Penha-Marion et al. (2020), who present an ongoing empirical project that exploits the power of data triangulation. Their study is important also because it proposes an annotation taxonomy to conveniently code explicitation, implicitation and simplification phenomena in translated text; it is to be hoped that the proposed taxonomy will be used extensively and tested on a wide range of language pairs and text types, so that new insights are gathered within a consistent unifying framework. It should be noted that while Evert and Neumann (2017) and Dupont and Zufferey (2017) use directionality in the sense of language pair directionality (i.e. DE>L1 EN vs EN>L1 DE and EN>L1 FR vs FR>L1 EN), Penha-Marion et al. (2020) examine L1/L2 directionality (i.e. translations into the L1 vs into the L2). Finally, to conclude this cursory overview of directionality, the picture is extended to the efforts focused on interpreting, as shown, for example, by Monti et al. (2005) who analyse a parallel trilingual (English, Italian and Spanish) part-of-speech-tagged corpus of original speeches delivered at the European Parliament and their simultaneous interpretations. They examine in particular frequent lexical patterns and morpho-syntactic structures that occur across all possible combinations and directions of simultaneous interpreting represented in the corpus, with a special interest in the typological variable, that is, when interpreting between a Germanic and a Romance language, as opposed to

50

Extending the Scope of Corpus-Based Translation Studies

within the Romance family, again laying emphasis on the relevance of the cross-typological dimension. Although for reasons of space this chapter is more focused on translation, some other important contributions from interpreting scholars that certainly extend and enrich the scope of CBTS are discussed in the next section.

3  Directions for future progress in CBTS: Breaking new ground So far, CBTS has mostly – and understandably – focused on phenomena and issues within the remit of what we might call traditional translation, typically adopting well-established methodological approaches, by making use of parallel, comparable and monolingual comparable corpora. As noted above with regard to Frankenberg-Garcia (2009) and Castagnoli (2011), there have been limited attempts to challenge these strict categories that have by and large defined, but also to some extent limited, the remit of CBTS. Efforts to overcome the methodological confines of the discipline such as the one put forward by Gaspari (2015), who proposes an innovative hybrid comparallel approach, seem to have remained largely isolated, a sign that there is some resistance to innovation and experimentation with novel methodological approaches. However, as argued in Section 1, there is a pressing need for CBTS to seriously chase up translation where and how it actually happens every day all around those who make even minimal use of digital technologies: not only on the internet accessed through a computer, but also via smartphone-based apps, streaming TV on portable devices and so on with user features and viewer services of all kinds powered by translation. The data is there, much easier to acquire, process and analyse than thirty years ago at the dawn of CBTS, and if scholars are to track and understand where translation is going and how it will evolve, that is the arena that they need to look at as their next big challenge. Just as the online environment has transitioned to the Web 2.0, paraphrasing the title of O’Hagan’s (2016) talk, we could call for the advent of CBTS 2.0. In my view, too little progress has been made too slowly to tap into this vast source of opportunities, which calls for some bold initiative. In particular, it seems imperative for CBTS to leave its hard-earned comfort zone. It is good for disciplines to carve out their own niche and establish their comfort zone, and to do so in a period of less than thirty years is nothing short of remarkable, a huge testament to the vision of the pioneers and to the

Expanding the Reach of CBTS

51

determination of those who have worked in this area so far. The achievements are particularly striking in a bottom-up field of research such as CBTS, which is unified in its many ramifications and inevitable fragmentation, due to the different language pairs, genres, text types, registers and so on that have been the objects of investigation over time. However, for a discipline to continue to thrive, it is also essential and healthy for it to leave its comfort zone. In the face of the new environments and circumstances in which digital translations are requested, produced, circulated and consumed, CBTS may end up lagging behind. The evidence of this is that any review of recent work in CBTS (e.g. Granger and Lefer’s Chapter 1, this volume) does not cover projects or efforts devoted to nontraditional (but now mainstream, in many respects) developments of translation in the digital world. To counter this obvious huge gap, I feel that such scenarios need to be carefully examined and accounted for in CBTS, if we want it to remain relevant and meaningful in the future, without losing its impact on translation studies as well as beyond. Social media are a case in point (cf. Desjardins, 2017). While, of course, posts and comments on Facebook, Twitter, Instagram and similar online platforms may not be regarded as legitimate translation material in the traditional sense by some, there are several cross-linguistic exchanges and multilingual accounts on social media that cater for a worldwide user base through translation, for example, those of global celebrities, international institutions and organizations, large-scale events attracting viewers from various countries and language backgrounds. Applied linguists have looked at large corpora of social media data in a number of languages for multiple purposes, ranging from sentiment analysis to information mining, via the realization of hate speech online, but I am not aware of comparable large-scale corpus-based projects and investigations systematically looking at the role played on social media platforms by translation, and at the dynamics that it is subjected to as well as those that it generates in such technologically rich environments. On a related note, due to the boost given by social media networks and online collaboration platforms, amateur translation and translation crowdsourcing represent areas that have surprisingly and, in my view, regrettably received very limited interest from CBTS. Translation crowdsourcing and collaborative and volunteer translation have been investigated within translation studies (e.g. McDonough Dolmaya, 2011 and 2012; Sutherlin, 2013; Olohan, 2014), with interesting and original corpus-based approaches (e.g. Jiménez-Crespo, 2013 and 2017). However, overall corpus-based contributions do not seem commensurate with the scale of these phenomena and the research opportunities

52

Extending the Scope of Corpus-Based Translation Studies

that they offer. The same observations could be made for other non-professional and non-traditional (no doubt, for some, unorthodox and illegitimate) forms of translation and mediation, including child language brokering (Antonini, 2010) in migratory contexts and various types of amateur translation for recreational or entertainment purposes such as fansubbing (Diaz Cintas and Muñoz Sánchez, 2006). There seem to be clear gaps here with regard to the contribution that CBTS could make to these and other similar areas of research, and it is difficult to fully explain the lack of engagement and interest so far. CBTS will hopefully make it a priority in the future to extend its scope and reach into these and other areas that have been largely neglected so far. In contrast, applications of corpora to the study of well-established forms of audiovisual and multimedia translation such as subtitling and dubbing are definitely more common and advanced (e.g. Heiss and Soffritti, 2008; Valentini and Linardi, 2009; Baños et al., 2013; Freddi, 2013; Chaume, 2018; Pavesi, 2019; Bruti, 2020), with however comparatively less attention paid, for instance, to film audio-description (e.g. Jimenez Hurtado and Soler Gallego, 2013). Finally, other relatively new, but widely practised, forms of translation such as game localization also seem to have been largely neglected by CBTS (see O’Hagan and Chandler, 2016; Mangiron, 2018: 125). One interesting methodological dimension that appears to be particularly promising, but still in need of being properly considered in mainstream CBTS, is the exciting new prospect of corpus-based translation process studies. As noted in Section 1, today translations – whether professional or otherwise – are invariably produced, manipulated and eventually delivered in digital format, including all the interventions occurring in the intermediate steps between the first draft produced by the translator and the final product that is eventually delivered to and read by the receiver. Research in translation studies has delved into the power relations, dynamics and technical interventions that play a role in this series of passages from a process-oriented perspective, considering both revision and editing of translations done by humans (e.g. Künzli, 2007a and 2007b; Martin, 2007; Mossop, 2007; Robert, 2008; Murphy, 2012; Scocchera, 2017) and more recently post-editing of MT output (e.g. Koponen et al., 2019; Nitzke, 2019; Koponen et al., 2021; see also Chapter 9 by Kübler, Mestivier and Pecman in this volume). However, Alves and Couto Vale (2017: 90) argue that ‘research on translation process data from the perspective of corpus linguistics is still quite incipient’, and since all versions of a translation from the first draft through to its final published or circulated version exist electronically, there is ample opportunity for CBTS to apply its methodologies and techniques in the process-oriented investigation of editing, revision and post-editing. This would

Expanding the Reach of CBTS

53

provide new insights into widespread practices in professional translation that also have significant pedagogic importance (Scocchera, 2017). In Section 2 a quick reference was made to the corpus-based study of directionality in simultaneous interpreting by Monti et al. (2005). As noted by the volume editors in their Introduction to the volume, on the whole corpusbased interpreting studies remains marginal in this book, even though overall its relevance seems to be rightly, if slowly, increasing (see e.g. Russo et al., 2018; Vandevoorde et al., 2020). Fortunately, the interest in using corpora to study interpreting in a range of settings is rapidly gaining ground, also with pedagogical applications in mind, including for example, within community telephone-based (Castagnoli and Niemants, 2018), dialogue (Davitti, 2019) and video-enabled remote interpreting (Davitti and Braun, 2020). In addition, more attention is being paid to the rich multimodal dimension of interpreting-mediated communication. Clearly, corpus-based projects in this broad area need to grapple with several specific technical and implementation challenges that range from collecting multi-channel data to their analysis and the description of the findings, as all these steps present extra layers of complexity compared to investigating corpora of exclusively written language (see e.g. Adolphs and Knight, 2010; Bernardini et al., 2018). While this may explain the comparatively slower progress to date, also in the light of recent promising advances such as those briefly reviewed above, it is hoped that efforts in corpus-based interpreting studies, including its multimodal components, will be prioritized in the near future. Room should also be made for sign language interpreting in CBTS, which seems long overdue in light of great developments such as those described in Napier (2009) and Turner and Best (2017). Concluding this discussion of the possible future directions that CBTS, including interpreting as well as innovative forms of translation, could take to remain relevant going forward, it seems to me that one clear lesson from the past thirty years whose impact needs to be kept relevant concerns the importance of corpora in the training and education of translators and interpreters (cf. Verplaetse and Kübler et al., this volume). However, in my view, there is a need to go beyond formal vocational and academic contexts (e.g. Zanettin et al., 2003; Beeby et al., 2009), to also include the life-long learning and post-accreditation professional development of experienced practitioners (see e.g. Bernardini, 2008; Gallego-Hernández, 2015). In the ongoing effort to extend the impact and reach of CBTS in the future, one issue that seems to deserve increased attention is that of ensuring buy-in from practising translators and interpreters and

54

Extending the Scope of Corpus-Based Translation Studies

their professional associations with regard to the importance and relevance of corpora. While anecdotal evidence suggests that by and large, this still remains an uphill struggle, as new generations of academically trained translators and interpreters start their careers, the reluctance to recognize corpora as important aids for language professionals may gradually subside.

4  Conclusion: The opportunities that still lie ahead for CBTS In spite of the many unknowns concerning the future developments of translation and the ways in which CBTS will continue to advance, we have highlighted some areas that seem to deserve careful consideration with a view to addressing key outstanding challenges, filling some still existing gaps, and not losing touch with the current and foreseeable evolution of translation. We have argued that one danger is that while translation is accelerating and diversifying its role in online and digital communication, CBTS may run the risk of lagging behind and losing sight of such innovations for fear of venturing out of its hard-earned comfort zone, which would make it gradually irrelevant. To avoid this risk, the discipline has to be bold in tackling new issues, including in unprecedented scenarios, by analysing data that is out there but is clearly difficult to tame for robust research. This involves keeping an open mind about the need to extend current methodologies and come up with altogether new approaches, as the objects of enquiry of CBTS necessarily change, at least to some extent. The opportunities are there, waiting to be seized. In pursuing this forward-looking agenda, the discipline would do well to extend its scope so as to encompass as many languages as possible, including the so-called minority and less-resourced languages. For understandable reasons that are hard to criticize, the first thirty years in CBTS have been broadly dominated by a very sharp focus on English and a relatively small group of other dominant and influential languages. There are partly technical reasons why so far CBTS has broadly confined itself to investigations focusing on a relatively small number of (mostly elite) languages with significant international circulation, including, for instance, the sheer availability of datasets and corpora in those languages. Probably the attraction of these languages in terms of receiving funding for research projects, publishing work and advancing one’s career has determined a de facto monopoly that has largely excluded less powerful languages. As CBTS considers its future directions, foregrounding the ethical dimension seems to be of paramount importance. It is hoped that the increasingly digital nature of

Expanding the Reach of CBTS

55

translation everywhere will enable researchers to tap into the vast datasets that are available for a very wide range of languages and language combinations, so as to unlock the full potential of truly inclusive CBTS that is relevant to all language communities and translation scenarios, overcoming the pragmatic elitism of the past. This chapter has provided a high-level overview of topics and phenomena that over the last three decades have at the same time made the fortune of CBTS and represented significant contributions to theoretical and applied translation studies as a whole, focusing for illustrative purposes on translation and mediation universals, explicitation and directionality. The discussion has foregrounded theoretical, methodological and operational concerns, especially drawing comparisons and parallels between independent strands of research where commonalities can be observed, and there seems to be a need to encourage further convergence. Far from adopting a purely celebratory approach, which would not have been entirely inappropriate when taking stock of the progress made in CBTS, the chapter has zoomed in on the several gaps and open challenges that still remain to be confronted, indicating the ones that appear to be in need of the most urgent attention for the continued growth of the discipline. To conclude, as we pay tribute to those who have paved the way with their vision and acknowledge the contributions of those who have sustained the progress of CBTS since the early 1990s, it is a shared responsibility of the community to swiftly move the next steps in the right direction for the discipline. This chapter has hopefully provided useful food for thought and indicated some of the most promising and important avenues for future development, so that the opportunities that still lie ahead for CBTS can be seized and put to good use. It is by grasping these and other similar opportunities that the scope and the reach of CBTS can be extended, to the benefit of both those within and outside the community.

Acknowledgements I am very grateful to the editors of the volume for the opportunity to contribute this chapter, for their support and encouragement during its preparation and for their constructive comments on a preliminary draft. All remaining errors are my sole responsibility.

56

Extending the Scope of Corpus-Based Translation Studies

Note 1 The title of this chapter contains a deliberate tribute to Baker’s (1996) groundbreaking and inspiring work.

References Adab, B. (2005), ‘Translating into a Second Language: Can We, Should We?’, in G. Anderman and M. Rogers (eds), In and Out of English, 227–41, Clevedon: Multilingual Matters. Adolphs, S. and D. Knight (2010), ‘Building a Spoken Corpus: What Are the Basics?’, in A. O’Keeffe and M. McCarthy (eds), The Routledge Handbook of Corpus Linguistics, 38–52, Oxford: Routledge. Alves, F. and D. Couto Vale (2017), ‘On Drafting and Revision in Translation: A Corpus Linguistics Oriented Analysis of Translation Process Data’, in S. Hansen-Schirra, S. Neumann and O. Čulo (eds), Annotation, Exploitation and Evaluation of Parallel Corpora, 89–110, Berlin: Language Science Press. Antonini, R. (2010), ‘The Study of Child Language Brokering: Past, Current and Emerging Research’, MediAzioni, 10: 1–23. Http://mediazioni​.sitlec​.unibo​.i​t. Atkins, S., Clear, J. and N. Ostler (1992), ‘Corpus Design Criteria’, Literary and Linguistic Computing, 7 (1): 1–16. Baker, M. (1993), ‘Corpus Linguistics and Translation Studies. Implications and Applications’, in M. Baker, G. Francis and E. Tognini-Bonelli (eds), Text and Technology: In Honour of John Sinclair, 233–50, Amsterdam: John Benjamins. Baker, M. (1995), ‘Corpora in Translation Studies: An Overview and Some Suggestions for Future Research’, Target, 7 (2): 223–43. Baker, M. (1996), ‘Corpus-Based Translation Studies: The Challenges That Lie Ahead’, in H. Somers (ed.), Terminology, LSP and Translation: Studies in Language Engineering in Honour of Juan C. Sager, 175–86, Amsterdam: John Benjamins. Baker, M. (1999), ‘The Role of Corpora in Investigating the Linguistic Behaviour of Professional Translators’, International Journal of Corpus Linguistics, 4 (2): 281–98. Baños, R., Bruti, S. and S. Zanotti (2013), ‘Corpus Linguistics and Audiovisual Translation: In Search of an Integrated Approach’, Perspectives, 21 (4): 483–90. Baroni, M. and S. Bernardini, eds (2006a), WaCky! Working Papers on the Web as Corpus, Bologna: GEDIT. Baroni, M. and S. Bernardini (2006b), ‘A New Approach to the Study of Translationese: Machine-Learning the Difference Between Original and Translated Text’, Literary and Linguistic Computing, 21 (3): 259–74. Baroni, M., Bernardini, S., Ferraresi, A., and E. Zanchetta (2009), ‘The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora’, Language Resources and Evaluation Journal, 43 (3): 209–26.

Expanding the Reach of CBTS

57

Beeby, A., Rodríguez-Inés, P. and P. Sánchez-Gijón, eds (2009), Corpus Use and Translating: Corpus Use for Learning to Translate and Learning Corpus Use to Translate, Amsterdam: John Benjamins. Bernardini, S. (2008), ‘Web and Corpora for Language Professionals: Where Are We (Going)?’, in M. Hédiard (ed.), Linguistica dei Corpora. Strumenti e Applicazioni, 149–70, Cassino: Edizioni Dell’Università. Bernardini, S. (2011), ‘Monolingual Comparable Corpora and Parallel Corpora in the Search for Features of Translated Language’, SYNAPS, 26: 2–13. Bernardini, S. and M. Baroni (2005), ‘Spotting Translationese: A Corpus-Driven Approach Using Support Vector Machines’, in P. Danielsson (ed.), Proceedings of Corpus Linguistics 2005, 1–12, University of Birmingham. Bernardini, S., Ferraresi, A., Russo, M., Collard, C. and B. Defrancq (2018), ‘Building Interpreting and Intermodal Corpora: A How-to for a Formidable Task’, in M. Russo, C. Bendazzoli and B. Defrancq (eds), Making Way in Corpus-based Interpreting Studies, 21–42, Berlin: Springer. Biber, D. (1993), ‘Representativeness in Corpus Design’, Literary and Linguistic Computing, 8 (4): 243–57. Bisiada, M. (2017), ‘Universals of Editing and Translation’, in S. Hansen-Schirra, O. Czulo and S. Hofmann (eds), Empirical Modelling of Translation and Interpreting, 241–75, Berlin: Language Science Press. Blum-Kulka, S. (1996), ‘Shifts of Cohesion and Coherence in Translation’, in J. House and S. Blum-Kulka (eds), Interlingual and Intercultural Communication: Discourse and Cognition in Translation and Second Language Acquisition Studies, 17–35, Tübingen: Narr. Bowker, L. (1998), ‘Using Specialized Monolingual Native-Language Corpora as a Translation Resource: A Pilot Study’, Meta, 43 (4): 631–51. Bruti, S. (2020), ‘Corpus Approaches and Audiovisual Translation’, in Ł. Bogucki and M. Deckert (eds), The Palgrave Handbook of Audiovisual Translation, 381–95, London: Palgrave Macmillan. Caimotto, M. C. and F. Gaspari (2018), ‘Corpus-based Study of News Translation: Challenges and Possibilities’, Across Languages and Cultures, 19 (2): 205–20. Campbell, S. (1998), Translation into the Second Language, London: Routledge. Cappelle, B. and R. Loock (2017), ‘Typological Differences Shining Through: The Case of Phrasal Verbs in Translated English’, in G. De Sutter, M.A. Lefer and I. Delaere (eds), Empirical Translation Studies. New Theoretical and Methodological Traditions, 235–64, Berlin: Mouton de Gruyter. Castagnoli, S. (2011), ‘Exploring Variation and Regularities in Translation With Multiple Translation Corpora’, Rassegna Italiana di Linguistics Applicata, 12: 311–32. Castagnoli, S. and N. Niemants (2018), ‘Corpora Worth Creating: A Pilot Study on Telephone Interpreting’, inTRAlinea, 20: 1–13. www​.intralinea​.org​/archive​/article​ /2315.

58

Extending the Scope of Corpus-Based Translation Studies

Chaume, F. (2018), ‘An Overview of Audiovisual Translation: Four Methodological Turns in a Mature Discipline’, Journal of Audiovisual Translation, 1 (1): 40–63. Chesterman, A. (2007), ‘What is a Unique Item?’, in Y. Gambier, M. Shlesinger and R. Stolze (eds), Doubts and Directions in Translation Studies. Selected Contributions from the EST Congress, Lisbon 2004, 3–13, Amsterdam: John Benjamins. Davitti, E. (2019), ‘Methodological Explorations of Interpreter-Mediated Interaction: Novel Insights from Multimodal Analysis’, Qualitative Research, 19 (1): 7–29. Davitti, E. and S. Braun (2020), ‘Analysing Interactional Phenomena in Video Remote Interpreting in Collaborative Settings: Implications for Interpreter Education’, The Interpreter and Translator Trainer, 14 (3): 279–302. De Sutter, G., M.A. Lefer and I. Delaere, eds (2017a), Empirical Translation Studies: New Methodological and Theoretical Traditions, Berlin: Mouton de Gruyter. De Sutter, G., Cappelle, B., De Clercq, O., Loock, R. and L. Plevoets (2017b), ‘Towards a Corpus-Based, Statistical Approach to Translation Quality: Measuring and Visualizing Linguistic Deviance in Student Translations’, Linguistica Antverpiensia, New Series: Themes in Translation Studies, 16: 25–39. Desjardins, R. (2017), Translation and Social Media. In Theory, in Training and in Professional Practice, London: Palgrave Macmillan. Diaz-Cintas, J. and P. Muñoz Sánchez (2006), ‘Fansubs: Audiovisual Translation in an Amateur Environment’, JoSTrans: The Journal of Specialised Translation, 6: 37–52. www​.j​​ostra​​ns​.or​​g​/iss​​ue06/​​art​_d​​iaz​_m​​unoz.​​pdf. Dupont, M. and S. Zufferey (2017), ‘Methodological Issues in the Use of Directional Parallel Corpora: A Case Study of English and French Concessive Connectives’, International Journal of Corpus Linguistics, 22 (2): 270–97. Even-Zohar, I. (1990), ‘The Position of Translated Literature Within the Literary Polysystem’, Poetics Today, 11: 45–51. Evert, S. and S. Neumann (2017), ‘The Impact of Translation Direction on Characteristics of Translated Texts: A Multivariate Analysis for English and German’, in G. De Sutter, M.A. Lefer and I. Delaere (eds), Empirical Translation Studies: New Methodological and Theoretical Traditions, 47–80, Berlin: Mouton de Gruyter. Fantinuoli, C. and F. Zanettin, eds (2015), New Directions in Corpus-based Translation Studies, Berlin: Language Science Press. Fattah, A. (2010), A Corpus-based Study of Conjunctive Explicitation in Arabic Translated and Non-Translated Texts Written by the Same Translators/Authors. PhD Thesis, Faculty of Humanities, University of Manchester, UK. Fillmore, C. J. (1992), ‘“Corpus Linguistics” vs. “Computer-aided Armchair Linguistics”’, in J. Svartvik (ed.), Directions in Corpus Linguistics: Proceedings from a 1991 Nobel Symposium on Corpus Linguistics, 13–38, Berlin: Mouton de Gruyter. Flowerdew, L. (2012), Corpora and Language Education, London: Palgrave Macmillan. Frankenberg-Garcia, A. (2009), ‘Are Translations Longer Than Source Texts? A Corpus-based Study of Explicitation’, in A. Beeby, P. Rodríguez-Inés and P. Sánchez-Gijón (eds), Corpus Use and Translating: Corpus Use for Learning

Expanding the Reach of CBTS

59

to Translate and Learning Corpus Use to Translate, 47–58, Amsterdam: John Benjamins. Frawley, W. (1984), ‘Prolegomena to a Theory of Translation’, in W. Frawley (ed.), Translation: Literary, Linguistic, and Philosophical Perspectives, 159–75, London and Toronto: Associated University Presses. Freddi, M. (2013), ‘Constructing a Corpus of Translated Films: A Corpus View of Dubbing’, Perspectives, 21 (4): 491–503. Gallego-Hernández, D. (2015), ‘The Use of Corpora as Translation Resources: A Study Based on a Survey of Spanish Professional Translators’, Perspectives, 23 (3): 375–91. Gaspari, F. (2013), ‘A Phraseological Comparison of International News Agency Reports Published Online: Lexical Bundles in the English-Language Output of ANSA, Adnkronos, Reuters and UPI’, in M. Huber and J. Mukherjee (eds), Corpus Linguistics and Variation in English: Focus on Non-Native Englishes. Proceedings of ICAME 31. VARIENG: Studies in Variation, Contacts and Change in English. Vol. 13. Https​://va​​rieng​​.hels​​inki.​​fi​/se​​ries/​​volum​​es​/13​​/g​asp​​ari/. Gaspari, F. (2015), ‘Exploring Expo Milano 2015: A Cross-linguistic Comparison of Food-Related Phraseology in Translation Using a Comparallel Corpus Approach’, The Translator, 21 (3): 327–49. Gaspari, F. and S. Bernardini (2009), ‘Revisiting the Notion of Translation Universals Through L2 Written Production: Theoretical and Methodological Issues’, in S. Cavagnoli, E. Di Giovanni and R. Merlini (eds), La Ricerca Nella Comunicazione Interlinguistica: Modelli Teorici e Metodologici, 202–16, Milano: Franco Angeli. Gaspari, F. and S. Bernardini (2010), ‘Comparing Non-Native and Translated Language: Monolingual Comparable Corpora with a Twist’, in R. Xiao (ed.), Using Corpora in Contrastive and Translation Studies, 215–34, Newcastle: Cambridge Scholars Publishing. Granger, S. (2018), ‘Tracking the Third Code: A Cross-linguistic Corpus-driven Approach to Metadiscursive Markers’, in A. Čermáková and M. Mahlberg (eds), The Corpus Linguistics Discourse: In Honour of Wolfgang Teubert, 185–204, Amsterdam: John Benjamins. Hansen-Schirra, S. (2011), ‘Between Normalization and Shining-through: Specific Properties of English-German Translations and Their Influence on the Target Language’, in S. Kranich, V. Becher, S. Höder and J. House (eds), Multilingual Discourse Production: Diachronic and Synchronic Perspectives, 133–62, Amsterdam: John Benjamins. Heiss, C. and M. Soffritti (2008), ‘Forlixt 1 – The Forlì Corpus of Screen Translation: Exploring Microstructures’, in D. Chiaro, C. Heiss and C. Bucaria (eds), Between Text and Image: Updating Research in Screen Translation, 51–62, Amsterdam: John Benjamins. Heyn, M. (1998), ‘Translation Memories: Insights and Prospects’, in L. Bowker, M. Cronin, D. Kenny and J. Pearson (eds), Unity in Diversity? Current Trends in Translation Studies, 123–36, Manchester: St. Jerome Publishing.

60

Extending the Scope of Corpus-Based Translation Studies

House, J. (1981), A Model for Translation Quality Assessment, 2nd edn, Tübingen: Narr. House, J. (2008), ‘Beyond Intervention: Universals in Translation?’, Trans-Kom, 1 (1): 6–19. Jiménez-Crespo, M. A. (2013), ‘Crowdsourcing, Corpus Use, and the Search for Translation Naturalness: A Comparable Corpus Study of Facebook and NonTranslated Social Networking Sites’, Translation and Interpreting Studies, 8 (1): 23–49. Jiménez-Crespo, M. A. (2017), Crowdsourcing and Online Collaborative Translations: Expanding the Limits of Translation Studies, Amsterdam: John Benjamins. Jimenez Hurtado, C. and S. Soler Gallego (2013), ‘Multimodality, Translation and Accessibility: A Corpus-based Study of Audio Description’, Perspectives, 21 (4): 577–94. Klaudy, K. (1996), ‘Back-translation as a Tool for Detecting Explicitation Strategies in Translation’, in K. Klaudy, J. Lambert and A. Sohár (eds), Translation Studies in Hungary, 99–114, Budapest: Scholastica. Koponen, M., Mossop, B., Robert, I. S. and G. Scocchera, eds (2021), Translation Revision and Post-Editing: Industry Practices and Cognitive Processes, London: Routledge. Koponen, M., L. Salmi and M. Nikulin (2019), ‘A Product and Process Analysis of Post-editor Corrections on Neural, Statistical and Rule-Based Machine Translation Output’, Machine Translation, 33: 61–90. Kruger, A., Wallmach, K. and J. Munday, eds (2011), Corpus-based Translation Studies: Research and Applications, London: Continuum. Kunilovskaya, M. and E. Lapshinova-Koltunski (2019), ‘Translationese Features as Indicators of Quality in English-Russian Human Translation’, in I. Temnikova, C. Orasan, G. Corpas Pastor and R. Mitkov (eds), Proceedings of the 2nd HiT-IT 2019. Varna, Bulgaria, 56 September 2019, 47–56, Shoumen: Incoma. Künzli, A. (2007a), ‘Translation Revision – A Study of the Performance of Ten Professional Translators Revising a Legal Text’, in Y. Gambier, M. Shlesinger and R. Stolze (eds), Doubts and Directions in Translation Studies. Selected Contributions from the EST Congress, Lisbon 2004, 115–26, Amsterdam: John Benjamins. Künzli, A. (2007b), ‘The Ethical Dimension of Translation Revision. An Empirical Study’, JoSTrans: The Journal of Specialised Translation, 8: 42–56. Https​://jo​​stran​​s​.org​​ /issu​​e08​/a​​rt​_ku​​nz​li.​​pdf. Lapshinova-Koltunski, E. (2015), ‘Variation in Translation: Evidence from Corpora’, in C. Fantinuoli and F. Zanettin (eds), New Directions in Corpus-based Translation Studies, 93–114, Berlin: Language Science Press. Laviosa, S. (2002), Corpus-based Translation Studies: Theory, Findings, Applications, Amsterdam: Rodopi. Laviosa, S. (2004), ‘Corpus-based Translation Studies: Where Does it Come from? Where is it Going?’, Language Matters, 35 (1): 6–27. Leech, G. (1992), ‘Corpora and Theories of Linguistic Performance’, in J. Svartvik (ed.), Directions in Corpus Linguistics: Proceedings from a 1991 Nobel Symposium on Corpus Linguistics, 105–22, Berlin: Mouton de Gruyter.

Expanding the Reach of CBTS

61

Loock, R. (2013), ‘Close Encounters of the Third Code: Quantitative vs. Qualitative Analyses in Corpus-based Translation Studies’, Belgian Journal of Linguistics, 27 (1): 61–86. Louw, W. E. (1993), ‘Irony in the Text or Insincerity in the Writer? The Diagnostic Potential of Semantic Prosodies’, in M. Baker, G. Francis and E. Tognini-Bonelli (eds), Text and Technology: In Honour of John Sinclair, 152–76, Amsterdam: John Benjamins. Mahlberg, M. (2005), English General Nouns: A Corpus Theoretical Approach, Amsterdam: John Benjamins. Mangiron, C. (2018), ‘Game on! Burning Issues in Game Localisation’, Journal of Audiovisual Translation, 1 (1): 122–38. Martin, T. (2007), ‘Managing Risks and Resources: A Down-to-earth View of Revision’, JoSTrans: The Journal of Specialised Translation, 8: 57–63. www​.jostrans​.org​/issue08​/ art​_martin​.php. Mauranen, A. (1999), ‘Will “translationese” Ruin a Contrastive Study?’, Languages in Contrast, 2 (2): 161–85. McDonough Dolmaya, J. (2011), ‘The Ethics of Crowdsourcing’, O’Hagan, M. (ed.) Linguistica Antverpiensia: Special Issue on Translation as a Social Activity, 10: 97–110. McDonough Dolmaya, J. (2012), ‘Analyzing the Crowdsourcing Model and Its Impact on Public Perceptions of Translation’, The Translator, 18 (2): 167–91. McEnery, T. and A. Hardie (2013), ‘The History of Corpus Linguistics’, in K. Allan (ed.), The Oxford Handbook of the History of Linguistics, 727–46, Oxford: Oxford University Press. Molés-Cases, T. (2019), ‘Why Typology Matters: A Corpus-based Study of Explicitation and Implicitation of Manner-of-motion in Narrative Texts’, Perspectives, 27 (6): 890–907. Monti, C., Bendazzoli, C., Sandrelli, A. and M. Russo (2005), ‘Studying Directionality in Simultaneous Interpreting Through an Electronic Corpus: EPIC (European Parliament Interpreting Corpus)’, Meta, 50 (4). www​.e​​rudit​​.org/​​fr​/re​​vues/​​meta/​​ 2005-​​v50​-n​​4​-met​​a1024​​/0198​​50ar/​. Mossop, B. (2007), Revising and Editing for Translators, 2nd edn, Manchester: St. Jerome Publishing. Murphy, A.C. (2012), Editing Specialized Texts in English: A Corpus-assisted Analysis, 2nd edn, Milano: LED. Napier, J., ed. (2009), International Perspectives on Sign Language Interpreter Education, Washington, DC: Gallaudet University Press. Neumann, S. (2013), Contrastive Register Variation. A Quantitative Approach to the Comparison of English and German, Berlin: Mouton de Gruyter. Nitzke, J. (2019), Problem Solving Activities in Post-editing and Translation from Scratch: A Multi-method Study, Berlin: Language Science Press. O’Hagan, M. (1996), The Coming Industry of Teletranslation: Overcoming Communication Barriers Through Telecommunication, Clevedon: Multilingual Matters.

62

Extending the Scope of Corpus-Based Translation Studies

O’Hagan, M. (2016), ‘Translation Studies 2.0 – How to Study Illegal and Unethical Translation in Dynamic Digital Environments’, in Keynote at the 3rd International Conference on Non-professional Interpreting and Translation – NPIT3, Zurich University of Applied Sciences, 5–7 May. O’Hagan, M. and H. Chandler (2016), ‘Game Localization Research and Translation Studies: Loss and Gain Under an Interdisciplinary Lens’, in Y. Gambier and L. van Doorslaer (eds), Border Crossings: Translation Studies and Other Disciplines, 309–30, Amsterdam: John Benjamins. Olohan, M. (2001), ‘Spelling Out the Optionals in Translation: A Corpus Study’, UCREL Technical Papers, 13: 423–32. Olohan, M. (2014), ‘Why Do You Translate? Motivation to Volunteer and TED Translation’, Translation Studies, 7 (1): 17–33. Partington, A. (2008), ‘The Armchair and the Machine: Corpus-assisted Discourse Research’, in C. T. Torsello, K. Ackerley and E. Castello (eds), Corpora for University Language Teachers, 95–118, Bern: Peter Lang. Pavesi, M. (2019), ‘Corpus-based Audiovisual Translation Studies: Ample Room for Development’, in L. Pérez-González (ed.), The Routledge Handbook of Audiovisual Translation Studies, 315–33, London: Routledge. Penha-Marion, L. A., G. Gilquin and M.A. Lefer (2020), ‘Annotating Translation Properties for the Study of Directionality and Expertise’, in S. Granger and M.A. Lefer (eds), Translating and Comparing Languages: Corpus-based Insights. Selected Proceedings of the Fifth Using Corpora in Contrastive and Translation Studies Conference, 61–79, Louvain-la-Neuve: Presses Universitaires de Louvain. Pokorn, N. (2005), Challenging the Traditional Axioms: Translation into a Non-mother Tongue, Amsterdam: John Benjamins. Pokorn, N., Blake, J., Reindl, D. and A. Pisanski Peterlin (2020), ‘The Influence of Directionality on the Quality of Translation Output in Educational Settings’, The Interpreter and Translator Trainer, 14 (1): 58–78. Puurtinen, T. (2003), ‘Genre-specific Features of Translationese? Linguistic Differences Between Translated and Non-translated Finnish Children’s Literature’, Literary and Linguistic Computing, 18 (4): 389–406. Rabinovich, E. and S. Wintner (2015), ‘Unsupervised Identification of Translationese’, Transactions of the Association for Computational Linguistics, 3: 419–32. Robert, I. (2008), ‘Translation Revision Procedures: An Explorative Study’, in P. Boulogne (ed.), Translation and Its Others. Selected Papers of the CETRA Research Seminar in Translation Studies 2007. 1–25. www​.a​​rts​.k​​uleuv​​en​.be​​/cetr​​a​/pap​​ers​/f​​i les/​​ rober​​t​.pdf​. Russo, M., Bendazzoli, C. and B. Defrancq, eds (2018), Making Way in Corpus-based Interpreting Studies, Berlin: Springer. Scocchera, G. (2017), La Revisione della Traduzione Editoriale dall’inglese all’italiano: Ricerca, Professione, Formazione, Roma: Aracne. Sinclair, J. (1991), Corpus Concordance Collocation, Oxford: Oxford University Press.

Expanding the Reach of CBTS

63

Stewart, D. (2008), ‘Vocational Translation Training into a Foreign Language’, inTRAlinea, 10. www​.intralinea​.org​/archive​/article​/1646. Stewart, D. (2010), Semantic Prosody: A Critical Evaluation, London: Routledge. Sutherlin, G. (2013), ‘A Voice in the Crowd: Broader Implications for Crowdsourcing Translation During Crisis’, Journal of Information Science, 39 (3): 397–409. Teich, E. (2003), Cross-Linguistic Variation in System and Text. A Methodology for the Investigation of Translations and Comparable Texts, Berlin: Mouton de Gruyter. Tirkkonen-Condit, S. (2002), ‘Translationese – a Myth or an Empirical Fact? A Study into the Linguistic Identifiability of Translated Language’, Target, 14 (2): 207–20. Tognini-Bonelli, E. (2001), Corpus Linguistics at Work, Amsterdam: John Benjamins. Toral, A. and A. Way (2014), ‘Is Machine Translation Ready for Literature?’, in Proceedings of Translating and the Computer 36. London, 2728 November 2014, 174– 6, London: Asling International Society for Advancement in Language Technology. Toury, G. (1995), Descriptive Translation Studies and Beyond, Amsterdam: John Benjamins. Turner, G.H. and B. Best (2017), ‘From Defensive Interpreting to Effective Professional Practices’, in M. Biagini, M. S. Boyd and C. Monacelli (eds), The Changing Role of the Interpreter: Contextualising Norms, Ethics and Quality Standards, 102–20, London: Routledge. Tymoczko, M. (1998), ‘Computerized Corpora and the Future of Translation Studies’, Meta, 43 (4): 653–9. Ulrych, M. (2009), ‘Translating and Editing as Mediated Discourse: Focus on the Recipient’, in R. Dimitriu and M. Shlesinger (eds), Translators and Their Readers. In Homage to Eugene A. Nida, 219–34, Brussels: Editions du Hasard. Ulrych, M. and A. Murphy (2008), ‘Descriptive Translation Studies and the Use of Corpora: Investigating Mediation Universals’, in C. T. Torsello, K. Ackerley and E. Castello (eds), Corpora for University Language Teachers, 141–66, Bern: Peter Lang. Valentini, C. and S. Linardi (2009), ‘Forlixt 1: A Multimedia Database for AVT Research’, inTRAlinea. www​.intralinea​.org​/specials​/article​/1715. Vandevoorde, L., Daems, J. and B. Defrancq (eds) (2020), New Empirical Perspectives on Translation and Interpreting, Abingdon: Routledge. Xiao, R., ed. (2010), Using Corpora in Contrastive and Translation Studies, Newcastle Upon Tyne: Cambridge Scholars Publishing. Zanettin, F. (2012), Translation-Driven Corpora. Corpus Resources for Descriptive and Applied Translation Studies, Manchester: St. Jerome Publishing. Zanettin, F., Bernardini, S. and D. Stewart, eds (2003), Corpora in Translator Education, London: Routledge.

64

Part II

Recent methodological and theoretical developments in CBTS

66

3

Translation as constrained communication Principles, concepts and methods Haidee Kotze

1  Introduction: The concept of ‘constraint’ and its relevance for translation Language production is always constrained in various ways. The architecture of human physiology and cognition imposes fundamental, universal constraints on the range of linguistic possibility, while the cooperative and intersubjective dynamics of human interaction (in specific communicative settings and sociocultural contexts) also mould linguistic usage. Beyond these internal (cognitive-physiological) and external (social-interactive) functional constraints, tacit or explicit socially sanctioned standards or conventions for linguistic usage form an additional layer of constraint. These physical, cognitive, communicative and social constraints interact in complex ways to produce and condition linguistic variation. The fundamental notion that constraints of various kinds influence language arises across diverse areas of linguistic enquiry and theoretical traditions. For example, at the formal, abstract and system-oriented end of the theoretical continuum, optimality theory (a generative linguistic theory focused primarily on phonology) proposes that ‘surface forms of language reflect resolutions of conflicts between competing demands or constraints’ (Kager 1999: xi, see Prince and Smolensky 2004). At the functional and usage-based end of the theoretical continuum, variationist linguistics (in its various forms, including sociolinguistically oriented and cognitively oriented approaches) focuses on how cognitive factors, language-internal grammatical factors and languageexternal social factors shape and constrain linguistic decision-making (see, e.g. Geeraerts 2018, Pütz et al. 2014; Tagliamonte 2012).

68

Extending the Scope of Corpus-Based Translation Studies

Recognition of the fundamental constrainedness of human language and communication, and attempts to model the effects of diverse constraints, thus cut across even radically different theoretical paradigms and approaches in linguistics. In translation studies, too, the notion of constraint arises in divergent areas and approaches. The Translation Studies Bibliography (https://benjamins​ .com​/online​/tsb/) lists more than 1,700 contributions with the search word ‘constrain*’ in the title, abstract or keywords. Constrainedness may well be one of the ‘memes’ of translation (see Chesterman 1997). In the influential work of Toury (1995, 2012) the notion of constraint encompasses cognitive and social factors, and is closely aligned with the concept of norms, where norms are seen as ‘constraints on freedom of action’ (Chesterman 1997: 78). Toury (1995: 54) explains the relationship between constraints and norms as follows: In its socio-cultural dimension, translation can be described as subject to constraints of various types and varying degree. These extend far beyond the source text, the systemic differences between the languages and textual traditions involved in the act, or even the possibilities and limitations of the cognitive apparatus of the translator as a necessary mediator. In fact, cognition itself is influenced, probably even modified by socio-cultural factors .  .  . . In terms of their potency, socio-cultural constraints have been described along a scale anchored between two extremes: general, relatively absolute rules on the one hand, and pure idiosyncrasies on the other. Between these two poles lies a vast middle-ground occupied by intersubjective factors commonly designated norms.

The cognitive and social constraints that operate on translation are proposed to result in a distinct language variety, a ‘third code’ (Frawley 2000 [1984]): the constraints that affect the translation process in probabilistic ways leave a perceptible linguistic ‘fingerprint’ on the translated text. Studying the third code of translation as a variety in its own right, usually in comparison with a ‘reference’ variety of non-translated texts in the same language, has been the focus of much corpus-based translation research (see Kotze 2019 for an overview, and further discussion in Section 2). In tandem with this focus on modelling the unique features of translation as a target language variety in its own right, a related strand of research has raised the possibility that these features are not unique to translations only but are evident in a larger set of varieties characterized by diverse communicative constraints. These constraints include, among others, discourse production under conditions of bi- or multilingual language activation, or in language contact settings; and

Translation as Constrained Communication

69

the relaying or ‘mediation’ of an existing message. In this view, the features that typify translations are reframed more broadly as features or ‘universals’ of language mediation, language contact, bi- or multilingual discourse production, or constrained communication. While this idea recurs throughout the history of translation studies – it is present, for example, in House and Blum-Kulka (1986), drawing together second-language acquisition and translation, and explored as a conceptual proposition by Chesterman (2004) – it has only recently started to receive more sustained theoretical and empirical attention. Kolehmainen et al. (2014) compare the conceptualization of and evidence for ‘interlingual reduction’ (i.e. ‘the reduction or the lower frequency of target language linguistic items or patterns not shared by both of the languages involved in the language contact situation’ (Kolehmainen et al. 2014: 4)) in translation studies, contact linguistics and second-language acquisition research. Granger (2015: 20) proposes the term ‘crosslingual varieties’ to capture the proposed similarities between language varieties that ‘have specific characteristics due to the interplay of two or more languages’, including, for English, learner Englishes, translated English, second-language varieties of English and English as a lingua franca (ELF). Importantly for the argument in this chapter, this notion has been articulated in an explicitly constraint-based framework by Lanstyák and Heltai (2012), which has been further developed and given empirical impetus by, among others, Kruger and van Rooy (2016a). Kruger and van Rooy (2016a) argue that contact-influenced varieties (including translation), apart from their obvious differences, demonstrate some similar features, which may be linked to both the cognitive and social constraints of communication in settings of language contact. Various empirical studies have taken up the methodological challenge of modelling the similarities and differences between translation and other constrained varieties, using increasingly sophisticated comparative methods to identify the cognitive and social constraints that are shared among constrained varieties, and those that are unique to a variety (see also Ivaska et al. this volume). Empirical work in this area is highly interdisciplinary, comparing, for example, editing and translation (Bisiada 2017; Kruger 2012), translation and secondlanguage, non-native or learner writing (De Sutter and Lefer 2020; Gaspari and Bernardini 2010; Granger 2018; Ivaska and Bernardini 2020, Ivaska et al. this volume; Kruger and De Sutter 2018; Kruger and van Rooy 2016a, 2016b), translation and interpreting (Ferraresi et al. 2018; Shlesinger and Ordan 2012), and interpreting and non-native speech (Kajzer-Wietrzny 2018, 2021).

70

Extending the Scope of Corpus-Based Translation Studies

Against this background, this chapter has three main aims. The first is to provide an overview of the rationale for the constrained-language framework, and for viewing translation within this framework (Section 2). The second is to further develop a more comprehensive theorization of the constraint construct. In Section 3 I set out a programmatic account of key theoretical assumptions for modelling the recurrent features of different forms of constrained communication conceptually and empirically as sets of overlapping ‘varioversals’, reframing and extending the (typological) concept of varioversals put forward by Szmrecsanyi and Kortmann (2009). The key theoretical proposal is that different varieties of a language are conditioned by language-internal and language-external, and cognitive and social constraints that combine in shaping usage in a probabilistic way. Drawing on work from usage-based linguistics, variationist linguistics, probabilistic grammar and comparative linguistics, in particular as set out in Grafmiller et al. (2018), I propose three core theoretical principles for investigating constrained varieties (Section 3.1), and set out five overarching and interacting constraint dimensions, enabling the modelling of similarities and differences between varieties (Section 3.2). The last aim of this chapter is to outline and illustrate the variationist, multifactorial and interdisciplinary corpus-linguistic method needed for investigating the effects of various constraints on different forms of language production, and to model the interaction of constraints. I reflect on how these methods, combined with the theoretical assumptions of a coherent explanatory framework of constrained communication, may assist in unscrambling some of the motivations behind the features that typify constrained varieties. I do this by presenting, in Section 4, a case study on that-omission in three varieties of English (translated English, a high-contact L1 variety of English, and a non-contact L1 variety of English). The case study reanalyses the same dataset used in Kruger and De Sutter (2018) and Kruger (2019), using a state-of-theart multifactorial method, random forests analysis combined with conditional inference tree modelling. The aim is to demonstrate how these methods can help us understand how the linguistic choices of translators are constrained in similar and different ways compared to those of other language users.

2  Not just translations: From translation ‘universals’ to shared cross-varietal constraints Corpus-linguistic as well as computational research has yielded substantial evidence that translated texts demonstrate linguistic patterns that systematically

Translation as Constrained Communication

71

distinguish them from non-translated texts in the same language, with the most typical trends described as (a) increased explicitness of lexico-grammatical encoding, (b) a preference for comparably more conventional, conservative or standard usage, and (c) cross-linguistic influence, priming or transfer (see Kotze 2019 for more detailed discussion and references). While simplification is often included in this list of trends, empirical evidence in support of simplification is less consistent. These differences between translated and non-translated language are primarily distributional, or quantitative, in nature, rather than qualitative. In other words, what makes translated language different in one way or another from non-translated language are subtle, but nevertheless systematic, differences in linguistic patterning. These include the over- and underuse of linguistic features in relation to some kind of reference variety; differences in conditioning factors for the use of linguistic features; or differences in combinatorial patterns across larger sets of features. As already indicated, these differences are used in support of the argument that translation is a variety in its own right, an assumption that has driven the interest in investigating the so-called ‘universals’, ‘recurrent features’, ‘typicalities’ or ‘probabilistic tendencies’ of translated language. De Sutter and Lefer (2020) argue that this strand of research has consumed a disproportionate amount of attention in corpus-based translation studies, which they ascribe to a selective focus on the possibilities of corpus linguistics for translation studies outlined in the early papers on the topic by Baker (e.g., 1993, 1995) (see also Chapter 1 by Granger and Lefer in this volume). In brief, the imagination of researchers was so captured by the compelling notion of ‘translation universals’ that many of the other avenues of investigation proposed by Baker faded into the background: ‘As a consequence, what was truly essential in Baker (1993) – the exploration of corpus-linguistic methods of scrutinizing translational products in order to find the “principles that govern translational behaviour and the constraints under which it operates” (p. 235) – has not yet been explored to the fullest extent’ (De Sutter and Lefer 2020: 2; my emphasis). This view reflects one important ‘blinker’ of studies of the features of translated language: the lack of consideration of multiple constraints that play a role in conditioning translational choices. While theorizations of the nature of translated language have always emphasized the complex conditioning of translational choices (e.g., Toury 2004), for a long time many corpus-focused researchers did theoretical lip service to this notion of the complex probabilistic conditioning of the features of translated language, and chose to focus on one contrast only: that between translations and non-translations. This, however,

72

Extending the Scope of Corpus-Based Translation Studies

is fast changing, and the probabilistic conditioning of the recurrent features of translated language across dimensions such as register, source language, translator background, translation method and software, directionality and language status differences is now well documented (see Kotze 2019 for more detailed discussion and examples). Another ‘blinker’ of much corpus-based translation research is its inwardlooking focus. Adopting the notion of ‘universals’ offered the promise of connecting with other areas of corpus-based research: translation studies is not unique in seeking to generalize the nature of the variety it investigates (and proposing and testing explanatory hypotheses to account for these generalizations). Corpus linguistics as methodology has been very important to linguistic research across a range of fields as it expanded in the last half century, including the study of language contact and bilingualism in fields such as learner Englishes, varieties of English (or World Englishes) and ELF. These are distinct fields, but, like corpus-based translation studies, they are interested in varieties of language that are, in some way or another, shaped by the bilingualism of their producers. But this promise, by and large, remained in the background of (early) corpus-based translation studies. The focus on ‘universals’ of translation turned the focus inward, to translation only, and comparisons of translated language with other kinds of bilingualism-influenced varieties remained a comparatively marginal area of investigation. However, in recent years, there has been an increasing awareness of similarities in research questions and methodologies. In most of the areas highlighted above a key question has been whether there are distinct patterns of language use that characterize the variety in question. The notion of some kind of ‘universal’ – be it angloversals, vernacular universals, varioversals, translation universals or universals of second-language learning – crops up across these research areas (see, e.g. Mauranen and Kujamäki 2004; Szmrecsanyi and Kortmann 2009; more detailed discussion follows in Section 3.1). For all these contact varieties, a key question is the interplay between the influence of the contact language (variously called transfer, interference or cross-linguistic influence) and other factors of the bilingualism-influenced communication situation that lead to patterns of linguistic use that cannot be directly ascribed to the effects of cross-linguistic influence. Similar features identified across some of these varieties include increased analyticity or explicitness of lexico-grammatical encoding, increased linguistic conservatism, and simplification (see Kohlemainen et al. 2015; Kruger and van Rooy 2016a for overviews).

Translation as Constrained Communication

73

This is some of the rationale for the increasing interest in studying different kinds of varieties influenced by language contact or bilingualism together, in order to generalize some of the factors that shape these kinds of language use, under the rubric of constrained language or communication (Kruger and van Rooy 2016a; Lanstyák and Heltai 2012). However, as highlighted in Section 1, to date there has been limited explicit theorization of this proposal. The next section proposes the first steps in this respect.

3  Theorizing constrained communication 3.1  Varieties, constraints and probabilistic conditioning: Three position statements for modelling constrained varieties In this chapter, I extend and adapt the notion of ‘varioversals’ proposed by Szmrecsanyi and Kortmann (2009) to the more general framework of constrained communication. Szmrecsanyi and Kortmann (2009: 33) depart from the principle that ‘there are different reasons that languages, or varieties of a given language, should exhibit the same linguistic features’. Subsequently, they propose a list of ‘-versals’ (framed from a typological perspective), in a descending order of generality: 1. Genuine universals (e.g. all languages have vowels) 2. Typoversals (features that are common to languages of a specific typological type) 3. Phyloversals (features shared by a family of genetically related languages) 4. Areoversals (features shared by languages in geographical proximity) 5. Vernacular universals (features common to spoken vernaculars) 6. Angloversals (or Francoversals, etc.) (features common in vernacular varieties of a specific language) 7. Varioversals (features recurrent in language varieties with a similar sociohistory, historical depth and mode of acquisition, e.g. second-language varieties of English) In the context of the model of constrained communication proposed here, the closest ‘level’ of generalization is the lowest level of ‘varioversals’, although, clearly, the typological framework in which Szmrecsanyi and Kortmann (2009) use the concept is different from the way in which I use it here. In this chapter, I use ‘varioversals’ in a (socio)cognitive-functionalist rather than typological

74

Extending the Scope of Corpus-Based Translation Studies

paradigm, to refer to recurrent linguistic patterning in varieties that are probabilistically constrained in similar (but also different) ways by competing constraints that derive from, for example, the production context, bilingual activation and proficiency (or acquisition) profiles. The framework I propose thus centres on the idea that different varieties of language are conditioned by language-internal and language-external, and cognitive and social constraints that combine in shaping a probabilistic grammar. A recent special issue of Glossa (2018) explores some of these issues (focusing on different languages and varieties of English). The introductory article of this special issue (Grafmiller et al. 2018) outlines many of the core ideas of comparative, variationist, probabilistic grammar, which I adapt here to the model of constrained language. The approach draws on usage-based linguistics, variationist linguistics, comparative linguistics and sociolinguistics. The usage-based view (see Bybee 2010) holds that linguistic structure emerges from experience with language: grammar is a system of generalizations built on general cognitive processes and experience with language in interactive settings. Variationist linguistics highlights that variation between different ways of saying the same thing is ‘sensitive to multiple and sometimes competing constraints which influence linguistic choice-making in subtle, probabilistic ways’ (Grafmiller et al. 2018: 1). Accounts of probabilistic grammar argue that these probabilistic patterns in experience form the foundation of grammar (Grafmiller et al. 2018: 2; see also Bresnan and Ford 2010). The comparative dimension introduces the question of how these sets of constraints that affect choices are the same, or different, across different varieties of the same language, and thus whether different varieties have different probabilistic grammars. A first implication, thus, is that as groups of individual users’ exposure to particular linguistic settings and usages varies, these probabilistic constraints vary, their grammars vary and the aggregate of individual grammars contribute to a variety-specific probabilistic grammar (as individual-level behaviour leads to population-level language patterns; Grafmiller et al. 2018: 3). Based on this, in the context of constrained language, the first principle I propose is: (1) Different varieties of constrained-language production demonstrate differences in the probabilistic conditioning of linguistic choices or in the combinations of choices, as a consequence of different constraints in operation.

Translation as Constrained Communication

75

A second important point is that these constraints are both language-internal and language-external, and both cognitive and social. In translation studies (for example) there has long existed competing cognitive and social explanations for the nature of translated language (see Halverson 2003; Malmkjær 2005). Some have argued that these features are the consequence of the (cognitively demanding) bilingual language processing that intrinsically characterizes translation, which opens the door not only to cross-linguistic influence, transfer or interference effects but also to other features (like increased explicitness of lexico-grammatical encoding) that are linked to processing strain (see Kruger and van Rooy 2016a). In contrast, others have proposed that the features arise from norms that play a role in translation, particularly in relation to risk-avoidant behaviour in environments that are communicatively complex and uncertain (see Pym 2015). The probabilistic model can account for both these kinds of factors. Grafmiller et al. (2018: 3) highlight that there are inherent, universal biases in language structure, but that there is ‘gradient, experience-driven variability within the context of universal constraints on the range of possible variation’ (Grafmiller et al. 2018: 3). At the same time, ‘social meaning and socially conditioned variation .  .  . is entirely compatible with – even predicted by – probabilistic grammar models. Community-specific social forces, e.g. language attitudes or stylistic preferences, undoubtedly shape biases in individual speakers’ production and comprehension . . . . The resulting patterns are in turn reflected in specific forms’ distributions across different social groups/contexts’ (Grafmiller et al. 2018: 3). This leads to the second position point for a model of constrained language: (2) Variation in the strength of conditioning factors (variables) across differently constrained varieties index the effects of different constraints, and may be used to infer the relative influence and interaction of cognitive and social (or sociocognitive) constraints. It is important, however, to consider that some of these factors are more invariant than others. For example, as suggested in Section 1, there is a universality to the general cognitive architecture that underpins language, and one might assume a great degree of similarity across probabilistic grammars. This, indeed, is the case: the influence of some (cognitive) factors demonstrates a great deal of stability across varieties, in terms of the direction of their influence. For example, Szmrecsanyi et al. (2016), in their study of three alternations (genitive, dative and particle placement) in four varieties of English, show that ‘wherever we look in our data, longer constituents follow shorter constituents’ (Szmrecsanyi et al. 2016: 132). In principle, these end-weight effects influence linguistic choices in the same

76

Extending the Scope of Corpus-Based Translation Studies

way across different varieties of English, because the same cognitive processing constraints apply to all humans. Variation in the strength of these constraints does occur, but can do so only where processing capacity allows space for such variation. In other words, the differences between how constraints affect different varieties are not qualitative (different constraints, or different directions of influence) but quantitative (in the strength of the influence). This is the source of variability: Subtle variation in the types and frequencies of constructions will lead to gradient, yet detectable differences in the strength of different factors’ influence on speakers’ syntactic choices . . . The variation in the use of specific constructions may be driven by stylistic preferences among registers or speakers, by situational forces such as language/dialect contact, by cognitive pressures related to language processing. (Grafmiller et al. 2018: 3)

This leads to the third position point: (3) The differences among constrained varieties are subtle/covert rather than overt; quantitative/distributional rather than qualitative. The interplay of different constraints results in different effect sizes, and different interactions between predictors, rather than different predictors altogether, or different directions of effects. I thus propose that comprehensive, rigorous and theoretically justifiable comparisons among language varieties with shared and distinct constraint dimensions may allow us to identify to what constraints particular features of translated language (and other contact or constrained varieties) may be ascribed, thus allowing us to identify ‘varioversals’ understood in the sense set out above – features that characterize particular (subsets of) varieties, and that may be seen as the result of shared constraints. This, of course, without losing sight of the differences between varieties.

3.2  A constraint model Kotze (2019) proposes five macro-level dimensions of constraint to be considered (conceived of as continua rather than as binaries), while acknowledging that this set of constraints is not exhaustive. The first constraint dimension is Language Activation (monolingual – bilingual). Bilingual language activation has two obvious consequences. First, it opens the door to cross-linguistic influence, priming or transfer, at both lexical and structural levels (see e.g. Loebell and Bock 2003; Maier et al. 2017; Travis et al. 2017). Second, due to processes of selection, switching and

Translation as Constrained Communication

77

inhibitory control in a cognitive environment where languages are in competition, bilingual language processing is a more effortful cognitive environment inducing higher processing costs (Costa and Sebastián-Gallés 2014). The more limited cognitive resources that result has a range of potential consequences for language production (e.g. restricting lexical range and grammatical complexity, prompting increased syntactic explicitness, causing decreased sensitivity to factors like style or register). The typological relation of the two languages in question, as well as the directionality of the influence, also obviously plays a role in this dimension (e.g. in translation (as conventionally practiced) the influence is from the L2 to the L1; in learner language or second-language varieties the influence is from the L1 to the L2). Furthermore, cross-linguistic influence is mediated in complex ways by sociocultural factors, like language status. A second constraint dimension is Modality and Register (spoken – written – multimodal). A range of constraint effects come into play in this dimension. For example, different modalities impose different time and processing constraints (spoken language modalities like interpreting have more time and processing constraints; multimodal modalities (like subtitling) are constrained by time factors and the presence of multiple semiotic channels). Different registers and genres are constrained by different physical restrictions (e.g. of space, or of time, in the form of deadlines) and stylistic expectations. The third constraint dimension is Text Production (independent – dependent). This dimension reflects whether the text production is independent or dependent on (i.e. derived from) a previous text. It may also be framed as the difference between unmediated and mediated text production. In the dependent or mediated condition, a prior text delimits and shapes the production. Translation, interpreting and editing, for example, are dependent/mediated text production (in various ways constrained by a prior text), whereas original writing and speech are independent/unmediated. Fourth, there is the Proficiency constraint, which reflects where on the cline of proficiency a text producer falls, ranging from a (near)-native or highly proficient (advanced) user, to a learner with intermediate or low proficiency (native/highly proficient user – learner). The fifth constraint dimension proposed by Kotze (2019) is Task Expertise (expert – non-expert). This dimension is distinct from language proficiency, as such, and rather signals the degree of expertise in, or familiarity with, the specific language production task in question (e.g. writing an essay, translation, interpreting, editing). To illustrate: a person might have native-language proficiency, but no expertise in or experience of, for example, writing an academic essay; conversely, one might be a learner of a particular language, but with plenty of familiarity with

78

Extending the Scope of Corpus-Based Translation Studies

and expertise in how to write an academic essay. Task expertise may range from no expertise whatsoever, to novice, to intermediate, to full professional expertise. Kotze (2019) explains in more detail how varieties can be compared by creating a comparative constraint matrix, which shows which constraints are shared, and which are distinct, among a set of varieties. Where differences between varieties emerge, these may then be ascribed to the particular dimensions where varieties are differently constrained. In addition to these macro-level constraints, it should be kept in mind that there are a variety of other ‘micro-level’ constraints that may play a role. These may be closely associated with the individual, and the immediate context of text production (e.g. instructions or briefs given to the text producer), and while they may not be amenable to a broad conceptual model, they are nevertheless important, and should be accounted for by using statistical methods that factor in the individual as a source of variation, for example, mixed-effects modelling (see Section 5). Constraints cannot be investigated independently of one another, since they are interwoven and interact in complex ways. This theoretical model thus demands a methodological correlate able to account for the complex interactions between constraints, and the way in which constraints play out in different varieties. Multifactorial methods (also called multivariate or multidimensional methods) enable one to investigate the effects of a large number of independent or predictor variables on an outcome variable, in the context of all other variables – and also allow for disentangling interactions between independent variables. These methods are beginning to make inroads into translation studies (see De Sutter and Lefer 2020 for discussion), but their use is by no means widespread in the discipline. The following section presents a case study using the that/zero alternation, to illustrate how one multifactorial method not thus far applied in translation studies, random forests analysis, may be used to investigate constrained varieties. As already pointed out, this case study uses the same dataset as in Kruger and De Sutter (2018) and Kruger (2019), but reanalyses the data using a different method.

4  Multifactorial methods to investigate the interplay of constraints: A case study using the that/zero alternation 4.1  Existing research on the that/zero alternation in translation studies The case study presented here focuses on similarities and differences between two different types of contact variety: English translations with Afrikaans as

Translation as Constrained Communication

79

source language, and written native South African English (a contact-influenced native variety of English that has existed in a contact situation with Afrikaans for more than 200 years, and which may thus be characterized as a high-contact L1 variety; see Kruger and van Rooy 2018). These contact varieties are compared with a non-translated, non-contact variety of written English (British English). The feature in question is the optional complementizer that in English, which has been used in several studies as an operationalization of the increased explicitness of translated and interpreted English, as well as non-native varieties of English (see De Sutter and Lefer 2020; Kajzer-Wietrzny 2018; Kruger 2019; Kruger and De Sutter 2018; Kruger and van Rooy 2016b, 2020; Olohan and Baker 2000; Wulff et al. 2014, 2018). The English alternation is illustrated in Example (1):1 (1)   (a) . . . and it is said that motions will be made to rescind these appointments. (British English, newswriting) (b) . . . and it is said Ø motions will be made to rescind these appointments. Afrikaans has a similar optionality (see Example (2)), although it differs from English in having different word orders in the complement clause with and without dat (the complement clause with dat has verb-final dependent word order; the complement clause without the complementizer (i.e. zero) has the verb-second word order of an independent main clause). (2) (a) Hy het erken dat hy alkohol gedrink het. (Afrikaans, newswriting) He have.AUX admit.PST.PTCP COMP he alcohol PST.PTCP-drink have. AUX ‘He admitted that he had drunk alcohol.’ (b) Hy het erken Ø hy het alkohol gedrink. He have.AUX admit.PST.PTCP Ø he have.AUX alcohol PST.PTCP-drink ‘He admitted Ø he had drunk alcohol.’ There is a large degree of overlap between English and Afrikaans in frequency and usage patterns of omission – but also some differences. In general, Afrikaans has a higher prevalence of complementizer omission in written registers than English does, and there are also differences in preferences in particular registers (see Kruger and van Rooy 2016a; van Rooy and Kruger 2016 for more detailed discussion). In corpus-based translation studies, interest in the that alternation as an operationalization of increased explicitness can be traced back to Olohan and

80

Extending the Scope of Corpus-Based Translation Studies

Baker (2000), who investigated complement clauses following the verbs SAY and TELL, comparing the frequency of that and zero after these verbs in the Translational English Corpus (TEC) and the British National Corpus (BNC). They find that the complementizer is proportionally more frequent in the translational corpus, and (cautiously) interpret this as evidence for unconscious processes of explicitation in translation. Several researchers have pointed out the limitations of this study – for example, the fact that neither register nor the influence of source languages is considered (Becher 2010). This study thus reflects the typical shortcomings of earlier corpus-based work highlighted by De Sutter and Lefer (2020). In initial extensions of this work, researchers addressed some of these limitations by adding predictors (e.g. register or translation expertise) and extending the sets of verbs analysed (e.g. Kruger and van Rooy 2012; Redelinghuys and Kruger 2015). However, these studies still did not take account of a very important feature of the that/zero alternation well studied in variationist linguistics – namely that, beyond these external factors that may influence the choice, the choice is also probabilistically conditioned by the interplay between a complex set of language-internal (grammatical and discourse) factors (see, e.g. Tagliamonte and Smith 2005; Torres Cacoullos and Walker 2009; Wulff et al. 2014, 2018, and more detailed discussion in Kruger 2019). These factors include predictors linked to complexity, as well as conventionality (discussed in more detail in Section 4.2.2). The most recent work in this area (Kruger and De Sutter 2018; Kruger 2019; De Sutter and Lefer 2020) leverages this feature of the that/zero alternation. In doing so, these contributions aim to problematize the results gained from monofactorial (or bivariate) studies in earlier analyses, and argue the case for the importance of a multifactorial approach. They also aim to use the complex effects of complexity-related and conventionality-related conditioning variables on the choice between that/zero to test explanations for the increased explicitness of translated texts, and to compare different constrained varieties (e.g. translation, L2 writing, contact-influenced L1 writing) to one other, within the constrained-language framework. Different methods have been used in these studies; for example, Kruger (2019) makes use of generalized linear modelling (GLMs) and conditional inference tree modelling, whereas Kruger and De Sutter (2018) apply a new method in variationist research, namely the Multifactorial Prediction and Deviation Analysis (MuPDAR) method developed by Gries and Deshors (2014, 2020). The analysis that follows extends this work, by introducing an additional method, namely random forests analysis, combined with conditional inference tree modelling.

Translation as Constrained Communication

81

These new methods have several advantages over more established methods like GLMs: they are well suited to ‘messy’ data with empty cells and nonnormal distributions, which corpus data often are; they are more accurate in determining the contribution of each predictor even in the presence of complex interactions (see Strobl et al. 2009b); they offer much more accessible formats for interpretation.

4.2 Method 4.2.1  Corpus composition and constraint dimensions The two non-translated English corpora used in this study are the written published registers of the International Corpus of English (see http://ice​ -corpora​.net​/ice/), for Great Britain (BR) and South Africa (SA), respectively. Five registers are included: academic writing, fiction, instructional writing (a combination of administrative texts and instructional ‘hobby’ texts), persuasive writing (e.g. editorials), popular non-fiction and newswriting. The translation corpus (TRANS) was constructed to be as comparable as possible with the ICEdesign, and contains published texts translated from Afrikaans by professional translators in South Africa, in approximately the same publication timeframe as the ICE-corpora (from the 1990s onwards). Table 3.1 shows the composition of the corpus used in this study. How does this corpus reflect particular sociocognitive constraints, and how is it potentially useful in investigating how different constraints play out differently in different varieties? Drawing up a constraint matrix (see Kotze 2019) shows that the varieties as represented in the three subcorpora are similar in three constraint dimensions: Modality and Register (the subcorpora are all of written texts, in the same registers), Proficiency (all users have high levels of proficiency) and Task Expertise (it can be assumed that all users are experts at writing and translating texts, given that these are published texts).

Table 3.1  Corpus Composition Register Corpus BR SA TRANS

Academic Fiction Instructional 72,851 33,866 117,208

35,236 35,024 150,634

42,210 28,297 72,703

Persuasive

Popular

Reportage

20,759 8,747 73,123

56,300 49,031 126,225

27,103 31,100 59,373

Total 254,459 186,065 599,266

82

Extending the Scope of Corpus-Based Translation Studies

There are differences in two constraint dimensions. First, there is a difference in the dimension of Language Activation. The translated subcorpus is clearly highly constrained by immediate bilingual language activation of both English and Afrikaans (as source language), whereas for the British English subcorpus this constraint does not exist. The most obvious potential consequence of the strong bilingual activation of the translation subcorpus is cross-linguistic influence effects (from Afrikaans, with its high rate of complementizer omission) – which would logically lead to a higher incidence of zero – although other less obvious consequences may also be anticipated, as already discussed in Section 2. For the native South African English subcorpus, the Language Activation constraint is more ambiguous: native speakers of South African English are unlikely to have Afrikaans as a second language activated when writing in English (or would have it activated to a small degree; see Kruger and van Rooy 2020 for a discussion of the bilingualism of native speakers of South African English). However, the effects of long-term language contact between Afrikaans and English in South Africa have been demonstrated in unique features of native South African English, which reflect the influence of Afrikaans (Kruger and van Rooy 2020). Among these features is a tendency towards a higher omission ratio of that in written South African English compared to British English, as a consequence of contact with Afrikaans, which has a very high omission ratio for the corresponding complementizer dat (Kruger and van Rooy 2020). Second, there is a straightforward difference in the constraint dimension of Text Production: the translation corpus reflects dependent (or mediated) text production, while the other two corpora reflect independent (or non-mediated) text production. This constraint matrix can thus be used to infer why potential differences in the conditioning of the that/zero alternation arise across the three corpora: where such differences emerge, they are likely to be the consequence of the differences in the constraints of Language Activation and Text Production.

4.2.2  Data extraction and coding In order to extract complement clauses (with or without that), the lists of private, public and suasive complement-taking verbs2 from Quirk et al. (1985: 1180–3) were used as search terms. Concordance lines were extracted using WordSmith Tools 6 (Scott 2013). The concordance was manually cleaned to identify complement clauses introduced by these verbs. The dataset used in this chapter consists of approximately 4,000 observations of complement

Translation as Constrained Communication

83

clauses introduced by 120 different verb lemmas (TRANS = 2,108, BR = 979, SA = 796). These cases were subsequently manually annotated. The dependent (or outcome) variable is the presence or absence of that. To distinguish the three varieties, Corpus was included as a predictor (with three levels, TRANS, BR and SA). The other predictor variables were grouped into two sets: complexityrelated predictors and conventionality-related predictors. These predictors, the shorthand used for them in the analysis (indicated in small caps) and the levels of each predictor are summarized below.

Conventionality-related predictors (reflecting that the preference for that/zero is linked by convention to particular contexts; for example, zero is more prevalent in more informal registers, and with more frequent verbs) 1. Register (Register): The register of the text from which each case was drawn was coded, using the following factor levels: Academic, Creative, Instructional, Persuasive, Popular, Reportage. 2. The frequency of the main verb in the declarative complement clause construction (LemmaConstrFreq100kLOG): This predictor is based on the frequency of each individual lemma as main-clause verb in the declarative complement clause construction, normalized to frequency per 100,000 words for each corpus. It therefore reflects not only absolute verb frequency but also the frequency of the verb within the complement construction. This variable was log transformed, since the values were not normally distributed. 3. Underlying semantics (SemClass) of the main-clause verb, with the levels Private, Public and Suasive (based on Quirk et al. 1985: 1180–3).

Complexity-related predictors (reflecting that there is a cognitive processingrelated constraint on the preference for that/zero in the particular context, with more complex grammatical environments prompting a higher likelihood of that) 4. Subject (Subject): The subject of the main clause, with the following levels: Clause, Zero, Pronoun, Noun, Non-referential It, Relative Pronoun, WH-word. 5. The distance between the main-clause verb and the onset of the complement clause (MCVerbToCCLengthLog): This measures the

84

Extending the Scope of Corpus-Based Translation Studies

presence of intervening material between the main verb and onset of the complement clause. It is a continuous variable of the count of characters from the main-clause verb to the onset of the complement clause, log transformed. 6. The tense and modality (TenseModality) of the main clause: Tense and modality combined was coded with four levels: Present, Past, Modal, Non-finite. 7. The aspect of the main clause (Aspect): With three levels (Simple, Progressive, Perfect), this reflects another dimension of grammatical complexity. 8. The polarity of the main clause (Polarity): Polarity (Positive, Negative) was also coded, since negation introduces increased grammatical complexity into the main clause.

4.2.3  Statistical analysis The first step of the analysis was carried out using random forests analysis – a new method in the statistical toolkit of corpus-based translation studies, but one that has made significant inroads into variationist linguistics alongside conditional inference trees (see Tagliamonte and Baayen 2012). Random forests analysis is an ensemble-based statistical learning technique for non-parametric regression and classification using recursive partitioning (see Strobl et al. 2009a, 2009b). In ensemble methods a ‘voting’ (or averaging) process over a randomly generated ensemble of conditional inference trees (see below), rather than a single tree, is used for classification. Random forests rank the individual independent variables according to their explanatory importance in conditioning the response variable given all other independent variables: the higher a variable’s importance score, the stronger its impact on the response variable. In what follows, I apply random forests models as implemented in the R package ‘randomForest’ (Liaw and Wiener 2002). In the second step of the analysis, I use conditional inference tree analysis, as implemented in the R package ‘partykit’ (see Hothorn and Zeileis 2015). Conditional inference trees are a method for classification (in the case of categorical variables) and regression (in the case of continuous variables), using binary recursive partitioning. Levshina (2015: 292) describes how the process works: In the first step, the algorithm tests if any of the predictor variables are associated with the outcome variable, and chooses the predictor that has the strongest effect on the outcome. In the second step, it makes a binary split on this variable to divide the dataset into two subsets (based on different criteria

Translation as Constrained Communication

85

for categorical and continuous variables). The algorithm then repeats these two steps for each subset, until there are no significant associations left.

4.3  Findings and discussion The variable importance plot (Figure 3.1) produced by the random forests analysis using all predictor variables provides an indication of the relative importance of each predictor variable. The variable importance plot immediately foregrounds that translators and the authors of non-translated texts (be they South African or British) generally are influenced by the same factors in the choice between that and zero: the two strongest conditioning variables, by far, are the frequency of the main-clause verb and the register. This is followed by the other mostly grammatical variables. However, the variety (Corpus) is highlighted as a relevant predictor variable midway down in the hierarchy – more important than the effects of the distance between the main-clause verb and the start of the complement clause; and the semantic class, aspect and polarity of the main-clause verb. In other words, by and large the two contact-influenced varieties and the non-contact-influenced variety are similar in terms of the conditioning of the choice between that and zero – though some differences are observable.

Figure 3.1  Variable importance plot for all three subcorpora combined.3

86

Extending the Scope of Corpus-Based Translation Studies

This is entirely in line with the third position point outlined in Section 3.1: we expect more similarity than difference among constrained varieties, as translators aim to approximate the usage patterns of target users, and contactand non-contact-influenced varieties converge on generally the same kind of grammatical conditioning, with differences of a subtle (but systematic) nature. This random forests analysis, however, does not immediately show us where these subtle differences occur: there are differences between the subcorpora (Corpus is a significant predictor), but it is not clear what these are. Neither does random forests analysis show us exactly what the effects of the different predictor variables are, or how exactly they interact. To answer these questions, there are a number of further possibilities, such as investigating partial dependence plots for the random forest, constructing separate random forests for each variety or using the important variables from the random forests analysis to construct a conditional inference tree. In what follows, for reasons of space I consider only the second and third of these options. Figure 3.2 shows the outcome of running a separate random forests analysis for each of the subcorpora, in order to determine similarities and

Figure 3.2  Variable importance plot for each individual varietal subcorpus.

Translation as Constrained Communication

87

differences in the strength and roles of the predictor variables in conditioning the alternation, in the three varieties. As the plots show, the predictor variables occur roughly in the same order of strength, again highlighting the similarities in the conditioning of the that/zero alternation. For British English, however, register is of (marginally) more importance than the frequency of the lemma in the complement-clause construction, unlike in the other two varieties, and the two predictors have a very similar strength. This is potentially a suggestion of a levelling out of register differences in the alternation, in the two contact (South African) varieties. Also notable is the exaggerated strength of two conventionality-related predictors in the translation subcorpus (note the very different scale for this subcorpus). This potentially suggests that the choice between that and zero is much more strongly influenced by conventionality-related predictors in translated texts. The individual random forests thus provide some possible additional insights suggestive of slightly different configurations of constraints across the three varieties, but much of this is still speculative. A  conditional inference tree (see Figure 3.3) offers a more readily interpretable output.4 A detailed analysis of the conditional inference tree is not possible within the scope of this chapter.5 Here, I will highlight only a few points pertinent to the argument. First, the analysis clearly shows that, by and large, similar factors condition the choice between that and zero in the three varieties (as predicted by the third position point outlined in Section 3.1). The conditional inference tree echoes the findings from the random forests analyses in demonstrating the very strong conditioning effect of register and lemma frequency, followed by other complexity variables (the subject, and the distance between the main-clause verb and the onset of the complement clause), as well as the conventionality-related variable of the semantic class of the main-clause verb. The variety (Corpus) does, however, appear as a conditioning variable lower down in the tree (node 17), and demonstrates that it is translations that are conditioned differently from the two non-translated varieties. However, this difference is evident only under very particular conditions. The left branch of node 1, split on Register, and containing academic, instructional, persuasive and popular writing reflects an overall higher preference for that, and in this part of the tree, conditioning variables play the same role for the three varieties. The right branch of node 1 contains the registers of fiction and newswriting, where zero is generally more common. It is only in these registers where there is a significant difference between translations and non-translated

Figure 3.3  Conditional inference tree for all three corpora combined.

Translation as Constrained Communication

89

texts, and only in very particular grammatical environments: where the complement clause occurs with a verb that is not one of the high-frequency verbs say, think, know and mean (the only verbs with a frequency above the cut-off point; split on node 15), and where the subject is not a pronoun (split on node 17), translations tend to use that more frequently while the two non-translated subcorpora have a relatively higher frequency of zero. In other words: where zero is relatively unambiguously preferred in English (in grammatically less complex environments with pronoun subjects combined with the high-frequency verbs say, think, know and mean in fiction and newswriting), translators follow this preference. However, in more ambiguous contexts (any other verbs, any other subjects) in these registers, writers feel free to opt for zero, whereas translators are more likely to opt for that. Why would this be? Or to rephrase this in terms of the constrained communication model: What are the constraint differences for the three varieties that may account for this finding? As discussed above, the translated subcorpus is differently constrained from the other two subcorpora in two dimensions. The first is the Text Production dimension, and the conditions of dependent text production clearly prompt translators to be more conservative in using that in circumstances where producers of original texts would omit it. The second is the Language Activation constraint: translated texts reflect text production under direct bilingual language activation, a constraint absent from the other two subcorpora (even though the South African English subcorpus reflects a variety influenced by language contact). The cognitive demand of bilingual language activation thus appears to ‘overrule’ possible cross-linguistic influence from the source language, Afrikaans, which has a high frequency of complementizer omission, and leads instead to a propensity for increased explicitness in translation, even when the strongly activated source-language influence (and conditions of dependent text production) is likely to prompt the zero form.

5 Conclusion This chapter has attempted to synthesize a theoretical and methodological proposal for extending the scope of corpus-based translation studies within a paradigm of constrained communication. In this approach, translation is seen as one among a larger set of constrained varieties, and it is argued that the systematic comparison of such varieties along particular constraint dimensions

90

Extending the Scope of Corpus-Based Translation Studies

has the potential to illuminate unique sociocognitive aspects of language and text processing in translation, against the background of similarities with other constrained varieties. In this, a multivarietal approach is essential, as is a multivariate statistical method. In this chapter, I have set out three theoretical position points for a study of constrained communication, proposed five constraint dimensions and provided an example case study of how these may be implemented in practice. In doing so, I hope to have illustrated how perspectives and methods from a usagebased, variationist and comparative linguistic perspective may be integrated in moving corpus-based translation studies beyond one-dimensional comparisons of translated and non-translated language, while, at the same time, opening up possibilities for better understanding why translated language demonstrates the features that it does. The analysis presented in this chapter first reiterates that differently constrained varieties demonstrate more similarities than differences, and that where differences do exist, they are subtle, distributional, and quantitative in nature (rather than categorical and qualitative). However, despite their subtlety, these differences are significant and systematic, and can be used to differentiate constrained varieties from one another, and to suggest different cognitive and social constraint configurations and effects (in line with the first and second position statements in Section 3.1). The comparison in this chapter, however, is limited to just three varieties reflecting small differences in constraints, and in order to develop a more comprehensive understanding of the similarities and differences among different constrained varieties in determining ‘varioversals’ and their functional underpinnings, more extensive comparisons involving more varieties and reflecting more complex constraint interactions will no doubt be illuminating (as, e.g. in Ivaska and Bernardini 2020, Ivaska et al. this volume). For example, the study reported on in this chapter could be extended by adding (proficient) second-language English writing, learner English and learner translations, thus yielding a constraint matrix as in Table 3.2 (where the three varieties used in this study are in the first three columns). It should be noted that the short descriptions of constraints inevitably mask the variability that is to be expected in reality; these descriptions are to be interpreted as general characterizations of the constraint. Where variability is likely to be particularly pronounced, this is indicated in the matrix. In addition, it is essential also to move away from single-feature approaches (another ‘blinker’ of corpus-based translation studies) towards multivariable approaches (as, e.g. in Ivaska and Bernardini 2020, Ivaska et al. this volume;

Table 3.2  An Extended Constraint Matrix Constraint dimension

Writing in a high-contact L1 English variety

Professional translation into English

Non-contact English writing

Writing in an L2 Learner translation English variety into English

Learner English writing

Language activation

Monolingual/ weak or historical bilingual L2→L1

Bilingual (strong) L2→L1 (or L1→L2)

Monolingual

Bilingual (weak to strong; variability likely) L1→L2

Bilingual (strong): L2→L1 (or L1→L2)

Bilingual (medium to strong): L1→L2

Modality

Written

Written

Written

Written

Written

Written

Text production

Unmediated

Mediated

Unmediated

Unmediated

Mediated

Unmediated

Proficiency

Proficient user

Proficient user

Proficient user

Proficient user (variability likely)

Learner; variability likely

Task expertise

Expert

Expert

Expert

Expert

Proficient user (L1 translation)/ learner (L2 translation); variability likely Non-expert (variability likely)

Non-expert (variability likely)

92

Extending the Scope of Corpus-Based Translation Studies

Kruger and van Rooy 2018) to consider whether the differences between constrained varieties remain localized in smaller grammatical patterns, or whether these actually combine to create larger perceptible differences at the level of functionally related constellations of features.

Notes 1 Examples (1) and (2) are taken from Kruger and Van Rooy (2020). 2 The lists provided by Quirk et al. (1985) are not exhaustive, but comprehensively cover the three important semantic domains of complement-taking verbs. Private and public verbs are both factual verbs. Private verbs express intellectual states and beliefs; they are ‘private’ because not observable (ibid., 1181). Public verbs are speech act verbs that typically introduce indirect statements (ibid.). Suasive verbs are a category of verbs that include the meaning of persuading, and that ‘can be followed by a that-clause either with putative should. . . . or with the mandative subjunctive’ (ibid., 1182). 3 The measure of importance is the mean decrease in the Gini coefficient, which is a measure of how much each variable contributes to the homogeneity (or purity) of the nodes in the forest. The higher the value, the more important the variable. For a detailed explanation of how it is calculated, see https​:/​/di​​nsdal​​elab.​​sdsu.​​edu​/m​​etag.​​ stats​​/code​​/rand​​omfor​​​est​.h​​tml. 4 For the sake of limiting complexity of interactions, the depth of the tree is restricted to four levels. 5 See Kruger (2019) for an example of how a full analysis can be carried out; note than in that article the contrast is only between translated texts and British English texts.

Suggested key readings Lanstyák, I. and P. Heltai (2012), ‘Universals in Language Contact and Translation’, Across Languages and Cultures, 13 (1): 99–121. This paper is a largely conceptual proposal for the notion of constrained communication, which has informed much subsequent theorization. Kruger, H. and B. van Rooy (2016a), ‘Constrained Language: A Multidimensional Analysis of Translated English and a Non-native Indigenised Variety of English’, English World-Wide, 37 (1): 26–57. This paper develops the idea of constrained communication in the context of an empirical, quantitative corpus-based comparison between English translated from

Translation as Constrained Communication

93

Afrikaans, writing in a non-native indigenized variety of English (East African English), and written British English, using the multidimensional method of Biber (1988). Kotze, H. (2019), ‘Converging What and How to Find Out Why: An Outlook On Empirical Translation Studies’, in L. Vandevoorde, J. Daems and B. Defranq (eds), New Empirical Perspectives on Translation and Interpreting, 333–71, London: Routledge. This chapter is an attempt to place the constrained communication paradigm within broader developments in empirical translation studies, and is the first attempt to formalize specific constraint dimensions. De Sutter, G. and M.-A. Lefer (2020), ‘On the Need for a New Research Agenda for Corpus-Based Translation Studies: A Multi-Methodological, Multifactorial and Interdisciplinary Approach’, Perspectives, 28 (1): 1–23. De Sutter and Lefer engage with the need for conceptual and methodological interdisciplinarity in their paper that goes back to the ‘roots’ of corpus-based translation studies, arguing for a revaluation of the original corpus-based programme proposed by Baker. In addition, it represents the most cogent argument thus far made for the importance of multifactorial methods in corpus-based translation studies. Kruger, H. and G. De Sutter (2018), ‘Alternations in Contact and Non-Contact Varieties: Reconceptualising That-Omission in Translated and Non-Translated English Using the MuPDAR Approach’, Translation, Cognition and Behavior, 1 (2): 251–90. Ivaska, I. and S. Bernardini (2020), ‘Constrained Language Use in Finnish: A Corpusdriven Approach’, Nordic Journal of Linguistics, 43 (1): 33–57. These two papers are representative of the state of the art of the implementation of the constrained-language paradigm and the use of contemporary multifactorial methods – the former corpus-based, and the latter corpus-driven. Kruger and De Sutter (2018) (alongside De Sutter and Lefer 2020) may be usefully read as a counterpoint to Olohan and Baker (2000) to demonstrate how far the discipline has come in twenty years in investigating the same phenomenon of that-omission.

References Baker, M. (1993), ‘Corpus Linguistics and Translation Studies: Implications and Applications’, in M. Baker, G. Francis and E. Tognini-Bonelli (eds), Text and Technology: In Honour of John Sinclair, 233–50, Amsterdam: John Benjamins. Baker, M. (1995), ‘Corpora in Translation Studies: An Overview and Some Suggestions for Future Research’, Target, 7 (2): 223–43. Becher, V. (2010), ‘Abandoning the Notion of “Translation-Inherent” Explicitation: Against a Dogma of Translation Studies’, Across Languages and Cultures, 11 (1): 1–28.

94

Extending the Scope of Corpus-Based Translation Studies

Bisiada, M. (2017), ‘Universals of Editing and Translation’, in S. Hansen-Schirra, O. Czulo and S. Hofmann (eds), Empirical Modelling of Translation and Interpreting, 241–75, Berlin: Language Science Press. Bresnan, J. and M. Ford (2010), ‘Predicting Syntax: Processing Dative Constructions in American and Australian Varieties of English’, Language, 86 (1): 168–213. Bybee, J. (2010), Language, Usage and Cognition, Cambridge: Cambridge University Press. Chesterman, A. (1997), Memes of Translation: The Spread of Ideas in Translation Theory, Amsterdam: John Benjamins. Chesterman, A. (2004), ‘Hypotheses About Translation Universals’, in G. Hansen, K. Malmkjær and D. Gile (eds), Claims, Changes and Challenges in Translation Studies: Selected Contributions from the EST Congress, Copenhagen 2001, 1–13, Amsterdam: John Benjamins. Costa, A. and N. Sebastián-Gallés (2014), ‘How Does the Bilingual Experience Sculpt the Brain?’, Nature Reviews Neuroscience, 15: 336–45. De Sutter, G. and M.-A. Lefer (2020), ‘On the Need for a New Research Agenda for Corpus-Based Translation Studies: A Multi-Methodological, Multifactorial and Interdisciplinary Approach’, Perspectives, 28 (1): 1–23. Ferraresi, A., Bernardini, S., Petrović, M. M. and Lefer, M.-A. (2018), ‘Simplified or not Simplified? The Different Guises of Mediated English at the European Parliament’, Meta, 63 (3), 717–38. Frawley, W. (2000 [1984]), ‘Prolegomenon to a Theory of Translation’, in L. Venuti (ed.), The Translation Studies Reader, 250–63, London: Routledge. Gaspari, F. and S. Bernardini (2010), ‘Comparing Non-native and Translated Language: Monolingual Comparable Corpora With a Twist’, in R. Xiao (ed.), Using Corpora in Contrastive and Translation Studies, 215–34, Newcastle: Cambridge Scholars Publishing. Geeraerts, D. (2018), Ten Lectures on Cognitive Sociolinguistics, Leiden: Brill. Grafmiller, J., Szmrecsanyi, B., Röthlisberger, M. and B. Heller (2018), ‘General Introduction: A Comparative Perspective on Probabilistic Variation in Grammar’, Glossa: A Journal of General Linguistics, 3 (1): 1–10. Granger, S. (2015), ‘Contrastive Interlanguage Analysis: A Reappraisal’, International Journal of Learner Corpus Research, 1 (1): 7–24. Granger, S. (2018), ‘Tracking the Third Code: A Cross-Linguistic Corpus-Driven Approach to Metadiscursive Markers’, in A. Čermáková and M. Mahlberg (eds), The Corpus Linguistics Discourse: In Honour of Wolfgang Teubert, 185–204. Amsterdam: John Benjamins. Gries, S. Th. and S. Deshors (2014), ‘Using Regressions to Explore Deviations Between Corpus Data and a Standard/Target: Two Suggestions’, Corpora, 9 (1): 109–36. Gries, S. Th. and S. Deshors (2020), ‘There’s More to Alternations Than the Main Diagonal of a 2×2 Confusion Matrix: Improvements of MuPDAR and Other Cassificatory Alternation Studies’, ICAME Journal, 44, 69–96.

Translation as Constrained Communication

95

Halverson, S. L. (2003), ‘The Cognitive Basis of Translation Universals’, Target, 15 (2): 197–241. Hothorn, T. and A. Zeileis (2015), ‘Partykit: A Modular Toolkit for Recursive Partytioning in R’, Journal of Machine Learning Research, 16, 3905–9. http:​/​/jml​​r​.org​​/ pape​​rs​/v1​​6​/hot​​horn​1​​5a​.ht​​ml House, J. and S. Blum-Kulka (1986), Interlingual and Intercultural Communication: Discourse and Cognition in Translation and Second Language Acquisition Studies, Tübingen: Gunther Narr Verlag. Ivaska, I. and S. Bernardini (2020), ‘Constrained Language Use in Finnish: A Corpusdriven Approach’, Nordic Journal of Linguistics, 43 (1): 33–57. Kager, R. (1999), Optimality Theory, Cambridge: Cambridge University Press. Kajzer-Wietrzny, M. (2018), ‘Interpretese vs. Non-Native Language Use: The Case of Optional That’, in M. Russo, C. Bendazzoli and B. Defrancq (eds), Making Way in Corpus-based Interpreting Studies, 97–113, Singapore: Springer. Kajzer-Wietrzny, M. (2021), ‘An Intermodal Approach to Cohesion in Constrained and Unconstrained Language’, Target. https://doi​.org​/10​.1075​/target​.19186​.kaj Kolehmainen, L., Meriläinen, L. and H. Riionheimo (2014), ‘Interlingual Reduction: Evidence from Language Contacts, Translation and Second Language Acquisition’, in H. Paulasto, L. Meriläinen, H. Riionheimo and M. Kok (eds), Language Contacts at the Crossroads of Disciplines, 3–32, Cambridge: Cambridge Scholars. Kotze, H. (2019), ‘Converging What and How to Find Out Why: An Outlook On Empirical Translation Studies’, in L. Vandevoorde, J. Daems and B. Defranq (eds), New Empirical Perspectives on Translation and Interpreting, 333–71, London: Routledge. Kruger, H. (2012), ‘A Corpus-Based Study of the Mediation Effect in Translated and Edited Language’, Target, 24 (2): 355–88. Kruger, H. (2019), ‘That Again: A Multivariate Analysis of the Factors Conditioning Syntactic Explicitness in Translated English’, Across Languages and Cultures, 20 (1): 1–33. Kruger, H. and G. De Sutter (2018), ‘Alternations in Contact and Non-Contact Varieties: Reconceptualising That-Omission in Translated and Non-Translated English Using the MuPDAR Approach’, Translation, Cognition and Behavior, 1 (2): 251–90. Kruger, H. and B. van Rooy (2012), ‘Register and the Features of Translated Language’, Across Languages and Cultures, 13 (1): 33–65. Kruger, H. and B. van Rooy (2016a), ‘Constrained Language: A Multidimensional Analysis of Translated English and a Non-native Indigenised Variety of English’, English World-Wide, 37 (1): 26–57. Kruger, H. and B. van Rooy (2016b), ‘Syntactic and Pragmatic Transfer Effects in Reported-Speech Constructions in Three Contact Varieties of English Influenced by Afrikaans’, Language Sciences, 56: 118–31. Kruger, H. and B. van Rooy (2018), ‘Register Variation in Written Contact Varieties of English: A Multidimensional Analysis’, English World-Wide, 39 (2): 214–42.

96

Extending the Scope of Corpus-Based Translation Studies

Kruger, H. and B. van Rooy (2020), ‘A Multifactorial Analysis of ContactInduced Change in Speech Reporting in Written White South African English (WSAfE)’, English Language and Linguistics, 24 (1): 179–209. Lanstyák, I. and P. Heltai (2012), ‘Universals in Language Contact and Translation’, Across Languages and Cultures, 13 (1): 99–121. Levshina, N (2015), How to Do Linguistics with R: Data Exploration and Statistical Analysis, Amsterdam: John Benjamins. Liaw, A. and M. Wiener (2002), ‘Classification and Regression by Random Forest’, R News, 2 (3): 18–22. Loebell, H. and K. Bock (2003), ‘Structural Priming Across Languages’, Linguistics, 41: 791–824. Maier, R. M., Pickering, M. J. and Hartsuiker, R. J. (2017), ‘Does Translation Involve Structural Priming?’ The Quarterly Journal of Experimental Psychology, 70 (8): 1575– 89. Malmkjær, K. (2005), ‘Norms and Nature in Translation Studies’, Synaps, 16: 13–19. Mauranen, A. and P. Kujamäki (eds) (2004), Translation Universals: Do They Exist?, Amsterdam: John Benjamins. Olohan, M. and M. Baker (2000), ‘Reporting That in Translated English: Evidence for Subconscious Processes of Explicitation?’, Across Languages and Cultures, 1 (2), 141–58. Prince, A. and P. Smolensky (2004), Optimality Theory: Constraint Interaction in Generative Grammar, Oxford: Blackwell. Pütz, M., Robinson, J. A. and M. Reif, eds (2014), Cognitive Sociolinguistics: Social and Cultural Variation in Cognition and Language Use, Amsterdam: John Benjamins. Pym, A. (2015), ‘Translating as Risk Management’, Journal of Pragmatics, 85: 67–80. Quirk, R., Greenbaum, S., Leech, G. and J. Svartvik (1985), A Comprehensive Grammar of the English Language, London: Longman. Redelinghuys, K. and H. Kruger (2015), ‘Using the Features of Translated Language to Investigate Translation Expertise: A Corpus-based Study’, International Journal of Corpus Linguistics, 20 (3): 293–325. Scott, M. (2013), Wordsmith Tools 6, Liverpool: Lexical Analysis Software. http:​/​/www​​ .lexi​​cally​​.net/​​words​​mith/​​versi​​on6​/i​​nd​ex/​​html. Shlesinger, M. and N. Ordan (2012), ‘More Spoken Or More Translated? Exploring a Known Unknown of Simultaneous Interpreting’, Target, 24 (1): 43–60. Strobl, C., Hothorn, T. and A. Zeileis (2009a), ‘“Party on!” A New, Conditional VariableImportance Measure for Random Forests Available in the Party Package’, The R Journal, 1 (2): 14–17. Strobl, C., Malley, J. and G. Tutz (2009b), ‘An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests’, Psychological Methods, 14 (4): 323–348.

Translation as Constrained Communication

97

Szmrecsanyi, B., Grafmiller, J., Heller, B. and M. Röthlisberger (2016)‚‘Around the World in Three Alternations: Modeling Syntactic Variation in Varieties of English’, English World-Wide, 37 (2): 109–37. Szmrecsanyi, B. and B. Kortmann (2009), ‘Vernacular Universals and Angloversals in a Typological Perspective’, in M. Filppula, J. Klemola and H. Paulasto (eds), Vernacular Universals and Language Contact: Evidence from Varieties of English and Beyond, 33–53, New York: Routledge. Tagliamonte, S. A. (2012), Variationist Sociolinguistics: Change, Observation, Interpretation, Malden, MA: Wiley-Blackwell. Tagliamonte, S. A. and R. H. Baayen (2012), ‘Models, Forests, and Trees of York English: Was/were Variation as a Case Study for Statistical Practice’, Language Variation and Change, 24: 135–78. Tagliamonte, S. A. and J. Smith (2005), ‘No Momentary Fancy! The Zero “Complementizer” in English Dialects’, English Language and Linguistics, 9 (2): 289–309. Torres Cacoullos, R. and J. A. Walker (2009), ‘On the Persistence of Grammar in Discourse Formulas: A Variationist Study of That’, Linguistics, 47 (1), 1–43. Toury, G. (1995), Descriptive Translation Studies – and Beyond, Amsterdam: John Benjamins. Toury, G. (2004), ‘Probabilistic Explanations in Translation Studies: Welcome as They Are, Would They Qualify as Universals?’, in A. Mauranen and P. Kujamäki (eds), Translation Universals: Do They Exist?, 15–32, Amsterdam: John Benjamins. Toury, G. (2012), Descriptive Translation Studies – and Beyond, revised edn, Amsterdam: John Benjamins. Travis, C. E., Torres Cacoullos, R. and E. Kidd (2017), ‘Cross-Language Priming: A View from Bilingual Speech’, Bilingualism: Language and Cognition, 20 (2): 283–98. Van Rooy, B. and H. Kruger (2016), ‘Faktore wat die Weglating van die Afrikaanse onderskikker dat bepaal’, Tydskrif vir Geesteswetenskappe / Journal for Humanities, 56 (1): 102–16. Wulff, S., Gries, S. Th. and N. A. Lester (2018), ‘Optional That in Complementation by German and Spanish Learners’, in A. Tyler, L. Huan and H. Jan (eds), What Is Applied Cognitive Linguistics? Answers from Current SLA Research, 99–120, Berlin: De Gruyter Mouton. Wulff, S., Lester, N. and M. T. Martínez García (2014), ‘That-variation in German and Spanish L2 English’, Language and Cognition, 6 (2), 271–99.

4

On the use of multiple methods in empirical translation studies A combined corpus and experimental analysis of subject identifiability in English and German Stella Neumann, Jonas Freiwald and Arndt Heilmann

1 Introduction The last decades have seen a large number of corpus studies which have discussed notable linguistic differences in translated texts as compared to non-translated texts (recent examples include Delaere et al. 2012; HansenSchirra et al. 2012). Such findings are corroborated by computational studies which continually report high accuracies in classifying translated versus nontranslated texts (e.g. Volansky et al. 2015). Translations must therefore contain some distinctive linguistic properties which make them easy to spot by machine learning techniques, despite the inconsistent findings of frequency-based corpus studies, which have led to a more general scepticism towards the idea of characteristic linguistic properties of translations (see e.g. Becher 2010). Arguably, these inconsistent findings can be explained by the incomparability of individual studies, which control for factors such as register, language pair and directionality rather than accounting for their influence using a rigorous empirical methodology. Rather than conceiving of translation properties as shoebox categories which are either in operation or not, the properties can be viewed as probabilistic tendencies. In this view, the various explanatory factors modulate the strength of the tendency. Such a conceptualization is also compatible with the more general conceptualization of language as a probabilistic system (see Toury 2004).

Multiple Methods in Empirical Translation Studies

99

Corpus analyses using advanced, rigorous methodologies such as Serbina (2015) and De Sutter and Lefer (2020) shed light on the various factors that influence the translation product. Possible explanations of patterns observed in products range from contrastive differences to claims about risk aversion and cognitive demand. Such explanations are difficult to test with a corpus methodology. Cognitive explanations in particular can only be tested indirectly in corpora and are often addressed through the process-based analysis of behavioural data, typically involving eye-tracking and intermediate versions of the emerging translation in the form of keystroke logging. Corpus studies represent observational approaches that provide access to authentic language produced by different language users, but lack information about the process which leads to a particular text/translation. Experimental studies, by contrast, involve deliberate manipulation of some factor (the independent variable) to test a predicted effect on some other phenomenon (the dependent variable). Such laboratory experiments allow the researcher to control and observe the measurement and outcome in potentially minute detail – at the expense of authenticity, that is, ecological validity. Neither perspective alone will lead to a holistic understanding of translation, whereas a combination of corpus studies with process-based research may increase explanatory power (Hansen-Schirra and Nitzke 2020). Experimental data can also be analysed as a corpus. This was, to the best of our knowledge, first introduced by Alves and Magalhães (2004), who describe using keystroke logging data as a corpus. Similarly, the CRITT TPR database (Carl et al. 2016) represents data from combined eye-tracking and keystroke logging experiments in a database that also provides access to the resulting translations. Most recently, this approach has been exploited by Heilmann (2021), who integrates behavioural data from the CRITT TPR database with rich linguistic annotation of the texts produced in the experiments. In this way, any given data point (e.g. a clause or a text) is characterized both by behavioural and linguistic variables. This approach will also be employed in this chapter. Against this background, this chapter aims at demonstrating how combined observational and experimental research can enrich each other to further our understanding of translation. To this end, we will draw upon our research in the TRICKLET project.1 TRICKLET uses a combination of corpus data, eyetracking and keystroke logging data to contribute to an empirically based model of translation. The remainder of the chapter is organized as follows. Section 2 gives an overview of how combinations of methods can be characterized. This

100

Extending the Scope of Corpus-Based Translation Studies

forms the basis for Section 3, in which we discuss our own previous multimethods studies. Leading on from these studies, we report in Section 4 a new analysis of previously collected material shedding light on how the identifiability of subjects can explain choices made in the process. Identifiable subjects refer to subjects containing information (e.g. a referent) that is known or can be derived from the context. The chapter concludes with a discussion of the merits of such multi-methods studies, drawing some more general conclusions for empirical translation studies (Section 5).

2  Characterizing combinations of methods Combinations of different empirical methods have been present in the social sciences for several decades, as has been the reflection on such combinations. A general notion capturing the combination of theories, data and methods frequently used in this context is triangulation. Alves (2003: vii) introduced the notion to translation studies, describing it as applying ‘several instruments of data gathering and analysis’, inspired by its origin in navigation in taking ‘several location points to establish one’s position’. The purpose of applying triangulation can be described as increasing ‘the validity, strength, and interpretative potential of a study, decreas[ing] investigator biases, and provid[ing] multiple perspectives’ (Thurmond 2001: 253). In keeping with terms commonly used in the social sciences, the combination of different types of observational and experimental methods frequently used in empirical translation studies are referred to as multimethods approaches in this chapter. Even without the integration of an observational component, process-based translation research often involves combining methods such as eye-tracking and keystroke logging. Although both are geared towards understanding cognitive aspects of the translation process, they yield data regarding different aspects of human behaviour in different modalities: eye gaze while reading and interactions with the computer (i.e. movements of the hands) while typing. However, given the far-reaching integration of these methods, researchers reporting such combined studies do not necessarily describe their study as multi-methods research. In fact, there does not seem to be a clear dividing line between single and multiple methods. Does searching a corpus for examples or for inspiration for potential stimuli in a laboratory experiment already represent using the corpus method? What then counts as a method? Let us attempt a definition:

Multiple Methods in Empirical Translation Studies

101

A method is the systematic procedure of collecting, organizing and studying information. In this definition ‘systematic procedure’ captures the methodological aspect whereas ‘collection, organization and study of information’ refers to doing research in the most general sense, which is acquired by observation or experimentation in empirical research. Based on this definition, only studies encompassing all steps of the research process for each method are clear cases of multi-methods research. Based on a content analysis of research articles using a mixed-methods design, Bryman (2006) identifies sixteen rationales for mixing qualitative and quantitative methods. Some of these are relevant for the type of multimethods research discussed here and will help us to characterize the studies discussed in the following sections. One group of rationales concerns complementation of methods: ‘enhancement’ captures augmenting the findings obtained with the help of one method with another; ‘completeness’ refers to achieving a more comprehensive account of an area, whereas ‘explanation’ enriches this comprehensive account by an interpretive component; lastly, ‘offset’ introduces an assessment of each method in that the rationale involves making up for weaknesses of one method with the help of the strengths of another. Bryman’s rationales of ‘triangulation’ and ‘credibility’ capture the strategy to lend greater validity to each method with the help of the combination and enhancing the integrity of findings and thus address aspects of corroboration. Rationales relating to a subordinated role of methods in the design or the discussion stage of the research process are also of relevance to multi-methods research in translation studies. They capture ‘instrument development’, that is, testing an instrument with the help of another method, and ‘sampling’, that is, using another method for collecting cases, ‘context’, that is, providing a qualitative contextual understanding for quantitative findings, and ‘illustration’, that is, the use of qualitative examples to make numerical results accessible. As Halverson (2017a: 198) argues, not only the motivation for combining methods but also the actual relationship between the methods needs to be clarified in order to facilitate innovation and rigour. Following Halverson (2017a citing Creswell 2009), we add the categories of timing (in sequence or concurrently), weighting (dominance of one method or equal weight) and nature of the integration (one method is embedded in another, one method is connected to another or the methods are integrated), leaving out Creswell’s

102

Extending the Scope of Corpus-Based Translation Studies

fourth category concerning the role of theory as there is no variation in our studies.2

3  Multi-methods studies in the TRICKLET project The TRICKLET project is a large-scale project combining corpus research with eye-tracking and keystroke logging experiments. In a sense, the entire project is a sequential multi-methods endeavour, where each subsequent study contributes to complementing and corroborating previous studies. In pursuit of this overall goal, a number of individual studies have been carried out, which adopted a multi-methods design at the level of the individual study (Serbina et al. 2017; Heilmann et al. 2020; Freiwald et al. 2020). All three studies are geared towards analysing the relationship between cognitive demand linked to specific linguistic constructions and observed irregularities in the translation product. The following discussion focuses on the interplay between the corpus-based and the integrated experimental methods. An aspect leading to further integration of the different methods is the analysis of products elicited in translation experiments using the corpus method. In what follows, we will restrict the notion of corpus results/data to the analysis of authentic data and use the notion of experimental product results/data to refer to the corpus-based analysis of texts elicited in our experiments. The study reported by Serbina et al. (2017) was the first from our group drawing on a full-fledged multi-methods design. It aimed at disentangling processes of changes in word class between nouns and verbs in the language pair English–German with the help of a translation experiment. It included preliminary assessment of an additional corpus analysis and thus sequentially complemented corpus-based with experimental findings. The corpus analysis provided additional information and the interpretation of the experimental component dominated. Since the research design also included concurrent analysis of exemplary intermediate versions of the experimental product reconstructed from the keystroke logging data, it involved the integration of methods. Serbina et al.’s (2017) analysis of behavioural measures showed that shifts from verbs to nouns and nouns to verbs are similarly effortful to process. Thus, changing a word class during translation seems to require more effort in general. Heilmann et al. (2020) investigated the effect of contrastive differences on the strong preference of German for animate subjects and found that the

Multiple Methods in Empirical Translation Studies

103

looser constraint on subject animacy in English affected translation choices in the translation direction English–German. An inanimate subject such as this dining table in Example 1 (subjects in bold here and henceforth) combined with a verb that implies agentivity of the subject will be perceived as an unwanted personification in German and is likely to be changed either by making it an adverbial as in the example and/or by changing the verb. In this process, subjecthood is assigned to an animate element in the clause (12 Personen). (1) This dining table seats 12 people. An diesem Esstisch passen 12 Personen. ‘At this dining table fit 12 persons.’ The aim of the study was to complement and corroborate connected corpus findings on the contribution of subject agentivity to changes in overall clause patterns (Serbina 2015) and more specific theme choices with the help of a process-based analysis. In addition to this sequential timing, the study involved a marginal concurrent and embedded corpus component to determine the adequate structure of stimuli in the experiment. Despite an overall corroboration of the translators’ tendency to apply changes in cases of non-agentive subjects, strategies of participants in the experiment deviated from those observed in the corpus analyses. Moreover, the study did not find any indications of more effort involved in the processing of inanimate agents in subject position. The authors interpreted this as a sign of automatization of translation shifts. This was an unexpected result that led to the closer investigation of the role of automatization of translation processes in a subsequent study (Freiwald et al. 2020). In both studies discussed so far, patterns in the experimental product were comparable with corpus-based patterns, suggesting that translational changes are linked to target language preferences. However, Heilmann et al. (2020) also showed clear differences in translation strategy between corpus and experimental product data. The most common kind of change found in the corpus analyses involved changing the inanimate subject to another clause element like an adverbial, while translating the rest of the clause without a change in response to the mismatch between the semantics of the inanimate subject and a volitional verb implying an animate argument. By contrast, participants tended to use a verb whose semantics agreed with the inanimacy of the subject. To the best of our knowledge, such differences between corpus findings and experiments have not been addressed in empirical translation studies. One possible explanation is that they are caused by the experimental situation, as in such an artificial setting, translators may prefer more literal renderings in terms of adherence to linear

104

Extending the Scope of Corpus-Based Translation Studies

precedence of clause elements. More generally, we tend to give preference to the ecologically more valid corpus findings as the processes observed in the highly controlled experimental setting are bound to be affected by the observer’s paradox and the participants’ general desire to contribute to scientific progress. The following comparison between translations produced in the usual working environment and those produced in a laboratory environment will show whether the different production contexts may result in systematic differences in translational choices in running text.

4  Integrating corpus-linguistic and behavioural data: A study of theme and identifiability 4.1  Motivation, key constructs and previous work In what follows, we will demonstrate the tight integration of corpus-based and experimental methods in a study aiming at pinpointing the effort of responding to contrastive differences between English and German regarding semantic and positional constraints on subjects during translating and its effect on linguistic patterns in translations. After introducing the two main linguistic concepts under investigation, namely theme and identifiability, and motivating the study with Freiwald et al.’s (2020) results, the data and method will be presented in the next subsection. Section 4.3 includes the product results, followed by the process results in Section 4.4. These results will then be discussed in Section 4.5. Our understanding of theme is based on the systemic functional approach, where it is defined as follows: The Theme is the element that serves as the point of departure of the message; it is that which locates and orients the clause within its context. The speaker chooses the Theme as his or her point of departure to guide the addressee in developing an interpretation of the message. (Halliday and Matthiessen 2014: 89)

The functional definition of theme as the point of departure of the message is arguably language-independent, but realization may differ between languages. In English and in German, it is realized through position. Constituents positioned early in the clause open up and contextualize the clause, on the basis of which the rest of the clause, the rheme, can be developed. However, the two languages differ in constraints relevant for the position of the subject. According to Halliday

Multiple Methods in Empirical Translation Studies

105

and Matthiessen (2014: 91), theme in English extends up to and includes the first element that has experiential meaning. Experiential elements contribute to the meaning of the clause as a representation of an event, which includes processes, participants and circumstantial information. Given the strong constraint on the position of the subject before the finite verb in English, the most natural realization of experiential meaning by the subject is in an early clause position. This characterization cannot simply be transferred to German due to the strong constraint on the position of the finite verb as the second element in German independent declaratives. To account for the otherwise fairly flexible word order in German, the clause is typically conceptualized as topological fields (see Müller 2015), which are separated by the verbal unit. The first two fields, the forefield and the midfield, are separated by the finite verb. Given the finite verb-second constraint, the forefield is typically restricted to a single element, but with only very few formal or functional restrictions on this element. Possible elements include, of course, the subject (see Example 2a) but also other experiential elements like circumstantial adverbials (see Example 2b) and even single textual elements (see Example 2c). The midfield can contain a multitude of elements. According to Steiner and Teich (2004: 172–3), the forefield realizes theme; any element positioned before the finite verb is considered thematic, thus essentially making the two notions synonymous. (2a)

Forefield

Finite verb

Ich I

habe have

(2b) Forefield

(2c)

Finite verb

Gestern yesterday

habe have

Forefield

Finite verb

Allerdings however

habe have

Midfield gestern stundenlang einen Kuchen yesterday for hours a cake Midfield ich stundenlang einen Kuchen I for hours a cake Midfield

Lexical verb gebacken. baked Lexical verb gebacken. baked Lexical verb

ich gestern stundenlang einen Kuchen gebacken. I yesterday for hours a cake baked

While the German forefield can be filled with almost any kind of constituent, the midfield is subject to very specific order restrictions based among others on grammatical function, length but also information status and particularly identifiability. Identifiability refers to information that is known or inferable from the context. English and German draw on the same linguistic resources

106

Extending the Scope of Corpus-Based Translation Studies

to signal the same kinds of identifiability, namely mentally identifiable referents through definite articles and proper nouns (Kunz 2010: 54ff), as well as situational and textual identifiability through demonstrative and possessive determiners and personal and demonstrative pronouns (Kunz 2010: 61ff.). The unmarked midfield order is identifiable before non-identifiable elements and subjects before objects and adverbials (Müller 2015). Therefore, the very first midfield position following the finite verb is the default place for identifiable noun phrases functioning as subjects – if the subject does not occur in the forefield, which is still its most frequent position. If a subject is non-identifiable and positioned in the midfield, it is more likely to occupy a later position in the clause, especially if the midfield contains identifiable objects (see Examples 3a-b; objects underlined). (3a) Gestern habe ich ihm ein Geschenk gegeben. ‘Yesterday have I him a present given.’ (3b) Gestern hat ihm ein Fremder ein Geschenk gegeben. ‘Yesterday has him a stranger a present given.’ Freiwald et al. (2020) conducted an experiment, which aimed at analysing degrees of automatization in German-to-English professional translations of popular scientific texts in relation to theme and subject length. The stimuli in this study were subjects of varying length in post-verbal position, thus forcing the participants to move the subject before the verb in their translation. The necessary change of the subject and finite verb order was assumed to be the least effortful, most automatic translation, and deviating from this strategy should result in increased cognitive effort. Moreover, short subjects should show more deviation from the automatized translation compared to longer subjects, after a previous corpus analysis had suggested a general increase in subject length in translations from German to English. While a difference in automatization was observed, subject length did not affect the translation process or product results (Freiwald et al. 2020: 198). To some extent, this can be attributed to the design of the study, which gave priority to the experimental part and only embedded the corpus analysis. In order to further elucidate the reasons for the increase in subject length in translations from German to English reported by Freiwald et al. (2020), we conducted a more elaborate preparatory analysis of the popular science register in the CroCo Corpus (see Section 4.2), which forms the starting point of the present study. This can be classified as a sequential study at the project level to account for unexpected results. To this end, we count subject length in the

Multiple Methods in Empirical Translation Studies

107

Table 4.1  Change in Subject Length in Translations from English to German Given as the Delta in Average Length of the Translated Subject and in Percentage Points

Pronouns

Proper nouns

Definite noun phrases

Indefinite noun phrases

Subject length

Delta

%

Delta

%

Delta

%

Delta

%

1 word 2–4 words 5+ words

+1.36 −0.25a −5.33

+136 −7 −64

+0.89 +1.41 −1.07

+89 +62 −9

+0.70 −2.20

+27 −23

+2.65 +1.28 −2.63

+265 +46 −30

Cells in italics are based on fewer than ten data points.

a

number of words (not accounting for differences in compound spelling) across types of nominal elements (see Table 4.1). The analysis shows that one-word subjects generally increase in length, while the length of subjects exceeding four words is reduced. A more differentiated picture emerges when comparing the types of nominal elements. Except for very long subjects, indefinite noun phrases display a clear tendency to increase length in translation. This is less pronounced for pronouns, proper nouns and definite noun phrases, that is, for the types of nominal elements which capture identifiable referents. The corpus analysis thus suggests that an increase in length is dependent not only on the length of the original subject but also on its form and particularly on identifiability. Since the stimuli subjects in Freiwald et al. (2020) varied neither in form nor in identifiability, it is not surprising that the length of the subjects did not affect their results. The relationship between subject identifiability and clause position represents a contrastive difference between English and German and is thus particularly relevant in English-to-German translation. In English, most subjects are the theme and also identifiable, but there is no strict constraint on non-identifiable referents in subject position. Likewise, in German, there is no restriction on subject identifiability if the subject is thematic. If the subject is placed after the finite verb, though, its information status is highly relevant for its midfield positioning. Non-identifiable subjects can thus constitute difficulties in translations from English into German. If the non-identifiable subject is the only theme element in English, its thematic status can be preserved in the German translation by simply placing it in the forefield (Example 4). If the non-identifiable subject theme in English is accompanied by another textual or interpersonal thematic element (Halliday and Matthiessen 2014: 105ff.), only one of these theme elements can occupy the forefield in the German translation. Thus, the translator is forced to decide between one of five main options3 to

108

Extending the Scope of Corpus-Based Translation Studies

resolve this translation problem, as illustrated by Example 5, in which the English clause starts with a stance adverbial as an interpersonal theme followed by the subject as the experiential theme: (1) remove the other theme element from the early position and use the non-identifiable subject as theme (Example 5a); (2) keep the first theme element in forefield position and move the subject to a late midfield position (Example 5b); (3) keep the first theme element in forefield position and make the German subject identifiable (Example 5c); (4) keep the first theme element in forefield position and place the subject in a marked early midfield position (Example 5d); (5) change the subject entirely, for example, by making the object the subject of the passive alternation (Example 5e). (4) Scientists believe that this atmospheric cycle may include the raining of liquid hydrocarbons. Wissenschaftler vermuten, daß es in diesem atmosphärischen Kreislauf flüssige Kohlenwasserstoffe regnet. ‘Scientists believe that it in this atmospheric cycle liquid hydrocarbons rains.’ (5) Strangely, a man in black wrote this on the wall. (a) Ein Mann in schwarz hat das auf die Wand geschrieben. ‘A man in black has this on the wall written’ (b) Komischerweise hat das ein Mann in schwarz auf die Wand geschrieben. ‘Strangely has this a man in black on the wall written.’ (c) Komischerweise hat der Mann in schwarz das auf die Wand geschrieben. ‘Strangely has the man in black this on the wall written.’ (d) Komischerweise hat ein Mann in schwarz das auf die Wand geschrieben. ‘Strangely has a man in black this on the wall written.’ (e) Komischerweise wurde das auf die Wand geschrieben. ‘Strangely was this on the wall written.’ Freiwald et al.’s (2020) inconclusive results, together with our corpus findings indicating identifiability-related constraints, hence motivate further testing of the translation of different kinds of subjects. As length by itself was not shown to be a good single predictor of translation change, we will focus on more detailed aspects of the subject, namely identifiability, position and thematic status. We hypothesize that non-identifiable subjects in English will not only lead to more translation shifts regarding identifiability in the product but also require more

Multiple Methods in Empirical Translation Studies

109

cognitive effort during the translation process due to the number of available translation strategies. It is unlikely that all of these possible strategies are equally probable. Niemietz et al. (2017) have shown that German translators try to preserve the theme order of the original as much as possible, which makes Examples 4c and 4d the most likely translation strategies. We study translations in terms of translation shifts (changes of the lexico-grammatical categories or functions of the source text; ST) similarly to Catford (1965). Technically, all five translation strategies in Example 4 constitute a kind of translation shift. However, given the amount of data, we only focus on changes to the subject itself regarding its identifiability and reference. Thus, 4c and 4e represent translation shifts with respect to these aspects, while Examples 3 as well as 4a, 4b and 4d do not. To this end, the present study will further integrate the corpus approach with an experimental approach (eye-tracking and keystroke logging) giving it equal weight. Product and process analyses complement each other to gain comprehensive insight into which theme and identifiability choices translators make and how effortful these choices are.

4.2  Material, method and annotation The corpus data comes from the CroCo Corpus, a bidirectional translation corpus of English and German original and translated texts. The corpus consists of one million words from eight different registers: political essays, fictional texts, instruction manuals, popular scientific writings, letters to shareholders, prepared speeches, tourism brochures and websites (for a detailed characterization of the corpus, see Hansen-Schirra et al. 2012). Our analysis concentrates on English original declarative clauses and their aligned German translations in the register of popular scientific writing (POPSCI; ST tokens: 35,148; TT tokens: 33,603) as the one most comparable with the registers used in the process-based studies (see below). In order to test the effects of subject identifiability on the translation process, we use existing data from four process-based studies collated by Heilmann (2021).4 The experiments recorded behavioural measures and translations of short English expository texts (four to eleven sentences) reflecting popular scientific writing, news and handbook articles. Like the texts from the CroCo Corpus, the experimental data can be characterized as expository writing. Even though some of the texts cannot be classified as popular scientific writing specifically, we believe that the shared expository goal of the texts makes them similar enough to the CroCo data to permit meaningful comparison. The four studies combined include data from seventy-three participants (twenty-

110

Extending the Scope of Corpus-Based Translation Studies

nine trained translators, thirteen translation students, thirty-one untrained participants). Due to the disparities of experience we control for effects of experience statistically in the process-based analyses. Holding the effects of experience constant makes the experiment data more comparable to the corpus data, which does not contain any information on the translators’ experience. The keystroke logging data in the experiments was collected with Translog 2006 and Translog-II. Gaze behaviour was recorded with Tobii TX300 eye-trackers. Subject identifiability was annotated manually. Identifiable subjects represent either given information or information that can be inferred from the context. They are identified as noun phrases that contain a definite article (e.g. the cells), a demonstrative or possessive determiner (e.g. this evidence, its instruments) or are proper nouns. Textually evoked subjects realized as personal and demonstrative pronouns are also labelled as identifiable. Nominal groups with an indefinite article or no article at all are labelled as non-identifiable subjects (e.g. a mission, introns). Non-referential subjects occurring in existential clauses, cleft and pseudo-cleft constructions and embedded clauses acting as subjects are categorized as other. The subjects in translated German are also annotated for position in the clause distinguishing between forefield and midfield subjects, which are further differentiated for position immediately after the finite verb and any later midfield position. Translation strategies are annotated for subjects that (1) remained unchanged both formally and functionally (Example 6), (2) were changed formally but retained their identifiability and reference (Example 7), (3) underwent a change in identifiability but retained their reference (Example 8) and (4) lost their original reference (Example 9). (6) Cassini also produced a remarkable set of images of Jupiter [. . .]. Cassini lieferte auch bemerkenswerte Bilder von Jupiter [. . .]. (E2G_ POPSCI_001) ‘Cassini produced also remarkable pictures of Jupiter [. . .].’ (7) All those regions talk to the reward pathway by releasing the neurotransmitter glutamate. Ihre Anweisungen signalisieren sie dem Belohnungssystem durch den Botenstoff Glutamat. (E2G_POPSCI_003) ‘Their order signal they [referring to the regions] to the reward system through the neurotransmitter glutamate.’ (8) Americans today choose among more options in more parts of life than ever before.

Multiple Methods in Empirical Translation Studies

111

Heutzutage können wir unter mehr Angeboten auswählen als je zuvor [. . .]. (E2G_POPSCI_005) ‘These days can we between more options choose than ever before [. . .].’ (9) Geologists agree that the risk of arsenic in deep aquifers is low [. . .]. Nach einmütiger Einschätzung der Geologen dürfte das untere Vorkommen kaum von Arsen verunreinigt sein. (E2G_POPSCI_002) ‘According to unanimous assessment of the geologists should the deeper occurrence barely or arsenic polluted be.’ We analysed how identifiability of the ST subject and its position affected the identifiability status of the translated subjects. To this end, we ran binomial generalized mixed regression models with the corpus- and experiment-based product data. We followed up on this analysis with two linear mixed regression models to predict the behavioural measures of reading time and translation duration. These models enabled us to see how translators react to identifiability of the ST subject. In a subsequent behavioural analysis, we focused on the effects of the actual translation choices. We analysed reading and typing responses with respect to the choice of an identifiable or non-identifiable subject at different subject positions. We used the lme4 package for linear mixed modelling (Bates et al. 2015) in R (R Core Team 2020). The R package LmerTest (Kuznetsova et al. 2015) was used for significance testing. All linear models were checked for multicollinearity, skew and kurtosis of residuals. Multicollinearity is a potentially detrimental condition in which two or more predictors are correlated among each other and with the dependent variable. This increases the standard errors of the variables and can lead to false-negative results. Residuals represent the difference between the model’s estimated values and the observed values. Several assumptions have to be met in linear mixed regression modelling to make the results trustworthy. Residuals have to be scattered equally across the range of observed values (homoskedasticity) and their distribution should be normal, that is, bell shaped (which amounts to a skewness of 0 and kurtosis of 3). Skewness of > |2| and kurtosis of >7 were selected as an indication of a severe deviation from the normality assumption regarding model residuals (Kim 2013). Data points with a standardized residual that exceeded three absolute standard deviations were identified as outliers. If the removal of outliers changed the statistical significance of our dependent variables in the model or resulted in a sign change, we report both results for reasons of transparency.

Table 4.2  Product Summaries for the Frequencies of Identifiable and Non-identifiable Subjects Corpus data English originals Count Subjects in total Identifiable Non-identifiable Other Pre-verbal subjects Identifiable Non-identifiable Post-verbal subjects Identifiable Non-identifiable Subjects in immediate postverbal position Identifiable Non-identifiable

969 537 354 78 969 537 354 0 0 0 0 0 0

Experimental product data

German translations % 55.4 36.5 8.1 60.3 39.7

Count 969 572 305 92 534 317 170 435 255 135 347 240 107

English originals % 59.0 31.4 9.5 65.1 34.9 65.4 34.6 69.2 30.8

Count 1,700 900 800 0 1700 900 800 0 0 0 0 0 0

German translations %

52.9 47.1 0.0 52.9 47.1

Count 1,700 963 599 138 1,314 801 513 322 162 86 254 137 80

% 56.7 35.2 8.1 61.0 39.0 65.3 34.7 63.1 36.9

Multiple Methods in Empirical Translation Studies

113

4.3  Results of the corpus-based and experimental product data Table 4.2 shows the distribution of subjects and their identifiability status in the corpus and the experimental product data. In total, the corpus data includes 969 subjects. The 1,700 subjects analysed in the experimental product data include multiple translations of the same text by different translators; that is, they stem from only 130 unique grammatical subjects. Similarly, the CroCo Corpus data consists of about a hundred subjects per text translated by the same translator. Thus, the descriptive values for the product data in Table 4.2 will likely display a sample-based bias. We therefore used linear mixed regression modelling for inferential testing, factoring out random variation introduced by repeated measures. A fixed effects structure that controls for confounding from other sources helps to measure the independent effect of interest with higher accuracy. As summarized in Table 4.2, corpus data and experimental product data show the same tendencies regarding the distribution of identifiability between originals and translations as well as identifiability between pre- and post-verbal position in German. Subjects in German translations are more likely to be identifiable than in English originals by 4.4 percentage points in the corpus data and 3.8 percentage points in the experimental product data. Moreover, the distribution of identifiable and non-identifiable subjects in German does not change noticeably between the pre- and post-verbal position. However, the number of identifiable subjects is lower in the experimental product data compared to the corpus data. This is true for both English originals (2.5 percentage points) and German translations (2.3 percentage points) as well as for all German clause positions. This difference may be caused by the text material. The corpus data consists of longer text, whereas most of the texts used in the laboratory experiment are limited to three or four sentences. Identifiability is often established by referencing information previously introduced in the text. It is therefore unsurprising that a shorter text contains fewer identifiable subjects. In the German corpus data, identifiable subjects are more likely to immediately follow the finite verb compared to all post-verbal identifiable subjects (difference: 3.8 percentage points). This is again due to information sequencing in the German midfield (Müller 2015). If an identifiable element is in post-verbal position, it is much more likely to be positioned close to the finite verb and in front of nonidentifiable elements. This is why the number of identifiable subjects increases immediately after the finite verb while non-identifiable subjects assume a variety

114

Extending the Scope of Corpus-Based Translation Studies

of positions within the German midfield. This is, however, not true in the experimental product data, where the relative frequency of identifiable subjects immediately following the finite verb is lower than all post-verbal subjects (difference: 2.2% points). To assess how source item identifiability interacts with target item identifiability, we ran two generalized mixed regression models for the corpus data and the experimental product data and predict target item identifiability by source item identifiability. The response variable Identifiability is a binary variable with the levels identifiable and non-identifiable. As explained above, we included random effects for the repeated measures of Source Text, Participant and Item (the unique subjects under investigation). We also included the following control variables: 1. SubjPos: position of the target text subject as either thematic (theme), the first element after the finite (FirstPosAfterFin) or any other later position (AnywhereElse). 2. TextSTLen: Source text subject length in characters. 3. Study: Variance introduced by the four studies is accounted for by a categorical predictor (only relevant for the experimental product data). 4. Experience: Translation experience may affect linguistic decisions and is modelled as number of years participants worked professionally as translators (available only for the experimental product data). Neither model showed signs of overdispersion. There was a strong and statistically significant effect of source item identifiability on target item identifiability in the experimental product data as well as the corpus data. Identifiability status is kept in translation. In the experimental product data, there was no significant effect Table 4.3  Target Identifiability by Source Identifiability and Target Position (Experimental Product Data) IdentifiabilityTT (Intercept) IdentifiabilityST non-identifiable SubjPosFirstPosAfterFin SubjPosAfterFiniteNewTheme Experience StudyPB08 StudySG12 StudyTM16 TextSTLen

Estimate

Std. error

z value

p(>|z|)

−8.326 10.261 1.03 2.081 −0.051 0.893 −0.573 −0.49 −0.542

1.427 1.214 1.184 1.25 0.238 1.042 0.472 0.602 0.406

−5.832 8.453 0.87 1.665 −0.215 0.857 −1.215 −0.814 −1.335