Multilingual Digital Humanities (Digital Research in the Arts and Humanities) 1032491949, 9781032491943

Multilingual Digital Humanities explores the impact of monolingualism―especially Anglocentrism―on digital practices in t

120 47 8MB

English Pages 296 [246] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Multilingual Digital Humanities (Digital Research in the Arts and Humanities)
 1032491949, 9781032491943

Table of contents :
Cover
Half Title
Series
Title
Copyright
Contents
List of Figures
List of Tables
List of Contributors
Acknowledgements
Introduction
Part I Multilingual/Multicultural Theory and Practice
1 A model for Multilingual and Multicultural Digital Scholarship Methods Publishing: The Case of Programming Historian
2 Diversifying Digital Biodiversity Knowledge: A Latin American Multilingual Perspective on the Biodiversity Heritage Library
3 Applications and Developments of NLP Resources for Text Processing in Indian Languages: Shared Multilingual Corpora Building and Pre-trained Models
Part II Pedagogy
4 Doing Digital Humanities in the Modern Languages Classroom
5 Digital Learning Environments for SLA: Learning Analytics and the Construction of Knowledge
6 Pedagogy and Praxis in Libraries: Natural Language Processing for Non-English Texts
7 Bridging the Gap Between Digital Humanities and Natural Language Processing: A Pedagogical Imperative for Humanistic NLP
Part III Language Models
8 Linguistic Injustice in Multilingual Technologies: The TenTen Corpus Family as a Case Study
9 Typological Challenges for the Application of Multilingual Language Models in the Digital Humanities
10 Data Scarcity and Methodological Limitations in Multilingual Analysis of News Articles Published in Brazil
Part IV Methods and Infrastructure
11 Multilingual Interfaces for All? Localisation Strategies in Proyecto Humboldt Digital
12 Towards Multilingually Enabled Digital Knowledge Infrastructures: A Qualitative Survey Analysis
13 Digital Approaches to Multilingual Text Analysis: The Dictionnaire de la langue franque and Its Morphology as Hybrid Data in the Past
Index

Citation preview

Multilingual Digital Humanities

Multilingual Digital Humanities explores the impact of monolingualism—especially Anglocentrism—on digital practices in the humanities and social sciences. The volume explores a wide range of applied contexts, such as digital linguistic injustice, critical digital literacy, digital learning, digital publishing, low-resourced, minoritised or endangered languages in a digital space, and multilingual historical intertextuality. These discussions are situated within wider work on language technologies, language documentation, and international (in particular European) language-based infrastructure creation. Drawing on both primary and secondary research, this four-part book features 13 diverse case studies of infrastructural projects, pedagogical resources, computational models, interface building, and publishing initiatives in a range of languages, including Arabic, French, Russian, Portuguese, Italian, German, Spanish, Bengali, Hindi, Malayalam, and Tamil. All the debates are contextualised within a wider cultural frame, thus bridging the gap between the linguistic focus of the multilingual initiatives and wider discussion of cultural criticism in DH. Multilingual Digital Humanities recognises the digital as a culturally situated and organic multilingual entity embedding past, present, and future worlds, which reacts to and impacts on institutional and methodological frameworks for knowledge creation. It is essential reading for students, scholars, and practitioners working in digital humanities and digital studies. Lorella Viola is Research Associate in Linguistics and Digital Humanities at the Centre for Contemporary and Digital History, University of Luxembourg. She researches the implications of the digital for the conceptualisation of digital objects, digital practices, and digital knowledge production with a focus on heritage, material culture, and preservation. She also investigates the relationship between language, media, and society and develops critical, data-driven methodologies for digital humanities and digital heritage. Paul Spence is Reader in the Department of Digital Humanities (DDH) at King’s College London, UK. His background is in modern languages, digital pedagogy, digital publishing, and structured knowledge representation. His research has recently focused on interactions between languages, multilingualism, linguistic diversity, and digital practice. He researches digital transformations in how we engage with languages, while also analysing the power of language to disrupt digital monolingualism in knowledge infrastructures, methods, and data.

Digital Research in the Arts and Humanities Founding Series Editors: Marilyn Deegan, Lorna Hughes and Harold Short Current Series Editors: Lorna Hughes, Nirmala Menon, Andrew Prescott, Isabel Galina Russell, Harold Short and Ray Siemens

Digital technologies are increasingly important to arts and humanities research, expanding the horizons of research methods in all aspects of data capture, investigation, analysis, modelling, presentation and dissemination. This important series covers a wide range of disciplines with each volume focusing on a particular area, identifying the ways in which technology impacts on specific subjects. The aim is to provide an authoritative reflection of the ‘state of the art’ in technologyenhanced research methods. The series is critical reading for those already engaged in the digital humanities, and of wider interest to all arts and humanities scholars. The following list includes only the most-recent titles to publish within the series. A list of the full catalogue of titles is available at: www.routledge.com/ Digital-Research-in-the-Arts-and-Humanities/book-series/DRAH Transformative Digital Humanities Challenges and Opportunities Edited by Mary McAleer Balkun and Marta Mestrovic Deyrup Medieval Manuscripts in the Digital Age Edited by Benjamin Albritton, Georgia Henley and Elaine Treharne Access and Control in Digital Humanities Edited by Shane Hawkins Information and Knowledge Organisation in Digital Humanities Global Perspectives Edited by Koraljka Golub and Ying-Hsang Liu Networks and the Spread of Ideas in the Past: Strong Ties, Innovation and Knowledge Exchange Edited by Anna Collar Multilingual Digital Humanities Edited by Lorella Viola and Paul Spence For more information about this series, please visit: www.routledge.com/Digital-Research-in-the-Artsand-Humanities/book-series/DRAH

Multilingual Digital Humanities Edited by Lorella Viola and Paul Spence

First published 2024 by Routledge 4 Park Square, Milton Park, Abingdon, Oxon OX14 4RN and by Routledge 605 Third Avenue, New York, NY 10158 Routledge is an imprint of the Taylor & Francis Group, an informa business © 2024 selection and editorial matter, Lorella Viola and Paul Spence; individual chapters, the contributors The right of Lorella Viola and Paul Spence to be identified as the authors of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-1-032-49194-3 (hbk) ISBN: 978-1-032-49417-3 (pbk) ISBN: 978-1-003-39369-6 (ebk) DOI: 10.4324/9781003393696 Typeset in Times New Roman by Apex CoVantage, LLC

Contents

List of Figures List of Tables List of Contributors Acknowledgements Introduction

viii x xi xiii 1

PAUL SPENCE AND LORELLA VIOLA

PART I

Multilingual/Multicultural Theory and Practice 1 A model for Multilingual and Multicultural Digital Scholarship Methods Publishing: The Case of Programming Historian

15

17

JENNIFER ISASI, RIVA QUIROGA, NABEEL SIDDIQUI, JOANA VIEIRA PAULINO, AND ALEX WERMER-COLAN

2 Diversifying Digital Biodiversity Knowledge: A Latin American Multilingual Perspective on the Biodiversity Heritage Library

31

LIDIA PONCE DE LA VEGA

3 Applications and Developments of NLP Resources for Text Processing in Indian Languages: Shared Multilingual Corpora Building and Pre-trained Models JUSTY JOSEPH, LALITHSRIRAM SR, AND NIRMALA MENON

48

vi

Contents

PART II

Pedagogy 4 Doing Digital Humanities in the Modern Languages Classroom

59 61

SUSANNA ALLÉS-TORRENT

5 Digital Learning Environments for SLA: Learning Analytics and the Construction of Knowledge

83

ALICE GASPARINI

6 Pedagogy and Praxis in Libraries: Natural Language Processing for Non-English Texts

103

IAN GOODALE

7 Bridging the Gap Between Digital Humanities and Natural Language Processing: A Pedagogical Imperative for Humanistic NLP

114

TOMA TASOVAC, NICK BUDAK, NATALIA ERMOLAEV, ANDREW JANCO, AND DAVID LASSNER

PART III

Language Models 8 Linguistic Injustice in Multilingual Technologies: The TenTen Corpus Family as a Case Study

127 129

DAVID BORDONABA-PLOU AND LAILA M. JREIS-NAVARRO

9 Typological Challenges for the Application of Multilingual Language Models in the Digital Humanities

145

MARCELL RICHARD FEKETE, JOHANNES BJERVA, AND LISA BEINBORN

10 Data Scarcity and Methodological Limitations in Multilingual Analysis of News Articles Published in Brazil CAIO MELLO

165

Contents PART IV

vii

Methods and Infrastructure

181

11 Multilingual Interfaces for All? Localisation Strategies in Proyecto Humboldt Digital

183

ANTONIO ROJAS CASTRO

12 Towards Multilingually Enabled Digital Knowledge Infrastructures: A Qualitative Survey Analysis

197

ALÍZ HORVÁTH, CORNELIS VAN LIT, COSIMA WAGNER, AND DAVID JOSEPH WRISLEY

13 Digital Approaches to Multilingual Text Analysis: The Dictionnaire de la langue franque and Its Morphology as Hybrid Data in the Past

213

JOSH BROWN

Index

230

Figures

1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4 2.5 2.6 3.1 3.2 3.3 5.1 5.2 5.3 5.4 5.5 5.6 6.1

6.2

Total visits Visits ranking Americas Brazil Portugal Language representation in BHL China (left), BHL Singapore (middle), and BHL Africa (right) Language representation in BHL SciELO (left) and BHL México (right) Traffic shares by country to BHL, CONABIO, and Bioteca. Comparative data from Similarweb, April–June 2020 Co-occurrence network for Subset 1. Generated on KHCoder3 with data from BHL as of July 2021 Selected subjects and co-occurrence network for Subset 2. Generated on KHCoder3 with data from BHL as of July 2021 Word frequencies in subject lists in Subset 1 (left) and Subset 2 (right). Generated on KHCoder3 with data from BHL as of July 2021 Lexical similarities Evolution of Indic scripts Indic scripts WordPress’ homepage WordPress’ homepage heatmap Moodle’s homepage Moodle’s homepage heatmap Main interactions recorded in the WordPress-based learning environment Main interactions recorded in the Moodle-based learning environment One of the Jupyter notebooks for the Non-English NLP Tutorial, with instructional text and code instructing users on how to perform named entity recognition and part-of-speech tagging on their non-English text The GitHub repository of the Analyzing Péguy project, showing how the code and folders are organized

21 21 22 26 34 35 36 41 43 44 54 56 57 92 92 93 94 95 97

107 110

Figures 8.1 Distribution of topics in the enTenTen20 and arTenTen12 corpus 8.2 TenTen Corpus Family relative size 9.1 Language models cluster linguistic elements according to meaning using co-occurrence information from shared contexts: here the two meanings of ‘trunk’, trunk1 for nose of an elephant and trunk2 for swimming shorts and words that may appear in their immediate contexts 9.2 Illustration of a language adapter encapsulated between two mBERT layers. The output of the bottom layer passes through the language adapter before it reaches the top layer. During fine-tuning, mBERT layers are frozen, and only the language adapter weights are updated 10.1 Network visualisation of entities generated using Spacy combined with Wikifier. Entities are connected according to how many times they are mentioned together in the same news article. This is therefore a narrative representation of the entities. The more an entity is mentioned, the bigger its node is 10.2 Variation of sentiment in the 464 news titles originally published in Portuguese in comparison with their translated version in English 10.3 Depiction of Shapley values for the original sentence. On the left side are the words classified as negative and on the right side, those classified as positive. The darker the word, the stronger the sentiment score 10.4 Depiction of Shapley values for the sentence replacing ‘great’ with ‘big’ 10.5 Depiction of Shapley values for the sentence replacing ‘great’ with ‘huge’ 10.6 Depiction of Shapley values for the sentence replacing ‘legacy’ with ‘impact’ 11.1 ediarum.PROHD.edit graphical user interface 11.2 The GUI of ‘Digital Dossier: Alexander von Humboldt and Cuba (1800–1830)’ 12.1 A screenshot of the introduction to our 2020 survey 12.2 This bubble graph illustrates the frequency of languages and scripts used by the respondents in question #1 of the survey. Visualisation in Tableau by Horváth 13.1 Title page of the 1830 Dictionnaire, printed in Marseille. Bibliothèque nationale de France 13.2 The first page of the Dictionnaire showing French and MLF. Bibliothèque nationale de France

ix 135 135

147

151

171 173

175 175 175 176 190 192 200 201 218 219

Tables

3.1 3.2 3.3 3.4 3.5 5.1 5.2 7.1 8.1 12.1

Languages supported by NLP libraries in Indian languages Tasks performed by the NLP tools Word sharing from proto-language Loan words across language families Morphological complexity The syllabus designed for the units Number of total visits tracked for WordPress and Moodle Kanbun NER component performance Sketch Word analysis results Summary of tasks and problems mentioned by respondents in the 2020 survey 12.2 Mapping potential problem spaces with sets of non-Latin script languages 13.1 First ten words of Italian vocabulary with stemmed forms generated by Snowball 13.2 d-suffixes of Romance language stemmers

50 51 52 53 55 89 91 124 138 202 208 222 223

Contributors

Susanna Allés-Torrent, University of Miami, USA Lisa Beinborn, Vrije Universiteit Amsterdam, the Netherlands Johannes Bjerva, Vrije Universiteit Amsterdam, the Netherlands David Bordonaba-Plou, Universidad Complutense de Madrid, Spain Josh Brown, University of Western Australia, Australia Nick Budak, Princeton University, Stanford University Libraries, Whitman College Natalia Ermolaev, Center for Digital Humanities at Princeton, USA Marcell Richard Fekete, Vrije Universiteit Amsterdam, the Netherlands Alice Gasparini, Università G.D’Annunzio, Pescara, Italy Ian Goodale, University of Texas at Austin, USA Alíz Horváth, Eötvös Loránd University, Hungary Jennifer Isasi, Pennsylvania State University, USA Andrew Janco, Haverford College, USA Justy Joseph, Indian Institute of Technology, India Laila M. Jreis-Navarro, Universidad de Zaragoza, Spain David Lassner, Technische Universität Berlin, Germany Cornelis van Lit, Utrecht University, the Netherlands Caio Mello, Centre for Advanced Studies, CAIS College, Germany Nirmala Menon, Indian Institute of Technology, India Joana Vieira Paulino, Universidade Nova de Lisboa, Portugal Riva Quiroga, Catholic University of Chile, Chile

xii

Contributors

Antonio Rojas Castro, Berlin-Brandenburgische Akademie der Wissenschaften, Germany Nabeel Siddiqui, Susquehanna University, USA Lalithsriram SR, Indian Institute of Technology, India Toma Tasovac, DARIAH-EU, Europe Lidia Ponce de la Vega,Environment and Sustainability Program (ENSP), William & Mary, USA Cosima Wagner, Freie Universität Berlin, Germany Alex Wermer-Colan, Temple University, USA David Joseph Wrisley, NYU Abu Dhabi, UAE

Acknowledgements

The editors are grateful to Quinn Dombrowski and Naomi Wells for their support and insights, as they are to all of our fellow colleagues in the numerous multilingual networks they are involved in. They are also immensely grateful to the authors for their contributions and their participation in the peer-to-peer review process.

Introduction Paul Spence and Lorella Viola

Why a Book on Multilingual DH? As we write this introduction, the topic of multilingualism in digital humanities (DH) seems to be gaining momentum. In recent years, numerous initiatives have been launched to address linguistic bias in the field and in digital scholarship more generally. Multilingual projects as well as community and special interest groups have proliferated, and in an international context where English is now accepted as a de facto lingua franca, space has increasingly been given over to debates in or about other languages, including at conferences and workshops. And yet, despite greater awareness around issues of power in digital practices (Viola 2023; O’Sullivan 2022; Lee 2020; Williams 2019; McPherson 2019; Earhart 2019; Risam 2015), archival biases, silences in the archives (Noble 2019; Mandell 2019), and lack of language diversity within the context of digitisation and digital scholarship (Spence and Brandao 2021; Viola 2021; Fiormonte 2012; Galina Russell 2014), geo-culturally peripheral work and minority voices continue to remain largely excluded from the wider scholarly conversation. DH still looks very much biased towards English. Significant indications of change and efforts towards inclusivity in digital practices include the emergence of multilingual and localised digital content and platforms (such as the Programming Historian and Proyecto Humboldt Digital, respectively, Chapter 1 and Chapter 11, in this volume), increasing attention to machine translation across languages (e.g. Chapter 8), multilingual natural language processing resources (e.g. Chapter 3 and Chapter 9), and infrastructures (e.g. Chapter 12). Such endeavours greatly contribute to bridge the cultural and language gap and promote inclusivity in digital spaces. However, if institutionally, DH positions itself as global, still ‘there can be little doubt that most instruments of global knowledge production strongly favour first English, and then a small number of high resource languages’ (Spence and Brandao 2021, 11). The Anglocentrism of academic practices and scholarship is of course a longstanding issue within the academic debate, but it is more recently that with the increasingly prominent role of technology in every aspect of knowledge production this issue has come more sharply into focus. As the digital landscape is unequivocally shaped by tech giants from English-speaking nations, particularly the United States, through search engines, software, tools, and social media platforms, the DOI: 10.4324/9781003393696-1

2 Paul Spence and Lorella Viola imposition of Anglocentric digital practices continues to be perpetuated, influencing cultural norms and standards, creating a template against which other cultures are often measured, and resulting in a homogenisation of cultural expressions, ultimately diluting linguistic and cultural diversity. Such concentration of power can lead to the devaluation or marginalisation of non-Anglophone cultural practices and perspectives, in turn reinforcing a power dynamic where English-speaking cultures are seen as more legitimate, influential, and authoritative. Unsurprisingly, it is not just scientific knowledge that is affected by Anglocentrism; wider aspects of academic practices bear the consequences of this linguistic and cultural polarisation, too including material labour, time, money, and career development (Hohti and Truman 2021; Leloup and White 2021). Computational resources available for languages other than English continue to remain, on the whole, scarce, and this becomes far more acute for languages at the lower-population (or lower-resourced) end of the spectrum. Such Anglophonecentricity acts as a barrier to researchers, teachers, and curators who work on, or teach with, materials in languages other than English. This includes academia, libraries, museums, and heritage and cultural institutions more widely. Indeed, the comparative lack of computational resources in other languages often dictates which tasks can be performed, with which tools, and through which platforms (Viola 2023). Moreover, even when adaptations for other languages may be possible, identifying which changes should be implemented and, perhaps more importantly, understanding the impacts that these may have, is often unclear. This typically means that one must inevitably make methodological compromises which are less than ideal (ibid.). This English bias often effectively amounts to an unspoken mono cultural imposition predicated on a wide range of factors, including a lack of stable tools for working with non-Latin scripts and lack of awareness of DH scholarship from non-Anglophone countries. This edited volume explores the impact of Anglocentrism on digital practices, including its contemporary and future relevance and its cultural impact, with a special focus on the humanities and social sciences. It situates this discussion within wider work on language technologies, language documentation, and international (in particular European) language-based infrastructure creation but does so within a wider cultural frame, reflecting the multidisciplinary nature of DH. In this way, it bridges a gap between the linguistic focus of such initiatives and wider discussion of cultural criticism in DH. The chapters in this collection significantly contribute to a growing interest in a more globally representative and multilingual scholarly communications ecosystem in current academic debates. As they bring together, advance, and reflect on recent work on the social and cultural relevance of multilingualism for scholarship, pedagogy, public engagement around digital resources, methods, platforms, infrastructures, and computational tools, these contributions crucially intersect with a number of different areas and disciplines exploring questions of monoculturalism, monolingualism, linguistic injustice, privilege and divides, cultural and linguistic biases in academic practices and beyond. This four-part book explores a range of theoretical reflections on digital transdisciplinary approaches to multilingual and digital research and its intercultural

Introduction 3 implications. Drawing on both primary and secondary research, it features 13 case studies of infrastructural projects, pedagogical resources, computational models, interface building, and publishing initiatives, drawn from a range of languages—Arabic, French, Russian, Portuguese, Italian, German, Spanish, Bengali, Hindi, Malayalam, and Tamil—and exploring a wide range of applied contexts, such as digital linguistic injustice, critical digital literacy, digital learning, digital publishing, low resourced, minoritised, or endangered languages in a digital space, and multilingual historical intertextuality. With its rich and diverse range of perspectives, languages, and approaches, this book answers the urgent need to recognise the digital not as something which is ‘inanimate’ but as a culturally situated and organic entity embedding past, present, and future worlds (Cameron 2021) which reacts to and impacts on institutional and methodological frameworks for knowledge creation, and therefore bears important consequences for how we carry out research about culture and society (Viola 2023). We are aware of the contradiction of proposing a book in English to counteract the predominant bias towards English. We argue, however, that this book will in fact disrupt DH monolingualism by raising the visibility of DH theory and practice in and about languages other than English and by providing other-than-English perspectives in this English-centric DH landscape. It does so by presenting work on the development and application of multilingual and multicultural digital methods and infrastructures; research into the nature and implications of studying diverse forms and processes of multilingual and multicultural research in the digital space in the context of translation studies, transborder and transcultural studies, language pedagogy and education; multilingual challenges in the application of NLP in humanities and social sciences research; opportunities and risks for low-resourced, minoritised, or endangered languages in a digital space; and critical reflections about the contours and geolinguistically framed definitions of ‘digital’ especially with reference to multilingualism in DH practices. The Emergence of Multilingual DH as a Focus The digital humanities have always had a close relationship to academic fields studying languages and cultures, and while they have not always been as visible as their anglophone counterparts, global digital humanities communities functioning in languages other than English have enjoyed a long and rich history in many parts of the world, such as Japan or Italy. There have also been historic attempts to promote multilingualism in DH such as the ADHO Multilingualism MLMC’s work in offering translations of its annual conference Call for papers. However, despite the frequent reference to ‘cultural diversity’ and similar constructions in conference themes and project summaries over the years, the move to focus explicitly on multilingual DH, either as a strategic challenge or as part of a focused research agenda, is relatively recent. Much of the current focus on multilingualism in the digital humanities stems from a wider ‘diversity turn’ which gathered steam in the early 2010s and included impassioned calls for greater global representation in the field, whether in the governance of its professional associations, its digital methods, or its attention to language communities around the world.

4 Paul Spence and Lorella Viola Domenico Fiormonte’s 2012 essay ‘Towards a Cultural Critique of the Digital Humanities’ challenged us to reflect on whether there is ‘a non Anglo-American Digital Humanities’ and argued that, despite energetic efforts to internationalise, if we analysed the governance of most DH associations and organisations at that time, it would be easy to conclude that the field consisted of ‘a solid Anglo-American stem onto which several individuals of mostly European countries are grafted’. Since then, numerous contributions have expanded the debate around the global representation in DH and digital scholarship more broadly (Galina Russell 2014; Gil and Ortega 2016; Fiormonte, Chaudhuri, and Ricaurte 2022), while initiatives such as Global Outlook DH1 have sought to disrupt hegemonic dynamics in the field through agile interventions around translation, global field mapping exercises (‘Around DH’), or minimal computing. Multilingualism is one key element of global representation but has not always received due attention. Scholars such as Maffi have drawn parallels between biological, cultural, and linguistic diversity (Maffi 2005), and here we argue that multilingual diversity is an important proxy for knowledge diversity in digital scholarship, and indeed the world at large. We will now turn to summarising a few of the principal multilingual DH advocacy initiatives and their relationship to wider drives for digital multilingualism. One thing to note is that there has not been significant discussion in the digital humanities regarding what multilingualism actually means, whether in the literature or in practice. In fact, the term represents a ‘diverse and divergent historical constellation of practice’ (Gramling 2021, 9–10), which ‘[d]espite its terminological vulnerabilities and its technological instrumentalizations . . . is usefully able to encompass complex, divergent, and sometimes opposing experiences and ideas’. Indeed, the effect which machine learning and other forms of digital transformation are having on the concept of multilingualism, as reified in emerging knowledge infrastructures, seems like a topic ripe for research in a field like the digital humanities. We do not have space to go into more depth here on the semantic and ideological differences between multilingualism, plurilingualism, bilingualism, translingualism, language diversity, and other such terms (including ‘multilinguality’ introduced in Chapter 12), but for the editors of this volume the term ‘multilingualism’ represents a linguistic and cultural discourse space, which extends way beyond the notion of simply using or supporting different languages, and draws from fields as distinct as linguistics and cultural studies. At the same time, while emphasising the cultural dimension in creating the conditions for language diversity in digital spaces, it also foregrounds the role of languages (in the plural) in promoting global diversity, in contrast to the language indifferent tenor of some debates about geocultural diversity within knowledge practices, which DH sometimes replicates. It is also worth acknowledging that the digital humanities have arrived relatively late to debates around digital multilingualism. In the field of modern languages (at least in an anglophone setting) the focus of digital research has largely been on pedagogy, as evidenced in journals on Computer-Assisted Language Learning (CALL) or Technology-Enhanced Language Learning (TELL). However, in

Introduction 5 the last ten years, there has been some important work in digital cultural studies through initiatives such as the (now inactive) Digital Latin American Cultures Network.2 Following on from this, scholars such as Pitman and Taylor have explored the boundary between modern languages and digital humanities to foreground the potential it contains for greater collaboration and a deeper ‘understanding of cultural and linguistic specificity [which] can help us to understand better—and even problematise—some of the assumptions around the globalising nature of digital technologies’ (2017). In other fields such as sociolinguistics, research on digital multilingualism has a long history, with landmark publications such as Danet and Herring’s agendasetting volume on multilingual computer-mediated communication (Danet and Herring 2007) or Carmen Lee’s exploration of multilingualism online (Lee 2016), which provided new methodological and theoretical models for studying the relationship between multilingual subject, technology, and evolving communication practices. Meanwhile, several guidelines and toolkits have been developed to promote language diversity in recent years aimed both at language communities and at digital practitioners: from the Whose Knowledge? report documenting the ‘State of the Internet’s Languages’3 to the ‘Digital Language Diversity Project’, aimed at giving greater agency to Europe’s regional and minority languages in digital spaces (The Digital Language Diversity Project 2019). In ‘Towards Language Sensitivity and Diversity in the Digital Humanities’, Spence and Brandao proposed that we articulate a framework for ‘languages-sensitive’ DH research building on wider frameworks and infrastructures for language diversity in digital communication (Spence and Brandao 2021). Since then, a European consortium has launched an ambitious ‘agenda and roadmap for achieving full digital language equality in Europe by 2030’.4 Key deliverables of that project have included a set of White Papers on technology support for European languages and technical ‘deep dives’ by industry partners into key strategic areas of development for language technology. While not instantly transferable outside of the European context, in bringing together so many research and industry partners, the project presented a useful model for strategising and implementing multilingual transformation in digital infrastructure. While some digital humanities researchers are involved in these initiatives, the discussion about multilingualism in DH is largely disconnected from literature in other fields of digital study, and more generally research into digital multilingualism can be heavily siloed in different fields. Recognising this, the Disrupting Digital Monolingualism event in 2020 sought to bring together different traditions of research and digital practice around four areas: (1) linguistic and geocultural diversity in digital knowledge infrastructures; (2) working with multilingual methods and data; (3) transcultural and translingual approaches to digital study; and (4) artificial intelligence, machine learning and natural language processing (NLP) in language worlds. As the report on that event noted: discussion around digital multilingualism is often ‘fragmented (and sometimes marginalised)’; we need greater access to training, services, data, and tools to promote multilingualism; and people interested in the topic need ‘more opportunities to engage with each other’ and to

6 Paul Spence and Lorella Viola ‘promote alliances’ (Spence 2021). This volume represents our concrete effort to bridge some of these disciplinary challenges. The emergence of overtly multilingual initiatives in DH is recent, but we are aware of at least five which explicitly use the ‘multilingual DH’ term: the loose multilingual DH network set up by Quinn Dombrowski in 2019, and the more recent Special (or Community) Interest Groups set up within DH professional associations such as ADHO, DARIAH,5 the DHd (the German-language association),6 and the UK-Ireland DH Association.7 These are still early days, and each group has a slightly different focus, but they all coalesce around the notion of language inclusivity, and opposition to anglophone dominance in scholarly communications. What perhaps makes them distinct from other fields promoting language diversity is that they are far more heterogenous in makeup than is the case in other fields (which makes it harder to find commonality but provides for richer interdisciplinary encounters) and the fact that they tend to have a strong connection between critiquing digital monolingualism and designing/building alternatives. The digital humanities is not unique in this notion of critical practice, of course, but its field of operation tends to be more applied than some digital fields, focusing on social and cultural production, and it often closely aligns with digital activism and the notion of citizen science/scholarship. In multilingual terms, this has important implications because it is uniquely placed to transform how language communities are represented in digital methods, tools, research infrastructures, and dissemination. Another important feature of multilingual DH research is its attention to advanced computational methods and tools, in particular around multilingual natural language processing (NLP) (three chapters in this book address this). It is no accident in this regard that one of the first things the multilingual DH network did, for example, was to assemble a list of resources for using NLP in different languages (currently 22).8 The co-ordinator of this resource (Dombrowski) also cofacilitated a theme group on ‘artificial intelligence, machine learning and NLP in language worlds’ at the Disrupting Digital Monolingualism workshop, which connected computational humanities to wider audiences working with AI and NLP (including industry and language communities). This group identified a diverse set of challenges including the lack of exposure of NLP systems to different [language] typologies ‘a deficit in understanding about the wider issues facing world language use when developing NLP’, and the high emphasis on benchmarks and validation systems which favour research on/using high resources language NLP (Joshi et al. 2019), and in particular English (theme group conclusions summarised in the event report; Spence 2021). Again, this is an area where digital humanities can play an important role, as witnessed by the ‘New Languages for NLP: Building Linguistic Diversity in the Digital Humanities’ initiative described in Chapter 7.9 One major challenge for geolinguistic diversity in the digital humanities is how to address computational challenges for lower-resourced languages, giving due credit for research which may not be considered to be as computationally innovative but which can lead to transformative applications in different language community contexts.

Introduction 7 Structure of the Book This edited volume draws together a multitude of voices, traditionally fragmented across multiple disciplinary fields. The origin of this project can be traced back to our involvement in multilingual DH, ‘a loosely-organized international network of scholars using digital humanities tools and methods on languages other than English’ set up by Quinn Dombrowski, which aims to ‘rais[e] the visibility of scholarship in and about many languages’.10 Following on from a meeting of the network, Lorella Viola put out a call for fellow editors for an edited volume on ‘multilingual DH’ in June 2021, to which Paul Spence responded.11 We would like to acknowledge the important contribution of these and many other discussions in the community in laying the groundwork for the book proposal which led to the volume you are reading now. The book consists of four parts organised around four macro-themes: Part I includes contributions on multilingualism theory and practice, Part II is concerned with multilingual pedagogy, Part III presents work on multilingual language models, and finally Part IV showcases initiatives on multilingual methods and infrastructure. The first part of the book explores how multilingual theory and practice have been realised in digital humanities initiatives in the past few years. Across the three chapters, the authors draw together theory and practice, and cultural and linguistic perspectives to provide a compelling introduction to opportunities and challenges for multilingual digital humanities today. The opening chapter by Isasi, Quiroga, Siddiqui, Paulino, and Wermer-Colan examines the multilingual and multicultural model provided by the Programming Historian (PH) for publishing methods-centred digital humanities tutorials. PH is a well-known open-access platform for self-directed critical digital literacy acquisition grounded in computational approaches to humanities research questions. The authors explore how PH has, in the past few years, moved to developing lessons in three other languages (Spanish, French, and Portuguese), addressing the challenges in developing a global audience based on an active community of practice within an open and sustainable socio-technical framework. They offer an important model for how to embed language communities with diverse geocultural profiles in multilingual initiatives from the start. The second chapter in this section by Ponce de la Vega evaluates the dynamics of multilingual and multicultural representation, taking the Biodiversity Heritage Library—a key scientific, social, and humanities resource—as a case study to contrast Global South and North perspectives in heritage sector practices. Using web analytics and metadata analysis, the author compares audience behaviour on the historic BHL site to behaviour on local partnership-driven nodes, making a distinction between the concept of inclusion as presence (the mere addition of language support) and decolonial inclusion, a community-driven approach to diversify epistemic representation in digital knowledge production. Finally, the chapter by Justy Joseph, Lalithsriram SR, and Nirmala Menon examines the challenges of carrying out digital humanities research using NLP in Indian

8 Paul Spence and Lorella Viola languages. As the authors demonstrate, India is a useful case study, given its hyperdiversity in languages and their particular social and linguistic features. Nevertheless, Anglocentrism affects DH in the country in numerous ways—from the linguistic traditions of scholarship in the field there, to the resource bias towards English in NLP development and the difficulties in transferring methods to Indic languages. As a way out of this, they survey existing NLP support for Indian languages and explore the challenges in consolidating and advancing the relatively under-developed Indian language NLP infrastructure, making the case for a multilingual ‘workbench’ which would allow for effective cross-lingual transfer and resource sharing, using a collaborative and community-driven approach. Part II gives visibility to Pedagogy, an area which the digital humanities has not, until recently, addressed well in the field as a whole. And while there have been significant advances in DH pedagogy in recent years, there is a notable lack of work on the pedagogy of multilingual methods and little attention from DH to the relatively advanced field of digitally mediated language pedagogies. This section begins with a chapter by Allés-Torrent, which builds on a debate building in the modern languages community on the relationship between digital multilingualism and language education, broadly understood. The author explores how the digital humanities operate in modern languages, how the two fields pose interesting challenges for each other, and what this dialogue can achieve in terms of building linguistically and culturally diverse digital pedagogies which further both DH and language learning. In building on the ‘DHML’ (digital humanities—modern languages) framework proposed by Pitman and Taylor (2017), she evaluates efforts to develop critical digital literacies informed by the social and cultural contexts for digital methods and objects. The fifth chapter by Gasparini extends the theme with a critical review of learning analytics in digital learning environments for Second Language Acquisition. There is little research on the linguistic and cultural settings for learning interaction with digital tools, the author argues, and the use of empirical methods to understand user behaviour and engagement with different interfaces has applications well beyond language education. Using tracking systems designed to better understand learner behaviour in virtual worlds, her contribution aims to improve an understanding of learner contexts and the concept of space when engaging with digital learning platforms, an approach instantly transferable more generally to evidence-based understanding of the study of user behaviour in diverse linguistic and cultural settings. The following chapter by Goodale addresses a different branch of DH pedagogy— namely the computational text analysis learning resources applied to non-English languages. Noting the important role of libraries in fostering digital humanities methods in some institutions, the author describes initiatives at UT Austin Libraries designed to promote the use of NLP and computational text analysis methods for non-English materials through code repositories and practical applications/case studies. The author describes how the resources described here face different challenges in different languages and summarises various ways in which learners can adopt open libraries of NLP code such as those used in these learning resources.

Introduction 9 In the final chapter of this section, Tasovac, Budak, Ermolaev, Janco and Lassner argue that we need to move beyond simply acknowledging the structural and representational bias in NLP tools and methods. At the same time, they recognise the difficulty in taking tangible steps to resolve this, both because humanists will not, broadly speaking, be able to develop their own machine-learning algorithms, while NLP experts cannot be expected to focus exclusively on humanities datasets and problems. They propose a better understanding by humanists of the opportunities and risks which machine learning presents, combined with the development of humanities-facing skill sets and workflows with which researchers can autonomously develop their own language models and NLP usage scenarios. Describing their experience in developing the ‘New Languages for NLP’ workshops which engaged with ten language teams, they analyse the challenges they faced across languages as diverse as Kannada, Old Chinese, Ottoman Turkish, Quechua, Yiddish, or Yoruba, and describe a pedagogical model developed to simplify the process of using linguistic data for new languages in NLP. Part III of the book is concerned with modelling, which despite being a core area of interest in the digital humanities since its inception, still has received relatively sparse attention in terms of languages-aware modelling practices. In examining linguistic injustice in scholarly communication systems, the chapter by BordonabaPlou and Jreis-Navarro advocates for a better understanding of the underlying linguistic dynamics in the digital systems we use, even where a multilingual philosophy is apparently applied. They reflect on how the scarcity of research data in languages other than English may impoverish knowledge production in the digital humanities and identify the imprecision in linguistic analysis tools in these purportedly ‘global’ environments. Using their own cross-linguistic study on colour terms using a Spanish and Arabic corpus, they highlight the limitations imposed by inconsistent support across languages in digital tools. In the following chapter (Chapter 9), Fekete, Bjerva, and Beinborn analyse typological challenges for the application of multilingual language models in DH. While computational methods for language processing are now more accessible to digital humanities research, thanks to easily customisable neural language models, these are still strongly bound to English approaches, and it is particularly difficult to apply them to low-resourced languages. The development of multilingual models such as multilingual BERT aims to obviate these limitations, but the authors explore the design parameters behind such universalist approaches and examine how better understanding of linguistic representation in multilingual models can lead to their greater effectiveness across different language conditions such as morphological structure or segmentation. The last chapter in the section by Mello presents a case study analysing the concept of Olympic legacy as represented by the British and Brazilian media in their coverage of the 2012 and 2016 Olympics. The author focuses on the challenges in combining qualitative and quantitative methods of close and distant reading for cross-cultural, cross-lingual, and temporally based research, focusing on data availability, data structure, and NLP support (in particular for Sentiment Analysis) and concluding with best practices for multilingual analysis in the digital humanities.

10

Paul Spence and Lorella Viola

The fourth and final section surveys the specific methods-related and infrastructural challenges facing the digital humanities. The chapter by Rojas Castro begins the section by investigating the internationalisation and localisation strategies in a major digital project (Proyecto Humboldt Digital). The project involves the creation of a digital archive of multilingual texts in Spanish, French, English, and German, and the author identifies the limitations of DH tools in providing multilingual support for these kinds of archives. The following contribution by presents the results of a survey designed to identify the requirements for multilingually enabled digital knowledge infrastructures. Building on work carried out at the Disrupting Digital Monolingualism workshop at King’s College London in 2020, the authors argue that multilingual workflows are more complex than previously realised by digital humanists and that the conditions for improving them require a set of formalisable requirements, particularly for non-Latin script languages. Through their survey, they aim to provide language/ language group-specific ‘requirement profiles’ which can be applied to digital infrastructures and in so doing improve multilingual awareness, skillsets, and professional development. The final chapter by Brown addresses the question of what ‘multilingual’ and ‘text’ mean in situations of language contact that render both terms ambiguous. The author points out the difficulty in assigning any particular ‘word’ to any one linguistic variety in a categorical way and discusses the consequences of code-mixing for computational analysis. In textual materials with widespread inter/intra-linguistic variation, it is difficult to unequivocally categorise items in computational analysis, and the chapter concludes with reflections on what it means to represent linguistic forms in a digital space. Conclusion It is impossible to do full justice to the 7,000 languages in the world in an edited volume, and while we made considerable efforts to be inclusive—for example, by actively seeking scholarship from the Global South, offering balanced coverage of languages across different continents and giving attention to lower resourced languages—we were not always successful. This no doubt reflects our positionality as two European authors with our own networks and epistemological biases, but it is also a reflection on the very global knowledge dynamics to which we attempt to draw attention here. We were not able, for example, to attract contributions from (or about) African digital humanities, despite some exciting developments on that continent, including the Network for Digital Humanities in Africa,12 Maskhane (a ‘grassroots NLP community for Africa, by Africans’),13 and the work of the South African Centre for Digital Language Resources (SADiLaR).14 Similarly, while much progress has been made to improve access to minoritised languages in recent years, and researchers such as Thieberger have articulated the importance of making ‘more information about the world’s small languages more freely available’ as part of the drive for global diversity in the digital humanities (Thieberger 2017), we were not able to engage with lower resourced languages to the extent we would have liked.

Introduction 11 We have already recognised the contradictions in writing about multilingual DH in English. We argue that there is value in advocating for multilingualism in English, but there is also a risk that we simply perpetuate the very centre-periphery/ higher-lower-resource language dynamics we aim to dismantle. While attention rightly focuses on English as the dominant language of the internet and digital systems more broadly, it is important to also be aware of other language hierarchies, and the dynamics they entail. There is a very real danger that multilingual efforts merely favour the top 10–20 (mostly European) languages in terms of digital support unless we pay attention to these factors. Another topic beyond the scope of this book is the fact that a significant number of the world’s languages have no writing system or little written representation. Advances in multimodal modelling and in speech-to-text technology offer some hope here, but this is another key multilingual challenge for the traditionally text-dominant digital humanities. These are some of the many challenges which lie ahead for what we might call ‘disruptive multilingualism’ in the digital humanities. What we have aimed to do in this book is to draw attention to the breadth of work on multilingualism in the field, the different sets of socio-technical challenges it involves, and its impact on global knowledge dynamics more generally. Mirroring the wide-ranging epistemological traditions in which digitally mediated multilingual research takes place, the chapters in this book have addressed a wide set of issues and themes in digital multilingualism: how language models are constructed, representation systems for cultural heritage, internationalisation/ localisation for digital interfaces, how to combat the large language bias in NLP research, and the kind of critical digital literacies needed to make progress in this topic. Hence, the book provides an effective way for researchers, practitioners, and graduate students to gain a foundation in the key concepts of the field. This volume assumes high contemporary relevance, especially in relation to the ubiquitous impact of digital technologies on post-COVID humanity. The exponential digital acceleration brought about by the COVID-19 pandemic provides a privileged viewpoint from which to reflect on the current biases in digital research practices, including how approaches to digitally mediated knowledge creation (which DH embodies and fosters) are changing. The main conclusions of this book serve therefore also as a platform from which the implications of monolingualism and monoculturalism in digital scholarship are discussed, explored, and shared in a critical and interdisciplinary way. The chapters in this volume collectively counterbalance the pivotal role played by Anglocentrism in shaping digital practices and innovatively contribute to reduce the barriers non-English speakers often face in accessing digital spaces, ultimately enhancing opportunities for engagement and expression. The richness in diversity perspectives offered by this volume thus concretely responds to our professional and moral obligation as scholars of the digital to impact on institutional and methodological frameworks for knowledge creation, to raise awareness of the limitations of Anglocentric perspectives and the need for diverse voices, and to push for a more inclusive digital future. By challenging the dominance of an English-language digital landscape, this multilingual DH volume also contributes to a burgeoning international network that promotes linguistic

12

Paul Spence and Lorella Viola

diversity, cultural exchange, and equitable access to digital resources. The collaborations in this volume embrace diverse narratives that amplify voices from marginalised communities, fostering intercultural dialogue, and contributing to dismantle the monocultural paradigm. As such, by advocating for inclusive technologies and fostering international collaborations, one of the main contributions of this book is to signal a shift towards inclusivity, linguistic diversity, and cultural pluralism in digital practices and pave the way for a more equitable and culturally rich digital future. The digital humanities is well placed to make a vital contribution to digital multilingualism, both as a research topic and as a strategic imperative in the design and application of scholarly research tools or infrastructures and beyond. There is still much exciting work to be done for multilingual DH. We feel privileged to be part of the growing international network which is making disruptive multilingual change happen; we believe that this volume helps to cement its important role in future DH literature. Notes 1 2 3 4 5 6 7 8 9 10 11 12 13 14

www.globaloutlookdh.org/ https://latamcyber.wordpress.com/ https://internetlanguages.org/en/ https://european-language-equality.eu/ www.dariah.eu/activities/working-groups/multilingual-dh/ https://dig-hum.de/ag-multilingual-dh https://digitalhumanities-uk-ie.org/community-interest-groups/multilingual-dh/ https://github.com/multilingual-dh/nlp-resources https://newnlp.princeton.edu/ http://multilingualdh.org/en/ www.c2dh.uni.lu/news/cfp-chapters-solicited-edited-volume-multilingual-dh https://dhafrica.blog/ www.masakhane.io/ https://sadilar.org

References Cameron, Fiona. 2021. The Future of Digital Data, Heritage and Curation in a More-than Human World. Abingdon, Oxon and New York: Routledge. Danet, Brenda, and Susan C. Herring, eds. 2007. The Multilingual Internet: Language, Culture, and Communication Online. 1st ed. Oxford University Press. Earhart, Amy E. 2019. ‘Chapter 18: Can Information Be Unfettered? Race and the New Digital Humanities Canon’. In Debates in the Digital Humanities, edited by Matthew K. Gold. Minneapolis: University of Minnesota Press. Fiormonte, Domenico. 2012. ‘Zu Einer Kulturkritik Der Digitalen Geisteswissenschaften Towards a Cultural Critique of the Digital Humanities’. Historical Social Research 37 (3): 59–76. https://doi.org/10.12759/HSR.37.2012.3.59-76. Fiormonte, Domenico, Sukanta Chaudhuri, and Paola Ricaurte, eds. 2022. Global Debates in the Digital Humanities. Minneapolis: University of Minnesota Press. Galina Russell, I. 2014. ‘Geographical and Linguistic Diversity in the Digital Humanities’. Literary and Linguistic Computing 29 (3): 307–316. https://doi.org/10.1093/llc/fqu005.

Introduction 13 Gil, Alex, and Élika Ortega. 2016. ‘Global Outlooks in Digital Humanities: Multilingual Practices and Minimal Computing’. In Doing Digital Humanities. Milton Park and New York: Routledge. Gramling, David. 2021. The Invention of Multilingualism. Key Topics in Applied Linguistics. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108780667. Hohti, Riikka, and Sarah E. Truman. 2021. ‘Anglocentrism in the Academy: On Linguistic Privilege, Mastery and Hoito’. Reconceptualizing Educational Research Methodology 12 (2). https://doi.org/10.7577/rerm.4675. Joshi, Pratik, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srinivasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, and Kalika Bali. 2019. ‘Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities’. arXiv. https://doi.org/10.48550/arXiv.1912.03457. Lee, Benjamin. 2020. ‘Compounded Mediation: A Data Archaeology of the Newspaper Navigator Dataset’. Digital Humanities Quarterly 15 (4). https://doi.org/10.17613/K9GT-6685. Lee, Carmen. 2016. Multilingualism Online. London: Routledge. https://doi.org/10.4324/ 9781315707211. Leloup, Pieter, and Adam White. 2021. ‘Questioning Anglocentrism in Plural Policing Studies: Private Security Regulation in Belgium and the United Kingdom’. European Journal of Criminology, May, 147737082110148. https://doi.org/10.1177/14773708211014853. Maffi, Luisa. 2005. ‘Linguistic, Cultural, and Biological Diversity’. Annual Review of Anthropology 34 (1): 599–617. https://doi.org/10.1146/annurev.anthro.34.081804.120437. Mandell, Laura. 2019. ‘Gender and Cultural Analytics: Finding or Making Stereotypes?’ In Debates in the Digital Humanities, edited by Matthew K. Gold and Lauren Klein. Minneapolis: University of Minnesota Press. McPherson, Tara. 2019. ‘Chapter 9: Why Are the Digital Humanities So White? Or Thinking the Histories of Race and Computation’. In Debates in the Digital Humanities, edited by Matthew K. Gold. Minneapolis: University of Minnesota Press. Noble, Safiya Umoja. 2019. ‘Gender and Cultural Analytics: Finding or Making Stereotypes?’. In Debates in the Digital Humanities, edited by Matthew K. Gold and Lauren Klein. Minneapolis: University of Minnesota Press. https://dhdebates.gc.cuny.edu/read/untitledf2acf72c-a469-49d8-be35-67f9ac1e3a60/section/5aafe7fe-db7e-4ec1-935f-09d8028a 2687\#ch02. O’Sullivan, James, ed. 2022. The Bloomsbury Handbook to the Digital Humanities. Bloomsbury Handbooks. New York: Bloomsbury Academic. Pitman, Thea, and Claire Taylor. 2017. ‘Where’s the ML in DH? And Where’s the DH in ML? The Relationship Between Modern Languages and Digital Humanities, and an Argument for a Critical DHML’. Digital Humanities Quarterly 11 (1). Risam, Roopika. 2015. ‘Beyond the Margins: Intersectionality and the Digital Humanities’. Digital Humanities Quarterly 9 (2). Spence, Paul. 2021. Disrupting Digital Monolingualism: A Report on Multilingualism in Digital Theory and Practice. London: Language Acts & Worldmaking Project. https:// doi.org/10.5281/zenodo.5743283. Spence, Paul, and Renata Brandao. 2021. ‘Towards Language Sensitivity and Diversity in the Digital Humanities’. Digital Studies/Le Champ Numérique 11 (1). https://doi. org/10.16995/dscn.8098. Thieberger, Nick. 2017. ‘What Remains to Be Done—Exposing Invisible Collections in the Other 7,000 Languages and Why It Is a DH Enterprise’. Digital Scholarship in the Humanities 32 (2): 423–434. https://doi.org/10.1093/llc/fqw006.

14

Paul Spence and Lorella Viola

Viola, Lorella. 2021. ‘ChroniclItaly and ChroniclItaly 2.0: Digital Heritage to Access Narratives of Migration’. International Journal of Humanities and Arts Computing 15 (1–2): 170–185. https://doi.org/10.3366/ijhac.2021.0268. Viola, Lorella. 2023. The Humanities in the Digital: Beyond Critical Digital Humanities. S.l.: Palgrave Macmillan. Williams, Lyneise. 2019. ‘What Computational Archival Science Can Learn From Art History and Material Culture Studies’. In 2019 IEEE International Conference on Big Data (Big Data), 3153–3155. Los Angeles, CA: IEEE. https://doi.org/10.1109/BigData47090.2019. 9006527.

Part I

Multilingual/Multicultural Theory and Practice

1

A Model for Multilingual and Multicultural Digital Scholarship Methods Publishing The Case of Programming Historian Jennifer Isasi, Riva Quiroga, Nabeel Siddiqui, Joana Vieira Paulino, and Alex Wermer-Colan

1.1.

Introduction

Programming Historian (PH) is an online open-peer review, open-access journal on computational methodologies for the humanities and social sciences. Despite the journal being exclusively available in English at first, it became apparent early on in its history that a sizable portion of the audience was from non-Anglophone countries and from the Global South. Consequently, PH began to expand to other languages. As the only multilingual journal dedicated to digital methodologies that, at the same time, showcases digital humanities (DH) research and pedagogy in action, here, we present the development of our multilingual and multicultural publishing methods to provide a model for digital publishing and DH. In this work we go through four key sections. First, we provide a review of the study of language diversity in DH. Second, we follow with an introduction to PH to clarify its goals. Third, we explain how we developed guidelines to address a global audience that has changed over time. Finally, we talk about the challenges (and opportunities) of dealing with linguistic variation within a single language across multiple cultures. As a conclusion, we summarize our core philosophy and offer three main takeaways that can serve as a guide for others seeking to expand linguistic diversity. 1.2. The Issue of Multilingualism in DH Despite DH often being cast as an inclusive “big tent,” the majority of scholarship in the field still revolves around the Northern Anglophone world. This has resulted in English becoming the lingua franca for scholarship, teaching, and infrastructure development (Fiormonte, 2012, 2021). For instance, major publications such as Digital Scholarship in the Humanities and Digital Humanities Quarterly are predominantly Anglophone—although they have made some strides with special language issues—1 and, as noted by Cecily Raynor, Cultural Analytics hasn’t published any article in other languages since its founding (Raynor, 2021). As Horváth et al. demonstrate in their chapter, those who use other languages—either for communication or as the language of the materials they do research on—often struggle to access and contribute DOI: 10.4324/9781003393696-3

18

Jennifer Isasi et al.

to the field. Likewise, research and perspectives from non-English-speaking cultures continue to be underrepresented, sometimes even within other major DH groups, such as the Spanish-speaking one (Isasi and Rio Riande, 2022).2 The lack of materials in languages other than English has been a well-known issue for more than a decade. In her article “Geographical and Linguistic Diversity in the Digital Humanities,” Isabel Galina Russell asked us to think reflectively about who “we” are as scholars, and to take the role of language seriously (Galina Russell, 2014). In the same vein, as Quinn Dombrowski and Patrick Burns note, “Language should be everyone’s concern in the humanities, although it is often rendered invisible for scholars who work on Anglophone materials and live in Anglophone countries” (Dombrowski and Burns, 2023). Similar to the ways that the power of whiteness stems from its “unmarked category against which difference is constructed . . . [that never has to] acknowledge its role as an organizing principle in social and cultural relations” (Lipsitz, 1995, p. 369), scholars rarely acknowledge the power of English in the academy, or even the fact that they are writing about the English language, per the #BenderRule (Bender, 2019). Attempts have been made to increase diversity. The Association for Digital Humanities Organization (ADHO), arguably the most influential organization in the field, has had a standing committee on multilingualism and multiculturalism since 2005. In their first discussion paper, the committee noted: Successful handling of multilingualism and multiculturalism is crucial to our credibility as an international scholarly organization, and also to our ability to attract the membership (not just in numbers, but in breadth of constituency) that the organization will need in order to thrive. (Burr, 2006, p. 2) Regardless of these lofty goals, their main activity has been to translate the call for papers for ADHO’s annual conference to their other four official languages (Spanish, French, German, and Italian) or others, regardless of the host country (Estill et al., 2022). This has resulted in little actual change. For example, although the 2014 CFP was in 23 languages, only approximately 4% of papers were presented in French, German, Spanish, and Italian while the remaining 96% of works were presented in English (Grandjean, 2014). Individually, many of ADHO’s constituent organizations work in languages such as Japanese, French, or German, and many are also regularly in contact with other local language communities. However, said linguistic diversity does not translate to ADHO’s annual conference or other global efforts, resulting in numerous organizers of the conference to conclude that the “practice of multilingualism has varied widely [but], unfortunately, has not yielded the diversity results it has been expected to” (Estill et al., 2022). As we see it, there are at least two key barriers to true multilingual DH: attitudes and technologies. The first one, unfortunately, involves cultural attitudes and perceptions held by scholars. For instance, Global Outlook: Digital Humanities, an ADHO special interest group, created the “DH Translation Toolkit” with an overview of the best practices to translate materials (Rio Riande et al., 2016),

Multicultural Digital Scholarship Methods Publishing 19 stemming from the DH Whispers campaign launched by Alex Gil and Élika Ortega at DH2014 in Lausanne, Switzerland. There, individuals could wear pins stating “I whisper (language)” signifying their willingness to translate during presentations (Ortega, 2014). While the organizers felt that the experiment was a success, there were also reports that rooms began becoming vacant when non-English presentations occurred, a particularly egregious example of the invisible barriers to expand geolinguistic diversity (Estill et al., 2022). This Anglocentric cultural attitude is also reflected by the common notion that DH is predominantly practiced in Western Europe and North-North America, perpetuating a “center-periphery” model, where scholars in the Global South are seen as outliers (Risam, 2016; Fiormonte and Sordi, 2019). This framing is evident in Melissa Terras’ map of DH centers: With 114 centers in 24 countries, she noted that the visualization provided evidence for DH as a field with a unique willingness to embrace international collaboration (Terras, 2012). Detractors contested, however, that the map concealed research that took place unattached to centers, because research in the Global South often lacks institutional support that maps such as Terras’ fail to capture.3 And although Gil et al. in 2018 and Dombrowski in 2020 tried to map DH work on a global scale, areas of the Global South are rarely included as members of the DH community (O’Donnell et al., 2015). The second major issue confronting multilingual DH is technical. As the name implies, DH relies on pre-existing digital technologies that themselves perpetuate English as a dominant language. As Ramsey Nasser notes in the overview of his programming language ‫بلق‬, these are almost exclusively based on the ASCII character set, which encodes Latin characters; this bias underlies most of our digital platforms (Nasser, N/D). Moreover, both programming syntax and vocabulary usually use the English-language as its base, and the provided documentation is also typically only available in English. The rise of “global” networking technologies only exacerbated these issues. While there are close to 7,000 languages in the world, 98% of the content of the Internet is in 12 languages only, with English taking up close to half of the space (Trevino, 2020). Likewise, although 75% of individuals navigating the Web are from the Global South, most servers and knowledge infrastructures are in Englishspeaking areas. The “Whose Knowledge?” campaign, a feminist anti-colonial group, succinctly writes that “the internet’s platforms, policies, and protocols are created for and decided by the ‘local’ context of the United States, making the ‘local’ of the United States the (far too often unquestioned) ‘global’ of the rest of the world” (Whose Knowledge?, 2021). And although many languages, including indigenous ones, are spoken in the United States and New York City being “the most linguistically diverse urban center in the world” (NYC Languages)—the “local” context functions in English. It is thus not surprising that many of the tools that form the backbone of DH research fail to provide support for other languages. Thus, as Horváth et al. demonstrate, many in DH cannot work in “smaller” languages or languages that use different scripts, having to find a workaround or give up altogether. Together, these cultural and technical factors have limited the visibility of DH research to mainly Northern Anglophone material. The consequences for DH are

20

Jennifer Isasi et al.

particularly serious in the field of pedagogy as there is a scarcity of teaching and learning tools that accommodate multilingual and multicultural realities. As Dombrowski points out, “almost all hands-on DH pedagogy is shaped by the assumption that all students will be applying these skills to English-language materials” (2023). In fact, Alves (2021) sees an advantage to having materials in one’s native language because students can focus on the systematic and critical use of the methods and tools more easily. In this article, we overview the challenges and pitfalls as members of the Programming Historian editorial team in developing multilingual pedagogical materials that can demonstrate the possibility of practicing and encouraging multilingual approaches. 1.3.

Brief Introduction of the History and Goals of PH

William J. Turkel and Alan MacEachern launched Programming Historian (PH) in 2008 as an introductory resource to Python. In 2012, PH expanded its editorial team and thematic scope to become an open-access peer-reviewed scholarly journal of methodologies for digital scholarship. This version of PH served as the original model of intake, open peer-review, and publication, including web-infrastructure being hosted and managed on GitHub.4 In 2016, PH added a Spanish-language team that started by publishing translated English-language lessons in 2017, under the title Programming Historian en español. Original lessons followed a year later. Soon after, a French-language team also joined the editorial board to launch Programming Historian en français in 2019. This was followed by a Portuguese-speaking team in 2020, which launched Programming Historian em português the following year. At the time of reviewing this chapter, there are 100 lessons available in English, 58 in Spanish, 25 in French, and 36 in Portuguese—and around 40 more lessons under review. In order to understand the diverse identities of PH readers, we check the data of user visits to our website gathered by Google Analytics. Although this approach has its limitations (e.g., the use of a VPN would not allow us to identify a user’s location), it is helpful for identifying general trends in our audience. At present, we only have data starting in mid-2012 and, for the purposes of the present article, we captured data until the end of 2021. During this time, we have seen a clear trend in audience participation: There has been a steady increase in traffic after 2016 with the highest peak seen in 2020, with more than four million total visits (Figure 1.1)— most likely due to the switch to remote first and then hybrid research and teaching during the COVID-19 pandemic. This reflects an increasing interest and need in learning digital skills in any area of life but also the impact of our broader offering of methods and approaches for DH in a global context. Although PH’s audience throughout these years has been reading the journal from almost every country in the world, it is necessary and interesting to point out that we have observed an increase in a “real” global audience as we have been adding different languages to our collection. There has been a noticeable increase in the audience from countries where Spanish is the most spoken language since early 2017.5 Whereas DHers in Colombia, Chile, or Argentina (to name a few) were barely consulting the lessons in the journal before 2016, they rapidly became part

Multicultural Digital Scholarship Methods Publishing 21

Figure 1.1 A graph showing the increase of total visits per year, from 2016 to 2021. Source: Prepared by the authors.

Figure 1.2 A graph with the change in ranking of visits per a few selected countries per year, from 2016 to 2021. Source: Prepared by the authors.

of the Top 12 in the following years (Figure 1.2). And by 2019, visits from Spain and Mexico took the third and fourth places, respectively, in the Top 5, displacing the total visits from the United Kingdom—one of the main audiences since the beginning—to fifth place (Figure 1.2).

22

Jennifer Isasi et al.

Moreover, when we study the countries in the American continent, where there is a very apparent divide between the North and the South in general but also in DH (Crymble and Afanador Llach, 2021; Fiormonte and Rio Riande, 2022; Risam, 2018; etc.), the increase in the demand for our lessons in languages other than English is even more recognizable. If we study the visits from Latin American and Caribbean countries in aggregate, and we compare them to the numbers from the United States and Canada together, we notice that in both 2019 and 2021 the visits from the South surpassed those in the North (Figure 1.3). And whereas most visits from the United States and Canada are done in the English-language journal, we find a variety of visits to the four journals coming from Latin America and the Caribbean.6 1.4.

Global Audience Guidelines and Translation Work

Since our audiences are in a wide variety of contexts, as a team, we have faced many challenges due to our inability to speak directly to most of our readers. Consequently, we have directed our efforts to globalize the content of our lessons to be accessible to any geography or community as much as possible, which, at the same time, facilitates translation of the lesson afterward. This effort grew through the realization that some early texts either had vocabulary specific to the regions of the authors, presented a lack of contextualization for the datasets being utilized for the

Figure 1.3 A graph with the visits per two groups in the Americas: Latin America and the Caribbean versus the USA and Canada, from 2012 to 2021. Source: Prepared by the authors. We pay particular attention to the rise of the Spanish-speaking countries in the ranks of visits because its addition led to the creation of novel guidelines to write for a global audience. For the first time, PH started to consider the fact that readers were not located just in the USA, UK, and Canada but anywhere else. These new users were, clearly, seeking these resources to learn digital skills to be applied in their own contexts and, for that reason, we had to adapt. Now we apply these guidelines to all the PH versions, as explained in the next section, and direct colleagues who are starting to pursue global projects to them as well.

Multicultural Digital Scholarship Methods Publishing 23 pedagogical activity, or assumed that the readers could understand every technical word. Little by little, the team developed a nine-step guide addressing several issues that we found crucial to expand accessibility. We summarize these guidelines next in three main groups. First, we ask authors to envision their readers outside of their main academic or work circle, and beyond the borders of their country or region. To make this easier, we provide clear examples of things to avoid, such as cultural or historical references without context. That is, authors should offer more than the given name in the country/region when they mention people, institutions, or historical details. Similarly, they should avoid the use of jargon, jokes, play on words, as well as the use of trademarks or brand names. This allows for more readability, as the users will not need to double-check information or second-guess a reference during their reading. Second, we ask authors to choose methods and tools with readers of different languages in mind. In general, the journals request lessons on tools that provide multilingual documentation whenever possible, and lessons also need to include references in multiple languages for further reading if possible. And for lessons on textual analysis, we ask that the tool or method supports as many character sets as possible to provide robust results in languages other than English—when this is not feasible, the editorial team contemplates how to best address the issue and, in most cases, a note is included at the beginning of the lesson. Finally, cognizant of the disparity in computing resources on a global level, PH lessons that require “relatively substantial computing resources, [must] include an alert warning after the Table of Contents to inform the readers.”7 This alert has to be very specific about the requirements needed to complete the lesson (e.g., “You need at least 8GB of RAM to finish this lesson”). This allows for a realistic expectation for our readers, who can consider their technological context before even attempting to learn a tool or method. Although strategies to ensure contextual information, multilingual methods/ tools, and awareness of technical capacity are mainly geared toward ensuring readability of the lessons individually, they are also of paramount importance for the translation work that the journal is so invested in. In fact, the current four versions of PH look for translations that consider, at their core, their own language-reading audiences, regardless of which language the original lesson was published in. As such, per the Translator Guidelines,8 translating the textual body of the lesson is not enough in most cases. If possible, we ask translators to edit or add to the programming code when it is part of a lesson, as well as translating or explaining technical terminology. They need to switch the software interface or explain the terms of the interface. Images, their titles, captions, and alt-text also need to be adapted. And, most importantly, they need to find and add references relevant to the new target audience, linking to a Wikipedia article in the corresponding language, or pointing to the documentation of a software in that language as well. All these changes are only feasible if the author of the original lesson includes the relevant information. As a result, and as argued by Isasi and Rojas Castro (in this volume), we see three different types or levels of translations, with the third one being, usually, the

24 Jennifer Isasi et al. desirable and most useful to each target audience. On a first level, translators carry out only linguistic changes in lexicon and syntax (body text only), as they are sufficient to cover what is, usually, a simple method or step in a process. On a second level, translators generate expressive changes in which they add contextual information, notes, extra bibliographic information, or material partially adapted to the target language. And on a third level, translators generate a complete localization, because everything is adapted, changed, or extended to meet the needs of the target audience, including datasets (Isasi and Rojas Castro, 2021, p. 619). Ultimately, the journal is not interested in publishing translated lessons that cannot be applied or taught in the target language—a difficulty only enhanced when teaching in a second language, as Allés-Torrent argues (see her chapter in this volume). Nonetheless, writing lessons or translating them is not as straightforward as it might seem even if one follows the previously mentioned guidelines or strategies. The team also has had to consider the cultural diversity and linguistic varieties within each target language. This is best exemplified by the decisions made by the Portuguese-speaking team, illustrated in the following section. 1.5.

Strategies to Account for Linguistic Variation Within One Language With Multiple Cultures

Following our own diversity policy,9 the Portuguese-language journal is led by both Brazilian and Portuguese members who are from academic and non-academic institutions, as well as from different fields of study. When translating, editing, or developing original lessons, the editorial team doesn’t choose whether these activities are done by someone from Portugal, Brazil, or any other Lusophone country. However, when it comes to reviewing lessons, the job is done by reviewers from different countries to consider the linguistic varieties of Portuguese in its different contexts. To account for this variety, early on in their onboarding process, this team decided to develop a glossary to use during the translation of the journal as well as in any editorial process.10 We first identified linguistic differences and meanings of some common words in DH, to adopt those that can be understood by the audiences on either side of the Atlantic Ocean. For example, the team chose the Brazilian version for the term “computer mouse” because it uses the same word as in English, instead of the Portuguese version “rato”; in the case of the term for “database,” we adopted the version from Portugal “base de dados” instead of the Brazilian “banco de dados.” What started as a simple shared document evolved into a Wiki page11 on the PH repository in GitHub to make the information more sustainable and easier to access for people not only involved in the journal but within the growing humanidades digitais community (Alves, 2016a, 2021; Pimenta and Alves, 2021). In the same manner that developing the authorial and translator guidelines took a long period and several iterations, this task took time to be discussed, analyzed, and determined. However, this now ensures that the lessons in Portuguese, whether originals or translations, are homogenous in the approaches they adopt and concepts that they use. At the same time, having a list of keyword concepts constantly

Multicultural Digital Scholarship Methods Publishing 25 used in the journal helps this team because, like the rest of the editorial board, it is composed not by translation experts or professionals, but by teachers, researchers, and students.12 Students in particular might have not had so much exposure to more than one variety of Portuguese, so it is of utmost importance to have clear guidelines for them. This uniformity—together with the global audience efforts—ensures that Portuguese-language lessons can be created, translated, reviewed, and used by students and early career researchers, and, in fact, they are very involved in all the steps in the process (Alves, 2021). Most notably, students are currently using these lessons in the History degree in Faculdade de Ciências Sociais e Humanas from Universidade Nova de Lisboa, Portugal, and in the School of Social Sciences FGV CPDOC from Fundação Getúlio Vargas in Rio de Janeiro, Brazil. Some of them are helping to translate those lessons, engaging with the project and enriching their digital skills.13 This has been a good way to evaluate if the lessons are understood by learners of digital methods from both Portugal and Brazil, adjusting them accordingly during the review process; they note when the instructions are not perfectly clear and need to be rewritten or if there is a step missing.14 We have to note that the Portuguese-speaking team prioritized the translation of lessons that cover methods and tools that they need in their specific contexts: the Python programming language series as well as lessons about R, Omeka, and spatial analysis tools (Alves, 2011, pp. 19–20). All in all, students enhance their digital skills or learn new ones and, at the same time, this scheme of adding students to the editorial process results in them being able to engage with the larger, international PH and DH community very early in their careers. Thanks to this collaborative effort, the team has been able to translate, review, and publish almost 40 lessons already, a record of publications among the PH journals. Besides the satisfactory results obtained in piloting the translated lessons in classrooms in two countries, looking at our audience data proves that this effort is successful: The visits from Portugal but especially Brazil have continuously doubled or tripled throughout the last years (Figure 1.4).15 Understandably, the visits from both countries increased, first, in 2018—coinciding with the launch of the Spanish-language journal and a writing workshop in Bogotá, Colombia, where the Lusophone community was also invited (Crymble and Afanador-Llach, 2018)—and later with the launch of the PT journal by the end of 2020 (Alves and Isasi, 2021). It is clear to us that paying attention to linguistic and cultural varieties within a single language is of high importance in the work we perform daily. This can avoid misunderstandings or mistakes within lessons, contributing further to the readability and translatability of each lesson. Following the authorial and translation strategies mentioned earlier together with the adoption of a set of common terms, it is the goal of the Portuguese-language team to offer lessons that can be understood by their own global audience. This is done by adopting an easily understandable language and, when it is demonstrated to be possible, adapting the examples and the data presented to their realities and contexts. At the same time, it is a prime example to the rest of the PH editorial board and, in fact, to any DH project that wants to communicate with a multilingual and multicultural audience.

26

Jennifer Isasi et al.

Figure 1.4 A graph with the total visits by users in Portugal and Brazil, from 2012 to 2021. Source: Prepared by the authors.

1.6.

Conclusion: Three Main Takeaways to Serve as a Guide for DHers

Despite DH having now a relatively long history in academia, the dominant centers of research and pedagogy have not been able to communicate with or include in conversation many colleagues in the “peripheries” of the world. Programming Historian is striving to break with this reality and has become a highly inclusive resource for many. As a community of practitioners of different, regional DHs, we have developed a series of strategies that we always follow and want to offer for the rest of the Northern DH community if they want, as they claim, to reach a global DH reality. First, be cognizant of the different cultural and material realities of the world where DH are practiced and provide contextual information, as well as a clear statement about the technology needed to replicate the project. Second, use translation to reach those that, simply, don’t have the privilege of learning English as a second language but have to live in the reality of it being a lingua franca in the digital sphere. Finally, in original materials or translations, take also into account that a language is not a monolith; thus, create strategies to address the linguistic varieties within a language group to ensure that the group can understand the materials and, most importantly, also can participate in the conversation. We know that these strategies are incomplete, and we are aware of our limitations in working in four major European languages that use the same script. However, given our experience, successful reach and engagement with a global

Multicultural Digital Scholarship Methods Publishing 27 audience that only continues to grow, we are already working toward including other languages either in new journals or within lessons.16 All in all, we find our strategies a set of minimum practices to achieve a multilingual DH reality. Notes 1 To date, DHQ has published one special issue in Spanish (2018, p. 12.1) and one in Portuguese (2020, p. 14.2). 2 According to Alves (2021), DH is now picking up momentum in the Portuguese-speaking world with conferences, new labs, projects, and, mainly, an increase in publications, with eight volumes on DH between 2018 and 2020, a situation different from up to the mid-2010s (Alves, 2016b). 3 For example, Isasi and del Rio Riande’s map of author affiliation demonstrates the many places that DH is practiced in the Spanish-speaking world (Isasi and del Rio Riande, p. 137). 4 To read about our infrastructure and its challenges, consult Lincoln et al. (2022). 5 Programming Historial en español was launched on March 5, 2017. See blogpost by Afanador-Llach. 6 We acknowledge that we are only working and quantifying the languages of the Americas that are spoken by a majority but not by everyone. 7 “Author Guidelines” by the editorial team: https://programminghistorian.org/en/ author-guidelines. 8 “Translator Guidelines” by the editorial team: https://programminghistorian.org/en/ translator-guidelines. 9 You can read about it in “About the Programming Historian” https://programminghistorian. org/en/about#diversity-policy. 10 To this day, it is the only team that has developed a glossary of this kind, while the other teams use a more relaxed approach to issues arising from linguistic varieties within one language. In some cases, the Spanish-language team uses the glossary available for the Spanish-language collaborative translation of R para Ciencia de Datos (Alfaro et al., 2019). 11 “Achieving Sustainability Agreed terminology PH em português”: https://perma.cc/ VWR6-GQB2 12 Most have a dominant command of English but also good knowledge of Spanish and French. This ensures a faster exchange of ideas and translations from the other journals into the Portuguese-language version. 13 Both Portuguese and Brazilian teams received funding to engage with their audiences. They had six fellowships in September 2021 and 2022 in the Institute of Contemporary History in NOVA FCSH/IN2PAST, in September 2021 and September 2022, funded by the Portuguese Foundation for Science and Technology. In addition, in FGV, they work on “Literacia Digital: modelando competências digitais para humanistas e cientistas sociais” (in collaboration with the Digital Humanities Lab from Institute of Contemporary History, NOVA FCSH/IN2PAST). 14 During the translation of the lesson “Installing Omeka,” students identified some steps that had not been considered in the original version but were needed for them to be able to follow this tutorial. 15 Total visits from other countries where Portuguese is an official language are not included because they show a low threshold that, also, doesn’t change through time. 16 See, for example: https://doi.org/10.46430/phfr0023.

References Afanador-Llach, M. J. (2017, March 5). ¡Bienvenidos a The Programming Historian en español! Programming Historian Blog.https://programminghistorian.org/posts/lanzamiento-PH-espanol

28

Jennifer Isasi et al.

Alves, D. (2011, March 24–27). História 2.0: oportunidades e desafios de um olhar digital sobre o passado. Paper presented at VI Encontro Nacional de Estudantes de História. Universidade de Lisboa. http://dhhistory.hypotheses.org/34 Alves, D. (2016a). As Humanidades Digitais como uma comunidade de práticas dentro do formalismo académico: dos exemplos internacionais ao caso português. Ler História, 69, 91–103. https://lerhistoria.revues.org/2496 Alves, D. (2016b), Humanidades Digitais e Investigação Histórica em Portugal: perspectiva e discurso (1979–2015). Práticas da História. Journal on Theory, Historiography and Uses of the Past, 1(2), 89–116. https://praticasdahistoria.pt/article/view/23302/17342 Alves, D. (2021). Ensinar Humanidades Digitais Sem as Humanidades Digitais: Um Olhar a partir das licenciaturas em História. Revista EducaOnline, 15(3), 5–26. https:// revistaeducaonline.eba.ufrj.br/edições-anteriores/2021-2/ensinar-humanidades-digitaissem-as-humanidades-digitais-um-olhar-a-partir Alves, D., & Isasi, J. (2021, January 29). Publicação do Programming Historian em português. Programming Historian. https://programminghistorian.org/posts/launch-portuguese Bender, Emily M. (2019, September). The #BenderRule: On Naming the Languages We Study and Why It Matters. The Gradient. Burr, E. (2006). Discussion Paper: Internationalization, Multi-Lingual & Multi-Cultural Agenda (p. 5) [ADHO—Standing Committee on Multi-lingualism & Multi-culturalism]. ADHO. Crymble, A., & Afanador-Llach, M. J. (2018, August 8). Writing Workshop Report Bogotá, Colombia. Programming Historian. https://programminghistorian.org/posts/bogotaworkshop-report Crymble, A., & Afanador-Llach, M. J. (2021). Digital History: The Globally Unequal Promise of Digital Tools for History: UK and Colombia’s Case Study. In A. Nye & J. Clark (Eds.), Teaching History for the Contemporary World: Tensions, Challenges and Classroom Experiences in Higher Education (pp. 85–98). Springer. Dombrowski, Q. (2020). AROUND DH 2020: Digital Humanities Projects in Languages Other Than English or from Minoritized Groups Worldwide. http://arounddh.org Dombrowski, Q. (2023). Bringing Languages Into the DH Classroom. In Brian Croxall & Diane K. Jakacki (Eds.), What We Teach When We Teach DH: Digital Humanities in the Classroom. University of Minnesota Press. Dombrowski, Q., & Burns, P. (2023). Language Is Not a Default Setting: Countering DH’s English Problem. In M. K. Gold (Ed.), Debates in the Digital Humanities 2023. University of Minnesota Press. Estill, L., Guiliano, J., Ortega, É., Terras, M., Verhoeven, D., & Layne-Worthey, G. (2022). The Circus We Deserve? A Front Row Look at the Organization of the Annual Academic Conference for the Digital Humanities. Digital Humanities Quarterly, 16(4). Fiormonte, D. (2012). Towards a Cultural Critique of the Digital Humanities. Historical Social Research/Historische Sozialforschung, 37(3 (141)), 59–76. Fiormonte, D. (2021). Taxation Against Overrepresentation?: The Consequences of Monolingualism for Digital Humanities. In D. Kim & A. Koh (Eds.), Alternative Historiographies of the Digital Humanities (pp. 333–376). Punctum Books. Fiormonte, D., & Rio Riande, G. del. (2022). The Peripheries and Epistemic Margins of Digital Humanities. In J. O’Sullivan (Ed.), The Bloomsbury Handbook to the Digital Humanities. Bloomsbury Publishing. Fiormonte, D., and Sordi, P. (2019). Humanidades Digitales del Sur y GAFAM. Para una geopolítica del conocimiento digital. Liinc em Revista, 15(1), Article 1. https://doi.org/10. 18617/liinc.v15i1.4730

Multicultural Digital Scholarship Methods Publishing 29 Galina Russell, I. (2014). Geographical and Linguistic Diversity in the Digital Humanities. Literary and Linguistic Computing, 29(3), 307–316. https://doi.org/10.1093/llc/ fqu005 Gil, A., Cordell, R., Risam, R., Bordalejo, B., Arthur, P., & Pimentel, M. (2018). Around DH in 80 Days [Map]. www.arounddh.org/ Grandjean, M. (2014). Le rêve du multilinguisme dans la science: L’exemple (malheureux) du colloque #DH2014 | Martin Grandjean [Personal]. Martin Grandjean. www. martingrandjean.ch/multilinguisme-dans-la-science-dh2014/ Isasi, J., & Rio Riande, G. del. (2022). ¿En qué lengua citamos cuando escribimos sobre Humanidades Digitales? Revista De Humanidades Digitales, 7, 127–143. https://doi. org/10.5944/rhd.vol.7.2022.36280 Isasi, J., & Rojas Castro,A. (2021). ¿Sin equivalencia? Una reflexión sobre la traducción al español de recursos educativos abiertos. Hispania, 104(4), 613–624. https://doi.org/10.1353/hpn. 2021.0130 Lincoln, M., Isasi, J., Melton, S., & Laramée, F. D. (2022). Relocating Complexity: The Programming Historian and Multilingual Static Site Generation. Digital Humanities Quarterly, 16(2). www.digitalhumanities.org/dhq/vol/16/2/000585/000585.html Lipsitz, G. (1995). The Possessive Investment in Whiteness: Racialized Social Democracy and the “White” Problem in American Studies. American Quarterly, 47(3), 369. https:// doi.org/10.2307/2713291 Nasser, R. (n.d.). Ramsey Nasser—‫[ بلق‬Personal]. Ramsey Nasser —. https://nas.sr/%D9% 82%D9%84%D8%A8/ O’Donnell, D. P., Walter, K. L., Gil, A., & Fraistat, N. (2015). Only Connect: The Globalization of the Digital Humanities. In S. Schreibman, R. Siemens, & J. Unsworth (Eds.), A New Companion to Digital Humanities (pp. 493–510). John Wiley & Sons, Ltd. https:// doi.org/10.1002/9781118680605.ch34 Ortega, É. (2014, July 21). Whispering/Translating During DH2014: Five Things We Learned [Personal]. Élika Ortega. https://web.archive.org/web/20150623204053/http:// elikaortega.net:80/2014/07/21/dhwhisperer/ Perlin, R., Kaufman, D., Lampel, J., Daurio, M., Turin, M., Craig, S., eds. Languages of New York City (digital version), Map. New York: Endangered Language Alliance. http:// languagemap.nyc Pimenta, R., & Alves, D. (2021). Humanidades digitais e o mundo lusófono. FGV Editora. https://editora.fgv.br/produto/humanidades-digitais-e-o-mundo-lusofono-3625 Priani Saisó, E., & González-Blanco García, E. (Eds.). (2018). Spanish Language Special Issue. Digital Humanities Quarterly, 12(1). www.digitalhumanities.org/dhq/vol/12/1/ index.html Raynor, C. (2021). Embedding Multilingualism. Cultural Analytics. https://culturalanalytics. org/post/1157-embedding-multilingualism Raynor, C., & Ferla, L. (Eds.). (2020). Portuguese Language Special Issue. Digital Humanities Quarterly, 14(2). www.digitalhumanities.org/dhq/vol/14/2/index.html Rio Riande, G. del, Gil, A., O’Donnell, D., & Ortega, É. (2016). The Translation Toolkit. The Translation Toolkit. https://go-dh.github.io/translation-toolkit/ Risam, R. (2016). Diasporizing the Digital Humanities: Displacing the Center and Periphery. International Journal of E-Politics (IJEP), 7(3), 65–78. https://doi.org/10.4018/IJEP. 2016070105 Risam, R. (2018). New Digital Worlds: Postcolonial Digital Humanities in Theory, Praxis, and Pedagogy. Northwestern University Press. https://nupress.northwestern.edu/9780810 138858/new-digital-worlds

30

Jennifer Isasi et al.

R para Ciencia de Datos (M. Alfaro, M. Alonso, F. Álvarez, Z. Bazurto, Y. Bellini, J. Benítez, M. P. Caldas, E. Campitelli, & R. Quiroga, Trans.). (2019). https://es.r4ds.hadley.nz/ Terras, M. (2012, January 20). Infographic: Quantifying Digital Humanities. Melissa Terras. https://melissaterras.org/2012/01/20/infographic-quantifying-digital-humanities/ Trevino, M. T. (2020, April 14). The Many Languages Missing From the Internet. BBC. www. bbc.com/future/article/20200414-the-many-lanuages-still-missing-from-the-internet Whose Knowledge? (2021). Whose Knowledge?: Re-Imagining and Re-Designing the Internet to Be for and By Us All. https://whoseknowledge.org/wp-content/uploads/2021/04/ WK-Prospectus-2021.pdf

2

Diversifying Digital Biodiversity Knowledge A Latin American Multilingual Perspective on the Biodiversity Heritage Library Lidia Ponce de la Vega

The Biodiversity Heritage Library, or BHL, is described on its homepage (www. biodiversitylibrary.org) as a resource that “improves research methodology by collaboratively making biodiversity literature openly available to the world as part of a global biodiversity community.” Collaborative knowledge production, openaccess, and globality are then key concepts in the conception and development of this project and position BHL within a larger context of similar online and DH initiatives anchored in the promotion of global open access to knowledge. Since its foundation in 2005, BHL has become a fundamental resource for scientific, social, and humanities research. The Library’s blog entries featuring user researchers around the globe,1 numerous awards such as the 2010 John Thackray Medal (The Society for the History of Natural History), and BHL’s standing as an authoritative site (with a Domain Rating of 80 on Ahrefs and 71 on Moz2) attest to BHL’s outstanding position in networks of biodiversity knowledge production, decisionmaking, and multidisciplinary research. Thus, BHL’s presence in such networks can potentially “infor[m] government, key stakeholders, policy and models of care” (Ryder et al. 256), fundamental instances through which BHL communicates and transfers knowledge. By highlighting the intrinsic connection between biodiversity and humanity, BHL documents the cultural values we attribute to biodiversity. Its collections exemplify the ways through which technological advances impact our relationships with biodiversity (Gaikwad and Chavan 1), in this case, through the digital dissemination of knowledge. Considering that most of our interactions with nature are technologically mediated (Heise 75; Sinclair and Posthumus 371), BHL becomes a place for human users to virtually interact with biodiversity and establish relationships with nonhuman subjects. Given this role of BHL, it becomes fundamental to ask which human users can utilize BHL and, therefore, establish interspecies relationships through its materials. In other words, the practices of representation and participation that occur at different levels of BHL determine who can meaningfully engage with and within biodiversity through its platform. Thus, BHL engages in systems of inclusion and exclusion heavily informed by power relationships, including issues of linguistic injustice and misrepresentation. DOI: 10.4324/9781003393696-4

32

Lidia Ponce de la Vega

My goal in this chapter is to identify, acknowledge, and devise strategies to counteract colonial biases present in BHL from a multilingual and multicultural standpoint. Through these analyses, I seek to contribute to the ongoing dialogue around the (im)possibilities for truly decolonial multilingual digital spaces and studies. By focusing on the case of Latin America and through a mixed-methods approach that includes web analytic data and critical metadata analyses, I explore the coloniality and decolonial potentiality of BHL’s partnerships, audiences, and language and Indigenous representation to inform other similar archival and heritage projects, as well as similar multilingual and multicultural initiatives. The lessons to be learned from the victories and limitations of BHL can help understand broader archival and representation practices applicable to digital and DH projects aiming for the creation of multilingual global open-access platforms. While many of these projects, including BHL, adhere to a global approach in their practices of representation and outreach, they often fall short in contending the colonial dynamics around epistemic standards, which can only occur by developing strategies that restructure and reappropriate Global-North-centric and Anglocentric paradigms. In this context, the case study presented in this chapter highlights the complexities of working digitally in/with multiple languages, a topic shared by many of the chapters in the present volume. For instance, as discussed by Jennifer Isasi and co-authors, digital humanities projects and organisms, such as the Alliance of Digital Humanities Organizations (ADHO),3 often seek to incorporate languages other than English but remain in the realm of what I call inclusion as presence. As I discuss later in this chapter, inclusion as presence occurs when the mere, often superficial, incorporation of other languages is seen as a multilingual effort, without tackling the actual systemic roots of linguistic under- and misrepresentation. Thus, my analyses of BHL from the perspective of multilingualism and Latin America are a call for digital/DH researchers, practitioners, and archivists to analyze the foundations of the very systems on which they seek to apply multilingual practices. Such a goal must not become a mask, disguising the historical, geopolitical, and cultural bases of linguistic injustice. Instead, multilingualism must participate in a deeper decolonial restructuring of digital and non-digital systems of knowledge and representation. 2.1. Toward Multilingualism: BHL’s Global Nodes A key component of the decolonization of BHL is multilingual and multicultural representation. The original member institutions of BHL were exclusively located in the United States and United Kingdom (Pilsk et al. 137–138), centering BHL around the United States and Europe and around English as the language of biodiverse knowledge. BHL’s initial so-called globality overlooked the Global South, despite many of the collections of these institutions pertaining, precisely, to biodiversity in and from the Global South. In other words, the ownership, curation, online presence, and production of knowledge about biodiversity from the Global South were in the hands of Global North Anglophone hegemonic institutions. These originally limited affiliations notwithstanding, BHL has gradually advanced its goal of becoming a more global open-access repository through partnerships.

Diversifying Digital Biodiversity Knowledge 33 Since 2009, BHL has established global nodes with institutions throughout the world that contribute biodiversity-related materials to its collections. Among these nodes, an important example from the perspective of language representation is BHL México (Kalfatovic and Rinaldo 6), which began in 2015 as a partnership between BHL and several Mexican institutions led by Comisión Nacional para el Conocimiento y Uso de la Biodiversidad (CONABIO), a government organism in charge of the protection of biodiversity in Mexico and the dissemination of plural knowledges, with an emphasis on Indigenous cultures. Seeking to promote access and plurality, one of BHL México’s objectives was to increase the availability of materials in Spanish in BHL (Tapia Tinajero and Guzmán Vera 56). Before BHL México, BHL included 600 texts about Mexico, only 44 of which were in Spanish (ibid.). The rest were also produced outside of Mexico, mainly in the United States and Europe: Mexican biodiversity was the object of study, but Mexico was not the site of production of that knowledge. In contrast, the majority of BHL México’s records are in Spanish and published in Mexico. Therefore, through their partnership, CONABIO reaches out to global audiences, and BHL opens spaces for the incorporation of multiple languages and sites of knowledge production. Moreover, BHL México’s texts are currently grouped under BHL’s collection Publicaciones en español, to date, the only listed collection on BHL built around a specific language and not presented in English. With this and other similar moves, BHL seems to be expanding its collection and services to become a multilingual platform and diversify access. Nevertheless, BHL continues to be heavily dominated by English and centered in the Global North. The majority of the 156,956 materials present in BHL’s catalog as of June 2020 were in English, even for nodes such as BHL China, BHL Africa, and BHL Singapore (Figure 2.1). In contrast, BHL’s Latin American nodes seem to focus more on linguistic representation. In their respective collections, BHL México and BHL SciELO (the node for Brazil) favor each country’s most spoken language—Spanish and Portuguese, respectively. As of June 2020, 84% of BHL SciELO’s collection was in Portuguese and over 99% of BHL México’s was in Spanish, with only two materials out of 1,172 in languages other than Spanish (Figure 2.2). While these figures highlight the value of partnerships in strengthening the presence of languages other than English in BHL, these five collections together reveal a heavily Anglocentric catalog and an appalling absence of non-hegemonic languages, such as Indigenous languages. Decolonization is a necessary political reorientation of knowledge production (TallBear), meaning that BHL can only decolonize its collections by reversing traditional epistemic hierarchies, in this case, beyond the Anglocentrism of online spaces. 2.2. Why Language Matters: Audience Engagement Multilingual representation in BHL ties to diverse access for global communities and must translate into the meaningful participation of diverse users through the inclusion of diverse linguistic communities. It is essential to consider the geopolitics

34 Lidia Ponce de la Vega

Figure 2.1 Language representation in BHL China (left), BHL Singapore (middle), and BHL Africa (right).

Diversifying Digital Biodiversity Knowledge 35

Figure 2.2 Language representation in BHL SciELO (left) and BHL México (right).

of access to BHL to understand who can engage in bio-diverse knowledge production through the Library and to what extent multilingualism can further diversify access. In this regard, web analytic data collected through custom comparative reports generated on Similarweb (www.similarweb.com) provide an interesting panorama of how audiences engage with biodiversity-related content, in this case, from the perspective of Latin America and, particularly, Mexico. A comparative analysis of traffic shares by country to BHL (biodiversitylibrary. org), CONABIO (biodiversidad.gob.mx), and CONABIO’s library, Bioteca, (bioteca.biodiversidad.gob.mx) reveals important patterns of access and allows for a more nuanced understanding of the linguistic and cultural diversity of actual and potential users. Traffic shares by country refer to the percentage of cumulative traffic that comes from a specific country to each selected website (Similarweb). While factors such as country-specific restrictions or individuals’ use of virtual private networks (VPNs) can obstruct these data, traffic shares by country illuminate the geopolitics of audience access, a matter particularly relevant for BHL’s multilingualism. The country with the most traffic to BHL, CONABIO, and Bioteca between April and June 2020 was, not surprisingly, Mexico, with a cumulative share of 47% (Figure 2.3). What is interesting, though, is that BHL has a considerably lower percentage of visits from Mexico when compared to the other websites (1.9%), suggesting that Mexican audiences prefer local websites when engaging with online biodiversity knowledge, while BHL has a low comparative penetration in the country, possibly due to cultural proximity and language familiarity. Since CONABIO seems to have a consolidated status in Mexico, international audiences might be the primary target of the BHL México collection, meaning that BHL’s representation

36 Lidia Ponce de la Vega

Figure 2.3 Traffic shares by country to BHL, CONABIO, and Bioteca. Comparative data from Similarweb, April–June 2020.

Diversifying Digital Biodiversity Knowledge 37 of Mexican bio-diverse knowledge has a more global outlook. Thus, the decolonization of BHL becomes even more fundamental, as the Library might be an international window into Mexican bio-diverse knowledge production. The necessary ethical stance vis-á-vis Mexico that such a position entails is more crucial given BHL’s significantly higher traffic numbers in other countries, especially outside Latin America. US audiences register access to all three websites, but BHL has a considerably higher traffic share at 91.2%. Similarly, Germany, France, Australia, and the UK register higher traffic to BHL (over 95%) and no traffic to Bioteca, indicating that audiences engage with CONABIO through its main website but not so much through its library, which houses its core knowledge production. These data thus evidence the prominence of BHL in the Global North and highlight the responsibility of BHL regarding its representation of Global South knowledges and biodiversity vis-á-vis its international audiences. Although open-access and BHL’s global outlook point to a trans-geographical digital environment, a clear ethical stance concerning its politics of linguistic and cultural representation is fundamental in the decolonization of its collections and the diversification of audiences. Mexico, the United States, Colombia, Peru, and Spain are the only countries on the list that register traffic to all three websites. These countries are also among the six countries with the largest number of native speakers of Spanish in the world (Instituto Cervantes 7–9). Considering that Spanish is the third most used language on the Internet (50), these figures reiterate the importance of language for promoting diversified access, as well as the relevance of BHL México and the collection Publicaciones en español. The case of Brazil further attests to the centrality of multilingualism in increasing traffic to biodiversity-related websites. Brazil is not only the country with the highest percentage of traffic to BHL (99%) and the only listed Latin American country where audiences do not prefer CONABIO but also the only listed Latin American country where Spanish is not the most spoken language. Thus, the main language of these websites seems to influence traffic shares by country significantly, even when geographical commonality exists.4 While BHL has considerably higher penetration among audiences in Europe and the United States, countries in Latin America seem to prefer CONABIO, a pattern explained on the one hand, by the importance of Spanish in these countries and, on the other, by the strong bond between these countries concerning biodiversity. For instance, CONABIO receives 68.1% of traffic from Colombia and 65.3% from Peru, a preference arguably informed by language but also by international relations. Colombia, Peru, and Mexico belong to Conservation International’s group of megadiverse countries (Comisión Nacional para el Conocimiento y Uso de la Biodiversidad, ‘México megadiverso’) and to the Group of Like-Minded Megadiverse Countries (LMMC) (Benítez-Díaz), composed of nations in Latin America, Asia, and Africa that share goals concerning sovereignty, equity, natural heritage, Indigenous cultures, and biodiversity-related policy (Department of Environmental Affairs South Africa). Furthermore, in 2016, Colombia, Mexico, Peru, and Chile, members of the Alianza del Pacífico, signed a declaration of shared commitment to sustainable development, emphasizing biodiversity conservation (SEMARNAT

38

Lidia Ponce de la Vega

México et al.). Thus, the prolific relationship between Mexico, Colombia, and Peru around topics of biodiversity has been continuous since the early 2000s, and the networks of bio-diverse collaboration that originate from this relationship along with the availability of materials in Spanish could explain the interest in Mexican biodiversity-related websites. Nonetheless, the arguments about the value of language are not necessarily true for all countries. While Spain and the United States register access to all websites and are two of the countries with the largest number of Spanish speakers in the world, they show a considerably greater traffic share to BHL. Here the complex panorama of traffic by country moves beyond language and into the sphere of knowledge production, thereby underscoring the importance of the BHL–CONABIO partnership. The case of Spain is particularly illuminating. While language can influence Spain’s interest in Mexican websites, Spanish audiences seem closer to European preferences—Spain is the only Hispanic country where BHL has the highest traffic share (71.4%). Although less stark than in other European countries, traffic trends in Spain confirm BHL’s status among European audiences and perhaps hint at a less robust relationship between Spain and Latin America around biodiversity, despite linguistic commonality. This, in turn, could be symptomatic of a less robust Global South–Global North collaboration regarding biodiversity-related topics. In sum, the cases of Brazil and Spain emphasize the need for a strategy that considers plural linguistic and geographical representation and widens the networks of online bio-diverse knowledges beyond the constraints of hegemonic languages and geocultural barriers. Multilingualism and diverse linguistic representation are but the first steps of what should be a deeper restructuring of the networks of bio-diverse knowledge that transcends the hierarchical dichotomy between Latin America and Europe and Northern America. In undertaking decolonizing strategies for its archive, BHL could then promote more meaningful Global South–Global North biodiverse interactions, another possible advantage of projects such as BHL México that could lead to a truly decolonizing multilingual library. 2.3. The Latin American Linguistic Panorama: Spanish and Indigenous Languages For BHL, multilingualism and multiculturalism should aim at counteracting the hegemony of English and the Global North in the production of bio-diverse knowledge from a decolonial standpoint. BHL Africa, for example, should not be simply a node located in the African continent but one that incorporates multilingual and multicultural knowledge production originated in Africa by diverse and plural communities. Decolonization is not akin to inclusion (TallBear), that is, having partners in the Global South without questioning the colonial nature of the materials and archival practices that they, as well as BHL, reproduce, is insufficient. Even where apparent diversification exists—for example, in the inclusion of languages other than English—BHL is still subject to the Eurocentrism of archives (Risam 47), with an absence, for instance, of materials in non-hegemonic languages. Even global nodes such as BHL China, BHL Africa, BHL Singapore, BHL SciELO, and

Diversifying Digital Biodiversity Knowledge 39 BHL México lack materials in Indigenous languages, despite the cultural richness of these regions and countries. In the case of Mexico, the absence of Indigenous languages is even more relevant, considering that CONABIO pursues the re-valorization of Indigenous knowledge and sustainable practices (Comisión Nacional para el Conocimiento y Uso de la Biodiversidad, Estrategia Nacional Sobre Biodiversidad de México 19). CONABIO is then in a privileged position to advocate for the incorporation of Indigenous knowledges and languages in BHL, with the added potential of increasing CONABIO’s agency concerning the inclusion, re-positioning, and decolonization of Mexican bio-diverse epistemologies in BHL’s catalog. Nevertheless, it is fundamental to acknowledge the advantages of multilingual inclusivity, even in the case of Spanish. Despite the absence of Indigenous languages in BHL, highlighting the importance of other languages—even if they are still hegemonic—is at least a move away from English privilege. Emphasizing knowledge production in Spanish can counteract the online dominance of English as “the lingua franca of imperialism, knowledge work, and global capitalism” (Cushman 121). In the case of BHL, having a considerable number of materials in Spanish is a valuable move away from the digital hegemony of English and an opening to non-Anglophone communities. Furthermore, promoting BHL’s selection of texts in Spanish is a potential step forward in providing access to Indigenous communities, especially in Latin America. In this region, out of 522 identified Indigenous peoples, 99 have lost their language and speak either Spanish or Portuguese (UNICEF España), while the rest are mostly bilingual (speakers of an Indigenous language plus Spanish or Portuguese) (López). Thus, increasing the online presence of materials in these two languages could potentially mean increased access for certain Indigenous communities. The situation is, however, much more complex, particularly considering three fundamental factors: the digital divide, Spanish privilege, and Indigenous participation. Regarding the digital divide, in Mexico, for example, only 67% of the population above six years old has access to the Internet (Asociación de Internet.mx). The Southern and South Eastern regions—Campeche, Tabasco, Yucatán, Chiapas, Guerrero, and Oaxaca—have the lowest percentage of access to the Internet in the country (ibid.) and also register the highest rates of poverty (Consejo Nacional de Evaluación de la Política de Desarrollo Social), the largest concentrations of Indigenous communities (Comisión Nacional para el Desarrollo de los Pueblos Indígenas and Programa de las Naciones Unidas para el Desarrollo), and illiteracy rates above the national average (Instituto Nacional de Estadística y Geografía). Therefore, although it is possible to focus on the multilingual inclusion of Indigenous cultural products, these communities are still greatly disfavored in terms of access to the necessary infrastructure to benefit from so-called global open access. Another fundamental aspect of the inclusion of Spanish in BHL and other archives relates to linguistic privilege. In Latin America, Spanish continues to be the so-called language of progress, which endangers Indigenous languages (López 90–92). This situation relates to an imbalanced appreciation of bilingualism, considered positive when it refers to Spanish and another hegemonic language— particularly English—and negative when it refers to Spanish and an Indigenous

40 Lidia Ponce de la Vega language (94). Thus, while increasing the online presence of Spanish is a move away from Anglophone digital neoimperialism, it simultaneously contributes to the consolidation of Spanish as the language of knowledge in Latin America. Along with the complexity of providing access for Indigenous users and Spanish privilege, it is key to consider the existing marginalizing epistemic and governing apparatus. A frequent question in this regard relates to whether Indigenous peoples would want to share their knowledge with repositories such as BHL. This is, however, not the real issue but a symptom of an exclusionary mode of thinking partnerships. BHL should offer avenues for engagement and contributions to Indigenous users, scholars, and researchers. Like any other institutional or nation-State partner of the Library, Indigenous individuals and communities can then decide whether to share materials, artifacts, and knowledges. Additionally, rethinking infrastructure—for example, by opening avenues for the incorporation of multimedia forms of knowledge or using TK labels (Local Contexts)—would transform BHL into a more suitable platform for Indigenous contributions. The goal is not to collect and incorporate materials concerning Indigenous peoples or languages as an effort to increase their presence in terms of numbers. This practice, which I call inclusion as presence, is insufficient and commodifies Indigenous knowledges (Simpson 199) by integrating marginalized epistemologies into otherwise colonial systems without questioning the structure of the systems themselves. This attitude conceals and consolidates the lack of infrastructure for non-Western modes of storytelling and knowing (e.g., in the case of BHL, by focusing on Western taxonomy and scientific names). Moreover, inclusion as presence can lead to what David Bordonaba-Plou and Laila M. Jreis-Navarro call “the paradox of Anglocentric multilingualism.” In their chapter in the present volume, the authors explain this paradox as the result of so-called multilingual projects that remain on a superficial level in terms of diversification, thus consolidating English privilege. In contrast, what I call decolonial inclusion requires a restructuring of epistemic systems that opens avenues for meaningful engagement, decisionmaking, and self-determination for Indigenous users and contributors beyond the tokenization of Indigenous artifacts and languages. Once more, while multilingualism is fundamental for archives, disregarding the geopolitics and colonial history of cultures, languages, and systems of knowledge can lead to a reinforcement of hierarchical structures of power. Although the decision-making process is not easy, online repositories must address these issues, especially regarding online epistemic practices (Christen 209). Ideally, a site of knowledge production such as BHL includes all languages, particularly those that have historically been most oppressed, such as Indigenous languages. While the inclusion of all languages might seem utopic, BHL can undertake actions to initiate such a process. For instance, it could prioritize spaces for multimedia and nonwritten knowledges by following the example of projects such as the Cherokee Stories and Songs DVD archive (Cushman 122–125); establish collaborative and citizen-science projects that promote the engagement of Indigenous peoples, such as the Plateau Peoples’ Web Portal (Christen 194–207) or The Zuni Collaborative Catalog (Boast and Enote 107–108); and reach out to Indigenous scholars that can work as intermediaries between institutions and Indigenous communities, such as

Diversifying Digital Biodiversity Knowledge 41 in the Academy for Maori Research and Scholarship (Te Mata o te Tau) (Durie 310). Perhaps these strategies will not lead to the incorporation of all languages in BHL, but they can lead to crucial changes that promote the necessary decolonial inclusion of marginalized linguistic and epistemic communities. 2.4.

Subject Versus Object: Indigenous Languages in BHL

Distinguishing between what I call inclusion as presence and decolonial inclusion is a key component of the decolonization of BHL. As revealed by a critical examination of patterns of representation through metadata categories, the representation of Indigenous peoples in the Library requires a careful examination of the binaries that BHL’s collections perpetuate, which manifest, for instance, as a hierarchy between Indigenous and European languages. For this critical metadata analysis, I employed BHL’s open-access data files (creator.txt, subject.txt, and title.txt) as of July 1, 2021.5 Through custom (subjectbased) SQL on Microsoft Access, I extracted metadata from these files for records that included at least one of the following subjects: Indians of Central America, Indians of Mexico, Indians of South America, and Indians of the West Indies.6 This process of data extraction resulted in 1,052 records7 (Subset 1), representing approximately 0.6% of BHL’s entire collection, which included 166,339 titles as

Figure 2.4 Co-occurrence network for Subset 1. Generated on KHCoder3 with data from BHL as of July 2021.

42

Lidia Ponce de la Vega

of July 2021. Afterward, I extracted the 159 subject lists8 of these records and generated a co-occurrence network of their terms, considering each subject list as a document, on KHCoder3 (khcoder.net). Co-occurrence networks are tree-like structures that visualize relationships between words based on the frequency in which they occur together, thus showing what other subjects appear in materials that contain the selected subjects. The resulting co-occurrence network for Subset 1, for instance, shows a considerable presence of topics related to Indigenous languages (Figure 2.4). Subgraph 08 includes the words language, linguistics, glossary, and vocabulary in a significant relationship with terms such as indian, Tupi, and Mapuche, the latter two referring to two Indigenous groups of the region we call Latin America. Similarly, the term language(s) appears 22 times, always preceded by an adjective referring to an Indigenous language: Carib, Maya, Nahuatl, Tupi, Mapuche, Chiquito, Mayoruna, etc. Although the homogenizing nature of the terms included in Subset 1 could explain some of these trends, they resonate with patterns observed when considering more specific subjects. When working with subject lists of materials that include the names of specific Indigenous groups roughly located throughout the region we call Latin America (Subset 2), the interest in Indigenous languages resurfaces. This subset contains 135 records9 (46 title IDs10), representing roughly 0.08% of BHL’s collection as of July 2021. Similar to Subset 1, the word language appears 22 times in Subset 2, always accompanied by the same adjective that qualifies the subject referring to the Indigenous group included in the subset. For example, subject lists that include the topic Carib language always include the topic Carib Indians. Arguably, this characteristic of subject lists indicates the coloniality of metadata and these records: the interest in Indigenous peoples is not exclusively linguistic, as the topic Carib language should be sufficient to indicate the linguistic nature of a record. Furthermore, similar to Subset 1, in the most intricate subgraph in the co-occurrence network for Subset 2 (subgraph 01) language-related words such as vocabulary, glossary, language, and linguistics co-occur alongside topics such as Description and travel and natural history, which often refer to colonization and colonial expeditions (Figure 2.5). What is perhaps most surprising in both subsets is the almost complete absence of biodiversity-related topics. In Subset 2, the most frequent subject referring to nonhuman subjects is Ornithology, with six occurrences, while in Subset 1, it is Plants, with only three (Figure 2.6). Word frequencies then suggest that, in subsets concerning the Indigenous peoples of what we call Latin America, the focus of attention is not necessarily biodiversity but Indigenous peoples, who, to a certain extent, “substitute” nonhuman species as objects of study. These patterns highlight the intertwined (hi)stories of colonial oppression of the Indigenous and nonhuman Others, still communicated through BHL’s catalog. Moreover, Subset 2 is fully dominated by the Global North in terms of knowledge production and by English in terms of language of publication: all records in this subset were published in Europe and the United States, and more than half of the subset is in English. In sum, Subsets 1 and 2 reveal a stark dichotomy between European and Indigenous languages, as well as colonial approaches to Indigenous peoples, which BHL continues to foster.

Diversifying Digital Biodiversity Knowledge 43

Figure 2.5 Selected subjects and co-occurrence network for Subset 2. Generated on KHCoder3 with data from BHL as of July 2021.

44

Lidia Ponce de la Vega

Figure 2.6 Word frequencies in subject lists in Subset 1 (left) and Subset 2 (right). Generated on KHCoder3 with data from BHL as of July 2021.

The case of Indigenous languages in BHL shows the damaging results of what I call inclusion as presence. While Indigenous languages are there, they only appear objectified and tokenized. In contrast, through the decolonial inclusion of Indigenous languages in BHL, for which I argued in previous sections, diverse systems of knowledge could “entwine in equal partnerships, where one system is not superior over the other” (Ryder et al. 257). Engaging Indigenous peoples and epistemic systems as a key component of biodiversity-related knowledge production can promote more trustworthy attitudes toward Western scientific knowledge among Indigenous peoples (Durie 305) and, perhaps, open paths for greater engagement of Indigenous users with BHL. These decolonial strategies could then lead to the diversification of human histories with and within biodiversity and allow for more equitable interspecies relationships that include plural human cultural and linguistic communities.

Diversifying Digital Biodiversity Knowledge 45 As shown throughout this chapter, BHL requires further diversification of its collections and partners to ensure the equitable representation of otherwise marginalized linguistic and cultural communities. The case of BHL illuminates the dangers of partial approaches to multilingualism and global open access, a lesson applicable to a myriad of multilingual digital platforms and DH projects. BHL and other repositories that align with its goals of globality and openness should address core issues of linguistic injustice to (re)construct and (re)contextualize knowledge production from a plural perspective. The methodology and case studies presented here can serve as a model for digital projects to identify key areas where a decolonizing multilingual restructuring might be necessary. The opposition between inclusion as presence and decolonial inclusion can serve as a guiding framework in the pursuit of truly decolonial multilingual digital environments and studies. Overall, perhaps the greatest lesson lies in understanding the decolonization of digital and DH practices as vital in the path toward the decolonization of knowledge, especially concerning oppressed and silenced (hi)stories beyond superficial approaches to multilingualism and multiculturalism. Notes 1 Entries under this category are labeled “user stories” on BHL’s blog and include three blog posts resulting from my personal collaboration with BHL: https://blog.biodiversitylibrary. org/category/user-stories. The data and analyses in my blog posts with BHL are the bases for this chapter. 2 Domain Rating (DR) is a measure of “the strength of a website’s backlink profile . . . on a 100-point scale” that considers a website’s inbound and external links and their authority (Ahrefs). Although DR is relative (i.e., comparative), it continues to be an important measure of a site’s authority, where the higher the DR, the more authority the site holds. 3 As discussed by the authors, ADHO has mostly focused on translation efforts for its conferences and website. In fact, I have participated in these (admittedly partial) multilingual efforts as the Spanish translator of the DH2022 and DH2023 Calls for Papers and in ADHO’s website translation team (2021–2022). 4 Although beyond the scope of this chapter, BHL’s higher penetration in Brazil could be related to BHL SciELO. 5 BHL’s data (updated monthly) is available at www.biodiversitylibrary.org-/data/. 6 I began some of the analyses in this section during an internship with BHL in 2021 under the supervision of Program Director Martin Kalfatovic, who suggested the method described here. 7 The extracted data and subject lists are available in the files IP-G2.csv and IP-G2-SL.csv respectively on my personal GitHub: https://github.com/LidiaPV/BHL-analyses. 8 The difference between the number of records and subject lists is due to titles under the same ID sharing the same subject list. Title IDs conflate numbers, volumes, or issues of periodicals into a single shared ID with a single subject list. The greater the difference between the number of records and lists (as in this case), the greater the presence of periodicals in the subset. 9 The extracted data in this subset are available in the file IP-S.xlsx on BHL’s GitHub: https://github.com/gbhl/bhl-us-data-sets/tree/master/Metadata-LatinAmerica/Data%20 files/Blog%20post%201/Metadata%20subsets and in the file IP-S.csv on my personal GitHub. The extracted subject lists are available in the file IP-S-SL.csv also on my personal GitHub (see note 7). 10 See note 8.

46

Lidia Ponce de la Vega

References Ahrefs. ‘What Is Domain Rating (DR)?’. Ahrefs Help Center, https://help.ahrefs.com/en/ articles/1409408-what-is-domain-rating-dr. Accessed 15 Sept. 2021. Asociación de Internet.mx. 14o Estudio Sobre Los Hábitos de Los Usuarios de Internet En México 2018, 17 May 2018, www.asociaciondeinternet.mx/estudios/antiguos. Benítez-Díaz, Hesiquio. The Like-Minded Megadiverse Countries Group (LMMC). SBSTTA 17, Montreal, Canada. www.environment.gov.za/sites/default/files/docs/mexicopresentation_lmmc.pdf. Boast, Robin, and Jim Enote. ‘Virtual Repatriation: It Is Neither Virtual Nor Repatriation’. Heritage in the Context of Globalization: Europe and the Americas, edited by Peter Biehl and Christopher Prescott, Springer Science+Business Media, 2013, pp. 103–113. Christen, Kimberly. ‘Opening Archives: Respectful Repatriation’. The American Archivist, vol. 74, no. 1, 2011, pp. 185–210. Comisión Nacional para el Conocimiento y Uso de la Biodiversidad (CONABIO). Estrategia Nacional Sobre Biodiversidad de México (ENBioMex) y Plan de Acción 2016–2030. Comisión Nacional para el Conocimiento y Uso de la Biodiversidad, 2016. Comisión Nacional para el Conocimiento y Uso de la Biodiversidad (CONABIO). ‘México megadiverso’. Biodiversidad Mexicana. www.biodiversidad.gob.mx/pais/quees. Accessed 20 July 2020. Comisión Nacional para el Desarrollo de los Pueblos Indígenas, and Programa de las Naciones Unidas para el Desarrollo. Regiones Indígenas de México. Edited by Enrique Serrano Carreto. Comisión Nacional para el Desarrollo de los Pueblos Indígenas, 2006. Consejo Nacional de Evaluación de la Política de Desarrollo Social. ‘Informes de Pobreza y Evaluación de Las Entidades Federativas 2020’. CONEVAL—Entidades Federativas, 2020, www.coneval.org.mx/coordinacion/entidades/Paginas/Informes_Pobreza_ Evaluacion_2020.aspx. Cushman, Ellen. ‘Wampum, Sequoyan, and Story: Decolonizing the Digital Archive’. College English. Special Issue: The Digital Humanities and Historiography in Rhetoric and Composition, vol. 76, no. 2, Nov. 2013, pp. 115–135. Department of Environmental Affairs South Africa. ‘Group of Like-Minded Megadiverse Countries (LMMC)’. Environment, Forestry, and Fisheries. Republic of South Africa, www. environment.gov.za/likeminded_megadiversecountries_lmmc. Accessed 20 July 2020. Durie, Mason. ‘Indigenous Knowledge Within a Global Knowledge System’. Higher Education Policy, vol. 18, 2005, pp. 301–312. Gaikwad, Jitendra, and Vishwas Chavan. ‘Open Access and Biodiversity Conservation: Challenges and Potentials for the Developing World’. Data Science Journal, vol. 5, June 2006, pp. 1–17. Heise, Ursula. ‘From Extinction to Electronics: Dead Frogs, Live Dinosaurs, and Electric Sheep’. Zoontologies: The Question of the Animal, edited by Cary Wolfe, University of Minnesota Press, 2003, pp. 59–81. Instituto Cervantes. El Español: Una Lengua Viva. Informe 2019, 2019, www.cervantes.es/ imagenes/File/espanol_lengua_viva_2019.pdf. Instituto Nacional de Estadística y Geografía. ‘Analfabetismo’. INEGI, 2020, www.cuentame. inegi.org.mx/poblacion/analfabeta.aspx. Kalfatovic, Martin, and Constance Rinaldo. ‘Enabling Progress in Global Biodiversity Research: The Biodiversity Heritage Library’. Enabling Progress, the Proceedings of the Eighth Shanghai International Library Forum, Shanghai Scientific and Technological Literature Press, 2016, pp. 406–418.

Diversifying Digital Biodiversity Knowledge 47 Local Contexts. ‘Traditional Knowledge (TK) Labels’. Local Contexts, https://localcontexts.org/tk-labels/. Accessed 27 June 2020. López, Luis Enrique. ‘Pueblos, Culturas y Lenguas Indígenas En América Latina’. Atlas Sociolingüístico de Pueblos Indígenas En América Latina, UNICEF and FUNPROEIB Andes, 2009, pp. 19–99. Pilsk, Suzanne, et al. ‘The Biodiversity Heritage Library: Advancing Metadata Practices in a Collaborative Digital Library’. Journal of Library Metadata, vol. 10, no. 2–3, 2010, pp. 136–155, https://doi.org/10.1080/19386389.2010.506400. Risam, Roopika. ‘Colonial Violence and the Postcolonial Digital Archive’. New Digital Worlds: Postcolonial Digital Humanities in Theory, Praxis, and Pedagogy, Northwestern University Press, 2018, pp. 47–64, https://muse.jhu.edu/book/62714. Ryder, Courtney, et al. ‘Indigenous Research Methodology. Weaving a Research Interface’. International Journal of Social Research Methodology, vol. 23, no. 3, 2020, pp. 255–267, https://doi.org/10.1080/13645579.2019.1669923. SEMARNAT México, et al. Declaración de Los Ministros de Ambiente de La Alianza Del Pacífico Hacia Una Plataforma de Crecimiento Verde (Cartagena de Indias, 30 de Marzo 2016). Alianza del Pacífico, 30 Mar. 2016, https://alianzapacifico.net/wp-content/uploads/ Declaracion-Ministros-de-Ambiente-de-la-AP.pdf. Simpson, Leanne Betasamosake. As We Have Always Done: Indigenous Freedom Through Radical Resistance, University of Minnesota Press, 2017, https://doi.org/10.5749/j.ctt1pwt77c. Sinclair, Stéfan, and Stephanie Posthumus. ‘Digital? Environmental: Humanities’. The Routledge Companion to the Environmental Humanities, edited by Ursula Heise et al., Routledge, 2017, pp. 369–377. The Society for the History of Natural History. ‘The John Thackray Medal 2010’. The Society for the History of Natural History. Awards, Honours and Medals, 2010, https://shnh. org.uk/awards-honours-medals/shnh-book-prize/john-thackray-medal-2010/. TallBear, Kim. Diversity V. Decolonization in the Academy, a Conversation with Kim TallBear. Department of English, Dawson College, February 2021. Tapia Tinajero, María del Socorro, and Rosa María Guzmán Vera. ‘La Biodiversity Heritage Library (BHL), Un Proyecto Colaborativo Con El Instituto de Biología Para La Preservación Digital Del Acervo Histórico’. Conectando Los Saberes de Bibliotecas, Archivos y Museos (BAM) En Torno a La Preservación de Documentos Analógicos y de Origen Digital, edited by Perla Olivia Rodríguez Reséndiz and María Teresa Fernández Bajón, UNAM, Instituto de Investigaciones Bibliotecológicas y de la Información, 2019, pp. 47–61. UNICEF España. UNICEF Presenta El Atlas Sociolingüístico de Pueblos Indígenas En América Latina, UNICEF, www.unicef.es/prensa/unicef-presenta-el-atlas-sociolinguisticode-pueblos-indigenas-en-america-latina. Accessed 29 June 2020.

3

Applications and Developments of NLP Resources for Text Processing in Indian Languages Shared Multilingual Corpora Building and Pre-trained Models Justy Joseph, Lalithsriram SR, and Nirmala Menon

3.1.

Introduction

India has over 1,600 dialects with 122 major languages spoken by more than a billion people. The Eighth Schedule of the Indian constitution recognizes 22 languages as official scheduled languages. Owing to the country’s linguistic diversity, English is vastly used in official and judicial communication at the central and state levels; additionally, each state/territory has its own recognized official language. In India, there are broadly four language families, namely Indo-Aryan, Dravidian, Tibeto-Burman, and Austro-Asiatic. Of these four, the first two are the most spread languages. Indo-Aryan languages consist of Hindi, Punjabi, Bengali, Konkani, Gujarati, Marathi, and others, while Dravidian languages consist of Tamil, Telugu, Kannada, Malayalam, and so on. Moreover, the Indian subcontinent comprises a fair share of tribal languages from Northeast states like Meghalaya, Mizoram, and Arunachal Pradesh hailing from the Austro-Asiatic, Tibeto-Burman, and SinoTibetan linguistic groups. India has the second-largest internet users after China with 40% of the total population of the country having accessibility to the internet with 692 million active users, out of which nearly half (around 351 million users) are from the rural areas of the country. However, the majority of the content generated and consumed by internet users is available in English, irrespective of their language preferences. According to a study by KPMG in India and Google (Indian Languages: Defining India’s Internet), almost 70% of internet users in India face challenges in using English keyboards, which in turn affects accessing digital payments, online government services, digital entertainment, social media platforms, digital news, and other facilities. Indian languages, though being widely spoken, still have a poorly developed natural language processing (NLP) ecosystem (Harish & Rangan, 2020). NLP applications including translation, transliteration, entity identification, information extraction and categorization, and varied text processing tools are therefore of huge significance for Indian language adoption on the internet. Whereas there is a plethora of resources and systems available for high-resource languages (HRLs) DOI: 10.4324/9781003393696-5

NLP Resources for Text Processing in Indian Languages 49 like English, Chinese, Spanish, German, and French, the unavailability of NLP resources of numerous low-resource languages (LRLs)—a category which many of the Indian languages are coming under—is considered as one of the major limitations of today’s NLP research (Hirschberg & Manning, 2015). Modelling and adapting completely from NLP infrastructure for English wouldn’t benefit NLP for Indian languages. Indian languages’ intricate linguistic properties such as subjectobject-verb word ordering, morphological richness, and free word ordering therefore make them an important area for NLP research, which can help improve the understanding of human languages in general. Technological development and practical application of Indian languages in digital humanities (DH) in India is at its nascent stage. However, there is a growing attention towards multilingual DH and an intention to facilitate DH projects in Indian languages by highlighting the challenges faced, particularly with regard to NLP. But the proportion of scholars actively engaged in multilingual DH research and development of related infrastructure including language corpora and tools are minimal in Indian DH. This chapter surveys the existing NLP packages for Indian languages, their architecture, workflow, usages, and limitations in terms of handling the linguistic features of Indian languages, adaptability, cross-lingual information retrieval, and standardization and evaluation procedures. Siripragada et al. (2020) propose the possibility of sentence-aligned shared parallel corpora for NLP tasks in related Indian languages, that is, languages that are from the same family or those that share similarities in terms of script, morphology or lexicon specifically across ten languages of Indian origin namely Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi and Punjabi; and English. Here, we also investigate the possibility of creating multilingual shared corpora for various NLP tasks for related languages, analysing the lexical, morphological, and scriptural similarities (Dhar, 2012) that is, the similarity of written text of major Indian languages (Tamil, Malayalam, Kannada, Telegu, Hindi, Punjabi, Gujarati, Odiya, and Bengali) belonging to different language families. This chapter is divided into two sections: the first section outlines the existing NLP packages, their architecture, and functioning, and the second section proposes the workbench for a shared multi-lingual corpus. 3.2.

NLP for Indian Languages: The Status Quo

The availability of pre-trained models (PTMs) is critical to build a good NLP infrastructure as they aid in reducing the labour, costs, and infrastructural requirements necessary for data collection, preparation, and hosting, needed for developing a custom-made model. Thus, studying the availability and accuracy of PTMs is ineluctable for understanding NLP for Indian languages. PTMs in NLP are deep learning models trained on large-scale datasets to perform specific NLP tasks. PTMs with large language corpus for many languages are widely available on the internet and they can be easily loaded in popular NLP libraries such as PyTorch, Tensorflow, and Keras. Some real-world examples of using PTMs in NLP include Name Entity Recognition (NER) (Liu et al., 2021), Sentiment Analysis (Zhou

50

Justy Joseph, Lalithsriram SR, and Nirmala Menon

et al., 2020), Machine Translation (Liu et al., 2020), Text Summarization (Khandelwal, 2019), Speech Recognition (Baevski et al., 2019) and Content Moderation (Zhu et al., 2021). However, for many Indian languages with millions of speakers, there is still a large-scale unavailability of pre-trained language models trained on an adequate corpus. Popular NLP libraries like spaCy, NLTK, and TextBlob lack Indic language support, which creates a barrier to working with these languages (Arora, 2020). The major NLP packages available for Indian languages are: • Indic NLP Library (iNLP)—a Python-based NLP library that shares linguistic, phonological, and syntactic similarities with many Indian languages. The library includes functionalities such as Text Normalization, Script Information, Tokenization, Word Segmentation, Script Conversion, Romanization, Indicization, Transliteration, and Translation. • iNLTK—open-source NLP library with pre-trained deep models for performing downstream tasks like text classification. It supports Data Augmentation, Text Similarity, Sentence Embeddings, Word Embeddings, Tokenization, and Text Generation. In the current state, Arora (2020) notes that iNLTK aims at including code-mixed languages like Hinglish (Hindi and English), Manglish (Malayalam and English), and Tanglish (Tamil and English). • StanfordNLP—a Python-based NLP library from the Stanford research group that supports around 53 languages in total, including two Indian languages namely Hindi and Urdu. Notable features of StanfordNLP library are textual analytics, multi-word expansion, lemmatization, and parts-of-speech (POS) tagging, and morphological feature includes tagging and dependency parsing. Table 3.1 shows the languages supported by the aforementioned NLP packages, and Table 3.2 provides a comparison of overall tasks performed by them. Indic NLP works with the majority of languages, whereas users rely on Stanford NLP for most tasks though only for two languages. The progress on NLP for Indian languages has been constrained by the unavailability of some essential NLP models, which in turn is the result of lack of datasets/ corpora, evaluation parameters (standards), and consolidated efforts. The anglophone centricity of digital humanities specifically Indian DH is an impediment to Table 3.1 Languages supported by NLP libraries in Indian languages NLP Library

Language Supported

Indic NLP Library (iNLP)

Indo-Aryan Languages such as Assamese, Bengali; Gujarati, Hindi/Urdu, Marathi, Nepali, Odia, Punjabi, Sindhi, Sinhala, Sanskrit, Konkani; Dravidian languages like Kannada, Malayalam, Telugu, Tamil, and English. Hindi, Bengali, Gujarati, Malayalam, Marathi, Tamil, Punjabi, Kannada, Oriya, Sanskrit, Nepali, and Urdu Hindi and Urdu

iNLTK Stanford NLP

NLP Resources for Text Processing in Indian Languages 51 Table 3.2 Tasks performed by the NLP tools NLP tasks

iNLP

iNLTK

StanfordNLP

Tokenization Part-of-speech tagging (POS) Dependency Parsing Stemming & Lemmatization Multi-word Token Expansion Dependency Parsing Named Entity Recognition (NER) Word Segmentation Text Normalization Sentence Embeddings Romanization Indicization Translation & Transliteration

            

            

            

scholars who work to expand the DH engagement out of English-centric scholarship. While devising methodologies, tools, primary sources, and standards in English, there should be a conscious consideration of multilingual users and researchers. A functional approach towards translation of resources can also be helpful. Creation of datasets for Indian languages requires a collaborative and community-based approach, which will bring together the necessary technical and linguistic skills, currently absent in Indian DH specifically studies on NLP. There should also be policy-oriented changes in Indian academia specific to DH, which will recognize such much-needed efforts. Large-scale morphological variations for high-resourced languages are another major reason. Studying the lexical and morphological features of Indian languages can help create models that better handle their linguistic features and in turn aid in developing better multilingual models that can understand and transfer knowledge across multiple languages. 3.3.

Utilizing Relatedness of Indian Languages for Corpora Building

The progress of NLP in Indian languages is impeded by the unavailability of large linguistic corpora and its evaluation benchmarks. A large-scale monolingual corpus allows the development of pretrained language models and deep contextualized word embeddings as precursors of modern NLP, while standard evaluation parameters allow effective standardization across various tasks ensuring efficiency. An alternative approach is to build a shared multilingual corpus for a group of related languages. Previous works (Kakwani et al., 2020; Tan & Le, 2019; Khanuja et al., 2021; Machel et al., 2021) have shown the benefits of pre-trained language models as well as NMT models that cater to a set of related languages. Related languages are those that exhibit lexical and structural similarities on account of sharing a common ancestry as in case of Slavic and Indo-Aryan languages or being in contact for a long period of time (Bhattacharyya et al., 2016). Languages in prolonged

52

Justy Joseph, Lalithsriram SR, and Nirmala Menon

contact, even when they have different ancestry or belong to different language groups, show convergence of linguistic properties, and this may lead to the formation of linguistic areas (Thomason, 2000) like the linguistic areas of the Indian subcontinent (Emeneau, 1956) or the standard average European (Haspelmath, 2001) linguistic areas. This section of the article briefs the relatedness of languages of the Indian subcontinent in terms of lexicon, morphology, and script and how these similarities can be utilized to devise a shared multilingual corpus for varied NLP tasks. 3.4.

Lexical Similarity of Indian Languages

Lexical similarity in linguistics is the measure of degree to which word sets of two languages are similar. Linguists use lexical similarity to evaluate the degree of genetic relationship between languages dating from the 1960s (Karl, 1963). When two languages are inferred to be lexically similar, they may share many words with similar forms (spelling and pronunciation) and meaning, cognates, lateral borrowings, or loan words. There are contested definitions for the terms ‘cognates’ and ‘loan words’ in linguistics. This study defines cognates as sets of words of similar nature in different languages that have been inherited in direct descent from an etymological ancestor in a common parent language (David, 2011). Loan words are defined as words that are partially or completely assimilated from one language— called the donor language—into another language, the recipient language (Otto, 1964). Table 3.3 shows how a group of languages that descend from the same language family borrows words from the proto-language of the family. In most cases, Table 3.3 Word sharing from proto-language.

Table 3.4 Loan words across language families Indo-Aryan

Indo-Aryan Words in Dravidian Languages

Dravidian words in Indo-Aryan Languages

Sanskrit

Hindi

Punjabi

Gujarati

Marathi

Odia

Bengali

Hunger Bread Fish

Bubuksha kshudhā Rotika Maysya

Bhūkh Chapāti Rotī Machhli

Pukh Roti Machhi

Bhukh Pau rotla Māchhli

Bhukh Chapati Poli Masa

Bhoka Pauruti Macha

Khide (Pau-) ruti Machh

English

Tamil

Malayalam

Kannada

Telugu

Fruit

Pazham

Pazha

Phala

Pandu Phalan

Ten

Pathu

Path

Hattu

Padi

Meaning

Word

Language

Loanword

Wheel Water Horse Fish

Cakram Jalam Ashvah Matsyah

Tamil Malayalam Kannada Telugu

Cakkaram Jalam Ashva Matsyalu

Meaning

Word

Language

Loanword

Hoe Agarwood Cardamom Mix Kid Love

Kuddala Agaru elakki Kalisu Pilla Premam

Hindi Punjabi Gujarati Marati Odia Bengali

Kuddāla Agar agarbatti elchee Kālav Pilaa/Pilata Prem

NLP Resources for Text Processing in Indian Languages 53

Dravidian

English

54

Justy Joseph, Lalithsriram SR, and Nirmala Menon

there will be a noticeable sound change according to the linguistic features of the languages when a word travels from the proto-language. Loaning of words and formation of cognates are seen not only in cases of languages of the same lineage but also within languages in different families. Table 3.4 shows examples of loanwords inside Indo-Aryan languages and Dravidian languages and the cross-loaning of words between the two language families. It can be observed that retroflex sounds (consonant sound produced with the tip of the tongue curled back towards the hard palate), echo words (a reduplicative), and quotative verbs are borrowed into Indo-Aryan languages from Dravidian languages, while most of the borrowings from Indo-Aryan languages to Dravidian languages are from Sanskrit. To quantify the lexical similarity between Indic languages (Punjabi, Hindi, Gujarati, Bengali, Marathi, Telugu, Tamil, and Malayalam), we used the Longest Common Subsequence Ratio (LCSR) (Melamed, 1995) as a measure of the lexical similarity since the lexical similarity stems from sub-word level correspondences and word order is the same for most related languages as inferred from previous studies (Kunchukuttan et al., 2018). For a pair of languages, we compute the average LCSR between every pair of sentences in the training corpus at the character level. The corpus used is the ILCI (Indian Languages Corpora Initiative), a multilingual parallel annotated corpus in 12 major Indian languages, including English with standard PoS annotation. The corpora comprise over 600,000 sentences with each sentence having an average of 16 words (9600,000 annotated words) from the health and tourism domains. Figure 3.1 shows the results of the lexical similarity analysis. The findings indicate that the language families Indo-Aryan display higher similarities than that of languages of Dravidian origin. Hindi shows the highest similarity (67.9%) with Punjabi and over 50% similarity with Gujarati, Bengali, and Marathi. Gujarati also shows over 50% similarity with Punjabi and Marathi. Among the Dravidian languages, Telugu, Tamil, and Malayalam display equal

Figure 3.1 Lexical similarities.

NLP Resources for Text Processing in Indian Languages 55 range of similarities while Telugu has higher similarity with Indo-Aryan languages than other Dravidian languages. 3.5.

Morphological Similarity

Languages are considered to be morphologically rich when they exhibit a high degree of inflections, compounding, or derivations. In these languages, words are often modified to depict tense, mood, gender, number, and/or modality while morphologically poor languages usually use word order and syntax for different meanings. Richer languages have more subtle words to convey different meanings precisely, and this leads to complexity in the language and a wider vocabulary. Languages like Hebrew, Turkish, and Dravidian languages are considered to be morphologically richer in comparison to languages such as English or Mandarin. Most of the Indian languages follow subject-object-verb (SOV) as canonical word order, though some languages also exhibit free word order. Dravidian languages are far more agglutinative in nature and have more morphological diversity as compared to Indo-Aryan languages. Table 3.5 shows the morphological complexities of Indian languages calculated using Type-Token ratio (TTR) indicator where Type-Token Ratio (TTR) = total no. of types/total no. of tokens. Indian languages are highly inflectional, a realistic NLP system should thus have a morphological sub-system for parsing surface forms of words into its constituent morphemes, which in turn will help achieve reduction in lexical redundancy and compactness of lexical representation. Though Indian languages in general and Dravidian languages in particular are morphologically complex, their same word order, similar case-marking systems, and relatively similar inflectional patterns can be catered for higher performance in NLP tasks. A shared corpus thus can reduce the data sparsity. 3.6.

Similarity of Scripts

All major Indic scripts are derived from Brahmi Script as shown in Figure 3.2. The arrangement of vowels and consonants in the Indian script has mostly a sound phonetic basis. In certain cases, same script is used for multiple languages as Devanagari script is used in case of Sanskrit, Hindi, Marathi, Konkani, Sindhi, etc. Table 3.5 Morphological complexity Languages

No. of Types

No. of Tokens

Type-Token ratio (TTR)

Telugu Tamil Malayalam Hindi Punjabi Gujarati Bengali

130,536 127,605 128,660 58,441 64337 99,926 90,000

945,863 1,035,458 400,000 1,524,199 1548743 1,272,594 156,000

0.1380 0.1232 0.834 0.0383 0.0415 0.0785 0.057

56

Justy Joseph, Lalithsriram SR, and Nirmala Menon

Figure 3.2 Evolution of Indic scripts.

and Bangla script is used for Assamese and Manipuri, while in certain other cases multiple scripts are used for same language. Gurumukhi and Shahmukhi are used for Punjabi and Devanagari, and Persio-Arabic is used for Sindhi. Indian languages have a largely overlapping character set with different visual renderings as shown in Figure 3.3. The traditional ordering of characters (varnamala) is also the same with dependent (maatras) and independent vowels. The languages display a similar movement of forming a consonant and vowel, with syllable (akshara) being the fundamental organizing principle. 3.7.

Conclusion

Language relatedness can be effectively used in the creation of shared datasets and can aid to improve translation between related languages via sub-word level transformations to translate cognates and loan words. Languages that show above 50% lexical, morphological, and scriptural similarity like Gujarati, Bengali, and Marathi. Gujarati, Punjabi, and Hindi will significantly benefit out of such an extensive corpus. Similarly, a shared corpora for Tamil, Malayalam, Kannada, and Telugu will contribute to building of NLP resources for Dravidian languages, and

NLP Resources for Text Processing in Indian Languages 57

Figure 3.3 Indic scripts.

languages like Telugu and Sanskrit will benefit equally from an Indo-Aryan and Dravidian corpus. References Arora, Gaurav. “iNLTK: Natural Language Toolkit for Indic Languages.” arXiv preprint arXiv:2009.12534, 2020. Baevski, Alexei, Michael Auli, and Abdelrahman Mohamed. “Effectiveness of Self-Supervised Pre-Training for Speech Recognition.” arXiv preprint arXiv:1911.03912, 2019. Bhattacharyya, Pushpak, Mitesh M. Khapra, and Anoop Kunchukuttan. “Statistical Machine Translation Between Related Languages.” In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts. Association for Computational Linguistics, 2016, pp. 17–20. Crystal, David. “Cognate.” In A Dictionary of Linguistics and Phonetics, 6th ed. Blackwell Publishing, 2011, pp. 104, 418. ISBN 978-1-4443-5675-5. OCLC 899159900 Dhar, Sachchidanand, Similarity of Scripts, 2012. https://doi.org/10.13140/2.1.1225.2168. Emeneau, M.B. “India as a Linguistic Area.” Language 32, no. 1 (1956): 3–16. https://doi. org/10.2307/410649. Harish, B.S., and R. Kasturi Rangan. “A Comprehensive Survey on Indian Regional Language Processing.” SN Applied Sciences 2, no. 7 (June 12, 2020). https://doi.org/10.1007/ s42452-020-2983-x. Haspelmath, M. “The European Linguistic Area: Standard Average European.” In Language Typology and Language Universals: An International Handbook. Walter de Gruyter, 2001. Hirschberg, Julia, and Christopher D. Manning. “Advances in Natural Language Processing.” Science 349, no. 6245 (July 17, 2015): 261–266. https://doi.org/10.1126/science.aaa8685. Kakwani, Divyanshu, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. “IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-Trained Multilingual Language Models for Indian

58

Justy Joseph, Lalithsriram SR, and Nirmala Menon

Languages.” In Findings of the Association for Computational Linguistics. EMNLP, 2020, pp. 4948–4961, Online. Association for Computational Linguistics. Khandelwal, Urvashi, Kevin Clark, Dan Jurafsky, and Lukasz Kaiser. “Sample Efficient Text Summarization Using a Single Pre-Trained Transformer.” CoRR abs/1905.08836 (2019). https://doi.org/http://arxiv.org/abs/1905.08836. Khanuja, Simran, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja K Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian, and Partha Talukdar. MuRIL: Multilingual Representations for Indian Languages, 2021. https://paperswithcode. com/paper/seegull-a-stereotype-benchmark-with-broad-geo. Kunchukuttan, Anoop, Pratik Mehta, and Pushpak Bhattacharyya. “The IIT Bombay English-Hindi parallel corpus.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). European Languages Resources Association (ELRA), 2018. Liu, Yinhan, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. “Multilingual Denoising Pre-Training for Neural Machine Translation.” Transactions of the Association for Computational Linguistics 8 (December 2020): 726–742. https://doi.org/10.1162/tacl_a_00343. Liu, Zihan, et al. “Crossner: Evaluating Cross-Domain Named Entity Recognition.” In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 15, 2021. arXiv:2012.04373v2 [cs.CL]. Machel, Reid, Hu Junjie, Neubig Graham, and Matsuo Yutaka. “AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages.” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 1306–1320. Melamed, I.D. “Automatic Evaluation and Uniform Filter Cascades for Inducing N-best Translation Lexicons.” In Proceedings of the 3rd Workshop on Very Large Corpora (WVLC3). Boston, 1995. Otto, Jespersen. Language. Norton Library, 1964, p. 208. ISBN 978-0-393-00229-4. Siripragada, Shashank, Jerin Philip, Vinay P. Namboodiri, and C V Jawahar. “A Multilingual Parallel Corpora Collection Effort for Indian Languages.” In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, 2020, pp. 3743–3751. Tan, Mingxing, and Quoc V. Le. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” CoRR abs/1905.11946 (2019). https://doi.org/10.48550/arXiv. 1905.11946. Teeter, Karl V. “Lexicostatistics and Genetic Relationship.” Language 39, no. 4 (October 1963): 638. https://doi.org/10.2307/411959. Thomason, S.G. “Linguistic Areas and Language History.” In Languages in Contact. Brill, 2000. Zhou, Jie, et al. “Sentix: A Sentiment-Aware Pre-Trained Model for Cross-Domain Sentiment Analysis.” In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 2020. 10.18653/v1/ 2020.coling-main.49 Zhu, Wanzheng, et al. “Self-Supervised Euphemism Detection and Identification for Content Moderation.” In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 2021.

Part II

Pedagogy

4

Doing Digital Humanities in the Modern Languages Classroom Susanna Allés-Torrent

4.1.

Introduction

This chapter is situated at the intersection of digital humanities (DH), digital pedagogy, and modern languages (ML) courses at the undergraduate level in a US university. Although I analyze the nature and implications of studying and practicing DH within an ML context from a Global North institutional and pedagogical standpoint, most of the observations might still be applied to a larger approach of teaching foreign languages and cultures in other contexts. My contribution addresses different aspects of this interconnectedness and how both fields enrich each other with one clear idea in mind: whereas DH provides new digital approaches to study language, literature, and culture, ML is more invested in studying and revamping linguistic and cultural diversity. In the past decades, the digital humanities has pervaded almost all other humanistic disciplines and communities of research and practice, bringing forward exciting new scholarly approaches. This symbiosis of DH with other fields has arrived in some cases more naturally and faster than in others (literary studies, history, archaeology, lexicography, among others), and hybrid sets of digital competences have been integrated in their theory and praxis. Just to name a few examples, many communities have delineated their digital identity by framing their discipline within the DH realm, such as Critical DH (Liu 2012), Black DH (Gallon 2016), or PostColonial DH (Risam 2018). Modern languages studies, reshaped in the cultural turn that occurred in the 1990s, is not an exception to this trend. Currently, the junction of ML with DH covers a large range of digital manifestations, from digital culture, media studies, e-literature, and new forms of digital expressions, to holistic projects for recovering new archives silenced for centuries, or digital methodologies such as the development and improvement of natural language processing (NLP) techniques. The need for NLP resources in languages other than English emerges clearly in Joseph and Menon, Goodale, and Tasovac’s chapters in the present volume. Meanwhile, and as an integral part of ML, language teaching has played an important role by placing language instruction within clear areas of action (Communication, Culture, Connections, Comparisons, Communities)1 and offering new approaches to the integration of technology in and outside the classroom to enhance meaningful learning, such as Computer-Assisted Language DOI: 10.4324/9781003393696-7

62

Susanna Allés-Torrent

Learning (CALL), Technology-Enhanced Language Learning (TELL), or MobileAssisted Language Learning (MALL). The relationship between ML and DH, although still in the process of being defined, can be used to revamp concrete values, such as the role that multilingualism can play within DH (Galina 2014; Fiormonte 2016; Mahony 2018). Some scholars contend that ML studies’ intrinsic linguistic and cultural diversity can enrich the study of DH, in the same way that the practice of DH has already enriched ML (Taylor and Thornton 2017; Ortega 2019). This interconnection led Pitman and Taylor (2017) to propose the label of “critical DHML.” In this sense, the digital turn of ML studies goes beyond the mere introduction of technology in our classrooms or the adoption of digital tools as mere instruments in our research, and it includes critical cultural studies as part of the DH arena. The relationship between DH and ML (from now on DHML) goes even beyond the creation of a digital product, conceived as a strategic initiative to boost research. Instead, DHML aims to grasp how the digital can be accessed through a cultural and linguistic perspective and how the study of foreign cultures can offer specific values to understand the digital from multiple specific linguistic, social, cultural, and literary perspectives (Allés-Torrent et al. 2021). This critical intervention of highlighting the central role of multilingualism and the relationship of DH to language, that ML studies brings to the table, has not been totally absent within DH, and there have been different initiatives for a multilingual DH.2 However, sometimes the field has been criticized for its linguistic bias and Anglocentric guise (Galina 2013; Dacos 2016; Fiormonte 2016). Moreover, Spence and Brandao (2021: 3) stress the fact that DH has been more attentive to its geographical representation, rather than to geolinguistic diversity, and that DH scholarship could benefit from a deeper scholarly engagement with the implications of multilingualism and geocultural diversity, which are an essential part of ML research. This alleged bias to linguistic diversity—as Fiormonte (2014) remarks—will inevitably impoverish the field of the humanities at large, because the creation of methodologies and technological standards is never neutral with respect to linguistic and cultural differences. As Roopika Risam has pointed out, we should question the dominant forms of knowledge production and be aware that “digital humanities practitioners” have usually worked within a “context of a politics of knowledge that has not been hospitable to those outside the dominant cultures of the Global North” (Risam 2018: 5). More than a decade has gone by since Kirschenbaum (2010) explained the reasons behind the early adoption of digital methods in English Departments. Similar motivations led Foreign or Modern Language Departments to incorporate other fields, such as gender studies, Africana studies, postcolonial studies, or Latino studies, and ultimately the digital humanities. Gradually, the introduction of DH in these departments has taken place, but it has been slower than in English. In the case of graduate education, Nora Benedict (2021) has exemplified the limitations and constraints that graduate students normally face when they engage with and learn about digital humanities outside an English department (e.g., lack of understanding and scholarly worthiness of DH), as well as institutional resources and infrastructure.

Doing Digital Humanities in the Modern Languages Classroom 63 The reasons behind this delay are multiple; some of them obvious, whereas others are more complex. First, although we are all aware of the “need to understand other cultures and languages” and “to communicate with and comprehend other parts of the world” (MLA 2007), modern languages in general have been treated as secondary to other disciplines, such as English. This secondariness comes not only from the higher echelons of the administration but also from our student body population. Consider that most students are not majoring in Spanish: they pursue a Spanish minor, they simply wish to keep studying Spanish, or, at most, they are double majors. Second, the majority of ML Departments have been focused on language teaching, and the majority of students enrol in these departments to fulfil a language requirement or to improve a foreign language. Therefore, the main—and obvious—consequence of this situation is the different body population: whereas in English, or History, students are native speakers, in the ML classroom they are second language learners or Language Heritage learners. This fact has probably hindered ML departments to replicate straightforwardly the developments of their peers in English or History. Third, the institutional US system at the Higher Education level, which nowadays usually amalgamates all Languages Other Than English (LOTE) in one single foreign or modern languages unit, establishes a large internal divide in the governance structure of ML teaching, which is organized in a two-tiered system. On the one hand, there are those faculty teaching the language curriculum, usually instructors in non-tenure-track positions, and, on the other, those teaching the upperdivision culture and literature courses, within the so-called tenure track. In the first case, language courses are the ones with higher enrolments and demand, and generally they leave little room for innovation because they adhere to an already established curriculum. Perhaps for this reason, instruction has been especially focused on integrating technology and flipping the classroom. Only recently Melinda Cro (2020) has proposed concrete and creative ways of conceiving a DH-inflected pedagogy in the L2 classroom. While promoting language acquisition, Cro intends to take advantage of the digital “to facilitate collaborative and cooperative modes of learning while considering reflectively the act of building and making in a digital frame” (2020: 3). In the second case, literature and culture courses, usually taught by tenure track faculty, are freed from the burden of pre-established curricula and are the ones more suited to introduce DH tools and methods, or to be conceived as a straightforward DH course. It is this last scenario to which my experiences discussed next relate. However, drawing the line where L2 instruction finishes and advanced content courses start is not always easy. Students who arrive to upper-division courses do not always fully master the target language, which is why, as we will see, even in advanced content courses, we feel inclined to adopt approaches belonging to language instruction. In an ideal setting, we need to imagine a pedagogical experience that integrates translingual competence—that is, how can we improve the acquisition of functional language abilities through DH—and transcultural competences,3 working more effectively toward the integration of DH methods

64

Susanna Allés-Torrent

and tools to study foreign history, geography, culture, and literature, and at the same time promoting the skills to interpret all kinds of texts and media. Moreover, and connected to this concern of the role of technology within language teaching and on how to transform the L2 classroom into a meaningful experience, Román Mendoza in many of her publications takes further the concept proposed by Paolo Freire’s “critical pedagogy” and talks about digital critical pedagogy, whose main goal should be to make visible and fight against the negative effects that technological ideologies have on education, as well as to facilitate the appropriate use of technology in education to highlight and act against social injustice (Román Mendoza 2023: 19). In this chapter, I propose some resources and case studies emerging from my personal experience, and I explore the concept of critical digital literacies when adopting a DH approach situated in the context of ML education. By critical digital literacies, and to better situate my analysis, I use Hinrichsen and Coombs’ (2013) framework based on five basic resources: Decoding, Meaning making, Using, Analyzing, and Persona. These literacies, applied to a DH context, should lead students to decode the structures and conventions of digital media and artifacts; to identify and interpret the variety of digital outcomes; to use digital tools to perform certain tasks and be able to analyze digitized material through informed judgments; and, finally, to create and be aware of one’s online persona and interactions. This list is far from being exhaustive, but it will serve to articulate some of the case scenarios that I propose. Also, as these two authors argue, we can devise two senses of a “critical dimension,” one internal to and another external to the digital. By internal we refer to faculties of analysis and judgement as applied to the content, usage and artefacts of the technology. The external meaning relates to a position regarding the development, effects and social relations bound in technology. This position is more associated with historical and cultural analyses and operates in a wider field of technology than computers. It is concerned particularly with how meaning is constructed, by whom and for what purposes and in this sense representational and communicational modes are central. (Hinrichsen and Coombs 2013: 4) In the case of the DHML classroom, this double dimension is especially relevant, as we work toward the adoption and appraisal of technology but also how the digital is communicated and particularly shaped by communities in a foreign culture. On top of these variables, when teaching digital humanities courses, we might have as well in mind a set of well-known values proposed by Lisa Spiro a decade ago: Openness, Collaboration, Collegiality and Connectedness, Diversity, and Experimentation.4 This holistic framework of competencies can help us to design new courses and frame our learning objectives in the undergraduate classroom.

Doing Digital Humanities in the Modern Languages Classroom 65 4.2.

Pedagogical Resources for DHML

At the core of DH, there has been a genuine and growing interest in digital pedagogy. Several edited collections and book chapters have made exponential contributions to the concept of critical pedagogy within DH (Hirsch 2012; Brier 2012; Beetham and Sharpe 2013; Gardiner and Musto 2015; Battershill and Ross 2017). There are journals specialized in digital pedagogies: Hybrid Pedagogy; Journal of Interactive Pedagogy; and Kairos: A Journal of Rhetoric, Technology, and Pedagogy are only three examples. For DH Undergraduate education, a special issue in Digital Humanities Quarterly appeared, including many interesting contributions focusing on students’ agency, digital literacies, and issues of scale (Murphy and Smith 2017). Among the essays, two contributions dealt in particular with Hispanic studies and the Hispanic-speaking world: Saum-Pascual (2017) showcased the use of electronic literature in the Spanish classroom and how digital and cultural literacies come together to enrich literary studies and DH; Álvarez and Peña (2017) moved the attention from an Anglophone context to their teaching experience at National Autonomous University of Mexico (UNAM) with History majors. In this volume, these two contributions align well with Ponce de la Vega’s chapter, where a fruitful dialogue between ML and decolonial practices is discussed. Nevertheless, in mainstream scholarship, examples belonging to ML digital pedagogies are rare, and they remain within Anglophone contexts. In most cases, as Melinda Cro puts it, “the authors tend to assume the reader is either working in an anglophone environment or is teaching either in English or through the digital humanities” (Cro 2020: 4). The relative lack of pedagogical materials and computational resources, which could contribute to bringing DH into the ML classroom, has been recently improved by some resources that help to partially overcome this situation. Listing them all is outside the scope of this chapter, but some of them are worth mentioning. Without doubt, the best resources are those offering open-access materials, namely in the form of step-by-step tutorials, which enable an autonomous and selfdirected learning depending on the interest of students. This type of online resource has been called by Benedict a “type of à la carte approach to training [that] encourages a kind of slow learning that, in turn, promotes deep, long-term knowledge production” (2021: 575). A unique resource that appeared in recent years within the field of ML is Modern Language Open (MLO), a “peer-reviewed platform for the open access publication of research from across the modern languages to a global audience.” Among its Special Collection section, there is the “Critical Digital Pedagogies in Modern Languages—a Tutorial Collection,”5 which offers a set of education materials to develop critical digital literacies for the study and research of modern languages and cultures (Spence and Brandão 2020). A group of seven tutorials offers a fertile field to harvest new ideas to bring to the ML classroom. Examples range from experimentation with poetry and video editing, interaction with AI chats, and online presence, to the use of platforms such as Scalar or Recogito. Another key resource is the multilingual offering published by the Programming Historian (PH), represented in this volume by the contribution written by Isasi

66

Susanna Allés-Torrent

et al. Most of their tutorials, which were written in English, are already translated in Spanish and French, with the recent addition of a Portuguese version. Additionally, they started including original content in these languages, which extends the variety of valuable resources for our ML students. However, Isasi and Rojas Castro, in charge of the Spanish version of PH, have recently underlined the challenges of translating these types of pedagogical materials and the need to adapt them in terms of content, code, and context.6 The goal of PH, similarly to MLO, is to devise a global audience and write tutorials for a multilingual audience. Therefore, authors try to avoid specific cultural references, jokes, word games, or complicated language, that afterward can potentially complicate the translator’s task. In short, the publication of these types of educational materials with open licenses that take into consideration the capacities to be transferred in different languages, accessed remotely, and reused is invaluable. The European consortium DARIAH released some years ago #dariahTeach, “a platform for Open Educational Resources (OER) for Digital Arts and Humanities.”7 This resource has 15 tutorials, of which six are in translation (Spanish, French, Hungarian, and Greek). For Spanish, we find an Introduction to DH, a tutorial on the Text Encoding Initiative (TEI), and a third one on Multimodal literacies. Although these tutorials are still few in number, they represent a good way to serve and promote educational materials in languages other than English. Special cases are those resources conceived to overcome the lack of specific areas in Spanish. For instance, within the Hispanic context, and especially in Spain, the interest for the Text Encoding Initiative remains a topic of high interest, due to its long tradition in philology and scholarly editing practice. Therefore, the Text Technologies Hub (TTHub) has been working toward the creation of introductory materials to learn encoding practices and the Text Encoding Initiative especially conceived for a Hispanic audience. In the platform, there are not only several tutorials to learn the TEI but also a set of useful resources, such as a catalogue of concrete projects and real case scenarios of TEI encodings.8 Although it is not conceived with multilingual intent, one of the most useful resources remains the Digital Pedagogy in the Humanities published on MLA Commons (Frost et al. 2020). The site offers real learning case scenarios adopting digital approaches and technologies. In most cases, they work with Anglophone materials, but in general they can also be transferred to the ML classroom. The exception within this collection remains Ana Oskoz’s contribution that focuses specifically on language learning. Drawing upon the communicative dimension of language, rather than the study of grammar, her contribution focuses on different ways to include sociolinguistic and cultural elements while enhancing learners’ writing, reading, speaking, and listening skills; helping them in their cultural understanding; aiding them in their comprehension of cross-cultural pragmatics; and even developing their own identity as language learners, which builds their agency in their own language learning process. (Oskoz 2020)

Doing Digital Humanities in the Modern Languages Classroom 67 Oskoz’s contribution showcases how teaching in ML departments also involves language-based area studies approaches as shown earlier. This quick, and obviously incomplete, overview of available learning resources reminds us of, at least, two more challenges that remain to be resolved to facilitate the alignment of DH and ML learning. On the one hand, the community should work toward the creation of datasets and digital data for the humanities at large to be used in our ML classroom, such as is discussed in this volume in the chapters authored by Goodale and Tasovac et al. In the case of literary texts in Spanish, datasets remain indeed still rare. On the other hand, and at least in Spanish, we have a dire need for more introductory materials, such as Johanna Drucker’s Introduction to Digital Humanities, a comprehensive companion for undergraduate students compiled during her teaching at UCLA and now published as well as a textbook at Routledge (Drucker 2021). 4.3.

Doing DHML: Case Scenarios for Critical Digital Literacies

I now wish to showcase different scenarios where DHML and critical digital literacies take central stage. The cases proposed emerge from two different courses that were conceived with different goals. One of them was a project-based course where the whole class, as a working team, had to deliver a collaborative project, while the second one was envisioned as an introduction to the digital humanities. Both courses were upper-level undergraduate culture courses and taught in Spanish. Therefore, the cases below should not be considered as a sequenced syllabus of an individual course but a group of ideas to develop critical digital literacies in the students’ educational and research practice, cultural and linguistic knowledge in Spanish. These cases, as Murphy and Smith argue, aim at being an account of a pedagogical practice to “underscore the way in which our lecture halls, classrooms, and labs are the locations in which we see our theoretical concepts made manifest” (Murphy and Smith 2017). 4.3.1.

Collaboration and Digital Scholarship in a Project-Based Learning

The first experience deals with a course on literary and cultural topics in the Spanish-speaking world.9 During Fall 2017, I proposed to study the figure of the Spaniard poet Federico García Lorca. The main goals of the course were twofold, on one hand, students had to study in detail the cultural context of early-20thcentury Spain and get to know one of the most well-known poets from Spain killed at the very beginning of the Civil War (1936). On the other hand, they had to use the materials held at the University of Miami’s Richter Library and autonomously build a digital artifact. The Collection Federico García Lorca’s Papers comprises photographs and letters written by and addressed to Lorca during his stay in La Habana (Cuba), while he was living in New York, an experience that shaped him in a particular way. The final project was entitled by the students “Lorca: Descubriendo las Américas.” This project involved the conception and implementation of a digital edition of the poet’s correspondence, a better understanding of his life and

68

Susanna Allés-Torrent

travels by gathering data and creating visualizations, as well as a new appreciation of Lorca’s literary works through literary analysis. Text Encoding as Close Reading and Situated Practice

The core idea of this course was to recover a collection of analogue materials, composed of photographs and correspondence by Federico García Lorca, held at our library’s home institution, the Richter Library (University of Miami). With this process of recovery, students could highlight a little-known story of one of the most emblematic Spaniard poets. After students became familiar with the historical context, they started to analyze the content of the collection and consider the best way to communicate those materials to a broader and general audience in a digital format. To that end, it was essential to expose students to similar digital editions and textual archives and to offer a glimpse of what textual studies and editorial theory means. Also, students start to decode and analyze this particular type of digital scholarly texts: navigation, textual and editorial digital conventions, presentation and connection of images and texts, as well as their design on the screen (Hinrichsen and Coombs 2013: 7). Afterward, students undertake the modelling and creation of the conceptual model for the primary source at stake, in this case, the postcards and letters written by García Lorca. Overall, students take this opportunity to reflect on the data structures, relying on the concept of classification and the idea of text. Text encoding, as a pedagogical activity, constitutes a comprehensive and hermeneutic situated practice (Burnard 2001). Students assume the labour in terms of transcribing, curating, editing, formatting, standardizing, and delivering a clean version of the text in hand. Amanda Gailey insisted that digital editing serves to accomplish certain teaching goals: “students must pay careful, consistent attention to the text; they learn to understand the cultural record as malleable; they feel a clear sense of purpose, audience, and expertise when writing; they leave with transferable technical skills” (Gailey 2014: 191). Students identify and encode elements such as literary genre, textual structures, sender, recipients, date, places, or names. Text encoding is also one of the best ways to raise awareness of diachronic linguistic change: by performing the manual labour of marking up a text, students can work with old scripts (paleographically), and non-standard forms of writings. Additionally, as Singer argued: [T]he practice of encoding effectively engages the encoder in a determinative act of reading. . . . Their decisions about which structures and information to tag very much determine what they value in a document—or at least what should be preserved and codified for other readers. (Singer 2013) On this point, Gailey says: [T]ext encoding, is fundamentally an exercise in close reading. I have never seen the duped youth in my charge exhibit as much careful attention to

Doing Digital Humanities in the Modern Languages Classroom 69 textual detail as they do when they have been asked to create a digital edition of a text. (Gailey 2014: 196) Also, text encoding can help to identify already-known or new literary terminology, not only in their own language but, in this case, in a second one. The idea of the course was to situate the language study in a concrete cultural and historical frame, within a humanistic and a digital learning environment. I sought to incorporate cultural inquiry, literary topics, and digital humanities methodologies. Also, I wanted my students to work collaboratively and have a sense of the complexity required to carry out a digital project, in this case, under the final form of a web site. Finally, this type of collaborative project-based instruction gives students some room to reconsider hierarchy within higher education, collaboration, and the value of labour. Also, the benefit of this type of approach to text encoding is that students perceive this method as part of a bigger intellectual endeavour. They realize the many variables that come into play in a digital project, the challenges when conceiving and building a digital infrastructure, and the tasks involved in project management. For this course, I used the materials mentioned in Section 4.2, namely those related to text encoding published in the Text Technologies Hub, since they are written in Spanish and offer useful markup examples in action. Textual Data as Sources for Mapping and Timelines

As part of the same course on Federico García Lorca, students devised a way to take advantage of the information found in the encoded texts. In the case of Lorca’s travels and life experiences, students conceived a chronology of his life and a map showing the travels across the Atlantic. This was an opportunity to get familiar with different geographic areas of the Spanish-speaking world. In fact, Lorca moved from Spain, where the author was born, to other European cities he visited (Paris, London, Oxford), until he arrived in the Americas: he disembarked in New York, then visited Florida and Miami, and finally arrived at Cuba. To map and explain Lorca’s traveling, students used StoryMapJS,10 an interactive web-based map and storytelling platform. This popular tool has a friendly interface to pin georeferenced coordinates, upload, or link online media, and add headlines as well as narrative sections. The result can be very informative as students work with original images and practice their writing skills. For the timeline and to draw a general overview of Lorca’s main biographical events, students adopted TimelineJS,11 which offers a pre-built spreadsheet template. Therefore, students populate the different rows as needed in the columns with dates, headlines, some texts, media, and captions, and they can even customize it with different colours to group the types of events. This tool is especially useful to appraise the modelling of data structures and the labour of data input. Both digital outcomes (the timeline and the interactive map) were integrated in the students’ collaborative website under the sections “Chronology” and “Travels.”12

70

Susanna Allés-Torrent

The process of curating the data, visualizing, and emending the artifacts within a bigger narrative, provided a suitable space to think critically about digital resources, underscoring the importance of code, and, especially, escaping the notion of a black box by inferring what happened between their input and the output. Among the pedagogical resources that can help to plan activities as the two mentioned earlier, we can recall several tutorials from the resources mentioned in Section 4.2. Hayes and Partlow (2022) offer a detailed lesson for StoryMap JS on which students can lean on to undertake their project. Although dealing with a different platform, del Rio Riande and Vitale (2020) present Recogito13 to recover geo-annotations produced on the texts that can be plotted on a digital map. Additionally, and centred on the creation of maps, Sinton (2020) offers a generous list of tools and platforms to create maps within the undergraduate classroom. Close Reading and Text Analysis

The opportunities for adopting text mining and text analysis techniques are manifold, although specialized datasets connected with the content taught in the ML classroom rarely exist. In this course centred on García Lorca, one of the learning outcomes was to historically locate the author and his context, as well as to engage with his literary production. One of the best ways to do so appeared to be the close reading and text analysis approach. This part of the course took almost three weeks, and it could have certainly been extended if a deeper exploration and refined results were desirable. The main goal was to identify the topics and symbols in García Lorca’s literary and theatrical production. For this purpose, I curated an anthology of relevant texts, recovered from an open-access website,14 which we read and analyzed through a close literary reading. To facilitate this analysis, I handed them a table with three columns: the first one contained frequent topics found in Lorca’s literary works (e.g., tragic destiny of life, frustration, love and sex, death, time, social injustices); the second one had some features of the author’s style (e.g., fusion of high and popular culture, metaphors, volksgeist); and a third one contained a list of symbols used by Lorca (e.g., moon, water, blood, horse, metals). After some sessions analyzing poems and some theatrical works, students had to use text analysis tools of increasing difficulty: AntConc, Voyant, and RStudio. First, students uploaded the corpus in Antconc and appraised the logic behind the creation of a word list, the concept of token versus unique words, keywords in contexts (KWIC), concordance list and sorting, and cluster plot, among other features.15 Afterward, we moved to the online platform Voyant’s tools.16 We started discussing the notion of stopwords, which can take place within a larger linguistic and grammatical discussion. Then, they made different wordclouds for the Poema del Cante Jondo, Romancero Gitano, and Poeta en Nueva York. At that point, students needed to distinguish and classify specific words and contexts by drawing upon the table of topics, style, and symbols that I had provided. Once again, and for both cases, the Programming Historian offers essential lessons that the students can follow autonomously on AntConc (Froehlich 2015,

Doing Digital Humanities in the Modern Languages Classroom 71 translated in Spanish) and Voyant (Gutiérrez De la Torre 2019, written originally in Spanish). Finally, the last exercise consisted of exploring the corpus with RStudio through a cluster analysis, trying to infer why and how the software created specific groups. These clusters reflected some trends based on the genre of texts, the topic, as well as on the period of time during which they were drafted.17 These experiential learning activities served specifically to demonstrate how literary hypotheses can be recognized, confirmed, and expanded by digital methods. At the same time, students developed the ability to critically deploy text analysis tools efficiently to explore and make meaning of literary texts (Hinrichsen and Coombs 2013: 7–11). A Minimal Infrastructure for Collaboration

To execute the digital archive, I proposed to use a group of technologies that are open-source, and web standards, as well as a set of practices already consolidated among the DH community. Texts were encoded in XML-TEI and transformed through a stylesheet written in XSLT (Extensible Stylesheet Language Transformations) into a displayable HTML (Hypertext Markup Language) connected to a CSS (Cascading Style Sheets). The platform used to conduct the collaborative work was GitHub, and the hosting service to publish was GitHub Pages.18 The idea behind this approach was that students learn how to work with open repositories, static websites, web standards, and open-source technologies, as well as being encouraged to work autonomously, without depending on external platforms or servers. Some years ago, Amanda Gailey highlighted that project-based courses, namely in the case of collaborative digital editions, were “labor-intensive, time-consuming, and sometimes extremely frustrating to beginners,” because they depend on owning “reliable computers, . . . a classroom license for software (I use Oxygen XML Editor) that costs several hundred dollars,” as well as having access to a server with certain technological specifications, which can also be costly (Gailey 2014: 191). The adoption of a minimal approach in the classroom can potentially resolve most of these hurdles, since, although the learning curve is high, we use open-source technologies that promote control and governance over the data, stimulate autonomous work, and allow research reproducibility. At the same time, we contribute to a more open, equitable, and global digital humanities (AllésTorrent and del Rio Riande 2023).19 4.3.2.

The Cultural Dimension of Digital Humanities

The second course was Digital Literacies Through Cultural and Literary Topics in Spanish,20 a new advanced-content course that was conceived to explore digital research methods and tools applied to literary and cultural studies in Spanish. I taught this course for the first time in Fall 2021, and I strove to offer a general introduction to the digital humanities giving a lot of room to hands-on practice.21

72

Susanna Allés-Torrent

Digital Humanities or Humanidades Digitales

In a course conceived as an introduction to DH, a first contact with the concept and definition of DHML consists of focusing on the DH community in the target language. The idea is to become aware of the multilingual dimension of the field, first at an international level, and then at the Spanish-speaking level. Moreover, Spanish is not a monolithic language, and this can be represented by the geographical, and geolinguistic diversity of Spanish. The importance of presenting the diversity of digital humanities in Spanish can be shown, for instance, by displaying a map of the different associations of Spanish DH. Thus, students can reflect on geographical and cultural terms and get familiarized with different geographies. The class can feature the different work done by the Spanish and Latin American associations: Asociación Argentina de Humanidades Digitales (AAHD), Humanidades Digitales Hispánicas (HDH), Red de Humanidades Digitales (RedHD, México), Red Colombiana de Humanidades Digitales, Humanidades Digitales en Cuba. Students can pick their favourite project in each one of these areas and present it to the class. In this sense, students can use a diverse array of educational materials, such as in the case of the AAHD (2021), which released a series of podcasts asking five questions to different scholars and practitioners. Featuring heterogenous research groups and pedagogical initiatives should give students the idea that, although English remains a global language, diversity is a synonym of richness. The field of Humanidades Digitales is still defining itself and emerging with a different set of needs and interests as distinct from the mainstream of DH of the Global North (del Rio Riande and Fiormonte 2021). Such an exercise should offer a ground for a conversation on geolocalization issues (Spain and Latin America), and how different cultural and academic backgrounds come into being, but also how language, in this case Spanish, plays a central role as builder and communicator of knowledge. Also, it is an excellent moment to make the claims for the co-existence of Spanish, as a language, with other satellite minority or endangered languages (e.g., other Romance or the Basque languages in Spain, or Indigenous languages in Latin America), depending on the region. Therefore, by showcasing specific projects, students perceive the diversity of the current DH landscape and how language and different geocultural perspectives can deliver different types of outputs (e.g., which kind of DH projects are the most characteristic in certain areas), and they are challenged to compare how languages (in this case Spanish and neighbouring languages) can shape the digital outcomes.22 What’s in a DH Project?

A second activity consists of introducing a collection of DH projects and proposing an analytical rubric.23 A good start are global initiatives, such as Around DH, or giving a self-made list of works dealing with the Hispanic world. In some cases, projects can be paired with informative readings about how DH projects are built (Posner 2013; Gutiérrez 2016). By starting with the outcome, students decode and develop familiarity with the structures and conventions of digital artifacts within the DH field from both the operational cultural context and their online presentation.

Doing Digital Humanities in the Modern Languages Classroom 73 The key for this activity is to give a clear template from which the students can start to deconstruct and identify the elements that “contribute to the meanings, uses and messages in digital products and communications” (Hinrichsen and Coombs 2013: 11). This rubric can integrate different parts. For example, with my students, I used a template elaborated by Kristen Mapes (2022) for her course, Introduction to DH, which I distributed to the students in Spanish translation. This template is divided into three main parts. The first one focuses on the project background and goals, where students detail partner institutions, funding, if any, credits and the different roles within the project, and the timeframe. Students envision as well the potential audiences and explain the research goals. The second part is centred on questions of presentation. Thus, students determine the presentation format and go beyond the concept of “website” by identifying if the project is a digital archive, a curated collation, a map, a timeline, a digital edition, etc. Also, they evaluate the narrative of the context within which the project is situated, for example, to make sure that the project offers sufficient content to understand a concrete historical context. At this point, students can also consider questions of accessibility. The third part of the template proposes students to assess the project’s material and data. In this sense, students identify the type of materials, for example, specifying the primary sources, and their origin (e.g., newspapers hosted at the Library of Congress). Also, they are requested to detect data formats, to get conversant with metadata languages in cultural institutions and scholarly projects. This information can be hard and challenging to obtain and recognize, especially at the very beginning of the course, but it is worth the analysis. Finally, students can balance advantages and limitations of the projects. After the template is completed, the students were asked to write a formal text, describing the project and including all the aforementioned aspects, as well as to present their results in an oral presentation. Students should be able to select, make judgments, and elaborate conclusions about a DH project. With this activity, we incentivize analytic and decoding skills while they investigate, explain, and reflect on DHML projects. Understanding Data in the Humanities

Another key step in any DHML is to introduce students to appraise data in the humanities. Some readings and exercises might come to hand to this end (SperbergMcQueen 2016; Allés-Torrent 2019), while other lessons from the Programming Historian and available in Spanish might serve to start the conversation about what data is. An introduction to the bash command line (Milligan and Baker 2014) serves as a way to approach the functioning of operating systems: navigating through the file system and stepping away from graphical interfaces represent, most of the times and, in this type of course, a novelty that not only deepens the understanding of their machines but also impacts on the way they navigate, work, and organize information. Also, a first overview on how a web page works and the code underneath, basically HTML (Hypertext Markup Language), provides a new perspective to their interaction with the web browsers (Turkel and Crymble 2012).

74

Susanna Allés-Torrent

However, and even before any technical reading, a helpful strategy is to start with a narrative text with a broader cultural breath that connects students with the target cultural and linguistic context. One of the most iconic examples is the one written by Jorge Luis Borges “The Analytical Language of John Wilkins” (1952), which inspired one of Foucault’s main works, precisely The Order of Things (1966). Borges, by inventing a complete nonsense categorization, gives students the possibility to discuss classification systems and their utility. The discourse can then move from the literary arena to the more concrete concepts of digital data. With this reading in mind, students are more open to analyze the code behind their web browsers and how basic languages such as HTML work. In fact, as has been extensively discussed, most of our students use technology on a daily basis and they interact with digital social media, but they ignore the code behind the tools they use or the underlying algorithmic mechanisms. Text Mining and Distant Reading

Another activity I proposed for this course, in line with the previous text analysis with AntConc and Voyant tools, consisted of exploring a contemporary corpus about COVID-19 on Twitter through a distant reading approach. To this end, we used an online corpus curated by the Digital Narratives of Covid-19 project.24 This exploration was more sophisticated than in the previous case, since students used a Python script to recover results.25 This project offers a corpus of tweets published between April 2020 and October 2021 organized by date and region (Argentina, Colombia, Ecuador, Spain, Mexico, Perú, and Florida). Students selected an area and a topic and had to gather and create their own collection of tweets to conduct research.26 Then, students adapt the Python scripts and process their own collections. The results they obtained were basically the most commonly used words, bigrams, unique words, etc. This activity served to deploy their analytical skills questioning the provenance, the purpose, and the meaningfulness of those terms in a very concrete social context, that of the COVID-19 pandemic. Students detected the algorithmic mechanism behind the script and reflect critically about the results they obtained by interpreting a very specific setting delimited by time and space. At the end, students produced a written essay integrating their results into a comprehensive narrative where they balanced qualitative and quantitative techniques and considered the results and the usefulness of this data-driven approach to conduct socio-linguistic research.27 Building a Digital Persona

Another activity that I implement very often in my DHML and DH-inflected courses is the creation of a personal site, often under a public blog form. Besides acquiring technical skills to create a digital artifact and understanding the value of code and design behind websites, students learn by doing. Usually, students use GitHub and GitHub Pages and draft all materials with markdown. The site is conceived to create a presentation of themselves where students adopt a public and

Doing Digital Humanities in the Modern Languages Classroom 75 open discourse, awakening an awareness of their digital identities. Therefore, they develop a “sensitivity to the issues of reputation, identity and membership” within a particular scholarly digital context (Hinrichsen and Coombs 2013: 12). Most of the course assignments are conceived to be published under the form of a blog post in their sites, so they document their research and learning progress. Blog posts can serve as spaces not only to talk about ‘code’ but also to draft their technobiography, so students can reflect on how they feel about the role of technologies in their lived experiences (Lee 2017: 126–129). Also, the personal site provides a setup to practice their written expression adapting it to a communicative mode and thinking on a potential online readership. Finally, it is a way to be creative and to keep working on their site after they have finished the course. However, on this last point, a digital portfolio has proven to be more useful at the graduate level, where students prepare for the professional market and, therefore, they feel more motivated to bring their academic sketches online. 4.4. Lessons Learned and Closing In this final section, I want to bring forward some final considerations one might face when teaching an advanced content course in a foreign language and is willing to introduce critical digital literacies. As shown, access to the digital through cultural and linguistic perspectives is variegated and full of opportunities; however, there are some challenges that might be taken into consideration. At the undergraduate level, most of the students have traditionally been exposed solely to language, culture, and literature courses, and they are taking a DH course for the first time. Therefore, instructors need to explain (more than in other types of traditional courses) and justify the work being done, as well as present with detail the course goals and learning outcomes; explain beforehand, for example, the type of complementary work to be carried outside the classroom, since most of the time they are not only expected to read or write, but to complete full online tutorials which require practice and, consequently, are time consuming. All activities and assignments need to be very well scaffolded, and it is better to think in terms of quality and not quantity. For instance, in my Lorca course, the processes that lead to the timeline and the storymap were too similar, both for data gathering and for the outcome, and in this sense, it would have been better to only concentrate on one of them. We need to reappraise the idea, introduced years ago by Prensky (2001), that our students are digital natives, and some of us “digital immigrants.” The fact that they use technology does not mean that they grasp the nature and implications of technology. As Cro pointed out, “instructors tend to conceive of students as digital natives without regard for or awareness of the learners’ actual experience with the digital” (Cro 2020). They might be considered digital natives in the sense that they operate in digital environments such as social media, but their knowledge about core technology is very rare. For instance, they ignore how the internet works or what is behind a website. Only a few of them arrive in the classroom having taken a computer science course (e.g., Java, Python). Therefore, it is essential that students comprehend the concept of data, their types, and the idea of classification.

76

Susanna Allés-Torrent

Also, in addition to exposing students to digital work for the first time with no previous experience with DH, the course is done in a foreign language. Students can potentially feel doubly challenged by the fact of studying in a foreign language. This linguistic divide forces us to insist on concepts and contents that need to be situated and critically appraised in the target language and culture. Sometimes, students do not have the required vocabulary in the target language, and concepts are harder to acquire; therefore, the teaching fluctuates in different languages and can require more time. One strategy could be to prepare more materials in Spanish adapted to their linguistic level. Most departments do not have DH programs or an established and articulated sequence of DH courses at the undergraduate (and graduate) level. This means that students normally do not have any DH previous exposure, nor a DH follow-up or continuation in their curriculum, which hinders a rational step-by-step according to language level and DH topics (both from more basic to more advanced levels). The lack of continuity can certainly be a discouraging factor. In any DHML course, it is challenging to find the right balance between language, culture, and digital learning. Each unit should ideally be structured not depending on the technology used but on the cultural or literary topic. One of the factors that can destabilize this balance within DH pedagogy across languages and literatures is precisely the lack of materials adapted to the students’ linguistic level. The case of Spanish is not the only one lacking examples and resources adapted to the ML classroom, as multiple chapters offered in this volume have demonstrated (see, for instance, Mello). From a personal experience, the most rewarding courses are those conceived as a project-based collaboration. If the course is conceived as a personal project, the results can sometimes be brilliant, such as the case of one student who edited her parent’s correspondence from each side of the US and Mexican borders.28 But this is more the exception than the rule. In collaborative projects, students engage in a collective endeavour, where each of them is committed with a role. Besides the Lorca project explained earlier, this was also the case of a previous experience that took place in fall 2016 at Columbia University, where students created a minimal edition under the title MiniLazarillo (2016), whose main outcome was to publish a small-scale digital scholarly edition of the 16th-century anonymous Lazarillo de Tormes. However, because of the labour, especially for data collection, this type of experiential course requires “a complex process involving design and analysis that may exceed the scope of the traditional term project” (Cro 2020). In the case of the hands-on Lorca editing project, students would have benefited, for instance, from having more time devoted to the concept of scholarly editing and resolving text encoding issues. It is also central to integrate the idea of “critical making” and the value of creating something, including digital artifacts, and conceiving it as an “intellectual endeavor” (Warwick 2019). Finally, and especially in the context of data-driven projects, it is worthwhile to dig into questions of knowledge creation and the implications of technological choices. Students from the Global North sometimes are not aware of the

Doing Digital Humanities in the Modern Languages Classroom 77 consequences of digital ecosystems and infrastructures or the inequities in digital access and literacies. As Price suggested, “uneven access to resources, which greatly affects who can do digital humanities and where, is not an issue we have had to address in such stark terms in other types of literary or humanistic scholarship” (2011: 12). Values such as open-access and open-source need to be part of our conversations and be the praxis that we teach. In fact, ethical practices such as the ones proposed by minimal computing provide the possibility to reflect on what it means to conduct digital work “under some set of significant constraints of hardware, software, education, network capacity, power, or other factors.”29 Also, we should keep in mind Román Mendoza’s observations regarding the fact that many technology companies are entering the educational industry with the intention of profit through the technologization of education and the datafication of learning rather than with the idea of improving learning or the world through learning (Román Mendoza 2023: 12–13). As this scholar predicts, large companies such as Google, Amazon, Facebook/Meta, Apple, and Microsoft will continue to use their power to introduce into the educational sphere their global and homogenizing technologies that are not designed for language teaching and learning (22). Therefore, we should cultivate the value of using certain types of technologies, privileging those where students can achieve autonomy and control and which are community-driven, open-source, and open-access resources (Allés-Torrent and del Rio Riande 2023). At the end of the two courses presented here, students admitted having enrolled in the course with no idea of DH or, at most, with a certain and vague interest for the digital world (other reasons were that the instructor was already known, the need to fulfil a requirement or to finish majors/minors, or simply because the time of the course fit in their schedule). However, they all considered the experience rewarding. In general, students did not expect to learn about “code,” but they all agreed that they would have liked to learn more and delve deeper into hands-on activities, and into topics such as TEI or other types of metadata. As for the language of the course, they acknowledged the challenge to learn new digital literacies in Spanish, but in general they did not consider it as a drawback. To close, I will only add that these new modes of engagement with modern languages content, drawing upon digital humanities approaches, methodologies, and tools, are productive pedagogical venues to promote not only critical digital literacies but also spaces to engage creatively with specific languages, cultures, histories, and communities. Notes 1 For a full overview of the “World-Readiness Standards for Learning Languages,” see www.actfl.org/educator-resources/world-readiness-standards-for-learning-languages. 2 Some of these examples are Multilingual DH, http://multilingualdh.org/en/, which showcases resources for doing digital humanities in different languages, and Global Outlook::Digital Humanities, www.globaloutlookdh.org/that promotes collaboration among DHers worldwide. 3 For the idea of “translingual” and “transcultural competence,” see MLA (2007).

78

Susanna Allés-Torrent

4 These are the values proposed in the well-known article of Lisa Spiro (2012). Similar values were proposed by the program Demystifying Digital Humanities (2012, http:// dmdh.org/): Auto-didacticism, Do It Yourself, Ad hoc, Iterative development, Process and product, Failure as valuable, Collaboration, Open source, Public Scholarship, Public Peer Review, Dissemination, and Transparency. 5 www.modernlanguagesopen.org/collections/special/dml-tutorials/ 6 In their article, Isasi and Rojas Castro (2021: 618–620) propose an interesting classification of translation strategies used in PH into three different levels. 7 https://teach.dariah.eu/ 8 https://tthub.io/ 9 This course is taught in the Department of Modern Languages and Literatures at the University of Miami, under the title Spanish 322: Cultural Topics, https://bulletin. miami.edu/courses-az/spa/. The syllabus used in Fall 2017 is available at http://susannalles.github.io/teaching/SPA322/, and the digital result done by the students at https:// susannalles.github.io/teaching/SPA322/Web_Lorca/ 10 StoryMapJS is available at: https://storymap.knightlab.com 11 TimelineJS at https://timeline.knightlab.com/ 12 Available in the site, https://susannalles.github.io/teaching/SPA322/Web_Lorca/index. html under tabs “Cronología” and “Viajes.” 13 Recogito is an open-access platform to edit texts in XML-TEI that allows georeferencing: https://recogito.pelagios.org/ 14 Texts were recovered from a website called https://federicogarcialorca.net/ through the wget command tool. I transformed the HMLT files in plain text and upload a folder in the shared repository of the course in GitHub. 15 The exercise can be found at: http://susannalles.github.io/teaching/SPA322/class_12_ AntConc 16 The exercise with Voyant is available at http://susannalles.github.io/teaching/SPA322/ class_14_Voyant.html 17 Available at: http://susannalles.github.io/teaching/SPA322/class_13_RStudio.html 18 See Visconti (2016) for a tutorial on GitHub and GitHub Pages. 19 Minimal computing is a working group of GO::DH, https://go-dh.github.io/mincomp/ 20 Again, the information about the course Spanish 410 can be found in the university Bulletin, https://bulletin.miami.edu/courses-az/spa/ 21 The course syllabus is available online: https://susannalles.github.io/SPA410/ 22 On this regard Spence and Brandao highlight: “little research [explores] how online/ digital (and hybrid) pedagogies are mediated by languages and distinct geocultural perspectives—discussion is predominantly shaped around how ‘digital’ can transform ‘languages’, rather than the other way around” (2020: 4). 23 Both the RedHD (2020) and the HDH (2021) published a list of best practices in DH. 24 Available at: https://covid.dh.miami.edu/ 25 Rubrics for essay 3: https://susannalles.github.io/SPA410/ejercicios/ensayo-3.html and for essay 4 at: https://susannalles.github.io/SPA410/ejercicios/ensayo-4.html 26 The Digital Narratives of COVID-19 project offered a graphic interface to create a collection of tweets by area, language, and date: https://covid.dh.miami.edu/get/ 27 An example of the result can be seen at: https://miladvh.github.io/2021/12/08/ensayo4. html 28 The project is still online at http://kimberlyflores.github.io/CARTAS/ (2016). 29 See their website, https://go-dh.github.io/mincomp/about/

Bibliography Allés-Torrent, Susanna. 2019. “Sobre la complejidad de los datos en Humanidades, o cómo traducir las ideas a datos.” Revista de Humanidades Digitales 4: 1–28.

Doing Digital Humanities in the Modern Languages Classroom 79 Allés-Torrent, Susanna, Megan J. Myers, and Élika Ortega. 2021. “New Dialogues in Spanish and Portuguese Studies: Pedagogical and Theoretical Perspectives from the Digital Humanities.” Hispania 104 (4): 535–541. https://doi.org/10.1353/hpn.2021.0124. Allés-Torrent, Susanna, and Gimena del Rio Riande. 2023, Forth. “Autonomía y control: Minimal Computing como propuesta pedagógica para las Humanidades Digitales.” In Enseñanza de literatura en español con Humanidades Digitales, edited by Rocio Orduña Casanova, Beatriz Tardío, and Antonio Huertas Morales. Bern: Peter Lang. Álvarez Sánchez, Adriana, and Miriam Peña Pimentel. 2017. “DH for History Students: A Case Study at the Facultad de Filosofía y Letras (National Autonomous University of Mexico).” Digital Humanities Quarterly 11 (3). Asociación Argentina de Humanidades Digitales (AAHD). 2021. PodcastHD. Cinco preguntas a humanistas digitales. www.youtube.com/playlist?list=PLjxMH6JwH1oPoytLM axhmbCx1IFpVj7ux Battershill, Claire, and Shawna Ross. 2017. Using Digital Humanities in the Classroom: A Practical Introduction for Teachers, Lecturers, and Students. New York: Bloomsbury. Beetham, Helen, and Rhona Sharpe. 2013. Rethinking Pedagogy for a Digital Age: Designing for 21st Century Learning. 2nd ed. New York: Routledge. Benedict, Nora. 2021. “The Risks and Rewards of Implementing Digital Humanities Methodologies in Modern Language Graduate Research.” Hispania 104 (4): 571–582. https:// doi.org/10.1353/hpn.2021.0127. Borges, Jorge Luis. 1952. Otras inquisiciones [The Analytical Language of John Wilkins]. Buenos Aires: Sur. Brier, Stephen. 2012. “Where’s the Pedagogy? The Role of Teaching and Learning in the Digital Humanities.” In Debates in the Digital Humanities, edited by Matthew K. Gold. Minneapolis: University of Minnesota Press. Burnard, Lou. 2001. “On the Hermeneutic Implications of Text Encoding.” In New Media and the Humanities: Research and Applications, edited by Domenico Fiormonte and Jonathan Usher, 31–38. Oxford: Humanities Computing Unit. Cro, Melinda. 2020. Integrating the Digital Humanities Into the Second Language Classroom: A Practical Guide. Washington, DC: Georgetown University Press. Dacos, Martin. 2016. “La Stratégie du Sauna Finlandais: Les Frontières des Digital Humanities.” Digital Studies/Le Champ Numérique 6 (2). https://doi.org/10.16995/dscn.41. Del Rio Riande, Gimena, and Domenico Fiormonte. 2021. “Una vez más sobre los sures de las Digital Humanities.” Acervo 35 (1): 1–15. Del Rio Riande, Gimena, and Valeria Vitale. 2020. “Recogito-in-a-Box: From Annotation to Digital Edition.” Modern Languages Open 1. https://doi.org/10.3828/mlo.v0i0.299. Drucker, Johanna. 2021. The Digital Humanities Coursebook. An Introduction to Digital Methods for Research and Scholarship. New York: Routledge. Fiormonte, Domenico. 2014. “Digital Humanities From a Global Perspective.” Laboratorio dell’ISPF 11. Fiormonte, Domenico. 2016. “Toward a Cultural Critique of Digital Humanities.” In Debates in the Digital Humanities, edited by Matthew K. Gold and Lauren F. Klein. Minneapolis: University of Minnesota Press. Foucault, Michel. 1966. Les mots et les choses [The Order of Things]. Paris: Editions Gallimard. Freire, Paulo. 2005. Pedagogía del oprimido. Translated by Jorge Mellado. 2nd ed. México: Siglo XXI. Froehlich, Heather. 2015. “Corpus Analysis With Antconc.” Programming Historian 4. https://doi.org/10.46430/phen0043.

80

Susanna Allés-Torrent

Frost Davis, Rebecca, Matthew K. Gold, Katherine D. Harris, and Jentery Sayers, eds. 2020. Digital Pedagogy in the Humanities: Concepts, Models, and Experiments. New York: Modern Language Association. https://digitalpedagogy.hcommons.org/. Gailey, Amanda. 2014. “Teaching Attentive Reading and Motivated Writing through Digital Editing.” CEA Critic 76 (2): 191–199. Galina Russell, Isabel. 2013. “Is There Anybody Out There? Building a Global Digital Humanities Community | Humanidades Digitales.” Blog Colectivo de la Red de Humanidades Digitales de México (blog). July 19. http://humanidadesdigitales.net/blog/2013/07/19/ is-there-anybody-out-there-building-a-global-digital-humanities-community/. Galina Russell, Isabel. 2014. “Geographical and Linguistic Diversity in the Digital Humanities.” Literary and Linguistic Computing 29 (3): 307–316. https://doi.org/10.1093/llc/ fqu005. Gallon, Kim. 2016. “Making a Case for the Black Digital Humanities.” In Debates in Digital Humanities, edited by Matthew K. Gold. Minneapolis, MN: University of Minnesota Press. http://dhdebates.gc.cuny.edu/debates/text/55. Gardiner, Eleen, and Ronald G. Musto. 2015. The Digital Humanities: A Primer for Students and Scholars. New York: Cambridge University Press. Gutiérrez De la Torre, Silvia. 2016. “¿Cómo lo hicieron? Herramientas para las Humanidades Digitales.” Personal Site (blog). October 5. https://silviaegt.wordpress.com/2016/10/05/ como-lo-hicieron/. Gutiérrez De la Torre, Silvia. 2019. “Análisis de corpus con Voyant Tools.” Programming Historian 3. https://doi.org/10.46430/phes0043. Hayes, Erica Y., and Mia Partlow. 2022. “Displaying a Georeferenced Map in KnightLab’s StoryMap JS.” Programming Historian 11. https://doi.org/10.46430/phen0098 Hinrichsen, Juliet, and Antony Coombs. 2013. “The Five Resources of Critical Digital Literacy: A Framework for Curriculum Integration.” Research in Learning Technology 21. https://doi.org/10.3402/rlt.v21.21334. Hirsch, Brett D., ed. 2012. Digital Humanities Pedagogy: Practices, Principles and Politics. Cambridge: Open Book Publishers. https://doi.org/10.11647/OBP.0024 Humanidades Digitales Hispánicas. 2021. Documento de recomendaciones para la evaluación de la investigación llevada a cabo en el ámbito de las Humanidades Digitales. https://humanidadesdigitaleshispanicas.es/documento-de-recomendaciones-para-la-evaluacion-de-la-investigacion-llevada-a-cabo-en-el-ambito-de-las-humanidades-digitales/. Isasi, Jennifer, and Antonio Rojas Castro. 2021. “¿Sin equivalencia? Una reflexión sobre la traducción al español de recursos educativos abiertos.” Hispania 104 (4): 613–624. https://doi.org/10.1353/hpn.2021.0130. Kirschenbaum, Matthew. 2010. “What Is Digital Humanities and What’s It Doing in English Departments?” ADE Bulletin 150: 55–61. www.doi.org/10.1632/ade.150.55. Lee, Carmen. 2017. Multilingualism Online. London and New York: Routledge. Liu, Alan. 2012. “Where Is Cultural Criticism in the Digital Humanities.” In Debates in the Digital Humanities, edited by Matthew K. Gold, 490–509. Minneapolis: University of Minnesota Press. Mahony, Simon. 2018. “Cultural Diversity and the Digital Humanities.” Fudan Journal of the Humanities and Social Sciences 11 (3): 371–388. https://doi.org/10.1007/ s40647-018-0216-0. Mapes, Kristen. 2022. Introduction to Digital Humanities. https://introdh2022.commons. msu.edu/ Milligan, Ian, and James Baker. 2014. “Introduction to the Bash Command Line.” Programming Historian 3. https://doi.org/10.46430/phen0037.

Doing Digital Humanities in the Modern Languages Classroom 81 MLA Ad Hoc Committee on Foreign Languages. 2007. Foreign Languages and Higher Education: New Structures for a Changed World. www.mla.org/Resources/Research/Surveys-Reports-and-Other-Documents/Teaching-Enrollments-and-Programs/ Foreign-Languages-and-Higher-Education-New-Structures-for-a-Changed-World. Murphy, Emily C., and Shannon R. Smith. 2017. “Imagining the DH Undergraduate: Special Issue in Undergraduate Education in DH. Introduction.” Digital Humanities Quarterly 11 (3). The National Standards Collaborative Board. 2015. World-Readiness Standards for Learning Languages. 4th ed. Alexandria, VA: Author. Ortega, Élika. 2019. “Zonas de Contacto: A Digital Humanities Ecology of Knowledges.” In Debates in the Digital Humanities, edited by Matthew K. Gold and Lauren F. Klein. Minneapolis: University of Minnesota Press. Oskoz, Ana. 2020. “Language Learning.” In Digital Pedagogy in the Humanities: Concepts, Models, and Experiments, Digital Pedagogy in the Humanities: Concepts, Models, and Experiments, edited by Rebecca Frost Davis et al. New York: Modern Language Association. https://digitalpedagogy.hcommons.org/keyword/Language-Learning. Pitman, Thea, and Claire Taylor. 2017. “Where’s the ML in DH? And Where’s the DH in ML? The Relationship Between Modern Languages and Digital Humanities, and an Argument for a Critical DHML.” Digital Humanities Quarterly 11 (1). www.digitalhumanities.org/dhq/vol/11/1/000287/000287.html. Posner, Miriam. 2013. “How Did They Make That?” Personal Site (blog). August 29. https:// miriamposner.com/blog/how-did-they-make-that/. Prensky, Marc. 2001. “Digital Natives, Digital Immigrants.” On the Horizon 9 (5): 1–6. Price, Kenneth M. 2011. “Collaborative Work and the Conditions for American Literary Scholarship in a Digital Age.” In The American Literature Scholar in the Digital Age, edited by Aamy E. Earhart and Andrew Jewell, 9–26. Ann Arbor: University of Michigan Press. Red de Humanidades Digitales. 2020. “Guía de buenas prácticas para la elaboración y evaluación de proyectos de Humanidades Digitales y checklist.” http://humanidadesdigitales. net/guia-de-buenas-practicas-para-la-elaboracion-y-evaluacion-de-proyectos-de-humanidades-digitales-y-checklist/. Risam, Roopika. 2018. New Digital Worlds: Postcolonial Digital Humanities in Theory, Praxis, and Pedagogy. Chicago: Northwestern University Press. Román Mendoza, Esperanza. 2023. “Pedagogía digital crítica y aprendizaje de lenguas. Viejos y nuevos retos tras la pandemia.” In Inovação e tecnologia no ensino de línguas: pedagogias, práticas e recursos digitais, edited by Isabelle Simões Marques et al., 9–25. Universidade Aberta. https://doi.org/10.34627/uab.cc.23. Saum-Pascual, Alex. 2017. “Teaching Electronic Literature as Digital Humanities: A Proposal.” Digital Humanities Quarterly 11 (3). Singer, Kate. 2013. “Digital Close Reading: TEI for Teaching Poetic Vocabularies.” The Journal of Interactive Technology and Pedagogy 3. https://jitp.commons.gc.cuny.edu/ digital-close-reading-tei-for-teaching-poetic-vocabularies/. Sinton, Diana S. 2020. “Mapping.” In Digital Pedagogy in the Humanities: Concepts, Models, and Experiments, Digital Pedagogy in the Humanities: Concepts, Models, and Experiments, edited by Rebecca Frost Davis et al. New York: Modern Language Association. https://digitalpedagogy.mla.hcommons.org/keywords/mapping/. Spence, Paul, and Renata Brandão. 2020. “Critical Digital Pedagogies in Modern Languages—a Tutorial Collection.” Modern Languages Open 1. https://doi.org/10.3828/mlo. v0i0.343.

82

Susanna Allés-Torrent

Spence, Paul, and Renata Brandao. 2021. “Towards Language Sensitivity and Diversity in the Digital Humanities.” Digital Studies/Le Champ Numérique 11 (1). https://doi. org/10.16995/dscn.8098. Sperberg-McQueen, C. Michael. 2016. “Classification and Its Structures.” In A New Companion to Digital Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth, 377–393. 2nd ed. Hoboken, NJ: John Wiley & Sons. Spiro, Lisa. 2012. “This Is Why We Fight: Defining the Values of the Digital Humanities.” In Debates in the Digital Humanities, edited by Matthew K. Gold, 16–34. Minneapolis: University of Minnesota Press. Taylor, Claire, and Niamh Thornton. 2017. “Modern Languages and the Digital: The Shape of the Discipline.” Modern Languages Open. https://doi.org/10.3828/mlo.v0i0.156. Turkel, William J., and Adam Crymble. 2012. “Understanding Web Pages and HTML.” Programming Historian 1. https://doi.org/10.46430/phen0018. Visconti, Amanda. 2016. “Building a Static Website with Jekyll and GitHub Pages.” Programming Historian 5. https://doi.org/10.46430/phen0048. Warwick, Claire. 2019. “‘They Also Serve’: What DH Might Learn About Controversy and Service From Disciplinary Analogies.” In Debates in the Digital Humanities, edited by Matthew K. Gold and Lauren F. Klein, Minneapolis: University of Minnesota Press.

5

Digital Learning Environments for SLA Learning Analytics and the Construction of Knowledge Alice Gasparini

5.1.

Introduction

The past COVID-19 pandemic years contributed to an exponential digital acceleration. There was an urgent need to move from a face-to-face educational experience to a screen-mediated one. Everybody quickly became familiarized with a huge number of new digital tools, platforms, and software. Teaching and studying became a virtual activity that took place in virtual classrooms, and we all built new habits, new ways of teaching a second language (Allés-Torrent 2019). This chapter especially focuses on digital learning spaces where Italian was taught as a second language, presenting an Italian perspective of language teaching in digital environments. It starts from a brief analysis of the concept of space in relation to human beings, and then the focus shifts to physical learning spaces and to online ones. As the ecology of learning perspective dictates, learning happens through the interaction with the environment (van Lier 2004). Learning spaces influence the act of learning, and at the same time they are influenced and shaped by the students who act in them. Within the instructive spaces, students build their own paths, their own space, and they orient and organize their perceptions and actions, constructing their own knowledge (Damşa et al. 2019: 2078). A better understanding can help the teachers to be aware of the way students live and experience the learning platform and therefore of the learning process. With these ideas in mind, the article aims to explore the interactions recorded within two different learning environments where digital resources for Italian as L2 are implemented. For the analysis, mainly quantitative data will be considered coming from a tracking software which registered the students’ movements and behavior in the two linguistic spaces. The collected data help to understand how the learners move and, therefore, how they interpret the linguistic environments and construct their own space and knowledge. The first part of the article illustrates the theoretical background: the concept of space and its role in learning environments and learning analytics. The second part illustrates the study and the results achieved. The last paragraph contains a general reflection about the work and a brief reflection about the role of digital language teachers. DOI: 10.4324/9781003393696-8

Alice Gasparini

84

5.2. Space and Learning Environments The reflection of the role of space in human life dates back to ancient times, we find references to it even in Greek philosophy. Many modern scholars dealt with the topic (Massey 2005; Soja 1989; Tuan 1977), but only in recent times did historians and philosophers start to really consider its relevance and its role in human life. During the 1980s, both modernist and postmodernist theories of space agreed on considering space as a form of social “production.” Space is analyzed and considered related to social relations (Gregory and Urry 1985). For centuries, space was a synonym of emptiness, “a container or stage of human activity” (Low 2016, 15); later on, thanks to social sciences, the Newtonian idea of space-bearing “sensible objects” (Smith 1991, 15) spread and became the ground for the ecological view of space. Gibson contributed to strengthen it, adding a further detail: space matters because it is substantial and physical (1979). Space is a physical entity with a tight relationship with sociality. According to Phil Benson, “[S]ocial space consists of the physical space that is produced within the social relations of production of an era, a space that also represents that era” (2021, 24). Humans construct their spaces, the private and the social ones, giving them both individual and social meaning. Spaces become environments when individuals have carved them out and shaped them. Everybody participates in the construction of space (Gibson 1979; Low 2016). We will explore this concept in more depth later when analyzing virtual learning environments. From an educational perspective, the learning process and students’ and teachers’ agency are seen as a network of interdependencies among all the components present in the setting, from different levels: social but also physical and symbolic (Van Lier 2004). Learning environments are therefore spaces carved out and shaped by the individuals who act in them. Space influences the learning event; it is not only considered as a mere setting, but it shapes the learning process. This is not only true for classrooms and schools, where formal learning happens, but it works with all kinds of spaces, as in informal learning. The construction of place perspective is most helpful, perhaps, as a reminder that language learning environments are products not only of impersonal global forces but also of individual agency (Benson 2021, 14). 5.2.1.

The Virtual Learning Spaces

The aforementioned ideas also apply to online space and virtual learning environments. Scholars agree that online is a space itself, with its features and characteristics and a solid bond to physical environments. According to Dourish: [T]he technologically mediated world does not stand apart from the physical world within which it is embedded; rather, it provides a new set of ways for that physical world to be understood and appropriated. Technological mediation supports and conditions the emergence of new cultural practices, not by creating a distinct sphere of practice but by opening up new forms of practice

Digital Learning Environments for SLA 85 within the everyday world, reflecting and conditioning the emergence of new forms of environmental knowing. (2006, 304) Technology enhances the spatial, physical space creating bridges between different environments and locations (Ciolfi 2015, 159). Online space constitutes an “enlargement” of physical space, inspired and built by the latter. The metaphor between physical and virtual space is supported and carried out by several studies. Erikson suggested that spatial environments are a metaphor for interface design (1993), which is exactly the field of interest of this chapter. Interface is the place of interaction between users and virtual space. When designers build the interface, they “impose” their vision on their users. The learning environments are often planned and developed by teachers and instructional designers, while learners really use them, moving inside these places, re-interpreting them, giving them new meanings, and reconfiguring them (Damsa et al. 2019, 2078). The physical space is a social and collective construct: people give to space their own individual and social meaning, reinterpreting them according to their necessities and needs. The same process happens in the virtual spaces. The indications about physical spaces are commonly collected with observation and questionnaires. But when it comes to online environments, how do the researchers collect the necessary data? This happens thanks to “virtual observation” with tracking software, or analytics, employed to acknowledge the learners’ agency in online learning spaces and generally on the internet. 5.3. Web Analytics, Learning Analytics, and Language Teaching Web analytics and learning analytics are quite common and widespread terms. Web analytics is defined as “the measurement, collection, analysis, and reporting of internet data for understanding and optimizing Web usage” (Järvinen and Karjaluoto 2015, 118). Analytical data collected are described under the umbrella term Big Data, which generally refers to the huge amount of analytical data produced through the navigation and implementation of network activities of employees, users, or customers. The data provide crucial information concerning the interaction between humans and the system’s interface (Rojas Castro 2013), which are useful details to improve the user experience and to make it more effective and efficient, as well as enjoyable. The interface can be easily considered the “entrance door,” which learners use to access the virtual classrooms or environments; it is a crucial element that needs to be easy to find, understand, and use. Collected analytical data produced during training situations take the name of learning analytics (often called LA), or “the measurement, collection, analysis and reporting of data about learners in their context, for purposes of understanding and optimizing learning and the environments in which it occurs” (Long and Siemens 2011, 34).

86

Alice Gasparini

They are generally collected by user behavior monitoring tools through external software connected to the site to be tracked, as in the case of the study here presented. They are based on the collection of statistics, which figure out how users and students move within the site or the learning environment in general, with which content they interact most and if they encounter adversities or obstacles in navigation. The data collected by the tracking software together with qualitative information about the learners’ (or users’) feelings and perceptions about the experience lived using the e-learning system or website help to get a complete assessment of the environment. This generally entails indications about the relationship between the users and the system, for example, if they found it easy to use, an observation of their behavior inside the online systems obtained by the analytics and interviews and questionnaires to explore learners’ perceptions and feelings about their learning experiences. Observing the students’ behavior in digital language learning contexts is a widely applied practice in language teaching, long before the concepts of web analytics or LA developed. The practice was conspicuously encouraged during the 1990s (Debski 2003), giving emphasis on the interaction between humans and machines, or between humans and humans in software, programs, or CD-ROM courses offering linguistic material. The studies began to be progressively focused on different aspects, such as dynamics between students, the negotiation of meanings, learner interactions, and specific system functions. The amount of research dedicated toward data collected by learners interacting with digital resources has increased over the past years, due to the gradual rise in the internet and the usage of mobile devices, which facilitate and empower connection to networks (Ferguson 2014, 2019; Papathoma et al. 2021). Relevant examples come also from the linguistic research. Viberg, Wasson, and KukulskaHulme (2020) elaborated a framework which focuses on the use of LA for language learning especially designed for mobile learning. Scholars from the University of Hawaii dedicated the February 2021 special issue of Language Learning and Technology magazine to the challenges and opportunities offered by Big Data in the matter of language learning (Reinders and Lan 2021). Thus, LA are useful for teachers not only to monitor teaching results but also to “observe” the learners through an external perspective, comparable to that of the teacher in a real classroom; LA are also essential to register the learner versus input relationship, and to make the system structure and the materials’ analysis more effective and successful. 5.4. The Study 5.4.1.

Selected Learning Environments

As stated before, the systems under consideration, namely Moodle and WordPress, come with different features. They can be traced back to the most common types of digital spaces, LMS (Learning Management System) and CMS (Content Management System). WordPress is a typical expression of Web 2.0.1 Specifically, Moodle

Digital Learning Environments for SLA 87 is the most widespread Learning Management System in the world,2 employed by educational institutions, private and public companies for distance learning, and it is a platform designed to contain the training event in all its functions. According to the estimates published on the website, Moodle is used by two hundred and seven million users in 246 countries on earth. The success might be attributable to the completeness of the functions it offers, in addition to the possibility of using it in several languages.3 Moodle exhibits a particularly well-defined and “ready-to-use” structure. It is already set and designed to carry out distance learning experiences, as it provides spaces for courses and materials, exercises, collaborative activities, and tasks. Third-party applications may be added to the platform, such as the H5P4 author software employed for the creation of interactive linguistic content. In addition, Moodle offers a capillary system that enables the monitoring of the presence of learners within the platform, as well as their teaching results. The linguistic activities added were built using the native pre-existing functionalities of Moodle and through the aforementioned software-author H5P. On the other hand, WordPress5 is a Content Management System, a system for content management. This implies it was not designed for didactic purposes but rather as software for the creation of websites; in fact, it is estimated that about 35% of the websites make use of WordPress as a core basis for the implementation of their own space on the network. It is a flexible piece of software, typically chosen for the creation of pages of governmental institutions, large and small companies, or personal blogs. It is an open-source system, which offers free designs, basic layouts as well as more complex paid ones. It is possible to increase its basic functionalities by adding further applications (free or paid), achieving a higher degree of customization. While Moodle is an already set-up platform, WordPress offers teachers and learning designers the chance to build a learning environment based on their specific needs and requirements. H5P software constitutes an example of an extra-application typically employed, which provides the option to build interactive linguistic content. Research about digital learning environments is moving from LMS (Learning Management System) to more flexible instructive spaces, as illustrated in Report on Research issued by Educause:6 “what is clear is that LMS has been highly successful in enabling the administration of learning but less so in enabling learning itself” (Brown et al. 2015, 2). The post-LMS scenario entails the integration of different applications and software with a high degree of customization and personalization (Baker 2017). 5.4.2.

Language Learning Path: Scorci Italian

The same learning path for the Italian language, Scorci Italiani (lit. Italian Glimpses), has been included in both the aforementioned systems. As already mentioned, it is organized into five main units: Bar, Lavoro (work), Dottore (doctor), Casa (home), and Università (university). The units aim to cover the main linguistic domains and communication contexts the learners will get involved with. A further extra-section

88

Alice Gasparini

called Canzoni (songs) has a dual objective: it introduces a glimpse of the Italian life, society, and music, and it reviews the grammatical structures previously dealt with across the five different units. The syllabus designed for each unit refers to the levels of expertise of the Italian language, A2 and B1 (Table 5.1). The materials are designed for an independent study; nevertheless, they can also be used by teachers as a first contact with the Italian language or with the grammar content. The sections are short teaching units; each of them is centered on a textual input with a specific format. Being totally self-consistent, the learner can freely pick and select the order through which to start a given activity since the reference text input changes. Completing each section requires from 15 to 45 minutes; hence, they are conceived as short lessons of self-learning of Italian language and culture as a daily refresher. 5.4.3. Methodology Data Collection Tools

The study here presented is part of a PhD project focusing on the analysis of the usability of the two linguistic learning environments illustrated earlier. The usability assessment and evaluation is a very common—but still complex— practice, necessary to provide tools through which to evaluate both objective and subjective aspects. The data collection tools included a tracking software for the collection of quantitative data, completed by the two questionnaires. The first aimed at defining the informants’ profile; the second had the purpose of providing a final evaluation of the two environments created through Google Forms. Furthermore, some observation sessions were held to bring out reflections concerning the use of learning systems by students, followed by interviews regarding their level of satisfaction and their perception of effectiveness and pedagogical usability of the systems and the linguistic resources proposed. These interviews were carried out in person. The different tools have contributed to highlighting both objective and subjective aspects of the dimensions of a system usability and linguistic resources. All the tools here introduced and employed in the present research work are the most common and preferred in the process of usability evaluation. In the chapter, we will especially consider the quantitative segment because they are more relevant for the topic we are interested in. The software for tracking the behavior of learners within the study environments has been used for the purposes of the present research, research, namely Matomo.7 The software collects analytical data evaluable in Excel sheets or directly processed by the system in easily readable graphical representations. The program is aimed at the collection of analytical data, with the main purpose of optimizing the usability of general websites; nevertheless, it is not specific to e-learning systems. Thus, the selection of software such as Matomo is because it permits observation of both systems offering comparable data, and it also guarantees a safer protection of

Table 5.1 The syllabus designed for the units SYLLABUS Communicative functions

Vocabulary

Grammar

Cultural content

Bar (Coffeeshop)

Espressioni di cortesia per chiedere informazioni e ordinare (Ask for information and order)

Lessico delle bevande (Vocabulary food and drinks)

Lavoro (Work)

Scrivere il proprio curriculum vitae . Write the cv. Descrivere il carattere . Describe the personality.

Il bar nei diversi momenti della giornata, la pausa pranzo . (The cultural relevance of the coffeshop in Italy) Il centro per l’impiego (Job center)

Dottore (Doctor)

Comprensione degli annunci di lavoro. Lessico e formule usate Find a job. Vocabulary and useful expressions Lessico del corpo umano (Human body vocabulary)

Esprimere sintomi di malattie e malesseri . (Talk about symptoms and illnesses) Esprimere la necessità: bisogna, volerci, servire (Structures to express “necessity,” “usefulness” and “time duration”) Descrivere una casa. Esprimere i propri Comprensione degli annunci immobiliari. gusti: verbo piacere Lessico e formule usate (Describe your house and express you (Find a house. Vocabulary tastes) related to houses) Lessico dell’università Presentarsi, fare conoscenza, parlare (Vocabulary of the di sé . university) (Talk about yourself) Verbi pronominali e lessico colloquiale (Slang and pronominal verbs)

Revisione passato prossimo e imperfetto (Review of passato prossimo e imperfetto) Condizionale presente . (Present conditional) Aggettivi e due e quattro uscite (Adjectives) Imperativo positivo e negativo. Formale e informale . (Negative, positive, formal and informal imperative) Congiuntivo presente . (Present conjuntive) Pronomi indiretti e combinati (Pronouns) Concordanza tempi passati, trapassato prossimo (Past tenses agreement)

I cambiamenti della famiglia e il rapporto con la casa (Changes related to the family and the house) Come cambia la società italiana, gli immigrati di seconda generazione . (Changes in the Italian society) Uso dei connettivi (Use of logical connectors)

Casa (Home)

Università (University)

Il servizio sanitario nazionale (National Health System)

Digital Learning Environments for SLA 89

Unit

90

Alice Gasparini

personal data. Matomo is highly respectful of the Europe law about personal data: the personal information is owned by the users and properly stored and protected. The metrics taken into account for the objectives of the study refer to visits and users: visits and access numbers in total, unique and recurring visits, informants’ geographical provenance, and devices employed. Some statistics record learner behavior, and they are specifically intended to track the most viewed pages and resources, the most “commonly taken” paths within the learning environments and the channels through which learners reach virtual systems. The Participants

The data collection process carried out in praesentia involved 20 students attending Italian language courses for foreigners at the University Language Center of the G.D’Annunzio University of Pescara between January and February 2020— right before the University closed because of the global health emergency. The data obtained from observations collected in person are less in number than those collected virtually. The questionnaires (profiling and usability and engagement) were arbitrarily completed by 136 informants and represent only a small quantity compared to the number of people who actually interfaced with the learning environments. The visits recorded by the tracking software equal to about 3,000, which however is not exactly the number of users who have entered the systems. The total, in fact, includes both single and recurring visits made by the same user. The tracking software attributes cookies to users, which however change if they connect from a different device. Matomo offers single-user measurement only for months and not over longer time spans, as in the case of the present experimentation. The online experimentation lasted seven months and was carried out between February and September 2020. The channels selected for the dissemination of the research and the learning environments have been the main mailing lists of Italian teachers in Italy and in the world, specifically ItalianoL2 e molto altro, AATI,8 Italian Studies,9 social networks such as LinkedIn and Facebook—especially profiles dedicated to the Italian language learning, for both teachers and students. The informants are mostly teachers and students of Italian as a second or foreign language, from universities, schools, and institutions in general, where Italian language teaching is provided. About 2,500 participants were involved in the online experiment. They were all informed about the data collection, and the two learning systems were created in full compliance of the GDPR. 5.5.

Data and Results

This present section illustrates the data collected and the results obtained from their analysis. They are a limited part of the quantitative data collected throughout the study, and only those in line with the topic of the article are considered. As mentioned earlier, the project aimed to compare two learning environments where some digital linguistic resources were implemented. The systems differ in

Digital Learning Environments for SLA 91 the general objective they were creating for, and this is also reflected in their structure and in the possibilities they offer to users. The data considered were generated by the tracking software employed to record the users’ behavior. The statistics considered regards the general number of visits, the student’s preferred paths, and the main interactions. 5.5.1.

Number of Visits

The first figures here presented describe the number of visits (Table 5.1). It indicates the number of visits recorded for both learning environments. As shown in Table 5.1, it suggests the users’ inclination toward the WordPress-based environment (2427 visits) rather than the Moodle-based one (494 visits). It is partial and very general information, but it signals the intention of the users’ flow, their choices, and so needs to be taken into consideration. This number contains the unique and the recurrent visits: the students who only visit the environments once and those who decide to return. They are both relevant figures because the new visits indicate new users’ flow exploring the systems, while recurring visits suggest the learners’ interest to try new activities or discovering other parts of the sites. The reason behind the choice may lie in students’ curiosity to try a new system. Moodle is a very popular LMS, employed by a great number of universities and organizations, and it is easy to imagine that many students and teachers already know it and use it. This is especially true due to the time of the dissemination of the research, which started in February 2020, the worst time of the sanitary emergency, when many countries were transitioning from face-to-face to distance education. 5.5.2.

The Users’ Preferred Paths—Homepage’ Heatmaps—

WordPress

The relationship between virtual spaces and users starts from the “front door”: the homepages of the two learning systems, which appear even before the login or registration pages. We would like now to consider user movement around the two homepages. Thus, we will first briefly refer to their structure, explaining the two homepages (Figures 5.1 and 5.2), because it influences the way users navigate them and more generally their overall experience of the construction of the knowledge. As depicted in Figure 5.1, WordPress looks like a standard website. The central section of the page displays a text where the focus of the research and the digital resources are explained. The WordPress homepage contains the description of the course, Scorci Italiani. It is composed of five units: Bar (coffeshop), Lavoro Table 5.2 Number of total visits tracked for WordPress and Moodle

Number of visits

WordPress (CMS)

Moodle (LMS)

2427

494

92

Alice Gasparini

Figure 5.1 WordPress’ homepage.

Figure 5.2 WordPress’ homepage heatmap.

(work), Dottore, (Doctor), Casa, (Home), Università (University). The menu shows six labels, from left to right, Scorci (the units), Abilità (Abilities), Competenze (Competences), Livelli (Levels), Contatti (Contacts), and Accedi o Registrati (Login or Register). The Abilities menu includes Comprensione (oral and written Comprehension) and Produzione (oral and written Production). Competenze includes, instead, Comunicazione (Communication), Vocabolario (Vocabulary),

Digital Learning Environments for SLA 93 Cultura (Culture), and Grammatica (Grammar). The linguistic level of the course is A2–B1.10 The activities last from 10 to 30 minutes. There is an extra unit, called Canzoni, (Songs), where students can find linguistic activities related to Italian songs. Figure 5.2 represents, on the other hand, WordPress’ homepage heatmap, which shows the most clicked areas and contents using colorful spots. From a first and quick look, the students can find different ways to access the website, different entrance doors, for example, the users can click on the menu to reach the contents or to follow the links contained in the text. As depicted in the figure, the biggest spots are located on A2 (the page containing the A2 level activities), B1, Comprensione (the activities focusing on oral and written comprehension), Comunicazione (the activities focusing on oral and written communication), and Grammatica (grammar activities). The graphic representation suggests that students prefer to use the links contained in the written text rather than using the upper menu. We can see a smaller spot on Scorci (the units) and on Canzoni (Songs, the activities containing a song). The students decide to access clicking on the linguistic levels, or on the abilities they are willing to improve. We chose to analyze only the homepage’s heatmap, but the tracking software produces a heatmap for every page of the website. Therefore, it is possible to have a comprehensive image of the students’ movements on every single page, which is very useful to understand what contents and paths they visualize more and what they decide not to visit. Moodle

In Figure 5.3 the Moodle’s homepage structure is depicted. The central section of the homepage contains a short text with the list of units’ names: Bar, Lavoro, Dottore, Casa, Università. The activities last from 10 to 30

Figure 5.3 Moodle’s homepage.

94

Alice Gasparini

minutes. There is an extra unit, called Canzoni, (Songs), where students can find linguistic activities related to Italian songs. In Moodle the units correspond to different courses, which implies that students need to exit from a course and enter into another one if they are interested in exploring the units. They cannot “jump” from one unit to another, or from one activity to another; once they have entered a unit, Bar, for example, they can see all the activities related to the unit. If they are interested in the following one, they exit from the first and access the second. Another big difference with WordPress’ homepage is the absence of the categories Abilità (abilities), Competenze (competences), and Livelli (levels), which were present in WordPress’ homepage. What does this imply for users? The different degrees of flexibility and freedom the two learning environments give to the users’ movements and to the construction of their own learning space. In Moodle we see only one available entrance door: the units. WordPress allows users to access linguistic materials from multiple points: the units, Scorci, or from the categories listed in the menu, like Abilità (abilities), Competenze (competences), and Livelli (levels). The students can explore the digital resources either by exploring the units or through the above-written labels giving everybody the chance to build their website and personalize it. Moodle is a complex platform containing various learning-related aspects, where personalization and flexibility have a limited space. As for the CMS, Moodle’s homepage heatmap (Figure 5.4) will be considered to better understand the navigation criteria employed by the students. As we can from Figure 5.4, the most clicked content is the first of the list, Bar (the unit Coffeeshop), and the least is Università, the last one. The navigation

Figure 5.4 Moodle’s homepage heatmap.

Digital Learning Environments for SLA 95 follows the structure imposed by the designer, because there is less chance to build a personalized structure and contents’ organization. These indications are confirmed and supported by other statistics recorded by the tracking software which will be illustrated in the next paragraph. 5.5.3. Main Interactions WordPress

The following data are called Interactions (Figure 5.5). They combine the most clicked and visited nodes (pages), which in the picture are the bigger rectangles. The thin lines depict the movements of the users’ flow. These graphic representations show the preferred paths and the most visualized pages. Three main interactions are represented in the figure, the analysis especially focuses on Interactions 1 and 2. Interaction 1 is composed of seven main nodes: • Scorci Italiani (the homepage) • Canzoni (the category page containing all the activities related to the songs)

Figure 5.5 Main interactions recorded in the WordPress-based learning environment.

96 • • • • •

Alice Gasparini Registrati (the registration page) A2 (the activities related to the linguistic level) Cosa hai fatto ieri? (an input text about past tense, passato prossimo) Eri bellissima (a song-based activity) La gatta (a song-based activity)

Interaction 1 totally counts 1,843 visits, 1,412 of which continued, and 431 left the website. On the top left, the biggest rectangle is the homepage where the number 639 indicates the number of visits; five main flows starting from it go to the homepage, to the categories page Canzoni, A2, and B1. Interaction 2 illustrates where the proceeding 1,412 visits spread; the lines depict the users’ flows going from a page to another. The first four nodes of Interaction 2 are the same as in Interaction 1: • • • • • • • •

Scorci Italian (the homepage) Canzoni (the category page containing all the activities related to the songs) Registrati (the registration page) A2, (the activities related to the linguistic level) Eri Bellissima (another song-based activity) B1, which includes the activities related to this linguistic level Comprensione (it includes activities focusing on written and oral comprehension) La gatta (a song-based activity)

They reveal how students move inside the WordPress system. An interesting indication comes from these data: the absence of the unit pages (Bar, Casa, Lavoro, Dottore, or Università) and the presence of category pages like Canzoni, the linguistic levels A2 or B1, or skills like Comprehension. The presence of homepage and the registration pages is quite predictable because the first one is the landing page of the site and the users need to register if they want to enter. The occurrence of the category-page Canzoni is more interesting, because it reveals the students’ interest in this specific cultural aspect. This is also true for single musical activities like Il mondo che vorrei, Eri Bellissima, and the linguistic level’s categories A2 and B1. The students seem to navigate following their interest toward musical activities and linguistic levels. They do not randomly explore units, but they follow their interest, curiosity, and linguistic needs. Moodle

We will consider now the same data for the second learning system, Moodle (Figure 5.6), as for WordPress our attention will especially focus on Interactions 1 and 2. Interaction 1 combines four main nodes: • /m/login/index.php (the login page) • /m/login/confirm.php (the login confirmation page)

Digital Learning Environments for SLA 97

Figure 5.6 Main interactions recorded in the Moodle-based learning environment.

• /m (it corresponds to the landing page of the platform which the users see before registration or login) • m/enrol/index.php?id=3, the enrolment page to the unit Bar The top left rectangle representing the homepage’s visits is much bigger than the others, which means that the students had only one path and “entrance door” to the platform. The following contents are other management pages like the login confirmation page, the homepage; only the last one corresponds to a learning resource, the unit Bar. The information confirms the complexity of Moodle’s structure where users’ movements are less free and flexible: there is one main entrance door and few paths to follow. The presence of many management pages translates into more time and clicks for the students to get to the linguistic information. These data confirm also that Bar is the most visualized unit, as seen and explained in the previous paragraph. If we compare this with the WordPress version, we can underline some differences: the number of total visits is lower, 1,843 versus 473; there are more exits than proceeding visits and Interaction 2 combines fewer nodes than in WordPress.

Alice Gasparini

98

The main users’ flows leaving the top left rectangle (the login page) go toward: • /m/login/index.php (the login page) • /m, (it corresponds to the landing page of the platform which the users see before registration or login) • /m/login/signup.php, (the registration page) • /my/my (the personal profile page) • m/enrol/index.php?id=3, (the enrolment page to the unit Bar) • /m/login/confirm.php (the login confirmation page) • /m/course/view.php (the list of the courses or units) • /m/mod/quiz/review.php (one quiz activity) • others The nodes contained in Interaction 2 include some management pages, and some learning-related content like the enrolment page to the unit Bar, the list of the learning units, and the quiz. 5.6.

Discussion

The data illustrated earlier outline the movements of students within the two systems. They give useful information about how the students shape their learning space, thanks to their movements. The heatmaps and the interaction analysis highlighted the most visualized contents and the paths taken by the users. Both measures express students’ linguistic interests and needs revealing what they would like to learn and what they feel useful to fulfill their communicative demands. As we mentioned before, the most visited environment is WordPress, which appears to give the users a more flexible experience as regards navigation and freedom of movement. The CMS offers the learners the chance to choose how to explore the resources. The data collected helped us to discover the fact that students prefer to access the resources through category pages, which group the linguistic activities under a common label, for example, the linguistic level. It is interesting to underline that the song category received many visits, suggesting a curiosity toward multimodal resources combining different modes like music, written words, and images. On the other hand, Moodle does not seem to be so flexible, and the students navigate the linguistic materials following the order imposed by the designer. Besides, the LMS has more administrative pages and the students take more time to reach the learning sections. The structure of the two virtual spaces had a clear influence on the way the students moved and explored them. The places devoted to learning, like all physical or virtual places are never neutral (Mulcahy 2018), and the students adapt and find their way to experience them. As happened in the study illustrated in the article, the students were able to express their needs through their movements in the environments, despite the limitations imposed by both structures.

Digital Learning Environments for SLA 99 5.7.

Conclusion

The system interface is the site of interaction between users and virtual learning spaces. According to Erikson, the designers could achieve this objective only by observing how physical spaces support and promote the interactions between people because “spatial constraints can be powerful ways of generating activities” (1993). This concept was further elaborated by the designer Donald Norman11 and the engineer Jakob Nielsen.12 They both work on the idea of usability13 and user experience applied to any kind of system. The observation of people’s movements in both physical spaces and virtual ones is a key point in designing human-centered interfaces able to foster and change according to learner and more generally human needs. As Mulcahy points out, “[S]paces are themselves agents for change. Changed spaces will change practice” (2018). The movements recorded by the monitoring software were able to give useful information about how the students employed both system interfaces, about their needs, and about possible data-driven modifications. As described in the chapter, the learners preferred to explore the sites using explicit labels, like linguistic levels (A2, B1) or skills (comprehension and production), rather than starting from the units that were named after common places like coffee shop, university, work. These last labels turned out to be too vague for them; they were not self-explanatory, as A2 or Comprehension are. An optimization of the interface would imply a different architecture of information: linguistic resources should be re-structured into categories with explicit names to allow quicker and more aware navigation. The interface’s possible re-organization leads to the importance of flexibility for a learning system, which appears to be a very important feature, as confirmed by the analysis carried out in the study. Without the monitoring software which recorded data generated by learner movements, we would not be able to really know how the students use the learning systems. We would not be able to make informed changes and optimize the instructive spaces. The consciousness of this information allows the designer to know how the students really use the environments and make data-driven changes that can meet the communicative and linguistic needs of the students. In face-to-face education, teachers monitor the students and how they carry out a task, in order to evaluate whether or not the task has achieved the expected results. The teachers test and evaluate the lesson plan structure and the order of the content they propose. In online learning this cannot happen, and a monitoring system like Matomo becomes “the teachers’ or instructional designers’ eyes.” In online learning spaces, the tutors are generally not present, and the monitoring software this study has presented use them to observe movements within the “virtual” classroom. This is the reason why research about learning analytics is so relevant in online training. If learning spaces influence the act of learning and at the same time they are influenced and shaped by the students who act on them (Damsa et al. 2019), there is a strong and urgent need to understand this dynamic to create better conditions to learn in any instructive space.

100

Alice Gasparini

This raises a number of questions about the language teacher’s role in digital education. As demonstrated elsewhere in this book, transdisciplinary competences appear to be indispensable to digital research and to foreign language instructors operating in digital environments. Their expertise in teaching needs to be supported by technical knowledge to select the suitable tools that demonstrate how we can create the digital activities and evaluate and assess the students’ performance and the chosen systems. As illustrated in the study, data analyst skills are also decisive in interpreting the data collected by the monitoring system and improving their work and the experience of their learners. Notes 1 By Web 2.0 we refer to the new network, in which sharing and collaboration are fundamental key features. The real change is represented by the innovative possibility the users are provided to produce materials and resources without possessing special computer skills. 2 Data are available on the following website: https://stats.moodle.org/ 3 For further information, see www.hubkengroup.com/resources/moodle-beneficialeducation-sector 4 H5P is an app and a set of devices aimed at online teaching, made as web pages (the name addresses to the “HTML5 Package” formula). It includes a set of packages, called “content types,” of three types: utility, such as a device for audio files recording; educational activities such as multiple choice and cloze questions; interactive contents (Fallani 2017), Url: https://h5p.org/ 5 WordPress is one of the most popular CMS, more than 40% of global website use WordPress technology. Available data here: https://wordpress.com/it/ 6 Educause is a nonprofit association whose mission is to advance higher education through the use of information technology. For more information, see www.educause. edu. 7 Matomo is an open-source platform set for the web analytics data gathering developed by a team of international developers that runs on a PHP/MySQL webserver. It tracks online visits to one or more websites and displays reports on these visits for analysis. For further information, see the website: https://matomo.org/ 8 American Association Teachers of Italian is an Italian teachers’ association in North America, the United States, and Canada.

9 This is a newsletter aimed at the spread of Italian studies at a British and, more in general, European and worldwide level. The two mailing lists, the American and the British one, are connected. 10 CEFR, Common European Framework Reference, is an international standard for describing language ability. See www.coe.int/en/web/common-european-frameworkreference-languages/level-descriptions 11 Donald Norman is an American cognitive psychologist and engineer. He worked as a designer at Apple and founded the Nielsen Norman Group together with Jakob Nielsen. 12 Jakob Nielsen is universally recognized to be the leading usability expert. He has an IT background, with a PhD in Interface Design and Computer Science from the Technical University of Denmark. Together with Donald Norman, he is the founder of the consulting firm Nielsen Norman Group, which deals with usability and UX. 13 “Usability is an attribute which measures how easy to use the user-interface is. It encompasses the system effectiveness, pleasantness, ease of use, learning and error tolerance” (Nielsen 1994).

Digital Learning Environments for SLA 101 References Allés-Torrent, Susanna, 2019. “Sobre la complejidad de los datos en Humanidades, o cómo traducir las ideas a datos.” Revista de Humanidades Digitales 4, 1–28. Baker, John. “Updating the Next Generation Digital Learning Environment”, Educause Review 52, no. 4 (2017), https://er.educause.edu/articles/2017/7/updating-the-nextgeneration-digital-learning-environment-for-better-student-learning-outcomes. (Accessed April 18, 2023). Benson, Phil. Language Learning Environments: Spatial Perspectives on SLA. Bristol: Multilingual Matters, 2021. Brown, Malcolm, Dehoney, Joanne and Millichamp, Nancy, “The Next Generation Digital Learning Environment: A Report on Research”, 2015. https://library.educause.edu/ resources/2015/4/the-next-generation-digital-learning-environment-a-report-on-research (Accessed April 18, 2023). Ciolfi, Luigina. “Space and Place in Digital Technology Research: A Theoretical Overview”, in The SAGE Handbook of Digital Technology Research, edited by Sarah Price, Carey Jewitt and Barry Brown, 159–173. London: Sage, 2015. Damşa, Crina, Nerland, Monika and Andreadakis, Zacharias E. “An Ecological Perspective on Learner-Constructed Learning Spaces”, British Journal of Educational Technology 50, no. 5 (2019), 2075–2089, https://doi.org/10.1111/bjet.12855. Debski, Robert. “Analysis of Research in CALL (1980–2000) with a Reflection on CALL as an Academic Discipline”, ReCALL 15, no. 2 (November 2003), 177–188. Dourish, Paul. “Re-Space-ing Place: Place and Space: Place and Space Ten Years On”, Proceedings of ACM Conference on Computer-Supported Cooperative Work (CSCW 2006), ACM, November 2006, 299–308. Erickson, Thomas. “From Interface to Interplace: The Spatial Environment as Medium for Interaction”, Proceedings of the Conference on Spatial Information Theory, SpringerVerlag, 1993, 391–405. Fallani, Gerardo. “Il Web come piattaforma. L’e-learning oltre i recinti tecnologici”, InSegno. Italiano L2 in classe, Siena, Becarelli Editore, 2017, 1–2. Ferguson, Rebecca. “Learning Analytics: Fattori trainanti, sviluppi e sfide”, TD Tecnologie Didattiche 22, no. 3 (2014), 138–147. Ferguson, Rebecca, Clow, Doug, Griffiths, Dai and Brasher, Andrew. “Moving Forward With Learning Analytics: Expert Views”, Journal of Learning Analytics 6, no. 3, (2019), 43–59. Gibson, James J. The Ecological Approach to Visual Perception. Boston, MA: Hough-ton Mifflin Company, 1979. Gregory, Derek and Urry, John, eds. Social Relations and Spatial Structures. Basingstoke: Macmillan, 1985. Järvinen, Joel and Karjaluoto, Heikki. “The Use of Web Analytics for Digital Marketing Performance Measurement”, Industrial Marketing Management 50 (2015), 117–127. https:// doi.org/10.1016/J.INDMARMAN.2015.04.009. Long, Phil and Siemens, George. “Penetrating the Fog: Analytics in Learning and Education”, Educause Review 46, no. 5 (2011), 31–40. Low, Setha. Spatializing Culture: The Ethnography of Space and Place. New York: Routledge, 2016. Massey, Doreen. For Space. London: Sage, 2005. Mulcahy, Dianne. “Assembling Spaces of Learning ‘in’ Museums and Schools: A Practice-Based Sociomaterial Perspective”, in Spaces of Teaching and Learning: Integrating

102

Alice Gasparini

Perspectives on Research and Practice, edited by Robert A. Ellis and Peter Goodyear, 13–30. Singapore: Springer, 2018, https://doi.org/10.1007/978-981-10-7155-3_2. Nielsen, Jakob. Usability Engineering. New York: Morgan Kaufmann, 1994. Papathoma, Tina, Ferguson, Rebecca and Vogiatzis, Dimitrios. “Accessible Learning, Accessible Analytics: A Virtual Evidence Café”, Companion Proceedings 11th International Conference on Learning Analytics & Knowledge (LAK21), 2021, 331–335, http://oro. open.ac.uk/75926/ (Accessed September 14, 2023) Reinders, Hayo and Lan, Yu-Ju, eds. Special Issue: Big Data in Language Education & Research 25 (1), (February 2021), University of Hawaii National Foreign Language Resource Center; Center for Language & Technology, www.lltjournal.org/collection/ col_10125_73416. Rojas Castro, Antonio. “Las Humanidades Digitales: principios, valores y practices”, Janus: estudios sobre el Siglo de Oro no. 2 (2013), 74–99. Smith, Neil. Uneven Development: Nature, Capital and the Production of Space. Oxford: Blackwell, 1991. Soja, Edward. Postmodern Geographies: The Reassertion of Space in Critical Social Theory. London: Verso, 1989. Tuan, Yi-Fu. Space and Place. The Perspective of Experience. Minneapolis, MN: University of Minnesota Press, 1977. Van Lier, Leo. The Ecology and Semiotics of Language Learning: A Sociocultural Perspective. Boston, MA: Kluwer Academic Publishers, 2004. Viberg, Olga, Wasson, Barbara and Kukulska-Hulme, Agnes. “Mobile-Assisted Language Learning Through Learning Analytics for Self-Regulated Learning (MALLAS): A Conceptual Framework”, Australasian Journal of Educational Technology 36, no. 6 (2020), 34–52, https://doi.org/10.14742/ajet.6494.

6

Pedagogy and Praxis in Libraries Natural Language Processing for Non-English Texts Ian Goodale

There is a notable lack of pedagogical resources for computational text analysis on non-English languages. While materials for teaching and learning about digital humanities are plentiful in English, similar resources for acquiring skills in applying digital methods to non-English texts are relatively rare (Dombrowski 2020). Moreover, using existing tools used on non-English languages often presents difficulties due to the tools’ focus on English texts (McDonough, Moncla, and van de Camp 2019). Difficulties with non-English NLP include issues with correctly parsing textual elements (e.g., splitting a text into words), problems working with special characters not commonly present in English text, and difficulty in detecting and processing language-specific characteristics (such as detecting stresses of words in a Slavic language). Many of these characteristics are not covered by commonly used programmatic tools, which tend to focus on English. Libraries are often positioned at the intersection of the digital humanities, instruction, and public outreach, and are thus in a unique position to advocate for greater multilingual representation in DH (Kear and Joranson 2018). This chapter will address how projects at the University of Texas at Austin Libraries aimed to remedy both of these deficits in existing digital humanities resources. I will discuss two digital humanities projects I created, which are distinct but work together to accomplish a shared pedagogical goal. First, I created a repository of code annotated with explanatory notes that is hosted on GitHub alongside materials for further reference and reading, covering both practical and theoretical aspects of natural language processing (NLP) and computational text analysis. I used this repository as the basis for a workshop on NLP for non-English materials, as well as the bedrock for the second project, which consists of practical applications of the code’s methods to various corpora of non-English literature. These projects complement the workshop materials and serve as case studies for real-world applications of the technologies. These practical repositories of code analyzing non-English corpora (in French and Russian) are also available in GitHub. These projects aim to represent a diversity of both literary genres and alphabets, incorporating both the Latin and Cyrillic scripts. This chapter will address how these repositories contribute to the educational mission discussed earlier and how the approaches they take to their texts and DOI: 10.4324/9781003393696-9

104

Ian Goodale

analysis can contribute to the broader field of non-English digital humanities. It will also show how the educational and practical mission of the materials intersect and enhance one another, and discuss the complications, challenges, and successes of seeking to popularize and create meaningful contributions to non-English NLP materials. 6.1.

Survey of Existing Materials

The current academic literature surrounding digital humanities methodologies for non-English texts, particularly as they relate to natural language processing in nonEnglish languages, reflects the scarcity both of pedagogical materials for learning and teaching non-English digital humanities and the relatively small number of DH projects undertaken on non-English materials. As Quinn Dombrowski notes in her presentation What’s a ‘word’: Multilingual DH and the English Default: If you’re an Anglophone scholar in an Anglophone country, you live in a sort of magical bubble where you may never run across friction in your research workflows, because things just . . . work. . . . Words are such a basic concept in our tools and methods. . . . They all are based on some assumption of what a “word” looks like—particularly, that it’s separated by spaces, and mostly invariant in form. This is a terrible assumption if you’re working on many languages. (2020) For scholars and practitioners in the West who work with non-English materials as part of their DH practice, such assumptions about the nature of the texts they are working with are common roadblocks to furthering their research, particularly when attempting to learn the tools they need to conduct their work. The often vastly different mechanics of non-English languages—such as noun declensions and tonal information—are rarely addressed in pedagogical materials aimed at students of DH in the West, leaving students to learn technologies and workflows designed to work with English rather than their own chosen languages (McDonough, Moncla, and van de Camp 2019). The Anglocentrism of DH practice in the West is further highlighted by the lack of published DH academic literature in languages other than English. As noted by Qing Wang in her bibliometric survey of research published within the digital humanities published between January 1968 and March 2017, 88.67 percent of the data found in the Web of Science database was published in English. Of the 16 languages encountered, the next closest to English was Spanish, with only 3.74 percent of the literature (2018). While there are some DH journals publishing in languages other than English, such as the Francophone Humanités numériques (Tessier and Bourgatte 2022), this domination of the academic publishing landscape in DH further highlights how Anglophone materials on the digital humanities are available far more than their non-English counterparts, enforcing the existing status quo of Anglophone materials both in the West and globally. As Spence and

Pedagogy and Praxis in Libraries 105 Brandao note, there is a need for the digital humanities to more actively address “challenges around global dynamics of digital multilingualism, transcultural exchange and geodiversity” (2021), and the trend in the academic literature on DH to primarily publish in English only reinforces this point. The Anglocentrism of the field and its literature can be viewed not only in technical terms—with the literature and pedagogical materials being inaccessible or not fully appropriate for those working with non-English texts—but in terms of expanding the possibilities and attitudes of the institutions in which DH thrives. While North America has the most DH centers in the world, with Europe close behind, there have been mapping projects done that “highlighted the divisions that exist within the field by illuminating (and obliterating) different sets of centres and scholars” (Tessier and Bourgatte 2022). Gimena del Rio Riande highlights the questions of power and inclusivity surrounding digital humanities infrastructures, especially as they relate to those who do not operate within North America, noting that “open, collaborative, and decentralized infrastructures seem to be the best tools for building community setups and knowledge without monopolies” (2021). While the Alliance of Digital Humanities Organizations has many member organizations from varying countries and is a strong advocate for a truly inclusive and multilingual digital humanities landscape (2023), there is still much work to be done to make DH truly inclusive and international. As Fiormonte notes, “[E]ven though so much effort has been expended in making existing DH associations and organizations more international, the impression remains the same: a solid AngloAmerican stem onto which several individuals of mostly European countries are grafted” (Fiormonte 2016). These divisions might be bridged, in part, by a more linguistically inclusive approach to pedagogy within DH centers, advocating for an explicitly globally minded approach to language processing and DH methodologies that intentionally pushes back against existing Anglocentric cultures and pedagogical materials. This approach can help to create a cultural shift among practitioners, students, and teachers of DH both within the academy (DH centers included) and outside of it, increasing the opportunity for collaboration and communication with other DH practitioners and centers internationally. As Galina Russell notes: [A] forefront issue for inclusiveness is language. In this sense, DH is currently a reflection of the way academia works in general with English as the predominant language and with a few countries having a far larger representation and research output. (2014) This issue of inclusivity is being addressed by the international DH community through diverse means, including writing and project work. My own experience with a previous project, my PyGallica Python wrapper (Goodale 2022b), is a small example of how creating multilingual DH tools— and pedagogical materials supporting them—has allowed for greater international collaboration. Staff and researchers at the National Library of France and

106

Ian Goodale

the University of Paris have contacted me about the tool and the tutorials I created, which were made available as open-access materials with a copyleft license through my GitHub account. Creating and publishing these materials helped forge this connection with researchers working outside of North America that I would not otherwise have made. That connection, in turn, has broadened my university’s reach and contacts outside of North America and has helped to highlight the importance of non-English DH as it can be practiced and advocated for within the university’s libraries. Scholars writing about decolonizing DH—broadening the field to move away from the Western-centric paradigms of the digital humanities—can illuminate ways in which the inclusion of multilingual DH can be viewed as a step toward an opening of the academy and its DH centers more generally, treating the inclusion of non-English languages not just as a step forward technically but an advance in fostering greater social awareness and inclusion. Kajsa Hallberg Adu writes about designing assignments for students using DH pedagogy that empower students with knowledge and tools that give them agency, moving away from a passive absorption of texts written in the Global North, modelling inclusive lifelong learning built around our local experience, celebrating and using diversity rather than problematizing it, hence decolonizing the classroom. (2021) I believe that broadening the DH literature and practice in the Global North— particularly in the Anglophone world where the majority of DH centers are located— can help to push for greater social justice and decolonization within the academy and the practice of DH. Widening our linguistic reach and explicitly advocating for researchers and students working with non-English languages can be a crucial role we can fill in a larger struggle to make our curricula and DH practice truly global and inclusive, moving away from existing entrenched structures of thought that confine us to Anglocentric work and toward a thoughtfully global DH. The Anglocentric perspective in DH might also be understood as linked to upholding existing canons of literature frequently studied and highlighted by DH. As Laura Estill writes, “[D]espite the potential for democratization or canon expansion, digital projects too often reify canon, even when they attempt to subvert it” (2019). This does only apply to projects in English, of course, but might be viewed as one tendency the non-English NLP project at UT attempts to address and change. Natalia Zorrilla writes of the “importance of the innovative projects that have been recently created in the field of the digital humanities, which aim to mitigate and to counter” the exclusion of modern women philosophers from the canonical status afforded to other thinkers and writers” (2022). DH is uniquely poised to make clear, widely accessible projects that can be easily and widely disseminated, openly edited, and reproduced by others. The digital media we work in as DH practitioners are increasingly easy to self-publish and distribute and thus can have a broader reach both in terms of eyes reached by the projects and in relation to

Pedagogy and Praxis in Libraries 107 the geographical regions the projects can be viewed in. By choosing to work with both non-English and less widely studied and canonized authors, we can advocate for a cultural shift that moves away from largely privileging Anglophone and canonical authors. Other chapters in this volume also address the need for practical efforts to broaden the inclusivity of DH. Isasi et al.’s chapter “A Model for Multilingual and Multicultural Digital Scholarship Methods Publishing: The Case of Programming Historian” takes the example of the Programming Historian, which is a wellknown open resource for self-directed digital literacy education, and explores how lessons on the site developed in three non-English languages (Spanish, French, and Portuguese) helped speak to a broader, global audience. Tasovac et al.’s chapter “Bridging the Gap between Digital Humanities and Natural Language Processing: A Pedagogical Imperative” argues in favor of taking practical action to advance the diversity of DH, taking an understanding of the structural and representational biases in NLP tools and methods and applying it to solutions that can make practical inroads in making NLP more linguistically inclusive. My own work builds on similar themes, applying practical approaches to DH and NLP instruction while attempting to build up a larger and more inclusive community around DH practice. 6.2.

Overview of the Non-English NLP Tutorial

The Non-English NLP Tutorial and its related projects—NLP on Khlebnikov, NLP on Daumal, and Analyzing Péguy—aim to contribute to the broadening of DH pedagogical and practical materials beyond the English and beyond the canon, making

Figure 6.1 One of the Jupyter notebooks for the Non-English NLP Tutorial, with instructional text and code instructing users on how to perform named entity recognition and part-of-speech tagging on their non-English text.

108

Ian Goodale

NLP methods and practical results from implementing these methods widely accessible and reproducible by others (Goodale, 2021a, 2021b, 2021c). These projects present practical applications of text analysis on one Russian and two French authors, including both the products of the analysis (such as visualizations and trained NLP models) and the code used to produce them. Working within the university libraries was a very important component of this project. Libraries on many university campuses in North America serve as campus hubs of DH (Kear and Joranson 2018), and the UT-Austin Libraries have likewise been instrumental in promoting DH at the University of Texas at Austin. The UTAustin Libraries take many proactive steps to promote and coordinate DH projects and events on campus, such as managing a listserv; maintaining a LibGuide on projects, events, and other DH-related resources and happenings on campus; and organizing multiple series of talks and events focusing on DH happening at the university (Guzman 2022). The UT-Austin Collections Portal also features many digitized collections coordinated by librarians and staff (UT Austin Libraries 2022). UT-Austin Librarians are frequent collaborators or managers of DH projects on campus and are at the forefront of planning and publicizing many of the DH events that occur at the university. This institutional support for DH is underscored by the use of UT-Austin Libraries spaces on campus for DH pedagogy and workshops, which often serve as a focal point from which existing DH efforts are promoted and hosted in the Libraries. The educational and public service mission of the Libraries is thus intertwined with their work to promote and foster a community of DH practitioners on campus. This work takes many forms—from interactions between patrons and librarians to managing large-scale projects over the course of many years—but is always tied to the mission of public outreach and support. The Non-English NLP Tutorial (Goodale, 2022a) and its accompanying repositories are an example of this hybrid of practical work that is also geared toward public interaction, collaboration, and pedagogy. The tutorial aspect of the project was a close collaboration with Graduate Student Assistant Madeline Goebel. While I wrote the majority of the code for the project, Madeline used the project as a way to deepen her knowledge of Python and contribute some code to the final repository. She also researched and compiled the bibliography that is featured in this portion of the project and helped with testing and debugging the code. This close collaboration with graduate students is an essential part of the UT Librarians’ work on DH projects, and efforts are always made to ensure that these graduate students are able to work in ways that are beneficial to them and that they are credited for their labor. The code in the first part of the project is available as Jupyter notebooks, allowing for users to follow along and learn from the code either in browser or on their own Jupyter server. The code walks users through the process of cleaning a text, tokenization, named entity recognition and lemmatization, sentiment analysis, visualizations, and topic modeling. French texts are used as examples for the code, but we took care to point out how the tutorial can be adapted for use with other non-English languages. The technologies chosen for the text processing in the tutorial—NLTK, spaCy, and Stanza—were chosen for the ease with which their

Pedagogy and Praxis in Libraries 109 workflows and pipelines can be adapted for use with many languages other than English. Indeed, spaCy supports 65 languages in addition to multi-language texts (spaCy 2022), with similar levels of support in Stanza (2022). NLTK, while not a pipeline-based framework like Stanza and spaCy, also supports use with multiple languages (NLTK 2022). All of these libraries are free and open source, so they are openly available to anyone with access to a computer and the internet. This reduces barriers to accessing, learning from, and modifying our materials, as there are no paywalls to lock out users unable to afford paid software. The content is hosted in GitHub to allow for ease of access and collaboration. GitHub is the main hub for open-source software online, with over 200 million repositories and over 80 million developers using the site (2022). Publishing our code here allows for it to be openly accessed in an environment that is well-established and well-understood by existing developers, and which is free and easy to learn for those who have not yet become familiar with GitHub or software development. This also allows for easy integration into workshops and other pedagogical environments, as students can be introduced to the GitHub site and learn how to navigate a repository in the process of learning about our specific project. Madeline (Goebel 2022) also created a LibGuide to promote the project in another format, mirroring the content available in GitHub (and linking to the GitHub project page). This further broadens the scope of users the project reaches, contributing to the pedagogical mission of the work. The second aspect of the project, consisting of the three repositories of code, serves as a practical example of how to process different types of text using the methods outlined in the tutorial portion of the project. The repository on the author René Daumal continues the tutorial’s emphasis on French prose, although, like the tutorial, can be easily adapted for other languages with minor adjustments. The repository focusing on the poetry of Charles Péguy concentrates on French poetry, adjusting various bits of the code to handle segmentation of the text by the author’s intended line breaks rather than by parsing prose. The third repository, which focuses on works by the Russian author Velimir Khlebnikov, works with Russian text as a vehicle to highlight how the code can be adjusted to handle non-Roman scripts. While the code examples strive to be as language-agnostic as possible, the nature of the languages analyzed does imply that the code would work best for other languages that read left to right. The code reads the texts it analyzes using the UTF-8 format, a variable-length character encoding standard commonly used while processing text. This allows for the scripts to be used with many languages that can be written using this encoding format. Of course, the code will only work for languages that use an alphabet system at least somewhat similar to French and Russian; character-based writing would require a separate approach. Overall, these repositories are intended both to highlight these pedagogical focuses, serving as practical examples that students can emulate and draw from in their own work, and as potential launch boards for future computational research on these authors. The results of the code are not merely examples for teaching, then, but have value in their own right as computational data on authors often overlooked by DH methodologies.

110

Ian Goodale

Figure 6.2 The GitHub repository of the Analyzing Péguy project, showing how the code and folders are organized.

All of these additional repositories apply the ideas from the tutorial in a handson, thorough manner that both provides useful resources for practitioners of NLP and serves as an example of how the tutorial’s code might be expanded and altered to suit individual texts. Pickle files containing tokenized and pre-processed versions of the texts were created using spaCy, allowing users to download and begin working with the linguistic features of the works without writing the code or processing the texts themselves. (Pickle files are used to serialize Python object structures—in this case linguistic data—allowing those structures to be loaded back to a Python object from the file.) This linguistic data could prove very useful as corpora for researchers and students. Other processes were also applied to the texts, with the example code made available in the repositories. Code included for the repositories shows how to pre-process the texts by removing stopwords and punctuation and tokenizing the texts by lines, sentences, and words. It also contains examples of how to process the text using spaCy and Stanza (with two different spaCy models applied for greater nuance), allowing users to compare and contrast the results of each library and model. Furthermore, the repositories contain unique features suited to the individual texts and authors being analyzed. The Lix and Rix readability tests were applied to the corpus of works by Péguy, as was a custom algorithm I wrote that detects the number of syllables in each line of his poetry. For the repository focusing on Daumal, a demonstration of how to create a network map of the characters is included, allowing users to map how characters intersect and interact across his novel Mount Analogue’s chapters. The repository focusing on Khlebnikov contains an algorithm to detect where the stresses fall in his poetry, applying a Russian-specific NLP

Pedagogy and Praxis in Libraries 111 methodology to his work that can help elucidate the rhythms and musicality of his writing. This focus on the particular stress patterns of the Russian language complements the incorporation of code that handles the particularities of the Cyrillic alphabet, providing an example of a DH project that works outside the boundaries of other languages that use the Latin alphabet. The repository focusing on the work of Péguy is a particularly in-depth example of how these projects can serve as examples for further work on non-English texts. The Lix and Rix readability texts mentioned earlier are language-agnostic and were applied by using custom algorithms written in Python that can be replicated and applied to other texts in non-English languages. The algorithms are also alphabet-agnostic and can work on any language with its characters written in the UTF-8 format. The other custom algorithm in the repository, designed to count the syllables in a given line of Péguy’s poems, can likewise be applied to other French texts to determine syllable lengths of a given subdivision of the text (i.e., sentences, lines, paragraphs, and so on). This algorithm works by comparing the words to a large list of words with pre-counted syllables, and, if the word is not found in that list, by using an algorithm to guess the number of syllables in the word. This repository, like the others that deal with works by Khlebnikov and Daumal, aims to serve as both a product of DH and NLP work and a pedagogical tool that can serve to teach others as they work with non-English textual data. All repositories contain numerous examples of the code’s end products, such as the visualizations and analysis data gathered from running computational models and custom algorithms on the texts. The raw analysis data of these code processes are also included in CSV format, including author-specific data and the information used to create the visualizations. The image files generated by the visualization code are also provided for users to review and use in their own work. This project is a combination of both the pedagogical and practical components within DH and aims to contribute to the multilingual DH landscape by promoting greater embrasure of non-English DH projects and methodologies, educating students of DH on how to use non-English texts in their own work, and creating code products that are openly available online. Creating more resources for multilingual pedagogical materials can aid not only in broadening the reach of the work done in the classroom for students and researchers who want to work with materials other than English but can increase the reach of these materials to a global audience working with non-English texts. This only increases the potential for collaboration between institutions around the globe. Multilingual research projects using these methods have the potential to greatly contribute not just to practical DH deliverables but to the expansion of the Anglo-centric research landscape as a whole, working toward a more inclusive and global community of researchers. By attempting to broaden the reach of institutionally supported research and DH work, we can, as a community of DH practitioners working within our respective institutional cultures, aim to make the culture of DH more inclusive, just, and truly international in scope.

112

Ian Goodale

6.3.

Conclusion

The digital humanities is a field with great potential to effect change in both its scholarship and culture. The preponderance of open-access tools, the ease with which such tools can be shared, distributed, and collaborated on, and the way technology is built into the fabric of DH all make a more deeply collaborative culture attainable through the work of DH practitioners. This type of culture can be created through advocating for greater multilingual representation and practice within the projects, centers, pedagogy, and individual projects of DH practitioners. Increasing non-English diversity and representation within DH is practical and beneficial on multiple levels. It can increase engagement with underrepresented literatures and cultures within the digital humanities literature, can broaden outreach in a pedagogical setting to students and researchers who work with such languages, and can generally expand the culture of both DH and the academies and centers in which it is often practiced to widen the scope and inclusivity of the work studied and accomplished by DH workers and scholars. Pedagogy and practice can be united together through teaching and practical projects to help achieve these goals. DH may still be Anglocentric in many ways, but we can work to broaden the culture and practice through our own teaching, research, and practical projects. A multilingual, open, and inclusive future is ours to create. References “ADHO Homepage.” Alliance of Digital Humanities Organizations. Accessed April 6, 2023. https://adho.org/. del Rio Riande, Gimena. “Digital Humanities and Visible and Invisible Infrastructures”. In Global Debates in the Digital Humanities. Minneapolis: University of Minnesota Press, 2022. muse.jhu.edu/book/100081. Dombrowski, Quinn. “What’s a ‘Word’: Multilingual DH and the English Default,” October 15, 2020. https://quinndombrowski.com/blog/2020/10/15/whats-word-multilingual-dh-and-englishdefault/undefined . Estill, Laura. “Digital Humanities’ Shakespeare Problem.” Humanities 8, no. 1 (March 4, 2019): 45. https://doi.org/10.3390/h8010045. Fiormonte, Domenico. “Toward a Cultural Critique of Digital Humanities.” In Debates in the Digital Humanities 2016. Minneapolis, MN: University of Minnesota Press, 2016. GitHub. “Build Software Better, Together.” Accessed June 30, 2022. www.github.com. Goebel, Madeline. “LibGuides: Natural Language Processing for Non-English Text: Home.” Accessed June 30, 2022. https://guides.lib.utexas.edu/c.php?g=1107463&p=8074740. Goodale, Ian. “Analyzing-Péguy.” HTML, December 30, 2021a. https://github.com/ian-nai/ Analyzing-Peguy. Goodale, Ian. “NLP-on-Daumal.” HTML, October 6, 2021b. https://github.com/ian-nai/ NLP-on-Daumal. Goodale, Ian. “NLP-on-Khlebnikov.” HTML, October 14, 2021c. https://github.com/ ian-nai/NLP-on-Khlebnikov. Goodale, Ian. “Ian-Nai/Non-English-NLP-Tutorial: Tutorials on Processing and Analyzing—Text Using NLP Methods.” Accessed June 30, 2022a. https://github.com/ian-nai/ Non-English-NLP-Tutorial.

Pedagogy and Praxis in Libraries 113 Goodale, Ian. “PyGallica.” Python, June 9, 2022b. https://github.com/ian-nai/PyGallica. Guzman, Allyssa. “LibGuides: Digital Humanities Tools and Resources: DH at UT Austin.” Accessed June 30, 2022. https://guides.lib.utexas.edu/digitalhumanities/dh-at-ut-austin. Hallberg Adu, Kajsa. “The Promise of Digital Humanities Pedagogy: Decolonizing a Diverse Classroom in Ghana.” Digital Scholarship in the Humanities 36, no. Supplement_1 (June 25, 2021): i37–42. https://doi.org/10.1093/llc/fqaa023. Kear, Robin, and Kate Joranson. “A View on Libraries and Librarians in Digital Humanities.” In Digital Humanities, Libraries, and Partnerships, edited by Robin Kear and Kate Joranson, xix–xxiii. Chandos Publishing, 2018. https://doi.org/10.1016/B978-0-08-1020234.00023-9. McDonough, Katherine, Ludovic Moncla, and Matje van de Camp. “Named Entity Recognition Goes to Old Regime France: Geographic Text Analysis for Early Modern French Corpora.” International Journal of Geographical Information Science 33, no. 12 (December 2019): 2498–2522. https://doi.org/10.1080/13658816.2019.1620235. Models & Languages. “Models & Languages · SpaCy Usage Documentation.” Accessed June 30, 2022. https://spacy.io/usage/models. “Nltk/Nltk: NLTK Source.” Accessed June 30, 2022. https://github.com/nltk/nltk. Russell, Galina. “Geographical and Linguistic Diversity in the Digital Humanities.” Literary and Linguistic Computing 29, no. 3 (2014): 307–316. https://doi.org/10.1093/llc/fqu005. Spence, Paul Joseph, and Renata Brandao. “Towards Language Sensitivity and Diversity in the Digital Humanities.” Digital Studies/Le Champ Numérique (DSCN) Open Issue 2021 11, no. 1 (2021). https://doi.org/10.16995/dscn.8098. Stanza. “Available Models & Languages.” Accessed June 30, 2022. https://stanfordnlp. github.io/stanza/available_models.html. Tessier, Laurent, and Michaël Bourgatte. “Introduction. Enseigner et apprendre les humanités numériques: des savoirs recomposés?” Humanités numériques, no. 5 (June 1, 2022). https://doi.org/10.4000/revuehn.2999. “University of Texas Libraries Collections.” Accessed June 30, 2022. https://collections.lib. utexas.edu/. Wang, Qing. “Distribution Features and Intellectual Structures of Digital Humanities: A Bibliometric Analysis.” Journal of Documentation 74, no. 1 (2018): 223–246. https://doi. org/10.1108/JD-05-2017-0076. Zorrilla, Natalia. “The Exclusion of Early Modern Women Philosophers from the Canon: Causes and Counteractive Strategies from the Digital Humanities.” Hypatia 37, no. 1 (2022): 177–186. https://doi.org/10.1017/hyp.2021.67.

7

Bridging the Gap Between Digital Humanities and Natural Language Processing A Pedagogical Imperative for Humanistic NLP Toma Tasovac, Nick Budak, Natalia Ermolaev, Andrew Janco, and David Lassner

It can be argued that one of the great contributions of digital humanities (DH) as a field has been the creation of a vast digital library of highly curated textual editions. These meticulously—and mostly manually—annotated resources, often encoded following the Guidelines of the Text Encoding Initiative (TEI), have emerged from and mirror many essential aspects of the philological tradition as a theoretical and methodological framework for the critical analysis and interpretation of texts. This tradition emphasises the importance of studying texts in their historical and cultural contexts, and the need to consider factors such as the transmission and history of a text, variations in manuscript witnesses, and the influence of earlier texts on a work. While replete with bibliographic, structural, literary, and historical information about texts and their variants, authorial and editorial interventions, contexts and chains of transmission, the majority of TEI-encoded texts are still not necessarily annotated linguistically: they are rarely lemmatised; they rarely include information on parts of speech, let alone syntactic relations. This is, we would claim, both understandable and somewhat curious at the same time. It is understandable because humanists are usually not trained as linguists, and because of the sheer labour required to provide linguistic annotations. Even if the object of research, or the subject of a digital edition, is a single novel or a manuscript, most humanities researchers will not have the capacity to manually annotate linguistic properties for each and every word in their texts. Yet this is also a rather curious disciplinary blindspot for text-based digital humanities because linguistically annotated texts can vastly improve our understanding of and engagement with any kind of text, regardless of the genre, historical period in which it was produced or the medium in which it was conceived. NLP methods ask us to foreground language and contemplate how we, as humans, read texts by having to explain our process to machines. Linguistic annotations can lead to better search and retrieval in digital editions, more meaningful, semantically informed statistical data analysis, and overall better pattern recognition. Furthermore, advances in NLP have made adding annotation to text significantly DOI: 10.4324/9781003393696-10

Digital Humanities and Natural Language Processing 115 easier, especially at scale. Today, NLP can be used for automatically (or semiautomatically) tagging parts of speech, determining sentence-level syntactic structure (dependency parsing), identifying names (named entity recognition, NER) and sentiment analysis—with relative ease. A key challenge remains to overcome this disciplinary blindspot. Most digital humanists currently lack the means to transform NLP models into useful research instruments. As David Bamman has documented, models trained on newspaper text are significantly less effective at reading Shakespeare or Dante (Bamman 2017). This is not a failure of the existing models but rather evidence that humanists need to craft the NLP tools they use in research. This is especially true for those who work with data sources in multiple or non-dominant languages, time periods, domains, cultures, or scripts. As is well known, NLP tools that are designed for a handful of languages (English being the dominant one) leave out entire cultures and communities, which has troubling social, economic, and cultural implications. If humanities computing follows suit and only produces research based on dominant, well-resourced languages, we will find ourselves perpetuating structural and representational biases in our research. So the question facing digital humanities is twofold: what can we do to make NLP tools and assets better suited to humanists, and, at the same time, how can the humanistic perspective help overcome the Anglocentric status quo of NLP more generally? This chapter describes how we addressed both of these questions through the pedagogical design of a series of workshops under the banner of the National Endowment for the Humanities (NEH) Institute for Advanced Topics in the Digital Humanities, ‘New Languages for NLP: Building Linguistic Diversity in the Digital Humanities’.1 We knew, on the one hand, that we couldn’t expect the average humanist to retrain themselves as a computer scientist and develop their own machine-learning methods and tools from scratch. On the other hand, we also knew it would be unrealistic to think that NLP developers could create bespoke solutions for the plethora of humanistic datasets and domains. If we want to enable humanities research for domain-specific (for instance, literary) datasets and/in language varieties (historical or dialectal) that have no robust NLP support, we must provide these scholars with comprehensive learning opportunities to gain the conceptual foundation for engaging with NLP tools as well as the practical skills and workflows so that they can independently collect and annotate data, train, and evaluate language models and experiment with transfer learning in their own work. This is easier said than done because—for one—most humanists come to quantitative, statistical, and machine-learning methods with a healthy dose of scepticism to begin with: humanists are trained to recognise the importance of context, that meaning is irreducibly complex, that texts are often inherently contradictory, and that there is no such thing as ideology-free space. Humanistic approaches to text, knowledge, and ethics seem at odds with the monolithic ‘black-box’ nature of many emerging technologies (see Bender et al. 2021; Bamman 2017; del Rio Riande 2022; Klein 2022). This obvious disciplinary and epistemological gap does not, however, require humanists to abandon their traditional heuristics in order

116

Toma Tasovac et al.

to apply NLP methods. On the contrary, we believe that humanists should be encouraged to apply their humanistic lens to NLP—to interrogate its complexity and grapple with its limitations—while accepting NLP as an ancillary, yet essential, methodology that can facilitate the processing and enrichment of texts for the benefit of qualitative research. Our approach was based on the concept of Humanistic NLP, which we define as an area of applied, translational research aimed at developing theories, tools, and processes that enable the meaningful use of NLP methods by humanities scholars in specific research scenarios. To make Humanistic NLP accessible to a diverse group of students with different methodological and technological backgrounds, we also established four high-level principles of teaching Humanistic NLP, which guided our pedagogical choices throughout the workshops: 1. highlight the useful role that linguistics, in general, and corpus linguistics and NLP, in particular, can play as ancillary methods in computationally aided humanistic research; 2. focus on applied NLP workflows (corpus creation, annotation, model training, and evaluation) rather than machine-learning theory as the primary learning objective; 3. contextualise NLP workflows against the background of traditional humanistic practices (annotation as the cultural practice of textual enrichment, concordancing, lexicography, etc.); and 4. prioritise ethical data curation and management as a core component of humanistic engagement with NLP. In the following sections, we will analyse how these high-level principles translated into day-to-day pedagogical practice, what epistemological and technical challenges we encountered along the way, and what practical solutions we implemented in order to make our students’ learning process as effective as possible. We realise that substantive discussions across the disciplinary boundaries of the humanities and data and computer science may at times feel insurmountable. But Humanistic NLP gives us a chance to carve out a practical, pedagogical approach to transdisciplinary collaboration, which builds on humanists’ expertise as text and language experts with an eye to the benefits of NLP methods and best practices for humanities research. 7.1.

Institute Background

The New Languages for NLP project was launched in September 2020. Our institutional home was the Center for Digital Humanities at Princeton University, and we partnered with DARIAH (the European Digital Research Infrastructure for the Arts and Humanities) and the US Library of Congress. We assembled a team of instructors who were experienced in computational humanities, machine learning, and project and data management to design and run our 12-month program.

Digital Humanities and Natural Language Processing 117 Our aim was to draw in humanists from a variety of fields, working in a broad range of under-resourced or languages or specific domains, and help them overcome the barriers to using NLP in their research. By employing NLP tools and methods, they would at once advance their own research agendas, while also increasing linguistic diversity in the field of NLP by contributing datasets and models in new languages. When we put out our call for proposals in January 2021, we were overwhelmed by approximately 90 applications from scholars all over the world. After a rigorous selection process, we chose 11 teams working on the following languages: Classical Arabic, Efik, Kanbun, Kannada, Old Chinese, Nineteenth-century Russian, Ottoman Turkish, Quechua, Tigrinya, Yiddish, and Yoruba. The scholars came from 16 institutions, from fields such as comparative literature, history, literary studies, and linguistics, and were at various levels in their careers (graduate students, postdocs, junior faculty, senior faculty, professional technologists). Technical experience was not required. We held a five-day workshop focused on annotation (June 2021), a five-day workshop focused on model training (January 2022), and a two-day symposium where each team presented their results (May 2022).2 Throughout the 12 months of the project, the entire group met monthly for two hours for help sessions and to track progress. Due to the ongoing COVID-19 pandemic, most workshop sessions were held virtually, except for the final symposium in May 2022, which was held onsite at Princeton. From start to finish, the pedagogical design of the New Languages for NLP Institute was rooted in the central tenets of humanistic inquiry—the importance of contextualisation, the practice of close, careful analysis, and critical thinking—as well as the values we hold dear in digital humanities, those of collaboration and communication. Our intention was not to run a simple ‘skills workshop’ or to teach ‘efficiency’ or ‘optimisation’ as a way of achieving goals. Rather, we tried to integrate reflection, attention, and critical analysis to each step of our process. We started our Institute with a seminar-style discussion of some key topics in NLP and humanities, with special focus on issues of ethics, bias, and fairness. These conversations must be centred on any work with AI, and raising the issues early helped our participants frame their subsequent NLP work with a critical eye. For example, ‘Lessons from Archives’ by Eun Seo Jo and Timnit Gebru (2020) highlighted how data collection and annotation are at the core of ensuring fair representation and grappling with bias; the famous ‘On the Dangers of Stochastic Parrots’ article (Bender et al. 2021) served as an opportunity to discuss how the humanistic view may help forge the path toward mitigating the risks posed by emerging technologies such as LLMs. We invited three leading digital humanists—Gimena del Rio Riande, Lauren Klein, and Ted Underwood— to share their thoughts and asked two of the Parrots co-authors—Angelina McMillan-Major and Meg Mitchell—to respond during a live-streamed Zoom roundtable in late October 2021.3 One of our main Institute goals was to give our participants confidence in their knowledge of NLP and machine learning so that they felt empowered to make

118

Toma Tasovac et al.

informed choices in each step of their project. For most, NLP was a new methodology; indeed, some had never done any computer programming at all. Focusing on NLP workflows—that is, the sequences of steps that describe how to perform a particular task within the research data lifecycle (see Barbot et al. 2022)—allowed us to contextualise some of these ‘new’ methods within a much longer trajectory of inquiry into the cultural record. While our participants worked with a very specific type of technology-enabled approach to corpus building and annotation, we framed this as a practice deeply ingrained in human culture for centuries, stemming from marking systems in ancient grammatical treatises on Greek dialects, and the elaborate systems of glosses in mediaeval codices, all the way to the scholarly apparatus for modern print texts. While annotation contexts and technologies have changed over time, the need to enrich text remains constant because every generation, regardless of the medium at their disposal, struggles with the inherent ambiguity, complexity, and constant transformations of the language(s) we use. 7.2.

Focus on Annotation

After introducing workshop participants to the principles of linguistic annotation for NLP, we brought them up to speed with the Universal Dependencies Treebank project,4 which maintains collections of annotated sentences where words are labelled with grammatical relationships to each other. UD treebanks are designed to be consistent across different languages and are frequently used in NLP tasks such as parsing or morphosyntactic analysis. UD treebanks are often created by large teams of linguists from several research universities. There are UD treebanks for many languages—English, Spanish, French, German, Chinese, and Arabic— however, not all languages have an existing UD treebank, and some languages have a limited number of treebanks available. The absence of an existing UD treebank presents a significant barrier to scholars working with historical and underresourced languages. The question we asked was whether it would be possible to train models for new languages in the absence of existing linguistic data. For some of our teams, existing resources for high-quality linguistically annotated texts that help propel NLP work (such as the Universal Dependencies Treebank) proved to be an extremely helpful starting point. But others, whose languages are not represented in these resources, had to create their corpora from scratch. One of the working hypotheses of our Institute was that, with the right set of tools, a small team of scholars could annotate a sufficient amount of text for model training. One of the key constraints here was time—annotation is extremely time-consuming. We experimented with two approaches to automate and semiautomate the annotation process: bulk-labelling and using auto-recommenders. Bulk labelling is based on the simple idea that we save significant labour by reducing repetition. We identify the most frequently occurring tokens in the corpus and annotate them once. That data can then be used to update every occurrence of that token in the corpus. This method allowed our participants to annotate a significant portion of the corpus (50–70%) with annotations for the 50–100 most common tokens. To facilitate this work, we created a web application called Cadet and a

Digital Humanities and Natural Language Processing 119 Jupyter notebook version with the same code. Cadet’s output is a tokenised version of the text with annotations in the CoNLL-U (Conference on Computational Natural Language Learning—Universal Dependencies) format, which is used to represent dependency trees. This data could then be loaded into INCEpTION, a collaborative annotation tool developed by the Ubiquitous Knowledge Processing Lab at the Technical University in Darmstadt. INCEpTION allowed us to visualise the data with an annotation tool and to add or correct annotations. Several of our language teams used INCEpTION’s auto-recommender capabilities. With this tool, a suggested value is shown to the annotator. If correct, they could approve and add the annotation with a single click. This greatly simplified the annotation process and reduced work time. For example, the Ottoman Turkish team was able to use an existing language model from Stanza (Stanford NLP) for modern Turkish. They were able to annotate five full books (78,000 tokens) for part of speech, dependencies, lemma and named entities. The INCEpTION interface allowed us to visually correct and validate auto-suggestions and bulk annotations. While more aggressive transfer learning methods are possible, we chose to rely on our annotators’ knowledge of the language in an effort to reduce the number of machine errors. 7.3.

Model Training and Evaluation

While many humanists are familiar with NLTK (Natural Language Toolkit), we chose to use spaCy for our model training purposes. SpaCy is designed to be fast and efficient for working with large amounts of text. It is optimised for production use and provides pre-trained models for a variety of languages and tasks, including named entity recognition, part-of-speech tagging, and dependency parsing. SpaCy is designed to be easy to use and it has a simple and consistent API. spaCy has been a strong voice for simplicity and robustness in a quickly evolving field where code bases of novel methods are often released as a proof of concept without longterm maintenance planning or interoperability with other methods. As a framework, spaCy facilitates the use of both rule-based processing and statistical models. This approach favours the interpretability of results and does not require significant computational resources. The core developers at spaCy maintain tools to facilitate fine-tuning models for specific projects. In contradistinction to ‘theoretical’ or ‘fundamental’ NLP, which focuses on the development of algorithms and models, as well as the study of the underlying linguistic and computational principles that govern human language, spaCy espouses an ‘applied NLP’ approach, which focuses on specific tasks, iterative improvement, and practical outcomes.5 This model fits nicely with our priorities and values as a project.6 We relied heavily on the use of spaCy projects7 as reproducible experiment templates. In a project file, we stored information on the source(s) of our data, how that data is processed before running an experiment, and all the hyperparameter settings and configurations for model training. This approach provided an important element of structure and clear action to workshop participant’s experiments. This stepby-step process gave participants an easy practical ‘on-ramp’ to model training that

120 Toma Tasovac et al. did not require a theoretical background in machine learning, while still requiring them to participate in the process. The main trade-off with this approach is that many design choices are made on the user’s behalf. We addressed this gap with high-level discussions of statistical language models and the architecture of the spaCy framework. While the backgrounds and learning of participants varied, our approach enabled participants either to experiment on their own or to seek out the information and partners that they needed to complete a task. 7.4.

Challenges and Limits

Our efforts to promote the development of NLP tools for humanists working with so-called ‘new languages’ met with a broad array of challenges. These challenges coalesced into three general categories: problems with labour, problems with text, and problems with tools. They also formed a sort of linear hierarchy; one is unable to even approach the challenges inherent in fine-tuning a statistical model if one does not have training data on which to base the model. That training data, in turn, will never materialise if one does not successfully retain the labour needed to curate it. Thus, the hierarchy is also pyramidal; what we found is that the lowest layer, made up of the most pressing and unavoidable class of challenges, is occupied by problems of labour. One cannot help but be struck by the contrast between this structure and the discourse that characterises NLP more broadly, where attention is overwhelmingly (though not exclusively) concentrated on problems with tools: namely, large language models and how to optimise them.8 The reality is that researchers, especially humanists, who wish to engage in NLP on a ‘new language’ will spend the greater part of their time trying to address prior questions: how to acquire annotators, texts, and a corpus of training data. Shifting focus away from machine-learning theory and onto these workflow questions helped the institute address the challenges faced by all of its teams. Humanists and linguists, the primary participants in our group, faced some labour challenges even before joining us. Engagement with NLP in the humanities is tempered by the overall level of support available and the likelihood that the institution will recognise these techniques as useful ancillary methods for humanistic research. For non-tenure-track faculty and graduate students housed in the humanities, in particular, neither of these are guaranteed. Researchers are often asked to wear many hats: ‘software developer’ as they configure an NLP library, ‘linguist’ as they apply formal taxonomies of grammar and syntax, ‘manager’ as they harmonise the potentially conflicting work of annotators. In most cases, the training necessary to serve in these roles must come from outside the institution—if it is available at all. Even if strong support exists, traditional publishing imperatives will make it difficult to share research openly: some of our participants elected to keep their repositories private on GitHub because publication of their data or model would have had implications for their impending dissertation or monograph.9 Labour-associated problems often involved annotators, that is, the participants who generated the training data for our new models. For some of our languages,

Digital Humanities and Natural Language Processing 121 like Yoruba and Efik, the best annotators were native speakers living in countries where the language was spoken; dispensing money from a private university in the United States to foreign nationals proved to be difficult even when funds were specifically allocated for this purpose. Moreover, cultural or language barriers between researchers and their annotators sometimes impeded research: one team found that annotators prioritised completeness over accuracy, prompting a conversation about research methodology. Socioeconomic factors also played a role: our annotation platform was delivered via a web interface, and some annotators had limited access to the internet.10 Others had no access to flat-rate internet providers and were charged specifically for data use on our project. A product that supported offline annotation would have helped some of our teams. Those languages for which no native speakers exist, including historical forms like Classical Arabic and Old Chinese, faced a situation where the researchers themselves were often the only suitable annotators, worsening their labour burden. Though labour was an omnipresent concern for our groups, most were able to progress at least to the stage of curating a corpus of training data. At this level of the pyramid, concerns about texts become paramount; the first task is to find a corpus of machine-readable text in the target language. Focusing on so-called new languages implies a certain level of difficulty at this stage; we attempted to head off potential issues via a rigorous selection process that asked teams to come to us with a digitised or readily digitisable corpus in hand prior to the beginning of the institute. Even extant digitised text was no guarantee that it could be used, however: as the political situation in the Horn of Africa morphed over the course of the grant, our Tigray language team was informed in no uncertain terms by their government contacts that they were no longer allowed to use the corpus they had assumed would form the basis of their research. For other teams that did acquire digitised text, small corpora could make it difficult to focus the model on the research task; sometimes scraping data from the public web was the only reliable source of data at the volume needed to make the models function well.11 Small datasets were not always a hindrance, however: the Russian-language team sought and succeeded to curate a corpus focused on Dostoevsky novels, a level of domain-specificity that is not usually found outside the humanities. Our text corpora were rarely ‘born-digital’, and the digitisation process brought with it a new host of problems. Optical Character Recognition (OCR) is itself a chicken-and-egg problem for ‘new languages’: training the NLP model requires reliable data, but the generation of that data needs a sufficiently robust OCR model. The quality of the recognised text varies widely based on source material, how well the images were captured, and the software being used—high-quality photos, high-quality software, and labour for correcting the (never perfect) output are all concerns. Our Kannada language team ultimately could not find an OCR solution that adequately digitised text from the inscriptions that formed their corpus. Even born-digital texts that were not subject to OCR often included undesirable artefacts like HTML tags and tokens from non-target languages like English, which are relatively common in web-scraped data. For some of our teams, even representing the data digitally was complicated: the Unicode standard is relatively comprehensive

122

Toma Tasovac et al.

but lacks the ability to represent some rare characters encountered in Chinese and Japanese texts, for example.12 Creating a model for a ‘new language’ is not a well-supported pathway in most NLP libraries, and, as a result, our teams met challenges with most of the tools we used during the institute. Most documentation for NLP tools focuses on the act of deploying and fine-tuning an existing language model for a domain-specific purpose; deviating from this pattern is usually reserved for confident software developers with a strong understanding of the library’s internals. Existing biases toward Western languages written in Latin scripts are replicated in NLP tools, and teams working with languages written right-to-left sometimes encounter unexpected results when annotating. These tools are also built atop commonly held concepts like ‘word’, ‘sentence’, and ‘paragraph’, which break down when applied to historical languages like Old Chinese and Classical Arabic that may not be segmented by punctuation or whitespace. Other languages, like Quechua, feature such morphological complexity that existing annotation solutions invariably erase or mask the full richness of the text. Similarly, NLP libraries usually provide components for a selection of well-described tasks: part-of-speech annotation, named entity recognition, etc.13 Research questions that rely on data that doesn’t fit neatly into these categories— for example, phonology—increase the support burden on research teams. As an NEH-funded project, we designed our instructional materials and workflow to use free and open-source software. With a research licence, we made use of the Prodigy annotation tool to quickly test ideas and annotation workflows. However, given the cost of a licence, we did not include it in our workshop materials or online course. Open-source tools that cover the same ground as their competitors usually do so by shifting much of the burden onto the researcher to provide computer infrastructure, install, and configure the tool, as was the case with INCEpTION. Virtually no tools offer an end-to-end solution that minimises required technical knowledge; at some point someone will be writing code. When the tools used are complicated or expensive, or the data is not publicly available, replicability of research suffers. In an effort to address this, we supported our teams in hosting and describing their data on GitHub as a best practice.14 7.5.

Outcomes

On the most practical level, the goal of our project was to contribute annotated data and new models for languages or domains that did not have any. We also hoped that in this process, our researchers would advance their humanities research problems. Overall, we wanted our participants to come away with an understanding of not only machine learning but also the humanistic values of contextualisation, close reading, critical thinking, collaboration, and communication—that would make them confident, informed contributors to the conversions about the role of AI in our research, society and our lives, and the place of humanistic values in this conversation. On a pedagogical level, we as an instructional team, plan to use our curriculum and expand it from a one-off program to an on-line asynchronous curriculum, to be published as a freely available open resource on DARIAH-Campus.

Digital Humanities and Natural Language Processing 123 And finally, on an overarching level, we wanted to test and refine the idea that Humanistic NLP in fact provides distinct added value to NLP. How do we measure our success? Each team worked from a template repository that we provided15 and that included folders for the original texts, the text annotations, and the trained models. All but one of the teams was able to create a corpus of texts for the project. The presence of an existing corpus of machine-readable text was one of the strongest indicators of success. With a significant number of existing texts, the Classical Arabic, Kanbun, and other teams were able to focus their efforts on curation and annotation. Of the 11 teams, ten could create linguistic data in the CoNLL-U or CoNLL (ner) formats. The Quechua team had specific research goals and created simple dramaturgical annotations and detailed morphosyntactic annotations of each word. The amount of annotated text data varied significantly between the different teams. The Kanbun and Old Chinese teams were able to leverage existing trained models to generate automatic annotations as well as autorecommender systems in INCEpTION. Steps to automate or semi-automate the annotation process accounts for the significant difference between teams with significant data (80,000 to 180,000 words annotated) and more modest contributions (11,000–18,000 words). Ottoman Turkish used a modern Turkish model from Stanza to annotate 76,000 words. It is also notable that the teams that specifically focused on named entity annotation were able to create significantly more annotated data than teams that also annotated for part of speech. For a better estimate of how relevant the annotated datasets are for other scholars to reuse: the UD treebanks comprise 138 languages with a median of around 21,500 tokens. In this year-long project, the teams were able to compile a preliminary number of seven new languages with a median of around 28,500 tokens. While work is ongoing and several teams need to wait to publish their data given dissertation work or a forthcoming book, the repositories nonetheless provide an important first glimpse at teams’ outcomes. The creation of a text corpus is itself a significant project outcome, and the forthcoming dissertations, articles, and essays will be testament to the impact of this experience for their research. To date, the Institute teams have been able to publish four trained models. Yoruba, for example, trained components for morphology (0.92 accuracy score) and part of speech tagging (0.94 accuracy score) with notable success. Lemma prediction remained an area for future growth (0.47 accuracy score). The Yiddish team trained a model specifically for named entity recognition (NER) of people and places and achieved an accuracy rate of 0.97. The Kanbun team also focused on named entity recognition and created a model for an ancient language with notable success in identifying works of art. With more data and further training, the organisation, person, and location recognition scores could be improved. While the differences between these scores might seem discouraging at first, they actually highlight our point that engagement with annotation and model training provides DH scholars with actionable insights about the NLPs models behind their research. If the person or location entities are essential to a project, we can assess the need for necessary performance and fine-tune the model on relevant materials.

124

Toma Tasovac et al.

Table 7.1 Kanbun NER component performance

Overall NER WORK_OF_ART ORG (organisation) PERSON LOC (location)

F-score

Precision

Recall

0.8231560239 0.9162981163 0.320610687 0.5627118644 0.1084337349

0.8482469911 0.8950400000 0.5384615385 0.6424148607 0.75

0.7995067818 0.938590604 0.2282608696 0.5006031363 0.0584415584

Humanistic NLP is less focused on the mathematical details of the learning methods and more on the careful and informed data collection and preparation, which is what conventional NLP is often lacking. Reporting incremental improvements on benchmark data sets yields a false sense of confidence whereas Humanistic NLP acknowledges the incompleteness and uncertainty of data, annotation, and model prediction right from the beginning (see also Jo and Gebru 2020; Gengnagel 2022). Apart from the aforementioned outcomes, the institute served as a platform to create new scholarly communities with broader reach. Two participants have recently established a DARIAH Working Group for Multilingual DH, which will help the wider community to coalesce around the challenges and seek new opportunities moving forward. 7.6.

Conclusion

The ‘New Languages for NLP’ Institute showed that with a sufficient amount of training, humanists can assemble and annotate corpora and train new models for languages or language varieties that have very little NLP support. While one such workshop series is not enough to transform the whole field, our approach and our outcomes give us hope that the concept of Humanistic NLP is an attractive framework to consider against the backdrop of existing disciplinary divisions, and the lack of proper NLP support for many languages and language varieties. As digital humanities scholars, we cannot rely on systems that we do not understand or that have been trained for tasks that are vastly different from academic research. This is especially true for our language models. We cannot cross our fingers and hope that a model from Google or Meta will know Ottoman Turkish or Classical Arabic. We need to learn how to create the tools that can support our research needs. Critique of existing solutions is a valuable and essential skill. We can build on that critique to create solutions that meet the needs of humanities researchers and the languages of their research. If we fail to do this, DH scholarship will remain stuck in a reactive mode, decrying that tools made for other purposes do not meet the needs of academic scholarship. The potential capabilities of ‘foundation’ models can point us to an emerging space in which digital humanists are actively engaged in the curation and processing of texts, annotation work, and the training of models for tasks that are relevant to DH analysis and scholarship.

Digital Humanities and Natural Language Processing 125 Future work will be needed on multiple fronts, including (a) championing the use of NLP methods as a fundamental contribution to the practices of textual enrichment in the context of traditional humanistic disciplines, and (b) demonstrating the importance of annotated collections of humanistic texts as a foundation for training more diverse or more language and domain-specific NLP models. At an even higher level of abstraction, digital humanities may turn out to be a most fruitful terrain for exploring both the opportunities and challenges of bringing closer philological and linguistic approaches to understanding and processing texts. Notes 1 https://newnlp.princeton.edu/languages/ 2 The detailed schedules for each day can be seen on our website. Originally, the idea had been to bring participants to Princeton for all sessions, but with the COVID-19 pandemic we were only able to meet in person once. The pedagogical shift from inperson to remote teaching was a substantive change that our instructional group had to adjust for. 3 See https://cdh.princeton.edu/updates/join-us-for-a-roundtable-on-machine-predictionsand-synthetic-text/ 4 See https://universaldependencies.org/ 5 See Ines Montani, “Applied NLP Thinking: How to Translate Problems into Solutions” https://explosion.ai/blog/applied-nlp-thinking 6 Our use of spaCy built on an existing relationship with the spaCy developers at Explosion AI and demonstrates this company’s interest in working with the digital humanities community. In 2019, Ines Montani, one of the company founders, spoke at the Alliance for Digital Humanities Organisations (ADHO) annual conference in Utrecht, NL, and helped identify the potential of a partnership between the open-source community that maintains spaCy and digital humanities domain experts with significant experience creating annotated texts in TEI in particular. 7 https://spacy.io/usage/projects 8 It is precisely this attempt to sidestep problems of labour and focus on the tool that characterises the modern focus on language models powered by large, indiscriminately harvested datasets: “prevailing data practices tend to abstract away the human labour, subjective judgments and biases, and contingent contexts involved in dataset production” (Amandalynne et al. 2021) 9 See https://helda.helsinki.fi/bitstream/handle/10227/647/bjork.pdf?sequence=1&is Allowed=y 10 We used INCEpTION (https://inception-project.github.io/) because it was free, opensource, and supported multiple annotators per project across a variety of annotation tasks. The software requires a server on which it is deployed, and to which annotators connect over the internet to do their work. 11 The process used here was quite different from the indiscriminate crawling used to form the datasets that power large language models. Researchers usually chose one or a handful of individual websites or pages, scraped content from those pages, then hand-edited it to extract text for their corpus. 12 The “oracle bone script” of ancient China has been proposed for addition into the Unicode standard (see www.unicode.org/L2/L2015/15280-n4687-oracle-bone.pdf) but remains difficult to represent digitally. Some vendors create bespoke digital “toolkits” for researchers who work with scripts like these, but there is little interoperability between these specialised and often unmaintained tools.

126

Toma Tasovac et al.

13 The relatively recent addition of spaCy’s SpanCategorizer is an interesting counterpoint to this trend. The tool allows users to train a component that identifies arbitrary, potentially overlapping “spans” of text, which can support NER-like usage as well as a wide variety of other specialised use cases. See https://spacy.io/api/spancategorizer 14 This description was facilitated by a README template that asked the participants to write a curation rationale for their data, inspired by Jo and Gebru (2020). See https:// github.com/New-Languages-for-NLP/repo-template 15 See https://github.com/New-Languages-for-NLP/repo-template

References Bamman, David. “Natural Language Processing for the Long Tail.” 2017, https://dh2017. adho.org/abstracts/408/408.pdf. Barbot, Laure, Edward J. Gray, and Klaus Illmayer. “The Ssh Open Marketplace: Workflow Curation Sprint.” 2022, https://doi.org/10.5281/zenodo.6375077. Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots.” Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021: 610–623, http://dx.doi.org/10.1145/3442188.3445922. Gengnagel, Tessa. “Digital Humanities, or: The Broken Record of Everything.” 2022, https://doi.org/10.5281/zenodo.6241218. Jo, Eun Seo, and Timnit Gebru. “Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning.” Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2020, http://doi.org/10.1145/3351095.3372829. Klein, Lauren. “Are Large Language Models Our Limit Case?” Startwords 3 (2022), https:// startwords.cdh.princeton.edu/issues/3/llm-limit-case/. Paullada, Amandalynne, Inioluwa Deborah Raji, Emily Bender, Emily Denton, and Alex Hanna. “Data and Its (Dis)contents: A Survey of Dataset Development and Use in Machine Learning Research.” Patterns 2(11) (2021), https://doi.org/10.1016/j.patter.2021.100336. Rio Riande, Gimena del. “On Spanish-Speaking Parrots.” Startwords 3 (2022), https://startwords.cdh.princeton.edu/issues/3/on-spanish-parrots/.

Part III

Language Models

8

Linguistic Injustice in Multilingual Technologies The TenTen Corpus Family as a Case Study* David Bordonaba-Plou and Laila M. Jreis-Navarro

8.1.

Introduction

When thinking about injustice, the most common examples that come to mind are racial or gender discrimination, economic inequality, labor exploitation, or unequal access to basic services, health, or education. However, there are other types of injustice that, if not adequately conceptualised, would not have been considered cases of injustice. For example, “testimonial injustice” (Fricker, 2007) refers to those cases when someone receives a credibility deficit because of prejudice, for example, when someone’s testimony is disregarded because of his or her accent. Similarly, “discursive injustice” (Kukla, 2014) alludes to those situations in which members of a disadvantaged group cannot consistently perform a particular type of speech act, for example, when a woman’s orders at work are interpreted as petitions by her male subordinates. Both types of injustices have been identified and analyzed after considerable time and effort. In this line, establishing English as the lingua franca in academia has generated another type of injustice whose proper conceptualization has required several years of reflection and analysis, the so-called linguistic injustice. Linguistic injustice refers to the difficulties that a linguistic community must deal with when a language different from its mother tongue becomes indispensable to perform important tasks. English has long been the language of knowledge in virtually all academic disciplines. Non-native English speakers must reach an adequate level to disseminate their work, from international journals to conference communications. As Pool (1991, p. 497) highlights, there are many different reasons to establish one language as the official in a linguistic community. From Pool’s list, the following can be signaled as the most basic principles that fit the establishment of English in academia: “uniformity (favoring only one language),” “stability (favoring existing language rights and statuses),” “universality (favoring languages known by outsiders),” or “prestige (recognizing already-high-status languages).” However, establishing English as the most common form of disseminating research in academia has led to linguistic injustice. On the one hand, the cost of acquiring the level of expertise necessary to publish scientific papers is enormous. On the other hand, this work is less likely to be published than that by English native speakers. DOI: 10.4324/9781003393696-12

130

David Bordonaba-Plou and Laila M. Jreis-Navarro

The predominance of English and the difficulties that the poor command of the language can pose have been a relevant concern in digital humanities (DH): “The dominance of the English language is . . . a barrier to inclusivity to all who do not read or speak English.” (Mahony, 2018, p. 384). Similarly, Galina (2014) argues that making DH more linguistically inclusive requires that we become aware of the disadvantages experienced by non-native speakers in disseminating and communicating their scientific work. When discussing the consequences and possible solutions to the English-speaking bias in DH, the literature has focused on the following problems: (i) the lack of translations of research outputs (Galina, 2013); (ii) issues of connectivity, for example, the so-called digital divide (Galina, 2014, p. 314; Mahony, 2018, p. 385); (iii) the unavailability of data sources in languages other than English to quantify DH (Galina, 2014, p. 310); (iv) problems with digital standards such as Unicode and TEI (Fiormonte, 2012, pp. 67–69; Mahony, 2018, p. 374); or (v) increasing the number of sources of textual information in languages other than English (Galina, 2014, p. 314). In this sense, multilingual DH critiques have been centered to date on the deficiencies of non-English language resources. However, as we will show, it is not only the number and diversity of resources that is the main focus of multilingual DH criticism, but they should also address the issue of the level of accuracy of digital analytical tools. The aim of this chapter is twofold. First, to distinguish a phenomenon, which we will label as “the paradox of Anglocentric multilingualism,” that can produce a new type of linguistic injustice. This paradox arises when a multilingual philosophy is pursued in constructing complex systems of analysis in the digital environment (digital platforms, ontologies); nevertheless, these systems imply advantages in studying English over other languages. For instance, this linguistic injustice is instantiated when researchers cannot find reliable empirical data in languages other than English to dis(proof) a research hypothesis. The injustice derives from a poor level of precision in the output of non-English language technologies in linguistic analysis. Second, to contend that multilingual DH should address the different challenges posed by this paradox. Multilingual DH needs to deal with the deficiencies of tools’ performance, in addition to those of the language resources, because this disadvantage makes it difficult for any cross-linguistic study to provide reliable empirical data in dis(proving) linguistic intuitions. This makes studying the English language more profitable in academic terms, because research results are more robust and reaching them is quicker than in other languages. To illustrate some of the potential problems derived from the paradox, this work will detail the difficulties we have faced in a cross-linguistic study on the semantic behavior of the color term “white” (Bordonaba-Plou & Jreis-Navarro, 2023), when using the Arabic corpus arTenTen (Arts et al., 2014) and the Spanish corpus esTenTen (Kilgarriff & Renau, 2013) in the corpus tool Sketch Engine. We will also conduct an additional study in which we will examine whether the deficiencies found in the work mentioned earlier reappear for another color term, in this case, the term “red.” We will study the different performances of the tool in Arabic and Spanish, compared to English, to signal the weaknesses of this tool in a multilingual arena, enabling its improvement and enriching the critical and inclusive framework of multilingual DH.

Linguistic Injustice in Multilingual Technologies 131 The plan for the chapter is as follows. Section 8.2 analyzes the different senses of linguistic injustice characterized in the literature: the economic and the publication sense; then, it presents a new sense of linguistic injustice, the technological sense, produced by the paradox of Anglocentric multilingualism. Section 8.3 introduces the methods and materials used in this work. Section 8.4 explains the analyses performed, showing how the paradox produces linguistic injustice in the technological sense. Section 8.5 discusses the analyses’ results, highlighting their relevance for cross-linguistic studies and defending the need for DH to take up this challenge. 8.2.

Linguistic Injustice Revisited

As mentioned in the Introduction, this chapter aims to distinguish a phenomenon, the paradox of Anglocentric multilingualism, which can produce instances of a new type of linguistic injustice. Linguistic injustice sets out the idea that secondlanguage learners are at a disadvantage when using the new language. These disadvantages manifest themselves in a number of different ways, for example, in a lower probability of publishing articles in prestigious journals. The literature points to several possibilities for this. One of them is the increasing number of submissions, which makes it impossible for editors to review works worth publishing, and the reviewers are more and more demanding in the level of English that a paper must exhibit to be publishable (see Meneghini & Packer, 2007, p. 114). Another is that reviewers evaluate articles written by non-native English speakers (NoNES) more negatively than those by native English speakers (NES). A different understanding of the term is related to the resources that NoNES must invest in developing an acceptable competence in the new language. Along these lines, two different senses of linguistic injustice can be distinguished in the literature: the economic and publication sense. In the remainder of this section, we will expose these two senses of linguistic injustice; then, we will explain the nuances of the new sense of linguistic injustice that we distinguish in this work, the technological sense. The economic sense of linguistic injustice is exposed in Van Parijs (2002). The work defends that, although someone can voluntarily choose to learn another language and thus become a non-native speaker, certain injustices can be derived from their reduced competence in mastering the new language. As he says: “unequal linguistic equipment can be the source of major interindividual injustices” (Van Parijs, 2002, p. 60). Then, he applies different principles of distributive justice to analyze the interactions between linguistic communities to propose different models of linguistic justice. The study concludes that, whenever asymmetric bilingualism is involved, for example, the use of English in academia, the dominant native linguistic community should bear part of the cost of learning the language for all those who, belonging to the dominated non-native linguistic community, have decided to learn that language. As one might think, this does not happen in academia since, in general, it is the English learners themselves who must bear the costs. These costs are the material and temporal resources needed to achieve

132

David Bordonaba-Plou and Laila M. Jreis-Navarro

the necessary mastery to be able to read, write, and speak in English, and other costs like, for example, translation or proofreading costs.1 It has also a life cost, as it hinders the integration into an academia that is increasingly understood as an international network. In summary, the economic sense of linguistic injustice relates to the distribution of the costs derived from the acquisition of a second language and the possible policies that can be adopted to make the distribution of costs fairer when a linguistic community learns a second language. In other words, a person is a victim of linguistic injustice in the economic sense when they have to learn a language that is not their own in order to be in the same situation or at the same level as other people (e.g., to have the same possibilities of promotion in a job), and this results in costs that people who do not have to acquire such a language do not have to assume. The publication sense of linguistic injustice is related to the “intrinsic difference in the probability of having the paper accepted between NES and NoNES due to ‘presentation’ adequacy” (Clavero, 2010, p. 552).2 The existence of linguistic injustice in this sense has been pointed out in different academic disciplines, as in biology (Primack & Marrs, 2008; Clavero, 2010), medicine (Man et al., 2004; Ross et al., 2006), or philosophy (Schliesser, 2018; Yen & Hung, 2019), and in many countries around the globe: Taiwan and Mexico (Hanauer et al., 2019), Korea (Cho, 2009), Japan (Gosden, 1996), Poland (Duszak & Lewkowicz, 2008), Italy (Guardiano et al., 2007), or Spain (Moreno et al., 2012). Several measures have been proposed to combat linguistic injustice in this regard. For example, Meneghini and Packer (2007) propose improving the translation process, supporting journals that publish multilingual versions of articles, or improving the visibility of national journals to increase the impact of the papers of those authors who do not write in English. Although the phenomenon of linguistic injustice has been well documented in different languages, cultures, and scientific disciplines, there are those who have disagreed with respect to the existence of linguistic injustice. Hyland (2016) defends that linguistic injustice is a myth because it is based on two problematic assumptions. First, the NoNES/NES divide, that is, the supposition that we can draw a clear-cut distinction between native and non-native speakers. In this sense, Hyland highlights that being a native speaker is a matter of degree and, since scientific writing for publication is a skill that both NoNES and NES must acquire, many NES also experience difficulties writing scientific articles in English. Therefore, there would be NoNES that would develop better scientific writing for publication skills than many NES. Second, the primacy of language refers to the existence of other barriers besides linguistic one, for example, the lack of financial resources or research networks. In other words, researchers from less-developed countries must overcome several material barriers that, at times, are much more significant obstacles than the lack of English proficiency. Finally, Hyland (ibid.) emphasizes that the reviews of NoNES and NES papers tend to focus on research aspects rather than on presentation issues. Although we agree with Hyland on some points, for example, that NES must also deal with specific difficulties related to scientific writing, his defense that linguistic injustice is a myth is flawed. First, although NES must deal with difficulties

Linguistic Injustice in Multilingual Technologies 133 related to the acquisition of scientific writing skills, all the difficulties NES must face are difficulties that NoNES also must face, and they have to do so in a language that is not their own. In this same line, Politzer-Ahles et al. (2016, p. 4) say that “the fact that writing is difficult for native English-speaking scholars does not entail that it is not even more difficult for non-native English-speaking scholars.” In fact, Hyland implicitly recognizes this point when he says that “the need for a certain proficiency in a foreign language inevitably creates an added burden for authors” [emphasis added] (Hyland, 2016, p. 61). Second, the evidence that Hyland provides to defend that reviewers are not linguistically biased is inadequate to prove that linguistic injustice is a myth. He mainly focuses on interviews with editors and a careful study of reviewers’ comments. However, Hyland forgets that, although editors and reviewers may express that they are not linguistically biased when doing a review or rejecting a paper, in fact, they may well be. As Alcoff (2012, p. 132) points out, “prejudice can operate quite effectively even when it runs counter to a person’s own consciously held values and commitments.” Our consciously held mental states may diverge from our unconscious mental associations (see Johnson, 2020). In other words, there is a difference between explicit biases, which we are aware of, and implicit biases, which we are not. From this point of view, someone with an implicit bias against the elderly will unconsciously associate certain characteristics with them, such as that they tend to repeat themselves, they only like old music, or they are bad with technology. In this same sense, reviewers might explicitly state that they have no bias against articles written by non-native speakers; however, at the implicit level, they might have a linguistic bias when evaluating such articles. In this sense, an article’s negative evaluation apparently based on research issues may be in fact based on presentation aspects. In summary, the publication sense of linguistic injustice refers to the idea that NoNES are less likely to see their papers published because of their lower English proficiency. As Hyland suggests, it is possible that some NES also experience difficulties, but it seems plausible that the majority of NoNES experience more difficulties than the majority of NES because acquiring English proficiency is an additional burden they must overcome. Be that as it may, we defend in this work a new sense of linguistic injustice, which is related to the previous ones but has its own features and consequences: the technological sense of linguistic injustice. This new sense also poses a burden on NoNES language technology users in the academic ground, if not in the publication rate, it certainly does in the research output. For NoNES to be on the same playground as NES researchers, they need to adapt to their native language tools whose design is based on the idiosyncrasies of the English language, which means forcing a given structure on the intrinsic differences of multilingualism. This does not only slow the process of achieving efficient tools, but it might also prove to be an infructuous effort eventually. The technological sense is related to a lower possibility of finding evidence when using digital analysis tools in languages other than English. This phenomenon—which we have labeled the “paradox of Anglocentric multilingualism”—occurs when a multilingual philosophy is pursued to build digital analysis systems, but these systems imply advantages in studying English over other languages. For instance, consider a researcher studying color terms in Arabic

134

David Bordonaba-Plou and Laila M. Jreis-Navarro

and a researcher studying color terms in English. Suppose that both researchers are using Sketch Engine’s corpora to examine color terms in context, but the first researcher gets less and worse quality data. In this sense, they will suffer from linguistic injustice in the technological sense, because the tool does not provide results as evidential as if the same phenomenon is analyzed in English. Therefore, one research will lack some scientific arguments to prove or disprove the initial hypothesis compared to the other. 8.3.

Methodology

To illustrate how the paradox of Anglocentric multilingualism can produce linguistic injustice in the technological sense, we will present the obstacles we have faced in an investigation on the color term “white” (see Bordonaba-Plou & JreisNavarro, 2023) using Sketch Engine. Sketch Engine is a corpus tool with text analysis applications (Kilgarriff et al., 2014); it “contains 600 ready-to-use corpora in 90+ languages, each having a size of up to 60 billion words to provide a truly representative sample of language.”3 Using this tool, we will conduct a comparative analysis of the quantity and quality of the available evidence in exploring color terms in Arabic, Spanish, and English corpora. We will focus the analyses on two types of issues: the lists of collocates, and the Part of Speech (POS) tagging. To do the analyses, we have used three corpora from the TenTen Corpus Family. The TenTen Corpus Family (Jakubícek et al., 2013) are web-based corpora compiled on the methodology described in Sharoff (2006). The same web crawling methods were used to compile the WaCky corpora (Baroni et al., 2009), three of the first very large (>1 billion words) web-based corpora that can be considered the precursors of the TenTen Corpus Family. The WaCky corpora contain a wide variety of different types of text, including ‘pre-web’ texts in electronic format, “spanning from sermons to recipes, from technical manuals to short stories, and ideally including transcripts of spoken language as well,” and texts belonging to web-based genres like “personal pages, blogs, or postings in forums” (Baroni et al., 2009, pp. 212–213). Therefore, the corpora from the TenTen Family also contain a rich diversity of texts since they have been compiled using the same methods (see Figure 8.1 for a representation of the different texts comprising each corpus).4 From the TenTen Corpus Family, we have used enTenTen20 (with a total number of 78,373,887 websites used in the compilation), arTenTen12 (with a total number of 14,546,247 websites used in the compilation), and esTenTen18 (with a total number of 59,415,971 websites used in the compilation). The English Web 2020 (enTenTen20), with a total word count of 38,149,437,411 words, has several previous versions, being the 2008 version the oldest one with 2,759,340,513 words. The Spanish Web 2018 (esTenTen18) (see Kilgarriff & Renau, 2013) has a total word count of 16,951,839,897 words, and it has a previous version from 2011 with 9,497,213,009 words. The Arabic Web 2012 (arTenTen12) has a total word count of 7,475,624,779 words tagged with Stanford tagger, of which a 115-million-word chunk has been tagged with MADA tagger (see Arts et al., 2014). For relative size in the TenTen Corpus Family, see Figure 8.2.

Linguistic Injustice in Multilingual Technologies 135

Figure 8.1 Distribution of topics in the enTenTen20 and arTenTen12 corpus.

Figure 8.2 TenTen Corpus Family relative size.

We will address two essential aspects of the tool, the lists of collocates, and the accuracy of the POS tagging. First, let us see some details of the Word Sketch tool. According to Sketch Engine website, the Word Sketch tool processes the word’s collocates and other words in its surroundings. It can be used as a one-page summary of the word’s grammatical and collocational

136 David Bordonaba-Plou and Laila M. Jreis-Navarro behavior. The results are organized into categories, called grammatical relations, such as words that serve as an object of the verb, words that serve as a subject of the verb, words that modify the word etc.5 Kilgariff and Tugwell (2002) describes the Word Sketch creation process. First, it is necessary to use a Part of Speech (POS) tagged corpus (the first corpus used was the BNC). Second, a parser comparing regular expressions looking for POS tags processes the corpus to find instances of quintuples of elements with the following form: {Rel, Word1, Word2, Prep, Pos}, “where Rel is a relation, Word1 is the lemma of the word for which Rel holds, Word2 is the lemma of the other open-class word involved, Prep is the preposition or particle involved, and Pos is the position of Word1 in the corpus” (Kilgariff & Tugwell, 2002, pp. 127–128), being possible for Word2 and Prep to receive null values. A total of 26 different relations are reported, including unary relations, for example, possessed, as in “my bank,” or plural, as in “the banks”; binary relations, for example, subject, as in “the bank refused,” or object, as in “climb the bank”; and one trinary relation, prepositional complement or modifier, as in “the bank of the river.” In this sense, the quintuple for “bank manager” would be {modifies, bank, manager, null, x}, and the quintuple for “the bank borrows money” would be {subject-of, bank, borrow, null, x}. Then, the salience of collocates is calculated with a combination of MI and log frequency. In the end, the tool presents the user with a list of relations composed of different collocates sorted by saliency.6 In this way, the user can study the linguistic behavior of the searched term by inspecting the lists associated with the different relations. As Kilgariff and Tugwell (2002) points out, this is one of the most significant purposes of the Word Sketch tool, that is, to provide the researcher with an easyto-access and use tool that allows her to quickly and comprehensively visualize the intricacies associated with the meaning and usage of the terms she is interested in. However, this varies depending on language since, as we will show, the lists of collocates provided by the Word Sketch tool are of differing types and usefulness depending on the language considered. For example, a search for rojo “red” in Spanish (esTenTen) provides lists like those in English (enTenTen), that is, lists based on a functional perspective: a list of the most frequent modifiers of rojo, a list of the nouns that rojo most frequently modifies, or a list of the terms that most frequently appear in the construction rojo y/o “red and/or.” However, in Arabic (arTenTen), the researcher has fewer lists available, and those only reflect grammatical categories and the collocation position (left or right), for example, noun-left, that is, nouns appearing to the left of the node (i.e., the searched word), or verb-right, that is, verbs appearing to the right of the node. This shortcoming makes it difficult for the tool to provide a complete perspective on the linguistic behavior of the term. This could be achieved if, as in English and Spanish, we could “produce a collocate list for subjects, another for objects, and so forth” (Kilgariff & Tugwell, 2002, p. 127) in Arabic. Second, the tool provides different degrees of accuracy in POS tagging. In other words, the output generated by the tool shows that it fails to identify the grammatical categories of some of the collocates, also neglecting the subtraction of affixes.

Linguistic Injustice in Multilingual Technologies 137 To perform the analysis, we checked the first ten collocates of each list (sorted from highest to lowest Log Dice),7 determining how many collocates do not correspond to the category defined by the list. In summary, these inaccuracies imply that the statistical scores like, for example, the Log Dice, cannot be used with total confidence. This poses a problem because the certainty of the reliability of statistical significance scores is of critical importance. The Word Sketch tool, a tool that displays collocations in a compact way, employs the Log Dice when displaying the lists resuming the collocational behavior of the searched term. 8.4. Analyses The paradox of Anglocentric multilingualism can produce linguistic injustice in the technological sense. For example, when building digital tools, a multilingual perspective may be pursued because the tool allows research questions to be conducted in different languages on the same arena: the TenTen Corpus Family. However, the scope of the analyses that can be carried out varies from one language to another, English being on top (see Mello in this volume for another work on the limitations of algorithms for NLP in languages other than English). In this sense, an investigation in a language other than English may be negatively affected by how restricted the analyses can be. We have faced these difficulties when investigating the meaning and use of color terms in a previous work. In Bordonaba-Plou and Jreis-Navarro (2023), we examine brightness, a dimension relevant to understanding color quality. To this end, we adopted a cross-linguistic perspective and examined how a specific color term, “white,” interacts with brightness modifiers in two different languages, Arabic and Spanish. We conducted a collocation and a KWIC analysis. However, we encountered certain difficulties related to the usefulness of the collocates lists and the accuracy of the POS tagging, which made gathering and analyzing enough data to test our hypotheses more complicated. Specifically, we ascertained a decreasing level of usefulness and accuracy in the quality of the search output, being the results in Arabic less useful and accurate than the results in Spanish. So, we wonder if the data quality would be even better in English than that in Spanish. Here, we want to examine the quality of the output for three different languages, English, Arabic, and Spanish. In the remainder of this section, we are going to present the results of another analysis we have performed using another color term, “red.” In this sense, we will be able to measure to what extent the aforementioned differences in the quality of the output are also present for another color term. Let us see the results of the first analysis using the Word Sketch tool, the analysis of the lists of collocates (see Table 8.1). In English, when we search for the term “red” in the enTenTen20 corpus, we obtain 3,675,423 hits as adjectives, 778,037 as nouns, and 13,196 as verbs. The tool displays nine lists, going by the following names: modifiers of “red,” nouns modified by “red,” “red” and/or ..., prepositional phrases [including red], infinitive object of “red,” verbs complemented by “red,” verbs before “red,” subject of “be red,” usage patterns [involving red].

138

David Bordonaba-Plou and Laila M. Jreis-Navarro

Table 8.1 Sketch Word analysis results English

No. of collocates lists Functional lists No. of collocates lists with accurate POS tagging No. of collocates with accurate POS tagging (top 10 collocates)

9 Yes 7/9 (77.7%) 80/90 (88.8%)

Spanish

Arabic Stanford

MADA

8 Yes 6/8 (75%)

7 No 0/7 (0%)

3 Yes 3/3 (100%)

66/80 (82.5%)

29/70 (41.4%)

30/30 (100%)

Let us now see the output of the tool in Spanish. We search for the term rojo in the esTenTen18 corpus. We obtain 2,064,613 as adjective, and 326,470 as noun. The tool provides eight lists: modifiers of rojo, rojo y/o, prepositional phrases, subjects of ser rojo “be red,” verbs before rojo, nouns modified by rojo, subjects of estar rojo “be red,” usage patterns. In Arabic, a basic search for aḥmar “red,” in the Stanford tagged corpus, produces 114,258 results; it is not specified whether it is as an adjective, as a noun or as a verb. The tool provides the following nine lists: verb_left, verb_right, noun_left, noun_right, adjective_left, adjective_right, nextleft, nextright, conj [for conjunction]. However, a basic search for aḥmar, in the MADA-tagged corpus, produces 18,539 hits. The tool provides the following lists: construct-state and/or adjective-of. We now move on to show the results of the second analysis, the analysis of the accuracy of POS tagging in each language (see Table 8.1). First, let us see the results for English. An examination of the items in each list reveals the following results. First, in the “red” and/or ... list, there are two collocates that do not belong to the category defined by the list: (i) “bright” as in “bright red hair,” which is also present in the modifiers of red list; (ii) “dark” as in “dark red carpet,” which should be in the modifiers of red list. Second, in the infinitive objects list, eight of the first ten collocates are wrong. The only two collocates that belong to the category infinitive objects of “red” are “to symbolize/symbolise” like in “red to symbolize the blood,” and “to signify” like in “red to signify negative numbers.” However, if we expand the results of the list, we get more items that do belong to the category, for example, “to indicate,” “to match,” “to represent” or “to show.” The results in Spanish are slightly worse than those obtained in English. First, in the modifiers of rojo list, only four (predominantemente “predominantly,” intensamente “intensely,” completamente “completely,” preferiblemente “preberably”) out of ten collocates are correct. The others are wrong, for example, alerta “alert” like in alerta roja “red alert” and atún “tuna” like in atún rojo “bluefin tuna,” both of which also appear in the nouns modified by rojo list. Besides, there also appear several terms like “getty” and “supra,” and proper names like “Javier” and “Jaime.” Second, in the verbs before rojo list, two (poner “turn” and tornar “become”) out of ten collocates are correct. The others are wrong, for example, paular “pauling

Linguistic Injustice in Multilingual Technologies 139 (from the proper name “Paula”),” rodrigar “rodriging (from the proper name “Rodrigo”),” like in Rodrigo Rojas “Rodrigo Rojas,” or circular “circle” like in el círculo rojo “the red circle.” In contrast, the results in Arabic are less accurate. First, in the corpus tagged with MADA, the data are scarce since the total corpus size is 115 million words, but the collocates of the three lists are correct. However, in the corpus tagged with the Stanford tagger, which is the one that is supposed to represent the language, according to its size, we detected important errors in the collocates of the lists. In the verb_left list, four out of ten collocates are correct. The first wrong collocate is “FROM,” which is not even an Arabic word; the second and the third are fāqiʿ “bright” and qātim “dark,” terms that work as modifiers; the fourth is bāhit “pale,” another modifier but, in this case, it is listed without the first letter of the word (the first letter of the word appears as a preposition, āhit); the seventh is fātiḥ “light,” another modifier, and it occurs the same as with bāhit, the term appears in the list without the first letter. In Arabic, possessive pronouns as well as some prepositions and conjunctions are attached to the word, for example, to nouns or verbs. The aforementioned errors occur because Arabic is an unsupported language in the TenTen Corpus Family and the Sketch Engine (the chunk tagged with MADA is not big enough to be representative of the language).8 To have partial support means that the Part of Speech (POS) tagger does not cover cliticization in smart search. Cliticization is the most important phenomenon in Arabic morphology and a big challenge in computational analysis (Habash, 2010). Clitics are harder to parse in Arabic because there are no marks for segmentation as is the case of the apostrophe in English. One token in Arabic can include all the following information— with orthographic changes included: proclitics (question or future particles, conjunctions, prepositions, definite article), the stem (the core), plus affixes (conjugation patterns) and enclitics (pronouns). Therefore, the tagger fails to recognize the clitics, thus generating two types of errors. On the one hand, the tagger recognizes as a clitic the first letter of the word, as in the verb_left list cases. For example, the tagger reads bāhit “pale” as b “with/in” and āhit, which is nonsense. On the other hand, as we will see later, the tagger may fail to recognize clitics in words, for example, a preposition, a future mark, or a pronoun, recognizing the same word with different affixes as different collocates. Other tools report the same problem.9 In the verb_right list, three out of ten results are correct. The wrong collocates are modifiers or nouns like nabīḏ “wine,” and some of them appear without the first letter as in the verb_left list. In the noun_left list, five out of ten are correct. Among the wrong results, we can see modifiers like dākin “muddy,” a conjunction like w “and,” which also appears in the conjunctions list. In the noun_right list, six out of ten collocates are correct. The wrong results are related to the mentioned POS tagging problem. For example, the noun lawn “color” appears three times as three distinct collocates, as the noun but also as lawnu-hā “her color” and bi-lawn “with a color.” In the adj_left list, three out of ten collocates are correct. The adjective fāqiʿ “bright” appears twice because the tagger fails to identify the indeterminate accusative case mark (fatḥa tanwīn), the an in fāqifāqiʿ -an. It also appears as the

140

David Bordonaba-Plou and Laila M. Jreis-Navarro

names of other colors, as in aḥmar w-aḫdar “red and green.” In the adj_right list, zero out of ten collocates are correct. In Arabic, the adjective comes after the noun, and Arabic is a left-to-right language, so the adj_right list is a complete nonsense from the beginning. In the conjunctions list, eight out of ten collocates are correct. The errors are due to the mentioned problem with POS tagging in which the tagger identifies the first letter of a word as a conjunction. For example, in the expression aḥmar wardī “pink red,” the tagger understands that the w of wardī is a conjunction, which ends up identifying it as a collocate of the list ardī, which is a nonsense. We set aside the next_left and next_right lists because they are composed of collocates that appeared in the previously mentioned lists. 8.5.

Conclusion

This study has shown how, using corpora belonging to the TenTen Corpus Family, a digital tool such as Sketch Engine puts forward a multilingual philosophy that, in practice, disadvantages languages other than English. In doing so, it produces a type of linguistic injustice identified as “the paradox of Anglocentric multilingualism.” The tool’s lack of precision when dealing with languages such as Spanish, but especially Arabic,10 makes it difficult to obtain reliable empirical data to support research hypotheses in these languages. The case studies displayed, which make use of color terms in English, Spanish, and Arabic, highlight the inequalities and risks involved in academic work focused on non-English languages. Particularly when the work makes use of natural language processing tools developed with the English language as a baseline. In many cases, the multilingual perspective is implemented merely by using translations that do not take into account the idiosyncrasies and complexities of other languages and their cultural backgrounds. In others, the use of specific tools is restricted, maintaining an imbalance in the quantity and quality of the available data; as is the case in arTenTen, where the bulk of the corpus is analyzed with the Stanford tagger, while only a small sample is analyzed with the MADA tagger.11 Moreover, in view of the results generated by the Sketch Word tool, it is evident that the different corpus data have varying degrees of refinement. Cross-linguistic studies should be considered central to the multilingual digital humanities, requiring balanced tools that move beyond quick and one-dimensional solutions. Achieving solid research results in one language and culture without establishing solid comparisons with others undermines the reliability of scientific knowledge itself. The multilingual perspective in DH does not solely affect disciplines such as philology or linguistics but humanistic thought itself, which risks being reduced to mere equivalences of terms without intercultural dialogue. The main objective of the ELEXIS (European Lexicographic Infrastructure) project, the Horizon 2020 Project12 that gave access to Sketch Engine to many European institutions for a while,13 was the following: “Reliable and accurate information on word meaning and usage is crucial in today’s multilingual and multicultural society.”14 Despite this, the present study has shown that we still need

Linguistic Injustice in Multilingual Technologies 141 more equitable approaches and that the DH philosophy should not be reduced to an instrumentalist perspective of technical problem-solving. It should have at its core the willingness to build digital infrastructure with robust ethical grounds. In this sense, we need digital environments and tools with an actual multilingual perspective, guaranteeing similar data quantities and qualities for the different languages and using NLP tools specific to each language and with similar performance. Large Language Models (LLM), such as multilingual BERT, outperform state-of-the-art POS tagging approaches, even in the case of Arabic (Saidi et al., 2021), but a fair way to work across languages is still to come, and the promise of “greener NLP” is far away (Mielke et al., 2021, p. 15). In this scenario, a seemingly plausible way of action from a digital humanist perspective is to provide an ethical framework for consensus between different linguistic traditions, which goes beyond extrapolating and translating from English tools and models.15 Multilingual digital humanities need to work toward providing human-annotated and supervised data to improve existing linguistically motivated tools for non-English languages and thus securing a multilingual pathway across them. This perspective would be a constructive way to tackle the linguistic injustice that hinders the credibility and competitiveness of cross-cultural work. Notes * A previous version of this work was presented at DH2022, held in Tokyo from 25 to 29 July 2022 (https://dh2022.dhii.asia/dh2022bookofabsts.pdf). 1 In this sense, different measures have been proposed to alleviate this asymmetrical distribution of costs, for example, that journals and publishers bear the costs of correcting the manuscripts of NoNES or the creation of associations that help researchers with the proofreading. However, these measures are far from being implemented. 2 Empirical studies show that the most common difficulties experienced by NoNES are linguistic issues such as sentence structure, grammar, or vocabulary (Shaw, 1991; Cho, 2009); text processing (Moreno et al., 2012); genre conventions (Englander, 2009); or composing processes of the papers’ structure (Flowerdew, 2004; Guardiano et al., 2007) 3 www.sketchengine.eu/. 4 The data for the esTenTen18 corpus do not appear because the distribution of the corpus does not consider the types of text but rather the varieties of Spanish. However, since the corpus has been compiled with the same methodology as the other two corpora, it can be assumed that the diversity of texts is also present in this case. 5 www.sketchengine.eu/guide/word-sketch-collocations-and-word-combinations/. 6 Specifically, the tool displays, for each searched word, lists of Rel-Word2 pairs and RelWord2-Prep triples. 7 The Log Dice expresses “the tendency of two words to co-occur relative to the frequency of these words in the corpus” (Gablasova et al., 2017, p. 164). 8 The Sketch Engine’s Word Sketch Grammar (WSG), that is, the rules determining the grammatical relations that generate the list of the Sketch Word tool are language dependent and cannot be shared across languages. As the Sketch Engine’s website reports, “corpora in unsupported languages can make use of a universal WSG which provides only basic statistics of words surrounding the keywords ignoring the grammar of the language.” For more info, see the entry Word Sketch Grammar in the Sketch Engine’s glossary (https://sketchengine.eu/guide/glossary/).

142

David Bordonaba-Plou and Laila M. Jreis-Navarro

9 http://corpora.lancs.ac.uk/lancsbox/localise.php. 10 A higher level of accuracy in Arabic can be achieved by using MADA-tagged arTenTen instead of Stanford-tagged arTenTen. However, this comes at a cost, as we sacrifice size at the expense of accuracy. 11 www.sketchengine.eu/arabic-mada-system-tagset/. 12 https://cordis.europa.eu/project/id/731015. 13 www.sketchengine.eu/access-after-elexis/. 14 https://elex.is/objectives/. 15 In this sense, this study links with Lidia Ponce de la Vega’s contribution within the present volume on Multilingual DH, as both address different types of injustices that occur through cultural and digital products in connection with the lack of linguistic diversity, and agree that multilingualism is not only achievable by bringing in a multitude of languages but by changing the very structures of the platforms that promote this idea from a universalistic and Anglocentric perspective.

References Alcoff, L. M. (2012). Epistemic Identities. Episteme, 7(2), 128–137. https://doi.org/10.3366/ E1742360010000869. Arts, T., Belinkov, Y, Habash, N., Kilgarriff, A., & Suchomel, V. (2014). arTenTen: Arabic Corpus and Word Sketches. Journal of King Saud University—Computer and Information Sciences, 26(4), 357–371. https://doi.org/10.1016/j.jksuci.2014.06.009. Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation, 43, 209–226. Bordonaba-Plou, D., & Jreis-Navarro, L. M. (2023). A Cross-linguistic Study of Color Terms in Arabic and Spanish. In D. Bordonaba-Plou (Ed.), Experimental Philosophy of Language: Perspectives, Methods and Prospects (pp. 151–170). Springer. https://doi. org/10.1007/978-3-031-28908-8_8. Cho, D. W. (2009). Science Journal Paper Writing in an EFL Context: The Case of Korea. English for Specific Purposes, 28, 230–239. https://doi.org/10.1016/j.esp.2009.06.002. Clavero, M. (2010). “Awkward Wording. Rephrase”: Linguistic Injustice in Ecological Journals. Trends in Ecology and Evolution, 25(10), 552–553. https://doi.org/10.1016/j. tree.2010.07.001. Duszak, A., & Lewkowicz, J. (2008). Publishing Academic Texts in English: A Polish Perspective. Journal of English for Academic Purposes, 7, 108–120. https://doi.org/10.1016/j. jeap.2008.03.001. Englander, K. (2009). Transformation of the Identities of Nonnative English-Speaking Scientists as a Consequence of the Social Construction of Revision. Journal of Language, Identity and Education, 8, 35–53. https://doi.org/10.1080/15348450802619979. Fiormonte, D. (2012). Towards a Cultural Critique of the Digital Humanities. Historical Social Research, 37(3), 59–76. https://doi.org/10.12759/HSR.37.2012.3.59-76. Flowerdew, J. (2004). Attitudes of Journal Editors to Nonnative Speaker Contributions. TESOL Quarterly, 35(1), 121–150. https://doi.org/10.2307/3587862. Fricker, M. (2007). Epistemic Injustice: Power and the Ethics of Knowing. Oxford University Press. Gablasova, D., Brezina, V., & McEnery, T. (2017). Collocations in Corpus‐Based Language Learning Research: Identifying, Comparing, and Interpreting the Evidence. Language Learning, 67(S1), 155–179. https://doi.org/10.1111/lang.12225.

Linguistic Injustice in Multilingual Technologies 143 Galina, I. (2013, July 19). Is There Anybody Out There? Building a Global Digital Humanities Community. Red Humanidades Digitales. http://humanidadesdigitales.net/ blog/2013/07/19/is-there-anybody-out-there-building-a-global-digital-humanities-community/. Galina, I. (2014). Geographical and Linguistic Diversity in the Digital Humanities. Literary and Linguistic Computing, 29(3), 307–316. https://doi.org/10.1093/llc/fqu005. Gosden, H. (1996). Verbal Reports of Japanese Novices’ Research Writing Practices in English. Journal of Second Language Writing, 5(2), 109–128. https://doi.org/10.1016/ S1060-3743(96)90021-1. Guardiano, C., Favilla, M. E., & Calaresu, E. (2007). Stereotypes About English as the Language of Science. AILA Review, 20, 28–52. https://doi.org/10.1075/aila.20.05gua. Habash, N. Y. (2010). Introduction to Arabic Natural Language Processing. Morgan & Claypool. Hanauer, D. I., Sheridan, C. L., & Englander, K. (2019). Linguistic Injustice in the Writing of Research Articles in English as a Second Language: Data from Taiwanese and Mexican Researchers. Written Communication, 36(1), 136–154. https://doi. org/10.1177/0741088318804821. Hyland, K. (2016). Academic Publishing and the Myth of Linguistic Injustice. Journal of Second Language Writing, 31, 58–69. https://doi.org/10.1016/j.jslw.2016.01.005. Jakubícek, M., Kilgarriff, A., Kovár, V., Rychlý, P., & Suchomel, V. (2013). The TenTen Corpus Family. In 7th International Corpus Linguistics Conference CL (pp. 125–127). https://ucrel.lancs.ac.uk/cl2013/. Johnson, G. M. (2020). The Psychology of Bias: From Data to Theory. In E. Beeghly & A. Madva (Eds.), An Introduction to Implicit Bias: Knowledge, Justice, and the Social Mind (pp. 20–40). Routledge. Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The Sketch Engine: Ten Years on. Lexicography, 1(1), 7–36. https:// doi.org/10.1007/s40607-014-0009-9. Kilgarriff, A., & Renau, I. (2013). esTenTen, a Vast Web Corpus of Peninsular and American Spanish. Procedia—Social and Behavioral Sciences, 95, 12–19. https://doi.org/10.1016/j. sbspro.2013.10.617. Kilgarriff, A., & Tugwell, D. (2002). Sketching Words. In M. H. Corréard (Ed.), Lexicography and Natural Language Processing: A Festschrift in Honour of B. T. S. Akins (pp. 125– 137). EURALEX. Kukla, R. (2014). Performative Force, Convention, and Discursive Injustice. Hypatia, 29(2), 440–457. https://doi.org/10.1111/j.1527-2001.2012.01316.x. Mahony, S. (2018). Cultural Diversity and the Digital Humanities. Fudan Journal of the Humanities and Social Sciences, 11, 371–388. https://doi.org/10.1007/s40647-018-0216-0. Man, J. P., Weinkauf, J. G., Tsang, M., & Sin, D. D. (2004). Why Do Some Countries Publish More Than Others? An International Comparison of Research Funding, English Proficiency and Publication Output in Highly Ranked General Medical Journals. European Journal of Epidemiology, 19, 811–817. https://doi.org/10.1023/B:EJEP.0000036571.00320.b8. Meneghini, R., & Packer, A. L. (2007). Is There Science Beyond English? Initiatives to Increase the Quality and Visibility of Non-English Publications Might Help to Break Down Language Barriers in Scientific Communication. EMBO Reports, 8(2), 112–116. https:// doi.org/10.1038/sj.embor.7400906. Mielke, S. J., Alyafeai, Z., Salesky, E., Raffel, C., Dey, M., Gallé, M., Raja, A., Si, C., Lee, W. Y., Sagot, B., & Tan, S. (2021). Between Words and Characters: A Brief History of

144

David Bordonaba-Plou and Laila M. Jreis-Navarro

Open-Vocabulary Modeling and Tokenization in NLP. Computing Research Repository (arXiv). https://doi.org/10.48550/arXiv.2112.10508. Moreno, A. I., Rey-Rocha, J., Burgess, S., López-Navarro, I., & Sachdev, I. (2012). Spanish Researchers’ Perceived Difficulty Writing Research Articles for English-Medium Journals: The Impact of Proficiency in English Versus Publication Experience. Iberica, 24, 157–183. Politzer-Ahles, S., Holliday, J. J., Girolamo, T., Spychalska, M., & Berkson, K. H. (2016). Is Linguistic Injustice a Myth? A Response to Hyland (2016). Journal of Second Language Writing, 34, 3–8. https://doi.org/10.1016/j.jslw.2016.09.003. Pool, J. (1991). The Official Language Problem. American Political Science Review, 85(2), 495–514. https://doi.org/10.2307/1963171. Primack, R. B., & Marrs, R. (2008). Bias in the Review Process. Biological Conservation, 141, 2919–2920. Ross, J. S., Gross, C. P., Desai, M. M., Hong, Y., Grant, A. O., & Daniels, S. R. (2006). Effect of Blinded Peer Review on Abstract Acceptance. The Journal of the American Medical Association, 295, 1675–1680. Saidi, R., Jarray, F., & Mansour, M. (2021). A Bert Based Approach for Arabic POS Tagging. In I. Rojas, G. Joya, & A. Catala (Eds.), Advances in Computational Intelligence. 16th International Work-Conference on Artificial Neural Networks, IWANN 2021, Virtual Event, June 16–18, 2021, Proceedings, Part I (pp. 311–321). Springer. Schliesser, E. (2018). On Philosophical Translator-Advocates and Linguistic Injustice. Philosophical Papers, 47(1), 93–121. https://doi.org/10.1080/05568641.2018.1429740. Sharoff, S. (2006). Creating General-Purpose Corpora Using Automated Search Engine Queries. In M. Baroni & S. Bernardini (Eds.), WaCky! Working Papers on the Web as Corpus (pp. 63–98). Gedit. Shaw, P. (1991). Science Research Students’ Composing Processes. English for Specific Purposes, 10(3), 189–206. https://doi.org/10.1016/0889-4906(91)90024-Q. Van Parijs, P. (2002). Linguistic Justice. Politics, Philosophy and Economics, 1(1), 59–74. https://doi.org/10.1177/1470594X02001001003. Yen, C. P., & Hung, T. W. (2019). New Data on the Linguistic Diversity of Authorship in Philosophy Journals. Erkenntnis, 84, 953–974. https://doi.org/10.1007/s10670-018-9989-4.

9

Typological Challenges for the Application of Multilingual Language Models in the Digital Humanities Marcell Richard Fekete, Johannes Bjerva, and Lisa Beinborn

9.1.

Introduction

Textual data is at the heart of many research areas in digital humanities (DH). The interdisciplinary mindset of DH researchers characterises the openness of the field and invites the adoption of methodological innovation. Natural language processing (NLP) allows for more extensive analysis of large corpora (Suissa et al. 2022), although the lack of tools available for languages other than English hinders progress. Performance drops even for resource-rich languages, such as Spanish and Portuguese. However, new developments in neural language models offer exciting possibilities, also for resource-poor languages. For instance, multilingual language models—trained on many languages simultaneously—can use linguistic similarities to promote information sharing across languages, but there are still obstacles to overcome. Besides data imbalances, it is also vital to account for the inherent challenges when working with languages that drastically differ from English in their global status or linguistic (typological) properties. In this chapter, we address three main challenges when applying multilingual language models and discuss promising research directions and concrete actions that DH researchers may take. These challenges can be loosely grouped into aspects related to the availability of data in a language (Challenge 1), vocabulary (Challenge 2), and linguistic structure (Challenge 3). These challenges are further compounded by cross-cultural and pragmatic factors. Challenge 1: Disparity in terms of data availability, especially in a cross-lingual perspective considering languages spoken outside the Global North. Challenge 2: Unequal coverage of lexical items across the vocabulary of languages, especially when not written with the Latin alphabet.1 Challenge 3: Subpar performance for languages that cannot be placed in the same linguistic categories as resource-rich Indo-European languages. We argue that multilingual DH researchers may benefit from NLP and aim to increase the accessibility of this technology to a broader audience. Nonetheless, it is DOI: 10.4324/9781003393696-13

146

Marcell Richard Fekete, Johannes Bjerva, and Lisa Beinborn

crucial to acknowledge the difficulties that arise in multilingual contexts. Furthermore, it is important to recognise that expert knowledge and manually created tools and resources cannot be replaced by NLP tools. Our suggestion echoes the chapter of Tasovac and colleagues in this volume: we propose a closer collaboration between DH and NLP researchers to address the diversity of languages and promote fairness in application of NLP across linguistic communities. 9.2.

Developments in Language Modelling

Our primary mode of communication is natural language, allowing us to convey subtle nuances in meaning while adjusting our message to suit specific audiences and contexts. Due to the constantly evolving processes of ambiguity and variation in language use, a static, rule-based mapping from word forms to meaning is insufficient (de Saussure 1983). Language models aim to capture the semantic and syntactic richness of language in a computationally processable format. Language models operate by identifying co-occurrence patterns of linguistic elements in large text corpora. They learn how to transform input sequences, such as sentences, into high-dimension vector representations that can be manipulated using computational tools. The relative positions of linguistic elements in the vector space approximate their semantic similarities. This approximation of meaning has significantly improved progress in NLP and boosted performance across a wide range of tasks, especially those requiring fine-grained semantic information. These include clustering texts by topic, identifying stylistic similarities between authors, and extracting domain-specific terminology (Zhu et al. 2021; Benzebouchi et al. 2018). Recent developments in language modelling have not yet been propagated to DH mostly due to technical boundaries (Suissa et al. 2022). Models are trained on terabytes of data and their performance is optimised using sophisticated engineering tricks. This is difficult to replicate without advanced computational resources and technical expertise. Furthermore, the underlying decision processes of these models are unintuitive since they rely on complex computations involving highdimensional matrix transformations (Underwood 2019). Previous NLP methods had additional limitations that hindered their adoption in the DH: 1 Vectors cannot adequately represent contextual information (Peters et al. 2018). 2 Datasets in the DH are often too small and too domain-specific to be processed with neural models (Suissa et al. 2022). 3 Language models do not work well for language families outside of Indo-European, or more strictly, for non-English languages (Joshi et al. 2020). In this section, we address these past problems and illustrate how progress in language modelling can at least partially alleviate these obstacles, opening new opportunities for multilingual DH researchers.

Multilingual Language Models in the Digital Humanities 147

Figure 9.1 Language models cluster linguistic elements according to meaning using cooccurrence information from shared contexts: here the two meanings of ‘trunk’, trunk1 for nose of an elephant and trunk2 for swimming shorts, and words that may appear in their immediate contexts. 9.2.1.

Integrating Linguistic Context

Language models use data-driven techniques to learn vector representations based on patterns like co-occurrence in contexts, following the distributional hypothesis that words appearing in similar contexts have similar meanings (Firth 1957; Harris 1954). Earlier models created static, invariant word representations, but this approach lacked in sophistication and limited the analysis of semantic differences, crucial for most DH scenarios. More advanced techniques now create dynamic representations that change according to linguistic context, providing more accurate capturing of meaning. Recent advancements in transformers have facilitated the development of more context-aware models with dynamic representations, such as BERT (Devlin et al. 2019) and GPT-4 (OpenAI 2023).2 While prior techniques process input sequences in a sequential order, transformers process their input in a parallel manner. Layer by layer, representations of the individual input tokens incorporate information about other tokens that can be relevant to their meaning or their role in the sentence. This

148

Marcell Richard Fekete, Johannes Bjerva, and Lisa Beinborn

allows for better capturing of phenomena such as co-reference even across long distances, see (1). (1)

Thomas went to the cinema to the new film on Sunday, and he met his friend there.

The contextualised representations transformers create also allows distinguishing between different usages of word, see the examples in (2a) and (2b). (2a) The elephant with the broken tusk stood on the riverbank, idly hanging its trunk. (2b) He was resting on the sand of the beach wearing his colourful trunk. The capability of transformers to detect nuanced distinctions in language and meaning can aid in identifying literary events (Sims et al. 2019). It also increases the robustness of named entity recognition when applied to multilingual historical data from English, German, and French (Todorov and Colavizza 2020). These dynamic representations that incorporate context are also useful in literary analysis to examine the sentiment surrounding a specific character. For example, Gollum’s role and attitudes towards him shift significantly throughout the Lord of the Rings trilogy, necessitating different representations at various points in the story. 9.2.2.

Transfer Across Domains and Tasks

Transformer-based language models are notoriously data-hungry. DH researchers typically use datasets at smaller scales. These datasets are typically more complex and challenging than the cleaned and normalised training and benchmark datasets used in NLP. This complexity stems from spelling variations in historical texts, errors inherent in OCR, etc. This forms a bottleneck that affects the performance of these models when trained from scratch. Furthermore, annotated data in specialised domains and specific languages may be difficult to come by since annotation is labour-intensive and typically requires expert knowledge. However, these models are more than capable of leveraging information from unrelated sources to the target task, thus reducing training data requirements. The prevailing framework in NLP is made up of two steps: pretraining and fine-tuning. During the pretraining phase, the language model is created via a self-supervised training objective that does not rely on annotations. Currently, the most common pretraining objective is called masked language modelling and is based on the cloze task. To illustrate the task, see the example in (3). (3) Today, I went to the and I bought bread and sugar. During pretraining, language models learn semantic and morphosyntactic patterns to assign higher probabilities to ‘store’ or ‘supermarket’ and lower probabilities to words that do not fit the context such as ‘freedom’ and ‘red’. The model is optimised by applying this masking task to massive amounts of data to derive more

Multilingual Language Models in the Digital Humanities 149 robust patterns over diverse contexts. These learned patterns can later be reused to quickly adapt the model to downstream tasks such as word sense disambiguation, author and genre identification, or entity linking. The process of adapting a language model to target scenarios using a small number of training examples is known as the fine-tuning step. This technique, also called transfer learning, involves transferring learned patterns to a target task, domain, or language. There are now numerous pretrained language models available, as well as tutorials that show how to fine-tune them using limited computing resources and domain-specific datasets.3 Relevant examples for DH include Nguyen et al. (2020), who have fine-tuned BERT for detecting OCR errors, and Jiang et al. (2021), who increased robustness of the model on texts with OCR errors to improve domain classification. In literature analysis, Ben (2020) successfully used a South Korean pretrained model to analyse North Korean poetry despite differences in style. Transformer-based language models have also been used for analysing semantic change even with less than 1,000 relevant instances in the training set (Fonteyn 2020). Detection of diachronic semantic shifts has been demonstrated by finetuning on longitudinal corpora spanning eight years by Martinc et al. (2020). In the legal field, language models have been fine-tuned for identifying violations of human rights articles and protocols, predicting case importance, and mining legal arguments (Chalkidis et al. 2019; Chalkidis et al. 2020; Poudyal et al. 2020; Zhang et al. 2021). 9.2.3. Multilingual Models

The quality and quantity of pretraining data affect language model performance. Indo-European colonial languages, especially English, dominate in terms of online presence and resources, leaving languages such as Yoruba insufficiently represented (Ruder et al. 2022). This is even though these languages are spoken by tens of millions of people and are used on the web. Languages that are less established fall behind even more (Joshi et al. 2020). Moreover, despite overwhelming evidence that computational approaches that work well for English do not necessarily scale well to other languages, experimental evaluation of English is often presented as a universal finding (Bender 2011). Multilingual models have emerged as an alternative, trained on over-concatenated datasets of hundreds or more languages. Despite this linguistically naïve approach, popular models like mBERT generalise information across languages surprisingly well (Pires et al. 2019; Wu and Dredze 2019). This generalisation is called cross-lingual transfer, and it has its importance for modelling translation and code-switching. It allows models to leverage information from resource-rich languages even without explicit typological information (Beinborn and Choenni 2020; Rama et al. 2020), benefitting languages with less training data (also see the chapter in this volume by Joseph and Menon). Languagespecific fine-tuning using little data can even allow model adaptation to languages not seen during pretraining (Pfeiffer, Vulić et al. 2020; Rust et al. 2021). This can

150

Marcell Richard Fekete, Johannes Bjerva, and Lisa Beinborn

be a useful tool for DH researchers working with texts from rare or understudied languages (see Section 9.3.1). 9.3. Typological Challenges While multilingual models open up promising opportunities for DH researchers towards more diverse, cross-lingual analyses, we need to pay attention to typological challenges; in particular when working with resource-poor languages. In the following section, we describe three important challenges, stemming from typological and cross-lingual variation, for the application of language models in non-English scenarios. We also introduce existing and ongoing research trends to address them and propose concrete actions researchers of DH can take to improve the state of affairs in multilingual DH. 9.3.1.

Challenge 1: Cross-lingual Differences in Data Availability

The amount of available NLP resources is far from equal across languages. While this cross-lingual disparity is mostly caused by economic conditions, it correlates with typological differences. The historical dominance of former colonial languages has led to a well-documented Indo-European bias in the field of NLP.4 While we address increasingly challenging tasks for English, the majority of language communities in the world remain without adequate tools for rudimentary linguistic analysis (Joshi et al. 2020). Lack of data prevents successful pretraining of monolingual models and is reflected in poor downstream performance of multilingual models, for example, on named entity recognition, part-of-speech tagging, and dependency parsing (Wu and Dredze 2020). Addressing the data imbalance between languages is far from an easy task. Recent attempts towards higher linguistic diversity aim at generating datasets for resource-poor languages by translating existing datasets (Hu et al. 2020). This process is costly and difficult to scale (Lin et al. 2021; Ruder et al. 2021) and does not account for cross-cultural differences that are lost in translation. For example, cultural traditions such as Thanksgiving cannot necessarily be assumed to be shared knowledge outside the United States (Hershcovich et al. 2022). Language-specific constraints also affect how speakers choose to express themselves. In particular, a factually correct translation might elicit different associations than the original due to phenomenon of semantic drift (Beinborn and Choenni 2020). Recent initiatives aim at quantifying cross-cultural variation (Muthukrishna et al. 2020; Awad et al. 2018) and developing culture-sensitive datasets (Liu et al. 2021). 9.3.1.1. NLP Research

NLP research focuses on methodological solutions to overcome the limitations caused by data imparity, aiming to use available data as efficiently as possible. The pretraining and fine-tuning paradigm provides a first step to reducing

Multilingual Language Models in the Digital Humanities 151 data requirements (see Section 9.3.1) and makes it possible to adapt already pretrained language models to new languages using only small data quantities (Pfeiffer, Rücklé et al. 2020). A full parameter update is not only computationally expensive, but it can also lead to over-fitting: the model may lose its capability to generalise across languages by focusing too much on the specific target language. Adapters represent a recent alternative, which are small-size modules injected in between the transformer layers of a pretrained language model (Houlsby et al. 2019; Pfeiffer, Vulić et al. 2020; Üstün et al. 2020): see Figure 9.2. During fine-tuning, the original model weights remain fixed and only the adapter weights (corresponding to less than 3% of all model parameters) are optimised (Pfeiffer, Rücklé et al. 2020). This leads to a very efficient fine-tuning procedure that can nevertheless outperform computationally heavier approaches as it still integrates the pretraining knowledge (Bapna and Firat 2019; Pfeiffer, Rücklé et al. 2020; Pfeiffer et al. 2021). Authors have achieved performance close to or

Figure 9.2 Illustration of a language adapter encapsulated between two mBERT (transformer model) layers. The output of the bottom layer passes through the language adapter before it reaches the top layer. During fine-tuning, mBERT layers are frozen, and only the language adapter weights are updated.

152

Marcell Richard Fekete, Johannes Bjerva, and Lisa Beinborn

surpassing fine-tuning with less than 10,000 training instances (Pfeiffer, Rücklé et al. 2020), and adapters have been shown to have a significantly faster training process (Rücklé et al. 2021). As adapters are modular components, they can be efficiently exchanged. Adapters trained for the same model are compatible with each other and can easily be swapped or combined to enable explicit cross-lingual transfer. As an illustration of this framework, consider the following scenario. A DH researcher works with Frisian manuscripts from the 16th to 18th centuries. As Frisian corpora are relatively scarce, it might be useful to first train a taskspecific entity recognition adapter on Dutch data. In combination with an adapter for Frisian, a language model could exploit cross-lingual transfer to project its knowledge about Dutch entities across languages. This modular and swappable characteristic of adapters enables adapter sharing between different researchers to facilitate reusability. For example, the platform AdapterHub provides a large variety of task- and language-specific adapters that are freely accessible (Pfeiffer, Rücklé et al. 2020).5 9.3.1.2.

DH Expertise

DH researchers can contribute valuable domain-specific knowledge to identify conceptually related tasks and identify transfer potential across languages. The extensive application of freely available language models6 in DH scenarios will provide important information on cross-lingual differences and can speed up the development for more data-efficient and linguistically more plausible methods. Another important contribution from the DH is the increasing availability of digitalised datasets (Davidson 2008; Spiro 2012) that pose more complex methodological challenges than the often simplified benchmark tasks used in NLP. Furthermore, guidelines such as the FAIR principles7 and initiatives such as CLARIAH8 can contribute to an open sharing of datasets that can benefit other researchers. In this, however, it is important to maintain cultural sensitivity for exploring technological possibilities within ethical boundaries (Hershcovich et al. 2022). 9.3.2.

Challenge 2: Lexical Aspects

Multilingual models typically employ a shared vocabulary for all languages, but this might not represent all languages well. Common algorithms are based on frequency, favouring languages like English that have shallow morphology and plenty of data available. This can cause lower performance for languages where the overlap with the model vocabulary is low (Rust et al. 2021; Ács et al. 2021). The choice of tokenisation approach can further reinforce cross-lingual differences. We discuss word-based, character-based, and subword-based tokenisation approaches and their shortcomings.

Multilingual Language Models in the Digital Humanities 153 Word-Based Tokenisation Word-based processing can be problematic not only because the definition of what a word is might differ from language to language. Splitting on words—for instance by splitting on whitespaces in languages that use them—would not be satisfying to build a model vocabulary since it could not handle new words that did not appear in its training corpus. This issue is certainly even more significant for languages with a more complex morphology than English where the same stem might appear in a multitude of constellations with different affixes. This is related to another issue with word-based tokenisation: it cannot model compositionality. Word-based tokenisation could not capture the relation between word forms connected by either inflectional (see 4a) or derivational morphology (see 4b). (4a) • dog • dogs (4b) • happy • happiness • unhappiness Character-Based Tokenisation If we want to be completely agnostic about linguistic structure, we could instead decide to split input sentences into individual characters. The number of characters used in each language is much smaller than the vocabulary size. This allows us to represent most vocabulary items in a language with a much smaller model vocabulary than if we used individual words. Yet characters in themselves do not express meaning. Language models may struggle without explicit guidance as to what a meaning-bearing unit in the language might be, making learning even more challenging (Bjerva et al. 2016; Plank et al. 2016; Wu et al. 2016). Subword Tokenisation Subword tokenisation is an alternative to word and character-based methods. It uses statistical heuristics to create a language model vocabulary that can correspond to independent lexical or functional items, segments of these items, or single characters. For instance, in the mBERT vocabulary, ‘screen’ is a standalone token, ‘glasses’ is split into the stem ‘glass’ and the suffix ‘es’, and the word ‘sunglasses’ is segmented into ‘sung’, ‘lasse’, and ‘s’, lacking any linguistic motivation. However, this approach still allows related word forms to be connected while also covering novel vocabulary items. Additionally, subword tokenisation avoids the

154 Marcell Richard Fekete, Johannes Bjerva, and Lisa Beinborn disadvantages of character-based methods by isolating longer, more meaningful segments of language. Subword tokenisation merges symbols in a language typically based on frequency information, starting with individual characters and eventually moving to character pairs. Algorithms like Byte-Pair Encoding establish the most common symbol pairs from the training corpus, merge them into new symbols, and add them to the model vocabulary while retaining the original symbols. The algorithm ends when a pre-determined number of merges takes place or a total symbol count in the vocabulary is reached. Subword tokenisation algorithms have advantages but relying on statistical heuristics rather than linguistic principles limits vocabulary coverage in various languages, impacting representation quality. For mBERT, a shared multilingual vocabulary of 115,000 subword tokens represents the vocabulary of 104 languages. Resource-rich languages will be better represented in this vocabulary due to an imbalance in training data. Additionally, the overlap between the model vocabulary and the vocabulary of resource-poor languages tends to be low-frequency lexical items (Wu and Dredze 2020). Non-Latin script languages have even worse coverage as most resource-rich languages use the Latin script. Additionally, since the total number of Unicode characters that the subword tokens are made up of is in the hundreds of thousands, certain characters do not enter the model vocabulary, such as Northern Sámi ‘ŧ’. Tokens that cannot be represented due to missing characters in each language are replaced with a placeholder ‘unknown’ token. Predictably, the proportion of these ‘unknown’ tokens correlates with model performance on these languages (Pfeiffer et al. 2021). Moreover, there is no guarantee that vocabulary items created via statistically driven subword tokenisation algorithms are linguistically meaningful, or that the same morphemes are tokenised consistently. Due to these issues, learning good representations can be challenging even for languages that otherwise have a good vocabulary coverage. Consider the examples generated with the subword tokeniser of the mBERT model in (6a) and (6b). (6a) • screen (screen) • sun × sc × reen (sunscreen) (6b) • glass × es (glasses) • sung × lasse × s (sunglasses) Even though the compounds in (6a) and (6b) contain the nouns in (6a) and (6b), respectively, these nouns might be broken down in multiple different ways depending on their context. When appearing on their own, the word screen is represented by a single subword, while the word glasses is split into a stem and a suffix. However, as soon as these nouns are paired up with the noun sun in a compound word, the ‘facade’ of linguistically motivated tokenisation breaks down. These examples

Multilingual Language Models in the Digital Humanities 155 demonstrate how a data-driven subword tokeniser may obscure morpheme identities even for comparatively morphologically simple resource-rich languages such as English. This property of subword tokenisers has a negative impact on how well the language model may learn from distributional patterns (Ács et al. 2021). Both subword and character-based tokenisation methods are only capable of capturing concatenative morphology such as affixation. This means that they struggle capturing stem alterations that appear even in Indo-European languages such as the ones in (7a), and non-concatenative morphology widespread for instance in Semitic languages (see 7b). (7a) • goose → geese (English) • Mutter → Mütter (German; mother—mothers) (7b) • kitāb → kutub (Arabic; book—books) 9.3.2.1. NLP Research

Subword tokenisation can represent vocabulary more effectively than word or character-based tokenisation. It enables the model to learn from statistical patterns and generate an unbiased vocabulary with sufficient data across languages. Nonetheless, it has an Indo-European bias as it encodes primarily subwords from languages using Latin scripts. Although these representations can generalise across languages (Artetxe et al. 2020), there is a lot to be gained from addressing this issue. As a solution to this weakness, NLP research has found that learning new subword embeddings dedicated to the target language can improve performance (Pfeiffer et al. 2021). For instance, using a tokeniser tailored for the target language can significantly improve performance as compared to relying on standard multilingual tokenisation (Rust et al. 2021). 9.3.2.2. DH Expertise

When having access to ample monolingual data, DH researchers might also attempt to adapt multilingual models using monolingually trained subword tokenisers as described earlier. However, morpheme-based tokenisation has the potential to create a more linguistically motivated model vocabulary, which might encourage model learning. Morpheme-based tokenisation builds a vocabulary by splitting on morpheme boundaries. This approach has improved performance in target languages, as demonstrated by Mohseni and Tebbifakhr (2019), who split Persian words into their constituent lemmas and affixes before feeding them into a custom BERT model for named entity recognition.

156

Marcell Richard Fekete, Johannes Bjerva, and Lisa Beinborn

More recently, Nzeyimana and Niyongabo Rubungo (2022) took a different approach, proposing a two-tier BERT architecture to capture morphological compositionality more explicitly. The first tier uses a morphological analyser to encode morphological information, and the second tier transfers this information to create sentence-level representations. Their model outperformed other models on a series of tasks—including news categorisation, sentiment analysis, question-answering, co-reference resolution, and more—in Kinyarwanda, a morphologically complex Bantu language spoken in Rwanda. There are, however, several reasons why transformer-based language models use subword tokenisation instead of morpheme-based tokenisation. First, it requires linguistic expertise and a well-performing morphological parser to generate vocabulary items. This makes this approach less scalable for multilingual models that are meant to cover 100+ languages, or when it comes to resource-poor languages lacking these tools. This state of affairs is not helped by the fact that the people who build massively multilingual models are typically engineers lacking in-depth knowledge of the individual languages. Morphological parsers also introduce a processing overhead to new input sequences and run into ambiguous analyses, especially in morphologically rich languages. Consider the examples in (8). (8) • dob-om ‘my drum’ (Hungarian) • dob-om ‘I am throwing (it)’ (Hungarian) Nevertheless, we argue that in settings where one has linguistic expertise and is focusing on a small set of languages, morpheme-based tokenisation might contribute to significant improvements in terms of language model performance. A better collaboration between DH and NLP experts is needed to realise the potential that lies in these solutions. 9.3.3.

Challenge 3: Linguistic Categorisation

Cross-lingual transfer is a crucial advantage of multilingual models, allowing them to leverage information between languages and deliver better performance for resource-poor languages (see also Section 9.2.3). Linguistic categorisation could provide insights into the factors influencing this transfer. Previous studies investigate cross-lingual transfer by pretraining or fine-tuning a language model on a source language, and then testing its performance on a target language. High performance on the target language may indicate strong cross-lingual transfer. Similarity in terms of syntactic features is one factor that seems to correlate with this transfer (Pires et al. 2019; K et al. 2020; Ahmad et al. 2021). Studies have also examined the impact of pragmatic factors, such as the proportion of multiword expressions that can be literally translated between two languages (Sun et al. 2021). An example for such an expression is shared in (9a), while (9b) illustrates when translation is not possible. The authors found a

Multilingual Language Models in the Digital Humanities 157 positive correlation between transfer and the number of shared idiomatic multiword expressions especially for tasks where pragmatics is important, such as sentiment analysis. (9a) • like father like son (English) • tel père tel fils (French) (9b) • • • •

like father like son (English) nem esett messze az alma a fájától (Hungarian) not fell far the apple the tree-from ‘like father like son’

Various research efforts have been concentrating on improving language model performance on typologically diverse languages but addressing this state of affairs is one of the most difficult open challenges in multilingual NLP. 9.3.3.1. NLP Research

Within the field of NLP, there are broadly two approaches to improving language model knowledge on various languages using linguistic categorisation. The first research direction is encouraging so-called hard parameter sharing (Üstün et al. 2020) by focusing on making an informed selection of source and target languages to facilitate cross-lingual transfer. Approaches are typically using ranking algorithms based on various features, including syntactic, phonological, and morphological similarity, as well as language family membership. Improvements have been shown for tasks such as machine translation, part-of-speech tagging, and dependency parsing (Lin et al. 2019; Zhou and Waibel 2021; Mi et al. 2021). Another research direction constitutes approaches that directly encode typological information within the processing pipeline of their language model: this is called ‘soft’ transfer as it relies on facilitating automatic parameter sharing by the language model directly (Üstün et al. 2020). This would usually involve using typological information from resources such as URIEL (Littell et al. 2017), which converts typological properties into vectors that can be computationally processed. A simple approach is to append these typological vectors to the input. Kaing et al. (2021), for instance, find that this approach facilitates cross-lingual transfer—and hence improved performance—on a series of typologically diverse languages, such as German, Korean, Burmese, Japanese, and Hungarian in the task of syntactic parsing. A more complex approach involves using typological vectors to generate adapter modules for language model adaptation (Üstün et al. 2020, 2022; Ansell et al. 2021). This approach has been shown to lead to increased performance on tasks such as part-of-speech tagging, dependency parsing, morphological tagging, and named entity recognition in languages such as Arabic, Basque, Finnish, Japanese, and Turkish among all (Üstün et al. 2022).

158

Marcell Richard Fekete, Johannes Bjerva, and Lisa Beinborn

9.3.3.2.

DH Expertise

Integrating linguistic categorisation into multilingual models is still an open research area where only limited recommendations can be made. Since the direct integration of linguistic categorisation may require advanced technical resources, it is more productive to focus on a better selection of source and target languages to facilitate successful model adaptation. Ranking approaches might be useful for this, but they would benefit from the input of DH experts. An alternative is to use the technique recommended in Section 9.3.2 and retrain the input layer of a monolingual language model that can be categorised similarly to a resource-poor target language. de Vries et al. (2021), for instance, show that carrying out this sort of adaptation with a monolingual Dutch language model to Gronings and West Frisian may improve performance even with small amounts of training data available. When it comes to DH researchers with a linguistic expertise, we recommend working closer to NLP experts to draw attention to linguistic features characteristic of the language or languages in focus (Adebara and Abdul-Mageed 2022). 9.4.

Conclusion

Progress in language modelling makes it easier for DH researchers to apply NLP methods in a multilingual setting. However, current models are biased due to differences between languages in terms of global position, cultural context, and typological properties. This may affect their ability to deal with various languages. In this chapter, we suggest addressing central challenges to the application of multilingual language models relating to data imbalance (Challenge 1), vocabulary (Challenge 2), and broader linguistic categorisation (Challenge 3). We suggest addressing these challenges through more data-efficient fine-tuning methods, tailoring the tokeniser and input layer of models, and careful selection of languages for cross-lingual transfer. Additionally, we encourage closer collaboration between NLP specialists and DH experts who can provide specialised knowledge and an awareness of distinctive features of their target languages. This could result in more effective methodological solutions that take into consideration linguistic variety and comprehend more intricate patterns and relationships, which are crucial for DH tasks. This would enable synergetic developments in both fields. Notes 1 A challenge shared by DH workflows, see the chapter by Horvath and colleagues in this volume. 2 GPT-4 also drives ChatGPT. 3 Our recommendations are the tutorials provided by Hugging Face (https://huggingface. co/docs/transformers/training), and the ones created by Simple Transformers, for instance https://simpletransformers.ai/docs/classification-specifics/. 4 This imbalance is also addressed by the chapter by Bordonaba-Plou and Jreis-Navarro in this volume.

Multilingual Language Models in the Digital Humanities 159 5 6 7 8

https://adapterhub.ml https://huggingface.co/models www.go-fair.org/fair-principles/ www.clariah.nl

Bibliography Ács, Judit, Dániel Lévai, Dávid Márk Nemeskey, and András Kornai. 2021. “Evaluating Contextualized Language Models for Hungarian.” arXiv. http://arxiv.org/abs/2102.10848. Adebara, Ife, and Muhammad Abdul-Mageed. 2022. “Towards Afrocentric NLP for African Languages: Where We Are and Where We Can Go.” In Proceedings of the 60th Annual Meeting of the ACL (Volume 1: Long Papers), 3814–3841. Dublin, Ireland: ACL. https:// doi.org/10.18653/v1/2022.acl-long.265. Ahmad, Wasi, Haoran Li, Kai-Wei Chang, and Yashar Mehdad. 2021. “Syntax-Augmented Multilingual BERT for Cross-Lingual Transfer.” In Proceedings of the 59th Annual Meeting of the ACL and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 4538–4554. Online: ACL. https://doi.org/10.18653/ v1/2021.acl-long.350. Ansell, Alan, Edoardo Maria Ponti, Jonas Pfeiffer, Sebastian Ruder, Goran Glavaš, Ivan Vulić, and Anna Korhonen. 2021. “MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer.” In Findings of the ACL: EMNLP 2021, 4762–4781. Punta Cana, Dominican Republic: ACL. https://doi.org/10.18653/v1/2021.findings-emnlp.410. Artetxe, Mikel, Sebastian Ruder, and Dani Yogatama. 2020. “On the Cross-Lingual Transferability of Monolingual Representations.” In Proceedings of the 58th Annual Meeting of the ACL, 4623–4637. Online: ACL. https://doi.org/10.18653/v1/2020.acl-main.421. Awad, Edmond, Sohan Dsouza, Richard Kim, Jonathan Schulz, Joseph Henrich, Azim Shariff, Jean-François Bonnefon, and Iyad Rahwan. 2018. “The Moral Machine Experiment.” Nature 563 (7729): 59–64. https://doi.org/10.1038/s41586-018-0637-6. Bapna, Ankur, and Orhan Firat. 2019. “Simple, Scalable Adaptation for Neural Machine Translation.” In Proceedings of the 2019 Conference on EMNLP and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 1538–1548. Hong Kong, China: ACL. https://doi.org/10.18653/v1/D19-1165. Beinborn, Lisa, and Rochelle Choenni. 2020. “Semantic Drift in Multilingual Representations.” Computational Linguistics 46 (3): 571–603. https://doi.org/10.1162/coli_a_00382. Ben. 2020. “Language Models & Literary Clichés: Analyzing North Korean Poetry with BERT—Digital NK.” 2020. https://digitalnk.com/blog/2020/10/01/ language-models-literary-cliches-analyzing-north-korean-poetry-with-bert/. Bender, Emily M. 2011. “On Achieving and Evaluating Language-Independence in NLP.” Linguistic Issues in Language Technology 6 (October). https://doi.org/10.33011/lilt. v6i.1239. Benzebouchi, Nacer Eddine, Nabiha Azizi, Monther Aldwairi, and Nadir Farah. 2018. “Multi-Classifier System for Authorship Verification Task Using Word Embeddings.” In 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP), 1–6. https://doi.org/10.1109/ICNLSP.2018.8374391. Bjerva, Johannes, Barbara Plank, and Johan Bos. 2016. “Semantic Tagging with Deep Residual Networks.” Semantic Tagging with Deep Residual Networks 12. Chalkidis, Ilias, Ion Androutsopoulos, and Nikolaos Aletras. 2019. “Neural Legal Judgment Prediction in English.” In Proceedings of the 57th Annual Meeting of the ACL, 4317– 4323. Florence, Italy: ACL. https://doi.org/10.18653/v1/P19-1424.

160

Marcell Richard Fekete, Johannes Bjerva, and Lisa Beinborn

Chalkidis, Ilias, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. “LEGAL-BERT: The Muppets Straight Out of Law School.” In Findings of the ACL: EMNLP 2020, 2898–2904. Online: ACL. https://doi.org/10.18653/ v1/2020.findings-emnlp.261. Davidson, Cathy N. 2008. “Humanities 2.0: Promise, Perils, Predictions.” PMLA/Publications of the Modern Language Association of America 123 (3): 707–717. https://doi. org/10.1632/pmla.2008.123.3.707. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: PreTraining of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the NAACL: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, MN: ACL. https://doi.org/10.18653/ v1/N19-1423. Firth, J. R. 1957. Papers in Linguistics 1934–1951. London: Oxford University Press. Fonteyn, Lauren. 2020. “What About Grammar? Using BERT Embeddings to Explore Functional-Semantic Shifts of Semi-Lexical and Grammatical Constructions.” CHR 2020 12. Harris, Zellig S. 1954. “Distributional Structure.” WORD 10 (2–3): 146–162. https://doi.org /10.1080/00437956.1954.11659520. Hershcovich, Daniel, Stella Frank, Heather Lent, Miryam de Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, et al. 2022. “Challenges and Strategies in Cross-Cultural NLP.” In Proceedings of the 60th Annual Meeting of the ACL (Volume 1: Long Papers), 6997–7013. Dublin, Ireland: ACL. https://doi.org/10.18653/v1/2022. acl-long.482. Houlsby, Neil, Andrei Giurgiu, Stanisław Jastrzebski, and Bruna Morrone. 2019. “Parameter-Efficient Transfer Learning for NLP.” In Proceedings of the 36th International Conference on Machine Learning, 10. Long Beach, CA: PMLR. Hu, Junjie, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. “XTREME: A Massively Multilingual Multi-Task Benchmark for Evaluating Cross-Lingual Generalisation.” In Proceedings of the 37th International Conference on Machine Learning, 4411–4421. PMLR. https://proceedings.mlr.press/v119/hu20b. html. Jiang, Ming, Yuerong Hu, Glen Worthey, Ryan C. Dubnicek, Ted Underwood, and J. Stephen Downie. 2021. “Impact of OCR Quality on BERT Embeddings in the Domain Classification of Book Excerpts.” CHR 14. Joshi, Pratik, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. “The State and Fate of Linguistic Diversity and Inclusion in the NLP World.” In Proceedings of the 58th Annual Meeting of the ACL, 6282–6293. Online: ACL. https://doi. org/10.18653/v1/2020.acl-main.560. K, Karthikeyan, Zihan Wang, Stephen Mayhew, and Dan Roth. 2020. “Cross-Lingual Ability of Multilingual BERT: An Empirical Study.” In International Conference on Learning Representations. https://openreview.net/forum?id=HJeT3yrtDr. Kaing, Hour, Chenchen Ding, Katsuhito Sudoh, Masao Utiyama, Eiichiro Sumita, and Satoshi Nakamura. 2021. “Multi-Source Cross-Lingual Constituency Parsing.” In Proceedings of the 18th International Conference on Natural Language Processing (ICON), 341–346. National Institute of Technology Silchar, Silchar, India: NLP Association of India (NLPAI). https://aclanthology.org/2021.icon-main.41. Lin, Bill Yuchen, Seyeon Lee, Xiaoyang Qiao, and Xiang Ren. 2021. “Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning.” In Proceedings of the 59th Annual Meeting of the ACL and the 11th

Multilingual Language Models in the Digital Humanities 161 International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1274–1287. Online: ACL. https://doi.org/10.18653/v1/2021.acl-long.102. Lin, Yu-Hsiang, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, et al. 2019. “Choosing Transfer Languages for Cross-Lingual Learning.” In Proceedings of the 57th Annual Meeting of the ACL, 3125–3135. Florence, Italy: ACL. https://doi.org/10.18653/v1/P19-1301. Littell, Patrick, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. 2017. “URIEL and Lang2vec: Representing Languages as Typological, Geographical, and Phylogenetic Vectors.” In Proceedings of the 15th Conference of the European Chapter of the ACL: Volume 2, Short Papers, 8–14. Valencia, Spain: ACL. https://aclanthology.org/E17-2002. Liu, Fangyu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. 2021. “Visually Grounded Reasoning Across Languages and Cultures.” In Proceedings of the 2021 Conference on EMNLP, 10467–10485. Online and Punta Cana, Dominican Republic: ACL. https://doi.org/10.18653/v1/2021.emnlp-main.818. Martinc, Matej, Petra Kralj Novak, and Senja Pollak. 2020. “Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift.” In Proceedings of the Twelfth Language Resources and Evaluation Conference, 4811–4819. Marseille, France: European Language Resources Association. https://aclanthology.org/2020.lrec-1.592. Mi, Chenggang, Shaolin Zhu, Yi Fan, and Lei Xie. 2021. “Incorporating Typological Features into Language Selection for Multilingual Neural Machine Translation.” In Web and Big Data, edited by Leong Hou U, Marc Spaniol, Yasushi Sakurai, and Junying Chen, 12858:348–457. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-85896-4_27. Mohseni, Mahdi, and Amirhossein Tebbifakhr. 2019. “MorphoBERT: A Persian NER System with BERT and Morphological Analysis.” In Proceedings of the First International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019) Co-Located with ICNLSP 2019 — Short Papers, 23–30. Trento, Italy: ACL. https://aclanthology. org/2019.nsurl-1.4. Muthukrishna, Michael, Adrian V. Bell, Joseph Henrich, Cameron M. Curtin, Alexander Gedranovich, Jason McInerney, and Braden Thue. 2020. “Beyond Western, Educated, Industrial, Rich, and Democratic (WEIRD) Psychology: Measuring and Mapping Scales of Cultural and Psychological Distance.” Psychological Science 31 (6): 678–701. https:// doi.org/10.1177/0956797620916782. Nguyen, Thi Tuyet Hai, Adam Jatowt, Nhu-Van Nguyen, Mickael Coustaty, and Antoine Doucet. 2020. “Neural Machine Translation with BERT for Post-OCR Error Detection and Correction.” In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, 333–336. Virtual Event China: ACM. https://doi.org/10.1145/3383583.3398605. Nzeyimana, Antoine, and Andre Niyongabo Rubungo. 2022. “KinyaBERT: A Morphology-Aware Kinyarwanda Language Model.” In Proceedings of the 60th Annual Meeting of the ACL (Volume 1: Long Papers), 5347–5363. Dublin, Ireland: ACL. https://doi. org/10.18653/v1/2022.acl-long.367. OpenAI. 2023. “GPT-4 Technical Report.” arXiv. http://arxiv.org/abs/2303.08774. Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. “Deep Contextualized Word Representations.” arXiv. http://arxiv.org/abs/1802.05365. Pfeiffer, Jonas, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. 2021. “AdapterFusion: Non-Destructive Task Composition for Transfer Learning.” In

162

Marcell Richard Fekete, Johannes Bjerva, and Lisa Beinborn

Proceedings of the 16th Conference of the European Chapter of the ACL: Main Volume, 487–503. Online: ACL. https://doi.org/10.18653/v1/2021.eacl-main.39. Pfeiffer, Jonas, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. 2020. “AdapterHub: A Framework for Adapting Transformers.” In Proceedings of the 2020 Conference on EMNLP: System Demonstrations, 46–54. Online: ACL. https://doi.org/10.18653/v1/2020.emnlp-demos.7. Pfeiffer, Jonas, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2020. “MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer.” In Proceedings of the 2020 Conference on EMNLP (EMNLP), 7654–7673. Online: ACL. https://doi. org/10.18653/v1/2020.emnlp-main.617. Pfeiffer, Jonas, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2021. “UNKs Everywhere: Adapting Multilingual Language Models to New Scripts.” In Proceedings of the 2021 Conference on EMNLP, 10186–10203. Online and Punta Cana, Dominican Republic: ACL. https://doi.org/10.18653/v1/2021.emnlp-main.800. Pires, Telmo, Eva Schlinger, and Dan Garrette. 2019. “How Multilingual Is Multilingual BERT?” In Proceedings of the 57th Annual Meeting of the ACL, 4996–5001. Florence, Italy: ACL. https://doi.org/10.18653/v1/P19-1493. Plank, Barbara, Anders Søgaard, and Yoav Goldberg. 2016. “Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss.” In Proceedings of the 54th Annual Meeting of the ACL (Volume 2: Short Papers), 412–418. Berlin, Germany: ACL. https://doi.org/10.18653/v1/P16-2067. Poudyal, Prakash, Jaromir Savelka, Aagje Ieven, Marie Francine Moens, Teresa Goncalves, and Paulo Quaresma. 2020. “ECHR: Legal Corpus for Argument Mining.” In Proceedings of the 7th Workshop on Argument Mining, 67–75. Online: ACL. https://aclanthology. org/2020.argmining-1.8. Rama, Taraka, Lisa Beinborn, and Steffen Eger. 2020. “Probing Multilingual BERT for Genetic and Typological Signals.” In Proceedings of the 28th International Conference on Computational Linguistics, 1214–1228. Barcelona, Spain (Online): International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.105. Rücklé, Andreas, Gregor Geigle, Max Glockner, Tilman Beck, Jonas Pfeiffer, Nils Reimers, and Iryna Gurevych. 2021. “AdapterDrop: On the Efficiency of Adapters in Transformers.” In Proceedings of the 2021 Conference on EMNLP, 7930–7946. Online and Punta Cana, Dominican Republic: ACL. https://doi.org/10.18653/v1/2021.emnlp-main.626. Ruder, Sebastian, Noah Constant, Jan Botha, Aditya Siddhant, Orhan Firat, Jinlan Fu, Pengfei Liu, et al. 2021. “XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation.” In Proceedings of the 2021 Conference on EMNLP, 10215–10245. Online and Punta Cana, Dominican Republic: ACL. https://doi.org/10.18653/v1/2021. emnlp-main.802. Ruder, Sebastian, Ivan Vulić, and Anders Søgaard. 2022. “Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the Research Manifold.” In Findings of the ACL: ACL 2022, 2340–2354. Dublin, Ireland: ACL. https://doi.org/10.18653/v1/2022. findings-acl.184. Rust, Phillip, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. 2021. “How Good Is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models.” In Proceedings of the 59th Annual Meeting of the ACL and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 3118–3135. Online: ACL. https://doi.org/10.18653/v1/2021.acl-long.243. Saussure, Ferdinand de. 1983. Course in General Linguistics. London: Duckworth.

Multilingual Language Models in the Digital Humanities 163 Sims, Matthew, Jong Ho Park, and David Bamman. 2019. “Literary Event Detection.” In Proceedings of the 57th Annual Meeting of the ACL, 3623–3634. Florence, Italy: ACL. https://doi.org/10.18653/v1/P19-1353. Spiro, Lisa. 2012. “‘This Is Why We Fight’: Defining the Values of the Digital Humanities.” In Debates in the Digital Humanities, edited by Matthew K. Gold, 16–35. University of Minnesota Press. https://doi.org/10.5749/minnesota/9780816677948.003.0003. Suissa, Omri, Avshalom Elmalech, and Maayan Zhitomirsky-Geffet. 2022. “Text Analysis Using Deep Neural Networks in Digital Humanities and Information Science.” Journal of the Association for Information Science and Technology 73 (2): 268–287. https://doi. org/10.1002/asi.24544. Sun, Jimin, Hwijeen Ahn, Chan Young Park, Yulia Tsvetkov, and David R. Mortensen. 2021. “Cross-Cultural Similarity Features for Cross-Lingual Transfer Learning of Pragmatically Motivated Tasks.” In Proceedings of the 16th Conference of the European Chapter of the ACL: Main Volume, 2403–2414. Online: ACL. https://doi.org/10.18653/v1/2021. eacl-main.204. Todorov, Konstantin, and Giovanni Colavizza. 2020. “Transfer Learning for Named Entity Recognition in Historical Corpora.” CLEF (Working Notes) 12. Underwood, Ted. 2019. “Do Humanists Need BERT?” The Stone and the Shell (blog). July 15, 2019. https://tedunderwood.com/2019/07/15/do-humanists-need-bert/. Üstün, Ahmet, Arianna Bisazza, Gosse Bouma, and Gertjan van Noord. 2020. “UDapter: Language Adaptation for Truly Universal Dependency Parsing.” In Proceedings of the 2020 Conference on EMNLP (EMNLP), 2302–2315. Online: ACL. https://doi. org/10.18653/v1/2020.emnlp-main.180. Üstün, Ahmet, Arianna Bisazza, Gosse Bouma, and Gertjan van Noord. 2022. “UDapter: Typology-Based Language Adapters for Multilingual Dependency Parsing and Sequence Labeling.” Computational Linguistics 48 (3): 555–592. https://doi.org/10.1162/ coli_a_00443. Vries, Wietse de, Martijn Bartelds, Malvina Nissim, and Martijn Wieling. 2021. “Adapting Monolingual Models: Data Can Be Scarce When Language Similarity Is High.” In Findings of the ACL: ACL-IJCNLP 2021, 4901–4907. Online: ACL. https://doi.org/10.18653/ v1/2021.findings-acl.433. Wu, Shijie, and Mark Dredze. 2019. “Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT.” In Proceedings of the 2019 Conference on EMNLP and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 833–844. Hong Kong, China: ACL. https://doi.org/10.18653/v1/D19-1077. Wu, Shijie, and Mark Dredze. 2020. “Are All Languages Created Equal in Multilingual BERT?” In Proceedings of the 5th Workshop on Representation Learning for NLP, 120– 130. Online: ACL. https://doi.org/10.18653/v1/2020.repl4nlp-1.16. Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, et al. 2016. “Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation.” arXiv. http://arxiv.org/ abs/1609.08144. Zhang, Gechuan, David Lillis, and Paul Nulty. 2021. “Can Domain Pre-Training Help Interdisciplinary Researchers from Data Annotation Poverty? A Case Study of Legal Argument Mining with BERT-Based Transformers.” In Proceedings of the Workshop on Natural Language Processing for Digital Humanities (NLP4DH)., 10. NLPAI. Zhou, Zhong, and Alexander Waibel. 2021. “Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Text-Based

164 Marcell Richard Fekete, Johannes Bjerva, and Lisa Beinborn Translation.” In Proceedings of the Third Workshop on Computational Typology and Multilingual NLP, 67–80. Online: ACL. https://doi.org/10.18653/v1/2021.sigtyp-1.7. Zhu, Lixing, Gabriele Pergola, Lin Gui, Deyu Zhou, and Yulan He. 2021. “Topic-Driven and Knowledge-Aware Transformer for Dialogue Emotion Detection.” In Proceedings of the 59th Annual Meeting of the ACL and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1571–1582. Online: ACL. https://doi. org/10.18653/v1/2021.acl-long.125.

10 Data Scarcity and Methodological Limitations in Multilingual Analysis of News Articles Published in Brazil Caio Mello 10.1.

Introduction

Current debates on multilingual digital humanities have been frequently motivated by questions that go beyond cross-lingual analysis, addressing topics such as the Anglophone scholarship’s hegemony (Fiormonte, 2012; Nilsson-Fernàndez and Dombrowski, 2022). Bordonaba-Plou and Jreis-Navarro (in this volume) cover the consequences of this predominance of the English language. This more than building up barriers for accessing the knowledge produced, excludes the diversity of social-cultural contexts and their specific challenges. An example of this is the widespread concept of Media Literacy in contrast with Latin-American Educommunication.1 The latest is rarely found in journal publications written in English, even though it has been developed since the 1960s (Mateus and Quiroz Velasco, 2017). What Educommunication offers as a contribution to this discussion, however, is the necessity of developing approaches which aim at considering the specificities of local issues. Strongly based on the studies of Paulo Freire (1968), it looks at education for the use of communication and media as emancipatory tools, aiming at fighting social inequalities. Recent uses of natural language processing techniques to study social phenomena have shown a lot of potential to answer some of the questions of the humanities, using them, for example, to understand how machines have been represented as animated in 19th-century news articles published in England (Ahnert et al., 2020). However, there are significant limitations when applied to languages other than English, which impact the possibility of their use for the study of subjects located outside the Anglophone sphere. A critical analysis of the current challenges and potential solutions is, therefore, necessary. Fekete and Beinborn (in this volume), for example, propose solutions to increase the algorithms’ performance, especially when applied to under-resourced languages. This chapter stems from reflections within my PhD research in digital humanities at the School of Advanced Study. It is based on my experience in adapting digital methods developed for the English language in order to study content produced originally in Brazilian Portuguese. Moreover, it presents a methodology focused on a critical analysis of the algorithm’s performance by making use of explainable AI. For that, I will rely on a case study to discuss the potential application of these DOI: 10.4324/9781003393696-14

166

Caio Mello

techniques, how and why they were chosen, and what barriers were encountered when adapting them to another language. 10.2.

Case Study: The Media Coverage of the Olympic Legacy

The study I have been conducting ‘Nationalism, internationalism and sporting identity: the London and Rio Olympics’ started in August 2019 as part of the project CLEOPATRA,2 which aims to make sense of the massive digital coverage of media events, generated by the intense disruption in Europe over the past decade. This includes not only dramatic events like the migration crisis or massive political disruptions such as Brexit but also the impact of mega sporting events like the London Olympics. I joined an interdisciplinary group of 15 researchers with backgrounds in computer science, mathematics, linguistics, history, and media studies, all based in five different countries to develop methods, datasets, and theoretical frameworks to investigate these phenomena. The highly NLP-focused approach of the project has offered me a wide range of tools and techniques developed over training sessions, which allowed me to explore their potentialities to respond to humanities questions. Based on this, I proposed a cross-lingual and cross-cultural study on the media coverage of London 2012 and Rio 2016 Olympic legacies. But why does legacy matter? As soon as a city launches its candidature to host the Olympic Games, the question that emerges at the centre of the bidding process is: what are the benefits of hosting one of the biggest sporting events in the world? Despite being a pillar of the Olympic Movement, and also regularly brought up to justify cities’—and nations’—participation in the event, the concept of legacy is not very clearly defined and continues to require effort by scholars and members of the International Olympic Committee (IOC) to determine—or even get closer to—exactly what it means (Preuss, 2006; Cashman and Horne, 2013). Gratton and Preuss (2008) describe three reasons why a positive legacy is one of the main concerns of the IOC: it prevents people from blaming the IOC after the event, it justifies the use of scarce public resources, and it also encourages other cities to bid for future events. This last reason is fundamental to guaranteeing the future of the Olympic Games. The Olympic Games have been facing a drop in interest from cities in bidding to host the event. There are many reasons why governments have decided to avoid taking part in organising the Games, among which are the rise in costs, the reduction in profit due to the concentration of media licences in the hands of the IOC, and the criticism of the press and the public about the lack of application of those investments for the benefit of local communities. Although highly related to the positive aspects that remain after the event, Olympic legacy can be also negative. A quick Google search on Athens Olympics will illustrate how this event transformed the Greek capital into what authors like Papanikolaou (2013) have called the ‘white elephants Olympic City’. White elephant is a common expression found in journalistic texts that report on negative legacies of major sporting events. It usually refers to those infrastructures constructed for the games that have been abandoned right after the event or remain in a precarious

Limitations in Multilingual Analysis of News Articles 167 condition not serving for the purpose it was initially planned. For Leopkey and Parent (2012), white elephants are among the most representative negative aspects of the Olympic legacies. Facing this challenging environment, Olympic legacy remains crucial for the continuation of the games. It is part of a ‘legitimating narrative’ (Poynter and MacRury, 2009, p. 315). The term ‘legacy’ became standard in Olympic discourse as it represents a balance between the promise of a Games—what is expected—and the municipal economic realities—what is delivered (Poynter and MacRury, 2012). Despite the huge investment of cities to develop and communicate their legacy plan, it is in the media discourse that the impacts of the event are discussed and reinvented. ‘The media help to “imagineer” images of host locations as “world class”—for people inside as well as outside the host city and country’ (Bairner and Molnar, 2010, p. 37). Bairner and Molnar (2010) have written about the different uses and framing of the word legacy as the ‘imagineering of legacies’, the idea of a collective mediated interpretation of the phenomena. Thinking of legacy as media narrative is a reflection on the role of the media in the construction of legacy storytelling. Based on Berger and Luckmann’s (1966) statement of a social construction of reality, this study analysed journalistic articles as key elements in the way the concept of legacy is perceived and interpreted. The main question to be responded to by this research was how the concept of Olympic legacy has been approached by the Brazilian and British media considering the different socio, political, economic, linguistic and cultural contexts in which both events occurred? In order to respond to this question, I have collected news articles published online by the Brazilian and British media in Portuguese and English to create the corpus of analysis. This chapter focuses on two main challenges I faced while conducting my research: data scarcity and the limitations of algorithms for natural language processing in Portuguese. The first one refers to the difficulties in accessing primary sources—in this case, the news articles—either caused by the lack of a diverse media production in Brazil, when compared to the United Kingdom, or due to the limitations of databases for collecting information. Here, the lack of a Brazilian Web Archive was the most relevant limitation. The second covers the disparity between resources for NLP produced in English and those available in Portuguese. In particular, I look at two specific techniques named entity recognition and sentiment analysis to discuss how their poor performance has affected my investigation and what the possible solutions are to overcome these limitations. Before I dive into these challenges, a brief introduction to the methodology is necessary. 10.3. A Cross-cultural Approach From the beginning, this project was intended to be a comparative study between the media coverage of London 2012 and Rio 2016 Olympic legacies. Legacies are not simply the result of the Games, as discussed earlier, but also one of the most

168

Caio Mello

important reasons for the event to exist. Often, countries use the Olympic legacy to justify investments in urban development that would, otherwise, be difficult to obtain—at least in such a small period of time. This ‘catalyst’ property is, however, commonly combined with issues such as overspending, non-fulfilment of promises and failure in planning. Due to the amount of public money and expectations from society involved in these events, they call the attention of the media, which tend to follow the development of legacy plans as soon as the host city is chosen, being difficult, however, to precisely identify when it is concluded. The legacy timeline is not clear because once the Games end, with the closing ceremony, the discussion about the legacy receives even more attention and remains in the media. More than tracking the evolution of the media discourses over time, it seemed to be a good strategy to compare two scenarios—London and Rio Olympics—to identify similarities and differences between an event hosted for the third time3 in one of the richest cities in the world and one happening for the first time in the Global South. Questions regarding the importance the media of each of these cities would give to infrastructure or the different expectations about the benefits of hosting a mega event were expected to be answered through a study that combined quantitative and qualitative research methods. The pipeline designed for this research was divided into four main stages: data collection, filtering, cleaning, and analysis. Since the first stage, the issues of working with multilingual data were identified and strategies had to be developed to overcome these challenges. 10.4.

Data Scarcity

For the creation of the dataset, news articles published online covering the topic ‘Olympic legacy’ were scraped using Python packages such as Boilerpipe. Sources were selected based on the popularity of the news websites, according to the Ofcom News Consumption Survey 2020 (Ofcom, p. 56) for news published by British outlets and the Digital News Report by the Reuters Institute for the Study of Journalism (Newman et al., 2020, p. 90) for Brazilian outlets. As this study wants to capture news articles that are more ‘visible’ online, they were searched using Google’s search engine. Time has a strong influence on Google ranking. Therefore, to capture news articles published in the past that are not necessarily accessible online anymore, we relied on the UK Web Archive using the platform SHINE. SHINE is a tool to explore data from .uk websites in the Internet Archive between 1996 and April 2013. Although a big part of the content of the UK Web Archive can only be accessed inside the British Library, SHINE is open and its data can be consulted off-site. The lack of a national Brazilian Web Archive4 made it difficult to use the same procedure to find old news articles about the Rio Olympics. Even the official webpage of the Games ‘rio2016.com.br’ was removed from the live web. Although a few snapshots of the page are available on Aquivo.pt—the Portuguese Web Archive—just one collected in 2013 had some content on it. The others archived in 2018, 2019, and 2020 consist of just the website homepage with the

Limitations in Multilingual Analysis of News Articles 169 message: ‘sorry for the inconvenience: system is under maintenance from June to July 2018 at 9am’. This means the page was abandoned for over three years at least. The same page was collected by the Internet Archive and a dive into the documents revealed that the last archived version of the fully working website was collected in January 2017. The website has never returned to work, being part of one of the many lost historical legacies of the Games. Moreover, navigation in these archives is hard, as we cannot just run a script to read the multiple pages under the category news, because many pages containing news articles were not collected and stored. Although Boilerpipe delivers a good performance to scrape webpages, it does not work for all of them. To collect as many pages as possible, it was combined with other Python libraries such as Beautiful Soup (Richardson, 2007) and Newspapers3k (Ou-Yang, 2018). Still, the structure of some of the Brazilian web pages was less recognisable by these algorithms. While they worked for most of the British news websites, Brazilian ones like R7 could not be scrapped. One of the possible reasons for that is the poor accessibility of its HTML. The corpus of articles published by the British media includes 12 outlets: BBC, Sky News, The Guardian, The Daily Mail, ITV, Buzzfeed, MSN News, The Telegraph, Yahoo News, The Independent, The Sun, and The Metro. The Brazilian corpus is composed of five outlets: UOL online, Globo, Yahoo News, Folha de S. Paulo, and O Estado de S. Paulo. There are some reasons for the discrepancy in the number of Brazilian and British outlets that compose the corpus. Although this chapter aims to be an overview of the challenges, and therefore a profound analysis of this gap cannot be fully developed in this space, I will briefly describe some of the reasons. The first cause for this discrepancy is the organisational structure of the Brazilian press in comparison to the United Kingdom. Brazil appears in the position 111th in the ranking of the 2021 World Press Freedom Index (Reporters Without Borders, 2021). One of the reasons for the bad score is, according to the Reporters Without Borders’ document,5 the concentration of media ownership by business families that have a close relationship with the political class. The lack of a strong public broadcast organisation—such as BBC in the UK—combined with the oligopoly of the media companies, results in less diversity of publishers and, therefore, less information available for researchers interested in multiple sources. According to the Media Ownership Monitor Report (2017),6 media pluralism in Brazil is at high risk in five out of their ten indicators: media audience concentration, media ownership concentration, cross-media ownership concentration, ownership transparency, and political control over media funding. Less pluralism means not only fewer news articles published quantitatively but also less diversity in the content produced by the media outlets. Studying the media discourse in Brazil regarding the Rio Olympics therefore implies working with less data than the one produced by the British press on the London Olympics. This demanded a more qualitative approach of close reading instead of making use of distant reading techniques developed for huge datasets.

170

Caio Mello

10.5. Cross-linguistic Approach I have chosen two widely known NLP techniques to study the media narratives using the news articles collected: named entity recognition (NER) and sentiment analysis (SA). By using named entity recognition, my main goal was identifying which entities—or actors—are relevant for the Olympics discourse and illustrate their narrative connections using networks. NER recognises semantic categories in text, attributing them to a value such as ‘person’, ‘place’, and ‘date’. While it performs well in languages such as English, it is still a challenge to apply it to less resourced ones such as Portuguese. To produce a network representation of entities mentioned in the news articles, I have combined different techniques of entity recognition. First, I have used Spacy (Honnibal and Montani, 2017) to recognise the entities and label them according to their type (e.g. person, organisation, date). However, Spacy cannot differentiate between two people with the same surname based on context. By combining Spacy with Wikifier (Brank et al., 2017), I aimed to disambiguate the entities. Wikifier7 is a semantic annotation service that links concepts to Wikipedia pages. This way, it provides each entity with a unique identifier that will be used to relate the words to their correct entity. It works in most of the cases with a few easily identifiable mistakes for English and significantly more issues for Portuguese. The outputs from Spacy in Portuguese for entity recognition presented a much poorer result in comparison to those produced in English (Amaral et al., 2014). In the process of disambiguating the entities by combining Spacy outputs with Wikifier, many entities were misread. To avoid mistakes in the analysis, entities found that do not correspond to the ‘vocabulary’ of Rio Olympics were manually identified and then a search in the corpus was performed to find the reason why the entity was being misinterpreted by the algorithms. By ‘vocabulary’ of Rio Olympics, I mean the set of words, organisations, entities, and government bodies that are mentioned in the media coverage of this event. The process of manually identifying mistakes is facilitated by the visualisation on networks, where nodes can be easily recognised as shown in Figure 10.1. This visualisation reduces the number of entities listed by presenting the ones that are most mentioned together in the same documents in bigger sizes, which helps in the identification of mistakes that can be relevant for the analysis. This approach to the method is interesting to be observed as a result of a qualitative intervention in the quantitative analysis. Expert knowledge on the topic ‘Rio Olympics’ was essential for these mistakes to be recognised and for an approach to overcome them to be developed. In the network, for example, one of the central entities was CPLP Games. This entity refers to the Wikipedia page CPLP Games,8 which describes the topic as ‘a multinational multi-sport event organised by the Community of Portuguese Language Countries (CPLP)’. However, when looking closer at the corpus, it was verified that CPLP Games was attributed to the word jogos—Games, in Portuguese. When the word jogos appears alone,

Limitations in Multilingual Analysis of News Articles 171

Figure 10.1 Network visualisation of entities generated using Spacy combined with Wikifier. Entities are connected according to how many times they are mentioned together in the same news article. This is therefore a narrative representation of the entities. The more an entity is mentioned, the bigger its node is.

it is wrongly attached to CLPL Games. When jogos is mentioned together with another word such as in jogos olimpicos (Olympic Games), they are correctly attributed to Olympic Games. Having presented this issue, it is important to note that other problems may occur. Entities can be attributed to wrong Wikidata that is still among the Olympics ’vocabulary’. An example encountered was jogos olimpicos being attributed to 2012 Summer Olympics—the London Games, when, in this specific context, it actually referred to Rio 2016 Olympics. Also, velodromo (velodrome) was attributed to ‘Velodromo Paulistano’—which is located in Sao Paulo, not in Rio. Olympic Stadium (or estadio olimpico)—was wrongly associated with ‘Estadio Olimpico Monumental’, which is based in Porto Alegre, South Brazil. These mistakes made the analysis of Brazilian entities in the networks impracticable in many aspects. While in English the networks provided valuable insights regarding important relationships between entities, such as the close connections between Boris Johnson and the Olympic Stadium. The ex-London Mayor was particularly interested in dismantling a controversy about the future of the stadium. While some public figures have stood for keeping the stadium as an athletics venue after the Games, Boris Johnson defended turning the space into a football stadium. Thanks to his intervention, the stadium was handed over to West Ham United, which remained occupying the space ten years later. Although this kind of association could have been verified by close reading of the articles, it became clearer by visualising the connections between the entities in the network visualisation of NER. The same technique was not successful in Portuguese, as discussed earlier, which impacted the possibilities of exploring the Brazilian content using this quantitative approach.

172 10.6.

Caio Mello How Sentiment Changes With Language

The second method I aim to discuss in this chapter is sentiment analysis (SA). Also called opinion mining (Feldman, 2013), SA is a natural language processing technique to identify emotions discursively expressed in a text (Liu, 2012). By applying SA classifiers to the news articles, my main objective was to identify how positive or negative the media coverage of the Olympic legacies of London and Rio was. Although different techniques to work with non-English text have been developed (Dashtipour et al., 2016), many approaches have relied on machine translation to transform the texts into English before using the sentiment analysis algorithms (Balahur and Turchi, 2012). By translating the source language corpus into the target language, machine translation (Wan, 2009; Wei and Pal, 2010) has been utilised to solve the issue of cross-lingual sentiment analysis. For this study using Brazilian news articles covering the legacy of the Rio 2016 Olympic Games, I have applied both algorithms trained in Portuguese and machine translation into English. Machine translation approaches can be a problem for languages without reasonable translation modules, where mistakes can lead to wrong sentiment outputs. For Portuguese, studies suggested that although errors occur, automated translation is a decent approach (Araújo et al., 2020; Pereira, 2021; De Freitas and Vieira, 2015). In our tests performed on this specific corpus, popular sentiment classifiers such as SenticNet, SentiStrength, and Vader presented very poor performance when applied to texts written in Portuguese as demonstrated in previous work (Mello et al., 2022). When applied, however, to the translated versions of the texts into English, the quality of the sentiment classification has increased significantly. Although widely used and recommended in some cases for non-English SA (Araújo et al., 2016, 2020), machine translation impacts the scores and can lead to misreadings (Dashtipour et al., 2016). However, Balahur and Turchi (2012) evaluated the impact of machine translation from German, French, and Spanish into English, with two being romance languages like Portuguese, and concluded that the noise exists, but the result is still reliable. To test the impact of machine translation, I have compared the performance of five sentiment classifiers when applied to the original texts in Portuguese with the results obtained by applying them to the translated version in English. To translate the texts into English google-translate API (Han, 2020) was used. The classifiers used are SenticNet, SentiStrength, Vader, and BERT (trained in three different datasets). They are divided into three types: lexicon, lexicon and rule, and contextualbased algorithms. SentiStrength (Thelwall, 2017) is purely lexicon-based, whereas SenticNet (Cambria and Hussain, 2015) and Vader (Hutto and Gilbert, 2014) are lexicon and rule-based algorithms. They split the text into words (termed as concepts) and look for their scores in the dictionary (Thelwall, 2014). They cannot disambiguate meanings as they do not always take the context into account. With rule-based algorithms, syntactic rules are applied over a sentence to get the final score between –1 (negative) and 1 (positive). BERT9 is a machine learning model trained on a very large corpus of text such that it can encode words and sentences into vectors of real numbers while capturing the surrounding context of each word. As it is done in practice, these pre-trained

Limitations in Multilingual Analysis of News Articles 173 models are fine-tuned (re-trained) for a target task, which in our case is sentiment prediction. For English, two sentiment detection datasets which are publicly available were used: Amazon reviews (McAuley and Leskovec, 2013; Devlin et al., 2018) and tweets (Go et al., 2009) and trained two separate models named Amazon BERT and Sent140 BERT. To utilise BERT for detecting sentiment in Portuguese text, we use a publicly available Twitter corpus (Souza et al., 2020) and fine-tune Portuguese BERT for the sentiment detection task. For this specific test, classifiers were applied to 464 headlines of news articles published by the Brazilian media covering the legacy of the Rio 2016 Olympic Games. Figure 10.2 shows the comparison between labels produced using the four classifiers for the original news titles in Portuguese and the translated version in English. BERT trained in Amazon and tweets datasets were compared individually with Portuguese Twitter BERT. The main purpose is to observe how often translation changes the sentiment of the sentences. The chart represents a total of 464 articles published in Portuguese divided into three sections. The bottom part of the bars represents the number of headlines that kept the same label in both languages. In the middle part are shown those headlines whose sentiment changed to the opposite direction: from positive to negative or from negative to positive. On the top of the bars are those titles whose sentiment varied slightly from positive or negative towards neutral, for example. 10.7.

Portuguese (Original) Versus English (Translation)

According to Figure 10.2, SenticNet and SentiStrength resulted in 72% of agreement when comparing the label assigned to the original with the translated

Figure 10.2 Variation of sentiment in the 464 news titles originally published in Portuguese in comparison with their translated version in English.

174

Caio Mello

version of the sentences. Vader presented slightly fewer matches, with 65% of the news headlines labelled equally. It is important to highlight the number of headlines whose sentiment varied to the opposite label, in yellow. SenticNet presented the highest disagreement of around 23%, followed by Vader with 9%. Amazon BERT and Sent140 BERT presented the highest variation towards neutral results. This occurred because most outputs from Portuguese Twitter BERT were ‘neutral’. This experiment allowed the conclusion that SentiStrength presented the best equivalence between Portuguese and English-translated versions, presenting no change in 72%, slight variation in 27%, and less than 1% of opposite results. The reason for this is that, as a lexicon-based algorithm, it outputs a similar score for words in both languages, while the others are affected by rules that take the syntax into consideration, producing different results. By looking closely at some of the sentences translated into English, some issues could be identified. The headline Brasil anuncia ao COI corte de R$ 900 mi no orçamento dos Jogos, for example, was translated as Brazil announces the COI Court of R$ 900 mi in the games budget. The word ‘corte’, however, can be translated as Court or cut. In this sentence, it should have been translated as cut. Another sentence has presented the opposite mistake. The headline Legado olímpico virou milhões em propina a ‘amigos da corte’ de Cabral, diz MPF was translated as Olympic legacy turned millions in tipping ‘Cut Friends’ of Cabral, says MPF. In this case, ‘corte’ should have been translated as Court. Other mistranslations occurred such as in Olimpíada é ‘desculpa fantástica’ para mudar o Rio, diz prefeito. Its translated version Olympiad is ‘fantastic excuse’ to change the river, says mayor has mistakenly translated the name of the city: Rio. These mistakes have impacted the sentiment of the news headlines and reveal what sort of issues can be faced by researchers when using this technique to explore large amounts of automated translated texts. This highlights the need for development of more tools that work in languages other than English. 10.8.

Unveiling the Black-Box: Identifying Limitations With Explainable AI

Another step conducted to improve the quality of the sentiment analysis outputs was the use of explainable AI tools (Linardatos et al., 2021) such as SHAP (Lundberg and Lee, 2017). Results obtained indicate possible reasons why a sentiment classification has been scored as negative or positive. SHAP was used on the two best-performing classifiers: Amazon BERT and Sent140 BERT. The tool used Shapley values to depict how fairly the outcome can be distributed among the different features. Thus, the Shapley values of every feature signal its contribution for a given predicted (into negative, positive, or neutral) instance. An existing tool to quantify the feature importance for BERT-based models was utilised. The input to the tool is a text and the classifier. Shap returns a pictorial depiction of Shapley values for every feature, which are words in this case. The output enables easy examination of the predictions.

Limitations in Multilingual Analysis of News Articles 175 Among the several news headlines analysed, one deserves a closer look: Legado ambiental, a grande dívida da Olimpíada do Rio (Soares, 2016) published by Globo/ CBN, in Brazil. This headline was translated as Environmental legacy, the great debt of Rio Olympics using Google translator. The article, as suggested by the title, discusses the government’s negligence towards the environmental promises for the Rio 2016 Olympics. As the event’s date approached, most of the legacy promises, such as the cleaning of Guanabara Bay and the Athletes’ forest were not delivered. While I interpreted the headline as negative, the sentiment classifiers have labelled it as highly positive. The use of SHAP clarifies the influence each of the words in the sentence had in the final sentiment score. Looking at Figure 10.3, it is possible to see that both the words ‘environmental’ and ‘debt’ were attributed a negative score. This indicates possible bias in the dataset for the word ‘environmental’. However, what seems to be pushing the final score towards positive is the word ‘great’. Great here is not being read as ‘big’ but rather as ‘good’. To understand the influence of the word ‘great’ in the sentence’s sentiment classification, an experiment was performed. First, the word ‘great’ was replaced with ‘big’. As we can see in Figure 10.4, it slightly increases the negative classification of the sentence, as a slight negative score was given to the word ‘big’. Following the example above, the word ‘great’ was replaced with ‘huge’ (Figure 10.5). This way, ‘huge debt’ has presented a highly negative score, impacting significantly the classification of the whole news headline. A last experiment was performed by replacing the word ‘legacy’ with ‘impact’. As it can be visualised in Figure 10.6, ‘impact’ carries a slight negative score while ‘legacy’ a slight positive one. Replacing ‘legacy’ with ‘impact’ has pushed the sentence score towards negative.

Figure 10.3 Depiction of Shapley values for the original sentence. On the left side are the words classified as negative and on the right side, those classified as positive. The darker the word, the stronger the sentiment score.

Figure 10.4 Depiction of Shapley values for the sentence replacing ‘great’ with ‘big’.

Figure 10.5 Depiction of Shapley values for the sentence replacing ‘great’ with ‘huge’.

176

Caio Mello

Figure 10.6 Depiction of Shapley values for the sentence replacing ‘legacy’ with ‘impact’.

What this experiment has revealed is not only the reason why this specific news headline was mistakenly classified by the sentiment tool but also how sensitive the tool is to small variations of words and meanings. Considering the fact that translation can already have a significant impact on meaning, the use of sentiment analysis in translated text has to be performed carefully. It was particularly useful to identify important words that carry bias in this specific context, such as the word ‘legacy’. Replacing ‘legacy’ with ‘outcome’, for example, would be recommended to avoid a positive bias in the corpus. It is important to notice how the concept of reliability is subjective to the algorithms’ application on different objects of analysis. Even though tests are performed in generic or even more specific datasets (Balahur and Turchi, 2012), evaluating how well they perform on my own corpus was crucial from a humanities and social sciences perspective. Understanding the algorithm’s behaviour provides the scholar with tools for a better interpretation and a more careful conclusion of the analysed phenomena. As argued by Puschmann and Powell (2018), sentiment analysis has been frequently misinterpreted in its potential and limitations by both scholars and the media. It is important, therefore, to delimit what it can really do in order to make good use of its numerical classifications. Choosing to work with tools, despite their limitations, adds a new layer to the biases and processes inherent in working with data (Gitelman, 2013). Acknowledging such interference from the choice of data, corpus construction, filtering, cleaning, visualisation, and interpretation, is fundamental. What these procedures presented in this chapter offer is a structured method of questioning the role that algorithms play in the analytical pipeline. 10.9.

Conclusion

Unveiling these limitations is important to reflect on the challenges of working with multilingual data analysis. It goes beyond language and points out what Sibilia (2008) has called a techno-apartheid, referring to the inequalities in terms of access and technological production in the North and the Global South. Bon et al. (2022) refer to this phenomenon as ‘digital coloniality’ and points out the importance of making people from these regions co-researchers, co-creators, and co-designers of technologies. Making use of technology developed by and for the Global North, or even more specifically, the Anglosphere, limits our mechanisms to study social phenomena, generating geographical boundaries also in the online territory. From the perspective of the humanities, overcoming these challenges has been demanding more and more development of technical skills by researchers interested

Limitations in Multilingual Analysis of News Articles 177 in constructing better tools for analysis. However, in many cases, adapting existing algorithms and datasets is a much less costly option. As the case study exemplified in this chapter has shown, replicating methods designed for English language in non-English corpora demands a combination of auxiliary methods. This means combining distant and close reading and, in the particular case of the named entity recognition, the use of visualisation tools to make mistakes easier to identify. In the case of the sentiment analysis, two steps were performed: first, an evaluation of the impact of machine translation on sentiment classification was conducted. Although scholars have published evaluations of cross-lingual sentiment analysis, its context sensitivity requires a particular deeper investigation for each study. When looking at news articles published in Portuguese and translated into English covering the legacy of the Rio Olympics, machine translation has presented some mistakes that need to be considered before major conclusions are taken using the overall sentiment classification of texts. The use of SHAP, an explainable AI tool, has also been revealed to be useful to understand the behaviour of SA algorithms and to identify biases in their scores. The experiment of replacing words with synonyms improves the researcher’s understanding of the tool mechanisms, which helps with the lack of resources to better predict sentiment in languages other than English. Further work on multilingual digital humanities would highly benefit from more tools and procedures developed for critically assessing the impact of methodological choices. While there is an indispensable focus on generating more data and software, this development needs to be accompanied by mechanisms that allow researchers to open the black-boxes and interrogate the impact of the decisions made through the process of creating, transforming, and making use of data to inform our studies. Notes 1 2 3 4 5 6 7 8 9

Educomunicación in Spanish language; Educomunicação in Portuguese. Cross-lingual Event-centric Open Analytics Research Academy. London hosted the Olympic Games in 1908, 1948, and 2012. There are some initiatives like the NUAWEB being developed by the Federal University of Rio Grande do Sul (www.ufrgs.br/nuaweb/) https://rsf.org/en/brazil http://brazil.mom-gmr.org/en/ The code used to link the entities provided by Spacy with Wikifier can be found here: https://github.com/cleopatra-itn/news-cartography-analysis https://en.wikipedia.org/wiki/CPLP_Games#:~:text=The%20CPLP%20Games%20 (Portuguese%3A%20Jogos,less%20than%2016%20years%20old BERT: Bidirectional Encoder Representations from Transformers is a machine learning technique for NLP developed by Google.

References Ahnert, Ruth, et al. Living machines: A study of atypical animacy. Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 2020.

178

Caio Mello

Amaral, Daniela, et al. Comparative analysis of Portuguese named entities recognition tools. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 2014, Brasil. European Language Resources Association (ELRA), 2014. Araújo, Matheus, Adriano Pereira, and Fabrício Benevenuto. A comparative study of machine translation for multilingual sentence-level sentiment analysis. Information Sciences 512, 2020. https://doi.org/10.1016/j.ins.2019.10.031. Araújo, Matheus, et al. An evaluation of machine translation for multilingual sentence-level sentiment analysis. Proceedings of the 31st Annual ACM Symposium on Applied Computing, 2016. Available on https://homepages.dcc.ufmg.br/~fabricio/download/sac2016translation.pdf Bairner, Alan, and Gyozo Molnar. The Politics of the Olympics: A Survey. Routledge, 2010. Balahur, Alexandra, and Marco Turchi. Multilingual sentiment analysis using machine translation? Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis, 2012. Available on https://aclanthology.org/W12-3709 Berger, Peter L., and Thomas Luckmann. The Social Construction of Reality: A Treatise in the Sociology of Knowledge. Anchor, 1991. Bon, Anna, et al. Decolonizing technology and society: A perspective from the global south. Perspectives on Digital Humanism, 2022. https://doi.org/10.1007/978-3-030-86144-5_9. Brank, Janez, Gregor Leban, and Marko Grobelnik. Annotating documents with relevant Wikipedia concepts. Proceedings of SiKDD 472, 2017. Cambria, Erik, and Amir Hussain. Sentic Computing: A Common-Sense-Based Framework for Concept-Level Sentiment Analysis. Springer, 2015. ISBN: 978-3-319-23654-4. Cashman, Richard, and John Horne. Managing legacy. Managing the Olympics, 2013. https://doi.org/10.1057/9780230389588_4. Dashtipour, Kia, et al. Multilingual sentiment analysis: State of the art and independent comparison of techniques. Cognitive Computation 8, 2016. https://doi.org/10.1007/s12559-016-9415-7. De Freitas, Larissa A., and Renata Vieira. Exploring resources for sentiment analysis in Portuguese language. 2015 Brazilian Conference on Intelligent Systems (BRACIS). IEEE, 2015. Devlin, Jacob, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2018. Available on https://aclanthology.org/N19-1423/ Feldman, Ronen. Techniques and applications for sentiment analysis. Communications of the ACM, 2013. Available on https://dl.acm.org/doi/10.1145/2436256.2436274 Fiormonte, Domenico. Towards a cultural critique of the digital humanities. Historical Social Research/Historische Sozialforschung. GESIS – Leibniz Institute for the Social Sciences, 2012. Freire, Paulo. Pedagogia do oprimido. Fac símile digitalizado (Manuscritos). Instituto Paulo Freire, 1968. Gitelman, Lisa, ed. Raw Data Is an Oxymoron. MIT Press, 2013. Go, Alec, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 2009. Available on https://www-cs.stanford. edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf Gratton, Chris, and Holger Preuss. Maximizing Olympic impacts by building up legacies. The International Journal of the History of Sport, 2008. https://doi.org/10.1080/095233608 02439023. Han, Suhun. Googletrans 3.0. 0. PyPI Library, 2020. Available on https://pypi.org/project/ googletrans/. Accessed on 20th January, 2022.

Limitations in Multilingual Analysis of News Articles 179 Honnibal, Matthew, and Ines Montani. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Sentometrics Research, 2017. Hutto, Clayton, and Eric Gilbert. VADER: A parsimonious rule-based model for sentiment analysis of social media text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). University of Michigan, Ann Arbor, 2014. Leopkey, Becca, and Milena M. Parent. Olympic games legacy: From general benefits to sustainable long-term legacy. The International Journal of the History of Sport, 29(6), 924–943, 2012. https://doi.org/10.1080/09523367.2011.623006. Linardatos, Pantelis, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1), 18, 2021. Liu, Bing. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, 5(1), 1–167, 2012. Lundberg, Scott M., and Su-In Lee. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 2017. Available on https:// github.com/slundberg/shap. Accessed on 10th January, 2022. Mateus, Julio-César, and María Teresa Quiroz Velasco. Educommunication: A theoretical approach of studying media in school environments. Revista Latinoamericana de Ciencias de la Comunicación, 14(26), 152–163, 2017. McAuley, Julian, and Jure Leskovec. Hidden factors and hidden topics: Understanding rating dimensions with review text. Proceedings of the 7th ACM Conference on Recommender Systems, 2013, pp. 165–172. Media Ownership Monitor Report, 2017. Available on www.mom-gmr.org/en/. Mello, Caio, Gullal S. Cheema, and Gaurish Thakkar. Combining sentiment analysis classifiers to explore multilingual news articles covering London 2012 and Rio 2016 Olympics. International Journal of Digital Humanities, 2022. https://doi.org/10.1007/s42803-02200052-9. Newman, Nic, et al. Reuters Institute Digital News Report. Reuters Institute for the Study of Journalism, 2020. Available on https://reutersinstitute.politics.ox.ac.uk/sites/default/ files/2020-06/DNR_2020_FINAL.pdf. Nilsson-Fernàndez, Pedro, and Q. U. I. N. N. Dombrowski. Multilingual digital humanities. In The Bloomsbury Handbook to the Digital Humanities (pp. 81–90). Bloomsbury Publishing, 2022. Ofcom. News Consumption in the UK: 2020, 2020. Available on www.ofcom.org.uk/__data/ assets/pdf_file/0013/201316/news-consumption-2020-report.pdf Ou-Yang, Lucas. Newspaper3k: Article Scraping & Curation—Newspaper 0.0. 2 Documentation, 2018. Available on https://github.com/codelucas/newspaper. Accessed on 9th September, 2021. Papanikolaou, Panagiota. Athens 2004. Ten years later the Olympic infrastructure, the cultural Olympiad and the ‘White Elephant’ syndrome. Journal of Power, 1(1), 9, 2013. Pereira, Denilson Alves. A survey of sentiment analysis in the Portuguese language. Artificial Intelligence Review, 54(2), 1087–1115, 2021. Poynter, Gavin, and Iain MacRury. Olympic Cities: 2012 and the Remaking of London. Ashgate Publishing, Ltd., 2009. Preuss, Holger. Lasting Effects of Major Sporting Events, 2006, p. 13. http://www. idrottsforum.org/. Puschmann, Cornelius, and Alison Powell. Turning Words Into Consumer Preferences: How Sentiment Analysis Is Framed in Research and the News Media. Social Media + Society, 2018. https://doi.org/10.1177/2056305118797724.

180

Caio Mello

Reporters Without Borders. World Press Freedom Index, 2021. Available on https://rsf. org/en/2021-world-press-freedom-index-journalism-vaccine-against-disinformationblocked-more-130-countries. Richardson, Leonard. Beautiful Soup Documentation, 2007. Available on https://pypi.org/ project/beautifulsoup4/. Accessed on 28th July, 2021. Sibilia, Paula. O show do eu. Rio de janeiro: nova fronteira. Nova Fronteira, 2008, pp. 158–174. Soares, Lucas. Environmental Legacy, the Great Debt of Rio Olympics. Globo, 2016. Available on https://cbn.globoradio.globo.com/grandescoberturas/rio-2016/2016/07/02/ LEGADO-AMBIENTAL-A-GRANDE-DIVIDA-DA-OLIMPIADA-DO-RIO.htm . Archived on https://web.archive.org/web/20220228150024/https://cbn.globoradio.globo. com/grandescoberturas/rio-2016/2016/07/02/LEGADO-AMBIENTAL-A-GRANDEDIVIDA-DA-OLIMPIADA-DO-RIO.htm. Accessed on 10th January, 2022. Souza, Fábio, Rodrigo Nogueira, and Roberto Lotufo. BERTimbau: Pretrained BERT Models for Brazilian Portuguese. Brazilian Conference on Intelligent Systems. Springer, 2020, pp. 403–417. Thelwall, Mike. The Heart and soul of the web? Sentiment strength detection in the social web with SentiStrength. In Cyberemotions (pp. 119–134). Springer, 2017. Thelwall, Mike, and Arvid Kappas. The role of sentiment in the social web. In Collective Emotions (pp. 375–388). Oxford University Press, 2014. Wan, Xiaojun. Co-training for cross-lingual sentiment classification. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore, 2009, pp. 235–243. Wei, Bin, and Christopher Pal. Cross lingual adaptation: An experiment on sentiment classifications. Proceedings of the ACL 2010 Conference Short Papers, 2010, pp. 258–262.

Part IV

Methods and Infrastructure

11 Multilingual Interfaces for All? Localisation Strategies in Proyecto Humboldt Digital Antonio Rojas Castro

11.1.

Introduction

Multilingualism is one of the features that characterise societies today. In contrast to the monolingual paradigm that identified a nation-state with one language, a multilingual paradigm has been taking place since the 1970s in parallel with the rise of global communication technologies, the increase in population migration, and the emergence of minority groups’ demands. The multilingual paradigm implies that the ‘social and political debates are now oriented towards “governing” linguistic multiplicity and not towards imposing and managing monolingualism’ (Martín Rojo and Pujolar 2020, 15). Contemporary multilingualism can be defined as ‘the ability of societies, institutions, groups and individuals to engage, on a regular basis, with more than one language in their day-to-day lives’ (European Commission and DirectorateGeneral for Education 2008, 6). It is often perceived as a competence that can be learned and is thus usually examined with the lens of Applied Linguistics, Second Language Acquisition, or Sociolinguistics. In reality, as Gramling (2021, 51) argues, we can envision a ‘continuum of multilingualism’ that cuts across disciplines, discourses, laws, politics, and people’s experiences. Since technology plays a very important role today, multilingualism is also present in the technological sphere in different ways. So, for example, in the field of internet studies, research on the digital divide has been frequent since the publication in 2003 of UNESCO’s recommendation on universal access to the internet (Leppänen and Peuronen 2012; Robinson et al. 2015; Dijk 2020). Over the past decade, there has also been much discussion in the digital humanities about ‘geographical and linguistic diversity’ (Galina Russell 2014) and the hegemony of English as lingua franca (Fiormonte, Numerico, and Tomasi 2015). There is no doubt that there is a growing awareness of the importance of multilingualism in the field, as evidenced by the large literature on the subject (Gil and Ortega 2016; ‘Multilingualism’ 2019; Allés-Torrent and Riande 2019; Spence 2021). However, the debate in digital humanities around multilingualism is often overly theoretical; there is little work devoted to analysing the technical implementation of multilingualism in software and websites. This chapter aims to fill this gap by analysing the localisation strategies carried out in the Proyecto Humboldt Digital DOI: 10.4324/9781003393696-16

184 Antonio Rojas Castro to create multilingual graphical user interfaces. It covers therefore only one layer of information that can be multilingual in an information system, the graphical user interface, and leaves out other layers such as data, access systems, and user interactions (Stiller n.d.). After the introduction, this chapter is structured as follows. The first section focuses on software and web localisation from a theoretical point of view; based on a review of previous literature, the terms ‘localisation’ and ‘locale’ are defined, and the GILT process is described in detail. The next section briefly introduces the Proyecto Humboldt Digital; the scope and objectives, the technical requirements, and the target users are described. The third section is an analysis of the localisation strategy of the ediarum software; it illustrates how the GILT process was put into practice to localise the graphical user interface of the editing framework. Similarly, the fourth section analyses the localisation strategy of the graphical user interface of the ‘Dossier Digital: Alexander von Humboldt and Cuba’, a website that hosts the digital editions of the project. Finally, this chapter closes with a summary of the main ideas and conclusions. 11.2.

On Software and Web Localisation

The emergence of the personal computer from the 1980s onwards and the expansion of the web in the 1990s has brought many changes to the translation industry and to translation studies, such as computer-assisted translation (CAT) tools, machine translation (MT), and the localisation of software and websites. These technological changes have shaped the practice of translation and generated new theoretical concepts such as ‘localisation’, ‘locale’, or the acronym ‘GILT’— Globalisation, Internationalisation, Localisation, and Translation (Munday 2016). The term ‘localisation’ is commonly used in the translation industry. US companies such as Microsoft, Oracle, or Apple started trading their computer products outside the American market, for example, to countries such as France, Germany, Italy, Spain, or Japan in search of higher profits (Hall 1999; Esselink 2003). This process of product globalisation, however, involved the development of new methods to translate their products to new markets. Today, ‘localisation’ refers both to the process of adapting a digital product to a different audience and to the final product. Although its origins lie in the development of computer software, localisations nowadays tend to be software, websites, video games, and smartphone operating systems or apps (Jiménez-Crespo 2013). In other words, in all cases these are digital products and therefore require the collaboration of translators and engineers during the development process. As general characteristics, localisations can vary to a certain extent in their adaptation level,1 but they share some common traits. They usually include text that is part of an information system, are accessed on screens, and are highly interactive. Defining a ‘locale’ is a bit more difficult. According to Hall (1999), a ‘locale’ is a ‘collection of all the conventions that characterise a single user community’. These conventions cover diverse aspects such as the writing system, language (hyphenation, morphological rules, etc.), formats for dates, numbers and measurements,

Multilingual Interfaces for All? 185 monetary symbols, and help or warning messages. More recently, Jiménez-Crespo (2013) has defined a locale more comprehensively as the ‘combination of a sociological region and a language in an industrial setting’. Thus, for example, the Spanish language would have several ‘locales’ such as Spanish-Spain, SpanishArgentina, and Spanish-Mexico. The localisation process is neither simple nor isolated but consists of several phases summarised by the acronym GILT, which stands for Globalisation, Internationalisation, Localisation and Translation (Jiménez-Crespo 2013; Munday 2016):2 • Globalisation (g11n): it refers to the processes of structural change that companies and institutions undertake to facilitate internationalisation and localisation. Globalisation understood in this way is mainly organisational in nature and relates to management, such as the recruitment and training of a team of international workers with multilingual skills or the adoption of agile methods for software development to respond quickly to market trends. • Internationalisation (i18n): it consists of developing the digital product with an international audience in mind as a potential target audience. In order to avoid technical modifications at a later stage, decisions are taken to ensure that the product is not tied to a specific culture and is independent of the language in which it was created (usually English), for example, by separating the text from the code and replacing it with identifiers and other mechanisms.3 • Localisation (l10n) and Translation: it refers to the adaptation of the product to a specific locale (language and region), that is, by replacing culturally inappropriate symbols and source text with their equivalent in the target language. The scope of this process involves activities entailing organisation, preparation of files and resources, the translation of the text and further cultural adaptations, quality control of the final product, and user testing. For this reason, localising is normally understood as an activity larger than translating. The GILT process aims to coordinate the technical development of the product and the linguistic and cultural adaptation as it involves initial preparation of the localisation project, implementation of the localisation by specialists, quality assessment, and final integration into the product. However, it can face several critiques, including quality issues, lack of consistency and integration, and excessive time and cost. Additionally, there is a current debate over whether machine translation can replace human translator entirely or not. The use of machine translation can speed up the last phase (localisation and translation), but it may not always provide the accuracy and nuance of a human translator. In general, larger companies and institutions usually have their own department and a staff of localisers who adapt the product in-house, while smaller companies and institutions may outsource the localisation to external companies (JiménezCrespo 2013). Due to market fluctuations and over time, companies which used to offer services in several languages have specialised in a single target language and have integrated IT technologies such as Translation Memories and Globalizations Management Systems (GMS) and XML-based standard formats (Esselink

186 Antonio Rojas Castro 2003). Finally, more recently, crowdsourcing initiatives have emerged whereby companies, such as Facebook, have localised their product interfaces into many languages, considering the most frequent term proposed by users. Due to the origins of localisation in the computer industry, the relationship with translation theory developed in translation studies has long been problematic. For example, Pym (2004) points out that localisation often minimises the role of translation and that the concept of ‘internationalisation’ is based on an artificial idea of equivalence. But more dangerously, the idea of passing off a foreign product as produced in a local context is risky because it can lock cultures into passive consumption. Despite all the warnings, ‘domestication’ in localisation is nowadays seen as a means of awareness and respect for the user. When looking back at translation theories, it is easy to realise that localisation strategies are not a complete novelty. The cultural adaptation of a text precisely relates localisation to Skopos theory (Nord 2018). Although it is traditionally assumed that the translator must be faithful to the source text, the translator can often do more than produce an equivalent text. This strategy may make sense under certain circumstances and with instructive texts. For example, a functional translation is the most appropriate when translating open educational resources such as tutorials, since these are persuasive texts that try to convince the reader to carry out a series of tasks in a specific order to achieve a result (Isasi and Rojas Castro 2021). In this context, faithfulness to the source text is not as important as the purpose or function that the translation should fulfil in each situation. As localisation has shifted its focus from software to websites, new challenges related to the digital nature of texts have emerged. For example, Jiménez-Crespo has theoretically reflected on the importance of usability (‘ease of use’) in web localisations, concluding that using a website is not the same as reading a printed text. However, W3C initiatives on accessibility are still little known by the main stakeholders in the localisation industry (Rodríguez Vázquez and O’Brien 2017; Acharya 2019) and, as far as I know, there are still few empirical user studies investigating how localisation affects usability, accessibility, and inclusion. 11.3. The Foundations of Proyecto Humboldt Digital Funded by the cultural preservation programme of the German Foreign Office, the Fritz Thyssen Foundation and the Gerda Henkel Foundation, Proyecto Humboldt Digital (ProHD), is a four-year (2019–2023) international research cooperation. The project takes a transnational topic—Alexander von Humboldt’s journey to Cuba—to enable the transfer of knowledge between the Oficina del Historiador de La Ciudad de La Habana (OHCH) in Cuba and the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) in Germany. The main goal of the project is to publish online a collection of texts about Humboldt’s journey to Cuba. Since 2020, the team members have identified about 120 documents and archival records as relevant against a previous selection criteria defined collaboratively. In Europe, in the Berlin State Library and in the Jagiellonian Library in Krakow, there can be found 30 documents of interest; these documents

Multilingual Interfaces for All? 187 have been catalogued and digitised prior to ProHD and are accessible in PDF and TIFF format in their digital libraries and institutional repositories. On the Cuban side, more than 90 documents preserved in several Cuban libraries and archives have been selected. These documents, however, are being digitised, described with metadata, and published online, thanks to the taskforce created by ProHD. The total amount of documents selected from Cuban institutions, however, may increase after the research that will be carried out throughout 2022 and 2023 in the library of the Universidad de La Habana and the Academia de Ciencias de Cuba. The organisation of research exchanges and of training activities play an important role in the completion of the project and future sustainability. The project was formally launched with a face-to-face meeting in Havana in November 2019, where its scope and goals were discussed at length. Despite restrictions during the COVID-19 pandemic, short stays in both Havana and Berlin have taken place as well as monthly meetings in virtual format. During these encounters, we were able to define software requirements, as well as share the main issues and progress made in retrospect. We decided that all our software and web applications should be multilingual and to speak Spanish as the lingua franca to communicate ourselves internally. In order to conceptualise the foundations of the project, we took into account the IFLA Recommendations (IFLA 2002) and the NISO Guidelines (National Information Standards Organization 2007) to define our digitisation policy and selection criteria. The FAIR principles (Wilkinson et al. 2016) were also employed as a guide to ideate the technological requirements of the digital resources; we decided that the digitisations should be findable in a repository, accessible by both computers and people, saved in a standardised and interoperable format, and reusable, thanks to the use of non-proprietary standard formats and open licences. The Principles for Digital Development (‘Principles’ n.d.) were used to define the cooperation between the two institutions, mainly considering the Cuban technological ecosystem. In this sense, the US blockade, which prevents trade between any country and Cuba, and the digital divide have really been two major challenges that have hindered the completion of the project; for example, the purchase of software licences has not been feasible and, during periods of confinement, it has been more difficult to hold virtual meetings, as domestic internet access is rare on the island (Bisset Alvarez, Grossi de Carvalho, and Borsetti Gregorio Vidotti 2015). Since the project is implemented in Cuba and Germany, we decided to address first of all our team’s needs, which may vary depending on the context (Horvath et al. in this volume), without forgetting to meet the needs of our potential audience. After several meetings and discussions, we were able to conceptualise two types of target users: • Target Users 1: the users of the software developed to create the project data, that is, the team members, editors, and researchers. These users are characterised by the following demographic variables: men and women (gender) living in Cuba or Germany (location), speak Spanish or German (language), are information professionals or researchers on Humboldt’s legacy (education), are

188 Antonio Rojas Castro aged between 20 and over 54 (age), and normally access content online using a computer desktop and a mobile phone (access). • Target Users 2: end users outside the team. These users may find the research data online, either the digitisations or the digital editions, and can be characterised as follows: men and women (gender) who access the content from anywhere in the world (localisation), are English native speakers or may speak English as lingua franca (language), have a university degree—not necessarily related to Humboldt’s though—(education), are aged between 20 and over 60 (age), and normally access content online using a computer desktop and mobile devices (access). The first target user is obviously a priority because the project aims, above all, to be useful for advancing research on Humboldt’s journey to Cuba and, therefore, it must satisfy the needs of the members of the team. However, the project also seeks to go further and promote awareness of the matter studied among a public that is not necessarily well-informed or specialised. In both cases, despite some similarities like education and access to the internet, our target user is characterised by its diversity in terms of language, location, education, age, and access to the internet. Further information on a segment of our target users was collected in April 2022. The organisation of two in-person events in Havana, a conference on digital humanities and a workshop on digital editions, allowed us to expand the impact of the project beyond the team, test our methodology, learn more about the interests of the Cuban audience, and experience first-hand the key issues and challenges faced in similar projects. Although it is too early to assess the impact of the project by looking at user’s traffic (Isasi et al. in this volume), the methodologies implemented and tools used in Proyecto Humboldt Digital are being replicated, tested, and adapted in la Oficina del Historiador de la Ciudad de La Habana with other materials not related to Alexander von Humboldt. In this sense, the project has served as a pilot test and will allow other digital scholarship projects to be started in La Habana, which will be launched in the future without the collaboration of the German team. 11.4.

ediarum, an Example of Software Localisation

Oxygen XML Author is an authoring tool designed to structure XML documents and to transform source documents into multiple output formats. In 2013, a team of digital humanists from the Berlin-Brandenburg Academy of Science and Humanities4 began developing ediarum, an editorial framework that is installed in Oxygen XML Author as an extension. Thus, the Graphical User Interface (GUI) can be adapted with new features to hide the XML tags and to facilitate the automatic validation of XML documents according to the project’s scheme. As Dumont and Fechner state, ediarum allows for increased ‘usability’ of TEI documents (Dumont and Fechner 2014). In fact, ediarum is not just another desktop application that helps the editor to encode texts in TEI format. In a strict sense, ediarum is a ‘digital editing environment’

Multilingual Interfaces for All? 189 because it works as a ‘platform in which several tools (independent or not) work as an integrated whole to perform a number of tasks or cover the entire workflow’ (Sichani and Spadini 2018). The main components of ediarum are three: firstly, the Oxygen XML editor, in which the framework is installed; secondly, an XML database (eXistDB, preferably); finally, the ConTeXt language, a document processor derived from LaTeX that generates PDFs from TEI documents. For almost ten years, ediarum has been successfully implemented in projects such as Alexander von Humboldt auf Reisen or Schleiermacher in Berlin 1808– 1834 carried out at BBAW. The methodology and task force employed in this research centre are similar in most of the cases: each project counts on a computer scientist (or digital humanist) who develops a version of ediarum in collaboration with the editors. The ediarum developer adapts the basic framework to the needs of the project, after modelling the texts, and extends the graphical user interface. However, ediarum presupposes two suitable conditions, one economic and the other linguistic. On the one hand, the use of a commercial Oxygen XML editor, which not all projects can afford, as the licence is not free. On the other hand, an advanced knowledge of German because, although the icons used are (almost) universal, the navigation menus and help windows are only available in German by default. Indeed, as Mertgens has recently pointed out, the limitations of ediarum have to do with its ‘default’ monolingual status (Mertgens 2019). The ediarum framework presupposes that you have funding and that you understand German, two conditions that fit most projects carried out at BBAW. This is, however, not always the case when carrying out collaborative work with researchers from institutions located in countries of the Global South like Proyecto Humboldt Digital. Since most of our editors had no previous experience with the TEI format, we considered it a good idea to test whether ediarum would facilitate the encoding task after adapting it to the needs of the users, especially regarding the language of the GUI. After modelling the text and creating a validation schema for the project, the main task consisted in adapting ediarum to be multilingual. Thus, a two-phase adaptation strategy was carried out between 2020 and 2021: first, the code was internationalised, that is, text and code were separated, replacing textual strings by identifiers; second, a file named ‘translation. xml’ was created in a folder named ‘i18n’, within the resources of the framework; in this file, the German text strings were translated into Spanish manually and structured hierarchically by matching each text string with the identifiers found in the code. This separation of code and graphical display allows users to select the display of the text in German or Spanish according to their preference. But translating the text display of actions, menu entries, context menus, toolbars, and static CSS content is not an easy task for several reasons: the translation is performed in a list without context, which makes it difficult to find an equivalent term as there is missing information that can only be obtained by testing the software. In addition, the translation must consider the limitations of screen space and menu tabs, as well as the cultural conventions of the target language, which vary from locale to locale.

190 Antonio Rojas Castro

Figure 11.1 ediarum.PROHD.edit graphical user interface.

Multilingual Interfaces for All? 191 Following the internationalisation and localisation processes, ediarum has been tested with team members, leading to significant improvements in visualisation, functionality, and validation. Currently, the version of ediarum adapted to the Proyecto Humboldt Digital—called ediarum.PROHD.edit—5 facilitates the encoding of texts in TEI format for German and Spanish-speaking editors, is well documented in both languages, and can be freely downloaded from GitHub. 11.5. The Web Localisation of the ‘Digital Dossier: Alexander von Humboldt and Cuba’ The dissemination of the digital editions has been carried out with TEI Publisher. This software tool facilitates the creation of web applications within eXistDB (Meier 2003). Originally inspired by Sebastian Rahtz’s TEI Processing Model in 2015, TEI Publisher can process TEI documents—among other formats—as input and is released under a GPLv3 licence. Additionally, the software has been built around a user community that can expand, modify, distribute, and reuse the software (‘TEI Publisher’ 2022). TEI Publisher transforms TEI documents stored and indexed in eXistDB into different publishing formats, such as HTML, PDF, and EPUB. From the point of view of the data representation model, the software only assumes that the aim of the project is to publish a collection of documents, but it does not assume a specific type of edition—critical, diplomatic, modernised, etc.—or the use of a specific subset of TEI tags. This is because the software was not created to respond to the needs of a specific project but with a generalist aim in mind. The presentation of the text can be flexibly configured, thanks to the rules defined in the ODD file. The administrator of the web application can write or overwrite processing instructions that select TEI elements requiring a format and behaviour. All this information is passed into a few templates—that is HTML documents with predefined structures—built by combining pieces of HTML code and JavaScript (web components) from an already existing reusable and highly modifiable library. Since version 6.0 TEI Publisher allows developers to create web applications that support international audiences. The GUI is by default available in 20 languages, thanks to community contributions via Crowdin. However, in the case of Proyecto Humboldt Digital, these default locales were not enough since we developed and adapted further the web application of our ‘Digital Dossier: Alexander von Humboldt and Cuba (1800–1830)’ creating new labels and visual signals that were not part of the default setting. Consequently, during the year 2021 we undertook an internationalisation and localisation strategy of the GUI of the web application. The text strings localised into English, German, and Spanish correspond to the title of the dossier, the tabs and icons in the main menu and in the visualisation bar, the fields in the metadata panel, the information found in the footer (funding and licence), the error page, and, finally, the facets that help to filter the editions by format, language, and subject.

192 Antonio Rojas Castro The whole adaptation strategy was again manual6 and can be summarised as follows: first, the languages of the GUI were narrowed down to three (English, German, and Spanish); second, in the HTML templates, the text strings of the GUI were replaced by web components with a unique identifier situated in the @key attribute; and finally, three JSON format files were created to store the localisations in the three languages of the project; these files contain hierarchically ordered lists of items in which the previously defined unique identifier is assigned to each of the text strings to be displayed in the GUI. In summary, the strategy is like the one outlined earlier for ediarum, but here there is an additional step since the default software comes with more locales than necessary to meet our target user’s expectations and needs. The first version of the GUI localised in English, German, and Spanish was tested internally with some members of the team using desktop computers.7 In addition, five usability tests were carried out with target users to determine whether multilingualism hindered interaction with the content. Several areas for improvement were discovered related to the access to the project pages, which, by default, were served in German, even though some users accessed them from Spain and had their operating system configured in Spanish. After correcting the detected errors and implementing some changes in the graphical user interface, a total of 15 tests with target users are expected to be carried out before the end of the project using both a desktop computer and mobile devices.

Figure 11.2 The GUI of ‘Digital Dossier: Alexander von Humboldt and Cuba (1800–1830)’.

Multilingual Interfaces for All? 193 11.6. Conclusion Contemporary multilingualism is a complex phenomenon involving both individual and social aspects. As it is embedded in our culture, it is logical that some of its aspects are reflected in the technological sphere. In the field of digital humanities, however, the debate has been overly theoretical for the past ten years. In contrast, this chapter has focused on the implementation of multilingual graphical user interfaces. In order to provide an appropriate theoretical framework, three main concepts have been introduced: first, the term ‘localisation’ has been defined as the adaptation of a digital product to a new audience; second, the term ‘locale’ has been described as the linguistic and cultural conventions that characterise a community of speakers; and third, the GILT process has been presented in detail as a strategy carried out by localisation companies to coordinate the technical development of a product together with its linguistic and cultural adaptation. After explaining how localisation companies undertake their mission, localisation has been linked to the Skopos theory and to the notion of accessibility promoted by the W3C. Contemporary multilingualism and localisation practices make full sense when conceptualising and implementing an international research project such as Proyecto Humboldt Digital. After clarifying the main goal of the project, the FAIR principles and the Principles for Digital Development were considered to ideate the technical requirements and the target users of the project. As we have conceived them, the results of the projects should be findable, accessible, interoperable, and reusable by a diverse audience in terms of gender, language, location, education, age, and internet access. We hope that ediarum will be useful for the editors of other projects, that our editions will be accessed by their target users, and that the underlying data will be reused in other contexts, thanks to the open licences we have chosen. Given the diversity of the target user, two localisation processes were initiated to create multilingual graphical user interfaces: on the one hand, the localisation from German into Spanish of ediarum, an editorial framework that is installed in Oxygen XML Author, and that helps editors to encode texts in TEI format; on the other hand, the localisation into three languages (Spanish, German, and English) of the web application created with TEI Publisher. In both cases, the localisation strategy is similar and follows the GILT process outlined in the theoretical framework. However, the localisation of the website was easier as, by default, TEI Publisher has been designed considering a global community. In conclusion, localising software and websites is a good strategy to make interfaces multilingual and thus to meet the needs of international audiences. When information systems, software, or websites are available in more than one language, more people can interact with them, that is, they are more usable and accessible. However, unless part of the localisation process is automated or crowdsourced, localising always implies a choice in terms of target users since the adaptation of the products involves additional time and cost.

194 Antonio Rojas Castro Notes 1 For example, Jiménez-Crespo (2013) presents different levels of localisation and cultural adaptation: enabled products (‘those in which users can write and use their own language and scripts, but the software and the accompanying help and guides appear in a different language’), localised products (‘those in which the user interface and all help files are localized, but some language-specific tools such spell-checkers and dictionaries are not available’), and adapted products (‘those in which all linguistic tools, functionalities and content are fully adapted to the target language/locale’). 2 In the professional field, these terms are often abbreviated as ‘g11n’, ‘i18n’, and ‘l10n’. These abbreviations stand for the number of characters between the first and the last one, for example, 11 characters between the ‘g’ and ‘n’ in ‘globalisation’. 3 This vision of ‘internationalisation’ coincides to a large extent with the definition proposed by the W3C, although this is later and is limited to web development: www.w3.org/ standards/webdesign/i18n 4 The team is part of The Electronic Life of the Academy (TELOTA), a department whose aim is to support digital scholarship in the institution. 5 The ediarum documentation in Spanish is accessible at www.ediarum.org/es/index.html, and it is possible to download the localised framework from GitHub: https://github.com/ humboldtdigital/ediarum.PROHD.edit 6 In other words, no Translation Memories has been used to translate the text strings. However, we created the XML structure of the translation file using XSLT, we used Deepl to translate certain strings and accessed a German-Spanish online dictionaries. 7 The digital editions are available at https://dossierdigital.ohc.cu/ while the web application itself can be downloaded from GitHub at https://github.com/humboldtdigital/ProHD.

References Acharya, Keshab Raj. 2019. “Usability for Social Justice: Exploring the Implementation of Localization Usability in Global North Technology in the Context of a Global South’s Country.” Journal of Technical Writing and Communication 49 (1): 6–32. https://doi. org/10.1177/0047281617735842. Allés-Torrent, Susanna, and Gimena del Rio Riande. 2019. “The Switchover: Teaching and Learning the Text Encoding Initiative in Spanish.” Journal of the Text Encoding Initiative 12 (July). https://doi.org/10.4000/jtei.2994. Bisset Alvarez, E., A. Grossi de Carvalho, and S. Borsetti Gregorio Vidotti. 2015. “Políticas públicas de inclusión digital: El caso de América Latina y Cuba.” Biblios 58: 42–53. www.redalyc.org/articulo.oa?id=16138590004. Dijk, Jan van. 2020. The Digital Divide. Cambridge and Medford, MA: Polity. Dumont, Stefan, and Martin Fechner. 2014. “Bridging the Gap: Greater Usability for TEI Encoding.” Journal of the Text Encoding Initiative 8 (December). https://doi.org/10.4000/ jtei.1242. Esselink, Bert. 2003. “The Evolution of Localization.” MultiLingual Computing & Technology 14 (5): 4–7. European Commission, and Youth Directorate-General for Education Sport and Culture. 2008. High Level Group on Multilingualism : Final Report. Publications Office of the European Union. Fiormonte, Domenico, Teresa Numerico, and Francesca Tomasi. 2015. The Digital Humanist: A Critical Inquiry (version 1). Punctum Books. https://doi.org/10.21983/P3.0120.1.00. Galina Russell, I. 2014. “Geographical and Linguistic Diversity in the Digital Humanities.” Literary and Linguistic Computing 29 (3): 307–316. https://doi.org/10.1093/llc/fqu005.

Multilingual Interfaces for All? 195 Gil, Alex, and Élika Ortega. 2016. “Global Outlooks in Digital Humanities. Multilingual Practices and Minimal Computing.” In Doing Digital Humanities. Practice, Training, Research, edited by Constance Crompton, Richard J. Lane, and Ray Siemens, 22–33. London and New York: Routledge. Gramling, David. 2021. The Invention of Multilingualism, 1st ed. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108780667. Hall, Patrick A. V. 1999. “Software Internationalization Architectures for Decision Support Systems.” In Decision Support Systemas for Sustainable Development, edited by Gregory Kersten, Zbigniew Mikolajuk, and Anthony Gar-On Yeh, 291–304. Boston, Dordrecht and London: Kluwer Academic Publishers. IFLA, ed. 2002. “Directrices Para Proyectos de Digitalización.” Ministerio de Cultura. www.ifla.org/files/assets/preservation-and-conservation/publications/digitizationprojects-guidelines-es.pdf. Isasi, Jennifer, and Antonio Rojas Castro. 2021. “¿Sin Equivalencia? Una Reflexión Sobre La Traducción al Español de Recursos Educativos Abiertos.” Hispania 104 (4): 613–624. Jiménez-Crespo, Miguel A. 2013. Translation and Web Localization. London and New York: Routledge. Leppänen, Sipra, and Saija Peuronen. 2012. “Multilingualism and the Internet.” In The Encyclopedia of Applied Linguistics, edited by Carol A. Chapelle, wbeal0805. Oxford: Blackwell Publishing Ltd. https://doi.org/10.1002/9781405198431.wbeal0805. Martín Rojo, Luisa, and Joan Pujolar, eds. 2020. Claves para entender el multilingüismo contemporáneo, 1a. ed. (Re)Pensar La Educación, no. 12. [Barcelona], Zaragoza: Editorial UOC; Prensas de la Universidad de Zaragoza. Meier, Wolfgang. 2003. “EXist: An Open Source Native XML Database.” In Web, WebServices, and Database Systems, edited by Akmal B. Chaudhri, Mario Jeckle, Erhard Rahm, and Rainer Unland, 2593:169–183. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-36560-5_13. Mertgens, Andreas. 2019. “Ediarum. A Toolbox for Editors and Developers.” RIDE 11. https://doi.org/10.18716/RIDE.A.11.4. “Multilingualism.” 2019. OPERAS (blog). December 2, 2019. www.operas-eu.org/specialinterest-groups/multilingualism/. Munday, Jeremy. 2016. Introducing Translation Studies: Theories and Applications, 4th ed. London and New York: Routledge. National Information Standards Organization. 2007. “A Framework of Guidance for Building Good Digital Collections | Framework.Niso.Org.” 2007. http://framework.niso.org/5. html. Nord, Christiane. 2018. Translating as a Purposeful Activity: Functionalist Approaches Explained. London and New York: Routledge. “Principles.” n.d. Principles for Digital Development (blog). Accessed December 7, 2022. https://digitalprinciples.org/principles/. Pym, Anthony. 2004. “Localization From the Perspective of Translation Studies: Overlaps in the Digital Divide?,” SCALLA Conference, Kathmandu, January. https://usuaris.tinet. cat/apym/on-line/translation/translation.html. Robinson, Laura, Shelia R. Cotten, Hiroshi Ono, Anabel Quan-Haase, Gustavo Mesch, Wenhong Chen, Jeremy Schulz, Timothy M. Hale, and Michael J. Stern. 2015. “Digital Inequalities and Why They Matter.” Information, Communication & Society 18 (5): 569–582. https://doi.org/10.1080/1369118X.2015.1012532. Rodríguez Vázquez, Silvia, and Sharon O’Brien. 2017. “Bringing Accessibility into the Multilingual Web Production Chain: Perceptions From the Localization Industry.” In

196 Antonio Rojas Castro Universal Access in Human—Computer Interaction. Design and Development Approaches and Methods, edited by Margherita Antona and Constantine Stephanidis, 10277:238–257. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi. org/10.1007/978-3-319-58706-6_20. Sichani, Anna-Maria, and Elena Spadini. 2018. “Criteria for Reviewing Tools and Environments for Digital Scholarly Editing, version 1.0.” December 2018. www.i-d-e.de/ publikationen/weitereschriften/criteria-tools-version-1/. Spence, Paul. 2021. “Disrupting Digital Monolingualism: A Report on Multilingualism in Digital Theory and Practice.” Zenodo. https://doi.org/10.5281/ZENODO.5743283. Stiller, Juliane. n.d. “Best Practices for Multilingual Access.” Europeana Pro. Accessed October 25, 2021. https://pro.europeana.eu/post/best-practices-for-multilingual-access. “TEI Publisher.” 2022. Our Vision, 2022. https://teipublisher.com/index.html#vision. Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 160018. https://doi. org/10.1038/sdata.2016.18.

12 Towards Multilingually Enabled Digital Knowledge Infrastructures A Qualitative Survey Analysis Alíz Horváth, Cornelis van Lit, Cosima Wagner, and David Joseph Wrisley 12.1.

Introduction

More than 20 years have passed already, but Georg Steiner’s 1998 critique that ‘[c]omputers and databanks chatter in ‘dialects’ of an Anglo-American mother tongue’ is still very relevant today when we consider the question of digital monolingualism and the power structures of the majority of research infrastructures (RIs) in the world. English language hegemony not only makes global academic research in the digital age in languages other than English and in non-Latin scripts less discoverable, but it has led to cultural and technological biases in the humanities disciplines. In digital humanities this hegemony has led to numerous debates around decolonising data and diversifying digital humanities (DH) as a field more generally.1 As the digitalisation of academic infrastructures has increased and globalised, a point of special importance is the fact that digital environments for research are neither neutral, nor inherently international, spaces. If we have not been involved in their creation directly, they have no doubt been created for us in order to be able to carry out certain tasks. They have been designed with certain users in mind, and their creators have had different outlooks, ranging from an open-source ethos to more restrictive, profit-based models. It has been argued that the design of digital platforms often adheres to a fallacy of single-language use (Ghorbaninejad et al. 2023), that is to say, their design assumes a single-language, single-script, and single-direction mentality of its imagined user community. If platforms offer more than one language, they are often limited to a few languages—either official or de facto national languages—perhaps with the addition of some variant of English. Of special concern for our chapter is digital scholarship in and about text written in non-Latin-scripts (NLS), and we acknowledge that software, information systems, and infrastructure used to carry out that research can accommodate the use of NLS only to a limited extent, or sometimes not at all. It is important to underscore that while multilingualism is often used to discuss the use of multiple languages other than English, for us the most urgent need for discussion concerns the combinations of languages that DH practitioners use in their daily work (their multilinguality), especially in cases of languages and scripts that are not Latin-based.2 In this chapter, we argue that while many world languages are at least marginally represented DOI: 10.4324/9781003393696-17

198 A. Horváth, C. van Lit, C. Wagner, and D. J. Wrisley in the DH community, the obstacles to working on shared infrastructure make DH work much harder for some in our community. There is an urgent need for pathways forward, which include, and support, practitioners of these languages. From this perspective, in this chapter we pay attention to knowledge practices in digitally transformed academic ecosystems in general, and among people who work with multiple languages and technologies in their work in particular. Using survey data, we report on some of the most common NLS languages DH practitioners work with and on. We also summarise some of the material and technical challenges of working digitally with NLS and ask our readers to become more aware of multilingual DH users who are working outside of the more visible Latinbased languages. Infrastructure is a complex, sociotechnical system (Liu 2020; Pawlicka-Deger 2021; Smithies 2017), and DH researchers are not only working with modern official languages but with combinations of languages, some of which may just be historical variants or not belonging to any one linguistic taxonomy at all.3 Whereas there are some approaches in advanced natural language processing (NLP) and also in software localisation which aim to be language-independent or language-agnostic, we do not adopt this approach; they require large amounts of textual data which our research domains do not necessarily have and often preclude expert input. We do not call on digital humanists in the various NLS communities to fix all of the problems specific to the languages in which they work by themselves in a piecemeal fashion. We also do not believe that it is the role of digital humanists to bend all of their work to fit enterprise-style platforms designed for internationalisation, nor do we believe this really to be possible with textual data. Instead, we encourage collective thinking through what some existing major challenges are for multilingual DH, and we call upon the DH community to champion their resolution at the design level in dialogue with architects of research infrastructure as well as providers of DH research data (i.e. galleries, libraries, museums, archives, GLAM). The health of our discipline depends on it.4 Furthermore, from our backgrounds as NLS DH practitioners and research librarians, we see the urgent need for opening a space for discussion of the best environments and how to sustainably co-design them at our institutions. In order to begin to think through this co-design process in view of developing a clearer picture of the functional specificities of RIs using NLS—what we are calling here the ‘requirement picture’ or the ‘requirement profile’—our chapter discusses two stages of research we have undertaken during the COVID-19 pandemic. First, we will highlight results from a survey on ‘Disrupting Digital Knowledge Infrastructures: A Status Quo Survey’ conducted in 2020 during the ‘Disrupting Digital Monolingualism’ conference at King’s College London.5 Second, based on the voices from our survey we present a pragmatic thought experiment of mapping out some possible problem spaces for dealing with world languages in digital environments. 12.2. A Descriptive Survey Undertaken in 2020 We conducted a status quo survey in summer 2020 as part of a thematic group dedicated to Disrupting Digital Knowledge Infrastructures, at the Disrupting Digital

Towards Multilingually Enabled Digital KIs 199 Monolingualism Symposium, organised by the ‘Language Acts & Worldmaking’ research group in collaboration with the ‘Cross-Language Dynamics: Reshaping community’ project and hosted by King’s College London. The thematic group (made up of the co-authors of this paper, as well as Kiyonori Nagasaki and Pascal Belouin) focused on ‘Linguistic and geocultural diversity in digital knowledge infrastructures’ and cooperated virtually from 17 to 24 June 2020 with the goal of establishing a ‘requirement profile’ for multilingually enabled digital knowledge infrastructures (KIs). By ‘requirement profile’ we meant a list of minimal functional specifications for platforms to be accessible to the multilingual practices of different users in the pursuit of research goals. We were particularly interested in functionality enabling NLS. Our starting point was the awareness that many of us work in internationalised universities, in which researchers coming from around the world use different languages (historical and modern), in research, teaching, and scholarly communication. However, we share the experience that our institutional infrastructures often fall short in accommodating our linguistic and geo-cultural diversity. The initial aim of the group was to develop guidelines that might be given to library and academic infrastructure managers, software developers, and other information stakeholders. This was admittedly a large task, but one that the moment seemed to call for. We believed then, and still believe, that articulating a collective set of functional requirements for digital scholarship infrastructure is a worthwhile goal. We believe that this goal has the promise both of enhancing the profile of multilingual DH work in general and of opening pathways for dialogue between the spheres of research and the creation and management of infrastructure in particular. Furthermore, it provides empirical evidence for making claims when it comes to material and infrastructural requests in our institutions. Over the course of our discussions, it became clear that navigating the workings of infrastructure and the needs of researchers was not a simple task and that more fundamental questions remained unanswered in the DH community: what languages are being worked on and with? How well can digital scholarship be carried out in these languages? What are some of the impediments to doing languagespecific research for some members of our community? We considered it essential to collect a set of somewhat representative voices on multilingual DH work from the community. To that end, we crafted an ad-hoc survey with the following four open-ended questions: 1 What languages (other than English) and scripts do you use in digital environments in your daily work? List them in the order of their importance to you. 2 What kinds of difficulties do you encounter when you use more than one language and/or non-Latin scripts in a digital environment? 3 What would you like to be able to do using your languages that you cannot right now? If you can, specify the kind of digital environment (repository, WordPress or other web publishing platform, library catalog, XML editor, e-journal, etc.). 4 Do you have workarounds to get a task done that would seem to be so simple but is actually not due to difficulties described in the second question of this survey?

200 A. Horváth, C. van Lit, C. Wagner, and D. J. Wrisley

Figure 12.1 A screenshot of the introduction to our 2020 survey.

We shared the survey on social media, with the Stanford-based Multilingual DH mailing list6 along with other online networks and interest groups, such as the European Association for Digital Humanities, the Japanese Association for Digital Humanities, the Digital Orientalist, among others. We were not fully prepared for the diversity and richness of responses that we received: 52 responses in five days. The responses showed significant diversity in terms of languages, modes of engagement with digital methods and infrastructures, explanations of roadblocks and potential workarounds, as well as the articulation of immediate and more holistic goals working with text. In the rest of this chapter, we attempt to make sense of the diversity of responses in our 2020 survey, ultimately arguing for the design of a more advanced survey instrument and implementation of it in the future, over a longer period of time and with many more respondents. 12.2.1. An Initial Qualitative Analysis of the 2020 Survey Responses

The most frequently mentioned languages among our respondents were Arabic, Chinese, German, French, and Japanese, but the list also extended to others such as Sanskrit, Syriac, Persian, Thai, and Wolof. Figure 12.2 suggests that the majority of them use multiple languages in their work. This point cannot be stressed enough: that so many researchers working worldwide are working not only with a single NLS language but with several at the same time, and also often in combination with a number of Latin script languages. The challenges faced by these users’ digital multilinguality are numerous: annotation, tagging, character recognition, encoding, the lack of proper fonts, enhanced problems even with typing in specific languages (particularly in the case of certain NLS), as well as discovery issues to

Towards Multilingually Enabled Digital KIs 201

Figure 12.2 This bubble graph illustrates the frequency of languages and scripts used by the respondents in question #1 of the survey. Visualisation in Tableau by Horváth.

202 A. Horváth, C. van Lit, C. Wagner, and D. J. Wrisley Table 12.1 Summary of tasks and problems mentioned by respondents in the 2020 survey Task

Problem

Creating and processing texts in non-Latin scripts (NLS)

• Persistent problems with writing or annotating texts with right-to-left scripts due to software and applications with directionality implemented for left to right • Platforms and interfaces support only a relatively narrow range of languages • Limitations of Unicode regarding the variety of historical characters (i.e. premodern Chinese characters) hinder the completion and launch of digitally enhanced projects • Anglophone-centric library discovery platforms do not recognise special characters, particularly non-Latin scripts • Diverging naming and calendrical traditions may differ from ‘Western’ conventions resulting in incomplete retrieval results • OCR tools often produce lower accuracy levels for nonLatin texts or are not usable at all, particularly in the case of non-modern corpora • Difficulty to digitise texts from underrepresented/ endangered languages resulting in less textual and cultural preservation, danger of loss of transmission

Displaying characters accurately Searching library discovery platforms in NLS/with special characters Using OCR (optical character recognition) tools on digitised NLS texts

difficulties with transcription and directionality. Table 12.1 gives an overview of the responses regarding the mentioned tasks and problems to execute them in/with NLS. Based on the aforementioned concerns, our respondents have drawn attention to the need for multilingually accessible online platforms, text editors, word processors, search engines, and discovery tools, which could solve, or at least mitigate, the ongoing difficulties. Furthermore, the user experience of text processing in multilingual environments could be significantly improved through more accurate OCR methods and more multilingually inclusive Unicode and NLP frameworks. The nature of the challenges that the respondents articulated in the survey, of course, depends on any specific user’s combination of languages (i.e., their multilinguality) and their goals. Variety, however, also applies to the question of finding workaround solutions to achieve the desired results. Approximately two-thirds of the respondents manage to alleviate their difficulties, but their methods (and options) appear to heavily vary based on the range and level of digital skills. Most users attempt to distinguish more clearly between LTR and RTL scripts in their multilingual workflows. Some rely on transliterations, whereas others insist on typing in the original script. A comparatively low number of users, presumably with high-level digital skills, have also articulated their willingness to create their own software, but an almost equal number of scholars would rather return to paperbased sources.

Towards Multilingually Enabled Digital KIs 203 12.2.2.

Survey Findings Specific to Multilinguality

Our respondents in the survey represent various languages and digital skill levels, which suggest the existence of three expertise levels. Some people are absolute beginners with limited technical understanding and skill in independently moulding digital documents they encounter in the shape they need. Others we call ‘powerusers’: people who are able to learn to work with technology and are able to operate complicated pieces of software with relative fluency and independence. Lastly, we have experts who hardly need to be guided in working with multilingual texts since they have adequate experience and good control over the technical domain. This breakdown is an important aspect of grasping the work which goes on in multilingual DH. In monolingual DH practices (or perhaps even amongst Latin-script bi- or trilingual scholars), we witness the creation of many materials, manuals and tutorials which facilitate the acquisition of skills (e.g. the Programming Historian). In multilingual NLS DH, this is far less the case, since infrastructure that would facilitate scholarly communication and education is lacking. People are typically operating independently, having to find out solutions on their own. The middle group of users mentioned earlier is therefore a fascinating cohort of people of whom builders and keepers of databases and other kinds of data infrastructure ought to be aware. It is worth pausing for a moment to wonder, as multilingual digital practices become more of a norm in the world, if this difference between ‘Latin-script DH’ and ‘NLS DH’ will close or will remain. We cannot answer this question at present; instead in this section, we will take a deeper dive in characterising the three different groups we have discovered in our survey. Respondents with ‘expert’ status are those scholars who write their own software to process and analyse texts. They noticed that even in the most modern programming languages and ecosystems, such as Python, they encounter pieces of software that do not support RTL scripts or even non-ASCII characters. This leaves much work to them to amend those packages to get them to work on their texts. These experts identified a problem that is persistent across all scripts (even Latin ones), namely, that the same character can be encoded in different ways. This requires normalisation. This is a non-trivial task, however, as normalisation would usually mean loss of information, which is an undesirable side effect in any work with digital texts. Furthermore, whereas there are ready-made packages available for normalising text in Latin script, most of the scripts we work with in the humanities require us to write custom normalisation packages. This group of experts also report their experiments in crossing the line between native script and transliteration. They want computers to operate in both domains, so that working in one has an effect in both. This would allow searching in the native script and having transliterations of the same text appear in the search results. Since their efforts are often only focused on their own domain, abstractions of such software to other domains are all but nonexistent. Then again, this bottom-up approach of working with problems as one encounters them in the domain(s) in which one usually works is probably for the best, as there are clearly cultural (or regional

204 A. Horváth, C. van Lit, C. Wagner, and D. J. Wrisley or historical) quirks in scripts, as well as in dating and naming conventions. To be fair, there is a solution for such problems, called authority files, but cultures using NLS do not always have authority files publicly available. One respondent lamented that creating custom resources takes so much time that little time is left for research inquiry. Another group which emerged from our research could be called ‘power-users’. They have a relatively advanced understanding of how computers work, and they commonly encode texts in XML or do text mining. They are able to produce new resources such as corpora, catalogues, visualisations, and databases but are not able to make software themselves. Indeed, some of them think that programming should not be a necessity. They commonly describe their problems in a passive voice, implying others would need to solve them. Two issues emerge: bi-directionality and typing and displaying native script. By bi-directionality they mean that mixed text blocks with both LTR and RTL text should behave more effectively. Displaying and selecting such text blocks often break down into an unintended result. This problem already occurs on the backend, when encoding in XML, as the commonly observed standard called the Text Encoding Initiative (TEI) consists of English tags (thus, Latin script). Identical issues can even occur in fully RTL text blocks, as some characters such as punctuation seem to have ‘weak directionality’ as one respondent emphasised. For example, all words of a sentence are correctly aligned right to left, but while a period is encoded after the last word it is not displayed on the left of the sentence but on the right. In this case, the computer system has incorrectly assessed the period as an LTR-character. These power-users are also best suited to answer questions about how texts ought to look digitally. For example, one respondent astutely observed that Arabic often appears smaller than English when presented alongside English. This is caused by visualising both Arabic and Latin scripts in the same font size, say size 12. In that case, there is no technological error, but it still merits adjustment. Then there is the problem of scripts other than Latin. Such problems can be found remarkably close to home; take modern Turkish, with its use of Latin script but adding to it a specific type of capital I (İ) which can be maddeningly hard to normalise to each other. These power users think such issues should not be solved by them. They are happy to create new resources such as digital editions and repositories and are happy to train new generations of scholars to use such resources but can tolerate only so many technical hurdles. It seems that aside from such encoding issues, there are plenty of display problems. Complaints about improper implementation of Unicode standards and half-finished web-fonts are abound. Respondents experienced difficulties with non-ASCII characters, punctuation within an LTR directional text when RTL words are included and more generally with minority scripts not being properly implemented in cross-platform applications. Half of our correspondents can be categorised as ‘beginners’. They would primarily like to use resources available for research or for academic writing. A recurring complaint is that one cannot type in the script native to their subject. Workarounds include using virtual keyboards and copy and pasting the right characters, that is, using their mouse pointer to click on the characters that they visually

Towards Multilingually Enabled Digital KIs 205 recognise as the ones they would like. This is hugely time consuming and clearly this group of people would be helped with proper ways to type in scripts other than Latin. Instead, at present they feel actively discouraged to do so. In printed publications, publishers often push back on the use of such scripts. One respondent reported that even with a press focused on area studies, the number of non-Latin characters per page is to be kept at a minimum. This also applies to those working on cataloguing. Latin script-based transliteration seems to always take precedence, whereas non-Latin scripts serve a secondary or ‘decorative’ purpose. The respondents vary in their tone, from anger to more pragmatism. How disarmingly simple and reasonable the request to search in a library catalogue directly in the original language! Yet imagine the scenario when a local cataloguer has put in the hours to provide native script entries and a mere software update, data migration or insufficient search algorithm can ruin everything, just because up the management chain, knowledge (or even sympathy) of the importance of NLS is often absent. We have found in our survey that the commercial or national logic of supporting languages of common or national use does not apply necessarily to all multilingual work, and especially that of our NLS DH colleagues. The languages of academic work, after all, are not always collocated with the people who have historically inherited these linguistic legacies. Imagine a Sinologist of the Tang dynasty, working in Singapore, alongside English and Malay; not only does she need to negotiate palaeographic differences between historical and contemporary Chinese, but her work and living environment constantly pull her into different languages as well. Using a search engine becomes a daily struggle as auto-correct or ‘smart’ search prohibits her from even executing the correct search query, let alone peruse the results. Moreover, you could imagine an Arabist working in a Western country whose work demands the philological precision of vocalising written texts much more than the average native speaker in writing. There is, in other words, a diversity of ways with which people around the world work with language, and this diversity makes it difficult to converge on common functionality in infrastructure. We may also look at this problem by distinguishing between different kinds of platforms where such issues arise. In the case of discovery systems, such as search engines or library catalogues, a number of our respondents mentioned that multiscript searches are not available. In some cases, odd workarounds and repeated searches using the knowledge of the inconsistencies in the transliteration systems are required. Furthermore, search algorithms cannot be ‘smart’ enough to enable robust, fuzzy searching in these languages. In addition, many users point to incorrect or inconsistent display of ‘special’ characters beyond the national language character set of their catalogues. Often, diacritics must be left off in order to search. Legacy systems in less financed countries are particularly hard hit, as we noticed for Perso-Arabic script. Scholars are left to dream of seamless multilingual, multiscript searching in their catalogues. Searching the Web can be just as challenging. Search engines regularly ‘help’ users by returning only or mostly English results. This is especially so when the script is a superset of Latin script (e.g. Welsh, Modern Turkish) or when one searches from a country that is using a Latin script (the search engines presumably ‘help’ to localise results). If our beginner group is not

206 A. Horváth, C. van Lit, C. Wagner, and D. J. Wrisley served soon, we suspect they will be tremendously turned off by technology and will grow only more wary to adopt more of it. The case of text editors was discussed at length by our respondents and participants. Whereas some of the respondents insisted that with Unicode they are able to work across varied character sets and alphabets without a glitch, we suspect that this answer was given mostly by experts. Repeated comments from our respondents underscored that recommendations from W3 and the Unicode consortium are not fully or properly implemented in applications for the majority of personal computer users. One of the most shocking refrains of our survey was that there are almost no text editors which allow for simple bidirectional redaction, rendering work with multiscript editing very difficult, especially for the DH researchers who prefer ‘plain text’, non-enterprise approaches. A notable example is TEI XML. Our respondents passed on their difficulty of integrating LTR tags with RTL text, especially when using bidirectional markers in the text such as punctuation or parentheses. In editors where there are two views (one for editing in html or markdown and the other presentational), the rendering can give a false impression of accuracy. Persistent complaints arose among our respondents about composing text in word processors or web content editors. Writing in more than one language in the same document often results in irregular typesetting as a font for one alphabet will appear to be regular sized, while appearing to be too small or too big when used for another alphabet. This brings to the fore questions of parity regarding screen-based readability and accessibility. Furthermore, when they are available, the relative size of two fonts, particularly for bilingual and bidirectional texts, can be very different. Whereas the ‘what you see is what you get’ principle (WYSIWYG) may render the language legible in an interface, actions such as cut and paste from them produce disastrous results. Cloud-based content management systems such as WordPress, Google Sites, or other forms of social media seem to fare better but only so for the most elementary of tasks such as reading and writing in contemporary scripts. Then again, these issues are not always about philological work with historical languages for which a character set or alphabet is no longer extant; it can also be for smaller languages for which language resources or commercial solutions have not been well developed. There is, in other words, a gap in the ‘enabling technologies’ which leave authors in non-dominant languages out of the loop. Respondents bemoaned the inaccuracy of spelling correction systems or search completion algorithms for minority languages which share homophones with others, or more simply the inability to use these in multilingual text creation. Given the drift towards the virtualisation of platforms but also the prioritisation of large-population languages by tech companies, many new frontiers of equity stand before us. 12.3.

Mapping Problem Spaces for World Languages: Towards Requirement Profiles

Some will retort that since Unicode exists, it should contain a solution for all the documented languages of the world. Humanists provide many case studies,

Towards Multilingually Enabled Digital KIs 207 however, that suggest otherwise, as we have heard from our respondents. Our 2020 survey provides us with an incomplete, but thought-provoking, entryway into a variety of practices and opinions informed by multilinguality in DH. The co-authors of this paper are not linguists with expertise in all of the world’s languages, nonetheless some patterns seemed to us to emerge, repeated problems faced by many different users across different language groups. In this section, we would like to suggest some problem spaces that might be used to group different languages together, if only tentatively, for thinking through questions of infrastructural design. The limited extent of our survey and scope of this chapter do not allow us to provide concrete solutions for many world languages/NLS. Instead, our limited results have suggested that it can be useful to bring together many different groups of people, each of which has a stake in workable digital infrastructure. Some of these groups feel the problem much more as a problem, and some of these groups are more essential to implementing solutions, but all the groups have a stake and should have a say in it. Examples of these groups include scholars (end users, our primary group), librarians/libraries/supporting staff, DH professionals, companies selling specifically to GLAM, the larger tech industry and its standards, and native speakers. From our cursory analysis of the survey results above, we would like to suggest that we begin to work pragmatically using a functional approach, rather than adopting a highly individualistic multilinguality in order to inform the design of RIs. It is difficult to imagine any system that could accommodate any number of NLS situations at once. (Imagine a DH project in comparative literature that needs classical Chinese, Old Church Slavonic, and Syriac all at the same time!) Detractors of our approach may also respond that languages are neither fungible nor combinable. We would like to suggest, however, that some uncommon cross-language comparison might be useful for handling practical solutions for recurring functional issues in RIs. RIs require, after all, specification profiles which assure functionality across the system, without recourse to too many workarounds. Mapping out some of these problem spaces allows us to propose possible lenses through which to consider the functional co-presence of world languages in digital environments. Table 12.2 presents a tentative pairing of some of the languages mentioned in the survey results, forming a thought experiment, if you like, in mapping particular problem spaces for NLS languages. In the left column is listed a concern with languages in digital environments and at right we list three languages sharing that problem. Of course, every problem has potentially many more than three languages for which it is relevant, but we have kept our mapping above simple to highlight the possible synergies of possible linguistic communities. Table 12.2 is not meant to be the final word on the subject but rather a ‘teaser’ for digital humanists to launch dialogue about multilingual RIs. Imagine, for example, a group of humanists working in Farsi and Swahili cooperating with infrastructure specialists and developers to explore potential cross-language compromise solutions. Such a collaboration might be possible, for example, in scholarly communities such as Indian Ocean Studies. Another possibility for dialogue might be among librarians belonging to

208 A. Horváth, C. van Lit, C. Wagner, and D. J. Wrisley Table 12.2 Mapping potential problem spaces with sets of non-Latin script languages7 Potential problem space

Three representative languages/scripts

Languages in non-Latin scripts lacking an agreed-upon transcription system Languages written in a script which is not from left to right Languages which do not have sufficient OCR support Languages which are shared across political entities Languages which have changed script one or more times during the era of print culture Languages which share the same ‘script’ but are encoded differently in Unicode Languages which have simplified, modern versions, in addition to classical versions Languages which are based on pictograms, or have mixtures of pictograms, ideograms, phonograms, and/or syllabaries

Arabic, Korean, Ottoman Turkish Japanese, Syriac, Hebrew Chinese, Hausa, Cherokee Farsi, Swahili, Russian Kazakh, Hausa, Azeri Farsi, Urdu, Arabic Hebrew, Greek, Chinese Chinese, Japanese, Hieroglyphs

a consortium where substantive holdings in Ottoman Turkish, Korean, and Arabic are represented and where the problem of transliteration despite international standards needs to be discussed and debated. 12.4.

Conclusion

The term ‘global digital humanities’ can refer both to the spread of DH methods and debates into many countries of the world and to awareness of the interconnected worlds of archives and scholarship. In national or single-language DH circles, the term ‘global’ can create a space of ‘anything else but English’ or another dominant language in question: Spanish, German, or French. In our survey and our analysis of it, we have explored ways of talking about this ‘anything else’ where it seems the most difficult, with NLS. RIs are the spaces we converge upon as researchers to get our work done, and some imagine (naively) that they must be able to contain cultural expressions from all over the world with ease (or at least with effort). Even though our survey was carried out in a very limited timeframe and during a moment of significant global disruption, the partial data that we have has illustrated that the DH community and its supporting RIs severely underestimate how many obstacles there are when carrying out research beyond European languages. Thinking about RIs and the multilinguality of researchers opens the door to the discussion and debate about an interconnected world of users, language, and design. We mentioned at the beginning of this chapter that some basic questions about the growing world of global digital scholarship remain unanswered. Reading through the results of our informal survey carried out in 2020, interesting new

Towards Multilingually Enabled Digital KIs 209 perspectives on the questions emerged and the question of the compatibility and the design of those environments for our different uses of language rose to the fore. We have argued here that two conceptual frameworks are especially useful for multilingual DH. The first is especially visible when we divide DH practitioners into skill levels. Beginners truly depend on the technology that is offered to them to complete tasks. This can already come to a halt when they simply want to search in a catalogue and find out the native script is not supported. Experts can bend and shift their digital documents in whatever shape they like. Often, though, they actually also rely heavily on the technology that is provided to them, such as fully fledged libraries or modules within a programming language. For multilingual DH, it is especially the middle group of power-users that is unique and interesting. They have found workarounds or simply put in a lot of labour to get the result they need, despite the technology.8 The second framework can be observed well when we look at different digital environments. The most fundamental environments such as search engines and content management systems tout support for many languages but invariably through the lens of ‘localisation’ of state-sanctioned languages. Spelling correction, machine translation, auto-completion, or chatbot interfaces will more often than not hinder a user’s intent rather than support them. We regret to notice that academic infrastructure, such as library catalogues, are predominantly built on top of these primary technologies and thereby adopt the same restraints. Practised multilinguality is therefore primarily seen outside of the digital world. Evidently, we wish to see better support and more possibilities for multilinguality within the digital world. Elsewhere, we have developed UX-personas which we hope can serve as a starting point for benchmarks for developers and product owners of academic infrastructure to think through the complexity of their user base (Horvath et al. 2023a, 2023b). We have published these user personas openly in the hope that others in the multilingual DH community might expand upon them to encompass facets of DH research which we do not fully understand or adapt to them for co-designing processes of their local RIs. In this chapter, building on our discussion of user level and platform, we suggest that the question of multilinguality in DH might best be seen in terms of problem spaces where different NLS languages fail to perform as they need to. Importantly, we hope that focusing on the functional specificities of particular languages may serve as fertile grounds for future discussions, and tangible terms for representatives of multilingual DH to rethink the facile assumption that Unicode has fixed all of our problems. Such discussions will be challenging ones, since they will require high-level ‘meeting of the minds’, working across and between languages in ways that computing has perhaps never tried before. We would like to reiterate one of the key points of our argument, namely that if NLS DH work is to become more practical and inclusive, it needs to be with the continual input of digital humanists. Two key lacunae of our survey were the limited timeframe and mode of distribution of the call to participate, which naturally led to certain languages and geographical areas not finding expression in our data. Of course, continued discussions with the users of various languages and scripts

210 A. Horváth, C. van Lit, C. Wagner, and D. J. Wrisley will be necessary, first, to make sure that the data we have is as complete as possible and, second, to imagine those potential commonalities that could bring users on a similar platform. We have proposed Table 12.2 as a practical starting point and inspiration to show how we conceptualise possible points of contact between languages from a DH perspective, according to how their specificities present challenges to our workflows and our digital environments. Neither siloed solutionism nor simplifying standardisation is the path forward as they will only encourage the concentration of digital knowledge in dominant languages and in the global North. Our chapter has only begun to scratch the surface. For us to fully understand how efforts might be combined to facilitate truly inclusive, multilingual DH research, we need much more data about the user base, and it needs to be created in an openly available, reusable dataset. Design of future survey instruments will be crucial in making sure that the diversity of the respondent pool is sought and maintained, not only for languages and scripts used or for geographic distribution but also for user level and task diversity as our survey data has emphasised. We call on the multilingual DH world and its supporting RIs to carry out a longitudinal study of this sort for the health of our discipline. Notes 1 See Asef et al. (2018), Dombrowski (2020a, 2020b), Fiormonte (2012, 2021), Ghorbaninejad et al. (2023), Golumbia (2013), Grallert (2022), Horvath (2022), Liu (2012, 2020), Mahony (2018), Ortega (2014), Perri (2009), Spence and Brandao (2021); Wrisley (2019). 2 It is important for us to make a distinction between the multilinguality of our surveyed practitioners as it emerged in the survey and the different, official multilingualisms found in various nation-states of today’s world. By multilinguality, we mean the complex, intertwined linguistic resources of users, including the specific combinations of languages they employ in their own tasks (Aronin and Singleton 2012). This is distinct from multilingualism in our mind which connotes a more top-down linguistic view of a state, region, or institution. 3 See also Josh Brown’s chapter ‘Digital approaches to multilingual text analysis: the Dictionnaire and its morphology as hybrid data in the past’ in this volume. 4 There is already a growing literature on the topic of ‘decolonising DH’ and towards more multilingually grounded research in DH, as well as community building efforts like Global Outlook::Digital Humanities (GO::DH), multilingualdh.org and other more local working groups. We do not have a concrete picture, however, of typical requirements (and roadblocks) of NLS DH researchers at different stages of their academic career, nor do we know what kind of NLS-DH-enabling KIs and services are a prerequisite for successfully ‘disrupting digital monolingualism’ (Spence 2021). The task of building infrastructures through participative user experience-based collaborations across the many NLS research specialties, as well as within and between institutions of higher education—given the number of people involved worldwide—is a daunting task. While the topics of multilingualism, infrastructure, DH, and user experience each have a significant and somewhat overlapping critical literature (Gibbs and Owens 2012; Hixson and Parrott 2021; Nielsen 2019), it is at the intersection of all four subjects that we hope to make our intervention. 5 For a detailed conference report, see Spence (2021). 6 https://mailman.stanford.edu/mailman/listinfo/multilingual-dh 7 We suggest that this table of problem spaces be submitted as a pull request for https:// github.com/multilingual-dh with a small explanatory paragraph so that it can exist online

Towards Multilingually Enabled Digital KIs 211 as a living document for the multilingual DH community. A number of multilingual DH initiatives seem to be in development at the time of writing this chapter. International venues such as ADHO or DARIAH seem to be fruitful sites for dialogue and forward thinking in this domain. 8 For a practical example of implementing a multilingual graphical user interface, see Antonio Rojas Castro’s chapter on ‘Multilingual Interfaces for Everyone: Internationalization and Localization Strategies in Proyecto Humboldt Digital’ in this volume.

References Aronin, Larissa, and David Singleton. Multilingualism. Amsterdam: John Benjamins Publishing Company, 2012. Asef, Esther, Martin Lee, and Cosima Wagner. “Workshop-Bericht Nicht-lateinische Schriften in multilingualen Umgebungen: Forschungsdaten und Digital Humanities in den Regionalstudien.” Dhd-Blog: Digital Humanities im deutschsprachigem Raum, October 24, 2018. https://dhd-blog.org/?p=10669. English version: https://blogs.fu-berlin.de/ bibliotheken/2019/01/18/workshop-nls2018/ Dombrowski, Quinn. “Preparing Non-English Texts for Computational Analysis”. Modern Languages Open 1, no. 45 (2020a): 1–9. https://doi.org/10.3828/mlo.v0i0.294 Dombrowski, Quinn. “What’s a ‘Word’: Multilingual DH and the English Default.” Blog,2020b. http://quinndombrowski.com/?q=blog/2020/10/15/whats-word-multilingual-dh-andenglish-default Fiormonte, Domenico. “Towards a Cultural Critique of the Digital Humanities.” Historical Social Research/Historische Sozialforschung 37, no. 3 (2012): 59–76. www.jstor.org/ stable/41636597 Fiormonte, Domenico. “Taxation Against Overrepresentation? The Consequences of Monolingualism for Digital Humanities.” In Alternative Historiographies of the Digital Humanities, edited by Dorothy Kim and Adeline Koh, 333–376. Santa Barbara: Punctum Books, 2021. Ghorbaninejad, Masoud, Nathan P. Gibson, and David Joseph Wrisley. “Right-to-Left (RTL) Text: Digital Humanists Plus Half a Billion Users.” In Debates in the Digital Humanities, edited by Matthew K. Gold and Lauren Klein, 47–73. Minneapolis: University of Minnesota Press, 2023. Gibbs, Fred, and Trevor Owens. “Building Better Digital Humanities Tools: Toward Broader Audiences and User-Centered Designs.” Digital Humanities Quarterly 6, no. 2 (2012). http://digitalhumanities.org:8081/dhq/vol/6/2/000136/000136.html Golumbia, David. “Postcolonial Studies, Digital Humanities, and the Politics of Language.” Uncomputing, May 31, 2013. www.uncomputing.org/?p=241 Grallert, Till. “Open Arabic Periodical Editions: A Framework for Bootstrapped Scholarly Editions Outside the Global North.” Digital Humanities Quarterly 16, no. 2 (2022). www. digitalhumanities.org/dhq/vol/16/2/000593/000593.html Hixson, Taylor, and Justin Parrott. “Using UX Personas to Improve Library GIS and Data Services.” Weave: Journal of Library User Experience 4, no. 2 (2021). https://doi. org/10.3998/weaveux.221 Horvath, Aliz. “Digital Brush Talk: Challenges and Potential Connections in East Asian Digital Research.” In Global Debates in the Digital Humanities, edited by Domenico Fiormonte, Sukanta Chaudhuri, and Paola Ricaurte, 127–140. Minneapolis: University of Minnesota Press, 2022.

212 A. Horváth, C. van Lit, C. Wagner, and D. J. Wrisley Horvath, Aliz, Cornelis van Lit, Cosima Wagner, and David Joseph Wrisley. Six User Personas for the Multilingual DH Community (v1.0) [Data set]. DH Unbound Conference 2022. Zenodo, 2023a. https://doi.org/10.5281/zenodo.7811800. Horvath, Aliz, Cornelis van Lit, Cosima Wagner, and David Joseph Wrisley. “Centering Multilingual Users: Thinking Through UX Personas in the DH Community.” Digital Studies/Le champ numérique, 2023b, forthcoming. Liu, Alan. “Where Is Cultural Criticism in the Digital Humanities?” In Debates in the Digital Humanities, edited by Matthew K. Gold, 490–509. Minneapolis: University of Minnesota Press, 2012. Liu, Alan. “Toward a Diversity Stack: Digital Humanities and Diversity as Technical Problem.” PMLA/Publications of the Modern Language Association of America 135, no. 1 (2020): 130–151. https://doi.org/10.1632/pmla.2020.135.1.130 Mahony, Simon. “Cultural Diversity and the Digital Humanities.” Fudan Journal of the Humanities and Social Sciences 11, no. 3 (2018): 371–388. https://doi.org/10.1007/s40647018-0216-0 Nielsen, Lene. Personas—User Focused Design. London: Springer, 2019. Ortega, Élika. “Multilingualism in DH.” 2015 MLA Position Papers, Digital Edition, December 31, 2014. https://web.archive.org/web/20210424073656/www.disruptingdh.com/ multilingualism-in-dh/ Pawlicka-Deger, Urszula. “Infrastructuring Digital Humanities: On Relational Infrastructure and Global Reconfiguration of the Field.” Digital Scholarship in the Humanities 37, no. 2 (2021): 534–550. https://doi.org/10.1093/llc/fqab086 Perri, Antonio. “Al di là della tecnologia, la scrittura. Il caso Unicode.” Annali dell’Università degli Studi Suor Orsola Benincasa II (2009): 725–748. Smithies, James. The Digital Humanities and the Digital Modern. London: Palgrave Macmillan, 2017. Spence, Paul. Disrupting Digital Monolingualism: A Report on Multilingualism in Digital Theory and Practice. London: Language Acts and Worldmaking Project, 2021. https:// doi.org/10.5281/zenodo.5743283 Spence, Paul, and Renata Brandao. “Towards Language Sensitivity and Diversity in the Digital Humanities.” Digital Studies/Le champ numérique 11, no. 1 (2021). https://doi. org/10.16995/dscn.8098 Steiner, George. After Babel. Aspects of Language and Translation. Oxford: Oxford University Press, 1998. Wrisley, David Joseph. “Enacting Open Scholarship in Transnational Contexts.” POP. Public Open Participatory (2019). https://doi.org/10.21810/pop.2019.002

13 Digital Approaches to Multilingual Text Analysis The Dictionnaire de la langue franque and Its Morphology as Hybrid Data in the Past Josh Brown 13.1.

Introduction1

Text analysis can be seen to mark the birth of digital humanities (DH). A widely conventionalised date fixes its origins to the beginning of 1946, with Roberto Busa’s plans for the Index Thomisticus—a massive attempt to encode nearly 11 million words of Thomas Aquinas’ writings on IBM punch cards (Sula and Hill 2019; also Terras and Nyhan 2016). From these early origins, text analysis has now become a very wide research area, involving knowledge discovery in written text. Before data can be processed in any meaningful way, text must be structured through the application of natural language processing (NLP). Multilingual texts pose a particular problem in this regard, since tools from DH often need to apply analysis that is not related to a linguistic structure of text in one specific language but rather relies on methodologies common to many languages (Vanetik and Litvak 2019, 1). Much of the early focus in text analytics has been on English language material. Since, researchers have become even more adventurous in attempting to analyse corpora that are multilingual in nature. The precise nature of what ‘multilingual’ and ‘text’ mean is far from obvious. This chapter attempts to respond to the question: How is it best to proceed when we try to analyse digitally varieties of languages that do not belong to any one linguistic taxonomy? Some focus has gone into attempts to ‘broaden out’ DH scholarship across multiple languages in recent years. Terms such as ‘multilingual digital humanities’, or variations of this syntagm, have begun to appear in various fora, both in scholarly literature (Horvath 2021; Nilsson-Fernàndez and Dombrowski 2022) and on blogs and other online spaces. Others have argued that ‘Non-English DH is not a thing’ (Dombrowski 2022). Dombrowski shows how power and scholarly communications are more problematic than is usually assumed across a variety of phenomena, including time zones, language of publication, and journals in particular languages. In a wide-ranging address, she usefully frames multilingual DH in terms of broader questions relating to the field, such as ‘who holds the keys to power? what languages are call for proposals in?’ Many of these issues are ‘recursive problems’. Part of the aim of making terms such as these more visible has been to highlight the DOI: 10.4324/9781003393696-18

214

Josh Brown

geographical and linguistic diversity already present in DH (Galina Russell 2014; Spence 2014; Mahony 2018; Dombrowski 2021). This chapter responds to the ongoing development of multilingual digital humanities as it relates to multilingual texts. Specifically, it complicates the question of what ‘multilingual’ and ‘text’ mean in situations of language contact that render both terms ambiguous. The chapter conceptualises this broader framework to one specific document from the past: the Dictionnaire de la langue franque of 1830. This dictionary, written by an anonymous author in Marseille, has been described as containing the most comprehensive and complete lexical entries for a Mediterranean trade language used throughout the early modern period. The chapter then turns to look at standard language models and mixed language data by considering the underlying assumptions of particular stemmers. 13.2.

‘Multilingual’ Texts in DH

What might the term ‘multilingual’ mean when we talk about ‘multilingual texts’? Given the polysemy of both these terms (multilingual and texts), their applications in situations of language contact may end up rendering them ambiguous and ultimately toothless when applying them to situations in the past. Plainly, multilingual texts can mean texts written in multiple languages. Historical texts that contain evidence of multilingualism often vary widely, including in terms of the degree of contact between languages, number of languages, linguistic typology, and others. A traditional descriptor has been to characterise the degree of multilingualism at either the inter-sentential or intra-sentential level or both. In other words, we may be dealing with textual data that combines languages at the sentence level or that code-switch languages at the word level (e.g. sea casa—sea house). In many cases of language contact, it is difficult (if not impossible) to ascribe a particular ‘word’ to any one linguistic variety in a categorical way: codemixing between Italian and English, for example, leads to data being produced such as fensa ‘fence’, with an English nominal stem (fens-) but Italian morphological inflection (-a). In other cases still, influences may be present at an orthographical level. This is the case in a variant such as hogy ‘oggi’ [today], where h represents a purely orthographical convention (i.e. without any phonological value), taken as evidence for a Latinising script (Brown 2015, 689). The use of grapheme -y here shows the author’s uncertainty in representing word-final vowels in a situation where no standard orthography exists. This uncertainty is one aspect which modern scholars must also confront when attempting to represent digital forms of linguistic heterogeneity. Since words often combine multiple varieties in ways that make it difficult to distinguish one variety from the other, there is no assured way to disambiguate which language a particular word belongs to. The examples sea casa, fensa, and hogy discussed earlier, with multiple variants discernible within the same ‘word’, do not belong to any contemporary, standard language. Given that much historical language contact takes place in situations where no standard exists, variants which freely combine elements from two (or more) languages can be said to occupy a code-intermediate space.2 Part of the

Digital Approaches to Multilingual Text Analysis 215 reason for a lack of interest in digital representations of heterogeneous linguistic variants is due to the fact that historical linguistics itself has had a strong tradition of ideology of the standard (Watts 2015). Developing tools such as annotated corpora or applying part-of-speech (POS) tags for historical language poses nontrivial problems even for languages that are historically recognised and which continue to be spoken by modern communities. This is the case with Old Catalan—an extremely low-resource language with rich inflection and frequent homographs. In cases such as these, previous work on other languages such as Middle Welsh or Classical Tibetan can allow for a semi-supervised method of manual tagging and correction. More importantly, they allow one to build up a training set in an incremental way (Meelen and Pujol i Campeny 2021; Horvath et al., this volume). These examples are just a few ways in which ‘words’ may combine linguistic variants from different varieties. This mixing can be at the level of the whole language, at different syntactic levels, or even at the root, infix, or suffix (as well as phonological and/or orthographical). Multilingualism in DH is often framed through a lens of diversity, or representativeness of the ‘number’ of standard languages used in a call-for-papers, during a conference, publications, etc. In its worst case, this diversity is dismissed as tokenistic and can be no more than the provision of translations of particular texts. But multilingualism is also about language itself, the variety of language(s), and the diverse ways such multilingualism can be represented digitally to mirror its human expression. In short, there is so much opportunity for heterogeneity in language, whether they be the Spanishes, the Arabics, the Chineses, the Dutches, and so on. The textual forms in which such multilingualism appears may take on a variety of forms. In short, ‘multilingualism’ can mean different things. In the same way, discerning what a ‘text’ is, is not a simple question. The exclusion of minority voices—or in the case discussed later, ‘invisible’ ones— inevitably perpetuates selection biases. Only recently have scholars begun to attempt digital text analysis using multilingual data. These include born-digital data as well as text in physical formats. 13.3.

Multilingual ‘Texts’ in DH

How is it best to proceed when we try to analyse digitally varieties of languages that do not belong to any one linguistic taxonomy? One issue is that ‘not all metadata standards are capable of encoding multilingual content in a sufficient way’ (Arnold 2019). The need for computers to be able to assign binary values complicates a vast swathe of linguistic production which cannot easily be interpreted as Italian, Swahili, French, English, etc. This is particularly true for textual materials which ostensibly contain evidence of ‘mixed’ varieties of languages, such as lingua francas. Different issues arise when dealing with non-English language material in a digital space. To some degree, these challenges come about due to a lack of resources or available digital infrastructure to represent linguistic production in an adequate way. Wagner (2020), for example, identifies several challenges for the use of nonLatin scripts in the digital space. Challenges include limitations of script reproduction (directionality, image-text connection, Unicode as gatekeeper), mapping of

216

Josh Brown

characters, OCR (Optical Character Recognition) with less ground truth to build on, and missing agreed standards for transcription of non-Latin scripts and metadata (see also Horvath et al., this volume, on challenges of non-Latin script tools). Dombrowski (2020) explicitly addresses the question of preparing non-English texts for computational analysis. She notes right from the start that the methods developed for computational text analysis in English ‘tend to be developed with certain assumptions about how “words” work’. While English fits some of these assumptions quite nicely, many languages do not. These include language-specific properties of English orthography (words are separated by a space when written), morphology (minimal inflection), and others. For languages like Arabic and Quechua, as well as historical languages such as Latin and Sanskrit, repetitions of a ‘word’ may be obscured to algorithms with no understanding of grammar due to variation in number, gender, or case in which that word occurs.3 Consequently, text modification is necessary before an algorithm can count various forms as the same ‘word’. In some cases, the goal ‘is to arrive at a different kind of understanding of a text using some form of word frequency analysis’. The question of being able to distinguish what a ‘word’ is allows for much more interpretable computational results than if one gives the algorithm a form of the text intended for human readers. One problem which researchers often come face to face with in textual analysis is stemming and lemmatisation. Vanetik and Litvak (2019, 7, 16) offer brief but useful comments regarding the issues involved. They provide an example of stemming performed by Python NLTK (Natural Language Toolkit) using the Porter stemming algorithm, noting that ‘most existing and available algorithms for stemming and lemmatisation are strictly language-dependent tasks’. This goes for many projects in DH, which often rely on available models and projects that have been designed according to standard language criteria. While these authors point to online resources for particular stemming algorithms developed for various European languages (such as Snowball), no evaluation of particular models or their implementation is discussed. Researchers face the question of choice. Selecting a particular stemmer (Porter, Snowball, Lancaster, WordNetLemmatiser, etc.) will naturally lead to different results—and different types of results—according to the overall aim of analysis. In any case, the objective of stemming remains the same: the reduction of inflectional forms and derivational morphologies to a common base form. Languages which have a highly inflected morphology may appear in many different types of ‘texts’ from the past, including trade material exchanged via lingua francas, as discussed later. In this case, we are dealing with an ‘invisible’ language whose variety remains buried at the bottom of archives. Each intervention in this form of knowledge recovery can be seen as an additional layer of transformation. These interpretations occur at multiple levels. First, mixed multilingual data is already a transformation as the data are imagined in creative ways. Cross-fertilising two (or more) separate linguistic varieties to produce new data is already one layer of extrapolation. The assumption is that the data will be interpretable by an imagined user-speaker. Second, the written configuration of these forms occurs via processes of abstraction that do not follow any norm,

Digital Approaches to Multilingual Text Analysis 217 other than what is imagined as the phone-grapheme mapping in the writer’s mind. The question goes to the very heart of what a language model is but also what a language is. At each transformation, semantic and linguistic values may become lost or distorted. Similarly, Viola and Fiscarelli’s (2021) project on a repository of twentieth-century newspapers shows how enriching a digital heritage collection ultimately means sacrificing content. Any form of historical linguistic (re)production will also require critical digital literacy to ensure the greatest degree of preservation and knowledge access. Another useful approach is to consider mixed language varieties in the past precisely as ‘data’. In the case of the Dictionnaire discussed later, the very existence of the document implies that its author had conceptualised a separate linguistic variety. In other words, ‘data need to be imagined as data to exist and function as such, and the imagination of data entails an interpretative base’ (Gitelman and Jackson 2013, 3). In this case, the title of the object, its subtitle, and its textual layout, belie the author’s assumptions in relaying (writing) a linguistic variety which they had envisioned as being such. Considering hybrid language from the past as data also requires us to look ‘under data to consider their root assumptions’. Put another way, ‘data need to be understood as framed and framing’ (Gitelman and Jackson 2013, 4). All linguistic production can be considered as data in this regard. In terms of the Dictionnaire, the title invented by the author describes the linguistic variety they are recording (langue franque ou petit mauresque). The subtitle indicates to future readers its teleological value and intended audience: to aid French persons in Africa (à l’usage des Français en Afrique). In the layout, the data are framed as lexical entries, with two columns on one folio representing a one-to-one mapping from French to the Mediterranean Lingua Franca (MLF). When we consider these aspects as ‘data’, the interpretation of hybrid language from the past is likely to take on new meanings created by their end users, as occurs with other historical matter. The rest of this chapter turns to issues of stemming in data from one particular document from the past—the Dictionnaire de la langue franque ou petit mauresque. After brief background notes by way of introduction, I deal with some questions of morphology and stemmers in preparing the non-English elements for analysis. 13.4. The Dictionnaire de la langue franque ou petit mauresque4 The Dictionnaire was written in 1830 and published in Marseille, France. It has been described as recording the most comprehensive and complete lexical entries for a Mediterranean trade language used sometime around the Middle Ages and throughout the early modern period and known as MLF. Even though it is a ‘dictionary’, there are no definitions. Rather, it provides French terms in the left-hand column and corresponding terms in ‘lingua franca’ in the righthand column. Images of the title page of the Dictionnaire, and the first page of recorded forms beginning with the letter A, are provided in Figures 13.1 and 13.2, respectively.

218

Josh Brown

Figure 13.1 Title page of the 1830 Dictionnaire, printed in Marseille. Bibliothèque nationale de France. Source: https://gallica.bnf.fr/ark:/12148/bpt6k6290361w.texteImage

Digital Approaches to Multilingual Text Analysis 219

Figure 13.2 The first page of the Dictionnaire showing French and MLF. Bibliothèque nationale de France. Source: https://gallica.bnf.fr/ark:/12148/bpt6k6290361w.texteImage

Previous analyses have attempted to describe the linguistic make-up of MLF data contained in the Dictionnaire, and in other texts taken to contain MLF (Baglioni 2018; Nolan 2020a, 2020b; Operstein 2017, 2018, 2021). Nevertheless, much debate remains about the precise nature of its origins, whether it ever creolised

220

Josh Brown

(Operstein 1998)—therefore providing evidence for a separate linguistic variety— and whether it is in fact a language at all (Brown 2022, 2023). These questions become acutely relevant when attempting to (re)present digital forms of hybrid language in the past. MLF is an extinct trade language, with a Romance lexical base, used around the Mediterranean basin. The Dictionnaire has been termed ‘the only comprehensive source’ [die einzige umfassende Quelle] by Schuchardt (quoted in Coates 1971, 25). Research on MLF is characterised by a general lack of agreement on precisely which data constitute this language variety (Selbach 2017; Brown 2017, 2022). Regardless of whether any particular variant can be said to belong to MLF—or belong to French, Italian, Spanish, etc.—MLF demonstrates a highly inflectional verb and noun morphology (Operstein 2018). In linguistic morphology, stemming is the process of reducing inflected lexemes to their word stem, base, or root. Oftentimes, researchers are interested in the morphological root of the word, in order to see the various types of inflections that a lexeme may display. Stemming is helpful for tagging or parsing purposes, or in order to group particular verb or noun morphologies together when they appear in unstructured texts. The question of stemming, therefore, is particularly useful for being able to identify all word forms which may belong to MLF, not just in the Dictionnaire but in other historical corpora too. 13.5.

Error Measurement in Stemming Algorithms

Two error measurements exist in stemming algorithms. The first is overstemming; the second is understemming. In overstemming, an error occurs where two separate inflected words are stemmed to the same root but should not have been. This is similar to receiving a ‘false positive’ in statistical analyses when performing particular types of tests. On the other hand, understemming is a type of error where two separate inflected words should be stemmed to the same root, but they are not—a ‘false negative’. The Porter stemmer, for example, stems ‘universal’, ‘university’, and ‘universe’ to ‘univers’. This can be seen as a case of overstemming. Although these three words are etymologically related, their modern meanings are so different that treating them as synonyms in a search engine will reduce the relevance of the search results. On the other hand, an example of understemming is ‘alumnus’ > ‘alumnu’, ‘alumni’, ‘alumna’ > ‘alumna’. This word, which has entered into standard English, has kept its Latin morphology, and so these near-synonyms are not conflated. Stemming algorithms attempt to minimise each type of error, often with varying results. The semi-structured nature of the data in the Dictionnaire, and the numerous duplicates it contains, also means that stemming is a particularly useful activity to carry out on such data. Many of the MLF headwords appear numerous times for different French terms. For example, the MLF entry adesso is given as the corresponding element for both the French maintenant and présent. Similarly, MLF cascar is provided as the only entry for French s’écrouler, glisser, tomber, couler, and écouler. Other studies of contact languages have shown that, after initial mixing takes place, we are likely to see wide polymorphy in the koine pool before

Digital Approaches to Multilingual Text Analysis 221 levelling occurs (Britain 2012, 224).5 It is only in a subsequent period that multiple features may become levelled, but different semantic terms are likely to remain. In this sense, the entries recorded in the Dictionnaire do not appear to be representative of a separate variety, whether it has creolised or not. Identifying a list of stemmed entries allows for easier identification of the lexical variation inherent in MLF data. It also means that such data can be used as a diagnostic to search for evidence of MLF in other data sources. If one eliminates the duplicates in the MLF column from the 2,120 total number of entries, 1,887 unique lexemes remain. The question of deciding which stemmer to apply to such data is not obvious. This is particularly the case for mixed-language data, since models are designed with specific languages in mind. These models are usually based on standard languages. Different stemmers can produce varying results, and their outputs can vary according to the algorithm used, the lexical stock included in their databases, as well as how they deal with lower-case versus upper-case letters, apostrophes, diacritics, and accents, just to name a few. The stemmers discussed in the next section include two widely used algorithms that are freely available online, Porter Stemmer and Snowball.6 Before entering into the following discussion, a brief word is needed to introduce each stemmer and contextualise the development of how they were produced. The Porter Stemming Algorithm (or more simply, ‘Porter Stemmer’) is a process for removing the ‘commoner morphological and inflexional endings from words’.7 The Snowball algorithm is similar to the Porter Stemmer. It is also known as Porter 2. Generally speaking, its algorithm is a better version of Porter Stemmer and is more aggressive. It allows for ‘a small string processing language for creating stemming algorithms’.8 It is one of a handful of stemmers which contains a model for Italian—one of the main donor varieties to MLF (Brown 2022). In the discussion below, focus is placed on the latter of these two stemmers. 13.6. Discussion Choosing the right model for stemming is crucial in order to achieve meaningful results. The Romance morphologies listed on the Snowball homepage describe the various forms used in creating the model, for both derivational and inflectional morphology. This model was one of the first which attempted to analyse multilingual data. Samples of vocabulary for specific languages are also available. By way of example, the first ten words listed and stemmed for Italian are provided in Table 13.1. As can be seen, the presence of word-final accented vowels has a drastic impact on the returned stem, leading to cases of understemming. The last two items in the table show the difficulties involved. The penultimate item, abbandono, is a first-person present singular verb (whose infinitive is abbandonare), while abbandonò is a third-person singular ‘past historic’ (passato remoto) verb with the same infinitive—but the stemmer returns different results. Nevertheless, the first four entries in the table, showing reflexes of past participles, which differ according to gender and number, return the correct morphological stem abbandon-. Similarly,

222

Josh Brown

Table 13.1 First ten words of Italian vocabulary with stemmed forms generated by Snowball Word

Stem

abbandonata abbandonate abbandonati abbandonato abbandonava abbandonerà abbandoneranno abbandonerò abbandono abbandonò

abbandon abbandon abbandon abbandon abbandon abbandon abbandon abbandon abband abbandon

the underlying model for Porter Stemmer provides output data based on the particular model used to create the different stemmed equivalents. For example, importing NLTK, creating a variable, and running the model on the terms bat and batting will produce bat, bat. import nltkfrom nltk.stem import PorterStemmer ps = PorterStemmer() print(ps.stem(‘bat’)) print(ps.stem(‘batting’)) Running this script on MLF data does not return meaningful results, since these data do not form part of the library. Running the same model on contentar and contento, or commandar, commando does not stem the entries at all—another case of understemming. The same type of result is returned even when the input data is identical except for the presence of a diacritic. This occurs with the corresponding items provided for the French entries entrepôt and dépot, translated in the Dictionnaire with the terms déposito and deposito, respectively. Running the same script on these entries: print(ps.stem(‘deposito’)) print(ps.stem(‘déposito’)) similarly provides no stemming to the data. This goes for all lexical roots which may contain stem extenders or present both verb and noun pairings for particular items. A case in point is escambiar and escambio, for which the term échange is provided in the ‘French’ column in the Dictionnaire. The implications of this algorithm for mixed data in Romance, whether historical or not, raises further questions about the typology used for the underlying model for derivational suffixes (called d-suffix) in the model.9 One advantage of Snowball is that it contains a list of suffixes used in the algorithm across different Romance languages. A summary of d-suffix typologies for French, Spanish, Portuguese, and Italian is provided in

Digital Approaches to Multilingual Text Analysis 223 the documentation for Snowball on its homepage, and they are reproduced in Table 13.2. The first column lists the linguistic category of the particular suffix (one for adverbs, six for adjectives, 13 for nouns, one for verbs), while the second column lists equivalent forms in English.10 What is striking about the underlying pattern in this table with respect to MLF data is that the endings only find a tenuous similarity between standard language forms and the actual forms listed in the Dictionnaire. In some cases, the model may provide spurious results. This is the case for -amento, listed as a Portuguese ending with the corresponding Italian ending, given as -mente for nouns—but an entry such as testamento is also standard Italian. In some cases, the suffixes in the model are present in only one of the four languages given (e.g., adjective ique is only French). In other cases, some items are common to several languages (ador, common to both Spanish and Portuguese), while others are common to the last three (ico, oso, ente, and so on). How useful is this stemmer when applied to MLF data in the Dictionnaire, and what can we say about the validity of the model underlying the Snowball stemmer in cases of hybrid data? When it comes to nouns, the model accounts for six suffixes present in MLF. Of these, four are French: ateur (1 occurrence, sacrificateur); ousif (1, but not an adj., French translation is nègre, esclave); ation (6, sitouation, sommation, condanation, dispération, fortification, ration), and ant (1, diamant). There is only one for Spanish ancia (1, bilancia ‘scale’), and one for Portuguese: amento (6, testamento ‘witness’, armamento ‘armery’, campamento ‘encampment’, désarmamento Table 13.2 d-suffixes of Romance language stemmers

adverb adjective adjective adjective adjective adjective adjective noun noun noun noun noun noun noun noun noun noun noun noun noun verb

LY IC ABLE IBLE OUS ENT IVE ANCE ISM IST MENT ATOR ATRESS ATION LOGY USION ENCE ANCE ANT ITY ATE

French

Spanish

Portug.

Italian

(e)ment ique able eux ent if ance isme iste ment ateur atrice ation logie usion ence ance ant ité at

(a)mente ico able ible oso ente ive anza ismo ista amiento ador ación logía ución encia ancia ante idad at

(a)mente ico ável ível oso ente ivo eza ismo ista amento ador ação logía ución ência ância ante idade at

(a)mente ico abile ibile oso ente ivo anza ismo ista mente attore atrice azione logia uzione enza anza ante ità at

224

Josh Brown

‘disarmament’, dgiouramento ‘pledge’, ornamento ‘decoration’). There are none for Italian. As mentioned earlier, it is not the case that certain suffixes are mutually exclusive to the language under which they are listed. In other words, the model provides for a framework for mixed language data if and only if the suffixes contained in the data uniquely match those listed under the standard languages listed in Table 13.2. In other cases still, overstemming is present. This goes in particular for the one occurrence of the adverb listed in the model—(a)mente (1, altramente ‘otherwise’). It is also present in the following morphologies: adjectives ico (1, antico, rico) but also produces nouns arsénico ‘arsenic’, adj/noun poublico ‘public’ and risico ‘risk’ oso (9 occurrences: curioso ‘curious’, fourioso ‘furious’, piloso ‘hairy’, prétzioso ‘precious’, sérioso ‘serious’, spacioso ‘spacious’, superstitzioso ‘superstititous’, vénimoso ‘poisonous’, vergognioso ‘shameful’) and 1 noun riposo ‘rest’. ente (1, proudente ‘prudent’) nouns ista (2, vista ‘vista’, provista ‘supply’) ante (1, levante ‘levant, east’) In the case of ante, the model also produces brillante ‘bright’ which, while listed in the noun class, is an adjective in standard Italian—a clear case of misidentification and overstemming. Similarly, poublico can be both adjective and noun. The same goes for riposo. What the model fails to distinguish in these cases is not just the linguistic category of the terms being reported, but a one-to-one mapping of the suffixes against these categories leads to spurious results also for standard language data, let alone mixed data. What is needed is a model to distinguish different forms according to different languages with a much broader-based morphology and that is capable of reporting when particular items can be distinguished as actual suffixes in one language or not. Occurrences such as these arise in other language pairings, especially anza, common to both Spanish and Italian (2 occurrences, spéranza ‘hope’, miscolanza ‘mix’), as well as Portuguese and Italian ivo (2, cativo ‘bad’, vivo ‘alive’) but also ador, which appears in the class of adjectives for both Spanish and Portuguese, with ten occurrences (10, balador ‘dancer’, biancador ‘laundryman’, cantador ‘singer’, caschador ‘shoe’, conspirador ‘conspirateur’, comprador ‘buyer’, lavorador ‘worker’, peskador ‘fisherman’, salvador ‘saviour’, segador ‘sawyer’). As I have argued elsewhere, part of the issue in analysing mixed language data using existing infrastructure can be explained by the fact that ‘these data either overlap across multiple taxonomies (multiple languages, datasets, etc.), or because the existing tools available for computational analysis are often designed to deal with the specifics of particular types of data in the first place’ (Brown 2023). The language

Digital Approaches to Multilingual Text Analysis 225 of the MLF presents mixed phenomena that do not easily fall into a pre-established taxonomy. The consequence is that the modelling of particular forms of hybridity, whether contemporary or historical, requires particular tools and further development in a digital space. Table 13.2 contains 48 unique forms. Running the model on the MLF entries from the Dictionnaire shows there to be 15 unique suffixes (14 noun; 1 adverb). In other words, even though the model may appear comprehensive when it comes to standard language data and includes a substantial number of suffixes across four of the major standard Romance languages, only 31.25% of these forms are identified in MLF data. 13.7.

Conclusion

This chapter has considered two separate but closely related notions in multilingual digital humanities. The first concerns the question of what ‘multilingual’ and ‘text’ might mean in situations of language contact that can potentially render both terms ambiguous. These terms are used in a variety of ways to refer to different forms of languages in contact across diverse media. Languages come into contact through a variety of means. The discipline of language contact itself is immediately concerned with the effects of that contact in its own right and as its own field, into circumscribed linguistic categories, into multiple categories, or into none. The task for the researcher is made more difficult when the linguistic data do not fall neatly. In most cases for Western European languages, we are dealing with data that must fit anachronistically into particular glottonyms (proper names of an individual language or language family), since what these ‘languages’ refer to is often defined at a date after the time when such data are produced.11 The issue becomes complicated, given that the ‘mental habit’ of working only with English data and literature leads to a situation of ‘monolingual-Anglophone obliviousness with regard to language’ (Nilsson-Fernàndez and Dombrowski 2022, 83). This chapter has also attempted to reveal the problematic nature of using models based on standard language datasets when applied to historically mixed language. Using a historical dictionary as a test-case, the so-called Dictionnaire de la langue franque, error measurement can be discerned when searching for the presence of derivational suffixes using one particular stemmer. The question of stemmer selection is in itself problematic for mixed languages. One advantage of the model discussed in this chapter is the seemingly large variety of derivational suffixes it contains for a variety of Romance languages, including French, Spanish, Portuguese, and Italian. Nevertheless, applying that model to MLF data reveals the paucity of suffixes which it is actually able to stem. The standard languages used to create the model also lead to cases of error measurement, particularly overstemming. Errors may occur when suffixes are assigned to multiple languages but also because the model will identify certain reflexes as belonging to a particular linguistic class when they belong to another entirely. Issues arising when working on multilingual data are becoming more and more pressing, as more data become available.

226

Josh Brown

Overall, what is needed is further development of the currently available tools in order to work with hybrid data in the past. A greater rapprochement between the communities of NLP and DH could indeed ‘bring methodological advances to NLP, while at the same time confronting DH datasets with powerful state-of-theart techniques’ (McGillivray et al. 2020; see also Gil and Ortega 2016; Wagner 2020). Part of the aim of this chapter has been to answer the question: How is it best to proceed when we try to analyse digitally varieties of languages that do not belong to any one linguistic taxonomy? In short, there is no best way. The methodological decisions and justifications that are needed will likely vary according to the precise research question being posed and desideratum of the research being conducted. What one can do is pay particular attention to the issue of multilingual variation, ensuring that parameters are set as openly as possible to take account of such variation. Even with the specific case presented here, we have seen that multiple avenues are possible for applying stemmers to historical data. There is not just one ‘word’ which pre-established suffixes might appear in; nor is there just one language they can be said to belong to. The field is slowing developing, as other questions of standardisation, authority files, and character recognition are being brought more out into the open (Arnold 2019). Possible future prospects of research for multilingual DH are wide-ranging. Further avenues are already being explored, beyond simply providing resources in languages other than English. This includes work on postcolonial digital humanities, sentiment analysis, as well as recent initiatives such as Saving Ukrainian Cultural Heritage Online, and many, many others. All this work has a focus which is not (just) on English. Some of it is not even in so-called standard language(s), or contemporary languages. More openness on a technical and/or conceptual level is also needed. Knowledge infrastructures housed in libraries, for example, could be enhanced if a greater openness to linguistic variation is incorporated into technical solutions. Another important corollary in answering the question of how best to proceed regards precisely (re)interrogating what the data actually are and how they are digitally (re)presented. In other words, it is important to define precisely what the model is doing and is searching for. Applying one of the available stemmers can be useful in many circumstances. Oftentimes models can be refined even further or developed more in order to provide a greater sense of definition for what the researcher is attempting to discriminate in hybrid data. In large measure, these developments are the result of researchers engaging with more culturally and linguistically diverse data. Such advances promise well for the future of multilingual digital humanities. Notes 1 The author is grateful to the editors and the anonymous reviewers for generous feedback on this chapter. 2 Matthews (1997, 58) notes that ‘“code” itself is used by some sociolinguists effectively of any distinct variety of language’. 3 The strong focus on English source material, academic environments, and tools has had far-reaching consequences for DH more generally. I cannot address this issue here

Digital Approaches to Multilingual Text Analysis 227

4 5

6 7 8 9

10

11

for reasons of space. However, as Cro and Kearns remark (2020, §1): ‘It is problematic that DH in the US has focused primarily and nearly exclusively on implementation in English-only environments given both the diversity of languages and experiences in the country itself, of the student populations targeted by DH in American higher ed institutions, and the purported global, public, open perspective indicative of work in DH (Gil and Ortega 2016)’. On the pedagogical issues and question of language faced by these researchers in the implementation of advanced French seminars, see Cro and Kearns (2020, §13). The full title as shown on the first page of this text is Dictionnaire de la langue franque ou petit mauresque, suivi de quelques dialogues familiers et d’un vocabulaire de mots arabes les plus usuels; à l’usage des Français en Afrique. Britain (2012, 224) describes levelling as ‘the eradication of marked linguistic features, marked in the sense of being in a minority in the ambient linguistic environment after the contact “event,” marked in the sense of being overtly stereotyped, or marked in the sense of being found rarely in the languages of the world and/or acquired late in first language acquisition’. Some work is already being done to improve stemming for non-English languages, such as the GitHub repository of lemmatization lists in languages other than English, available at: https://github.com/michmech/lemmatization-lists I quote directly from the main homepage of the Porter Stemming Algorithm, available at: https://tartarus.org/martin/PorterStemmer/. https://snowballstem.org/. The term i-suffix is used to denote an inflectional suffix (e.g. the addition of -ed to verbs in English). Since most Romance languages tend to have a highly inflected verb morphology, the verb ‘tends to dominate initial thinking about stemming in these languages’ (http://snowball.tartarus.org/texts/romance.html). This table has 84 cells with data from four Romance varieties, with three cells left blank (presumably since no corresponding form exists in that variety). Removing the three blank cells from this count leaves 81 forms. Removing duplicates (e.g. ismo, which appears as a noun suffix in Spanish, Portuguese, and Italian) across two or more languages, leaves 48 unique forms. These issues are not restricted to these languages, of course. For a recent paper on word segmentation, unknown-word resolution and a morphological parsing system in Hebrew, see Goldberg and Elhadad (2013); also Sarma et al. (2022).

References Arnold, Matthias. 2019. “Multilingual Research Projects: Challenges for Making Use of Standards, Authority Files, and Character Recognition”. ADHO DH 2019, Utrecht, The Netherlands. Session Workshop: “Towards Multilingualism in Digital Humanities: Achievements, Failures and Good practices in DH Projects with non-Latin Scripts”. Available at: https://zenodo.org/record/5911047#.Ym8o29NBxpQ. Baglioni, Daniele. 2018. “The Vocabulary of the Algerian Lingua Franca”. Lexicographica 33, 1, 185–205. Britain, David. 2012. “Koineization and Cake Baking: Reflections on Methods in Dialect Contact Research”. In Methods in Contemporary Linguistics, edited by Bernhard Wälchli, Adrian Leemann and Andrea Ender, 219–238. Berlin: De Gruyter Mouton. Brown, Joshua. 2015. “Testimonianze di una precoce toscanizzazione nelle lettere commerciali del mercante milanese Francesco Tanso (?-1398), Archivio Datini, Prato”. Forum Italicum 49, 3, 683–714. Brown, Joshua. 2017. “Multilingual Merchants: The Trade Network of the 14th Century Tuscan Merchant Francesco di Marco Datini?”. In Merchants of Innovation. The

228

Josh Brown

Languages of Traders, edited by Esther-Miriam Wagner, Bettina Beinhoff and Ben Outhwaite, 235–251. Berlin: De Gruyter Mouton. Brown, Joshua. 2022. “On the Existence of a Mediterranean Lingua Franca and the Persistence of Language Myths”. In Language Dynamics in the Early Modern Period. Volume 1, edited by Karen Bennett and Angelo Cattaneo, 169–189. London: Routledge. Brown, Joshua. 2023 “Whose Language? Whose DH? Towards a Taxonomy of Definitional Elusiveness in the Digital Humanities”. Digital Scholarship in the Humanities 38, 2, 501– 514 Available at: https://academic.oup.com/dsh/article/38/2/501/6827888. Coates, William A. 1971. “The Lingua Franca”. In Proceedings of the Fifth Annual Kansas Linguistics Conference, edited by Frances Ingemann, 25–34. Lawrence: University of Kansas. Cro, Melinda A. and Sara K. Kearns. 2020. “Developing a Process-Oriented, Inclusive Pedagogy: At the Intersection of Digital Humanities, Second Language Acquisition, and New Literacies”. Digital Humanities Quarterly 14, 1. Dombrowski, Quinn. 2020. “Preparing Non-English Texts for Computational Analysis”. Modern Languages Open 45, 1, 1–9. Dombrowski, Quinn. 2021. Humanités numériques, цифровые гуманитарные науки, デジ タル・ヒューマニティーズ: History and Future of DH Linguistic Diversity. UCL Centre for Digital Humanities. Available at: www.ucl.ac.uk/digital-humanities/events/2021/ apr/ucldh-online-histories-and-futures-linguistic-diversity-dh-rescheduled. Dombrowski, Quinn. 2022. “Non-English DH Is Not a Thing”. Talk presented at online conference Digital Approaches to Multilingual Text Analysis. Recording. Available at: https://metodhology.anu.edu.au/index.php/2022/01/20/digital-approaches-to-multilingualtext-analysis/. Galina Russell, Isabel. 2014. “Geographical and Linguistic Diversity in the Digital Humanities”. Literary and Linguistic Computing 29, 3, 307–316. Gil, Alex and Élika Ortega. 2016. “Global Outlooks in Digital Humanities: Multilingual Practices and Minimal Computing”. In Doing Digital Humanities: Practice, Training, Research, edited by Constance Crompton, Richard J. Lane and Ray Siemens, 22–34. New York: Routledge. Gitelman, Lisa and Virginia Jackson. 2013. “Introduction”. In “Raw Data” Is an Oxymoron, edited by Lisa Gitelman, 1–14. Cambridge, MA: The MIT Press. Goldberg, Yoav and Michael Elhadad. 2013. “Word Segmentation, Unknown-Word Resolution, and Morphological Agreement in a Hebrew Parsing System”. Computational Linguistics 39, 1, 121–160. Horvath, Aliz. 2021. “Enhancing Language Inclusivity in Digital Humanities: Towards Sensitivity and Multilingualism”. Modern Languages Open 1, 26, 1–21. Mahony, Simon. 2018. “Cultural Diversity and the Digital Humanities”. Fudan Journal of Humanities and Social Sciences 11, 371–388. Matthews, Peter. 1997. Concise Dictionary of Linguistics (Oxford). Oxford and New York: Oxford University Press. McGillivray, Barbara, Thierry Poibeau and Pablo Ruiz Fabo. 2020. “Digital Humanities and Natural Language Processing: ‘Je t’aime . . . Moi non plus’”. Digital Humanities Quarterly 14, 2. Meelen, Marieke and Afra Pujol i Campeny. 2021. “Old Catalan Morphosyntax: Developing an Annotated Corpus”. Journal of Open Humanities Data 7, 30. Nilsson-Fernàndez, Pedro and Quinn Dombrowski. 2022. “Multilingual Digital Humanities”. In The Bloomsbury Handbook to the Digital Humanities, edited by James O’Sullivan, 83–92. London: Bloomsbury.

Digital Approaches to Multilingual Text Analysis 229 Nolan, Joanna. 2020a. The Elusive Case of Lingua Franca. Fact and Fiction. London: Palgrave Macmillan. Nolan, Joanna. 2020b. “Mediterranean Lingua Franca”. In Arabic and Contact-Induced Change, edited by Christopher Lucas and Stefano Manfredi, 533–550. Berlin: Language Science Press. Operstein, Natalie. 1998. “Was Lingua Franca Ever Creolized?”. Journal of Pidgin and Creole Languages 13, 2, 377–380. Operstein, Natalie. 2017. “The Spanish Component in Lingua Franca”. Language Ecology 1, 2, 105–136. Operstein, Natalie. 2018. “Inflection in Lingua Franca: From Haedo’s Topographia to the Dictionnaire de la langue franque”. Morphology 28, 2, 145–185. Operstein, Natalie. 2021. The Lingua Franca: Contact-Induced Language Change in the Mediterranean. Cambridge: Cambridge University Press. Sarma, Neelakshi, Ranbir Sanasam Singh and Diganta Goswami. 2022. “SwitchNet: Learning to Switch for Word-Level Language Identification in Code-Mixed Social Media Text”. Natural Language Engineering 28, 3, 337–359. Selbach, Rachel. 2017. “On a Famous Lacuna: Lingua Franca the Mediterranean Trade Pidgin?”. In Merchants of Innovation. The Languages of Traders, edited by Esther-Miriam Wagner, Bettina Beinhoff and Ben Outhwaite, 252–271. Berlin: De Gruyter Mouton. Spence, Paul. 2014. “Centros y fronteras: el panorama internacional de las humanidades digitales”. In Humanidades Digitales: desafíos, logros y perspectivas de futuro. Janus, Anexo 1, edited by Sagrario López Poza and Nieves Pena Sueiro, 37–61. SIELAE: Universidade da Coruña. Sula, Chris Alen and Heather V. Hill. 2019. “The Early History of Digital Humanities: An Analysis of Computers and the Humanities (1966–2004) and Literary and Linguistic Computing (1986–2004)”. Digital Scholarship in the Humanities 34, Supplement 1, 190–206. Terras, Melissa and Julianne Nyhan. 2016. “Father Busa’s Female Punch Card Operatives”. In Debates in DH 2015, edited by Matthew K. Gold and Lauren F. Klein, 60–65. Minneapolis, MN: University of Minnesota Press. Vanetik, Natalia and Marina Litvak. 2019. “Multilingual Text Analysis: History, Tasks, and Challenges”. In Multilingual Text Analysis. Challenges, Models, and Approaches, edited by Marina Litvak and Natalia Vanetik, 1–29. Hackensack, NJ: World Scientific. Viola, Lorella and Antonio Maria Fiscarelli. 2021. “From Digitised Sources to Digital Data: Behind the Scenes of (Critically) Enriching a Digital Heritage Collection”. In Proceedings of the International Conference Collect and Connect: Archives and Collections in a Digital Age, edited by Alison Weber, Maarten Heerlien, Eulàlia Gassó Miracle and Katherine Wolstencroft, 51–64. CEUR—Workshop Proceedings, vol. 2810. Wagner, Cosima. 2020. “Challenging Research Infrastructures from a Multilingual DH Point of View—Impulses from Two Workshops on Non-Latin Scripts”. Disrupting Digital Monolingualism Workshop. Available at: https://languageacts.org/digital-mediations/ event/disrupting-digital-monolingualism/ddm-workshop-videos/. Watts, Richard. 2015. “Setting the Scene: Letters, Standards and Historical Sociolinguistics”. In Letter Writing and Language Change, edited by Anita Auer, Daniel Schreier and Richard Watts, 1–13. Cambridge: Cambridge University Press.

Index

Anglocentric multilingualism 40, 130–131, 133–134, 137, 140 Anglocentrism 1–2, 8, 11, 19, 32, 33, 40, 62, 104–106, 115, 130–131, 133–134, 137, 140 Anglophone-centricity see Anglocentrism ASCII 19, 203–204 BERT 9, 141, 147, 149, 156, 172–174 Black DH 61 born-digital 121, 215 colonial 19, 32, 38, 40, 42, 149–150 colonial languages 40, 149–150 computer-Assisted Language Learning (CALL) 4, 61 computer-assisted translation 184 COVID-19 11, 20, 74, 83, 117, 187, 198 critical DH 61 critical DHML 62 critical digital literacy 3, 7–8, 11, 64–65, 67, 75, 77, 217 critical pedagogy 64–65 datafication 77 decolonial practices see decolonial strategies decolonial strategies 44 decolonisation 7, 32–33, 37–41, 44–45, 65, 106, 197 DH pedagogy 7, 20, 76, 106, 108 digital acceleration 11, 83 digital divide 39, 130, 183, 187 digital immigrants 75 digital inclusivity 1, 11–12, 26, 31, 39, 45, 105–107, 111–112, 130, 202, 209–210 digital learning spaces 8, 69, 87 digital multilingualism 4–5, 8, 11–12, 105, 200

digital multilinguality 200 digital pedagogy 2, 61, 65–66 digital standards 49–51, 62, 71, 109, 121, 130, 187, 204, 207–208 digitisation 1, 19, 121, 187–188, 202, 305 diversity turn 3 endangered languages 3, 72, 202 global language 72 Global North 32–33, 37–38, 42, 61–62, 72, 76, 106, 145, 176, 210 Global South 7, 10, 17, 19, 32, 37–38, 168, 176, 189 hegemonic languages 38–39, 183, 197 Humanistic NLP 116, 123–124 Indian DH 49–51 Indigenous languages 33, 38–42, 44, 72 knowledge production 1, 7, 9, 31, 33, 35, 37, 39–40, 42, 44–45, 62, 65, 165 language-agnostic 109, 111, 198 language diversity 1, 4–6, 17 Language Heritage 63 languages other than English see non-English languages Latin scripts 200, 203–205 learning environments 8, 69, 83–91, 94–95, 97 lingua franca 1, 17, 26, 39, 129, 183, 187–188, 215–217 linguistic bias see linguistic injustice linguistic diversity 4, 6, 12, 17–19, 48, 62, 72, 115, 117, 150, 183, 214 linguistic injustice 2–3, 9, 31–32, 45, 129–134, 137, 140–141

Index localisation 1, 10–11, 24, 72, 183–186, 188, 191–193, 198, 205, 209 low-resourced languages 3, 6, 9, 49, 118, 215 machine translation 1, 50, 157, 172, 177 modern languages/area studies 4–5, 8, 61–63, 65, 67, 77, 205 monoculturalism 2, 11–12 monolingual DH 203 multilingual awareness 9–10, 26, 63, 72, 130, 133, 158, 183, 198 multilingual graphical user interfaces 184, 193 multilingualism 1–5, 7–8, 11–12, 17–18, 32, 35, 37–38, 40, 45, 62, 105, 130–131, 133–134, 137, 140, 183, 192–193, 197–200, 202–203, 205–210, 213–216, 221, 225–226 multilinguality 4, 197, 200, 202–203, 207–209 multilingual natural language processing 1, 6, 157 named entity recognition (NER) 51, 107–108, 115, 119, 122–123, 148, 150, 155, 157, 167, 170, 177 National Endowment for the Humanities 115, 122 natural language processing (NLP) 1, 3, 5–11, 48–52, 55–56, 61, 103–104, 106–108, 110–111,114–125, 137, 140–141, 145–146, 148, 150, 152, 155–158, 165–167, 170, 172, 198, 202, 213, 226 non-English languages 8, 103–104, 106–111, 130, 140–141, 146, 215

231

Non-English NLP 103–104, 106–108 non-hegemonic languages 33, 38 non-Latin scripts 2, 10, 154, 197, 199, 202, 205, 208, 215–216 Part of Speech 51, 107, 119, 122–123, 134, 136, 139, 150, 157, 215 pedagogy 2–4, 7–8, 17, 20, 26, 61, 63–66, 76, 105–106, 108, 112 POS see part of speech PostColonial DH 61, 226 primacy of language 132 Programming Historian 20, 26, 65, 70, 73, 107, 203, 1,7,17 Proyecto Humboldt Digital 1, 10, 183–184, 186, 188–189, 191, 193 reproducibility 71 sentiment analysis 9, 49, 108, 115, 156–157, 167, 170, 172, 174, 176–177, 226 stemming 19, 51, 118, 150, 216–217, 220–222 Subword tokenisation 153–156 TEI see Text Encoding Initiative Text Encoding Initiative 66, 71, 77, 114, 130, 188–189, 191, 193, 204, 206 translation 3–4, 18, 22–26, 48, 50–51, 56, 66, 73, 130, 132, 140, 149–150, 156, 172–174, 176, 184–186, 189, 215, 223 Unicode standard 121, 130, 154, 202, 204, 206, 208–209, 215