Applying Language Technology in Humanities Research: Design, Application, and the Underlying Logic [1st ed.] 9783030464929, 9783030464936

This book presents established and state-of-the-art methods in Language Technology (including text mining, corpus lingui

495 130 3MB

English Pages XIII, 126 [133] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Applying Language Technology in Humanities Research: Design, Application, and the Underlying Logic [1st ed.]
 9783030464929, 9783030464936

Table of contents :
Front Matter ....Pages i-xiii
Introducing Language Technology and Humanities (Barbara McGillivray, Gábor Mihály Tóth)....Pages 1-6
Design of Text Resources and Tools (Barbara McGillivray, Gábor Mihály Tóth)....Pages 7-34
Frequency (Barbara McGillivray, Gábor Mihály Tóth)....Pages 35-46
Collocation (Barbara McGillivray, Gábor Mihály Tóth)....Pages 47-59
Word Meaning in Texts (Barbara McGillivray, Gábor Mihály Tóth)....Pages 61-79
Mining Textual Collections (Barbara McGillivray, Gábor Mihály Tóth)....Pages 81-115
The Innovative Potential of Language Technology for the Humanities (Barbara McGillivray, Gábor Mihály Tóth)....Pages 117-121
Back Matter ....Pages 123-126

Citation preview

Applying Language Technology in Humanities Research Design, Application, and the Underlying Logic Barbara McGillivray Gábor Mihály Tóth

Applying Language Technology in Humanities Research

Barbara McGillivray · Gábor Mihály Tóth

Applying Language Technology in Humanities Research Design, Application, and the Underlying Logic

Barbara McGillivray Faculty of Modern and Medieval Languages University of Cambridge Cambridge, UK

Gábor Mihály Tóth Viterbi School of Engineering, Signal Analysis Lab (SAIL) University of Southern California Los Angeles, CA, USA

The Alan Turing Institute London, UK

ISBN 978-3-030-46492-9 ISBN 978-3-030-46493-6  (eBook) https://doi.org/10.1007/978-3-030-46493-6 © The Editor(s) (if applicable) and The Author(s) 2020 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Cover illustration: © Melisa Hasan This Palgrave Macmillan imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The idea of this book goes back to the HiCor Research Network, founded and led by us together with Gard Jenset and Kerry Russell. HiCor was a research group of historians and corpus linguists at the University of Oxford active between 2012 and 2014. It was generously supported by TORCH (The Oxford Research Center for the Humanities). In addition to organizing lectures and a workshop, HiCor also aimed to disseminate language technology among historians and, more generally, humanists. For instance, we organized several courses on Language Technology and Humanities at the Oxford DH Summer School, which inspired this book. We are grateful to Gard Jenset who helped to shape the initial ideas underlying this book. We also thank our employers and funders for providing us with time and funding to accomplish the project.1 We have contributed equally to the design of the book. We have joint responsibility for Chapter 1. Barbara McGillivray has primary responsibility for Chapters 2 and 5. Gábor Tóth has primary responsibility for Chapters 3, 4, 6, and 7. Cambridge, UK London, UK

Barbara McGillivray

Los Angeles, USA

Gábor Mihály Tóth

1Gabor Toth thanks the USC Shoah Foundation and the USC Viterbi School of Engineering. Barbara McGillivray was supported by The Alan Turing Institute under EPSRC grant EP/N510129/1.

v

Contents

1 Introducing Language Technology and Humanities 1 1.1 Why Language Technology for the Humanities? 1 1.2 Structure of the Book 3 References 6 2 Design of Text Resources and Tools 7 2.1 Text Resources in the Humanities 7 2.1.1 Text Resources and Corpora 9 2.1.2 Data and Metadata 10 2.2 Corpus Design and Creation 12 2.2.1 Designing a Text Resource 12 2.2.2 Humanities Corpora 14 2.3 Use Case: The Diorisis Ancient Greek Corpus 16 2.4 Corpus and Natural Language Processing Tools 20 2.4.1 Text-Processing Pipeline 20 2.4.2 Pre-processing and Tokenization 21 2.4.3 Stemming, Lemmatization, and Morphological Annotation 23 2.4.4 Part-of-Speech Tagging 25 2.4.5 Chunking and Syntactic Parsing 27 2.4.6 Named Entities 28 2.4.7 Other Annotation 29 2.5 Conclusion 31 References 32 vii

viii  

CONTENTS

3 Frequency 35 3.1 Concept of Frequency 36 3.2 Application: The “Characteristic Vocabulary” of the Moonstone by Wilkie Collins 39 3.3 Application: Terms with ‘Turbulent History’ in the Early English Books Online 43 3.4 Conclusion 46 References 46 4 Collocation 47 4.1 The Concept of Collocation 48 4.2 Probability of a Bigram 49 4.3 Observed and Expected Probability of a Bigram 50 4.4 Strength of Association: Pointwise Mutual Information (PMI) 52 4.5 Strength of Association: Log Likelihood Ratio 54 4.6 Application: What Residents of Modern London Complained About 54 4.7 Conclusion 58 References 59 5 Word Meaning in Texts 61 5.1 The Study of Word Meaning 61 5.2 Distributional Approaches to Word Meaning 62 5.3 Word Space Models 64 5.3.1 Words in Space 64 5.3.2 Word Embeddings 68 5.4 Use Case: Exploring Smell in Historical Health Reports 69 5.4.1 Visualizing Words in the Semantic Space 71 5.4.2 Measuring Distances in the Semantic Space 72 5.5 Use Case: Finding Semantic Change in a Web Archive 75 5.6 Conclusion 78 References 78 6 Mining Textual Collections 81 6.1 Textual Similarity, an Old Problem 82 6.2 How to Construct a Feature Space 83 6.2.1 Feature Selection 84 6.2.2 Feature Scoring 88

CONTENTS  

6.2.3 Representation as a Geometric Space 6.2.4 The Document–Term Matrix 6.2.5 Representation as a Vector Space 6.2.6 Summary 6.3 Application: Discovery of Similarity in the Anglo-Saxon Chronicle 6.3.1 Transformation of the Anglo-Saxon Chronicle into a Document Collection 6.3.2 Feature Extraction and Feature Selection 6.3.3 Construction of the Document–Term Matrix 6.3.4 Feature Scoring 6.3.5 Rendering a Feature Space Through Projection to a Lower-Dimensional Space 6.3.6 Measuring the Cosine Similarity Between Annals 6.3.7 Clustering 6.3.8 Topic Modelling 6.3.9 Topic as a Hidden Layer 6.3.10 Hierarchical Topic Modelling 6.3.11 Summary of Topic Modelling 6.4 Conclusion References

ix

89 90 90 93 93 94 95 96 97 99 102 104 107 110 111 113 113 114

7 The Innovative Potential of Language Technology for the Humanities 117 7.1 Bridging Concepts Between Humanities and Language Technology 117 7.2 A Critical View of Language Technology 121 Index 123

List of Figures

Fig. 3.1 Fig. 4.1 Fig. 4.2 Fig. 5.1 Fig. 5.2

Fig. 5.3 Fig. 6.1 Fig. 6.2

Fig. 6.3

Relative document frequency of lemma forsake in the EEBO subcorpus 45 Changes of log likelihood ratio (window: 5 words; direction: left) between complain/complaint and dust, mouse, noise, rat, smell, smoke in the London Health Reports dataset 56 Changes of log likelihood ratio (window: 5 words; direction: right) between complain/complaint and dust, mouse, noise, rat, smell, smoke in the London Health Reports dataset 57 Bi-dimensional representation of the words film, movie, and quote using the coordinates from Table 5.2 67 Bi-dimensional representation of the semantic space from the London MOH reports. We have displayed the points corresponding to the top 40,000 most frequent words, and the labels of the words smell, stink, odour, perfume, table, and house 73 Simplified visualization of the semantic change of the noun tweet in three semantic spaces 76 Simplified representation of some car models in terms of common features 84 Some novels by Wilkie Collins and their representation using library catalogue subject headings as common features (*Source The on-line catalogue of the Bodleian Library, Oxford, http://solo.bodleian.ox.ac.uk, accessed 1 January 2020) 85 A simplified document collection of three English proverbs and their representation through bag of words 87 xi

xii  

LIST OF FIGURES

Fig. 6.4 Fig. 6.5 Fig. 6.6 Fig. 6.7 Fig. 6.8 Fig. 6.9

Representation of three English proverbs in a feature space rendered as geometric space 91 Representation of three English proverbs as document vectors 92 Representation of the annals of the Anglo-Saxon Chronicle in a projected space 101 Similarity matrix of annals highlighted (Group 2) in Fig. 6.6 103 Representation of the annals of the Anglo-Saxon Chronicle in a projected space with some clusters highlighted 106 Hierarchical topic modelling in the Anglo-Saxon Chronicle 112

List of Tables

Table 2.1 Table 2.2 Table 5.1 Table 5.2 Table 5.3 Table 5.4

Table 6.1 Table 6.2

Top frequency word types in Shakespeare’s Hamlet 23 Top frequency word types in Shakespeare’s Hamlet after removing stop words 24 Example of co-occurrence frequencies in a toy example consisting of four sentences containing the nouns dog and cat 65 Example of co-occurrence frequencies for the lemmas film, movie, and quote from the British National Corpus 2014 Spoken 66 Example of co-occurrence frequencies for the lemmas film, movie, and quote from the British National Corpus 2014 Spoken 67 Cosine similarity measures between the word embeddings for blackberry and phone, and blackberry and raspberry, 2000 and 2013. The embeddings are from https://zenodo.org/ record/3383660#.XfylShf7Sbc 77 Summary of annals highlighted in Fig. 6.6 102 Key topics extracted from the Anglo-Saxon Chronicle 108

xiii

CHAPTER 1

Introducing Language Technology and Humanities

Abstract  This chapter outlines the relevance of language technology for the exploration and study of big textual data sets in the humanities. We also discuss the importance of understanding the logic underlying the use of language technology to resolve research problems in the humanities. Finally, we outline the three pillars of the approach we follow throughout the book: focus on application through both simplified and more complex use-case examples; discussion of both the potential and the limitations of language technology; and explanation of how to translate humanities research questions into research problems using language technology. Keywords  Big data · Distant reading · Textual resource Language technology · Humanities research

·

1.1   Why Language Technology for the Humanities? In the last two decades, the humanities have seen an unprecedented change opening up new directions for the inquiry of human cultures and their histories: the yet not fully explored availability of digitized humanistic texts. Thanks to the mass digitization of analogue resources preserved in libraries and archives, large textual collections, such as Google Books, Early English Books Online, and Project Gutenberg, have become available on the World Wide Web. The rise of digital humanities © The Author(s) 2020 B. McGillivray and G. M. Tóth, Applying Language Technology in Humanities Research, https://doi.org/10.1007/978-3-030-46493-6_1

1

2  B. McGILLIVRAY AND G. M. TÓTH

as a new academic field has contributed to the proliferation of research infrastructures and centres dedicated to the study and distribution of textual resources in the humanities. The mission of digital humanities projects such as CLARIN European Research Infrastructure, DARIAH and the ESRC Centre for Corpus Approaches to Social Science is to make textual resources not only available but also investigable for scholars. Digital humanists have proposed the method of distant reading or macro analysis for learning from large textual resources (Jockers 2013; Moretti 2015). Alongside a growing interest in large textual resources, there is an increasing demand from (digital) humanities researchers for quantitative and computational skills. The current offering in this space is rich, with a range of training options (including dedicated summer schools like the digital humanities training events at Oxford,1 DHSI at Victoria,2 or the European Summer School in Digital Humanities in Leipzig3) and publications (examples include Bird et al. 2009; Gries 2009; Hockey 2000; Jockers 2014; Piotrowski 2012). Nonetheless, textual resources in the humanities and beyond raise a key challenge: they are too big to be read by humans interested in analysing them. The potential lying in the exploration of large textual collections has not been fully realized; yet, it remains a key task for the current and the next generations of humanities scholars. To explore tens of thousands of books or millions of historical documents, humanities scholars inevitably need the power of computing technologies. Among these technologies, there is one that has had and will definitely continue to have a pivotal role in the exploration of big textual resources. Language technology, which can help unlock and investigate large amounts of textual data, is a truly interdisciplinary enterprise. It is not an academic field per se; it is rather a collection of methods that deal with textual data. Language technology sits at the crossroads between corpus and computational linguistics, natural language processing and text mining, data science and data visualization. As we will demonstrate throughout this book, language technology can be used to address a great variety of research problems involved in the investigation of textual data in the humanities and beyond.

1 https://www.dhoxss.net/. 2 https://dhsi.org. 3 http://esu.culintec.de/.

1  INTRODUCING LANGUAGE TECHNOLOGY AND HUMANITIES 

3

1.2  Structure of the Book This book examines research problems that are relevant for humanities and can be addressed with the help of language technology. The first chapter demonstrates how language technology can help structure raw textual data and represent them as a resource meaningful for both humans and computers. For instance, the lyrics of thousands of popular songs are now available in plain text on the World Wide Web. But lyrics in plain text format do not distinguish the title and the refrain of a song. This is an example of unstructured data because various components of a song are not marked in a way that computers can a­ utomatically extract them. Language technology can help detect structural components within a text such as the refrain of a song; it can also help represent a song in digital form so that different structural components are distinguished and readily available for further computational investigations. Language technology also supports word-level investigations of textuality. The lyrics of a song consist of not only structural units, but also different types of words such as nouns, verbs, and names of people. In plain text format, word-level information about lyrics is not readily usable by computing tools; for instance, it is not possible to extract all proper names from a collection of lyrics in plain text. As Chapter 2 explains, language technology helps attach different types of information to each word of a text; it also offers ways to record this information in well-established data formats. Language technology also facilitates the bottom-up exploration of textual resources and textuality. For instance, finding terms that are significant elements of a text is an important component of bottom-up explorations. We will discuss how the investigation of word frequency can support this in Chapter 3. Language technology methods can map terms closely related to a given concept in thousands of texts. This form of bottom-up exploration is discussed in Chapter 4. Language technology methods can also help in bottom-up studies of word meaning. For instance, the meaning of a concept can be investigated by drawing on a dictionary definition, but it can also be inferred from the way authors used that concept in their works. Chapter 5 examines how language technology enables this type of exploration of meaning. Finally, language technology has tools to detect patterns recurring over thousands of texts. As the proverb says, there is nothing new under the sun. Similar themes and ideas recur over texts from different historical times.

4  B. McGILLIVRAY AND G. M. TÓTH

However, detecting them in large textual resources is a tedious (or sometimes impossible) task for human readers. As Chapter 5 illustrates, language technology supports humans in their efforts to detect recurrence and similarity in texts. To realize the rich potential that language technology offers, humanists need to bridge two interrelated gaps. The first is the conceptual gap between humanities research problems and language technology methods. As a simple example, language technology can detect how many times a given term is used in a given set of historical sources. In more technical terms, with language technology we can study word frequency. But rarely do historians ask how many times a term occurs in their source texts. Rather, they inquire about the prevailing social concepts in a given historical time. There is a conceptual gap between word frequency and the prevailing social concepts. This simple example also sheds light upon the second gap, which lies between qualitative and quantitative approaches. The insights that language technology can deliver are very often quantitative and difficult to interpret with a qualitative framework. Bridging these gaps is a daunting task for scholars, and this publication seeks to assist them in this task. We believe that the potential of language technology can be realized if there is a clear understanding of the logic underlying it. The overall goal of this book is therefore to apply the logic of language technology to the resolution of humanistic research problems. We will attempt to convey this logic by following a didactic approach with three pillars. First, we guide you through various research procedures involved in the application of language technology. The first chapter looks at the design of language resources, the first step in the application of language technology. The following chapters study specific ­ humanities-related research problems and show how to design quantitative research procedures to address them. We believe that an understanding of how to design a research process in language technology is one of the key steps to understanding its overall logic. We do not, however, explain the technical implementation of the research procedures discussed throughout the book.4 Thanks to the development of computing tools in popular programming languages, such as Python and R, many 4 The Python implementation can be found in the following github repository: https:// github.com/toth12/language-technology-humanities.

1  INTRODUCING LANGUAGE TECHNOLOGY AND HUMANITIES 

5

of the technological procedures presented here have been (at least partially) automated, and their implementation can be learnt by following excellent on-line tutorials and manuals. But what is difficult to learn from on-line resources is how language technology is related to existing research goals and practices in the humanities. We draw on both complex and simple examples to illustrate this. Sometimes these illustrative examples will be simplistic; we call these ‘toy examples’. Despite their simplicity, we believe that readers can grasp otherwise highly complex research problems and procedures through them. Second, we highlight both the potential and the limitations of language technology. We believe that the logic of a technology can be understood if one is aware of what that technology can and cannot resolve. A thorough critical understanding is crucial to use technology in an innovative way. Third, we will return again and again to the two gaps described above. Resolving the conceptual gap involves a process that is similar to translation. In order to make use of language technology, humanities researchers have to express, or more technically speaking operationalize, their research questions and problems in way that are meaningful from the language technology perspective. This translation process will be demonstrated through various applied examples. Similarly, addressing the gap between quantitative and qualitative views needs a translation process. Highly complex mathematical procedures need humanistic analogies so that their results can inform qualitative research problems. Throughout the book we will attempt to establish such analogies. Although these might sound simplistic to readers trained in mathematics, we believe that our simple and accessible explanations will enable readers to build a more solid understanding of language technology. With language technology playing a pivotal role in the discovery and analysis of textual data, this book offers an accessible overview of the main topics that can be considered under the umbrella term of language technology: corpus linguistics, computational linguistics, natural language processing, and text mining. Our aim is to focus on those aspects that are relevant to a readership of humanists. To keep this volume agile and easy to handle, some topics have been removed from its scope. For example, sentiment analysis is only briefly touched on, and we have not been able to cover many other important areas, including stylometrics, geospatial analysis, and authorship attribution. Space constraints

6  B. McGILLIVRAY AND G. M. TÓTH

also mean that many details concerning the topics covered were omitted. However, we aimed to provide basic information to further explore themes that are of particular interest to readers.

References Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. Sebastopol, CA: O’Reilly. Gries, S. T. (2009). Quantitative Corpus Linguistics with R. New York, NY and Abingdon: Routledge. Hockey, S. (2000). Electronic Texts in the Humanities: Principles and Practice. Oxford: Oxford University Press. Jockers, M. L. (2013). Macroanalysis: Digital Methods and Literary History. Champaign, IL: University of Illinois Press. Jockers, M. L. (2014). Text Analysis with R for Students of Literature. New York, NY: Springer. Moretti, F. (2015). Distant Reading. London: Verso. Piotrowski, M. (2012). Natural Language Processing for Historical Texts. San Rafael, CA: Morgan and Claypool.

CHAPTER 2

Design of Text Resources and Tools

Abstract  This chapter guides the reader through the key stages of ­creating language resources. After explaining the difference between linguistic corpora and other text collections, the authors briefly introduce the typology of corpora created by corpus linguists and the concept of corpus annotation. Basic terminology from natural language processing (NLP) and corpus linguistics is introduced, alongside an explanation of the main components of an NLP pipeline and tools, including p ­ re-processing, part-of-speech tagging, lemmatization, and entity extraction. Keywords  Corpus · Text collection · Metadata · Annotation · Natural Language Processing (NLP) · Pipeline · Tool · Part-of-speech tagging · Lemmatization

2.1  Text Resources in the Humanities This chapter will guide you through the key stages of creating and using text resources and tools. We use the term text resource to refer to collections of texts of various kinds, as well as other types of text-based resources encountered in the humanities. The key difference here is that collections typically contain running text, often organized into sections, books, and so on, while other text-based resources display text content which is not necessarily a connected piece of work. For example, the first

© The Author(s) 2020 B. McGillivray and G. M. Tóth, Applying Language Technology in Humanities Research, https://doi.org/10.1007/978-3-030-46493-6_2

7

8  B. McGILLIVRAY AND G. M. TÓTH

edition of the Encyclopaedia Britannica1 was digitized by the National Library of Scotland and is organized into volumes and pages, hence we call it a text collection. On the other hand, resources like historical maps can contain textual parts which are not in the form of running text. Some of the techniques that we cover here, such as lemmatization or semantic analysis, will also be suitable for exploring such resources. Many text collections encountered in the humanities are not in digital form, and a large proportion of digital humanities research focuses on the process of digitizing such texts and providing them to the scholarly community. The focus of this book is on texts in digital form, as this is a precondition for the type of computer processing that we are concerned with. A special category of text collections is that of linguistic corpora, which by definition are designed with the specific purpose of studying human language. But text analysis can reveal important patterns about society and culture, well beyond questions of linguistic interest. For example, finding occurrences of place names, personal names, or mention of events in texts, and counting those instances both independently and in relation to other entities can help us access aspects of the ­content at a scale that is not reachable by close reading. Purely linguistic units like patterns of use of modal verbs (like must, should, or ought to in English) can be used to study abstract concepts such as obligation in a historical period. Therefore, the experience of corpus research, including approaches to corpus design and creation, is helpful as a way to draw rich information from texts. Over the past decades, linguists have dedicated major efforts to creating corpora and developing ways of enriching them with annotation. These efforts have the potential to benefit other humanities disciplines. For example, if positive and negative adjectives are annotated in a novel, then we can analyse the sentiment of the text, and relate it to different characters to answer questions such as “Is character X associated with negative sentiment and how does this change throughout the novel?” In this chapter we describe a typology of linguistic corpora and some of their most useful features. We focus on how corpus design questions such as balance and representativeness may impact research outcomes. We also explain the workings of annotation, stressing the benefits it can bring to

1 https://digital.nls.uk/encyclopaedia-britannica/archive/144133900.

2  DESIGN OF TEXT RESOURCES AND TOOLS 

9

humanities scholarship, and discuss the challenges that text resources of interest to humanists pose for the design and creation phases. Regarding tools to manipulate and enrich texts, we will borrow some basic terminology from the fields of natural language processing (NLP) and corpus linguistics, and explain the main concepts behind processes such as tokenization, part-of-speech tagging, lemmatization, and so on. The main focus is on how these tools are needed in humanities research and can guide the exploration and close reading of texts. 2.1.1   Text Resources and Corpora Here we use the term text resource broadly to refer to any resource containing some text. Corpora (plural of corpus) are a special type of text resources because they are collections typically created by linguists to answer specifically linguistic questions. In fact, a branch of linguistics called corpus linguistics emerged in the second half of the twentieth century to define the characteristics of corpora, and to create and use them in linguistic studies. For an overview of the history of corpus linguistics, see McEnery and Hardie (2013). There are several definitions of what a corpus actually is. One of the most famous definitions is by Sinclair (2005): a corpus is “a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research”. Along similar lines, Xiao et al. (2006) say that “a corpus is a collection of machine readable authentic texts (including transcripts of spoken data) which is sampled to be representative of a particular language or language variety”. Both these definitions name the main features of corpora: electronic format, the fact that they contain naturally occurring language, and the fact that they are meant to represent a language or a part of it (technically called “variety”). For example, the corpus of the frequency lexicon of spoken Italian (Bellini and Schneider 2003–18) contains 469 texts that are transcriptions of lectures, TV and radio programmes, and spoken interactions between people. This corpus was used to create the first frequency dictionary of spoken Italian, which gives information about how frequently words are used in this language. The aim was to support scholars studying the language variety of spoken Italian. To realize this goal the corpus compilers decided to focus on four cities (Milan, Florence, Rome, and Naples) deemed representative of the broad range of features of spoken Italian.

10  B. McGILLIVRAY AND G. M. TÓTH

One important difference between humanities text collections and linguistic corpora is that the former are typically not created for the purpose of linguistic studies and therefore do not aim at being representative of a language variety. For example, the Darwin Correspondence Project2 has published the full texts of more than 9000 of Charles Darwin’s letters. In this project the letters have been collected into a resource available to the community of historians of science and other scholars interested in analysing Charles Darwin’s views and entourage, and more broadly the creation of his scientific network, his impact on the scientific discourse of his time, and his legacy. However, the Darwin Letters are not meant to be representative of the English language or the scientific language of the time. Many of the text technology tools and terminology developed by corpus linguists are very useful when analysing and processing the texts of a non-linguistic project like the Darwin Correspondence Project, and we will describe such tools and terminology in the next sections. 2.1.2   Data and Metadata We have seen that text resources can contain text in various forms. Let us take the example of the Hartlib Papers,3 which contain the full-text transcription (as well as facsimile images) of the manuscripts of the correspondence of Samuel Hartlib (c.1600–62), a seventeenth-century ‘intelligencer’ and man of science. The texts of these letters reveal interesting insights into the topics talked about in Hartlib’s circle. However, for many research questions it is of critical importance to scholars to know a range of other attributes in addition to the texts of the letters, such as the library subject header, the year in which the letter was written, who wrote it, and its addressee, gender, location, and so on. This allows us to investigate, for instance, how many women corresponded with Samuel Hartlib, and whether this number changed over time. Together, all these features about the context of the text are referred to as metadata.4 Combining the text data with the metadata makes it possible to answer even more questions, such as: how did the topics of the letters change

2 https://www.darwinproject.ac.uk. 3 https://www.dhi.ac.uk/hartlib/context. 4 We will follow the Oxford Dictionaries in using metadata as a mass noun and data as a plural noun.

2  DESIGN OF TEXT RESOURCES AND TOOLS 

11

over time? Did Hartlib use a different style when addressing certain ­personalities? Does the length of the letters tend to change over time? Metadata can be of different types, depending on the kind of information it provides. We follow the categorization in Burnard (2005) and distinguish between descriptive, administrative, editorial, and analytic metadata. The scope of the first two categories is the collection as a whole, while the latter ones apply to smaller text units. Descriptive metadata accesses external information about the context of the text, such as its source, date of publication, and the sociodemographics of the authors. Administrative metadata contains information about the collection itself, for example its title, its version, encoding, and so on. Editorial metadata, on the other hand, provides information about the editorial choices that the creators of the digital collection made with respect to the original text, for example regarding additions, omissions, or corrections. Finally, analytic metadata focuses on the structure of the text, for example by marking the beginning and end of sections or paragraphs. Metadata can be encoded into text resources in various ways, either in external documentation or as part of the collections themselves. The Text Encoding Initiative (TEI) has developed detailed guidelines for the encoding of texts in digital format and it has become a widely accepted standard in the digital humanities. The TEI guidelines specify, among other things, how the metadata of a text should be displayed in what is known as the TEI header (for details see TEI Consortium 2019). As we have said earlier, metadata combined with text data offers the widest scope for insightful ways to explore texts. Moreover, the texts themselves can be enriched via annotation to optimize the implicit linguistic information they contain and make it usable for large-scale analyses. Let us imagine that we have access to a large collection of digitized newspapers and we are interested in analysing the level of international relations exemplified in this collection. Knowing the geographical origin of each newspaper is of primary importance, but it is not sufficient because a newspaper article may talk about a location which is different from its place of publication. Hence, we would want to conduct an in-depth search of the texts to find, for example, instances of place names. This can be a very time-consuming (or sometimes impossible) process if we need to read all the articles. Without good disambiguation, we may have to ignore many instances of potentially irrelevant hits while at the same time missing a high number of relevant hits. For example, Paris is the name of the French capital but is also the name of a

12  B. McGILLIVRAY AND G. M. TÓTH

city in Texas, and being able to distinguish the two means that we can know whether a particular mention refers to international relationships with France or the United States. Moreover, Paris can also be a person’s name, and at the same time the city can be referred to in different ways (e.g., ‘the City of Lights’), so again being able to disambiguate the usages of this name in context is very useful. As noted by McEnery and Wilson (2001, p. 32), annotation makes the linguistic information in a text computationally retrievable, thus enabling a wide range of searches that can be performed in a manual, automatic, semi-automatic, or crowd-sourced way, depending on whether humans, computers, a combination of humans and computers, or groups of humans are responsible for it. For a detailed overview of linguistic annotation, see Jenset and McGillivray (2017, pp. 99 ff.). In Sect. 2.4 we will see different types of linguistic annotations and how they can be relevant to humanities research.

2.2  Corpus Design and Creation This section will guide you through the decisions involved in designing and creating a corpus for humanities research. We will borrow most of the terminology from corpus linguistics, but will also discuss what should be adapted to the specific needs of humanities research. We will cover issues of availability of the data, representativeness, and their impact on the research outcomes. Finally, we will walk you through a use case, an Ancient Greek corpus built for the purpose of studying how Ancient Greek words change their meaning over time. 2.2.1   Designing a Text Resource In humanities research it is very common to start from a question and then search for the best evidence to answer it. In other cases, we may have already identified an existing resource that we are interested in, for example an archive, and we want to use it for research. To take an example, the National Library of Scotland has recently made available the first ten editions of the Encyclopaedia Britannica in digital form. This is an impressive resource which allows us to explore a range of questions regarding the composition of the text, the differences between editions, the various topics covered in it, the relationship between text and images, and many more. But if we wanted to investigate themes that

2  DESIGN OF TEXT RESOURCES AND TOOLS 

13

relate to the wider historical context of this work or comparisons with other encyclopaedias from different eras or geographical regions, for example, we would need to expand our evidence base to other resources. So we would find ourselves in the position of assembling a suitable corpus. In this section we will explore the steps involved in designing a text resource in a way that lets us answer the questions we care about. Let us imagine, for example, that we are interested in analysing the spread of diseases in France in the nineteenth century. What are the best sources to address this topic? Within the limits of our time and resources, we should choose the available texts that allow us to best tackle the task in question. How to realize this in practice? A useful starting point is to consider the features of the texts we would need to collect. Some helpful criteria for designing the resource can be borrowed from corpus linguistics, where corpora can be categorized in the following ways: • By medium: does the corpus contain only text, speech, video material, or is it mixed? • By size: does the corpus contain a static snapshot of a language variety (static corpus) or is it continually updated to monitor the evolution of language (monitor corpus)? • By language: is the corpus monolingual or multilingual? If it is multilingual, have its parts been aligned (parallel corpus)? • By time: does the corpus cover a language variety in a specific period without considering its time evolution (synchronic corpus) or does it focus on the change of a language variety over time (diachronic corpus)? • By purpose: was the corpus built to describe the general language (like contemporary spoken English) or a special aspect of it (like the language of medical emergency reports)? Of course, some of these criteria may co-exist, so that, for instance, we may create a collection of health reports (text) in French (monolingual) covering the nineteenth century (diachronic). But how can we make sure that our resource is good enough to investigate our question, i.e., the spread of diseases in nineteenth-century France? In the course of its history, the field of corpus linguistics has witnessed a hot debate around the topic of representativeness, and corpus linguists have developed methodologies for building balanced corpora that aim at being representative of the language under study. These methodologies

14  B. McGILLIVRAY AND G. M. TÓTH

typically involve drawing a prioritized inventory of the relevant features, for example register, region, time period, and so on; then estimating a target size for each feature, and then assembling the corpus according to these proportions. In the example about the spread of diseases in ­nineteenth-century France, we should make sure that the texts are sampled from different regions in such a way as to reflect the diversity of France. For instance, given the prominent role of Paris, we would expect a high proportion of reports to be from this city, but at the same time we would want to ensure that other regions are suitably represented as well. In addition to geographical provenance, we should account for other features such as text type, and make sure that the texts represent a range of different health-related texts by different roles (medical professionals, general public, scientists, etc.). As McGillivray (2014, pp. 11–13) discusses, in spite of the efforts to build balanced corpora, representativeness remains an ideal limit and more recently a different approach aimed at gathering the most inclusive set of texts has gained popularity. This has led to the creation of very large corpora containing as many texts as it is feasible to collect. This topic has been discussed at length outside linguistics (see, for example, Underwood 2019; Bode 2020). While not taking part in this debate here, we believe that engaging with the issue of representativity in a critical way is useful when designing a text resource for humanities research. In the next section we will dive deeper into the features of the text resources typically considered in humanities research, and stress some key differences from linguistic corpora. 2.2.2   Humanities Corpora In this chapter we have made a distinction between humanities text resources in general on the one hand and linguistic corpora on the other. In Sect. 2.1.1 we saw that today’s corpora tend to be assembled with the aim of including as many relevant texts as possible, even if this means compromising on the balance between different features. This is true especially in the case of contemporary languages, for which the Internet can provide a huge source of born-digital texts, leading to very large corpora whose size can be measured in billions of words. One example is the JSI Timestamped English corpus,5 an English corpus 5 https://www.sketchengine.eu/jozef-stefan-institute-newsfeed-corpus/.

2  DESIGN OF TEXT RESOURCES AND TOOLS 

15

built from news articles gained from their RSS feeds; it is updated daily and contains 37 billion words. Such an unrestricted approach to corpus building, however, is not always applicable to the text resources employed in humanities scholarship, where a potentially complex interaction of research questions and availability of texts affects the size and shape of the resources we can create. For example, sometimes only a few texts or fragments have survived historical accidents and have found their way into the collection, meaning that creating a balanced corpus is ­simply not a viable option. Three important considerations to keep in mind when building a corpus in humanities research are access, digitization, and encoding. Gaining access to a group of texts can often be anything but straightforward, requiring potentially complex issues to be negotiated such as legal questions with third parties (who might have been responsible for the digitization, for example), and privacy or human data protection concerns. Even when we gain access to the texts, these may need to be digitized, as any subsequent computational processing of the type we talk about in this volume requires them to be in digital form. Once the texts have been digitized, or even better during the digitization step itself, the texts should be presented in such a way to enable their effective use in research. In Sect. 2.1.2 we touched on the TEI guidelines, which provide a great basis for ensuring that digital texts are equipped with all the metadata needed to place them in their historical context. Although these topics are not the focus of this volume and therefore will not be covered in depth, we acknowledge that access, digitization, and encoding can have a significant impact on the decisions that follow in the research process. In particular, the quality of the digitization can radically affect the outcomes of quantitative analyses carried out on the texts, as shown, for example, by Hill and Hengchen (2019). Another challenge concerns historical texts, which are often the object of study in the humanities and which require especially careful consideration. One primary reason for this is that tools and methods developed in language technology research are still mainly concerned with modern and well-established languages like English, but require special adaptation when applied to historical languages (cf. Piotrowski 2012; McGillivray 2014). Philological and interpretative issues are often of major importance and need to be accurately incorporated in the corpus design phase (cf. Meyer 2015). Furthermore, the lack of native speakers of extinct languages or old varieties of living languages means that

16  B. McGILLIVRAY AND G. M. TÓTH

we cannot rely on native speaker intuition for the annotation, and extra layers of checks and explicit guidelines are needed to achieve good quality results. The next section will describe a concrete use case involving a ­historical language, Ancient Greek.

2.3  Use Case: The Diorisis Ancient Greek Corpus In Sect. 2.2.2 we stressed some of the features of humanities corpora, including the specific challenges posed by historical texts. This section is dedicated to a case study on the Diorisis Ancient Greek corpus, which is described in detail in Vatri and McGillivray (2018) and which will give us the opportunity to explore the process of corpus design starting from the original research questions through a concrete example. The Diorisis corpus was built in the context of the “Computational models of meaning change in natural language texts” project funded by the Alan Turing Institute (McGillivray et al. 2019; Perrone et al. 2019). The interdisciplinary project team included scholars in classics, NLP, statistics, and digital humanities, who worked together for six months to begin to explore the following question: how can we identify the change in meaning of Ancient Greek words over the history of this language? The question of meaning change (or semantic change, more precisely) is relevant to a range of humanistic disciplines. In fact, a large part of humanities research involves interpreting meaning in textual sources, for example to find instances of entities or concepts based on which we can analyse historical, cultural, and social trends, or explore the connection between language and stylistic and geographical factors. Words can have many meanings and this changes over time and across registers, geography, style, etc. Let us take the example of the Ancient Greek word mus, which can mean ‘mussel’, ‘muscle’, ‘whale’, or ‘mouse’. Imagine that we are interested in medical terminology, how can we find only those texts that display the medical meaning of mus? Knowing the genre of a text will obviously help, as the medical meaning is more likely to be found in medical texts, but occurrences of the medical meaning can also be found in other texts. Historical dictionaries usually give some examples of usage of each word meaning, but do not attempt to give a full account of the literature, and using close reading methods, we would need to read and record the meaning of all words in every single text ever written, which clearly does not scale up to very large text collections. So, having access to an annotated corpus can make all the difference (McGillivray et al. 2019).

2  DESIGN OF TEXT RESOURCES AND TOOLS 

17

The project aimed to map the change in the meaning of words in the history of Ancient Greek from the seventh century BCE to the fifth century CE, an extremely ambitious goal. For this purpose, we had to build the largest corpus possible. In Sect. 2.1.1 we stressed the aspiration to representativeness. One of the important factors to keep in mind is the role of genre in Ancient Greek semantics, so in the corpus design phase we aimed at finding the best possible representation of Ancient Greek genres. While scoping the genre distribution of the texts, we devised a categorization into genre classes (such as Poetry, Narrative, or Technical) and subclasses (such as Bucolic, Biography, or Geography). The categorization aimed at the best possible representation of Ancient Greek genres. The emphasis on “possible” is critical in this context, as we were constrained by three main factors. First, the texts that have survived historical accidents and have reached us are all we can hope to obtain for Ancient Greek. Second, as new digitization was not within the scope of the project, the number of available digital resources constituted the upper limit of what we were able to include. Third, even when digitized editions exist, they may not be free to use and distribute, so we sourced the texts from three openly available digital libraries (for details see Vatri and McGillivray 2018). The corpus consists of 820 texts and it counts 10,206,421 word tokens, making it the largest corpus of its kind available today. As is often the case in digital humanities projects, the texts came in different formats, ranging from TEI XML, to non-TEI XML, HTML, and Microsoft Word files.6 Therefore we had to allow for an initial phase of cleaning and standardization of these formats into TEI-compliant XML to allow further processing and analysis. Another important consideration was character encoding. Greek characters can pose additional challenges when it comes to encoding, and we found a range of options in the sources, from Beta Code7 to UTF-8 Unicode, to HTML hexadecimal references. Taking the example from Vatri and McGillivray (2018), for the Greek character ᾆ, the Beta Code is A) = |, the Unicode UTF-8 encoding is ᾆ, and the hexadecimal reference is F86;. We converted all Greek characters to Beta Code for standardization purposes, choosing this encoding because it makes automatic processing and retrieval easier. 6 See http://teibyexample.org/modules/TBED00v00.htm?target=markuplanguages for an explanation of these terms. 7 https://www.tlg.uci.edu/encoding.

18  B. McGILLIVRAY AND G. M. TÓTH

For example, if we want to easily find occurrences of the same word starting with or without a capital letter and match them to a digital dictionary, we can easily do that with Beta Code. This is because Beta Code encodes capitalization by adding an asterisk (*) to the letter character, so we can easily look up the capitalized and non-capitalized forms of the same word by adding or removing the asterisk. The format of the corpus was determined by further processing, aimed at identifying semantic change in a computational way. This means that, instead of one single file of running text, the corpus is organized in several text files to enable faster programmatic access to it. Moreover, the text is split into sentences, as a sentence is the unit of input for the computational model. We marked sentence boundaries as analytic metadata in the text. Below is an excerpt from the text file for the work Leucippe and Clitophon by Achilles Tatius, where we have removed the linguistic annotation on lemma and morphological analysis (see Sect. 2.4 for more details) and only included the first four words of the first sentence (hence the ellipsis).8