Best Practices for Spoken Corpora in Linguistic Research [1 ed.] 9781443865548, 9781443860338

A key concern of researchers involved in the creation and sharing of language resources is to attain maximum usability,

138 62 2MB

English Pages 282 Year 2014

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Best Practices for Spoken Corpora in Linguistic Research [1 ed.]
 9781443865548, 9781443860338

Citation preview

Best Practices for Spoken Corpora in Linguistic Research

Best Practices for Spoken Corpora in Linguistic Research

Edited by

ùükriye Ruhi, Michael Haugh, Thomas Schmidt and Kai Wörner

Best Practices for Spoken Corpora in Linguistic Research, Edited by ùükriye Ruhi, Michael Haugh, Thomas Schmidt and Kai Wörner This book first published 2014 Cambridge Scholars Publishing 12 Back Chapman Street, Newcastle upon Tyne, NE6 2XX, UK British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Copyright © 2014 by ùükriye Ruhi, Michael Haugh, Thomas Schmidt, Kai Wörner and contributors All rights for this book reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner. ISBN (10): 1-4438-6033-6, ISBN (13): 978-1-4438-6033-8

TABLE OF CONTENTS

List of Figures............................................................................................ vii List of Tables .............................................................................................. ix Chapter One ................................................................................................. 1 Introduction: Putting Practices in Spoken Corpora into Focus ùükriye Ruhi, Thomas Schmidt, Kai Wörner and Michael Haugh Part 1: Case Studies on Corpora Design, Annotation and Analysis Chapter Two .............................................................................................. 20 Building and Maintaining the GeWiss Corpus: Perspectives on the Construction, Sustainability and Further Enrichment of Spoken Corpora, A Showcase Adriana Slavcheva and Cordula Meißner Chapter Three ............................................................................................ 36 Constructing General and Dialectal Spoken Corpora for Language Variation Research: Two Case Studies from Turkish ùükriye Ruhi and E. Eda IúÕk Taú Chapter Four .............................................................................................. 56 The Corpus of Spoken Greek (CSG) Theodossia-Soula Pavlidou, Charikleia Kapellidi and Eleni Karafoti Chapter Five .............................................................................................. 75 Annotating Spoken Language Ines Rehbein, Sören Schalowski and Heike Wiese Chapter Six ................................................................................................ 95 Computational Analysis of Turn-taking Patterns in Interdisciplinary Research Meetings Seongsook Choi and Keith Richards

vi

Table of Contents

Part 2: Discussions on Best Practices in Spoken Corpora Chapter Seven.......................................................................................... 118 Russian Speech Corpora Framework for Linguistic Purposes Pavel Skrelin and Daniil Kocharov Chapter Eight ........................................................................................... 128 Building a Data Repository of Spontaneous Spoken Czech Lucie Benešová, Martina Waclawiþová and Michal KĜen Chapter Nine............................................................................................ 142 Creating a Multimodal Corpus of Spoken World French Oliver Ehmer and Camille Martinez Chapter Ten ............................................................................................. 162 Best Practices on Long-Term Archiving of Spoken Language Data Peter M. Fischer and Andreas Witt Chapter Eleven ........................................................................................ 183 Best Practices in the Creation, Archiving and Dissemination of Speech Corpora at the Language Archive Sebastian Drude, Paul Trilsbeek, Han Sloetjes and Daan Broeder Chapter Twelve ....................................................................................... 208 Multilingual Corpora at the Hamburg Centre for Language Corpora Hanna Hedeland, Timm Lehmberg, Thomas Schmidt and Kai Wörner Chapter Thirteen ...................................................................................... 225 The Use of Ontologies as a Tool for Aggregating Spoken Corpora Simon Musgrave, Andrea C. Schalley and Michael Haugh Chapter Fourteen ..................................................................................... 249 (More) Common Ground for Processing Spoken Language Corpora? Thomas Schmidt Contributors ............................................................................................. 266 Index ........................................................................................................ 269

LIST OF FIGURES Figure 2-1: Figure 3-1a: Figure 3-1b: Figure 3-2a: Figure 3-2b: Figure 3-3: Figure 3-4: Figure 3-5: Figure 3-6: Figure 3-7: Figure 4-1: Figure 4-2: Figure 4-3: Figure 4-4: Figure 5-1: Figure 5-2: Figure 5-3: Figure 5-4: Figure 5-5: Figure 5-6: Figure 5-7: Figure 5-8: Figure 5-9: Figure 5-10: Figure 6-1: Figure 6-2: Figure 6-3: Figure 7-1:

The online interface of the GeWiss corpus ..................... 24 Transcription of the /k/-/g/ variation in STC ................... 46 Transcription of the /k/-/g/ variation in STCDC ............. 46 Transcription of /k/-/g/ variation in STC ......................... 48 Transcription of /t/-/d/ variation in STCDC .................... 48 A code-switch to English in STC .................................... 49 Marking source language pronunciation of a word in STC ............................................................................. 49 Transcription of tirolli ..................................................... 50 Transcription of genneri .................................................. 50 Transcription of benzine basmag .................................... 51 Initial page of the online subcorpus ................................ 66 Search results .................................................................. 67 The ten (10) most frequently used word tokens in the online subcorpus.................................................... 68 Number of words per specific frequency ........................ 68 Multiple prefield in the Verbmobil corpus ...................... 81 Syntactic representation of a non-sentential unit (NSU) in KiDKo “A friend of my f of my friend” .......... 85 Different syntactic analyses for self-repairs (I) ............... 87 Different syntactic analyses for self-repairs (II) .............. 87 Different syntactic analyses for self-repairs (III) ............ 88 Different syntactic analyses for self-repairs (IV) ............ 88 Syntactic representation of a self-repair in the Switchboard corpus ............................................... 89 Representation of repetitions in the Switchboard corpus ............................................... 90 Representation of repetitions in KiDKo: “I have to make a cheat sheet.” .................................................... 90 Representation of repetitions in the Verbmobil corpus: “I have my scheduler here, too” ....... 91 Speaker 2 as addressee, following Speaker 1 ................ 106 Speaker 2 given Speaker 1 uses “so” ............................ 108 Speaker 2 uses “so” given Speaker 1 ............................ 109 Basic Annotation Scheme. ............................................ 120

viii

Figure 7-2: Figure 8-1: Figure 9-1: Figure 9-2: Figure 10-1: Figure 10-2: Figure 11-1: Figure 11-2: Figure 11-3: Figure 11-4: Figure 11-5: Figure 11-6: Figure 12-1: Figure 12-2: Figure 13-1: Figure 13-2: Figure 13-3: Figure 14-1: Figure 14-2: Figure 14-3: Figure 14-4: Figure 14-5: Figure 14-6:

List of Figures

Illustration of additional melodic annotation introduced to the corpus. ............................................... 125 Mluvka: a query result................................................... 133 Communicative areas covered in CIEL-F ..................... 144 Interface of [moca] ........................................................ 145 Example of a living archive .......................................... 171 Functioning of a long-term archive ............................... 176 The Data Pyramid—a hierarchy of rising value and persistency .............................................................. 184 An ELAN annotation file displayed in ANNEX. .......... 189 The ISOcat data category registry. ................................ 191 Components of a typical session. .................................. 192 IMDI-browser view on an IMDI session in a LAT based language archive. ................................. 193 View on a LEXUS lexicon ............................................ 198 Decision tree for transcriptions. .................................... 213 Partitur (score) representation on the web. .................... 215 Minimal set of classes that need to be asserted in the ontology............................................................... 235 Additional classes in ontology fragment, defined via measurement property specifications ...................... 236 Additional classes in ontology fragment, defined via concurrent occurrence property specifications ........ 237 Screenshot of the DGD2 query interface....................... 250 Screenshot of HZSK corpus repository ......................... 251 HZSK/EXMARaLDA data model for corpus organisation ................................................................... 253 Temporal structure of a transcription represented in an AG with a partition into tiers ................................ 256 Overview of implementations ....................................... 260 Three level architecture ................................................. 263

LIST OF TABLES

Table 2-1: Table 3-1: Table 4-1: Table 5-1: Table 6-1: Table 6-2: Table 8-1: Table 8-2: Table 9-1: Table 10-1: Table 13-1:

The size of the GeWiss corpus in recording hours .......... 23 Partial metadata for a file in STC .................................... 43 Discourse Types and Number of Words ......................... 64 Additional POS tags in KiDKo ....................................... 82 Turn-initial “so” for participants ................................... 104 MICASE Turn-initial “so” ............................................ 105 Number of words proper in selected binary categories.... 137 Number of words proper in the region of childhood residence category. ................................... 138 Excerpt of a fictitious metadata set ............................... 149 General format suggestions ........................................... 166 Current composition of the Australian National Corpus ........................................................................... 228

CHAPTER ONE INTRODUCTION: PUTTING PRACTICES IN SPOKEN CORPORA INTO FOCUS ùÜKRøYE RUHø, THOMAS SCHMIDT, KAI WÖRNER AND MICHAEL HAUGH

In an age of linguistic research where scholars are making more and more use of digital spoken corpora, a foreground concern of researchers involved in the creation and sharing of language resources is to attain maximum usability, reliability and “longevity” of the resources for present and future researchers in the language sciences. Assuming that data sharing and sustainability have been secured, there is still the fact that individual characteristics of a spoken corpus strongly impact on the kind of research that can be conducted with the resource. One cannot emphasize enough how decisions taken in the methods and techniques developed and deployed in the field of spoken corpora building and sharing eventually affect what linguists (can) say about language represented in corpora.1 Turning a gaze therefore toward the very practices in “creating” spoken corpora and corpora repositories and archives, and enriching the venues where these practices are described and discussed, can unravel potential for moving such practices forward in at least two ways. On the one hand, it can help us to address the shared and differing demands that linguists make of spoken corpora. On the other hand, it can be a step toward the development of commonalities amongst practices in spoken 1

To offer just one obvious illustration, all spoken corpora need to make decisions on what will be represented as units in their transcription conventions. These in turn will influence, for example, quantification of speaker turns and turn lengths, and any analytic conclusions drawn thereof (see Haugh (this volume), Rehbein, Schalowski, & Wiese (this volume), and Schmidt (this volume), for detailed discussions).

2

Chapter One

corpora research. In both undertakings, however, it is important not to lose sight of the fact that spoken corpora users and developers have (and will always have) their specific research goals and may “cherish” their own way of doing things, often for very good reasons. Indeed, understanding why spoken corpora researchers do the things they do differently is necessary to trace how spoken corpora researchers view their mission(s) in the creation of human language resources and, at a more practical level, understand how they resolve the requirements and constraints/potentialities of their specific research spaces. One of the requirements for discussing best practices is arguably documentation of the practices in spoken corpora assembly itself. The last ten or so years evidence a surge in discussing trends and issues in corpus assemblage and processing and in providing researchers with guides to good practices in spoken corpus design (e.g., Adolphs & Knight, 2010; Baude et al., 2006; Thompson, 2005; see also Lüdeling & Kytö, 2008). While guidelines undoubtedly function as springboards for researchers in the actual process of corpus creation, they do not–and cannot be expected to–display what goes on in the actual “assembly line”. As pointed out by Schmidt and Wörner (2012, pp. xi-xii), very few research outputs are devoted to documenting the process of corpus construction. In other words, we view spoken corpora construction and sharing as major research endeavors that should also be laid open to academic debate in a manner that is more visible than is currently the case in corpus linguistics. In the present volume, our take on the gap is that it is not only practices in corpus construction but also practices in spoken language processing, archiving and sharing that need to be documented and discussed, as methods and techniques implemented in the two are mutually influential. The present volume thus aims to function as one such venue by approaching the issue of best practices from two angles: the perspective of corpus assembling and issues of annotation and linguistic analysis, and the perspective of corpus assembling in terms of developing spoken language processing, archiving, and dissemination systems. The volume thus lends voice to scholars who are shouldering responsibilities in spoken corpora creation and sharing by listening to the rationales, decisions, and action plans that underlie their research in spoken corpora. One way to do this is to showcase actual practices for the design, creation and dissemination of spoken corpora in linguistic disciplines like conversation analysis, dialectology, sociolinguistics, pragmatics, and discourse analysis. The volume takes stock of current initiatives so that readers may see for themselves how approaches to speech data processing differ or overlap, and observe where and how a potential for coordination of efforts and

Introduction: Putting Practices in Spoken Corpora into Focus

3

standardisation exists. Placing the rationale and the description of practices at the core of the volume, we believe, will be conducive to further development of common and good practices in spoken corpora research. The volume grew out of a half-day workshop entitled “Best Practices in Speech Corpora in Linguistic Research”, which was organized by the editors on May 21, 2012 at the Language Resources and Evaluation Conference in Istanbul.2 It comprises a selection of revised and extended versions of papers presented at the workshop, as well as contributions solicited from scholars in the field of spoken corpus research more broadly.

1. “Speech” versus “Spoken” Corpora In the field of human language resources, and in corpus linguistics in particular, a distinction is made between speech corpora and spoken corpora even though the difference may be blurred and the two types of corpora may blend into each other in practice. It is characteristic of speech corpora to be built for “the detailed study of individual sounds and phonetic features” (Tognini-Bonelli, 2010, p. 25) and speech technology applications, major areas of application being automatic speech recognition, speech synthesis and automatic dialogue systems (Wichmann, 2008, p. 187). Speech corpora: do not need to be continuous texts, nor do they need to be collected in natural situations […]. Many are recorded in high-tech laboratories in order to maximise the detail on the recorded version. Speech corpora are thus part of the class of special corpora […], those which for one reason or another are not intended to be representative of the ordinary language in its characteristic use as communication. (Tognini-Bonelli, 2010, p. 25)

In contrast, spoken corpora in the sense we are using the term here are principled collections of electronically available, transcribed and annotated audio and/or video recordings of languages or language varieties that aim to represent the language as used by its speakers in naturally occurring communicative contexts (Anderson, 2010). They thus seek to function as reference points for investigating languages and language varieties (Biber, Conrad, & Reppen, 1998; McEnery & Wilson, 2001). Spoken language corpora research has been gaining momentum in the last two decades thanks to the efforts of computational linguists, speech technologists, and linguists who work with authentic spoken language 2

http://www.corpora.uni-hamburg.de/lrec2012/index.html.

Chapter One

4

resources. However, while research outputs stemming from speech corpora and scholarly work on compiling and analyzing written corpora are abundant and have informed the field of spoken corpora (see, e.g., ArÕsoy, Can, Parlak, Sak, & Saraçlar, 2009), spoken corpora, as noted in the previous section, have not enjoyed the same attention in research on digital human language resources. Largely in parallel to the speech technology community, linguists from a variety of fields have intensified their efforts to build or curate larger collections of spoken language data. However, while methods, tools, standards and workflows developed for corpora used in speech technology often serve as a starting point and a source of inspiration for the practices evolving in the linguistic research community, it would be an oversimplification to say that speech technology data and spoken language data in linguistic research are merely two variants of the same category of language resources. Too distinct are the scholarly traditions, the research interests and the institutional circumstances that determine the designs of the respective corpora and the practices chosen to build, use and disseminate the resulting data. As noted above, the emphasis on compiling or curating spoken corpora mainly from impromptu speech and the goal to have spoken corpora serve the linguistics community makes the research enterprise distinctly different from speech corpora. The volume therefore looks at spoken corpora from a decidedly linguistic perspective. It brings together linguists, tool developers and corpus specialists who develop and work with authentic spoken language corpora, and discusses their different approaches to corpus design, transcription and annotation, metadata management and data dissemination. The discussions in the various chapters of the volume are thus geared toward a better understanding of x x x x x x x x x x x x

best practices for spoken corpora in conversation analysis, dialectology, sociolinguistics, pragmatics and discourse analysis; spoken corpora designs and corpus stratification schemes; spoken corpora designs for pluricentric languages; metadata descriptions of speakers and communications; data models, formats and workflows for spoken language data in linguistic research and possible routes to their standardisation; tools for manual transcription and annotation of authentic spoken data; use of (automatic) methods for tagging and annotating authentic spoken data in terms of phonetic, syntactic and discursive

Introduction: Putting Practices in Spoken Corpora into Focus

x x x x x x x x

5

features; corpus management systems and dissemination platforms for spoken corpora; integration of spoken corpora from linguistic research into digital infrastructures; and legal issues in creating, using and publishing spoken corpora for linguistic research

In the following we briefly highlight emerging issues and practices, and provide an overview of the foci of the chapters in the volume.

2. Issues in best practices for spoken corpora In the past, research on spoken language in linguistics –especially in the fields of conversation analysis, dialectology, sociolinguistics, pragmatics and discourse analysis– has relied on audio recordings collected by the researchers for specific projects. More often than not these data were not available for other researchers. The arrival of publicly available spoken corpora is slowly changing the nature of studies conducted in these fields (see, e.g., Adolphs, 2008; Aijmer, 2002; Andersen, 2000; Ries et al., 2000; Rühleman, 2010; Schauer & Adolphs, 2006). This development has its strengths and weaknesses (see Virtanen, 2009 for a discussion on problems in conducting discourse analysis with corpora). While previous data sources may be characterized as specialized corpora (Tognini-Bonelli, 2010), which embody rich information on interactions in a narrow range of communicative contexts, the designs and metadata of large-scale (and in particular reference) spoken corpora have been criticized for presenting impoverished and static contextual information on interactions (e.g. Rühlemann, 2010; Archer, Culpeper, & Davies, 2008). Current research on spoken corpora is presenting a number of emerging practices in metadata and annotation (tools) that attempt to remedy this state both in terms of types of corpus construction research and annotation tools and metadata understandings (see, e.g, Adolphs, Knight, & Carter, 2011; Schmidt & Wörner, 2009; Taylor, 2011). The issue of contextualizing spoken interaction in corpora is taken up through discussions on (metadata) annotation in this volume. Discussions on research methodologies in spoken corpora research are also addressing this issue by advocating mixed methods (Bednarek, 2011) or deployment of new computational approaches (see, e.g., Santamaría-García, 2011; Choi & Richards, this volume).

6

Chapter One

A crucial difference between what we may call traditional spoken corpora and present-day electronic corpora is their design principles. In a discussion on the impact of data on corpus linguistic analysis, Moisl (2009) makes the following critical reminder: Data is ontologically different from the world. The world is as it is; data is an interpretation of it for the purpose of scientific study. The weather is not the meteorologist’s data – measurements of such things as air temperature are. A text corpus is not the linguist’s data – measurements of such things as average sentence length are. (p. 876)

We agree with Moisl that data is an interpretation of the world, and underscore that corpora, too, are interpretations of the world of communication for a number of reasons. The selection of what is to be represented of the huge world of communication involves theoretical and empirical decisions on sampling procedures, each step of which is guided by choice of parameters and size considerations. As underscored by Hunston “issues of corpus design and building take us to the heart of theories of corpus linguistics” (2008, p. 166). In other words, corpus design is theory-driven through and through, and design practices, therefore, crucially impact and delimit what corpora can tell us about the workings of language. Furthermore, the very fact that we “record” communication and then transcribe it means “slicing” data and making several abstractions and moves away from the “natural” world. These procedures too are driven by theoretical and practical decisions and call for more intense investigation and debate of existing practices. In spite of problems concerning what representativeness and balance mean and how they are to be achieved in corpora (see Leech, 2007), the range of communication features to be represented in a corpus, such as the period of occurrence, genre, speaker biographies, the balance of genre tokens, and principles in collecting recordings are expected to be explicit and well-designed for representing variation in language along four parameters: “time, space, society and situation” (Moreno-Fernández, 2005, p.126). These considerations translate themselves into breadth of coverage in present-day corpora. Discussions on practices in corpus design and stratification are taken up especially in Chapters Eight and Nine in the present volume. Another feature that distinguishes traditional linguistic corpora from electronic corpora is the practice of including metadata, that is, data that describes the language resource in terms of “contents and context to allow a better and more precise retrieval” (MetaGuide 2003, as cited in Wörner, 2012, p. 400). Accuracy, extensiveness and consistency in metadata make

Introduction: Putting Practices in Spoken Corpora into Focus

7

corpora more explicit for the research questions of the linguistic community. Also the sheer size of the data in current orientations in corpus methodologies, which combine quantitative and qualitative techniques, makes metadata indispensible for corpus designs. While standardisation in spoken corpora metadata practices is desirable (Lehmberg & Wörner, 2008), it remains a challenge, and even though there are formats that are widely used, most are customized to fit the demands of individual projects (Wörner, 2012). Discussions on best practices in metadata emerge in several of the chapters in the volume. Transcription and further linguistic annotation continue to be at the center of debate in linguistic research that deploys spoken language data (Ochs, 1979; O'Connell & Kowal, 1999). This issue is all the more central for spoken corpora for a number of reasons. Achieving a minimum of interoperability is a prime concern for digital human language resources as researchers in linguistics operate with varying transcription conventions so that these somehow need to be interpretable within another framework. Also at the ground level of conducting research with spoken corpora is the issue of segmenting utterances into units that can be quantified and annotated for linguistic features (e.g., turn and utterance boundaries, the syntactic unit of spoken grammar, discourse segments, etc.). While linguistic research on spoken interaction may continue to debate these issues, spoken corpora creators need to develop, if not what may be called standards and common practices (Schmidt, 2011), at least a set of workable conventions that populations of linguists can agree on. These issues also appear as critical points of discussion in Chapters Five, Thirteen, and Fourteen of the present volume. Alongside standardisation in transcribing spoken language and ways of improving interoperability, annotation proper continues to hold a prominent place in academic debate on spoken corpora. There are differences even in annotation of what might be called the basics such as POS-tagging and morphological parsing (e.g., Atwell, 2008; Schmid, 2008). In this respect Leech’s advice that annotation should be “consensual” and “theory neutral” (1993, p. 275) becomes more of an ideal to be pursued than a reality, especially in the context of conversational management and pragmatic annotation. Moving on to more sophisticated levels of annotation–the syntactic, the discoursal and the pragmatic–we observe major theoretical and empirical challenges. To give one illustration on the empirical side, annotation of pragmatic features have largely remained within the area of speech act annotation, but even here annotation techniques often rely on conventionalized linguistic sequences (see, e.g., Weisser, 2002). Chapters Five and Six in this volume address methodologies and

8

Chapter One

describe implementations of further annotation at the syntactic, discoursal and pragmatic levels, while Chapter Three proposes to move corpus building pragmatic annotation to the metadata level. Corpora are also shaping significantly the way researchers think about and study language today (Arppe, Gilquin, Glynn, Martin, & Zeschel, 2010; Gilquin & Gries, 2009) and increasingly more researchers are using these data to address various research questions in theoretical and applied linguistics. Within the context of the changing nature of linguistic research, it is thus becoming even more important to disseminate findings about best practice for building and using spoken corpora to the wider community of researchers. Spoken corpora are labor-intensive and expensive digital resources and the spoken corpora designers are following diverse methodologies even though there are converging linguistic and technical approaches. A pressing concern then in spoken corpora research is the development of corpus construction tools and dissemination platforms that can enable researchers to archive their resources and make them available to others. This volume addresses issues related to the largescale perspective in spoken corpora research in the chapters in Part Two. Within this wide range of topics, by showcasing current practices in spoken corpora research in diverse languages and language varieties – namely, Czech, English, French, German, Greek, Polish, Russian and Turkish – the present volume addresses critical issues concerning maintenance, dissemination and resource creation. It thus offers readers a wide perspective on avenues toward best practices in the field. The book brings into closer contact scholars whose specializations have often remained in relatively different streams of the scientific investigation, that is, scholars whose work fall primarily in conversation analysis, pragmatics and discourse analysis but who are involved in spoken corpus compilation, on the one hand, and scholars who also specialize in linguistics but who have been intensively involved in developing various infrastructures for spoken corpora, on the other hand. As noted in the previous section, this combination of scholars brings into better relief the concerns of data providers and data curators in linguistic research.

3. Overview of the volume The present volume consists of two parts. A number of themes and pressing issues, all of which are inter-related, can be identified in the various contributions to the volume. Highlighting a common concern, we observe that common ground in annotation, which would enable comparability of data resources, sustainability and dissemination appear in

Introduction: Putting Practices in Spoken Corpora into Focus

9

the foreground in several of the chapters. This not surprising, since comparing data both within and across languages is obviously one key research activity in linguistics. The chapters in Part One approach spoken corpora from the angle of corpus design, spoken data transcription and annotation, as conducted in the fields of corpus linguistics and pragmatics. Part Two also discusses design issues in specific spoken corpora research but address computational issues–data formats, ontologies in transcription, and archiving systems in web platforms and portals. In other words, Part Two focuses more on the “infrastructures” that underlie annotation and repository developments. In Chapter Two, “Building and maintaining the GeWiss corpus– perspectives on the construction, sustainability and further enrichment of spoken corpora. A showcase”, A. Slavcheva and C. Meißner describe the construction of GeWiss, which is a spoken multilingual comparable corpus consisting of academic discourse in three languages (German, English and Polish) and German as a foreign language. They discuss issues related to sustainability and annotation, and matters deriving from the publication of the corpus in different legal frameworks. In “Constructing general and dialectal corpora for language variation research: Two case studies from Turkish”, ù. Ruhi and E. I. Taú discuss the challenges of constructing comparable corpora for geographical and genre variation for pragmatics research in the context of the Spoken Turkish Corpus and the Spoken Turkish Cypriot Dialect Corpus. The chapter first describes design and metadata practices to achieve comparability for conducting variation research. It then details the major transcription conventions that were developed to address display of geographical variation. A contribution that also highlights the representation of the pragmatics of spoken language in corpora is the chapter authored by T-S. Pavlidou, C. Kapellidi and E. Karafoti, “The Corpus of Spoken Greek” (CSG). Showcasing the work being conducted in the compilation of the corpus, the authors first describe the design of CSG and argue for the need to represent the fine details of spoken interaction in the transcriptions, as is customary in the conversation analytic approach. They argue that practices in spoken corpora research need to consider the research goals of compiling a corpus in the first place when developing standardisation practices that would maintain comparability of resources to serve wider communities of researchers. As will be observed in the chapters in Part Two, standardisation, interoperability and enhancing dissemination of resources are key issues which researchers working at the “managerial” and computational level of spoken corpus resources, so to speak, are

10

Chapter One

attempting to improve through various text technological research and development of methods for the corpus compilation process and for corpus dissemination. Moving on to contributions that focus on linguistic annotation of spoken corpora, I. Rehbein, S. Schalowski and H. Wiese in their chapter entitled “Syntactic annotation of spoken language” discuss the annotation of spoken language syntax in KidKo (the Kiezdeutsch-Korpus), which consists of samples of adolescent language in Berlin-Kreuzberg and Berlin-Hellersdorf. Two crucial problems they tackle are the identification and annotation of speech units and disfluencies. They point to the need to maintain “interoperability and comparability with corpora of written language” but also argue that spoken grammar makes its own demands in annotation. To meet these demands, they argue that annotation of spoken corpora should be data-driven. They therefore work on the surface-level description of language use. Chapter Six, entitled “Computational analysis of turn-taking patterns in interdisciplinary research meetings”, addresses annotation from the user end of spoken corpora research. In this chapter, Choi and Richards describe a discourse-level annotation with R. They discuss the implementation of a tagging system for speaker contributions that may then be submitted to statistical analysis to understand social actions. The authors thereby combine the data analytic techniques of conversation analysis and discourse analysis from the perspective of corpus linguistic statistics and showcase a methodology for analyzing the meeting ground of the micro context and the discursive context in spoken corpora. The chapters in Part Two address infrastructure issues and tool developments in the annotation, analysis, compilation and dissemination of spoken corpora. The discussion opens with the chapter by P. Skrelin and D. Kocharov, entitled “Russian speech corpora framework for linguistic purposes.” The authors discuss work that is being carried out on both Russian speech and spoken corpora developed at the Department of Phonetics, Saint Petersburg State University. They particularly address the phonetic annotation system and tools developed, with a view of enabling linguistic analysis of the sound system of the Russian language and incorporating further annotation carried out on the corpora. In Chapters Eight and Nine we move into detailed descriptions of two spoken corpora, the first from the Slavic group and the second from Romance languages. In their contribution entitled “Building a data repository of spontaneous spoken Czech,” L. Benešová, M. Waclawiþová and M. KĜen present spoken corpus construction work being carried out in the framework of the Czech National Corpus, where the authors both

Introduction: Putting Practices in Spoken Corpora into Focus

11

showcase a spoken corpus and focus on corpus design and compilation issues. They discuss the challenges of achieving representativeness and balance in a corpus that truly reflects spoken language and the challenges of collating a large-scale corpus at the national level. English is arguably the language that has until recently been represented to a greater extent in terms of its various varieties around the world, notably by the International Corpus of English (ICE) project (Greenbaum & Nelson, 1996; http://icecorpora.net/ice/index.htm). Corpus compilation work for other languages spoken widely around the globe is therefore a welcome development for linguistic research. In Chapter Nine, O. Ehmer and C. Martinez describe the stages in the workflow and a number of standards in the creation of the Corpus International Écologique de la Langue Française – CIEL-F, a multimodal corpus of global varieties of spoken French. While descriptions of the procedures followed in the collection of recordings is typical of qualitative work in linguistics, researchers reporting on electronic spoken corpora usually do not touch upon this ground level in detail. Thus Ehmer and Martinez’s chapter is unique in this respect. Besides detailing practices and standards followed in transcription, metadata and corpus management, they also discuss how recordings were compiled for CIEL-F. The subsequent chapters in Part Two focus on long-term archiving of spoken corpora and on issues of interoperability. In Chapter Ten, “Best practices on long-term archiving of spoken language data,” P. M. Fischer and A. Witt take up the issues of disseminating and curating corpora. With illustrations from the repository system at the Institut für Deutsche Sprache (Institute for the German Language), the authors claim that longterm availability of spoken corpora depends on “the unconditional employment of standardized methods and formats in all facets of data handling and processing.” The authors describe the workflow stages in long-term archiving, from pre-processing of resources technical infrastructure maintenance. As is the case in the work described in the chapter by T. Schmidt, Fischer and Witt emphasize the importance of working in close cooperation with data providers and data curators in tackling issues in the workflow. Chapter Eleven, co-authored by S. Drude, D. Broeder, P. Wittenburg and H. Sloetjes, also adopts a wide angle perspective on spoken corpora research and present an overview of corpus construction, dissemination and archiving practices developed at The Language Archive (TLA) at the Max Planck Institute for Psycholinguistics. The authors discuss issues of tackling resource and annotation diversity in corpus design; they focus on metadata developments within the framework of IMDI and the component

12

Chapter One

meta-data infrastructure (CMDI), and dwell upon the tools offered for corpus construction and linguistic analysis at TLA. Chapter Twelve, by H. Hedeland, T. Lehmberg, T. Schmidt and K. Wörner, demonstrates practices that have emerged in Germany. In their chapter entitled “Multilingual corpora at the Hamburg Centre for Language Corpora”, the authors first provide an overview of the development of the Hamburger Zentrum für Sprachkorpora (Hamburg Centre for Language Corpora), which grew out of multilingual, parallel, historical and sociolinguistic corpora compilation and research conducted within the framework of the Research Centre on Multilingualism (SFB 538) at Hamburg University, one of the major outputs of which is EXMARaLDA (a system for constructing and analyzing spoken corpora). The authors then discuss the infrastructure being developed and the technical assistance extended to researchers in conducting corpus research and in integrating the corpora for the wider academic community within this infrastructure. Representation of spoken language in corpora is a major topic of debate in several of the chapters in Part One. Chapter Thirteen, entitled “The use of ontologies as a tool for aggregating spoken corpora”, addresses this issue in the context of the Australian National Corpus (AusNC), which is an aggregated collection of corpora, many of which pre-date the establishment of the AusNC. The authors, S. Musgrave, A. Schalley and M. Haugh, propose tackling the problem of comparability of representations that employ different transcription conventions by discussing the development of ontologies for transcription concepts. They illustrate the development of such an ontology on three sets of spoken data in the AusNC and discuss its implications for the representation of the “theoretical commitments” of the original data contributors. The last chapter in the volume, authored by T. Schmidt, addresses the question as to whether and to what degree there are commonalities in the various linguistic annotation practices in spoken corpora that have emerged since Bird and Liberman’s (2001) annotation graph framework. In his chapter entitled, “(More) common ground for processing speech corpora?” Schmidt offers a bird’s eye view of these issues based on comparisons between a wide range of corpora (the Database of Spoken German; Schmidt et al., 2013), the Hamburg Centre for Language Corpora (see Chapter Twelve), and those at the Max Planck Institute in Nijmegen (see also Chapter Eleven), and of the Talkbank/CHILDES/Phonbank family (MacWhinney, 2000; Rose, 2012)). Schmidt also takes up the question of how these annotations are organized and used in corpora. With this chapter we get a bird’s eye view of recent accomplishments in

Introduction: Putting Practices in Spoken Corpora into Focus

13

interoperability and the major challenges that lie ahead for spoken corpora to be of even more value to the linguistic community. When asked to comment on the future of corpus linguistics, Leech (2009) highlights the following challenges: x the still urgent need to document the large number of languages around the world in electronic corpora x the need to enrich annotation of spoken corpora, especially in terms of its prosodic, semantic, syntactic, pragmatic and discoursal features x the need for “breakthroughs” in automated transcription and annotation In this way, he argues, the field can accomplish “sophisticated methods of analysis” (Leech, 2011, pp.165-166). Writing from the end of spoken corpora creation, annotation and dissemination, while we may have not addressed the third challenge, we believe that the present volume offers a comprehensive representation of research and implementation of practices that address spoken corpora creation, annotation and dissemination. In Part One of this volume, for instance, such challenges are addressed through the use of (adapted) orthographic representation in corpora that is highlighted in Chapters Eight and Nine or the overall efforts to adhere to “theory neutrality” in linguistic annotation underpinning the work described in many of these chapters. It can also be observed in the references to the use of multiple-checks in annotation and transcription, and differences in the sampling of recordings (i.e., whole or complete) made throughout the chapters in Part One. The various chapters in Part Two also address such challenges through showcasing various tools that can be deployed for corpus assembly and corpus management. In doing so, however, questions arise around whether there remains a need to develop a standardized set of best practices to avoid inadvertent cases of re-inventing the wheel as it were. One productive way forward to ameliorate this problem is to encourage greater use and provision of open source tools. Yet even with greater emphasis on best practices that are open source, as these chapters collectively suggest, there are nevertheless various specific problems those building repositories and archiving systems face, particularly in relation to the markedly divergent approaches to collecting and representing spoken data. One of the challenges facing those building spoken corpora is that more often than not they are required to play a dual role, both as an environment in which corpus-based research can be carried out, and as a repository that safe-

14

Chapter One

guards and maintains spoken data in a form that will continue to be accessible to future generations. In that respect, the rapid technological changes we continue to experience in this area offers not only significant promise, but also possible dangers for the unwary.

References Adolphs, S. (2008). Corpus and Context. Investigating Pragmatic Functions in Spoken Discourse. Amsterdam/Philadelphia: John Benjamins. Adolphs, S. & Knight, D. (2010). Building a spoken corpus: what are the basics? In A. O’Keefe & M. McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics (pp.38-52). London/New York: Routledge. Adolphs, S., Knight, D., & Carter, R. (2011). Capturing context for heterogeneous corpus analysis: Some first steps. International Journal of Corpus Linguistics 16(3), 305-324. Aijmer, K. (2002). English Discourse Particles: Evidence from a Corpus. Amsterdam/Philadelphia: John Benjamins. Andersen, G. (2000). Pragmatic Markers and Sociolinguistic Variation: A Relevance-Theoretic Approach to the Language of Adolescents. Amsterdam/Philadelphia: John Benjamins. —. (2010). How to use corpus linguistics in sociolinguistics. In A. O’Keefe, & M. McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics (pp. 547-562). London/New York: Routledge. Archer, D., Culpeper, J., & Davies, M. (2008). Pragmatic annotation. In A. Lüdeling, & M. Kytö (Eds.), Corpus Linguistics: An International Handbook, Volume I (pp. 613-642). Berlin/New York: Walter de Gruyter. Arppe, A., Gilquin, G., Glynn, D., Hilpert, M., & Zeschel, A. (2010). Cognitive Corpus Linguistics: five points of debate on current theory and methodology. Corpora, 5(1), 1–27. ArÕsoy, E., Can, D., Parlak, S., Sak, H., Saraçlar, M. (2009). Turkish broadcast news transcription and retrieval. IEEE Transactions on Audio, Speech, and Language Processing 17(5), 874-883. Retrieved from http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5071138. Atwell, E. (2008). Development of tag sets for part-of-speech tagging. In A. Lüdeling & M. Kytö (eds.), Corpus Linguistics: An International Handbook, Volume I (pp.501-526). Berlin/New York: Walter de Gruyter.

Introduction: Putting Practices in Spoken Corpora into Focus

15

Baude, O., Blanche-Benveniste, C., Calas, M., Cappeau, P., Corderereix, P., Goury, L., ... Mondada, L. (2006): Corpus Oraux – Guide des Bonnes Pratiques. Paris: Presses Universitaires d’Orléans. Retrieved from http://hal.inria.fr/docs/00/35/77/06/PDF/Corpus_Oraux_guide_ des_bonnes_pratiques_2006.pdf. Bednarek, M. (2011). Approaching the data of pragmatics. In W. Bublitz & N. R. Norrick (Eds.), Foundations of Pragmatics. Handbooks of Pragmatics, Vol. 1 (pp. 537-559). Berlin/Boston: De Gruyter Mouton. Biber, D., Conrad, S., & Reppen, R., (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press. Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Commununication, 33(1-2), 23-60. doi: 10.1016/S0167-6393(00)00068-6 Bucholtz, M. (2000). The politics of transcription. Journal of Pragmatics, 32(10), 1439-1465. doi:10.1016/S0378-2166(99)00094-6 Gilquin, G., & Gries, S. Th. (2009). Corpora and experimental methods: A state-of-the-art review. Corpus Linguistics and Linguistic Theory, 5(1), 1-26. Greenbaum, S., & Nelson, G. (1996). The International Corpus of English (ICE) Project. World Englishes, 15(1), 3-15. Hunston, S., (2008). Collection strategies and design decisions. In A. Lüdeling, & M. Kytö (Eds.), Corpus Linguistics: An International Handbook, Volume 1 (pp. 154-168). Berlin/New York: Walter de Gruyter. Jenks, C. J. (2011). Transcribing Talk and Interaction: Issues in the representation of Communication Data. Amsterdam/Philadelphia: John Benjamins. Leech, G. (2007). New resources, or just better old ones? The Holy Grail of representativeness. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus Linguistics and the Web (pp. 133-149). Amsterdam: Rodopi. —. (2011). Principles and applications of Corpus Linguistics. In V. Viana, S. Zyngier & G. Barnbrook (Eds.), Perspectives on Corpus Linguistics (pp. 155-170). Amsterdam/Philadelphia: John Benjamins. Lehmberg, T, & Wörner, K. (2008). Annotation standards. In A. Lüdeling, & M. Kytö (Eds.), Corpus Linguistics: An International Handbook, Volume 1 (pp. 484-501). Berlin/New York: Walter de Gruyter. Lüdeling, A., & Kytö, M. (Eds.). (2008). Corpus Linguistics: An International Handbook, Vol. I. Berlin/New York: Walter de Gruyter. McEnery, T., & Wilson, A. (2001). Corpus Linguistics. An Introduction. Edinburgh: Edinburgh University Press.

16

Chapter One

Moisl, H. (2008). Exploratory multivariate analysis. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook, Volume 1 (pp.876-899). Berlin/New York: Walter de Gruyter. Moreno-Fernández, F. 2005. Corpora of spoken Spanish language – The representativeness issue. In Y. Kawaguchi, S. Zaima, T. Takagaki, K. Shibano, M. Usami (Eds.), Linguistic Informatics – State of the Art and the Future. The First International Conference on Linguistic Informatics (pp. 120-144). Amsterdam/Philadelphia: John Benjamins. Ochs, E. (1979). Transcription as theory. In E. Ochs & B. B. Schieffelin (Eds.), Developmental Pragmatics (pp. 43-72). New York: Academic Press. O’Connell, D., & Kowal, S. (2009). Transcription systems for spoken discourse. In D’hondt, S., Östman, J-O., & Verheuren, J. (Eds.), The Pragmatics of Spoken Interaction (pp. 240-255). Amsterdam/ Philadelphia: John Benjamins. Ries, K., Levin, L., Valle, L., Lavie, A., & Waibel, A. (2000). Shallow discourse genre annotation in CallHome Spanish. Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece, 31st May-2nd June 2000. Retrieved from http://www.cs.cmu.edu/~alavie/papers/LREC-2000-Clarity.pdf Rühlemann, C. (2010). What can a corpus tell us about pragmatics? In A. O’Keeffe, & M. McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics (pp. 288-301). London/New York: Routledge. Santamaría-García, C. (2011). Bricolage assembling: CL, CA and DA to explore agreement. International Journal of Corpus Linguistics 16(3), 345-370. Schauer, G. & Adolphs, S. (2006). Expressions of gratitude in corpus and DCT data: Vocabulary, formulaic sequences, and pedagogy. System 34, 119-134. Schmid, H. (2008). Tokenizing and part-of-speech tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook, Volume I (pp.527-551). Berlin/New York: Walter de Gruyter. Schmidt, T. (2011). A TEI-based approach to standardising spoken language transcription. Journal of the Text Encoding Initiative 1, http://jtei.revues.org/142. Schmidt, T. & Wörner, K. (2009). EXMARALDA – creating, analysing and sharing spoken language corpora for pragmatic research. Pragmatics 19(4), 565-582. Schmidt, T. & Wörner, K. (2012). Introduction. In T. Schmidt & K. Wörner (Eds.), Multilingual Corpora and Multilingual Corpus

Introduction: Putting Practices in Spoken Corpora into Focus

17

Analysis (pp. xi-2). Amsterdam/Philadelphia: John Benjamins. Taylor, C. (2011). Negative politeness forms and impoliteness functions in institutional discourse: A corpus-assisted approach. In B. Davies, M. Haugh & A. J. Merrison (Eds.), Situated Politeness (209-231). London: Continuum. Thompson, P. (2005). Spoken Language Corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 59-70). Oxford: Oxbow Books. Retrieved from http://ahds.ac.uk/linguisticcorpora/ [Accessed 02-01-2014]. Tognini-Bonelli, E. (2010). Theoretical overview of the evolution of corpus linguistics. In A. O’Keefe & M. McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics (pp. 14-27). London/New York: Routledge. Virtanen, T. (2009). Discourse linguistics meets corpus linguistics: theoretical and methodological issues in the troubled relationship. In A. Renouf & A. Kehoe (Eds.), Corpus Linguistics: Refinements and Reassessments (pp. 49-65). Amsterdam/New York: Rodopi. Weisser, M. (2002). SPAACy – A semi-automated tool for annotating dialogue acts. International Journal of Corpus Linguistics, 8(1), 63-74. Wichmann, A. (2008). Speech corpora and spoken corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook, Volume 1 (pp. 187-207). Berlin/New York: Walter de Gruyter. Wörner, K. (2012). Finding the balance between strict defaults and total openness: Collecting and managing metadata for spoken language corpora with the EXMARaLDA Corpus Manager. In T. Schmidt & K. Wörner (Eds.), Multilingual Corpora and Multilingal Corpus Analysis (pp. 383-400). Amsterdam/Philadelphia: John Benjamins. Wynne, M. (Ed.). (2005). Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow Books. Retrieved from http://ahds.ac.uk/linguistic-corpora/

PART 1: CASE STUDIES ON CORPORA DESIGN, ANNOTATION AND ANALYSIS

CHAPTER TWO BUILDING AND MAINTAINING THE GEWISS CORPUS: PERSPECTIVES ON THE CONSTRUCTION, SUSTAINABILITY AND FURTHER ENRICHMENT OF SPOKEN CORPORA, A SHOWCASE ADRIANA SLAVCHEVA AND CORDULA MEIßNER

1. GeWiss – a comparable corpus of spoken academic language In March 2013, the GeWiss corpus was released.1 It is the first freely available comparable corpus of spoken academic language and gives to the scientific community a valuable resource of transcripts aligned to audiorecordings and equipped with comprehensive metadata.2 Its two access options (i.e., via full texts and concordances) are suitable for both qualitative and quantitative research. The launch of the corpus was preceded by three and a half years of construction work. Numerous decisions had to be made regarding recording, gathering of metadata, transcription, and documentation. Issues of long-term maintenance and hosting had to be dealt with, so that the

1

The corpus is accessible via the web interface www.gewiss.uni-leipzig.de. It can be used for research and teaching after free registration. 2 There is ongoing work in creating corpora of academic discourse, e.g. the euroWiss project (cf. http://www1.slm.uni-hamburg.de/de/forschen/projekte/eurowiss.html).

Building and Maintaining the GeWiss Corpus

21

time- and labour-consumingly constructed corpus would be available even after the VolkswagenStiftung funding.3 This paper will focus on the issues of construction and maintenance of a spoken language resource using GeWiss as a showcase. The first part introduces the GeWiss corpus from the perspective of its construction process (1.1) and puts it into the context of other existing corpora of spoken (academic) language (1.2). The second part then raises issues of general relevance for spoken language corpora.

1.1 The GeWiss corpus: construction, publication and further perspectives The GeWiss corpus in its primary version was constructed between 2009 and 2012. Partners from Germany, Poland and the UK collaborated in a project funded by the VolkswagenStiftung to build a comparable corpus of spoken academic language (cf. Fandrych, Meißner, & Slavcheva, 2012). Each recorded academic speech events in their respective native languages (German, Polish and English) as well as German as L2 based on a predefined corpus design. The corpus needed to allow for contrastive analyses of the genres chosen with regard to the different languages, the different traditional discourse contexts of Germany, Poland and UK and the different language competences in German (L1 vs. L2). Therefore five parameters were defined to describe the corpus. These are x the language used in the speech event (i.e., German, English, Polish), x the genre of spoken academic discourse (academic paper, student presentation, oral exam), x the language competence of the speaker in the academic language used (L1 or L2), x the academic context that the academic language is used in (Germany, UK, Poland), and x the professional competence of the speaker in the respective academic subject (i.e. philology) (experienced academics vs. students).

3 GeWiss was funded by the VolkswagenStiftung (Az.: II/83967) as part of its funding scheme Deutsch Plus – Wissenschaft ist mehrsprachig

22

Chapter Two

In accordance with these parameters, a corpus design was created, that comprised approximately 120 h of recording in total: 40 hours from each of the three academic contexts (with half of that being the respective L1 and the other half being German as L2 in that context), and 20 hours of monologic and dialogic speech, respectively, for each academic context. To create comparable sub-corpora following these parameters and to ensure comparability also with regard to formats and conventions, a fivestep workflow was conducted (cf. Slavcheva & Meißner, 2012). In the first step, recording opportunities were gathered and participants recruited. This task proved especially challenging in the case of the oral exams. Participants had to give their written consent prior to the recording. The recording itself was conducted by research assistants who also collected metadata describing the speech event, and the speaker(s) while the recording took place. In the second step, the recorded data was prepared for transcription, (i.e., transferred to a server, cut and masked). Pseudonyms were assigned to each speaker. All personal data in the collected additional material such as scripts, handouts or slides were also masked. The third step required that the metadata be entered into a corpus management software. For this task, we used the Corpus Manager (Coma) of the EXMARaLDA software package (Schmidt & Wörner, 2009), which allowed for a standardized archiving of the metadata in XML format (see section 2.1 below). The subsequent fourth step, the transcription, was the main part of the work. Recordings were transcribed according to the specifications of the minimal transcript of the GAT2 transcription conventions (Selting et al., 2009) using the EXMARaLDA Partitur Editor (Schmidt & Wörner, 2009). As the GAT2 conventions were mainly designed with respect to the German language, adaptations had to be developed for the transcription of specific spoken language phenomena of the Polish and English data (cf. Lange, Slavcheva, RogoziĔska, & Morton, in prep.). The transcription included a three stage correction process by the transcriber her-/himself and two other transcribers. In the fifth step, the recordings were checked for additional masking and the transcripts for segmentation errors. Afterwards, transcripts were linked to the speech event in the corpus manager. Following this workflow, the GeWiss core corpus was constructed. It encompasses 126:05 hours of recordings.4 The transcripts add up to 4

In Table 2-1 the total size is numbered by 131:21 h, which follows from the addition of the recording times of the sub-corpora. However, the actual total recording time of 126:05 h results from subtracting the double counted mixed student presentations of German native and non-native speakers in the sub-corpora German L1 and L2.

Building and Maintaining the GeWiss Corpus

23

1.273.529 tokens. There are 371 speech events in total, (i.e., 58 academic papers, 89 student presentations and 224 oral exams). 238 speech events are held in German as the main language of communication, 83 are held in English, and 59 in Polish. With 462 main speakers in the recordings (i.e., presenters, examinees, examiners, lecturers conducting the seminars) the corpus offers a great variability which allows for the analyses of supraindividual features. Academic Context

Germany

UK

Poland

Total

Language Genre

German L1

German L2

English English German L1 L2 L2

Polish L1

German L2

Academic Paper

10:03 h

-

5:06 h

2:47 h

5:21 h

4:53 h

4:54 h

33:04 h

Student Presentation

8:01 h

8:20 h

2:25 h

2:36 h

5:02 h

4:55 h

4:58 h

36:17 h

6:57 h 10:22 h

Oral Exam

12:07 h

9:01 h

3:32 h

Total

30:11 h

17:21 h

11:03 h 12:20 h 20:45 h

9:59 h 10:02 h 62:00 h 19:47 h 19:54 h 131:21 h

Table 2-1: The size of the GeWiss corpus in recording hours. In the beta version of the GeWiss corpus released in March 2013, the full transcripts with audio files linked to them can be accessed (see Figure 2-1). In addition, comprehensive metadata describing the context of the speech event and the speakers are provided. A further access option allowing for the direct search of words and phrases via concordances will be released in autumn 2013. The web interface, thus, offers access for both qualitative and quantitative analyses. The beta version comprises the core corpus of GeWiss as described above. Every step of the corpus construction was carefully documented. There are internal documentations about the criteria for the selection of recordings for the corpus among the collected speech events, the set of metadata chosen to describe the recorded speech events and speakers, and the conventions for entering these into the metadata management software Coma. The workflow from recording to corpus publication was documented as well to keep the processes reproducible and transparent. With regard to the long-term aim of becoming a reference corpus of spoken academic language, these documentations will be a valuable tool to help partners creating new resources or preparing existing ones for the integration into GeWiss. The documentations have already proved helpful

24

Chapter Two

in the creation of two addditional sub-co orpora of the G GeWiss-corpu us, which are now beinng integrated into the resou urce (see below w). The puublished GeW Wiss corpus itself is aalso compreh hensively documentedd. Statistical documentation d n is available ffor each sub-ccorpus as well as for tthe entire resoource (cf. Sieradz, 2013). A Additionally, wordlists are retrievabble for all offfered sub-corp pora. In addittion to the staatistics, a manual of thhe corpus inteerface and quicck-start instruuctions can be found on 5 the webpagee (cf. Graefe, Lange, L Sierad dz, Meißner, & Slavcheva, 2013). 2

Figure 2-1: T The online interrface of the GeeWiss corpus. A Access to full transcripts with aligned aaudio files that can be played back b by clickinng the black play y buttons.

What aree the further perspectives p of o GeWiss, noow that the co orpus has been releaseed? The GeW Wiss project pursues p two llong-term goaals. First, securing thee accessibilityy of the resou urce that is exxpected to gro ow into a reference coorpus of spokeen academic language l withh academic German G as its core. Seccond, further annotating th he data of thee resource in n order to allow for m more sophisticaated search options regardi ding pragmaticc aspects. There is alreeady work gooing on in botth directions: In April 2013 3 GeWiss was selectedd as a curatioon project of CLARIN-D6 and is fundeed by the Federal Minnistry for Eduucation and Research (BMB BF) until Marrch 2014. The central aim of the cuuration is to jo oin the existingg resources of GeWiss 5

The wordlissts, the statisticaal description of the GeWiss-c orpus and its su ub-corpora as well as thee corpus manual are available online o under ww ww.gewiss.uni--leipzig.de > Corpus reseearch > Documentation. 6 See http:///clarin-d.de/en/ddiscipline-speciffic-working-grouups/88-fachwisssenschaften /kurationsprojjekte/256-kurattion-f-ag-1-2-en n.html.

Building and Maintaining the GeWiss Corpus

25

and to prepare them for integration into the CLARIN infrastructure (cf. Meißner, Jettka, & Fandrych, 2013). Along with making the GeWiss resource CLARIN compatible (see section 2.1 below), there are two further resources that will be added to the already published core corpus: A corpus of about 10 hours of conference papers held in Italian, and a corpus of about five hours of student presentations held in German as L2 recorded in the Bulgarian academic context. These corpora were constructed by partners in Pisa and Sofia who applied the same principles of corpus design, recording, and transcription as was done in the construction of the core corpus. They also collected analogue sets of metadata for their corpora and used the EXMARaLDA Coma format to manage and archive the metadata as well as the corpus data. These additional resources will strengthen the two dimensions of comparison laid out in the GeWiss corpus design: GeWiss as a comparable corpus of spoken academic language provides on the one hand resources for comparative analyses of the use of academic German as L2 in different non-German academic contexts. The core corpus comprises German L2 data recorded in the British and the Polish context. The now added corpus of German L2 student presentations recorded in the Bulgarian academic context provides a resource that represents the use of academic German in South-East Europe. On the other hand, GeWiss offers resources for contrastive analyses of different academic languages. Data of academic English and Polish are already contained in the core corpus. The now integrated corpus of Italian conference papers will add a Romanic language resource to the Germanic and Slavic ones. The integration of the two resources will be documented as well so that documentations and workflows can be provided to the community as a blueprint for the integration of further resources in the future. With the integration of these two additional corpus resources, an important step has been made towards the aim of creating a reference corpus of spoken academic language. Aside from strengthening the resources with respect to the two dimensions of comparison, the curation project will also be a step towards enhancing the search options by integrating a pragmatically annotated subcorpus. Based on analyses of conference papers taken from the GeWiss corpus, a typology of discourse commenting devices has been developed. Discourse comments are sequences where the speaker leaves the subject matter proper and comments on the structure of his presentation, on preceding or subsequent discourse phases (i.e., the initial presentation of the speaker by the moderator and the discussion at the end of the talk) or on the institutional context and shared discourse memories (i.e., other papers presented in the same or other sections of the conference) (cf.

26

Chapter Two

Fandrych, in prep.).7 The sub-corpus of German conference papers recorded in the German academic context was manually annotated using the typology of discourse comments. The thus enriched sub-corpus is also being integrated into the GeWiss resource during the curation phase. Users will then be given the option to query this sub-corpus directly for sequences of discourse commenting.

1.2 GeWiss in the corpus landscape In the previous section, the core GeWiss corpus was briefly described. But how does the GeWiss corpus relate and compare to other corpora of spoken (academic) language? Freely available corpora of spoken German have only been present in the corpus landscape for a short time. GeWiss is one of the three largest multimodal corpus projects at present among the DGD2 (Database of Spoken German)8 along with the FOLK corpus (Forschungs- und Lehrkorpus) developed at the Institute of German Language (IDS Mannheim) and the corpora at the Hamburg Centre of Spoken Corpora (HZSK)9. As a specialized corpus, it is reasonable to compare GeWiss to a reference corpus of spoken German. With FOLK, the Research and Teaching Corpus of Spoken German, a systematically stratified corpus of authentic spoken interaction is already available and continuously growing (cf. Deppermann & Hartung, 2012). Due to the different research rationales for the construction of both corpora, there are some considerable differences in the corpus design of GeWiss and FOLK. First, as a reference corpus of spoken German, FOLK comprises only L1 German data, while GeWiss as a comparable (German-English-Polish) and learner corpus enables both contrastive analyses of German, English and Polish spoken discourse as well as comparisons between L1 and L2 spoken data in German. With regard to the parameter representativeness, FOLK aims to be systematically stratified and covers many different types and domains of dialogic spoken discourse. GeWiss on the other hand, is a 7

The sequences of discourse commenting in spoken academic language differ from the text commenting devices that have been described for the written modality (cf. Fandrych & Graefen, 2002) with regard to the spectre of types and the kinds of expressions used. As discourse commenting sequences especially show characteristics of orality they are of special interest for the description of spoken academic language which still often follows the ideal of the written form. 8 For further information on the DGD2 cf. http://dgd.ids-mannheim.de. 9 For further information on the HZSK cf. http://www.corpora.uni-hamburg.de/.

Building and Maintaining the GeWiss Corpus

27

sample corpus of spoken academic language thus focusing only on a few genres of spoken academic discourse. The restriction regarding genre and discipline in the GeWiss corpus is, however, an advantage when it comes to research interests that go beyond the general lexical and morphosyntactic levels and that include pragmatic and textual dimensions. It is also a key feature for any cross-language comparison. In addition, GeWiss comprises both monologic and dialogic spoken data, while FOLK contains strictly dialogic speech. When comparing the actual corpus size of the German data in GeWiss (88 hours / 915.017 tokens) (cf. Sieradz, 2013) and FOLK (66 hours / 611.732 tokens)10, GeWiss is somewhat larger.11 However, if we consider only the L1 German data, the ratio in the GeWiss corpus is half that of FOLK – 30 hours / 313.129 tokens. But still, statistically, the corpus size of both corpora is roughly equal. In the corpus landscape of spoken academic German, GeWiss is unique, as it is the first and currently the only freely available corpus. With regard to English, there are three corpora of spoken academic English that the GeWiss corpus may be compared with: MICASE (Michigan Corpus of 12 13, Academic Spoken English) , BASE (British Academic Spoken English) 14 and ELFA (English as a Lingua Franca in Academic Settings) . The first two contain mainly L1 data while ELFA comprises only L2 recordings. GeWiss combines the different perspectives taken by MICASE and ELFA with regard to the usage contexts of a foreign language. While MICASE contains English L2 data produced only in a native (American) English context and ELFA contains English L2 data recorded at Finnish universities, (i.e., in one specific non-English academic environment), GeWiss comprises both types of L2 data, thus allowing for a comparison of academic German used by L2 speakers of German both in a German context and in non-German academic contexts (cf. section 1.1 above). 10 For further statistical information on the corpus data in FOLK currently available through the DGD 2 see http://dgd.ids-mannheim.de. 11 Since both corpora – Folk and GeWiss – are still growing, this comparison reflects only their actual reported corpus size. The further perspectives of GeWiss have been addressed in Chapter 1.1. Further developments on FOLK will be available through the DGD 2. 12 For further information on MICASE cf. http://micase.elicorpora.info/aboutmicase. 13 For further information on BASE cf. http://www2.warwick.ac.uk/fac/soc/al/research/collect/base/. 14 For further information on ELFA cf. http://www.helsinki.fi/englanti/elfa/elfacorpus.html.

28

Chapter Two

With regard to the range of disciplines and discourse genres covered, the GeWiss corpus is more focussed than MICASE and ELFA. The latter two both contain recordings of a wide range of spoken academic genres. In contrast, in GeWiss, the number of genres covered is confined to two (presentations and oral examinations) and the disciplinary range is very narrowly limited to philology. In this regard, GeWiss is comparable to BASE, which contains only lectures and seminars. When comparing the GeWiss corpus size to that of MICASE (200 hours / 1.8 million tokens) and BASE (1.6 million tokens), it is evident that GeWiss is somewhat smaller (126 hours / 1.27 million tokens) (cf. Sieradz, 2013) yet it is a little bigger then ELFA (131 hours / 1 million tokens). However, if one compares the ratio of data per genre, the differences are minimal. MICASE, for example, contains a total of 143.369 tokens originating from student presentations covering different disciplines while GeWiss contains 31:07 hours / 330.625 tokens of this genre from a single discipline (but in three different languages), or 21:11 hours / 225.751 tokens in German (but including both native and nonnative speakers of German). In short, the first version of GeWiss is comparable with other publicly available corpora of both spoken German and spoken English academic discourse. With its specific design, however, GeWiss offers a valuable database for comparative investigations of various kinds in the areas mentioned above. Taken together, the focus on comparable disciplines, topics, genres, and data size allows for various methodologically sound contrastive analyses, including an investigation of variations of academic German in different contexts and settings and possible factors determining such variations.

2. GeWiss as a case in point: issues of general relevance for spoken corpora Apart from the immediate questions concerning the process of corpus construction and publication, there are further reaching issues a spokencorpus project is confronted with. In the second part of this paper we want to address three of these issues: the question of sustainability and longterm preservation (2.1), the legal aspects that have to be considered (2.2), and the possibilities of further editing spoken corpus data (2.3).

Building and Maintaining the GeWiss Corpus

29

2.1 Long-term preservation and sustainability Sustainability has been an issue for the GeWiss project since the phase of corpus construction, where decisions had to be made about the software and the formats to use, how to transcribe the recordings, and how to organize and archive the data. Besides these decisions connected to the construction process, the question of how to secure a long-term availability through hosting and maintenance of the resource had to be dealt with. We will address aspects relevant to both phases (i.e., during the corpus construction and afterwards). In planning the construction of the corpus several decisions had to be made that involved questions of sustainability. With respect to the recordings, suitable data formats and recording parameters had to be chosen. In GeWiss, audio recordings were taken in the uncompressed wavformat with 16 bit and at least 44.1 kHz. Whenever we had the permission to also videotape a speech event, we used the MPEG format with high definition resolution (1920 x 1080).15 Established conventions should be used for the transcription of the data. We opted for the minimal transcript of GAT 2 (conversation analytic transcription system, cf. Selting et al., 2009), which aims to depict the actual uttered speech without forcing the transcriber to segment it into theoretically predefined categories such as utterances. This offers a way of transcribing that keeps interpretational decisions to a minimum. For the transcription, we used the EXMARaLDA Partitur Editor. This allows one to represent each speaker in a separate tier which displays turn taking or simultaneous speaking clearly in the transcript. To manage metadata and to organize the corpus as a whole, we used the EXMARaLDA Corpus-Manager (Coma). This software allows for the integrated archiving of all data belonging to a speech event (i.e., recordings, transcripts, additional material such as slides or handouts) by hyperlinking them to that event. By using Coma, the metadata of the corpus are archived in XML format. The EXMARaLDA software was not chosen for practical reasons alone. A key argument was that EXMARaLDA implements the current standards of the community (XML and UTF-8 as basics). It is part of the European CLARIN initiative and thereby likely to be a format that will be further developed and supported.

15

The parameters chosen are in line with the latest recommendations of the CLARIN initiative (cf. CLARIN recommendations for field recordings, van Utvanck, 2012, p. 64).

30

Chapter Two

A main concern of the project after the construction phase lies in the long term availability of the resource. This can only be achieved through integration into larger infrastructures. At present the GeWiss corpus is hosted at the data processing centre of the University of Leipzig (URZL), where server space for the corpus and basic maintenance are only available for the next 5 years. Clearly, a solution for long-term hosting and maintenance is needed beyond this period. A possible option that is currently being pursued is the integration of the corpus into a larger infrastructure like that offered by the European CLARIN (Common Language Resources and Technology Infrastructure) initiative16. The goal of CLARIN is to make existing language resources long-term accessible in standardized formats and infrastructures. At present, the therefore needed integration processes are run exemplarily by so called “curation projects”. In March 2013, GeWiss successfully applied for a curation project in the German CLARIN-D project (see section 1.1 above). Within the curation phase, the GeWiss resource will be made CLARIN compatible, which means that formats are adapted to the standardized CLARIN versions. As GeWiss had already employed community standards by using the EXMARaLDA software, the CLARIN adaptations merely comprise the conversion of the metadata archived in the EXMARaLDA Corpus Manager into the CMDI format (Component Metadata Infrastructure) and the registration of PIDs for the resource and its parts. By using the OAIPMH provider (Open Archives Initiative Protocol for Metadata Harvesting) the metadata of the corpus can then be accessed and indexed even from outside the repository. Thus, by the end of the curation phase the GeWiss-resource will be accessible via the VLO (Virtual Language Observatory) (cf. Wittenburg & van Uytvanck, 2012). The registration of PIDs for the sub-corpora and their components will allow for long-term identification and citability.

2.2 Legal issues A further obstacle for the sustainability of spoken language corpora is of a legal nature. Recordings of spoken word are subject to various legal restrictions, which may differ considerably from country to country. For example, personal data privacy is extremely protected in Germany and there is no fair use doctrine. In contrast, the U.S. allows for greater use of copyrighted work. The copyright and the personal data protection directly affect the construction and the dissemination of a spoken language 16

See http://www.clarin.eu.

Building and Maintaining the GeWiss Corpus

31

resource and therefore need to be precisely considered at the very beginning of the corpus project in order to ensure the future use of the recorded data and to avoid problems with legal issues. For that, written consent forms from the speakers, adhering to the legal regulations in the respective countries in which the data were collected and archived, are strongly recommended. However, due to major legal differences in national laws, this could be a considerable challenge for multinational projects like GeWiss. This is also a concern when integrating the corpus resources into international research infrastructures like CLARIN. There is great need for a unification of the EU legislation on copyright and data protection. In fact, the European Commission is already planning developments in the EU legal framework, which could positively affect the digital humanities too. In 2012, the EU Commission released a proposal for a General Data Protection Regulation that will supersede the existing Data Protection Directive 95/46/EC and thus unify data protection within the EU with a single law.17 As for the reform of the EU Copyright Directive 2001/29, with the EU initiative “Licences for Europe”18, a structured stakeholder dialogue in that field has been initiated. Since the discussion is still ongoing, legal decisions in future corpus projects should be made in conjunction with their institutional legal adviser (see also Ketzan & Schuster, 2012, p. 45). The publication of the collected additional materials in the GeWiss corpus, in particular the presentation slides, caused unexpected additional legal issues. Since various slides contained images, diagrams, etc., we needed to pay attention to copyright as well as to the personal data of the individuals in the pictures. In most cases, the speakers did not make specifications on the copyright of the images on their slides. Given that we did not want to ask them to assure the legal use ex post, the only possibility was to mask, (i.e., delete), all legally risky images. However, by doing so, many slides became difficult to follow. In addition, for the free of charge use of the collected additional materials for a corpus publication, written user agreements are needed. Again, the possibility of a fair use of such content, for purposes of teaching, scholarship and

17

In contrast to EU directives, which must be transposed into national law, EU regulations become immediately enforceable as law in all member states simultaneously. For further information on the draft of the General Data Protection Regulation, cf. http://ec.europa.eu/justice/data-protection/document/review 2012/com_2012_11_en.pdf. 18 For further information, cf. http://ec.europa.eu/licences-for-europe-dialogue/ en/content/about-site.

32

Chapter Two

research, would facilitate the publication of such materials in corpora and hence increase the research value of the collected spoken data.

2.3 Further editing of spoken language corpora Finally, issues of further annotation of transcribed spoken data in terms of tagging, lemmatization and parsing are of great importance for an improved usability of spoken language corpora and need to be addressed in computational linguistics. Let us observe the case of the GeWiss corpus once again. In its beta version, the corpus consists of audio recordings of authentic spoken interactions as well as of their corresponding transcriptions conducted according to the conventions of the minimal transcription level of GAT2, a system of conventions developed more than 10 years ago by German linguists with a mainly conversational analytical background for the specific purposes of the analysis of spoken language data (cf. Selting et al., 2009). In this paradigm, the actual wording of an utterance has to be notated as precisely as possible and transcripts of spoken interactions “should permit the notation of those phenomena which previous research has shown to be relevant for the interpretation and analysis of verbal interaction” (cf. Couper-Kuhlen & Barth-Weingarten, 2011, p. 5). For that, even though the transcripts are conducted orthographically using the standard pronunciation as a reference norm, strong deviations from the standard pronunciation, such as dialectisms and regionalisms as well as idiosyncratic forms are transcribed using a ‘literary transcription’, (e.g. mitnanner as an idiosyncratic form of miteinander). In addition, GAT2 also includes conventions for an indication of typical phenomena of spoken language, such as overlaps, hesitation markers, pausing, inbreaths and outbreaths, etc. The transcription system thus allows for inductive data-driven analysis, particularly in the framework of Conversational Analysis and Interactional Linguistics, yet poses considerable challenges for automatic processing and reliable querying of the transcribed data. The lack of further corpus-linguistic annotation in terms of tagging, lemmatization and parsing, however, reduces considerably the usability of such valuable and labour- and cost-intensive corpus resources for other research communities such as the quantitative Corpus Linguistics, where linguistic analyses are based on frequency patterns. However, this is not a specific issue of the GeWiss corpus only. At present, it applies to all corpora of spoken German, which have all been collected for the purposes either of Functional Pragmatics or Conversational Analyses and Interactional Linguistics.

Building and Maintaining the GeWiss Corpus

33

Since all existing automatic annotation tools are trained only on written data, the only possibility for further annotation of transcribed spoken data corpora at present is an additional step of orthographic normalization of the data. There are, in fact, some projects applying this method at present. In the FOLK project at the Institute for German Language (IDS), for example, the OrthoNormal annotation tool for the orthographical normalization of cGaT transcripts was developed (cf. Schmidt, 2012, pp. 239-240) and in the Nordic Dialect Corpus project, a semi-automatic transliterator, trained automatically for each dialect type, was used in order to output orthographic transcriptions (cf. Johannessen, Vangsnes, Priestley, & Hagen, 2012). However, the development of specific tools for the automatic processing of (transcribed) spoken data will considerably improve the conditions of spoken language corpora and thus still needs to be addressed by computational linguistics. In the GeWiss project, we are currently working on this issue. One of the main objectives of the follow-up GeWiss-ESF project “Spoken academic language in the digital age”19 is to explore and develop corpuslinguistic tools for the specific purposes of the analysis of spoken language data in order to improve the usability of the GeWiss corpus as well as other spoken language corpora.

3. Conclusion In this chapter, we introduced the construction of the GeWiss corpus and discussed key issues concerning the long-term availability, accessibility and improved usability of the resource. In fact, building and maintaining corpora of authentic spoken language data, turned out to be a complex, labour- and cost-intensive task and may best be accomplished within large infrastructures like CLARIN.

References Couper-Kuhlen, E., & Barth-Weingarten, D. (Translation and Adaption). (2011). A system of transcribing talk-in-interaction: GAT 2. Gesprächsforschung - Online-Zeitschrift zur verbalen Interaktion, 12, 1-51. Retrieved September 22, 2013, from: http://www.gespraechs forschung-ozs.de/fileadmin/dateien/heft2011/px-gat2-englisch.pdf. Deppermann, A., & Hartung, M. (2011). Was gehört in ein nationales Gesprächskorpus? Kriterien, Probleme und Prioritäten der Stratifikation 19

For further information, cf. https://gewiss.uni-leipzig.de/index.php?id=114.

34

Chapter Two

des Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK) am Institut für Deutsche Sprache (Mannheim). In E. Felder, M. Müller, & F. Vogel (Eds.), Korpuspragmatik. Thematische Korpora als Basis diskurslinguistischer Analysen von Texten und Gesprächen (pp. 414450). Berlin, New York: de Gruyter. Fandrych, C. (In preparation). Metakommentierungen in Wissenschaftlichen Vorträgen. In C. Fandrych, C. Meißner, & A. Slavcheva (Eds.) Gesprochene Wissenschaftssprache: Korpusmethodische Fragen und empirische Analysen. Heidelberg: Synchronverlag. Fandrych, C., & Graefen, G. (2002). Text commenting devices in German and English academic articles. Multilingua 21, 17-43. Fandrych, C., Meißner, C., & Slavcheva, A. (2012). The GeWiss Corpus: Comparing Spoken Academic German, English and Polish. In T. Schmidt, & K. Wörner (Eds.), Multilingual corpora and multilingual corpus analysis (pp. 319-337). Amsterdam: Benjamins. Graefe, K., Lange, D., Sieradz, M., Meißner, C., & Slavcheva, A. (2013). GeWiss - Handbuch zum Korpus. Retrieved July 26, 2013, from: https://gewiss.uni-leipzig.de/open.php?url=Handbuch.pdf. Johannessen, J. B., Vangsnes, Ø. A., Priestley, J., & Hagen, K. (2012). A linguistics-based speech corpus. In M. Haugh, T. Schmidt, & K. Wörner (Eds.) Proceedings of the LREC-Workshop “Best Practices for Speech Corpora in Linguistic Research” 2012 (pp. 1-5). Retrieved September 22, 2013, from: http://www.corpora.uni-hamburg.de/ lrec2012/Proceedings_Complete.pdf. Ketzan, E., & Schuster, I. (2012). Access to resources and tools – technical and legal issues. In CLARIN-D User Guide (pp. 43-45). Retrieved July 26, 2013, from: http://media.dwds.de/clarin/userguide/text/multimodal_corpora.xhtml. Lange, D., Slavcheva, A., RogoziĔska, M., & Morton, R. (In preparation). GAT2 als Transkriptionskonvention für multilinguale Sprachdaten? Zur Adaption des Notationssystems im Rahmen des Projekts GeWiss. In C. Fandrych, C. Meißner, & A. Slavcheva (Eds.) Gesprochene Wissenschaftssprache: Korpusmethodische Fragen und empirische Analysen. Heidelberg: Synchronverlag. Meißner, C., Jettka, D., & Fandrych, C. (2013). CLARIN-KP-GeWiss: Das zweite Kurationsprojekt der F-AG 1 Deutsche Philologie. CLARIN-D-Newsletter, 4, 3-8. Retrieved July 26, 2013, from: http://de.clarin.eu/images/newsletter/CLARIN-D-Newsletter-20134.pdf. Schmidt, T. (2012). EXMARaLDA and the FOLK tools. In Proceedings of the Eighth conference on International Language Resources and

Building and Maintaining the GeWiss Corpus

35

Evaluation (LREC’10), Istanbul, Turkey: European Language Resources Association (ELRA) (pp. 236-240). Retrieved September 22, 2013, from: http://www.lrec-conf.org/proceedings/lrec2012/pdf/529_Paper.pdf. Schmidt, T., & Wörner, K. (2009). EXMARaLDA – Creating, analyzing and sharing spoken language corpora for pragmatic research. Pragmatics, 19, 565-582. Selting, M., Auer, P., Barth-Weingarten, D., Bergmann, J., Bergmann, P., Birkner, K., … Uhmann, S. (2009). Gesprächsanalytisches Transkriptionssystem 2 (GAT 2). Gesprächsforschung - OnlineZeitschrift zur verbalen Interaktion, 10, 353-402. Retrieved July 26, 2013, from: http://www.gespraechsforschung-ozs.de/heft2009/px-gat2. pdf. Sieradz, M. (2013). Statistische Angaben zu den GeWiss-Korpora. Retrieved July 26, 2013, from: https://gewiss.uni-leipzig.de/open.php?url=Statistische_Angaben.pdf. Slavcheva, A., & Meißner, C. (2012). GeWiss - a Comparable Corpus of Academic German, English and Polish. In M. Haugh, T. Schmidt, & K. Wörner (Eds.) Proceedings of the LREC-Workshop “Best Practices for Speech Corpora in Linguistic Research” 2012 (pp. 7–11). Retrieved September 22, 2013, from: http://www.corpora.uni-hamburg.de/lrec 2012/Proceedings_Complete.pdf. Uytvanck, D. van. (2012). Types of Resources: Multi-modal Corpora. In CLARIN-D User Guide (pp. 61-65). Retrieved July 26, 2013, from: http://media.dwds.de/clarin/userguide/text/multimodal_corpora.xhtml. Wittenburg, P., & van Uytvanck, D. (2012). Metadata. In CLARIN-D User Guide (pp. 13-29). Retrieved July 26, 2013, from: http://media.dwds.de/clarin/userguide/text/multimodal_corpora.xhtml.

CHAPTER THREE CONSTRUCTING GENERAL AND DIALECTAL SPOKEN CORPORA FOR LANGUAGE VARIATION RESEARCH: TWO CASE STUDIES FROM TURKISH ùÜKRøYE RUHø AND E. EDA IùIK-TAù

1. Introduction Clancy makes a useful distinction between “Variety” and “variety” in the construction of corpora (2010, pp. 80-81). He proposes “Variety” to mean geographical varieties of language (e.g., Irish English, British English, etc.), while “variety” refers to discourse genres defined by situational use (e.g., academic discourse, workplace language, etc.). This chapter concerns achieving comparability between spoken corpora of the Variety kind (Varietygeo, hereafter) that are also designed to represent variety in discourse genres (varietydg, hereafter) and ways of enhancing metadata so as to support comparative research on corpora.1 In order to portray language variation in a more fine-grained manner at the discoursal level, we argue that spoken corpora should consist of the complete recording of communicative events –in other words, they should be discourse corpora (see also Santamaría-García, 2011) and that metadata include socio-cultural activity types, discourse and local topics, and speech acts so as to achieve a more in-depth representation of interaction content for pragmatic variation research. The chapter develops its argument in the 1

The currently preferred term for “variety” in the sense of “genre” appears to be “register” in corpus linguistics, (see e.g., Biber & Conrad (2009) and McEnery & Hardie (2012)), although Biber (1988) initially used the term “genre” for situational variation (as cited in Biber, 1995, p. 9). For this chapter, we follow the notation in Clancy (2010) since the discussion touches upon variation research on a number of levels–the dialectal, the sociolinguistic, and the pragmatic.

Constructing General and Dialectal Spoken Corpora

37

context of two corpora on contemporary dialects of Turkish, namely, the Spoken Turkish Corpus (STC - Turkish spoken in Turkey; see Ruhi, EryÕlmaz, & Acar, 2012) and the Spoken Turkish Cypriot Dialect Corpus (STCDC - Turkish in Northern Cyprus). It first presents an overview of certain aspects of Varietygeo corpora features that impact variation research, and discusses the importance of developing matched genre sampling frames that can capture communication as situated interaction. Section 3 presents how and to what extent comparability is achieved in STC and STCDC. The same section also illustrates the addition of topic and speech act annotation to the metadata. This section is followed up by a brief discussion of some of the capabilities provided by EXMARaLDA (Schmidt & Wörner, 2009) for variation research. As is the case in STC and STCDC, transcription standardisation is a crucial issue for comparative research. In section 5 we illustrate the method followed in the two corpora especially in regard to orthographic representation of speech. The chapter ends with a summary of the highlights concerning the representation of spoken discourse and variation, with a focus on corpus tool desiderata and transcription standardisation so as to ensure gains in comparative research even on corpora designed for different purposes.

2. Designing Varietygeo Corpora Spoken Varietygeo corpora have been compiled in different ways with respect to the nature of interaction types. Those with a particular interest in dialectal variation have mostly relied on elicited speech. The Freiburg English Dialect Corpus–FRED, for example, is built mainly from interviews on life histories conducted in oral history projects (Anderwald & Wagner, 2007). Others such as the International Corpus of English – ICE (Greenbaum & Nelson, 1996), incorporate Varietiesgeo of English and samples of unprompted language in public and private settings. Corpora based on elicited speech have the advantage of allowing the researcher to guide the interaction so as to document inter-speaker variation. However, they would lose out on the capacity to explore intra-speaker variation in naturally occurring communications, unless corpora are designed to include elicited speech. Compiling corpora from elicited speech arguably has the advantage of controlling for specific lexico-grammatical features, but in line with ýermák (2009) we maintain that spoken corpora–and indeed discourse corpora– should predominantly comprise naturally occurring speech (Ruhi, IúÕk-Güler, Hatipo÷lu, Eröz-Tu÷a, & Çokal Karadaú, 2010). Thus, for spoken corpora, multi-party, spontaneous and

38

Chapter Three

high interaction speech are essential (ýermák, 2009, p. 118) if we are to represent the nature of discourse in real life. In this regard, elicited speech largely limits the kind of sociolinguistic and pragmatic variation analysis that can be conducted, such as accommodation processes and situationbound form-function mappings. Inclusion of a range of genres in Varietygeo corpora is a standard implementation procedure, but a problematic feature at the discourse level concerns the common practice of selecting equal lengths of texts. This means that texts may be cut at appropriate points in the discourse or combined to form “composite[s]” from shorter interactions (e.g., ICE; Greenbaum & Nelson, 1996, p. 5). Such sampling makes it impossible to conduct cross-varietal research on genres and conversational organisation (Andersen, 2010, p. 556).2 Other corpora, however, have aimed to include complete interactions wherever feasible (e.g., The British National Corpus (BNC)). Genre classification in spoken corpora is yet another problematic area (Lee, 2001) that has a bearing on variation analysis. This is partly due to the fact that spoken interaction is often fluid in terms of interaction goals. Corpus compilers have tackled classification in slightly different ways and levels of granularity. While BNC lumps casual conversation into a single category, the Cambridge and Nottingham Corpus of Discourse in English (CANCODE) differentiates such communicative settings along two axes: the relational and goal dimensions (McCarthy, 1998, p. 5). One problem with this scheme is the rigid divide it proposes for goal classification and in some cases for speaker relations. Conversations in real life may be either task or idea oriented at different specific times (Gu, 2010) or include differing social relations within the same situated discourse as is the case when, for example, two participants in an interaction may be chatting as friends in some parts and as colleagues in other parts of the same conversation (Ruhi, forthcoming). One way of resolving the multidimensionality and fluidity of discourse in corpus files would be to include social activity types in metadata. The discussion above has raised a number of issues in spoken corpus compilation, focusing on certain design features. In the following, we highlight the employment of domain and social activity criteria for genre classification, which is supplemented with topic and speech act annotation in metadata. 2

On similar grounds, Wolfram (1997) criticizes studies in dialectology in the twentieth century that used “dialectally diagnostic lexical and phonological items collected through the direct elicitation of single instances of forms” (p. 108) as their primary data.

Constructing General and Dialectal Spoken Corpora

39

3. Corpus Design, Pragmatic Metadata, and Sampling Procedures in STC and STCDC The EAGLES–Expert Advisory Group on Language Engineering Standards– Guidelines (1996) defines a comparable corpus as “one which selects similar texts in more than one language or variety” and notes that while there are questions concerning what similarity is, ICE (Greenbaum & Nelson, 1996) can be considered as an example (see http://www.ilc.cnr. it/EAGLES96/corpustyp/node21.html.3 Kenning (2010) describes comparability on the basis of similarity in corpus design and cites “the Brown Corpus of Standard American English, the Lancaster–Oslo Bergen corpus, and the Kolhapur Corpus (Indian English)” (p. 488) as examples of large comparable corpora. Besides similarity in corpus design, she further notes the significance of choices made in “mark-up and annotation” (Kenning, 2010, p. 490) in comparable corpora, and remarks that, while there is a paucity of parallel and comparable corpora, comparable corpora are especially useful for a wide range of research questions –lexical, discourse, pragmatics, cultural, etc. In the foregoing paragraph we have mentioned some examples of corpora that certain scholars consider comparable because there is disagreement on the matter and we would wish to clarify our position. McEnery and Xiao (2007), for example, are in agreement with the experts referred to above on the importance of similarity in corpus design and sampling criteria. They also add the importance of time contiguity in corpus construction. Yet they argue that ICE cannot be an example of comparable corpora as what is being compared there are Varietiesgeo of the same language. In their understanding “all corpora, as a source for linguistic research, have ‘always been pre-eminently suited for comparative studies’ (Aarts, 1998), either intra-lingual or inter-lingual” (p. 133). We do not consider this to constitute a strong argument against monolingual comparable corpora as linguistic and sociolinguistic features of dialects of a language constitute similar and distinct (socio)linguistic practices that are worthy of being considered as distinct linguistic, social and embodied entities. We therefore agree with scholars who treat monolingual Varietygeo corpora as comparable corpora. With these remarks we turn below to an overview of the design features of STC and STCDC. The construction of STC and STCDC as web-based corpora started at different but close times at the campuses of Middle East Technical

3

See http://www.ilc.cnr.it/EAGLES/browse.html.

40

Chapter Three

University in Ankara and Güzelyurt (in 2008 and 2010, respectively; http://std.metu.edu.tr/en/; http://corpus.ncc.metu.edu.tr/?lang=en). STC is a Varietygeo corpus that includes demographic variation and different domains and genres. STCDC is also a Varietygeo corpus that includes different genres, but it has a regional dialect slant. Despite the dialectal focus of STCDC, the two corpora may be used for variation analysis especially because their sampling frames are similar and because they employ the same speaker metadata and domain and genre sampling frames, and are being constructed within the same time frames. Both corpora projects recruit recorders on a voluntary basis, and selections are based on stratified sampling methods. Even though STCDC is currently much smaller in size and narrower in varietydg, the two corpora will eventually achieve comparability even if at different scales (250,000 words in STCDC, and 1 million in STC). Thus STC and STCDC can be considered as weakly comparable corpora in terms of size (see the GeWiss corpus, Chapter II for a comparable corpus in the strict sense). The revised demo version of STC was published in 2010, and a “baby” version of 400,000 words consisting of recordings between 2009-2012 is expected to be released toward the end of 2015. The one million target will be attained through ongoing academic research projects in the next four years. The demo version of STCDC is expected to be available on the project website in 2015 and the 250.000 word size is expected to be reached in five years. With respect to the discursive dimensions of interaction, since the two corpora aim at compiling complete conversations, analysis of pragmatic variation will be possible. Speaker metadata in STC and STCDC include standard demographic categories such as place of birth and residence, age, education, language profiles, and length of residence in different geographical locations (see Ruhi, Eröz-Tu÷a, et al. (2010) for a fuller description of the metadata features in STC). STC and STCDC employ two parameters in domain and genre classification: speaker relations and social activity type. The major domains are family, friend, family-friend, educational, service encounter, workplace, media discourse, legal, political, public, religious, research, brief encounter, and unclassified. Each domain comprises relevant genres. For instance, the workplace domain includes business interviews, appointments, shoptalk, and cultural events. Finer granularity in metadata is achieved during the corpus assignment and the various steps in the transcription stage in STC through annotation of speech acts (e.g., greeting, requesting, thanking, etc.), on the one hand, and annotation in Turkish for conversational topics (e.g., child care), speech events (e.g.,

Constructing General and Dialectal Spoken Corpora

41

troubles talk), and ongoing social activities (e.g., cooking), on the other hand. The twenty-one speech act labels in STC metadata build upon the generic classification system in Searle (1976) and the specific speech act labels under each class have been selected on a number of different but intertwined and graded criteria:4 -

the capacity to facilitate speech act, speech event, and discourse segmentation annotation in future research on the corpus (prototypical examples: leave-taking, requests, advising) the degree of formulaicity (prototypical examples: thanking, requests) cultural specificity (prototypical examples: asking about wellbeing, well-wishes); the potential to be identified consistently

Some additions and modifications in labelling were made. For example, requests cover imperatives because that label was found to be more semantically transparent for non-expert transcribers. “Asking about wellbeing” and “well-wishes” were added because they are particularly cognitively and discursively salient in their relational functions in Turkish conversations in several domains and medium types. In contrast, some speech acts which have been studied extensively in pragmatics have been lumped under a class category. Compliments and compliment responses, for instance, are not noted in the metadata, but files containing either speech act are included under “other expressives”. This is because compliments show greater lexico-grammatical variety, on the one hand, but co-occur and overlap with well-wishes, on the other hand, in Turkish. Indicating speech acts in the metadata of a corpus is not a usual practice in corpus design. In STC this was adopted as a practice on theoretical and practical grounds. First, STC specifically aims to support Varietygeo and varietydg pragmatics research without constraining or misguiding researcher perspectives on the corpus data. Second, speech act annotation in the actual communication files would lead to further lag in the release of the corpus. Thus only the directive speech act class has been annotated in a select number of service encounter files for the purpose of presenting a pilot case that can be further experimented on in future work (see Ruhi, Schmidt, Wörner, & EryÕlmaz, 2011), and will be published with the second version of STC. 4

Space limits do not allow for a full discussion of the speech act metadata system in this chapter, but see Ruhi, IúÕk-Güler, Hatipo÷lu, Eröz-Tu÷a, and Çokal Karadaú (2010) for the list of speech act labels.

42

Chapter Three

While speech act and topic annotation in the metadata may seem tedious and not a totally reliable way of tapping into content of speech, the cyclic workflow in corpus construction, which is summarized below, is geared toward ensuring consistency in speech act and topic in annotation, too: 1. Noting of local and global topics, and the communication related activities by recorders in the communication file (e.g., studying for an exam and having dinner) 2. Checking of topics and additions during assignment of the recording to the corpus management system 3. Stages in transcription: a. Initial step: basic transcription of recording for verbal and nonverbal; editing of topics and addition of speech act metadata b. First check: Checking the transcription for verbal and non-verbal events; editing of topics and speech act metadata c. Second check: Checking the transcription for verbal and nonverbal events; editing of topics and speech act metadata (From Ruhi et al., 2011) In the case of the speech act metadata, only the list of speech act labels and a few sample expressions for each were provided for the transcribers in the initial step. Vague as this technique may be, it does not induce transcribers into “reading” too much into the transcription file. In cases of incompatibility between the initial and the first check, the final decision is taken during the second check, which is carried out by the first author of this chapter. Cross- and intralingual corpus linguistic research on spoken corpora has until recently concentrated on the smaller units of language such as lexico-grammatical patterns and discourse markers/particles (see Jucker, Schneider, Taavitsainen, & Breustedt (2008) and Santamaría-García (2011) for exceptions). However, pragmatics research, broadly construed here to include variation analysis, goes beyond small linguistic units to examine stretches of discourse, which are arguably difficult to visualize through concordances. As noted by Virtanen (2009), such investigation requires reading through corpora. In this respect, along with standard corpus linguistic search tools, the marking of speech acts and topics in metadata is expected to support researchers in the identification of clustering of linguistic phenomena related to their particular research questions. To illustrate this with the partial metadata for file 061_090622_00020 in STC in Table 3-1, it is expected that the presence

Constructing General and Dialectal Spoken Corpora

43

of “communication problems in the family” as a topic may include argumentative sequences in the interaction and be characterized by involved styles (Caffi & Janney, 1994) with frequent changes in rate of speech and tone of voice. Indeed in a recent study that analyzed this file for (im)politeness, it was found to display frequent use of tamam “okay, alright”, which is multi-functional pragmatic marker that can function both as a silencer and a global/local topic boundary marker in Turkish (Ruhi, 2013). No. of speakers Date recorded Domain Duration Physical space Project name Relations

Speech acts

Topics

3 2009-06-22T18:00:00 Conversations among family and/or relatives 1640 secs. Home ODT-STD ZEY000073 is mother of ISA000058. ISA000058 is elder brother of CAG000125. ZEY000073 is mother of CAG000125 Leaves taking, Thanking, Well wishes / congratulations, Refusals (as a response to a request), Requests, Advising, Criticizing, Offering troubles talk, life after university, playing football communication problems in the family, planning the evening, dialectal differences in Arabic, communication and sharing with dormitory friends, the weather, looking after the baby, cooking, studying mathematics

Table 3-1: Partial metadata for a file in STC Furthermore, assuming that particular settings likely go hand in hand with particular activities, it is possible to predict that certain speech events and speech acts will emerge in the interaction. For instance a service encounter is a prototypical context for directives. However, “service encounter” is an umbrella genre for social practices that take place in institutional settings with varying activities that stem from differing social orders (see Schatzki (2001) on practice as performance and social orders). Shopping in a supermarket may be different from doing one’s daily shopping at the local grocery, where in the Turkish context, for instance, one is expected to engage in some amount of small talk (e.g., asking about the participant’s well-being). Thus second level annotation in metadata so as to include speech acts and topics is essentially a proposal for combining more generic metadata with more specific pragmatic annotation, and is in line with pragmatics research that emphasizes the notion of activity types

44

Chapter Three

in discourse (notably, Levinson, 1979; Culpeper, Crawshaw, & Harrison, 2008; Gu, 2010). Thus marking topics and speech acts has the potential to provide clues to researchers on differences in, for example, the discourse moves, lexico-grammatical patterns, and politeness phenomena occurring in the interaction. Combining these two parameters in metadata is intended to render corpora more visible for pragmatics research. The gains from such metadata, naturally, need to be tested in future research. In the current state of STC, speech acts are annotated as a separate category from topic annotation, but topics as a super-category includes both local conversational topics and discoursal topics, which, as evident in Table 3-1, may concern the meso-level of discourse in that they are more discourse activities than topics in the strict sense. Conflating local and global topics in a single category may not be the ideal route. However, given that speech events, activities and discoursal topic categorizations are fuzzy notions,5 we submit that separating them from local topics would not add value to the annotation (speech act and topic annotation will be implemented in the STCDC Project at a later stage.). This metadata annotation design is an improvement for pragmatics research on spoken corpora, where scholars have underscored the difficulty of retrieval of speech act realizations through standard searches that more often than not rely on the extant literature on speech acts in various languages (see e.g., Jucker et al. (2008) for an investigation of compliments in BNC). Identifying speech act realizations during the corpus construction stage can enhance corpus-driven approaches to pragmatics by providing starting points for researchers on where to look for the relevant realizations in the “jungle” of corpus data. As has been noted by scholars who critically assess the possibility of doing corpusoriented discourse analysis and pragmatics research (e.g., Virtanen, 2009), data retrieval and analysis can be overwhelming. We propose that there is a need to achieve a modicum of practice that garners the best of qualitative and quantitative methodological practices, and that one way to do this is to annotate a minimum of pragmatic and semantic information in corpora (see also Adolphs et al., 2011) for an implementation of activity type annotation). Below, we expand on the motivation for implementing a minimum level of speech act and topic annotation, taking up four issues regarding general corpora design, researching language in use, and the case of languages that have not been the object of pragmatics research to the extent that we observe in languages such as English and Spanish. 5

See Adolphs, Knight, and Carter (2011) for related remarks on classifying activities in corpus.

Constructing General and Dialectal Spoken Corpora

45

First, to our knowledge, the field of spoken corpus construction is yet to see the publication of large-scale general corpora that are annotated for speech acts with the analytic detail that is displayed in pragmatics research (i.e., full annotation of head-acts, supportive moves, and so on). As noted in Santamaría-García (2011), too, there appears to be little sharing of pragmatically (partially) annotated corpora, with the result that corpusoriented research is limited in terms of sharing knowledge and experience in the field. Naturally, it is questionable whether general corpora construction should aim for full annotation of speech acts or other discursive phenomena that go beyond the utterance level, given that these are time-consuming, expensive enterprises, which also carry the risk of weakening annotation consensus (see Leech, 1993). Nonetheless, there is growing interest to employ general or specialized corpora for pragmatics research. We would argue that while listing speech act occurrences certainly does not replace full-scale annotation, rudimentary, list type annotations open the way to enhancing qualitative methodologies that aim to combine them with quantitative approaches. The second issue concerns corpus compilation research. Topic and speech act annotation aids monitoring for variation in genre, tenor and the affective tone of interaction during corpus construction (Ruhi, IúÕk-Güler, et al., 2010). Monitoring for contextual feature variation are too broad notions to achieve representativeness in regard to the content level of language, especially in casual conversational contexts, which exhibit great diversity in conversational topics. While such annotation naturally increases corpus compilation effort, it is useful for studies in pragmatics and dialectology, where it is observed that stylistic variation is controlled by discoursal and sociopragmatic variables (see Macaulay, 2002). The third issue is related to the fact that pragmatic phenomena (e.g., negotiation of communicative goals and relational management, and inferencing) are not solely observable as surface linguistic phenomena. Despite its limitations, speech act metadata is a viable starting point for exploring these dimensions of language variation. Finally, for the less frequently studied languages in traditional discourse and pragmatics research, there is often little if none by way of research findings that corpus-oriented research can rely on to retrieve tokens of pragmatic phenomena at the linguistic expression level. Turkish is a typical example of such a language, where, for instance, only a handful of studies exist on directives (Ruhi, 2011). In this regard, speech act and topic metadata can aid in implementing the necessarily corpus-driven methodologies for pragmatics research in such languages. Stocktaking the issues raised concerning the inclusion of what are arguably qualitative metadata

46

Chapter Three

features, wee maintain thaat the parametters describedd above respo ond to the need for coorpora that are a pragmatically annotatted. In the following f sections w we turn to corpus con nstruction toools for tran nscription standardisation in the twoo corpora.

4. Coorpus Construction To ools in STC C and STCD DC STC andd STCDC are trranscribed orth hographically w with the EXMARaLDA software suiite (Schmidt & Wörner, 2009), 2 which facilitates ressearch on variation ow wing to the naature of its too ols. Partitur-E Editor, the tran nscription tool, enablees multiple-leevel annotatio on and files are time-aligned with audio and vvideo files. In I its simplesst form, eachh speaker in a file is assigned twoo tiers –one for f utterances (v-tier) and oone for their an nnotation (c-tier) (see the bolded and a grey word ds in Figures 3-1a and 3-1 1b for the annotation oof dialectal annd standard pronunciation) p ). Each file also a has a no-speaker tier (nn), whhere backgrou und events ma may be describ bed (e.g., ongoing actiivities that im mpinge on conv versation). The multi-modal natture of the corpora is esppecially supp portive of variation annalysis, as reseearchers may themselves coonsult the orig ginal data against posssible errors inn transcriptio on, or play annd watch seg gments as many times as necessary.

Figure 3-1a: T Transcription of the /k/-/g/ varriation in STC

Figure 3-1b: T Transcription of o the /k/-/g/ varriation in STCD DC

5. Tran nscription Standardis S ation in ST TC and STC CDC In this section, we firrst briefly describe commonnalties and diifferences between ST TC and STCD DC transcriptiion conventioons, and then illustrate how borrow wings and coode-shifts are transcribed and annotateed in the corpora.

Constructing General and Dialectal Spoken Corpora

47

Transcription of language Varietiesgeo has been handled in two ways in current corpora and in disciplines such as conversation analysis and discourse analysis: either by closely following the written standard except for minor changes, or by employing varying degrees of eye dialect, which capture the “special quality and subjective phonological “feel” of speech” (Anderwald & Szmrecsanyi, 2009, p. 1136). As is current practice in general corpora, STC follows the first approach, while STCDC implements a reasonable degree of eye dialect. In STC standard orthography is employed in the v-tier, except for a limited list of words consistently pronounced in different ways from socalled careful pronunciation. For instance, the word hanÕmefendi “lady” is usually pronounced with the deletion of Õme. Thus, the word is written either as hanÕmefendi or hanfendi in the v-tier depending on speaker pronunciation (for documentation of spelling variation in STC, see Ruhi, Hatipo÷lu, Eröz-Tu÷a, & IúÕk-Güler, 2010). Prominently marked dialectal and stylistic variants of words and morphological realizations are transcribed in the c-tier (see Figure 3-1a). As stylistic variation can interleave with dialect switches, representing both the actual pronunciation and its standard form provides crucial information for investigating pragmatic variation. In STCDC, standardized dialectal orthography is employed in the vtier and the standard written orthography of prominent words is written in the c-tier. STC and STCDC transcriptions are thus like mirror images of each other, owing to the fact that STC is designed as a general corpus while STCDC is relatively mono-dialectal and aims to display the dialectal properties of Cypriot Turkish (compare Figures 3-1a and 3-1b). STC and STCDC share the same transcription system adapted from HIAT (Ruhi, Hatipo÷lu, et al., 2010). A major gain of this procedure will be the possibility to compare dialectal and standard forms in a unitary fashion across Turkey and Northern Cyprus. STCDC will also inform standardisation in the compilation of special, dialectal computerized corpora for Anatolian dialects, as dialects in Cypriot Turkish share an extensive number of variation patterns despite variation especially in certain inflectional morphemes (Saraço÷lu, 2004). Figures 3-2a and 3-2b illustrate the transcription of the /k/-/g/ and /t/-/d/ variation in STC and STCDC, respectively.

48

Chapter Three

Figure 3-2a: T Transcription of /k/-/g/ variatio on in STC

Figure 3-2b: T Transcription of o /t/-/d/ variatio on in STCDC6

Commonnality in standdardisation in n STC and ST TCDC is not restricted solely to thhe transcriptioon of the wo ord. The two corpora also o employ similar convventions for annotating a disscursively siggnificant paraalinguistic features andd non-lexical contributionss to the convversation (see e.g., the transcriptionn of laughter in i Figure 3-2b b, and the voicce quality annotation in is expected to assist Figure 3-5 in Section 5.2). This methodology m pragmatics rresearch. In thhe next two seections we dw well on the tran nscription of borrowinggs and code sw witching in ST TC and borrow wings in STD DC.

5.1 Transcription of o loanwordss and code sswitching in STC Recent llexical borrow wings and so ome amount of code-switcching are present in tthe conversattions in STC C files, depennding on geo ographical location, typpe of domain and a genre, and d the language profile of the speakers. Borrowings that have beccome part of th he lexicon of conversationaal Turkish are written w with their most commonly observed orthhography (e.g g., mail or e-mail for ee-mail, and çekup ç for cheeck-up). In caases where th here is a (lexical) codde-switch, thee switch is ann notated for itss Turkish tran nslation in the c-tier, allong with the indication i of the t language. Figure 3-3 illustrates a case of a codde-switch to English. E

6

abiler, lit. ““elder.brothers--PLU OKA” iss used as a syymbolic kinshiip term of respect/defereence and intimaacy by the speak ker in the v-tierr.

Constructing General and Dialectal Spoken Corpora

FAT070 [v]

((0.5)) that's it

FAT070 [c]

eng: bu kadar

49

Figure 3-3: A code-switch to English in STC

For some lexical borrowings either the original or the Turkish pronunciation exists in the files. In such cases the word is annotated in the c-tier for type of pronunciation. There are also tokens of common and proper nouns that are pronounced with either Turkish or the original language pronunciation. In such cases the Turkish spelling of the word is written in the v-tier and annotated in the c-tier. Figure 3-4 exemplifies the pronunciation of Washington with the English pronunciation, while the Turkish pronunciation of the word has the phoneme /v/ as its onset in the first syllable. ANI086 [v] ((0.5)) Vaúington ANI086 [c] eng pro: Washington Figure 3-4: Marking source language pronunciation of a word in STC

5.2 Transcription of borrowings in STCDC Cyprus is a multi-lingual and a multi-cultural island, on which Greek Cypriots and Turkish Cypriots have co-existed as the two main speech communities for centuries. The Turkish Cypriot dialect is confined to Cyprus, which is an island geographically isolated from Turkey. However, Turkish spoken on the island has always been in intensive contact with Turkish spoken in Anatolia as well as Greek and English: Though migrations from Turkey have led to contacts with external varieties, Cypriot Turkish was left without strong influences from Turkey over longer periods, thus preserving old characteristics and developing innovative features. As a result of immigration to North Cyprus since 1974, intensive linguistic contact with both standard Turkish and Anatolian dialects have been established. (Demir & Johanson, 2006, p.1)

Turkish Cypriot dialect spoken by Turkish Cypriots is accepted to be an Anatolian dialect (Saraço÷lu, 2004; Gökçeo÷lu, 2006), with origins dating back to the Ottoman presence in the 16th century. After the rule of the Ottoman Empire on the island between 1571-1878 (ømer & Çelebi, 2006), Cyprus remained as a British colony until 1960 (Gazio÷lu, 1997). During

50

Chapter Three

this period, English signiificantly affectted languagess spoken on th he island. The languagge contacts (T Thomason & Kaufmann, 1988) across different Anatolian diialects spoken on the island as a well as Greeek, English an nd dialects of Turkish sspoken in Turkkey today hass shaped Turkiish Cypriot diialect in a unique way tto characterizee its lexicon, sy yntax as well aas phonology. Lexical, grammaticall and phonolo ogical borrow wing from Greek G and English, com mmonly observed in the Turkish T Cyprioot dialect, setts it apart from other Anatolian diialects and sttandard Turkiish spoken in n Turkey (Demir & JJohanson, 2006). For instance, as reveealed in Pehllivan and Osam (20100), many vehiicle related words came intto the Turkish h Cypriot dialect from m English. Most M terminollogy related to car parts such as giyarboks, ““gearbox”, gllaj, “clutch” and vayper, “wiper” (Peh hlivan & Osam, 20100) are examplles to such bo orrowings. M Mida, “a kind of goat”, mannos, “duumb”, on the other o hand, co onstitute exam mples of a large number of words boorrowed from Greek. In Turrkish Cypriot dialect, many y of these borrowings are “part off the vocabullary and havve become fu unctional” (Osam, 1997) and are inccluded in dicttionaries of T Turkish Cypriot dialect (e.g., Hakerii, 2003; Kabattaú, 2007). The orthhographic trannscription con nventions in S STCDC aim to reflect such featurees in order to facilitate the investigationn of dialectal variation. v In the transccription of ST TCDC, borrow wed words whhich have beccome part of the Turkkish Cypriot dialect d are allso written inn the v-tier. However, H equivalents of these wordds in the source language arre provided in the c-tier to allow crooss-dialectal reesearch across Turkey andd North Cypru us. Figure 3-5 illustrates the transcrription of the borrowed woord tirolli, “trrolley” in STCDC. Thhe orthographhy of the word in the sourrce language, which is English in tthis case, is provided p in the c-tier. Figgure 3-6 illusttrates the transcriptionn of the wordss, genneri, “th hey”. The equuivalent of this word in standard Turrkish is providded in the c-tier.

Figure 3-5: Transcription off tirolli

Figure 3-6: Transcription off genneri

Constructingg General and Dialectal D Spokenn Corpora

51

In addition to words, certain c expresssions in the T Turkish Cyprio ot Dialect are also trannscribed alongg with their equivalents e inn standard Tu urkish. As illustrated inn Figure 3-7, the t equivalent of the expreession benzinee basmag, “to step on tthe gas pedal”” in standard Turkish T is provvided in the c--tier.

Figure 3-7: Transcription off benzine basma ag

6. Concluding Remarkss This papper has discussed how comp parability wass achieved in a general corpus and a dialectal corpus in th he context oof STC and STCDC. Comparablee corpora uttilizing a staandardized ssystem for sampling, s recording, trranscription and a annotation n potentially eensures cross-linguistic analyses forr researchers. We have hig ghlighted the importance of moving toward the constructionn of discoursse corpora foor language variation research by including recordings that represent thee complete in nteraction and by enricching metadatta designs thaat can enhancee corpus-based/corpusdriven crosss-varietal reseaarch in pragm matics and relaated fields. Th he chapter has also dw welled on traanscription conventions c thhat can assisst in the investigationn of lexico-grrammatical diffferences. In tthis regard, it has been shown thatt the multi-ttier and time-aligned ann nnotation affo orded by EXMARaLD DA is especiially suited for f variation research. Att a more specific leveel, we hope that t the corpu us design andd its implemen ntation in STC and ST TCDC will move m forward the field of ccorpus constru uction for Turkish andd Turkic languuages.

Acknowled dgments STC waas supported by b TÜBøTAK K 108K283 bbetween 2008-2011 by METU BAP P-05-03-2011--001 between March 2011-- March 2013. STCDC is being suupported by METU NCC C BAP-SOSY Y-10. We th hank the reviewers foor their comm ments, which have h been insttrumental in clarifying c our argumennts. Needless to say, neither is responsibble for the pressent form of the paper.

52

Chapter Three

References Aarts, J. (1998). Introduction. In S. Johansson & S. Psefjell (Eds.), Corpora and cross-linguistic research (pp. ix-xiv). Amsterdam: Rodopi. Adolphs, S., Knight, D., & Carter, R. (2011). Capturing context for heterogeneous corpus analysis: Some first steps. International Journal of Corpus Linguistics, 16(3), 305-324. Andersen, G. (2010). How to use corpus linguistics in sociolinguistics: In A. O’Keefe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (pp. 547-562). London/New York: Routledge. Anderwald, L., & Szmrecsanyi, B. (2009). Corpus linguistics and dialectology. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook, vol. II (pp. 1126-1140). Berlin/New York: Walter de Gruyter. Anderwald, L., & Wagner, S. (2007). FRED – The Freiburg English Dialect Corpus: Applying corpus-linguistic research tools to the analysis of dialect data. In J. C. Beal, K. P. Corrigan & H. L. Moisl (Eds.), Creating and digitizing language corpora, volume 1: Synchronic database (pp. 35-53). London: Palgrave Macmillan. Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press. —. (1988) Variation across speech and writing. Cambridge: Cambridge University Press. Biber, D., Conrad, S. (2009). Register, genre, and style. Cambridge: Cambridge UniversÕty Press. Caffi, C., & W. Janney, R. (1994). Towards a pragmatics of emotive communication. Journal of Pragmatics, 22, 325-373. ýermák, F. (2009). Spoken corpora design: their constitutive parameters. International Journal of Corpus Linguistics, 14(1), 113-123. Clancy, B. (2010). Building a corpus to represent a variety of a language. In A. O’Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (pp. 547-562). London/New York: Routledge. Culpeper, J., Crawshaw, R., & Harrison, J. (2008). “Activity types” and “discourse types”: Mediating “advice” in interactions between foreign language assistants and their supervisors in schools in France and England. Multilingua, 27, 297-324. Demir, N., & Johanson, L. (2006). Dialect contact in North Cyprus: International Journal of the Sociology of Language, 181, 1-11. Gazio÷lu, Ahmet C. (1997). KÕbrÕs Tarihi, øngiliz Dönemi (1878-1960). Lefkoúa: KÕbrÕs AraútÕrma ve YayÕn Merkezi.

Constructing General and Dialectal Spoken Corpora

53

Gökçeo÷lu, M. (2006). KÕbrÕs Türk a÷ÕzlarÕ sözlü÷ü. østanbul: Türkiye øú BankasÕ Kültür YayÕnlarÕ. Greenbaum, S., Nelson, G. (1996). The International Corpus of English (ICE) Project. World Englishes, 15(1), 3-15. Gu, Y. (2010). The activity type as interface between langue and parole, and between individual and society - An argument for trichotomy in pragmatics. Pragmatics and Society, 1, 74-101. Hakeri, B. H. (2003). Hakeri’nin KÕbrÕs Türkçesi sözlü÷ü. Magosa: Samtay VakfÕ YayÕnlarÕ. ømer, K., & Çelebi, N. (2006). The intonation of Turkish Cypriot dialect: a contrastive and sociolinguistic interpretation. International Journal of the Sociology of Language, 181, 69-82. Jucker, A., Schneider, G., Taavitsainen, I. & Breustedt, B. (2008). “Fishing” for compliments. Precision and recall in corpus-linguistic compliment research. In A. Jucker & I. Taavitsainen (Eds.), Speech acts in the history of English (pp. 273- 294). Amsterdam/Philadelphia: Benjamins. Kabataú, O. (2007). KÕbrÕs Türkçesinin etimolojik sözlü÷ü. Lefkoúa: KÕbrÕs Türk Yazarlar Birli÷i. Kenning, M.-M. (2010). What are parallel and comparable corpora and how can we use them? In A. O’Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (pp. 487-500). London/New York: Routledge. Lee, D.YW. (2001). Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning & Technology, 5(3), 37- 72. Leech, G. (1993). Corpus annotation schemes. Literary and Linguistic Computing, 8(4), 275–281. Levinson, S. (1979). Activity types and language. Language, 17, 365- 399. Macaulay, R. (2002). You know, it depends. Journal of Pragmatics, 34, 749-767. McCarthy, M. (1998). Spoken languages and applied linguistics. Cambridge: Cambridge University Press. McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press. McEnery, T., & Xiao, Z. (2007). Parallel and comparable corpora —The state of play—. In Y. Kawaguchi, T. Takagaki, N. Tomimori & Y. Tsuruga (Eds.), Corpus-based perspectives in linguistics (pp. 131145). Amsterdam/Philadelphia: John Benjamins.

54

Chapter Three

Osam, N. (1997). Dil kirlenmesine sayÕsal bir yaklaúÕm. In Dil devriminden bu yana Türkçenin görünümü (pp. 60-69). Ankara: Dil Derne÷i. Pehlivan, A. & Osam, N. (2010). Vehicle-related expressions in Turkish Cypriot dialect. Bilig, 54, 231-240. Ruhi, ù. (2011). “LOOK” as an expression of procedural meaning in directives in Turkish: Evidence from the Spoken Turkish Corpus. Paper presented at the Sixth International Symposium on Politeness, 11-13 July, 2011, Middle East Technical University, Ankara. —. (2013). The interactional functions of tamam in spoken Turkish. Dil ve Edebiyat Dergisi / Journal of Linguistics and Literature, 10(2), 9-32. Retrieved from http://yeniweb.mersin.edu.tr/edergi/index.php/ded/article/view/462. —. (forthcoming). Corpus linguistic approaches to (im)politeness: Corpus metadata features and annotation parameters in spoken corpora. In D. Z. Kádár, E. Németh T. & K. Bibok (Eds.), Politeness: Interfaces. London: Equinox. Ruhi, ù., Eröz-Tu÷a, B., Hatipo÷lu, Ç., IúÕk-Güler, H., Acar, M. G. C., EryÕlmaz, K., … Çokal Karadaú, D. (2010). Sustaining a corpus for spoken Turkish discourse: Accessibility and corpus management issues. In Proceedings of the LREC 2010 Workshop on Language Resources: From Storyboard to Sustainability and LR Lifecycle Management. Paris: ELRA, 44-47. Retrieved from http://www.lrec. conf.org/proceedings/lrec2010/workshops/W20.pdf#page=52. Ruhi, ù., EryÕlmaz, K., Acar, M. G. C. (2012). A platform for creating multimodal and multilingual spoken corpora for Turkic languages: Insights from the Spoken Turkish Corpus. Paper presented at the First Workshop on Language Resources and Technologies for Turkic Languages, LREC 2012, May 21 2012. Retrieved from http://www. lrec-conf.org/proceedings/lrec2012/workshops/02.Turkic%20Lang uages%20Proceedings.pdf. Ruhi, ù., Hatipo÷lu, Ç., Eröz-Tu÷a, B. & IúÕk-Güler, H. (2010). A guideline for transcribing conversations for the construction of spoken Turkish corpora using EXMARaLDA and HIAT. ODTÜ-STD: Setmer BasÕmevi. Ruhi, ù., IúÕk-Güler, H., Hatipo÷lu, C., Eröz-Tu÷a, B. & Çokal Karadaú, D. (2010). Achieving representativeness through the parameters of spoken language and discursive features: the case of the Spoken Turkish Corpus. In I. Moskowich-Spiegel Fandino, B. Crespo García, & I. Lareo Martín, (Eds.), Language Windowing through Corpora.

Constructing General and Dialectal Spoken Corpora

55

Visualización del lenguaje a través de corpus Part II (pp. 789- 799). A Coruña: Universidade da Coruña. Ruhi, ù., Schmidt, T., Wörner, K., & EryÕlmaz, K. (2011). Annotating for precision and recall in speech act variation: The case of directives in the Spoken Turkish Corpus. Multilingual Resources and Multilingual Applications. Proceedings of the Conference of the German Society for Computational Linguistics and Language Technology (GSCL) 2011. Working Papers in Multilingualism Folge B, 96, 203-206. Santamaría-García, C. (2011). Bricolage assembling: CL, CA and DA to explore agreement. International Journal of Corpus Linguistics, 16(3), 345-370. Saraço÷lu, E. (2004). KÕbrÕs a÷zÕ: Sesbilgisi özellikleri, metin derlemeleri, sözlük. Lefkoúa: Ateú MatbaacÕlÕk. Schatzki, T. (2001). Practice mind-ed orders. In S. Theodore, K. Knorr Cetina & E. von Savigny (Eds.), The practice turn in contemporary theory (pp. 50-63). London / New York: Routledge. Schmidt, T., Wörner, K. (2009). EXMARALDA – creating, analysing and sharing spoken language corpora for pragmatic research. Pragmatics, 19(4), 565-582. Searle, J. (1976). A classification of illocutionary acts. Language and Society, 5, 1-23. Thomason, S.G. & Kaufman, T. (1988). Language contact, creolization, and genetic linguistics. Berkeley CA: University of California Press. Virtanen, T. (2009). Corpora and discourse analysis. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook, vol. II (pp. 613-642). Berlin/New York: Walter de Gruyter. Wolfram, W. (1997). Dialect in society. In F. Coulmas (Ed.), The handbook of sociolinguistics (pp. 92-107). Malden: Blackwell Publishing.

CHAPTER FOUR THE CORPUS OF SPOKEN GREEK (CSG) THEODOSSIA-SOULA PAVLIDOU, CHARIKLEIA KAPELLIDI AND ELENI KARAFOTI

1. The Corpus of Spoken Greek (CSG) While the significance of speech for the study of language has been undisputed for more than a century now and while the necessity for corpora has long been recognized, the international praxis of spoken corpora is only slowly catching up with that of written corpora. This is no coincidence if one takes into account the fact that only in the last decade did spoken data become broadly available to the public via the internet (e.g., through YouTube) as a kind of pendant to printed newspapers, books, etc. (i.e., the “traditional” ready-made resources for the compilation of corpora). Up to that point collecting spoken data was both more cost and time intensive as it requires technical equipment (audio-/video-recorders), which is superfluous in the case of written texts. One more difficulty with regard to the collection of spoken data is their “naturalness” (i.e. the attempt to overcome the observer’s paradox). Specific types of discourse, like everyday conversations or telephone calls, are much more sensitive to the kind of recording (audio vs. video) employed, compared with (mediated) public discourse genres (e.g., broadcast lectures, talk shows, news bulletins, etc.). And this is why spoken corpora commonly comprise only very little, if at all, data drawn from everyday conversations or telephone calls. Moreover, access to such data is much more difficult, since participants’ consent to record their interaction and use the recorded material is required. It is then of no surprise that spoken corpora based on naturally occurring interactions are relatively rare and small in comparison to written corpora. In this paper, we would like to discuss some of the problems involved in the development of naturalistic corpora by way of presenting the Corpus of Spoken Greek (CSG) of the Institute of Modern

The Corpus of Spoken Greek (CSG)

57

Greek Studies (Manolis Triandaphyllidis Foundation) at the Aristotle University of Thessaloniki. More specifically, we will describe the features of this corpus, which are closely related to its aims but also to the conditions under which it was compiled.

2. Corpora of Modern Greek In this section a brief overview of the publicly available Modern Greek corpora is presented, focusing mainly on those comprising spoken data. The Greek praxis on corpora is but a reflection of what has been going on internationally, as it primarily engages in the compilation of written texts. Unsurprisingly enough, it was written corpora that were first released for the Greek language. More specifically, around the turn of the century two large written collections became available to the public, namely the Hellenic National Corpus, developed by the Institute for Language and Speech Processing (ILSP),1 and the Corpus of Modern Greek Texts at the Portal for the Greek Language (Centre for the Greek Language). The Hellenic National Corpus contains a variety of Greek texts–the majority of which were published after 1990–and consists of 47,000,000 words. According to the corpus site (http://hnc.ilsp.gr/en/default.asp), the texts have been selected so as “to present a realistic picture” of the use of Modern Greek, meaning standard Modern Greek, since texts written in “highly idiomatic language” have been excluded from the corpus. However, the focus of this corpus on journalistic texts (61, 29% of the whole) challenges its representativeness. In addition, the specific composition of the corpus (i.e., samples of exclusively written language) undermines its goal “to present a realistic picture” of the use of Modern Greek, as this picture is clearly incomplete without spoken data. It is presumably for these reasons that Goutsos (2010a, p. 30) observes that the overall design of this corpus “has been rather opportunistic–dictated by the needs of developing computational tools rather than representing the state of the language”. The Corpus of Modern Greek Texts at the Portal for the Greek language, on the other hand, consists of three different collections, the first two comprising texts from two daily newspapers (5,000,000 word tokens) while the third contains texts from school textbooks (2,000,000 word tokens). The collections available at the Portal for the Greek language do not aspire to be representative since they only involve two text types; yet, they provide a valuable resource for the study of the journalistic and 1

For this reason the particular corpus is also known as the ILSP Corpus.

58

Chapter Four

educational genres. Some 10 years after the development of the corpora mentioned above, the Corpus of Greek Texts was launched. This corpus, a joint project of the Universities of Athens and Cyprus, was designed as a general or reference corpus aspiring to fulfill the criteria of representativeness with respect to “size, authenticity and proportionality” (Goutsos, 2010a, p. 33). It consists of about 27,000,000 words, a number sufficient for major applications according to Goutsos (2010a) and does not comprise texts produced under experimental conditions. As for proportionality, the Corpus of Greek Texts includes both written and spoken material, drawn from Standard Modern and Cypriot Greek, so as to provide a more comprehensive view of the Greek language (Goutsos, 2010a). Apart from mode (written-spoken) and text-type, the texts of the specific corpus are further classified, allowing for a “varied composition of sub-corpora”, according to research needs and priorities (Goutsos, 2010b, p. 4). Although the Corpus of Greek Texts aspires to be a more comprehensive description of Greek taking into account that a national corpus should cover proportionally both written and spoken language (Xiao, 2008), the contribution of spoken data to the overall composition of the corpus is small (10% of the total, namely 2,931,280 words) and comes mainly from public speeches (1,839,766 words). Ordinary conversation, (i.e., naturally occurring talk under everyday conditions), makes only a minimal part of the corpus (0,76%, i.e. 207,548 words) – a fact confirming the inherent difficulties in collecting spontaneous talk, especially talk that is not produced in public. Thus, although this particular corpus takes a first step towards redressing the imbalance between spoken and written text types, it does not, after all, achieve their equal representation. The corpora outlined above are not the only ones that comprise data from the Greek language. There are other collections as well, either in private hands (and eventually accessible after request) or publicly available but not for online search. There are also more specialized corpora like, for example, the Greek Speaking Children Corpus,2 which comprises 61 interviews with children (Gavriilidou & Kambaki-Vougioukli, 2011). The above corpus is designed for the phonological analysis of speech and is available upon request. The Stefany Corpus in the CHILDES database3 on the other hand, consists of 28 audio-recordings of four children during their daily activities and is publicly available but not for online search. Within the above landscape of Greek corpora, the CSG appears to hold 2

The Greek Speaking Children Corpus is not online accessible, but is “freely offered for research” (Gavriilidou & Kambaki-Viougioukli, 2011, p. 565). 3 http://childes.psy.cmu.edu/browser/index.php?url=Other/Greek/Stephany/.

The Corpus of Spoken Greek (CSG)

59

a distinct position, since – apart from specialized corpora – this is the only corpus exclusively composed of spoken language data, drawn in addition to a great extent from ordinary conversation. What also differentiates the CSG from the other Greek corpora is that, apart from its online version (see below), it is also available to scholars interested in qualitative research. As a matter of fact, the CSG has been primarily intended for close qualitative rather than quantitative analyses. More specifically, it has been developed as part of the research project Greek talk-in-interaction and Conversation Analysis,4 under the direction of Th.-S. Pavlidou at the Institute of Modern Greek Studies (Manolis Triandaphyllidis Foundation). The project also aims at the study of the Greek language from the perspective of Conversation Analysis, an ethnomethodologically informed approach to linguistic interaction (for the aims, principles, methodology, etc., of conversation analysis, cf. e.g., Lerner, 2004; Schegloff, 2007). It is the conversation analytic orientation of the project that prioritizes qualitative analysis and lends the CSG its particular characteristics. The CSG is not only unique with respect to its aims but also with respect to its size, given the types of discourse it comprises. It consists of more than 1,750,000 words of exclusively naturalistic data, which have been transcribed and, for the most part, re-examined by different people aiming at a better representation of the talk-in-interaction. Moreover, the CSG has been designed as a dynamic corpus in that it can be enriched with new recordings and transcriptions. The CSG did not arise out of nowhere nor did it get designed at one shot. In fact, it utilized earlier data collections (and transcriptions) compiled by Pavlidou since the 1980s in various research and/or student projects, mostly without any funds. In particular, an earlier corpus of approximately 250,000 words, the GR-Speech (cf. Pavlidou, 2002), has been incorporated into the CSG. This was the basis on which the CSG has been further developed comprising at the present time different discourse types of both ordinary and institutional talk (for a more detailed presentation see Table 4-1).

3. Features of CSP As already mentioned, the conversation analytic perspective has greatly informed the compilation of the CSG. Particular emphasis has been given to talk from everyday conversations in face-to-face interactions or 4

http://ins.web.auth.gr/index.php?option=com_content&view=article&id=592 &Itemid=242&lang=en.

60

Chapter Four

over the telephone. In addition to informal conversations among friends and relatives, the CSG also includes more institutional discourse types, like recordings of teacher-student interaction in high school classes, television news bulletins, interviews and panel discussions (cf. Table 4-1 for the size of data from each discourse type). The CSG, thus, differs not only in its aims/conception from the other Greek corpora but also with respect to its make-up, since its greatest part (ca. 44%, i.e. more than 780,000 words) comes from informal (face-to-face or telephone) conversations. The CSG, as previously mentioned, is available to scholars for research purposes.5 Part of the corpus, however, can be also accessed online at http://corpus-ins.lit.auth.gr/corpus/index.html (more on this below). In the following sections further information with regard to the collection of data, transcription, files and metadata will be provided.

3.1 Data Collection The CSG, as mentioned, utilized earlier data collections and transcriptions. In all cases, we have to do with naturally occurring talk that was for the most part collected by MA or PhD students for their theses or by undergraduate students during their semester work. The informal conversations were tape- or video-recorded by one of the participants. In the case of classroom interaction, it was the teacher her-/himself who made audio-recordings of her/his classes at high school. The quality of equipment used for the recordings has been varying over time; in the beginning only simple tape-recorders were available, while later on digital (tape-, video-) recorders could be purchased and used. Recordings of private conversations and classroom interactions were made openly, after having informed the participants and getting their consent. In the case of telephone conversations, this information was sometimes given after the telephone calls were conducted. In all cases, the persons who made the recordings were asked to erase whatever they wanted to and to hand in only those conversations/interactions they would not mind being heard or seen by others. The issue of participants’ consent, though, has been handled with growing stringency in the last 15 years, as a result of the rising sensitivity in the Greek society and the greater restrictiveness of the official Data Protection Authority for the safeguarding of personal data in Greece. Accordingly, the written consent form that we now use is much 5

For the conditions pertaining to access and use of the Corpus of Spoken Greek, scholars may send an e-mail to the following address: [email protected].

The Corpus of Spoken Greek (CSG)

61

more comprehensive and detailed.

3.2 Transcription Conversation analysis, as is well known, lays great emphasis on the detailed representation of spoken discourse via transcription of the recordings. This does not entail that the transcripts substitute the audio- or video-recordings, which are the only reliable data and are viewed as a “reproduction of a determinate social event” (Hutchby & Wooffitt, 2008, p. 70). Yet, transcripts constitute the necessary initial step for the analysis of interaction and are a convenient tool that can be used in conjunction with real data. For conversation analysis, transcription is not an automatic or mechanical procedure, of the kind accomplished, for example, by various software packages, nor is it confined to the representation of content, as is usually done in the written form of interviews by journalists. On the contrary, the “translation” of sound into written text requires theoretical elaboration and analysis (cf. also Ochs, 1979), presupposes training, and demands repeated corrections by several people. The recordings collected for the CSG are therefore meticulously transcribed according to the principles of conversation analysis (cf. e.g., Jefferson, 2004; Sacks, Schegloff and Jefferson, 1974; Schegloff, 2007)6 in several rounds by different people. This is basically an orthographic transcription, in which the marking of overlaps, repairs, pauses, intonational features, etc., is carried out in a relatively detailed manner. For the disambiguation of prosodic features (e.g. intonation contours), we sometimes employ the Praat software. Finally, transcriptions are produced as Word documents, using a table format that allows different columns for marking the participants’ names, the numbers of lines (when necessary), etc. An extract of such a transcription is presented below (with added English translation):7

6

To this we added the marking of certain sandhi phenomena (^) and dialectal features. 7 In the transcribed text we employed standard CA transcription symbols. For a detailed presentation of these transcription conventions see Jefferson (2004), Sacks et al. (1974), and Schegloff (2007).

Chapter Four

62

(1) Conversation I.14.A.20.1 Dimos

[…] ȑʌĮȚȟİ ȩȜĮ IJĮ:- ĮȣIJȐ ʌȠȣ ĮțȠȪȖĮȝİ ıIJĮ ȏǥȐŠ‡’Žƒ›‡†ƒŽŽ–Š‡Ǧ –Š‘•‡•‘‰•™‡Ž‹•–‡‡†–‘™Š‡ įİțĮȠȤIJȫ, (1.3) țĮȚ ȝĮȢ țȠȚIJȠȪıİ ȝİ țȡȚIJȚțȩ ™‡™‡”‡‡‹‰Š–‡‡ǡȋͳǤ͵Ȍƒ†Š‡Ž‘‘‡†ƒ–—•™‹–Šƒ [ȝȐIJȚ. (ȖȚĮIJȓ įİ IJ’ ĮțȠȪȢ ĮȣIJȐ.)] …”‹–‹…ƒŽ‡›‡Ǥȋ™Š›†‘ǯ–›‘—Ž‹•–‡–‘–Š‡•‡ǤȌ

Afroditi

[ȈȣȖȖȞȫȝȘ, IJĮ ȓįȚĮ ȐțȠȣ[ȖĮ]ȝİ (ȩIJĮȞ ȒıĮıIJĮȞ)]= š…—•‡‡ǡ†‹†™‡Ž‹•–‡–‘–Š‡•ƒ‡•‘‰•ȋ™Š‡›‘—

Yorgos

[((ȖİȜȐ..........................]= ȋȋhe laughs………..….

Afroditi

=[įİțĮȠȤIJȫ? ʌȠȪ șİȢ ȞĮ ȟȑȡȦ IJȚ ĮțȠȪȖĮIJİ.] ((ȖİȜȫȞIJĮȢ)) ™‡”‡Ȍ‡‹‰Š–‡‡ǫŠ‘™•Š‘—Ž† ‘™™Šƒ–›‘—Ž‹•–‡‡†–‘Ǥ ȋȋin a laughing toneȌȌ

Yorgos

=[................................................................))] ................................................................)) ǻȘȜĮ[įȒ (ȩʌȦȢ ............)]= ‘—‡ƒȋŽ‹‡ǤǥǥǤǤǤǤȌ

Dimos

[Ǽ: ȑțijȡĮıȘ İȓȞĮȚ.]= Š‹–ǯ•Œ—•–ƒ‡š’”‡••‹‘Ǥ

Yorgos

=[(ı’ ĮȣIJȒ IJȘ) įȠȣȜİȚȐ, IJȚ İȞȞȠİȓȢ ȝİ țȡȚIJȚ]țȩ ȝȐIJȚ? ȋ‹–Š‹•ȌŒ‘„ǡ™Šƒ–†‘›‘—‡ƒ™‹–Šƒ…”‹–‹…ƒŽ‡›‡ǫ

Manos

=[ǹ ȣ IJ Ȑ IJĮ: ʌȠȚȠIJȚțȐ (ʌ Ƞ ȣ Į ț Ƞ Ȫ Ȟ).] Š‘•‡“—ƒŽ‹–›•‘‰•ȋ–Š‡›Ž‹•–‡–‘ȌǤ (.)

Afroditi

.hh ȅ ǻȘȝȒIJȡȘȢ [ĮȢ ʌȠȪȝİ ȐțȠȣȖİ >IJȚ ȐțȠȣȖİ.