Documenting Endangered Languages: Achievements and Perspectives 9783110260021, 9783110260014

The rapid decline in the world's linguistic diversity has prompted the emergence of documentary linguistics. While

205 34 2MB

English Pages 353 [364] Year 2011

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Documenting Endangered Languages: Achievements and Perspectives
 9783110260021, 9783110260014

Table of contents :
Preface. Ulrike Mosel’s contribution to documentary linguistics
Chapter 1. Introduction
Part I. Theoretical issues in language documentation
Chapter 2. Competing motivations for documenting endangered languages
Chapter 3. Evolving challenges in archiving and data infrastructures
Chapter 4. Comparing corpora from endangered language projects: Explorations in language typology based on original texts
Part II. Documenting language structure
Chapter 5. “Words” in Kharia – Phonological, morpho-syntactic, and “orthographical” aspects
Chapter 6. Aspect in Forest Enets and other Siberian indigenous languages – when grammaticography and lexicography meet different metalanguages
Chapter 7. Documentary linguistics and prosodic evidence for the syntax of spoken language
Chapter 8. Diphthongology meets language documentation: The Finnish experience
Chapter 9. Retelling data: Working on transcription
Part III. Documenting the lexicon
Chapter 10. The making of a multimedia encyclopaedic lexicon for and in endangered speech communities
Chapter 11. What does it take to make an ethnographic dictionary? On the treatment of fish and tree names in dictionaries of Oceanic languages
Part IV. Interaction with speech communities
Chapter 12. Language is power: The impact of fieldwork on community politics
Chapter 13. Sustaining Vurës: Making products of language documentation accessible to multiple audiences
Chapter 14. Filming with native speaker commentary
Index

Citation preview

Documenting Endangered Languages

Trends in Linguistics Studies and Monographs 240

Editor

Volker Gast Founding Editor

Werner Winter Editorial Board

Walter Bisang Hans Henrich Hock Heiko Narrog Matthias Schlesewsky Niina Ning Zhang Editor responsible for this volume

Volker Gast

De Gruyter Mouton

Documenting Endangered Languages Achievements and Perspectives

Edited by

Geoffrey L. J. Haig Nicole Nau Stefan Schnell Claudia Wegener

De Gruyter Mouton

ISBN 978-3-11-026001-4 e-ISBN 978-3-11-026002-1 ISSN 1861-4302 Library of Congress Cataloging-in-Publication Data Documenting endangered languages : achievements and perspectives / edited by Geoffrey Haig ... [et al.]. p. cm. ⫺ (Trends in linguistics : studies and monographs; 240) Includes bibliographical references and index. ISBN 978-3-11-026001-4 (alk. paper) 1. Language obsolescence. 2. Language and languages. I. Haig, Geoffrey. P40.5.L33D63 2011 410⫺dc23 2011030746

Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de. ” 2011 Walter de Gruyter GmbH & Co. KG, Berlin/Boston Printing: Hubert & Co. GmbH & Co. KG, Göttingen ⬁ Printed on acid-free paper Printed in Germany. www.degruyter.com

This volume is dedicated to Ulrike Mosel, in recognition of her contribution towards documenting the world’s endangered languages.

Contents

Preface Ulrike Mosel’s contribution to documentary linguistics . . . . . . . . . . . . . . . Geoffrey Haig and Nicole Nau

xi

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geoffrey Haig, Nicole Nau, Stefan Schnell and Claudia Wegener

1

Part I. Theoretical issues in language documentation Chapter 2 Competing motivations for documenting endangered languages . . . . . . . Frank Seifart Chapter 3 Evolving challenges in archiving and data infrastructures . . . . . . . . . . . . . Daan Broeder, Han Sloetjes, Paul Trilsbeek, Dieter van Uytvanck, Menzo Windhouwer and Peter Wittenburg Chapter 4 Comparing corpora from endangered language projects: Explorations in language typology based on original texts . . . . . . . . . . . . Geoffrey Haig, Stefan Schnell and Claudia Wegener

17

33

55

Part II. Documenting language structure Chapter 5 “Words” in Kharia – Phonological, morpho-syntactic, and “orthographical” aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John Peterson

89

viii

Contents

Chapter 6 Aspect in Forest Enets and other Siberian indigenous languages – when grammaticography and lexicography meet different metalanguages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Florian Siegl Chapter 7 Documentary linguistics and prosodic evidence for the syntax of spoken language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Candide Simard and Eva Schultze-Berndt Chapter 8 Diphthongology meets language documentation: The Finnish experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Klaus Geyer Chapter 9 Retelling data: Working on transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Dagmar Jung and Nikolaus P. Himmelmann

Part III. Documenting the lexicon Chapter 10 The making of a multimedia encyclopaedic lexicon for and in endangered speech communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Gaby Cablitz Chapter 11 What does it take to make an ethnographic dictionary? On the treatment of fish and tree names in dictionaries of Oceanic languages . . 263 Andrew Pawley

Part IV. Interaction with speech communities Chapter 12 Language is power: The impact of fieldwork on community politics . . . 291 ˚ Even Hovdhaugen and Ashild Næss

Contents

ix

Chapter 13 Sustaining Vur¨es: Making products of language documentation accessible to multiple audiences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Catriona Hyslop Malau Chapter 14 Filming with native speaker commentary . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Anna Margetts Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

Preface Ulrike Mosel’s contribution to documentary linguistics Geoffrey Haig and Nicole Nau In 2011, we look back on a decade of language documentation within the Volkswagen Foundation’s DoBeS-programme. Likewise in 2011, Ulrike Mosel, one of the founding scholars behind that programme, and a driving force within it throughout, will reach official retirement age. We are all familiar with Ulrike’s disdain for the external trappings of academia; for her, they are simply diversions from the real business of documenting endangered languages. Nevertheless, we hope she will excuse us for taking a couple of pages to briefly reflect on a very distinguished career, one that has been deeply intertwined with the burgeoning enterprise of language documentation, its theory, its practice, but also with training and inspiring the minds of many linguists, among them the editors of this volume. It was her fascination with documents of languages that had lost their speakers long ago that led Ulrike to the study of Semitic languages, Assyriology and General Linguistics at the University of Munich in 1966. Her academic education was grounded in a tradition where the study of texts was of central importance, and the principles of classical grammatical analysis were common knowledge. Her linguistic background is steeped in the philological tradition, resting on the three cornerstones of text, grammar and dictionary. At the same time, she witnessed the spread of modern linguistics through German universities, and absorbed those influences with the same insatiable curiosity that characterizes all her activities. In her PhD thesis, written on the great eighth-century Arabic grammarian Sibawayh, Ulrike identified precursors of modern structuralist thinking in his work. After her PhD, Ulrike’s research interests took a new direction: She turned her attention away from the study of ancient texts and Semitic philology to the under-described, and mostly unwritten, languages of the Pacific region, beginning with Tolai in Papua New Guinea in the 1970’s. Working with the Tolai was Ulrike’s first experience with linguistic fieldwork (bar a brief trip to the Bedouins of Jordan, but that is another story), and as it is against her nature to do something without method, it was probably inevitable that from

xii

Geoffrey Haig, Nicole Nau

that time on she has contributed substantially to the methodology of fieldwork, adding many aspects that had not been properly thought through at that time; in part thanks to her efforts, the current generation of young linguists leaves for the field much better prepared than she was. Ulrike was certainly not the only scholar at the time who went into the field with some high-flung research agenda and ended up collecting stories. But she was one of the few wise enough to understand that collecting texts and actually learning the language is not something that precludes serious linguistic research, but forms an essential part of it. Thus, the publication of Tolai texts (1977) stands with equal importance beside the two later monographs she devoted to the language: Tolai and Tok Pisin (1980), where she proved the important role Tolai had played in the development of Tok Pisin lexicon and grammar, and Tolai syntax and its historical development (1984), which was her post-PhD thesis (Habilitationsschrift). Since then, Ulrike has continued to work on different languages of the Pacific region, focussing on Samoan and Teop. She is not one of those linguists who produce superficial, quick-fix analyses on isolated aspects of numerous different languages. Ulrike’s work is based on years of first-hand experience, and the study of actual usage in the speech community. It would be simply against her principles to publish on a language she was not personally intimately familiar with; one can only speculate on the beneficial effects such a principle would have on the quality of many linguistic publications, if it were more widely adhered to. In her thinking as a linguist, she was strongly influenced by the major typological schools of the time, spending several years in the inspiring research environment at the University of Cologne, where Hansjakob Seiler and his associates had founded the major centre for language typology and universals in Germany. Around 1990, Ulrike moved to the Australian National University in Canberra, where she worked with linguists like Avery Andrews, Bob Dixon, Andy Pawley, Malcolm Ross and Anna Wierzbicka. This period, interspersed with lengthy field trips to Samoa, is a time she looks back on with great fondness, both intellectually and personally. During the Canberra years she produced with Even Hovdhaugen the monumental Samoan reference grammar, as well as numerous important contributions to Samoan syntax and semantics. Then in 1995 she was offered the chair at the Department of General and Indo-European Linguistics at the University of Kiel, where she has remained ever since. When Ulrike arrived in Germany, she brought with her the "can

Ulrike Mosel’s contribution to documentary linguistics

xiii

do!" attitude that she herself would, perhaps, attribute to the influence of her Australian years, but in fact is very much part of her own personality. She inherited a chair that had been vacant for some time, and a department that was lacking in direction, and in students. She set about rebuilding the department, introducing an emphasis on language documentation, but without neglecting foundational theoretical aspects of linguistics. The BA programme she established in collaboration with the phonetics department has been hugely successful, with new enrolments of 60–100 students annually. The shake-up involved some decisions that raised a few eyebrows in the then still very traditional Philosophische Fakultät at the University of Kiel (for example, her decision to abolish Latin as a prerequisite for an undergraduate degree in linguistics, or her regular conflicts with the faculty when it came to allowing graduates to write their theses in English). In both decisions, as in many others, she stuck to her own convictions against considerable resistance, and in doing so proved herself once again ahead of her time. Within a remarkably short time she had forged a thriving department, hosted several research projects, supervised more than a dozen PhD’s and countless MA theses, and set up highly successful and innovative new degree programmes. She never served as dean, vice-chancellor, or anything else in university politics: she just concentrated on what she does best, namely research and teaching in linguistics. International recognition of her work has come in many forms, among them an Honorary Membership in the Linguistic Society of America, awarded to her in 2007. Through it all she has retained her signature mix of enthusiasm and rigour, coupled with a very Prussian work ethic, but also a deep humanity that enables her to genuinely engage with people of all walks of life – one of the most important qualities for successful field work. Her research commitments in these latter years have been dominated by her involvement in the DoBeS-programme, as one of its founding scholars, multiple grant-recipient, and long-standing chairperson of the Steering Committee. There is hardly an aspect of the programme that Ulrike has not impacted on over the years. We feel confident that a little thing like retirement will not significantly change that.

xiv

Geoffrey Haig, Nicole Nau

Publications by Ulrike Mosel Grammars and textbooks 1992 Samoan Reference Grammar (with Even Hovdhaugen). Oslo: Scandinavian University Press. 1994 Saliba. München: Lincom. 1997 Say It In Samoan (with Ainslie So’o). (A text book for learners of Samoan as a second language). Canberra: Australian National University Press. 2000 O le Kalama o le Gagana Samoa (with Fosa Siliko, Ainslie So’o and Agafili Tuitolova’a). (A Samoan grammar for teachers). Apia, Western Samoa: Curriculum Development Unit, Department of Education. 2007 The Teop Sketch Grammar (with Yvonne Thiesen). University of Kiel. Version May 2007. http://corpus1.mpi.nl/ds/imdi_browser/ ?openpath=MPI533750%23 and http://www.linguistik.uni-kiel.de/Teop_ Sketch_Grammar_May07.pdf

Dictionaries 1997 O le fale (with Mose Fulu). (A small monolingual Samoan dictionary on housebuilding and furniture, published by the Ministry of Youth, Sports and Culture). Apia, Western Samoa: Matagaluega Autalavou Taaloga ma Aganuu. 2001 Utugaga (with Fosa Siliko, Ainslie So’o and Agafili Tuitolova’a). (A monolingual dictionary of Samoan for primary school students of year 5–8). Apia, Western Samoa: Curriculum Development Unit, Department of Education. 2007 Teop Lexical Database (with Marcia L. Schwartz, Ruth Saovana Spriggs, Ruth Siimaa Rigamu, Jeremiah Vaabero and Naphtaly Maion). Kiel: Seminar für Allgemeine und Vergleichende Sprachwissenschaft, Christian-Albrechts-Universität. http://www.linguistik. uni-kiel.de/Teop_Lexical_Database_May07.pdf

2010

A inu. The Teop–English Dictionary of House Building (with Mark Mahaka, Enoch Horai Magum, Joyce Maion, Naphtaly Maion, Ruth Siimaa Rigamu, Ruth Saovana Spriggs, Jeremiah Vaabero, Marica Schwartz and Yvonne Thiesen). With drawings by Neville Vitahi and photographs by Ulrike Mosel. SAVS Arbeitsberichte 6. Kiel: Seminar für Allgemeine und Vergleichende Sprachwissenschaft, Christian-Albrechts-Universität.

Ulrike Mosel’s contribution to documentary linguistics

xv

Text collections 1977 Tolai Texts. Kivung, Journal of the Linguistic Society of Papua New Guinea. Volume 10, Port Moresby. 2007 Amaa vahutate vaa Teapu (with Enoch Horai Magum, Joyce Maion, Jubilie Kamai, Ondria Tavagaga and Yvonne Thiesen). Illustrated by Rodney Rasin. Kiel: Seminar für Allgemeine und Vergleichende Sprachwissenschaft, Christian-Albrechts-Universität. 2009 Teop Language Corpus (with Enoch Horai Magum, Helen Magum, Shalom Magum, Jubilie Kamai, Owen Kasinory, Mark Mahaka, Joyce Maion, Naphtaly Maion, Janeth Nasin, Rodney Rasin, Ruth Siimaa Rigamu, Ruth Saovana Spriggs, Ondria Tavagaga, Jerimiah Vaabero, Neville Vitahi, Jessika Reinig, Marcia Schwartz, Yvonne Schuth (Thiesen)). http://corpus1.mpi.nl/ds/imdi_ browser?openpath=MPI622803%23

On grammaticography 1975 1987

1980 2002

2006

2006

2007

Die syntaktische Terminologie bei Sibawaih. 2 volumes. Dissertation. München: Uni Fotodruck Frank. Inhalt und Aufbau deskriptiver Grammatiken (How to write a grammar). Arbeitspapier Nr. 4. Köln: Institut für Sprachwissenschaft, Universität zu Köln. Syntactic categories in Sibawaih’s “Kitab”. Histoire Épistémologie Langage 2(1): 27–37. Analytic and synthetic language description. In Linguistik jenseits des Strukturalismus. Akten des II. Ost-West-Kolloquiums Berlin 1998, eds. Kennosuke Ezawa, Wilfried Kürschner, Karl H. Rensch and Manfred Ringmacher, 199–208. Tübingen: Narr. Sketch grammars. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus Himmelmann and Ulrike Mosel, 301–309. Berlin: Mouton de Gruyter. Grammaticography, the art and craft of writing grammars. In Catching Language: the Standing Challenge of Grammar Writing, eds. Felix Ameka, Alan Dench and Nicholas Evans, 41–68. Berlin: Mouton de Gruyter. Early grammars of Oceanic languages (with Even Hovdhaugen). In Sprachtheorien der Neuzeit III/2: Sprachbeschreibung und Sprachunterricht, ed. Peter Schmitter, 462–478. Tübingen: Narr.

xvi

Geoffrey Haig, Nicole Nau

On lexicography 2002

2004

2011

Dictionary making in endangered language communities. In Proceedings of the International Workshop on Resources and Tools in Field Linguistics, Las Palmas 26–27 May 2002. Dictionary making in endangered speech communities. In Language Documentation and Description, Volume 2, ed. Peter K. Austin, 39–54. London: School of Oriental and African Studies. Lexicography in endangered language communities. In The Cambridge Handbook of Endangered Languages, eds. Peter K. Austin and Julia Sallabank, 337–353. Cambridge: Cambridge University Press.

On fieldwork and language documentation 2006

Essentials of Language Documentation (with Jost Gippert and Nikolaus Himmelmann, eds.). Berlin: Mouton de Gruyter. 2001 Linguistic fieldwork. In International Encyclopedia of the Social and Behavioral Sciences, Volume 13, eds. Neil J. Smelser and Paul B. Baties, 8906–8910. 2002 Methods of language documentation in the DOBES project (with Peter Wittenburg and Arienne Dwyer). In Proceedings of the International Workshop on Resources and Tools in Field Linguistics, Las Palmas 26–27 May 2002. 2004 Inventing communicative events: Conflicts arising from the aims of language documentation. Language Archive Newsletter 1(3): 3–4. 2006 Fieldwork and community language work. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus Himmelmann and Ulrike Mosel, 67–85. Berlin: Mouton de Gruyter. In print Morphosyntactic analysis in the field, a guide to the guides. In The Oxford Handbook of Linguistic Fieldwork, ed. Nicholas Thieberger. Oxford: Oxford University Press. On typology and grammar of Oceanic languages 1983

1999

Adnominal and Predicative Possessive Constructions in Melanesian Languages. Arbeiten des Kölner Universalienprojektes 50. Köln: Institut für Sprachwissenschaft, Universität zu Köln. Negation in Oceanic Languages (with Even Hovdhaugen, eds.). München: Lincom.

Ulrike Mosel’s contribution to documentary linguistics

1982

1987

1989

1991

1991

1991

1991 1992

1994

1994

1999

xvii

Number, collection and mass in Tolai. In Apprehension. Das sprachliche Erfassen von Gegenständen. Teil II. Die Techniken und ihr Zusammenhang in Einzelsprachen, eds. Hansjakob Seiler and FranzJosef Stachowiak, 123–154. Tübingen: Narr. Subject in Samoan. In A World of Language: Papers Presented to Professor S. A. Wurm on his 65th Birthday, eds. Donald C. Laycock and Werner Winter, 455–479. Canberra: Australian National University Press. On the classification of verbs and verbal clauses in Samoan. In VICAL 1: Oceanic Languages. Papers from the Fifth International Conference on Austronesian Linguistics, eds. Ray Harlow and Robin Hooper, 377–398. Auckland: Linguistic Society of New Zealand. Abstufungen der Transitivität im Palauischen. In Partizipation: Das sprachliche Erfassen von Sachverhalten, eds. Hansjakob Seiler and Waldfried Premper, 400–407. Tübingen: Narr. The continuum of verbal and nominal clauses in Samoan. In Partizipation: Das sprachliche Erfassen von Sachverhalten, eds. Hansjakob Seiler and Waldfried Premper, 138–149. Tübingen: Narr. Towards a typology of valency. In Partizipation: Das sprachliche Erfassen von Sachverhalten, eds. Hansjakob Seiler and Waldfried Premper, 240–251.Tübingen: Narr. Transitivity and reflexivity in Samoan. Australian Journal of Linguistics 11: 175-194. On nominalisation in Samoan. In The Language Game: Papers in Memory of Donald C. Laycock, eds. Tom Dutton, Malcom Ross and Darell T. Tryon, 263–281. Canberra: Australian National University Press. Samoan. In Comparative Austronesian Dictionary: An Introduction to Austronesian Studies, ed. Darrell T. Tryon, 943–946. Berlin: Mouton de Gruyter. Tolai. In Comparative Austronesian Dictionary: An Introduction to Austronesian Studies, ed. Darrell T. Tryon, 727–730. Berlin: Mouton de Gruyter. Towards a typology of negation in Oceanic Languages. In Negation in Oceanic Languages, eds. Even Hovdhaugen and Ulrike Mosel, 1–19. München: Lincom.

xviii 1999

1999

2000 2000

2004

2004 2007 2010

Geoffrey Haig, Nicole Nau

Negation in Teop (with Ruth Spriggs). In Negation in Oceanic Languages, eds. Even Hovdhaugen and Ulrike Mosel, 45–56. München: Lincom. Gender in Teop (with Ruth Spriggs). In Gender in Grammar and Cognition, eds. Barbara Unterbeck and Matti Rissanen, 321–349. Berlin: Mouton de Gruyter. Aspect in Samoan. In Probleme der Interaktion von Lexik und Aspekt, ed. Walter Breu, 179–192. Tübingen: Niemeyer. Valence changing clitics and incorporated prepositions in Teop (Oceanic, Bougainville) (with Jessika Reinig). In Proceedings of AFLA VII, ed. Marian Klamer, 133–140. Amsterdam: Department of Linguistics, Vrije Universiteit. Complex predicates and juxtapositional constructions in Samoan. In Complex Predicates in Oceanic Languages. Studies in the Dynamics of Binding and Boundedness, eds. Isabelle Bril and Françoise Ozanne-Rivierre, 263–296. Berlin: Mouton de Gruyter. Demonstratives in Samoan. In Deixis in Oceanic Languages, ed. Gunter Senft, 141–174. Canberra: Pacific Linguistics. Ditransitivity and valency change in Teop – a corpus based approach. Tidsskrift for Sprogforskning, 5: 1–40. The fourth person in Teop. In A Journey through Austronesian and Papuan Linguistic and Cultural Space: Papers in Honour of Andrew K. Pawley, eds. John Bowden, Nikolaus Himmelmann and Malcolm Ross, 391–404. Canberra: Pacific Linguistics.

On Semantics 1982 Local deixis in Tolai. In Here and There: Cross-linguistic Studies in Deixis and Demonstration, eds. Jürgen Weissenborn and Wolfgang Klein, 111–132. Amsterdam: Benjamins. 1991 Time metaphors in Samoan. In Festschrift für Meinrad Scheller, ed. Walter Bisang, 149–165. Arbeiten des Seminars für Allgemeine Sprachwissenschaft der Universität Zürich 11. Zürich: Seminar für Allgemeine Sprachwissenschaft, Universität Zürich. 1991 The Samoan construction of reality. In The Currents in Pacific Linguistics. Papers on Austronesian Languages and Ethnolinguistics in Honor of George Grace, ed. Robert Blust, 293–303. Canberra: Australian National University Press.

Ulrike Mosel’s contribution to documentary linguistics

1994

xix

Samoan. In Semantic and Lexical Universals: Theory and Empirical Findings, eds. Cliff Goddard and Anna Wierzbicka, 331–360. Amsterdam: Benjamins.

On language contact and language change 1980

Tolai and Tok Pisin. The Influence of the Substratum on the Development of New Guinea Pidgin. Canberra: Australian National University Press. 1984 Tolai Syntax and its Historical Development. Canberra: Australian National University Press. 1979 Early language contact between Tolai, Pidgin and English in view of its sociolinguistic background (1875–1914). In Papers in Pidgin and Creole Linguistics No. 2, 163-181. Canberra: Australian National University Press. 1982 The influence of the Church Missions on the development of Tolai. In Gava’. Studies in Austronesian Languages and Cultures, Dedicated to Hans Kähler, eds. Rainer Carle, Martina Henschke, Peter W. Pink, Christel Rost and Karen Stadtlender, 155–172. Berlin: Reimer. 1982 New evidence of Samoan origin of New Guinea Tok Pisin (New Guinea Pidgin English) (with Peter Mühlhäusler). Journal of Pacific History 17(3): 166–175. 2004 Borrowing in Samoan. In Borrowing: A Pacific Perspective, eds. Jan Tent and Paul Geraghty, 215–232. Canberra: Australian National University. To app. Analogical levelling across constructions – incorporated prepositions in Teop. In The Evolution of Syntactic Relations, eds. Christian Lehmann and Stavros Skopeteas. Berlin: Mouton de Gruyter.

Chapter 1 Introduction: Documenting endangered languages before, during, and after the DoBeS programme∗ Geoffrey Haig, Nicole Nau, Stefan Schnell and Claudia Wegener

1.

Background

Scholars have been engaged in language documentation for centuries. But it is only in the past couple of decades that language documentation has emerged as an academic discipline in its own right, associated with scholarly publications, university departments, funding initiatives, and a growing repertoire of practices, theoretical diversification and sophistication, and specialist terminology. The roots of the current upsurge in language documentation run deep, and stem from several sources, too many to treat in any detail here. Today, language documentation has come of age, a fertile domain for cross-disciplinary exchange, involving linguists, anthropologists, ethno-musicologists, biologists, along with software developers and corpus linguists. The development has been remarkably rapid; a glance at the seminal article by Himmelmann (1998) suffices to confirm that much of what was envisaged in that programmatic statement has since become part and parcel of language documentation practice. In the late 1990’s, a small group of German-based linguists began developing an agenda for a programme aimed at documenting endangered languages worldwide. That initiative came to fruition in the Volkswagen Foundation’s DoBeS-programme (Dokumentation bedrohter Sprachen), which began with a pilot phase in 2000. DoBeS went on to become the Foundation’s longest-running programme within the humanities. From the outset, DoBeS established the Max Planck Institute for ∗

We are extremely grateful to the DoBeS-programme of the Volkswagenstiftung, who funded most of the research reported in this book, and so much more. We would also like to thank the authors for their contributions and constructive input during the entire publication process. Finally, our thanks go to the production team at Mouton and the series editor, Volker Gast, for their support and encouragement throughout.

2

Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener

Psycholinguistics in Nijmegen as the host for the central archive, also responsible for developing and implementing technical standards for the project. This generated an intensive exchange between linguists, software developers, and archivists, yielding significant improvements in annotation and metadata procedures as well as refinements of ethical and legal guidelines. It has been an ongoing process, constantly informed by the practical experience gained through some 50 documentation projects (cf. Broeder et al., this volume). Moreover, regular workshops and summer schools have provided scores of linguists, as well as many native speakers of endangered languages, with training in documentary linguistics, ensuring the initiative’s continued influence through future generations of linguists. The DoBeS-programme has undoubtedly been one of the most successful research initiatives in the language sciences. Its impact on the emergent field of documentary linguistics can hardly be exaggerated (see Harrison, Rood, and Dwyer 2008 for a similar assessment), with long-term implications that go far beyond the field of linguistics itself. With the programme entering its closing phases in late 2011, it is fitting to take a look at some of its achievements, and some of the future challenges, through the work of some of its practitioners. As mentioned, the modern discipline of language documentation has multiple ancestors. Previous retrospectives, e.g. Woodbury (2003), identify Franz Boas as the spiritus rector of documentary linguistics in the North American context. However, with the rise of Chomskyan linguistics in North America, the Boasian tradition of anthropologically informed documentary linguistics had lost ground there. Both Grinevald (2003) and Woodbury (2003) identify the LSA symposium in 1991, and the associated publication in Language (Hale et al. 1992), as the turning points in triggering the rehabilitation of documentary linguistics. From a North American perspective, the developments certainly appear to be quite radical; Grinevald (2003: 52) describes in vivid terms the “incredible tension” she experienced prior to the LSA panel meeting in 1991 when the blunt facts regarding the imminent loss of much of the world’s linguistic diversity were to be presented before the assembled dignitaries of the North American linguistic scene. The perception of a “paradigm shift” can only really be appreciated when one considers the extent to which American linguistics at the time was dominated by scholars working in theories focused on an idealized conception of “grammar”. The topic of language endangerment, on the other hand, meant introducing social and political dimensions into a field that had effectively abstracted away from such consider-

Introduction

3

ations. But since the early 1990’s, documentary linguistics in North America has regained much of the lost ground, with numerous highly successful and innovative programmes now well-established (cf. Woodbury 2003 for further discussion). But the North American perspective is only part of the story.1 From a European perspective, on the other hand, the paradigm shift appears somewhat less fundamental. In Germany, now an important centre for language documentation, the impact of the much-vaunted Chomskyan Revolution was considerably weaker than across the Atlantic. The reasons are partly to be sought in the way linguistics is institutionalized at German universities. To this day, dedicated departments of general linguistics are only sparsely scattered across the university landscape. Linguistics has remained to a large extent the concern of individual philologies (English, Romance, German etc.). Within such departments, the text-based tradition of philology continued to be an important pillar in the academic training. Linguists within these disciplines thus never felt themselves in the defensive to the same extent that their American colleagues felt during Generative Grammar’s ascension to domination in North America. In Europe, a deep-rooted tradition in the description of littlestudied languages and dialects could survive as a respectable, if marginal, niche within the philologies. Particularly in departments such as African languages, Finno-Ugric, Semitic, Turkic, or Iranian, descriptive grammars, dictionaries and text collections from under-studied languages and dialects have constituted a regular part of the scientific output for close to two centuries. Highly-respected German professors, such as the now retired Semitist Otto Jastrow, devoted much of their professional careers to field work and documenting endangered languages. Among other things, Jastrow’s work led to the establishment of the Semarch,2 a digital archive of (often highly endangered) Semitic languages at the University of Heidelberg. 1. A comprehensive treatment of the roots of language documentation lies beyond the scope of this introduction; see Himmelmann (2008) for a more balanced account. Two further antecedents of modern documentary linguistics, outside of Western Europe and North America, are also definitely worthy of mention: the linguistic fieldwork paradigm initiated by the Moscow-based linguist Alexander Kibrik and his associates in the 1970’s, which generated an enormous amount of descriptive materials on indigenous languages of the Ex-SovietUnion, and the tradition of encouraging descriptive grammars of undescribed languages as PhD theses, established at the Australian National University in the 1970’s mainly through the efforts of Bob Dixon. 2. http://www.semarch.uni-hd.de/index.php43

4

Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener

Nevertheless, it would be a gross oversimplification to see today’s discipline of language documentation as simply the digitalized continuation of the philological tradition of the 19th century. For a start, today’s documentary linguistics is primarily motivated by the desire to conserve (at least a record of) the world’s rapidly diminishing linguistic diversity. Prior to the 1980’s, the global dimensions of this trend had not been fully appreciated. It is this awareness that unites modern documentary linguistics world-wide, and across the boundaries of language families, creating a common platform that the individual philologies never really developed. Himmelmann (2006: 15) identifies five further innovations that distinguish contemporary documentary linguistics from its philological predecessors. Here we take up what we consider are the three most salient, while adding a fourth that has very recently come to the fore: 1. Focus on the full range of communicative practices, rather than on selected text types with supposedly high cultural, religious or historical significance (and, one might add, most commonly from exclusively male speakers). This point is fundamental; it arises from the fact that documentary linguistics is a branch of general linguistics, rather than any particular philology. Linguistics is concerned with language, regardless of its prestige; spontaneous discourse is not merely “idle” chat, but offers important insights into the language usage of the community and is thus equally worthy of the documenter’s attention.3 Furthermore, additional research questions have arisen over the last decades, for example discourse analysis, language and gender, language acquisition, and language contact, which have led to a growing interest in a broader range of communicative events. As an example of a philologist’s approach, it is instructive to consider the work of David MacKenzie, a specialist on Iranian languages who undertook doc3. The importance of documenting the full range of communicative practices in the community has been stressed by many authors (e.g. Himmelmann 1998; Foley 2003). In practice, however, it is notable that in most documentation projects, it still tends to be more traditional monologues than everyday conversational interactions that find themselves as fullyannotated records in the archive. Part of the reason lies in the speech communities’ own assessment of what is worth preserving for posterity (cf. Mosel 2004, 2006, 2008). Partly, it may be due to the practical difficulties of recording, annotating and analysing spontaneous multi-participant discourse. In this sense, then, documentation practice lags behind documentation theory, but this is a perfectly normal state of affairs in any discipline, and with increased experience with, for example, video recording and annotations, we expect that greater amounts of conversational data will be archived.

Introduction

5

umentation of Kurdish dialects in Iraq in the 1950’s and 1960’s. In the introduction to his two-volume work (MacKenzie 1961: xvii), MacKenzie bemoans the dearth of “trustworthy informants”, i.e. speakers who use dialectally “pure” forms in their speech, and are unaffected by dominant standard languages. For MacKenzie, the lack of “trustworthy” speakers constituted a serious obstacle to the work of documentation. In the modern practice of documenting endangered languages, on the other hand, lack of speakers of a “pure” dialect is probably the norm for many projects. And in fact, interest in the dynamics of language shift and obsolescence is now a major research focus in its own right within documentary linguistics (Seifart, this volume). 2. Concerns for long term storage and preservation of primary data. This requires no special comments, and has been dealt with in many contributions (Trilsbeek and Wittenburg 2006). However, there is a further consideration generally wedded to the concern for long-term data preservation, namely the issue of maximal web-based accessibility. As Harrison et al. (2008: 3) put it: “Never before have extensive text collections been directly accessible to researchers worldwide.” This ease of accessibility distinguishes modern documentary linguistics from its predecessors rather sharply. But like most technical innovations, it has its downside. Global accessibility of (often sensitive) data has spawned a host of highly complex legal and ethical issues regarding access rights and protection of personal and community privacy, most of which remain unresolved (see discussion in Broeders et al., this volume). 3. Close cooperation with, and direct involvement of the speech community. The degree to which community involvement characterizes modern language documentation, and the fact that it is now explicitly recognized and encouraged in grant proposals and evaluations, is undoubtedly a major innovation, which few would seriously question. But it is worth recalling that even in the 1990’s, the issue of community involvement was highly controversial. At the 1993 Cologne summer school on documentary linguistics, organized by Hans-Jürgen Sasse and Nikolaus Himmelmann, the keynote speech by Colette Grinevald (then Colette Craig) was dominated by a heated discussion of this issue. Those who advocated a purely “objective scientific” approach to language documentation, eschewing any involvement in community-based efforts at language revitalization, were pitched against those who considered it part of the documenters’ responsibility to

6

Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener

become involved in such initiatives. Since then, however, documentary linguistics, particularly in the DoBeS framework, has adopted a much more pragmatic and realistic approach. It is now widely recognized that it is simply counterproductive to insist on a dogmatic standpoint on this issue. If the speech community wishes for the documenter to participate in the speech community’s efforts to maintain its own language, this needs to be taken seriously as part of the documentation. Quite apart from the moral imperative, and despite the very real complications that may arise through the community’s active involvement (see Hovdhaugen and Næss, this volume), experience from numerous documentation projects worldwide has demonstrated that such participation is generally highly beneficial for all actors in the documentary scenario, and enhances the quality of the data in numerous ways. 4. The final point, not mentioned by Himmelmann, is the scientific potential inherent in the rapidly growing amounts of digitally archived data from typologically diverse languages. This is a point that few proponents of language documentation could have foreseen at the outset; it is a nice example of how data collection, if done properly, can generate new research fields which were simply not predictable at the time the data was collected. A recent example stems from research based on phoneme inventories of the world’s languages. When Ian Maddieson began to collate phoneme inventories from a sample of the world’s languages in the early 1980’s, his work was known primarily as a resource for language typology. Later, Maddieson contributed his data into the web-accessible World Atlas of Language Structure (WALS).4 Based on the phonological data in WALS, Quentin Atkinson has since formulated some far-reaching claims on the global settlement and migration patterns of homo sapiens (Atkinson 2011).5 Back in the 1980’s, it is highly unlikely that Ian Maddieson could have predicted that his data on phoneme inventories might one day feed into a hypothesis regarding global human settlement patterns in prehistory. The point of this example, and numerous comparable ones, is that data collection is never “mere data collection”. 4. http://wals.info/ 5. Atkinson’s findings have been heavily criticized, and may thus seem an unfortunate example for illustrating this point. However, the basic fact remains that data collectors simply cannot predict future advances; for them, it is a sufficient goal to ensure maximal accessibility, perseverance, and good data organization so that later applications can be applied as economically as possible.

Introduction

7

The creation of structured, accessible, and rich data-bases raises conceptual and technical challenges (e.g. appropriate metadata formats); their resolution represents in itself a progression, and may in fact yield unforeseen theoretical advances. The DoBeS archive is thus far more than simply a “data repository”; it is a dynamic research resource that is generating exciting new perspectives for research. Language typologists and corpus linguists are beginning to appreciate these opportunities. Language typology based on language comparison through primary data – texts – rather than reference grammars (cf. Wälchli 2009) is already gaining ground, and it is only a matter of time before the considerable resources residing in archives of endangered languages will be harnessed for this kind of research (cf. Haig et al., this volume, for some initial examples). Thus the emphasis in the DoBeS-programme on “primary data”, rather than a “grammar+dictionary format” (Himmelmann 2006) finds its natural counterpart in an increasingly primary-data oriented, quantitative approach to language typology. 2.

The organization of this volume

The contributions to this volume deal with aspects of language documentation, as they have emerged through the work of practitioners active in the DoBeS-programme over the last decade. The chapters are divided into four parts, which we briefly summarize below. Part I. Theoretical issues in language documentation The first part contains three contributions relating to theoretical issues in language documentation as a discipline. In the first one, “Competing motivations for documenting endangered languages”, Frank Seifart discusses the different motivations and expectations that different researchers bring to the enterprise of language documentation, and how they can be resolved. He focuses on four different motivations: preservation of human cultural heritage, extending the empirical basis of linguistics, aiding a speech community in their efforts at language promotion, and documenting the effects of language contact. Seifart distinguishes the demands that each motivation places on different aspects of a documentation, taking as his framework Himmelmann’s (1998, 2006) distinction between primary data (content), and analytical apparatus of a language documentation. He notes how conflicts can be resolved by setting priorities either in the content, or the apparatus component of a language documentation, or both.

8

Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener

Language documentation is by nature a multi-faceted enterprise, inevitably leading to competing interests; Seifart provides a framework in which competing motivations are contextualized as a normal state of affairs, whose resolution is part of the overall challenge of language documentation. In their contribution “Evolving challenges in archiving and data infrastructures”, Daan Broeder, Han Sloetjes, Paul Trilsbeek, Dieter van Uytvanck, Menzo Windhouwer and Peter Wittenburg trace some of the developments in language documentation technology, in which the technical group of the MPI has been centrally involved. The pace of progress has been extraordinary: data storage capacity has vastly increased, both in archives and in recording devices; portability and performance of recording devices has improved beyond measure; annotation and metadata schemes have been progressively refined. One of the most exciting prospects is the development of automatic recognition software “that react to comparatively simple patterns in media streams” as an additional layer to conventional annotations. By contrast, developments in the less-technical aspects of archiving has not always matched the technical advances. In particular, the issue of access rights to archived material remains, despite 10 years of intense negotiation, far from resolved. The authors stress the dynamic nature of these developments, an evolutionary process driven on the one hand by the technical advances, yet constantly modified in accordance with the constraints of language documentation, with its eminently human aspects. Geoffrey Haig, Stefan Schnell and Claudia Wegener discuss the potential relevance of archived data for language typology (“Comparing corpora from endangered language projects: Explorations in language typology based on original texts”). They show how archived data can be used in language comparison, enriching typological research not only by new data, but also by new methods. They argue that quantitative cross-corpus analysis is possible and yields interesting results in cases where texts of the same genre are compared, irrespective of the content. In a case study they explore the marking of core arguments in four genetically unrelated and geographically distant languages, using corpora of original spoken narrative texts from DoBeS-projects and the annotation system GRAID developed by Haig and Schnell (2011). Their investigation shows that some characteristics of discourse, such as the proportion of transitive and intransitive clauses in texts, or the well-known tendency to mark A by a pronoun or zero anaphora rather than a lexical expression, are remarkably similar across languages, while others, notably the

Introduction

9

deployment of pronouns in different syntactic functions, are more language dependent. Part II. Documenting language structure This part contains case studies from individual language documentations showing how language documentation practice can lead to the recognition and possible solutions to issues of actual structural analysis, with considerable theoretical relevance. One of the earliest decisions that language documenters need to make concerns the nature of the orthography used in various aspects of the documentation. In Chapter 4, “‘Words’ in Kharia – Phonological, morpho-syntactic, and ‘orthographical’ aspects”, John Peterson discusses the different factors that need to be considered when fixing the boundaries of the basic orthographical unit, the word. The language concerned, Kharia, makes extensive use of clitics in inflection, which is a contributing factor to conflicting notions of “word” in the language. But the indigenous writing tradition in the dominant languages, based on the Devanagari script, is a further factor, as is the prosodic pattern typically associated with content words. Peterson presents the results of an investigation with native speakers’ perception of word boundaries, revealing that while there are areas of relatively stable judgements across different speakers, certain types of word/phrase may be orthographically segmented quite differently by different native speakers. His approach demonstrates how native speakers’ intuitions in orthography may run counter to the linguist’s analysis; both need careful consideration before decisions can be made. It is an – often unspoken – assumption that documentary linguists should conduct their grammatical analysis in a maximally accessible, “theory-neutral” framework. In practice, however, no linguist is free of theoretical bias. In Chapter 5, Florian Siegl shows how both the native language of the investigator, as well as a local linguistic tradition, may impact on the way grammatical descriptions are formulated (“Aspect in Forest Enets and other Siberian indigenous languages – when grammaticography and lexicography meet different metalanguages”). The case in point is the description of “aspect” in the endangered Samoyedic language Forest Enets. Siegl demonstrates that the conception of verbal aspect, as established within a long-standing grammaticographic tradition of the Russian language, has seriously distorted the analysis of a similar, though by no means analogical, category in the endangered language, both in grammars and dictionaries of this language. This study illus-

10

Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener

trates how linguistic analysis may be driven by the linguist’s pre-established conception of linguistic phenomena, and underscores the necessity for a truly data-driven approach. Current documentations of endangered languages offer vastly improved possibilities for the study of spoken language. In particular, the range of languages documented makes it possible to extend linguistic research on prosody, which so far has been based mainly on data from a few well described languages. This point is made by Candide Simard and Eva SchultzeBerndt (“Documentary linguistics and prosodic evidence for the syntax of spoken language”), who further argue that grammatical units in spoken language cannot be defined without reference to prosodic units. They demonstrate their approach with an analysis of prosodic and syntactic units in the Australian language Jaminjung, such as afterthoughts, non-finite constructions, and discontinuous noun phrases. Prosodic evidence is sufficient to distinguish several syntactic constructions with otherwise identical surface forms. They show further that different degrees of semantic integration are iconically reflected by degrees of prosodic integration. All languages have vowels; yet the analysis of vowel systems in documentary linguistics is seldom treated as a topic in its own right. In Chapter 7, “Diphthongology meets language documentation: The Finnish experience”, Klaus Geyer focuses on the analysis of diphthongs. Diphthongs appear to be frequently neglected or only poorly treated in language descriptions. Drawing on the well-documented case of diphthongs in Finnish, the author provides the reader with a practicable methodological toolkit for analyzing diphthongs, comprising the relevant diagnostics for their identification and classification in any language. In Chapter 8, “Retelling data: Working on transcription”, Dagmar Jung and Nikolaus Himmelmann investigate methods in the rarely addressed, though ubiquitous task of transcribing recorded texts. Emphasizing the importance of close collaboration between linguists and speech community, they discuss a number of ways in which native speakers change or elaborate the recorded text when transferring it into a written form, for instance by editing out false starts, or by adding information felt to be highly relevant but missed by the speaker. The authors argue that a comprehensive documentation of transcription practices and such changes and elaborations represent highly valuable data in terms of native speakers’ (meta)linguistic knowledge

Introduction

11

and our understanding of the development of written varieties of hitherto unwritten languages. Part III. Documenting the lexicon Dictionaries represent the most tangible product of a language documentation, the one most readily understood by laypersons, and most appreciated by the speech community themselves. It is thus no surprise that dictionarymaking has probably the longest tradition of any activity in language documentation. In her contribution “The making of a multimedia encyclopaedic lexicon for and in endangered speech communities”, Gaby Cablitz explores some of the ways that recent advances in digital technology can be implemented in a documentation scenario. Specifically, Cablitz describes two major innovations: First, the LEXUS-tool for creating web-based lexica, where entries are linked to multimedia documents, permitting a much more “encyclopaedic” feel than conventional dictionary software. LEXUS includes a relational linking device, ViCoS (Visualization of Conceptual Spaces), which allows the creation of dense networks of linkages between different data types, intended to model knowledge representation more realistically. Second, she discusses web-based interactive editing of dictionary entries, in which the speech community itself takes an active role. Cablitz presents a view from the field, drawing on the experience gained during a four-year interdisciplinary project in the Marquesan and Tuamotuan speech communities in French Polynesia. As in other contributions, Cablitz’ work reveals that the innovative potential made available through advances in hard- and software need to be tempered and adapted to the realities of the fieldwork situation, where very fundamental considerations of user-friendliness and respect for community values and relationships cannot be left out of consideration. In Chapter 11, Andrew Pawley explores the challenges of making dictionaries of lesser described languages, in many instances the first dictionaries of the respective languages (“What does it take to make an ethnographic dictionary? On the treatment of fish and tree names in dictionaries of Oceanic languages”). Based on experience gained through compiling dictionaries of several Oceanic languages, Pawley first discusses the challenges of eliciting and determining the reference of terms in domains that require specialist expertise that the lexicographer generally does not have. He focuses on fish and tree names, two terminological domains that are large and important in most traditional Oceanic speech communities. The second problem discussed is

12

Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener

that of defining terms from such fields, when they differ in content and structure from corresponding fields in the investigator’s language. A central question is how much, and what kinds of cultural knowledge should be included in a dictionary. Pawley concludes that general ethnographic dictionaries need the cooperation of several well-informed specialists, and that the linguist’s fluency in the target language, as well as the native-speaker collaborator’s fluency in the defining language, are of considerable importance. Above all, he stresses that compiling such a dictionary is inevitably a long term project in its own right, which cannot readily be accommodated within the tightlyconstrained time frame of a typical documentation project. Part IV. Interaction with speech communities The fourth part of this book is devoted to the interaction of linguists with the speech community where field work is carried out. Even Hovdhaugen and Åshild Næss open the discussion by drawing attention to problems that may arise when the field work becomes an issue in local struggles of power (“Language is power: The impact of fieldwork on community politics”). They describe a situation of conflict they witnessed during their work in the Solomon Islands, which shows that even the best preparation and long term experience in an area cannot always prevent the field worker, however unwillingly, from becoming the cause of conflicts in the community. Linguists have to be aware of the impact their work may have on local politics, and questions of the linguist’s behaviour in possible situations of dissent within the community should be given consideration when planning a language documentation project. In Chapter 13, “Sustaining Vurës: Making products of language documentation accessible to multiple audiences”, Catriona Hyslop Malau presents her experiences with the production and presentation of a DVD documenting traditional plaiting and fishing techniques in the Vurës community, accompanied by screenings of Vurës dictionary entries of relevant specialized vocabulary. She shows how such professionally processed video material can be of significant value as an output of a language documentation project. The very fact that video material is processed professionally signals the worthiness of the community’s linguistic heritage and traditional knowledge, and encourages the speech community to uphold them. But professionally produced DVDs also enhance the visibility of the linguistic and cultural diversity of Vanuatu, and of the general enterprise of language documentation, by making it available to a wider national and global audience.

Introduction

13

The issue of which indigenous communicative practices are to be documented has been a recurring one in the language documentation literature. But Mosel (2004) points out that the language documentation process itself may lead to the creation of previously non-existent text genres in the speech community. In her contribution “Filming with native speaker commentary”, Anna Margetts discusses one such genre, that of film commentaries supplied by native speakers to video footage taken during documentation. Such commentaries differ both in discourse structure from other genres, and may include syntactic constructions that are otherwise unusual. Furthermore, the commentary may include samples of specialist vocabulary otherwise overlooked. Finally, such commentaries provide a level of immediacy and naturalness to film material, raising its acceptance within the speech community itself. In the author’s own project, the value of such commentaries was only discovered more or less in passing, and she notes in retrospect that this technique could have been fruitfully applied to several video recordings, adding to the repertoire of communicative events that are documented, and at the same time raising levels of community involvement and identification with the project.

References Atkinson, Quentin. 2011. Phonemic diversity supports a serial founder effect model of language expansion from Africa. Science 4(15):346–349. Foley, William A. 2003. Genre, register and language documentation in literate and preliterate communities. In Language Documentation and Description, Volume 1, ed. Peter K. Austin, 85–98. London: School of Oriental and African Studies. Grinevald, Colette. 2003. Speakers and documentation of endangered languages. In Language Documentation and Description, Volume 1, ed. Peter K. Austin, 52–72. London: School of Oriental and African Studies. Haig, Geoffrey, and Stefan Schnell. 2011. Annotations using GRAID (Grammatical Relations and Animacy in Discourse). Introduction and guidelines for annotators. Version 6.0. Available at: http://vc.uni-bamberg.de/moodle/ course/view.php?id=9488. Hale, Ken, Michael Krauss, Lucille J. Watahomigie, Akira Y. Yamamoto, Colette Craig, LaVerne Masayesva Jeanne, and Nora C. England. 1992. Endangered languages. Language 68:1–42.

14

Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener

Harrison, K. David, David Rood, and Arienne Dwyer. 2008. A world of many voices: Editors’ introduction. In Lessons from Documented Endangered Languages, eds. K. David Harrison, David Rood, and Arienne Dwyer, 1–13. Amsterdam, Philadelphia: John Benjamins. Himmelmann, Nikolaus P. 1998. Documentary and descriptive linguistics. Linguistics 36:161–195. Himmelmann, Nikolaus P. 2006. Language documentation: What is it, and what is it good for. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 1–30. Berlin, New York: Mouton de Gruyter. Himmelmann, Nikolaus P. 2008. Reproduction and preservation of linguistic knowledge: Linguistics’ response to language endangerment. Annual Review of Anthropology 37:337–350. MacKenzie, David. 1961. Kurdish Dialect Studies, Volume 1. Oxford: Oxford University Press. Mosel, Ulrike. 2004. Inventing communicative events: Conflicts arising from the aims of language documentation. Language Archives Newsletter 1(3):3–4. Mosel, Ulrike. 2006. Fieldwork and community language work. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 67–85. Berlin, New York: Mouton de Gruyter. Mosel, Ulrike. 2008. Putting oral narratives into writing – experiences from a language documentation project in Bougainville, Papua New Guinea. Presentation at the Simposio Internacional Contacto de lenguas y documentación, Buenos Aires, CAIYT (August 2008), available online as ‘Oral and written versions of Teop legends’ at http://www.linguistik.uni-kiel.de/ mosel_publikationen.htm#download (accessed 2011/02/26). Trilsbeek, Paul, and Peter Wittenburg. 2006. Archiving challenges. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 311–335. Berlin, New York: Mouton de Gruyter. Wälchli, Bernhard. 2009. Data reduction typology and the bimodal distribution bias. Linguistic Typology 13:77–94. Woodbury, Anthony C. 2003. Defining documentary linguistics. In Language Documentation and Description, Volume 1, ed. Peter K. Austin, 35–51. London: School of Oriental and African Studies.

Part I. Theoretical issues in language documentation

Chapter 2 Competing motivations for documenting endangered languages∗ Frank Seifart

1.

Introduction

Many aspects of language documentations are crucially shaped by the diverse motivations of the documenters. This chapter addresses a number of questions with the aim of clarifying the role of such motivations in language documentation, including: (i) What are the motivations for documenting endangered languages ? (ii) Which requirements for language documentations stem from them? (iii) How can these requirements be accommodated in the format of language documentations? (iv) Where do these motivations give rise to competing documentation priorities? With respect to the first question, four possible motivations for documenting endangered languages may be identified:1 – – – –

documentation to preserve human cultural heritage documentation to enhance the empirical basis of linguistics documentation by and for the speech community documentation to study language contact

These four motivations are theoretical abstractions, which are mixed in practice. This abstraction turns out to be useful, however, since making these motivations explicit allows one to study their specific requirements for language ∗

I am grateful to the editors of this volume for very useful comments that helped improve this chapter. 1. The fourth motivation, the study of language contact, may be a less common one than the first three. However, as argued in Section 3.4 below, there are a number of good reasons to document endangered languages to study language contact, with interesting implications for documentation priorities.

18

Frank Seifart

documentations. For instance, the empirical basis of linguistics may require unedited data from spoken language, while reading material for the use in a speech community may require edited texts, cleared of repetitions, false starts, etc. These requirements shape important aspects of language documentation in crucial ways. They affect the two basic components of language documentation: the collection of primary data, on the one hand, and the apparatus, i.e. annotations, descriptive grammatical sketches, dictionaries, etc., on the other. This chapter thus also contributes to clarifying these two central concepts of documentary linguistics. These issues are discussed in this chapter with examples from documentation practices from the past 10 years, mostly within the DoBeS program of the Volkswagen Foundation. Section 2 introduces the theoretical framework of language documentation. Section 3 discusses four motivations for language documentation and their implications for the collection of primary data and the apparatus of language documentation. Section 4 summarizes some of the potentially competing requirements and Section 5 concludes this chapter.

2.

The format of language documentation

Within documentary linguistics, the format of language documentations is typically conceived of as in Table 1 (Himmelmann 1998, 2006). A key feature of this format is the clear separation of primary data, i.e. the contents, from any descriptive or analytical statement about these data, which are placed in the apparatus of the documentation. The first major component, the primary data, is organized into sessions, where one session is a recording of a communicative event. One such session ideally displays a unity of participants, place, and time, and may correspond to an observed event that may have naturally occurred, e.g. the performances of a song, as well as to a staged event, e.g. an elicitation session documenting metalinguistic knowledge. As discussed in the following section, the selection of communicative events that make up the contents of a language documentation is crucially shaped by the motivations of the documentation project. The second major component of language documentation is an apparatus that is divided into one section that contains documents related to individual sessions and one for the documentation as a whole. The apparatus comprises all information that is necessary to access the primary data, in particular information that is necessary for someone who is not familiar with the language.

Competing motivations for documenting endangered languages

19

This includes annotations of primary data (per session), which in turn crucially include translations. It also includes (for the documentation as a whole) general access resources, descriptive analyses, and metadata. As we shall see further below, a number of requirements stemming from different motivations pertain to these aspects of language documentation. Table 1. Format of language documentations (Himmelmann 1998, 2006) Contents (Primary data)

recordings/records (sessions) of observable linguistic behaviour and metalinguistic knowledge

Apparatus For documentation Per session as a whole metadata

metadata

Annotations transcription translation further linguistic and ethnographic glossing and commentary

General access resources introduction orthographic conventions glossing conventions indices links to other resources ..... Descriptive analysis ethnography descriptive grammar dictionary

3.

Motivations for documenting endangered languages

The term ‘motivation’ of a language documentation is used here in an technical sense as resulting from the intended aims of a documentation, in particular the intended uses of the documentation, which in turn, heavily depend on the intended user groups (see also Austin 2003: 8; Woodbury 2003: 46). The following subsections identify and discuss four such motivations.2 It should 2. The idea of examining language documentation from the perspective of underlying motivations stems from Seifart (2000: 25–48), where a similar set of motivations were discussed with little reference to actual documentation projects. The current paper modifies the concept of motivations and discusses them with respect to experience from over 10 years of language documentation practice.

20

Frank Seifart

be emphasized that I wish to refrain from any judgment with respect to the (moral or other) validity of these motivations (see the useful discussion in Fast 2007), but rather focus on clarifying their respective requirements for the contents and apparatus of language documentations.

3.1.

Documentation to preserve human cultural heritage

The general motivation to document endangered languages discussed in this section is closely linked to a particular concept of language that may be called the ‘ethnolinguistic’ view, often attributed to Wilhelm von Humboldt (see, for instance, Zimmermann 1991: 297). This concept asserts that every language embodies a particular worldview and that language is closely interrelated with culture and culture-specific thinking, as emphasized also by Whorf (1956). Based on such a perspective on language, linguistic diversity is seen as a treasure that is of high value for humanity as a whole. This perspective on endangered languages is present in many writings aimed at raising awareness to the problem of language loss and endangered languages since the early 1990s (see, e.g. Mühlhäusler 1990; Thieberger 1990; Wurm 1991; Zimmermann 1991; Hale et al. 1992; Hale 1995; Krauss 1995; Maffi 1998; Crystal 2000; Nettle and Romaine 2000; Grenoble and Whaley 2006; for critical discussions, see Hill 2002; Errington 2003; Himmelmann 2008: 343–345). Within an approach to preserving linguistic diversity as human cultural heritage, the notion of linguistic ecology emphasizes the importance of linguistic diversity involving the active use of multiple and diverse languages on a global level as a healthy and natural state, drawing heavily on the parallel with biological species (see Romaine 2007: 127–130 for a recent overview). To preserve this state, this approach advocates language maintenance and revitalizations efforts, rather than documentations, which are sometimes viewed as “artificial environments” (Mühlhäusler 1996: 164; Romaine 2007: 127). However, products derived from language documentations, such as educational materials or movies in the endangered languages have been used in many cases for language maintenance and revitalizations efforts (see Section 3.3). Also, the mere existence of any written, audio, or video document in the language may contribute to language maintenance by positively influencing the prestige of the language. Language documentation in the sense of Section 2 may also contribute to preserving aspects of human cultural heritage in a different sense by pre-

Competing motivations for documenting endangered languages

21

serving a long-term record of linguistic and cultural traditions and practices. This motivation to document endangered languages probably underlies all language documentation projects, although maybe to different degrees, and it has implications for the contents and apparatus of the resulting language documentation. With respect to contents, i.e. the selection of events to be recorded and included as primary data in a language documentation, this motivation implies a priority for the documentation of the diversity of linguistically expressed cultural production such as verbal art (see, for instance, Evans 2010: 182–204), rather than, for example, elicitation of paradigms or acceptability judgments. For instance, Morey and Schöpf’s (2009) documentation focuses on the traditional songs and poetry of various language of Upper Assam. In a similar spirit, Jung et al.’s (2010) and Burenhult and Levinson’s (2010) documentations focus on culture-specific concepts embodied in place naming and landscape terminology. Other documentation projects include foci on aspects of material culture, e.g. canoe building and fishing (Cablitz 2007, this volume), house building (Mosel et al. 2007), or weaving (Malau, this volume). All of these efforts clearly contribute to preserving aspects of human cultural heritage. Modern lexicon software, such as those discussed in Cablitz (2007, this volume), additionally allow the documentation of indigenous cultural knowledge by creating linked networks of entries incorporating encyclopedic knowledge. With respect to the apparatus, i.e. the part of the documentation that allows access to the data, the motivation to preserve human cultural heritage implies particular care for long-term archiving and global accessibility. Besides purely technical aspects of archiving, this also involves transparent and consistent metadata that adheres to internationally agreed-upon standards. A major contribution of the DoBeS initiative in this context is the development and establishment of the IMDI standard for metadata (http://www.mpi.nl/IMDI/), a now widely accepted cataloguing scheme for linguistic resources, which facilitates global accessibility of metadata and associated linguistic resources. For a documentation to be widely accessible, it is also crucial to use an internationally widely used language in the metadata as well as for translation and other annotation.

22

Frank Seifart

3.2.

Documentation to enhance the empirical basis of linguistics

At least some sub-disciplines of linguistics strive for an empirical basis that contains data from as many human languages as possible. Most notable among these are typological approaches, which this section focuses on. The need for samples containing large numbers of languages stems from the aim to describe the constraints on the variability of human language in general. Such claims about human language are obviously the stronger the more languages are included in the sample, notwithstanding methods to minimize undesired biasing of limited samples (see Rijkhoff and Bakker 1998; Bickel 2008). Most typological surveys (e.g., Haspelmath et al. 2008) operate with descriptive statements, e.g. about word order or argument alignment, rather than primary data that make up the contents of a language documentation in the sense of Section 2.3 Such descriptive statements may be placed in the apparatus of a language documentation, more specifically in the descriptive grammar, which serves in the context of the language documentation as a means to make the primary data accessible (see Mosel 2006b). Enhancing the empirical basis for linguistic typology within a language documentation thus implies analyzing primary data and formulating descriptive statements, possibly beyond what is necessary to merely access the primary data. Typology has developed certain traditions of annotating data, which may be taken as specific requirements for another component of the apparatus of language documentations. Among these are the now widely accepted Leipzig Glossing Rules, a set of conventions for interlinear morpheme-by-morpheme glosses.4 A further development in this spirit is the GRAID project described in Haig, Schnell, and Wegener (this volume), which enhances the empirical database for a number of core linguistic research questions by additional, standardized annotations. With respect to contents, Himmelmann (1998: 177–179) proposes selecting communicative events to be included in a language documentation that represent different points on the ‘spontaneity parameter’. This parameter distinguishes communicative events that can be expected to display different 3. There are recent approaches to using textual data for typological studies (Cysouw and Wälchli 2007). These approaches have so far used mostly parallel texts from translations of the Bible or popular fiction. The approach can be extended to a kind of parallel texts contained in the apparatus of languages documentation, namely transcriptions and their aligned translations. 4. See URL http://www.eva.mpg.de/lingua/resources/glossing-rules.php.

Competing motivations for documenting endangered languages

23

linguistic structures. For instance, the verbal system in Potawatomi conversation (high on spontaneity parameter) is drastically different from that of Potawatomi narratives (low on spontaneity parameter) (Buszard-Welcher 2010). The application of the spontaneity parameter to ensure a certain completeness of data from a linguistic point of view may thus be taken as a requirement for language documentations under the motivation of enhancing the empirical basis of linguistics more generally, beyond the specific requirements of, e.g., typological studies. 3.3.

Documentation by and for the speech community

There is by now a large literature on the involvement of the speech community in language documentation projects, which discusses the complex issues of field work ethics, cooperative research, and agency of native speaker collaborators (see, for instance, Franchetto 2007; Czaykowska-Higgins 2009; Cablitz, this volume; Malau, this volume). This body of literature amply illustrates that a language documentation project may be at least partially motivated by expectations of the speech community whose language is the object of documentation. These expectations may of course vary widely from one community to another, but there are a number of general requirements that can be considered here. With respect to contents, speech communities at advanced stages of language endangerment often express the wish to document culturally highly valued texts that are not transmitted to younger generations by traditional means anymore, for example, traditional songs. Documentation by and for the speech community may further necessitate the production of versions of these recordings that are accessible to the speech community, e.g. on CDs. For instance, in the context of an Australian documentation project (Sasse et al. 2008), copies on CDs of recordings of traditional songs made as part of the documentation become very popular among younger community members. Copies on DVDs, CDs, or audio cassettes may also be used by a speech community to transmit traditional knowledge to geographically far away parts of the community where this knowledge has already become obsolete. This was successfully done with DVDs documenting Vurës cultural events involving sea worm gathering in Vanuatu (Malau, this volume) and DVDs documenting traditional Bora songs in Peru and Colombia (Seifart et al. 2009).

24

Frank Seifart

There is also often an expectation to produce within a documentation project written documents to be used as reading materials in the community, e.g. in community schools. If the production of such documents is based on recorded, spoken texts, these texts usually require editing. This process of editing involves the elimination of speech errors, such as false starts, repetitions, and possibly code switching (Mosel 2006a: 80). Such edited written texts may be considered part of the annotation, i.e. the apparatus. Alternatively, they may also be considered separate records of the language, different from the underlying recorded text (if it was recorded at all), i.e. as part of the primary data and thus the contents. Another reason for considering edited versions of recorded texts as part of the contents rather than the apparatus is that the editing process reveals metalinguistic intuitions by native speakers about, e.g., what constitutes a false start or an unnecessary repetition. With respect to the apparatus, there is at least one undisputable requirement that stems from the motivation to document for and by the speech community. This is the requirement to use a local language, known to the speech community, as the metalanguage for annotation, translation, metadata, and any further text in the apparatus, and as target language in a language documentation dictionary (see Cablitz, this volume). This language is not necessarily a widely used language on a global level (see Section 3.1), but a language used locally by the speech community. Ironically, this is usually the dominant language that is displacing the endangered language that is being documented. A related requirement for the apparatus is to represent data with an orthographic system that is accepted by the speech community, rather than (or in addition to) representations of the data that are ideally suited for linguistic analyses, such as phonemic transcriptions.

3.4.

Documentation to study language contact

This last motivation to be discussed here is seldom explicitly on the agenda of language documentation projects, and it is probably never the primary motivation. However, endangered languages are almost by definition in intensive contact with at least one other language, the one that speakers are shifting to, the only exception being the rare case of physical extinction of a monolingual speech community. This means that the documentation of endangered languages almost inevitably contains highly valuable data for studies on language contact, including code switching and contact-induced language

Competing motivations for documenting endangered languages

25

change. One aspect of language contact studies that is particularly relevant here is the study of the process of language shift, i.e. the process that leads to language endangerment. This motivation goes beyond the motivation to enhance the empirical bases of linguistics in that language contact, as one aspect of cultural contact, is intrinsically an interdisciplinary field, which requires a range of linguistic as well as non-linguistic data. The linguistic data required by the study of language contact may shape the contents of language documentations in a number of interesting, and potentially conflicting ways. Firstly, any data from the contact language(s) are highly important, in addition to data from the endangered language that is the focus of documentation. This includes not only loanwords and code switching within texts that are primarily in the endangered language, but also detailed information on the local variety of the contact language(s). Secondly, it is sometimes claimed that advanced stages of language endangerment lead to structural-linguistic change such as exaggerated variation (Campbell and Muntzel 1989; Sasse 1992; Tsitsipis 1998), but these claims are mostly based on very few, often anecdotally reported, case studies. Any data documenting such deviation from the norm – which are possibly disregarded under other motivations as speech errors – are highly important to test such claims. With respect to the annotation of texts, i.e. an aspect pertaining to the apparatus, some documentation projects have already implemented a means to enhancing the usability of language documentation for language contact studies by introducing additional annotation tiers, which specify the source language for each morpheme in the primary data (e.g. Bickel et al. 2009; Güldemann et al. 2010). In addition to multilingual data and their annotation, most language contact studies crucially also require sociolinguistic and other extralinguistic information, which would also be part of the apparatus. Studies of language endangerment, for instance, operate with sociolinguistic descriptions of the distribution of the use of the endangered and dominant languages across different domains and an evaluation of the quality of these domains (Himmelmann 2009: 46). The ‘ethnolinguistic vitality’ approach to language endangerment, which is developed in social psychology, additionally takes into account linguistic attitudes, which are assessed by standardized questionnaires, as well as a range of other information, including demographic and geographic infor-

26

Frank Seifart

mation (Giles, Bourhis, and Taylor 1977; Landry and Allard 1994; Edwards 2010).5

4.

Where motivations compete

In principle, the format of language documentations is flexible enough to incorporate all requirements mentioned in the preceding sections. For instance, edited versions of spoken discourse can be placed in the corpus of primary data in addition to unedited versions, and a translation into a locally used language can be inserted in addition to a translation into an internationally widely used language. Thus potential conflicts between competing motivations do not arise for principled reason. However, each documentation project will have to set its priorities, given intrinsically limited resources in terms of time, labor, and finances and these will be set according to the underlying motivations. There seem to be three broad areas where potential conflicts arising from competing motivations emerge particularly clearly and that are particularly relevant because they pertain to important and labor-intensive aspects of documentation: (i) The use of a local language vs. an international language as the metalanguage used in the apparatus for translation, commentary, etc. The accessibility of the documentation crucially depends on the choice of its metalanguage. If the language that the speech community prefers as the metalanguage of their documentation is not one of the few internationally widely used languages, then the choice for just one language means seriously restricting the accessibility of the documentation for either the speech community or the international community. For instance, the documentation of the languages of the People of the Center (Seifart et al. 2009) in (Spanish-speaking) Peru opted for allocating its limited resources to Spanish translations, metadata, etc., facilitating access for the speech community, but making access difficult for non-Spanish speaking users. Other documentation projects (e.g. Haig et al. 2011) use only English, maximizing accessibility at an international level, but making access for members of the speech community difficult. 5. It should be noted that the ethnolinguistic vitality framework was developed in the context of migrant communities in industrialized countries, often with languages that are far from being endangered as a whole. It would need to be extended to be applied to language endangerment settings involving indigenous languages.

Competing motivations for documenting endangered languages

27

(ii) Naturally occurring spoken language vs. edited data. Linguistic data that are produced as naturally as possible are heralded as the most valuable for many linguistic, as well as other scientific studies, which make use of the full range of information present in spoken language, including prosody, speech rate, repetitions, etc. (see, e.g. Finnegan 2008). This applies similarly to language contact studies that require data on code switching and other influences from a contact language. On the other hand, data that are edited according to normative intuitions of speakers, i.e. that are cleared of speech errors, repetitions, code switching, etc., may be appropriate or even required for reading materials used in community schools, and to a lesser degree for some aspects of documenting cultural heritage and structural-linguistic studies. Producing edited versions of texts is a painstaking and time-consuming task, which few documentation projects have undertaken for more than a small number of texts. An exception is the Teop language documentation (Mosel et al. 2007), which has produced a large corpus (probably the largest) of texts edited in this way, which are archived alongside the original, unedited versions. (iii) Specific descriptive information in the apparatus. Within a language documentation, the apparatus serves to facilitate access to the primary data, allowing for their further usability and analysis. However, a number of motivations to document endangered languages require specific descriptive information that may not be contained in such an apparatus, such as sociolinguistic descriptions of domain-specific language use or specific linguistic-structural information for typological surveys about, e.g., word order. The question is to what extent a language documentation project can be expected to provide such information rather than spending its limited resources on, for instance, enhancing the collection of primary data. A guideline to mediate between these competing motivations might be to afford higher priority to providing information in the apparatus that cannot be deduced, however painstakingly, from the corpus of primary data at a later stage. Thus, for instance, details of word order can be determined from a large text collection after the conclusion of a documentation project, while the distribution of languages across domains cannot.

5.

Summary and conclusion

This chapter discussed different intended aims and user groups of language documentation and subsumed these under four motivations for language doc-

28

Frank Seifart

umentation. Each motivation was analyzed in terms of its requirements for language documentations, conceived as consisting of a collection of primary data, the contents, and an apparatus. Some motivations include specific requirements on contents or apparatus, such as standardized morpheme glosses for typological studies. Other requirements are more general in nature, such as the preference for textual primary data that closely conforms to normative intuitions, without speech errors or code switching. This perspective on language documentation showed a complex interaction of different types of requirements, which the format of language documentation can in principle accommodate, but where in practice priorities have to be set.

References Austin, Peter K. 2003. Introduction. In Language Documentation and Description, Volume 1, ed. Peter K. Austin, 6–12. London: School of Oriental and African Studies. Bickel, Balthasar. 2008. A refined sampling procedure for genealogical control. Sprachtypologie und Universalienforschung 61:221–233. Bickel, Balthasar, Goma Banjade, Toya N. Bhatta, Martin Gaenzle, Netra P. Paudyal, Manoj Rai, Novel Kishore Rai, Ichchha Purna Rai, and Sabine Stoll, eds. 2009. Audiovisual Chintang corpus (ca. 150,000 words transcribed and translated, of which ca. 65,000 glossed and translated, plus paradigm sets and grammar sketches, ethnographic descriptions, photographs). Nijmegen, Leipzig: DoBeS, Universität Leipzig. http://www. uni-leipzig.de/~ff/cpdp/. Burenhult, Niclas, and Stephen C. Levinson, eds. 2010. Tongues of the Semang: documenting endangered languages and indigenous knowledge among foragers of the Malay Peninsula. Nijmegen: DoBeS. http://mpi.nl/ DOBES/projects/semang. Buszard-Welcher, Laura. 2010. Lessons from Potawatomi legacy documentation. In Language Documentation: Practice and Values, eds. Lenore A. Grenoble and Louanna Furbee, 67–74. Amsterdam, Philadelphia: John Benjamins. Cablitz, Gabriele, ed. 2007. Towards a multimedia dictionary of the Marquesan and Tuamotuan languages of French Polynesia. Nijmegen: DoBeS. http://mpi.nl/DOBES/projects/marquesan.

Competing motivations for documenting endangered languages

29

Campbell, Lyle, and Martha C. Muntzel. 1989. The structural consequences of language death. In Investigation Obsolescence. Studies in Language Contraction and Death, ed. Nancy C. Dorian, 181–196. Cambridge: Cambridge University Press. Crystal, David. 2000. Language Death. Cambridge: Cambridge University Press. Cysouw, Michael, and Bernhard Wälchli. 2007. Parallel texts: Using translational equivalents in linguistic typology. Language Typology and Universals 60:95–99. doi:10.1524/stuf.2007.60.2.95. Czaykowska-Higgins, Ewa. 2009. Research models, community engagement, and linguistic fieldwork: Reflections on working within Canadian indigenous communities. Language Documentation & Conservation 3:15–50. Edwards, John R. 2010. Minority Languages and Group Identity: Cases and Categories. Amsterdam, Philadelphia: John Benjamins. Errington, Joseph. 2003. Getting language rights: The rhetorics of language endangerment and loss. American Anthropologist 105:723–732. doi:10.1525/aa.2003.105.4.723. Evans, Nicholas. 2010. Dying Words: Endangered Languages and What They Have to Tell Us. Chichester, Malden, Oxford: Wiley-Blackwell. Fast, Annicka. 2007. Moral incoherence in documentary linguistics: Theorizing the interventionist aspect of the field. In Proceedings of the Fifth University of Cambridge Postgraduate Conference in Language Research, eds. Naomi Hilton, Rachel Arscott, Katherine Barden, Arti Krishna, Sheena Shah, and Meg Zellers, 64–71. Cambridge: Cambridge Institute of Language Research. Finnegan, Ruth. 2008. Data – but data from what? In Language Documentation and Description, Volume 5, ed. Peter K. Austin, 13–28. London: School of Oriental and African Studies. Franchetto, Bruna. 2007. A comunidade indígena como agente da documentação lingüística. Revista de Estudos e Pesquisas 4:11–32. Giles, Howard, Richard Y. Bourhis, and Donald M. Taylor. 1977. Toward a theory of language in ethnic group relations. In Language, Ethnicity and Intergroup Relations, ed. Howard Giles, 307–325. London: Academic Press. Grenoble, Lenore A., and Lindsay J. Whaley. 2006. Saving Languages: An Introduction to Language Revitalization. Cambridge: Cambridge University Press.

30

Frank Seifart

Güldemann, Tom, Alena Witzlack-Makarevich, Martina Ernszt, and Sven Siegmund, eds. 2010. A text documentation of N|uu. Leipzig, London: MPI-EVA, ELDP. http://www.eva.mpg.de/lingua/research/nuu.php. Haig, Geoffrey, Ludwig Paul, and Philip Kreyenbroek, eds. 2011. Documentation of Gorani, an endangered language of West Iran. Nijmegen: DoBeS. http://www.mpi.nl/DOBES/projects/gorani. Hale, Ken. 1995. On the human value of local languages. In Proceedings of the XVth International Congress of Linguistics, eds. André Crochetière, Jean-Claude Boulanger, and Conrad Ouellon, 17–31. Quebec: Les Presses de l’Université Laval. Hale, Ken, Michael Krauss, Lucille J. Watahomigie, Akira Y. Yamamoto, Colette Craig, LaVerne Masayesva Jeanne, and Nora C. England. 1992. Endangered languages. Language 68:1–42. Haspelmath, Martin, Matthew S. Dryer, David Gil, and Bernard Comrie, eds. 2008. The World Atlas of Language Structures Online. Munich: Max Planck Digital Library. http://wals.info/. Hill, Jane H. 2002. “Expert rhetorics” in advocacy for endangered languages: Who is listening and what do they hear? Journal of Linguistic Anthropology 12:119–133. doi:10.1525/jlin.2002.12.2.119. Himmelmann, Nikolaus P. 1998. Documentary and descriptive linguistics. Linguistics 36:161–195. Himmelmann, Nikolaus P. 2006. Language documentation: What is it, and what is it good for. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 1–30. Berlin, New York: Mouton de Gruyter. Himmelmann, Nikolaus P. 2008. Reproduction and preservation of linguistic knowledge: Linguistics’ response to language endangerment. Annual Review of Anthropology 37:337–350. Himmelmann, Nikolaus P. 2009. Language endangerment scenarios: A case study from northern Central Sulawesi. In Endangered Languages of Austronesia, ed. Margaret Florey, 45–72. Oxford: Oxford University Press. Jung, Dagmar, Julia Colleen Miller, Pat Moore, Gabriele Schwiertz, Carolina Pasamonik, Kate Hennessey, and Olga Lovick, eds. 2010. Beaver knowledge systems: documentation of a Canadian First Nation language from a placenames’ perspective. Nijmegen: DoBeS. http://www.mpi.nl/DOBES/ projects/beaver. Krauss, Michael. 1995. Language extinction catastrophe just ahead: Should

Competing motivations for documenting endangered languages

31

linguists care? In Proceedings of the XVth International Congress of Linguistics, eds. André Crochetière, Jean-Claude Boulanger, and Conrad Ouellon, 43–48. Quebec: Les Presses de l’Université Laval. Landry, Rodrigue, and Réal Allard, eds. 1994. Ethnolinguistics Vitality. Berlin, New York: Mouton de Gruyter. Maffi, Luisa. 1998. Linguistic and biological diversity: The inextricable link. Terralingua Discussion Paper 3. http://www.terralingua.org/activities/ DiscPapers/DiscPaper3.html. Morey, Stephen, and Jürgen Schöpf, eds. 2009. The Traditional Songs and Poetry of Upper Assam – A Multifaceted Linguistic and Ethnographic Documentation of the Tangsa, Tai and Singpho Communities in Margherita, Northeast India. Nijmegen: DoBeS. http://www.mpi.nl/DOBES/projects/ singpho_tai_tangsa. Mosel, Ulrike. 2006a. Fieldwork and community language work. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 67–85. Berlin, New York: Mouton de Gruyter. Mosel, Ulrike. 2006b. Sketch grammars. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 301–309. Berlin, New York: Mouton de Gruyter. Mosel, Ulrike, Roslyn Purupuru, Jessika Reinig, Alexander Radke, Ruth Saovana Spriggs, Marcia Schwartz, and Yvonne Thiesen, eds. 2007. Teop Documentation. Nijmegen: DoBeS. http://www.mpi.nl/DOBES/projects/teop. Mühlhäusler, Peter. 1990. Preserving languages or language ecologies? A top-down approach to language survival. Oceanic Linguistics 31(2):163–180. Mühlhäusler, Peter. 1996. Linguistic Ecology. Language Change and Linguistic Imperialism in the Pacific Region. London, New York: Routledge. Nettle, Daniel, and Suzanne Romaine. 2000. Vanishing Voices: The Extinction of the World’s Languages. Oxford: Oxford University Press. Rijkhoff, Jan, and Dik Bakker. 1998. Language sampling. Linguistic Typology 2:263–314. doi:10.1515/lity.1998.2.3.263. Romaine, Suzanne. 2007. Preserving endangered languages. Language and Linguistics Compass 1:115–132. doi:10.1111/j.1749-818X.2007.00004.x. Sasse, Hans-Jürgen. 1992. Language decay and contact-induced change: Similarities and differences. In Language Death. Theoretical and Factual Explorations with Special Reference to East Africa, ed. Matthias Brenzinger, 59–80. Berlin, New York: Mouton de Gruyter.

32

Frank Seifart

Sasse, Hans-Jürgen, Nicholas D. Evans, Linda Barwick, Bruce Birch, Murray Garde, Joy Williams, and Janet Fletcher, eds. 2008. Iwaidja Documentation. Nijmegen: DoBeS. http://www.mpi.nl/DOBES/projects/iwaidja. Seifart, Frank. 2000. Grundfragen bei der Dokumentation bedrohter Sprachen. Köln: Institut für Sprachwissenschaft der Universität zu Köln. Seifart, Frank, Doris Fagua, Jürg Gasché, and Juan Alvaro Echeverri, eds. 2009. A multimedia documentation of the languages of the People of the Center. Online publication of transcribed and translated Bora, Ocaina, Nonuya, Resígaro, and Witoto audio and video recordings with linguistic and ethnographic annotations and descriptions. Nijmegen: DoBeS. http://corpus1.mpi.nl/qfs1/media-archive/dobes_data/Center/Info/ WelcomeToCenterPeople.html. Thieberger, Nicholas. 1990. Language maintenance: Why bother? Multilingua 9:333–358. Tsitsipis, Lukas D. 1998. A Linguistic Anthropology of Praxis and Language Shift: Arvanitika (Albanian) and Greek in Contact. Oxford: Clarendon Press. Whorf, Benjamin Lee. 1956. Language, Thought and Reality. Edited with an introduction by John B. Carroll. New York: John Wiley and Sons. Woodbury, Anthony C. 2003. Defining documentary linguistics. In Language Documentation and Description, Volume 1, ed. Peter K. Austin, 35–51. London: School of Oriental and African Studies. Wurm, Stephen A. 1991. Language death and disappearance: Causes and circumstances. In Endangered Languages, eds. Robert H. Robins and Eugenius M. Uhlenbeck, 1–18. Oxford: Berg. Zimmermann, Klaus. 1991. ‘Babel wiederlesen’ und die Vielfalt der Sprachen fördern. Jahrbuch Preußischer Kulturbesitz 28:289–301.

Chapter 3 Evolving challenges in archiving and data infrastructures Daan Broeder, Han Sloetjes, Paul Trilsbeek, Dieter van Uytvanck, Menzo Windhouwer and Peter Wittenburg

1.

Introduction

Increasingly often research in the humanities is based on data. This change in attitude and research practice is driven to a large extent by the availability of small and cheap yet high-quality recording equipment (video cameras, audio recorders) as well as advances in information technology (faster networks, larger data storage, larger computation power, suitable software). In some institutes such as the Max Planck Institute for Psycholinguistics, already in the 90s a clear trend towards an all-digital domain could be identified, making use of state-of-the-art technology for research purposes. This change of habits was one of the reasons for the Volkswagen Foundation to establish the DoBeS program in 2000 with a clear focus on language documentation based on recordings as primary material. The fact that more and more data is being collected poses some challenges for those who are dealing with this data in one way or another. The researcher who collects the material will need to maintain a coherent administration of all the relevant bits of contextual information surrounding the data. These “metadata” descriptions (see Section 4.2) are not just for the researchers own use but should also allow others to find the data once it has been stored in an archive and should allow others to assess whether the data suits their needs. Research data archives that are storing more and more large data collections will have to provide proper facilities and guidance for potential users of the data to find what they are looking for. While technological advances have made it much easier to collect large amounts of audiovisual recordings, the automatic extraction of the relevant bits of information from these recordings is still very difficult and therefore needs to be done manually to a large extent. This causes a discrepancy be-

34

Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg

tween the amount of data that is being collected and the amount of data that ends up being analyzed and used to support research hypotheses. Data archiving and sharing is currently on the agenda in all areas of science and the technical frameworks that are being developed are often based on the OAIS reference model (CCSDS 2002) that was originally designed for space data but can be applied more broadly. Different workflows and usage scenarios and differences in the nature of the archived data often require deviations from this abstract model though, in particular in the case of an online archive that gives users direct access to the archived material.

2.

Issues and strategies in data handling

Since digital technology quickly offered ways to not only create large amounts of primary recordings but also several associated resources such as transcriptions, linguistic analyses, field notes, etc., it became obvious that new challenges appeared at the horizon: we needed ways to take care of proper lifecycle management of the archived data. In 2000 the MPI stored about one terabyte of digitized recordings, currently the data in the online archive and the data ready to be integrated take up about 74 terabyte. Due to technological innovation we are now able to process and store lossless compressed JPEG2000 video streams, which result in files that are a factor 20 larger than the MPEG2 files that were our highest quality archival copies until recently. This increase in file sizes results in an annual growth of the archive of about 18 terabytes currently, however with more and more researchers switching to high-definition video cameras we can expect another steep increase in annual growth in the near future. In the humanities, sheer data volume specifications are not a good indicator for the data management challenges to solve. There are generally complex relations between the archived objects that need to be maintained in order to preserve all the knowledge about the objects. Each digitized recording is for example part of a hierarchy of semantically related objects. Often such objects are split into new objects for specific reasons such as presentations. Different layers of annotation of the linguistic content are created, perhaps even from different annotators at different times. Derived resources such as lexica are created that relate to a collection of archived objects (see Cablitz, this volume). Several versions and transformations of many objects might be created in the course of time. It is important to store the relationships between

Evolving challenges in archiving and data infrastructures

35

all these objects, since in most cases context and provenance is essential for the interpretation of the objects’ content. Handling the complexity in such collections is thus a true challenge. From a UNESCO study we know that already for tape media the preservation of the stored information has turned out to become a huge and partly insoluble problem. About 80% of the existent recordings of languages and cultures created by ethnologists, linguists etc. are highly endangered because the physical carriers are deteriorating rapidly and the material is not in the hands of specialized archives (Schüller 2004). Digital technology moves on even faster, i.e. uncurated data is much more endangered than the traditional analog recordings. There is a great risk of losing parts of our cultural and scientific memory if we do not ensure that data formats and encodings are kept distinct from the software being used, if we do not use open standards such as XML (eXtensible Markup Language) for specifying structure and if we do not use widely agreed and thoroughly documented encoding schemes such as UNICODE, MPEG etc. Digital data needs to be continuously migrated, both at the carrier level as well as at the structure/encoding level. How can we maintain integrity and authenticity - both essential pillars for the preservation of our contents - in such a dynamic world? Migration alone will not ensure data survival, since our media are very vulnerable and our software erroneous. Automatic copying to distinct locations according to safe protocols making use of different software systems is required as well to preserve our digital treasure. For DoBeS data, six copies are created automatically at three locations and in addition selected data is being returned to the locations where they were recorded. For both aspects – migration and copying – there are no simple solutions that are safe enough and all procedures involving too many manual operations will not work in the end, since the costs would be much too high for the large volumes of data that we are creating and maintaining. 2.1.

The influence of the DoBeS programme

One of the great outcomes of the DoBeS program in an early stage was that a few enthusiastic researchers and technologists sat together and contributed to the specification of a flexible metadata schema and infrastructure: ILSLE

36

Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg

metadata Initiative (IMDI1 ). It was quickly understood that metadata is the glue for maintaining the complex relationships that may exist between various objects in an archive. With the IMDI metadata infrastructure we were able to not only add descriptions to objects in order to make them retrievable, or to group them based on categories, but also to organize them into various collections. Each depositor constructs a hierarchical reference organization for their corpus, which forms the basis for all management and access permission operations, but alternative organizations are also possible. IMDI is still the basis for one of the largest online archives these days: the MPI language Archive of which the DoBeS archive is a well-organized part. There is quite some discussion currently about proper data management and the challenges posed by what is now also called the “Data Tsunami”. Just recently the European Commission founded a high-level expert group to bring out a report with the name “Riding the wave – How Europe can gain from the rising tide of scientific data” (High Level Expert Group on Scientific Data 2010) and to come up with actions to address the volume and complexity aspects. In the US, a final report of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (Blue Ribbon Task Force on Sustainable Digital Preservation and Access 2010) and the ASIS&T Summit on Research Data showed the relevance of the data curation and preservation challenges. Looking back a decade we can state that the DoBeS program at an early stage made great contributions to address these questions. Excellent solutions were found given the early stages of the debates and contributions to the discussions about data management are still being made today. Principles of data archiving were worked out, the need of standards was articulated, a new organization framework based on metadata descriptions was invented, the issue of appropriate creation, management, access and enrichment tools was tackled and concrete actions were started along all these dimensions resulting in solutions that meet most of the requirements being discussed these days. In the report of the EC high-level expert group, “trust” is indicated as one of the most fundamental principles for success. Obviously trust has many facets, but most essential is that (1) the depositors trust the archivists that they take care of proper preservation, curation and access and rights management; (2) that the archivists rely on the quality of the data provided by the depositors 1. http://www.mpi.nl/IMDI/

Evolving challenges in archiving and data infrastructures

37

and (3) that the users can trust getting exactly those objects they are looking for in authentic quality. The last point has led to an important shift in the MPI archivists view on researcher involvement in data managing. Given the utterly dynamic era in which the DoBeS program and in particular the archiving part has been set up, we can state that this trust has eventually been established, even though it required some attitude changes both from the archivist’s as well as from the researcher’s side. The archivists for example had to become aware of the utmost importance that researchers attach to proper protection and presentation of their data, and the researchers had to get used to the idea of handing over their data to an online archive that has data sharing as one of its goals. It’s in the nature of innovation that trust has to be continuously re-established. 3.

Archive stakeholders and their needs

As indicated in Trilsbeek and Wittenburg (2006), a number of different parties typically interact with an archive, each from a different perspective and with different – sometimes conflicting – needs. Depositors require an easy way to deposit their material and to write metadata descriptions for their deposits, archivists need means to ensure consistent archive organization and data integrity, and various groups of users of an archive need easy means to navigate and access archival content. Particularly the latter group poses a challenge to the developers of access tools for an archive since it is a rather heterogeneous group ranging from interested members of the general public to journalists, to people from the speech communities whose recordings are in the archive, and to linguists who may or may not have specialized knowledge about the archived material. It is almost impossible for an archive to cater for the access needs of every group of users, so it is important that it offers access to its resources in an atomic way and ideally also offers access to some basic web-services to explore the archived material. In this way, different web sites or portals with different looks and levels of complexity can be developed on top of the archiving infrastructure. Offering access to archived material and services in such a way also becomes essential if an archive wants to become part of the various “e-research infrastructures” that are being developed at the moment in projects such as

38

Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg

CLARIN2 . Agreements about standards and interchange formats for data and services are needed to ensure interoperability between various archives and tool providers. 4. 4.1.

Long-term preservation requirements File formats and encodings

Long-term preservation of digital data involves both the physical preservation of the digital objects as well as keeping these objects interpretable in the long run. File formats and encodings change over the years up to a point where old formats cannot even be read any more on common hardware and software of a certain point in time. We have seen various examples of this in the past and we will no doubt see many more in the future. Keeping archived data interpretable therefore means that an archive needs to migrate its stored files to up-to-date formats before the old ones have become obsolete. Converting from one format to another, however, often involves loss of information or the introduction of artifacts. In audiovisual formats, transcoding between two lossy formats or even encoding a file again in the same lossy format will introduce artifacts and loss of information. To prevent loss of information or loss of quality, the archive should use formats according to the following principles: – for audiovisual material, use uncompressed or lossless compressed formats whenever possible – for textual material, use Unicode character encoding and XML-based formats whenever possible – avoid closed, proprietary formats For textual material and audio material it is quite straightforward today to follow these guidelines. Storing uncompressed or lossless compressed video, however, still requires a lot of storage capacity by today’s standards, which is problematic for many language archives. One hour of standard definition MJPEG2000 lossless compressed video for example takes up about 70 GB of storage, for High Definition video this number would be even 4 times as high. The role of video in language documentation is growing since it provides a 2. http://www.clarin.eu

Evolving challenges in archiving and data infrastructures

39

way to contextualize the spoken language and to analyze other communication channels such as gesticulation, however the quality requirements for video are much less straightforward than for audio. There are a lot of variables that play a role for the quality of the video signal but their importance may vary depending on the purpose of the recording. The recording equipment that is being used to acquire the video material also limits the quality that can be obtained to some extent; the resulting video quality will depend on the available budget and on the size and weight that is still practical in the recording situation. It’s probably safe to assume that the price of digital storage will continue to drop at least at the same rate as it has during the past decade, so storing uncompressed or lossless compressed video will become feasible for more archives. The consumer camcorder market on the other hand is very hard to predict and is not driven by the needs of linguistic researchers. Two XML-based formats for linguistic data that were developed with the help of the DoBeS program are the EAF format for linguistic annotations and the Lexical Markup Framework (LMF) format for lexica (ISO 24613:2008). EAF is the format that is used by the ELAN multimedia annotation tool for storing multi-layered annotations that are time-aligned to the audio or video files. The LMF format is a flexible format for creating structured lexica and is being used by the LEXUS lexicon tool. Both formats were designed as XML formats to allow for relatively easy conversions to other formats now and in the future. 4.2. Organization of data: metadata When gathering and managing large amounts of data, be it in the form of analogue or digital resources, an additional layer of meta-information is indispensable. This might seem obvious for the classic case of a library full of books, but it is even more true for a digital archive where language resources are stored as digitized recordings and text files. Specific reasons for this are: Digital resources are meaningless by themselves. On the lowest level they exist of bits (0 and 1). While digital storage systems themselves already provide for interpretation of the basic characteristics of the stored bit-streams, there are many other layers of interpretation necessary for keeping the data useful and manageable, each layer requiring explicit specific (metadata) information.

40

Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg

There are very many ways to organize digital language resources; one organization might be more suitable for a specific archiving or research purpose than another and fortunately the digital storage paradigm does not impose a single organization. Therefore we need the ability to impose different flexible organization models or views that match the interest of researchers or archivists. The richer the metadata available, the more possibilities there are for the end user to create these special views and explore the digital collection. In the current landscape of digital repositories and archives, a number of specific metadata standards are prominent for the description of linguistic data. Such a standard usually specifies a set of metadata elements (sometimes called attributes) together with prescriptions for the values of these elements and also prescriptions on how the metadata elements and values should be put into a text format (schema). The first one of these sets and probably the most widely used one, is Dublin Core,3 which stems from the electronic library world. Dublin Core was later extended with some linguistic specializations into the OLAC standard4 which has become popular for exchanging Language Resource metadata between archives. Around the same time the IMDI5 standard was introduced and adopted by the DoBeS program. IMDI strives to allow detailed descriptions and several so-called specialized profiles were created for specific linguistic subdomains. A suite of tools to edit and use IMDI metadata was partly developed within the context of the DoBeS program. At the time of writing (2011) a follow-up standard for IMDI, called CMDI6 (Component metadata Infrastructure, cf. Broeder et al. 2010) is being worked out within the CLARIN framework. Rather than offering one single metadata schema it tries to offer the user a set of loose components that can be combined into a tailored metadata schema. This approach should allow for a detailed description while keeping the focus only on those metadata elements that are relevant. Apart from that, it also allows for partial re-use of existing metadata schemas and provides better mechanisms of semantic interoperability by requiring that the semantics of all used metadata elements are explicitly defined in an accepted concept registry. Using CMDI will hope3. 4. 5. 6.

http://dublincore.org http://www.language-archives.org/OLAC/metadata.html http://www.mpi.nl/IMDI/ http://www.clarin.eu/cmdi

Evolving challenges in archiving and data infrastructures

41

fully increase metadata interoperability between linguistic research communities having different needs and traditions. 4.3.

Other standards in language archiving

Both the LMF lexicon standard and CMDI metadata standard are prime examples of a trend to have standards which can be easily adapted to the needs of a specific resource type, as in use by a specific (research) community, or even a single resource. LMF provides a core meta model, with some extensions, which can be adorned with data categories taken from a data category registry to form the actual data model for a specific LMF lexicon. CMDI uses the same approach by storing pre-defined metadata components and profiles in a component registry. These components are also annotated with links into concept registries, from which the data category registry is one, to make semantic descriptions available and to share those. Registries are thus starting to play an increasingly prominent role in standards related to archiving. The MPI develops and hosts the following registries: – ISOcat7 (Kemps-Snijders et al. 2008) is the data category registry (ISO 12620:2009) for ISO TC 37, which is based on a grass-roots approach, allowing any linguist to participate in the specification and standardization of linguistic data categories. – The CMDI component registry8 for CLARIN-NL. – RELcat is a registry to store (user-specific) relationships between data categories and possibly other concept registries. A metadata example that is already in use illustrates the support for mapping from the IMDI to the Dublin Core metadata schemas by using these strategies. The metadata profile in ISOcat has been bootstrapped with the IMDI elements, which includes the /mimeType/9 data category. The specification of a data category can be very elaborate including translations in multiple languages, but at least an English name and definition should be available. The /mimeType/ data category is defined as the “specification of the mime-type of the resource which is a formalized specifier for the format included or a 7. http://www.isocat.org/ 8. http://www.clarin.eu/cmdi/ 9. http://www.isocat.org/datcat/DC-2571

42

Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg

mime-type that the tool/service accepts”. In the CMDI component registry the cmdi-mimetype10 component links the MimeType element to this data category. The format11 element in the Dublin Core metadata schema actually plays the same role and is defined as “the file format, physical medium, or dimensions of the resource”. In RELcat the equivalence relations between the ISOcat data category /mimeType/ and the Dublin Core format element can be specified using a simple RDF triple: isocat:DC-2571 rel:sameAs dc:format

The metadata search that is currently under development in CLARIN can already exploit such semantic relationships to broaden the scope of a search. While this example zoomed in on the metadata domain, ISOcat is currently being populated with data categories for various other domains, e.g., morphosyntax and terminology, and it is expected that this will provide the same kind of flexibility to content search on resources created in these domains. As usability, accessibility and interoperability are long-term goals of the archive, the persistency of the registries and the links to them is a major concern. Most of these registries do provide Persistent IDentifiers (PIDs) backed up by persistency strategies, which allow safe use of these identifiers in the metadata of resources or even the resources themselves. There are various PID frameworks available. To help an archive to choose among these frameworks, ISO 24619 “Persistent identification and sustainable access” (ISO 24619:2010) gives specific requirements these frameworks should meet to make them useful for archives of linguistic resources. To promote the (re)use of the resources stored in the archive standards for harvesting metadata, e.g., OAI-PMH from the Open Archives Initiative, and standards on and agreements between archives about Authentication and Authorization Infrastructures are important. Large-scale infrastructure initiatives like CLARIN help to build up the federations of all involved organizations. 4.4.

Versioning

When storing and archiving digital resources, an important policy decision concerns how to respond when a depositor offers a new “version” of a resource that is already present in the archive’s holdings. There can be differ10. http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:c_

1271859438106 11. http://purl.org/dc/elements/1.1/format

Evolving challenges in archiving and data infrastructures

43

ent reasons for offering a new version: (a) The depositor has realized that the first version is simply broken or unusable, for instance in the case that the files were switched. (b) New insights make it necessary to change some annotations. (c) The format of a resource may need to be upgraded. For instance a codec used to encode a media-stream may become obsolete requiring resource replacement. Depending on the archive organization and policies, it is possible to let the new version take the place of the old one in the existing network of relations with other resources and metadata. The old version may then be moved to background storage or, depending on archive policy, even deleted. Of course the relation between old and new versions needs to be stored and users should to be able to see that other versions exist. We will not go into the question on what actually makes a resource a new version of another resource. This should best be left to the judgment of the depositor or caretaker of the original version. It is however very important to realize that users may have created references to a resource in the archive, for instance as a link in a publication. Most users will expect that that reference will always link to the same version, while others may want to refer to the latest version. It is important that the archive is explicit about its versioning policy in this respect. The most flexible system is to always keep any reference to a specific resource version but to provide referencing to the latest version as a special service. However it is known that some archives are unable to keep stable references to resources or resource collections due to legal or organizational obstacles. For instance, its legal owner might withdraw a resource from an archive’s holding. In such cases the archive can only be as explicit as possible about such circumstances. 5.

Open access vs. access restrictions

At the moment there is a large push towards open access to research results, not just the scientific publications but also the data that forms the basis of these publications. The Berlin declaration on Open Access to Scientific Knowledge12 was first published and signed in 2003 by representatives of most of the German research organizations, but has meanwhile been signed 12. http://oa.mpg.de/lang/en-uk/berlin-prozess/berliner-erklarung/

44

Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg

by 294 scientific organizations and universities worldwide. In general there is a lot to be said in favor of making the outcome of research that has been funded with public money available to the public and to other researchers in an unrestricted manner. Giving access to the raw data on which publications are based would in principle allow anyone to verify the claims that were made and would allow the data to be reused for other analyses. In many fields of research however, data is collected by making use of human subjects, in which case the privacy of these subjects needs to be taken into account. Informed consent forms are often used to explicitly regulate the rights to publicize data of human subjects. Anonymization is another method to ensure that the privacy of the human subjects is respected, which in general means that all information that could be used to identify individuals is removed from the data. In social sciences for example it is common practice that the names and contact information of participants to a survey or an experiment are removed before the data set is published. Both informed consent and anonymization can be somewhat problematic in the field of documentary linguistics though. Informed consent about making the data public on the world wide web would entail that the subject has a good understanding of what this implies. Masking names in texts and in audio recordings is something that can be done but modifying audio and video recordings up to a point where the individuals can no longer be recognized would render them useless for many linguistic purposes. The fact that recordings are made within small communities sometimes requires the researcher to protect the speakers in order to avoid conflicts within the communities. It is up to the researchers working in these communities to discuss these issues with the speakers and to make careful decisions taking both the Open Access principles and the privacy of the speakers into consideration. Some of these issues, and possible solutions, are discussed in the following section.

6.

Legal and ethical issues

As indicated above, when working with data collected in small language communities one has to carefully consider the rights and privacy of the interviewed contributors. In the DoBeS program, legal and ethical considerations were an important point of discussion from the very beginning. In its second year a workshop was organized with leading European law experts to determine a proper juridical basis for the DoBeS program and in particular the

Evolving challenges in archiving and data infrastructures

45

online archiving ideas. The result however was disappointing from the practitioners’ point of view, since it was concluded that the legal situation is much too complex to give clear juridical advice. The only advice the experts could come up with was to lock the material in a safe in a cellar, which was exactly the opposite of what was expected from the emerging archive – namely to be a place where authorized persons from all around the world could access and even enrich the stored data. Intensive and serious discussions afterwards led to a number of conclusions: – It was understood that the DoBeS program should have a proper basis to guide the behavior of all persons involved: collectors, archivists and users. The result was an elaborate Code of Conduct, which was amended over the years. – The roles of all actors in the complex system were defined and the expectations with respect to each actor were formulated. For the archivists it is the principal researcher who is responsible for specifying for example the access permissions etc. It is expected that the researcher responsible takes care of proper relationships with the communities and the interviewees and that all statements are based on informed consent. The archivist will adhere to the statements of the researcher responsible and provide access mechanisms that implement the requirements. – The archivist declared that he does not claim copyright on the stored material. However, he needs the right to archive in order to perform his task in a responsible way. With respect to users the archivist will claim copyright on behalf of the data producers. – It was decided to not use visible logos in the video since they might obstruct the content. – The researcher responsible always has access permissions to all material and he can set access permissions for other persons. In particular members of the speech community should be granted the rights and abilities to access the content. Handling legal and ethical issues at a responsible level is a serious challenge especially since communities may withdraw access permissions to certain material again although it was granted at a certain moment for culture specific reasons. Also other complicating issues may play a role requiring a high

46

Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg

degree of sensitivity of all actors involved. To cope with all kinds of unexpected events a Linguistic Advisory Board consisting of highly respected field researchers was established that can be called upon by the archive to help solving potential difficult questions. Over the years, when it became more obvious that more users may want to access material in the online archive, four levels of access granting were agreed upon: Level 1: Material under this level is directly accessible via the internet; Level 2: Material at this level requires that users register and accept the Code of Conduct; Level 3: At this level, access is only granted to users who apply to the researcher responsible (or persons specified by him or her) and specify their usage intentions; Level 4: Finally, there will be material that will be completely closed, except for the researcher and (some or all) members of the speech communities. Access level specifications for archived resources may change over time for various reasons, e.g. resources could be opened up a certain number of years after a speaker has passed away, or access restrictions might be loosened after a PhD candidate in a documentation project is done writing the thesis. The number of external people who requested access to “level 3” resources over the last years was not that high. We need to see in the future whether the regulations that are currently in place can and should be maintained as explained. Access regulations remain a highly sensitive area where the technical possibilities opened up by using web-based technologies need to be carefully balanced against the ethical and legal responsibilities which archivists and depositors have towards the speech communities. Despite almost 10 years of ongoing discussions and debate, no simple solution to this problem has yet been found.

7.

Data enrichment tools

Providing tools for tagging or annotating audio-visual media has been one of the focal points of software development at the MPI right from the start, from the Mac-only application MediaTagger, via a set of client-server based cor-

Evolving challenges in archiving and data infrastructures

47

pus visualization programs to their convergence into the stand-alone, multiplatform annotation tool ELAN. This progression of tools was paralleled by the switch from data in a proprietary format to data stored in the evolving the evolving standard XML. After a thorough makeover (in 2003), marking the transition to the 2.x versions, ELAN further developed into an application supporting multiple videos in multiple formats, providing a growing number of import and export options, with increasing editing capabilities and available as both a Java Web Start and as a downloadable installer version. ELAN allows enriching of audio and video recordings with multilayered, structured annotations stored in EAF (ELAN Annotation Format) files, a file format that can be uploaded into the archive as a constituent of a corpus. To make inspection and exploration of data in the archive more convenient than downloading a bundle of files and opening the software, the web application ANNEX has been created. It streams (chunks of) media recordings from the archive and visualizes associated annotations, not only those stored in EAF but other formats as well, in an interface resembling that of ELAN. ANNEX resembling that of ELAN. ANNEX is closely connected to TROVA, a search engine for structured search in annotation content. Queries can be executed in one or more corpora or parts thereof and from any search result or hit a jump to ANNEX can be made, showing that particular annotation in that particular file. Processing multiple files simultaneously has recently become an important track of development of ELAN and it is expected that it will be in the years ahead. This type of operation improves productivity enormously and stimulates consistency within a (local) corpus. More generally, reducing the number of mouse clicks and keystrokes and steps that have to be performed manually will be a future goal. Semi-automatic annotation by pattern-recognition based software components is expected to become available for everyday language research soon. Another data enrichment tool developed by the MPI is a flexible online lexicon tool called LEXUS13 (Ringersma and Kemps-Snijders 2007). The LEXUS lexicon schema can be based on the meta-model of the LMF standard, but actually users have extensive freedom to construct a rich lexicon schema appropriate for the language to be described. Elements in this schema can be linked to the data category registry, ISOcat, and can thus have explicit, and shareable, semantics. Import tools allow loading existing lexica in 13. http://www.lat-mpi.eu/tools/lexus

48

Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg

various formats, e.g., MDF, into LEXUS. In principle lexica exist in a userspecific workspace. However, LEXUS allows sharing these lexica with other users thus enabling collaboration on the development and population of a lexicon. Cablitz (this volume) gives a detailed account of the implementation of LEXUS in an actual documentation project. The LEXUS frontend has evolved over time using different browserbased technologies into the current FLEX version, which due to its use of the Adobe Flash plug-in provides a similar look&feel across a wide variety of browsers and platforms. The rendering of lexical entries has always been very flexible, allowing users to construct templates for both list and entry views. A new version of the LEXUS backend is currently close to completion and next to providing increased stability and performance will also allow to more easily add new output formats, e.g., a printable version of the lexicon. Making different tools like LEXUS, ANNEX/TROVA and ELAN cooperate as seamlessly as possible is another important line of development. Separation of metadata and annotation content has its merits, but at some point they will have to come together e.g. in a combined data-metadata search in TROVA. Some annotation editing options, especially those that are executed on multiple files (like find-and-replace in many files), make perfect sense in the context of ANNEX. The combination of ELAN and LEXUS will on the one hand allow lookup and retrieval of information from a lexicon while annotating, and on the other hand will enable the user to start building a lexicon while annotating.

8. 8.1.

Accessing data Meta data searching and browsing

Access to archived resources is generally offered by means of search and browse functions for the metadata catalogue. Search functions can be implemented in various ways, e.g. as free text Google-like search across the entire metadata catalogue, as an advanced search for searching in specific metadata fields, or as a “faceted search” that allows one to narrow down search results by selecting values of a number of pre-defined fields. Searching within a metadata catalogue that makes use of a single metadata scheme is fairly straightforward. The only problem here is that there is a certain degree of variation of metadata values that actually refer to the same kind of data, if

Evolving challenges in archiving and data infrastructures

49

the metadata field does not require the use of a controlled vocabulary. Some kind of mapping would need to be performed in order to find all variants of the same value. The situation becomes more complex if one needs to search across different catalogues with different metadata schemes. It is hoped that the use of links to the ISOcat data category registry in metadata schemas and value sets will make cross-archive searches more manageable. As an example, archive A may use a metadata schema that contains the element “gender” for speakers for which the values can be “F” and “M”, archive B may use the element “sex” for basically the same concept and uses the values “female” and “male”. If both metadata schemas would refer to the proposed ISOcat term “HumanGender” and the values “feminine” and “masculine” (with the definitions that this relates to the gender of a person rather than grammatical gender), it would be possible to search across both archives using either terminology using an ISOcat-aware search tool. 8.2.

Content searching

A more elaborate search for the actual content of the resources is required if one wants to find specific examples of language use that cannot be described in the metadata. At the moment this content search will be limited to textual resources (annotations to audio/video) but possibly in the future this could be extended to a limited set of features in the audio or video material itself. Searching for annotations can also be done in varying levels of complexity. The TROVA content search tool for example offers a simple search mode to search across the entire annotation file, it offers a “single layer” mode to search for sequences within a single annotation layer and it offers a “multiple layer” mode to search for sequences both within and between annotation layers. Content search tools can be used to find specific examples in a language corpus, but can also be used to perform statistical analyses on a corpus by finding all cases of a certain linguistic structure. Also in textual content search tools, the variation in terminology that occurs within and between archives can be an issue. Here also the ISOcat registry can play a role by allowing search tools (and users) to create mappings between different terms that actually have the same meaning.

50

Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg

8.3.

Portals

While metadata and content search tools are generally suitable for specialists to find material that they are interested in, members of a speech community or members of the general public have other requirements when accessing archived content. If the search services and archive access framework are set up in a rather generic way and can be called via standard web-service interfaces, it is possible to create an additional “layer” on top of the archive that serves a specific user group. This layer or “Portal” can have an appealing graphical design and it can direct people to certain pre-defined searches that have been set up or interesting resources that have been selected. Within the European research infrastructure projects that are currently running such as CLARIN, more and more tools are being made available as web services. To what extent these web services will be of use for certain user-specific portals remains to be seen, but at least they open up a wide range of possibilities to combine resources and services together in a web interface.

9.

New challenges

Life cycle management of data can be split into three major and related phases: creation, curation/preservation and access/utilization. With respect to all three phases we will see accelerated technological innovation which on the one hand has positive effects in so far that research can make use of newest inventions and products and on the other hand has negative implications with respect to the stability of the solutions found. The trick will be to define the islands of stability in a very dynamic environment and to participate stepwise in the innovation process. This holds for the archive as well as for all software being written. In all phases of the data life cycle, the challenging ethical and legal situation needs to be taken into account. Creation Phase: The creation process will benefit from further sophistication in recording equipment, where in particular three developments will have their implications: (1) miniaturization of data storage leading to increased capacity; (2) resolution; (3) connectivity. Miniaturization will lead to continuously increasing storage capacities allowing researchers to make high-resolution recordings with portable devices. Miniaturization also will simplify field work in so far that direct annotation will be easier with help of smart and small devices demanding less power. The resolution of recording

Evolving challenges in archiving and data infrastructures

51

devices will be increased so that soon high-definition video cameras can be expected in the low cost sector. Also connectivity will become better – even at remote places. This will mean that digitized (in fact digital recording is now the norm in the vast majority of fieldwork contexts) recordings can be transmitted earlier and faster. It also will mean that the possibility of downloading or accessing archive material will be improved. Preservation/Curation: A big step for video preservation has already been done by introducing lossless MJPEG2000 recently. This indeed means that we are able to store a master file from which other formats, e.g. for presentation purposes, can be generated without risking serious transformation effects. Information technology (channel bandwidth, storage capacity, CPU power) will allow us to deal with the increased data amounts. Long-term preservation is very much dependent on “safe” replication where every operation on a data object will automatically lead to check whether the copied instance is indeed the same as the original one. It is widely agreed now that the extensive use of externally registered persistent identifiers associated with checksum information is the only way to ensure data integrity and authenticity in distributed and thus more complex data management scenarios. The DoBeS archive is prepared to participate in such state-ofthe-art archive federation scenarios, since for some years it is already based on persistent identifiers and automatically generated checksum information. Together with the computer center in Garching (RZG) it has been testing actively a switch to a rule-based safe replication strategy based on the iRODS software and it seems that the system can be put into operation in 2011. This will be a major step ahead also to support the open deposit service of the MPI offered to all researchers with language resources. Also in 2011 the component based metadata tools will come into place, which offers much more flexibility for the researchers to design a metadata profile that is suitable for their resources. Interoperability will be guaranteed by making use of categories defined in the ISOcat registry. The ARBIL editor, which has now replaced the IMDI metadata editor, is already supporting this component structure and will hopefully motivate researchers to provide better metadata descriptions, since they will be the key for the application of advanced analysis tools and for generating portals designed for the special community in mind. Utilization Phase: We expect many developments in the improved utilization possibilities of the stored data as long as access is being granted and

52

Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg

as long as the quality of the metadata and the data is high – quality will actually be the crucial point for many advanced operations. One big concern is that the amount of recorded media streams that is not being touched (annotated in some form to make it ready for analysis) is increasing continuously which means that much of the stored data will effectively not be of much use to anyone other than the person who collected it. A new attempt to use state-of-the-art speech and image processing technology is required that does not build on holistic stochastic models but on detectors that react to comparatively simple patterns in media streams and create annotations. There will be several of these detectors all with different characteristics that may also be specialized on specific quality types of recordings. The resulting lattice of annotations could be the base for linguistic evidence and theorization if there are smart tools allowing the researcher to look for specific patterns and to easily navigate in it. We can indicate a few other areas where we expect new opportunities in the coming months and years: – Semantically based weaving of content (creating relations and navigating in the resulting conceptual spaces) is very attractive for finding linguistic evidence. However, this work is hampered by the huge effort required to create meaningful relations. Better usage of existing ontologies for automatic support in creating the relations would make this work practically feasible. – Archive federations are being set up, metadata has been standardized, resource formats are being much more harmonized and improved tools to foster semantic gateways will make it easier to carry out cross-archive and cross-corpus related work. – More and more tools are being turned to web services or at least support web-based interactions. Since the programming interfaces are also currently being harmonized, there is great hope that in a few years researchers will be able to combine useful algorithms to chains of operations on texts (annotations, etc.), audio and video streams and even other type of data to carry out work that currently is only possible when large scale expert knowledge is directly available. For these advanced operations, the quality of metadata and data will be of crucial importance. Much funding is currently invested in creating infrastructures that will increase the integration and interoperability of resources and tools. CLARIN is

Evolving challenges in archiving and data infrastructures

53

the initiative that aims to achieve these goals in the linguistic domain. Such infrastructure work can only be achieved when we apply standardization and harmonization where possible without hampering the research progress. The DoBeS community was one of the driving forces to apply open standards and foster new standards. If this positive attitude is continued, the work on endangered languages will profit in many ways from new technological developments in the coming years.

References Blue Ribbon Task Force on Sustainable Digital Preservation and Access. 2010. Sustainable Economics for a Digital Planet: Ensuring Long-Term Access to Digital Information. San Diego: BRTF-SDPA. Online version: http://brtf.sdsc.edu/biblio/BRTF_Final_Report.pdf. Broeder, Daan, Marc Kemps-Snijders, Dieter Van Uytvanck, Menzo Windhouwer, Peter Withers, Peter Wittenburg, and Claus Zinn. 2010. A data category registry- and component-based metadata framework. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valetta, Malta, May 19–21, 2010, 43–47. Consultative Committee for Space Data Systems. 2002. Reference Model for an Open Archival Information System (OAIS). Washington DC: CCSDS Secretariat, NASA. High Level Expert Group on Scientific Data. 2010. Riding the Wave: How Europe Can Gain from the Rising Tide of Scientific Data. Brussels: European Commission. Online version: http://cordis.europa.eu/fp7/ict/e-infrastructure/ docs/hlg-sdi-report.pdf. ISO 12620:2009. 2009. Terminology and other language and content resources – Specification of data categories and management of a Data Category Registry for language resources. ISO 24613:2008. 2008. Language resource management – Lexical markup framework (LMF). ISO 24619:2010. 2010. Language resource management – Persistent identification and sustainable access (PISA). Kemps-Snijders, Marc, Menzo A. Windhouwer, Peter Wittenburg, and Sue Ellen Wright. 2008. ISOcat: Corralling data categories in the wild. In Proceedings of the Sixth International Conference on Language Resources

54

Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg

and Evaluation (LREC 2008), Marrakech, Morocco, May 28–30, 2008, ed. European Language Association (ELRA). Ringersma, Jacquelijn, and Marc Kemps-Snijders. 2007. Creating multimedia dictionaries of endangered languages using LEXUS. In Proceedings of Interspeech 2007, eds. Hugo van Hamme and Rob van Son, 65–68. Baixas, France: International Speech Communication Association. Schüller, Dietrich. 2004. Safeguarding the documentary heritage of cultural and linguistic diversity. Language Archives Newsletter 1(3):9. Trilsbeek, Paul, and Peter Wittenburg. 2006. Archiving challenges. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 311–335. Berlin, New York: Mouton de Gruyter.

Chapter 4 Comparing corpora from endangered language projects: Explorations in language typology based on original texts∗

Geoffrey Haig, Stefan Schnell and Claudia Wegener

1.

Introduction

The current large-scale initiatives towards documenting endangered languages have produced, among many other outcomes, an unprecedented amount of digitally archived, natural spoken discourse in a range of typologically diverse languages. This chapter investigates some of the ways that language typologists can tap into these growing resources. Our results suggest that the cross-language comparison of original narrative texts, of the type now available in abundance in many documentation projects, is a viable and promising avenue of typological investigation. Up until now, there has been some reluctance within typology with regard to direct comparisons of original texts, for reasons that are certainly understandable: the lack of comparable content, the lack of cross-cultural comparability of genres, and perhaps most tellingly, the lack of a standardized system of annotation. As a result, typologists interested in text-based (as opposed to grammar-based) typologies have ∗

The present study draws on the GRAID-system of morpho-syntactic annotation of connected discourse (Haig and Schnell 2011). We are grateful to Sabine Reiter and Florian Siegl for contributing data from Awetí and Forest Enets respectively. We would like to thank the audiences at conferences in Nijmegen (October 2009; October 2010) and SOAS, London (November 2009) for their feedback and comments on earlier work on this topic. We are grateful for financial support from the University of Bamberg, the University of Kiel, the Max Planck Research Group on Comparative Population Linguistics at the MPI for Evolutionary Anthropology Leipzig, and the Volkswagen Foundation’s DoBeSprogramme, which financed the four documentation projects from which the data for this chapter is taken. Many thanks also to Nicole Nau, Bethwyn Evans, Anna Margetts and Brigitte Pakendorf for their valuable feedback on earlier drafts of this paper. The responsibility for all remaining errors rests with the authors.

56

Geoffrey Haig, Stefan Schnell and Claudia Wegener

tended to shy away from large-scale comparison of original texts in favour of more closely controlled data, for example translations of a single source text (parallel text typology), or vernacular re-tellings of standardized narratives (e.g. Pear Story narratives). While these methodologies have undeniable advantages, they are also subject to certain limitations. The most obvious in a documentation context is that they are not based on speech samples from an indigenous genre, and that they are only rarely encountered in archives of documentation projects. Original text typology, on the other hand, capitalizes on resources that are generally available in documenters’ archives, and can more directly preserve the specifics of indigenous speech genres. Given the widely-accepted goals of language documentation as the preservation of indigenous language practices, an approach to typology that dovetails with these aims is surely a goal worth pursuing. In the remainder of the chapter we present the results of a quantitative typological comparison of original spoken texts from four language documentation projects. The chapter is organized as follows: in Section 2, we introduce the background assumptions to text-based typology, and discuss the relative efficacy of different methodologies. In Section 3, we outline the database for this study and the annotation system, GRAID, developed for these purposes. We show that despite the variation in content, the texts from the four languages display remarkable similarities in certain domains of discourse organization, evident even in corpora of the modest size we are dealing with. Section 4 turns to an area highly sensitive to cross-linguistic variation, that of pronoun deployment. Previous studies in text-based typology have tended to ignore pronouns, either lumping them with full NPs, or with zero anaphora. Our results indicate that in doing so, a potentially rich domain of typological variation is overlooked, one that only a finer-grained system of annotation will capture. The different parameters investigated here represent but a fraction of the possibilities opened up by the methodology proposed. However, we think they suffice to demonstrate the fundamental feasibility of cross-linguistic comparison of original texts, and identify some of the typological parameters amenable to this kind of research.

2.

Language documentation and language typology

Language typology is traditionally based on the comparison of grammars. But grammars are not languages; they are the products of a more or less idiosyncratic interpretation of a selection of the documented facts of a lan-

Original text typology

57

guage. How much, and what type of data is included in a grammar, how it is collected, and how it is presented, are matters left to the discretion of the author. Of course a broad consensus exists on what constitutes a good grammar, but if a given language has only one grammar, as is often the case for endangered languages, then that grammar will represent that language in typological databases, regardless of its quality. These problems are well-known, but for lack of viable alternatives, comparison of reference grammars continues to be the most frequently used methodology in typology (cf. Haspelmath et al. 2008). Recently, however, an alternative methodology has been developed, sometimes referred to as “primary data typology” (Wälchli 2006, 2009). Rather than compare reference grammars of individual languages, primary data typology attempts to compare actual instances of language usage – texts – in different languages. Such an undertaking poses considerable challenges, and is of course subject to a number of limitations. Perhaps the most obvious question is: what kinds of text need to be included in a corpus to be representative of a given language? The question of representativity has long been discussed in traditional corpus linguistics (cf. Atkins, Clear, and Ostler 1992), but also in the context of language documentation (cf. Himmelmann 1998; Woodbury 2003; Seifart 2008). While there is widespread agreement on the ideal of a maximum coverage of indigenous speech events (Geertz’ proverbial ‘thick description’ (Geertz 1973)), in practice, the selection of recorded text types largely depends on preferences of the speech community and is constrained by the documenters’ limited knowledge about indigenous text genres (Mosel 2006). As for corpus size, corpus building is first of all constrained by limited resources. It is generally recognized that a corpus has to display a considerable size in order to contain data on all relevant linguistic structures. However, even the largest corpora may lack data on some structures in the particular languages. The adequacy of a corpus’ size partly depends on the nature of the object of investigation: for an investigation of, for example, syllable structure, or the use of determiners in NPs, quite short stretches of text may be adequate to obtain a reliable picture. For an investigation of relative clauses, on the other hand, much larger text samples across different genres would be necessary. The issue of corpus size is ultimately an empirical one, that can only be satisfactorily addressed when sufficient studies have been carried out with text corpora of different sizes.

58

Geoffrey Haig, Stefan Schnell and Claudia Wegener

While the issues of text types and corpus size are regularly raised in connection with primary-data typology, it is often forgotten that they are equally relevant to reference grammar typology. Yet with regard to reference grammars, these problems are seldom made explicit. In fact, reference grammars of different languages are based on wildly varying corpus sizes. While some are based on just a couple of hours of recorded speech, coupled with elicitation sessions with a single speaker, others are based on 20–30 hours of recorded speech and participant observation in the speech community over many years, or written by native speakers relying on their own intuitions. Also, reference grammars are based on corpora of different language varieties and text types: some take a form of literary standard (if available) as the basis, while others document a spoken vernacular; some are based mainly on narrative texts, while others are confined to conversational data; and so on. Yet despite the gaping differences in the empirical basis of reference grammars, in typology, different grammars are treated for all practical purposes as equal. The shift towards primary data typology paves the way for a further development: features that are fed into a typology are no longer read off a check-list in an either/or fashion (e.g. language X “has” OV word order etc.). Rather, features can be assigned a quantitative value, determined by the frequency of occurrence in the text under investigation. Thus a particular text type may, for example, be found to exhibit OV-word order in 70% of the relevant constructions, and elsewhere use other word-orders. In principle grammars could – and probably should – make such variation explicit in a precisely quantified manner; yet this is not common practice in most of descriptive linguistics to date (but see Biber et al. 2004 for a notable exception). But precise statements about constructional variation in specific types of text in individual languages can lead to finer-grained typologies and to closer attention to the factors determining such variation. They open up the possibility of incorporating language-internal variation in cross-language comparison. The quantitative nature of primary data typology is an aspect that will be of central concern in the present chapter. Apart from the problem of representativity, primary data typology also faces the problem of comparability of text genres and texts of different content. It would, for example, be misleading to compare a cooking recipe text in one language with a traditional folk narrative text in another. Currently two approaches to alleviate these differences are widely used. The first compares

Original text typology

59

the translations of a single source text in different languages, so called paralleltext typology. Typical source texts are the Universal Declaration of Human Rights, the Harry Potter-series (Wälchli 2006; Stolz 2007), Le petit prince (Stolz 2007), or the Bible (cf. Wälchli 2009). The advantages of this method are clear: roughly the same denotational meaning is held constant across different languages, and the language-specific forms for expressing that meaning can be compared. From a practical perspective, the texts are already available in a fairly wide sample of languages, thus reducing the typologist’s workload considerably. The main disadvantages are equally evident: the texts compared are generally samples of written language, often a highly specialized and even artificial register that may have been influenced by the source language of the translation text (cf. Wälchli 2009: Ch. II for discussion of the advantages and disadvantages of parallel text typology, and the contributions in Cysouw and Wälchli 2007 for different applications of the method). An approach that alleviates the latter problem to some extent is the use of standardized stimuli for eliciting connected texts. Well-known examples are the Pear Film developed by Wallace Chafe, or the Frog story, a picture book (Mayer 1994[1969]). Speakers view the film, or the book, and are then requested to recount the events they have seen in their own language. In this manner, texts can be recorded in the spoken vernacular and with approximately comparable content, avoiding the pitfalls of translations. Nevertheless, it should be borne in mind that Pear Film narratives or Frog story retellings come with their own specific disadvantages: first of all, the cultural and material content of such stimuli may be quite alien to the speakers; they may lack vocabulary items for objects that occur in the stories, or even find them embarrassing or simply incomprehensible. Second, even within the same speech community speakers differ in the extent to which they elaborate on the details of the events recounted, leading to texts of vastly different length and content. Third, unless the speakers were asked to recount the content of the film or picture booklet to another speaker who did not know the material, retelling the content of a film or a picture booklet to the researcher (who is obviously familiar with the material) is not part of the repertoire of linguistic practices of speakers of many language communities. Finally, speakers may produce quite different types of text, namely they either re-narrate the plot of the film or they describe the film as such, or mix both of these perspectives (cf. Himmelmann 1998: 187; Mosel, p. c.). So while the resultant texts may be preferable to translations on some counts, they are also far from ideal as a source of primary data.

60

Geoffrey Haig, Stefan Schnell and Claudia Wegener

From a documentary linguistics’ point of view, both translations and storyretellings are of secondary importance. As Himmelmann (1998) notes, the primary aim of language documentation is to document, as comprehensively as possible, the indigenous linguistic practices of a speech community. As mentioned, translations, or retelling of films, are often not part of those practices, but represent novel genres. There is a very real possibility that speakers, when faced with the challenge of novel speech events, may not be producing the kind of fluid routine sequences that characterize much of our actual linguistic behaviour. Of course speakers’ responses to novel speech events are of considerable interest in their own right, (see Mosel (2004) and Margetts (this volume) for discussion of novel communicative events introduced by language documenters into speech communities). But in the context of a typological investigation aiming at comparison of the profiles of languages by studying actual usage, data of this nature are extremely problematic. One of the few researchers to explicitly address this point is Foley (2003), who compares the structure of a traditional indigenous narrative with the structure of a Frog-story retelling in the Papuan language Watam. Foley notes some striking structural differences, and concludes that a description based on a Frog-story retelling could quite significantly distort the language’s discourse profile. An obvious difference between such texts and many indigenous narratives in our experience at least is that the former are generally characterized by a paucity of clauses in the first and second person, and of direct and indirect speech. Both of these are extremely common features of traditional narratives of some regions (e.g. the Middle Eastern cultural region), so texts lacking them already differ markedly from the indigenous narrative tradition. The obvious alternative method, which avoids the drawbacks of the two just discussed, is typology based on the comparison of original texts, i.e. texts produced by native speakers, recounting original content in an indigenous text genre. Typically such texts are monologous narratives of past events, real or mythical, often traditional tales that are part of the community’s cultural heritage, but in some cases accounts of specific events that occurred in the recent past. This method has the advantage of working with texts expressing indigenous content and representing an original linguistic practice within the speech community, hence not obliging speakers to engage in novel or unfamiliar communicative tasks. The vast majority of language documentation projects yield some recordings of spoken narrative monologues. Once transcribed and translated, these texts are a potential data source for primary data

Original text typology

61

typology. The amount of such text already available in digital form can only be estimated, but the DoBeS-programme alone, with its emphasis on textbased documentation and 51 large-scale documentation projects to date,1 has produced a massive amount of digitally archived spoken language data in various stages of annotation and analysis. It is therefore rather surprising that to date, there has been little serious attempt on the part of typologists to mine these resources. In the remainder of this chapter, we present the results of our own efforts in this direction, which focus on studying the form and function of referring expressions in original texts. 3. 3.1.

Quantitative approaches to typology: The GRAID initiative Background

At least as far back as the work of Chafe, Givón and others in the 1970’s, it has been known that across natural spoken discourse, languages reveal statistically stable patterns in the ways that referents are introduced into discourse, how they are referred to in subsequent stretches of discourse, and with what frequency per clause. The focus of Chafe’s research (Chafe 1979, 1980, 1987) was on the cognitive and evolutionary grounding of narrative structure, but later researchers have turned their attention to more explicitly grammatical topics, re-forging the link to typology by systematically investigating a larger cross-section of languages. One example is John Du Bois’ research on “Preferred Argument Structure” (PAS), which draws on a quantitative analysis of discourse; we will take up some aspects of his work in 3.4 below. More recently, Bickel (2003) presented a quantitative, cross-linguistic investigation of “Referential Density” (RD), the degree to which a language uses overt expressions for the referents in a given text. Superficially, it might seem that Bickel’s research is similar to crosslinguistic investigation of “discourse pro-drop” in the the generative tradition, (cf. Neeleman and Szendr˝oi 2007, 2008). However, Bickel’s Referential Density (RD) cannot be equated with “discourse pro-drop”, as the latter refers solely to subject or object deletion, while RD includes the omission of oblique arguments. More importantly, there is a fundamental difference in methodology between the two approaches, reflecting the discussion in Section 2 above: In their cross-linguistic investigation of argument-deletion, Neeleman and 1. http://www.mpi.nl/DOBES/projects/

62

Geoffrey Haig, Stefan Schnell and Claudia Wegener

Szendr˝oi treat zero-expression of arguments as a feature of grammars, rather than texts. They explicitly restrict themselves to the question of whether discourse pro-drop is “available” or not (Neeleman and Szendr˝oi 2008: 333) in a given language, i.e. if it can occur at all, and base their decision solely on the investigation of reference grammars. Under what conditions, and with what frequency, discourse pro-drop is actually manifested in texts, lies outside the purview of their investigations. Bickel and his associates, on the other hand, approach the matter from a text-based perspective, which yields variable values (between 1 and zero) for the RD of a given language. In fact, Bickel’s measure of RD is not extractable from a reference grammar. Cross-linguistic research within the text-based, quantitative approach presupposes that the texts to be compared have been annotated in a manner that is consistent across different languages. The working assumption is of course that the annotation categories are sufficiently widespread across the world’s languages and are definable in a cross-linguistically valid way. Examples of such categories might be: “subject of transitive verb”, “first person personal pronoun”, “direct object”. There is an ongoing debate about the status of such typological categories (cf. Haspelmath 2007, 2010a,b; Newmeyer 2010), but in practice, typologists have been conducting conventional cross-language research for decades based on just such categories, with considerable success. Therefore, for the time being, we continue to work on the assumption that there are at least some grammatical categories that can meaningfully be applied cross-linguistically. Despite their widespread acceptance, a standardized system of annotating such categories has never been developed. Conventional morpheme-bymorpheme glossing, perhaps the most widespread type of morpho-syntactic annotation, is of only limited use for cross-linguistic comparison. First of all, the inventory of symbols used varies from language to language, i.e. from grammar to grammar. For example, some grammar writers will use “ACC” to refer to a morphological marker found on direct objects, while others gloss the approximately equivalent marker as “OBJ”, etc. Second, morphemic glosses vary according to how, and if, they make reference to non-expressed categories (zero exponence). Third, morpheme glosses may apply one and the same label to a morpheme, although it occurs in different distributions, for example, a person agreement marker on a possessor and a person agreement marker on a predicative verb. Finally, and most seriously, morphemic glosses render individual morphemic segments, but they provide no direct and

Original text typology

63

consistent means for identifying constituents larger than words, e.g. phrases. Although there are guidelines for morphemic glossing available (e.g. the ‘Leipzig Glossing Rules’2 ), in practice morphemic glosses are a free-forall area, with individual practitioners generally developing their own, often quite idiosyncratic solutions. Recently, Seifart et al. (2010) evaluate the possibilities of cross-language comparison of texts glossed with Toolbox. They find that the sole parameter that can be more or less reliably extracted from these data is the respective noun/verb ratios in the texts. Thus, the possibilities of direct comparison of existing morphologically glossed texts appears to be extremely restricted at present. And even if morphemic glossing were to be standardized to a level where cross-corpus comparison was possible, the problem of lack of identification of syntactically relevant units remains unresolved. The obvious solution, to develop systems for annotating syntactic structures, was never seriously pursued on a large scale. Schultze-Berndt (2006) briefly mentions this possibility in connection with the annotation of small language corpora, only to dismiss it as too time-consuming and hence impracticable. However, over the last couple of years we have been developing such a system for annotating texts in typologically diverse languages, based on a small and standardized system of labels (approx. 30). The system, known as GRAID (Grammatical Relations and Animacy in Discourse) is described in detail in Haig and Schnell (2011), and we will only outline some of the more important properties here. GRAID is an annotation system for glossing connected narrative texts, resulting in an additional rich layer of data on a text corpus that contains information on the interface of discourse and syntax. It identifies the major clause constituents (predicates, arguments), and links the syntactic functions of arguments to their formal properties (e.g. whether they are pronominal, or full NPs). Additionally, GRAID includes information on the distinction between human and non-human reference of arguments,3 as this is an important parameter in shaping many aspects of grammar. GRAID has been developed through glossing actual texts in five genetically and areally diverse languages, and has proved sufficiently flexible to accommodate the attested problems to date. 2. http://www.eva.mpg.de/lingua/resources/glossing-rules.php 3. Note that the category ‘human’ includes anthropomorphized referents.

64

Geoffrey Haig, Stefan Schnell and Claudia Wegener

GRAID annotations cannot be derived automatically from morphemic glosses as they target primarily clause-level constituents (e.g. subjects of intransitive clauses etc.), so they need to be undertaken manually. Although this is a time-consuming process we suggest that, given the availability of text corpora already entered into annotation tools (e.g. Toolbox), the additional work necessary to add a GRAID-annotation to the corpus remains within reasonable boundaries. In around 10–15 hours, a person with a sound background in syntactic analysis, and knowledge of the language concerned, can annotate a text of around 500 clause units that is then amenable to a variety of quantitative typological cross-corpus investigations, as we will demonstrate in the following sections. 3.2.

The GRAID database

Currently, GRAID annotations have been added to monologous narrative texts from five existing language documentation projects; the data from four of these were taken as the basis for this investigation.4 Table 1 gives an overview of the raw data; sources for the languages concerned can be found in Appendix 5. Table 1. Overview of GRAID data used in this study Language Affiliation; location Awetí Tupi-Guarani, Brazil Gorani Indo-European, Iranian; West Iran Savosavo Papuan Isolate; Solomon Islands Vera’a Austronesian, Oceanic; Vanuatu

Annotator Sabine Reiter Geoffrey Haig

Words 2289 1835

Clause units 466 551

Claudia Wegener

3140

659

Stefan Schnell

3006

546

10270

2222

Totals:

In addition, spoken German renderings of the Pear Film have also been annotated, based on the text versions in Himmelmann (1997: 234–243). As this material belongs to a different genre, it is only occasionally mentioned for comparison, but not included in the analysis. 4. Data from the fifth language, Forest Enets (Samoyedic, cf. Siegl, this volume) were not fully available at the time of writing, hence were not included in the current analysis.

Original text typology

65

The texts from the documentation projects were selected according to the following criteria: monologous narratives recounting traditional tales or historic events that are part of the cultural heritage of the speakers. The texts had to have been recorded, transcribed, translated and analysed prior to being annotated with GRAID. In all cases, the annotator was the main investigator of the language concerned, thus ensuring the expertise was available to make the analytical decisions necessary during annotation.5 For some languages, more than one text was needed to reach the minimum amount of around 500 clause units. Annotation took place over the course of approx. 18 months, with close cooperation between the annotators. A workshop was held in 2010 in Bamberg where the annotators for Awetí, Gorani, Forest Enets and Vera’a conducted intensive discussion and refined the annotation practice in response to challenges raised by the individual languages. These efforts fed into the current version of the GRAID Manual (Haig and Schnell 2011), which contains the main principles for GRAID annotations, the inventory of symbols, and illustrative examples and explanations. An important feature of this research is that the data meet the general requirement of maximal accountability, because it is based on archived and freely accessible material (see Trilsbeek et al., this volume). This is one of the key advantages of working with data from language documentation projects, where data accessibility and durability are afforded maximum priority from the outset. GRAID thus profits from requirements that are already in place, by simply adding an extra tier of annotation to existing archived data. It also distinguishes the present work from much comparable work in text-based typology, for which the raw data is often not accessible. Given the bottle-neck of additional manual annotation, the issue of corpus size is obviously highly relevant. Surprisingly, it is seldom explicitly discussed in the literature on quantitative approaches to typology. An exception is the following citation by Nichols (2008: 124), discussing the size of corpora necessary to determine the types of argument structure typically associated with certain verb meanings: A text frequency survey probably gives the most sensitive indication of a language’s overall preference, but it is labour-intensive and requires close 5. In subsequent research we will need to cross-code the data to ensure consistency and objectivity of the coding; this has not been done yet because the GRAID coding system is still being refined and improved, and the necessary guidelines for each language are still in preparation.

66

Geoffrey Haig, Stefan Schnell and Claudia Wegener control of genres, stylistic levels, etc. in text corpora. I estimate ... that a corpus of about 1000 clauses exclusive of those containing the verb ‘be’ that swamps all frequencies in some Indo-European languages, is enough to yield reliable information on natural frequencies of verbs ...

A brief look at published research on text-based quantitative typologies is instructive. Table 2 gives the size of corpora used in the respective studies (in clause units) for a selection of publications investigating syntactic aspects of connected discourse. With the exception of Clancy (2003), few of the studies based on spoken language that we have seen to date have a database of more than 1000 clause units (and the clause units of Clancy’s child speech may have been considerably shorter than the clause units of the adult speech in the other studies). Table 2. Corpus size in selected text-based studies Author Du Bois (1987) Bickel (2003)

Language(s) Sakupultek Belhare, Maithili, Nepali Genetti and Crain (2003) Nepali Clancy (2003) Korean

Text type Pear Stories Pear Stories

Payne (1992: 57)

Traditional narratives Children’s speech (1–2 yrs.) Traditional narratives

Yagua

Clause units 456 568 861 4363 1156

Comparing the data in Table 1 and Table 2, it is evident that our own corpora fall within the range of those used in previously published research. Taken together, our corpus actually already represents one of the largest, fully accessible corpora of its kind currently available. One potential disadvantage of working with indigenous narrative texts is that the content of each text differs, thus presumably diminishing the degree of comparability across languages. When we first set out to compare original texts, this was obviously a matter of some concern, and it was precisely this concern that has led many researchers to adopt the parallel-text, or story-retelling, methodologies outlined above. However, our initial analysis of the indigenous texts has in fact revealed rather remarkable areas of global stability across all four languages, which are suggestive of the fact that as a text type, such monologous narratives have enough commonalities to make cross-language comparison meaningful. But we have also identified an area

Original text typology

67

of considerable cross-language variation, namely in the deployment of pronouns. Here it seems that the discourse-typological profiles of the individual languages do indeed exert an influence. In the following sections, we will first outline the areas where the four languages exhibit considerable parallels, before exploring one area of high cross-language variation.

3.3.

Ratios of core arguments and human expressions

In this section we present the overall distribution of core arguments as well as human expressions, as background figures for the analyses presented in Sections 3.4 and 4 below. Regarding the notion of ‘core arguments’, throughout this chapter we restrict our analysis to the properties of S (intransitive subject), A (transitive subject) and P (transitive object). The GRAID annotation system does in fact provide for additional arguments, but their inclusion will be the subject of future analysis. The respective proportions of S, A and P in the four text corpora is as follows: Table 3. Proportion of S, A and P arguments Awetí Gorani Savosavo Vera’a

S 55.0% (325) 46.2% (306) 51.2% (391) 52.6% (374)

A 21.1% (125) 27.4% (182) 24.9% (190) 23.2% (165)

P 23.9% (141) 26.4% (175) 24.0% (183) 24.2% (172)

total 591 663 764 711

Following the definition of A and P functions adopted in Haig and Schnell (2011), these two arguments should always co-occur in one transitive clause and hence show exactly the same proportions. The higher number of Ps in the data can be attributed to the presence of non-finite constructions, which may contain a P, but systematically block the expression of A. In Gorani, the slightly higher number of As results from the occasional use of an adpositional object with an otherwise transitive verb, as in English he drank of the medicine, which could be glossed with an A, but lack a regular P. In Savosavo, it is mostly due to relative clauses where P was the relativized constituent, and which therefore could not contain any expression of P. What is, of course, behind these figures is the proportion of transitive and intransitive clauses. If we take the number of P arguments to represent the

68

Geoffrey Haig, Stefan Schnell and Claudia Wegener

number of transitive clauses, hence including infinitival constructions in this category, the results obtained are as follows: Table 4. Proportion of intransitive and transitive clauses Awetí Gorani Savosavo Vera’a

intransitive 69.7% (325) 63.6% (306) 68.1% (391) 68.5% (374)

transitive 30.3% (141) 36.4% (175) 31.9% (183) 31.5% (172)

total 466 481 574 546

The similarities are quite remarkable: the mean proportion of intransitive clauses across the four languages is 67.5%, with a range of just 6.1 percentage points. While a corpus of four languages is obviously much too small for drawing far-reaching conclusions, these results are certainly indicative that monologous narrative texts reveal a fairly stable ratio of intransitive to transitive clauses of approximately two thirds to one third. Hence, these texts are remarkably homogenous despite their very varied contents. Interestingly, these proportions match those observed by Everett (2009: 9) for his corpus of English and Portuguese data, in which 70% and 62% of clauses were intransitive. The fact that this corpus consisted of data from a very different genre, i.e. conversational data, suggests a more fundamental relevance of this proportion. As mentioned above, GRAID annotations also note the semantic feature of humanness as a sub-aspect of the more general dimension of animacy, and the distribution of human referents in texts can therefore be investigated quantitatively with our system. The traditional stories investigated here primarily recount the actions of human beings, often of one central character. But the total number of human actors that are introduced in the different stories varies from text to text, as does the number of non-human referents – artefacts, places, objects – that occur. One might assume that the actual number of human participants that play a role in a given narrative would affect the rate of occurrence of human referents in texts, and potentially influence the distribution of referential expressions. The actual proportion of human vs. non-human arguments in core functions is provided in Table 5. As can be seen from the table, texts in all four languages figure more human than non-human discourse participants, with a mean of 69.7% of human discourse referents in the overall corpus, with a range of 11.8 percentage points. Again it seems that we are dealing with a relatively stable ratio, which is quite

Original text typology

69

Table 5. Proportion of human vs. non-human arguments in core arguments + HUM - HUM

Awetí 68.4% 31.6%

Gorani 62.8% 37.2%

Savosavo 74.6% 25.4%

Vera’a 73.0% 27.0%

surprising given that these texts have not been controlled for content. A very preliminary conclusion is that in narrative texts, somewhere in the region of two-thirds of the core arguments refer to human participants. Whether the proportion of two thirds is genuinely stable, or merely an artefact of the small corpus remains to be seen. But it is once again indicative of the fact that traditional narratives, despite the differences in content, show similar profiles in some aspects, and are therefore in principle comparable across languages. Looking closer at the proportion of human referents in individual core functions, the languages in our corpus show a clear, and again homogenous, pattern (see Table 6). Table 6. Proportion of human arguments among S, A and P arguments S + HUM - HUM A + HUM - HUM P + HUM - HUM

Awetí 70.2% 29.8% Awetí 96.8% 3.2% Awetí 39.0% 61.0%

Gorani 83.7% 16.3% Gorani 92.7% 7.3% Gorani 26.4% 73.6%

Savosavo 81.1% 18.9% Savosavo 97.4% 2.6% Savosavo 37.2% 62.8%

Vera’a 84.8% 15.2% Vera’a 93.3% 6.7% Vera’a 27.9% 72.1%

Not altogether surprisingly, referential expressions denoting humans are very often found as S or A, and more rarely in P. The extremely high proportion of human expressions in A function, and the relatively low one in P function reflect a well-known tendency that can be attributed to the idea that in typical transitive events humans act upon inanimate things (Comrie 1989: 128; Næss 2007: 18), a tendency that we will return to in Section 4 below.

70

Geoffrey Haig, Stefan Schnell and Claudia Wegener

3.4.

Testing a well-known constraint: Avoid Lexical A

As a test case for our corpus, we investigated a well-known constraint on the occurrence of noun phrases in A function, i.e. as transitive subjects. This constraint was originally noted in Du Bois 1987, using data from Sakapultek. Du Bois found that in connected discourse, S, A and P differ in their respective preference or dispreference for being filled by lexical expressions (NPs) as opposed to non-lexical expressions (zero or pronouns). The tendencies concerned are collectively referred to as ‘Preferred Argument Structure’ and have since been confirmed in a number of other languages (see (Du Bois, Kumpf, and Ashby 2003)). While Du Bois makes a two-way distinction between lexical expressions (NPs), and non-lexical expressions (pronouns and zero anaphora), in our data we maintain a three way distinction in form-types: Full NP (lexical), pronoun, or zero anaphora (cf. Section 4, where we discuss the relevance of pronominal, as opposed to both lexical and zero, expressions in greater detail). Now in any given text one might expect a random distribution of lexical NPs, such that, for example, in each of S, A and P functions we find approximately one third of the total lexical NPs in a text. This turns out not to be the case: For all languages investigated to date, the proportion of NPs in the A function falls far below what could be expected from a random distribution. In the data cited in Du Bois, Kumpf, and Ashby (2003: 37), drawn from eight languages, the proportion of lexical arguments in the A function does not exceed ten percent in any of the languages, while the remaining 90% are distributed approximately evenly across S and P. What this tendency indicates is that lexical mentions are not randomly distributed among S, A and P, but tend to be split across S and P, while avoiding A. Du Bois refers to this constraint as “Avoid Lexical A”. Despite its cross-linguistic robustness, as yet we lack a database large and varied enough to empirically back one or the other explanation proposed in the literature (cf. Du Bois 1987; discussion in Goldberg 2004 and more recently in Everett 2009). The raw figures from our corpus are provided in Table 7, showing the distribution of lexical mentions across core arguments. The data from our corpus clearly confirm Avoid Lexical A: Nowhere does the A function account for more than 14.4% of the lexical mentions. In all cases, the value for non-lexical A arguments is significant according to chi-square tests yielding p closed vowels > liquids > nasal consonants > fricatives > plosives. It should be added that, as far as vowels are concerned, nasalization increases sonority, and a voiced obstruent is more sonorous than its voiceless counterpart.

184

Klaus Geyer

Even if the sonority maximum most often coincides with the syllable peak, this is not necessarily the case. Regarding the parameter of sonority, which, according to Kohler (1995: 74), is somewhat “impressionistisch gewonnen” [impressionistically obtained. (Translation K.G.)], one has to bear in mind that the gradation of more or less sonorous sounds in the sonority hierarchy requires the tacit prerequisite that the sounds in question are produced with the same articulatory effort. Differences in terms of loudness, duration, and/or pitch can override sonority, or, as Schubiger (1977: 108) puts it: “Die Sonoritätsskala behält ihre volle Gültigkeit nur bei gleichbleibendem Atemdruck. Druckschwankungen können Verschiebungen bewirken.” [Only if the expiratory pressure is constant, the sonority hierarchy remains fully valid. Variations in pressure may cause a shift. (Translation K.G.)] Thus, we have to distinguish between what could be called the sonority contour and the prominence contour in a syllable. Evidence for the validity of this differentiation can be found in Lithuanian, where diphthongs possessing roughly the same initial and target vowel qualities and, consequently, an identical sonority contour, may form minimal pairs with respect to a decreasing or an increasing prominence contour, cf. áukštas ‘high’ vs. a˜ukštas ‘floor, storey’5 . In order to have a means for differentiating sonority and prominence, I suggest to term diphthongs with a decreasing prominence contour early-peak diphthongs and those with an increasing prominence contour late-peak diphthongs.6 If we take the fact seriously that sonority, in marked cases, is not the single controlling parameter for syllable peak formation in diphthongs, the search for this kind of mismatching sequences can be further extended. Sievers already did this in 1893, when he stated the following on the combinations of vowels with liquids and nasals: Auch hier haben wir es hauptsächlich nur mit den einsilbigen Verbindungen zu thun. Diese sind den Verbindungen zweier Vocale vollkommen analog, nur mit der Einschränkung, dass nach den Gesetzen der Abstufung der Schallfülle ... die Liquide und Nasale in fast allen Fällen die unsilbischen Glieder der Verbindungen sind. Dass wir Gruppen wie ar, al, an, am, aN 5. The place of the syllable peak is marked by the accent signs tilde ˜ and acute ´ accent. These are also used to indicate different stress types in Lithuanian. 6. A third “contour-type” of diphthongs occurring in some languages (though not in Lithuanian) might be floating diphthongs without a clearly localizable peak, as Grønnum (1998: 79) puts it (even if this is done in connection with sonority and not with prominence).

Diphthongology meets language documentation

185

(genauer geschrieben ar, al, an, am, aN, um die unsilbische Geltung des an “ Sonorlauts “ “ “ zu “ bezeichnen) nicht auch als ‘Diphzweiter Stelle stehenden thonge’ auffassen, liegt grossentheils bloss an der Gewohnheit, l, r, m, n, N als ‘Consonanten’ zu bezeichnen, die mit einem ‘Vocale’ nicht eine derartig homogene Verbindung eingehen können wie zwei ‘Vocale’ unter einander. (Sievers 1893: 154) [Also here, we are mostly dealing with monosyllabic combinations. These are completely analogous to the combinations of two vowels, with the restriction that according to the laws of gradation of sonority ... the liquids and nasals are almost always the non-syllabic members in the combinations. The reason why we do not also consider sequences like ar, al, an, am, aN (more exactly ar, al, an, am, aN, to mark the non-syllabicity of the sonorous sound in the “ “ place) “ “ diphthongs “ second is to a great extent due to the habit to call l, r, m, n, N ‘consonants’, which cannot form as homogeneous a combination with a ‘vowel’ as two ‘vowels’ can do with each other. (Translation K.G.)]

The so-called semi-diphthongs in Lithuanian – the term Semi-Diphthong was presumably coined by Kurschat (1876) – seem to constitute an example of a rather remarkable mismatch of sonority and prominence contours. These semi-diphthongs are defined as “tautosyllabic clusters ‘vowel + sonorant”’ (Ambrazas 1997: 25) and come, like the diphthongs /a˜u/ and /áu/ mentioned above, in two prominence contours, namely early-peak and late-peak. An early-peak sequence of vowel plus sonorant as in pažínti ‘to know’ is not at all a peculiarity and would normally not be dealt with under the heading of diphthongs. But in the case of vowel plus sonorant combinations in Lithuanian, also a sonorant /l, m, n, r/ may form the (late) syllable peak – this in spite of the preceding, much more sonorous vowel in the syllable. To make the sonorant the syllable peak, extra effort in articulation is required in the form of higher expiratory pressure, resulting in a higher volume, and, in addition, some lengthening of the sonorant; cf. the following examples for illustration, the accent sign over the letter again indicating the place of the ˜ syllable peak: añtra˛ ka˜rta˛ ‘second.ACC time.ACC’, kalbate ‘speak.2PL’. Although the matter of Lithuanian semi-diphthongs deserves much more thorough inquiry than this somewhat cursory discussion can present here due to limitations of space, it has become clear that there is much more variation in diphthongs than one might expect. Therefore, one should be cautious with claims like the following by Sánchez Miret (1998), who says that “diphthongs like [eI, ou, @i, @I, ae, a@] are actually unpronounceable, no matter how much “ diminish “ “ “ the “ expiratory “ one can intensity on the first element” (Sánchez Miret

186

Klaus Geyer

1998: 44) – at least if it is not intended to be a statement about one’s personal articulatory abilities. In any case, such mismatches of sonority contour and prominence contour may be seen as marked structures. It should be added that whether one wants to call sounds or sound sequences like the Lithuanian ones (semi-)diphthongs or not, is not the crucial question. If a phenomenon like the Lithuanian semi-diphthongs with a syllable peak on the less sonorous sonorant phase does occur, any language description or documentation has to account for it. In my view, the best place to do so would be in the section on diphthongs and syllable structure in the chapter on phonology. 2.4.

Summary: What is a diphthong?

In this section, the basic notions of diphthongs were discussed. It was pointed out that the articulatory movement in production leading to an auditory perceivable change of quality is the minimum requirement for a vocoid sound to count as a diphthong; hereby, the representation of a diphthong by means of two symbols has to be understood as an abstraction inasmuch as these symbols only indicate the initial and the target positions of the diphthong and not two distinct “halves” – even if the transition from initial to target position can be a more gliding or a more sequential one. Besides reviewing the previously established distinctions of phonetic vs. phonological, mono- vs. biphonemic, and, with respect to sonority contour, falling vs. rising diphthongs, we introduced the criterion of prominence with the distinction of early-peak vs. late-peak diphthongs. The evidence for the necessity to discern sonority and prominence contours came from examples like Lithuanian, where less sonorous phonetic vowels and even sonorants like /m, n, l, r/ can form syllable peaks. Diphthongs may turn out to be best analyzed and described as derived on the synchronic level in the sense that a phonetic diphthong appears as the result of a phonological rule such as, e.g., vocalization of certain consonants in certain positions affecting an underlying phonological non-diphthong (a vowel + consonant sequence). Diphthongs may also be derived – or: have developed in a historical sense – by sound change processes; sometimes, those diphthongs are termed primary, secondary, tertiary etc. Be the derivational relation synchronic or historical, any description and documentation will benefit from establishing and explaining systematic links like these.

Diphthongology meets language documentation

3.

187

Case study: Finnish

3.1. Why Finnish? Finnish has a remarkably large diphthong inventory that constitutes a true challenge for a proper description – a challenge that has not been responded to in a completely satisfactory way to this day, at least as far as descriptive grammars or grammatical sketches are concerned. Having a closer look at the ways the diphthongs of this language are systematized and presented in grammars is, however, not only helpful for the sake of developing a proper description of Finnish diphthongs. It is also instructive for the purpose of establishing criteria for potentially relevant diphthong features in general. To simplify quite a bit (see e.g. Abondolo 1998 for a detailed overview), Finnish phonology is, amongst other properties, characterized by rather symmetrical resp. highly integrated systems of vowels and consonants, length being distinctive in all vowels (monophthongs) and in almost all consonants. Word stress is on the first syllable as a rule. One main property is the presence of vowel harmony. This property is twofold: it concerns the dimensions of place (front vs. back) and of lip rounding (round vs. unround). Of the eight vowel phonemes /i, y, u, e, ö, o, ä, a/, 7 /i/ and /e/ are neutral with respect to both of the dimensions, whereas the open vowels /a/ and /ä/ do not participate in the lip harmony. In other words, /y/ and /ö/ as well as /u/ and /o/ are the only fully harmonic vowels (the same holds true for the long counterparts, of course). 3.2.

Item- and inventory-related questions

3.2.1.

Finnish diphthong inventories

To start with, already the bare number of diphthongs is a matter of disagreement: some representations count 16, others 17, still others 18 diphthongs. The differences are, however, not as big as it might seem, since there is a 7. The monophthongs of Finnish are usually systematized as follows: front unrounded rounded i y e ö ä

back u o a

close mid open

± long

188

Klaus Geyer

core set of 16 diphthongs that are unequivocally accepted as such by e. g. Campbell (1995: 461), Karlsson (1992: 14; 1999: 24), Mitchell (2001: 215) and Sulkala and Karjalainen (1997: 373). These diphthongs are: /ai, ei, oi, ui, yi, äi, öi, au, ou, eu, iu, äy, öy, ie, uo, yö/. Karlsson (1992: 14), Mitchell (2001: 215) and Sulkala and Karjalainen (1997: 373) add /ey/ as a 17th diphthong, whereas the diphthong /iy/ is listed as no. 18 only by Karlsson (1992: 14) and Mitchell (2001: 215)8 . Two main groups are generally distinguished in these representations: firstly, diphthongs ending in one of the high vowels /i, y, u/, forming a group of 14, 15, or 16 items respectively, depending on whether or not /ey/ and /iy/ are included; and secondly, a small but stable group of the three diphthongs /ie, uo, yö/. These two groups are more or less explicitly defined either with respect to the articulatory movement or with respect to the sonority contour. In accordance with articulatory movement, the first, bigger group consists of closing diphthongs, the second, smaller group of opening diphthongs. In accordance with sonority contour, the first group consists of falling, the second group of rising diphthongs. Irrespective of the exact number of items in the inventory, all of the 16, 17 or 18 diphthongs comprise in their first and second phases (i e. in their initial and target positions, respectively) only vowel qualities that also occur in monophthongs. 3.2.2.

Diphthongs with restricted occurrence

It is obvious that it is the two diphthongs /iy, ey/ that are causing problems in identifying the exact size of the diphthong inventory. Even one and the same author (Fred Karlsson, 1992 and 1999) makes different decisions and, consequently, differing counts at different times in his analyses of the Finnish sound structure. /iy, ey/ are indeed heavily restricted in their occurrence – without, however, being at suspicion for being loans: /ey/ only shows up in leyhyä ‘to wave’ and in a few more words derived from the same root, even though the sequence /e/ plus /y/ itself is far from infrequent in Finnish words. In most cases, however, the two sounds are divided up into two syllables by a hiatus, e.g. in ter.ve.ys ‘health’. Regarding the second critical diphthong, the only occurrence of (tautosyllabic) /iy/ is in the place name Kiysaari. A decision has to be made here whether unical elements like the diphthongs /iy/ 8. Actually, the diphthong chart Mitchell provides comprises 20 positions, since, for unknown reasons, /ie/ und /ei/ are listed twice.

Diphthongology meets language documentation

189

(restricted to one single word) and /ey/ (restricted to one single root) should or should not be included in the overall diphthong inventory of the language. In my view, they are definitely part of the sound system – but in terms of core and periphery, they are clearly part of the periphery rather than of the core. Another peculiarity must be mentioned here, namely that the diphthong /öi/ never occurs in roots, but can only be the result of a morphological construction involving stem-final /ö/ + (nominal) plural /i/ or (verbal) preterite or conditional /i/. So, /öi/ is peculiar not with respect to frequency of occurrence in word forms, as /ey/ and /iy/ are, but rather with respect to place of occurrence in a word form. The sequence is frequent since both nominal plural and verbal preterite and conditional formation are very common morphological constructions, and stem-final /ö/ is not really rare. Because /öi/ is quite restricted with respect to possible places of occurrence in word forms, it is, in my view, more peripheral than other diphthongs. An interesting fact, also from the viewpoint of documentary linguistics, is that the most common diphthongs in a sample of 6,700 diphthongs turned out to be /oi/ (21.4% of all diphthongs), /ai/ (16.5%), and /ei/ (15.1%) (cf. Häkkinen 1977, cited in Karlsson 1983: 90); note that these are all diphthongs that occur freely both in roots and across morpheme boundaries of the type mentioned above. Probably the most widely discussed issue in terms of core and periphery of phonological (and other linguistic) elements is that of whether or not loans – here: loan diphthongs – should be considered part of the inventory of a given language or not. This is, actually, not a big issue in Finnish, but for the closely related Estonian language, it is reported that out of the up to 36 diphthongs, as many as 10 solely occur in loan words (Viitso 2003: 22). And for the Baltic language Lithuanian, recent presentations (Ambrazas 1997; Eckert, Bukeviˇci¯ut˙e, and Hinze 1994) unanimously count, beyond the genuine items, the diphthongs /eu, oi, ou/ “in words of foreign origin” (Mathiasen 1996: 30) as part of the inventory – despite the fact that Lithuanian grammaticography tends to be very conservative in an overall perspective.

3.2.3.

Sonority and prominence in Finish diphthongs

One feature all Finnish diphthongs have in common is that the first phase of the diphthong is more prominent than the latter one, i.e. all Finnish diphthongs are early-peak diphthongs. For the first – and much bigger – group mentioned above, namely the diphthongs ending in a closed vowel /i, y, u/,

190

Klaus Geyer

the decrease of prominence is to be expected, since it is matched by the decrease of sonority. But in group two, we find diphthongs that show a rising sonority contour and decreasing prominence simultaneously. It is worth mentioning that the three “mismatching” Finnish diphthongs /ie, yö, uo/, get restructured as soon as they are followed by an /i/ – which is, as mentioned above, very common in morphological constructions – in the sense that they adapt to the unmarked, prevailing pattern, avoiding triphthong formation, cf. e.g. tie ‘path *tie-i-tä > te-i-tä ‘path-PL-PART’; yö ‘night’, *yö-i-tä > ö-i-tä ‘night-PL-PART’. In this respect, these are unstable diphthongs. 3.3. 3.3.1.

System-related issues Finnish diphthong systems

Up to now, we have said very little about the distinctive features of Finnish diphthongs and about how to fully systematize them. An examination of the existing systems of the Finnish diphthong inventories proves to be insightful and illustrative with respect to the difficulties caused in part by the size of the inventory, in part by lack of a working analytic tool. In the following, I will give five examples of diphthong systems in Finnish, as they can be found in current grammars and grammatical sketches. Note that all systems given below comprise only 16 items, except that of Hakulinen et al. (2004), which includes the controversial /ey/ and /iy/. Branch (1987: 596; also Lyovin 1997: 80) represents the Finnish diphthongs in the following way: ei äi ui ai oi öi yi ie

äy

eu

au ou öy



uo iu

Diphthongology meets language documentation

191

Karlsson (1999: 24–25) lists four groups of diphthongs: (a) (b) (c) (d)

diphthongs ending in /i/ (/ei, äi, ui, ai, oi, öi yi/) diphthongs ending in /u/ (/au, ou, eu, iu/) diphthongs ending in /y/ (/äy, öy/) the three diphthongs /ie, yö, uo/

Hakulinen et al. (2004: §21, translation K.G.) operate with three main types of diphthongs and some subdivisions: closing diphthongs closed diphthongs opening diphthongs

ei öi äi oi ai | ey öy äy | eu ou au yi ui | iy iu ie yö uo

Groenke (1998: 133, translation K.G.) provides these four groups of diphthongs: ei öi oi öy eu äi ai äy au 3 falling: uo, yö, ie 2 de-rounding diphthongs: yi, ui 1 rounding diphthong: iu

ou

And finally, Fromm’s (1982: 31, translation K.G.) analysis comprises the following five groups: I ei öi äi

oi ai

II öy äy

III eu

ou

ie

IV yö

uo

iu

V yi

ui

au

I. consisting of a open or mid-open vowel phoneme and /i/, II. and III. consisting of a open or mid-open vowel phoneme and a closed one as demanded by vowel harmony, IV. consisting of a closed phoneme and the corresponding mid-open one, V. consisting of a closed phoneme and a phoneme of the same degree of openness, but contrasting in the feature of rounding. Let us briefly comment on the diphthong systems presented above. Branch (1987) and, in an identical way, Lyovin (1997) systematize the diphthong inventory to a certain extent, as their chart suggests. However, since the criteria for systematization are not explicitly stated by the authors, one has to

192

Klaus Geyer

deduce them from the positioning of the elements in the chart. The target vowel quality seems to play a major role for the arrangement of the columns, whereas the arrangement of the lines all in all remains cryptic. This applies in particular to the lines at the bottom, where it remains an open question why the three opening / rising diphthongs have been integrated in this position. Clearly, such attempts do not meet the basic requirements of consistency and explicitness for a phonological systematization. Karlsson (1999) is, in some sense, not so different from Branch (1987) and Lyovin (1997): he, too, uses the target vowel quality /i/, /u/ and /y/ as the main classificatory criterion. In contrast to Branch (1987) and Lyovin (1997), however, he at least explicitly mentions this criterion. However, the criteria underlying the arrangement of items within the three groups again have to be inferred somehow by the reader. Karlsson’s (1999) group 4, unlike groups 1 to 3, is a bare enumeration, not characterized by any feature. As far as any criteria are retrievable at all, these systematizations only seem to make use of the static properties of diphthongs, i.e. the positions of either the initial or, possibly, the target vowel quality. Hakulinen et al. (2004) is the only systematization comprising 18 diphthongs. The vertical tongue movement functions as the main criterion for classification here; the quality of the target vowel seems to be an additional criterion for systematizing closing diphthongs, as the respective subgroups are constituted by separating marks “|”. Unfortunately, this feature is not explicitly pointed out. Regarding the group of closed diphthongs, however, the subgrouping is carried out according to the position of the /i/, i.e. whether it is the initial or the target vowel quality in the diphthong. The order of diphthongs within the subgroups seems to follow, by and large, the principle of the IPA chart to put front vowels to the left and back vowels to the right. The opening diphthongs in the third group seem to be arranged in the same way. Hakulinen et al. (2004) are using the dynamic criterion of vertical tongue movement (or the absence of it, respectively) as a feature for forming the three main groups, which apparently improves the analysis. The systematization Groenke (1998) suggests in his sketch of Finnish is somewhat suspect because the feature of vertical tongue position is described using the same terms that are otherwise used for sonority contours (rising, falling).9 The features related to changes in lip rounding that Groenke brings 9. It may be somewhat unclear whether this is a question of terminological confusion regarding vertical tongue movement and sonority contour, or whether it has to do with some

Diphthongology meets language documentation

193

into play are, however, useful ones since they address the dynamic nature of diphthongs. But whilst /iy/ only displays lip movement (rounding) in terms of articulation, both /iu/ and /ui/ imply, in addition to lip movement (rounding in /iu/ and de-rounding in /ui/), horizontal tongue movement (backing in /iu/ and fronting in /ui/). Diphthongs where only one articulatory feature changes from initial to target position like in /iy/, but also in /yi, uo/ etc., are termed homogeneous (cf. Roca and Johnson 1999: 190–191), whereas in a heterogeneous diphthong more than one feature changes, e.g. in /ui, äy/. The system proposed by Fromm (1982), finally, does not only make use of dynamic as well as static features, it also labels them – at least in the descriptions of the five (or possibly actually four, since II and III are treated together) groups. The gap in the chart of group II on the top left is to be filled with /ey/, which shows that /e/, in contradiction to the explanation given by Fromm, does not participate in vowel harmony; /iy/ could easily be integrated in group V. In my view, none of these systematizations, not even the most recent one, adequately captures the distinctiveness of relevant features in Finnish diphthongs completely. Therefore, I will present a more consistent analysis in the next section. 3.3.2.

Taking the dynamic nature of diphthongs seriously in systematization

The following account for systematizing Finnish diphthongs makes use of both dynamic and some static features as used in the aforementioned inventories and aims to develop them further. Thus, the system employs the dynamic feature of movement in vertical tongue position [± move]. Note that no movement in vertical tongue position, thus [– move], always implies a different type of movement in any diphthong, be it horizontal tongue position, sort of different terminological tradition in describing phonology in German. In German phonology, the terms ‘rising’ and ‘falling’ are traditionally used in the same way Groenke uses them here, for instance in several introductory textbooks on phonetics and phonology – even in most recent ones, cf. Pompino-Marschall (1995: 218–219), Graefen and Liedke (2008: 220), Altmann and Ziegenhain (2010: 47). If the latter, however, should prove to be true, this would result in an inconsistent terminology, where a diphthong, oddly enough, can be rising and falling at the same time, depending on whether the feature under discussion is articulatory movement or sonority. Therefore, I strongly prefer and recommend to strictly distinguish between rising / falling on the one hand, and closing / opening on the other hand.

194

Klaus Geyer

lip rounding, or both – these other types of movement are not distinctive in this systematic account, but still relevant for the phonetic form of the diphthong. If the item is [+ move], the extent of a possible movement in vertical tongue position can be [+ wide] (between open and close vowel quality) or narrow (between mid and close), thus [– wide]. The very basic distinction of falling vs. rising in sonority, ascribing the feature [+ rising] to the three diphthongs /yö, ie, uo/, is also implemented, as is the vowel harmony dimension of lip rounding, where only the non-open round vowels /y, ö; u, o/ participate (see Section 3.1). Therefore, there are only four diphthongs with the feature [+lip harm]. Lip rounding in the target vowel quality is made use of for the feature [+ round], which occurs only in diphthongs ending in /y, ö, u, o/, all others being [– round]. The other dimension of Finnish vowel harmony, namely place harmony, plays an even more important role for the system. The place harmony feature [+ front] vs. [– front] is to be understood not in an articulatory but in a phonological sense. Recall that the open vowels /ä/ and /a/ participate in place harmony (though not in lip rounding harmony) and that /i/ and /e/, though phonetically front vowels, are neutral with respect to place harmony, i.e. they can combine with both /y, ö, ä/ and /u, o, a/. In other words, a diphthong is front if either the initial or the target vowel quality is phonologically front (i.e. /y, ö, ä/), and it is back, if either the initial or the target vowel quality is phonologically back (i.e. /u, o, a/); /ei/ and /ie/ are neither front nor back. The matrix of distinctive features in Finnish diphthongs is as follows:

iy ey öy äy yi öi äi ei ui oi ai iu

rising – – – – – – – – – – – –

front + + + + + + + – – – – –

round + + + + – – – – – – – +

lip harm – – + – – – – – – – – –

move – + + + + – + + – + + –

wide – – – + – – + – – – + –

Diphthongology meets language documentation

eu ou au yö uo ie

rising – – – + + +

front – – – + – –

round + + + + + –

lip harm – + – + + –

move + + + + + +

195

wide – – + – – –

Arranged according to their binary feature values, the system displays the diphthongs in Finnish in a quite compact and integrated way. I refrain from marking the vast majority of diphthongs for [– lip harm] and [– rising] respectively, since that would affect the clarity of the representation; the positive feature values are simply indicated in boldface or italics, respectively. [+ front] [– front] [+ round] [– round] [+ round] iy yi ui iu [– move] ey öi oi eu [– wide] [+ move] [+ lip harm] [+ rising] öy yö ei ie ou uo äy äi ai au [+ wide]

Translating the binary features to common category labels renders a system that looks like this:

lip harmonic rising

front round iy ey öy yö äy

not round yi ui öi oi ei ie äi ai

back round iu eu ou uo au

no movement narrow movement wide movement

Of course, the category labels – as all labels in this type of systematization – require some explanation, e.g. that front and back is not an articulatory feature, but refers to place harmony; that by round and not round only the target vowel quality is intended; or that movement refers to vertical tongue movement only. The systematization I propose in this paper has several advantages: it is effective, since its outcome is a highly integrated system which, in addition, does not leave any unanalyzable remainder that merely gets enumerated; it

196

Klaus Geyer

is economic, since it makes use of only a small feature set (obviously only the distinctive ones); it is adequate for the language, since it relies on core characteristics of Finnish phonology, such as the vowel harmonies; and it is typologically adequate, since it clearly states that the rising early-peak diphthongs /ie, yö, uo/ are marked structures – simply because they need special features to be singled out. 4.

Conclusion: The “diphthong analysis and description tool”

In this last section, I will merge the results from the general discussion of diphthongs and the findings from the specific examination of the Finnish diphthongs, thus establishing the tool for diphthong analysis and description this article aims at. When analysing the functional phonetics of a language, one of the first observations might be whether the language contains vocoids with a changing sound quality or not – i.e. whether the language contains phonetic diphthongs. In most languages, this will be the case. These diphthongs – even already before anything is known about their precise phonological status – can be described in terms of the following articulatory criteria (ordered according to decreasing probability of occurrence): – opening vs. closing: vertical tongue movement – fronting vs. backing: horizontal tongue movement – (peripherising vs. centralising: combination of vertical and horizontal tongue movement) – rounding vs. de-rounding (spreading): change in lip rounding – nasalizing vs. de-nasalizing: change in nasality The following two additional criteria may be useful for a further differentiation of the diphthongs in a language: – homogeneuous vs. heterogeneous: 1 vs. > 1 parameters changing – narrow vs. wide: articulatory / perceptive distance from initial to target position / sound Of course, already in this early stage, the important “contour criteria” may be accounted for: – falling vs. rising: sonority contour, direction of change – early peak vs. late peak: prominence contour, position of peak

Diphthongology meets language documentation

197

Note that falling, early-peak diphthongs are by far the most common diphthong type, but, as was shown, prominence caused by means of extra effort in pitch and / or expiratory energy may override the sonority contour – even to the extent that sonorants may form a syllable peak. In this (probably rare) case, the next criterion has to be applied: – purely vocoid vs. blended: type of sounds constituting the diphthong (only vocoids vs. vocoids in combination with sonorants) The following criterion can provide a hint for a later analysis of mono-phonemic vs. bi-phonemic diphthongs, but need not do so: – sequential vs. gliding: impression of 2 distinct phases vs. 1 complex phase All of these criteria, which directly address the dynamic character of diphthongs, may not only be relevant for a sound description of the phonetic diphthongs in a language, they may also constitute potentially distinctive features if it turns out in the phonological analysis that at least some of the diphthongs have a functional, phonemic status (by forming minimal pairs) in the language. It goes without saying that all of the static features that either the initial and/or the target sounds bear – e.g. front vs. back, closed vs. mid vs. open, round vs. spread, nasal vs. oral, etc. – might turn out to be systematically distinctive in phonological diphthongs. This also holds true for the criterion of quantity, even if it is not very likely that quantity functions as a distinctive feature in the diphthongs of a language. If there are phonological diphthongs in a language, these criteria have to be dealt with: – mono-phonemic vs. bi-phonemic: segmental phonological analysis – simple vs. complex: occurrence in morphologically simple vs. in complex forms – stable vs. instable: changeability by morphophonological processes – common vs. restricted: restrictions of occurrence, e.g. only in stressed syllables, in loans, etc. – frequent vs. infrequent: frequency in lexicon words or in text words The criteria of sonority and prominence contour come up again in phonological analysis, now with special reference to the general principles of syllable structure in the language. And, of course, mismatches of sonority and prominence contour, as they are marked structures, need a particularly careful investigation and explanation.

198

Klaus Geyer

A criterion that is equally relevant for the phonetic and the phonological perspective is the question of underlying phonological non-diphthongal sound sequences for diphthongs in the phonetic output, or the question of stages in diachronic emergence by sound change processes, giving rise to the following features: – derived vs. basic: synchronic generatability by phonological rules, – primary vs. secondary vs. tertiary etc.: diachronic emergence I hope that the “diphthong analysis and description tool” will prove useful for a number of linguistic enterprises, be they within language documentation or other linguistic areas.

References Abondolo, Daniel. 1998. Finnish. In The Uralic Languages, ed. Daniel Abondolo, 149–183. London, New York: Routledge. Altmann, Hans, and Ute Ziegenhain. 2010. Prüfungswissen Phonetik, Phonologie und Graphemik. (3rd edition). Göttingen: Vandenhoeck und Ruprecht. Ambrazas, Vytautas. 1997. Lithuanian Grammar. Vilnius: Baltos lankos. Andersson, Erik. 1994. Swedish. In The Germanic Languages, eds. Ekkehard König and Johan van der Auwera, 271–312. London, New York: Routledge. Bowern, Claire. 2008. Linguistic Fieldwork: A Practical Guide. Basingstoke: Palgrave Macmillan. Branch, Michael. 1987. Finnish. In The World’s Major Languages, ed. Bernard Comrie, 593–617. London: Croom Helm. Braunmüller, Kurt. 1999. Die skandinavischen Sprachen im Überblick. 2nd edition. Tübingen: Francke. Campbell, George L. 1995. Concise Compendium of the World’s Languages. London, New York: Routledge. Catford, John C. 1977. Fundamental Problems in Phonetics. Edinburgh: Edinburgh University Press. Eckert, Rainer, Elvira-Julia Bukeviˇci¯ut˙e, and Friedhelm Hinze. 1994. Die baltischen Sprachen: Eine Einführung. Leipzig: Langenscheidt Verlag Enzyklopädie.

Diphthongology meets language documentation

199

Elert, Claes-Christian. 1995. Allmän och svensk fonetik [General and Swedish Phonetics]. 7th edition. Stockholm: Norstedts. Fromm, Hans. 1982. Finnische Grammatik. Heidelberg: Winter. Graefen, Gabriele, and Martina Liedke. 2008. Germanistische Sprachwissenschaft: Deutsch als Erst-, Zweit- oder Fremdsprache. Tübingen: Francke. Groenke, Ulrich. 1998. Die Sprachenlandschaft Skandinaviens. Berlin: Weidler. Grønnum, Nina. 1998. Fonetik og fonologie: almen og dansk [Phonetics and Phonology: General and Danish]. Copenhagen: Akademisk forlag. Gussmann, Edmund. 2002. Phonology: Analysis and Theory. Cambridge: Cambridge University Press. Häkkinen, Kaisa. 1977. Tilastotietoja Suomen kielen äännerakenteesta [Statistical information about the phonemic structure of Finnish]. Turku: Turun yliopisto. Hakulinen, Auli, Maria Vilkuna, Riitta Korhonen, Vesa Koivisto, Tarja Riitta Heinonen, and Irja Alho. 2004. Iso suomen kielioppi [Comprehensive Grammar of Finnish]. Helsinki: Suomalaisen Kirjallisuuden Seura. Karlsson, Fred. 1983. Suomen kielen äänne- ja muotorakenne [The Structure of Finnish Phonology and Morphology]. Provoo: Söderström. Karlsson, Fred. 1992. Finnish. In International Encyclopedia of Linguistics, 2nd vol., ed. William Bright, 14–17. Oxford: Oxford University Press. Karlsson, Fred. 1999. Finnish: An Essential Grammar. London, New York: Routledge. Kohler, Klaus. 1995. Einfürung in die Phonetik des Deutschen. 2nd edition. Berlin: Erich Schmidt. Kurschat, Friedrich. 1876. Grammatik der litauischen Sprache. Halle: Verlag der Buchhandlung des Waisenhauses. Lindau, Mona, Kjell Norlin, and Jan-Olof Svantesson. 1990. Some crosslinguistic differences in diphthongs. Journal of the International Phonetic Association 20:10–14. Lyovin, Anatole. 1997. An Introduction to the Languages of the World. Oxford: Oxford University Press. Maddieson, Ian. 1984. Patterns of Sound. Cambridge: Cambridge University Press. Mathiasen, Terje. 1996. A Short Grammar of Lithuanian. Columbus: Slavica. Mitchell, Erika J. 2001. Finnish. In Facts about the World’s Languages: An

200

Klaus Geyer

Encyclopedia of the World’s Major Languages, Past and Present, eds. Jane Garry and Carl Rubino, 214–218. New York: Wilson. Mosel, Ulrike. 1994. Saliba. München: Lincom Europa. Mosel, Ulrike. 2006a. Fieldwork and community language work. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 67–85. Berlin, New York: Mouton de Gruyter. Mosel, Ulrike. 2006b. Sketch grammars. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 301–309. Berlin, New York: Mouton de Gruyter. Mosel, Ulrike, and Even Hovdhaugen. 1992. Samoan Reference Grammar. Oslo: Scandinavian University Press. Pompino-Marschall, Bernd. 1995. Einführung in die Phonetik. Berlin: de Gruyter. Ramers, Karl-Heinz, and Heinz Vater. 1992. Einführung in die Phonologie. Hürth: Gabel. Roca, Iggy, and Wyn Johnson. 1999. A Course in Phonology. Oxford: Blackwell. Sánchez Miret, Fernando. 1998. Some reflections on the notion of diphthong. Papers and Studies in Contrastive Linguistics 34:27–51. Schubiger, Maria. 1977. Einführung in die Phonetik. 2nd edition. Berlin: de Gruyter. Schultze-Berndt, Eva. 2006. Linguistic annotation. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 213–251. Berlin, New York: Mouton de Gruyter. Sievers, Eduard. 1893. Grundzüge der Phonetik zur Einführung in das Studium der Lautlehre der indogermanischen Sprachen. Leipzig: Breitkopf und Härtel. Sulkala, Helena, and Merja Karjalainen. 1997. Finnish. London, New York: Routledge. Trubetzkoy, Nikolaj S. 1977[1939]. Grundzüge der Phonologie. 6th edition. Göttingen: Vandenhoeck und Ruprecht. Vennemann, Theo. 1988. Preference Laws for Syllable Structure and the Explanation of Sound Change: With Special Reference to German, Germanic, Italian, and Latin. Berlin: Mouton de Gruyter. Viitso, Tiit-Rein. 2003. Structure of the Estonian language: Phonology, morphology and word formation. In Estonian Language, ed. Mati Erelt, 9–92. Tallinn: Estonian Academy Publishers.

Chapter 9 Retelling data: Working on transcription∗ Dagmar Jung and Nikolaus P. Himmelmann

1.

Introduction

Transcribing narrative and conversational speech is a core activity of all linguistic fieldwork, though one of the less attractive ones. Neither linguists nor speakers are generally very keen to spend long hours on this task. Nevertheless, it is without doubt one of the most important tasks to be carried out in the field requiring close cooperation between speaker(s) and researcher(s). Given its significance, it is somewhat surprising how little attention this task receives in the literature. When transcription is mentioned, if at all, in field method books and articles, the focus is usually on phonetic aspects, i.e. questions relating to the proper representation of sounds and the distinction between broad and narrow transcription. Occasionally, there are a few additional notes on general procedure, as conveniently summarized in Chelliah and de Reuse (2011: 434–435). A notable exception here is Crowley (2007: 137–141) who discusses practicalities of transcribing a fairly large amount of narrative and conversational speech that go beyond the problems of basic procedure and properly capturing sound. Likewise, the current chapter is exclusively concerned with the conceptual and interpersonal issues arising when working on transcriptions of continuous text, for which usually some type of practical orthography (broad transcription) will be used. We will not repeat Crowley’s very useful observations and suggestions here. But we want to emphasize the point that text transcription has to be carried out in close cooperation with native speakers, and this usually means: in the field. It may be possible for a researcher to transcribe some parts of a text recording independently at a stage when one has achieved a certain mastery ∗

We are very grateful to all the speakers who generously shared their knowledge of the Beaver language with us and were so patient and accommodating in dealing with our obsessions with linguistic form rather than content. Many thanks also to Geoffrey Haig, Carolina Pasamonik, Stefan Schnell and Gabriele Schwiertz for helpful comments and suggestions, and to Jessica Di Napoli for thoroughly editing English grammar and style.

202

Dagmar Jung and Nikolaus P. Himmelmann

of the language – enough, for example, to engage in simple conversational exchanges. However, it will almost never be possible to fully transcribe a recording, as there will always be sections where articulation is poor or very fast, major noise interferences occur, or new words or constructions are being used. Hence a serious attempt should be made to transcribe (and translate) as many recordings in the field as possible, unless it is also possible to work with native speakers at home. Successful transcription thus essentially depends on a productive collaboration and interaction between researcher and native speaker. The productivity of this interaction may be hampered by a number of pitfalls which are the main concern of this chapter. In Section 2, we will first review some basic conceptual and practical issues which need to be carefully considered to ensure a productive transcription collaboration. Section 3 will then survey and systematize the different strategies native speakers may adopt when responding to the nontrivial challenges arising for them when engaging in transcription, focusing on aspects of (lack of) cooperation and typical changes applied to the recorded speech in the transfer to a written representation. This chapter, thus, contributes to two interconnected topics in documentary linguistics. The first topic pertains to the idea that fieldwork should be conceived of as a cooperative learning enterprise between speaker(s) and researcher(s), as argued in detail by Mosel (2006). Here the important point is that researchers need to have a clear understanding of what kinds of demands they are putting on their collaborators and to see where the interests and goals of the two parties involved may diverge or even stand in opposition to each other. The second topic pertains to the idea that working on transcription may lead to the emergence of a new linguistic variety, as it involves the creation of a new written language. This is particularly true in those instances where recorded texts are carefully edited for publication in a local (e.g. educational) context, a process documented, perhaps for the first time, in a rigorous way in Mosel’s work on Teop (cp. Mosel 2004, 2008). But it actually also occurs in similar, though less systematic ways in transcription, as we will document here. Both topics, incidentally, provide further arguments for the proposal that the transcription process itself should be documented as fully as practically feasible. That is, ideally it should also be recorded, as was the standard practice within the Beaver Athapaskan language documentation project in Northern Alberta, Canada, from which all data in this chapter are drawn (Jung et al.

Retelling data: Working on transcription

203

2004–present).1 Without such a recording, the kinds of examples we discuss here would not be available. 2.

Basic conceptual and practical issues

To begin with, it will be useful to keep in mind that transcription is a metalinguistic activity generally not part of the speaker’s and the linguist’s normal linguistic repertoire. That is, there is no natural linguistic behavior involving an activity that can be said to be parallel to what is involved when transcribing recordings of continuous speech. Stenography, which has some semblances to transcription but differs considerably in purpose and focus, is a highly specialized activity as well, requiring a great deal of training and practice. A more widespread and, in literate societies, more natural activity which may have some, but rather remote, semblances to transcription is note-taking in a meeting or a class, which may include the occasional literal transcription of a short segment of the contribution by a current speaker. But taking notes is primarily concerned with capturing the content of what is being said, not with providing a precise transcript of it. The ‘natural’ focus on content seen here also underlies a number of typical problems occurring in collaborative work on transcription, as is further discussed in Section 3. The relative unnaturalness of transcription is due to two facts. First, it presupposes writing. Though literate practices are now widely found even in societies without a literary tradition, the very fact that writing is not a universal modality of linguistic communication (unlike speaking and signing) contributes to the special status of transcription. Second, and related to the first fact, transcription involves a transfer from one modality into another one. Such modality transfers are generally rare in linguistic behavior, the translation of spoken into signed language (or vice versa) probably being the only type of transfer occurring in nonliterate societies. Literate societies add reading aloud and dictation as widely used techniques in the acquisition of literacy. As there are no natural models for transcription, it follows that transcription practices need to be established in accordance with the specific purposes a transcript is intended to serve. Or, to put this slightly differently, transcription is theory, as argued by Ochs (1979). In this classic paper, Ochs makes 1. For further information on this project and full acknowledgements see www.mpi.nl/ dobes/projects/beaver.

204

Dagmar Jung and Nikolaus P. Himmelmann

a number of important observations. We repeat here only the most important and relevant one for our purposes: transcription always involves reduction and thus selection. It is practically impossible to represent a given speech event in its entirety in writing. This would have to include, for example, minute details of articulation, gestures, facial expression, etc. Apart from being unfeasible, any kind of transcription trying to capture as many features of the speech event as possible would actually defeat its purpose, as it would be very difficult to parse. That is, the goal of transcription is reduction, i.e. making those aspects of a complex speech event accessible that are of major relevance to the user of the transcript. The transcripts typically produced in language documentation and description serve the purpose of making recordings of continuous speech accessible for further analysis, including, for example, morphosyntactic analysis. For this type of analysis, phonetic detail is usually not relevant, which is the reason why a broad transcription, using a practical orthography, is generally sufficient. What is relevant, however, is the segmentation of the continuous stream of speech into chunks of different sizes (words, intonation units). This segmentation has to make use of native speaker intuition (in particular with regard to the word level, see Peterson, this volume) and involves preliminary analytical hypotheses as to the relevant chunking units (see Himmelmann (2006) for further discussion of this rarely explicitly discussed aspect of transcription). This in turn means that transcripts have to be repeatedly revised in line with the better understanding of the relevant units obtained through ongoing analysis. A further evident consequence of the fact that transcription is a theorydependent project not part of anyone’s basic linguistic repertoire is the fact that transcription has to be learned and practiced, both by the researcher and the native speakers collaborating in the task. As a rule, researchers should have transcribed a recording in their own language before setting out on the task of transcribing a recording in the field. Among other things, this will help them to have a clear understanding of the demands they are placing on the native speakers they engage for collaboration in this task. Additionally, it will help them comprehend the different kinds of reactions they will encounter when searching for transcription collaborators, to which we now turn. As with almost all other activities involved in field-based language documentation and description (cf. Mosel 2006, Seifart, this volume), native speakers will differ quite significantly in their aptness for, and interest in, the

Retelling data: Working on transcription

205

task of transcription. Given that it is an unnatural task, it should not come as a surprise that a very common, if not the most common, reaction is great reluctance to collaborate in this task, even from speakers who are happy to engage in other tasks such as providing example sentences for dictionary entries or grammaticality judgments, or compiling complex paradigms or taxonomies. The reasons for such reluctance include: the unwillingness or inability to listen to chunks of the recording; declaring the recorded item impossible to understand; unwillingness or inability to repeat segments of the recording in such a way that they can be written down; lack of time. The reluctance to engage in transcription may also be based on an evaluation of the recorded discourse (or a part of it) in terms of correctness (not proper language), appropriateness (that is not how one can say this in these circumstances) or irrelevance (not worthy of being committed to writing). Such evaluations, in turn, are part of, or result from, the set of basic attitudes speakers have with regard to the variety being used and they display consciousness of linguistic norms, which appear to exist in all societies, even in ones without formally established standard varieties. The underlying system of attitudes is usually very difficult to untangle and, in the case of the Beaver data discussed in this chapter, we lack independent evidence for this system. We will therefore refrain from trying to link specific examples to basic attitudes, but for almost all examples in Section 3 it will be clear that some notion of perceived linguistic value is at play also in those instances where speakers engage in transcription but apply changes to the original record. Finding good collaborators for transcription will thus often be quite challenging. However, it is usually not the case that in a given field situation there are very many options to choose from, and hence one has to find ways to make the best of the options available. Perhaps the most important qualification for becoming a collaborator in transcription – apart from being a reasonably fluent speaker of the language – is the willingness and ability to learn something new, i.e. to listen to small chunks of continuous speech and repeat them so that they can be written down (or, write them down oneself). Here transcription differs from many other activities required in language documentation and description where experience and a high level of linguistic skills and insight are essential. For transcription, it may often be useful to collaborate with younger speakers able and willing to learn the task, even if they no longer have a full command of the language. Difficult and unclear

206

Dagmar Jung and Nikolaus P. Himmelmann

segments must then be checked with more experienced speakers, who may lack the patience or interest to collaborate on this task for more extensive time periods. It is not necessary, and in fact often not desirable, to work on transcription with the speaker who appears in the recording. While the speaker probably has a relatively clear idea of what she wanted to say, this does not mean that she is particularly good at listening to and restating precisely what was actually said. In fact, speakers involved in the recorded speech event are sometimes more likely to engage in the correcting and extension activities discussed in Section 3 than non-participating speakers. Furthermore, listening to one’s own voice in a recording can be disturbing as it sounds quite different from what one hears when speaking and because one may feel embarrassed about dysfluencies, speech errors and other kinds of imperfections in one’s own speech. As transcription is not only an unnatural but also a very time consuming and tedious task which requires practice and dedication, it is the task that is perhaps most work-like of all the activities involved in field-based language documentation and description. It is thus also the task that is best approached in a work-like arrangement, characterized by regular working hours and a salary/remuneration in line with local practices, if such arrangements are at all possible in the community. Transcription is ideally done in an office-like setting, with adequate furniture and a low noise level, so that one can fully concentrate on the listening and interpretation task. Work-like arrangements also provide for the option of independent transcription, i.e. a native speaker works by her/himself on the transcription of recordings. This, of course, presupposes that the speaker is able to handle the technical aspects of playback (ideally using a laptop, otherwise an audio player). It also involves some training and, crucially, regular supervision and checking. The latter are important for two reasons: first, independent transcribers, like researchers, may tend to develop ‘bad habits’ such as regularly misspelling a class of high frequency items, leaving out segments at times when they interrupt their work, etc. Regular checks may detect and arrest such developments early on. Second, and of equal, if not greater importance, independent transcribers need regular feedback and appreciation in order to keep up the motivation for good and productive work. If all these conditions are met, independent transcription, perhaps even involving two or three tran-

Retelling data: Working on transcription

207

scribers, is probably the most efficient and productive method for tackling this task. The widely practiced alternative to independent transcription is collaborative transcription, where researcher and speaker together transcribe a recording, preferably while both listening to it with a headset. The discussion in the following section is based on data generated in this way. The set-up generally involved a single native speaker and a single linguist. The native speakers were all elderly people as there are no younger speakers left, sometimes working on recordings of themselves, sometimes on recordings of other speakers. Transcription was not separated from translation, so that upon hearing a short segment played by the linguist, the native speaker would start with either explaining what was being said, or with a free translation, or with repeating the segment for the linguist to transcribe. All speakers involved are bilingual in English and Beaver and most of them also literate in English. Two speakers also have some familiarity with the orthography used for representing Beaver, hence being able to follow what the linguist was writing. While this is just one type of set-up for transcription, many of the phenomena discussed below will also be found in transcriptions produced by independently working native speaker transcribers and also whenever native speakers are involved in editing precise transcripts for publications to be used in the community (cp. Mosel 2004, 2008). 3.

Retelling data: change and elaboration in transcription

As already mentioned in the previous section, in field-based language description the main goal of the linguistic researcher in the transcription process is typically to obtain a written version of what is being said in a recording which can then be used for further grammatical analysis. Given this goal, the focus is on precision: the transcript should contain all of, and only, the linguistic elements that have actually been uttered, including not only ‘small words’ of ambiguous function such as discourse particles which are easily overheard, but also linguistic manifestations of the production process such as false starts, markers of hesitation, etc. The latter often have little direct relevance for propositional content and grammatical structure, but ignoring them may lead to incorrect interpretations of content or structure (these may be edited out carefully later on as part of the overall analysis process). This

208

Dagmar Jung and Nikolaus P. Himmelmann

approach to transcription is very technical and its overall goal may not be easily understood by non-specialists. Native speaker collaborators in the transcription process tend to have different goals and priorities which can generally be classified as attempting to improve the recording in a number of ways: to make the content clearer, to make it more appropriate for a general audience, to make it adhere to what they perceive as the proper norm or more authentic speech, and so on. Goals and priorities here depend in part on who is helping with the transcription: a transcriber who is also the recorded speaker may decide more freely on what should be edited in and out for semantic reasons or perceived mistakes in clause structure. He/she may also focus on rephrasing, expanding or repeating the text to guarantee its comprehension by the intended audience. A transcriber who is not among the speakers who appear in the recording may comment on specific lexical items and idiomatic expressions that should be changed. In the following, examples from actual transcription sessions in a fieldwork setting illustrate these processes. They are organized into three major types: a) avoidance strategies where loose paraphrases and translations are provided instead of a precise rendering of the recording; b) editing-out strategies which lead to the removal of words and phrases; c) editing-in strategies changing elements appearing in the recording or adding completely new material. The latter two types belong more closely together in that they both relate primarily to linguistic form, while the first type is most closely related to the content being transmitted. Specific examples often involve a mixture of the three types, which cannot always be neatly separated. 3.1.

Paraphrasing: avoiding word-by-word renditions of recordings

There are basically two reasons for avoiding to repeat segments of a recording word-by-word during the transcription process (leaving aside boredom or impatience): either the desire to tell more, or the desire to tell less than what is actually in the recording. From a native speaker’s perspective, the important point in working on a recording is to understand what is being said, i.e. the core concern is with the message and not with its form, which often results in the desire to tell more

Retelling data: Working on transcription

209

or to tell it differently.2 This becomes especially relevant when the speaker in a recording works on the same recording together with a researcher from a different cultural background. In this case, the speaker may be perceptive of the researcher’s need to obtain additional information in order to be able to fully understand the sense of what is being said. A typical example resulting from this state of affairs is the following: when played the segment given in (a) and asked to repeat it so that it may be written down by the researcher, the speaker volunteers the explanation given in (b). Upon the insistence of the researcher, the two words of (a) are dictated for transcription (and translated) but then immediately expanded again by a further elaboration of the story in (c):3 (1)

a. dáwó˛ t’yedye aadi what.kind.of.place 3.said ‘She said what kind of place.’ b. “(There’s) willow in there, they make arrows with; but there’s a snake in there, a big one, it will kill you.” c. “There’s a creek in there, there’s a saskatoon willow in there, but you got to fight the snake!” (yaamaadzuyaaze_transcr001)

The main concern here is clearly that there (finally) is understanding on the part of the researcher. And while this may first be perceived by the latter as not being very helpful with regard to the primary goal of getting a useful (i.e. reasonably precise) transcription, it is evident that such information will later be of great value for interpreting the narrative. Depending on speakers and communities, work on a transcript may involve several retellings of the same narrative (in the contact language) which usually include important information for its interpretation. In some way then, the transcription setup should allow for making use of such elaborations. The opposite motivation for avoiding the word-by-word repetition needed for precise transcription, i.e. the wish to transcribe less than what has been 2. As just mentioned and further illustrated in the following two subsections, form is of course also a major concern, in particular in those cases where the recorded form is judged to be inappropriate or incorrect. However, here the main concern is with cases of content-based avoidance. 3. The practical orthography of Northern Alberta Beaver is used. Dentals are underlined (s, ¯ z), the acute accent marks high tone. ¯

210

Dagmar Jung and Nikolaus P. Himmelmann

recorded, may be due to gender-specific speech, the avoidance of sexual explicitness, and other kinds of taboos, including respecting the linguistic “ownership” of certain topics by other persons. In a traditional story that was told by a male speaker about the wolf meeting his brother-in-law, the beaver, in the intense cold of winter, it is mentioned that the wolf’s frozen testicles can be heard from far away. This situation is described with the use of an ideophone: tł’aa tł’aa aadyi ‘clack clack it sounds’. The female transcriber rendered this by saying ts’áa láádi “something about ‘a young beaver”’, without mentioning the actual meaning. Sometimes it is simply disbelief of the content of a recording which may lead to a refusal or avoidance of transcribing a segment. In still other cases, the speaker may in fact be unable to help as the segment in question may involve a different dialect or even another language. In the latter cases, the reluctance to help with transcription will usually be made explicit, often with an accompanying explanation. In some instances of taboo, however, it may not be obvious to the researcher that the native speaker is not willing or able to engage in precise transcription. The researcher may only find out later on, once he/she has a better understanding of the language and the general cultural context, that there is a discrepancy between the original wording and the material given in transcription (and translation). Rephrasings and other kinds of modification caused by observing taboos may be kept up rather convincingly and consistently for several units.

3.2.

Editing-out

There are different types of elements that tend to be edited out (or ‘overlooked’) by native speakers in the transcription process regardless of the setup of the transcription process (independent or collaborative). Perhaps the most common type concerns hesitations or false starts, i.e. verbal elements that are not actually part of the linguistic construction, do not directly contribute to its meaning, and, most importantly perhaps in the current context, are generally absent in written formats other than transcripts. One very simple reason for leaving them out in dictation is, of course, the fact that hesitations and in particular false starts are difficult to reproduce. After some practice, the researcher her/himself will usually be able to identify such hesitations and false starts and can then add them to the transcription without having to bother the native speaker collaborator with this. However, some care has to

Retelling data: Working on transcription

211

be taken in this regard as misinterpreted false starts may lead to a chain of errors, as illustrated with the following example. (2)

a. ts’ídoaa ch’u-... ch’uunézis t’aaguyihtyi˛ ¯ ¯ inside.3 PL .3 O.put.ANIMO child.DIM ch’u-... wolf.hide ‘They put the baby into a wolfhide.’ (moose001:56)

In this example, the false start highlighted here was not initially identified by the transcribers and thus was not included in the first transcript. At the time when the transcript was being translated and the recording was listened to again for verification, the linguist insisted that there was a word missing in the translation (not having realized that it was a hesitation). Consequently, the following ‘emendation’ was carried out (which strictly speaking is a case of editing-in, but note that unlike the examples discussed in the following section, here the editing-in is triggered by the researcher): b. ts’ídoaa zo˛ ch’uunézis t’aaguyihtyi˛ ¯ ¯ inside.3 PL .3 O.put.ANIMO child.DIM only wolf.hide ‘They put their only baby into a wolfhide’ (moose001transc003) The particle zo˛ ‘only’ was inserted, as it is phonetically close, syntactically well-formed (following an NP), semantically appropriate (the couple in the story had only one child) and functionally often occurs as a discourse particle. The fact that this segment does not actually involve this particle despite the fact that it would fit very well was caught only later, upon further checking of the transcript paying close attention to the sounds and overall features of articulation. The second type of frequently omitted elements encompasses repetitions of words or phrases characteristic of emphatic oral speech, but which are deemed redundant or superfluous for the written version, a phenomenon that is also quite widespread and has occasionally been remarked upon by other authors (including Mosel 2008). The third type concerns elements functioning as connectors or floor-keeping particles in discourse. In the second line of the following example, the initial ˛ih was omitted in the first transcription:4 4. In the remainder of this section, segments that have been left out by native speakers in the transcription process are put in parentheses (and in bold).

212

Dagmar Jung and Nikolaus P. Himmelmann

(3) ní˛itye, k’emoi hakui ní˛itye, 3.lived bush cows 3.lived ‘They used to live here, buffalo used to live here, (i˛h...) wuts’e˛ alééske wúye, łéés aghwi˛lé-’éh (and...) from.here Eleske 3.is.called, dirt 3.made-because.of then it is called Eleske (=‘on the dirt’), because they make the area dusty’. (Eleske) As indicated here, there is usually a pause following this frequent marker of hesitation. It is frequently not included in the string dictated for transcription and typically is not even commented on – that is, for the native speaker it is as if it is not really there. Similarly, the clause-final clitic =ú that generally marks combined clauses is very often ignored or not even noticed in transcription: (4)

tsé’e wutsé madzéé’ xáátsad(=ú), datés’oé’ ní˛dyí˛to˛ , ¯ dad very heart 3.falls.out(=PRT) 3.gun.POSS 3.took.ELOO ts’e˛ hdzé’ dyééza outside 3.went ‘My dad became very mad, he took his gun, and went outside.’ (naabane003)

(5) tłi˛˛idyéédayi-’éh go˛ o˛ łó˛ ó˛ dó˛ dyídyeel(=ú) horse.team-with over.there all 1 PL.go.PL (= PRT ) ‘We all go over there with horse teams.’ (bullrush_lake001) This clause-final clitic can indicate a loose temporal embedding in the discourse and generally does not play a role in signifying a more specific interclausal construction. Another class of words that are often edited out are evidentials that mark a narration as known to the speaker only via the word of others. Since there is no good translation for such particles, they are left out as ‘not important’. In the case of Beaver, these words are also considered inappropriate for the written style, partly because the European language used as target language in the translation lacks the expressed category.

Retelling data: Working on transcription

(6)

213

yéhjii. naawoghanePo˛ laa (sô˛ ) Yéhnuujéle FOC (I.think) 3 SG .tell.3 SG 3 SG-with.go.back 3 PL-plan ‘That woman was to return back with that guy, they had planned that, he was told.’ (Sweeney_Creek, Doig River Beaver)

Omissions may also pertain to parts of words, as in the following example: (7)

hade tyéége ii sadaa, uzéhtso˛ (-áa), madzagée ˛ihdadze ¯ ¯ just quiet DEM 3.sit 3.listen(DIM ) 3.ear.DIM . POSS both náághada 3.move ‘It just sat there quietly, listening a bit, its little ears were both moving.’ (day_ferry001)

This example involves the (crosslinguistically rather rare) instance of a diminutive element occurring on a verb. Here, it is not clear what exactly triggers the omission or oversight. As the other examples in this section illustrate, the most fundamental reason for editing out elements from transcription appears to be the assessment that they are irrelevant for the propositional content, as is most clearly the case for hesitations, false starts and – with some exceptions perhaps – repetitions. The difficulty of rendering the pragmatic or semantic nuances of particles and clitics in the contact language used for communicating with the researcher may also be relevant. 3.3.

Changing and editing-in

Changing the wording, which often includes the insertion of further material, usually has the goal of achieving a clearer and more precise linguistic expression of the narrated event or situation. Typical examples include the use of overt nominals (in addition to pronominal markers on the verb) and additional indications of location, and the change of lexical stems (in Beaver frequently verb stems). Change of word order may also occur. Another type of modification pertains to the substitution of lexemes for stylistic or ideological reasons, a prime example being the replacement of English words with terms from the native language. The addition of a locational specification can be seen in the next two examples:5 5. This section makes use of example pairs, where the first example provides the precise transcription while the second shows the modified version.

214 (8)

Dagmar Jung and Nikolaus P. Himmelmann

a. ashídle=yuu łenaxiy.brother=and 1 PL.together ‘My younger brother and me (were tied) together.’ b. ashídle=yuu łehjíítł’o˛ mak’edeisdii-k’e y.brother=and 1 PL.tied.together saddle-on ‘My younger brother and me were tied together on a saddle.’ (life_without)

In (8b) the location ‘on a saddle’ and a verb are added to the clause. Instead of the addition of whole phrases, postpositions may be changed to provide a more accurate depiction of the scene: (9)

a. matyééle-moi ˛iláádzis xi˛st’aas 3.sheet-at.edge gloves 1 SG.cut.out ‘I cut out the gloves at the edge of her (bed) cover.’ b. matyééle-k’e ˛iláádzis xi˛st’aas 3.sheet-on gloves 1 SG.cut.out ‘I cut out the gloves on her (bed) cover.’ (first_gloves001)

Here, the second version is more appropriate in that it prepares the recipient better for the punch line of the episode in which the cover resulted in being accidentally cut up. A further area for emendation pertains to fine semantic distinctions that may be conveyed with the help of grammatical markings. In Beaver, one important case here is the expression of distinctions in the number of participants through suppletive stems within the realm of motion verbs. (10)

a. aláá no˛ o˛ naatł’is-de łííníí-n-í-dyil boat 3.crosses-LOC stop-PFV-1 PL-go.PL b. aláá no˛ o˛ naatł’is-de łííníí-n-í-t’aats boat 3.crosses-LOC stop-PFV-1 PL-go.DU ‘We (pl/dl) stopped at the ferry.’

(day_ferry001)

In the recording, the plural motion stem is used, which, strictly speaking, is not correct, since only two people were in the car. Accordingly, the dual motion stem is edited in. A similar example is seen below in (11) where again a plural stem is substituted for a dual stem. In addition, several other typical changes may be

Retelling data: Working on transcription

215

seen in comparing the two versions. One consists of a change in agreement, from first plural subject to third plural subject marking: (11)

a. hó˛ ty’e laa tsáá-gha k’éémoi-dze k’éé-ghí-dish-ełé like.that FOC beaver-for bush-to ASP - CNJ .1 PL -go. PL - HAB ‘Like that, we used to hunt for beaver in the bush.’ b. hó˛ ty’e tsáá-ka k’éémoi-dze k’éé-gha-t’aash-ełé like.that beaver-for bush-to ASP - CNJ .3 PL -go. DU - HAB ‘Like that, they used to hunt for beaver in the bush.’ (life_without)

Further changes include the editing out of the emphatic/focus marker laa following the first phrase, and the replacement of the purposive postposition -gha with -ka. Even though both are grammatical in this context, the latter appears in the common expression ‘to go out in order to hunt an animal’. Classificatory verbs are also a locus for semantic differentiation and specification. They are frequently replaced in transcription: (12)

a. náádaagheilé’-éh k’éémoi-dze naxa-gha-dyeh-tyé wagon-with bush-to 1 PLO-3 PL - ASP. V-bring.ANIMO ‘They brought us with the wagon to the bush.’ b. náádaagheilé’-éh k’éémoi-dze naxa-gha-da-lé wagon-with bush-to 1 PLO-3 PL - ASP. V-bring.PLO ‘They brought us with the wagon to the bush.’ (life_without)

In this example, the speaker substituted the stem for animate objects, which refers implicitly to either a single or a dual object, with the stem for handling plural objects. An even more elaborate case of semantic specification is seen in the following example. A very general verb found in the original recording (13a) is first replaced by a variant including a classificatory verb (13b) that pertains to stick-like objects (in this case the leg of a frog). The third variant in (13c) was suggested to be the one that best expresses the event depicted in the picture (from the Frog Story, Mayer 1994[1969]). The verb here specifically refers to the movement of legs: (13)

a. tyehkazi tyéék’e łige mats’ané’ ó˛ ty’e ¯ frog in.water one 3.leg 3.be ‘The frog has one leg in the water.’

216

Dagmar Jung and Nikolaus P. Himmelmann

b. tyehkazi tyéék’e mats’ané’ łige sahto˛ ¯ frog in.water 3.leg one 3.put.ELOO ‘The frog has one leg in the water.’ c. tyehkazi tyéék’e dyéh’ééts frog in.water 3.move.leg ‘The frog moves a leg in the water.’

(frog_story001transc)

A different sort of editing is involved when phrases are replaced with phrases of an entirely different grammatical type. In the modified version, the situation is often described more explicitly than in the actual recording (note that in (14) the focus marker laa has again been omitted with the original overt subject noun in the modified version, so strictly speaking this is a case of replacement and omission): (14)

a. méhz˛i laa łééyetł’o˛ owl¯ FOC 3.tied.3.together ‘The owl tied them together.’ b. ˛ihtyedi łééguyaghétł’o˛ naked 3 PL.tied.3.together ‘They tied them together naked.’

(buffaloman001)

Other possible kinds of changes include the paradigm of the verb (aspectual variation) or the choice of person markers. In example (15a), the speaker uses the areal pronominal marker ghu- as a possessive prefix to refer to the story in a general sense. The transcriber, who in this case is not identical to the speaker, chooses the third person marker ma- for this construction: (15)

a. gaa ghu-lo˛ o˛ now 3 ARE-end ‘That’s the end (of the story)!’ b. gaa ma-lo˛ o˛ now 3-end ‘That’s the end (of the story)!’

(ghutsahgeeze_woodpecker)

Retelling data: Working on transcription

217

The transcription of the next example resulted in a change in word order within the first noun phrase: the modifier ‘all’ is shifted from its usual place following the noun to the front.6 (16)

a. dane łó˛ ó˛ dó˛ ada’íídyii hesi˛ gutł’e ó˛ li˛ people all 1 PL.know must.be 3 PL.passed.away ‘All the people we know, they are all gone.’ b. łó˛ ó˛ dó˛ dane ada’íídyii hesi˛ gaa matł’e ó˛ li˛ all people 1 PL.know must.be now 3.passed.away ‘All the people we know, they are all gone now.’ (St.Charles_001)

Possibly, word order here is influenced by interference from English. Recall in this regard that for all the Beaver data discussed in this chapter the work was set up in such a way that transcription was performed simultaneously with translation. That is, each unit was transcribed and translated before proceeding to the next unit, and there was no fixed order in between the two steps (the speaker would sometimes first volunteer a translation and then repeat the segment for transcription, or vice versa). In such a setup, interference from the target language is more likely than in a setup where all attention is concentrated on transcription. But, of course, there are many reasons why the latter procedure will often not be possible, besides the fact that it helps to have at least a rough idea of what one is transcribing. Turning briefly now to the second major class of ‘editing-in’ modifications, code-switches or the use of non-aboriginal words often trigger significant modifications to the original recording.7 Here is a simple example: (17)

a. marten gulae, nóódyehk’azhi gulae marten maybe fisher maybe b. uust’yᲠa˛ gulae, nóódyehk’azhi gulae marten maybe fisher maybe ‘Maybe a marten, maybe a fisher.’

(day_ferry001)

6. Other changes include the editing-in of the adverb gaa ‘now’ to emphasize the difference between the narrated past and the present situation, as well as a pronominal change (marked 3pl to general 3). 7. The tendency to replace English words or phrases in the annotation with Beaver terms is more pronounced in the Northern Alberta varieties.

218

Dagmar Jung and Nikolaus P. Himmelmann

It may also happen that longer segments in English are translated by the transcriber. Thus, for example, the clause he hears something was rendered as wó˛ o˛ li dííts’ak in one transcription session. More interestingly perhaps, the transcriber might speak a different variety of the language than the recorded speaker. This may result in changes in the transcript, both with regard to morphological form and in the naming of traditional characters: (18)

a. go˛ o˛ dyéézhe gaa éhdyi: aséi dasbát-éh over.there 3.went now 3.said grandfather 1 SG.hungry-with dyée-ya ni˛ka 2 SG.for ASP-come.SG ‘He went over there and said: grandfather, I came to you because I’m hungry!’ b. go˛ o˛ dyéézhe gaa éhdyi: aséi Daskutł’e, ni˛ka over.there 3.went now 3.said grandfather Daskutł’e 2 SG.for dyée-zha ASP -come. SG ‘He went over there and said: grandfather Daskutł’e, I came for you.’ (aghat’usdane002)

In the recording the singular motion stem is -ya, while the transcriber repeats it as -zha. Both belong to the paradigm of the stem ‘sg.moves’, but they are used differently in various dialects of Beaver.8 The second modification, the renaming of a character within a well-known traditional story, is explained by family tradition: “My grandfather told me about old man Daskutł’e.” The original inflected verb form in (18a) that is not a personal name is thereby replaced by a personal name that seems to fit the particular story. 4.

Conclusion

Transcription of recorded data plays a central role in field-based language documentation and description. The final product, i.e. the transcript, forms the stepping stone for a variety of further activities, including not only grammatical analysis but also the preparation of educational materials or other written resources to support language maintenance or development efforts. But, as 8. Historically, a so-called voice marker has been absorbed into the -zha form.

Retelling data: Working on transcription

219

argued here, the transcription process itself, while often tedious and disliked by all parties involved, provides valuable insights into the linguistic knowledge of speakers: insertion, omission, or change of items show the range of the linguistic repertoire and (un)acceptable variation and thus complement the evidence otherwise gathered in elicitation tasks. It may also provide important clues for our understanding of the creation of new linguistic varieties, as many of the phenomena reviewed here also occur when basic transcripts are edited for publication as written resources, resulting in the creation of a written language variety where none existed beforehand (cp. Mosel 2004, 2008). From a scientific point of view, it is thus important to document, as much as practically feasible, the kinds of changes and elaborations discussed in the preceding section, i.e. to provide both as precise a transcript of the actual recording as possible as well as a record of the changes applied by native speakers in the transcription process and the motivations given for them (if any). The best way to do this is to record the transcription process as well, which, however, will often not be possible for various reasons. There is, of course, a potential conflict here between scientific and community/speaker interests, which was already hinted at in the introduction to Section 3: What if a speaker or the community at large actually rejects (parts of) an utterance as incorrect or inappropriate, for whatever reason? Under such circumstances, which version should be made available to whom and in what form? There is no straightforward and easy answer to this question, as in all cases where conflicts arise about control and ownership in a documentation project, but layered access levels in a digital archive usually make it possible to accommodate the interests of all parties concerned.

Abbreviations 1, 2, 3

first, second, third person (usually indexing the subject argument if not otherwise specified) ANIM O animate object ARE areal ASP aspectual CNJ conjugation DEM demonstrative DIM diminutive

FOC HAB LOC O PFV PL PL O POSS PRT

focus habitual locative object perfective plural plural objects possessive particle

220 DU ELO O

Dagmar Jung and Nikolaus P. Himmelmann dual elongated object

SG V

singular valency

References Chelliah, Shobhana L., and Willem de Reuse. 2011. Handbook of Descriptive Linguistics. Dordrecht: Springer. Crowley, Terry. 2007. Field Linguistics: A Beginner’s Guide. Oxford: Oxford University Press. Himmelmann, Nikolaus P. 2006. The challenges of segmenting spoken language. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 253–274. Berlin, New York: Mouton de Gruyter. Jung, Dagmar, Julia Colleen Miller, Patrick Moore, Gabriele Müller (now Schwiertz), Olga Müller (now Lovick), and Carolina Pasamonik. 2004–present. DoBeS Beaver Documentation. DoBeS Archive MPI Nijmegen, http://www.mpi.nl/DOBES/. Mayer, Mercer. 1994[1969]. Frog, Where Are You?. New York: Dial Books for Young Readers. Mosel, Ulrike. 2004. Inventing communicative events: Conflicts arising from the aims of language documentation. Language Archives Newsletter 1(3):3–4. Mosel, Ulrike. 2006. Fieldwork and community language work. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 67–85. Berlin, New York: Mouton de Gruyter. Mosel, Ulrike. 2008. Putting oral narratives into writing – experiences from a language documentation project in Bougainville, Papua New Guinea. Presentation at the Simposio Internacional Contacto de lenguas y documentación, Buenos Aires, CAIYT (August 2008), available online as ‘Oral and written versions of Teop legends’ at http://www.linguistik.uni-kiel.de/ mosel_publikationen.htm#download (accessed 2011/02/26). Ochs, Elinor. 1979. Transcription as theory. In Developmental Pragmatics, eds. Elinor Ochs and Bambi B. Schieffelin, 43–72. New York: Academic Press.

Part III. Documenting the lexicon

Chapter 10 The making of a multimedia encyclopaedic lexicon for and in endangered speech communities∗ Gabriele Cablitz

1.

Introduction

In the past ten years a growing amount of lexicon software1 has become available to create multimedia lexica. A key area of application is the compilation of online dictionaries for speech communities of underdescribed and endangered languages. For a number of such projects (Manning, Jansz, and Indurkhya 2001; Kroskrity 2002; Albright and Hatton 2008; De Korne et al. 2009; Yang et al. 2008; Rau et al. 2009; Cablitz, Chong, and Tetahiotupa 2009) the development of a multimedia lexicon is an important step towards language documentation as a means of language maintenance and revival and the preservation of endangered linguistic, lexical and cultural knowledge. In this chapter we report on an interdisciplinary project2 in which digital multimedia encyclopaedic lexica are created for the endangered Marquesan and Tuamotuan languages of French Polynesia with the help of the lexi∗

This research has been generously supported by two DoBeS grants of the Volkswagen foundation. I would like to thank the Marquesan and Tuamotuan speech communities for their warm welcome, support and inspiration for the project. I have in particular benefited from discussions with Tehoatahiiani Bruneau, Fasan Chong (Jean Kape), Marc Kemps-Snijders, Lucien Mataiki, Upu Mataiki, Lucien (Mimio) Puhetini (†), Jacquelijn Ringersma, Tahia Tuohe (†), Edgar Tetahiotupa, Mathias (Teaiki) Tohetiaatua, Peter Wittenburg and Claus Zinn. I would like to thank Ken Dicks for proof-reading my paper and Geoff Haig, Nicole Nau and Claudia Wegener for their helpful comments on earlier versions of this paper; any errors and inconsistencies are of course my own. Last, but not least I would like to thank Ulrike Mosel for her great inspiration of my own work, and all her enthusiasm and support over the years. 1. E.g. the Kirrkirr software (Manning, Jansz, and Indurkhya 2001), Lexique Pro and WeSay (SIL), LEXUS (Max Planck Institute for Psycholinguistics), IDD (Indiana Dictionary Database, cf. De Korne et al. 2009) among others. 2. The project was part of the DobeS-programme and was generously supported by the Volkswagen foundation between 2006 and 2010 (http://www.mpi.nl/dobes).

224

Gabriele Cablitz

con tool LEXUS.3 LEXUS is a web-based tool which has a flexible scheme of linking multimedia documents – including annotated sessions from the archive – to lexical entries as well as the possibility of creating relational links using its integrated tool ViCoS4 . The relational linking device in ViCoS is a new form of knowledge representation with which the user can create a dense network of lexical and cultural data in ways which are meaningful to different kinds of user groups in a thematically organised way.5 This has important implications not only for the speech communities, whose languages are documented, but also for documentary linguistics. In this chapter we will discuss why this form of language documentation, i.e. the creation of multimedia lexica with LEXUS, is beneficial to both the scientific as well as the speech communities. In the first part of this chapter (§2) we will briefly discuss the background and objectives of this multimedia encyclopaedic lexicon project (henceforth: MEL-project). In §3 we proceed to outline the type of lexicographic work we have undertaken in the MEL-project, discussing new aspects of lexicographic work in our lexica in more detail. This section also provides insights into the creation and representation of thematically organised networks of lexical and cultural data in ViCoS (§3.4), in particular the creation of ethnobotanical ontologies and a folk taxonomy on marine life. One major objective of the MEL-project is to motivate the speech communities to actively participate in the process of creating these multimedia lexica. The web-based editing possibilities of LEXUS and internet facilities in French Polynesia – in principle – make it possible to allow an online participation by the speech community. In §4 we will address the implications a web-based lexicon tool has for the process of lexicon creation as well as the problems which are involved with such an approach of egalitarian lexicon creation: a model of collaborative workspaces and the basic challenges of a web-based collaboration in endangered speech communities are discussed in detail. The last section (§5) discusses why the making of dictionaries should be a key activity in a DoBeS language documentation project, or indeed for 3. LEXUS is currently being developed by the technical team of the Max Planck Institute for Psycholinguistics in Nijmegen (Netherlands). 4. ViCoS=Visualization of Conceptual Spaces, cf. http://www.mpi.nl/dobes/tools and cf. Zinn (2008: 890–894); the first multimedia lexicon software using the relational linking device was Kirrkirr in the late 1990s (McElvenny 2008: 160). 5. Cf. §3.4 for further details.

The making of a multimedia encyclopaedic lexicon

225

any documentation project, and why multimedia dictionaries created with LEXUS can be particularly appealing and useful for endangered speech communities as well as the wider scientific community. We will discuss the somewhat controversial role of lexicography in documentary linguistics as well as the contributions that multimedia dictionaries – in particular LEXUS – can make to documentary linguistics, showing that DoBeS-style language documentation and lexicography can complement each other. Finally, we also address the question of how far multimedia lexica – as created in LEXUS – are useful tools for language revitalization. 2. 2.1.

The MEL-project Background

The MEL-project evolved from a previous DoBeS language documentation project6 of the Marquesan languages in French Polynesia. Between 2003 and 2006 a large corpus of primary and lexical data from five different Marquesan island vernaculars has been compiled, analysed and annotated in close cooperation with the Marquesan speech community, and is now stored in a digital multimedia language archive housed by the Max Planck Institute for Psycholinguistics (cf. Cablitz 2010: 40–43 for details). In the course of this DoBeS documentation project we also began to build up a general lexical database and glossaries of topics which have become important foci of our documentation (breadfruit varieties, food preparation, plants, plant medicine, fish and fishing, etc.). All lexical databases are trilingual (Marquesan, French and English); the lexical entries have been extracted from recordings on the respective topics, with added information from field notes and wordlists. Due to the well-known constraints on time and money in short term documentation projects, we mainly applied Mosel’s thematic approach of dictionary making (2004b; 2011). We created so-called mini-dictionaries (Mosel 2011: 348) by focussing on one sub-domain of a culture (e.g. fish/fishing) at one time. In this way a mini-dictionary can be completed in a relatively short period of time which often has motivating effects on the speech community (Mosel 2004b: 45–47). Idioms, collocations and lexicalised phrases were also collected in a separate database because they constitute an important part of the 6. Documentation of the Marquesan languages and culture in French Polynesia (2003–2006), cf. http://www.mpi.nl/DOBES/projects/marquesan.

226

Gabriele Cablitz

linguistic competence of speakers (Pawley 1993), and they provide deeper insights into a culture than other linguistic units (Mosel 2004b: 50). The lexical entries in our databases typically contain information about parts of speech, phonological variants,7 plural forms, glosses, French and English reversal glosses for index lists, definitions, dialectal usage and register, example sentences, etymology of borrowed lexemes, synonyms, antonyms, cross-references and scientific names of things from the natural environment (e.g. fish, shellfish, birds and plants). The primary and lexical data forms the basis for our on-going work on the multimedia lexicon. 2.2.

Objectives

The main objective was to create multimedia lexica for the Marquesan and Tuamotuan speech communities by building up a new form of multimedia language archive in a structural frame of a lexicon with the help of LEXUS and ViCoS (thereby also providing the LEXUS/ViCoS developers with a documentation setting to further improve and refine the software). This involved the enriching of lexical entries with indigenous knowledge, multimedia extensions (images, video and audio clips) as well as annotated sessions from the archive, in order to move from a conventional dictionary with simple glosses towards an encyclopaedia or ethnographic type of lexicon (cf. Pawley 2001: 236–237). In order to do this effectively, we needed to motivate the speech communities to actively participate in the creation of these multimedia lexica, thus becoming more involved in the process of documenting their own languages. The entire process had an important beneficial side-effect: the involvement of the speech community in the creation of the multimedia lexica is a way of contributing to language maintenance and revival.8 In order to facilitate the more active participation of speech community members, they had to learn a) about the basics of lexicography and the relevant linguistic software used, and b) how to write monolingual definitions and encyclopaedic articles of vernacular words, a documentation method which is 7. Phonological variants are allolexemes, also called triplets and doublets (Elbert 1982). Doublets and triplets are different forms of the same lexeme which are used by one and the same speaker (e.g. ko’aka – ’o’aka ‘find’), i.e. one cannot observe any complementary distributional rules nor a regional demarcation (Cablitz 2006: 26). 8. We documented e.g. very specialised knowledge about plants which some of our language consultants have not talked about for many years. In many instances, the documentation helped them remember and revive traditional knowledge.

The making of a multimedia encyclopaedic lexicon

227

particularly useful for language maintenance. Moreover, we intended to create lexical and cultural networks in ViCoS which are based on the indigenous categorisation and organisation of relations between words and/or any element in the DoBeS-archive (e.g. sessions depicting particular cultural uses). We were hoping that an indigenous organisation of lexical and cultural data would ensure a better access of the speech communities to the documentation of their languages and cultures, with the aim of contributing to future language revitalization efforts by the speech communities. Due to the web-based properties of LEXUS, an online cooperation with the speech communities outside fieldwork periods was in theory possible (cf. above). However, while the idea was initially appealing to the speech communities, it turned out to be a complex enterprise. There are many problematic aspects connected with a web-based approach of lexicon creation in endangered speech communities, some of which are discussed in further detail below (§4). 3.

Lexicographic work in the MEL-project

The lexical data collected during the course of the DoBeS documentation project on Marquesan had been prepared with the MDF 4.0 database type in Toolbox. This was now imported into LEXUS by representing the same hierarchical structures as those in the corresponding Toolbox MDF database type.9 Toolbox can also convert the lexical database into an electronic version with a conventional dictionary layout of alphabetically ordered headwords which includes the two gloss languages French and English. Apart from the electronic and digital versions of the lexical databases, print-out versions are an important product of our documentation project (Mosel 2011: 340) because speech community members sometimes have limited access to electronic devices. Four aspects distinguish the lexicographic work in the MEL-project from our previous approach to dictionary making: 1) the inclusion of multimedia and annotated archive files to visualise and contextualise word meaning; 2) the inclusion of encyclopaedic information in lexical entries; 9. Cf. Coward and Grimes (2000) for details about the MDF (=Multi-Dictionary Formatter) database type in Toolbox.

228

Gabriele Cablitz

3) the use of vernacular language (i.e. Marquesan or Tuamotuan) in documenting word meaning and encyclopaedic knowledge, and 4) indigenous classification of vernacular words (i.e. folk taxonomies), and the creation of ethno-ontologies or informal ontologies (Zinn 2008: 890), i.e. knowledge spaces which are solely based on an indigenous understanding of cultural connections between elements of the lexicon. In the following sections, the motivation behind the inclusion of these new aspects will be explained and further detailed.

3.1.

Visualisation and contextualisation of word meaning

The contextualisation and visualisation of word meaning with multimedia and annotated archive files provides the user with information on the pragmatics of lexical units as well as on cultural knowledge related to the meaning and use of the lexical units in question. Moreover, multimedia extensions can illustrate non-verbal aspects of cultural activities that are relevant for the understanding of the concepts encoded by the lexical units in question. In our lexica in LEXUS, lexical units are enriched with and linked to example sentences of annotated sessions in the archive (via time codes). The user can navigate to the exact passage where the example sentence occurs in the annotated session. From this navigation point the user can then browse through the whole session back and forth to explore the exact context of usage (see Figure 1). The linking of lexical units to annotated multimedia archive files goes beyond conventional lexicography, which focuses on the word or lexeme level. Lexical units are embedded in discourse, revealing patterns and habits associated with their usage. According to Pawley (2001: 238), “part of native command of a language involves knowing what things to say in discourse, when and why to say them and how to say them in conventional ways”. Furthermore, every language has a large repertoire of so-called speech formulas (or lexicalised phrases), i.e. “set phrases for saying things in certain contexts” (Pawley 2001: 238). As Pawley (2001: 239) notes, these tend to be neglected by grammarians and lexicographers alike. They “do not really fit the structuralist idea of a clean divide between grammar and dictionary” (Himmelmann 2006: 19). Multimedia lexicon tools such as LEXUS can go some way towards closing this gap in linguistic descriptions.

The making of a multimedia encyclopaedic lexicon

229

Figure 1. Archive-linking: from lexical entry in LEXUS to archive session (viewed in ANNEX)

3.2.

The inclusion of encyclopaedic information in lexical entries

When doing lexicography in and for endangered speech communities, the linguistic fieldworker will often find that lexicographic practice does not easily combine with linguistic theories of lexical semantics (Haiman 1980; Herbst and Klotz 2003; Pawley 2001; Mosel 2004b). Apart from the difficulty in finding adequate translations in the target languages for vernacular words of the source language – which is a general problem of bi- or multilingual lexicography (Herbst and Klotz 2003: 109) – there is also the problem that many headwords of lexical entries denote complex phenomena, procedures and concepts which are specific to the source language and culture and therefore often do not have a translation equivalent in the target language at all (Mosel 2004b: 48; Franchetto 2006: 203–206; Haviland 2006: 136–139). In order to avoid an inadequate documentation of word meaning it is therefore often necessary to provide more encyclopaedic information in definitions than a conventional bi- or multilingual dictionary would normally include. For instance, Marquesan (MQR) heikai vaihopu and m¯akiko are both a kind of breadfruit pudding made out of the same ingredients (very ripe breadfruit pulp and coconut milk), but the final product and the technique of preparing them differ to a great extent. In our lexicographic work, the En-

230

Gabriele Cablitz

glish equivalent of ‘breadfruit pudding’ is not a satisfactory translation and we have added more information, for example, on the preparation technique to make the distinction between the two more salient. But is this kind of additional encyclopaedic information part of the word meaning of heikai vaihopu and m¯akiko? Pawley (2001: 231) questions whether the sparse representations of word meaning in conventional dictionaries only reflect the “arbitrary limitations in the practices of lexicographers” which arise not “from theoretical principles but from practical considerations of time and money”. Some researchers (Haiman 1980; Herbst and Klotz 2003) believe that one cannot really distinguish between dictionaries and encyclopaedias because the semantic knowledge of a word often derives from cultural knowledge, and that a meaning of a word is often a subset of cultural (or encyclopaedic) knowledge. Consequently the distinction between linguistic and cultural knowledge often has no clear boundary. Whatever the theoretical stance might be, in a context of language endangerment and the documentation of an endangered language and culture it is surely self-evident that one should integrate as much cultural knowledge connected with a certain word as possible. The question is to what extent the dictionary is the right medium of documenting this knowledge. Pawley (2001: 236–237) observes that all large general dictionaries of underdescribed languages are to some degree ethnographic, definitions of vernacular terms “include information about their significance in the culture of the speech community”. Also, practical handbooks of conventional lexicography advise lexicographers to include detailed encyclopaedic and cultural knowledge in definitions of culture-specific terms (Svensén 1993: 164; Atkins and Rundell 2008: 126–127). However, the definitions in conventional dictionaries are confined to verbal descriptions. The advantage of a lexicon created with LEXUS is that multimedia extensions as well as archive links can provide the user with a wider and fuller linguistic as well as non-linguistic context of cultural activities which contributes to more thorough understanding of the culture-specific concepts in question.

3.3.

The use of vernacular language in lexical entries

In §2.2 it has already been mentioned that monolingual (or vernacular) definitions of the different senses of a headword and vernacular encyclopaedic articles are an important part of our MEL-project. This kind of work encour-

The making of a multimedia encyclopaedic lexicon

231

ages native speakers to express meanings of words in their own language and therefore to further stimulate the expressive power of their language. Vernacular definitions of words are also an important step towards understanding indigenous word meaning because they document word meaning from the native speaker’s understanding of a word (Mosel 2004b: 48). It is a repository of authentic indigenous word meaning for the scientific community because it can clarify possible misunderstandings and misinterpretations between the researcher and the local field assistants and language consultants. Misinterpretations are not uncommon in linguistic fieldwork because the researcher, as well as the local consultants, often communicate via a contact language which is neither of their native languages (Mosel 2004b: 48; Haviland 2006: 144). Even if the fieldworker communicates via the field language, there are usually very different levels of linguistic competence between the fieldworker and the language consultants which could lead to misinterpretations. In many endangered speech communities which lack a tradition of literacy, the formulation of monolingual definitions and encyclopaedic articles poses a considerable challenge and leads to the emergence of new, written speech genres. While linguistic ecologists such as Mühlhäusler (1990, 1996) have cautioned against the imposition of literacy because it might diminish a rich oral heritage,10 such newly emerging written genres can provide fascinating insights into the expressive potential of speakers of a language (2004a: 263, Mosel 2004c: 4). New constructions may be deployed which are rarely used in everyday conversations, undoubtedly of interest to linguists and possibly to language educators. The Marquesan field assistants indeed developed a style containing new and rarely used constructions. Verb serialisation (1, 2) and clause chaining (2) occurred much more frequently than in other (speech) genres. For example, a wooden ring (manoni) is defined as follows: (1)

’Akau humu=t¯ıa ha’a=kapoipoi, to’o=t¯ıa no te hana wood attach= PASS CAUS=round take= PASS for ART fabricate pafi’o ako’e’a ’ofi’o landing.net or catcher ‘Attached rounded wood (lit. ‘wood which has been attached and rounded’) used to fabricate landing nets (fruit harvest) and catchers (fishing)’

10. Crowley (2001: 3) has noted that literacy has been introduced for a long time in many Pacific minority languages, but it has not at all resulted in the reduction of the rich oral heritage. This can also be stated for the Marquesas.

232

Gabriele Cablitz

A preparation of plant medicine is described as follows: (2) ...e katot¯ıa t=o te ’au i ¯ıa tau manamana, p¯ı TAM break.off ART = POSS 3. SG PL RED -branch be.full ART leaf LD ’una, tukituki ha’a=pe’ehu, kohi te tau manamana... top RED-pound CAUS=soft collect ART PL RED-branch ‘...the branches, which are full of leaves, are broken off, pounded and made soft, then collected...’ TAM particles and articles are often omitted: (3) ...ka-kaiu tumu, kof¯a hunahuna, ’au firifiri ’ina kapoipoi RED -small trunk rib tiny leaf curly almost round ‘...small trunk, tiny stalk, curly leaves almost round’ There is generally a higher number of juxtapositional constructions: (4) ...kakano ke’eke’e kaikai na te k¯uk¯u seed black food for ART fruit-dove ‘...black seeds which are food for the fruit doves’ Some NPs can contain up to four postnuclear modifiers: (5) ...t=o ¯ıa tau pupu pua ’iki met¯ıe ku’uhua ART = POSS 3. SG PL bunch flower small green yellow ‘...it has small, yellow-green bunches of flower (lit. ‘its bunches flower small green yellow’).’ Most of the vernacular definitions in our MEL-project denote concrete entities of the natural world (plants, fish, shellfish, birds, body-parts, etc.), cultural products (ornaments, artefacts, containers, clothing, etc.) preparation techniques and technologies to produce these cultural products (fishing techniques, food preparation, plant medicine, tools, instruments, etc.). These definitions mainly describe form, appearance, habitat, texture, colour, material or course of action, etc., but they also contain encyclopaedic information and functional aspects of usage if consultants felt them to be a necessary. For example with respect to plants, a detailed description of the appearance (form, height, leaves, flowers, fruits, seeds, habitat, etc.) and its main uses were given because it facilitates the identification of the plant by other native speakers.

The making of a multimedia encyclopaedic lexicon

233

The vernacular definition is followed by a translation of the definition in the target languages French and English. If an equivalent French and English term (e.g. a standard common botanical name) existed, it was added on to the translated definition as in this entry for the headword tumu ¯ına¯ına: tumu ¯ına¯ına n., Xylosma suaveolens, Usage: HO, TH11 Marquesan definition: tumu ’akau e to’u meta te hohonu, mea fatea te manamana i vaho, e to’u e f¯a meta, ’e’eva, mea titatita te ’au fi’i o’o, pua ’iki ma’ita, kakano ke’eke’e me he kakano o te va’ova’o, e tupu i no he tau ¯ıvi, taha mo’o, na te kuk ¯ u¯ me te koma’o e kai te kakano (HO). French translation with common botanical name: Xylosma suaveolens; arbre de trois mètres d’hauteur, les branches s’etendent vers les côtés de 3 à 4 mètres, feuilles touffues, petites fleurs blanches, graines noires comme les graines de l’arbre de Premna (Premna serratifolia), il se trouve au plateau et sur des collines dans les endroits secs, le ptilope dupetit-thouars (Ptilinopus dupetithouarsii) et la rousserolle des Marquises (Acrocephalus percernis) mangent ses graines. English translation with common botanical name: East Polynesian Xylosma; tree up to 3 meters in height, long branches to its side (up to 3 to 4 meter), dense, bushy leaves, small white flowers, blacks seeds similar to the Premna tree (Premna serratifolia), found on the high plateaus and on the hill tops in dry locations, the white-capped fruit-dove (Ptilinopus dupetithouarsii) and the Marquesan reed-warbler (Acrocephalus percernis) eat its seeds. Other, more detailed encyclopaedic information was transformed into encyclopaedic articles describing cultural uses by using keywords in capital letters (e.g. ’APAU ‘medicine, medicinal treatment’, PENI ‘paint’, HA’INA ‘handicraft’, ’AVA¯I’A ‘fishing’, etc.). The lexical entry ’anet¯ai ‘coral tree’ (Erythrina variegata (or E. corallodendrum)), for example, contains the following encyclopaedic descriptions of cultural uses: HA’INA: To’oa te ’akau no te hana te hei ku’uh¯e no te ’ana’ana o te ¯ ’akau. To’ot¯ıa te kakano no te tui hei. ’AVAI’A: Ia pua te ’anet¯ai ua kai 11. HO and TH are abbreviations for different island vernaculars: HO = Hiva ’Oa dialect, TH = Tahuata dialect.

234

Gabriele Cablitz

te te’ape, momona pao te ¯ı’a, to ¯ıa kerehi me he pua o te ’anet¯ai, mo’ai oko te ¯ı’a. ARTISANAT: On prend le bois pour fabriquer le “hei ku’uhe” (sorte de couronne) à cause de sa qualité legère. Les graines sont prises pour confectionner des couronnes. PÊCHE: Quand l’arbre est en fleurs, cela annonce une période très favorable pour pêcher les perches (Lutjanus kasmira); les graisses autour du foie de ce poisson deviennent très grasses et ont une couleur rouge semblable aux fleurs d’erythrine. HANDICRAFT: The wood is taken to make the breast ornament "hei ku’uhe" due to its light quality. The seeds are used to make necklaces. FISHING: When the coral tree is in flower, an abundance of bluestripe snappers (Lutjanus kasmira) can be caught; during the flowering period of the coral tree, the liver of this fish is extremely fatty and has a colouring similar to that of the red flowers of the coral tree. The encyclopaedic articles can vary greatly in length. In general, details of preparation forms (e.g. plant medicine, dance costumes, medical treatment, etc.) are described briefly in an instructive style. The first encyclopaedic articles had structural similarities to spoken instructive texts with a high frequency of discourse particles indicating the sequence of events (e.g. of a preparation form). Most of these first encyclopaedic articles were re-written, often in a more brief and dense style avoiding repetitiveness. In glossaries which focus on one tree or plant with all its botanical varieties (e.g. the breadfruit tree and its varieties), the description of cultural uses in the encyclopaedic article section of the lexical entry were organised with respect to the different plant parts (e.g. fruit, stem, bark, leaves, sap, roots, etc.). More will be said about this type of ethnobotanical approach in the next section.

3.4.

Cultural knowledge spaces and folk taxonomies

As already mentioned above, the new technology of relational linking in LEXUS/ViCoS not only opens up the possibility of viewing, for example, typical semantic relations such as synonymy, antonymy, meronymy, etc. of a lexical entry in one and the same space, but also allows the integration of all kinds of information in whatever way defined by the user. Unlike the commonly used one-to-one hyperlinking mechanism, relational links – created in

The making of a multimedia encyclopaedic lexicon

235

ViCoS knowledge spaces12 – visualise multiple links in the same space. The users can view several nodes simultaneously on the computer screen, which allow them to navigate according to their interest. All nodes can contain further links to other lexical, multimedia or archive data (containing sessions of specific cultural uses) which are opened up once the user clicks on the nodes. The screenshot in Figure 2 shows how parts of the breadfruit ontology are realised in ViCoS. Lexus Workspace Editor

ViCoS - Visualising Concept..

ViCoS Editor and Navigator

Legend:

tumu mei

- preparation_shown_in is_a is_product_of is_material_for is_part_of is_variety_of

mei ka'aku pukupuku

is_image_of is_medicine_for is_topic_in

mei hōi

mei

mei autēa

Modes: browse

move

connect

delete

lexus

mei aravei

mei hinu



mei maō'i

world

attach

detach

overview

refetch

save

colour

Relation Types:

mei 'ape mei kopumoko

mei mā popoi

haīka mā

galerie de photos

mā tehītō

Figure 2. Screenshot of parts of the breadfruit ontology in ViCoS

In the upper left part of the screenshot only those relations are visualised which are related to the node mei ‘breadfruit’. When navigating e.g. to the node m¯a ‘fermented breadfruit’ it becomes the focus in the ViCoS Editor and Navigator showing all relations connected to m¯a. The navigation can continue 12. Note that also in the Kirrkirr software multiple relational links are viewed in the same space.

236

Gabriele Cablitz

as long as relations have been created between elements of the LEXUS lexical database. The advantage such a tool has for the speech as well as scientific community is obvious. Speech community members can define and organise how words and other (cultural) data are grouped together and what relations they hold to each other, thus creating knowledge spaces which are meaningful to speech community members. Due to the flexible scheme of linking, all kinds of data13 can be combined in whatever way possible according to the particular interest of the user. Linguistic researchers could create knowledge spaces to view a particular semantic field or domain of investigation (e.g. CUT and BREAK verbals). Researchers interested in oral literature, for example, can include proper names of protagonists in the lexical database, establish the relationships they hold to each other (e.g. is_father_of, is_younger_sister_of, is_sister-in-law_of, etc.) in ViCoS knowledge spaces and make links to the archive where the respective narratives are stored (cf. Figure 3). ViCoS can not only visualise various relations, but important information (e.g. complex family relationships), which is usually neither included in the lexical entry nor in the metadata of the archive sessions, can be established in ViCoS. The indigenous representation and organisation of data in ViCoS have been collected and prepared in two different ways. Our first approach was to create knowledge spaces from an ethnobotanical perspective as the traditional material culture of the Marquesas and Tuamotu islands is – for most parts – based on plants. A plant was taken as a point of departure (or anchor) and then all the cultural uses which are connected with that plant were established. The cultural uses of the most important trees in the Tuamotuan (coconut tree) and Marquesan cultures (breadfruit, coconut and banyan tree) were elicited in detail by creating a kind of informal ontology for each tree: all the different parts of the plants were named and put into relation of how they are used in the traditional culture (food, preparation, plant medicine, handicrafts, canoe-building, house-building, etc.). It was the intention to establish a kind 13. The linking of multimedia files in ViCoS is still very restricted, but there are work-aroundsolutions of creating links to the archive and multimedia files. One can create a database in LEXUS which contains the relevant multimedia files, archive links or photo galleries without further lexical data. Within that database one creates a data category (e.g. ‘link to photo gallery’) which could be used as a link in the ViCoS knowledge space to access the database which contain the multimedia data or archive links in LEXUS (cf. Figure 2).

The making of a multimedia encyclopaedic lexicon

ViCoS - Visualising Concept..

237

ViCoS Editor and Navigator

Legend: is_father_of

Teupokootiū

is_ancestral_spirit_of

Teupokootekahi

is_son_of is_adoptive_sister_of is_younger_sister_of is_sister-in-law_of

Paevao Pepei’u

Tohi’akau

Modes:

Moni

browse

move

connect

delete

lexus

world

attach

detach

overview

refetch

save

colour

Keikahanui Relation Types:

Pa‘ehitu

Figure 3. An example for kinship relations represented in ViCoS

of mini-encyclopaedia about one specific plant which would have links to all the indigenous and encyclopaedic knowledge available in our lexicon and archive in a structured way (cf. Figure 2 above). We refer to them as cultural knowledge spaces, ethno-ontologies or informal ontologies (Zinn et al. 2008; Zinn 2008: 890). The cultural knowledge spaces of these trees were first created with socalled meta-cards which are paper cards with different shapes and colours. Consultants wrote keywords (e.g. usage of a plant part) on the meta-cards and arranged them according to their own cultural associations. With the metacards consultants could change, add and rearrange their initial organisation of the meta-cards. On the basis of these meta-card ontologies, consultants were further interviewed and a wealth of indigenous encyclopaedic information was elicited and video-taped about the different preparation forms, techniques, function and usage, customs and rituals connected with the trees or artefacts fabricated from these trees, etc. The elicitation of indigenous encyclopaedic knowledge was undertaken together with the indigenous anthropologist of the team, Edgar Tetahiotupa.

238

Gabriele Cablitz

Cultural knowledge spaces are informal ontologies which are solely based on native speakers’ associations between elements of the lexicon in LEXUS (Zinn 2008: 890–894). They do not create formal ontologies like SUMO or WordNet which are built for machine-reasoning rather than human consumption in educational settings of language revival (Zinn et al. 2008). Work on ontology building in endangered speech communities has been so far attempted by Rau et al. (2009: 199–209) for the Austronesian language Yami, but the researchers “have come to realize that it is not possible to describe an ontology based on sophisticated machine reasoning. Any ontology description ... requires triangulation of various resources of human interpretation” (Rau et al. 2009: 208). Whereas Rau et al. (2009: 192) pursue a ‘formalized model of existing indigenous knowledge’, our objective is more guided by practical aspects of usability for the speech community and better visualisation of cultural connections in a language documentation archive. Our second approach was to create so-called folk taxonomies (Conklin 1962; Bulmer 1970; Coward and Grimes 2000: 138–142) for which speech community members had to classify aspects of the natural environment by creating classes of one specific cultural domain and establish hierarchical structures based on cultural uses and associations. Conklin (1962: 50) observes that folk taxa “belong simultaneously to several distinct hierarchical structures” of which some are based on form and appearance whereas others are based on culture-specific associations and uses. In other words: “functional categories” of plants and animals, e.g. culturespecific products such as containers, clothing, ornaments, medicine, food dishes, etc. form part of folk taxonomies as much as the categories and classes formed on the basis of form and appearance of things from the natural environment. For our work on folk taxonomies we decided to focus on the domain of marine life because this domain – next to plants – plays an important role in the two Polynesian cultures. Thus far, the work has only been accomplished in the Marquesan speech community. The first step in establishing a folk taxonomy is to find out under which generic term a lexeme is classed in the vernacular, revealing the higher category of a lexeme (Coward and Grimes 2000: 138). However, between the highest level of the taxonomy (i.e. generic term), called life form and the specific name of a thing, called terminal taxa – which relates to the western notion of species – there can be many more intermediate levels called

The making of a multimedia encyclopaedic lexicon

239

intermediate taxa (Coward and Grimes 2000: 138). It is particularly difficult to find out about these intermediate taxa – if they exist at all – which often deviate greatly from scientific sub-classes (e.g. families, phyla, varieties, subspecies, etc.). Within taxonomies which are hierarchically structured with respect to classes, the field researcher working on folk taxonomies often has the problem of integrating different ontogenetic stages (different names for juvenile vs. adult animal) and part-of-relationships (tree-stem sap) which can be important folk categories (Conklin 1962: 50). For our folk taxonomy we classified 203 different species of marine life by building up hierarchical structures and classes which are meaningful to the speech community members (e.g. culture-specific uses such as fishing techniques, food preparation, etc.). The classification of a particular domain is of course best done with real objects or the entities in question. However, it is very difficult to have samples of a large number of marine animals at the same time and for the duration of the classification process of the folk taxonomy which can take two weeks or longer. For this reason we prepared photos of marine species and attached them to blank file cards on which sufficient space was left to code the data and make additional notes. The Marquesan consultants then sorted the cards according to their understanding of classes and sub-classes (intermediate taxa) and found generic names for these classes if they existed. Differently coloured post-it notes were used for generic terms and intermediate (vernacular) taxa. Altogether, we made 22 indigenous classifications (incl. marine animal/ fish families, fishing techniques, fishing times, fishing tools, fishing locations, fishing baits, food dishes, lunar fishing calendar, etc.) containing sub-classes with intermediate taxa. Not all classes have generic terms. If there were no generic terms, consultants were encouraged to give descriptive labels. These non-lexicalised descriptions express the most important characteristics of a particular class which are meaningful to them (e.g niho ta’ata’a ‘sharp teeth’, ¯ıpu pe’ehu ‘soft shell’, ¯ıpu fe’o ‘hard shell’. Some of the classes correspond more or less to general scientific groupings such as ¯ıpu pe’ehu (crustaceans) and ¯ıpu fe’o (shellfish), whereas other classes such as ki’i fe’o, unahi ko’e ‘thick skin, no scaling’ comprise six different fish families: Ostraciidae (boxfish, cofferfish, cowfish), Tetraodontinae (puffer, toby), Diodontidae (porcupine fish), Balistidae (triggerfish), Labridae (wrasse) and Pterorinae (lionfish). These six fish families are very different in form and appearance, but they are grouped together due to a quality they share (i.e. thick skin) as well

240

Gabriele Cablitz

as a functional aspect, namely that they do not need to be scaled after fishing. Furthermore, it is interesting to note that the classes are not always formed with regard to the linguistic classification. For example, some fish names which are formed with the generic terms of particular classes such as u¯ me ‘unicornfish’ and tamano ‘jackfish, chubs’, are not necessarily classified with fish which bear the same generic term. For example, u¯ me tiaporo ‘longhorn cowfish’ belongs to a different class than u¯ me tatihi, u¯ me kuripo, u¯ me mei, etc. which all belong to the unicornfish family. Furthermore, the term ’eitano ‘seafood’ comprises crustaceans as well as shellfish, but in the classifications the consultants clearly separated the group of crustaceans (¯ıpu pe’ehu ‘hard shell’) and shellfish (¯ıpu fe’o ‘soft shell’) into two groups (cf. above). In ViCoS these indigenous classifications can be visualised coherently and contrasted with scientific classifications via scientific names of fish families (e.g. Balistidae (triggerfish) vs. Labridae (wrasse), etc.). Altogether only three consultants participated in the task, so our folk taxonomy does not yet represent a general indigenous classification of marine life for the Marquesan speech community. More data will have to be collected with a larger number of consultants to see whether there are regular patterns of classification across the speech community. However, the consultants, of whom two also worked as field assistants on the dictionary, felt that this task greatly helped them in structuring and eliciting encyclopaedic knowledge which is now being transformed into encyclopaedic articles for the fish lexicon. Again, this task shows that the documentation of lexical and cultural knowledge is best achieved when it is embedded in a very neatly defined thematic domain. Moreover, cultural connections can be much more easily established when the whole domain of investigation is highly contextualised.

4.

Web-based collaboration with the speech community

Internet facilities are becoming more and more accessible in the remotest areas of the world which makes it technically possible to continue the cooperation between researchers and speech communities outside fieldwork periods. The web-based editing possibilities of LEXUS were particularly designed to motivate the speech community members to actively participate in the documentation of their languages by adding and editing dictionary entries and

The making of a multimedia encyclopaedic lexicon

241

enriching them with encyclopaedic information via the web (Ringersma and Rybka 2009), and potentially, in the absence of the researcher. However, there are many problems connected with such a collaboration framework which basically has a wiki-like set-up (cf. also Rau et al. 2009: 194–195). In this section it will be discussed why web-based collaboration with endangered speech communities is difficult to achieve, drawing on our own experience and the numerous discussions we had with the Marquesan and Tuamotuan language activists and field assistants. At the beginning of the project the idea of online cooperation seemed very appealing both to the linguistic team and the speech communities. The language activists in the Tuamotuan speech community, for example, hoped that the web-based possibilities of LEXUS would promote a communitywide participation of documenting endangered cultural knowledge on a larger scale. Community members could be involved who would otherwise not be able to contribute due to difficult inter-island communications in the Tuamotuan archipelago:14 getting to remote parts of the Tuamotuan islands is expensive and time-consuming. For the linguistic team a web-based collaboration initially seemed appealing because the collaboration with the speech communities could continue outside fieldwork periods which would ensure a continuous growth of linguistic material in a short period of time. However, it is not sufficient simply to make a web-based tool available in order to ensure online cooperation, nor can one assume that an encyclopaedic lexicon will be easily created in a wiki-like manner by the speech community. For a successful online cooperation with LEXUS, there needs to be a substantial amount of capacity building in the speech community and a number of community-internal obstacles have to be overcome as well, which will be described in the following sub-sections. Some of the obstacles are culturespecific, but a number of them can be generalised for endangered speech communities. If the lexicon creation is open to the whole speech community, there needs to be a system established in LEXUS to ensure that the contributions are coordinated and merged into the lexicon. In LEXUS this concept of collaborative lexicon creation will be realized via so-called collaborative workspaces.15 The whole set-up of collaborative workspaces is a crucial pre14. Note that the sparsely-scattered Tuamotuan archipelago covers a geographic area approximately corresponding to a triangle between North Germany, Bukarest and London. 15. LEXUS does not yet have the full set-up of collaborative workspaces, but so far can only distinguish between readers and writers. The concept described below was the proposal

242

Gabriele Cablitz

requisite in collaborative lexicon creation and the manner of the design of collaborative workspaces is closely connected to many of the communityinternal obstacles which will be described below. We will therefore begin with the basic concept of collaborative workspaces. 4.1.

Collaborative workspaces

At the beginning of the MEL-project the developers of LEXUS proposed a basic model of collaborative workspaces. The idea of a collaborative workspace works as follows. For example, a linguistic team which has already created a lexical database in Toolbox imports it into LEXUS and publishes it in a central storage place. The researcher then defines access rights for the user groups. Authorised users then copy the lexicon version into their personal workspace and make changes to the lexicon. After editing the lexicon the authorised user submits the new lexicon version which is then merged with the lexicon version in the central storage place. An administrator is solely in control of the merging, i.e. synchronisation processes mean that the administrator can accept or refuse changes (Cablitz, Ringersma, and Kemps-Snijders 2007: 414). 4.1.1.

Basic challenges of the model of collaborative workspaces

Although there is some control over the merging process, the crucial question is who would be a suitable administrator. If it were the field researcher it would usually mean that she or he is a non-native speaker who cannot accurately judge whether or not vernacular definitions are well-formed. If the administrator is a native speaker, he or she would be able to make a judgement about the well-formedness of vernacular definitions, but might have too little knowledge about lexicography (e.g. how to analyse the different senses of a headword, verbals with differing degrees of transitivity, etc.). Either way, the administrator will not be able to judge fully on the validity of a changed or new lexical entry. Many concerns were raised in discussions held between members of the speech communities and the linguistic team during 2007. They revealed that set forth by the software developers to the linguistic team and the speech community. It gave rise to many discussions which are relevant for the discussion of web-based tools for language documentation.

The making of a multimedia encyclopaedic lexicon

243

the idea of collaborative workspaces working on a wiki-type basis leads to serious problems which could have counterproductive outcomes (cf. §4.2.1 and 4.2.2). It was therefore decided to adapt the model more closely to the needs of the speech community, as outlined in the next section.

4.1.2.

Design of collaborative workspaces by the speech community

The speech community suggested a quality-controlled system of collaborative workspaces. In this model, the contributions of the speech community are evaluated by a panel of moderators which should ideally consist of native speakers of all the different island vernaculars of the archipelago. There should be a flexible system of hierarchical rights such as “reader”, “writer”, “moderator” and “administrator”. Readers would only have access to the published version of lexicon in the central storage place, whereas writers could download the lexicon from the central storage place into their own personal workspace and change and add content to the lexical entries. The moderators have the most important function in this model of collaborative workspaces: they review the contributions and decide whether they should be published or not. Each member of the panel of moderators specialises in a particular cultural domain (e.g. song and dance, story-telling, food preparation, plant medicine, fishing, etc.). Each moderator reviews only those entries which fall into their area of cultural expertise. The final decision of accepting, rejecting or modifying submitted contributions is a joint decision of the panel of moderators which is regarded as a more “democratic” approach than having one administrator who makes all the decisions. The administrator who delegates the new and changed entries to the respective moderators, manages all users, gives permissions, merges the data into the lexicaldatabase database and publishes it in the central storage place has to follow the decisions made by the panel of moderators. The administrator could be the primary editor of the lexicon – in our case the linguistic fieldworker – who could have a say in spelling matters, the presentation of linguistic information and semantic analysis of a headword. The system of collaborative workspaces as proposed by the speech communities is not without some problematic aspects. First of these is the question of how such a panel is selected. For the Tuamotuan speech community the function of the panel is ideally taken over by a language academy which is elected by an external committee. Each member of the speech community

244

Gabriele Cablitz

can apply to become an academy member. Experience with the Marquesan academy, however, has shown that the choice of academy members is often politically motivated rather than a selection by competence and knowledge. The replacement of old with new academy members has been largely the choice and personal preference of the academy’s director rather than an election by the whole academy body or an external committee. This has led to academy members being held in low esteem and an unwillingness in the Marquesan speech community to cooperate with them (Cablitz 2010: 39–40). Tuamotuan speech community members have made similar observations during the election of their academy in 2010. The whole process of moderating contributions is very time-consuming and work-intensive. The administrator who finalises the work would constantly have to communicate and negotiate between the contributors (i.e. the writers) and the panel of moderators. For example, if several entry writers submit different definitions of the same headword (i.e. same sense of a headword), it would be difficult to review efficiently these multiple drafts at one time. Another problem is that changes cannot be easily traced in a changed entry which further complicates the final editing process for the administrator. One of the solutions of facilitating the whole editing and merging process as well as encouraging a community-wide participation would be the introduction of so-called whiteboarding tools. Whiteboarding tools are web-based tool which allow an informal collaborative editing of documents and files. Ideally such a tool should be made available with the published ‘read only‘ version of the lexicon. In this way speech community members could add comments, make suggestions, etc. on specific parts of a lexical entry without making irretrievable changes to the lexical database, as illustrated in Figure 4. Apart from difficulties in administrating the workflow, there is an obvious desire by both speech communities16 to have some control over the quality and the content of the contributions. Similar negative reactions about unmonitored dictionary contributors have been reported by Rau et al. (2009: 194) for the Yami speech community. Contributors of their wiki dictionary feared that ‘painstakingly entered lexical entries would disappear’ because they would be re-edited by other contributors as soon as the data was published (Rau et al. 2009: 195). These concerns were also articulated by the Tuamotuan 16. The set-up of collaborative workspaces and the workflow was also intensely discussed with Marquesan speech community members, whose concerns were very similar to those of the Tuamotuan community.

The making of a multimedia encyclopaedic lexicon

To leave a comment, choose a tool below

Like

Dislike

Agree

Lexical entry-LEXUS reviewbasics .1X

Callout

245

1X

5X

Custom

Change

Disagree Change

Vahaka is not a good word, po'o is better

Arrow Emoticon

patahi is written pātāhi!

Selection Drawing Keep preset selected

Also used on Nuku Hiva, add NH

View Comments:

All

Size

Normal

Delete ALL my comments

Print

Figure 4. Example of a whiteboarding tool: a dictionary entry with comments

and Marquesan community members. The speech community members who have participated in the MEL-project were mainly concerned that a wiki-like lexicon creation would result in an “editing war” where speech community members would constantly re-write lexical entries of other speech community members because they believe that they have a more profound linguistic and cultural knowledge. To them, collaborative creation of the lexicon seemed too idealistic an approach, and uncontrolled contributions in a wiki-like setup would not yield the desired result. While some of these difficulties are probably of a universal nature, there are also culture-specific factors at work. One serious problem is that language consultants do not necessarily share the same metalinguistic and cultural knowledge about words. The indigenous Polynesian languages are currently undergoing rapid linguistic change caused not only by the dominant contact language French, but also by Tahitian (Cablitz 2010: 49). Depending

246

Gabriele Cablitz

on the age of the consultants and their up-bringing, the metalinguistic knowledge is in general very heterogeneous. The problem is also rooted in their traditional society: the transmission of cultural knowledge was – and still is – by no means a public affair and it was only transmitted to selected persons in the community. Consequently some speakers possess detailed cultural knowledge, whereas others – belonging to the same generation – only have rudimentary knowledge. Moreover, the continuous loss of their linguistic and cultural heritage also feeds into many insecurities of the speakers and are often ground for conflicts between speech community members in what is authentic and unauthentic knowledge. Community members often accuse each other of (re-)inventing and transforming the indigenous language and culture. Knowledgeable – often older – speakers are frequently insulted and stigmatised as liars because their knowledge is not commonly shared with other community members. As a consequence, knowledgeable speakers withdraw from participating in the documentation of their languages. Another problem of community-wide lexicon creation is that both Polynesian communities are still very much anchored in their oral traditions. Despite rapid westernisation and schooling in the last 40 years, the Tuamotuan and Marquesan communities have not really developed a writing tradition. The most knowledgeable speech community members, whom one ideally wants to engage in the creation of an encyclopaedic lexicon, often cannot read or write, not to mention their lack of IT skills. Even those speech community members who are literate are often reluctant to express their knowledge in writing. They mostly prefer to “chat” about their knowledge in informal interviews, and thus many consultants still feel most at ease when their knowledge is simply recorded, thus remaining, to some extent, bound to the oral tradition of knowledge transmission. Some of the problems discussed in this section are specific to Polynesian communities, but the tensions and insecurities which exist due to the loss of the language and culture can probably be generalised for a number of endangered speech communities. Rau et al. (2009: 194) report that Yami community members were “suspicious and critical of the collaborative efforts” of the research team and showed negative attitudes towards researchers which were not part of the speech community. The greatest fear of Yami community members is the potential abuse of the materials which are put online and the disrespect of intellectual property rights, which meant that the wiki dictio-

The making of a multimedia encyclopaedic lexicon

247

nary developed for the Yami dictionary project did not get sufficient support (Rau et al. 2009: 195). Similar fears of abuse have also been expressed by Marquesan and Tuamotuan speech community members.

4.2.

Capacity building in the speech community

Web-based lexicon creation would not be possible without a substantial amount of capacity building in the speech community. For a successful online cooperation with LEXUS outside fieldwork periods, the participating speech communities need to be prepared and trained in the basic aspects of lexicography and linguistic concepts as well as acquire the usage of linguistic – in particular lexicographic – software used in the project. In the previous documentation project on the Marquesas Islands (cf. §2.1 and §3), members of the speech community had already been trained in the use of linguistic and lexicographic software such as Toolbox and some basic linguistic concepts. As we used the MDF 4.0 database type for the build-up of our lexical databases, it was essential that the local field assistants understood the structure because the Toolbox databases were imported into LEXUS where the same lexicon structure would be reproduced. Understanding lexicon structures such as the MDF Toolbox database type requires intensive training, repetition of usage and continuous familiarisation with the structure as well as the software. The Marquesan field assistants who had some basic IT skills were trained to use the Toolbox software on a daily basis over the protracted period of several field trips by writing lexical entries together with the fieldworking linguist. Capacity building also played an important role during fieldwork with the Tuamotuan speech community, as all the linguistic software used for our language documentation was new to them at the beginning of the MEL-project. Several training sessions with up to six participants were held two to three times a week over the full length of the field trip; basic linguistic concepts which are relevant for Polynesian linguistics such as word classification and polysemy were taught as well as lexicographic aspects of how to write examples sentences, monolingual definitions, encyclopaedic articles, integrate subentries/run-ons, make dictionaries by applying Mosel’s (2004b) thematic approach, etc.

248 4.2.1.

Gabriele Cablitz

The need for user-friendly tools

Toolbox is a complex tool with many editing possibilities. The members of the Marquesan speech community were able to independently compile a bilingual (Marquesan-French) breadfruit mini-dictionary of 34 breadfruit varieties, which involved writing vernacular definitions and encyclopaedic articles of the ethnobotanical uses of the various plant varieties. However, a year later, on the next field trip the community members had to be (re-)trained because they had not used the software on a daily basis and confused many of the field markers in the MDF database type. Albright and Hatton (2008: 192) also report that speech community members who were trained to work with Toolbox appeared to do well during the training, but a month later – when left on their own to complete lexical entries – they often stopped work. They thought that they had lost all their work when closing a single window, not knowing how to open it again in their project. These kinds of complaints also frequently came from the Marquesan field assistants, who thought that they had lost their lexicon files through a computer virus. Albright and Hatton (2008: 192) point out that lexicon tools such as Toolbox are not even intuitive for trained linguists who can get confused in particular with file management tasks such as finding, opening, saving and backingup files. According to Albright and Hatton (2008: 189) speech community members who want to get involved in the documentation of their language need a tool with a simplified user interface that removes this kind of barrier (Albright and Hatton 2008: 189). Albright and Hatton (2008) make very valuable points about the need for a simplified user interface for speech community members. Complex lexicon structures can be overwhelming, lexicographic concepts can get confused and consequently speech community members often feel lost when having to edit lexical entries on their own. In our experience of working with Toolbox, speech community members often accidentally inserted field markers, leading to inconsistent lexicon structures – which is a problem for the data import into LEXUS – and entered data into the wrong fields. Another major problem is inconsistent spelling due to the fact that special characters such as macrons cannot be simply typed in over the keyboard. This creates an enormous amount of editing for the fieldworker who has to integrate and merge the data produced by the speech community into the lexical databases. In the light of these problems the development of a simple user interface for the speech community had been discussed between the LEXUS develop-

The making of a multimedia encyclopaedic lexicon

249

ers and the speech community. However, it will be very difficult to develop one simplified user interface for all speech communities because the concept of simplicity is very much dependent on a personal selection of functionality and personal preferences. Each speech community has differing needs and demands and it is therefore difficult to develop a unique implementation which takes account of the wishes of particular user groups. As long as there are no simple standards established by broad training during capacity building it will be very difficult realize a general simplified user interface. 4.2.2.

Collaboration within the speech community

Even if LEXUS had a simplified user interface and training was provided, it is unlikely that many interested community members would have all the skills necessary to produce an online encyclopaedia. The differing skills for a successful lexicon creation such as the ability to use software and computers with ease as well as a profound knowledge about the culture and language are unevenly distributed across the two endangered speech communities. Thus, cooperation between community members is essential for such a project: younger community members with good IT skills,17 but a lack of knowledge about the indigenous language and culture, would have to learn to work cooperatively together with older community members who know the language and culture, but cannot operate a computer on their own. Currently there are many efforts in this direction in the Tuamotuan speech community, but the young Tuamotuans contribute on a voluntary basis during their holidays and weekends and progress is therefore slow. Furthermore, even if speech community members with IT skills and linguistic and cultural knowledge work hand in hand, the participants would still need to learn important aspects of lexicographic work as well. There is, however, the danger that the pressure to conform to formal requirements of writing monolingual definitions and encyclopaedic articles in a certain format or style would cause participants to abandon their participation altogether.

17. Young Polynesians tend to be quite IT literate because computer skills have been part of the school curriculum since the existence of IT technology in French Polynesia i.e. since the late 1990s.

250 4.2.3.

Gabriele Cablitz

Can web-based tools replace fieldwork?

Last but not least we want to discuss whether or not indigenous word meaning and cultural knowledge can be documented in a wiki-like set-up outside fieldwork periods. In general, the field assistants felt that the mutual dialogue with the researcher and other consultants was necessary for a detailed investigation and documentation of the language and culture. In their view, the writing of monolingual definitions has so far been the most difficult task of our documentation project because the different senses of a headword had to be precisely analysed in order to write the definitions. It was not a particular problem for the field assistants that monolingual definitions were a newly emerging speech genre, as they quickly developed a distinctive style (cf. above, §3.3). However, they did find it more difficult to analyse word meanings and express them adequately in their own language. According to them, their language “lacks” adequate terms to describe colour and appearance. They often combined several colour terms and use French loan words (e.g. sokora < Fr. ‘chocolat’) or/and made comparisons to other plants (e.g. ’imu tai sokora mea a¯ ve¯ave me he veivei ’au toa ‘brown seaweed with long strings like needles of the ironwood tree’). Occasionally it was the process of defining the meaning of a word that made them aware of its distinct senses. For example, when we first defined the term kof¯a they explained that it is a kind of plant stalk which bears a semantic relationship to four other stalk terms (kata, f¯a, veivei and kokau). Trying to explain the differences between kof¯a and the other stalk terms, they gave examples of the most prominent uses of kof¯a. Leaf stalks of coconut, papaya, banana and some giant fern trees are called kof¯a. The definition for kof¯a was then formulated as being e memau kok¯oa mei no he tumu mai i ’una o te ’au no te tumu ’a’e he mea to ¯ıa manamana ‘central rib or stalk of tree leaves which have otherwise no branches’. At a later stage they used the term kof¯a again, but this time referring to the rib of a pinnate leaf of trees with branches. The previous definition of kof¯a had to be revised. For the field assistants kof¯a did not seem to denote two different things at first because coconut and fern leaves are fronds which resemble pinnate leaves. However, when they were asked to make one definition – encompassing the characteristics of pinnate leaves and fronds – they came to the conclusion that kof¯a has two distinct senses: due to their size, coconut and fern fronds form one class with papaya and banana leaf stalks rather than with pinnate leaves and thus the first definition was kept and a second definition was added for ribs of pinnate leaves.

The making of a multimedia encyclopaedic lexicon

251

The definition of plant parts, body-parts or any parts of things from the natural environment is a difficult task because speakers of non-Indo-European languages can partition aspects of the natural world (including fauna, flora, landscape, the human body, etc.) differently than speakers of English or French do (e.g. pipi u¯ ma ‘middle part of lobster between abdomen and head’). Some observable phenomena or objects do not have names at all in the target language (e.g. p¯uhu’u ‘brown, cloth-like substance around trunk of banana plant’). The documentation of word meaning and indigenous knowledge can only be efficiently established in a face-to-face communication because it requires subtle questioning and explanations which go back and forth between researcher and consultant or field assistant. In a fieldwork situation, one generally has the possibility to pick up on interesting comments or follow up on interesting leads. Important details and distinctions can be made by demonstrating an action, showing the specimen (and its parts) in question in its natural cultural context, etc. The fieldworker can constantly challenge and question the definitions with regard to a word’s usage and ask for clarifications and more precise meanings. In particular the non-native perspective of the fieldworker and his or her lexical knowledge of the field language as a second language learner can be fruitful in analysing and refining word meaning. From a more general perspective, any documentation of indigenous knowledge is a piecemeal process which requires time and patience because many of the cultural activities have not been practised for a long time. Even the most knowledgeable consultants will not remember aspects of their traditional culture instantly. After a session with the fieldworker, the consultant will often come back to add and revise the documentation of cultural processes and practices or vocabulary. At a later stage consultants often want to replace the modern term with the “real” – often obsolete – term for things or activities. The documentation of indigenous word meaning and knowledge is a close collaborative effort between the fieldworker and the field assistants and fieldwork cannot be replaced by simply making a web-based tool available to the speech community. For the reasons listed above, the enrichment of the lexicon with linguistic and cultural knowledge is still best achieved during fieldwork periods.

252 5.

Gabriele Cablitz

Lexicography in documentary linguistics

This section discusses the somewhat controversial role of lexicography in documentary linguistics as well as the contribution that multimedia dictionaries – but in particular a tool such as LEXUS – can make to documentary linguistics as well as for language maintenance and revitalization. Many language documentation projects of endangered languages compile a dictionary or lexical database as a by-product of their documentation project. It is thought as an additional documentation device which helps to structure and annotate primary data (Himmelmann 2006: 10). In the following subsection we will discuss that a lexical database should not just be a documentation aid, but an essential part of a language documentation itself.

5.1.

The importance of structured lexical data in a language documentation context

The underlying belief in most language documentation initiatives is that good language documentation should be a “lasting, multipurpose record of a language” which contains a large set of samples of observable linguistic behaviour, “i.e. examples of how people actually communicate with each other” (Himmelmann 2006: 7). Dictionaries and grammars do not provide the full linguistic context on how words and constructions are actually used in everyday communicative events (e.g. how people actually converse, how they verbally interact when making, selling or negotiating something, etc.). If a documentation consists solely of a conventional dictionary plus a grammar, then the communicative practices of the speech community are not reconstructable, and the dictionary and grammar by themselves are of restricted value for researchers and practitioners outside of linguistics, for example educators, speech community members, and researchers from other disciplines (cf. Himmelmann 2006: 18–19). Despite the generally acknowledged primacy of primary data in documentary linguistics, there are several good reasons to make dictionaries an integral part of a language documentation. One reason for this is that the annotation of words and phrases in multimedia documents (primary data) – as required for language documentation projects in the DoBeS-programme – will normally only document one sense of a word in a specific context. It will neither document the full range of meanings of a word, nor will the annotation

The making of a multimedia encyclopaedic lexicon

253

show in which way a word is related to other words of a language in the same way as dictionaries do. In other words: although the primary data documents how a language is actually used, the semantic complexity of words and their semantic networks are not at all evident and easily deductible. In view of this it is clear that directly accessible structured lexical data adds significantly to the practical usability of primary data to speech communities, educators and researchers from scientific disciplines other than linguistics. Furthermore, a language documentation should not only be data-oriented and multifunctional, it should ideally provide a potential basis for developing pedagogical material, thus contributing to language maintenance and revival as well. A number of other field linguists working with endangered speech communities regard dictionaries not just as a mere documentation device and a compilation of structured analysed abstractions of a language, but also as a key activity for language maintenance and revitalization when prepared and presented in an accessible format18 for the speech community (Corris et al. 2000; Kroskrity 2002; Hinton and Weigel 2002: 155; De Korne et al. 2009: 141; Rau et al. 2009; among others); for Hinton and Weigel (2002: 156) a dictionary is “a repository of tribal identity that can be used for a variety of purposes even after the language ceases to be spoken”. In many speech communities the compilation of a dictionary is still regarded as one of the most important products of a language documentation project (Mosel 2006: 68). The language also acquires a higher status or the status of a “real” language which approaches that of the dominant local language(s).

5.2. Why multimedia dictionaries and why LEXUS For endangered speech communities in particular, multimedia dictionaries are an increasingly popular medium to satisfy both documentation and educational needs (De Korne et al. 2009: 141). “The rich audio-visual-interactive input” is “far better than simple text, tape, or audio” and it attempts to approximate immersion education (De Korne et al. 2009: 143; Kroskrity 2002: 190; Hinton 2001) which is considered to be the best approach to language revitalization (Grenoble and Whaley 2006). 18. Cf. below and Corris et al. (2000) for a discussion of the tension between “documentation dictionaries” and “maintenance dictionaries” or lexical database vs. community dictionary (Mosel 2011: 340).

254

Gabriele Cablitz

Apart from contributing to language maintenance and revival (cf. §5.3 for more details), multimedia lexicon tools such as LEXUS have the potential to be excellent resource and research tools for the scientific community. The archive-linking capacity in LEXUS creates enriched multimedia lexica which go beyond conventional dictionary making. Word meanings in the dictionary can be fully contextualised and presented in their socio-cultural contexts by linking corpus-based examples in the dictionary to the respective annotated sessions in the archive. Words in annotated sessions, on the other hand, are embedded in and linked to their dictionary entry which contains the full range of meanings of a word which is not documented in the annotated session as such (cf. above, §5.1). A multimedia lexicon created in LEXUS, with its new relational linking device in ViCoS, can consist of a dense network of lexical and cultural data with various media files which are part of a language archive. This has important implications not only for speech communities, but also for documentary linguistics. metadata and archive structures used for example in the DoBeS archive can only establish limited connections between sessions, which can leave many cultural connections between sessions of an archive unexplored. As ethnographers such as Franchetto (2006: 183) point out, “ethnographical documentation is a crucial component in any language documentation”, and this “involves designing digital architectures with multiple and multidirectional links between different sessions and qualitatively different kinds of information such as lexica, analytical papers, photos, and so on” (Franchetto 2006: 206). LEXUS and ViCoS are tools which would allow the documentation team to come closer to this ideal. The new relational linking device in ViCoS, in particular, opens up the possibility to create a dense network of cultural connections between sessions as well as structured lexical data in a thematically organised way and therefore achieves a better and more userfriendly access to a culture’s network of meanings for both the scientific as well as speech community.

5.3.

Multimedia lexica as a tool for language maintenance and revitalization

The creation of dictionaries of endangered languages is often regarded as a major activity in language maintenance and revitalization (Corris et al. 2000: 1). Dictionaries – in particular bilingual dictionaries – are important reference documents for endangered speech communities in revitalization

The making of a multimedia encyclopaedic lexicon

255

classes: they are important for developing vocabulary, as a spelling reference and for the language teachers themselves because they are often not fluent speakers of the endangered language which is being revitalized (Hinton and Weigel 2002: 155–156). Multimedia dictionaries are undoubtedly even more useful tools for revitalization than conventional dictionaries due to their multimedia extensions, provided that they are not overloaded with illustrations and graphics (Frawley, Hill, and Munro 2002: 10–12). Although multimedia tools are no substitute for a real interaction with a native speaker, they “can provide an approximation of the model for speaking that is produced by native speakers in a face-to-face interaction” and thus are important tools which support language teaching and language learning at the same time (Kroskrity 2002: 190). Many projects which report on the use of multimedia tools in language revitalization (Hinton 2001; Manning, Jansz, and Indurkhya 2001; Kroskrity 2002; Canger 2002; De Korne et al. 2009; Rau et al. 2009; among others) emphasize their beneficial use. However, little is said about the usability and type of the multimedia data which is integrated into the various tools which are available at the present. The multimedia documents we have integrated into our multimedia lexica are based on recordings of spoken language which commonly contain speech phenomena such as hesitations, repetitions, speech errors, break-ups, etc. The transcription and translation of these recordings have at times posed a challenge. Speech community members have often commented that these recordings are a good repertoire of their language and culture, but that they need to be edited before they can be published or used as pedagogical material in schools. Although these kinds of multimedia documents are a reflection of how the language is actually used, it is therefore questionable if they meet language revitalization objectives effectively. A learner with little knowledge of the language will not just learn the language by force of listening to recordings in an immersive way. At present, it needs to be further investigated how useful unedited multimedia annotated documents of naturally spoken language are in revitalizing endangered languages. However, the learning needs of language learners in endangered speech communities are different than those of ordinary second language learners because they mostly have experience in some form with the indigenous language they want to learn (e.g. passive knowledge by listening to their grandparents, etc.). Although a tool like LEXUS has the potential of being used for language maintenance and revitalization, the creation of our multimedia lexica

256

Gabriele Cablitz

in LEXUS resembles more a kind of ‘documentation dictionary’ (Corris et al. 2000: 1), ‘ethnographic dictionary’ (Pawley 2001: 237) or ‘lexical database’ (Mosel 2011: 349), but not a ‘maintenance (or learner) dictionary’ (Corris et al. 2000: 1). According to Corris et al. (2000: 7) documentation dictionaries contain as much information as possible about a lexeme, so lexical entries can be long and detailed and overcrowded with information (e.g. conventions, symbols, abbreviations) which is often confusing and intimidating for speech community members who are not familiar with dictionaries. Therefore documentation dictionaries are less likely to be useful as a tool for revitalization. Corris et al. report further that speech community members often prefer simple word lists, inflected verb forms over citation forms as headwords, short definitions no longer than two lines and no subentries (2000: 8). However, as most dictionaries are created with computational tools it is fairly easy to transform a documentation dictionary into a maintenance dictionary for the speech community (cf. Mosel’s approach of transforming a documentation dictionary into thematic mini-dictionaries for the speech community (2011: 348–349)).

6.

Conclusion

Multimedia lexicon tools such as LEXUS are excellent resource and research tools for the scientific as well as endangered speech communities interested in the lexical, cultural and ethnographic documentation of endangered and underdescribed languages. The LEXUS tool, with its new feature of knowledge representation in ViCoS, can create a dense network of lexical as well as cultural data in ways which are meaningful to different kinds of user groups, allowing a multilayered organisation of lexical and cultural knowledge. For members of the speech community the creation of knowledge spaces in ViCoS can be a more natural entry point into a lexicon than a conventional paper dictionary because data of the lexical database in LEXUS can be selected according to the needs of the users, giving it the potential to become an important tool in language maintenance and revitalization. Apart from the possibility of individually organising the documented knowledge, the contextualisation of word meaning via archive linking is a major new approach in lexicography. For documentary linguistics, multimedia lexicon creation, as envisaged in our MEL-project with LEXUS, is in fact a new form of language archiving

The making of a multimedia encyclopaedic lexicon

257

within the structural frame of a lexicon as it combines data from a language documentation archive with a structured lexical database. It also goes beyond so far practised language archiving with metadata as well as conventional dictionary making as our multimedia lexicon will consist of a dense network of lexical entries with all sorts of media files, and thus presenting the meaning of words and cultural connections between archive sessions in a new way. Although the web-based possibilities of LEXUS can actively involve people from the speech community in the creation of their lexicon, a communitywide online participation in a wiki-like framework is difficult to achieve in highly endangered speech communities and can create many irresolvable conflicts. Even if speech community members are trained adequately in the use of the relevant linguistic software and in the basics of lexicography, and a quality-controlled system of collaborative workspaces and a simplified user interface are implemented, it is still unlikely that online collaboration can really replace intensive fieldwork to document endangered languages and cultures. A more appropriate and informal way of online participation by endangered speech communities is the use of whiteboarding tools which would resolve many problems of online participation discussed above.

Abbreviations 1, 2, 3 ART CAUS LD PASS

first, second, third person article causative locational-directional preposition passive

PL POSS RED SG TAM

plural possessive reduplication singular tense, aspect, modus

References Albright, Eric, and John Hatton. 2008. WeSay: A tool for engaging native speakers in dictionary building. In Documenting and Revitalizing Austronesian Languages, eds. D. Victoria Rau and Margaret Florey, Language Documentation & Conservation, Special Publication No. 1, 189–201. Honolulu: University of Hawai’i. http://hdl.handle.net/10125/1368.

258

Gabriele Cablitz

Atkins, Beryl T. Sue, and Michael Rundell. 2008. The Oxford Guide to Practical Lexicography. Oxford: Oxford University Press. Bulmer, Ralph N. H. 1970. Which came first, the chicken or the egg-head? In Échanges et Communications: Mélanges Offert à Claudes Lévi-Strauss à l’Occasion de Son 60ième Anniversaire, eds. Jean Pouillon and Pierre Miranda, 1069–1091. The Hague: Mouton. Cablitz, Gabriele. 2006. Marquesan - A Grammar of Space. Berlin, New York: Mouton de Gruyter. Cablitz, Gabriele. 2010. A field report on a language documentation project on the Marquesas Islands in French Polynesia. In Endangered Austronesian, Papuan and Australian Aboriginal Languages: Essays on Language Documentation, Archiving and Revitalization, ed. Gunter Senft, 31–47. Canberra: Pacific Linguistics. Cablitz, Gabriele, Fasan Chong, and Edgar Tetahiotupa. 2009. The documentation of endangered linguistic, lexical and cultural knowledge of the Marquesan and Tuamotuan languages of French Polynesia. In Proceedings of the 11th Pacific Science Inter-Congress and 2nd Symposium on French Research in the Pacific. Tahiti: Pacific Science Association. http: //intellagence.eu.com/psi2009/output_directory/cd1/Data/articles/000323.pdf. Cablitz, Gaby, Jacquelijn Ringersma, and Marc Kemps-Snijders. 2007. Visualizing endangered indigenous languages of French Polynesia with LEXUS. In 11th International Conference Information Visualization (IV ’07), 409–414. IEEE Computer Society. Canger, Una. 2002. An interactive dictionary and text corpus for sixteenthand seventeenth-century Nahuatl. In Making Dictionaries – Preserving Indigenous Languages of the Americas, eds. William Frawley, Kenneth C. Hill, and Pamela Munro, 195–218. Berkeley, CA: University of California Press. Conklin, Harold C. 1962. Lexicographic treatment of folk taxonomies. In Problems in Lexicography, eds. Fred W. Householder and Sol Saporta, 41–59. Bloomington: Indiana University Research Center in Anthropology, Folklore, and Linguistics. Corris, Miriam, Christopher Manning, Susan Poetsch, and Jane Simpson. 2000. Dictionaries and endangered languages. http://nlp.stanford.edu/pubs/ eldic.ps. Coward, David F., and Charles E. Grimes. 2000. Making dictionaries. A guide to lexicography and the multi-dictionary formatter. http://www.sil. org/computing/shoebox/MDF_2000.pdf.

The making of a multimedia encyclopaedic lexicon

259

Crowley, Terry. 2001. The indigenous linguistic response to missionary authority in the Pacific. Australian Journal of Linguistic 21(2):239–260. De Korne, Haley, and The Burt Lake Band of Ottawa and Chippewa Indians. 2009. The pedagogical potential of multimedia dictionaries – lessons from a community dictionary project. In Indigenous Language Revitalization: Encouragement, Guidance and Lessons Learned, eds. Jon Reyhner and Louise Lockard, 141–153. Flagstaff, AZ: Northern Arizona University. Elbert, Samuel H. 1982. Lexical diffusion in Polynesia and the MarquesanHawaiian relationship. Journal of the Polynesian Society 91:499–517. Franchetto, Bruna. 2006. Ethnography in language documentation. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 183–208. Berlin, New York: Mouton de Gruyter. Frawley, William, Kenneth C. Hill, and Pamela Munro. 2002. Making a dictionary. Ten issues. In Making Dictionaries – Preserving Indigenous Languages of the Americas, eds. William Frawley, Kenneth C. Hill, and Pamela Munro, 1–22. Berkeley, CA: University of California Press. Grenoble, Lenore A., and Lindsay J. Whaley. 2006. Saving Languages: An Introduction to Language Revitalization. Cambridge: Cambridge University Press. Haiman, John. 1980. Dictionaries and encyclopedias. Lingua 50:329–357. Haviland, John. 2006. Documenting lexical knowledge. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 129–162. Berlin, New York: Mouton de Gruyter. Herbst, Thomas, and Michael Klotz. 2003. Lexikografie. Paderborn: Schöningh UTB. Himmelmann, Nikolaus P. 2006. Language documentation: What is it, and what is it good for. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 1–30. Berlin, New York: Mouton de Gruyter. Hinton, Leanne. 2001. Language revitalization: An overview. In The Green Book of Language Revitalization in Practice, eds. Leanne Hinton and Ken L. Hale, 3–18. San Diego, CA: Academic Press. Hinton, Leanne, and William F. Weigel. 2002. A dictionary for whom? Tensions between academic and nonacademic functions of bilingual dictionaries. In Making Dictionaries – Preserving Indigenous Languages of the Americas, eds. William Frawley, Kenneth C. Hill, and Pamela Munro, 155–170. Berkeley, CA: University of California Press.

260

Gabriele Cablitz

Kroskrity, Paul V. 2002. Language renewal and the technologies of literacy and postliteracy. In Making Dictionaries – Preserving Indigenous Languages of the Americas, eds. William Frawley, Kenneth C. Hill, and Pamela Munro, 171–192. Berkeley, CA: University of California Press. Manning, Christopher, Kevin Jansz, and Nitin Indurkhya. 2001. Kirrkirr: Software for browsing and visual exploration of a structured Warlpiri dictionary. Literary and Linguistic Computing 16(1):123–139. McElvenny, James. 2008. Review of Kirrkirr. Language Documentation & Conservation 2(1):160–165. http://hdl.handle.net/10125/1796. Mosel, Ulrike. 2004a. Complex predicates and juxtapositional constructions in Samoan. In Complex Predicates in Oceanic Languages, eds. Isabelle Bril and Françoise Ozanne-Rivierre, 263–296. Berlin, New York: Mouton de Gruyter. Mosel, Ulrike. 2004b. Dictionary making in endangered speech communities. In Language Documentation and Description, Volume 2, ed. Peter K. Austin, 39–54. London: School of Oriental and African Studies. Mosel, Ulrike. 2004c. Inventing communicative events: Conflicts arising from the aims of language documentation. Language Archives Newsletter 1(3):3–4. Mosel, Ulrike. 2006. Fieldwork and community language work. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 67–85. Berlin, New York: Mouton de Gruyter. Mosel, Ulrike. 2011. Lexicography in endangered language communities. In The Cambridge Handbook of Endangered Languages, eds. Peter K. Austin and Julia Sallaback, 337–353. Cambridge: Cambridge University Press. Mühlhäusler, Peter. 1990. ‘Reducing’ Pacific languages to writing. In Ideologies of Language, eds. John E. Joseph and Talbot J. Taylor, 189–205. London: Routledge. Mühlhäusler, Peter. 1996. Linguistic Ecology. Language Change and Linguistic Imperialism in the Pacific Region. London, New York: Routledge. Pawley, Andrew. 1993. A language which defies description by ordinary means. In The Role of Theory in Language Description, ed. William A. Foley, 87–129. Berlin, New York: Mouton de Gruyter. Pawley, Andrew. 2001. Some problems of describing linguistic and ecological knowledge. In On Biocultural Diversity: Linking Language, Knowledge, and the Environment, ed. Luisa Maffi, 228–247. Washington, DC: Smithsonian Institution Press.

The making of a multimedia encyclopaedic lexicon

261

Rau, D. Victoria, Meng-Chien Yang, Hui-Huan Ann Chang, and Maa-Neu Dong. 2009. Online dictionary and ontology building for Austronesian languages in Taiwan. Language Documentation & Conservation 3(2):192–212. http://hdl.handle.net/10125/4439. Ringersma, Jacquelijn, and Konrad Rybka. 2009. Lexus manual. Published first version. http://www.mpi.nl/corpus/manuals/manual-lexus.pdf, downloaded June 2009. Svensén, Bo, ed. 1993. Practical Lexicography: Principles and Methods of Dictionary-making. Oxford: Oxford University Press. Yang, Meng-Chien, Hsin-Ta Chou, Hui-Shiuan Guo, and Gia-Ping Chen. 2008. On designing the Formosan multimedia word dictionaries by a participatory process. In Documenting and Revitalizing Austronesian Languages, eds. D. Victoria Rau and Margaret Florey, Language Documentation & Conservation, Special Publication No. 1, 111–133. Honolulu: University of Hawai’i. http://hdl.handle.net/10125/1359. Zinn, Claus. 2008. Conceptual spaces in ViCoS. In The Semantic Web: Research and Applications, eds. Sean Bechhofer, Manfred Hauswirth, Jörg Hoffmann, and Manolis Koubarakis, 890–894. Berlin: Springer. Zinn, Claus, Gabriele Cablitz, Jacquelijn Ringersma, Marc Kemps-Snijders, and Peter Wittenburg. 2008. Constructing knowledge spaces from linguistic resources. Presentation at the 18th International Congress of Linguistics, Workshop 12 on Linguistic Studies of Ontology: From Lexical Semantics to Formal Ontologies and Back, July 21–26, 2008, Seoul, Republic of Korea.

Chapter 11 What does it take to make an ethnographic dictionary? On the treatment of fish and tree names in dictionaries of Oceanic languages∗ Andrew Pawley

1.

Introduction

Some lexicographers hold that it is impossible to make a good first general dictionary of any language. In his highly regarded textbook Dictionaries: The Art and Craft of Lexicography, Sidney Landau writes that A really new dictionary would be a dreadful piece of work, missing innumerable basic words and senses, replete with absurdities and unspeakable errors, studded with biases and interlarded with irrelevant provincialisms. Noah Webster’s American Dictionary of the English Language of 1828, though far from being entirely new, was new enough to subscribe to many of these defects... Fortunately, very few dictionaries are really new, and none of the general, staff-written, commercial dictionaries published by major dictionary houses are. (Landau 1984: 35–36)

Landau was clearly thinking of dictionaries of metropolitan languages, where the lexicographers are native speakers of the target language and have the advantage of access to a large written corpus, including materials by specialists in scores of different fields. No doubt he would be even more pessimistic about the chances of making good first dictionaries of languages spoken by what I will refer to as ‘traditional’ societies (roughly, those with subsistence economies). Such languages usually have no written tradition or only a limited one. The dictionaries are generally written by one or a few scholars who are not native speakers of the target language, assisted by native-speaking ∗

It is a pleasure to contribute to a volume in honour of Ulrike Mosel, whose contributions to descriptive and documentary linguistics I greatly admire and with whom I have had many stimulating discussions about dictionary-making. Thanks are due to Geoffrey Haig, Frank Lichtenberk and Claudia Wegener for their valuable comments on a draft of this paper and to Claudia for her eagle-eyed copy-editing.

264

Andrew Pawley

consultants who have no prior knowledge of lexicography. Sometimes the compilers themselves have no previous experience of making a dictionary. For the many languages of traditional societies that are in danger of disappearing the first dictionary is likely also to be the last. This paper identifies some of the challenges faced in making a first general dictionary of a language of a traditional society and considers how to minimize the kinds of defects mentioned by Landau. I will begin with some very general questions: 1. What kind of dictionary is to be the goal? In particular, what kinds of semantic information are to be given about headwords? Is there a principled basis for drawing a line between definitions and cultural or encyclopaedic information? 2. What are the best ways of discovering the words of a language? Are major gaps in the lexical record inevitable? In discussing these matters I will draw mainly on examples from just one major domain of the lexicon, animal and plant names, and their treatment in a sample of some 30 bilingual dictionaries of languages belonging to the Oceanic branch of Austronesian.1 I choose to focus on this domain for a number of reasons. (i) Animals and plants are important, economically and socially, in the lives of Oceanic communities and names for animal and plant taxa make up a significant proportion of the lexicon. World-wide comparative studies show that farming communities living in environments with a rich diversity of plant species generally distinguish between 500 and 1500 plant taxa (Brown 1985).2 I am not aware of a comparable world-wide survey 1. Oceanic contains more than 400 languages of Melanesia together with the languages of the Polynesian Triangle and most of the languages of Micronesia. Most Oceanic languages have fewer than 10,000 speakers. Their speech communities were, traditionally, subsistence farmers and, in many cases, also fishers. 2. The largest reliably recorded inventory of vernacular plant names from one traditional community appears to be about 1,800–2,000, for Hanunóo, of Mindoro, Philippines, who speak an Austronesian language (Conklin 1954). Puku’i and Elbert (1971: ix) mention that Mary Neale and Edward Handy list over 2,300 plant names for Hawaiian but a number “are dubious”. Henderson and Hancock (1988) list more than 800 plant names for Kwara’ae, of Malaita. Fox’s (1978) dictionary of Arosi (of Makira), gives about 770 names for plants, of which 194 denote varieties of cultivated plants. For Cèmuhî, a language of northern New Caledonia (a region with a relatively small flora), Rivierre (1994) lists 557 taxa, of which 178 represent cultivated plants.

What does it take to make an ethnographic dictionary?

265

concerning animal names but it is probable that most coastal communities on high islands in the Pacific tropical zone distinguish between 600 and 1000 animal taxa, with fish names being the largest component.3 (ii) Terminologies for fauna and flora highlight important issues in semantic description. They are typically associated with complex cultural knowledge and practices and are organised into quite complex taxonomies that are not isomorphic with those of the biological sciences or with the defining vernacular language. The task of describing this knowledge underlines the ethnographic and interdisciplinary nature of dictionary-making. (iii) In many Oceanic societies traditional knowledge of plants and animals is diminishing as these societies are drawn into the modern industrial world. 2.

What kind of general dictionary

A pivotal decision concerns what kind of general dictionary is to be the goal. I refer specifically to the nature of what may be called the ‘explicatory’ parts of an entry, those that give semantic and pragmatic information about the headword. These parts consist chiefly of (i) the definition or gloss, (ii) additional information about cultural context and reference, (iii) information about semantic relations to other lexical units (synonyms, antonyms, superordinates, hyponyms, etc.) and (iv) illustrative examples showing the lexical unit in typical contexts of use. In a typology of general dictionaries based on the semantic information they provide one can distinguish between two idealized extremes: dictionaries designed as translation aids and dictionaries that we might call ethnographic (or descriptive). Translation aid dictionaries give headwords in L1 and provide words or phrases in L2 that are translation equivalents for these. Ethnographic dictionaries provide explications intended to reflect native speakers’ (sometimes variable) understandings and use of lexical units. They seek to give definitions that define complex lexical concepts in terms of simpler ones, using vernacular vocabulary, and also provide additional informa3. Speakers of Wayan (a dialect of Western Fijian spoken on Waya Island, Yasawa group, Fiji) distinguish about 800 names for kinds of animals. These include about 450 fish taxa and 230 names for marine invertebrates. Land invertebrates are of little economic importance on Waya but upwards of 70 taxa are named. As one moves eastwards across the central Pacific the number of land bird, mammal and reptile species drops off sharply and Wayan terminologies for indigenous birds (about 35 names) and reptiles (about 20) are small.

266

Andrew Pawley

tion of types (ii)–(iv) above. (I am not so naïve as to think that any dictionary can fully achieve these objectives; I am speaking of an ideal.) Most bilingual dictionaries are primarily intended to be translation aids. The best monolingual dictionaries, by contrast, are closer to the ethnographic type. Dictionaries of languages of traditional societies are usually bilingual, with a main part containing headwords in the target language (L1) and glosses in a major language (L2), in combination with reverse finder list that allows the user to look up words in L2 and find relevant entries in L1. The glosses mainly consist of words or phrases intended to be approximate translation equivalents. Analytic definitions are given only where no translation equivalent is available. While conceding the importance of providing translation equivalents where possible, I believe that scholars compiling first bilingual dictionaries of languages of traditional societies should aim at rich semantic descriptions, providing analytic definitions wherever these will give a more precise and usefully informative account of the meaning of the lexical unit than approximate translation equivalents, and also including supplementary information of types (ii)–(iv) where this serves the same purpose. See Cablitz (this volume, Section 3.2) for discussion of the question of to what extent it is appropriate to record cultural information in a dictionary. By giving rich semantic descriptions an ethnographic dictionary on the one hand provides linguistic and cultural information likely to be valued by members of the speech community and on the other hand stands as a reference work for scientific purposes. However, the creation of a good ethnographic dictionary presents huge challenges. Some of these challenges will be considered in the sections that follow. 3.

What are the best ways of discovering the words of a language

A lexicographer making a first dictionary has to discover the words of the language and their meanings (the basic lexical unit being the sense unit, because each sense requires its own definition). What are the best ways of going about this task, so as to avoid “missing innumerable basic words” and making “unspeakable errors” in definitions or glosses? A corpus of natural connected discourse is a valuable tool in the search for words and senses. But for an Oceanic language the lexicographer(s) will

What does it take to make an ethnographic dictionary?

267

generally have available a very modest-sized corpus. In many documentation projects the number of words in the annotated corpus is less than 30,000. It is unrealistic to imagine that one can obtain anything like a near exhaustive list of lexical units from such a small sample. Most words and senses are low frequency items and even a ten million word corpus made up of diverse speech genres is likely to contain only a small proportion of such items. For example, many names for kinds of animals and plants will only be obtainable by elicitation. An effective elicitation strategy is to pursue words by semantic fields. A lexicon can be viewed as falling into many different domains of meaning: kinship, body parts, house parts, canoe parts, bodily conditions and health, emotions, landscape, spatial relations, weather, seasons, months, numbers, colours, temperature, texture, sounds, speaking, cutting, breaking, locomotion, transport and so on. Terms belonging to the same semantic field display more or less systematic semantic relations, i.e. they form a structured terminology. This fact allows a structured approach to the search for words and the formulation of definitions, one that has been pursued fruitfully by Ulrike Mosel (Mosel 2011). Using a dictionary or wordlist from another language to elicit words in the target language (either for a particular semantic field or more generally) can yield quick and copious returns. I refer to the practice of asking bilingual informants to provide translation equivalents in the target language. However, my experience is that this method will lead to lots of errors unless it is combined with a good knowledge of the target language, cross-checking with a range of informants, and observation or eliciting of contextual use. Let me give an example: In 1967 I undertook a first spell of 11 weeks of fieldwork on Waya Island, in the Yasawa group, Fiji, and this yielded an ‘instant’ dictionary of some 5000 roots together with thousands of derived words and compounds. These were obtained largely by eliciting equivalents of Bauan (Standard Fijian) items, using the Bauan dictionary, and proceeding alphabetically. The process can be compared to using a dictionary of Dutch to elicit words in English from English-Dutch bilinguals. Some years later, after I had resumed intensive fieldwork on Waya and become fluent in the language, I realized that the instant dictionary was riddled with errors – cases where Bauan words had a different meaning or range of senses than their cognate in Wayan, or where I had included Bauan words not current or at best rarely used in Wayan, but most often, cases of wrong

268

Andrew Pawley

definitions resulting from misunderstandings between linguist and informant. I then spent part of the next several field seasons checking and correcting the errors. It is hard to say whether, on balance, using the Bauan elicitation list wasted more time than it saved. What I learned the hard way is that there is no substitute for being able to speak the language. Once you are reasonably fluent you can more easily talk to a wide range of people in the community, argue the point, eavesdrop, ask complicated questions and elicit suitable illustrative examples, and you are much better placed to critically evaluate the information that comes in and so to reduce errors. While fluency in the field language is the ideal, it may be difficult to achieve in the space of a three-year project with limited in situ field work. In some contexts it is possible for the linguist to partially compensate for lack of fluency in the vernacular by close collaboration with bilingual native speakers. (See Cablitz (this volume) and Jung and Himmelmann (this volume) for discussion of the crucial role of native speaking consultants in linguistic datagathering and analysis.) When eliciting terms for certain fields, such as plants and animals, lexicographers need help, in the form of specialist researchers and/or works of reference. In recent decades some excellent handbooks for the flora and fauna of certain regions of Oceania have appeared. However, there are traps in using illustrations in handbooks to elicit terms. In the case of animals and plants, for example, informants are unable to judge size from illustrations and pictures alone usually do not provide information on behaviour and habitat. In some cases even very expert informants looking at illustrations make wrong equations. Checking and rechecking with a variety of knowledgeable people and, if possible, with actual specimens and a variety of illustrations, will reduce the error rate. But in the case of terms for fauna and flora, my experience is that unless the lexicographer enlists the aid of specialists, many terms will be missed and many others will be misidentified or will remain unidentified.4 4.

Finding a method to assess coverage of fish and tree names in Oceanic dictionaries

Most of the Oceanic dictionaries cited in this paper are the first and only substantial dictionaries of these languages. Contrary to Landau, it is not my 4. For a fuller account of methodological issues in ethnobiology see Bulmer (1992).

What does it take to make an ethnographic dictionary?

269

impression that they “are missing innumerable basic words and senses”, if “basic” refers to high frequency items. However, it is certainly the case that many of the dictionaries are missing many words of middle to low frequency. This becomes obvious if one looks closely at the treatment of particular semantic fields. I turn now to a brief survey of the coverage of fish and tree names in about 30 Oceanic dictionaries. I will discuss two methods for determining whether coverage in a particular dictionary is close to exhaustive, and will touch briefly on a third. 4.1.

Fish names

Comparative evidence, drawing on the most thorough collections of terms for those Oceanic fishing communities where the widest range of fish species and genera are present, indicates that such communities generally distinguish between 300 and 500 fish taxa.5 The number tends to fall as one moves further east in the Pacific, where the number of fish species and genera diminish. 5. Sources (dictionaries and other works) for contemporary languages that figure in the discussion are listed below. Works that are not general dictionaries are marked with a star before the name of the author(s); these are mainly survey reports focusing on names of marine fauna. New Guinea and Bismarck Archipelago: Kiriwina (Trobriand Islands, PNG): Lawton 1998; Kuanua (New Britain): Lanyon-Orgill 1960; Titan (Admiralty group): *Akimichi and Sakiyama 1991. Western Solomons: Cheke Holo: White 1988; Marovo: Hviding 1990, *Hviding 2005; Roviana: Waterhouse 1949; Tak¯u (Polynesian Outlier, north of Bougainville): Moyle in press; Teop: Shoffner 1976. Eastern Solomons: Arosi: Fox 1978; Gela: Fox 1955; *Foale 1998; Lau: Fox 1974; Owa: Mellow 2009; Toqabaqita: *Henderson and Hancock 1988; Lichtenberk 2008. Vanuatu and Tikopia: Paamese: Crowley 1992; Kwamera: Lindstrom 1986; Tikopia (Te Motu Province, Solomon Islands): Firth 1985; Lenakel: Lynch 1977. New Caledonia: Cèmuhî: Rivierre 1994; Nyelâyu: Ozanne-Rivierre 1998; Paicî: Rivierre 1983; Xârâcùù: Moyse-Faurie and Néchéro-Jorédie 1989. Fiji: Bauan (Standard Fijian): Capell 1941; Wayan (dialect of Western Fijian): Pawley and Sayaba 2003; Rotuman: Churchward 1940, Inia et al. 1998. Micronesia: Carolinean (Saipan, Marianas): Jackson and Marck 1991; Kapingamarangi (Polynesian Outlier, central Carolines): Lieber and Dikepa 1974; Marshallese: Abo et al. 1976; Palauan: Helfman and Randall 1973, Johannes 1981; Ponapean: Rehg and Sohl 1979; Puluwat: Elbert 1972; Satawalese: *Akimichi 1980; Kiribati: Thaman and Tebano, n.d. Polynesian Triangle: Marquesan: *Lavondès 1977; Niuatoputapu: *Dye 1983; Niuean: Sperlich 1997; Rarotongan: Buse and Taringa 1996; Tongan: Churchward 1959; Uvean: *Rensch 1983.

270

Andrew Pawley

The crudest method of assessing how thorough is coverage of fish names in a dictionary of a fishing community’s language is to compare the dictionary’s tally with tallies obtained for languages where a thorough survey of fish names has been carried out. (Of course allowance must be made for regional variations in the diversity of species). Table 1 compares data for nine languages for which fairly thorough surveys of fish names have been done: five are from western Melanesia and western Micronesia and four from Fiji and Polynesia. Here and in later tables a star before a language name indicates that the source is a study that focuses on fish names, not a general dictionary. Table 1. Total recorded fish names in some Pacific Islands languages Western Melanesia and western Micronesia *Satawal. *Gela *Marovo 400+ 368 350+

*Palauan 336

Fiji and Polynesia Wayan *Uvean 458 284

*Niuatoputapu 210

*Marquesan 260

*Titan 289

Table 2 gives the tallies obtained by going through a sample of substantial Oceanic dictionaries. In some cases their finder lists treat ‘sharks’, ‘rays’ and ‘eels’ separately from ‘fish’, i.e. by ‘fish’ they mean only ‘typical fish’. In such cases I have combined the lists so as to include ‘sharks’, etc. under ‘fish’. Table 2. Total fish taxa in some substantial Oceanic dictionaries Western Melanesia Kuanua Arosi Owa 301 214 203

Lau 149

Toqabaqita 143

Fiji, New Caledonia and Vanuatu Bauan Cèmuhî Paicî Nyelâyu 200 163 147 123 Micronesia Marshallese 231

Ponapean 157

Gela 136

Xârâcùù 121

Cheke Holo 133

Roviana 80

Lenakel 41

Kwamera 15

Carolinean 154

Polynesia, including the Outliers Tak¯u and Kapingamarangi Tak¯u Kapinga. Tikopia Niuean Tongan 280 262 220 163 158

Kiribati 151 Rarotongan 124

What does it take to make an ethnographic dictionary?

271

It can be seen that the totals vary greatly. Comparison with Table 1 suggests that most of these inventories are missing between 100 and 250 of the taxa distinguished by the speech community. This is demonstrably the case for Gela: Fox’s (1955) very substantial dictionary of Gela contains only 136 fish names, more than 200 fewer than the 368 reported in the survey of Gela fishing done by a marine biologist (Foale 1998). Similarly, the large Bauan (Standard Fijian) dictionary contains about 200 fish names, less than half the number recorded for Wayan, another Fijian language. For fishing communities in Melanesia and western Micronesia, where the fish fauna are relatively rich, lists numbering below 300 are probably far from complete. In the case of communities of the Polynesian Triangle and eastern Micronesia, we can expect somewhat lower totals. A second method for detecting gaps is to compare the breakdown of names for taxa into uninomials and binomials. Uninomials typically apply to taxa belonging to the level of folk generic, binomials to folk specifics (see Section 6). In the best documented Oceanic fish taxonomies uninomials usually amount to between 70 and 80 percent of total taxa, and binomials to between 20 and 30 percent. Percentages that deviate markedly from this range stand out as suspiciously anomalous. Table 3 gives the percentages for 14 Oceanic languages recorded in dictionaries (unmarked) and surveys of fish names (marked *). Table 3. Uninomial and binomial fish names in some Oceanic languages uninomials binomials % binomials

Wayan 352 106 23

*Satawal. 278 122 30

*Titan 279 8 3

*Gela 252 100 28

*Marovo 194 122 38

uninomials binomials % binomials

Uvean 180 104 36

*Teop 168 29 15

*Kapinga. 148 114 43

Tikopia 215 5 2

Niuatoputapu 138 71 34

uninomials binomials % binomials

Cèmuhî 132 31 24

Kiribati 132 19 14

Cheke Holo 123 10 7

Xârâcùù 116 5 4

At one extreme, only 5 out of 220 recorded Tikopia fish names (Firth 1985) are binomials. Similarly, the survey of Titan (Akimichi and Sakiyama 1991)

272

Andrew Pawley

yielded only eight binomials among 287 listed fish names, whereas the survey of Satawalese (Akimichi 1980) yielded 122 binomials out of 400 names. We conclude that, in these cases, the Tikopia and Titan lists are probably missing between 60 and 120 binomials. At the other extreme, in the case of Kapingamarangi, there are 43% binomials (114 out of 262) and we can conclude that the list of binomials probably includes some ad hoc descriptive forms. Most of the larger dictionaries generally do poorly in their coverage of binomials. Their coverage of uninomials is much better, but even there it is clear that, in many cases, coverage is incomplete. The third method is more fine-grained and requires painstaking familyby-family comparison of fish names. The first steps are to note how many names representing each of the major families of fish are present in the best documented languages, and then to obtain an average and a range of variation for this sample. The next step is to see how dictionary tallies compare with these averages and ranges, looking for cases where there is a striking shortfall in the number of taxa recorded for particular families. This method makes it possible to locate quite precisely some of the gaps in coverage but for reasons of space I will not tabulate results here (see Pawley 2011a for details). It has been suggested to me that two Oceanic fishing communities, exploiting a similar marine environment, may differ greatly in the extent to which they have elaborated their lexicon of uninomial and/or binomial fish names. That is to say, large differences in the numbers of fish taxa between two communities with similar means of subsistence, may not always be due to oversights on the part of the lexicographer, but actually reflect genuine differences in the vernacular lexicon. This possibility cannot be ruled out – our data include only a few wellcontrolled case studies (such as the Gela one) where the dictionary coverage can be compared with that of a thorough independent survey. However, we are looking for general trends and the general trends are clear. It is striking that in every case where a careful survey has been done of an Oceanic fishing community’s lexicon of fish names, the number of fish names distinguished has been above 200 and generally in the 300-450 range. Similar findings have been reported for fishing communities in Indonesia (Quick 2010; Taylor 1990). The fact that most dictionaries of Oceanic languages spoken by fishing communities report much lower figures cannot plausibly be attributed to random variation in the vernacular lexicons.

What does it take to make an ethnographic dictionary?

273

A recurrent problem in eliciting is that of distinguishing between names that are genuine binomials and ad hoc descriptions. When asked to discriminate between specimens informants may offer descriptive names, such as ‘spotted X’, ‘spiky X’, ‘red X’, which may be descriptions rather than conventional names. For this reason it is important to make independent checks with a range of informants. I had this problem in eliciting Wayan fish names using illustrations. An overly helpful expert informant provided names for about 30 kinds of butterflyfish. Cross-checking indicated that only a very few are conventional names. Another potential source of confusion concerns binomials which are conventional but denote functional categories, like ‘fish of the reef’ vs ‘fish of the deep sea’, which cross-cut the core categories in the taxonomic hierarchy. 4.2.

Tree names

The first two of these methods can readily be applied to plant names, again making allowances for regional variations in diversity of flora. Table 4 compares total names of plants that English speakers classify as ‘trees’ (including palms and pandans) in dictionaries of 15 Oceanic languages. Eight of these are spoken on large high islands of western Melanesia (six of them in the Solomons), where the flora are relatively diverse, and seven are spoken in southern Melanesia (New Caledonia and Vanuatu) and the central Pacific, where the diversity is less. Table 4. Total names of trees in some Oceanic dictionaries Western Melanesia Arosi Kuanua Lau 339 295 204

Cheke Holo 188

Roviana 180

Southern Melanesia and the central Pacific Wayan Niuean Cèmuhî Paamese 222 160 156 84

Gela 179

Kiriwina 170

Toqabaqita 157

Kwamera 61

Lenakel 58

Rotuman 71

A comparison of totals for the six Solomon Islands languages is revealing. The tally of 339 tree names in Fox’s dictionary of Arosi (spoken on Makira) is the highest for this region. By this measure, the totals of 204, 179 and 157 in the dictionaries of Lau (Malaita), Gela (Florida group) and Toqabaqita (Malaita), are below expectations, as are the tallies of 188 for Cheke Holo

274

Andrew Pawley

(Santa Isabel) and Roviana (180). Kuanua, which scores almost as high as Arosi, is spoken much further west, in New Britain. The tree flora of southern Melanesia and the central Pacific is a good deal less diverse than that of western Melanesia. The total of 222 tree taxa for Wayan Fijian, based on systematic collecting by a botanist (Gardner and Pawley 2006), is probably close to exhaustive for this language and provides a rough yardstick. It suggests that the dictionaries of Paamese, Kwamera, Lenakel (all Vanuatu) and Rotuman, which record between 58 and 84 tree names, are probably missing upwards of 100 taxa. 4.3.

Conclusion

Most Oceanic dictionaries show major shortfalls in their coverage of fish and tree names. The broad conclusion we can draw is that getting all the names for indigenous animal and plant taxa distinguished by a typical Oceanic community is a formidable task and that without the help of specialists, lexicographers are likely to miss a high proportion of them. 5.

Problems in defining the meanings of lexical units

Getting names is only the first step in the lexicographical treatment of animal and plant taxa. It remains to provide accurate identifications and definitions of each taxon. Each vernacular name should be matched with the class of referents it is applied to – whether this be a particular species, a genus, a growth stage, male vs female animal, a grouping based on habitat, edibility or size, and so on. In the case of folk specifics and folk generics (see below), ideally this should include accurate scientific identifications with Latin names. The pioneering lexicographers in the Pacific Islands seldom had help from specialists or natural history handbooks on the flora and fauna of their regions to help them to make scientific identifications. Dictionary searches for cognate sets for animal and plant names often yield results like the following. Proto Oceanic *rakumu, which referred to a kind of crab, is quite widely reflected in contemporary languages. However, the sources for these languages give such imprecise identifications that it is impossible to determine its range of reference in Proto Oceanic.

What does it take to make an ethnographic dictionary?

275

Proto Oceanic *rakumu ‘kind of large crab, probably a land crab’ Admiralties: Lou roum ‘kind of crab’ Papuan Tip: Kiriwina lakum ‘small mud crab’ Papuan Tip: Muyuw lakum ‘crab’ Papuan Tip: Gapapaiwa rakum ‘kind of crab’ North N. Guinea: Manam rakum ‘k. big, red crab’ Meso-Melanesian: Tabar raku ‘crab’ Meso-Melanesian: Roviana garumu ‘Cardisoma carnifex, white land crab’ (metathesis) SE Solomonic: Bugotu ragomu ‘a crab’ Micronesian: Ponapean rokumw ‘species of small land crab’ Micronesian: Puluwat róókum ‘large land crab’ róókum(ihát) ‘common black sea crabs’ Micronesian: Woleiaian ragumw u ‘crab, generic’ N. C. Vanuatu: S.E. Ambrym oum ‘land crab’ S. Vanuatu: Sye (n)roGum ‘kind of crab’

It should be noted that this cognate set stands in sharp contrast to a number of other cases where a Proto Oceanic term for an animal or plant category can be given a highly specific identification (many such examples can be found in Ross, Pawley, and Osmond 2008, 2011). The latter cases are those where dictionary sources for a subset of contemporary languages belonging to diverse subgroups yield cognates that agree on the identification. Today’s lexicographers can often call on a range of quite good handbooks for domains of flora and fauna to take to the field. And they are usually better placed than the pioneering generations were to have specimens identified by specialists. For instance, exemplary work has been done on the lexicons of New Caledonian languages by scholars from CNRS/LACITO (Centre national de la recherche scientifique/Langues et civilisations á tradition orale), who have been able to draw on a reservoir of specialist help and a good range of handbooks. A number of such dictionaries, e.g. OzanneRivierre (1984, 1998), Rivierre (1983, 1994), Moyse-Faurie and NéchéroJorédie (1989), provide scientific determinations to family, genus or species level for around 90 percent of recorded plant and animal names. Of course, it should not be assumed that there will be a one-to-one correspondence between folk taxa and a particular species, genus or family recognised by biologists. Sometimes several species will be subsumed in a folk

276

Andrew Pawley

taxon. At other times two or more folk taxa will correspond to a single species, with different taxa corresponding, e.g. to different stages of maturation or to adult males vs females and juvenile males. This is commonly the case, for instance, with certain fish species, where up to five different taxa corresponding to different maturational stages may be distinguished. Making scientific IDs is the primary but not the only reason dictionarymakers need the help of specialists in natural history. A second reason is that specialists are better qualified to describe the key characteristics of species and their relationships to other species and to investigate their practical uses. A third, already noted above, is that without specialists it is likely that names for many of the less important taxa will be missed. Conversely, a survey conducted solely by biologists with almost no knowledge of the target language is likely to yield an inaccurate account of vernacular terms and taxonomies. The ideal is a close collaboration of linguists and biologists. 6.

Drawing a line between definitions and cultural or encyclopaedic information

In explicating the meaning of a lexical unit, are there principled grounds for drawing a line between linguistic and encyclopaedic knowledge, or between definitions and cultural information? If so, what are the distinguishing criteria? If not, what practical criteria should be invoked? These questions haunt every writer of descriptive or ethnographic dictionaries (see Cablitz, this volume). What is the meaning of sheep to the community of English speakers? One can say that sheep are four-legged animals reared for their wool and meat. One can add that they eat grass and other leafy plants, that they come in many breeds, noted for different characteristics, that their skin is used in making leather, that they belong to the genus Ovis and are closely related to goats, that they are found wild in the mountains of Asia, Africa, Europe and North America, that they live in herds (called ‘flocks’), that they are proverbially easily led, that they make bleating sounds, that they normally bear a single young, that traditionally they were tended by shepherds, assisted by specially bred sheepdogs, and so on. The list of characteristics is indefinitely large and there is no obvious boundary between characteristics central and peripheral to the meaning of sheep.

What does it take to make an ethnographic dictionary?

277

I agree with those who argue that we lack a principled basis for drawing a line between lexical and encyclopaedic knowledge.6 The question ‘What is the meaning of sheep?’ is probably wrong-headed. When formulating a dictionary explication of sheep, it makes more sense to ask ‘Of the many characteristics of sheep known to English speakers, which are the most salient?’ and ‘For the various users of the dictionary, what is likely to be the most useful information to include?’ A few plant name entries from the Wayan dictionary follow which are instructive in two respects: (1) they provide fairly rich descriptions, mainly due to the work of a professional botanist; (2) they show the difficulty of drawing the line between information that belongs in a dictionary and information better left to an encyclopaedia or to technical botanical works. DALI1 , n. Tree (kai) taxon: Cajanus cajan (Leguminosae), pigeon pea. (Name from Hindi dal.) Introduced and sporadically cultivated in Waya for its pea-like seeds. Bushy shrub, occasionally cultivated for its edible seeds. Stems ribbed, leaves 3-foliolate, narrowly oval, pointed, pea-flowers red-brown outside, yellow and crimson-striped within, pods straight, about 6cm long. Ei kani vin¯a na dali mag¯a ei v¯akari. Cajun seeds are good eating if they are curried. DAMANU, n. Tree (kai) taxon: Calophyllum vitiense (Guttiferae). Straighttrunked forest tree of higher altitude, leaves narrow-oval, shiny grey and brown, flowers white. Timber valued for house posts, rafters, furniture, boats. DIGI2 , n. Fern (kai) taxon: generic, includes at least the following two medium-sized terrestrial ferns: (1) Nephrolepis biserrata (Davalliaceae). Ladder fern. Locally common in open mid-altitude forest. (2) Sphaerostephanos invisus (Thelypteridaceae). Common on dry grassy hillsides. The two vascular strands in the leaf-stalk were once used in necklace-making. Subtaxa include borete, mata, vativati. DILO, n. Tree (kai) taxon: Calophyllum inophyllum (Guttiferae). Common shore tree, sticky cream-coloured sap, spikes of white day-fragrant flowers, round green-purple spongy fruit with a hard stone. Timber used especially for boat ribs. An infusion of the leaves is applied to sore eyes. Oil of the seed used in massage (and formerly to treat leprosy). 6. Haiman (1980) offers detailed arguments in support of this position.

278

Andrew Pawley

The descriptions provided for these taxa by the botanist working on Wayan were in most cases a good deal more detailed than are shown in the dictionary entries. As the lexicographer in this case, the rule of thumb I adopted was to omit details of plant morphology likely to be of interest only to a botanist while retaining information relating to general appearance, habitat and uses. But it can be seen that the entries are not entirely consistent in this respect. In the entry for DALI, for instance, more details of leaf forms and flowers (“Stems ribbed, leaves 3-foliolate, narrowly oval, pointed, pea-flowers redbrown outside, yellow and crimson-striped within, pods straight, about 6cm long”) are retained than in the entry for DAMANU. 7.

Problems in defining the semantic range and semantic relations of major generics: the case of life forms for fish and trees

To analyse and describe the semantic structure of the lexicon, lexicographers need to have a theory of how lexical systems are structured and a metalanguage for describing the relevant categories and relations. Cross-linguistic comparisons show that plant and animal terms typically participate in several systems of semantic classification (Berlin 1992; Conklin 1962). Two such systems (not always completely distinct) are kinds of taxonomies, in which categories are placed in contrastive and hierarchically organised relations. There are (1) taxonomies based on natural kinds, distinguished by morphological and behavioural characteristics; (2) functional taxonomies, where categories are based on use and context. Other systems of classification include (3) partonomies, i.e. part-whole relations, such as parts of a plant or animal, and (4) maturational or growth stages. Let us consider just folk taxonomies of type (1). Over the years various frameworks have been proposed for describing such systems. For example, for a four-level taxonomy, as, say, the series animal > dog > terrier > fox terrier, one can simply refer to the most inclusive level as primary, the next as secondary, then tertiary and quaternary, respectively. A more ambitious framework is that advocated by Brent Berlin and his associates. The following is a summary of the system of rank distinctions in folk taxonomies and generalisations about nomenclature given in Berlin (1992). Life-form. A life-form is a high-order generic taxon, one that (i) distinguishes a highly distinctive morphotype which (ii) is not included in any other taxon other than kingdom, (iii) includes many (sometimes hundreds) of lower

What does it take to make an ethnographic dictionary?

279

order taxa which share the characteristic morphology of the type, and (iv) is named by a uninomial. Examples of English life-form taxa are tree, vine, fish, bird and insect. Folk generic. A folk generic (or folk genus) is a ‘natural’ category in two senses, one perceptual, the other linguistic. First, the members of this category are usually marked off from non-members by multiple characters of morphology and behaviour or ecological adaptation that will be evident to any close observer. English examples are mullet, trout, oak, pine, spider, ant, centipede. Second, the folk generic (rather than the life form) is the usual way of referring to particular plants or animals if their identity is known. Depending on various factors, a folk genus may correspond to a single species in the biologist’s taxonomy, to a number of species or a genus, or to a number of genera or families. Third, the category is named by a uninomial rather than a binomial. Folk specific. Some folk generics divide into a number of folk specifics (or folk species), usually just a few taxa that contrast in a limited number of features with other members of the generic. Except for domesticated animals and plants folk specifics are usually the lowest-level taxa distinguished. Berlin (1992) says that folk specific names are usually compounds, consisting of the generic name plus a modifier, e.g. mako shark vs hammerhead shark, trapdoor spider vs huntsman spider. However, Bulmer (1970) finds that a fair number of folk specific names for animals, among the Kalam, are primary lexemes. Berlin names three other ranks that are sometimes distinguished in folk taxonomies, but these need not concern us here. Determining the scope of life form taxa is often tricky. While informants agree on the focal membership, they may disagree about peripheral cases. For example, among Oceanic languages there is considerable variation in the boundaries of the general term for fish. Generally the term can be applied to various aquatic creatures other than fish: typically it is extended to whales and dolphins, in some languages, to turtles and crocodiles, in some also to octopus and squid, and in some cases most or all water-dwelling animals are included (Pawley 2011b). As Table 5 indicates, few Oceanic dictionaries make much of an effort to specify the range of reference of the generic for fish and fish-like animals. Definitions can be roughly scaled according to their level of informativeness.

280

Andrew Pawley

Table 5. Definitions of the generic for fish and fish-like animals in some Oceanic dictionaries Group A. Definitions that give ‘fish, or ‘fish, sea creature’ without further definition Arosi: i’a, a fish. Cheke Holo: sasa, fish (generic). Marshallese: ek, fish. Mota: iga, a fish. Owa: aiga, fish, sea creature. Rarotongan: ika, fish. Roviana: igana, the generic name for fish. Group B. Definitions that give a partial but rather imprecise listing of members Puluwat: yiik, fish (including porpoises and whales but not squid). Samoan: i’a, the general name for fishes, except the bonito (Thymnus) and shellfish (Mollusca and Crustacea). On Tuituila the bonito is called i’a. (Pratt 1911) Tak¯u: ika, generic term for fish, including marine animals, turtles and two species of clam. Tikopia: ika, generic category with primary reference to fish, but including allied creatures, e.g. turtle, cetaceans. [Examples also refer to crabs.] Tongan: ika, fish. Also turtles (fonu) and whales (tofua’a) but not eels, cuttle-fish, or jelly-fish. Group C. Definitions that try to be comprehensive Gela: iga, a creature of the sea, fish, mollusc, crayfish, whale, squid, sea anemone, etc. Paamese: mesau, 1. fish. 2 any sea dweller (including also turtles, dolphins, shellfish, etc.). Toqabaqita: iqa, 1. fish (generic term). 2. Also denotes a superordinate category that includes fish, whales, dolphins, turtles, dugongs. ¯ This category includes Wayan: ika, 1. Typical fish, true fish, syn. ika du. all gill-breathing fish with fins, including sharks, rays and eels. 2. Fish and certain fish-like creatures. A generic which includes all true fish (see sense 1) and dolphins. Most informants also regard turtles (ikabula) as ika. Some also include octopus ¯ Universally excluded are (sulua) and squid (suluanu). crustaceans (crabs, lobsters, etc.), molluscs with shells, sea cucumbers, sea urchins and jellyfish. (There follows a full list of names of ika.)

Group A definitions show no awareness of the fact that ‘fish’ is a highly problematic definition. Group B definitions make an effort to give an exhaustive

What does it take to make an ethnographic dictionary?

281

account of the membership but the definitions tend to be vague or incomplete. Group C definitions come closest to being exhaustive, but largely avoid the problem of disagreement among informants and fuzziness at the boundaries of generic categories. The treatment of ‘tree’ terms shows a similar variation in quality. All Oceanic languages have a ‘tree’ life form but they vary as to which plants are included (Evans 2008). For instance, some include palms, pandans, others do not; some include tree-ferns and cycads, others do not. In some languages the term for ‘tree’ is used, in certain specialized contexts, as a generic for all plants. Table 6. Definitions of generic term for ‘tree’ in a sample of Oceanic languages Group A. Definitions that give ‘tree, or ‘tree, plant’, without further definition Puluwat: yirå, tree. Owa: apenagai, tree. Cheke Holo: gaiiju, tree (generic). Roviana: huda, tree. Mota: tangae, a tree. Rarotongan: r¯akau, tree, bush, plant. Samoan: l¯a’au, a tree, a plant. (Pratt 1911) Tongan: ’akau, tree, plant. Paamese: a¯ i, tree. Marshallese: kaan, tree. [Examples refer to coconut and pandanus as well as branching trees.] Group B. Definitions that give a partial but generally incomplete listing of treelike members Tak¯u l¯akau, generic which includes all trees, shrubs, plants, creepers and sticks. Tikopia: rakau, generic term for member of vegetable kingdom, usually woody plant, including shrub, herb, but not vegetable or grass. Niuean: akau, tree (refers to large or substantial trees; shrubs or small trees are generally known as l¯akau). Arosi: ’ai, a tree or plant having stem and branches; not used of a fern cycad, sago palm, coconut, etc. but used of small plants, e.g. balsam. It is not therefore the true equivalent of English “tree”. Gela: gai, a branching plant, shrub or tree, i.e. balsam, croton, and banyan are all gai, but not a palm or coconut (gaimane).

282

Andrew Pawley

Toqabaqita: ai, tree (does not include palm trees, cycads, tree-ferns, bamboos, banana trees). Wayan: kai, generic for trees and shrubs, and occasionally low bushy plants. Includes palms and pandans. Used in certain compounds as a generic for all plants.

Group A definitions show little or no awareness that the English gloss ‘tree’ is not a sufficiently precise definition of the vernacular generic. Group B definitions recognize that the range of reference of the generic is not satisfactorily captured by such general terms as ‘tree’ and ‘plant’. However, their attempts to define by specifying diagnostic characters and/or by listing members are of varying adequacy. 8.

Concluding remarks

Finally, let us return to the question: Is it possible to make a first general dictionary of a language that is not ‘dreadful’? Our examination of some 30 Oceanic dictionaries has been largely confined to aspects of their treatment of terms for fish and trees – too narrow a basis for a general assessment. It must be conceded that only a few of the dictionaries score well in their coverage of the fish and tree lexicons. Most do poorly, both in terms of exhaustiveness of the wordlists and quality of the expository information. This does not necessarily make them dreadful works overall. It is my impression that many of the major Oceanic dictionaries do passably well in their treatment of a number of major lexical domains. However, my purpose in this paper has been to use the example of fish and tree lexicons to highlight some of the challenges of method and scale that face anyone intending to do a first general dictionary of a language. Trial and error in my own attempts at dictionary-making has taught me that such a project requires considerable expertise, or the help of experts, in various specialised fields of knowledge in addition to botany and zoology. Experience has also taught me that the more fluent the lexicographer is in the target language, and the more fluent native speaker assistants are in the defining language, the easier it is to achieve accuracy, especially in respect of definitions and sense discriminations. Except that it is never easy. The hard truth is that making an accurate and close-to-comprehensive general dictionary of the language of a traditional society needs many thousands of hours

What does it take to make an ethnographic dictionary?

283

of labour, i.e. the equivalent of several full-time years on the job.7 In most documentation contexts, which operate within a 3–4 year framework, with many competing objectives, it would be unrealistic to try to compile a large general dictionary. These practical considerations, among others, have led Ulrike Mosel (2011) to advocate doing ‘thematic dictionaries’, mini-dictionaries each of which treats a single culturally important domain, as an alternative to compiling general dictionaries. This has the virtue of allowing the researcher to produce in quite a short time one or more small reference works of interest both to members of the speech community and to academics. References Abo, Takachi, Bryon Bender, Alfred Capelle, and Tony Debrum. 1976. Marshallese–English Dictionary. Honolulu: University of Hawai’i Press. Akimichi, Tomoya. 1980. Bad fish or good fish: The ethnoichthyology of the Satawalese (Central Carolines, Micronesia). Museum of Ethnology, Osaka. Bulletin of the National Museum of Ethnology 6(1):66–133. Akimichi, Tomoya, and Osamu Sakiyama. 1991. Manus fish names. Bulletin of the National Museum of Ethnology, Tokyo 16(1):1–29. Berlin, Brent. 1992. Ethnobiological Classification: Principles of Categorization of Plants and Animals in Traditional Societies. Princeton, NJ: Princeton University Press. Brown, Cecil H. 1985. Mode of subsistence and folk biological taxonomy. Current Anthropology 26:43–62. Bulmer, Ralph N. H. 1970. Which came first, the chicken or the egg-head? In Échanges et Communications: Mélanges Offert à Claudes Lévi-Strauss à l’Occasion de Son 60ième Anniversaire, eds. Jean Pouillon and Pierre Miranda, 1069–1091. The Hague: Mouton. Bulmer, Ralph N. H. 1992. Field methods in ethno-zoology with special reference to the New Guinea Highlands. In Studying and Describing Unwritten Languages, Questionnaire 12, Ethnozoology, eds. Luc Bouquiaux 7. Anyone thinking of embarking on an ethnographic dictionary should note these words of warning offered by a French colleague and accomplished lexicographer: “La confection d’un dictionnaire encyclopédique est un exercice dangereux, qui peut induire une régression à un stade libidinal archaique.” (J-C. Rivierre, p.c.) [The making of an encyclopaedic dictionary is a dangerous exercise, which can induce a regression to a primal libidinal stage.]

284

Andrew Pawley

and Jacqueline M. C. Thomas, 526–556. Dallas: Summer Institute of Linguistics. Buse, Jasper, and Raututi Taringa. 1996. Cook Islands Maori Dictionary. (Eds. Bruce Biggs and Rangi Moeka’a). Canberra: Cook Islands Ministry of Education. Capell, Arthur C. 1941. A New Fijian Dictionary. Suva: Government Printer. Churchward, Clerk Maxwell. 1940. Rotuman Grammar and Dictionary. Sydney: Australasian Medical Publishing Co. Churchward, Clerk Maxwell. 1959. Tongan Dictionary. Oxford: Oxford University Press. Conklin, Harold C. 1954. The Relation of Hanunóo Culture to the Plant World. New Haven, CT: Yale University. Conklin, Harold C. 1962. Lexicographic treatment of folk taxonomies. In Problems in Lexicography, eds. Fred W. Householder and Sol Saporta, 41–59. Bloomington: Indiana University Research Center in Anthropology, Folklore, and Linguistics. Crowley, Terry. 1992. A Dictionary of Paamese. Canberra: Pacific Linguistics. Dye, Tom. 1983. Fish and fishing on Niuatoputapu. Oceania 53(3):247–271. Elbert, Samuel H. 1972. Puluwat Dictionary. Canberra: Pacific Linguistics. Evans, Bethwyn. 2008. Ethnobotanical classification. In The Lexicon of Proto Oceanic: The Culture and Environment of Ancient Oceanic Society. Vol. 3. Plants, eds. Malcolm Ross, Andrew Pawley, and Meredith Osmond, 53–84. Canberra: Pacific Linguistics. Firth, Raymond. 1985. Tikopia–English Dictionary. Oxford, Auckland: Oxford University Press, Auckland University Press. Foale, Simon J. 1998. The role of customary marine tenure and local knowledge in fishery management at West Nggela, Solomon Islands. Phd thesis, University of Melbourne. Fox, Charles E. 1955. Nggela Dictionary. Auckland: The Unity Press. Fox, Charles E. 1974. Lau Dictionary. Canberra: Pacific Linguistics. Fox, Charles E. 1978. Arosi Dictionary. Canberra: Pacific Linguistics. Gardner, Rhys O., and Andrew Pawley. 2006. Annotated list of local plant names from Waya Island, Fiji. Records of the Auckland Museum 43:97–108. Haiman, John. 1980. Dictionaries and encyclopedias. Lingua 50:329–357. Helfman, Gene S., and John E. Randall. 1973. Palauan fish names. Pacific Science 27:136–153.

What does it take to make an ethnographic dictionary?

285

Henderson, C. P., and I. R. Hancock. 1988. A Guide to the Useful Plants of Solomon Islands. Honiara: Research Department, Ministry of Agriculture and Lands. Hviding, Edvard. 1990. Draft Environmental Dictionary, Marovo Language. Centre for Development Studies, University of Bergen, Norway. In cooperation with the Western Province Division of Culture, Gizo. Hviding, Edvard. 2005. Reef and Rainforest. An Environmental Encyclopaedia of Marovo Lagoon, Solomon Islands. Paris: UNESCO. Inia, Elizabeth, Sofie Arntsen, Hans Schmidt, Jan Rensel, and Alan Howard. 1998. A New Rotuman Dictionary. An English–Rotuman Wordlist and C. Maxwell Churchward’s Rotuman-English Dictionary. Suva: Institute of Pacific Studies, University of the South Pacific. Jackson, Frederick H., and Jeffrey C. Marck. 1991. Carolinean–English Dictionary. Honolulu: University of Hawai’i Press. Johannes, Robert E. 1981. Words of the Lagoon. Fishing and Marine Lore in the Palau District of Micronesia. Berkeley: University of California Press. Landau, Sidney. 1984. Dictionaries: The Art and Craft of Lexicography. Cambridge: Cambridge University Press. Lanyon-Orgill, Peter A. 1960. A Dictionary of the Raluana Language. Vancouver: The author. Lavondès, Henri. 1977. Les noms du poissons marquisien. Journal de la Société des Océanistes 33:79–112. Lawton, Ralph. 1998. Kiriwina word list: English to Kiriwina. Printout. Canberra. Lichtenberk, Frantisek. 2008. Toqabaqita Dictionary. Canberra: Pacific Linguistics. Lieber, Michael D., and Halio H. Dikepa. 1974. Kapingamarangi Lexicon. Honolulu: University of Hawai’i Press. Lindstrom, Lamont. 1986. Kwamera Dictionary. Nikukua sai Nagkiariien Ninife. Canberra: Pacific Linguistics. Lynch, John. 1977. Lenakel Dictionary. Canberra: Pacific Linguistics. Mellow, Greg. 2009. Owa dictionary. Electronic file, submitted to Pacific Linguistics. Mosel, Ulrike. 2011. Lexicography in endangered language communities. In The Cambridge Handbook of Endangered Languages, eds. Peter K. Austin and Julia Sallaback, 337–353. Cambridge: Cambridge University Press. Moyle, Richard. In press. Tak¯u Dictionary. Canberra: Pacific Linguistics.

286

Andrew Pawley

Moyse-Faurie, Claire, and Marie-Adèle Néchéro-Jorédie. 1989. Dictionnaire xârâcùù–français suivi d’un lexique français–xârâcùù (NouvelleCalédonie). Noumea: Ed. Populaires. Ozanne-Rivierre, Françoise. 1984. Dictionnaire iaai–français (Ouvéa, Nouvelle-Calédonie), suivi d’un lexique français–iaai. Paris: Société d’Études Linguistiques et Anthropologiques de France. Ozanne-Rivierre, Françoise. 1998. Le Nyelâyu de Balade (NouvelleCalédonie). Paris: Peters. Pawley, Andrew. 2011a. Stability and change in Oceanic fish names. In The Lexicon of Proto Oceanic: The Culture and Environment of Ancestral Oceanic Society. Vol. 4. Animals, eds. Malcolm D. Ross, Andrew Pawley, and Meredith Osmond, 137–160. Canberra: Pacific Linguistics. Pawley, Andrew. 2011b. Were turtles fish in Proto Oceanic? Semantic reconstruction and change in some terms for animal categories in Oceanic languages. In The Lexicon of Proto Oceanic: The Culture and Environment of Ancestral Oceanic Society. Vol. 4. Animals, eds. Malcolm D. Ross, Andrew Pawley, and Meredith Osmond, 421–452. Canberra: Pacific Linguistics. Pawley, Andrew, and Timoci Sayaba. 2003. Words of Waya: A dictionary of the Wayan dialect of the Western Fijian language. Printout, pp.1200. Department of Linguistics, Research School of Pacific and Asian Studies. Pratt, George. 1911. A Grammar and Dictionary of the Samoan Language. 4th edition. Apia: Malua Printing Press. Puku’i, Mary Kawena, and Samuel Elbert. 1971. Hawaiian Dictionary. Honolulu: University of Hawai’i Press. Quick, Phil. 2010. Assaying the scope and number of fish names in Pendau based on Pawley’s Oceanic benchmarks. In A Journey Through Austronesian and Papuan Linguistic and Cultural Space. Papers in Honour of Andrew Pawley, eds. John Bowden, Nikolaus P. Himmelmann, and Malcolm Ross, 621–631. Canberra: Pacific Linguistics. Rehg, Kenneth L., and Damian G. Sohl. 1979. Ponapean–English Dictionary. Honolulu: University of Hawai’i Press. Rensch, Karl H. 1983. Fish names of Wallis Island (Uvea). Pacific Studies 7(1):59–90. Rivierre, Jean-Claude. 1983. Dictionnaire paicî–français suivi d’un lexique français–paicî. Paris: Société d’Études Linguistiques et Anthropologiques de France.

What does it take to make an ethnographic dictionary?

287

Rivierre, Jean-Claude. 1994. Dictionnaire cèmuhî–français suivi d’un lexique français–cèmuhî. Paris: Société d’Études Linguistiques et Anthropologiques de France, vol. no. 345. Ross, Malcolm, Andrew Pawley, and Meredith Osmond, eds. 2008. The Lexicon of Proto Oceanic: The Culture and Environment of Ancestral Oceanic Society. Vol. 3. Plants. Canberra: Pacific Linguistics. Ross, Malcolm, Andrew Pawley, and Meredith Osmond, eds. 2011. The Lexicon of Proto Oceanic: The Culture and Environment of Ancestral Oceanic Society. Vol. 4. Animals. Canberra: Pacific Linguistics. Shoffner, Robert Kirk. 1976. The economy and cultural ecology of Teop: An analysis of the fishing, gardening, and cash cropping systems in a Melanesian society. Phd thesis, University of Hawai’i, Honolulu. Sperlich, Wolfgang, ed. 1997. Tohi vagahau Niue. Niue Language Dictionary. Honolulu: Government of Niue and Department of Linguistics, University of Hawai’i. Taylor, Paul M. 1990. The Folk Biology of the Tobelo People. A Study in Folk Classification. Washington, DC: Smithsonian Institution Press. Thaman, Randy, and Temakai Tebano. N. d. Kiribati plant and fish names. A preliminary listing. Atoll Research Programme, University of the South Pacific. Waterhouse, J. H. Lawry. 1949. A Roviana and English Dictionary with English–Roviana Index and List of Natural History Objects. Revised and enlarged by L. M. Jones. Sydney: Epworth Printing and Publishing House. White, Geoffrey M. 1988. Cheke Holo (Maringe/Hograno) Dictionary. Canberra: Pacific Linguistics.

Part IV. Interaction with speech communities

Chapter 12 Language is power: The impact of fieldwork on community politics∗ Even Hovdhaugen and Åshild Næss

1.

Introduction

The ethical and political aspects of language documentation work have been brought increasingly to the forefront in recent literature. This is a vast and complex field, where no universally valid solutions can be offered, since both the nature of the issues and the appropriate way of resolving them will vary widely between communities and situations. In the words of Grinevald (2006: 351), “the different agents involved create a maze of commitments of often conflicting nature, and ... one of the major challenges of fieldwork is to juggle all these constraints, requirements, and commitments”. This paper presents a case study from our own fieldwork experience as a basis for addressing some of the complex issues raised by the presence of a fieldworker in a small local community: what happens when there is disagreement within a community as to whether, how, and where the documentation work should be carried out, and how the burdens and benefits associated with such work is to be distributed? We will discuss how such issues may affect not only the relationship between researcher and language community, but also politics and power relations within the community itself. While it is generally taken for granted that a visiting researcher should as far as possible avoid getting involved in issues of local politics, this is in practice impossible, because the presence of a visiting researcher in a small ∗

The authors would like to thank Benedicte H. Frostad, Anders Vaa, and the editors of this volume for helpful comments on earlier versions, while stressing that the views expressed in this paper, as well as any factual errors, are entirely our own responsibility. We would also like to emphasize that during our many years of work in Temotu Province, we have been met in the overwhelming majority of cases with helpfulness and generosity; our discussion of the conflict described below is meant as an example of issues which it may be important for linguistic fieldworkers to take into consideration, and not in any way as criticism of the people of the area, for whom we hold the deepest respect and gratitude.

292

Even Hovdhaugen and Åshild Næss

community is local politics. Indeed, Rice (2006), while stating that a fieldworker should avoid getting caught up in internal political issues, goes on to list a number of ways in which the presence of a researcher affects a community, including the distribution of money as payment for consultants’ work, choosing who gets to do such work, and deciding where the linguist stays and works; these are all, ultimately, political issues in the sense that they concern the distribution of potentially scarce material resources, and of personal and communal prestige associated with the presence of and interaction with a researcher from outside. While not being able to offer any general solutions, we will point out some potential loci of conflict which are probably to some extent present in most fieldwork settings. Language documentation work is done in a variety of settings. Our experiences come from “classical” fieldwork, where researchers from outside spend time in a location where a language is spoken and work within the language community for purposes of language documentation and description. However, in many cases work is done in towns or cities with isolated members or small groups of a scattered or decimated language community. In such cases, the issues to be raised here – of who speaks for the community, of whose permission is required and of who decides how the benefits and privileges attached to such work is to be distributed – may be even more acutely relevant and even more difficult to resolve. While we will not specifically discuss such situations here, we note that they are likely to become increasingly common as a result of language shift and the loss of traditional ways of life, and that as a result, the kinds of issues discussed in this paper are likely to require increased attention in language documentation work. 2.

The issues

There are three main issues which arise from the case study to be presented below. All are to some extent to do with issues of local politics and of the distribution of power and privilege within a community, and all are interrelated. Specifically, they concern the question of the relevance of existing legal and administrative structures for the particular issues associated with language documentation projects, and of how to handle cases where these structures turn out to be inadequate or disputable.

Language is power

2.1.

293

Permits vs. permissions

There are normally procedures to be followed at the level of national and/or local administration in order to obtain legal permission to carry out research in a country other than one’s own. It is taken for granted that any responsible research project should follow such procedures and obtain all necessary legal documents required to carry out a project in compliance with the requirements set by central authorities. However, such legal permission from a central body of authority does not necessarily translate into ready permission from the local community where a language is actually spoken. In many of the countries where endangered languages are spoken and documentation work is most urgently needed, the central administration is relatively weak, and the communities where one may wish to carry out the work are far removed both geographically and in terms of everyday goals and values from those granting the research permits. Coming to a village and presenting a permit from a central authority may trigger goodwill towards a project; it may also, however, cause resentment and a sense of having the community’s own authority over its linguistic and cultural heritage weakened through the imposition of decrees from above. For this reason, funding bodies for language documentation work often require a researcher to document explicit permission from the language community in which the work will actually be undertaken. This, however, potentially leads to a second problem, that of exactly whose permission can or should be sought.

2.2.

Community and authority

The question of how to obtain consent for a documentation project, and from whom, holds not only in the tension between central and local authorities, but also within a language community. There is considerable discussion of this in the literature, for example in Thieberger and Musgrave (2007), who address the issue of potential conflict between individuals, who may give permission for material collected from them to be used in certain ways, and of the community as a whole, which may wish to place restrictions on the use of such material. In general, guidelines for ethics in documentation work stress the importance of obtaining explicit consent both from the individuals participating in the project, and from the community whose language is to be

294

Even Hovdhaugen and Åshild Næss

documented (see e.g. the ethics guidelines of the Linguistics Department of the Max Planck Institute for Evolutionary Anthropology,1 or of the DoBeS programme 2 ). Crucially, however, a “language community” is rarely a homogeneous entity; rather it consists of a complex web of subcommunities such as towns, villages, clans, or families, and each of these subcommunities come with their own power structures. Even those subcommunities not directly involved in a documentation project may feel entitled to have a say in how the documentation work is carried out, and this may lead to conflict not just between the community and the researcher, but also between different parts of the language community itself. The fieldwork literature typically emphasizes that a fieldworker should only go where her presence is accepted, if not actively desired – “don’t go where you are not wanted” (e.g. Bowern 2008). It is not necessarily realistic, however, to expect approval from all parts of a community, or for the same terms to be acceptable to all, and differences in expectations and desires may trigger tensions within the community which the fieldworker may not be able to defuse or resolve. Again, the complexity of negotiating consent from different bodies and levels of authority within a community is well known from the literature; for example, Grinevald (2006: 354) acknowledges that “[c]learly one must deal with a variety of constituencies from whom to get consent”, while Bowern (2008: 153) exhorts the would-be fieldworker to “make sure that you are seeking permission from the right people”. It is often assumed, however, that the main challenge facing the fieldworker is that of finding out whose permission is needed; that is, that there is a pre-existing and well-defined set of “right people” who must be identified and consulted. However, a quite different problem arises when there is no clear consensus within the community on who the relevant authorities are; the question of whose permission is needed may simply not have a definite answer, and attempts to claim authority over a documentation effort may have far-reaching political effects within a community.

1. http://www.eva.mpg.de/lingua/resources/ethics.php 2. http://www.mpi.nl/DOBES/ethical_legal_aspects/DOBES-coc-v2.pdf

Language is power

2.3.

295

Copyrights and profits

The question of who owns the copyrights to materials produced through language documentation has also been addressed in previous literature (e.g. Dwyer 2006: 46–48, Thieberger and Musgrave 2007: 33–34). A related problem, however, is reaching agreement on exactly when and how the question of copyrights is relevant. In the DoBeS “Code of Conduct”3 , for example, it is stated that “No one is allowed to use the recorded and analyzed data for commercial purposes without permission from the speech community” and that “It is the speech community that deals with the issue of commercial uses. The speech community not only gives permission, but also decides all aspects involved in the commercial use (copyrights, sharing profits, etc).” The problem here is deciding what exactly counts as “commercial use”. A product which is often agreed on as an outcome of documentation valuable to both researcher and speech community is a collection of texts in the documented language reproduced in digital or printed form. Such a product has value both to the community, as it preserves traditional oral literature and may be used as a basis for literacy training, and to the research community as primary material for linguistic analysis. Naturally, the publication of a text collection requires the explicit permission of those who contribute to it. But does it constitute “commercial use” of the material? The researcher is likely not to see it that way, since she knows that the commercial market for such products is practically non-existent, and since finding a commercial publisher for such materials is generally difficult; typically, such materials will be produced using the project’s own money, i.e. the researcher spends money producing the materials rather than making money from them. But it cannot be taken for granted that the community will share this view. It is not uncommon to hear stories of researchers “stealing” traditional cultural materials and making large profits from them (cf. e.g. McLaughlin and Sall 2001: 196), and unlikely as such stories may sound to the researcher’s ears, they may nevertheless inform the community’s understanding of the situation. Indeed, given many third-world communities’ experience with Westerners as people who visit their areas solely for the purpose of making money – through plantations, logging, mining or other commercial ventures – the idea of a non-profit research project, and of Western institutions paying for such projects, may be extremely difficult to explain. This may be a particular problem for linguis3. http://www.mpi.nl/DOBES/ethical_legal_aspects/DOBES-coc-v2.pdf

296

Even Hovdhaugen and Åshild Næss

tics, which, as Bowern (2008: 161–162) points out, has often been carried out as a means to other ends, such as anthropology or missionary efforts; the idea that someone might be paid to study a remote minority language with no ulterior motives may be difficult to accept.

3.

The setting

Our experiences come from fieldwork in several different language communities around the world, but primarily in the Pacific. The experiences to be discussed here stem from the documentation and description work carried out in Temotu Province in the Solomon Islands in 2002–2007, as part of the multidisciplinary research project Identity Matters, funded by the Norwegian Research Council. The linguistic part of this project aimed at studying the two languages of the Reef Islands, Vaeakau-Taumako (Pileni) and Äiwoo, and the contact situation between them, and it based itself on fieldwork in both language communities. Some of the specific issues associated with doing fieldwork as an outsider in Melanesia have been insightfully discussed by Dobrin (2005, 2008), who points out that the ability to engage the interest of outsiders and to enter into relationships of mutual exchange and commitment is a fundamental cultural value of the area. From a more general perspective, the questions surrounding the allocation of resources connected with a documentation project, discussed in the introduction, are highly pertinent in the area where we have worked: firstly, because it consists of small islands poor in material resources, where possibilities for earning a cash income are extremely limited; secondly, because the “bigman” system of social organization, where political influence is based on the resources and community support that an individual is able to command, means that the personal prestige that may derive from close association with visiting researchers and their work is a valuable commodity in itself. On the face of it, we were reasonably well prepared to tackle such issues when the Identity Matters project started. It built on previous work in the area, with Hovdhaugen having made two previous visits to Pileni island in order to make a start on studying the language there – the first time accompanied by anthropologist Ingjerd Hoëm, the second time by Næss. During these visits, experience and personal contacts had been built up which proved extremely valuable when Hovdhaugen returned in 2003 to present the Iden-

Language is power

297

tity Matters project to the local chiefs and communities in different islands in the area, and to make arrangements concerning working conditions for researchers in the new project: where they would stay, who would act as main consultants, how much they would be paid per hour of work, how much the community would receive in return for hosting the researchers, and not least, who would be responsible for receiving and allocating this money, as there were several candidates for this role – paramount chiefs, village councils, pastors, churches, etc. The fact that Hovdhaugen was able to communicate in the Vaeakau-Taumako language made these meetings a lot easier than would probably otherwise have been the case. We were well aware of the potential impact of our presence in these small village communities, and made every effort to negotiate clear agreements concerning the practical and financial arrangements for our fieldwork. These arrangements usually involved both a contribution to village funds and a modest hourly compensation for those speakers taking time out from their daily activities to work with the linguists. While these terms were, in principle, generally accepted, the more detailed arrangements concerning which villages would receive the benefits of having a visiting researcher stay, and which individuals would get the opportunity through consultant work of earning some much-needed cash income, proved more difficult to resolve to everyone’s satisfaction. Mosel (2006: 71–72) addresses the near-impossibility of a fieldworker selecting consultants directly; as a guest in the community, she has to work with those individuals whom the community considers suitable and who can spare the time from their daily duties. Our principle has been, once a village has been selected for a research stay, to ask local authorities – chiefs and elders – for help in selecting suitable consultants, according to some basic criteria specified by the researcher. This ensures that decisions are seen to be made by those whose decision-making authority is unquestionable, and makes it possible to take into account local notions of fairness and entitlement which the researcher, at least initially, will know little about. Of course individuals may still feel slighted and conflict may ensue, but such conflict will to a lesser extent be perceived as a direct result of the researcher’s personal conduct. More complicated, at least in our case, is the question of where – in which village and under whose direct responsibility – a researcher is to be located during her stay. The ideal solution is of course for the researcher to divide her time about equally between the different communities, not only for po-

298

Even Hovdhaugen and Åshild Næss

litical but also for scientific reasons, as a language documentation project should strive to include as much of the existing dialectal variation as possible. In practice, however, this may not always be possible, for several reasons. Firstly, time limitations and logistical challenges may mean that only very little time may be spent in each location, drastically reducing the efficiency of the work to be done. Secondly, other practical issues may mean that some locations are more suitable than others. In our case, a general, though unwritten agreement existed concerning where the researchers would stay and how they would circulate between villages. In the spring of 2005 Hovdhaugen came to the area with two MA students, who were to spend time in two different villages. This was one point that necessarily influenced Hovdhaugen’s choice of location. Should the students experience practical problems or need help in their work, they had to be able to contact Hovdhaugen, meaning that both he and the students had to have access to the local radio network. But several islands and villages did not have a working radio. Usually the battery needed charging, and the villages lacked the equipment for this. Furthermore, Hovdhaugen had a few years previously been diagnosed with Parkinson’s disease, and this affliction influenced his accommodation needs. Whereas previously he had stayed and worked in the holau or men’s house, which also served as a meeting house and general social gathering place for the men of the community, he now needed quieter surroundings in order to be able to work efficiently. This meant that accommodation plans had to be rethought, and this was felt by some as a breach of an unwritten contract. As a result, some communities and individuals felt slighted, and this may have contributed to the open conflict that followed.

4.

The conflict

The project plan, including the budget, had been approved by the Ministry of Education in Honiara, the national capital; by the Premier of Temotu Province; and by the council of chiefs in the villages where members of the research team stayed. We felt fairly confident that in obtaining permission from these bodies, we had followed correct procedures and consulted all relevant authorities. However, in 2005, a group of chiefs on one of the islands questioned our right to collect traditional stories, on the basis that the project had not been approved at the level of the island. This was initially communicated by

Language is power

299

a representative of the group approaching the student working on the island in question and demanding that he stop his work immediately. This demand caused considerable consternation in the village where he was staying, where everyone, at least on the face of it, was supportive of the project, and where it had been explicitly approved by the village authorities. An individual island, however, is not a recognized administrative entity either locally or by national legislation. The island in question consists of four larger villages and a number of smaller villages, and our project had been approved by the villages where the work was being carried out. The whole island is part of a larger administrative district, which is the unit recognized at the national level. The attempt at setting up the island as a distinct level of authority therefore amounted to a re-casting of the political structure of the area, and caused some confusion locally. The protestors called a public meeting to discuss the project, its legitimacy and the terms for its continued operation in the islands. In the week before this meeting Hovdhaugen had long discussions via radio almost every morning with one of the central actors behind the protests and the proposed meeting. Since the radio network is public, a number of people, both on the island concerned and in the neighbouring area, could listen in on these discussions, and they were often continued with locals in the area where Hovdhaugen was staying, many of whom expressed their support. The main objections to our project were that we had no right to record and print the traditional stories of the language community, and that no central or provincial authority, nor the chiefs in other villages, had any authority to give us permits to do so. More specifically, our motivation for wishing to collect such stories was questioned. Explaining the link between “studying the language”, which was what our paperwork stated that we were there to do, and collecting traditional texts, proved difficult. A complicating factor here may be that most people in the region encounter outside interest in studying their language only in the context of Bible translation, which of course has a rather different goal and methodology. As discussions proceeded, it became clear that the main motivation for the objections was of a financial nature. As noted above, explaining the scientific motivation behind a linguistic documentation project is often difficult, and the protestors reasoned that there must be a profit motive behind the work that we wished to do. Why, after all, would a faraway country like Norway spend so much money to send us to a remote island group to gather language

300

Even Hovdhaugen and Åshild Næss

data? If traditional stories were to be published, there should be financial profit involved, and the question of copyrights was perceived as being poorly handled. Hovdhaugen conceded the latter point, but also stressed that there was very little chance of any profits being made from the planned text collections; on the contrary, our publications were printed in Norway using the project’s money, and a significant number of copies were given to the local communities free of charge. Indeed, one major motivation for our wishing to produce printed collections of texts was for the communities to receive written materials to use in literacy training, a goal supported by many; these publications, which comprised both text collections and a short dictionary as well as a small collection of hymns (Hovdhaugen, Hoëm, and Næss 2002; Hovdhaugen 2006; Hovdhaugen and Næss 2006; Hovdhaugen and Tekilamata 2006) were, for the most part, received with enthusiasm. The conflict came to a head when it transpired that the protesting chiefs had asked the local police officer in the islands to come to the meeting with the police boat and arrest Hovdhaugen if he was found to have done something illegal. But after hours of discussion, including thorough scrutiny of the research permit from the department of education in Honiara and the letter from the province Premier where he asked every village to do their best to support us in our work, the meeting was resolved without any important decisions taken. However, a general wish was expressed that the question of copyright and ownership of linguistic materials should be addressed more carefully, a wish which Hovdhaugen accepted as legitimate.

5.

The aftermath

Triggered by the presence of outside researchers, this political conflict ultimately involved a struggle for authority within the local community: given that there was no formal level of authority corresponding to the language community as a whole, who had the right to make decisions on the community’s behalf? While the conflict petered out without any dramatic short-term results, a pertinent question is whether it may have had any lasting effects on local politics, or on the relationship between locals and research teams from outside. We have not been able to confidently identify any such effects. The claim of authority based on a non-existent administrative level was recognized as illegitimate, and the process triggered, if anything, explicit statements of sup-

Language is power

301

port during later research visits; on one occasion a group of village chiefs provided a written statement of support for our project, asserting explicitly that authority in this matter rested with them and so reaffirming the illegitimacy of the attempt at redefining local political structure. Nevertheless, the very fact that such explicit statements of authority were deemed to be necessary may be indicative of subtle shifts in power relations or status within the local communities; such changes are of course difficult for us to discover. Furthermore, the conflict may to some extent have changed patterns of interaction between different groups in the area. Traditionally, the two language communities in the Reef Islands – the Polynesian-speaking group in the Outer Reefs and the Äiwoo speakers in the Main Reefs – have had little day-to-day contact. There is some intermarriage between the two groups, and people from the Polynesian villages travel to the trade store in the Main Reefs for supplies, but people identify strongly with their village and language community, and there is little contact or collaboration across the linguistic border. As our project spanned both communities, however, both had a vested interest in the situation, and the conflict brought together supporters of the project from both sides. On the evening before the meeting, several of Hovdhaugen’s friends both from the Main and Outer Reefs came to his house to discuss strategies for the meeting. We are not aware that meetings of political leaders from the two language communities have been a practice in the past, and in a longer-term context it may well be that an increase in such contact will turn out to be the main political impact of our project in the islands. For small island communities having to deal increasingly with the effects of modernization and globalization, increased collaboration and the ability to put traditional borders aside for the sake of promoting common interests is clearly an asset, and our presence may, in a small way, have contributed to this. As a final point, it should be noted that conflicts of authority of the kind we have described may have their source in preexisting conflicts between individuals or groups that the fieldworker cannot be expected to be aware of – they may go back years or even decades, and may be dragged to the surface through the new situation created by the arrival of the researcher. In our case, it was clear that the protests were motivated as much by individuals’ sense of entitlement and injustice as by legitimate formal claims. One key actor in particular was known to have been very hostile to previous linguistic efforts

302

Even Hovdhaugen and Åshild Næss

in the area, including ones largely carried out by locals; acceding to this person’s wishes of playing a central role in our work might have prevented the particular situation which arose, but would have been practically and scientifically infeasible, as well as creating a potential source of resentment from other prominent personalities in the area. 6.

Conclusions

A familiarity with the structures of power and authority in the region where one is working is essential for any documentation enterprise. Such familiarity is only built up by experience, which means that the chances of making mistakes in the early phases of a project are great, as is well documented in the literature on language documentation. However, the existing power structures may not always be adequate for the handling of the particular issues raised by a language documentation project, and this may lead to conflict not only between project participants and community members, but also within different parts of a language community. If there is no body of authority associated with the language community as a whole (as distinct from, e.g., an administrative district which may subsume several language communities), the question of whose permission is required may simply not have an unequivocal answer. This raises a number of questions: 1. Under such circumstances, how far can a researcher be expected to go in obtaining permission to carry out documentation work? Is it necessary – and realistic – to acquire permission from all parties who may consider themselves entitled to an opinion, even if they are not directly involved in the project itself? 2. If this is deemed to be the case, how does one identify and approach all interested parties? In our case, the objections came from a self-styled body of authority which was not recognized locally as legitimate. In other words, there simply was no way to approach this body beforehand, as technically it did not exist until it was set up as a means of protesting against the project. 3. When there is no consensus within a community as to who has the authority to grant permission, how does the researcher establish whose claims are legitimate? It is tempting to assume that those who desire the project to go ahead are those whose opinions should be listened to, but can their dismissal of the objections be accepted without question?

Language is power

303

4. In cases where a project creates divisions and conflict within a community, is it ethical to carry on with the project at all? Do the benefits for those who wish for documentation to be carried out – benefits which are typically assumed to extend to all of the language community, even those who may have objections – outweigh the damage that may be done to interpersonal relations and political equilibrium within the community? This question is of central importance not only to language documentation as such, but to the entire discipline of linguistics, since it plays a significant part in determining what language data is made available to linguistic science. As is often the case, the answers to these questions will differ from case to case and from area to area. They should, however, be reflected upon as part of the planning of a documentation project, since our experience shows that it is in practice impossible for a visiting researcher to predict the impact which a research project may have on local politics. One may strive to treat everyone fairly and to make the proper arrangements with the appropriate local authorities; but local power struggles and individuals’ quest for personal prestige may nevertheless give rise to conflicts which may not only be uncomfortable for the researchers, but may shake up a whole community and create new divisions and alliances which cross-cut traditional structures. When the dust settles and the researcher leaves, local power relations may well have changed as a result. References Bowern, Claire. 2008. Linguistic Fieldwork: A Practical Guide. Basingstoke: Palgrave Macmillan. Dobrin, Lise. 2005. When our values conflict with theirs: Linguistics and community empowerment in Melanesia. In Language Documentation and Description, Volume 3, ed. Peter K. Austin, 45–52. London: School of Oriental and African Studies. Dobrin, Lise. 2008. From linguistic elicitation to eliciting the linguist: Lessons in community empowerment from Melanesia. Language 84(2):300–324. Dwyer, Arienne M. 2006. Ethics and practicalities of cooperative fieldwork and analysis. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 31–66. Berlin, New York: Mouton de Gruyter.

304

Even Hovdhaugen and Åshild Næss

Grinevald, Colette. 2006. Worrying about ethics and wondering about “informed consent”: Fieldwork from an Americanist perspective. In LesserKnown Languages of South Asia: Status and Policies, Case Studies and Applications of Information Technology, eds. Anju Saxena and Lars Borin, 339–370. Berlin, New York: Mouton de Gruyter. Hovdhaugen, Even. 2006. A Short Dictionary of the Vaeakau-Taumako Language. Oslo: The Kon-Tiki Museum. Hovdhaugen, Even, Ingjerd Hoëm, and Åshild Næss. 2002. Pileni Texts with a Pileni-English Vocabulary and an English-Pileni Finderlist. Oslo: The Kon-Tiki Museum. Hovdhaugen, Even, and Åshild Næss. 2006. Stories from Vaeakau and Taumako / A lalakhai ma talanga o Vaeakau ma Taumako. Oslo: The Kon-Tiki Museum. Hovdhaugen, Even, and Christian Tekilamata. 2006. Christmas Carols from Vaeakau. University of Oslo: Department of Linguistics and Scandinavian Studies. McLaughlin, Fiona, and Thierno Seydou Sall. 2001. The give and take of fieldwork: Noun classes and other concerns in Fatick, Senegal. In Linguistic Fieldwork, eds. Paul Newman and Martha Ratliff, 189–210. Cambridge: Cambridge University Press. Mosel, Ulrike. 2006. Fieldwork and community language work. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 67–85. Berlin, New York: Mouton de Gruyter. Rice, Keren. 2006. Ethical issues in linguistic fieldwork: An overview. Journal of Academic Ethics 4:123–155. Thieberger, Nick, and Simon Musgrave. 2007. Documentary linguistics and ethical issues. In Language Documentation and Description, Volume 4, ed. Peter K. Austin, 26–37. London: School of Oriental and African Studies.

Chapter 13 Sustaining Vurës: Making products of language documentation accessible to multiple audiences Catriona Hyslop Malau

1.

Introduction

With the emergence of language documentation as a distinct sub-discipline of linguistics, and recent advances in technology, there is now a strong emphasis within field-based linguistics on the importance of producing high quality video and audio language data which is annotated and stored in a digital archive (e.g. Gippert, Himmelmann, and Mosel 2006; Austin and Grenoble 2007). Once archived, the unedited data can be accessed (depending on imposed restrictions) not only by linguists and members of the speech community, but also by any other interested parties. However, if the documentation is only stored as unedited recordings in a digital archive, apart from linguists and members of the language community, most likely it will only be accessed by researchers in disciplines closely linked to linguistics, such as anthropology, ethnomusicology and archaeology, in particular those with a focus on the immediate language area or family. A number of linguists working on language documentation have therefore emphasised the importance of considering from the outset the broad variety of users who could potentially utilise and benefit from a language documentation corpus, and planning and carrying out the data collection accordingly (e.g. Dwyer 2006; Austin and Grenoble 2007). With this issue in mind, as part of our DoBeS language documentation project focussing on the Vurës and Vera’a languages of Vanuatu, we made a decision to produce edited versions of some of our video recordings as a distinct output of our project. We considered that if we took advantage of the opportunity to use all available resources to create high quality documentary films, we could reach a significantly wider range of users. This could then have a substantial impact on the value of our project, with its primary aim of documenting endangered languages and through documentation promoting

306

Catriona Hyslop Malau

language maintenance and linguistic diversity. Thus while Seifart (this volume) considers the different motivations for documenting endangered languages, I consider a related issue, that of attempting to target a wide range of users of language documentation. Cablitz (this volume) has also considered the issue of making language documentation work accessible to a wide audience, with her work on a multimedia encyclopaedic lexicon that is userfriendly for the language community and others. We took a similar approach and this paper presents, as a case study, the films produced by our project. Le Kal Vurës ‘Sustaining Vurës’ is a two DVD set of documentaries presenting aspects of Vurës life and language. Through discussion of the diverse range of audiences who have accessed the films, the paper shows how these and other films of this nature can be important tools for supporting language and cultural maintenance, both within and beyond the language community. 1.1.

Project background

Our project focuses on the documentation of two closely related languages spoken on the island of Vanua Lava, one of the Banks Islands situated in North Vanuatu: Vera’a, the more endangered language of the two with approximately 300 speakers, and Vurës, the dominant indigenous language of the island, which is nevertheless endangered with only around 1,200 speakers. There are approximately 100 languages spoken in Vanuatu (Lynch and Crowley 2001) for a population which in 2009 was 234,023 (Vanuatu Census, Office 2009). Thus with an average of little more than 2,000 speakers per language, this linguistic situation has earned Vanuatu the status as the country with the greatest density of languages per capita in the world. The low number of speakers of each language, combined with the fact that the lingua franca and national language, Bislama, and official languages, French and English, impact on the status and transmission of the indigenous languages, means that all of the languages of Vanuatu are endangered to some extent, although the threat is not high in comparison with languages in other hotspots across the globe (Harrison 2007). The majority of the Vanuatu languages are under-documented (Lynch and Crowley 2001), particularly with regards to archived language corpora, and language documentation projects such as ours are essential for increasing our understanding of the languages and producing documentations which can assist with language maintenance.

Sustaining Vurës

307

The primary aim of our research is to produce a comprehensive record of the language within its cultural context. In an effort to document the most significant aspects of the speakers’ culture and environment, the project has brought together a team which includes, apart from linguists, specialists from varied disciplines: a marine biologist (Katherine Holmes) an anthropologist (Sabine Hess), an ethnomusicologist (Raymond Ammann), and a botanist (Philemon Ala).

1.2.

The documentaries

We have produced two documentaries in DVD format which centre on the Vurës language and culture, presented under the single title Le Kal Vurës ‘Sustaining Vurës’. The Vurës title translates more literally as ‘lift up Vurës’, indicating our objective for the documentaries to be used both to promote the maintenance of the language within the language community, and also to celebrate linguistic diversity and raise the status of minority languages such as Vurës outside the community. To the best of my knowledge, these are the first films produced entirely in an indigenous language of Vanuatu. Each of the two documentaries has a common theme and includes a number of short films focussing on a specific topic. The first DVD, O tere tar¯ni vovo¯non ‘Water harvests’ comprises three short films, each demonstrating a fishing activity: the traditional technique of harvesting the marine worm un ‘palolo’; the process of catching fish employing the leaves of a poisonous vine; and the weaving and setting of a freshwater prawn trap. The second DVD, O tere tar¯ni vetvet ‘Ways of weaving’ includes twelve films, each demonstrating the process and techniques for weaving or plaiting a different basket or mat. The fishing and weaving practices presented in the films are quite unique to the region, thus making the documentaries an important ethnographic contribution in terms of documentation of the cultural activities. The activities presented in the fishing film are also noteworthy for the opportunities they provide for discussion of conservation in traditional fishing practices, which became an important secondary focus of the films due to our collaboration with conservation marine biologist, Katherine Holmes (see §3.3). The topic of weaving was chosen for various reasons, most significant of which was our desire to produce a comparative record demonstrating the variation in basketry styles relative to other areas in Vanuatu. In the Banks islands a greater

308

Catriona Hyslop Malau

diversity of styles has been preserved compared to other islands in Vanuatu. There are a variety of fibres which are used and basket styles which are produced today on Vanua Lava which, in areas south of the Banks group, are either not used at all or about which only a small number of basket producers retain knowledge. 1.3.

Production of the films

All video and audio footage for the films was recorded by me or Katherine Holmes, not by professional film makers. However, we engaged a small film production company, GoodEyeDear, to edit and post produce the films. GoodEyeDear director, Gavin Banks, also created all the graphics and DVD interface. To realise this aspect of the project we depended on the support of our project host, Prof. Ulrike Mosel, and project funders Volkswagen Foundation, for recognising the films as worthwhile outputs of a language documentation project and agreeing to the use of project funds to pay for production costs. The documentaries are still low budget productions, a point which is, of course, evident in the final product, but we feel (and the reaction to the films would confirm) that they are sufficiently professional to justify targeting them at an audience beyond the language community. I will return to the point of funding for film projects in §5, in considering the positive and negative aspects for linguists investing time in producing outputs such as documentaries. 2.

The films as language maintenance tools

The primary audience for these documentaries was always intended to be the Vurës people, for the speakers and their descendants to be able to use them to assist in maintaining and passing on aspects of both their language and their culture. True to this purpose, the audio is entirely in the Vurës language. When editing the films we always needed to keep our primary target audience in mind, while also considering the fact that we wanted the films to be accessible and of interest to people outside Vanua Lava, both within and outside of Vanuatu. Every aspect of the films is thus trilingual in Vurës, English and Bislama, the national language of Vanuatu. The title and information on the cover and DVD faces, plus the menus within the DVDs are in all three languages. The Vurës version plays with no subtitles, while the English and Bislama versions present subtitles in the chosen language.

Sustaining Vurës

309

Aside from the fact that the Vurës language is the medium used to present the communication in the films, these films are not about the language and do not present any overt discussion of their purpose as language awareness and maintenance tools. However, there are three ways in which this purpose is made evident, two of which are minor yet nevertheless significant points, the other being a more explicit technique employed for highlighting the language used as one of the themes of the films. The first point is simply the fact that these documentaries are entirely in the Vurës language. It is a highly significant matter for a film to be made – and to be commercially available – in a minority language spoken by a little over a thousand people. Within the sociolinguistic context of Vanuatu with its 100 languages, the existence of the films and knowledge of their existence within and outside the community serves immediately to elevate the status of the language. In §4 I discuss further the significance of films being produced in minority languages. The second point is that while the language maintenance issue is not discussed within the films, there is an introductory written text, accessible through the main DVD menu, which provides background on the language and language endangerment issue. The text discusses the fact that some of the languages spoken on Vanua Lava have already been lost, and that while Vurës is being passed on to children today, the community recognise that it is under threat from English, French and Bislama. It is stated that this is the reason why the language documentation project was supported by the community and why the DVDs were produced. The third way in which these films can be identified as having the language used as a focus is considerably more explicit, with a clear pedagogical intent. Each of the documentaries contains a number of dictionary entries – presented on the screen as an excerpt of a page from a dictionary (as exemplified by Figure 1) – which highlight and provide translations and definitions for certain key terms used in the films. The aim was to choose the most significant technical terms that were related to the topic of each film, particularly those which were more likely to represent restricted knowledge, thus enabling the dictionary entries to serve as an important record of the meanings of the technical terms. For example, in both documentaries a number of plants are referred to which are used for producing woven artefacts or in fishing activities. The dictionary entries for these words are presented complete with the scientific identification for the plant. Thus, the film serves as a record of

310

Catriona Hyslop Malau

particular cultural activities, including the vernacular names for species that have specific cultural uses, and the linked dictionary entry serves as an accurate scientific record linking the identification to the use as it is represented in the film.

Figure 1. Screenshot of a dictionary excerpt

The idea behind including the dictionary entries within the films themselves was to highlight the fact that the language being used is one of the themes of the films, and to compel the viewers, whether they be members of the Vurës community or not, to consider its high significance. For the Vurës speakers, now and in the future, the entries for each headword can be used as a teaching tool, particularly in relation to those words which are used in the domain of cultural practices that are not now widely observed. Linking dictionary entries to video in which the use and denotatum of the words is clearly illustrated has a much greater value than a dictionary entry presented as text alone. This is particularly true in the case of scientific identifications and technical terms, where the translation or definition may not, in reality, be sufficient to unambiguously retrieve the correct meaning and range of use of the word or expression. I will give two different types of examples. Firstly, including the language term and its scientific name alongside contextual video footage which illustrates clearly the species and the natural environment in which it is found, is a comprehensive record which can be used to assist the

Sustaining Vurës

311

language community in identifying the precise class of referents and preserving the word with its associated meaning and cultural knowledge. Mere inclusion of a universally accepted scientific identification, on the other hand, means that the information regarding the exact species will not be easily accessed by members of the indigenous community who do not possess the background scientific knowledge and ready access to relevant sources, such as are available on the internet. A different type of example is that of terms used to explain the process of producing woven or plaited basketry. In producing a bilingual/multilingual dictionary, it can be very difficult to produce a translation, particularly for words referring to actions and abstract concepts, which is sufficiently precise to enable someone to access the accurate meaning, so that they could reproduce the correct associated action. For example, the definition I have provided for qeleg, ‘plait together two strips of coconut leaflets or pandanus leaves as first stage in making basket or mat’, is not sufficiently precise to enable someone to reproduce the action which qeleg refers to. Incorporating the use of the word in context, with the translation (in both English and Bislama), alongside demonstration of the action as a stage of the plaiting procedure which it is a part of, is a practical method for enabling someone to reproduce the correct action. The dictionary entries presented within the films are taken from a trilingual Vurës-English-Bislama dictionary that we are currently compiling as part of our documentation project, and their format and the information included is typical of a bilingual/multilingual dictionary. The headword, given in the accepted Vurës orthography, is followed by pronunciation presented in IPA, then the word class. The English translation is given first, then Bislama, followed by a scientific definition where relevant. Where it was considered necessary, additional encyclopaedic information was also included. Where the relevant sense is a secondary sense of the word, the primary sense is included as a further record linking the senses. There are a total of 77 definitions of terms in the two documentaries. All key weaving terms are presented, including both those terms whose meaning and use is completely restricted to the domain of weaving, and those which also have other more general senses outside weaving. For example, the primary sense of the word sal is ‘road, path’. The same term used in weaving means ‘working strip of material being used for weaving, plaiting’. Both senses are given, and thus the connection between the primary sense and the secondary sense is made clear through the presentation of the definition.

312

Catriona Hyslop Malau

Terms specific to fishing and other featured activities are also included, both those that are more technical and also more general terms that play an important role in discussion of the featured activities. Definitions for 15 different plant species are included: nine in the weaving documentary and six in the fishing documentary. Each dictionary entry appears on screen directly after the utterance unit the defined word occurs in, thus linking the word and its translation/ definition directly to the use of the word in context. Most of the content of the films is of a procedural genre, and the entries are linked in such a way that as the speaker(s) demonstrate the process, explaining their actions, when they use a key term the film then cuts to a screen where first the headword alone, then the full dictionary excerpt appears. The dictionary page remains on screen for eight seconds and then cuts back to the film. Presenting the dictionary entries in this way, the aim was to make them a feature of the film without disrupting considerably the flow of the depicted procedures. 3.

The film audiences

At the time of writing of this paper it is less than a year since our documentaries were completed and first presented to their primary audience, the Vurës language community. After first being shown to the community, the films were launched at the Vanuatu Museum in February 2010. In the short time since, the films have already been viewed by a number of different audience groups, demonstrating the validity of producing an output of a language documentation project that targets a variety of users. Below I discuss the different audience groups and describe some of the ways in which a film that is in an indigenous language and centres on cultural themes, can have different impacts and functions for different groups of users. 3.1.

The Vurës language community

I have already stated clearly that the primary audience for these films is intended to be the language community. But do the speakers of the language recognise the films as being significant and offering any worthwhile contribution in terms of supporting linguistic and cultural identity and maintenance? Apart from a general sense that those who have watched the films feel pride in seeing themselves and their culture represented on a ‘real’ DVD, the films

Sustaining Vurës

313

appear to have already had an impact on viewers in terms of issues relating to transmission of cultural knowledge. One significant example is the response to the film N¯en a so o un ‘We gather the palolo worms’. ‘Palolo’ are segmented sea worms belonging to various species of the genus Palola or Eunice. They are gathered for consumption when they rise to the surface of the sea close to shore after spawning. The opportunity to collect the palolo only arises on a few nights each year, and thus the activities associated with the gathering are part of a rare yet significant cultural event. (See Mondragón 2004 for further explanation, discussion and references relating to palolo harvesting in the Torres islands, directly to the north of Vanua Lava.) Whilst all of the Vurës speaking villages are located in a very small area of southwest Vanua Lava, there is considerable variation in exploitation of marine resources, depending on the proximity of the village to the coast and reef. This is true to the extent that many people have remarked, after watching the palolo film, that they have never been involved in or witnessed the gathering of the palolo. For these viewers, watching the film has been their first exposure to a natural phenomenon and cultural practice that is an annual event for members of the language community living less than an hour’s walk from their village. Production of the film has thus presented members of the community with access to cultural knowledge regarding an activity that they were previously aware of only vaguely, without being privy to specific details regarding the cultural practice, and the language associated with the activity. This provides opportunities for discussion and transmission of knowledge possessed by select groups within the language community, rather than the language community as a whole. A second more general point is connected to the widespread misperception that once a record is made and stored in an archive, then there is no further effort required to ensure that the knowledge is retained. This issue is reflected in the reaction of Eli Malau, a Vurës village chief and active member of the Vanuatu Cultural Centre’s fieldworker program, who is intensively involved in our research and has assisted in all aspects of documentation: after watching the films he commented with pride on how he perceives their potential usefulness as a cultural maintenance tool within his language community. He applauds the production of the films as an accessible means of feeding research outcomes back into the community and sees that there is even more motivation to work on cultural maintenance when material out-

314

Catriona Hyslop Malau

puts are produced and returned to the community. He plans to use the films, particularly the documentation of weaving, as a starting point for workshops on transmitting cultural knowledge. 3.2.

The wider Vanuatu community

There is great potential for these documentaries to be used by different groups outside the Vurës community, within Vanuatu. The films have been broadcast by the national television station, Television Blong Vanuatu, which is proactive in screening locally made productions, as a means of promoting cultural maintenance. The films have relevance throughout the nation as they depict familiar activities and can be used to discuss the local diversity of cultural practices. People in Vanuatu are very interested in slight variations in cultural practices between islands and this is a common conversation topic. Yet, few people have the opportunity to travel widely to other islands within Vanuatu. Producing documentaries like ours offers groups within Vanuatu the opportunity to view and gain insight into other Vanuatu cultures. Although the films are in Vurës, a language spoken by less than 1% of the population of Vanuatu, these documentaries could be used in schools throughout Vanuatu for their content, for cultural studies. We have not yet explored this avenue for dissemination of the films, but intend to do so. 3.3.

Conservation groups and other researchers outside Linguistics

Our main aim, as linguists and ethnographers, is to document and to support the maintenance of traditional knowledge, and to enable language communities to pass on this knowledge through our documentation. However, when it comes to such cultural practices as fishing, there can often be a certain amount of conflict between traditional practices and environmentally sustainable practices, and linguists or ethnographers do not want to and should not be engaged in the promotion of unsustainable practices. The three films on the DVD O tere tar¯ni vovo¯non ‘Water harvests’ each describe a traditional fishing activity. Although these films do document environmentally unsustainable fishing practices, namely the mass stunning of fish, they can nevertheless be used to promote discussion of issues surrounding the sustainability of traditional practices.

Sustaining Vurës

315

Katherine Holmes, conservation marine biologist on our project team, is currently director of the Papua New Guinea Marine Program of the Wildlife Conservation Society. In this role she has already been making use of and will continue to use in the future, our documentary focussing on fishing practices on Vanua Lava. In conservation workshops throughout Papua New Guinea she has been showing both the film demonstrating the use of poison to catch fish and that centring on gathering palolo worms, with two distinct purposes in mind. The technique for stunning fish which is demonstrated in our film N¯en a vun o mes ‘We poison the fish’, using leaves of the vine Derris elegans, is highly effective, to the extent that its use is actively discouraged by conservationists as an unsustainable fishing practice (Katherine Holmes, p.c.). Vurës speaker Eli Malau presents that message at the end of the film, commenting that in the past this technique was only employed when a large number of fish were required for a significant cultural event and people possessed the cultural knowledge that the practice was not sustainable for regular use. Despite its unsustainability, it is a significant cultural practice and it is important that the knowledge be maintained. In different areas of Papua New Guinea, both with communities and with conservation groups, Holmes is able to use the film with a dual purpose: to ascertain where the technique is practised and the details associated with it, and as a basis for discussion of the unsustainability of the practice. For this audience, only the content of the films is relevant, not the language used. 3.4.

A general audience

There is a dearth of films available depicting Vanuatu culture, and fewer still including commentary in indigenous languages. There are a number of locally produced DVDs available, showcasing Vanuatu music, both traditional music and modern stringband. However, few films highlighting aspects of Vanuatu culture have been produced1 . Certainly our films are the first which are solely in a Vanuatu language, with a purely ethnographic focus. Although our film is a low budget production and would only be of interest to a select audience, the DVDs are being sold through the Vanuatu Museum. Making films in minority indigenous languages available to a general audience is an 1. Notable exceptions are the recent award-winning French language films, Sevrapek City (Broto and Tzerikiantz 2009) and Le Salaire du Poéte (Wittersheim 2008).

316

Catriona Hyslop Malau

important way of promoting language and cultural maintenance. This is true not only in terms of raising public awareness about the issue of language endangerment, but also considering the fact that the status of the language within the language community is elevated when the people are aware that a film depicting their language and culture is being disseminated outside the community. 4.

Films in and on minority languages

The issue of language endangerment has been receiving quite some attention in the mainstream media, and a number of films have been produced on the topic which have reached a popular audience and served to increase general awareness about language endangerment and the need for language documentation. Prominent examples are the 2008 film The Linguists, featuring David Harrison and Greg Anderson and their work on documentation of endangered languages, and Janus Billeskov Jansen’s 2005 film In Languages We Live – voices of the world. Although I do not intend to claim that we are the first or the only researchers to produce films of this nature, there are however few entire films that have been produced to date which are not just about, but which are also in a minority indigenous language. I believe that researchers who are engaged in language documentation should be taking more opportunities to promote linguistic diversity by using language documentation data to produce films telling real stories, aimed at a range of audiences. Such films represent an important middle ground in contrast to other representations of minority languages that are currently being presented on screen. The success of Rolf de Heer’s film Ten Canoes is an excellent example of how production of a film in a minority language can have a significant impact on increasing awareness of language endangerment and contributing to language maintenance. Apart from the impact of having an award-winning film in a minority language, spin-offs from Ten Canoes have actively contributed to language and cultural maintenance.2 I do not suggest that linguists working on language documentation projects are in a position to produce feature films, yet, as mentioned, films such as those discussed in this paper represent an important middle ground. We should take advantage of the opportunity 2. http://www.vertigoproductions.com.au/information.php?film_id=11&display=

extras

Sustaining Vurës

317

presented through building language documentation projects to incorporate the production of purposeful non-academic outputs. But what is the reality of such an expectation, that linguists, employed as academic researchers, and in most cases also engaged in tertiary teaching, will be in a position to dedicate the time required for production of such nonacademic outputs? The tasks required in language documentation are already many and varied: record texts and elicit other language data in the field, including metadata; transcribe and translate the recordings; further annotate the recordings using specialised software tools; archive the recordings along with metadata; write a grammatical description and analytical papers; produce a dictionary and literacy materials for the community; and more. These are incredibly time-consuming activities (cf. Wittenburg 2007), and added to this the lack of importance that is generally placed on outputs such as these within the context of academia, it is an even greater expectation to propose that more projects should be giving emphasis to these non-academic, popular outputs. Yet projects such as The Sorosoro Program3 and National Geographic’s Enduring Voices Project4 , which present excerpts of films in indigenous languages, show how collaboration between linguists and filmmakers can offer opportunities to increase the potential audience of language documentation. Currently the video excerpts presented on the Enduring Voices YouTube channel5 do not provide sufficient background information and translations of content to capture a wide audience, but this initiative is a step in the right direction in terms of presenting the work of language documentation. The Sorosoro Program in particular offers a model of how language documentation and maintenance work can be done, engaging professional filmmakers to work with linguists to produce video footage that is both aesthetically pleasing (cf. Dimmendaal 2010) and provides linguists with data that enables further analysis and preservation in language archives. 5.

Conclusion

In conclusion, this paper has demonstrated the diverse range of audiences that can benefit from outputs of a language documentation project, despite the fact that the different audiences have varied needs and interests. Placing emphasis 3. http://www.sorosoro.org/en/ 4. http://travel.nationalgeographic.com/travel/enduring-voices/ 5. http://www.youtube.com/enduringvoices

318

Catriona Hyslop Malau

on the production of non-academic outputs such as these reminds us of the primary goal of language documentation: to support language maintenance and linguistic diversity. Proposing that all language documentation projects should produce such outputs is not unproblematic. As discussed, the workload involved in language documentation is already beyond most academic linguists and many language documentation projects have neither the time nor the funding to dedicate to producing these outputs. However, we can address this issue both in the long term and in the short term. For the long term we can work towards promoting the worth of these outputs, both to funders and to the academic community. And in the short term, for those language documentation projects that do not have the resources available to dedicate to such outputs, they can nevertheless focus on recording high quality audio, video and image data that could potentially be used for producing such outputs in the future; data that includes a wealth of information on focussed topics and goes beyond being basic language data, by depicting aspects of the lives of the speakers in an aesthetically pleasing way.

References Austin, Peter K., and Lenore A. Grenoble. 2007. Current trends in language documentation. In Language Documentation and Description, Volume 4, ed. Peter K. Austin, 12–25. London: School of Oriental and African Studies. Broto, Emmanuel, and Fabienne Tzerikiantz (Producer & Director). 2009. Sevrapek City. [Motion picture]. Dimmendaal, Gerrit J. 2010. Language description and the “new paradigm”: What linguists may learn from ethnocinematographers. Language Documentation & Conservation 4:152–158. Dwyer, Arienne M. 2006. Ethics and practicalities of cooperative fieldwork and analysis. In Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 31–66. Berlin, New York: Mouton de Gruyter. Gippert, Jost, Nikolaus P. Himmelmann, and Ulrike Mosel, eds. 2006. Essentials of Language Documentation. Berlin, New York: Mouton de Gruyter. Harrison, K. David, ed. 2007. When Languages Die: The Extinction of the World’s Languages and the Erosion of Human Knowledge. Oxford: Oxford University Press.

Sustaining Vurës

319

Lynch, John, and Terry Crowley, eds. 2001. Languages of Vanuatu: A New Survey and Bibiliography. Canberra: Pacific Linguistics. Mondragón, Carlos. 2004. Of winds, worms and Mana: The traditional calendar of the Torres Islands, Vanuatu. Oceania 74:289–308. Office, Vanuatu National Statistics. 2009. National Census of Housing and Population. Port Vila, Vanuatu: Ministry of Finance and Economic Management. Wittenburg, Peter. 2007. DoBeS/MPI Archive Issues. Presentation at the National Science Foundation Workshop on Documenting Endangered Languages, Durham, New Hampshire, (October 2007), available online at http: //www.lat-mpi.eu/papers/papers-2007/Presentations/newhampshire-talk.pdf (accessed 2011/03/19). Wittersheim (Producer & Director), Eric. 2008. Le Salaire du Poète. [Motion picture].

Chapter 14 Filming with native speaker commentary∗ Anna Margetts

1.

Introduction

As discussed by Seifart (this volume) there tend to be competing motivations for documenting endangered languages. Documentation projects within the DoBeS program aim at building corpora of authentic data of spoken language that (a) are not only interesting for linguistics, but also for other disciplines; (b) are annotated with transcriptions, translations and comments so that they can be understood without prior knowledge of the language; and (c) can be used by the speech community for language maintenance and revitalisation. An example of a conflict arising from such competing motivations, and its resolution, is discussed in Mosel (2004, 2009) for the Teop language documentation project. Mosel (2004: 3) notes that language documenters need “to find a balance between what is interesting for them and their special field of expertise, what is relevant for other non-linguistic disciplines and what meets the expectations of the speech community.” In response to the Teop speech community’s reluctance to having linguistically accurate transcriptions of recordings made publicly available, the documentation project created written native-speaker-edited versions of oral texts which were consid∗

I would like to thank Birgit Hellwig and the editors of this volume for valuable feedback on an earlier draft. I also thank Andrew Margetts who is behind much of the technical details reported here. As always, I cordially thank the communities and individuals on Saliba and Logea Island who have supported our work. More speakers have been involved with the Saliba-Logea project than can be listed here. Community members who have worked with us on transcribing, annotating, translating, and editing texts include Nebo Joseph, Rose Meina, Meggie Alaluku, Matthew Hawele, Penesia Eric, Mila Kelwau, and Morris Alaluku. For their commentaries on the recordings discussed here I thank Balosi Leman, Alaluku Leman and Mr January. DoBeS project members in Australia include Anna Margetts, Carmen Dawuda, Andrew Margetts, and John Hajek; Ulrike Mosel was the German host, and Kipiro Damas from the PNG National Herbarium in Lae joined as a botanical consultant.

322

Anna Margetts

ered more acceptable and desirable by the community. The raw transcriptions are still archived, but are less publicly available. This chapter is concerned with commentaries accompanying video recordings as a research methodology and means of data collection, again as a response to the tension arising from the different aims within a documentation project. It discusses the benefits of this technique and addresses the nature of the data that can be collected by this method. Commentaries have in the past received relatively little attention as a genre or as a data collection method. However, more recently there has been some discussion of this topic in the field of documentary linguistics. Cablitz (2008) describes several methods and recording setups for the creation of procedural documents. In one of the techniques, one person performs the activity (e.g. traditional food or medicine preparation) while another speaker provides a commentary on what is happening. As Cablitz observes, such commentaries do not constitute “examples of how people actually communicate with each other” as called for by Himmelmann (2006: 7), but create a new type of communicative event along the lines described by Mosel (2004, 2009). Like Mosel she notes that new types of communicative events, while not traditional, are interesting in themselves as they allow one to observe the process of using language in a new situation and provide new insights into a language’s expressive potential (Mosel 2004: 4; Cablitz 2008). Another recent use of commentaries is the technique of orally providing annotations to primary data thereby shortcutting the transcription and written annotation process. In the context of documentation work with the Cup’ik community Woodbury (2003) argues for the recording of such oral commentaries instead of focusing exclusively on written annotations. ... we will use the time of the few elder Cup’ik translators with wide English and Cup’ik vocabularies to produce running UN [United Nations] style translations of many more materials, and then have younger speakers flag the obscure words or usages for special attention. We are also considering not transcribing everything – instead starting with hard-to-hear tapes and asking elders to ‘respeak’ them to a second tape slowly so that anyone with training in hearing the language can make the transcription if they wish. (2003:45)

In a similar vein, Simons (2008), Bird (2010), and Reiman (2010) discuss the method of Basic Oral Language Documentation (BOLD) where oral annotation replaces transcription and written annotation. In place of the traditional corpus which consists of data that has been transcribed and marked up with

Filming with native speaker commentary

323

written annotations, the BOLD method builds an oral documentation corpus where the compiled data is annotated orally and then archived (Simons 2008: 23). In this method, commentaries are tagged onto media files containing the primary data; it has been applied to languages in Guinea-Bissau (Reiman 2010) and in Papua New Guinea as part of the BOLD PNG project (Bird 2010).1 This chapter introduces a methodology that is distinct from the BOLD approach, but bears some similarities to that of Cablitz (2008). I will be primarily concerned with commentaries to video recorded events, where the commentary itself constitutes the primary linguistic data (sections 2.1 to 2.3). However, I also address scenarios where the commentary provides annotations to linguistic events or other types of performances (section 2.4). The chapter is based on research within the DoBeS Saliba-Logea documentation project in Papua New Guinea which started in 2004 but which continues earlier research with the Saliba community since 1995. There are about 2500 speakers of Saliba-Logea who live predominantly as subsistence farmers on Saliba and Logea Island and the surrounding area in Milne Bay Province of Papua New Guinea. The two main dialects, Saliba and Logea, are mutually intelligible and differ mainly in their lexicon. During our fieldwork we experimented with different methodologies of data gathering and of collaborating with the community. Some of the techniques had been conceived of before the beginning of the project, informed by previous fieldwork, such as interviews of older speakers conducted by younger family members. Others techniques developed in the course of the project as new situations arose. Most of our texts are transcribed recordings of oral language but the database also includes a small number of written texts. So far the project has created a corpus of about 37 hours of transcriptions. Some recordings are of natural conversations, others are based on elicitation with non-verbal stimuli, including the frog story, the pear story and a number of the stimuli developed at the MPI for Psycholinguistics.2 However, the majority of sessions in the database follow the common format of monologuous narratives where one speaker tells a story or provides a description of certain activities. 1. http://www.boldpng.info/ 2. http://fieldmanuals.mpi.nl/

324

Anna Margetts

The dominance of narratives is due to a number of factors including preference by the community for texts such as traditional narratives as recordworthy, as described by Mosel (2009), but no doubt also through a bias on the part of the fieldworkers, as described by Foley (2003). Overall, conversational data is more difficult to record because speakers do not necessarily recognise conversation as an event worthy of documentation and the topics of conversation are often more private and less suitable to be shared publicly (e.g. gossip). It is also easier to prompt speakers to tell a story than to simply have a conversation. In addition, conversational data is generally more complex due to the presence of multiple speakers in one session and overlapping speech. It is therefore more difficult and time-consuming to process, annotate, and analyse. The commentaries discussed in this chapter extend the project’s database in two ways. One of the commentaries we recorded provided a new text type to the corpus – a sports commentary which consists of a stream of spontaneous language. Apart from specific linguistic features which are of interest in this text, the addition of the commentary broadened the range of communicative events represented in the corpus. This is of value in itself as a broad spectrum of text types is essential both from the point of view of documentation and for the purpose of linguistic description (see Foley 2003, Mosel 2009 among others). The second commentary provided essentially conversational data which, as mentioned, is still underrepresented in the Saliba-Logea corpus and is one of the genres more difficult to record. The project’s experience in two different recording situations and with different speakers shows that commentaries do not constitute a single text type or genre but depend on the speaker’s interpretation of the task. Some recordings in our database capture community events and cultural practices. They are classified as event documentation. If they contain linguistic data of sufficient quality, they are transcribed and fed into the project’s normal data workflow. Otherwise they are deposited to the archive with metadata but without further annotations. These sessions include recordings of canoe building and repair, house building, parties, sports matches, drumming, and other types of performances. The remainder of this chapter is concerned with commentaries as a method of enhancing the value of such primarily nonlinguistic recordings. Audio commentaries about the filmed events make the recordings more useful and interesting to the community but they also provide valuable language data for the project’s corpus.

Filming with native speaker commentary

2.

325

Filming with running commentary

On past occasions the team had filmed some special events for the community. Ad hoc attempts to add linguistic value to such recordings by asking from behind the camera what people were doing tend to create only minimal interactions, like the exchange in (1). (1)

Q: Saha kwa gina-ginauli? what 2PL RED-do ‘What are you doing?’ A: Ya nekwa-nekwali. 1SG RED-peel.tubers ‘I’m peeling tubers!’

The answer in (1) was given by a woman at a wedding who was peeling an enormous taro tuber which her family had brought for the feast to support the relatives who were hosting the party. It was a special type of taro and a special type of gift, and there could have been much more said about it, but the question was clearly not suitable to elicit a more detailed response. When the documentation team was asked to film a soccer match and the surrounding community activities we were facing a day of filming with little chance to record any usable linguistic data. In this context we decided to invite a speaker to provide a running commentary of the match. This strategy was successful and elicited a stream of spontaneous spoken language. We applied this technique in another context with similar success and, in hindsight, realised that we could have enhanced earlier recordings by employing this method. In the following I discuss the recordings for which we invited commentaries and one example of an occasion where we did not and how a commentary would have improved the value of the recording both for the community and for the database. 2.1.

The soccer match

The soccer match was the season final of the Sawasawaga area and a big community event.3 We invited a speaker who had previously worked with 3. The final was between “Cycas” from Bwasitau and the “Station Warriors” from Sawasawaga. The match had been postponed several times because of a death in a neighbouring

326

Anna Margetts

the project on recordings to provide the commentary. We set him up with a radio microphone which allowed him freedom of movement independent of the camera. He gave it a very good go but was somewhat shy in his reporting and when there was not much happening on the soccer field there are many long pauses in the commentary. At some point, without warning, the microphone was handed to another speaker and we did not know for sure where or who the new commentator was (this is an interesting aspect of radio microphones: one can lose the speaker!). The new commentator was an older speaker with whom we had not worked before. He had lived in the capital Port Moresby for many years, was a sports fan, and had seen many matches on TV with commentaries in English. He was clearly familiar with sports commentaries as a genre and his commentary was engaging and professional. It went on virtually without pauses for the remainder of the two hour match, covering lulls in the game with comments on the teams, their supporters, the referee, and the weather (perfect: no rain and not too hot). He also included short interviews with the referees – clearly in order to make the commentary livelier and more engaging.4 There are a number of interesting aspects to the recorded data. The commentary constitutes a novel type of communicative event in the Saliba-Logea context and a new text type in the project’s database and so adds to the variety of texts represented in the corpus. As discussed, a wide spectrum of text genres is crucial both from the point of view of documentation and for the purpose of linguistic description. Foley (2003) and Mosel (2009) among others discuss the fact that different text types are rich in different constructions and grammatical descriptions can as a result differ considerably if they are primarily based on different text types. Mosel (2009: 17) demonstrates this with topic constructions in Teop. A broad spectrum of text types is also important for other reasons. Seifart (2008) discusses the representativeness of documentation corpora, and community and of the uncertainty when the funeral would be held. The teams’ preparation for the match included all-night prayer meetings which continued each time the match was postponed. Players must have been very tired by the time the day came and it was a relatively slow match, ending in a 0:0 draw. In the ensuing penalty shoot-out, Sawasawaga won 1:0. 4. The commentaries were provided by Balosi Leman and Mr January. The recordings of the match were fed into the data workflow as four sessions: SoccerMatch_01EZ (commentary Balosi Leman), SoccerMatch_02FA, SoccerMatch_03FA, SoccerMatch_04FA (commentary by Mr January across three video tapes).

Filming with native speaker commentary

327

Foley (2003: 86) warns that “the effect of our native language ideology on our products of description can be heavily disguised ... The only corrective I can suggest for our possibly misleading descriptive flights of fancy is ... stay close to the full range of data, all register and genre types”. While the commentary is monologuous it is very different from other monologues in the corpus, such as more planned narratives, but also from spontaneously produced monologues. Running sports commentaries are a unique text type in that the commentator is not aware of the outcome of the match while reporting and therefore cannot structure the text according to a certain result.5 The speech in the soccer commentary is spontaneous and off the cuff in this sense as the commentator invents techniques of filling the gaps and producing a stream of continuous fluent language. It is not an established genre in Saliba-Logea and the commentator clearly produces an imitation of a western-style sports commentary. In this sense the commentary data can be considered artificial while at the same time clearly highly spontaneous. Beyond constituting a new text type, the commentary contains some interesting linguistic features including several instances of the reciprocal prefix which is extremely rare in the Saliba-Logea text data: (2)

Se hai-bayobayoa molosi. 3PL RECP-struggle real ‘They are really struggling with each other.’

The text also contains the only examples of code switching between Saliba and Tok Pisin in the corpus.6 The commentator started out in Saliba but then switched between Tok Pisin and Saliba several times. Two examples are given in (3) and (4).7

5. Presumably the genre only emerged in response to advances in media technology and did not exist before the advent of radio. However, commentary-style reporting could in principle also take place for the benefit of a physically present but non-seeing audience. 6. English is the areal lingua franca in Milne Bay Province and typically only people who lived in other parts of PNG know Tok Pisin. This has been changing somewhat in recent years, at least in the provincial capital Alotau, with the influx of people from other parts of PNG. 7. Items like winim ‘win’ and tim ‘team’ are considered loan words here rather than instances of code switching.

328

Anna Margetts

(3) Saliba

ta kaikewa kabo kaiteya tim ye winim, 1INCL watch TAM which team 3SG win ‘We watch which team will win,’ nige kabina ta kata. NEG its.nature 1 INCL know ‘we don’t know’

Tok Pisin

Aah! tufella tim, tufella tim. team two team INTRJ two ‘Ah, two teams, two teams.’

(4)

Tok Pisin

Saliba

i kisim lau i pasim finish, 3SG get go 3SG pass finished ‘he goes and gets it he passes it,’ kabo kaiheya ye lau-lau sola TAM game 3 SG RED -go still ‘the game is still going’ sola nige hada hesauna ye tole. still NEG goal other 3SG put ‘nobody put in a goal yet.’

Another characteristic of the commentary concerns the structuring of the text through parallel clauses and repetitions. Parallel constructions describing the teams and their actions seem to be designed to highlight the teams’ equality and possibly also the commentator’s impartiality. Examples are given in (5) and (6). (5)

Cycas ye bayao kalili na team.name 3SG strong very CONJ ‘The Cycas team is really strong and’ Station warriors hinage ye bayao kalili na. team.name also 3SG strong very CONJ ‘the Station warriors team is also really strong.’

Filming with native speaker commentary

(6)

329

tau support meta ye bado kalili Cycas yo-di team.name CLF -3 PL man support TOPIC 3 PL many very ‘There are very many Cycas supporters,’ na

Station warriors hinage yo-di tau support ye bado kalili also CLF -3 PL man support 3 PL many very CONJ team.name ‘there are also many Station warriors supporters.’ In terms of repetitions, the commentary shows a different pattern from other text types. It features continuing repetition of phases while the action they describe is ongoing, as common in western-style sports commentaries. (See Müller 2007 and Lavric et al. 2008 for discussion on the linguistics of football.) (7)

Cycas ye hai ye dobi-dobima, team.name 3SG get 3 SG RED-come.down ‘The Cycas team got it, it’s coming down,’ ye dobi-dobima ye dobi-dobima. ye dobi-dobima 3 SG RED-come.down 3 SG RED-come.down 3 SG RED-come.down ‘it’s coming down, it’s coming down, it’s coming down.’

Besides providing linguistic material, the commentary also enhanced the value of the recording for the community by adding some of the excitement of the match to the recording. It also provides information which cannot necessarily be gleaned from video by naming players, stating which team is on the ball, and who has a chance for scoring, etc. 2.2.

The canoe race

On another occasion the documentation team was preparing to film a children’s canoe race. We again invited one of the organisers to provide a running commentary.8 The style of this commentary is very different from that accompanying the soccer match. The commentator is basically not speaking for the microphone but to the other race organiser, to by-standing spectators, and to the 8. The race was initiated by Balosi Leman, the sailing canoe expert and soccer commentator with whom we had worked before. He organised it with the help of his brother Alaluku Leman who also provided the commentary.

330

Anna Margetts

racing participants out on the bay. Overall the data is conversational rather than monologuous and it constitutes some of the clearest audio recordings of casual conversations in the database. From the point of view of conversational data, the drawback of the recordings is of course that none of the speakers is in the picture as the camera is trained on the racing canoes and that when interlocutors are not close to the commentator only his side of the conversation is recorded through the lapel microphone. However, the audio quality is still better than some of our other audio-only conversational data. Apart from the conversational nature of the commentary, interesting aspects of the data include examples of boating and sailing terminology in their natural context of use, as in (8) to (10). Such terminology is otherwise mainly represented through elicitation and is rare in the corpus. (8) Se kuke. 3 PL set.sail ‘They are setting sail.’ (9) Se giyuli. 3 PL go.around.point ‘They are sailing around the point.’ (10)

Se yatowa. 3 PL tack ‘They are tacking.’

The recordings also include examples of English vocabulary and of codeswitching between Saliba and English. Speakers tend to consciously avoid this during recordings and so, compared to every-day language, code-switching is underrepresented in the database. Some examples are given in (11) to (13). (11)

Nige ko wose! Ko wose ko disqualified. NEG 2 SG paddle 2 SG paddle 2 SG disqualified ‘Don’t paddle. If you paddle you will be disqualified.’

(12)

Next time, huya hesau namwa-namwa-na kabo ku kita-gai. next time time other RED-good-3 SG . POSS TAM 2 SG see-1 EXCL ‘You’ll see us next time.’

Filming with native speaker commentary

(13)

331

You are safe to walk, kwa lau namwa-namwa, you act properly. you act properly you are safe to walk 2 PL go RED-good ‘You are safe to walk, walk safely, act properly.’

Again, the commentary enhanced the video footage both in terms of adding linguistic data for documentation and analysis and in terms of benefit for the community. Without the explanations of what is happening the video images basically just show a few sailing canoes and are otherwise uninterpretable. 2.3.

The gwalisaekeno feast

There are other events that were filmed as part of the project, such as weddings, feasts, or the gathering and cooking of malamala (sea worms), which would have benefited from a running commentary. In this section I describe, based on the example of a traditional feast, how a commentary would have enhanced the documentation. A gwalisaekeno feast is a significant community event with many aspects and sub-events that are worth documenting. The feast is part of the traditional arrangements of celebrating and cementing a marriage and the ensuing bond between families. The feast is held by the woman’s or the man’s side after the couple has lived together for a few years. One family calls the other for the gwalisaekeno at a set date. Both sides collect pigs from people on which they can call and some will be slaughtered for the feast. The extended family will come to support either side by bringing contributions of food stuff. The hosts prepare a feast for their in-laws but will not partake themselves. The visiting party brings pigs and food gifts and there is competition to match or out-do the other side’s efforts. The visitors also bring gifts of special fire wood as a present for the mother-in-law to compensate her for the loss of help by her daughter or son who is now married. If the visitors come from further away they will typically arrive by boat, blowing a cone shell trumpet or a modern substitute which indicates that the boat is carrying a gift of pigs. As they arrive, the visitors run into the village carrying the pigs, yams and other goods on decorated stretchers. The mother of the visiting party or another female relative will launch into a tirade about the bad and lazy behaviour of the son or daughter-in-law. She will complain about the good-for-nothing youngster who doesn’t help, doesn’t know how to work hard, and doesn’t fetch water – often cracking into laughter while performing this traditional

332

Anna Margetts

task. At the end of the party the visitors return to their village with their gifts and the host family distributes the received goods to the members of their own extended family who supported them with their contributions. The visiting side typically reciprocates after a few years with a corresponding feast. Following this counter exchange the marriage is settled and divorce would be quite involved for both sides, because the exchanged gifts would have to be returned or compensated for. We filmed the preparations of a gwalisaekeno feast on the side of the visiting party, their arrival by boat, and the party itself.9 The recordings are more or less free of usable linguistic data (apart from some speeches) but they provide an interesting documentation of one of the last traditional feasts still practiced in the area. However the video documentation is basically only as good as the background knowledge of the viewer because many aspects of the recordings are not self-explanatory. As described, there are many aspects of this traditional feast which follow regular patterns and which are part of the cultural knowledge of the community, but their meaning and significance or even the fact that they are taking place cannot necessarily be gleaned from video footage of the events. A running commentary by a community member could have bridged this gap and made the documentation more meaningful for outside viewers by explaining what was happening. In the course of the project we also recorded procedural texts and interviews with several speakers about gwalisaekeno feasts.10 These sessions are located in a different branch of our data tree but they are linked to the video recording of the feast though metadata and the project’s file naming practice. A running commentary during filming would have strengthened the link between the two types of recordings (the event itself and the meta-texts about the event) by naming sub-events as they occurred, such as the traditional complaints by the mother, the carrying of the firewood, and the pig slaughter, which are described in the meta-texts but filmed without commentary in the documentation. Of course it is possible to add a running commentary after the event by reviewing the recording with speakers and this would provide a useful annotation to the gwalisaekeno videos. However, such commentaries constitute a separate step in the data gathering and processing (and therefore may or may 9.

The relevant sessions are Gwalisaekeno_03AN, Gwalisaekeno_04AN, and Gwalisaekeno_ 05AN. 10. See e.g. Gwalisaekeno_01BF, Gwalisaekeno_02BF, and Giyahi_01AA.

Filming with native speaker commentary

333

not in fact happen). One also needs to be aware that adding a commentary later is a different technique which will result in different kinds of linguistic data. For example deictic terms employed by a commentator on site may be quite different from those used by a commentator who is reviewing the recordings on a screen and commenting on them.11 2.4.

Commentary as annotation

So far we have been concerned with commentaries as a means of adding linguistic data to recordings of primarily non-linguistic events, similar to the methodology discussed by Cablitz (2008). However the technique can also be applied to record annotations to linguistic events or other types of performances as in the case of the BOLD methodology (Simons 2008, Bird 2010, and Reiman 2010) and to provide metadata. The oral recording of basic metadata like the date, location and the names of the participants at the beginning of a session is already a standard in language documentation, but commentaries can be used much more extensively along these lines. The previous section discussed the value that a commentary to the gwalisaekeno recordings could have added by providing explanations on the activities being filmed. But besides commenting on the activities it could also have provided extended metadata type information such as the key performer’s names and their position in the respective families (e.g. the mother’s eldest brother), or annotations such as which aspects of the party are traditional and which are modern takes on traditional themes. Depending on the type of performance being filmed, the audio recording of the event itself can be very important. This would be the case with the speeches that were held as part of the gwalisaekeno feast, or when recording musical performances or festivals. Ideally, these aspects of the signal would not be inseparably overlayed with the commentary and so a recording of the event without the commentary should be created in parallel. This can be achieved by using separate microphones, one for each channel. With 11. In principle, such different types of commentaries could be used to create parallel corpora in order to explicitly investigate the differences between them. A similar methodology is described by Mosel (2009) to compare a narrative about killing a chicken and a procedural text on the same topic, both elicited with the same visual stimuli. See also Mosel (2008, 2009) on comparing parallel corpora of raw oral versions and edited versions of the same texts.

334

Anna Margetts

this setup one microphone is dedicated to the commentary, the other microphone is recording the event. This allows for annotations and comments to be recorded simultaneously to the actual event without interfering with it. See Margetts and Margetts (in print) for technical details on such recordings.

3.

Conclusion

Language documentation projects aim to meet several sometimes seemingly conflicting goals and are subject to different expectations by different parties. Making materials produced for the community valuable for linguistic research and vice versa can reduce the tension between conflicting demands and make for more productive fieldwork. Commentaries have so far been little discussed as a method of data gathering. In particular running commentaries on recordings of primarily nonlinguistic events prove to be a good source of high quality data and a good way of recording annotations. Apart from making filming more worthwhile from the point of view of the documentation project, running commentaries also enhance the quality of the documentation for the community. The problem with many non-linguistic recordings is that they are not necessarily selfexplanatory and often require additional information to be interpreted, even for people who may have been present during recording. Commentaries can therefore make the recordings both more interesting to watch, and create a more detailed and comprehensive documentation. The running commentaries we recorded contain interesting linguistic data including some features otherwise rare in the database and they contributed a new text type to the Saliba-Logea corpus, thus adding to the variety of speech types within the database. Questions to address when preparing to film with a commentary include which events are appropriate to be filmed and who is a good commentator. Filming on invitation by the community is a good place to start. But in our experience, some events for which we were initially reluctant to ask permission to film, such as mourning rituals or funerals, might in hindsight have delighted the community as a memento and a documentation of the important event. It can be hard to find the right people to ask for permission to film because they may be occupied with scripted roles in the event, e.g. as mourners. Inquiring ahead of time about the appropriateness of filming certain events is therefore indispensable. Another point to be taken into account is the person-

Filming with native speaker commentary

335

ality of the commentator. It helps if they like to talk and are engaged with the event. Another aspect is whether the commentator is perceived as appropriate to comment on the event, and what their relation is to the people and events being filmed. This includes aspects like their seniority, their knowledge of the event, (perceived) partiality, family affiliations, etc. In sum inviting a running commentary can help in creating a richer documentation for the project and for the community and should perhaps be the norm rather than the exception for video recordings of primarily nonlinguistic events. They are also a productive technique of annotating linguistic events and performances and for recording metadata.

Abbreviations 1, 2, 3 CLF CONJ EXCL INCL INTRJ NEG

first, second, third person classifier conjunction exclusive inclusive interjection negation

PL POSS RECP RED SG TAM TOPIC

plural possessive reciprocal reduplication singular tense, aspect, mood topic marker

References Bird, Steven. 2010. A scalable method for preserving oral literature from small languages. Proceedings of the 12th International Conference on Asia-Pacific Digital Libraries, Gold Coast, Australia, June 2010. http: //www.boldpng.info/. Cablitz, Gabriele. 2008. The making of procedural documents on the Marquesas and Tuamotu islands (French Polynesia). Presentation at Language Documentation Methods in Focus (DoBeS meeting, June 2008). Foley, William A. 2003. Genre, register and language documentation in literate and preliterate communities. In Language Documentation and Description, Volume 1, ed. Peter K. Austin, 85–98. London: School of Oriental and African Studies. Himmelmann, Nikolaus P. 2006. The challenges of segmenting spoken language. In Essentials of Language Documentation, eds. Jost Gippert, Niko-

336

Anna Margetts

laus P. Himmelmann, and Ulrike Mosel, 253–274. Berlin, New York: Mouton de Gruyter. Lavric, Eva, Gerhard Pisek, Andrew Skinner, and Wolfgang Stadler, eds. 2008. The Linguistics of Football. Tübingen: Gunter Narr. Margetts, Anna, and Andrew Margetts. To print. Audio and video recording techniques for linguistic research. In The Oxford Handbook of Linguistic Fieldwork, ed. Nicholas Thieberger. Oxford: Oxford University Press. Mosel, Ulrike. 2004. Inventing communicative events: Conflicts arising from the aims of language documentation. Language Archives Newsletter 1(3):3–4. Mosel, Ulrike. 2008. Putting oral narratives into writing – experiences from a language documentation project in Bougainville, Papua New Guinea. Presentation at the Simposio Internacional Contacto de lenguas y documentación, Buenos Aires, CAIYT (August 2008), available online as ‘Oral and written versions of Teop legends’ at http://www.linguistik.uni-kiel.de/ mosel_publikationen.htm#download (accessed 2011/02/26). Mosel, Ulrike. 2009. Collecting data for grammars of previously unresearched languages. Unpublished manuscript, available online at http://www.linguistik.uni-kiel.de/mosel_publikationen.htm#download (accessed 2011/02/26). Müller, Torsten. 2007. Football, Language and Linguistics. Tübingen: Gunter Narr. Reiman, D. Will. 2010. Basic oral language documentation. Language Documentation & Conservation 4:254–268. Seifart, Frank. 2008. On the representativeness of language documentation. In Language Documentation and Description, Volume 5, ed. Peter K. Austin, 60–76. London: School of Oriental and African Studies. Simons, Gary F. 2008. The rise of documentary linguistics and a new kind of corpus. Presentation at the 5th National Natural Language Research Symposium, De La Salle University, Manila (November 2008), available online at http://www.sil.org/~simonsg/presentation/doc%20ling.pdf (accessed 2011/02/26). Woodbury, Anthony C. 2003. Defining documentary linguistics. In Language Documentation and Description, Volume 1, ed. Peter K. Austin, 35–51. London: School of Oriental and African Studies.

Index access regulations, 46 Admiralty Islands, 269, 275 Äiwoo, 296, 301 afterthought, 10, 152, 153, 160, 161, 164–170, 172 Ambrym (South East), see South East Ambrym animacy, 68, 73, 76, 80 ANNEX, see tools ARBIL, see tools archive (DoBeS), see DoBeS archiving, 8, 21, 33–53, 81, 256 Arosi, 264, 269, 270, 273, 274, 280, 281 aspect system, 121–145 Athapaskan, 202 Austronesian, 64, 82, 238, 264 authority, 293–294, 297, 299–302 Awetí, 55, 64, 65, 67–69, 71, 73, 75, 81, 82 Basic Oral Language Documentation (BOLD), 322, 323, 333 Bauan, 267–271 Beaver, 201–219 Bislama, 306, 308, 309, 311 Bismarck Archipelago, 269 boundary (intonation), 153, 159–162, 168 Brazil, 64, 81 Bugotu, 275 capacity building, 241, 247–249 Carolinean, 269, 270 Cèmuhî, 264, 269–271, 273 Cheke Holo, 269–271, 273, 280, 281 clitic word, see word Code of Conduct (DoBeS), see DoBeS

collaborative fieldwork, 23, 251 lexicon creation, 48, 241, 242, 245, 268 transcription, 10, 202–208, 210 workspace, 241–247 commentary, 13, 315, 321–335 community (DoBeS), see DoBeS community (language), see language community involvement, 5–6, 10, 11, 13, 23–24, 57, 224–226, 240, 241, 243–249, 251, 252, 256, 257, 291, 297, 323 complex sentences (prosody of), 160– 169 conjunctivism, 114, 116 consent, 44, 45, 293, 294 contact (language), see language contrastive (focus), 101, 170, 171 copyrights, 45, 295, 300 corpus linguistics, 57 cultural heritage, see heritage cultural knowledge, 12, 21, 151, 223, 228, 230, 237, 240, 241, 245, 246, 249–251, 256, 265, 311, 313–315, 332 Danish, 181 data collection, 6, 33, 156, 177, 305, 322 infrastructures, 8, 33–53 utilization, 50–52, 305 data (lexical), see lexical description (language), see language dictionaries bilingual, 125, 229, 248, 254, 264, 266, 311

339

340

Index

descriptive, 265 documentation, 24, 253, 256 encyclopaedic, 11, 223–257, 306 ethnographic, 11, 226, 263–283 general, 230, 263–266, 282, 283 maintenance, 253, 256 mini-dictionaries, 225, 248, 283 monolingual, 125, 266 multilingual, 229, 311 multimedia, 11, 223–257, 306 online, 223 thematic, 225, 283 translation aid, 265, 266 dictionary definitions, 226, 229, 230, 232, 233, 244, 256, 264–268, 274, 276–282, 309–312 monolingual, 226, 230, 231, 247, 249–251 vernacular, 230–233, 242, 248 dictionary-making, see lexicography diphthong feature, 183, 187, 189–198 diphthongs, 10, 177–198 analysis and description tool, 10, 196–198 dynamic vs. static, 192, 193, 197 falling vs. rising, 183, 186, 188, 191–194, 196, 197 inventory, 178 inventory (Finnish), 187–196 opening vs. closing, 183, 188, 191– 193, 196 phonetic vs. phonological, 179, 181–182, 186, 194, 196–198 prominence in, 183–186, 189–190, 196, 197 semi-diphthongs (Lithuanian), 185– 186 system (Finnish), 187, 190–196 discontinuity, see noun phrase disjunctivism, 115, 116 dislocation, see noun phrase DoBeS

archive, 7, 36, 51, 82, 227, 254 Code of Conduct, 295 community, 53 data, 35 initiative, 21 programme, xi, xiii, 1, 2, 6, 7, 18, 33, 35–37, 39, 40, 44, 45, 55, 61, 151, 252, 294, 321 projects, 8, 55, 81, 82, 121, 223– 225, 227, 305, 323 documentary film, 305–318 documentary linguistics, xi–xiii, 2–6, 9, 10, 18, 44, 60, 81, 89, 151, 189, 202, 224, 225, 252–256, 322 Eastern Solomons, see Solomon Islands editing annotations, 48 dictionary entries, 11, 224, 240– 251 during transcription, 207–219 film, 308 text, 10, 24 web-based, 11, 224, 240–251 ELAN, see tools elicitation strategies, 267 tasks, 161, 219, 323 encyclopaedia, 226, 230, 237, 249, 277 encyclopaedic information, 226, 227, 229–234, 237, 240, 241, 247– 249, 264, 276–278, 311 endangered languages, xi, 1–3, 5, 7, 10, 17–28, 53, 55–82, 126, 154, 223–257, 293, 305, 306, 316, 321 Estonian, 124, 189 ethics, 2, 5, 23, 44–46, 50, 291, 293, 294, 303

Index ethnobotany, 224, 234, 236, 248 fieldwork, xi, xii, xvi, 3, 11, 12, 51, 121, 154, 172, 201, 202, 208, 227, 231, 240, 241, 247, 250, 251, 257, 267, 291–303, 323, 334 fieldwork (collaborative), see collaborative Fiji, 267, 269, 270 Fijian Standard, see Bauan Western, see Wayan Finnish, 10, 177–198 fish names, 11, 240, 263–273, 283 focus, 99, 101, 108, 153, 158, 162, 166, 168, 170, 171, 215, 216 folk generic, 271, 274, 279 specific, 271, 274, 279 taxa, 275, 276 taxonomies, 224, 228, 234–240, 278, 279 Forest Enets, 9, 55, 64, 65, 121–145 Gapapaiwa, 275 Gela, 269–273, 280, 281 genre (indigenous), see indigenous German, 64, 73, 75, 76, 182, 193 glossing, 19, 22, 28, 62, 63, 157, 226, 265, 266 Gorani, 64–82 grammatical description, 9, 126, 128, 145, 152, 326 grammaticography, 9, 121–145, 189 Hanunóo, 264 Hawaiian, 264 heritage cultural, 7, 17, 20, 21, 27, 60, 65, 246, 293 linguistic, 12, 246 oral, 231

341

indigenous classification, 227, 228, 236, 239, 240 genre, 56, 57, 60 knowledge, 21, 226, 228, 237, 238, 251 linguistic practices, 13, 56, 60 word meaning, 231, 250, 251 Indo-European, 64, 66, 82 infrastructures (data), see data intonation, see prosody Iran, 64, 82 Iranian, 64, 82 Jaminjung, 10, 151–172 Kapingamarangi, 269–272 Kharia, 9, 89–117 Kiriwina, 269, 273, 275 knowledge (indigenous), see indigenous knowledge (lexical), see lexical Kuanua, 269, 270, 273, 274 Kwamera, 269, 270, 273, 274 Kwara’ae, 264 language community, 291–294, 299–303, 305–308, 311–313, 316 contact, xix, 4, 7, 17, 24–27 description, xv, 10, 186, 207 maintenance, 20, 218, 223, 226, 227, 252–256, 306, 308–312, 316, 318, 321 revitalization, 5, 20, 143, 225, 227, 252–256 language typology, see typology Lau, 269, 270, 273 Lenakel, 269, 270, 273, 274 lexica, see dictionaries lexical, 159, 243 concept, 265

342

Index

data, 224–227, 235, 236, 252–254, 256 database, 225, 227, 236, 242, 244, 247, 248, 252, 253, 256, 257 entries, 48, 95, 115, 224–227, 229– 234, 236, 242–245, 247, 248, 256, 257 expressions, 8, 70 knowledge, 223, 240, 251, 256, 277 networks, 227 unit, 228, 265–267, 274–276 verb, 131, 132 lexicography, xvi, 9, 121–145, 223– 257, 263–283 LEXUS, see tools life form, 238, 278–282 linguistic heritage, see heritage linguistic practices (indigenous), see indigenous literacy, 203, 231, 295, 300, 317 Lithuanian, 180, 184–186, 189 long-term preservation, 5, 21, 38–43 Lou, 275 maintenance (language), see language major generic, 278–282 Manam, 275 Marovo, 269–271 Marquesan, 11, 223–248, 269, 270 Marshallese, 269, 270, 280, 281 Melanesia, 264, 270, 271, 273, 274, 296 Meso-Melanesian, 275 metadata, 2, 7, 8, 19, 21, 24, 26, 33, 35–37, 39–43, 48–52, 236, 254, 257, 317, 324, 332, 333, 335 metalanguage, 9, 24, 26, 121–145, 278 Micronesia, 264, 269–271 Micronesian, 275 Mirndi, 154

morpheme glossing, see glossing morpho-syntactic word, see word Mota, 280, 281 multimedia lexicon, see dictionaries multimedia lexicon tool, see tools Munda (South), see South Munda Muyuw, 275 native speaker, 2, 9, 10, 13, 23, 24, 58, 60, 89, 104, 106, 113, 117, 143, 201–219, 231–255, 265, 268, 282, 321–335 New Caledonia, 264, 269, 270, 273, 275 Ngaliwurru, 154 Niuatoputapu, 269–271 Niuean, 269, 270, 273, 281 Northern Central Vanuatu, 275 Northern New Guinea, 275 Northern Samoyedic, 121 noun phrase discontinuous, 10, 153, 159, 160, 169–172 right-dislocated, 152, 153, 160, 165–167, 172 Nyelâyu, 269, 270 Oceanic, 11, 64, 82, 263–283 ontologies, 52 ethno-/informal, 224, 228, 236– 238 formal, 238 oral heritage, see heritage orthographical word, see word orthography, 9, 89, 104 practical, 201, 204 working, 177 Owa, 269, 270, 280, 281 Paamese, 269, 273, 274, 280, 281 Paicî, 269, 270 Palauan, 269, 270 Papua New Guinea, 269, 315, 323

Index Papuan, 60 Papuan Isolate, 64, 82 Papuan Tip, 275 parallel text typology, 56, 59 payment, 292 PENTA model of intonation, 152, 154, 157–159 permission, 36, 45, 243, 292–295, 298, 302, 334 phonological word, see word Pileni, see Vaeakau-Taumako pitch, 95–98, 115, 151, 156–158, 160– 162, 166–168, 182, 184, 197 politics (community), 12, 291–303 Polynesia, 223–225, 249, 264, 269– 271 Polynesian, 245, 269, 301 Polynesian communities, 246, 301 Polynesian Outlier, 269, 270 Polynesian Triangle, 264, 269, 271 Ponapean, 269, 270, 275 Preferred Argument Structure, 61, 70, 72, 79, 81 projects (DoBeS), see DoBeS prosody, 151–172 Proto Oceanic, 274, 275 Puluwat, 269, 275, 280, 281 quantitative analysis, 8, 56, 64, 68, 71, 81, 152, 158, 172 Rarotongan, 269, 270, 280, 281 recording techniques, 8, 13, 33, 38 reduplication, 106, 110, 114 reference (pronominal), 72, 74, 80 Referential Density, 61, 71, 72, 79, 81 relational linking, 11, 224, 234, 236, 254 repetition, 18, 24, 27, 106, 109–111, 209, 211, 213, 255, 328, 329 revitalization (language), see language right-dislocated noun phrase, see noun phrase

343

Rotuman, 269, 273, 274 Roviana, 269, 270, 273–275, 280, 281 Russian, 9, 122–126, 129, 135, 138– 144 Saliba-Logea, xiv, 321–335 Samoan, xii, xiv, xvii–xix, 280, 281 Samoyedic, 9, 64, 121, 127, 128, 142 Satawalese, 269–272 Savosavo, 64–82 sense unit, 266 Solomon Islands, 12, 64, 82, 269, 273, 296 Eastern, 269 Western, 269 sonority, 177–198 South East Ambrym, 275 South East Solomonic, 275 South Munda, 89, 90 South Vanuatu, 275 speakers’ intuitions, 9, 98, 104–115 speech genre, 4–5, 8, 13, 55, 57, 58, 60–61, 64, 66, 156, 231, 250, 267, 312, 322, 326, 327 sports commentary, 324, 327 Standard Fijian, see Bauan standards, 2, 21, 33–53, 249 Swedish, 181–183 syllable structure, 154, 177, 178, 197 Tabar, 275 Tak¯u, 269, 270, 280, 281 Teop, xii, xiv, xv, xviii, xix, 27, 202, 269, 271, 321, 326 terminologies for fauna and flora, 265, 268 text types, 4, 57, 58, 324, 326, 329 Tikopia, 269–272, 280, 281 Titan, 269–272 Tok Pisin, xii, xix, 327, 328 Tolai, xi, xii, xv, xvii–xix Tongan, 269, 270, 280, 281

344

Index

Toolbox, see tools tools, 52, 248 access, 37 ANNEX, 47, 48, 229 annotation, 39, 46–47 ARBIL, 51 diphthong analysis and description tool, see diphthong ELAN, 39, 47, 48, 81, 135, 136, 156, 157 LEXUS, 11, 39, 47, 48, 223–230, 234, 236, 238, 240–242, 247– 249, 252–257 metadata, 40, 51 multimedia lexicon, 11, 39, 47– 48, 224, 225, 228, 252, 254– 256 relational linking, 224, 236 search, 49, 52 tagging, 46–47 Toolbox, 63, 64, 156, 157, 227, 242, 247, 248 TROVA, 47–49 ViCoS, 11, 224, 226, 227, 234– 237, 240, 254, 256 web-based lexicon, 224, 241, 250– 251 whiteboarding, 244, 245, 257 topic, 153, 158, 160, 161, 164–169, 172 Toqabaqita, 269, 270, 273, 280, 282 transcription, 10, 19, 89, 201–219, 255, 322 transcription (collaborative), see collaborative translation equivalent, 143, 229, 265– 267 tree names, 11, 234, 263–283 TROVA, see tools Tuamotuan, 11, 223–249 Tupi-Guarani, 64, 81 typology

linguistic, 6–8, 22, 55–81 original text, 55–81 quantitative approaches to, 7, 8, 61–71 utilization (data), see data Uvean, 269–271 Vaeakau-Taumako, 296, 297 Vanuatu, 12, 23, 64, 82, 269, 270, 273, 274, 305–309, 314, 315 Vera’a, 64–82, 305 vernacular language, 228, 230, 265 ViCoS, see tools Vurës, 12, 23, 82, 305–318 Wayan, 265, 267, 269–271, 273, 274, 277, 278, 280, 282 web-based collaboration, 224, 240–251 West Iran, see Iran Western Fijian, see Wayan Western Mirndi, 154 Western Solomons, see Solomon Islands whiteboarding tools, see tools Woleiaian, 275 word, 9, 89 clitic, 99, 108, 114, 115 morpho-syntactic, 89, 90, 92, 100– 104, 109, 114, 115 orthographical, 89, 105, 106, 115 phonological, 89, 90, 92, 94–101, 104, 106, 109–112, 114, 115, 117 written, 90, 98, 104–116 word meaning, 228–230, 251, 254 contextualization of, 227–228, 256 indigenous, see indigenous visualization of, 227–228 workspace (collaborative), see collaborative written word, see word Xârâcùù, 269–271